Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Analysis and prediction of malicious users on online social networks: applications of machine learning and network analysis in social science
(USC Thesis Other)
Analysis and prediction of malicious users on online social networks: applications of machine learning and network analysis in social science
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ANALYSIS AND PREDICTION OF MALICIOUS USERS ON ONLINE SOCIAL NETWORKS:
APPLICATIONS OF MACHINE LEARNING AND NETWORK ANALYSIS IN SOCIAL
SCIENCE
by
Adam Badawy
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(POLITICAL SCIENCE AND INTERNATIONAL RELATIONS)
August 2020
Copyright 2020 Adam Badawy
Acknowledgments
I would like thank Emilio Ferrara for his support throughout this PhD and for having me as a part
of his lab. I appreciate Patrick James's support and guidance since I joined the POIR department.
I would also like to thank Pablo Barbera for being a great mentor and a friend. I am grateful for
Kristina Lerman for having me on her project as well as for Aram Galstyan. I would like to thank
all of my colleagues and sta at the Information Sciences Institute (ISI) and the Political Science
and International Relations (POIR) Department for their support and collegiality since I joined
USC. Special thanks goes to Veri Chavarin for always being there for POIR students.
I owe special thanks for all my co-authors. Besides my faculty co-authors who I mention earlier,
I own a great debt and gratitude to the bright students I have worked with. It was a pleasure
working with Aseel Addawood, whom I produced two papers with. It was also a pleasure working
with Ashok Deb and Luca Luceri. I am particularly grateful for working along Anna Sapienza
and Laura Alessandretti. Although we did not produce any papers together, I would like to thank
them for their support and friendship. Special thanks also goes to the PostDocs at ISI, Homa
Hosseinmardi and Goran Muric, for being supportive and friendly to everyone.
I would like to thank all of my colleagues at POIR, especially: Cody Brown, Damiela Maag,
Maria Perez, David Somogyi, Shiming Yang, and Xinru Ma for being great friends. I would like
to give special thanks to my colleague and my wife, Evgeniia, which words cannot express how
grateful I am for meeting her and falling in love with her. Evgeniia has always been there for me
and I can say with condence that the most important thing I have achieved in this PhD is not
this dissertation but meeting my wife and falling in love with her!
It goes without saying that I would like to thank my family, my mom and my dad Nadia and
Badawy for their support throughout my life. Denitely I would not have reached where I am today
ii
without them. I would like to thank my sisters: Heba, Yasmine, and Mona for being always there
for me. I would like to thank my brother-in-law, Mohammed for being a great friend. I am grateful
of having the most adorable nephews and niece, Yazan, Kinan, and Misk. I am also grateful of
having life-long friends that I grew up with and miss hanging out with like the old days: Chris
Carnabatu, Aminur (Tito) Rahman and Michael Jermakian.
iii
Contents
Acknowledgments ii
List of Tables vii
List of Figures ix
Abstract xii
1 Introduction 1
1.1 The Role of Trolls and Bots in the 2016 US Presidential Elections . . . . . . . . . . 6
1.2 Predicting Users Who Spread Trolls' Content . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Understanding the Spread of Misinformation and the Centrality of Trolls . . . . . . 10
2 Analyzing the Digital Traces of Political Manipulation: The 2016 Russian In-
terference Twitter Campaign 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Twitter Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Classication of Media Outlets . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Russian Trolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Data Analysis & Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Retweet Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
iv
2.3.2 Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Bot Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Geo-location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.5 Activity Summary of Russian Trolls . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 RQ1: Political Ideology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 RQ2: Social Bots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.3 RQ3: Geospatial Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Who Falls for Online Political Manipulation? 32
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Twitter Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Russian Trolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Spreaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Data Analysis & Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Political Ideology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 Bot Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.3 Engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Characterizing the 2016 Russian IRA In
uence Campaign 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.1 Twitter Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.2 Classication of Media Outlets . . . . . . . . . . . . . . . . . . . . . . . . . . 60
v
4.3 Methods and Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.2 Bot Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.3 Geolocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.1 RQ1: Political Ideology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.2 RQ2: Temporal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.3 RQ3: Social Bots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.4 RQ4: Geospatial Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5 Conclusion 80
Bibliography 85
Appendices 96
A Appendix for Chapter 1 97
vi
List of Tables
2.1 Twitter Data Descriptive Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Liberal & Conservative Domain Names. . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Descriptive Statistics on Russian trolls. . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Descriptive statistics of the Retweet Network. . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Precision & Recall scores for the seed users and hyper-partisan users test sets. . . . . 23
2.6 Breakdown of the Russian trolls by political ideology, with the ratio of conservative
to liberal trolls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 Top 20 stemmed words from the tweets of Russian trolls classied as liberal and
conservative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.8 Descriptive statistics of spreaders, i.e., users who retweeted Russian trolls. . . . . . . 27
2.9 Breakdown by political ideology of users who spread Russian trolls' content and
wrote original tweets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.10 Bot analysis on spreaders (those with bot scores). . . . . . . . . . . . . . . . . . . . . 28
3.1 Twitter Data Descriptive Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Descriptive Statistics on Russian trolls. . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Descriptive statistics of spreaders, i.e., users who retweeted Russian trolls. . . . . . . 37
3.4 List of features employed to characterize users in dataset . . . . . . . . . . . . . . . . 38
3.5 Liberal & Conservative Domain Names. . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Precision & Recall scores for the seed users and hyper-partisan users test sets. . . . . 41
3.7 Breakdown for overall users, trolls and spreader by political ideology . . . . . . . . . 41
3.8 Machine Learning Models from the Baseline (Metadata) to Full Model. . . . . . . . . 46
vii
4.1 Descriptive Statistics of Russian trolls. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Twitter Data Descriptive Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Liberal & Conservative Domain Names (excluding left-center and right-center) . . . 62
4.4 Descriptive statistics of the Retweet Network. . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Distribution of tweets by country (Top 10 countries by tweet count) . . . . . . . . . 68
4.6 Breakdown of the Russian trolls by political ideology, with the ratio of conservative
to liberal trolls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7 Descriptive statistics of spreaders, i.e., users who retweeted Russian trolls. . . . . . . 70
4.8 All spreaders by political ideology; Bot analysis for 115,396 spreaders (out of a 200k
random sample of spreaders). Ratio: conservative/liberal . . . . . . . . . . . . . . . 70
4.9 Top 20 meaningful lemmatized words from the tweets of Russians trolls classied as
Conservative and Liberal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.10 Users that reported Russia as their location . . . . . . . . . . . . . . . . . . . . . . . 76
viii
List of Figures
2.1 Timeline of the volume of tweets (in blue) and users (in red) generated during our
observation period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Distribution of tweets with links to the top ve liberal media outlets. . . . . . . . . . 20
2.3 Distribution of tweets with links to the top ve conservative media outlets. . . . . . 20
2.4 Distribution of the probability density of bot scores assigned to liberal users who
retweet Russian trolls (blue) and for conservative users (red). . . . . . . . . . . . . . 29
2.5 Proportion of the number of retweets by conservative users (excluding bots) of Rus-
sian trolls per each state normalized by the total number of conservative tweets by
state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Feature correlation heat map for the all users in the dataset. . . . . . . . . . . . . . 39
3.2 Probability density distributions of bot scores assigned to spreaders (red) and non-
spreaders (blue). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Friend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Sentiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Temporal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.8 User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
ix
3.9 Area under the ROC curve plot for the ve models under evaluation using Gradient
Boosting. I show ve models, in each I use the fold/model that yields the highest
AUC among the trained ones. It is evident that the addition of bot scores, political
ideology, and tweet count variables are important in improving the performance of
the classiers. The legend shows the average AUC scores for each model. . . . . . . 50
3.10 Relative importance of the features using Gradient Boosting for the full model (best
performing fold) in predicting users as spreaders vs. non-spreaders. Political Ideology
explains over 25% of the variance, followed by Followers Count, Statuses Counts, and
Bot Scores, each explaining roughly 5% to 10% of the variance. . . . . . . . . . . . . 51
3.11 Upward Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.12 Downward Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Timeline of the volume of tweets (in blue) generated during the observation period
and users the produced these tweets (in red). . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Distribution of tweets with links to liberal media outlets. . . . . . . . . . . . . . . . 63
4.3 Distribution of tweets with links to conservative media outlets. . . . . . . . . . . . . 64
4.4 Overall Bot Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Content Bot Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6 Friend Bot Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.7 Metadata Bot Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8 Network Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.9 SentimentBot Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.10 Temporal Bot Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.11 Proportion of the number of retweets by liberal users of Russian trolls per each state
normalized by the total number of liberal tweets by state. . . . . . . . . . . . . . . . 77
4.12 Proportion of the number of retweets by conservative users of Russian trolls per each
state normalized by the total number of conservative tweets by state. . . . . . . . . . 78
A.1 Example of a Russian troll tweet, exhibit 1 . . . . . . . . . . . . . . . . . . . . . . . 98
A.2 Example of a Russian troll tweet, exhibit 2 . . . . . . . . . . . . . . . . . . . . . . . 99
A.3 Example of a Russian troll tweet, exhibit 3 . . . . . . . . . . . . . . . . . . . . . . . 100
x
A.4 Example of a Russian troll tweet, exhibit 4 . . . . . . . . . . . . . . . . . . . . . . . 101
A.5 Example of a Russian troll tweet, exhibit 5 . . . . . . . . . . . . . . . . . . . . . . . 102
xi
Abstract
Until recently, social media was seen to promote democratic discourse on social and political issues.
However, this powerful communication platform has come under scrutiny for allowing hostile actors
to exploit online discussions in an attempt to manipulate public opinion. A case in point is the
ongoing U.S. Congress investigation of Russian interference in the 2016 U.S. election campaign,
with Russia accused of, among other things, using trolls (malicious accounts created for the purpose
of manipulation) and bots (automated accounts) to spread misinformation and politically biased
information. In this dissertation, I explore the eects of this manipulation campaign, taking a
closer look at users who re-shared the posts produced on Twitter by the Russian troll accounts
publicly disclosed by U.S. Congress investigation.
In chapter two, I nd that Conservatives retweet Russian trolls signicantly more often than
liberals and produce 36 times more tweets. Additionally, most of the troll content originate in,
and was shared by users from Southern states. Using state-of-the-art bot detection techniques, I
estimate that about 4.9% and 6.2% of liberal and conservative users respectively were bots. Text
analysis on the content shared by trolls reveals that they have a mostly conservative, pro-Trump
agenda. Although an ideologically broad swath of Twitter users was exposed to Russian trolls in
the period leading up to the 2016 U.S. Presidential election, it is mainly conservatives who helped
amplify their message.
In chapter three, I attempt to answer the following two questions. First, I test whether pre-
dicting users who spread trolls' content is feasible in order to gain insight on how to contain their
in
uence in the future; second, I identify features that are most predictive of users who either
intentionally or unintentionally play a vital role in spreading this malicious content. The machine
learning model utilized in this dissertation is able to very accurately identify users who spread the
xii
trolls' content (average AUC score of 96%, using 10-fold validation). I show that political ideology,
bot likelihood scores, and some activity-related account meta data are the most predictive features
of whether a user spreads trolls' content or not.
In chapter four, I replicate the work done in chapter two with a new dataset of 13 million election-
related posts shared on Twitter in the year of 2016 by over a million distinct users. I also conduct
novel network analysis on the retweet network. This dataset includes accounts associated with the
identied Russian trolls as well as users sharing posts in the same time period on a variety of topics
around the 2016 elections. Conservative users who retweet Russian trolls produced signicantly
more tweets than liberal ones, about eight times as many in terms of tweets. Additionally, trolls'
position in the retweet network is stable overtime, unlike users who retweet them who form the
core of the election-related retweet network by the end of 2016. Using state-of-the-art bot detection
techniques, I estimate that about 5% and 11% of liberal and conservative users are bots, respectively.
Text analysis on the content shared by trolls reveal that conservative trolls talk about refugees,
terrorism, and Islam; while liberal trolls talk more about school shootings and the police.
xiii
Chapter 1
Introduction
The initial optimism about the role of social media as a driver of social change has been fading
away, following the rise in concerns about the negative consequences of malicious behavior online.
Such negative outcomes have been particularly evident in the political domain. The spread of
misinformation (Shorey and Howard, 2016a, Tucker et al., 2017) and the increasing role of bots
(Bessi and Ferrara, 2016) and trolls in the 2016 US presidential elections has increased the interest
in understanding this phenomenon, quantifying its impact and establishing tools for automatic
detection and prediction of malicious actor activity.
A diverse amalgamation of actors including trolls, bots, fake-news websites, highly partisan me-
dia outlets and even politicians are all involved in the overlapping roles in producing and spreading
misinforming on online social networks. Recent research spanning the social sciences and computer
science has explored the mechanisms of misinformation campaigns, the eect of such campaigns on
the political discourse online, and building state-of-art algorithms to detect bots and trolls. This
dissertation will mostly focus on trolls, but to a lesser extend will shed light on the role of bots
among the trolls.
The term \trolls" has been used from the early days of the internet. It was used to describe
people who deliberately annoy others on the internet to elicit an emotional response from them.
Trolls attempt to post provoking messages to sow discord among users online (Phillips, 2015). In
the US, they usually attempt to trick mainstream media by reporting fake news and having the
former broadcast these stories to the public on their behalf. Another tactic of trolls is to post racist
and sexist content to enrage liberals which usually lead to the polarization of the political discourse
1
online. This outcome fuels the perception of the rise of political polarization and an increased
coverage of mainstream new outlets enforcing this point (Higgin, 2013, Herring et al., 2002).
Trolls are not only users who deliberately annoy others out of dierence in political ideology or
for pure enjoyment in the frustration of others. A large portion of trolls are hired individuals paid
by companies, politicians, political parties, or foreign countries to write posts containing fake news
or hateful material (Mihaylov et al., 2015). The most famous example of these trolls are the ones
deployed in the 2016 US presidential elections Russian interference campaign.
Research on Russian interference campaign have a long history and mainly goes back to the
practice of dezinformatsiya (disinformation) spearheaded by the Soviet Union. This practice in-
cluded tactics to in
uence Western public opinion by planting false or distorted stories in Western
media or even non-aligned coutnry media which eventually end up reaching the Western world
(Ziegler, 2018, Mar echal, 2017). In contrast to the Soviet time, the internet and social media today
have amplied the eect of this kind of \information warfare" by countries like Russia (Diamond
et al., 2016, Tucker et al., 2017, Roberts, 2018, Sanovich et al., 2018).
In this dissertation, I focus on the role of Russian trolls in the recent US presidential elections.
In the context of the 2016 US election, I dene trolls as users who exhibit a clear intent to deceive
or create con
ict
1
. Their actions are directed to harm the political process and cause distrust
in the political system. My denition captures the new phenomenon of paid political trolls who
are employed by political actors for a specied goal. The most recent and important example of
such phenomenon is the Russian \troll farms"|trolls paid by the Russian government to in
uence
conversations about political issues aimed at creating discord and hate among dierent groups
(Gerber and Zavisca, 2016).
Survey data from the Pew Research Center (Gottfried and Shearer, 2016) show that two-thirds
of Americans get their news from social media. Moreover, they are being exposed to more political
content written by ordinary people than ever before. Bakshy et al. (2015a) report that 13% of
posts by Facebook users|who report their political ideology|are political news. This raises the
question of how much in
uence the Russian trolls had on the national political conversation prior
1
See Appendix A for examples of Russian Trolls content. Appendix A shows images of twit-
ter posts from a pro
ic Russian troll twiter account 'Ten Gop'. These images come form the fol-
lowing source: https://theoutline.com/post/3397/here-are-the-most-shared-tweets-created-by-russian-trolls-that-
twitter-scrubbed-from-the-internet
2
to the 2016 US election and after and how much in
uence such trolls will have in the upcoming
elections.
The use of trolls and bots in political manipulation campaigns around the globe is well docu-
mented through an array of reports by mainstream media outlets and academics (see Tucker et al.
(2018) for a comprehensive review on the role of misinformation, bots, and trolls on social media).
This phenomenon is not entirely new: researchers warned about the potential for online political
manipulation for over a decade (Howard, 2006, Hwang et al., 2012). Reports tracking and studying
this phenomenon date back to the early 2010s (Ratkiewicz et al., 2011, Metaxas and Mustafaraj,
2012, Ratkiewicz et al., 2011). Since then, an increasing account of such events has been recorded
in the context of several elections, both in the United States (Bessi and Ferrara, 2016, Kollanyi
et al., 2016, Shorey and Howard, 2016a, Woolley and Howard, 2016, Woolley, 2016, Marwick and
Lewis, 2017, Wang et al., 2016) and all over the world, including in South America (Forelle et al.,
2015, Su arez-Serrato et al., 2016), the U.K. (Howard and Kollanyi, 2016), and Italy (Cresci et al.,
2017).
Although trolls do not necessarily need to be automated accounts, in many cases bots play a
substantial role in political manipulation. Bessi and Ferrara (2016) report that 400k bots were
responsible for posting 3.8 million tweets in the last month of the 2016 US presidential election,
which is one-fth of the total volume of online conversations they collected. Specically, Russian
political manipulation campaigns did not only target the US (Badawy et al., 2018a): there is
evidence of Russian interference in German electoral campaigns (Applebaum and Colliver., 2017),
British elections (Gorodnichenko et al., 2018), and the Catalonian referendum (Stella et al., 2018).
Russian aliated accounts were also reported in the 2017 French presidential elections, where
bots were detected during the so-called MacronLeaks disinformation campaign (Ferrara, 2017).
Moreover, a recent NATO report claims that around 70% of accounts tweeting in Russian and
directed at Baltic countries and Poland are bots.
Russian political manipulation online did not stop at Russia's borders. Domestically, there is
strong evidence that trolls and bots were present at multiple occasions. Ananyev and Sobolev
(2017) provide evidence of Russian government-aliated trolls being able to change the direction
of conversations on the LiveJournal blog, a popular platform in Russia in the 2000s. Moreover, the
same entity that controlled many of the trolls studied in this paper, the Russian \troll factory",
3
run by the Internet Research Agency, had its trolls contribute to Wikipedia in support of positions
and historical narratives put forward by the current Russian government (Labzina, 2017).
Online political manipulation is not only a Russian phenomenon. There is strong evidence of
similar eorts by various governments to control political discussion online, particularly with the
use of bots. King et al. (2017) shows that the so-called \50-centers"{low-paid government workers
who work online on behalf of the Chinese government{ try to distract Chinese citizens online form
politically controversial topics. Even further, Miller and Gallagher (2018) estimate that Chinese
astroturfers produce about 15% of all comments made on the 19 popular Chinese news websites.
In Korea, bots were utilized as a part of a secret intelligence operation in support of the incumbent
party's candidate reelection (Keller et al., 2017).
But what are the tactics of spreading misinformation on online social networks? The rst
tactic is to remove information from Online Social Networks (OSN) and leaving others, creating
the impression of truthfulness and ubiquitousness of the information left on the platform. This
tactic is mostly utilized by authoritarian states, like China (King et al., 2013) and to a lesser extent
Russia who can control what goes on their national OSNs.
The second tactic is by manipulating search algorithms to show some news more than others.
Some of the common tactics here include inserting keywords to raise the ranking of certain websites
and creating webpages pointing to each other. This algorithmic manipulation take place on OSNs as
well, particularly in relation to trending topics and hashtags on Twitter. Common strategies would
include hijacking hashtags, popular users' mentions, and other ways to dominate the conversation
online in the form of counter-messaging delivered in bulk, or simply distracting online conversations
with meaningless content.
There is strong evidence that political manipulation campaigns are on the rise, but how eective
are they? In the case of the recent US presidential elections, Allcott and Gentzkow (2017) nd that,
even though \fake news" stories were widely shared on social media during the 2016 election, an
average American saw only few of these stories. Despite this nding, we should not underestimate
the potential role misinformation might play in distorting the views of citizens. Misinformed indi-
viduals hold consistently dierent opinions from those who are exposed to more factual knowledge.
In some experimental studies, people who were exposed to accurate information about political
issues often changed their views accordingly (Gilens, 2001, Sides, 2016). Other studies show that
4
ignorance distorts collective opinion from what it would be if people were provided with more
information about politics (Althaus, 1998, Bartels, 1996, Gilens, 2001) .
This distortion of individual opinion can lead to distortions at the aggregate level, as in the
collective public opinion. On many occasions, these distortions might be initiated, encouraged,
and exploited by domestic political elites or foreign powers. Manipulating actors are attempting to
construct their own \truth" and push their version of the story in the public sphere to be adopted
by their target audience. For example, some politicians might resort to distortion to win elections or
avoid accountability for their performance in oce (Fritz et al., 2004, Flynn et al., 2017). Examples
of misperceptions, whether caused by misinformation or not, that distort modern public policy
debates in the US are abound. For example, US citizens hold drastically exaggerated perceptions
about the amounts of the U.S. federal welfare, foreign aid, and the number of immigrants in the
country. A recent Kaiser Family Foundation poll found that, on average, Americans estimated
that 31% of the federal budget goes to foreign aid, with very few people aware that the actual
percentage does not exceed 1% (DiJulio et al., 2016). Similarly, another survey found that fewer
than one in ten respondents knew that welfare spending amounts to less than 1% of the federal
budget (Kuklinski et al., 2000). Moreover, it has been found that Americans tend to overestimate
the size of the immigrant population (Hopkins et al., 2019). All these political misperceptions
play a negative role in American political life and strengthen the already polarized environment
that the US nds itself in right now. In case of the recent presidential elections, we could see
that trolls were spreading misinformation about immigration, minority issues, and the government
in general. The similarities with the above-mentioned cases of misinformation are obvious: trolls
adopt classical techniques aimed at spreading misperceptions to push the agenda of the initiator of
such campaigns.
In recent years, growing ideological and aective polarization was accompanied by the increase
in conspiracy theories and partisan misinformation. Belief in false and unsupported claims is fre-
quently skewed by partisanship and ideology, suggesting that our vulnerability to them is increased
by directionally motivated reasoning. Directionally motivated reasoning is dened as the tendency
to selectively accept or reject information depending on its consistency with our prior beliefs and
attitudes (Kunda, 1990, Taber and Lodge, 2006). This tendency makes the recent US presidential
elections a good target for misinformation by internal and external actors: in an environment of
5
severe polarization, both at the elite and mass level, false or misrepresented information that rein-
forces a person preexisting motivated perception/opinion can be quite eective. Nyhan and Rei
er
(2010) and Flynn et al. (2017) show that motivated reasoning can even undermine the eectiveness
of corrective information, which sometimes fails to reduce misperceptions among vulnerable groups.
This dissertation tackles the issue of misinformation in the 2016 US presidential elections. This
dissertation includes three chapters. The rst chapter analyzes the role of users' political ideology
in how it aects their propensity to engage with political trolls, the role of bots, and explores areas
of the US that the trolls penetrated most successfully. The second chapter examines whether can
we predict which users will become susceptible to misinformation by spreading content promoted by
Russian trolls and explores features distinguishing users who spread trolls' messages. The third and
last chapter attempts to tackle the same questions as the rst chapter but with a new comprehensive
dataset that covers longer time span than the one used in the rst chapter. Additionally, it examines
the question of how central the Russian trolls were in the network of spreading information in the
year of 2016 (pre and post-the US presidential elections). The next few sections will provide a
summary of the work done in each chapter and the insights gained from this work.
1.1 The Role of Trolls and Bots in the 2016 US Presidential Elec-
tions
In this section, I explain the questions I attempt to answer in the rst chapter and elaborate on the
data and methods used to do so. The rst part of this chapter concerns the role of the users' political
ideology. I investigate the role of political ideology in terms of patterns of engagement with Russian
rolls and propagation of trolls' content. I examine whether the eect was more pronounced among
liberals vs. conservatives or evenly spread across the political spectrum. The second part of this
chapter concerns the role of social bots. I examine whether social bots played a role in spreading
content produced by Russian trolls and analyze their positioning on the political spectrum. The
third part of this chapter examines the specic geographic areas of the US that trolls were most
successful at penetrating.
I use a twitter dataset that was collected using a list of hashtags and keywords that relate to
the 2016 U.S. Presidential election. The list was crafted to contain a roughly equal number of
6
hashtags and keywords associated with each major Presidential candidate. I classify users by their
ideology based on the political leaning of the media outlets they share. Using a polarity rule, I label
Twitter users as liberal or conservative depending on the number of tweets they produce with links
to liberal or conservative sources. I utilize a list of 2,752 Twitter accounts identied as Russian
trolls that was compiled and released by the U.S. Congress to identify trolls, users who spread their
message, and users who do not spread their message in the aforementioned dataset.
I construct a retweet network containing nodes (Twitter users) with a directed link between
them if one user retweeted a post of another. I use label propagation algorithm to classify Twitter
accounts as liberal or conservative. In a network-based label propagation algorithm each node is
assigned a label, which is updated iteratively based on the labels of node's network neighbors.
In label propagation, a node takes the most frequent label of its neighbors as its own new label.
The algorithm proceeds updating labels iteratively and stops when the labels no longer change.
Additionally, I use Botometer (a.k.a. BotOrNot) to determine the automatic account in the dataset.
Botometer is a machine learning framework that extracts and analyses a set of over one thousand
features, spanning content and network structure, temporal activity, user prole data, and sentiment
analysis to produce a score that suggests the likelihood that the inspected account is indeed a social
bot. Lastly, I use either the location of tweets produced by users or the self-reported location in
users' proles to identify the location/state users belong to varying degree of success.
In terms of the rst question regarding the role of political ideology, I look at the users who
re-broadcast content produced by Russian trolls, a.k.a. spreaders. There are 28,274 spreaders
in the dataset that engage with Russian trolls. Spreaders also produce over 1.5 Million original
tweets and over 12 Million tweets and retweets, not counting the ones from Russian trolls. Looking
at the content of the top users, I can easily identify them as conservative; besides, I nd that
the most active of them produced thousands of tweets that is an unreasonably large amount in
such a short period (a few weeks). Thus I suspect some of them may be bots. I next look at
users' activities by political leaning. There are more than 42 thousands original tweets posted
by the liberal spreaders and more than 1.5 million by conservative ones. Out of the 28 thousand
spreaders who engage with Russian trolls, 892 are liberals and 27,382 are conservatives. The
top stemmed words in the liberals' tweets indicate support for Clinton, while the conservatives'
postings openly support Trump. The top URLs for the liberals include media outlets such as
7
Hungton Post and NBC News, while conservatives tweeted most frequently news from Breitbart,
The Gateway Pundit, and Info Wars. As for prole URLs, liberals mostly had social network
accounts, while for conservatives, besides social network accounts, some of them put links like
\www.donaldjtrump.com" and \lyingcrookedhillary.com".
As for the second question, I discover that some accounts exhibited suspicious activity levels.
I obtain bot scores for 34,160 out of the 40,224 spreaders. The number of users that has a bot
score above 0.5, and can therefore be safely considered a bot, stands at 2,126 accounts. Out of the
34,160 spreaders with bot scores, 1,506 are liberal, and 75 of them have bot scores above 0.5, about
4.9% of the total. As for the conservatives, there are 32,513 spreaders, 2,018 of which have bot
scores greater than 0.5, representing around 6.2% of the total. Twitter users rebroadcast almost
exclusively political content produced by like-minded accounts, and I note that most spreaders
are conservatives. Therefore, assessing the overall in
uence of bots on the diusion of Russian
trolls' content in the population at large is challenging. In terms of tweet/retweet production,
the fteen hundred liberal spreaders produced nearly 225 thousand tweets/retweets with 18,749
tweets/retweets by users who have a bot score above 0.5, representing around 8.3%. The 32
thousand conservative spreaders produced almost 12 million tweets/retweets, with 955,583 from
users with bot score above 0.5, or 8% of the total. Putting aside the disproportionate number of
liberals to conservatives, the mean value of the bot scores of the conservative spreaders (0.3) is
higher than the liberal one (0.24). I performed a two-sided t-test for the null hypothesis that the
two distributions have identical mean values and the p-value is less than 0.0, meaning that we can
reject the null.
For question three, I show a gure that shows the proportion of the number of retweets by
conservative users (classied according to the label propagation algorithm and excluding bots) of
the content produced by Russian trolls per each state normalized per state by the total number
of conservative tweets. I notice that some states exhibit very high proportions of retweets per
total number of tweets for conservatives. I tested the deviations by using a two-tailed t-test on
the z-scores of each deviation calculated on the distribution of ratios (average = 1.54, standard
deviation = 0.71). Among the most active states, South Dakota leads the ranking with 8 retweets
of Russian trolls (=3.65, p-value < 0.001), followed by Tennessee (479 retweets, = 3.61, p-value
< 0.01) and Wyoming (19 retweets, = 3.20, p-value= 0.019). Due to the small amount of liberal
8
spreaders (i.e., only 1,506 liberal users engaged with retweeting Russian trolls), a similar analysis
does not produce any statistically signicant results.
1.2 Predicting Users Who Spread Trolls' Content
In the previous chapter, I focus on describing the dataset and analyzing the role of political ideology
in shaping online political discourse. In this chapter, I focus on answering two questions. First, can
we predict which users will become susceptible to the Russian online misinformation campaign by
spreading content promoted by Russian trolls? Second, what features distinguish users who spread
trolls' messages? The goals of these questions are, rst, to test whether it is possible to identify
the users who will be vulnerable to manipulation and participate in spreading the messages trolls
post. I refer to such users as spreaders. My second goal is to better understand what distinguishes
spreaders form non-spreaders. If we can predict who will become a spreader, we can design a
counter-campaign which might stop the manipulation before it achieves its goal.
I use the same dataset described in the prior section as well as the list of trolls provided to the
public by the US Congress. In terms of spreaders, out of the forty thousand total spreaders, 28,274
of them produce original tweets (the rest only generated retweets). Overall, these twenty-eight
thousand spreaders produce over 1.5 Million original tweets and over 12 Million other tweets and
retweets|not counting the ones from Russian trolls.
In order to answer the questions posed in this chapter, I gather a set of features about the users.
The features can be grouped under ve categories: metadata, psychological (LIWC), Engagement,
Activity, and other. The Metadata category include variables, such as: number of followers, num-
ber of likes, number of friends, tweet count, whether the account is Geo-enabled or not, has a
background-image, is veried, and the age of the account. The psychological features (LIWC) cap-
ture emotion from language, such as: positive and negative emotions, anxiety, anger, sadness, and
aection. The Activity variables convey the number of characters, hashtags, mentions, and URLs
produced by users, normalized by the number of tweets they post.
The Other category mainly includes two features: political ideology and bot scores. The former
used a methodology described in the previous section and is explained more at length in chapter
three. As for bot score, I use botometer as mentioned in the previous section. Botometer provides
9
an account a score from 0 to 1 in according to the account's likelihood of being an automated
account or a bot. The score is dependent on the following categories of features: metadata, friends,
network, temporal, content, and sentiment features. I will expand on what these categories of
features mean in chapter three. As for engagement features, I measure user engagement in four
activities: retweets, mentions, replies, and quotes. Engagement of a user is measured through three
components: the quantity, longevity, and stability in each activity.
I use dierent machine learning classiers on dierent models (each model includes a subset
of the features, with the full model including all the features). I was able to achieve an average
AUC score of 96% for a 10-fold validation in terms of distinguishing spreaders form non-spreaders
using Gradient Boosting for the full model on a subset of the dataset where the outcome variable
has roughly equal number of spreaders vs. non-spreaders. Moreover, I verify my results on the
full dataset as well as datasets where I increased the features but dropped any rows with missing
values. I was able to still achieve over 90% average AUC score in the full model using Gradient
Boosting. In terms of feature importance, political ideology is the most prominent for the balanced
dataset, as well as in the validation settings. Number of followers, tweets, and bot scores were also
in the top most predictive features both in the balanced and other datasets.
1.3 Understanding the Spread of Misinformation and the Central-
ity of Trolls
This chapter improves upon the rst chapter by 1) extending the span of the data to one year
before and after the 2016 US elections rather than just two months as in the previous paper; 2)
using sophisticated network analysis to understand the in
uence of malicious users across time. I
collect Twitter data for a year leading into the election to answer the following questions. One,
what is the role of the users' political ideology? I investigate whether political ideology aects who
engages with Russian trolls, and how that may have helped propagate trolls' content. Second, how
central are the trolls in the network of spreading information in the year of 2016 (pre and post-the
US presidential elections)? I oer analyses of the position of trolls and the users who spread their
messages in the retweet network progressively in time from the beginning of 2016 to the end of
that year. Third, what is the role of social bots? I characterize whether social bots play a role
10
in spreading content produced by Russian trolls and, if that was the case, where on the political
spectrum are bots situated. Do trolls succeed in specic areas of the US? I oer an extensive
analysis of the geospatial dimension and how it aects the eectiveness of the Russian interference
operation; I test whether users located within specic states participate in the consumption and
propagation of trolls' content more than others.
I obtain a dataset of over 13 million tweets generated by over a million distinct users in the year
of 2016. I successfully determine the political ideology of most of the users using label propagation
on the retweet network with precision and recall exceeding 84%. Next, using advanced machine
learning techniques developed to discover social bots applied on users who engage with Russian
trolls, I nd that bots existed among both liberal and conservative users. I perform text analysis
on the content Russian trolls disseminated, and nd that conservative trolls are concerned with
particular causes, such as refugees, terrorism and Islam, while liberal trolls write about issues
related to the police and school shootings. Additionally, I oer an extensive geospatial analysis of
tweets across the United States, showing that it is mostly proportionate to the states' population
size as expected; however, a few outliers emerge for both liberals and conservatives.
I use network analysis to map the position of trolls and spreaders in the retweet network over
the year of 2016. The way I measure the in
uence of trolls is by measuring where they are located
in the retweet network. I choose the retweet network in particular because retweeting is the main
vehicle for spreading information on Twitter. In terms of the location of the user, there are multiple
ways to measure the location of a user in the network he/she is embedded in. I choose the k-core
decomposition technique, because it captures the notion of who is in the core of the network vs.
the periphery, while giving an ordinal measure re
ecting the number of connections.
I measure the centrality of trolls across time by dividing the number of trolls by the total number
of nodes in every core for every snapshot of the network. Since I want to measure the evolution of
the trolls' importance, I construct monthly retweet networks, where every network contains all the
previous nodes and edges up to the time in question. The rst snapshot of the network contains the
nodes and edges observed until one month after the initialization of the data collection, while the
last snapshot includes all nodes and edges in the retweet network. This analysis shows that trolls
make up a substantial portion of the lower cores and remain fairly stable across time. I replicate
the analysis for spreaders which shows that spreaders' centrality increases progressively across time
11
and dominates the core of the network toward the end.
12
Chapter 2
Analyzing the Digital Traces of
Political Manipulation: The 2016
Russian Interference Twitter
Campaign
2.1 Introduction
Social media have helped foster democratic conversation about social and political issues: from
the Arab Spring (Gonz alez-Bail on et al., 2011), to Occupy Wall Street movements (Conover et al.,
2013,?) and other civil protests (Gonz alez-Bail on et al., 2013, Varol et al., 2014), Twitter and other
social media platforms appeared to play an instrumental role in involving the public in policy and
political conversations by collectively framing the narratives related to particular social issues, and
coordinating online and o-line activities. The use of digital media for political discussions during
presidential elections was examined by many studies, including the past four U.S. Presidential
elections (Adamic and Glance, 2005, Diakopoulos and Shamma, 2010, Bekago and McBride, 2013,
Carlisle and Patton, 2013, DiGrazia et al., 2013), and other countries like Australia (Gibson and
McAllister, 2006, Bruns and Burgess, 2011), and Norway (Enli and Skogerb, 2013). Findings that
focused on the positive eects of social media, such as increasing voter turnout (Bond et al., 2012)
13
or exposure to diverse political views (Bakshy et al., 2015b) contributed to the general praise of
these platforms as a tool for promoting democracy and civic engagement (Shirky, 2011, Loader and
Mercea, 2011, Eng et al., 2011, Tufekci and Wilson, 2012, Tufekci, 2014).
However, concerns regarding the possibility of manipulating public opinion and spreading po-
litical misinformation or fake news through social media were also raised early on (Howard, 2006).
These eects were later documented by several studies (Ratkiewicz et al., 2011, Conover et al., 2011,
El-Khalili, 2013, Woolley and Howard, 2016, Shorey and Howard, 2016b, Bessi and Ferrara, 2016,
Ferrara, 2017, Fourney et al., 2017). Social media have been proven as eective tools to in
uence
individuals' opinions and behaviors (Aral et al., 2009, Aral and Walker, 2012, Bakshy et al., 2011,
Centola, 2011, 2010) and some studies even evaluated the current tools to combat misinformation
(Pennycook and Rand, Pennycook and Rand). Computational tools, like troll accounts and so-
cial bots, have been designed to perform such type of in
uence operations at scale, by cloning or
emulating the activity of human users while operating at much higher pace (e.g., automatically
producing content following a scripted agenda) (Hwang et al., 2012, Messias et al., 2013, Ferrara
et al., 2016, Varol et al., 2017) { however, it should be noted that bots have been also used, in some
instances, for positive interventions (Savage et al., 2016, Monsted et al., 2017).
Early accounts of the adoption of bots to attempt manipulate political communication with
misinformation started in 2010, during the U.S. midterm elections, when social bots were employed
to support some candidates and smear others; in that instance, bots injected thousands of tweets
pointing to Web sites with fake news (Ratkiewicz et al., 2011). Similar cases were reported during
the 2010 Massachusetts special election (Metaxas and Mustafaraj, 2012) { these campaigns are
often referred as to Twitter bombs, or political astroturf. Unfortunately, oftentimes determining
the actors behind these operations was impossible (Kollanyi et al., 2016, Ferrara et al., 2016).
Prior to this work, only a handful of other operations were linked to some specic actors (Woolley
and Howard, 2016), e.g., the alt-right attempt to smear a presidential candidate before the 2017
French election (Ferrara, 2017). This is because governments, organizations, and other entities with
sucient resources, can obtain the technological capabilities necessary to covertly deploy hundreds
or thousands of accounts and use them to either support or attack a given political target. Reverse-
engineering these strategies has proven a challenging research venue (Freitas et al., 2015, Alari
et al., 2016, Subrahmanian et al., 2016, Davis et al., 2016), but it can ultimately lead to techniques
14
to identify the actors behind these operations.
Manipulation through misinformation, or \fake news," has been gaining notoriety as a result
of the 2016 U.S. Presidential election (Allcott and Gentzkow, 2017, Shao et al., 2017, Pennycook
and Rand, 2017, Mele et al., 2017, Guess et al., 2018, Zannettou et al., 2018). Data from Facebook
and Twitter show that deceptive, made-up content, marketed as political news, was shared with
millions of Americans before the 2016 election,
1;2
although only a handful of studies have examined
this phenomenon in detail (Guess et al., 2018).
One diculty facing such studies is objectively determining what is fake news, as there is a range
of untruthfulness from simple exaggeration to outright lies. Beyond factually wrong information,
it is dicult to classify information as fake.
Rather than facing the conundrum of normative judgment and arbitrarily determine what is
fake news and what is not, in this study I focus on user intents, specically the intent to deceive,
and their eects on the Twitter political conversation prior to the 2016 U.S. Presidential election.
Online accounts that are created and operated with the primary goal of manipulating public
opinion (for example, promoting divisiveness or con
ict on some social or political issue) are com-
monly known as Internet trolls (trolls, in short) Buckels et al. (2014). To label some accounts or
sources of information as trolls, a clear intent to deceive or create con
ict has to be present. A
malicious intent to harm the political process and cause distrust in the political system was evi-
dent in 2,752 now-deactivated Twitter accounts that were later identied as being tied to Russia's
\Internet Research Agency" troll farm. The U.S. Congress released a list of these accounts as part
of the ocial investigation of Russian eorts to interfere in the 2016 U.S. Presidential election.
Since their intent was clearly malicious, the Russian Troll accounts and their messages are the
subject of my scrutiny: I study their spread on Twitter to understand the extent of the Russian
interference eort and its eects on the election-related political discussion.
2.1.1 Research Questions
In this paper, I aim to answer three crucial research questions regarding the eects of the interference
operation carried out by Russian trolls:
1
https://blog.twitter.com/official/en_us/topics/company/2017/Update-Russian-Interference-in-2016--Election-Bots-and-Misinformation.
html
2
https://www.theguardian.com/technology/2017/oct/30/facebook-russia-fake-accounts-126-million
15
RQ1 What was the role of the users' political ideology?. I will investigate whether political ideology
aected who engaged with Russian trolls, and how that may have helped propagate trolls'
content. If that was the case, I will determine if the eect was more pronounced among
liberals or conservatives, or evenly spread across the political spectrum.
RQ2 What was the role of social bots?. Second, I will characterize whether social bots played a
role in spreading content produced by Russian trolls and, if that was the case, where on the
political spectrum bots were situated.
RQ3 Did trolls especially succeed in specic areas of the US?. Last, I will oer an extensive
analysis of the geospatial dimension and how it aected the eectiveness of the Russian
interference operation; I will test whether users located within specic states participated in
the consumption and propagation of trolls' content more than others.
I collected Twitter data over a period of few weeks in the months leading up to the election. By
continuously polling the Twitter Search API for relevant, election-related content using hashtag-
and keyword-based queries, I obtained a dataset of over 43 million tweets generated by about 5.7
million distinct users between September 16 and November 9, 2016. I were able to successfully
determine the political ideology of most of the users using label propagation on the retweet net-
work with precision and recall exceeding 90%. Next, using advanced machine learning techniques
developed to discover social bots (Ferrara et al., 2016, Subrahmanian et al., 2016, Davis et al.,
2016) on users who engaged with Russian trolls, I found that bots existed among both liberal and
conservative users (although it is worthy to note that most of these users are conservative and
pro-Trump). I performed text analysis on the content Russian trolls disseminated, and found that
they were mostly concerned with conservative causes and were spreading pro-Trump material. Ad-
ditionally, I oer an extensive geospatial analysis of tweets across the United States, showing that
it is mostly proportionate to the states' population size, as expected, albeit a few outliers emerge
suggesting that some Southern states may have been fertile ground for this operation.
2.1.2 Summary of Contributions
My ndings presented in this work can be summarized as:
16
I propose a novel way of measuring the consumption of manipulated content through the
analysis of activities of Russian trolls on Twitter in the run-up period to the 2016 U.S.
Presidential election.
Using network-based machine learning methods, I am able to accurately determine the polit-
ical ideology of most users in my dataset, with precision and recall above 90%.
State-of-the-art bot detection on users who engaged with Russian trolls shows that bots were
engaged in both liberal and conservative domains; however, the majority of the users in mt
dataset are conservative, thus most bots were on the conservative side as well.
Text analysis shows that Russian trolls were mostly promoting conservative causes and were,
specically, spreading pro-Trump material.
I oer a comprehensive geospatial analysis showing that states like Tennessee was overly-
engaged with production and diusion of Russian trolls' contents.
My comprehensive analysis indicates that although the consumption and dissemination of con-
tent produced by Russian trolls was distributed broadly over the political spectrum, it was especially
concentrated among the conservative Twitter accounts. These accounts helped amplify the oper-
ation carried out by trolls to manipulate public opinion during the period leading up to the 2016
U.S. Presidential election.
2.2 Data Collection
2.2.1 Twitter Dataset
I created a list of hashtags and keywords that relate to the 2016 U.S. Presidential election. The
list was crafted to contain a roughly equal number of hashtags and keywords associated with each
major Presidential candidate: I selected 23 terms, including ve terms referring to the Republican
Party nominee Donald J. Trump (#donaldtrump, #trump2016, #neverhillary, #trumppence16,
#trump), four terms for Democratic Party nominee Hillary Clinton (#hillaryclinton, #imwithher,
#nevertrump, #hillary), and several terms related to debates. To make sure my query list was
comprehensive, I also added a few keywords for the two third party candidates, including the
17
Libertarian Party nominee Gary Johnson (one term), and Green Party nominee Jill Stein (two
terms).
By querying the Twitter Search API at an interval of 10 seconds, continuously and without
interruptions between 15
th
of September and 9
th
of November 2016, I collected a large dataset
containing 43.7 million unique tweets posted by nearly 5.7 million distinct users. Table 2.1 reports
some aggregate statistics of the dataset while Figure 2.1 shows the timeline of the volume of the
tweets and users during the aforementioned period. The data collection infrastructure ran inside
an Amazon Web Services (AWS) instance to ensure resilience and scalability. I chose to use the
Twitter Search API to make sure that I obtained all tweets that contain the search terms of interest
posted during the data collection period, rather than a sample of unltered tweets. This precaution
I took avoids certain issues related to collecting sampled data using the Twitter Stream API that
had been reported in the literature (Morstatter et al., 2013).
Table 2.1: Twitter Data Descriptive Statistics.
Statistic Count
# of Tweets 43,705,293
# of Retweets 31,191,653
# of Distinct Users 5,746,997
# of Tweets/Retweets with a URL 22,647,507
2.2.2 Classication of Media Outlets
I classify users by their ideology based on the political leaning of the media outlets they shared.
The classication algorithm is described later in the paper. I describe the methodology of obtaining
ground truth labels for these outlets.
I use lists of partisan media outlets compiled by third-party organizations, such as AllSides
3
and Media Bias/Fact Check.
4
The combined list includes 249 liberal outlets and 212 conservative
outlets. After cross-referencing with domains obtained in my Twitter dataset, I identied 190
liberal and 167 conservative outlets. I picked ve media outlets from each partisan category that
appeared most frequently in my Twitter dataset and compiled a list of users who tweeted from
these outlets. The list of media outlets/domain names for each partisan category is reported in
3
https://www.allsides.com/media-bias/media-bias-ratings
4
https://mediabiasfactcheck.com/
18
Figure 2.1: Timeline of the volume of tweets (in blue) and users (in red) generated during our observation
period.
Table 2.2.
Table 2.2: Liberal & Conservative Domain Names.
Liberal Conservative
www.hungtonpost.com www.breitbart.com
thinkprogress.org www.thegatewaypundit.com
www.politicususa.com www.lifezette.com
shareblue.com www.therebel.media
www.dailykos.com theblacksphere.net
Overall, 161,907 tweets in the dataset contained a URL that pointed to one of the top-ve liberal
media outlets, which were tweeted by 10,636 users. For the conservative outlets, the numbers are
184,720 tweets and 7,082 users. Figures 2.2 and 2.3 show the distribution of tweets with URLs
19
from liberal and conservative outlets respectively. As I can see in the gures, Hungton Post and
Breitbart make up more than 60% of the total volume.
Figure 2.2: Distribution of tweets with links to the top ve liberal media outlets.
Figure 2.3: Distribution of tweets with links to the top ve conservative media outlets.
20
I used a polarity rule to label Twitter users as liberal or conservative depending on the number
of tweets they produced with links to liberal or conservative sources. In other words, if a user
had more tweets with URLs to liberal sources, he/she would be labeled as liberal and vice versa.
Although the overwhelming majority of users include URLs that are either liberal or conservative,
I removed any users that had equal number of tweets from each side.
5
My nal set of labeled users
include 29,832 users.
2.2.3 Russian Trolls
I used a list of 2,752 Twitter accounts identied as Russian trolls that was compiled and released
by the U.S. Congress.
6
Table 2.3 oers some descriptive statistics of the Russian troll accounts.
Table 2.3: Descriptive Statistics on Russian trolls.
Value
# of Russian trolls 2,735
# of trolls in data 221
# of trolls wrote original tweets 85
# of original tweets 861
Out of the accounts appearing on the list, 221 exist in my dataset, and 85 of them produced
original tweets (861 tweets). Russian trolls in my dataset retweeted 2,354 other distinct users
6,457 times. Trolls retweeted each other only 51 times. Twitter users can choose to report their
location in their prole. Most of the self-reported locations of accounts associated with Russian
trolls were within the U.S. (however, a few provided Russian locations in their prole), and most
of the tweets were from users whose location was self-reported as Tennessee and Texas (49,277 and
26,489 respectively). Russian trolls were retweeted 83,719 times, but most of these retweets were for
three troll accounts only: `TEN GOP', 49,286; `Pamela Moore13', 16,532; and `TheFoundingSon',
8,755, in total making over 89% of the times Russian trolls were retweeted. Russian trolls were
retweeted by 40,224 distinct users.
5
I used ve categories, as in left, left center, center, right center, right, to make sure I have a nal list of users
who are unequivocally liberal or conservative and do not fall in the middle. The media outlet lists for the left/right
center and center were compiled from the same sources.
6
See https://www.recode.net/2017/11/2/16598312/russia-twitter-trump-twitter-deactivated-handle-list, it is a list
produced by US Congress that contains Twitter handles/accounts accused of being Russian trolls working for Russia's
\Internet Research Agency" troll farm.
21
Table 2.4: Descriptive statistics of the Retweet Network.
Statstic Count
# of nodes 4,678,265
# of edges 19,240,265
Max in-degree 278,837
Max out-degree 12,780
Density 8.79E-07
2.3 Data Analysis & Methods
2.3.1 Retweet Network
I construct a retweet network, containing nodes (Twitter users) with a directed link between them
if one user retweeted a post of another. Table 2.4 shows the descriptive statistics of the retweet
network. It is a sparse network with a giant component that includes 4,474,044 nodes.
2.3.2 Label Propagation
I used label propagation
7
to classify Twitter accounts as liberal or conservative, similar to prior
work (Conover et al., 2011). In a network-based label propagation algorithm each node is assigned
a label, which is updated iteratively based on the labels of node's network neighbors. In label
propagation, a node takes the most frequent label of its neighbors as its own new label. The
algorithm proceeds updating labels iteratively and stops when the labels no longer change (see
Raghavan et al. (2007) for more information). The algorithm takes as parameters (i) weights,
in-degree or how many times node i retweeted node j; (ii) seeds (the list of labeled nodes). I x
the seeds' labels so they do not change in the process, since this seed list also serves as my ground
truth.
I constructed a retweet network where each node corresponds to a Twitter account and a link
exists between pairs of nodes when one of them retweets a message posted by the other. I used
the 29k users mentioned in the media outlets sections as seeds, those who mainly retweet messages
from either the liberal or the conservative media outlets in Table 2.2, and label them accordingly.
I then run label propagation to label the remaining nodes in the retweet network.
To validate results of the label propagation algorithm, I applied stratied cross (5-fold) valida-
7
I used the algorithm in the Python version of the Igraph library (Csardi and Nepusz, 2006)
22
Table 2.5: Precision & Recall scores for the seed users and hyper-partisan users test sets.
Seed Users Hyper-Partisan Users
Precision 0.91 0.93
Recall 0.91 0.93
tion to the set of 29k seeds. I train the algorithm on 4/5 of the seed list and see how it performs
on the remaining 1/5. The precision and recall scores are around 0.91.
To further validate the labeling algorithm, I noticed that a group of Twitter accounts put media
outlet URLs as their personal link/website. I compiled a list of these hyper-partisan twitter users
who has the domain names from Table 2.2 in the proles and used the same approach explained in
the previous paragraph (stratied 5-fold cross-validation). The precision and recall scores for the
test set for these users were around 0.93. Table 2.5 show the precision and recall scores for the two
validation methods I used, both labeled more than 90% of the test set users correctly, cementing
my condence in the performance of the labeling algorithm.
2.3.3 Bot Detection
Determining whether either human or a bot controls a social media account has proven a very
challenging task (Ferrara et al., 2016, Subrahmanian et al., 2016). I used an openly accessible
solution called Botometer (a.k.a. BotOrNot) (Davis et al., 2016), consisting of both a public Web
site (https://botometer.iuni.iu.edu/) and a Python API (https://github.com/IUNetSci/
botometer-python), which allow for making this determination. Botometer is a machine-learning
framework that extracts and analyses a set of over one thousand features, spanning content and
network structure, temporal activity, user prole data, and sentiment analysis to produce a score
that suggests the likelihood that the inspected account is indeed a social bot. Extensive analysis
revealed that the two most important classes of features to detect bots are, maybe unsurprisingly,
the metadata and usage statistics associated with the user accounts. The following indicators
provide the strongest signals to separate bots from humans: (i) whether the public Twitter prole
looks like the default one or it is customized (it requires some human eorts to customize the
prole, therefore bots are more likely to exhibit the default prole setting); (ii) absence of GPS
metadata (humans often use smart-phones and the Twitter iPhone/Android App, which record the
23
physical location of the mobile device as digital footprint); and, (iii) activity statistics such as the
total number of tweets and frequency of posting (bots often exhibit incessant activity and excessive
amounts of tweets), proportion of retweets over original tweets (bots retweet contents much more
frequently than generating new tweets), proportion of followers over followees (bots usually have less
followers and more followees), account creation date (bots are more likely to have recently-created
accounts), randomness of the username (bots are likely to have randomly-generated usernames).
Botometer was trained with thousands of instances of social bots, from simple to sophisticated,
yielding an accuracy above 95 percent (Davis et al., 2016). Typically, Botometer returns likelihood
scores above 50 percent only for accounts that look suspicious to a scrupulous analysis. I adopted
the Python Botometer API to systematically inspect the most active users in my dataset. The
Python Botometer API queries the Twitter API to extract 300 recent tweets and publicly available
account metadata, and feeds these features to an ensemble of machine learning classiers, which
produce a bot score. To label accounts as bots, I use the fty-percent threshold { which has proven
eective in prior studies (Davis et al., 2016): an account is considered to be a bot if the overall
Botometer score is above 0.5.
2.3.4 Geo-location
There are two ways to identity the location of tweets produced by users. One way is to collect the
coordinates of the location the tweets were sent from; however, this is only possible if users enable
the geolocation option on their Twitter accounts. The second way is to analyze the self-reported
location text in users' proles. The latter includes substantially more noise, since many people
write ctitious or imprecise locations { for example, they may identify the state and the country
they reside in, but not the city.
There were 36,351 tweets with exact coordinates in my dataset. The distribution of tweets
across the fty states tended to be concentrated in the South, with Kentucky being the state
with the highest number of geolocated tweets. It is hard to know why that is the case; besides,
geo-tagged tweets in this dataset comprise less than 0.001% of the whole dataset. Tweets and
users' self-reported locations make up substantially more of my dataset than geo-tagged tweets.
More than 3.8 million Twitter users provided a location in their prole, and out of those that are
intelligible and located within the US, 1.6 Million remained. From users' locations, I mapped over
24
Table 2.6: Breakdown of the Russian trolls by political ideology, with the ratio of conservative to liberal
trolls.
Liberal Conservative Ratio
# of trolls 107 108 1
# of trolls w/ original tweets 15 64 4.3
# of original tweets 44 844 19
10.5 Million tweets to some U.S. States. The distribution of the tweets and users seems to be
as expected population-wise, although it is slightly less than expected for the state of California,
provided that it is the most populous state in the nation. The top three states to originate tweets
in my dataset are Texas, New York, and Florida.
2.3.5 Activity Summary of Russian Trolls
The predicted labels for the 215 Russian troll accounts in my dataset are almost equally divided
between liberal and conservative, with 107 accounts labeled as liberal and 108 labeled as conser-
vative. However, the two groups are extremely dierent in terms of their activity (Table 2.6).
There are only 15 liberal Russian trolls who wrote original tweets, and 64 conservative trolls who
produced original content. Left leaning trolls wrote 44 original tweets, while conservatives wrote
844 original tweets. Table 2.7 shows the top 20 stem words from tweets of liberal and conservative
trolls respectively.
2.4 Results
Let address the three research questions I sketched earlier:
RQ1 What was the role of the users' political ideology?
RQ2 What was the role of social bots?
RQ3 Did trolls especially succeed in specic areas of the US?
In Section 2.4.1, I analyze how political ideology aects engagement with content posted by
Russian trolls. Section 2.4.2 focuses on social bots and how they contributed in spreading content
25
Table 2.7: Top 20 stemmed words from the tweets of Russian trolls classied as liberal and conservative.
Liberal count Conservative count
trump 14 trumpforpresid 486
debat 10 trump 241
nevertrump 6 trumppence16 227
like 5 hillaryforprison2016 168
2016electionin3word 5 vote 127
elections2016 4 maga 113
imwithh 4 neverhillari 106
obama 3 election2016 102
need 3 hillari 100
betteralternativetodeb 3 hillaryclinton 85
women 3 trump2016 80
would 3 draintheswamp 50
vote 3 trumptrain 48
mondaymotiv 2 debat 48
last 2 realdonaldtrump 45
oh 2 electionday 43
thing 2 clinton 41
damn 2 makeamericagreatagain 34
see 2 votetrump 32
defeat 2 america 31
produced by Russian trolls. Finally, in Section 2.4.3 I show how users contributed to consumption
and propagation of trolls' content based on their location.
2.4.1 RQ1: Political Ideology
Users who rebroadcast content produced by Russian trolls, a.k.a. spreaders, tell a fascinating story
(Tables 2.8 and 2.9). There are 28,274 spreaders in our dataset that engaged with Russian trolls.
Spreaders also produced over 1.5 Million original tweets and over 12 Million tweets and retweets,
not counting the ones from Russian trolls. Looking at the content of the top users, I can easily
identify them as conservative; besides, the most active of them produced thousands of tweets, an
unreasonably large amount in such a short period|a few weeks|thus I suspect some of them may
be bots.
I next look at users' activities by political leaning. There are more than 42 thousand original
tweets posted by the liberal spreaders and more than 1.5 million by conservative ones. Out of the 28
thousand spreaders who engaged with Russian trolls, 892 are liberals and 27,382 are conservatives.
26
Table 2.8: Descriptive statistics of spreaders, i.e., users who retweeted Russian trolls.
Value
# of spreaders 40,224
# of times retweeted trolls 83,719
# of spreaders with original tweets 28,274
# of original tweets >1.5 Million
# of original tweets and retweets >12 Million
Table 2.9: Breakdown by political ideology of users who spread Russian trolls' content and wrote original
tweets.
Liberal Conservative Ratio
# of spreaders 892 27,382 31
# of tweets >42,000 >1.5 Million 36
The top stemmed words in the liberals' tweets indicate support for Clinton, while the conservatives'
postings openly support Trump (Table 2.7). The top URLs for the liberals include media outlets
such as Hungton Post and NBC News, while conservatives tweeted most frequently news from
Breitbart, The Gateway Pundit, and Info Wars. As for prole URLs, liberals mostly had social
network accounts, while for conservatives, besides social network accounts, some of them put links
like \www.donaldjtrump.com" and \lyingcrookedhillary.com".
2.4.2 RQ2: Social Bots
As mentioned above, some accounts exhibited suspicious activity levels. Using the approach ex-
plained in Section 2.3.3, I were able to obtain bot scores for 34,160 out of the 40,224 spreaders.
The number of users that has a bot score above 0.5, and can therefore be safely considered bot
according to prior work (Varol et al., 2017), stands at 2,126 accounts. Out of the 34,160 spreaders
with bot scores, 1,506 are liberal, and 75 of them have bot scores above 0.5, about 4.9% of the
total. As for the conservatives, there are 32,513 spreaders, 2,018 of which have bot scores greater
than 0.5, representing around 6.2% of the total. Results are summarized in Table 2.10.
Twitter users rebroadcast almost exclusively political content produced by like-minded accounts,
and I noted that most spreaders are conservatives. Therefore, assessing the overall in
uence of bots
on the diusion of Russian trolls' content in the population at large is challenging. In terms
of tweet/retweet production, the fteen hundred liberal spreaders produced nearly 225 thousand
27
Table 2.10: Bot analysis on spreaders (those with bot scores).
Liberal Conservative Ratio
# of spreaders 1,506 32,513 22
# of tweets 224,943 11,928,886 53
# of bots 75 2,018 27
# of tweets by bots 18,749 955,583 51
tweets/retweets with 18,749 tweets/retweets by users who have a bot score above 0.5, representing
around 8.3%. The 32 thousand conservative spreaders produced almost 12 million tweets/retweets,
with 955,583 from users with bot score above 0.5, or 8% of the total.
Figure 2.4 shows the probability density of bot scores of the liberal and conservative spreaders.
Again, putting aside the disproportionate number of liberals to conservatives, the mean value of
the bot scores of the conservative spreaders (0.3) is higher than the liberal one (0.24). I performed
a two-sided t-test for the null hypothesis that the two distributions have identical mean values and
the p-value is less than 0.0, meaning that I can reject the null.
2.4.3 RQ3: Geospatial Analysis
Figure 2.5 shows the proportion of the number of retweets by conservative users (classied according
to the label propagation algorithm and excluding bots) of the content produced by Russian trolls
per each state normalized per state by the total number of conservative tweets. The ratio is
computed as = (T
S
=P
S
) 100, where T
S
is the total number of conservative retweets of trolls
from a given state S, and P
S
is the total number of tweets per each State.
I notice that some states exhibit very high proportions of retweets per total number of tweets for
conservatives. I tested the deviations by using a two-tailed t-test on the z-scores of each deviation
calculated on the distribution of ratios (average = 1.54, standard deviation = 0.71). Among the
most active states, South Dakota leads the ranking with 8 retweets of Russian trolls (=3.65,
p-value < 0.001), followed by Tennessee (479 retweets, = 3.61, p-value < 0.01) and Wyoming
(19 retweets, = 3.20, p-value= 0.019). Due to the small amount of liberal spreaders (i.e., only
1,506 liberal users engaged with retweeting Russian trolls), a similar analysis does not produce any
statistically signicant results.
28
Figure 2.4: Distribution of the probability density of bot scores assigned to liberal users who retweet Russian
trolls (blue) and for conservative users (red).
29
Figure 2.5: Proportion of the number of retweets by conservative users (excluding bots) of Russian trolls per
each state normalized by the total number of conservative tweets by state.
2.5 Conclusions
The dissemination of information and the mechanisms for democratic discussion have radically
changed since the advent of digital media, especially social media. Platforms like Twitter have
been extensively praised for their contribution to democratization of public discourse on civic and
political issues. However, many studies have also highlighted the perils associated with the abuse of
these platforms. The spread of deceptive, false and misleading information aimed at manipulating
public opinion are among those risks.
In this work, I investigated the role and eects of misinformation, using the content produced
by Russian trolls on Twitter as a proxy for misinformation. I collected tweets posted during the
period between September 16 and November 9, 2016 related to the U.S. Presidential election using
the Twitter Search API and a manually compiled list of keywords and hashtags. I showed that
that misinformation (produced by Russian trolls) was shared more widely by conservatives than
liberals on Twitter. Although there were about 4 times as many Russian trolls posting conservative
views as liberal ones, the former produced almost 20 times more content. In terms of users who
retweeted these trolls, there were about 30 times more conservatives than liberals. Conservatives
30
also outproduced liberals on content, at a rate of 35:1. Using state-of-the-art bot detection method,
I estimated that about 4.9% and 6.2% of the liberal and conservative users are bots.
The spread of misinformation by malicious actors can have severe negative consequences. It
can enhance malicious information and polarize political conversation, causing confusion and social
instability. Scientists are currently investigating the consequences of such phenomena (Woolley and
Howard, 2016, Shorey and Howard, 2016b). I plan to explore in detail the issue of how malicious
information spread via exposure and the role of peer eect. Concluding, it is important to stress
that, although my analysis unveiled the current state of the political debate and agenda pushed by
the Russian trolls who spread malicious information, it is impossible to account of all the malicious
eorts aimed at manipulation during the last presidential election. State- and non-state actors,
local and foreign governments, political parties, private organizations, and even individuals with
adequate resources (Kollanyi et al., 2016), could obtain operational capabilities and technical tools
to construct misinformation campaigns and deploy armies of social bots to aect the directions
of online conversations. Future eorts will be required by the social and computational sciences
communities to study this issue in depth and develop more sophisticated detection techniques
capable of unmasking and ghting these malicious eorts.
31
Chapter 3
Who Falls for Online Political
Manipulation?
3.1 Introduction
The initial optimism about the role of social media as a driver of social change has been fading
away, following the rise in concerns about the negative consequences of malicious behavior online.
Such negative outcomes have been particularly evident in the political domain. The spread of mis-
information (Shorey and Howard, 2016a, Tucker et al., 2017) and the increasing role of bots (Bessi
and Ferrara, 2016) in the 2016 US presidential elections has increased the interest in automatic
detection and prediction of malicious actor activity.
In this study, I focus on the role of Russian trolls in the recent US presidential elections. Trolls
are usually described as users who intentionally \annoy" or \bother" others in order to elicit an
emotional response. They post in
ammatory messages to spread discord and cause emotional
reactions (Phillips, 2015). In the context of the 2016 US election, I dene trolls as users who
exhibit a clear intent to deceive or create con
ict. Their actions are directed to harm the political
process and cause distrust in the political system. My denition captures the new phenomenon of
paid political trolls who are employed by political actors for a specied goal. The most recent and
important example of such phenomenon is the Russian \troll farms"|trolls paid by the Russian
government to in
uence conversations about political issues aimed at creating discord and hate
32
among dierent groups (Gerber and Zavisca, 2016).
Survey data from the Pew Research Center (Gottfried and Shearer, 2016) show that two-thirds
of Americans get their news from Social Media. Moreover, they are being exposed to more political
content written by ordinary people than ever before. Bakshy et al. (2015a) report that 13% of
posts by Facebook users|who report their political ideology|are political news. This raises the
question of how much in
uence the Russian trolls had on the national political conversation prior
to the 2016 US election, and how much in
uence such trolls will have in the upcoming elections.
Although I do not discuss the eect that these trolls had on the political conservation prior to the
election, I focus my eorts in this paper on the following two questions:
RQ1: Can I predict which users will become susceptible to the manipulation campaign by spread-
ing content promoted by Russian trolls?
RQ2: What features distinguish users who spread trolls' messages?
The goal of these questions is, rst, to test whether it is possible to identify the users who will
be vulnerable to manipulation and participate in spreading the messages trolls post. I refer to
such users as spreaders in this paper. My second goal is to better understand what distinguishes
spreaders form non-spreaders. If I can predict who will become a spreader, I can design a counter-
campaign, which might stop the manipulation before it achieves its goal.
For this study, I collected Twitter data over a period of seven weeks in the months leading up to
the election. By continuously pulling the Twitter Search API for relevant, election-related content
using hashtag- and keyword-based queries, I obtained a dataset of over 43 million tweets generated
by about 5.7 million distinct users between September 16 and November 9, 2016. First, I cross-
referenced the list of Russian trolls published by the US Congress with my dataset and found that
221 Russian trolls accounts exist in my dataset. Next, I identied the list of users who retweeted
the trolls. I gather important features about the users and use a machine learning framework to
address the questions posed earlier.
I used dierent machine learning classiers on dierent models (each model includes a subset of
the features, with the full model including all the features). I are able to achieve an average AUC
score of 96% for a 10-fold validation in terms of distinguishing spreaders form non-spreaders using
Gradient Boosting for the full model on a subset of the dataset where the outcome variable has
33
roughly equal number of spreaders vs. non-spreaders. Moreover, I veried my results on the full
dataset as well as datasets where I increase the features but drop any rows with missing values. I
am able to still achieve over 90% average AUC score in the full model using Gradient Boosting. In
terms of feature importance, political ideology is the most prominent for the balanced dataset, as
well as in the validation settings. Number of followers, statuses (no. tweets), and bot scores were
also in the top most predictive features both in the balanced and other datasets.
3.2 Related Literature
The use of trolls and bots in political manipulation campaigns around the globe is well documented
through an array of reports by mainstream media outlets and academics (see Tucker et al. (2018)
for a comprehensive review on the role of misinformation, bots, and trolls on social media). This
phenomenon is not entirely new: researchers warned about the potential for online political manip-
ulation for over a decade (Howard, 2006, Hwang et al., 2012). Reports tracking and studying this
phenomenon date back to the early 2010s (Ratkiewicz et al., 2011, Metaxas and Mustafaraj, 2012,
Ratkiewicz et al., 2011). Since then, an increasing account of such events has been recorded in the
context of several elections, both in the United States (Bessi and Ferrara, 2016, Kollanyi et al.,
2016, Shorey and Howard, 2016a, Woolley and Howard, 2016, Woolley, 2016, Marwick and Lewis,
2017, Wang et al., 2016) and all over the world, including in South America (Forelle et al., 2015,
Su arez-Serrato et al., 2016), the U.K. (Howard and Kollanyi, 2016), and Italy (Cresci et al., 2017).
Although trolls do not necessarily need to be automated accounts, in many cases bots play a
substantial role in political manipulation. Bessi and Ferrara (2016) report that 400k bots were re-
sponsible for posting 3.8 million tweets in the last month of the 2016 US presidential election, which
is one-fth of the total volume of online conversations they collected. Specically, Russian political
manipulation campaigns did not only target the US: there is evidence of Russian interference in
German electoral campaigns (Applebaum and Colliver., 2017), British elections (Gorodnichenko
et al., 2018), and the Catalonian referendum (Stella et al., 2018). Russian-aliated accounts were
also reported in the 2017 French presidential elections, where bots were detected during the so-called
MacronLeaks disinformation campaign (Ferrara, 2017). Moreover, a recent NATO report claims
that around 70% of accounts tweeting in Russian and directed at Baltic countries and Poland are
34
bots.
Russian political manipulation online did not stop at Russia's borders. Domestically, there is
strong evidence that trolls and bots were present at multiple occasions. Ananyev and Sobolev
(2017) provide evidence of Russian government-aliated trolls being able to change the direction
of conversations on the LiveJournal blog, a popular platform in Russia in the 2000s. Moreover, the
same entity that controlled many of the trolls studied in this paper, the Russian \troll factory",
run by the Internet Research Agency, had its trolls contribute to Wikipedia in support of positions
and historical narratives put forward by the current Russian government (Labzina, 2017).
Online political manipulation is not only a Russian phenomenon. There is strong evidence of
similar eorts by various governments to control political discussion online, particularly with the
use of bots. King et al. (2017) shows that the so-called "50-centers"{low-paid government workers
who work online on behalf of the Chinese government{ try to distract Chinese citizens online form
politically controversial topics. Even further, Miller and Gallagher (2018) estimate that Chinese
astroturfers produce about 15% of all comments made on the 19 popular Chinese news websites.
In Korea, bots were utilized as a part of a secret intelligence operation in support of the incumbent
party's candidate reelection (Keller et al., 2017).
3.3 Data Collection
3.3.1 Twitter Dataset
I created a list of hashtags and keywords that relate to the 2016 U.S. Presidential election. The
list was crafted to contain a roughly equal number of hashtags and keywords associated with each
major Presidential candidate: I selected 23 terms, including ve terms referring to the Republican
Party nominee Donald J. Trump (#donaldtrump, #trump2016, #neverhillary, #trumppence16,
#trump), four terms for Democratic Party nominee Hillary Clinton (#hillaryclinton, #imwithher,
#nevertrump, #hillary), and several terms related to debates. To make sure our query list was
comprehensive, I also added a few keywords for the two third-party candidates, including the
Libertarian Party nominee Gary Johnson (one term), and Green Party nominee Jill Stein (two
terms).
By querying the Twitter Search API continuously and without interruptions between September
35
15 and November 9, 2016, I collected a large dataset containing 43.7 million unique tweets posted
by nearly 5.7 million distinct users. Table 3.1 reports some aggregate statistics of the dataset.
The data collection infrastructure ran inside an Amazon Web Services (AWS) instance to ensure
resilience and scalability. I chose to use the Twitter Search API to make sure that I obtained all
tweets that contain the search terms of interest posted during the data collection period, rather
than a sample of unltered tweets. This precaution I took avoids known issues related to collecting
sampled data using the Twitter Stream API that had been reported in the literature (Morstatter
et al., 2013).
Table 3.1: Twitter Data Descriptive Statistics.
Statistic Count
# of Tweets 43,705,293
# of Retweets 31,191,653
# of Distinct Users 5,746,997
# of Tweets/Retweets with a URL 22,647,507
3.3.2 Russian Trolls
I used a list of 2,752 Twitter accounts identied as Russian trolls that was compiled and released by
the U.S. Congress.
1
Table 3.2 oers some descriptive statistics of the Russian troll accounts. Out of
the accounts appearing on the list, 221 exist in my dataset, and 85 of them produced original tweets
(861 tweets). Russian trolls in my dataset retweeted 2,354 other distinct users 6,457 times. Trolls
retweeted each other only 51 times. Twitter users can choose to report their location in their prole.
Most of the self-reported locations of accounts associated with Russian trolls were within the U.S.
(however, a few provided Russian locations in their prole), and most of the tweets were from users
whose location was self-reported as Tennessee and Texas (49,277 and 26,489 respectively). Russian
trolls were retweeted 83,719 times, but most of these retweets were for three troll accounts only:
`TEN GOP', received 49,286 retweets; `Pamela Moore13', 16,532; and `TheFoundingSon', 8,755.
These three accounts make up for over 89% of the times Russian trolls were retweeted. Overall,
Russian trolls were retweeted by 40,224 distinct users.
1
See https://www.recode.net/2017/11/2/16598312/russia-twitter-trump-twitter-deactivated-handle-list
36
Table 3.2: Descriptive Statistics on Russian trolls.
Value
# of Russian trolls 2,735
# of trolls in data 221
# of trolls wrote original tweets 85
# of original trolls' tweets 861
3.3.3 Spreaders
Users who rebroadcast content produced by Russian trolls, hereafter referred to as spreaders, may
tell a fascinating story, thus will be the subject of my further investigation. Out of the forty
thousand total spreaders, 28,274 of them produced original tweets (the rest only generated retweets).
Overall, these twenty-eight thousand spreaders produced over 1.5 Million original tweets and over
12 Million other tweets and retweets|not counting the ones from Russian trolls (cf., Table 4.1).
Table 3.3: Descriptive statistics of spreaders, i.e., users who retweeted Russian trolls.
Value
# of spreaders 40,224
# of times retweeted trolls 83,719
# of spreaders with original tweets 28,274
# of original tweets >1.5 Million
# of other tweets and retweets >12 Million
3.4 Data Analysis & Methods
In order to answer the questions posed in this paper, I gather a set of features about the users
to (i) predict the spreaders with the highest accuracy possible and (ii) identify feature(s) which
best distinguish spreaders from the rest. Table 3.4 shows all the features I evaluated in this paper,
grouped under the following categories: Metadata, Linguistic Inquiry and Word Count (LIWC),
Engagement, Activity, and Other variables.
To understand what each variable in the Metadata and LIWC categories means, see the Twitter
documentation page
2
and Pennebaker et al. (2015), respectively. The Activity variables convey
the number of characters, hashtags, mentions, and URLs produced by users, normalized by the
2
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object
37
Table 3.4: List of features employed to characterize users in dataset
Metadata LIWC Engagement Activity Other
# of followers Word Count Retweet variables # of characters Political Ideology
# of favourites Postive Emotion Mention variables # of hashtags Bot Score
# of friends Negative Emotion Reply variables # of mentions Tweet Count
Status count Anxiety Quote variables # of urls
Listed count Anger
Default Prole Sadness
Geo-enabled Analytic
Background-image Clout
Veried Aection
Account Age Tone
number of tweets they post. Tweet Count, under Other, is the number of user's tweets appearing
in my dataset. The remaining variables are more involved and warrant a detailed explanation:
I explain how Political Ideology, Bot Scores, and Engagement variables were computed in the
following sections. One may wonder how much the features evaluated here correlate with each other,
and whether they provide informative signals in terms of predictive power about the spreaders.
Figure 3.1 shows that, besides Engagement variables, most of the features are not highly correlated
among each other (Pearson correlation is shown, results do not vary signicantly for Spearman
correlation). There are however a few notable exceptions: Word Count and Tweet Count, LIWC
Positive Emotion and Aection, Anxiety and Anger|these pairs all show very high correlation.
This is not surprising, considering that these constructs are conceptually close one another. As for
the Engagement variables, I can see a "rich get richer" eect here, where users who have higher
scores in terms of some of the sub-features in the Engagement category, are also higher in other
sub-features. For example, by construction the Retweet h-index will be proportional to the number
of times a user is retweeted, and similarly for replies, quotes and mentions|all these Engagement
features are explained in great detail in a sectionx3.4.3.
38
Figure 3.1: Feature correlation heat map for the all users in the dataset.
3.4.1 Political Ideology
Classication of Media Outlets
I classify users by their ideology based on the political leaning of the media outlets they share. I use
lists of partisan media outlets compiled by third-party organizations, such as AllSides
3
and Media
Bias/Fact Check.
4
The combined list includes 249 liberal outlets and 212 conservative outlets.
After cross-referencing with domains obtained in my Twitter dataset, I identied 190 liberal and
167 conservative outlets. I picked ve media outlets from each partisan category that appeared
most frequently in my Twitter dataset and compiled a list of users who tweeted from these outlets.
The list of media outlets/domain names for each partisan category is reported in Table 3.5.
3
https://www.allsides.com/media-bias/media-bias-ratings
4
https://mediabiasfactcheck.com/
39
Table 3.5: Liberal & Conservative Domain Names.
Liberal Conservative
www.hungtonpost.com www.breitbart.com
thinkprogress.org www.thegatewaypundit.com
www.politicususa.com www.lifezette.com
shareblue.com www.therebel.media
www.dailykos.com theblacksphere.net
I used a polarity rule to label Twitter users as liberal or conservative depending on the number
of tweets they produced with links to liberal or conservative sources. In other words, if a user
had more tweets with links to liberal sources, he/she would be labeled as liberal and vice versa.
Although the overwhelming majority of users include links that are either liberal or conservative, I
remove any users that had equal number of tweets from each side
5
|this to avoid the conundrum
of breaking ties with some arbitrary rule. My nal set of labeled users include 29,832 users.
Label Propagation
I used label propagation
6
to classify Twitter accounts as liberal or conservative, similar to prior
work (Conover et al., 2011). In a network-based label propagation algorithm, each node is assigned
a label, which is updated iteratively based on the labels of the node's network neighbors. In label
propagation, a node takes the most frequent label of its neighbors as its own new label. The
algorithm proceeds updating labels iteratively and stops when the labels no longer change (see
Raghavan et al. (2007) for more information). The algorithm takes as parameters (i) weights,
in-degree or how many times node i retweeted node j; (ii) seeds (the list of labeled nodes). I x
the seeds' labels so they do not change in the process, since this seed list also serves as my ground
truth.
I construct a retweet network where each node corresponds to a Twitter account and a link
exists between pairs of nodes when one of them retweets a message posted by the other. I use the
29K users mentioned in the media outlets sections as seeds, those who mainly retweet messages
from either the liberal or the conservative media outlets in Table 3.5, and label them accordingly.
5
I use ve categories, as in left, left center, center, right center, right, to make sure I have a nal list of users who
are unequivocally liberal or conservative and do not fall in the middle. The media outlet lists for the left/right center
and center were compiled from the same sources.
6
I used the algorithm in the Python implementation of the IGraph library (Csardi and Nepusz, 2006)
40
I then run label propagation to label the remaining nodes in the retweet network.
To validate results of the label propagation algorithm, I applied stratied 5-fold cross validation
to the set of 29K seeds. I train the algorithm on four-fths of the seed list and test how it performs
on the remaining one-fth. The averge precision and recall scores are both over 91%.
To further validate the labeling algorithm, I notice that a group of Twitter accounts put media
outlet URLs as their personal link/website. I compile a list of the hyper-partisan Twitter users who
have the domain names from Table 3.5 in their proles and use the same approach explained in
the previous paragraph (stratied 5-fold cross-validation). The average precision and recall scores
for the test set for these users are above 93%. 3.6 shows the average precision and recall scores for
the two validation methods I use: both labeled over 90% of the test set users correctly, cementing
my condence in the performance of the labeling algorithm. Table 3.7 reports users, trolls, and
spreaders by political ideology.
Table 3.6: Precision & Recall scores for the seed users and hyper-partisan users test sets.
Seed Users Hyper-Partisan Users
Precision 91% 93%
Recall 91% 93%
Table 3.7: Breakdown for overall users, trolls and spreader by political ideology
Liberal Conservative
# of users >3.4 M >1 M
# of trolls 107 108
# of spreaders 1,991 38,233
3.4.2 Bot Detection
Determining whether either a human or a bot controls a social media account has proven a very
challenging task (Ferrara et al., 2016, Subrahmanian et al., 2016). I use an openly accessible
solution called Botometer (a.k.a. BotOrNot) (Davis et al., 2016), consisting of both a public Web
site (https://botometer.iuni.iu.edu/) and a Python API (https://github.com/IUNetSci/
botometer-python), which allows for making this determination with high accuracy. Botometer
is a machine-learning framework that extracts and analyses a set of over one thousand features,
41
spanning six sub classes:
User : Meta-data features that include the number of friends and followers, the number of tweets
produced by the users, prole description and settings.
Friends : Four types of links are considered here: retweeting, mentioning, being retweeted, and
being mentioned. For each group separately, botometer extracts features about language use,
local time, popularity, etc.
Network : Botometer reconstructs three types of networks: retweet, mention, and hashtag co-
occurrence networks. All networks are weighted according to the frequency of interactions or
co-occurrences.
Temporal : Features related to user activity, including average rates of tweet production over
various time periods and distributions of time intervals between events.
Content : Statistics about length and entropy of tweet text and Part-of-Speech (POS) tagging
techniques, which identies dierent types of natural language components, or POS tags.
Sentiment : Features such as: arousal, valence and dominance scores (Warriner et al., 2013),
happiness score (Kloumann et al., 2012), polarization and strength (Wilson et al., 2005), and
emotion score (Agarwal et al., 2011).
I utilize Botometer to label all the spreaders, and I get bot scores for over 34K out of the total
40K spreaders. Since using Botometer to get scores all non-spreaders (i.e., over 5.7M users) would
take an unfeasibly long time (due to Twitter's restrictions), I randomly sample the non-spreader
user list and use Botometer to get scores for a roughly equivalent-size list of non-spreader users.
The randomly-selected non-spreader list includes circa 37K users. To label accounts as bots, I
use the fty-percent threshold which has proven eective in prior studies (Davis et al., 2016): an
account is considered to be a bot if the overall Botometer score is above 0.5. Figure 3.2 shows the
probability distribution for spreaders vs. non-spreaders. While most of the density is under the
0.5 threshold, the mean of spreaders (0.3) is higher than the mean of non-spreaders. Additionally,
I used a t-test to verify that the dierence is signicant at the 0.001 level (p-value).
As for the plots in Figures 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, it is evident that the spreaders are dierent
on almost all the Botometer subclass scores, except for the temporal features. The dierences in all
42
Figure 3.2: Probability density distributions of bot scores assigned to spreaders (red) and non-spreaders
(blue).
plots are statistically signicant (p< 0.001). Besides, looking at the distributions, I can see that the
dierence in user characteristics (metadata), friends, and network distributions, are substantively
dierent as well. Moreover, the mean of spreaders is higher in all the subclass features. Besides
knowing that spreaders are dierent from non-spreaders on this set of aggregated features, it is hard
to tell in which way exactly without looking deeper at the features that make up these aggregated
features.
3.4.3 Engagement
I plan to measure user engagement in four activities: retweets, mentions, replies, and quotes.
Engagement of a user is measured through three components: the quantity, longevity, and stability
43
Figure 3.3: Content
in each activity. For instance, for a set of N users, this measure would calculate the engagement
index score of user i2N by including the following:
1) number of retweets, replies, mentions, and quotes by Ni users for user i;
2) time dierence between the last and the rst quote, reply, and retweet per tweet;
3) consistency of mentioning, replying, retweeting, and quoting by Ni users for user i across
time (per day);
4) number of unique users who retweeted, commented, mentioned, and quoted user i
Item three is measured using h-index (Hirsch, 2005). The measure captures two notions: how
highly referenced and how continuously highly referenced a user is by other members in the network
(Lietz et al., 2014). This measure was originally proposed to quantify an individual's scientic
research output. In this context, a user has index h if for h days, he/she is referenced at least h
times and in all but h days no more than h times.
44
Figure 3.4: Friend
3.5 Results
Predicting spreaders on the original dataset may be considered a daunting task: only a relatively
small fraction of users engaged with Russian trolls' content (about 40K out of 5.7M users). However,
for the same reason, if a model were to trivially predict that no user will ever engage with Russian
trolls, the model would be accurate most of the time (i.e., most users won't be spreaders), even if its
recall would be zero (i.e., the model would never correctly predict any actual spreaders)|provided
that I want to predict spreaders, this model would not be very useful in practice. In other words,
my setting is a typical machine-learning example of a highly-unbalanced prediction task.
To initially simplify the prediction task, I created a balanced dataset that is limited to users
who have bot scores.
7
This balanced dataset has about 72K users, with 34K spreaders and 38K
non-spreaders. To test the ability to detect spreaders and to see which features are most important
in distinguishing between the two groups, I leverage multiple classiers and multiple models: the
7
Will get back to the original prediction task on the highly-unbalanced dataset later in this section.
45
Figure 3.5: Network
rst model serves as a baseline with each model including more variables until I reach the full
model. Since my goal was not that to devise new techniques, I used four o-the-shelf machine
learning algorithms: Extra Trees, Random Forest, Adaptive Boosting, and Gradient Boosting. I
train the classiers using Stratied 10-fold cross-validation with the following preprocessing steps
(i) replace all categorical missing values with the most frequent value in the column (ii) replace
missing values with the mean of the column.
Table 3.8 shows all the models I evaluate, from the simplest baseline model (Metadata) to the
full model that includes all the features I present in Table 3.4.
Table 3.8: Machine Learning Models from the Baseline (Metadata) to Full Model.
Model Features
1 Metadata
2 Metadata + LIWC
3 Metadata + LIWC + Activity
4 Metadata + LIWC +Activity + Engagement
5 Metadata + LIWC +Activity + Engagement + Other
46
Figure 3.6: Sentiment
For Gradient Boosting, which is the best performing classier among the four I evaluate, I
obtained average AUC scores for the 10 folds that range from 85% to 96%. Figure 3.9 shows the
ROC curve plots for each model (using the fold/model with the highest AUC score among the
trained ones). The jump from 89% to 96% for the AUC scores from Model 4 to 5 shows that
the addition of bot scores and political ideology are meaningful in distinguishing spreaders from
non-spreaders (the legend in Figure 3.9 shows the average AUC score for each model). To better
understand the contribution of the features in predicting the target values (i.e., spreader vs. non-
spreader), I look at the variable importance plot of the Gradient Boosting results for Model 5.
The Variable Importance plot (cf., Figure 3.10) provides a list of the most signicant variables in
descending order by a mean decrease in the Gini criterion. The top variables contribute more to
the model than the bottom ones and can discriminate better between spreaders and non-spreaders.
In other words, features are ranked based on their predictive power according to the given model.
Figure 3.10 shows that, according to Model 5 and Gradient Boosting, political ideology is the
most predictive feature, followed by number of followers, statuses/tweets count (obtained from the
47
Figure 3.7: Temporal
metadata), and bot score, in a descending order of importance. The plot does not show all the
features, since the omitted features contribute very little to the overall predictive power of the
model.
Feature importance plots reveal which features contribute most to classication performance,
but they do not tell the nature of the relationship between the outcome variable and the predic-
tors. Although predictive models are sometime used as black boxes, Partial Dependence plots (cf.,
Figures 3.11 & 3.12) can tell a lot about the structure and direction of the relationship between
the target and independent variables. They show these relationships after the model is tted, while
marginalizing over the values of all other features. The dependency along the x-axis captures the
range of a given feature, with that feature values normalized between 0 and 1.
8
Using Partial Dependence, I illustrate that the target variable (spreader) has positive relation-
ships with the following features: political ideology, statuses count, bot scores, and friends count.
8
Political ideology should be considered in the range from 0 (to identify left leaning users), to 1 (for right-leaning
ones).
48
Figure 3.8: User
Figure 3.11 visualizes these relationships (I put political ideology on a dierent y-axis in order to
show that its magnitude of in
uence on the target variable is signicantly higher compared to all
other features, including downward trend features in Figure 3.12). This suggests that moving from
left to right political leaning increases the probability of being a spreader; larger number of posts,
more friends (a.k.a. followees), and higher bot scores are also associated with higher likelihood of
being a spreader.
On the other hand, I can see that the outcome variable has a negative relationship with followers
count, account age, characters count, and word count, as shown in Figure 3.12. This means that
having fewer followers, having a recently-created account, posting shorter tweets with fewer words,
are all characteristics associated with higher probability of being a spreader.
Going back to the original highly-unbalanced dataset, I aim to validate the results above using
two strategies: (i) I run Gradient Boosting (with the same preprocessing steps) on the whole
dataset of 5.7M for the ve models I outlined in table 3.8; (ii) I run Gradient Boosting classier
on models without imputations and with all missing observations deleted. For the rst approach,
49
Figure 3.9: Area under the ROC curve plot for the ve models under evaluation using Gradient Boosting.
I show ve models, in each I use the fold/model that yields the highest AUC among the trained ones. It
is evident that the addition of bot scores, political ideology, and tweet count variables are important in
improving the performance of the classiers. The legend shows the average AUC scores for each model.
50
Figure 3.10: Relative importance of the features using Gradient Boosting for the full model (best performing
fold) in predicting users as spreaders vs. non-spreaders. Political Ideology explains over 25% of the variance,
followed by Followers Count, Statuses Counts, and Bot Scores, each explaining roughly 5% to 10% of the
variance.
the average AUC scores ranged form 83% for the baseline model to 98% for the full model. For the
second approach, due to the sparsity of some features, the overall number of observations decreases
signicantly when these features are added. Putting the overall number of observations aside, the
average ROC scores for a 10-fold validation for the roughly same set of models specied earlier range
from 84% to 91%. In terms of feature importance, political ideology is again the most important
in the full model, with status count and bot scores following it in importance. In summary, the
results above remain consistent when validating on the highly-unbalanced prediction task.
3.6 Discussion and Conclusion
The results in previous section show that (i) with some insight on users who spread or produce
malicious content, I am able to predict the users that will spread their message to a broader
audience; (ii) in the case I focus on, the 2016 US presidential elections, political ideology was
highly predictive of who is going to spread trolls' messages vs. not. Moreover, looking at the
51
Figure 3.11: Upward Trends
52
Figure 3.12: Downward Trends
top predictive features in Figure 3.10, basic metadata features give a strong signal in terms of
dierentiating spreaders from non-spreaders, along with the bot score. Looking at the subclass
features from Botometer, Figures 3.4, 3.5, and 3.8 show that spreaders and non-spreaders are
signicantly dierent on the dimensions of friends, network, and user metadata, with spreaders
having higher bot scores on all three (thus, they have a higher likelihood of being a bot according
to those subclasses).
Looking at the partial dependence plots, I can deduce that spreaders write a lot of tweets
(counting retweets as well), have higher bot scores, and tend to be more conservative (conservative
is labeled as the highest numerical value in the political ideology feature). Also, since the range
of the y-axis tells about the range of in
uence a feature has on the target value, it is evident
that political ideology has by far the most in
uence on distinguishing between spreaders and non-
spreaders. On the other hand, I can also deduce that spreaders do not write much original content,
tend not have that many followers, and have more recently established user accounts. In the
downward trends in Figure 3.12, I can see that followers count and account age have more in
uence
53
on the target value in comparison to the other features in this plot.
To conclude, this paper focused on predicting spreaders who fall for online manipulation cam-
paigns. I believe that identifying likely victims of political manipulation campaigns is the rst step
in containing the spread of malicious content. Access to reliable and trustworthy information is a
cornerstone of any democratic society. Declining trust of citizens of democratic societies in main-
stream news and their increased exposure to content produced by ill-intended sources poses a great
danger to democratic life. Social science literature shows a lot of evidence that mis-perceptions on
the individual level can aggregate into a distortion in the collective public opinion (Bartels, 2002,
Baum and Groeling, 2009), which can have severe policy implications (Fritz et al., 2004, Flynn
et al., 2017). Thus, I believe that studying how and who spreads political manipulation content
is extremely important, and it is an issue that many social media platforms should attempt to
contain.
54
Chapter 4
Characterizing the 2016 Russian IRA
In
uence Campaign
4.1 Introduction
Social media have helped foster democratic conversation about social and political issues: from
the Arab Spring (Gonz alez-Bail on et al., 2011), to Occupy Wall Street movements (Conover et al.,
2013,?) and other civil protests (Gonz alez-Bail on et al., 2013, Varol et al., 2014, Stella et al.,
2018), Twitter and other social media platforms appeared to play an instrumental role in involving
the public in policy and political conversations by collectively framing the narratives related to
particular social issues, and coordinating online and o-line activities. The use of digital media
for political discussions during presidential elections is examined by many studies, including the
past four U.S. Presidential elections (Adamic and Glance, 2005, Diakopoulos and Shamma, 2010,
Bekago and McBride, 2013, Carlisle and Patton, 2013, DiGrazia et al., 2013), and other countries
like Australia (Gibson and McAllister, 2006, Bruns and Burgess, 2011), and Norway (Enli and
Skogerb, 2013). Findings that focus on the positive eects of social media, such as increasing
voter turnout (Bond et al., 2012) or exposure to diverse political views (Bakshy et al., 2015b)
contribute to the general praise of these platforms as a tool for promoting democracy and civic
engagement (Shirky, 2011, Loader and Mercea, 2011, Eng et al., 2011, Tufekci and Wilson, 2012,
Tufekci, 2014).
55
However, concerns regarding the possibility of manipulating public opinion and spreading polit-
ical propaganda or fake news through social media were also raised early on (Howard, 2006). These
eects are documented by several studies (Ratkiewicz et al., 2011, Conover et al., 2011, El-Khalili,
2013, Woolley and Howard, 2016, Shorey and Howard, 2016b, Bessi and Ferrara, 2016, Ferrara,
2017, Fourney et al., 2017). Social media have been proven to be an eective tool to in
uence
individuals' opinions and behaviors (Aral et al., 2009, Aral and Walker, 2012, Bakshy et al., 2011,
Centola, 2011, 2010) and some studies even evaluate the current tools to combat misinformation
(Pennycook and Rand, Pennycook and Rand). Computational tools, like troll accounts and so-
cial bots, have been designed to perform such type of in
uence operations at scale, by cloning or
emulating the activity of human users while operating at much higher pace (e.g., automatically
producing content following a scripted agenda) (Hwang et al., 2012, Messias et al., 2013, Ferrara
et al., 2016, Varol et al., 2017, Ferrara, 2018) { however, it should be noted that bots have been
also used, in some instances, for positive interventions (Savage et al., 2016, Monsted et al., 2017).
Early accounts of the adoption of bots to attempt manipulate political communication with
misinformation started in 2010, during the U.S. midterm elections, when social bots were employed
to support some candidates and smear others; in that instance, bots injected thousands of tweets
pointing to Web sites with fake news (Ratkiewicz et al., 2011). Similar cases are reported during
the 2010 Massachusetts special election (Metaxas and Mustafaraj, 2012) { these campaigns are
often referred to as Twitter bombs, or political astroturf (Ferrara et al., 2016, Varol et al., 2017).
Unfortunately, oftentimes determining the actors behind these operations was impossible (Kollanyi
et al., 2016, Ferrara et al., 2016). Prior to this work, only a handful of other operations are
linked to some specic actors (Woolley and Howard, 2016), e.g., the alt-right attempt to smear a
presidential candidate before the 2017 French election Ferrara (2017). This is because governments,
organizations, and other entities with sucient resources, can obtain the technological capabilities
necessary to covertly deploy hundreds or thousands of accounts and use them to either support
or attack a given political target. Reverse-engineering these strategies has proven a challenging
research venue (Freitas et al., 2015, Alari et al., 2016, Subrahmanian et al., 2016, Davis et al.,
2016), but it can ultimately lead to techniques to identify the actors behind these operations.
One diculty facing such studies is objectively determining what is fake news, as there is a range
of untruthfulness from simple exaggeration to outright lies. Beyond factually wrong information,
56
it is dicult to classify information as fake.
Rather than facing the conundrum of normative judgment and arbitrarily determine what is
fake news and what is not, in this study I focus on user intents, specically the intent to deceive,
and their eects on the Twitter political conversation prior to the 2016 U.S. Presidential election.
Online accounts that are created and operated with the primary goal of manipulating public
opinion (for example, promoting divisiveness or con
ict on some social or political issue) are com-
monly known as Internet trolls (trolls, in short) (Buckels et al., 2014). To label some accounts
or sources of information as trolls, a clear intent to deceive or create con
ict has to be present.
A malicious intent to harm the political process and cause distrust in the political system was
evident in 2,752 now-deactivated Twitter accounts that are later identied as being tied to Russia's
\Internet Research Agency" troll farm, which was also active on Facebook (Dutt et al., 2018). The
U.S. Congress released a list of these accounts as part of the ocial investigation of Russian eorts
to interfere in the 2016 U.S. Presidential election.
Since their intent is clearly malicious, the Russian Troll accounts and their messages are the
subject of my scrutiny: I study their spread on Twitter to understand the extent of the Russian
interference eort and its eects on the election-related political discussion.
4.1.1 Research Questions
In this paper, I aim to answer four crucial research questions regarding the eects of the interference
operation carried out by Russian trolls:
RQ1 What is the role of the users' political ideology?. I investigate whether political ideology
aects who engages with Russian trolls, and how that may have helped propagate trolls'
content. If that is the case, I will determine if the eect is more pronounced among liberals
or conservatives, or evenly spread across the political spectrum.
RQ2 How central are the trolls in the network of spreading information in the year of 2016 (pre
and post-the US presidential elections)? I oer analyses of the position of trolls and the users
who spread their messages in the retweet network progressively in time from the beginning
of 2016 to the end of that year.
57
RQ3 What is the role of social bots? I characterize whether social bots play a role in spreading
content produced by Russian trolls and, if that was the case, where on the political spectrum
are bots situated.
RQ4 Do trolls succeed in specic areas of the US?. I oer an extensive analysis of the geospatial
dimension and how it aects the eectiveness of the Russian interference operation; I test
whether users located within specic states participate in the consumption and propagation
of trolls' content more than others.
This paper improves upon my previous work (Badawy et al., 2018b) by 1) extending the span
of the data to one year before and after the 2016 US elections rather than just two months as in the
previous paper; 2) using sophisticated network analysis to understand the in
uence of malicious
users across time. I collect Twitter data for a year leading into the election. I obtained a dataset of
over 13 million tweets generated by over a million distinct users in the year of 2016. I successfully
determine the political ideology of most of the users using label propagation on the retweet network
with precision and recall exceeding 84%. Next, using advanced machine learning techniques devel-
oped to discover social bots (Ferrara et al., 2016, Subrahmanian et al., 2016, Davis et al., 2016)
applied on users who engage with Russian trolls, I nd that bots existed among both liberal and
conservative users. I perform text analysis on the content Russian trolls disseminated, and nd
that conservative trolls are concerned with particular causes, such as refugees, terrorism and Islam,
while liberal trolls write about issues related to the police and school shootings. Additionally, I
oer an extensive geospatial analysis of tweets across the United States, showing that it is mostly
proportionate to the states' population size|as expected|however, a few outliers emerge for both
liberals and conservatives.
4.1.2 Summary of Contributions
Findings presented in this work can be summarized as:
I propose a novel way of measuring the consumption of manipulated content through the
analysis of activities of Russian trolls on Twitter in the year of 2016.
Using network-based machine learning methods, I accurately determine the political ideology
58
of most users in dataset, with precision and recall above 84%.
I use network analysis to map the position of trolls and spreaders in the retweet network over
the year of 2016. I dene spreaders in this paper as users who retweet trolls at least once.
Although retweeting a troll once does not make a user malicious, I use this term to focus on
the act of spreading the message of trolls not the intention.
State-of-the-art bot detection on users who engage with Russian trolls shows that bots are
engaged in both liberal and conservative domains.
Text analysis shows that conservative Russian trolls are mostly promoting conservative causes
in regards to refugees, terrorism, and Islam as well as talking about Trump, Clinton, and
Obama. For liberal trolls, the discussion is focused on school shootings and the police, but
Trump and Clinton are in the top words used as well.
I oer a comprehensive geospatial analysis showing that certain states overly-engaged with
production and diusion of Russian trolls' contents.
4.2 Data Collection
4.2.1 Twitter Dataset
To collect Twitter data about the Russian trolls, I use a list of 2,752 Twitter accounts identied
as Russian trolls that is compiled and released by the U.S. Congress.
1
To collect the tweets, I use
Crimson Hexagon,
2
a social media analytic platform that provides paid datastream access. This
tool allows me to obtain tweets and retweets produced by trolls and subsequently deleted in the
year of 2016. Table 4.1 oers some descriptive statistics of the Russian troll accounts. Out of the
accounts appearing on the list, 1,148 accounts exist in the dataset, and a little over a thousand of
them produce more than half a million of original tweets.
I also collect users that did not retweet any troll, since it helps us have a better understanding
of trolls behaviour online vs. normal users and how it aects the overall discourse on Twitter. I
collect non-trolls' tweets using two strategies. First, I collect tweets of such users using a list of
1
https://www.recode.net/2017/11/2/16598312/russia-twitter-trump-twitter-deactivated-handle-list
2
https://www.crimsonhexagon.com/
59
Table 4.1: Descriptive Statistics of Russian trolls.
Value
# of Russian trolls 2,752
# of trolls in data 1,148
# of trolls wrote original tweets 1,032
# of original tweets 538,166
Table 4.2: Twitter Data Descriptive Statistics.
Statistic Count
# of Tweets 13,631,266
# of Retweets 10,556,421
# of Distinct Users 1,089,974
# of Tweets/Retweets with a URL 10,621,071
hashtags and keywords that relate to the 2016 U.S. Presidential election. This list is crafted to
contain a roughly equal number of hashtags and keywords associated with each major Presidential
candidate: I select 23 terms, including ve terms referring to the Republican Party nominee Donald
J. Trump (#donaldtrump, #trump2016, #neverhillary, #trumppence16, #trump), four terms for
Democratic Party nominee Hillary Clinton (#hillaryclinton, #imwithher, #nevertrump, #hillary),
and several terms related to debates. To make sure the query list was comprehensive, I add a
few keywords for the two third-party candidates, including the Libertarian Party nominee Gary
Johnson (one term), and Green Party nominee Jill Stein (two terms). My second strategy is to
collect tweets from the same users that do not include the same key terms mentioned above.
Table 4.2 reports some aggregate statistics of the data. It shows that the number of retweets
and tweets with URLs are quite high, more than 3/4 of the dataset. Figure 4.1 shows the timeline
of the tweets' volme and the users who produced these tweets in the 2016 with a spike around the
time of the election.
4.2.2 Classication of Media Outlets
I classify users by their ideology based on the political leaning of the media outlets they share. The
classication algorithm is described in Section 4.3.1. In this section, I describe the methodology of
obtaining ground truth labels for the media outlets.
60
Figure 4.1: Timeline of the volume of tweets (in blue) generated during the observation period and users the
produced these tweets (in red).
61
Table 4.3: Liberal & Conservative Domain Names (excluding left-center and right-center)
Liberal Conservative
change.org dailycaller.com
cnn.com nypost.com
dailykos.com americanthinker.com
democracynow.org bizpacreview.com
hungtonpost.com breitbart.com
nydailynews.com dailymail.co.u
rawstory.com judicialwatch.org
rollingstone.com nationalreview.com
thedailybeast.com thegatewaypundit.com
theroot.com
I use lists of partisan media outlets compiled by third-party organizations, such as AllSides
3
and Media Bias/Fact Check.
4
I combine liberal and liberal-center media outlets into one list and
conservative and conservative-center into another. The combined list includes 641 liberal and 398
conservative outlets. However, in order to cross reference these media URLs with the URLs in the
Twitter dataset, I need to get the long URLs for most of the links in the dataset, since most of
them are shortened. As this process is quite time-consuming, I get the top 5,000 URLs by count
and then retrieve the long version for those. These top 5,000 URLs accounts for more than 2.1
million or around 1/5 of the URLs in the dataset.
After cross-referencing the 5,000 long URLs with the media URLs, I observe that 7,912 tweets
in the dataset contain a URL that points to one of the liberal media outlets and 29,846 tweets with
a URL pointing to one of the conservative media outlets. Figures 4.2 & 4.3 show the distributions
of tweets with URLs from liberal and conservative outlets. As I can see in the gures, American
Thinker dominated the URLs being shared in the conservative sample under study, while in the
liberal sample more media outlets were equally represented. Table 4.3 shows the list of the left and
right media outlets/domain names after removing left/right center from the list.
I use a majority rule to label Twitter users as liberal or conservative depending on the number
of tweets they produce with links to liberal or conservative sources. In other words, if a user has
more tweets with URLs to liberal sources, he/she is labeled as liberal and vice versa. Although the
overwhelming majority of users include URLs that are either liberal or conservative, I remove any
3
https://www.allsides.com/media-bias/media-bias-ratings
4
https://mediabiasfactcheck.com/
62
Figure 4.2: Distribution of tweets with links to liberal media outlets.
63
Figure 4.3: Distribution of tweets with links to conservative media outlets.
64
users that has equal number of tweets from each side. The nal set of labeled users includes 10,074
users.
4.3 Methods and Data Analysis
4.3.1 Label Propagation
I use label propagation
5
to classify Twitter accounts as liberal or conservative, similar to prior
work (Conover et al., 2011). In a network-based label propagation algorithm each node is assigned
a label, which is updated iteratively based on the labels of node's network neighbors. In label
propagation, a node takes the most frequent label of its neighbors as its own new label. The
algorithm proceeds updating labels iteratively and stops when the labels no longer change (see
(Raghavan et al., 2007) for more information). The algorithm takes as parameters (i) weights,
in-degree or how many times node i retweets node j; (ii) seeds (the list of labeled nodes). I x
the seeds' labels so they do not change in the process, since this seed list also serves as the ground
truth.
I construct a retweet network, containing nodes (Twitter users) with a directed link between
them if one user retweets a post of another. Table 4.4 shows the descriptive statistics of the retweet
network. It is a sparse network with a giant component that includes 1,357,493 nodes and over
1.4 Million distinct users. The number of distinct users in the retweet network is larger than the
number of unique users in the dataset, because the users of the retweet network include retweeted
users collected from the text of the tweets, while the distinct number of users in the dataset are
the users I have tweets for, both original and retweets.
Table 4.4: Descriptive statistics of the Retweet Network.
Statstic Count
# of nodes 1,407,190
# of edges 4,874,786
Max in-degree 198,262
Max out-degree 8,458
Density 2.46e-06
I use more than 10,000 users mentioned in the media outlets sections as seeds { those who
5
I use the algorithm in the Python version of the Igraph library (Csardi and Nepusz, 2006)
65
mainly retweet messages from either the liberal or the conservative media outlets in Figures 4.2
& 4.3 { and label them accordingly. I then run label propagation to label the remaining nodes in
the retweet network. To validate results of the label propagation algorithm, I apply stratied cross
(5-fold) validation to the set of more than 10,000 seeds. I train the algorithm on 4/5 of the seed list
and see how it performs on the remaining 1/5. The average precision and recall scores for the 5-folds
are around 0.84. Since I combine liberal and liberal-center into one list (same for conservatives), I
can see that the algorithm is not only labeling the far liberal or conservative correctly, which is a
relatively easier task, but it is performing well on the liberal/conservative center as well.
4.3.2 Bot Detection
Determining whether either a human or a bot controls a social media account has proven a very
challenging task (Ferrara et al., 2016, Subrahmanian et al., 2016, Kudugunta and Ferrara, 2018). I
use an openly accessible solution called Botometer (a.k.a. BotOrNot) (Davis et al., 2016), consisting
of both a public Web site (https://botometer.iuni.iu.edu/) and a Python API (https://
github.com/IUNetSci/botometer-python), which allows for making this determination with high
accuracy. Botometer is a machine-learning framework that extracts and analyses a set of over one
thousand features, spanning six sub classes:
User: Meta-data features that include the number of friends and followers, the number of tweets
produced by the users,prole description and settings.
Friends: Four types of links are considered here: retweeting, mentioning, being retweeted, and
being mentioned. For each group separately, botometer extracts features about language use,
local time, popularity, etc.
Network: Botometer reconstructs three types of networks: retweet, mention, and hashtag co-
occurrence networks. All networks are weighted according to the frequency of interactions or
co-occurrences.
Temporal: Features related to user activity, including average rates of tweet production over
various time periods and distributions of time intervals between events.
66
Content: Statistics about length and entropy of tweet text and Part-of-Speech (POS) tagging
techniques, which identies dierent types of natural language components, or POS tags.
Sentiment: Features such as: arousal, valence and dominance scores (Warriner et al., 2013),
happiness score (Kloumann et al., 2012), polarization and strength (Wilson et al., 2005), and
emotion score (Agarwal et al., 2011).
Botometer is trained with thousands of instances of social bots, from simple to sophisticated,
yielding an accuracy above 95 percent (Davis et al., 2016). Typically, Botometer returns likelihood
scores above 50 percent only for accounts that look suspicious to a scrupulous analysis. I adopted
the Python Botometer API to systematically inspect the most active users in the dataset. The
Python Botometer API queries the Twitter API to extract 300 recent tweets and publicly available
account metadata, and feeds these features to an ensemble of machine learning classiers, which
produce a bot score. To label accounts as bots, I use the fty-percent threshold { which has proven
eective in prior studies (Davis et al., 2016): an account is considered to be a bot if the overall
Botometer score is above 0.5.
4.3.3 Geolocation
There are two ways to identity the location of tweets produced by users. One way is to collect the
coordinates of the location the tweets were sent from; however, this is only possible if users enable
the geolocation option on their Twitter accounts. The second way is to analyze the self-reported
location text in users' proles. The latter includes substantially more noise, since many people
write ctitious or imprecise locations { for example, they may identify the state and the country
they reside in, but not the city.
In this paper, I use the location provided by Crimson Hexagon, which uses two methodologies.
First, it extracts the geotagged locations, which is only available in a small portion of the Twitter
data. For the tweets which are not geotagged, Crimson Hexagon estimates the users' countries,
regions, and cities based on \various pieces of contextual information", for example, their prole
information as well as users' time zones and languages. Using the country and state information
provided by Crimson Hexagon, this dataset has over 9 Million tweets with country location. As
shown in 4.5 more than 7.5 million of these geolocated tweets come from the US, with the UK,
67
Nigeria, and Russia trailing behind the US with 287k, 277k, and 192k tweets respectively. There
are more than 4.7 Million US tweets with State information provided. The top four states are as
expected: California, Texas, New York, and Florida with 742k, 580k, 383k, and 360k tweets.
Table 4.5: Distribution of tweets by country (Top 10 countries by tweet count)
Country # of tweets
USA 7,548,463
UK 287,031
Nigeria 277,507
Russia 192,305
Canada 145,371
India 91,464
Australia 79,188
Mexico 70,268
Indonesia 37,716
France 28,452
4.4 Results
Let us address the three research questions I sketched earlier:
RQ1 What is the role of the users' political ideology?
RQ2 How central are trolls and the users who retweet them, spreaders?
RQ3 What is the role of social bots?
RQ4 Did trolls especially succeed in specic areas of the US?
In Section 4.4.1, I analyze how political ideology aects engagement with content posted by
Russian trolls. Section 4.4.2 shows how the position of trolls and spreaders evolve over time in the
retweet network. Section 4.4.3 focuses on social bots and how they contribute to the spreading of
content produced by Russian trolls. Finally, in Section 4.4.4 I show how users contribute to the
consumption and the propagation of trolls' content based on their location.
68
4.4.1 RQ1: Political Ideology
The label propagation method succeeds in labeling most of the users as either liberal or conservative;
however, the algorithm was not able to label some users outside of the giant component. Table 4.6
shows the number of trolls by ideology, and I can see that there are almost double the amount of
conservative trolls compared to liberal ones in terms of number of trolls overall and the number of
trolls who wrote original tweets. Although the number of liberal trolls is twice smaller, they produce
more tweets than conservative trolls. The mean and standard deviation of the number of tweets for
liberal trolls are 978 and 3,168 while for conservatives they are 354 and 1,202 respectively. It is hard
to determine the distribution of the tweets from these numbers, but by looking at the distribution
of number of original tweets for liberal/conservative trolls, I can see that the distribution of original
tweets for liberal trolls is more even than for conservative ones.
Table 4.6: Breakdown of the Russian trolls by political ideology, with the ratio of conservative to liberal
trolls.
Liberal Conservative Ratio
# of trolls 339 688 2
# of trolls w/ original tweets 306 608 1.98
# of original tweets 299,464 215,617 0.7
Table 4.7 shows descriptive statistics of spreaders. The table shows that only few spreaders
wrote original tweets and that more than half of the tweets are retweets of trolls. There are fewer
conservative spreaders, but they write substantially more tweets than their liberal counterparts (see
Table 4.8). Besides talking about the candidates, liberals talk about being black, women, and school
shootings. Conservatives talk about being American, Obama, terrorism, refugees, and Muslims (see
Table 4.9 for top 20 words used for liberals and conservatives). Although the dierence in language
use between liberals and conservatives is important to understand, analyzing other aspects of their
respective behavior online is crucial and that is what the rest of the paper focuses on.
4.4.2 RQ2: Temporal Analysis
Analyzing the in
uence of trolls across time is one of the most important questions in terms of
the spread of political propaganda. The way I measure the in
uence of trolls is by measuring
where they are located in the retweet network. I choose the retweet network in particular because
69
Table 4.7: Descriptive statistics of spreaders, i.e., users who retweeted Russian trolls.
Value
# of spreaders 720,558
# of times retweeted trolls 3,540,717
# of spreaders with original tweets 21,338
# of original tweets 319,565
# of original tweets and retweets 7,357,717
Table 4.8: All spreaders by political ideology; Bot analysis for 115,396 spreaders (out of a 200k random
sample of spreaders). Ratio: conservative/liberal
Liberal Conservative Ratio
# of spreaders 446,979 273,546 0.6
# of tweets 1,715,696 5,641,988 3.2
# of bots 3,528 4,896 1.4
# of tweets by bots 26,233 181,604 7
Table 4.9: Top 20 meaningful lemmatized words from the tweets of Russians trolls classied as Conservative
and Liberal
Conservative count Liberal Count
trump 10,362 police 15,498
hillary 5,494 trump 13,999
people 4,479 man 13,118
clinton 4,295 black 9,942
obama 2,593 year 8,627
one 2,447 state 7,895
american 2,388 woman 7,748
woman 2,167 shooting 6,564
day 2,152 killed 6,407
donald 2,148 people 5,913
time 2,105 clinton 5,606
refugee 2,103 school 5,143
president 2,079 shot 5,132
terrorist 1,994 city 4,754
country 1,980 win 4,543
muslim 1,963 cop 4,515
year 1,893 day 4,478
need 1,856 re 4,408
think 1,835 ocer 4,397
al 1,823 death 4,305
retweeting is the main vehicle for spreading information on Twitter. In terms of the location of the
user, there are multiple ways to measure the location of a user in the network he/she is embedded
70
in. I choose the k-core decomposition technique, because it captures the notion of who is in the
core of the network vs. the periphery, while giving an ordinal measure re
ecting the number of
connections.
The k-core of a graph is the maximal subgraph in which every node has at least degree k. In a
directed network, k is the sum of in-degree + out-degree. The k-core decomposition is a recursive
approach that progressively trims the least connected nodes in a network (those with lower degree)
in order to identify the most central ones (Barber a et al., 2015).
I measure the centrality of trolls across time by dividing the number of trolls by the total number
of nodes in every core for every snapshot of the network. Since I want to measure the evolution of
the trolls' importance, I construct monthly retweet networks, where every network contains only
the nodes and edges of that month.The reason of this setup is to see how the trolls' and spreaders'
positions evolve across the retweet network from month to month throughout the year of 2016.
From this analysis, I nd that most of the trolls were in the middle and higher k-cores in the
rst eight months or so of 2016 as opposed to being moslty in the high k-cores toward the end of
that year. As for spreaders, their proportion across most of the k-cores increases progressively and
uniformally across time and dominates the retweet network toward the end of 2016.
4.4.3 RQ3: Social Bots
Since there are many spreaders in this dataset, I take a random sample of 200,000 spreaders and
use the approach explained in Section 4.3.2 to obtain bot scores. I use Botometer to obtain bot
scores only for the sample and not for all of the spreaders, since it takes considerably a long time
to get bot scores for such a number due to the Twitter API call limit. I are able to obtain bot
scores for 115,396 spreaders out of the 200,000 spreaders.
The number of users that have bot scores above 0.5, and can therefore be safely considered
bots according to prior work (Varol et al., 2017), stands at 8,424 accounts. Out of the 115,396
spreaders with bot scores, 68,057 are liberal, and 3,528 of them have bot scores above 0.5, about
5% of the total. As for the conservatives, there are 45,807 spreaders, 4,896 of which have bot scores
greater than 0.5, representing around 11% of the total. As the Results summarized in Table 4.8
show, although the number of conservative spreaders is lower than liberal ones, there are more bots
among conservatives who write considerably more tweets.
71
The top 100 liberal posters produce 341,940 tweets (including retweets). Out of the top 100
liberal accounts, 15 have bot scores above 0.5 and they post 14,815 tweets, which is only 4% of the
total tweets produced by the top 100 liberal accounts. For the top 100 conservative accounts, they
produce 828,334 tweets. Out of the top 100 conservative accounts, 25 accounts have bot scores
above 0.5 and produce 85,102 tweets, a bit more than 10% of the total tweets produced by the
top 100 most productive conservative posters. It is evident that among the top most productive
posters, conservative ones produce more tweets, include more bots, and the bot accounts produce
more tweets than their liberal counterparts.
Figure 4.4 shows the probability density of bot scores of the liberal and conservative spreaders.
Again, putting aside the disproportionate number of liberals to conservatives, the mean value of
the bot scores of the conservative spreaders (0.29) is higher than the liberal one (0.18). I performed
a two-sided t-test for the null hypothesis that the two distributions have identical mean values,
and the p-value is less than 0.0, meaning that I can reject the null. Furthermore, it conservative
spreaders have higher means on all of the Botometer subclass scores in comparison to their liberal
counterparts.
Figure 4.4: Overall Bot Score
72
For the plots in Figures 4.5, 4.6, 4.7, 4.8, 4.9, & 4.10, it is evident that conservative spreaders
have higher means on all of the Botometer subclass scores in comparison to their liberal counter-
parts. The dierences in all plots are statistically signicant (p <0.001). Besides, looking at the
distributions, I can see that the dierences in user characteristics (metadata) and \Friend" category
are substantively dierent (around 0.10 dierence).
Figure 4.5: Content Bot Score
4.4.4 RQ4: Geospatial Analysis
Figures 4.11 and 4.12 show the proportion of the number of retweets by liberal and conservative
users (classied according to the label propagation algorithm) of the content produced by Rus-
sian trolls per each state normalized per state by the total number of liberal and conservative
tweets respectively. The ratio is computed as = (T
S
=P
S
), where T
S
is the total number of
liberal/conservative retweets of liberal/conservative trolls from a given state S, andP
S
is the total
number of tweets per each State.
I notice that few states exhibit very high proportions of retweets per total number of tweets
for liberals and conservatives. I test the deviations using a two-tailed t-test on the z-scores of each
73
Figure 4.6: Friend Bot Score
Figure 4.7: Metadata Bot Score
74
Figure 4.8: Network Score
Figure 4.9: SentimentBot Score
75
Figure 4.10: Temporal Bot Score
Table 4.10: Users that reported Russia as their location
Count
# of trolls from Russia 438
# of spreaders from Russia 1,980
# of overall from Russia 3,017
deviation calculated on the distribution of ratios. For conservatives, the average is 0.34 and the
standard deviation is 0.12, while for liberal states, the average is 0.26 and the standard deviation
is 0.13. For conservatives, Wyoming leads the ranking (=3.65, p-value = 0.001). For liberals,
Montana (=0.54, p-value = 0.023) and Kansas (=0.51, p-value = 0.046) lead the ranking and
are the only states with ratios that are statistically signicant. It is also interesting to note that
a little less than 1/2 of the trolls have Russia as their location while a small number of spreaders
and other users have Russia as their location (see Table 4.10).
76
Figure 4.11: Proportion of the number of retweets by liberal users of Russian trolls per each state normalized
by the total number of liberal tweets by state.
77
Figure 4.12: Proportion of the number of retweets by conservative users of Russian trolls per each state
normalized by the total number of conservative tweets by state.
78
4.5 Conclusions
The dissemination of information and the mechanisms for democratic discussion have radically
changed since the advent of digital media, especially social media. Platforms like Twitter are
praised for their contribution to democratization of public discourse on civic and political issues.
However, many studies highlight the perils associated with the abuse of these platforms. The
spread of deceptive, false and misleading information aimed at manipulating public opinion are
among those risks.
In this work, I investigated the role and eects of trolls, using the content produced by Russian
trolls on Twitter as a proxy for propaganda. I collected tweets posted during the year of 2016
by Russian trolls, spreaders, and other users who tweet in this period. I showed that propaganda
(produced by Russian trolls) is shared more widely by conservatives than liberals on Twitter.
Although the number of liberal spreaders is close to 2:1 in comparison to conservative ones, the
latter write about 3.2 times as many tweets as the liberal spreaders. Using state-of-the-art bot
detection method, I estimated that about 5% and 11% of the liberal and conservative users are
bots. Conservative bot spreaders produce about 7 times as many tweets as liberal ones.
The spread of propaganda by malicious actors can have severe negative consequences. It can
enhance malicious information and polarize political conversation, causing confusion and social
instability. Scientists are currently investigating the consequences of such phenomena (Woolley and
Howard, 2016, Shorey and Howard, 2016b). I plan to explore in detail the issue of how malicious
information spread via exposure and the role of peer eect. Concluding, it is important to stress
that, although my analysis unveiled the current state of the political debate and agenda pushed
by the Russian trolls who spread malicious information, it is impossible to account for all the
malicious eorts aimed at manipulation during the last presidential election. State- and non-state
actors, local and foreign governments, political parties, private organizations, and even individuals
with adequate resources (Kollanyi et al., 2016), could obtain operational capabilities and technical
tools to construct propaganda campaigns and deploy armies of social bots to aect the directions
of online conversations. Future eorts will be required by the social and computational science
communities to study this issue in depth and develop more sophisticated detection techniques
capable of unmasking and ghting these malicious eorts.
79
Chapter 5
Conclusion
The dissemination of information and the mechanisms for democratic discussion have radically
changed since the advent of digital media, especially social media. Platforms like Twitter have been
extensively praised for their contribution to the democratization of public discourse on civic and
political issues. However, many studies have also highlighted the perils associated with the abuse of
these platforms. The spread of deceptive, false and misleading information aimed at manipulating
public opinion are among those risks. In the remainder of the chapter, I brie
y summarize the main
ndings of the empirical chapters and propose how we could move this research agenda forward in
future research.
In this work, I investigate the role and eects of misinformation, using the content produced
by Russian trolls on Twitter as a proxy for misinformation. Overall, in this dissertation I use
Twitter datasets related to the U.S. Presidential election of 2016. In chapter two, I show that
misinformation (produced by Russian trolls) was shared more widely by conservatives than liberals
on Twitter. Although there were about four times as many Russian trolls posting conservative
views as liberal ones, the former produced almost 20 times more content. In terms of users who
retweeted these trolls, there were about 30 times more conservatives than liberals. Conservatives
also outproduced liberals on content, at a rate of 35:1. Using state-of-the-art bot detection method,
I estimated that about 4.9% and 6.2% of the liberal and conservative users are bots.
In chapter three, I show that (i) with some insight on users who spread or produce malicious
content, I am able to predict those that will spread trolls' messages; (ii) in the case of the 2016
US presidential election, political ideology was highly predictive of who is going to spread trolls'
80
messages. I show that spreaders and non-spreaders are signicantly dierent on the dimensions of
friends, network, and user metadata, with spreaders having higher bot scores on all three. Basic
metadata features give a strong signal in terms of dierentiating spreaders from non-spreaders,
along with the bot score.
I also show that spreaders write a lot of tweets (counting retweets as well), have higher bot
scores, and tend to be more conservative (conservative is labeled as the highest numerical value in
the political ideology feature). It is evident that political ideology has by far the most in
uence
on distinguishing between spreaders and non-spreaders. On the other hand, I can deduce that
spreaders do not write much original content, tend not to have that many followers, and have more
recent accounts.
In chapter four, I investigate the role and eects of trolls, using a more representative Twitter
dataset. I use tweets posted during the year of 2016 by Russian trolls, spreaders, and other users
who tweet in this period. I show that propaganda (produced by Russian trolls) is shared more
widely by conservatives than liberals on Twitter. Although the number of liberal spreaders is close
to 2:1 in comparison to conservative ones, the latter write about 3.2 times as many tweets as the
liberal spreaders. Using state-of-the-art bot detection method, I estimate that about 5% and 11%
of the liberal and conservative users are bots. Conservative bot spreaders produce about seven
times as many tweets as liberal ones. These results are in sync with the results in chapter one, but
show that magnitude of eect of Russian trolls on conservatives is less than estimated in chapter
two.
It is important to stress that, although my analysis unveiled the current state of the political
debate and agenda pushed by the Russian trolls who spread malicious information, it is impossible
to account for all the malicious eorts aimed at political manipulation during the last presidential
election. State- and non-state actors, local and foreign governments, political parties, private
organizations, and even individuals with adequate resources (Kollanyi et al., 2016) could obtain
operational capabilities and technical tools to construct misinformation campaigns and deploy
armies of social bots to aect the directions of online conversations.
Overall, many questions about how misinformation gets spread and its eect on people remain
to be answered. For example, what are the eects of exposure to disinformation on individual
beliefs and behavior? How does the spread of disinformation through images and video dier from
81
the spread of misinformation through text? This is something this dissertation does not address
neither do many of the studies that were conducted to this day. Additionally, does exposure via
dierent platforms aect users dierently form users who were aected through a single platform?
(Tucker et al., 2018).
One key result from the analysis conducted in this dissertation is that conservatives propagated
the content produced by trolls more than liberals. But the question remains whether conservatives
are more prone to share or even believe such content than liberals due to dierent psychological
traits. Is there another reason that can explain this phenomenon? One major caveat of taking
this result at face value is that trolls produced more content targeting conservatives than liberals.
Although that is true in terms of overall quantity, proportionately conservatives shared more of the
trolls' content that targeted them than liberals. By most measures, conservatives did propagate
trolls' content more than liberals; however, most of the analysis done in this dissertation and in
most similar analyses focus on a very short period, so it is unclear whether this pattern would hold
in a dierent time period or a geographical location. In addition to the limitation of having this
issue studied only for very few years, most datasets collected from online social networks tend to
be have their own collection biases that may in
uence this sort of analysis.
I believe that the main question about misinformation besides measuring its prevalence is: how
does it really aect people's beliefs and actions? This question is extremely hard to answer with
observational data from an online platform. There are few theories that attempt to explain the
mechanism of how people's beliefs and actions may get aected by exposure to misinformation.
Some believe that individuals update their political positions in response to new information, in a
direction that is consistent with what they learned (Achen, 1992, Bullock, 2009). Others, however,
demonstrated the existence of boomerang eects that lead to individuals' reinforcing their previous
positions (Lewandowsky et al., 2012, Nyhan and Rei
er, 2010). I believe the latter is more likely the
case as there is strong evidence that people do not easily change their beliefs based on persuasion
campaigns (see Kalla and Broockman (2018)), although it may lead to higher mobilization of
motivated individuals.
This dissertation and the current research on this issue attempts to understand how troll cam-
paigns are established in terms of account creation, content production, and propagation. This
analysis is ultimately intended to be used to counter such campaigns. I believe that certain insights
82
gained from this research can be used to ght such campaigns. This could be done by recognizing
troll/bot accounts at an early stage and suspending them and lowering the prevalence of their
content in newsfeeds or removing them completely. I believe that the research in this dissertation
and similar research in the literature could help with this eort. As for counter-messaging, there
is a smaller portion of the literature that utilizes mostly experimental methods to test the eect of
counter-messaging and I believe that some of these measures can be eective as well (see Munger
(2017)). But this approach has an obvious downside of alienating some users and even pushing
them to using other platforms that do not implement counter trolling measures.
Another issue with the current literature on misinformation is its inability to combine data
from dierent online social platforms to study the question of misinformation. This issue leads
to overemphasizing the role of certain online social networks, such as Twitter, due to the relative
availability of its data. There are current eorts to study misinformation from dierent platforms,
but the ability to map the data to the same users across dierent platforms remains elusive. This
hurdle hampers our ability to understand the eect of misinformation from each platform on dier-
ent individuals and how they collectively and individually play a role in propelling misinformation
and possibly aecting people's perceptions and behaviors.
The question of how information is produced and shared across online social networks and
traditional media sources is another important question mostly missed by the current literature on
misinformation. For example, it is important to know how political rumors migrate from online
social networks to traditional media outlets and vice versa (Tucker et al., 2018). It would be
interesting to know how hyperpartisan media outlets use their online accounts to propagate their
message online, particularly when it comes to videos shared across both media.
Overall, the literature on misinformation is predominately based on the analysis of textual data
rather than visual or audiovisual content. Visual and audiovisual content are hard to analyze
for several reasons. First, there are no agreed-upon concepts and measures of visual political
content (Grin, 2013). Second, storing and retrieving image and audiovisual material is harder in
comparison to text and requires large storage capacity. Third, image and audiovisual content is
harder to analyze and code for. Last, the computational tools available for image and audiovisual
content is in its infancy in comparison to text in addition of its need for huge computational
resources unavailable to most researchers.
83
Another major limitation of the current literature on misinformation lies in its overemphasis on
the US case and, to a lesser extent, major European countries. There are several reasons for why
that is the case. One, most of the researchers working on misinformation do not speak the languages
of most of the underrepresented countries where misinformation is rampant, which makes it hard
to analyze the data. Moreover, even if a researcher has the resources to translate the text data,
he/she usually lacks adequate knowledge of the political situation in these countries, hampering
his/her attempt to distinguish between normal political polarization versus targeted misinformation
campaigns. Lastly, there is a lack of interest among researchers to understand misinformation in
underrepresented countries, even for comparative studies, as it does not raise the same level of
interest among the public and other researchers. Besides, compared to the US and European cases,
it does not attract the same level of funding.
Finally, similar to the aforementioned issue, the current literature gives too much attention
to the liberal vs. conservative division, following the political division existing in the US. More
rened studies are needed to focus on the eects of misinformation on groups cross-sectionally
and on people at dierent sides of the political spectrum in accordance with the political divisions
in Europe. Overall, substantial eorts are required from the social and computational sciences
communities to study these issues in depth, develop more sophisticated techniques to estimate
the prevalence and the eect of misinformation actors, and build tools capable of unmasking and
ghting these malicious eorts.
84
Bibliography
Achen, C. H. (1992). Social psychology, demographic variables, and linear regression: Breaking the
iron triangle in voting research. Political behavior 14 (3), 195{211.
Adamic, L. A. and N. Glance (2005). The political blogosphere and the 2004 us election: divided
they blog. In 3rd Int Work Link Disc.
Agarwal, A., B. Xie, I. Vovsha, O. Rambow, and R. Passonneau (2011). Sentiment analysis of
twitter data. In Proceedings of the workshop on languages in social media, pp. 30{38. Association
for Computational Linguistics.
Alari, A., M. Alsaleh, and A. Al-Salman (2016). Twitter turing test: Identifying social machines.
Information Sciences 372.
Allcott, H. and M. Gentzkow (2017). Social media and fake news in the 2016 election. Journal of
Economic Perspectives 31 (2), 211{36.
Althaus, S. L. (1998). Information eects in collective preferences. American Political Science
Review 92(3), 545{558.
Ananyev, M. and A. Sobolev (2017). Fantastic beasts and whether they matter: Do internet trolls
in
uence political conversations in russia? In preparation.
Applebaum, Anne, P. P. M. S. and C. Colliver. (2017). "make germany great again" kremlin,
alt-right and international in
uences in the 2017 german elections.
Aral, S., L. Muchnik, and A. Sundararajan (2009). Distinguishing in
uence-based contagion from
homophily-driven diusion in dynamic networks. Proceedings of the National Academy of Sci-
ences 106(51).
85
Aral, S. and D. Walker (2012). Identifying in
uential and susceptible members of social networks.
Science 337(6092), 337{341.
Badawy, A., E. Ferrara, and K. Lerman (2018a). Analyzing the digital traces of political ma-
nipulation: The 2016 russian interference twitter campaign. In 2018 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 258{265.
IEEE.
Badawy, A., E. Ferrara, and K. Lerman (2018b). Analyzing the digital traces of political manipu-
lation: The 2016 russian interference twitter campaign. In ASONAM.
Bakshy, E., J. Hofman, W. Mason, and D. Watts (2011). Everyone's an in
uencer: quantifying
in
uence on twitter. In 4th WSDM.
Bakshy, E., S. Messing, and L. A. Adamic (2015a). Exposure to ideologically diverse news and
opinion on facebook. Science 348(6239), 1130{1132.
Bakshy, E., S. Messing, and L. A. Adamic (2015b). Exposure to ideologically diverse news and
opinion on facebook. Science 348(6239).
Barber a, P., N. Wang, R. Bonneau, J. T. Jost, J. Nagler, J. Tucker, and S. Gonz alez-Bail on (2015).
The critical periphery in the growth of social protests. PloS one 10(11), e0143611.
Bartels, L. M. (1996). Uninformed votes: Information eects in presidential elections. American
Journal of Political Science, 194{230.
Bartels, L. M. (2002). Beyond the running tally: Partisan bias in political perceptions. Political
behavior 24(2), 117{150.
Baum, M. A. and T. Groeling (2009). Shot by the messenger: Partisan cues and public opinion
regarding national security and war. Political Behavior 31 (2), 157{186.
Bekago, M. A. and A. McBride (2013). Who tweets about politics? political participation of
twitter users during the 2011 gubernatorial elections. Social Science Computer Review 31 (5),
625{643.
86
Bessi, A. and E. Ferrara (2016). Social bots distort the 2016 us presidential election online discus-
sion. First Monday 21 (11).
Bond, R. M., C. J. Fariss, J. J. Jones, A. D. Kramer, C. Marlow, J. E. Settle, and J. H. Fowler
(2012). A 61-million-person experiment in social in
uence and political mobilization. Na-
ture 489(7415).
Bruns, A. and J. E. Burgess (2011). The use of twitter hashtags in the formation of ad hoc publics.
In 6th ECPR General Conference.
Buckels, E. E., P. D. Trapnell, and D. L. Paulhus (2014). Trolls just want to have fun. Personality
and individual Dierences 67.
Bullock, J. G. (2009). Partisan bias and the bayesian ideal in the study of public opinion. The
Journal of Politics 71 (3), 1109{1124.
Carlisle, J. E. and R. C. Patton (2013). Is social media changing how we understand political
engagement? an analysis of facebook and the 2008 presidential election. Political Research
Quarterly 66 (4).
Centola, D. (2010). The spread of behavior in an online social network experiment. Sci-
ence 329(5996), 1194{1197.
Centola, D. (2011). An experimental study of homophily in the adoption of health behavior.
Science 334(6060), 1269{1272.
Conover, M., B. Gon calves, J. Ratkiewicz, A. Flammini, and F. Menczer (2011). Predicting the
political alignment of twitter users. In Proc. 3rd IEEE Conference on Social Computing, pp.
192{199.
Conover, M., J. Ratkiewicz, M. R. Francisco, B. Gon calves, F. Menczer, and A. Flammini (2011).
Political polarization on twitter. ICWSM 133, 89{96.
Conover, M. D., C. Davis, E. Ferrara, K. McKelvey, F. Menczer, and A. Flammini (2013). The
geospatial characteristics of a social movement communication network. PloS one 8(3), e55957.
87
Conover, M. D., E. Ferrara, F. Menczer, and A. Flammini (2013). The digital evolution of occupy
wall street. PloS one 8(5), e64679.
Cresci, S., R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi (2017). The paradigm-shift
of social spambots: Evidence, theories, and tools for the arms race. In Proceedings of the 26th
International Conference on World Wide Web Companion, pp. 963{972. International World
Wide Web Conferences Steering Committee.
Csardi, G. and T. Nepusz (2006). The igraph software package for complex network research.
InterJournal, Complex Systems 1695 (5), 1{9.
Davis, C. A., O. Varol, E. Ferrara, A. Flammini, and F. Menczer (2016). Botornot: A system to
evaluate social bots. In Proc. 25th International Conference on World Wide Web, pp. 273{274.
Diakopoulos, N. A. and D. A. Shamma (2010). Characterizing debate performance via aggregated
twitter sentiment. In SIGCHI Conf.
Diamond, L., M. F. Plattner, and C. Walker (2016). Authoritarianism goes global: The challenge
to democracy. JhU Press.
DiGrazia, J., K. McKelvey, J. Bollen, and F. Rojas (2013). More tweets, more votes: Social media
as a quantitative indicator of political behavior. PloS one 8(11), e79449.
DiJulio, B., M. Norton, and M. Brodie (2016). Americans' views on the u.s. role in global health.
Technical report, Kaiser Family Foundation.
Dutt, R., A. Deb, and E. Ferrara (2018). `Senator, We Sell Ads': Analysis of the 2016 Russian Face-
book Ads Campaign. In Third International Conference on Intelligent Information Technologies
(ICIIT 2018).
Eng, R., J. Van Hillegersberg, and T. Huibers (2011). Social media and political participation:
are facebook, twitter and youtube democratizing our political systems? Electronic participation,
25{35.
El-Khalili, S. (2013). Social media as a government propaganda tool in post-revolutionary egypt.
First Monday 18 (3).
88
Enli, G. S. and E. Skogerb (2013). Personalized campaigns in party-centred politics: Twitter and
facebook as arenas for political communication. Information, Communication & Society 16 (5),
757{774.
Ferrara, E. (2017). Disinformation and social bot operations in the run up to the 2017 french
presidential election. First Monday 22.
Ferrara, E. (2018). Measuring social spam and the eect of bots on information diusion in social
media. In Complex Spreading Phenomena in Social Systems, pp. 229{255. Springer, Cham.
Ferrara, E., O. Varol, C. Davis, F. Menczer, and A. Flammini (2016). The rise of social bots.
Comm. of the ACM 59 (7), 96{104.
Ferrara, E., O. Varol, F. Menczer, and A. Flammini (2016). Detection of promoted social media
campaigns. In Tenth International AAAI Conference on Web and Social Media, pp. 563{566.
Flynn, D., B. Nyhan, and J. Rei
er (2017). The nature and origins of misperceptions: Understand-
ing false and unsupported beliefs about politics. Political Psychology 38 (S1), 127{150.
Forelle, M., P. Howard, A. Monroy-Hern andez, and S. Savage (2015). Political bots and the ma-
nipulation of public opinion in venezuela. arXiv preprint arXiv:1507.07109 .
Fourney, A., M. Z. Racz, G. Ranade, M. Mobius, and E. Horvitz (2017). Geographic and temporal
trends in fake news consumption during the 2016 us presidential election. In CIKM, Volume 17.
Freitas, C., Benevenuto, Ghosh, and A. Veloso (2015). Reverse engineering socialbot inltration
strategies in twitter. In ASONAM.
Fritz, B., B. Keefer, and B. Nyhan (2004). All the president's spin: George W. Bush, the media,
and the truth. Simon and Schuster.
Gerber, T. P. and J. Zavisca (2016). Does russian propaganda work? The Washington Quar-
terly 39(2), 79{98.
Gibson, R. K. and I. McAllister (2006). Does cyber-campaigning win votes? online communication
in the 2004 australian election. Journal of Elections, Public Opinion and Parties 16 (3), 243{263.
89
Gilens, M. (2001). Political ignorance and collective policy preferences. American Political Science
Review 95(2), 379{396.
Gonz alez-Bail on, S., J. Borge-Holthoefer, and Y. Moreno (2013). Broadcasters and hidden in
uen-
tials in online protest diusion. American Behavioral Scientist 57 (7), 943{965.
Gonz alez-Bail on, S., J. Borge-Holthoefer, A. Rivero, and Y. Moreno (2011). The dynamics of
protest recruitment through an online network. Scientic reports 1, 197.
Gorodnichenko, Y., T. Pham, O. Talavera, et al. (2018). Social media, sentiment and public
opinions: Evidence from# brexit and# uselection. Technical report.
Gottfried, J. and E. Shearer (2016). News Use Across Social Medial Platforms 2016. Pew Research
Center.
Grin, M. (2013). Visual communication. In The Handbook of Communication History, pp. 149{
168. Routledge.
Guess, A., B. Nyhan, and J. Rei
er (2018). Selective exposure to misinformation: Evidence from
the consumption of fake news during the 2016 u.s. presidential campaign. Technical report.
Herring, S., K. Job-Sluder, R. Scheckler, and S. Barab (2002). Searching for safety online: Manag-
ing" trolling" in a feminist forum. The information society 18 (5), 371{384.
Higgin, T. (2013). Fcj-159/b/lack up: What trolls can teach us about race. The Fibreculture
Journal (22 2013: Trolls and The Negative Space of the Internet).
Hirsch, J. E. (2005). An index to quantify an individual's scientic research output. Proceedings of
the National academy of Sciences of the United States of America 102 (46), 16569.
Hopkins, D. J., J. Sides, and J. Citrin (2019). The muted consequences of correct information
about immigration. The Journal of Politics 81 (1), 315{320.
Howard, P. (2006). New media campaigns and the managed citizen.
Howard, P. N. and B. Kollanyi (2016). Bots,# strongerin, and# brexit: Computational propaganda
during the uk-eu referendum. Browser Download This Paper.
90
Hwang, T., I. Pearce, and M. Nanis (2012). Socialbots: Voices from the fronts. Interactions 19 (2),
38{45.
Kalla, J. L. and D. E. Broockman (2018). The minimal persuasive eects of campaign contact in
general elections: Evidence from 49 eld experiments. American Political Science Review 112 (1),
148{166.
Keller, F. B., D. Schoch, S. Stier, and J. Yang (2017). How to manipulate social media: Analyzing
political astroturng using ground truth data from south korea. In ICWSM, pp. 564{567.
King, G., J. Pan, and M. E. Roberts (2013). How censorship in china allows government criticism
but silences collective expression. American Political Science Review 107 (2), 326{343.
King, G., J. Pan, and M. E. Roberts (2017). How the chinese government fabricates social media
posts for strategic distraction, not engaged argument. American Political Science Review 111 (3),
484{501.
Kloumann, I. M., C. M. Danforth, K. D. Harris, C. A. Bliss, and P. S. Dodds (2012). Positivity of
the english language. PloS one 7(1), e29484.
Kollanyi, B., P. N. Howard, and S. C. Woolley (2016). Bots and automation over twitter during
the rst us presidential debate.
Kudugunta, S. and E. Ferrara (2018). Deep neural networks for bot detection. Information Sci-
ences 467(October), 312{322.
Kuklinski, J. H., P. J. Quirk, J. Jerit, D. Schwieder, and R. F. Rich (2000). Misinformation and
the currency of democratic citizenship. Journal of Politics 62 (3), 790{816.
Kunda, Z. (1990). The case for motivated reasoning. Psychological bulletin 108 (3), 480.
Labzina, E. (2017). Rewriting knowledge: Russian political astroturng as an ideological manifes-
tation of the national role conceptions. In preparation.
Lewandowsky, S., U. K. Ecker, C. M. Seifert, N. Schwarz, and J. Cook (2012). Misinformation and
its correction: Continued in
uence and successful debiasing. Psychological science in the public
interest 13 (3), 106{131.
91
Lietz, H., C. Wagner, A. Bleier, and M. Strohmaier (2014). When politicians talk: Assessing online
conversational practices of political parties on twitter. Proceedings of the Eighth International
AAAI Conference on Weblogs and Social Media.
Loader, B. D. and D. Mercea (2011). Networking democracy? social media innovations and par-
ticipatory politics. Inf. Commun. Soc 14 (6).
Mar echal, N. (2017). Networked authoritarianism and the geopolitics of information: Understand-
ing russian internet policy. Media and Communication 5 (1), 29{41.
Marwick, A. and R. Lewis (2017). Media manipulation and disinformation online. New York: Data
& Society Research Institute.
Mele, N., D. Lazer, M. Baum, N. Grinberg, L. Friedland, K. Joseph, W. Hobbs, and C. Mattsson
(2017). Combating fake news: An agenda for research and action.
Messias, J., L. Schmidt, R. Oliveira, and F. Benevenuto (2013). You followed my bot! transforming
robots into in
uential users in twitter. First Monday 18 (7).
Metaxas, P. T. and E. Mustafaraj (2012). Social media and the elections. Science 338(6106),
472{473.
Mihaylov, T., G. Georgiev, and P. Nakov (2015). Finding opinion manipulation trolls in news com-
munity forums. In Proceedings of the nineteenth conference on computational natural language
learning, pp. 310{314.
Miller, B. and M. Gallagher (2018). The progression of repression: When does online censorship
move toward real world repression?
Monsted, B., P. Sapiezynski, E. Ferrara, and S. Lehmann (2017, 09). Evidence of complex contagion
of information in social media: An experiment using twitter bots. PLOS ONE 12 (9), 1{12.
Morstatter, F., J. Pfeer, H. Liu, and K. M. Carley (2013). Is the sample good enough? comparing
data from twitter's streaming api with twitter's rehose. In ICWSM, pp. 400{408.
Munger, K. (2017). Tweetment eects on the tweeted: Experimentally reducing racist harassment.
Political Behavior 39 (3), 629{649.
92
Nyhan, B. and J. Rei
er (2010). When corrections fail: The persistence of political misperceptions.
Political Behavior 32 (2), 303{330.
Pennebaker, J. W., R. L. Boyd, K. Jordan, and K. Blackburn (2015). The development and
psychometric properties of liwc2015. Technical report.
Pennycook, G. and D. G. Rand. Assessing the Eect of \Disputed" Warnings and Source Salience
on Perceptions of Fake News Accuracy.
Pennycook, G. and D. G. Rand (2017). Who falls for fake news? the roles of analytic thinking,
motivated reasoning, political ideology, and bullshit receptivity.
Phillips, W. (2015). This is why we can't have nice things: Mapping the relationship between online
trolling and mainstream culture. Mit Press.
Raghavan, U. N., R. Albert, and S. Kumara (2007). Near linear time algorithm to detect community
structures in large-scale networks. Physical review E 76 (3), 036106.
Ratkiewicz, J., M. Conover, M. Meiss, B. Gon calves, S. Patil, A. Flammini, and F. Menczer (2011).
Truthy: mapping the spread of astroturf in microblog streams. In 20th WWW Conf., pp. 249{252.
Ratkiewicz, J., M. Conover, M. R. Meiss, B. Gon calves, A. Flammini, and F. Menczer (2011).
Detecting and tracking political abuse in social media. ICWSM 11, 297{304.
Roberts, M. E. (2018). Censored: distraction and diversion inside China's Great Firewall. Princeton
University Press.
Sanovich, S., D. Stukal, and J. A. Tucker (2018). Turning the virtual tables: Government strategies
for addressing online opposition with an application to russia. Comparative Politics 50 (3), 435{
482.
Savage, S., A. Monroy-Hernandez, and T. H ollerer (2016). Botivist: Calling volunteers to action
using online bots. In 19th CSCW.
Shao, C., G. L. Ciampaglia, O. Varol, A. Flammini, and F. Menczer (2017). The spread of fake
news by social bots.
93
Shirky, C. (2011). The political power of social media: Technology, the public sphere, and political
change. Foreign aairs, 28{41.
Shorey, S. and P. N. Howard (2016a). Automation, algorithms, and politics: A research review.
Int. J Comm. 10.
Shorey, S. and P. N. Howard (2016b). Automation, algorithms, and politics: A research review.
Int. J Comm. 10.
Sides, J. (2016). Stories or science? facts, frames, and policy attitudes. American Politics Re-
search 44(3), 387{414.
Stella, M., E. Ferrara, and M. De Domenico (2018). Bots sustain and in
ate striking opposition in
online social systems. arXiv preprint arXiv:1802.07292 .
Su arez-Serrato, P., M. E. Roberts, C. Davis, and F. Menczer (2016). On the in
uence of social bots
in online protests. In International Conference on Social Informatics, pp. 269{278. Springer.
Subrahmanian, V., A. Azaria, S. Durst, V. Kagan, A. Galstyan, K. Lerman, L. Zhu, E. Ferrara,
A. Flammini, and F. Menczer (2016). The darpa twitter bot challenge. Computer 49 (6).
Taber, C. S. and M. Lodge (2006). Motivated skepticism in the evaluation of political beliefs.
American journal of political science 50 (3), 755{769.
Tucker, J., A. Guess, P. Barber a, C. Vaccari, A. Siegel, S. Sanovich, D. Stukal, and B. Nyhan
(2018). Social media, political polarization, and political disinformation: A review of the scientic
literature.
Tucker, J. A., Y. Theocharis, M. E. Roberts, and P. Barber a (2017). From liberation to turmoil:
social media and democracy. Journal of democracy 28 (4), 46{59.
Tufekci, Z. (2014). Big questions for social media big data: Representativeness, validity and other
methodological pitfalls. ICWSM .
Tufekci, Z. and C. Wilson (2012). Social media and the decision to participate in political protest:
Observations from tahrir square. Journal of Communication 62 (2), 363{379.
94
Varol, O., E. Ferrara, C. Davis, F. Menczer, and A. Flammini (2017). Online human-bot interac-
tions: Detection, estimation, and characterization. In ICWSM, pp. 280{289.
Varol, O., E. Ferrara, F. Menczer, and A. Flammini (2017). Early detection of promoted campaigns
on social media. EPJ Data Science 6(13).
Varol, O., E. Ferrara, C. L. Ogan, F. Menczer, and A. Flammini (2014). Evolution of online user
behavior during a social upheaval. In Proceedings of the 2014 ACM conference on Web science.
Wang, Y., Y. Li, and J. Luo (2016). Deciphering the 2016 us presidential campaign in the twitter
sphere: A comparison of the trumpists and clintonists. In ICWSM, pp. 723{726.
Warriner, A. B., V. Kuperman, and M. Brysbaert (2013). Norms of valence, arousal, and dominance
for 13,915 english lemmas. Behavior research methods 45 (4), 1191{1207.
Wilson, T., J. Wiebe, and P. Homann (2005). Recognizing contextual polarity in phrase-level
sentiment analysis. In Proceedings of the conference on human language technology and empirical
methods in natural language processing, pp. 347{354. Association for Computational Linguistics.
Woolley, S. C. (2016). Automating power: Social bot interference in global politics. First Mon-
day 21(4).
Woolley, S. C. and P. N. Howard (2016). Automation, algorithms, and politics: Introduction. Int.
Journal of Commun. 10.
Zannettou, S., T. Cauleld, E. De Cristofaro, M. Sirivianos, G. Stringhini, and J. Blackburn (2018).
Disinformation warfare: Understanding state-sponsored trolls on twitter and their in
uence on
the web. arXiv:1801.09288 .
Ziegler, C. E. (2018). International dimensions of electoral processes: Russia, the usa, and the 2016
elections. International Politics 55 (5), 557{574.
95
Appendices
96
Appendix A
Appendix for Chapter 1
97
Figure A.1: Example of a Russian troll tweet, exhibit 1
98
Figure A.2: Example of a Russian troll tweet, exhibit 2
99
Figure A.3: Example of a Russian troll tweet, exhibit 3
100
Figure A.4: Example of a Russian troll tweet, exhibit 4
101
Figure A.5: Example of a Russian troll tweet, exhibit 5
102
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Socially-informed content analysis of online human behavior
PDF
Diffusion network inference and analysis for disinformation mitigation
PDF
Three text-based approaches to the evolution of political values and attitudes
PDF
Modeling information operations and diffusion on social media networks
PDF
Improving information diversity on social media
PDF
The spread of moral content in online social networks
PDF
Towards combating coordinated manipulation to online public opinions on social media
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Measuing and mitigating exposure bias in online social networks
PDF
Tobacco and marijuana surveillance using Twitter data
PDF
Manipulating consumer opinion on social media using trolls and influencers
PDF
Designing data-effective machine learning pipeline in application to physics and material science
PDF
Machine learning for efficient network management
PDF
Selling virtue: how human rights NGOs and their donors work together to create a better world ... for themselves
PDF
Learning social sequential decision making in online games
PDF
Computational intelligence: prediction, control and memory in artificial and biological agents
PDF
Information, public opinion, and international relations
PDF
Forced march to modernity: State-imposed cultural change and regime stability in 20th-century east Asia
PDF
Politics is something we do together: identity and institutions in U.S. elections
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
Asset Metadata
Creator
Badawy, Adam
(author)
Core Title
Analysis and prediction of malicious users on online social networks: applications of machine learning and network analysis in social science
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Political Science and International Relations
Publication Date
07/12/2020
Defense Date
04/29/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data science,machine learning,misinformation,network science,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
James, Patrick (
committee chair
), Barbera, Pablo (
committee member
), Ferrara, Emilio (
committee member
)
Creator Email
adam.badawy@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-326346
Unique identifier
UC11664034
Identifier
etd-BadawyAdam-8648.pdf (filename),usctheses-c89-326346 (legacy record id)
Legacy Identifier
etd-BadawyAdam-8648.pdf
Dmrecord
326346
Document Type
Dissertation
Rights
Badawy, Adam
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
data science
machine learning
misinformation
network science