Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Essays on the unintended consequences of digital platform designs
(USC Thesis Other)
Essays on the unintended consequences of digital platform designs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Essays on the Unintended Consequences of
Digital Platform Designs
by
Isamar Troncoso
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BUSINESS ADMINISTRATION)
May 2022
Copyright 2022 Isamar Troncoso
Acknowledgments
I am indebted to my advisors, Davide Proserpio and Lan Luo, for their support, guid-
ance, patience, and encouragement during my doctoral studies. I am grateful to have the
opportunity to work with and learn from each of them.
I am also grateful to Xiao Liu, Dina Mayzlin, and Mohammed Alyakoob, for sitting on
my committee and providing valuable feedback on my dissertation. I would also like to thank
Kristin Diehl and Kalinda Ukanwa for their several helpful conversations and suggestions.
I am grateful to my parents, brothers, fiancé, and friends for their continuous and un-
conditional support in these challenging years. I couldn’t have made it without it.
ii
Table of Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Does Gender Matter? The Effect of Management Responses on Reviewing
Behavior 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The Tripadvisor Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 The Effect of Management Responses on the Likelihood to Review . . . . . . 10
1.4.1 Two-Way Fixed-Effects Regression . . . . . . . . . . . . . . . . . . . 11
1.4.2 Hotels Self-Selection into Treatment . . . . . . . . . . . . . . . . . . . 15
1.4.3 Reviewers’ Self-Selection into Disclosing their Gender . . . . . . . . . 19
1.5 Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5.1 How do Reviewers React to the Presence of Management Responses? 21
1.5.2 Management Responses and Gender Bias . . . . . . . . . . . . . . . . 24
1.6 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2 LookthePart? TheRoleofProfilePicturesinOnlineLaborMarketplaces 33
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Observational Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
iii
2.2.2 Labeling Profile Pictures Based on Perceived Job Fit . . . . . . . . . 44
2.2.3 Estimating Hiring Preferences . . . . . . . . . . . . . . . . . . . . . . 46
2.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3 Experimental Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.4 Experimental Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.4.1 What a High Job Fit Programmer Looks Like? . . . . . . . . . . . . 67
2.4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Bibliography 78
A Appendix to Chapter 1 84
A.1 Event Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.2 The Effect of Management Responses on Ratings . . . . . . . . . . . . . . . 86
A.3 Determinants of the Timing of Hotels’ First Response . . . . . . . . . . . . . 86
A.4 Including Hotel-Specific Time Trends . . . . . . . . . . . . . . . . . . . . . . 90
A.5 Addressing Reviewers’ Self-Selection into Disclosing their Gender . . . . . . 90
A.5.1 Predicting the Gender of the Reviewers . . . . . . . . . . . . . . . . . 91
A.5.2 Analyzing Gender Disclosure Decisions . . . . . . . . . . . . . . . . . 95
A.6 Review LDA Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.7 Coding of Contentious Responses . . . . . . . . . . . . . . . . . . . . . . . . 102
A.8 Coding of Contentious Reviews . . . . . . . . . . . . . . . . . . . . . . . . . 103
B Appendix to Chapter 2 104
B.1 Validating API Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.2 Details on the Deep Learning Image Classifier . . . . . . . . . . . . . . . . . 105
iv
B.3 Observational Data Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 109
B.4 Pretesting our Visual Manipulation in Second Choice Experiment . . . . . . 113
v
List of Tables
1.1 Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Percentage of reviews written by self-identified female users . . . . . . . . . . 9
1.3 TWFE estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 TWFE estimates using response visibility as treatment variable . . . . . . . 19
1.5 Survey statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6 Survey estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7 Comparison of response text characteristics by reviewer’s gender . . . . . . . 26
1.8 Reviewer’s gender on management response type . . . . . . . . . . . . . . . . 28
1.9 The effect of contentious responses on the likelihood to review . . . . . . . . 29
1.10 Contentious responses by hotel type . . . . . . . . . . . . . . . . . . . . . . . 30
2.1 Applications descriptives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Hiring outcomes descriptives . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3 Estimating hiring preferences in observational data . . . . . . . . . . . . . . 49
2.4 Interplay between perceived job fit and reputation in hiring outcomes . . . . 52
2.5 Attributes and levels used in the first choice experiment . . . . . . . . . . . . 59
2.6 Summary statistics of applications received for the job description used in our
conjoint study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.7 Part-worths and attribute importance for with picture condition across three
reputation diagnosticity conditions . . . . . . . . . . . . . . . . . . . . . . . 64
2.8 What does a high fit programmer looks like? . . . . . . . . . . . . . . . . . . 69
2.9 Impact of profile picture manipulation on the “treated freelancer” hiring out-
comes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
vi
A.1 The effect of management responses and reviewer gender on star-ratings . . . 87
A.2 Determinants of the timing of the first response . . . . . . . . . . . . . . . . 89
A.3 TWFE estimates including hotel-specific time trends . . . . . . . . . . . . . 90
A.4 TWFE estimates including observations with inferred and predicted gender . 93
A.5 Text-based classifier confusion matrix . . . . . . . . . . . . . . . . . . . . . . 94
A.6 Text-based classifier false negative rate (FNR) and false positive rate (FPR) 95
A.7 The effect of management responses on gender disclosure . . . . . . . . . . . 97
A.8 Coders’ average evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
A.9 Coders’ average evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.10 Review topic interpretation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.11 The effect of reviewer’s gender on management response type . . . . . . . . . 101
B.1 Agreement between labels provided by Cloud Vision APIs and MTurkers . . 104
B.2 VGG architecture (modified and tuned) . . . . . . . . . . . . . . . . . . . . . 108
B.3 Variables included in the conditional logit model: Profile picture, reputation,
and performance variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
B.4 Variables included in the conditional logit model: Application variables and
additional controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
B.5 Summary Statistics for observational data: Continuous variables . . . . . . . 111
B.6 Summary Statistics for observational data: Discrete variables . . . . . . . . . 112
B.7 Pretesting picture versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
vii
List of Figures
1.1 Example of Tripadvisor user profile. . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Adoption of management responses over time. . . . . . . . . . . . . . . . . . 8
1.3 Distribution of star-ratings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Response visibility over time. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Example of a job list and three applications on Freelancer.com. . . . . . . . . 42
2.2 Example of a choice task under the two platform design conditions in experi-
mental study 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.3 Visual manipulations in experimental study 2. . . . . . . . . . . . . . . . . . 71
2.4 Plain profile pictures in experimental study 2. . . . . . . . . . . . . . . . . . 71
2.5 Example of a choice task under the two experimental conditions in study 2. . 72
A.1 Inspecting parallel trends assumption. . . . . . . . . . . . . . . . . . . . . . . 85
B.1 Examples of perceived freelancer-job fit as programmer label as provided by
human raters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B.2 Examples of perceived freelancer-job fit as programmer label as predicted by
the image classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
viii
Abstract
This thesis brings together two research papers that empirically investigate how different
digital platform design choices can lead to intended consequences on their users’ behavior.
The first paper focuses on the context of online review platforms and explores the impact
of management responses on reviewing behavior, emphasizing potential gender differences.
Using data from Tripadvisor.com, this paper shows that after managers begin responding to
their reviewers, the probability that a negative review comes from a user that self-identifies
as female decreases. Based on a survey conducted among online review platform users,
this paper also shows that such a decrease can be explained by the fact that female users
are more likely to perceive management responses as potential source of conflict with the
manager. Based on text analysis of management responses on Tripadvisor, this paper also
shows that female users are more like to receive contentious responses (i.e., responses where
the manager is confrontational, tries to discredit the reviewer, or responds aggressively).
Overall, the findings in this paper suggest that management responses can lead to selective
attrition of female reviewers.
The second paper focuses on online labor platforms and explores the role of profile pic-
tures in hiring outcomes, emphasizing the potential role of appearance-based perceptions
of a worker’s fit for the job (e.g., whether a worker "looks the part"). Using data from
Freelancer.com, this paper shows that workers who "look the part" are more likely to be
hired. Based on two choice experiments, this paper also shows that the effect of "looking
the part" goes above and beyond gender and race (i.e., accessories or image background can
also play a role) and it is stronger when reputation systems are less diagnostic. Overall,
these findings illustrate that profile pictures can influence hiring outcomes, especially when
reputation systems are not diagnostic enough to differentiate candidates.
ix
Chapter 1
Does Gender Matter? The Effect of
Management Responses on Reviewing
Behavior
1.1 Introduction
User-generated content has gained tremendous popularity in the past two decades. Every
day, users scan and read Tripadvisor reviews to choose which hotel to stay in, Amazon
reviews to decide which product to buy, or Yelp reviews to select a restaurant for dinner.
This motivates firms to invest time, money, and effort to monitor and manage their online
reputation. A popular way of doing so is the practice of publicly responding to individual
reviews.
The current literature on online reputation management suggests that the adoption of
management responses has the potential to affect consumers’ reviewing behavior and, conse-
quently, a firm’s reputation. A common finding is that management responses can affect the
volume and sway the valence of subsequent consumer reviews (Proserpio and Zervas 2017,
Chevalier et al. 2018, Wang and Chaudhry 2018). In this paper, we study whether changes
0
Joint Work with Davide Proserpio and Francesca Valsesia
1
in the ratings associated with consumer reviews can be explained by changes in the distri-
bution of who decides to write a review following the adoption on management responses.
In doing so, our paper answers the following question: Do management responses affect the
reviewing behavior of different consumers in different ways? And if so, why?
On the one hand, the motivations to write a review may vary depending on both the
observable and unobservable characteristics of the users. The presence of management re-
sponses may affect such motivations (e.g., to praise a business for a positive experience or
to complain about a negative experience), which in turn can lead to an increase or decrease
of reviews from a certain type of users. On the other hand, because responses to customers’
reviews are open-ended pieces of text written by real people, the content and tone of a re-
sponse may vary depending not only on the content and valence of the review but also on
observable characteristics of the reviewer such as gender, age, or race. Any differences in the
writing style of management responses can, in turn, affect the subsequent decisions to write
a review by some consumers.
In this paper, among the several observable variables that can be used to characterize
reviewers, we focus on their gender and study whether management responses differentially
affect the reviewing behavior of reviewers who self-identify as either female or male.
1
This
choice is motivated by the growing body of research in sociology, marketing, and economics
studying gender bias, gender stereotypes, and behavioral differences across genders, in a
variety of scenarios, and, more recently, the manifestation of these phenomena in online
settings (Bohren et al. 2019a, Small et al. 2007, Vasilescu et al. 2012).
Westartbyprovidingevidencesuggestingthatmanagementresponsesdifferentiallyaffect
the reviewing behavior of self-identified male and female reviewers. We use data collected
from Tripadvisor to show that the probability of observing a negative review written by
1
We use the term “self-identify” because we use the gender information that some reviewers decided to
publicly disclose. One benefit of using reviews for which reviewers disclose their gender is the possibility of
exploring reviewing behavior as a function of the gender reviewers identify with. We nonetheless acknowledge
that the binary nature of the data available doesn’t allow for exploring other forms of gender bias. In the
Appendix, we show that our results hold when we include users that did not publicly disclose their gender,
but whose gender we infer using either their username or a text-based machine learning algorithm.
2
a self-identified female user decreases after hotels start responding to reviews. We then
move to study the mechanisms behind the change in reviewing behavior observed in the
data and explore two related potential factors that can explain how management responses
might affect different reviewers: 1) how male and female reviewers react to the presence of
management responses and 2) discrimination, i.e., gender bias in how managers respond to
reviewers. To study the reaction of reviewers to management responses, we survey online re-
view platform users and find that, when management responses are available, self-identified
female users are more concerned about potential confrontations that can arise if they write
a negative review. To study whether gender bias exists, we analyze the text of real manage-
ment responses collected from Tripadvisor. We find significant linguistic differences between
responses to reviews written by self-identified male and female users. In particular, when
the review is negative, management responses to female reviewers tend to be less positive,
to be angrier, and to contain more negations. We reinforce these findings by classifying
responses to negative reviews based on their content, and we provide evidence that manage-
ment responses to self-identified female users tend to be more aggressive, confrontational,
and more likely to discredit the reviewer. Besides suggesting that managers are biased in the
way in which they address reviewers, these results suggest female users are indeed correct in
perceiving management responses as a potential source of conflict. We conclude the paper
by showing that hotels with more contentious responses receive fewer negative reviews from
self-identified female reviewers, effectively linking reviewing behavior with how managers
respond to reviews.
Overall, our results suggest that management responses can change the distribution of
reviews written by self-identified male and female users, and that these changes can be par-
tiallyexplainedbydifferencesinthewayusersreacttothepresenceofmanagementresponses
and differences in how managers address female and male reviewers. These findings have
implications for both hotels and review platforms. From the hotels’ point of view, responding
to reviews and being aggressive towards self-identified female reviewers reduces the arrival of
3
negativereviewswhich, inturn, mayleadtohigherratingsand, therefore, demand. However,
hotels should consider the potential downsides of this behavior. For instance, discouraging
reviews from a specific group of users might affect a business’s ability to learn about these
consumers and their concerns. Moreover, given users’ tendency to trust reviews written by
those perceived to be similar to themselves (Wang et al. 2008, Ayeh et al. 2013), lower rep-
resentation of a given group may have repercussions on prospective customers’ intention to
book. Therefore, hotels may need to reconsider their online reputation management strate-
gies and dedicate more time to training their employees to address reviews in a way that
does not discourage the participation of any user.
When it comes to platforms, this work shows that management responses can be misused
to discriminate and that the participation of those users more likely to be discriminated
against can decrease. This occurs despite platforms like Tripadvisor or Yelp employing
algorithms to curate the user-generated content posted on their website, which suggests that
these platforms should design better curation algorithms to prevent this type of behavior.
For example, they should update the guidelines for how to write responses to reviews to
include a statement of diversity and inclusion. This is key for two reasons. First, platforms
compete for consumers’ time. To the extent that discrimination is responsible for a segment
of users choosing to migrate to competing platforms, this could result in market share losses.
Second, in any social environment, consumers look at the behavior of others to gauge the
acceptable norms in that environment (Kashima et al. 2013). The risk that platforms face
is that discrimination might be perceived to be allowed on a rating platform, which might
have negative consequences on how social interactions are conducted on the platform.
1.2 Literature Review
The recent literature on online reputation management provides strong evidence of the
impact of management responses on both the volume and valence of consumer ratings and
4
reviews. By studying hotel reviews on Tripadvisor and Expedia, Proserpio and Zervas (2017)
show that management responses increase review volume by 12% and improve ratings by
0.12 stars and explain this effect is due to management responses increasing the cost of
writing a negative review and reducing the cost of writing a positive review. Chevalier
et al. (2018) find consistent results when it comes to the effect of management responses on
review volume but seemingly opposite results when it comes to their effect on review valence.
Their findings suggest that management responses increase reviewers’ motivation to provide
feedback; however, because negative reviews are seen as more helpful, management responses
affect these type of reviews disproportionately. Such mixed findings are partially reconciled
by Wang and Chaudhry (2018), who show that reviewers react differently to management
responses to positive and negative reviews.
In contrast to the above papers, we study the impact of management responses on who
decides to write reviews. In particular, we focus on the role played by the gender of the re-
viewer. This choice is motivated by the growing body of research studying gender differences,
gender bias, and gender stereotypes in several settings and, more recently, the manifestation
of these phenomena in online settings. This literature highlights behavioral differences be-
tween genders that could affect how users behave in the presence of management responses.
Women, for example, are more inclined to be givers of praise and compliments rather than
givers of information (Tannen 1991, Holmes 1988, Herbert 1990, Johnson and Roen 1992),
to the point that compliment giving has been used to successfully infer the user gender in
electronic conversations (Thomson and Murachver 2001). Moreover, men and women tend
to manage conflict differently. Women are more likely to avoid direct confrontation (Soren-
son et al. 1995, Brewer et al. 2002, Small et al. 2007) and, when dealing with conflictual
situations, are more guarded, in part because they can more easily be labeled as arrogant or
unfeminine (Tannen 1991).
More directly related to our work, recent work started exploring gender differences and
gender bias in online settings. For example, Gallus and Bhatia (2020) find women choose to
5
contribute to less controversial conversational domains on Wikipedia. Moreover, Vasilescu
et al. (2012) show that questions-and-answers platforms have a significant underrepresenta-
tion of female participants. This phenomenon may be the consequence of these users facing
hostile online environments, documented by Bohren et al. (2019a) in the case of Stack
Overflow.
Followingthisstreamofresearch, wefocusononlinereviewplatforms, andinvestigatethe
effect of management responses on the reviewing behavior of users based on their gender.
The gender of reviewers might matter for various reasons. Female reviewers might react
differently to the presence of management responses, and managers might be responsible for
addressing female reviewers differently. These factors, in turn, can affect female reviewers’
likelihood of writing a review. We explore these hypotheses next.
1.3 The Tripadvisor Dataset
We collect data from Tripadvisor, one of the most popular online review platforms. This
user-generated content platform provides travel-related reviews for hotels, restaurants, air-
lines, and experiences. Statistics from 2019 indicate that the platform contains more than
830 million reviews and, on average, receives 460 million visitors every month.
2
Although visitors of the website can access and contribute to Tripadvisor without regis-
tering to the website, the platform encourages users to join the travel community by offering
access to special hotel deals and to its TripCollective enhanced contributor program. To join
this travel community, users create a basic user profile that they can edit at any time and
that includes attributes such as age, gender, and location. These variables are provided at
the user’s discretion. The user profile is then made publicly available, and is thus accessible
to other reviewers and hotel managers (see Figure 1.1 for an example).
2
See: https://tripadvisor.mediaroom.com/US-about-us.
6
Figure 1.1: Example of Tripadvisor user profile.
Around 2009, Tripadvisor started giving hotel managers the opportunity to respond to
reviews. A response to a review is a open-ended piece of text that Tripadvisor displays below
the review it addresses.
Following Chevalier et al. (2018), we collected the entire set of reviews and responses
for all the hotels in the 25th to 75th largest US cities. This resulted in a total of 5,413
hotels, 2,028,872 reviews, and 962,395 responses, spanning a period of 15 years (2001–2016).
Further, we collected information for each one of the 1,463,950 users in our sample, including
username, entire review history (for a total of over 27M reviews), and, when disclosed,
personal information such as gender, age, and location.
For the main analyses reported in this paper, we focus on reviews of users who disclosed
their gender. We discuss the advantages and limitations of this approach in Section 1.4.3,
and show in Appendix A.5 that our results are robust to the inclusion of all reviews.
1.3.1 Descriptive Statistics
In our sample, 4,018 hotels (74% of all the hotels) responded to at least one of the reviews
written by their customers, a managerial practice that has been gaining popularity in recent
years (see Figure 1.2). Overall, the total number of responses is 962,395, roughly 47% of the
7
number of reviews; this means that, by the end of 2016, on average, close to one out of two
reviews had received a response.
Figure 1.2: Adoption of management responses over time.
(a)
0%
20%
40%
60%
2003 2005 2007 2009 2011 2013 2015
Responding hotels
(b)
0%
10%
20%
30%
40%
2003 2005 2007 2009 2011 2013 2015
Reviews with response
Panel (a) shows the cumulative percentage of hotels that respond to reviews, and Panel (b)
shows the cumulative percentage of reviews with a response by year.
Table 1.1 reports summary statistics for reviews and responses. Hotels that respond to
their reviews have, on average, a higher review volume and a higher average rating compared
to hotels that do not respond to reviews. The table also shows that the time between
consecutive management responses is, on average, higher than the time between reviews
(approximately 18 days and 6 days, respectively), and that hotels respond to about half of
the reviews they receive in a month.
Turning to the reviews for which we have self-disclosed gender information, we have
673,604 reviews (or 33% of the total number of reviews) written by 379,938 reviewers (or
26% of the total number of reviewers) of 5,081 hotels (or 94% of the total number of hotels).
Table 1.2 provides information about the distribution of reviewers by gender and review
valence. Reviews from users who self-identify as female are, on average, more positive than
those from users who self-identify as male. This is true independently of whether we look at
the complete dataset or separately at the periods before and after the hotel began responding
8
Table 1.1: Summary statistics
Number of hotels 5.413
Without responses
Number of hotels 1.395
Avg. rating 3.18
Avg. number of reviews 43.37
Avg. number of monthly reviews 2.50
With responses
Number of hotels 4.018
Avg. rating 3.80
Avg. number of reviews 489.88
Avg. number of responses 239.52
Avg. number of monthly reviews 7.33
Avg. number of monthly responses 3.58
Avg. days between reviews* 5.90
Avg. days between responses* 17.82
Avg. monthly response rate* 0.51
* After first response
to reviews. Table 1.2 also suggests that the fraction of reviews from self-identified female
users declines after hotels begin to respond to reviews.
Table 1.2: Percentage of reviews written by self-identified
female users
Reviews
All 2-stars 3-stars
Pct. female reviewers...
overall 52.54% 51.25% 52.68%
in the before period 55.25% 54.66% 55.34%
in the after period 51.75% 49.68% 51.93%
We next analyze and compare review distributions by star-rating and traveler category.
Panel (a) of Figure 1.3 shows that the rating distribution of self-identified female reviewers
is slightly more positive than that of self-identified male reviewers. Panel (b) of Figure 1.3
9
showsthatreviewsfromself-identifiedfemalereviewersaremorelikelytofallintotheCouple,
Family, or Friends traveler categories, while self-identified male reviewers are more likely to
fall into the Business traveler category.
Finally, in Panel (c) of Figure 1.3, we plot the response rate (i.e., the fraction of the
reviews that receive a response from the hotel’s manager) by gender and star-rating. We
observe that the response rate to reviews from self-identified male reviewers (43.73% on
average) is higher than of reviews from self-identified female reviewers (40.81% on average),
a difference that persists across all star-ratings.
Figure 1.3: Distribution of star-ratings.
(a)
0%
10%
20%
30%
40%
50%
1 2 3 4 5
Star−rating
Reviews
Female Male
(b)
0%
10%
20%
30%
40%
Business Couple Family Friends
Traveler category
Reviews
Female Male
(c)
0%
20%
40%
1 2 3 4 5
Star−rating
Reviews
Female Male
Panel (a) shows the distribution of reviews’ star-ratings by gender; Panel (b) shows the
distribution of reviews’ traveler category by gender; Panel (c) shows the response rate, i.e.,
the fraction of reviews with a response, by gender and star-rating.
1.4 The Effect of Management Responses on the Likeli-
hood to Review
In this section, we investigate whether the probability that a given review comes from a
self-identified female reviewer changes following the adoption of management responses.
10
1.4.1 Two-Way Fixed-Effects Regression
To estimate the impact of management responses on the likelihood that a given review
comes from user who disclosed their gender is female, we implement a two-way fixed effects
(TWFE) identification strategy that exploits the fact that 21% of the hotels in our Tripad-
visor sample never responded to their reviews, while among the ones that do respond, the
practice has been gradually gaining popularity over time.
3
We compare changes in the likeli-
hood that a given review comes from a self-identified female reviewer for hotels that respond
to reviews, before and after the hotels begin to respond, with respect to a baseline of changes
among hotels that did not (or did not yet) respond to reviews over the same period.
4
One of the main assumptions behind our identification strategy is that, in the absence
of the treatment, the outcome variable for treated and control hotels would have evolved in
a similar way. While this assumption is untestable (the counterfactual outcome for treated
hotels is unobserved), a common practice to partially test it is to implement an event study
in which treatment leads and lags are added to the model. In Appendix A.1, we provide
evidence that this assumption is likely satisfied in our settings.
Our specification takes the following form:
Female
ijt
=
1
Treated
j
+
2
After
ijt
+
3
After
ijt
Treated
j
+
j
+
t
+X
0
ijt
+
ijt
; (1.1)
where the dependent variable Female
ijt
indicates whether reviewi for hotelj written at time
t comes from a user who disclosed that her gender is female. Treated
j
is an indicator for
whether the hotel ever responded to reviews; After
ijt
is an indicator for whether the reviewi
of hotel j written at time t was written after the hotel adopted management responses; and
the interaction After
ijt
Treated
j
, whose coefficient is of interest, measures changes in the
3
When we consider only reviews for which gender is disclosed, we are left with 5,081 hotels out of which
4,002 responded to a least one review.
4
In doing so, we implicitly assume that management responses only affect customers’ likelihood of writing
a review given a hotel visit, but not their likelihood of visiting a hotel. This implicit assumption has been
used in other recent studies (Proserpio and Zervas 2017, Wang and Chaudhry 2018).
11
probability that a given review of a hotel that responds to reviews comes from a user who
self-identified as female, after the hotel began to respond, against a baseline of changes in
the same probability for hotels that do not respond to reviews. The specification includes
hotel fixed effects
j
, which allow accounting for unobserved time-invariant hotel-specific
confounders(e.g., certainhotelsattractmorefemalecustomers), andyear-monthfixedeffects
t
, which allow controlling for unobserved time-specific confounders that affect all hotels
(e.g., seasonality effects). Additionally, we incorporate a set of time-varying controls, X
0
ijt
,
such as the reviewer traveler type (solo, business, family, friend), an indicator for whether
a review rating is above or equal to 3 stars, and the length of the review. Moreover, as is
common in this type of analysis, and to further reduce concerns about unobserved time-
varying confounders between the treated and control hotels, we allow for treatment-specific
time trends (Goodman-Bacon 2021, Rambachan and Roth 2019).
Because our specification includes hotel fixed effects, Equation 1.1, can be rewritten as:
Female
ijt
=
1
After
ijt
+
j
+
t
+X
0
ijt
+
ijt
: (1.2)
where the coefficient of After
ijt
is of interest. We estimate Equation 1.2 using OLS and,
following standard practice (Bertrand et al. 2004), clustering standard errors at the hotel
level to account for serial correlation in the dependent variable.
5
We report the estimates of Equation 1.2 in column 1 of Table 1.3. The coefficient of
interest in negative but not statistically significant, suggesting that, overall, the likelihood
of observing a review written by a self-identified female reviewer is not different from that
of a self-identified male reviewer.
We, therefore, explore whether the effect of management response is different depending
on the valence of the review. For this purpose, we categorize reviews into negative reviews
5
Because our dependent variable is binary, we could have opted for a logistic regression. However, linear
regressionisgenerallythebeststrategytoestimatecausaleffectsoftreatmentsonbinaryoutcomes, especially
when the model includes interaction terms or fixed effects and its coefficients are directly interpreted in terms
of probabilities (Gomila 2019).
12
Table 1.3: TWFE estimates
(1) (2)
After 0:003 0:020
(0:003) (0:005)
After Positive 0:018
(0:005)
Positive 0:013
0:024
(0:002) (0:014)
Treated Positive 0:025
(0:015)
Controls:
Traveler segment Yes Yes
log Review length Yes Yes
Observations 673,604 673,604
Adjusted R
2
0.043 0.043
Note: The dependent variable is whether the user gender of re-
viewi of hotelj at timet is female. All specifications include
hotel and year-month fixed effects, and a treatment-specific
time trend. Cluster-robust standard errors at the individual
hotel level are shown in parentheses.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
(1 and 2 stars), and positive reviews (3 stars and above).
6
Then, we estimate the following
specification:
Female
ijt
=
1
After
ijt
+
2
After
ijt
Positive
ijt
+
3
Treated
ijt
Positive
ijt
(1.3)
+
4
Positive
ijt
+
j
+
t
+X
0
ijt
+
ijt
;
where everything is as in Equation 1.1, and Positive
ijt
is a binary indicator that is equal
to one if the review star-rating is 3 stars or above. The two coefficients of interest are
1
, which measures the effect of management responses on the likelihood that a negative
review comes from a self-identified female reviewer, and
2
, which measures the differential
effect for positive reviews written by self-identified female reviewers. As before, we estimate
6
Results are qualitatively the same when considering 3 stars reviews as negative.
13
Equation 1.3 using OLS and clustering standard errors at the hotel level. We report these
results in column 2 of Table 1.3. We observe that both
1
and
2
are statistically significant.
1
tells us that management responses affect the likelihood of observing a negative review
comingfromafemalereviewer;
2
tellsusthattheeffectofmanagementresponsesonpositive
reviews is statistically different from that on negative reviews. The estimates suggest that
the probability that a negative review comes from a self-identified female reviewer decreases
by 2% (p < 0:05), while the probability that a positive review comes from a self-identified
female reviewer does not change significantly (
1
+
2
=0:2%, p = 0:552). Using the pre-
response period as the baseline (see Table 1.2), the estimated decrease in the probability that
anegativereviewcomesfromaself-identifiedfemalereviewertranslatesto-3.7%(
1
=Baseline
= -0.02/0.54).
Overall, our results are consistent with management responses differentially affecting self-
identifiedfemaleandmalereviewers’likelihoodtowritereviews. Morespecifically,ourresults
suggest that in the presence of management responses it is less likely to observe negative
reviews coming from self-identified female reviewers. Finally, turning to the reputational
benefitsobtainedfromrespondingtoreviews(ProserpioandZervas2017), ourresultssuggest
that the increase in ratings due to management responses is partially driven by self-identified
female reviewers being more positive after managers begin to respond to reviews.
7
Of course, as any observational study, the above analysis does not come without limi-
tations. There are two types of concerns about the results presented in this section. First,
hotels’ self-selection into responding to reviews, and second, reviewers’ self-selection into
disclosing their gender. Next, we proceed to describe each of these concerns in detail and
present additional analyses aiming to address them.
7
We formally confirm this in Appendix A.2.
14
1.4.2 Hotels Self-Selection into Treatment
As previous studies on management responses suggest (Chevalier et al. 2018, Proserpio
and Zervas 2017), it is likely that the decision of hotels to begin to respond to reviews is
endogenous. For example, when the outcome of interest is the hotels’ average rating, it
is possible that hotels start responding to promote quality improvements that would also
(positively) affect future customer ratings.
In our case, it is not immediately clear how endogeneity would play a role. For example,
it seems unlikely that hotel managers decide to start responding to reviews when the number
of female reviewers increases or decreases, because such quantity is not readily available.
8
However, a more plausible scenario is as follows: Suppose that female and male reviewers
write about completely different topics (for example, female reviewers write about food and
male reviewers write about amenities) and that some topics but not others trigger managers
to respond to reviews. Then suppose that, along with adopting management responses,
hotels take some corrective actions with regards to the issues brought up by the reviews that
triggered the responses (for instance, suppose that men complain a lot about Wi-Fi speed
and the hotel improves it substantially). If these (unobserved to the researchers) actions
reduce the likelihood of men writing reviews, then this scenario could bias (upward, in this
case) our estimates.
We conduct three analyses aimed at reducing these type of concerns. First, we follow Sea-
mans and Zhu (2014), and estimate a discrete-time hazard model to explore the extent to
which the proportion of reviews written by female reviewers can predict hotels’ timing of the
first response. We report these results in Table A.2 in Appendix A.3. These results suggest
that the observed proportion of reviews written by female users does not determine when
hotels begin to respond to reviews. Second, to further reduce concerns about time-varying
confounders, we incorporate hotel-specific time trends into our specification. We report these
8
Of course, some hotels could decide to invest time and effort in measuring such quantity as we did in
this paper, but we argue that this is an unlikely scenario.
15
results in Table A.3 in Appendix A.4.
9
Third, we implement a new identification strategy
that relies on a different—and likely to be exogenous—treatment variable. We present this
analysis next.
Management response visibility as a treatment variable To further reduce endo-
geneityconcerns related to hotels’self-selection into responding, we implementan alternative
identification strategy that relies on a dependent variable that is likely to be exogenous.
We exploit the facts that (1) Tripadvisor displays 10 reviews (and the responses they
received, if any) chronologically ordered on each page; (2) users are more likely to read the
first page of hotel reviews than subsequent ones; and (3) reviews arrive faster than responses
do (see Table 1.1), and we create a treatment variable, the response visibility, that is likely
to be outside the hotels’ control, and thus likely to be exogenous. For every review i of
hotelj written at timet, we compute the fraction of the latest 10 reviews (those in the first
page) for hotelj that received a response by timet. In doing so, we follow a similar strategy
adopted by both Proserpio and Zervas (2017) and Wang and Chaudhry (2018). This variable
is likely to be exogenous because hotels cannot easily manipulate the number of responses
that an incoming potential reviewer would see when visiting the Tripadvisor hotel page.
This is because visibility depends both on the frequency at which responses arrive (which
is something hotels can control) but also on the frequency at which reviews arrive (which
hotels cannot control). Using this variable, we estimate the following specification:
Female
ijt
= Visibility
ijt
+
j
+
t
+X
0
ijt
+
ijt
: (1.4)
where everything is as in Equation 1.2, and Visibility
ijt
is as defined above.
9
One may argue that there might be confounders that vary at the hotel-year-month level and that hotel-
year-month fixed effects could help in this case. However, hotel-year-month fixed effects are almost perfectly
collinear with our treatment variable After. The only case in which they are not is when the variable After
changes from zero to one within the same month for the same hotel. This is true for only 641 out of 4,002
hotels that ever responded to reviews in our dataset and for which we have reviews from reviewers who
disclosed their gender.
16
As in our previous analysis, we estimate Equation 1.4 using OLS and clustering standard
errors at the hotel level. In addition, we estimate this specification by restricting the sample
to hotels that respond to reviews, and to the reviews that arrived after each hotel’s first
response was published. By focusing solely on hotels that do respond to reviews, we are
eliminating concerns about potential differences between responding and non-responding
hotels that can drive the results. By focusing solely on the period after hotels begin to
respond, weareeliminatingconcernsaboutthetimingofthefirstresponsebeingendogenous;
this is because all travelers writing reviews in the after period should have experienced any
observable or unobservable change implemented by the hotels that could also influence their
decision to start responding to reviews (e.g., improvements in quality).
Usingthisspecification, theeffectofmanagementresponsesisidentifiedbycomparingthe
probability that a given review comes from a self-identified female reviewer when responses
are more likely to be visible with respect to the same probability in the case in which
responses are less likely to be visible (even if the hotel adopted management responses).
To better describe the variation that we are exploiting in this specification, we resort
to the treatment variation plot discussed in Imai et al. (2018). In Figure 1.4, we plot the
response visibility for each hotel and year-month while ordering hotels based on the fraction
of months they are treated.
10
We observe substantial variation in the treatment assignment
both across hotels and over time.
11
Moreover, we can clearly observe from the color mix on
the individual plot lines that for many hotels the treatment turns on and off many times,
confirming our assumption that hotels cannot easily manipulate the response visibility, and
that this variable is therefore likely to be exogenous.
12
We report these results in column 1 of Table 1.4. The estimates are consistent with the
sign of the estimates reported in column 1 of Table 1.3 and statistically significant, i.e., we
10
We convert the response visibility to a binary variable that we set to one if visibility is greater than
zero.
11
The average value and standard deviation of the variable Visibility for hotels that respond to reviews
is about 0.35 and 0.37, respectively
12
Papers studying questions in which the treatment variable turns on and off many times are not uncom-
mon in the literature; see, for example Acemoglu et al. (2019) and Scheve and Stasavage (2012).
17
Figure 1.4: Response visibility over time.
Jan 2006 Jan 2008 Jan 2010 Jan 2012 Jan 2014 Jan 2016
Hotels
Untreated
Treated
Treatment variation plot for visualizing the distribution of treatment across hotels and time
for the specification using response visibility as a treatment variable.
observe a statistically significant decrease of 0.9% (p< 0:01) in the probability that a given
review comes from a self-identified female reviewer.
Next, we explore whether the effect of management response is different depending on
whether the review is positive or negative. Our specification is the following:
Female
ijt
=
1
Visibility
ijt
+
2
Visibility
ijt
Positive
ijt
+
j
+
t
+X
0
ijt
+
ijt
;(1.5)
where everything is as in Equation 1.4, and Positive
ijt
is a binary indicator that is equal to
oneifthereviewstar-ratingisthreeorabove. Thecoefficientofinterestis
1
, whichmeasures
the effect of management responses on the probability of observing a negative review from
a female user, and
2
, which measures the differential effect for positive reviews.
18
Table 1.4: TWFE estimates using response visibility
as treatment variable
(1) (2)
Visibility 0:009
0:020
(0:002) (0:007)
Visibility Positive 0:012
(0:007)
Positive 0:019
0:015
(0:003) (0:004)
Controls:
Traveler segment Yes Yes
log Review length Yes Yes
Observations 520,833 520,833
Adjusted R
2
0.044 0.044
Note: The dependent variable is whether the user gender of re-
viewi of hotelj at timet is female. All specifications include
hotel and year-month fixed effects. Cluster-robust standard
errors at the individual hotel level are shown in parentheses.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
Asbefore, weestimateEquation1.5usingOLSandclusteringstandarderrorsatthehotel
level. We report these results in column 2 of Table 1.4. The results are again consistent with
those reported in column 2 of Table 1.3. The probability that a negative review comes from a
user that discloses her gender is female decreases by 2% (p< 0:01), and the probability that
a positive review comes from a self-identified female reviewer decreases by 0.8% (p< 0:01).
Overall, the results presented in this section are consistent with the findings discussed in
Section 1.4.1, suggesting that hotels’ self-selection is not affecting our results.
1.4.3 Reviewers’ Self-Selection into Disclosing their Gender
So far, our analyses have focused on the subsample of reviews written by users who
publicly disclosed information about their gender. The advantage of this approach is that
19
we do not have to rely on algorithms to infer the gender of those reviewers that decided not
to disclose the gender. However, this choice might raise two types of concern.
The first concern is related to who decides to disclose their gender. If reviewers who
voluntarily disclose their gender are systematically different from those who do not do so,
our results may be biased. Consider, for example, the hypothetical case in which female users
whodisclosetheirgenderaremuchmoreaffectedbymanagementresponsesthanfemaleusers
who decide to not disclose their gender. In this case, our estimates would be overstating the
effect of management responses. To reduce this concern, in Appendix A.5.1, we replicate
our main results using a larger sample of reviewers for whom we predict gender using two
different algorithms, one based on the reviewers’ username and one based on the text of the
reviews.
The second concern is related to the fact that management responses could also affect
the reviewers’ decision to disclose their gender. If such decision is affected by the presence
of management responses, and this effect varies by gender and review valence, our estimates
shouldbeviewedasthecombinedeffectofusersdecidingnottoreview and decidingtoreview
but not to disclose their gender. In Appendix A.5.2, we show that our main estimates are
likely a combination of female reviewers being less likely to write negative reviews and less
likely to disclose their gender in the presence of management responses.
1.5 Mechanisms
In Section 1.4, we showed that the likelihood of a given negative review coming from a
self-identified female user decreases after hotels begin to respond to reviews. In this section,
we discuss the potential mechanisms behind this effect.
20
1.5.1 How do Reviewers React to the Presence of Management Re-
sponses?
We start by discussing a survey we conducted among online review platform users aimed
at understanding whether the presence of management responses differentially affects the
motivation to review of male and female reviewers.
The survey was completed by 599 online review platform users (278 males and 321 fe-
males) in exchange for a small compensation. Respondents were recruited through Amazon
Mechanical Turk and were only allowed to participate in the survey if they indicated they
regularly used online review platforms such as Tripadvisor and Yelp.
Respondents were first reminded that online review platforms such as Tripadvisor and
Yelp allow business owners and managers to respond to reviews, and were then asked to
indicate how they feel about management responses.
Based on prior literature (Proserpio and Zervas 2017, Chevalier et al. 2018), we identified
three potential reactions to the introduction of management responses: (1) praise: excite-
ment for the possibility of directly praising managers for a positive experience; (2) complain:
excitement for the possibility of directly complaining to managers for a negative experience;
and (3) concern: concern that managers would discredit their negative reviews and be con-
frontational if they wrote a negative review. To measure these three reactions, respondents
were asked to rate their agreement (1 = completely disagree, 7 = completely agree) with the
statements reported in Table 1.5. Each reaction was measured as the average of three items
(all statements were presented to respondents in random order, correlations and discriminant
validity across items are reported in the note to Table 1.5). Further, we collected additional
information about the respondents including demographic variables (age, race, income level,
education, marriage status, native language) and their frequency of use of Tripadvisor, that
we incorporated as controls in our analysis.
Based on our review of the literature, we expected female users to be more excited for
the perspective of directly praising managers and business owners if they had a positive
21
Table 1.5: Survey statements
1. If I had a positive experience, I am pleased I would be able to compliment owners
and managers directly for their good work.
2. If I had a positive experience, I am excited I would be able to thank owners and
managers directly for doing a good job.
3. If I had a positive experience, I am thrilled owners and managers would be able to
respond to my compliments.
4. If I had a negative experience, I am glad I would be able to demand compensation
to a owner or manager directly.
5. If I had a negative experience, I am thankful I would be able to complain directly
to a owner or manager.
6. If I had a negative experience, I am happy owners and managers would be able to
respond to my complaints.
7. If I wrote a negative review, I am bothered owners and managers could try to
discredit me.
8. If I wrote a negative review, I am worried owners and managers could be confronta-
tional.
9. If I wrote a negative review, I am concerned owners and managers could respond
aggressively.
Note: Statements 1–3 measure Praise ( = 0:88), 4–6 measure Complain ( = 0:73) and 7–9 measure
Concern ( = 0:88). The confirmatory factor analysis including these statements supports discriminant
validity between the three constructs: the average variance extracted is 0.81, 0.58, and 0.78 for Praise,
Complain, and Concern respectively, while squared correlations are 0.404 for Praise-Complain, 0.039 for
Praise-Concern, and 0.182 for Complain-Concern.
experience (Holmes 1988, Herbert 1990, Johnson and Roen 1992) and to be more concerned
about confrontation if they were to leave a negative review (Sorenson et al. 1995, Brewer
et al. 2002); we did not necessarily expect a difference across genders in the excitement
about being able to complain directly and be heard out by managers following a negative
experience.
To test whether there are significant differences between female and male respondents,
we estimate the following specification:
Reaction
i
=
0
+
1
Female
i
+
2
Controls
i
+
i
; (1.6)
22
where the dependent variable of interest is the score reported by the respondent i for every
one of the three reactions, and Female
i
, whose coefficient is of interest, is an indicator of
whether the gender of the respondent i is female.
We report the estimates of Equation 1.6 in Table 1.6. We find that female reviewers
are more excited than male reviewers about the possibility of directly and publicly praising
managers for their good work (when including controls, the score for female participants is
5.33% (p< 0:1) higher than the score for male participants), and that female reviewers are
more concerned than males about potential confrontations that could arise with managers
if they wrote a negative review (when including controls, the score for female respondents
is 16.05% (p < 0:01) higher than the score for male participants).
13
Finally, and in line
with our expectations, we find no significant difference in excitement about the possibility
of directly and publicly complaining to managers about a negative experience.
Table 1.6: Survey estimates
Praise Complain Concern
(1) (2) (3) (4) (5) (6)
Female 0.292
0.280
-0.015 -0.009 0.397
0.453
(0.090) (0.094) (0.111) (0.113) (0.144) (0.148)
Intercept 5.941
6.383
5.109
3.891
3.182
2.107
(0.066) (0.825) (0.081) (0.991) (0.105) (1.294)
Controls No Yes No Yes No Yes
Observations 599 575 599 575 599 575
Adjusted R
2
0.016 0.021 -0.002 0.048 0.011 0.060
Note: The control variables include whether the respondent is an English native speaker
and the age, Tripadvisor usage frequency, income, ethnicity, married status, and edu-
cation level of the respondent. The value of the control variables was not available for
27 respondents.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
If female reviewers’ concerns about facing a managers’ confrontational behavior inhibit
themfromwritingnegativereviews, theseresultscanexplainwhy, inourTripadvisordataset,
13
We report the percentage difference in the average reaction reported by female and male respondents,
computed as
1
0
.
23
we observe fewer self-identified female reviewers leaving a negative review after hotels begin
to respond to reviews. We next proceed to investigate whether the concerns that female
users have about management responses are well founded by analyzing how hotel managers
respond to reviews based on the gender of the reviewers.
1.5.2 Management Responses and Gender Bias
In this section, we explore whether female reviewers’ concern that management responses
canbeasourceofconflictissupportedinourdatabyexaminingwhetherthereisanyevidence
of gender bias in the way managers address reviewers when responding to them.
Linguistic Analysis of Management Responses
To explore the possibility of gender bias in the way hotel managers respond to reviews,
we first use the Linguistic Inquiry and Word Count (LIWC) text analysis tool (Pennebaker
et al. 2001) to process the 284,271 responses in the Tripadvisor sample of reviews for which
the reviewer’s gender is disclosed by the reviewer. LIWC is a dictionary-based tool for text
analysis that categorizes words in a given text into several dimensions including different
emotions, thinking styles, social concerns, and parts of speech. For each linguistic variable,
the output of LIWC is the percentage of words in the text pertaining to the dictionary
associated with that variable.
The purpose of this analysis is to investigate whether management responses are more
contentious (e.g., aggressive, confrontational, or dismissive) when written to female reviewers
than when addressed to male reviewers. While LIWC does not provide direct measures of
contentiousness, it does provide several variables that can be used as a proxy for these types
of responses. Specifically, we focus on the following variables: positive emotion (e.g., thank,
like, improve, lovely), negative emotion (e.g., dislike, hostile, suspicious, liar), anger (e.g.,
aggressive, liar, rude, violent), negations (negative verbs or negation verbs), and second-
(you) and third-person singular pronouns (she and he).
24
Positive emotion words reflect a more positive tone in a response, while negations and
negative emotion words, and anger in particular, reflect a more negative tone. We include
second- and third-person personal pronouns because pronominal usage is a semantic re-
source that can be used to establish the social distance and power relationship between
two parties (Brown et al. 1960). In our context, responses that have higher values for the
second-person variable are more likely to be more apologetic while responses with higher val-
ues for the third-person variable are more likely to be critical toward the person who wrote
the review.
14
We formally test how these pronouns are used in the context of management
responses in Appendix ??.
Using the subsample of reviews where users disclosed their gender, we conduct a t-test
of the difference in the scores of the above LIWC variables for responses to female and
male reviewers. We report our results in Table 1.7. In columns 1–3, we show the results
for responses to negative reviews (1 and 2 stars), and in columns 4–6, we show the results
for responses to neutral or positive reviews (3 stars or above). The results suggest that
when responding to negative reviews, management responses to female reviewers tend to
be less positive and more negative, to use fewer second-person singular pronouns and more
third-person singular pronouns, to be angrier, and to use more negations words.
Taken at face value, these differences provide suggestive evidence that responses to female
reviewers are less favorable than responses to male reviewers.
Management Responses Classification
Sofar,wehaveusedthepresenceofspecificwordsinthetextoftheresponsestoreviewsto
provide evidence of discriminatory behavior towards female reviewers by the hotels. There
are two limitations associated with the previous analysis. First, LIWC does not provide
direct measures of contentious responses; second, the previous analysis does not take into
14
This is consistent with studies of political discourse analysis and conference presentations that suggest
second-person pronouns are used in positive contexts to create rapport (Polo 2018) and that third-personal
pronouns are often used in a negative context as a way to invalidate the opposition (Håkansson 2012).
25
Table 1.7: Comparison of response text characteristics by reviewer’s gender
LIWC Variable
Responses to negative reviews Response to positive reviews
To female To male Difference To female To male Difference
Positive emotions 7.974 8.170 -0.196
11.482 11.491 -0.009
Negative emotions 1.526 1.460 0.066
0.361 0.345 0.016
Second pers. pron. (you) 7.445 7.564 -0.119
8.885 8.914 -0.029
Third pers. pron. (she/he) 0.049 0.041 0.008
0.050 0.041 0.009
Negations 0.990 0.948 0.042
0.365 0.353 0.012
Anger 0.074 0.067 0.007
0.018 0.018 0.000
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
account the content of the reviews to which managers respond. This can bias the estimates
if, compared to male reviewers, female reviewers use a different tone (e.g., they are more
angry or aggressive) in their reviews and tone drives how managers decide to address the
feedback received. In this section, we overcome these limitations by formally classifying
responses depending on their text content and directly controlling for the review content.
To classify responses, we asked two coders to classify a random subsample of 500 re-
sponses to negative reviews (2 stars or below) as contentious or not. We define a response
as contentious if the manager either responds aggressively to the reviewer, is confrontational
with the reviewer, or tries to discredit the reviewer (we refer the reader to Appendix A.7 for
a detailed discussion of this coding exercise).
15
We then used such information to train a
text-based classifier to predict the response type (contentious or not) for all the responses to
negative reviews in our dataset. Following standard practice, we split the responses dataset
in an 80% training sample and a 20% test sample. We selected the results from a Naive Bayes
Classifier based on tf-idf, which achieves an 85.0% accuracy rate and a 87.4% ROC-AUC
score.
16
15
We also tried to classify responses by replacing the two coders with Amazon Mechanical Turk coders
and increasing the number of coders from two to five, and obtained similar results.
16
Other types of classifiers led to similar performance.
26
To control for the review content, we include in our specification three different variables:
(1) three review topic weights that we estimate using latent Dirichlet allocation (LDA) (Blei
et al.2003);
17
(2)whetherareviewisclassifiedascontentious;
18
and(3)reviewcharacteristics
such as the review length (in characters) and LIWC anger and tone (defined as the difference
between the number of positive and negative words).
After classifying management responses and creating the controls for the review content,
we proceed to estimate a model where we regress the type of response (contentious or not)
against the gender of the reviewer. We report the estimates in Table 1.8. Depending on the
specification, these results suggest that responses to female reviewers are 0.9% to 1.2% more
likely to be classified as contentious responses, i.e., responses in which the manager either
responds aggressively, is confrontational, or tries to discredit the reviewer. This is true even
after controlling for the review content, suggesting that, even when the reviews are similar
in their characteristics, managers are more likely to write contentious responses for female
reviewers.
Overall, the results presented in this section are consistent with gender bias in the way
hotel managers address reviewers and suggest female reviewers’ concerns about management
responses being a potential source of conflict are indeed justified.
The effect of contentious responses on the likelihood to review We continue this
section byinvestigating whetherhotels with more contentious responses receivefewer reviews
from self-identified female reviewers, effectively linking the analysis reported in Section 1.4
17
We determined the optimal number of LDA topics to be three using the perplexity score (Blei et al.
2003). Results are not sensitive to the inclusion of a higher number of LDA topics, see Appendix A.6 for
more details.
18
We follow an approach similar to that we adopted for classifying management responses. For more
details, see Appendix A.8.
27
Table 1.8: Reviewer’s gender on management response type
(1) (2) (3) (4)
Female 0.012
0.011
0.009
0.009
(0.003) (0.003) (0.003) (0.003)
Intercept 0.053
0.048
0.060
0.063
(0.002) (0.004) (0.004) (0.005)
Controls:
Review characteristics No Yes Yes Yes
Review LDA topics No No Yes Yes
Review is contentious No No No Yes
Observations 27,941 27,941 27,941 27,941
Adjusted R
2
0.001 0.002 0.004 0.004
Note: The dependent variable is an indicator of whether the management response to reviewi of
hotelj written at timet is classified as contentious. We include controls for review character-
istics, i.e., the review length and LIWC variables for anger and tone (defined as the difference
between positive and negative words); review topics, estimated using the LDA algorithm; and
whether the review is classified as contentious. Standard errors are shown in parentheses.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
with the analysis discussed above.
19
To do so, we estimate the following specification:
Female
ijt
=
1
After
ijt
+
2
Positive
ijt
+
3
After
ijt
Positive
ijt
(1.7)
+
4
Treated
ijt
Positive
ijt
+
5
After
ijt
log Contentious Responses
ijt
+
6
After
ijt
Positive
ijt
log Contentious Responses
ijt
+
j
+
t
+X
0
ijt
+
ijt
;
where everything is as in Equation 1.2, and log Contentious Responses
jt
is the log of 1 +
the cumulative number of contentious responses written by hotelj at timet. The coefficients
of interest are
5
, which measures the additional impact (on top of the effect of responding
to reviews) of contentious responses on the likelihood that a negative review comes from a
self-identified female reviewer, and
6
, which measures the differential impact for positive
reviews. We report these results in Table 1.9. We observe that
5
is negative and significant,
19
We thank an anonymous reviewer for this suggestion.
28
suggesting that self-identified female reviewer write even fewer negative reviews for hotels
that respond and write more contentious responses.
20
6
is positive, significant, and of the
same magnitude of
5
, suggesting that the additional impact of contentious responses is
indistinguishable from zero for positive reviews.
21
Table 1.9: The effect of contentious responses on the likelihood
to review
(1)
After 0:014
(0:006)
After Positive 0:011
(0:006)
After log Contentious Responses 0:006
(0:003)
After Positive log Contentious Responses 0:006
(0:002)
Positive 0:024
(0:014)
Treated Positive 0:025
(0:015)
Controls:
Traveler segment Yes
log Review length Yes
Observations 673,604
Adjusted R
2
0.043
Note: The dependent variable is whether the user gender of review i of ho-
tel j at time t is female. All specifications include hotel and year-month
fixed effects, and a treatment-specific time trend. Cluster-robust standard
errors at the individual hotel level are shown in parentheses.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
20
For example, moving from zero to two (the median) cumulative contentious response, the effect on the
likelihood of observing a negative review from a self-identified female reviewer changes from -1.35% (p<0.05)
to -1.96% (p<0.01).
21
The effect on the likelihood of observing a positive review from a self-identified female reviewers is
-0.22% (p=0.40) and -0.19% (p=0.51) for hotels with zero and two (the median) cumulative contentious
response, respectively.
29
Which types of hotels write contentious responses? Finally, we investigate which
types of hotels are more likely to write contentious responses. To explore this question we
classifyhotelsbasedontheiroperation(chainvs. independent)andpricecategory(budgetto
luxury).
22
Intuitively, chain and high-end hotels are more likely to have dedicated resources
to train their employees on how to address consumers properly.
To test this hypothesis we compute, for each hotel, the percentage of the responses to
negative reviews that are classified as contentious and regress this variable against the hotel
type. We report these results in Table 1.10. In column 1, we show the results by hotel
operation and in column 2 by price category. As predicted, we observe that chain and
pricier hotels are less likely to write contentious responses.
Table 1.10: Contentious responses by hotel type
(1) (2)
Operation Price Category
Is Chain 0:227
(0:009)
Economy 0:007
(0:011)
Midprice 0:035
(0:010)
Upscale 0:050
(0:011)
Luxury 0:059
(0:011)
Intercept 0:304
0:121
(0:007) (0:007)
Observations 3,788 2,568
Adjusted R
2
0.157 0.015
Note: The dependent variable is the percentage of responses to neg-
ative reviews from hotel i that are classified as contentious. In
column 1, the reference level is independent hotels; in column 2,
the reference level is budget hotels.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
22
We thank STR for providing us with this information.
30
1.6 Discussion and Conclusions
The results presented in this paper demonstrate that management responses can have a
significant impact on who decides to write a review on an online rating platform.
The finding that management responses can be misused to discriminate against different
groups of reviewers is particularly troubling for platforms given their efforts to curate the
content posted on their websites. Our results should be considered a red flag that plat-
form curation algorithms need to be improved and that stronger emphasis should be put
on promoting a fair communication process for anyone using their service. In addition to
any ethical concerns, these considerations also have managerial relevance. First, a review
site allowing hotels to discriminate against a specific group can be responsible for a seg-
ment of users choosing to migrate to competing platforms, ultimately resulting in market
share losses. These losses can be dynamically exacerbated over time, since principles of
homophily (McPherson et al. 2001) suggest that users will likely favor using platforms in
which they have better representation. Second, both managers and reviewers likely look at
the behavior of others to gauge the acceptable norms on a given platform (Kashima et al.
2013). To the extent that discrimination is perceived to be allowed, this could have negative
downstream consequences on how social interactions are conducted on the platform.
From the hotels’ point of view, there is an interesting trade-off. Responding to reviews
and being aggressive towards self-identified female reviewers is associated with a reduction
in negative reviews written by these users which, in turn, may lead to higher ratings and,
therefore, demand. However, this comes at the cost of discouraging reviews from a specific
group of users, which may have negative consequences such as bad word of mouth which
may affect demand and revenue, among many others. Studying these consequences can be
an interesting venue for future research.
Ourresultsshouldbetakenseriouslybybothplatformsandhotels. Whiletravelersshould
feel free and safe when writing their opinion, we found evidence of discrimination and that
thisdiscriminationcan,alongwithgenderidiosyncraticdifferences, affectreviewingbehavior.
31
Whether it affects only one or many travelers, discrimination should not be tolerated and
both platforms and hotels should strive to completely eliminate this kind of behavior.
Ourworkaddsanotherpieceofevidencetothegrowingliteraturestudyingdiscrimination
in online settings. While being aware of discrimination is not enough to solve the problem,
providing evidence of the problem and its consequences could help nudge platforms to better
design their websites and improve their curation algorithms, and could help firms to improve
how they handle their online customers and complaints.
32
Chapter 2
Look the Part? The Role of Profile
Pictures in Online Labor Marketplaces
2.1 Introduction
Freelancing websites have gained tremendous popularity in recent years, connecting mil-
lions of employers and independent freelancers worldwide. In 2019, the freelance workforce
constituted 35% of the U.S. workforce.
1
Aiming to establish trust among participants, these
platforms often actively encourage or even require freelancers to include a personal picture
in their profiles. For example, Freelancer.com informs freelancers that “using your real face
lets employers know that they are dealing with a real person and not a stranger” in their
promotional materials.
2
Similarly, Upwork.com tells freelancers that “clients want to feel
they can trust a freelancer before they engage them for a project and your profile photo is
an important part of the equation.”
3
Although well-intentioned, including profile pictures as
part of freelancers’ digital resumes may lead to unintended negative consequences, such as
0
Joint Work with Lan Luo
1
Source: https://www.upwork.com/i/freelancing-in-america/2019/
2
Source: https://www.freelancer.com/community/articles/profile-picture-tips-and-tricks
3
Source: https://support.upwork.com/hc/en-us/articles/211063208-Sample-Profiles-and-Bes
t-Practices
33
discrimination or hiring biases (Luca 2017). Furthermore, it is not immediately evident how
profile pictures contribute to the goal of establishing trust relative to other design choices,
such as standardized reputation systems (Filippas et al. 2019).
Within this context, we investigate whether profile pictures may facilitate hiring biases
based on appearance-based perceptions. Specifically, we empirically explore the extent to
which employers’ perceptions of the freelancer-job fit (whether a freelancer is perceived to
have the skills and abilities required for the job) as inferred from the freelancer’s profile
picture may influence hiring outcomes. More importantly, we extend prior research by
examining the extent to which such a bias goes above and beyond known prejudice variables
such as demographics and attractiveness. For instance, when two applicants from the same
gender and race seem to be equally qualified for the job, will an employer who is recruiting
a programmer be more likely to hire a freelancer who looks more like a programmer?
Our research is inspired by some anecdotal evidence suggesting that people often rely
on their appearance-based perceptions to judge whether an individual appears suitable for
a particular kind of job. One such example is the widespread social media backlash in
response to a recruiting campaign implemented by OneLogin, placed in the BART public
transit stations of San Francisco in the summer of 2015.
4
According to some social media
users, one of the ads portraying a female platform engineer failed to represent what a female
engineer should look like, partly because she was too attractive to be a real engineer.
5
Many
other stories on the web illustrate that the stereotypical image of a software engineer is
“normally male White or Asian who are average height and build, not athletic and prefer
little to no human contact.”
6,7
4
Source: https://medium.com/the-coffeelicious/you-may-have-seen-my-face-on-bart-8b956
1003e0f
5
Source: https://www.colorado.edu/engineering/ilooklikeanengineer-campaign
6
See for example: https://www.linkedin.com/pulse/armando-you-dont-look-like-software-eng
ineer-armando-pantoja/
7
Such beliefs are so strong that, to date some organizations have launched initiatives to fight against
them and their downstream consequences on career choices. For example, see Girls who code (https:
//girlswhocode.com/) and I am a scientist (https://www.iamascientist.info/vision).
34
In the psychology literature, there is plenty of evidence that appearance-based impres-
sions can impact decisions (Olivola and Todorov 2010b). Evidence from the political domain,
for example, suggests that appearance-based inferences about candidates’ competence can
predict election outcomes better than chance (Olivola and Todorov 2010a). And merely
having a “conservative”-looking face seems to benefit candidates running in conservative ar-
eas (Olivola et al. 2012). However, to the best of our knowledge, no academic research has
investigated the potential downstream consequences of appearance-based inferences on hir-
ing decisions in online labor marketplaces. We note that Luo et al. (2008) has suggested that
consumers often use both objective and subjective criteria to evaluate a product. In a similar
vein, we conjecture that employers in online marketplaces may use both objective criteria
(such as online reputation variables) and subjective judgments (such as appearance-based
perceptions of job fit) when choosing a freelancer to hire.
Within this context, we explore the role that profile pictures serve in online labor mar-
ketplaces in the presence of an online reputation system. More specifically, we carry out
an observational study and two experimental studies to explore answers to the following
research questions:
Can perceptions of job fit inferred from profile pictures help explain hiring outcomes?
Whatistheinterplaybetweenprofilepicturesandthereputationsystem? Inparticular,
do perceptions of job fit complement or substitute online reputation?
Can freelancers exploit their profile pictures to improve their hiring outcomes?
It is worth noting that, by focusing on perceived job fit as our primary construct of
interest, we diverge from prior research that has mostly studied profile pictures as signals of
demographics and attractiveness (Pope and Sydnor 2011, Doleac and Stein 2013, Ruffle and
Shtudiner 2015, Hannák et al. 2017). While recognizing that perceived job fit may indeed be
partially explained by demographics and attractiveness, this study departs from an emphasis
on these variables in two important ways. First, perceptions of job fit may be formed
holistically, basedonmultiplevisualcuesfromaprofilepicture, manyofwhichextendbeyond
35
demographics and attractiveness. For example, visual cues such as wearing glasses can
influence perceptions of intelligence (Wei and Stillwell 2017), and consequently, perceptions
of fit for jobs associated with such a trait (e.g., STEM jobs). Second, perceptions of job
fit are job-category specific, i.e., the same profile picture may lead to different evaluations
depending on the job under consideration. For example, while a “nerdy-looking” profile
picture may be perceived as a high fit for a programming task, it may be perceived as a low
fit for creative marketing tasks.
Our observational study is based on Freelancer.com, one of the largest freelancing plat-
forms worldwide. We collected data for all jobs posted between January and June 2018 that
ended in a contract, resulting in 79,038 projects in four different categories (website and soft-
ware, graphic design, writing, and sales and marketing), with 2,462,043 applications from
220,385 freelancers. We leverage modern computer vision techniques (e.g. Zhang et al. 2017,
Zhang and Luo 2019, Hartmann et al. 2019) to generate labels indicating whether profile
pictures are perceived as a high fit for the job. We find that freelancers with pictures that
are perceived as a high fit, or those who “look the part", are more likely to be hired. Such
a result holds even after we control for known prejudice variables that can be inferred from
the profile pictures such as demographics and attractiveness. Interestingly, we also discover
that a profile picture that “looks the part” seems to complement, rather than substitute, in-
formation from the platform’s standard reputation systems. Namely, the effect of perceived
job fit strengths as the freelancer’s online reputation gets stronger. Overall, these findings
are consistent with our conjecture that profile pictures may serve as facilitators of hiring
biases based on appearances-based perceptions of the candidates’ fit for the job. And such
a “look the part” bias remains after we control for the focal candidate’s gender, race, age, or
attractiveness.
Somemaywonderifthepossibilityoffreelancersusingfakeprofilepicturesmayinvalidate
our findings. However, if such is the case, employers should respond little to this type of
“cheap talk” (Pope and Sydnor 2011) and we should find no effect of profile pictures. In other
36
words, if fake profile pictures induce a possible confound in our estimates, such a confound
should go against what we aim to uncover and our observational study’s findings will provide
a lower bound of the effect of interest. Additionally, platforms’ efforts to verify freelancers’
identity and impose penalties to users who use fake profile pictures may further relieve these
concerns.
8
Nevertheless, given that freelancers may only apply for jobs for which they believe that
they will be perceived as a high fit, findings from our observational study may be confounded
by a self-selection bias. As such, results from our observational study can only be interpreted
as correlational rather than causal. To address these concerns, we further carry out two
choice experiments in which we randomize freelancers’ characteristics in each choice task
and leverage several manipulations to further explore causal effects of profile pictures that
we cannot readily accomplish using secondary data.
In our first experimental study, we implement a between-participants choice experiment
similar to commonly used choice-based conjoint studies (e.g. Luo et al. 2008, Aribarg et al.
2017). By adopting a factorial design where we manipulate perceived job fit to be orthogonal
to race and gender, we can readily separate out the effect of “look the part” from such prej-
udice variables. Furthermore, aiming to better understand why profile pictures complement
rather than substitute standardized reputation systems, we manipulate the diagnosticity of
the reputation systems and the accessibility of profile pictures. Our results suggest that
when the reputation systems are diagnostic (e.g., they provide enough variation to rank
candidates based on reputation alone), whether a profile picture is perceived as a high fit
does not significantly impact hiring choices. However, as the diagnosticity of the reputa-
tion system decreases and gets closer to the reputation levels we observe in the secondary
data, we observe that: 1) profile pictures become more important for hiring choices; 2) a
profile picture that is perceived as a high job fit leads to better hiring outcomes; and 3) par-
ticipants use profile pictures as tiebreakers, i.e., to choose among top candidates who have
8
Source: https://www.freelancer.com/support/Profile/profile-picture-guidelines
37
similar strong online reputation. These findings reinforce and provide a causal interpretation
for the role of profile pictures we observe in our observational study, i.e., profile pictures as
facilitators of hiring biases based on perceptions of job fit.
Even when profile pictures might facilitate hiring bias, platforms may not have strong
incentivestoremovethem, especiallyconsideringthattheirusehasbecomethenorminmany
online marketplaces (Luca 2017). As such, we employ a second within-participant choice
experiment to explore whether freelancers, especially those who are in the disadvantaged
group (i.e., perceived as a low fit), can exploit their profile pictures to mitigate such a
bias. With this goal in mind, we manipulate the accessories and the background of the
profile picture for a set of focal candidates such that each candidate has two versions of
profile pictures (one that “looks the part” and one that “does not look the part”). Our
results suggest that, for a given freelancer, a home-office background that includes a visible
computer, and/or wearing glasses can increase his/her perceived job fit as a programmer,
which in turn, translates into a higher probability of obtaining the job. Such an effect is
more pronounced for freelancers who do not naturally look like a stereotypical programmer.
Our research makes the following contributions to the literature. First, we contribute to
the literature on profile pictures as facilitators of discrimination (e.g. Pope and Sydnor 2011,
Doleac and Stein 2013, Edelman and Luca 2014, Ruffle and Shtudiner 2015, Ert et al. 2016,
Hannák et al. 2017, Baert 2018), by showing that profile pictures can facilitate hiring biases
based on appearance-based perceptions of the focal candidate’s fit for the job (whether the
candidate “looks the part”). Indeed, we show that this bias goes above and beyond well-
studied discrimination variables such as demographics or attractiveness. To the best of
our knowledge, our research is the first empirical study evincing such an appearance-based
bias in online labor marketplaces. Moreover, departing from prior discrimination studies
that rely on either purely observational (e.g. Chan and Wang 2018, Malik et al. 2019) or
experimental data (e.g. Rosenblat 2008, Ukanwa and Rust 2020), we complement a large-
38
scale observational study with two lab experiments to exploit the external validity of the
former and the clean causal identification of the latter.
Second, we contribute to the literature on online marketplaces and the influence of online
reputation by examining its interplay with profile pictures. Prior research has shown that
online reputation variables are highly valuable in freelancer platforms (e.g. Yoganarasimhan
2013, Benson et al. 2020), and that the influence of different reputation variables hinges on
their diagnosticity (Watson et al. 2018) or their variance (Sun 2012). We add to this liter-
ature by showing that as reputation variables become less diagnostic and their importance
decreases, profile pictures start to play a more important role in the final hiring decision.
We believe that this research is among the first to examine and uncover such an intertwined
effect between online reputation and profile pictures.
Last but not least, we add to the marketing and economics literature on discrimination,
by proposing an alternative mechanism behind hiring biases. Prior research has suggested
mechanisms in the vein of: 1) taste-based discrimination models (Gary 1957), where decision-
makers have a systematic preference for individuals from a specific group (see, for example,
Malik et al. 2019); or 2) statistical discrimination models (Phelps 1972, Arrow et al. 1973),
where decision-makers use their beliefs about groups as proxies for unobservables (see, for
example, Cui et al. 2020, Ukanwa and Rust 2020). We suggest a novel mechanism in the vein
of statistical discrimination, akin to lexicographic heuristics. In particular, our research sug-
gests that employers use profile pictures as tiebreakers to arrive at their final hiring decisions
when multiple freelancers are sufficiently qualified for the job. Such a mechanism differs from
other heuristics proposed in the literature, in which employers first use their beliefs about
racial groups to screen out candidates before examining these applicants’ credible credentials
(Bertrand and Mullainathan 2004).
Our findings offer useful implications for freelancing platforms and freelancers. In today’s
online labor marketplace, freelancing platforms are usually characterized by a large number
of applications per job and by an extremely inflated reputation system. As such, we believe
39
thatneitherstandardizedreputationsystems(Cui et al.2020)norchangesintheaccessibility
of profile pictures (Luca 2017) are sufficient to reduce hiring biases on these platforms. We
suggest instead that platforms need to explore methods to make their reputation systems
more diagnostic, i.e., to provide enough variation for employers to easily rank freelancers.
For freelancers, our findings provide guidelines on how to address the hiring biases that may
place them at a disadvantage. We show that, with simple changes in the background and
accessories in their profile pictures, freelancers can signal to employers that they are a good
fit for the job and improve their hiring outcomes. Finally, because the freelance workforce
has become a significant part of the U.S. workforce,
9
our results may also have important
implications for policymakers. For example, policymakers could evaluate whether certain
prohibited employment practices in physical workplaces, such as “employers should not ask
for a photograph of an applicant,”
10
should be extended to online labor marketplaces.
The remainder of the paper is organized as follows. First, we describe our observational
studyinSection2.2. Next, wepresentourfirstandsecondexperimentalstudiesinSection2.3
and Section 2.4, respectively. Finally, we conclude the paper in Section 2.5.
2.2 Observational Study
In this section, we present a large-scale observational study in which we explore whether
andhowprofilepicturesareassociatedwithhiringoutcomesinanonlinefreelancingplatform.
In Section 2.2.1, we begin with a description of the empirical setting and the data collection
process. Then, in Section 2.2.2, we explain the computer vision techniques we implement
to label the profile pictures in our sample, i.e., to create measures that summarize the
information these profile pictures convey. Next, in Section 2.2.3, we estimate employers’
9
The freelance workforce constituted 35% of the U.S. workforce in 2019. Source: https://www.upwork
.com/i/freelancing-in-america/2019/
10
U.S. Equal Employment Opportunity Commission. Source: https://www.eeoc.gov/prohibited-em
ployment-policiespractices#pre-employment_inquiries
40
hiring preferences as a function of freelancers’ reputation and profile picture variables, among
other control variables. Finally, we summarize and discuss our main findings in Section 2.2.4.
2.2.1 Data
Our observational study is based on data we collect from Freelancer.com, the world’s
largest freelancing crowd-sourcing marketplace, as measured by the number of users and
jobs. To participate in this platform, freelancers must first register and create their user
profiles, including a description of their skills, a summary of their expertise, and a profile
picture. Employers also need to register in the platform and verify their payment method,
afterwhichtheycanlistajobandwaitforfreelancerstoapplyfortheposition. InFigure2.1,
we illustrate a job listing with 3 of the more than 70 applications it received within one
day. Note that a job listing posted by the employer (upper part of Figure 2.1) consists
of a description of the task, the budget, and desired freelancer skills. The applications
submitted by the freelancers (bottom part of Figure 2.1) include the requested price, a
summary of the freelancer’s reputation (number of reviews and average rating), and an
application description (text), among other details. Once the job is completed, both the
employer and the freelancer have the option to review each other.
41
Figure 2.1: Example of a job list and three applications on Freelancer.com.
Build my personal website Budget $30.00 – 250.00 USD
Using the website API,
11
we collect data for all job listings posted between January and
June 2018 that ended in a contract (i.e., one freelancer was hired). We focus on the four
largest job categories on the platform, which together, account for nearly 85% of all job
listings: Website, IT and Software (42%); Design, Media and Architecture (24%); Writing
and Content (13%); and Sales and Marketing (6%). Our sample consists of 79,038 jobs with
2,462,043 applications from 220,385 freelancers.
In what follows, we provide descriptive statistics of our two main variables of interest:
freelancers’ profile pictures and reputation. We also provide descriptive statistics on the
number of applications each job receives and its hiring outcomes.
Profile Pictures Approximately 92% of the freelancers in the sample have a profile pic-
ture. We use the Face++ Cloud Vision API to identify whether a profile picture has a human
11
See: https://developers.freelancer.com/
42
face on it (instead of logos or avatars), and if so, the apparent gender and race.
12
We note
that 67% of the profile pictures have a human face on it (33% are logos or avatars). Within
this group, most profile pictures are labeled as male freelancers (73% male, 27% female). In
terms of freelancers’ race, the majority of the profile pictures are labeled as Indian (45%)
and White (30%), followed by Black (15%) and Far East Asian (10%).
13
Freelancers’ Reputation Among the freelancers who have some reputation (i.e., at least
one review), the average number of reviews is 31, with an average rating of 4.73 out of 5
stars. Moreover, the mode of the average ratings among freelancers is 5-stars, an extremely
positive distribution which is typical of these types of platforms (Filippas et al. 2019).
Applications and hiring outcomes In column 1 of Table 2.1, we show the average
number of applications each job in each category receives. In column 2 of the same table, we
show the average number of applications from freelancers with a reputation equal to or above
the average (i.e., at least 31 reviews and an average rating of 4.73 or higher). These numbers
suggest that employers not only receive a large number of applications (at least 26), but they
also receive a considerable number of applications from candidates with a reputation above
the average (at least 9).
We also inspect the characteristics of the winner, i.e., the freelancer who was hired by the
employer for each job. In Table 2.2, we show the percentage of times that the winner was the
freelancer who offered the lowest price, had the highest number of reviews, or had the highest
average rating. We observe that simple decision rules, such as selecting the candidate that
offers the lowest price or the candidate with the highest reputation, explain no more than
12
To validate labels provided by the API, we recruit human raters to manually code a subsample of
images. As we show in Web Appendix B.1, there is a high level of agreement between labels provided by the
Face++ Cloud API and human raters.
13
The race labels provided by the Face++ API do not include a Hispanic label. Indeed, prior research
has recognized the difficulty of distinguishing between White and Hispanic race when inferring race from
profile photos (Davis et al. 2019). To address this concern, we also use alternative race labels provided by
Clarifai API, which include: American Indian, Asian, Black, Hawaiian or Pacific Island, Hispanic, Middle
Eastern, and White. All the results presented in this section are consistent with this alternative race variable.
Because of the large number of freelancers from Asia, we opted for the race variable based on the Face++
API, which distinguishes Indians from Far Eastern Asians.
43
25% of the employers’ choices. This implies that the employers’ decision process may be
more complex than these simple decision rules. We formally explore the hiring preferences
of employers on this platform in Section 2.2.3.
Table 2.1: Applications descriptives
Avg. number of applications per job
Job Category Total Reputation Above Average
Websites, IT and Software 27 11
Design, Media and Architecture 39 20
Writing and Content 26 9
Sales and Marketing 29 13
Note: Average reputation is defined as 31 reviews and an average rating of 4.73 as inferred from observa-
tional data.
Table 2.2: Hiring outcomes descriptives
Percentage of times the winner had...
Job Category Lowest price Highest N. Reviews Highest Rating
Websites, IT and Software 20% 11% 12%
Design, Media and Architecture 23% 18% 24%
Writing and Content 20% 13% 22%
Sales and Marketing 23% 22% 19%
2.2.2 Labeling Profile Pictures Based on Perceived Job Fit
A critical step in preparing the data for estimating hiring preferences is to label profile
picturesbasedonperceived job fit, i.e., alabeltoapproximatewhether, basedonafreelancer’s
profile picture, the employer perceives that he/she has the skills to meet the demands of a
specific job. Given that we need to create such labels for 220,385 freelancers over 4 job
categories (which amounts to a total of 881,540 labels), it is manually prohibitive to label all
44
profile pictures in our observational data using human raters. Therefore, we leverage modern
computer vision techniques to generate such labels.
We conjecture that appearance-based perceptions of job fit result from a complex process
weightingmultiplevisualcues, henceareformedholistically. Therefore, ourprediction model
for job-category specific perceived fit is an image classifier based on deep neural networks
called VGG16, which are well-known for its ability to detect and capture nonlinear relation-
ships between visual patterns. We employ several different Convolutional Neural Networks
(CNN) architectures for this classification task, including VGG16, Inception, and ResNet,
to perform the classification task. In our setting, the architecture that performs the best in
terms of out-of-sample accuracy is VGG16. In what follows, we briefly explain the three key
steps that we follow to complete this image classification task. The reader can refer to Web
Appendix B.2 for more details such as how we modify the VGG16 architecture and fine tune
the hyper parameters for our specific task.
The first step is to create a training set of profile pictures labeled as “low job fit” or “high
job fit” for each job category in our data set. To this end, we select a random subsample of
1,000 profile pictures and ask 5-10 human raters to score each image based on their perceived
freelancer-job fit for each of the four job categories in our sample using 5-point Likert scale.
14
We use the average rating across raters to denote perceived job fit for each profile profile in
each job category.
15
That is, each profile picture receives four scores, one per job category,
consistent with our claim that perceived job fit is job-category specific. Following standard
practices in the literature (Zhang et al. 2015, 2017, Zhang and Luo 2019, Liu et al. 2019), we
convert the 5-point Likert scale to binary levels (low and high) to mitigate potential noises
in the training data.
14
We recruit raters who reside in the U.S. through Amazon Mechanical Turk to ensure quality control
of human coders. Although people often hold more or less universal beliefs about the stereotypical look for
each job category, we acknowledge that employers from different cultures might judge “perceived job fit”
differently. However, this is a reasonable approximation for our observational data, where a considerable
share of employers are from North America.
15
The inter-rater agreement as measured by Cronbach’s alphas are 0.76, 0.64, 0.66, and 0.72 for the
perceivedjobfitasaprogrammer, agraphicdesigner, awriter, andasalesandmarketingperson, respectively
45
The second step is to train the classifier. To do so, we split the 1,000 labeled profile pic-
tures resulting from the first step into an 80% training and 20% validation sets. Additionally,
we leverage traditional data-augmentation techniques such as horizontal flip, rotations, and
shifts (Krizhevsky et al. 2012) to increase sample size during training. The VGG16 architec-
ture we adopt achieves out-of-sample accuracy as high as 81%, 72%, 79%, and 81% for the
perceived job fit as a programmer, a graphic designer, a writer, and a sales and marketing
person, respectively.
The third and final step is to extrapolate the learned parameters from step two above to
generate a job-category specific perceived job fit label for all profile pictures in our observa-
tional data. More specifically, we define each label as the predicted probability that a given
image is perceived as a high fit for a particular job category. We illustrate some examples of
the job fit as programmer label as predicted by this classifier in Figure B.2 of Appendix B.2.
2.2.3 Estimating Hiring Preferences
Upon obtaining a job-category specific label for each profile picture in our sample, we
proceed to estimate hiring preferences to explore answers to our first two research questions:
(i) can perceived job fit, as inferred from freelancers’ profile pictures, help explain hiring
outcomes? and (ii) what is the interplay between profile pictures and the reputation system?
Can perceived job fit help explain hiring outcomes?
Toempiricallyinvestigatethisquestion, weestimatehiringpreferencesusingaconditional
logit model with the following specification:
Hired
ijt
=
t
+
1
ProfilePicture
j
+
2
Reputation
j
+
3
Performance
j
(2.1)
+
4
Application
jt
+
5
Controls
ij
+"
ijt
46
where the dependent variable Hired
ijt
indicates whether employer i hired freelancer j to
complete the job t. The independent variables include (a more detailed explanation and
descriptive statistics of these variables are presented in Web Appendix B.3):
Profile Picture: Setofeightvariablesextractedfromfreelancers’profilepictures, including
(i) perceived job fit, or whether the freelancer “looks the part” for the focal job, (ii) gender,
(iii) race, (iv) age, (v) beauty, (vi) smile, (vii) whether the freelancer has a profile picture,
and (viii) whether there is a human in the picture. Note that items (i)-(vi) are conditional
on (vii)-(viii); that is, we can only obtain these variables for freelancers with a profile
picture with a human on it.
Reputation: Set of two reputation variables, including (i) number of reviews, and (ii)
average rating. These are the most widely used variables to account for the information
from standardized reputation systems that employers have available to make their choices
(e.g., Yoganarasimhan 2013, Chan and Wang 2018).
Performance: Set of three variables that describe freelancers’ performances on previous
jobs, including (i) total earnings made on similar jobs, (ii) percentage of jobs completed
on time, and (iii) percentage of jobs completed on budget. These variables allow us to
account for additional information about freelancers’ experience in the platform.
Application: Set of six variables that describe the application submitted by a freelancer
(see bottom part of Figure 2.1), including: (i) price (normalized within job because the
magnitude of the price can vary significantly across jobs), (i) number of days to complete
the job (normalized within job for the same reason as stated above), (iii) log applica-
tion word count, (iv) application similarity (which we explain below), (v) whether the
application is recommended by the platform and highlighted in the first position (only
one such freelancer per job), and (vi) the relative position in the list of applications as
displayed to the employer.
16
The third variable, application word count, which we create
16
Relative positions are determined by the platform’s application ranking algorithm (details are unknown
to the authors). Although such positions can change in real-time as new applications arrive, the website
API only gives us information for the final position. We acknowledge that this is a limitation in using this
variable to control for order effects. Nevertheless, in the observational data we observe that applications
47
from the text description submitted by freelancers, can serve as a proxy for the amount of
information provided by the freelancer and for the motivation the freelancer demonstrates
for the project. We also create the fourth variable, application similarity, as the cosine
similarity between the text application the freelancer submitted and the job description
the employer posted. This variable captures whether the freelancer mentioned certain
keywords, such as specific requirements or skills needed for the job, which can serve as a
proxy for matched quality.
Additional Controls: Setofeightvariablesthatserveascontrolsforadditionalinformation
about freelancers available in their profiles, including: (i) whether the freelancer is a pre-
ferred freelancer (a status granted by the platform to distinguish experienced freelancers),
(ii) whether the freelancer has passed exams on relevant skills required by the employer,
(iii) whether the freelancer is from a developed country, (iv) whether the freelancer is
from the same country as the employer, (v) whether the freelancer has a previous review
from the same employer, serving as a proxy for whether they have worked together, (vi)
freelancer’s region of residence, (vii) freelancer’s membership category (e.g., Intro, Basic,
Plus), which serves as a proxy of freelancer’s involvement with the platform, and (viii)
whether the freelancer has provided verification for his/her profile.
Results WereportourestimatesinTable2.3, whereeachcolumncorrespondstoadifferent
model specification of Equation 2.1. In column 1 of Table 2.3, we include a baseline model
that includes all variables described above, with the exception of all profile picture related
variables. Theestimatesforthereputationandapplicationvariablesareconsistentwithprior
findings in the literature (Yoganarasimhan 2013, Chan and Wang 2018). For example, as
the number of reviews and average rating increases, a freelancer’s probability of being hired
also increases. On the contrary, as price increases, a freelancer’s probability of being hired
decreases. Wealsoobservethatthelongerthedescription(higherwordcount)andthehigher
arrive within a very short time-window (on average, 90% of the total applications arrive within 90 minutes),
suggesting that application positions become relatively stable shortly after the job is posted.
48
the similarity between the employer’s description and the freelancer’s description (higher
cosine similarity), the higher the probability of being hired. In what follows, we consider
this specification our baseline model, since it includes all the information the employers can
obtain without the profile picture of the freelancers.
Table 2.3: Estimating hiring preferences in observational data
(1) (2) (3)
Profile Pictures Variables:
Perceived Job Fit 0:166
0:158
Has Picture 0:175
0:171
Human 0:013 0:121
Reputation Variables:
Low N. Reviews 1:090
1:105
1:109
Mid N. Reviews 1:192
1:205
1:212
High N. Reviews 1:768
1:780
1:788
Avg. Rating 0:186
0:183
0:181
Application Variables:
Price 1:860
1:860
1:861
Log Application Word Count 0:050
0:050
0:051
Cosine Similarity 1:386
1:381
1:369
Additional Variables:
Performance Variables
Other Application Variables
Control Variables
Gender, Race, and Age
Beauty and Smile
N 2,462,043 2,462,043 2,462,043
LL -199,464 -199,391 -199,344
AIC 399,021 398,879 398,799
BIC 399,606 399,502 399,511
Note: Conditional logit estimates using robust standard errors. The dependent variable is
whether the employer i hired freelancer j for the job t. In column 1, we estimate the
model controlling for everything with the exception of profile picture related variables.
The baseline number of reviews is zero. In column 2, we add the variable of interest,
perceived job fit score. In column 3, we incorporate additional picture related control
variables including demographics, beauty and smile.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
49
In column 2 of Table 2.3, we incorporate our primary variable of interest: perceived
job fit. Under this specification, the main coefficient of interest is positive and significant
( = 0:166;p < 0:01), suggesting that freelancers who are perceived as a higher fit for the
job from their profile pictures are more likely to be hired. The model fit (AIC and BIC)
favors this model specification relative to the baseline model (column 1), i.e., perceived
freelancer-job fit from profile pictures can help explain hiring outcomes in our data.
Finally, in column 3 of Table 2.3, we estimate the full model including all the variables
described in Table B.3 and Table B.4. This comprises all variables used in column 2 plus
additional profile picture variables such as gender, race, age, beauty, and smile as additional
controls. Interestingly, we observe that even with the inclusion of these additional controls,
thesignandsignificancefortheperceivedjobfitparameterremainconsistent ( = 0:158;p<
0:01). Inotherwords, wefindthatperceptionsofjobfithaveasignificantpositivecorrelation
with hiring outcomes above and beyond such known discriminatory variables.
Overall, our results suggest that in the presence of online reputation systems, incorpo-
rating variables from freelancers’ profile pictures can help explain hiring outcomes in online
labor markets. More importantly, above and beyond other discriminatory variables such as
demographics and attractiveness, we find that freelancers who 1) have a profile picture; 2)
have a profile picture in which a human face is visible (rather than logos or avatars); and 3)
are perceived to be a high fit for the focal job based on their profile pictures are more likely
to be hired.
17
What is the interplay between perceived job fit and reputation system?
Thus far, we have shown that appearance-based perceptions of freelancer-job fit pos-
itively correlate with hiring outcomes. In this section, we further explore the interplay
17
Arguably, employers may form two additional relevant perceptions from freelancer profile pictures:
perceived competence and perceived professionalism. As a robustness check, we label images based on these
two variables following the same procedure as in Section 2.2.2, and include them in the choice models. We
find that the inclusion of these two additional controls does not change the main results presented in this
section. Due to page limit, we omit these results from the paper. They are available upon request.
50
between perceived job fit and reputation system. For instance, does perception of job fit as
inferred from profile pictures substitute or complement information from the online reputa-
tion system? And what can such an interplay inform us regarding the possible mechanism
behind our findings? To empirically explore these questions, we incorporate an interaction
term between perceived job fit and online reputation into the employers’ utility specification:
Hired
ijt
=
t
+
1
ProfilePicture
j
+
2
Reputation
j
(2.2)
+
3
PerceivedJobFit
j
Reputation
j
+
4
Performance
j
+
5
Application
jt
+
6
Controls
ij
+"
ijt
wherethedependentandindependentvariablesareasdefinedinEquation2.1(seeTables??).
The independent variable Reputation
j
in the interaction term represents either the number
of reviews or the average rating of the freelancer j.
Results We report our results in Table 2.4. In column 1, we include the estimates of the
full model without interaction terms as reference (column 3 from Table 2.3). In column 2, we
include an interaction term indicating whether the freelancer has a number of reviews above
the mode (59 reviews). In column 3, we include an interaction term indicating whether the
freelancer has an average rating above the mode (4.89 stars). In column 4, we include an
interaction term indicating whether the freelancer has a number of reviews and an average
rating above the mode.
Interestingly, our estimates suggest that the effect of perceived job fit strengthens as the
focal freelancer’s reputation gets stronger. More specifically, the interaction coefficients in
Table 2.4 suggest that perceived job fit has a positive and significant correlation with hiring
outcomes for freelancers with number of reviews above the mode ( = 0:256;p < 0:01),
with average rating of reviews above the mode ( = 0:337;p < 0:01), and with number of
reviews and average rating above the mode ( = 0:412;p< 0:01). However, freelancers with
reputation below the mode do not benefit significantly from the perceptions of job fit their
51
Table 2.4: Interplay between perceived job fit and reputation in hiring outcomes
(1) (2) (3) (4)
Profile Pictures Variables:
Perceived Job Fit 0:158
0:026 0:043
0:043
Has Picture 0:171
0:217
0:190
0:224
Human 0:121
0:126
0:123
0:133
Interaction Effects:
Perceived Job Fit
N. Reviews Above Mode 0:256
Rating Above Mode 0:337
N. Reviews & Rating Above Mode 0:412
Reputation Variables:
Low N. Reviews 1:109
1:129
1:527
1:322
Mid N. Reviews 1:212
1:178
1:646
1:377
High N. Reviews 1:788
1:697
2:224
1:909
Avg. Rating 0:181
0:182
0:068
0:141
Additional Variables:
Performance Variables
Application Variables
Control Variables
Gender, Race, and Age
Beauty and Smile
N 2,462,043 2,462,043 2,462,043 2,462,043
LL -199,344 -199,249 -199,079 -198,995
AIC 398,799 398,613 398,272 398,105
BIC 399,511 399,337 398,997 398,830
Note: Conditional logit estimates using robust standard errors. The dependent variable is whether the employer i
hired freelancer j for the job t. In column 1, we present the model with full covariates from Table 2.3 as the base-
line. In column 2, the overall effect of perceived freelancer-job fit on freelancers with number or reviews above the
mode ( 59) is 0.282 (p = 0.000). In column 3, the overall effect of perceived freelancer-job fit on freelancers with
rating above the mode ( 4.89) is 0.294 (p = 0.000). In column 4, the overall effect of perceived freelancer-job fit
on freelancers with number of reviews and rating above the mode is 0.455 (p = 0.000).
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
pictures elicit. Indeed, the coefficient for perceived job fit is either insignificant or of very
modest magnitude for these freelancers.
52
In sum, our findings suggest that perceptions of job fit complement rather than substitute
reputation systems. Considering the high volume of applications that employers receive and
the fact that reputation systems are highly inflated in these platforms (Filippas et al. 2019),
we conjecture one mechanism that can rationalize our findings is that employers use profile
pictures as tiebreakers to arrive at their final hiring decisions when multiple freelancers seem
to be sufficiently qualified for the job.
We further explore how our proposed mechanism relates to other known mechanisms
in the literature. Prior research has suggested two main theories of discrimination: taste-
based discrimination (Gary 1957) or statistical discrimination (Phelps 1972, Arrow et al.
1973). Mechanisms akin to taste-based discrimination suggest that decision-makers have a
systematic preference for workers from a specific group (see, for example, Malik et al. 2019).
In our setting, this would imply that the effect of the perceived job fit does not change with
freelancers’ reputation. As such, our results suggest that taste-based discrimination is not
at play in our context.
Mechanisms akin to statistical discrimination suggest that decision-makers discriminate
as a rational solution to an incomplete information problem. As such, our proposed mecha-
nism is consistent with this view, but operates in a different manner than other previously
proposed mechanisms in this vein. For example, Phelps (1972) and Arrow et al. (1973)
suggest that decision-makers use their beliefs about group averages as a proxy for unobserv-
ables. Under this mechanism, one should expect lower discrimination when individuals from
the disadvantaged group can provide signals of their quality or skills (e.g. Agrawal et al.
2016, Cui et al. 2020). In our setting, such a mechanism would imply that the effect of the
perceived job fit variable should diminish as freelancers gain a high reputation. In other
words, perceived job fit could serve as a substitute for reputation. Therefore, we believe that
this mechanism does not explain our finding.
18
18
One might argue employers might believe that the quality signals from individuals from the discrimi-
nated group are less accurate (Altonji and Blank 1999). In our setting, employers may have such concerns
regarding the accuracy of the average ratings. For example, employers may interpret a “5-star rating” as
an accurate quality signal from freelancers who look the part, but as an inaccurate quality signal from free-
53
An alternative mechanism in the vein of statistical discrimination that might rationalize
our finding is the heuristic proposed by Bertrand and Mullainathan (2004). The authors
suggest that employers may rely on their beliefs about group averages to screen out candi-
dates before carefully examining their credentials. Within our context, this means employers
use profile pictures to first screen out candidates who do not look the part, then focus on
reputation variables to make their choice among the remaining candidates. In such cases,
the main effect of perceived job fit should not go away as we include an interaction term into
our model. Given that we find perceived job fit only matters for freelancers with a strong
reputation, we conjecture a tiebreaker lexicographic heuristic where employers first rely on
reputation variables to obtain a shortlist of candidates, and then use the profile picture to
decide whom to offer to job to.
2.2.4 Discussion
In summary, the results from our observational data suggest that profile pictures can help
explain hiring outcomes in online labor markets. More specifically, we show that freelancers
whose profile pictures elicit perceptions of high fit for the job or “look the part” are more
likely to be hired. Moreover, we show that “looking the part” benefits freelancers with a high
reputation, but not freelancers with a low reputation. This suggests that profile pictures
complement rather than substitute reputation systems. We conjecture that the underlying
mechanism behind such a finding is that employers use profile pictures as tiebreakers. We
think it is possible that, due to the extremely positive distribution of ratings on the platform,
reputation systems are not diagnostic enough to allow employers to rank candidates based
lancers who do not look the part (e.g., they believe the rating is inflated or reflects reciprocity rather than
true quality). However, it is less clear whether this mechanism can explain our findings based on the more
objective quality signal the number of reviews (e.g., given that the number of reviews is an objective number
by itself, it is hard to imagine that employers would think that the number of reviews alone would be a more
accurate signal for freelancers who look the part than those who don’t). Moreover, the precision of a quality
signal should increase with the amount of relevant information (which equates to the number of reviews in
our case) (Bohren et al. 2019b). Considering that the effect of perceived job fit gets stronger for freelancers
with a high reputation when the number of reviews is high (i.e., when the quality signal is more precise)
(Table 2.4, column 2), it is to our belief that this mechanism is not at play in our context.
54
on reputation variables alone. We believe this is a novel mechanism not discussed in prior
research. We explore this conjecture further in our first choice experiment.
It is worth noting that our findings above are subject to some limitations common to
observational studies. One such limitation is related to endogeneity concerns with respect to
supply-side biases. On Freelancer.com, freelancers endogenously decide which job to apply
for and which price to request, two decisions that could be based on the expected hiring
outcome. For example, a freelancer who believes that she/he does not look the part as a
programmer might not apply for a programming job, or they might lower the bidding price
to get some advantage over the other candidates who do look the part. As such, findings
from our observational data can only be interpreted as correlational rather than causal. In
the experimental studies we present in the following sections, we address these concerns by
employing an orthogonal design to randomize freelancers and applications’ characteristics
within each choice task.
Another limitation we face is the lack of variability in the observational data, which com-
promises our ability to fully explore two of our research questions. First, because ratings
are extremely positive throughout the platform, we cannot explore the interplay between
the role of profile pictures and the diagnosticity of the reputation system. We address this
limitation by manipulating the diagnosticity of the reputation system in the first experimen-
tal study in Section 2.3. Second, because we only observe one profile picture per freelancer
in the observational data, we cannot determine whether freelancers can exploit their profile
pictures to improve their hiring outcomes. In the second experimental study, we address this
limitation by manipulating the profile pictures for a set of treated freelancers (i.e., changing
the background and accessories in the picture) in Section 2.4.
55
2.3 Experimental Study 1
In this section, we present a choice experiment in which we explore the interplay between
the role of profile pictures and the diagnosticity of the reputation system, as well as the
influence of profile picture accessibility on hiring outcomes. We describe the experimental
design in Section 2.3.1, and present our results in Section 2.3.2. Finally, we discuss and
summarize our main findings in Section 2.3.3.
2.3.1 Experimental Design
Our choice experiment is similar in spirit to a choice-based conjoint experiment. Partic-
ipants were asked to imagine that they would like to hire a programmer from a freelancing
platform to develop a personal website. We chose this scenario under the following two
considerations: 1) it falls under the largest job category in the platform we use for our ob-
servational study (Websites, IT and Software); and 2) we can easily familiarize participants
with the desired outcome by providing some examples of personal websites. After being
presented with the scenario, each participant was given sixteen choice tasks, with each task
comprising ten hypothetical applications for the focal job. The participant was instructed
to choose the freelancer that he/she would like to hire in each hiring scenario.
The experiment is a 3x2 between-participants design, in which we manipulate 1) the
diagnosticity of the reputation system and 2) the accessibility of profile pictures. Below, we
explain the details of our experimental design.
Choice profiles Although our observational study is highly information-rich with a large
number of controls, a typical conjoint experiment often does not comprise more than six
or eight attributes (Orme 2002, Luo et al. 2008, Aribarg et al. 2017). As such, we identify
the following six attributes (perceived job fit, gender, race, online reputation, price, and
certification) based on 1) the main goal of this research; and 2) their importance in hiring
outcomes as discovered from our observational study. Following the convention in conjoint
56
studies, we instructed the participants to assume that all attributes not presented were
constant across all job applications (Rao et al. 2014).
Unlike most other conjoint experiments in which text descriptions convey all attributes,
we use profile pictures to depict the first three attributes: perceived job fit, gender, and
race. To separate out the effects of race and gender from that of perceived job fit, we
incorporate sixteen different combinations of perceived job fit (low, high), race (Far East
Asian, Black, Indian, and White), and gender (female, male). Namely, we include both low
and high fit pictures for each race-gender combination such that perceived job fit is balanced
and orthogonal to race and gender variables in the conjoint design. Such a setup allows us
to address endogeneity concerns related to supply biases in our observational study (e.g.,
females may be reluctant to apply for programming jobs due to their belief that they are not
perceived as a high fit for this type of job), and also allow us to correct for other imbalances
in the observational data (e.g., disproportional participation of freelancers from a given race
or gender group). Furthermore, we include four profile pictures per possible combination of
perceived job fit, gender, and race, such that our findings are not subject to idiosyncratic
characteristics related to one particular picture. The levels of perceived job fit are defined
by human raters, i.e., as in the training data used in Section 2.2.2.
19
We also use the same
image size as what the platform from our observational study uses (6464 pixels) so that
our choice experiment can mimic the platform to the extent possible.
We define online reputation as the number of reviews and average review rating. We
incorporated different levels for this attribute to manipulate the diagnosticity of the repu-
tation system (Watson et al. 2018), which will be explained in more detail in the following
section. The fifth attribute is price. To represent the realistic price range for this type of
job on the platform, we listed a similar job on Freelancer.com and we scanned similar jobs
on our data, and defined price levels as $100, $150, and $200. The last attribute is cer-
tification, indicating whether the freelancer has successfully passed an exam on a relevant
19
We consider images with average rating score greater (lower) than three on a 5-point scale as a high
(low) perceived job fit.
57
skill for the job. We use this attribute to replace the application description used in the
platform (see Figure 2.1) because it is challenging to manipulate preset levels of application
descriptions with multiple versions of texts. In contrast, certification is a clean manipulation
that summarizes the key information the freelancer can include in the text, such as “I have
experience with WordPress.” Moreover, the results from the observational study indicate
that certifications correlate significantly with hiring outcomes.
Based on the six attributes described above (also summarized in Table 2.5), we use an
orthogonal factorial design to generate sixteen choice tasks with ten applications each, in-
cluding 2 warm-up and 2 hold-out tasks. Although most conjoint studies do not include as
many as ten alternatives per choice task, we chose ten applications to mimic the number of
candidates that employers can simultaneously compare on most freelancer platforms. For
example, Freelancer.com splits the total number of applications into multiple pages and dis-
plays 8 applications per page, while Upwork displays ten applications per page. It is also
worth noting that this is a number significantly smaller than the total number of applicants
employers receive on these platforms. Recall from Table 2.1 that in our observational data,
such jobs received on average at least 26 applications. Therefore, our choice task is a simpli-
fied version of the actual choice most employers have to go through in practice. We chose ten
applicants per task to strike a balance between avoiding overwhelming participants with too
many applications in each task and approximating a realistic scenario to the extent possible.
Reputation diagnosticity manipulation Our rationale for manipulating the diagnos-
ticity of the online reputation variables is twofold. First, we would like to examine if the
importance of “looking the part” changes as online reputation becomes more or less diagnos-
tic. Second, such a setup helps us to dive further into our proposed mechanism regarding
the role of profile pictures as tiebreakers.
We manipulate the levels of the reputation attribute to create three diagnosticity con-
ditions: high, mid, and low. To define the reputation levels for each condition, we post a
job similar to the one used in our choice experiment on Freelancer.com (i.e., recruiting a
58
Table 2.5: Attributes and levels used in the first choice experiment
Attribute Description Levels
Perceived job fit Whether the candidate is perceived as
a high fit for the job, as inferred from
his/her profile picture
Low, High
Gender Gender of the candidate, as inferred
from the profile picture
Female, Male
Race Race of the candidate, as inferred from
the profile picture
Asian, Black, Indian, White
Reputation Number of reviews and average rating
(N. Reviews, Avg. Rating)
High diagnosticity condition: (None),
(10, 4.5), (100, 4.5), (10, 5.0), (100, 5.0)
Mid diagnosticity condition: (30, 4.8),
(300, 4.8), (30, 5.0), (300, 5.0)
Low diagnosticity condition: (100, 4.9),
(300, 4.9), (100, 5.0), (300, 5.0)
Price Price the candidate is charging to com-
plete the job
$100, $150, $200
Certification Whether the candidate has certified a
skill that is relevant for the job
Yes, No
freelancer to create a personal website) and use the statistics of the freelancers that apply
for that job as reference values (see Table 2.6).
20
Table 2.6: Summary statistics of applications received for the job description used in our
conjoint study
Min 25th percentile 50th percentile 75th percentile Max
Number of Reviews 1 27 88 309 1992
Average Rating 4.3 4.8 4.9 5.0 5.0
Price $30 $120 $150 $220 $450
Note: We received a total of 75 applications for this job. The number of reviews and average
ratings were obtained for the subset of 86% candidates that have at least one review.
20
As a robustness check, we also searched our observational data for jobs similar to the one used in this
choice experiment, which we identified as jobs in the Websites, IT, and Software category containing the
words “personal website” or “personal blog” in their description. We found that these jobs have comparable
statistics to the ones presented in Table 2.6.
59
For the high diagnosticity condition, we use five reputation levels: {None}, and the
four possible combinations N. Reviews = {10, 100} Avg. Rating = {4.5, 5.0}. Note
that our reputation attribute collapses two variables from reputation systems, N. Reviews
and Avg. Rating, which allows us to to accommodate the {None} level without imposing
any restrictions on the orthogonal design. By using these levels, the high diagnosticity
condition represents a hypothetical scenario, in which reputation systems are considerably
less positive/or more diagnostic compared to the reference levels in Table 2.6, the values we
observe in the observational data, or online labor platforms in general (Filippas et al. 2019).
For the mid diagnosticity level, we use four reputation levels resulting from the combi-
nation N. Reviews = {30, 300} Avg. Rating = {4.8, 5.0}, values that correspond to the
25th and 75th percentiles of the reference levels in Table 2.6. These levels can represent
a case in which the employer emphasizes their choices among the top 75% of applicants in
terms of reputation. By adopting these levels, the mid diagnosticity condition is also a good
approximation for all the jobs in our observational data, where the mode of the number of
reviews and average rating at the application level is 59 and 4.9, respectively.
For the low diagnosticity condition, we use four reputation levels resulting from the
combination N. Reviews = {100, 300} Avg. Rating = {4.9, 5.0}, values that correspond to
the 50th and 75th percentiles of the reference levels in Table 2.6. These levels can represent a
caseinwhichemployersonlyconsiderthetop50%ofcandidatesintermsofonlinereputation.
In the observational data, we discover that 90% of the time, the candidate who was
hired for the job ranked among the top 25% in terms of reputation relative to the other
candidates applying for the same job. As such, we believe that our mid and low diagnosticity
assumptions where employers emphasize their choices among the top 75th or 50th percentiles
of applicants based on reputation are reasonable, especially considering that we are only
showing 10 applications to the employer in each choice task, while in the observational data
each job in the Websites, IT and Software category receives an average of 27 applications.
60
Profile picture accessibility manipulation In traditional labor markets, recruiters are
not allowed to request the inclusion of photographs in job applications.
21
Although recruiters
will eventually meet the candidates before making their offer (e.g., in-person interviews),
avoiding the use of profile pictures in the early stage of the application process aims to
reduce hiring biases (Ruffle and Shtudiner 2015). To explore an analogous potential solution
within the context of online labor marketplaces, we explore a platform design in which we
change the timing in which profile pictures are shown to employers (Luca 2017).
With this goal in mind, we manipulate the accessibility of profile pictures to create two
conditions: 1) with picture; and 2) picture only shown after a click. The first condition is the
same as the current platform design, where participants can immediately see a freelancer’s
profile picture along with other attributes of his/her application. In the second condition,
freelancers’ profile pictures are not automatically displayed on the platform interface, but
they are revealed when participants click a tab next to the freelancer’s application. We
illustrate these two conditions in Figure 2.2. In the after click condition, we are particularly
interested in how many profile pictures were clicked open and the type of profiles that
participants choose to click on. Such findings will yield interesting insights regarding the
mechanism underlying the participants’ decision processes.
Data collection We recruit participants through Amazon Mechanical Turk. We randomly
assign participants to one of the six experimental conditions, obtaining an average sample
size of 100 participants per condition. To control for order effects, we randomize the order
in which the freelancer’s profiles are shown to participants in each choice task.
Aiming to relieve concerns regarding the quality of the responses, we implement three
quality checks. First, we recruit participants located in the U.S. who had completed at least
1,000 HITs and had an approval rate equal to or above 99%. Second, after participants read
the survey’s instructions, they answered three questions to verify that they had read and
21
U.S. Equal Employment Opportunity Commission. Source: https://www.eeoc.gov/prohibited-em
ployment-policiespractices#pre-employment_inquiries
61
understood the choice task and the attributes. We do not record responses from participants
who fail to provide the right answer to any of these three questions. Lastly, we compare
results when we excluded participants who answered the survey too fast or too slow, and
our key findings remain qualitatively the same.
Figure 2.2: Example of a choice task under the two platform design conditions in experi-
mental study 1.
Picture condition After click condition
Scenario 3/16: If you had to choose among the following 10 candidates,
which one would you hire?
Scenario 3/16: If you had to choose among the following 10 candidates,
which one would you hire?
Remember that you can see the worker’s profile picture by clicking on
button next to his/her username
62
2.3.2 Results
To analyze responses from the “with picture” condition, we use a multinomial logit model
to obtain the part-worths parameters reported in Table 2.7. To evaluate the performance of
the conjoint model, we compute the in-sample and out-of-sample hit rates. Considering that
our choice tasks have ten applicants in each task, we compare our results to a baseline of 10%
hit rate (i.e., prediction by chance). Across experimental conditions, the minimum in-sample
hit rate was 36%, i.e., 3.6 times better than the baseline, and the minimum out-of-sample hit
rate computed using the 2 hold-out profiles was 23%, i.e., 2.3 times better than the baseline.
We also note that the estimated parameters for attributes such as reputation and price are
consistent with our findings from the observational data and prior literature across the three
diagnosticity conditions. For example, ceteris paribus participants prefer freelancers with
higher reputation levels, who have a certification, and who offer a low price. In what follows,
we summarize our findings regarding the effect of perceived job fit and insights from the
after click condition.
The effect of perceived job fit In Table 2.7, we observe that when the diagnosticity of
the reputation system is high, the perceived job fit coefficient is not significant. However, as
the reputation system becomes less diagnostic, this coefficient turns positive and significant.
In other words, the role of profile pictures depends on how diagnostic the online reputation
variables are, a finding that seems in tension with the mechanism where profile pictures
are first used to screen out candidates who don’t look the part. If participants of our study
indeed use whether the freelancers “look the part” to first rule out applicants, as the heuristic
suggested by Bertrand and Mullainathan (2004), profile pictures should play a role regardless
of how diagnostic the online reputation variables are.
63
Table 2.7: Part-worths and attribute importance for with picture condition
across three reputation diagnosticity conditions
Reputation diagnosticity condition
High Mid Low
Part-worths
Perceived Freelancer-Job Fit:
High 0:025 0:173
0:192
Reputation:
Low N Reviews - Low Rating 0:836
Low N Reviews - High Rating 2:145
1:343
0:440
High N Reviews - Low Rating 2:439
1:444
0:425
High N Reviews - High Rating 3:896
2:811
1:231
Certification:
Yes 1:564
1:249
0:954
Price:
$150 0:807
0:945
0:717
$200 1:591
2:055
1:145
Additional Controls:
Gender and Race
N 12480 11400 11280
LL 1712:5 1767:0 2217:7
AIC 3449:0 3556:0 4457:4
BIC 3538:2 3636:7 4538:0
Note: Parth-worths are obtained using a logit model using robust standard errors. The de-
pendent variable is whether the participant i chose freelancer j in the conjoint profile t. In
column (1), {None} reputation level is set as the reference level. In columns (2) and (3), the
{Low N Reviews - Low Rating} level is set as reference level.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
Attribute importance Based on the part-worths parameters in Table 2.7, we compute
the importance for difference attributes across the three experimental conditions. We note
that, asthereputationsystembecomeslessdiagnostic, reputationbecomelessimportant(de-
creases from 52% in the high, to 42% in the mid, to 30% in the low diagnosticity condition),
and profile picture variables (gender, race, and perceived job fit) become more important
(increases from 6% in the high, to 10% in the mid, to 18% in the low diagnosticity condition).
64
We also observe that in the mid and low diagnosticity conditions, perceived job fit is roughly
as important as gender and race.
Insights from the after click condition To analyze the after click condition responses,
we examine how participants interacted with the profile pictures. First, we look at the
percentage of participants who clicked on at least one picture. We observe that when the
reputation system is very diagnostic, 32% of the participants clicked on at least one profile
picture before making their choice. When the reputation system is not diagnostic, that
percentage raises to 60%.
Second, we compare the reputation variables of the candidates whose profile pictures were
and were not clicked on. We observe that candidates whose profile pictures were clicked on
have a significantly higher number of reviews and higher ratings than candidates whose
profiles pictures were not clicked on. This pattern is consistent across the three diagnosticity
conditions.
Lastly, focusing on participants who click on at least one picture, we compute the average
number of pictures they clicked on. We observe that participants clicked on three pictures
per question (on average), independent of the diagnosticity condition.
Together, these statistics suggest that, on average, participants clicked on the profile
pictures of three out of the ten candidates, particularly those with a high reputation, be-
fore making their choice. In other words, rather than clicking open the profile pictures of
all candidates, participants seem to only look at a selective set of them contingent on can-
didates’ reputation, a finding that further rules out the mechanism proposed by Bertrand
and Mullainathan (2004) where profile pictures are used as the first criterion to screen out
candidates.
65
2.3.3 Discussion
In summary, our first experimental study provides three key insights. First, we show
that as the reputation system becomes less diagnostic, reputation becomes less important,
and conversely, profile pictures become more important for hiring choices. Indeed, when
reputation systems are not very diagnostic, profile pictures that look the part lead to better
hiring outcomes. Considering that our mid diagnosticity and with picture condition ex-
perimental condition best approximates the observational data, our experimental findings
reinforce and provide a causal interpretation for the effect of perceived job fit we obtain in
our observational study.
Second, we show that even with the nominal cost of having to click on a box to see the
freelancers’ profile pictures, participants do so on a reduced number of freelancers with a
high reputation. These experimental findings reinforce our proposed mechanism of profile
pictures serving a tiebreaker role.
Last but not least, our findings imply that as long as the reputation system is not very
diagnostic, a platform design in which participants have access to freelancers’ profile pictures
at the cost of one click might not be sufficient to mitigate the “look the part” hiring bias. The
fact that profile pictures have become the norm in online platforms (Luca 2017),
22
and the
factthatmanyfreelanceplatformsstronglyencourageor, insomecases, requirefreelancersto
uploadtheirprofilepicture, maysuggestthattheseplatformsdonothavestrongincentivesto
remove profile pictures. Such observations motivate the following question: is there anything
that freelancers can do with their profile pictures to address the biases that might go against
them? We explore answers to this question in the next section.
22
One exception in the context of the sharing economy is Airbnb, which by 2018 decided to hide guests’
photos from hosts until the booking process is completed. Source: https://news.airbnb.com/update-on-
profile-photos/
66
2.4 Experimental Study 2
In this section, we present the second experimental study where we explore whether
freelancers can exploit their profile pictures to improve their hiring outcomes. First, we
present an analysis to understand what a high fit programmer looks like, or in other words,
how can freelancers enhance their profile picture to “look the part” in Section 2.4.1. Then, we
describe the experimental design in Section 2.4.2, and our results in Section 2.4.3. Finally,
we discuss and summarize our main findings in Section 2.4.4.
2.4.1 What a High Job Fit Programmer Looks Like?
In this subsection we explore what a high job fit programmer looks like. We emphasize
our analysis on programmers given the experimental setting of our second choice experiment.
Similar analysis can be done for other job categories in our data. To explore this question,
we focus on the ratings for perceived job fit as a programmer provided by human raters,
i.e., the ratings we collected to train the image classifier in Section 2.2.2. More specifically,
we estimate a regression using these ratings as the dependent variable, and three sets of
explanatory variables: 1) demographics, 2) physical appearance, and 3) background and
accessories. We include demographics (e.g., race and gender) and physical appearance (e.g.,
beauty) variables because we expect the perceived job fit ratings to reflect some stereotypical
beliefs about computer and software engineers. For example, as noted in the introduction,
scholarly and anecdotal evidence suggests that the stereotypical computer or software engi-
neer is male, likely White or Asian, and not necessarily attractive. We also include a smile
variable, to control for such a facial expression that generally elicits positive responses from
observers (Fagerstrøm et al. 2017). Finally, we include background and accessory variables
that might signal that a freelancer spends considerable time in front of a computer (e.g.,
the presence of a computer), or that could serve as a proxy for perceived intelligence (e.g.,
67
wearing glasses). All labels for these explanatory variables are obtained from Face++ and
Clarifai Cloud Vision APIs (validation for these labels are provided in Appendix B.1).
We present our results in Table 2.8. Our estimates in column 1, which only include
demographics variables, suggest that females are perceived as a lower fit as programmers,
and Indians are perceived as a higher fit as programmers. These estimates seem to be
consistent with the stereotypical believes about programming workers. Our estimates in
column 2, which adds physical appearance variables, suggest that beauty does not signifi-
cantly correlate with perceived job fit ratings for programmers, and smile has a positive and
significant correlation with perceived job fit ratings. Finally, in column 3, we incorporate
the background and accessory variables. The estimates show that pictures taken outdoor
are perceived as a lower fit as a programmer, while the presence of a computer and wearing
eyeglasses are perceived as a higher fit for a programmer.
Based on the results in Table 2.8, the variables we include in the model explain 32% of
the variation of the perceived job fit ratings. Such a modest level of model fit is consistent
with our notion that perceptions of job fit from profile pictures are often associated with
a set of stereotypical beliefs (such as inferred capabilities and personality traits) above and
beyond simple visual cues such as the ones we could readily label using existing APIs.
Overall,theseresultssuggestthatfreelancers’characteristicsthatcannotbeeasilychanged,
such as demographic and physical appearance, play a significant role in employers’ percep-
tions of their fit as a programmer. More importantly, our results also suggest other aspects
that can be easily modified in the pictures (e.g., background and accessories) can help free-
lancers “look the part,” i.e., to be perceived as a higher fit as programmers. We explore this
further in our second choice experiment.
2.4.2 Experimental Design
The scenario of our second choice experiment is similar to the one used in Section 2.3,
i.e., a hypothetical choice experiment similar to conjoint where participants are asked to
68
Table 2.8: What does a high fit programmer looks like?
(1) (2) (3)
Intercept 2:430
2:430
2:404
Demographics:
Human 0:790
0:700
0:664
Female 0:199
0:273
0:237
Far East Asian 0:085 0:107 0:064
Black 0:064 0:054 0:045
Indian 0:150
0:164
0:157
Physical Appearance:
Beauty 0:055 0:004
Smile 0:266
0:259
Background and Accessory:
Outdoor Background 0:143
Sunglasses 0:246
Normal Glasses 0:397
Computer 0:637
N 961 961 961
Adj R2 0.244 0.260 0.336
Note: OLS estimates. The dependent variable is the perceived job fit score as a pro-
grammer as provided by human raters (averaged across rates). White race label is
set as reference level.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
imagine that they are hiring a programmer from a freelancing platform to design a personal
web page. This experiment has a within-participants design, in which we manipulate the
profile picture of one “treated” candidate per choice task. Below, we explain the details of
our experimental design.
Choice profiles We use the choice tasks from the mid diagnosticity and with picture
condition in Section 2.3.1 as the focal starting point in this study. The motivation for this
decision is that this condition closely follows the characteristics of the platform from the
69
observation study. In particular, we select ten out of the sixteen choice tasks where at least
3 of the ten candidates were chosen more than 10% of the time (i.e., candidates chosen
better than by chance) in the first choice experiment. We employ this criterion under the
consideration that we do not expect our visual manipulation to have a significant effect
on participants who have very little chance to be hired, particularly based on our previous
findings that profile pictures are used as tiebreakers.
Profile picture manipulation We manipulate the profile picture of one “‘treated” ap-
plicant per choice task to create two conditions: “does not look the part” and “looks the
part.” By manipulating the profile picture of only one applicant per ask, we can keep the
characteristics of the remaining applicants constant and cleanly measure the impact from
the visual manipulation of the hiring outcomes from the one “treated freelancer.” For each
choice task, the “treated freelancer” is selected at random among the three candidates chosen
more than 10% of the time in our previous study (i.e., chosen better than by chance). We
also ensure that at least one candidate per each gender and race was selected.
Next, we exploit findings from Table 2.8 to create the visual manipulations for our two
experimental conditions. More specifically, we hired a professional photo editor to create
two versions of the profile pictures for each “treated freelancer”: (i) a version with outdoor
background and no glasses; and (ii) a version with a home-office background including a
computer and wearing glasses. We depict our manipulations in Figure 2.3.
Based on our prior findings (Table 2.8), we expect that for each pair in Figure 2.3, the
version in the left (outdoor background without glasses) will be perceived as a lower fit
as a programmer compared to the version in the right (indoor background with computer
visually in sight and wearing glasses). As a manipulation check, we conduct a pretest survey
in which Mturkers rate one of the two versions of each picture (chosen at random) based on
their perceived job fit as programmers. The results, which we report in the Appendix B.4,
confirm that the manipulation works as expected. Hence, we use the version with outdoor
background and no glasses for the “does not look the part condition,” and the version with
70
home-office background, computer, and wearing glasses for the “looks the part condition” in
our experiment.
Figure 2.3: Visual manipulations in experimental study 2.
Note: Eachpairrepresenta“treatedfreelancer.” Foreachpair, theimageontheleft(outdoor
background without glasses) is used in the “does not look the part condition,” and the image
on the right (home-office background with computer and wearing glasses) is used in the
“looks the part picture condition.”
We also aim to explore whether our visual manipulation has a greater impact on free-
lancers who do not have the appearance of the stereotypical programmer independent of
picture background and accessories. Therefore, we created a plain version of the profile pic-
tures without any background and not wearing glasses, as illustrated in Figure 2.4. Using
the same pretest survey we used as the manipulation check, we asked Mturkers to rate with
a 7-point Likert scale the degree to which each freelancer in the plain picture version looks
like a stereotypical programmer.
Figure 2.4: Plain profile pictures in experimental study 2.
Plain version of the profile pictures used to measure whether each freelancer looks like a
stereotypical programmer without accessories or background
71
Figure 2.5: Example of a choice task under the two experimental conditions in study 2.
Does not look the part condition Looks the part condition
Scenario 3/16: If you had to choose among the following 10 candidates,
which one would you hire?
Scenario 3/16: If you had to choose among the following 10 candidates,
which one would you hire?
Note: Inthisexample,the“treatedcandidate” correspondstothecandidateshownasthesev-
enth alternative (Oyinda). Note that besides the visual manipulation in her profile picture,
everything else (her other attributes and other freelancers’ attributes and profile pictures)
remains constant across conditions.
72
Data collection We recruit 300 participants from Amazon Mechanical Turk to complete
our second choice experiment. The experimental condition is randomized at the participant-
conjoint task level. Namely, for each participant and conjoint task, we randomly show either
the “does not look the part” or “looks the part” profile picture manipulation for the treated
freelancer of that conjoint task. We illustrate the two experimental conditions for one choice
task in Figure 2.5. To control for order effects, we also randomize the order in which the
freelancer’s profiles were shown to participants. Lastly, we follow the same quality checks
we used in the first experimental study to relieve concerns regarding responses’ quality.
2.4.3 Results
To analyze the effect of our profile picture manipulation, we focus on the hiring outcomes
of the treated freelancers by estimating a logit model with the following specification:
Hired Treated Freelancer
ijt
=
t
+
1
Looks the part Condition
ijt
(2.3)
+
2
Stereotypical
j
+
3
Looks the part Condition
ijt
Stereotypical
j
+"
ijt
whereHired Treated Freelancer
ijt
indicateswhetherparticipantihiredthetreatedfreelancer
j in the choice taskt;
t
a choice task fixed effect that allows us to control for the character-
istics of the treated freelancer that are not being manipulated (price, reputation, certifica-
tion, gender, and race) and all characteristics of the remaining freelancers in the choice set;
Looks the part Condition
ijt
is an indicator of whether respondent i sees the profile picture
from the “looks the part condition” for the treated freelancer j in the choice task t; and
Stereotypical
j
is the average stereotypical programmer rating for the treated freelancer j in
the plain version of the picture.
We report the results in Table 2.9. Our results suggest that, keeping everything else
constant, a picture that looks the part can increase a freelancer’s likelihood to be hired ( =
73
1:051;p < 0:05). Moreover, the negative and significant interaction term ( =0:267;p <
0:10) suggests that such an effect is stronger for freelancers who do not naturally look like a
stereotypical programmer in the plain picture.
Table 2.9: Impact of profile picture manipulation on the “treated
freelancer” hiring outcomes
Hired Treated Freelancer
Intercept 13:493
Looks the part Condition 1:051
Stereotypical 3:363
Looks the part Condition Stereotypical 0:267
Controls:
Freelancer FE Yes
N 3,020
LL -1514
AIC 3,052
BIC 3,124
Note: Logistic regression estimates using robust standard errors. The depen-
dent variable is whether respondent i hired the treated freelancer j in the
choice task t.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
2.4.4 Discussion
In summary, this experimental study provides two insights. First, for the same freelancer
(holding gender and race constant), subtle visual manipulations of his/her profile pictures
(background and accessories) can help elicit perceptions of higher job fit. These findings
provide further evidence that the “look the part” bias goes above and beyond well-known
prejudice variables such as gender and race.
Second, we show that changing the background and accessories in the profile pictures can
directly impact freelancers’ hiring outcomes. We find that using home-office background and
wearing glasses help freelancers looking for programming jobs, especially for freelancers who
do not look like a stereotypical programmer. As such, our findings have valuable implications
74
for freelancers who are at a greater disadvantage by profile picture requirements, specifically
regarding what they can do to counteract the biases that they face.
2.5 Conclusions
Based on an observational study and two experimental studies, our paper provides evi-
dence that the use of profile pictures in online labor marketplaces can facilitate hiring biases
from appearance-based perceptions of freelancer-job fit. We show that more favorable per-
ceptions of job fit lead to better hiring outcomes, and that such perceptions become more
importantforhiringdecisionsasthereputationsystembecomeslessdiagnostic. Interestingly,
we find that profile pictures complement rather than substitute information from standard-
izedreputationsystems. Lastbutnotleast, weshowthatfreelancerscanimprovetheirhiring
outcomes by strategically selecting the background and accessories of their profile pictures.
Moreover, this is especially true for freelancers who do not have a stereotypical look for the
focal job, or in other words, freelancers who “do not look the part.”
Our paper makes the following contributions to the discrimination and online market-
places literature. First, to the best of our knowledge, our paper is among the first com-
prehensive research to identify a “look the part” bias on hiring outcomes in online labor
markets. More importantly, we show that such a bias goes above and beyond gender, race,
and attractiveness, three discriminatory variables commonly studied in the literature (Pope
and Sydnor 2011, Hannák et al. 2017, Chan and Wang 2018, Malik et al. 2019, Hannák et al.
2017), and it affects hiring outcomes. Second, this paper is among the first to explore the
interplay between standardized reputation systems and profile pictures. We show that when
the former is not sufficiently diagnostic, the latter becomes more important in hiring out-
comes. In doing so, we shed light on some limitations of standardized reputation systems as
moderators of discrimination (Cui et al. 2020) in online labor marketplaces. Lastly, we pro-
pose a novel mechanism in line with statistical discrimination, more specifically in the vein
75
of heuristic decision-making. We suggest that employers use profile pictures as tiebreakers
when multiple applicants seem to be similarly qualified.
Our findings have important ethical implications for platforms, especially regarding the
prevalence of profile pictures as a key design element. Indeed, we show that standardized
reputation systems (Cui et al. 2020) or changes in the prevalence of profile pictures by
making pictures only available after click (Luca 2017) are not sufficient solutions for freelance
platforms, where many jobs receive a large number of applications and the reputation system
is such that ratings are extremely positive. Instead, our findings suggest that freelance
platforms need to explore solutions that make their reputation system more diagnostic, e.g.,
toprovideenoughvariationforemployerstoeasilyrankapplicants. Furthermore,becausethe
freelance workforce has become a significant part of the U.S. workforce, our results may also
have important implications for policymakers. For example, policymakers could evaluate
whether certain prohibited employment policies/practices, such as “employers should not
ask for a photograph of an applicant,”
23
should also apply in the context of online labor
marketplaces. Last but not least, our findings provide freelancers with guidelines on how to
mitigate the hiring bias that may place them at a disadvantage. We show that, with simple
changes in the background or accessories in their profile pictures, freelancers can signal to
employers that they are a good fit for the job, which in turn, can translate into better hiring
outcomes.
Finally, ourresearch also provides some fruitful directionsfor future research. First, while
we focus on the role of profile pictures on hiring outcomes, future research could explore the
role of videos in a similar vein. For example, Fiverr.com allows freelancers to upload a video
in their profiles to present themselves to potential employers. Compared to a profile picture,
a video may have an even stronger impact on employers’ perceptions of the freelancer, for
instance, through the dynamic of their facial gestures (e.g., dynamics of a smile, Krumhuber
et al. 2009) or though their voice (e.g., perceptions of competence, Burgoon 1978). Second,
23
U.S. Equal Employment Opportunity Commission. Source: https://www.eeoc.gov/prohibited-em
ployment-policiespractices#pre-employment_inquiries
76
while we have shown that the diagnosticity of a platform design is essential to moderate
hiring biases, future research can further explore how to design a platform to satisfy this
condition, or alternative platform design choices that could help to address the “look the
part” bias we uncover in this research. Lastly, while we focus on the context of online
labor marketplaces, there are many other online platforms in which profile pictures are a
key component of the platform design (e.g., LinkedIn.com; Care.com; Healthgrades.com).
Although such platforms only facilitate the initial touchpoint between employers and service
providers (the two parties will eventually work together in physical spaces), future studies
could extend our research by exploring the business implications of digital profile pictures
and appearance-based inferences in these alternative settings.
77
Bibliography
Acemoglu, D., Naidu, S., Restrepo, P. and Robinson, J. A. (2019). Democracy does cause
growth. Journal of Political Economy, 127 (1), 47–100.
Agrawal, A., Lacetera, N. and Lyons, E. (2016). Does standardized information in online
markets disproportionately benefit job applicants from less developed countries? Journal of
International Economics, 103, 1–12.
Altonji, J. G. and Blank, R. M. (1999). Race and gender in the labor market. Handbook of
labor economics, 3, 3143–3259.
Aribarg, A., Burson, K. A. and Larrick, R. P. (2017). Tipping the scale: The role of dis-
criminability in conjoint analysis. Journal of Marketing Research, 54 (2), 279–292.
Arrow, K. et al. (1973). The theory of discrimination. Discrimination in Labor Markets, 3 (10),
3–33.
Ayeh, J. K., Au, N. and Law, R. (2013). “Do we believe in TripAdvisor?” Examining credibility
perceptions and online travelers’ attitude toward using user-generated content. Journal of
Travel Research, 52 (4), 437–452.
Baert, S. (2018). Facebook profile picture appearance affects recruiters’ first hiring decisions. new
media & society, 20 (3), 1220–1239.
Benson, A., Sojourner, A. and Umyarov, A. (2020). Can reputation discipline the gig econ-
omy? Experimental evidence from an online labor market. Management Science, 66 (5), 1802–
1825.
Bertrand, M.,Duflo, E. andMullainathan, S. (2004).Howmuchshouldwetrustdifferences-
in-differences estimates? The Quarterly Journal of Economics, 119 (1), 249–275.
— and Mullainathan, S. (2004). Are Emily and Greg more employable than Lakisha and Jamal?
A field experiment on labor market discrimination. American Economic Review, 94 (4), 991–
1013.
Blei, D. M.,Ng, A. Y. andJordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine
Learning Research, 3 (Jan), 993–1022.
Bohren, J. A., Imas, A. and Rosenberg, M. (2019a). The dynamics of discrimination: Theory
and evidence. American Economic Review, 109 (10), 3395–3436.
78
—, — and — (2019b). The dynamics of discrimination: Theory and evidence. American Economic
Review, 109 (10), 3395–3436.
Brewer, N., Mitchell, P. and Weber, N. (2002). Gender role, organizational status, and
conflict management styles. International Journal of Conflict Management, 13 (1), 78–94.
Brown, R., Gilman, A. et al. (1960). The pronouns of power and solidarity.
Burgoon, J. K. (1978). Attributes of the newscaster’s voice as predictors of his credibility. Jour-
nalism Quarterly, 55 (2), 276–300.
Canziani, A., Paszke, A. and Culurciello, E. (2016). An analysis of deep neural network
models for practical applications. arXiv preprint arXiv:1605.07678.
Chan, J. and Wang, J. (2018). Hiring preferences in online labor markets: Evidence of a female
hiring bias. Management Science, 64 (7), 2973–2994.
Chevalier, J. A., Dover, Y. and Mayzlin, D. (2018). Channels of impact: User reviews when
quality is dynamic and managers respond. Marketing Science, 37 (5), 688–709.
Cui, R., Li, J. and Zhang, D. J. (2020). Reducing discrimination with reviews in the sharing
economy: Evidence from field experiments on Airbnb. Management Science, 66 (3), 1071–
1094.
Davis, D. R., Dingel, J. I., Monras, J. and Morales, E. (2019). How segregated is urban
consumption? Journal of Political Economy, 127 (4), 1684–1738.
Doleac, J. L. and Stein, L. C. (2013). The visible hand: Race and online market outcomes. The
Economic Journal, 123 (572), F469–F492.
Edelman, B. G. and Luca, M. (2014). Digital discrimination: The case of airbnb. com. Harvard
Business School NOM Unit Working Paper, (14-054).
Ert, E., Fleischer, A. and Magen, N. (2016). Trust and reputation in the sharing economy:
The role of personal photos in airbnb. Tourism Management, 55, 62–73.
Fagerstrøm, A., Pawar, S., Sigurdsson, V., Foxall, G. R. and Yani-de Soriano, M.
(2017). That personal profile image might jeopardize your rental opportunity! on the relative
impact of the seller’s facial expressions upon buying behavior on airbnb
TM
. Computers in
Human Behavior, 72, 123–131.
Filippas, A., Horton, J. J. and Golden, J. M. (2019). Reputation inflation. National Bureau
of Economic Research Working Paper.
Gallus, J. and Bhatia, S. (2020). Gender, power and emotions in the collaborative production of
knowledge: A large-scale analysis of wikipedia editor conversations. Organizational Behavior
and Human Decision Processes, 160, 115–130.
79
Gary, S. B. (1957). The economics of discrimination. The American Catholic Sociological Review,
18, 276.
Gomila, R. (2019). Logistic or linear? Estimating causal effects of treatments on binary outcomes
using regression analysis. Working Paper.
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal
of Econometrics, 225 (2), 254–277.
Håkansson, J. (2012). The use of personal pronouns in political speeches: A comparative study
of the pronominal choices of two american presidents.
Hannák, A.,Wagner, C.,Garcia, D.,Mislove, A.,Strohmaier, M.andWilson, C.(2017).
Bias in online freelance marketplaces: Evidence from taskrabbit and fiverr. In Proceedings of
the 2017 ACM conference on computer supported cooperative work and social computing, pp.
1914–1933.
Hartmann, J.,Heitmann, M.,Schamp, C.andNetzer, O.(2019).Thepowerofbrandselfiesin
consumer-generated brand images. Columbia Business School Research Paper, (Forthcoming).
Herbert, R. K. (1990). Sex-based differences in compliment behavior 1. Language in Society,
19 (2), 201–224.
Holmes, J. (1988). Paying compliments: A sex-preferential politeness strategy. Journal of Prag-
matics, 12 (4), 445–465.
Imai, K., Kim, I. S. and Wang, E. (2018). Matching methods for causal inference with time-series
cross-section data. Princeton University, 1.
Johnson, D. M. and Roen, D. H. (1992). Complimenting and involvement in peer reviews:
Gender variation. Language in Society, 21 (1), 27–57.
Kashima, Y., Wilson, S., Lusher, D., Pearson, L. J. and Pearson, C. (2013). The ac-
quisition of perceived descriptive norms as social category learning in social networks. Social
Networks, 35 (4), 711–719.
Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classification with deep
convolutionalneuralnetworks.InAdvances in neural information processing systems, pp.1097–
1105.
Krumhuber, E., Manstead, A. S., Cosker, D., Marshall, D. and Rosin, P. L. (2009).
Effects of dynamic attributes of smiles in human and synthetic faces: A simulated job interview
setting. Journal of Nonverbal Behavior, 33 (1), 1–15.
Lakoff, R. (1973). Language and woman’s place. Language in Society, 2 (1), 45–79.
— (1986). You say what you are: Acceptability and gender-related language. In Dialect and Lan-
guage Variation, Elsevier, pp. 403–414.
80
Liu, X., Lee, D. and Srinivasan, K. (2019). Large-scale cross-category analysis of consumer
review content on sales conversion leveraging deep learning. Journal of Marketing Research,
56 (6), 918–943.
Luca, M. (2017). Designing online marketplaces: Trust and reputation mechanisms. Innovation
Policy and the Economy, 17 (1), 77–93.
Luo, L., Kannan, P. and Ratchford, B. T. (2008). Incorporating subjective characteristics in
product design and evaluations. Journal of Marketing Research, 45 (2), 182–194.
Malik, N., Singh, P. V., Lee, D. D. and Srinivasan, K. (2019). A dynamic analysis of beauty
premium. Available at SSRN 3208162.
McPherson, M., Smith-Lovin, L. and Cook, J. M. (2001). Birds of a feather: Homophily in
social networks. Annual Review of Sociology, 27 (1), 415–444.
Netzer, O., Lemaire, A. and Herzenstein, M. (2019). When words sweat: Identifying signals
forloandefaultinthetextofloanapplications.Journal of Marketing Research,56(6), 960–980.
Newman, M. L., Groom, C. J., Handelman, L. D. and Pennebaker, J. W. (2008). Gender
differences in language use: An analysis of 14,000 text samples. Discourse Processes, 45 (3),
211–236.
Olivola, C. Y., Sussman, A. B., Tsetsos, K., Kang, O. E. andTodorov, A. (2012). Repub-
licans prefer republican-looking leaders: Political facial stereotypes predict candidate electoral
success among right-leaning voters. Social Psychological and Personality Science, 3 (5), 605–
613.
— and Todorov, A. (2010a). Elected in 100 milliseconds: Appearance-based trait inferences and
voting. Journal of Nonverbal Behavior, 34 (2), 83–110.
— and — (2010b). Fooled by first impressions? Reexamining the diagnostic value of appearance-
based inferences. Journal of Experimental Social Psychology, 46 (2), 315–324.
Orme, B.(2002).Formulatingattributesandlevelsinconjointanalysis. Sawtooth Software Research
Paper, pp. 1–4.
Parkhi, O. M., Vedaldi, A. and Zisserman, A. (2015). Deep face recognition. British Machine
Vision Association.
Pennebaker, J. W., Francis, M. E. and Booth, R. J. (2001). Linguistic inquiry and word
count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71 (2001), 2001.
Phelps, E. S. (1972). The statistical theory of racism and sexism. The American Economic Review,
62 (4), 659–661.
Polo, F. J. F. (2018). Functions of “you” in conference presentations. English for Specific Purposes,
49, 14–25.
81
Pope, D. G. and Sydnor, J. R. (2011). What’s in a picture? Evidence of discrimination from
prosper. com. Journal of Human Resources, 46 (1), 53–92.
Proserpio, D. and Zervas, G. (2017). Online reputation management: Estimating the impact of
management responses on consumer reviews. Marketing Science, 36 (5), 645–665.
Rambachan, A. and Roth, J. (2019). An honest approach to parallel trends. Unpublished
manuscript, Harvard University.
Rao, V. R. et al. (2014). Applied conjoint analysis. Springer.
Rosenblat, T. S. (2008). The beauty premium: Physical attractiveness and gender in dictator
games. Negotiation Journal, 24 (4), 465–481.
Ruffle, B.J.andShtudiner, Z.(2015).Aregood-lookingpeoplemoreemployable? Management
Science, 61 (8), 1760–1776.
Scheve, K. and Stasavage, D. (2012). Democracy, war, and wealth: lessons from two centuries
of inheritance taxation. American Political Science Review, 106 (1), 81–102.
Seamans, R. and Zhu, F. (2014). Responses to entry in multi-sided markets: The impact of
craigslist on local newspapers. Management Science, 60 (2), 476–493.
Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556.
Small, D. A., Gelfand, M., Babcock, L. and Gettman, H. (2007). Who goes to the bar-
gaining table? The influence of gender and framing on the initiation of negotiation. Journal
of Personality and Social Psychology, 93 (4), 600.
Sorenson, P. S., Hawkins, K. and Sorenson, R. L. (1995). Gender, psychological type and
conflict style preference. Management Communication Quarterly, 9 (1), 115–126.
Srivastava, N.,Hinton, G.,Krizhevsky, A.,Sutskever, I. andSalakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine
Learning Research, 15 (1), 1929–1958.
Sun, M. (2012). How does the variance of product ratings matter? Management Science, 58 (4),
696–707.
Tannen, D. (1991). You just don’t understand. Simon & Schuster Audio.
Thomson, R. and Murachver, T. (2001). Predicting gender from electronic discourse. British
Journal of Social Psychology, 40 (2), 193–208.
Ukanwa, K. and Rust, R. T. (2020). Algorithmic discrimination in service. Available at SSRN
3654943.
82
Vasilescu, B., Capiluppi, A. and Serebrenik, A. (2012). Gender, representation and online
participation: A quantitative study of stackoverflow. In Social Informatics (SocialInformatics),
2012 International Conference on, IEEE, pp. 332–338.
Wang, Y. and Chaudhry, A. (2018). When and how managers’ responses to online reviews affect
subsequent reviews. Journal of Marketing Research, 55 (2), 163–177.
Wang, Z., Walther, J. B., Pingree, S. and Hawkins, R. P. (2008). Health information,
credibility, homophily, and influence via the internet: Web sites versus discussion groups.
Health communication, 23 (4), 358–368.
Watson, J., Ghosh, A. P. and Trusov, M. (2018). Swayed by the numbers: the consequences
of displaying product review attributes. Journal of Marketing, 82 (6), 109–131.
Wei, X. and Stillwell, D. (2017). How smart does your profile image look? Estimating in-
telligence from social network profile images. In Proceedings of the Tenth ACM International
Conference on Web Search and Data Mining, pp. 33–40.
Yoganarasimhan, H. (2013). The value of reputation in an online freelance marketplace. Mar-
keting Science, 32 (6), 860–891.
Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701.
Zhang, M. and Luo, L. (2019). Can user posted photos serve as a leading indicator of restaurant
survival? Evidence from yelp. Available at SSRN 3108288.
Zhang, S., Lee, D., Singh, P. V. and Srinivasan, K. (2017). How much is an image worth?
Airbnb property demand estimation leveraging large scale image analytics. Airbnb Property
Demand Estimation Leveraging Large Scale Image Analytics (May 25, 2017).
Zhang, X., Zhao, J. and LeCun, Y. (2015). Character-level convolutional networks for text
classification. In Advances in Neural Information Processing Systems, pp. 649–657.
83
Appendix A
Appendix to Chapter 1
Appendix
A.1 Event Study
Here, by visualizing the estimates of event studies we implemented for our main spec-
ifications, Equations 1.2 and 1.3, we provide evidence that the parallel trends assumption
is likely satisfied. We start by partitioning the time around the first response date of each
hotel into 4-month intervals so that [0, 4) is interval 0, [4,8) is interval 1, [-4,0) is interval -1
and so on.
1
As is standard in this type of analysis, we set never-treated hotels, i.e., hotels
that never respond to reviews, to have the baseline interval (-1 in our case) to which we
compare the rest of the interval coefficients (Goodman-Bacon 2021). Finally, we limit our
analysis to12 intervals and bin the end points together, i.e., every interval outside the
analysis period (above and below 12 intervals) is set to have the minimum interval value in
the pre-treatment period and the maximum interval value in the post-treatment period. We
then estimate Equations 1.2 and 1.3 but replacing the variable After
ijt
with the intervals de-
scribed above. We plot the interval coefficients and 90% confidence intervals of Equation 1.2
1
We choose relatively large intervals to account for the fact that 66% of observations in our dataset have
missing gender information.
84
in Panel (a) of Figure A.1, and the interval coefficients and its interaction with the review
valence coefficients, along with their respective 90% confidence intervals, in Panel (b) and
(c) of Figure A.1, respectively. What emerges from Figure A.1 is that, in the pre-treatment
period, the difference in trends for treated and control hotels is close to zero in all panels.
Moreover, in the after-treatment period, we observe a slightly negative trend in Panel (a),
which is consistent with the negative but not significant estimates we report in column 1
Table 1.3, and a more pronounced decrease in Panel (b) and an increase in Panel (c), both
of which are consistent with the estimates reported in column 2 of Table 1.3.
Figure A.1: Inspecting parallel trends assumption.
(a)
−0.02
0.00
0.02
−12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10 11 12
120−day intervals around first response
Change in the likelihood that a
review is from a female reviewer
(b)
−0.08
−0.04
0.00
0.04
−12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10 11 12
120−day intervals around first response
Change in the likelihood that a negative
review is from a female reviewer
(c)
−0.04
0.00
0.04
−12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10 11 12
120−day intervals around first response
Change in the likelihood that a positive review
is from a female reviewer wrt a negative review
Estimated coefficient and 90% confidence intervals from regressing the likelihood that a
review comes from a self-identified female reviewer on a set of 120-day intervals around
the first response time of each hotel. In Panel (a), we plot the estimates of Interval for a
regression without interaction with the valence of the reviews; in the bottom panels, we plot
the estimates for a regression in which we add the interaction between the 120-day intervals
and the review valence: Panel (b) plots the Interval coefficients, while Panel (c) plots the
interaction IntervalPositive coefficients. The p-values of the slope of the best-fit line
thorough the pre-treatment coefficients are 0.071, 0.221, 0.582 for panel (a), (b), and (c),
respectively.
85
A.2 The Effect of Management Responses on Ratings
Our findings implicitly suggest that by responding to reviews and by being aggressive
towards self-identified female reviewers, hotels are likely to reduce the arrival of negative
reviews which, in turn, should increase their ratings. Thus, responding to reviews and being
aggressive is likely to be good for their online reputation.
We can explicitly test this hypothesis by estimating the following regression:
Stars
ijt
=
1
After
ijt
+
2
Female
ijt
+
3
After
ijt
Female
ijt
+
i
+
t
+
i
;
where the dependent variable is the star-rating of review i of hotel j at time t. After
ijt
is an indicator for whether the review i of hotel j written at time t was written after the
hotel adopted management responses, and Female indicates whether review i is from a self-
identified female reviewer. and
t
are hotel and year-month fixed effects, respectively.
Finally, we include treatment-specific time trends.
We report the estimates of this regression in column 1 of Table A.1. First,
1
is positive
and significant showing that, on average, reviews coming from self-identified male reviewers
are more positive after hotels begin to respond to them. Second,
2
is positive suggesting
the reviews coming from female reviewers are, on average, more positive in the pre-response
period. Third,
3
, is positive and significant, suggesting that after hotels begin to respond
to reviews, reviews coming from female reviewers are more positive than those coming from
self-identified male reviewers.
2
A.3 DeterminantsoftheTimingofHotels’FirstResponse
To empirically explore what drives managers to begin responding to reviews, we fol-
low Seamans and Zhu (2014) and estimate a discrete-time hazard model that predicts the
2
It it is worth noting that the finding about reviews being more positive after hotels begin to respond
replicate the findings of Proserpio and Zervas (2017).
86
Table A.1: The effect of management responses and
reviewer gender on star-ratings
(1)
After 0.049
(0.011)
Female 0.038
(0.006)
After Female 0.015
(0.007)
Observations 673.604
Adjusted R
2
0.174
Note: The dependent variable is the star-rating of review i
of hotel j written at time t. All specifications include hotel
and year-month fixed effects, and a treatment-specific time
trend. Cluster standard errors at the hotel level are reported
in parentheses.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
timing of hotels’ first review response as a function of review volume, ratings, review topics,
and hotel operation. Our specification takes the following form:
Respond
it
= X
0
it
+
t
+
it
(A.1)
where the dependent variable Respond
it
indicates whether hotel i responded to reviews in
period (year-month) t. The vector X
0
it
includes variables such as the cumulative number of
reviews and the cumulative rating for hotel i up to time t, the number of reviews, average
ratings, the fraction of negative reviews, the fraction of reviews containing LDA topic 1 and
2, the fraction of the reviews that were written by female users for hotel i at time t, and
hotel operation (chain vs independent). Finally, we include year-month dummies,
t
, in the
model.
We estimate Equation A.1 using a logit model, dropping all observations for hotel i
after the month of its first response. We report the estimates in Table A.2, where we
87
gradually include controls as we move from column 1 to column 3. The results suggest
that determinants of the first management response include the number of reviews, average
ratings, type of hotel operation, and the fraction of negative reviews, but not review topics
(column 2) or the fraction of reviews written by female reviewers (column 3).
3
3
These results are in line with those discussed in Proserpio and Zervas (2017) where the authors show
that hotels with better ratings are more likely to respond, and that a negative shock to rating precedes the
first hotel response.
88
Table A.2: Determinants of the timing of the first response
(1) (2) (3)
Cumulative Reviews 0:001 0:001 0:001
(0:001) (0:001) (0:001)
Cumulative Rating 0:400
0:400
0:400
(0:015) (0:015) (0:015)
Reviews 0:044 0:044 0:044
(0:023) (0:023) (0:023)
Rating 0:390
0:396
0:392
(0:013) (0:016) (0:017)
Chain 0:897
0:914
0:915
(0:046) (0:048) (0:048)
Prop. of Negative Reviews 1:730
1:720
1:709
(0:051) (0:051) (0:054)
Prop. of Topic 1 Reviews 0:075 0:073
(0:057) (0:057)
Prop. of Topic 2 Reviews 0:029 0:031
(0:067) (0:067)
Prop. of Reviews from Females 0:032
(0:043)
Intercept 5:436
5:442
5:443
(0:238) (0:238) (0:238)
Time dummies Yes Yes Yes
Observations 574,281 574,281 574,281
Log-likelihood -18,276 -18,274 -18,274
AIC 36,855 36,856 36,857
BIC 38,567 38,590 38,603
Note: The dependent variable is whether the hotel i starts responding to reviews in time t.
In columns 2 and 3 we include controls for review topic, i.e., the proportion of reviews from
topics estimated using the LDA algorithm. Cluster-robust standard errors at the individual
hotel level are shown in parentheses.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
89
A.4 Including Hotel-Specific Time Trends
In Table A.3, we show that our main results are robust to the inclusion of hotel-specific
time trends.
Table A.3: TWFE estimates including hotel-specific
time trends
(1) (2)
After 0:0005 0:015
(0:003) (0:005)
After Positive 0:018
(0:006)
Positive 0:013
0:024
(0:002) (0:015)
Treated Positive 0:026
(0:016)
Controls:
Review valence Yes Yes
Traveler segment Yes Yes
log Review length Yes Yes
Observations 673,604 673,604
Adjusted R
2
0.043 0.043
Note: The dependent variable is whether the user gender of re-
viewi of hotelj at timet is female. All specifications include
hotel and year-month fixed effects, and a hotel-specific time
trend. Cluster-robust standard errors at the individual hotel
level are shown in parentheses.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
A.5 Addressing Reviewers’ Self-Selection into Disclosing
their Gender
In the main paper, we focused on the sample reviewers that self-identified as male or
female. To reduce concerns about this choice affecting the estimates, we replicate here our
90
main results using larger data samples. The first one is a sample that includes reviewers
from which we can infer their gender from their Tripadvisor username. The second one is
the complete dataset, and we infer each reviewer’s gender using a text-based classifier.
A.5.1 Predicting the Gender of the Reviewers
Username-basedclassifier Ourusername-gendermatchingalgorithminvolvesthreesteps.
The first, similar to the gender resolution algorithm used by Vasilescu et al. (2012), is to
identify the first name of the user.
4
In the second step, we match each first name with
a dictionary of name-gender pairs obtained from the software R. For any given name, the
dataset provides information about the likelihood of the name belonging to either a male or
female based on historical birth databases.
5
For the names that are not identified using this
procedure, we check if the username contains one of the top 100 most popular female and
male names in the US.
6
Using this approach, we obtain gender information for about 620,000
additional reviewers so that this sample includes 61.21% of all reviews in our dataset. We
report the results using this subsample in columns 1 and 2 of Table A.4. The estimates
are almost identical to those reported in Table 1.3 in the paper. These results suggest the
sample of reviewers who disclose their gender is not systematically different from the sample
including reviewers whose gender is identified using the reviewers’ username.
Text-based classifier We implement a text-based machine learning classifier to impute
the gender of all reviewers in our dataset. This approach is motivated by the fact that
women and men use language differently. Lakoff (1973, 1986) shows that the language of
women is characterized by the use of a special vocabulary (more color words or fewer swear
words, for example), adjectives that describe feelings, intensifiers (so, much, etc.), and words
4
Some usernames clearly indicate first and last name directly, separated by a space (John Smith) or
capital letters (JohnSmith). If this is not the case, we remove all numbers and punctuation marks, and treat
the resulting outcome as the first name.
5
We set gender equal to female or male if their respective matched probability is greater than 0.7.
6
See: https://www.ssa.gov/oact/babynames/decades/century.html.
91
that suggest hesitation, among others. More recent studies reinforce these findings, showing
that women use more words related to psychological and social processes, while men use
more words focused on objects and impersonal topics (Newman et al. 2008). Following these
linguistic literature insights, we base our text-classification algorithm on a word-frequency
approach. More specifically, we represent each review as a bag-of-words, and use this infor-
mation as the input of a logistic regression classifier aimed at predicting the gender of their
author. However, note that, rigorously speaking, the text classifier is learning female-style
writing and not the reviewers’ gender.
To train the model, we use the full review history of every user for which we have
gender information, amounting to approximately 16M reviews. For estimation purposes,
we collapse all the reviews for a given user into a single observation (i.e., the data for the
text classifier has as many rows as the number of users). Following standard practice, we
split the observations in 80%-20% train and test sets, respectively. Using this classifier, and
considering the ground truth to be the gender as self-disclosed by reviewers, we obtain an
87.7% out-of-sample accuracy rate and an 88.3% ROC-AUC score.
7
To reassure ourselves that our classifier is indeed able to distinguish between male- and
female-stylewriting, weanalyzetherelativeimportanceofthewordsusedforthepredictions.
We find that the most predictive words for female users are husband, lovely, loved, delicious,
beautiful, while for male users they are wife, quality, business, average, beer. These results
provide further support that our classifier effectively distinguishes male- and female-style
writing.
Using this approach, we approximate the gender of all reviewers and are therefore able
to use the complete set of reviews in our dataset in this analysis. We report the results using
this sample in columns 3 and 4 of Table A.4. Overall, the estimates are consistent with those
reported in Table 1.3 in the paper. However, the magnitudes of the estimates are smaller
than those reported in the paper.
7
We obtain similar results using alternative text representation algorithms and classifiers.
92
These differences can be due to a few factors. First, errors by the classifier might lead
to attenuation bias. Second, it might be that gender disclosure is affected by the presence
of management responses and that this effect varies by gender and review valence. Third,
reviewers that disclose their gender might be different from those that do not. While whether
reviewers are different is not directly testable with our observational data, we can directly
test the remaining two hypotheses. We start with discussing the classifier errors below, and
then analyze gender disclosure decisions in Appendix A.5.2.
Table A.4: TWFE estimates including observations with inferred and predicted gender
Gender inferred from usernames Gender predicted from text
(1) (2) (3) (4)
After 0:006
0:021
0:004
0:008
(0:002) (0:005) (0:002) (0:003)
After Positive 0:017
0:004
(0:005) (0:003)
Positive 0:013
0:016 0:007
0:024
(0:002) (0:011) (0:002) (0:010)
Treated Positive 0:015 0:029
(0:012) (0:011)
Controls:
Traveler segment Yes Yes Yes Yes
log Review length Yes Yes Yes Yes
Observations 1,241,901 1,241,901 1,967,780 1,967,780
Adjusted R
2
0.036 0.036 0.035 0.035
Note: The dependent variable is whether the user gender of review i of hotel j at time t is female. In
addition to reviews for which user gender is self-identified, in columns 1 and 2, we include reviews from
where we can infer gender based on the username. In columns 3 and 4, we include reviews for which
gender is predicted from the review text. All specifications include hotel and year-month fixed effects,
and a treatment-specific time trend. Cluster-robust standard errors at the individual hotel level are
shown in parentheses.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
Errors of the text-based gender classifier Here we discuss how and when the results
using a dependent variable inferred using a text-based classifier can be biased. Classifiers are
93
a probabilistic method and as such they make mistakes, i.e., misclassify a female reviewer
as a male reviewer or vice versa. These errors may bias our results if they affect one gender
more than the other, and if they are systematically correlated with the response timing of
each hotel that decides to respond to reviews. While we view this as an unlikely scenario
because the timing of response varies substantially across hotels, here we investigate whether
the error of the text-based gender classifier could potentially lead to any bias in our results.
Table A.5 presents the confusion matrix from the gender classifier based on the text of
the reviewer. The rows represent the gender as predicted by the classifier, and the columns
represent the gender as reported by the reviewer. Given the confusion matrix, we can
computethetruepositiverate(i.e., therateoffemalereviewerscorrectlyclassifiedasfemale),
the true negative rate (i.e., the rate of male reviewers correctly classified as male), and their
complements, the false negative rate (i.e., the rate of female reviewers classified as male) and
the false positive rate (i.e., the rate of male reviewers classified as female).
Table A.5: Text-based classifier con-
fusion matrix
Reported
Female Male
Predicted
Female 44,315 5,346
Male 4,998 34,892
ThetruepositiverateisTPR = 44;315=(44;315+4;998) = 89:86%,andthetruenegative
rate is TNR = 34;892=(34;892 + 5;346) = 86:71%. Equivalently, the false negative rate is
FNR = 1TPR = 10:14%, and the false positive rate is FPR = 1TNR = 13:29%.
Since the FPR is greater than the FNR, our classifier is more likely to misclassify male
reviewers as female reviewers or, in other words, our classifier is likely overestimating the
number of female reviewers in our dataset. This means that, in the case where there is a
systematic correlation between these errors and our treatment variable, our results are likely
94
to be positively biased so that our results on the effect of negative reviews coming from
female reviewers should be considered as a lower bound.
As an additional test, we check the FPR and FNR separately for the period before and
after hotels begin to respond to reviews and for positive and negative reviews. We report
these rates in Table A.6. The FPR is greater than the FNR in all cases, suggesting that
there is no systematic correlation between the classifier errors and the treatment variable or
the valence of the review.
Table A.6: Text-based classifier false negative rate (FNR) and false positive
rate (FPR)
Before Period After Period
All reviews 2 Stars 3 Stars All reviews 2 Stars 3Stars
FNR 10.86% 14.79% 10.17% 9.86% 12.36% 9.55%
FPR 13.34% 14.83% 13.06% 13.27% 13.54% 13.23%
A.5.2 Analyzing Gender Disclosure Decisions
Here, we analyze the gender disclosure decision as a function of management responses,
reviewer gender, and review valence using the gender inferred by the algorithms described
above as a gender-distribution ground truth.
To measure the effect of management responses on the disclosure decision, we estimate
the following specification:
Disclosed
ijt
=
1
Female
ijt
+
2
After
ijt
+
3
Positive
ijt
+
4
After
ijt
Female
ijt
(A.2)
+
5
Positive
ijt
Female
ijt
+
6
After
ijt
Positive
ijt
+
7
After
ijt
Female
ijt
Positive
ijt
+
j
+
t
+
ijt
;
95
where Disclosed
ijt
indicates whether the user who wrote review i for hotel j in time t self-
identified as either male or female. Female
ijt
indicates whether the user who wrote review i
is female as inferred by either one of the algorithms described above, and Positive
ijt
indicates
whether review i is positive.
j
and
t
are hotel and year-month fixed effect, respectively.
We estimate Equation A.2 using OLS and clustering standard errors at the hotel level. We
report the results in Table A.7. In column 1, we report the estimates using the sample of
reviews in which gender is inferred with the reviewers’ username as ground truth and, in
column 2, the results using all reviews and with gender inferred using the text-based classifier
as ground truth.
The results reported in column 1 suggest that the probability of a reviewer to self-identify
as a male or female decreases in a similar way after hotels begin to respond to reviews (
2
is negative and significant, and
4
is negative and not significant). Moreover, we do not
observe any significant effect by review valence.
These results are consistent with the fact that we obtain very similar estimates of the im-
pact of management responses on reviewing behavior using both the sample of self-identified
reviewers (Table 1.3) and the sample which includes reviews for which gender is inferred
using the reviewers’ username (columns 1 and 2 of Table A.4).
The results reported in column 2 suggest that, after hotels begin to respond to reviews,
the probability of reviewers self-identifying as male or female changes as a function of the
review valence and gender. We observe that the probability of a reviewer to self-identify
as a female decreases when the review is negative (
4
is negative and significant), and the
decrease is significantly smaller for positive reviews (
7
is positive and significant). Finally,
the probability of a reviewer to self-identify as a male decreases when the review is positive
(
6
is positive and significant), and it does not change for negative reviews (
2
is small and
not significant).
These results are consistent with the fact that, using the full sample of reviews to measure
the impact of management responses on reviewing behavior (columns 3 and 4 of Table A.4),
96
we obtain estimates with magnitudes that are smaller than those obtained using the sample
of reviews in which reviewers disclose their gender (Table 1.3). Therefore, assuming that the
sample of reviews for which gender is inferred with the text-based classifier is close to the
real distribution of male and female reviewers, the results reported in the paper should be
viewed as the combined effect of deciding not to review and deciding to review but not to
disclose gender.
Table A.7: The effect of management responses on gender disclo-
sure
Gender inferred Gender predicted
from usernames from text
(1) (2)
Female 0:006 0:015
(0:005) (0:004)
After 0:038
0:004
(0:008) (0:007)
Positive 0:062
0:123
(0:005) (0:005)
After Female 0:008 0:016
(0:006) (0:004)
Female Positive 0:004 0:017
(0:005) (0:004)
After Positive 0:006 0:047
(0:007) (0:006)
After Female Positive 0:001 0:009
(0:007) (0:005)
Observations 1,241,901 1,936,688
Adjusted R
2
0.180 0.101
Note: The dependent variable is whether the user gender of review i of hotel
j at time t self-disclosed their gender. All specifications include hotel and
year-month fixed effects. In column 1, we assume that the true gender of the
reviewers is given by the username prediction. In column 2, we assume that
the true gender of the reviewers is given by the text-based classifier. Cluster-
robust standard errors at the individual hotel level are shown in parentheses.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
97
Second-personpronouns Weformallytestwhethertheuseofthesecond-personpronoun
increases the perception of criticism from the manager to the reviewer by asking two coders
to judge a random subsample of 200 responses (half of them with high and half of them
with no usage of second-person singular pronouns). Participants were asked to read the
content of each response carefully and to evaluate (using a 7-point Likert scale) to what
extent the manager is: (1) sincerely apologizing for the customer’s negative experience and
(2) sympathizing with the customer writing the review. We report the results of this analysis
in Table A.8.
8
The results show that responses containing more second-person pronouns are
more likely to be perceived as responses where the manager is apologizing and sympathizing
with the customer.
Table A.8: Coders’ average evaluations
2nd PP = 0 2nd PP > 0
Mean SD Mean SD
Apologize 1.475 (0.084) 2.955 (0.115)
Sympathize 1.600 (0.850) 2.625 (1.023)
Third-person pronouns We formally test whether the use of third-person singular pro-
nouns increases the perception of criticism from the manager to the reviewer by asking two
coders to judge a random subsample of 200 responses (half of them with high and half of
them with no usage of third-person singular pronouns). Participants were asked to read the
content of each response carefully, and to evaluate (using a 7-point Likert scale) to what
extent the manager is: (1) taking responsibility for the customer’s negative experience, (2)
blaming the consumer for his/her negative experience, and (3) being critical of the customer
writing the review. We report the results of this analysis in Table A.9.
9
The results show
8
We average the raters score for each review first. Krippendorff’s alpha measure of intersubject reliability
is 0.724 and 0.610 for each of the questions we asked, respectively.
9
We average the raters score for each review first. Krippendorff’s alpha measure of intersubject reliability
is 0.709, 0.743, and 0.666 for each of the questions we asked, respectively.
98
that responses containing third-person singular pronouns are in fact perceived as blaming
and criticizing the reviewer and trying to diminish the hotel’s responsibility.
Table A.9: Coders’ average evaluations
3rd PP = 0 3rd PP > 0
Mean SD Mean SD
Responsibility 5.505 1.072 2.480 1.499
Blame 1.275 0.664 3.845 1.838
Critical 1.645 0.905 4.420 1.979
99
A.6 Review LDA Topics
To analyze the meaning of the topics obtained by the LDA algorithm, we search for the
most representative words of each topic (Netzer et al. 2019). We present these results in
Table A.10, for the different topics obtained using 3, 4, and 5 topics.
Table A.10: Review topic interpretation.
LDA Topic More representative words in the topic
3 topics, Perplexity = 147.59
Amenities hotel, room, clean, staff, breakfast, great, nice, stay, good, friendly, comfort-
able, location, area, helpful, restaurant, bed, would, stayed, free, well
Place hotel, great, room, stay, time, pool, view, service, strip, place, vegas, stayed,
staff, casino, like, food, good, one, nice, really
Service room, hotel, one, night, desk, would, front, get, day, front-desk, stay, bed,
time, could, check, floor, door, like, back, got
4 topics, Perplexity = 148.50
Experience/Stay hotel, great, stay, staff, time, service, stayed, place, always, best, wonderful,
year, experience, back, room, make, made, every, friendly, like
Location room, hotel, pool, nice, view, good, strip, great, vegas, get, one, area, casino,
also, floor, night, would, really, like, stayed
Staff room, hotel, desk, front, night, would, one, front-desk, day, get, stay, time,
check, could, bed, back, got, door, told, even
Amenities/Service hotel, room, clean, staff, breakfast, nice, friendly, good, stay, great, com-
fortable, location, helpful, area, restaurant, bed, would, room-clean, free,
parking
5 topics, Perplexity = 149.18
Overall Experience hotel, stay, time, great, staff, place, year, stayed, service, always, like, best,
back, make, one, wonderful, every, feel, made, experience
Location/Attributes room, hotel, pool, great, strip, vegas, view, casino, good, nice, time, resort,
service, get, food, stay, would, restaurant, really, stayed
Staff room, desk, front, hotel, front-desk, would, night, one, check, day, get, time,
stay, told, back, could, got, said, went, asked
Overall Service staff, hotel, friendly, room, clean, great, stay, helpful, breakfast, nice, good,
location, comfortable, restaurant, would, staff-friendly, room-clean, service,
airport, well
Amenities room, hotel, nice, bed, breakfast, night, good, area, one, parking, bathroom,
clean, small, pool, also, street, free, would, get, coffee
In Table A.11, we replicate the results reported in Table 1.8 in the paper, but using 4
and 5 LDA topics. None of the estimates change significantly.
100
Table A.11: The effect of reviewer’s gender on manage-
ment response type
4 topics 5 topics
Female 0:010
0:010
(0:003) (0:003)
Intercept 0:037
0:041
(0:009) (0:005)
Controls:
Review characteristics Yes Yes
Review LDA topics Yes Yes
Review is contentious Yes Yes
Observations 27,941 27,941
Adjusted R
2
0.006 0.008
Note: The dependent variable is an indicator of whether the manage-
ment response to review i of hotel j written at time t is classified
as contentious. We include controls for review characteristics, i.e.,
the review length and LIWC variables for anger and tone (defined
as the difference between positive and negative words), review top-
ics estimated using the LDA algorithm, and whether the review is
classified as contentious. Standard errors are shown in parentheses.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
101
A.7 Coding of Contentious Responses
We classify management responses as contentious by asking two coders to judge a random
subsample of 500 responses. The coders read the following instructions:
In the attached file, there are several management responses to hotel reviews that customers
wrote on TripAdvisor. In all cases, the guest had a negative experience they rated as one
and two-star and wrote a review about. A manager then wrote a response to these reviews.
We’d like your help to classify management responses depending on whether the response is
contentious. A contentious response contains at least one of the following cues:
The manager tries to discredit the reviewer, i.e., rather than apologizing, the
manager disputes what the reviewer says or tries to undercut the reviewer’s credibility.
For example, the response denies the claim(s) of the reviewer or refuses to accept it
(them) as accurate.
The manager is confrontational with the reviewer, i.e., rather than apologizing,
the response is hostile towards the reviewer, argumentative against the reviewer, or
uses irony to mock the reviewer.
The manager is responding aggressively, i.e., rather than apologizing, the re-
sponse is rude and uses aggressive language.
Please focus on the content of the management responses, and classify each response de-
pending on whether the above cues are present in the response text (2 = yes), and somewhat
present in the response text (1 = somewhat), or not present at all in the response text (0
= no). There are no right or wrong answers, but please try to be as consistent as possible
throughout.
The percentage of agreement between the two coders is 84%, and the Cohen’s kappa
coefficient is 0.55. Disagreements were solved by discussion. To train the text classifier, we
group responses into two levels: non-contentious if it is coded as 0, and contentious if coded
as either 1 or 2.
Using this subsample of 500 responses, we train a text classifier to predict the labels for
all the responses to negative reviews in our data. Following standard practice, we split the
responses dataset in an 80% training sample and a 20% test sample. We select the results
from the Naive Bayes Classifier based on tf-idf, which achieves an 85.0% out-of-sample
accuracy rate and a 87.4% ROC-AUC score.
102
A.8 Coding of Contentious Reviews
We classify reviews as contentious by asking five Mturk coders to judge a random sub-
sample of 500 reviews. The coders read the following instructions:
You will see five reviews that customers wrote on TripAdvisor. In each case, the guest had
a negative experience he/she rated as 1 and 2-star and wrote a review about it. We’d like
your help to classify each review depending on whether the review is contentious. A
contentious review may contain at least one of the following cues:
The reviewer is using an aggressive tone, i.e., the review is rude and uses aggres-
sive or disrespectful language.
The reviewer is confrontational with the hotel staff, i.e., the review is hostile
towards the hotel staff and management, uses irony to mock the staff, or makes threats
towards the staff.
The reviewer blames the hotel staff, i.e., the review explicitly considers specific
staff members to be directly responsible for their negative experience.
Please classify the content of each of the following responses on whether the above cues are
present in the review text (YES) or are not present at all in the review text (NO). There
are no right or wrong answers, but please try to be as consistent as possible throughout.
To train the text classifier, we label each review based on the average rating given by the
five coders into one of two levels: non-contentious if the average rating is below or equal to
0.5, and contentious if the average rating is above 0.5.
Using this subsample of 500 reviews, we train a text classifier to predict the labels for all
the negative reviews in our data. Following standard practice, we split the responses dataset
in an 80% training sample and a 20% test sample. We select the results from the Naive
Bayes Classifier based on tf-idf, which achieves an 74.0% out-of-sample accuracy rate and a
70.7% ROC-AUC score.
103
Appendix B
Appendix to Chapter 2
B.1 Validating API Labels
To validate the quality of the outputs from the Cloud Vision APIs, we use a subsample of
100 images and recruit human raters through Amazon Mechanical Turk to provide the same
labels we collected from the Cloud Vision APIs. Each rater was presented with 20 randomly
selected images. Given that this classification task is relatively straightforward, each image
was labeled by 3 different raters. In case of disagreement, we label the image based on the
majority of votes.
In Table B.1, we present the percentage of the agreement between the labels provided by
the Cloud Vision APIs and those provided by MTurkers.
Table B.1: Agreement between labels provided by Cloud Vision APIs and MTurkers
Variable Labels Percentage of agreement
Human Yes, No 94.62%
Gender Female, Male 83.33%
Race Asian, Black, Indian, White 63.88%
Outdoor Background Yes, No 84.94%
Eye wear None, Sunglasses, Glasses 80.56%
Computer in Picture Yes, No 97.84%
104
B.2 Details on the Deep Learning Image Classifier
Trainingdata Tocreateinitialtrainingdataforourclassificationtask, weselectarandom
subsample of 1,000 images and recruit human raters through Amazon Mechanical Turk.
Raters were asked to provide ratings on the freelancer’s perceived job fit in each job category
on 5-point Likert scales, wherein 1 is the lowest fit, and 5 is the highest fit. Each image was
rated by five to ten independent raters and for each of the job categories in our data (i.e.,
job fit as a programmer, job fit as a graphic designer, job fit as a writer, job fit in sales and
marketing). Following standard practices in the literature (Zhang et al. 2015, 2017, Zhang
and Luo 2019, Liu et al. 2019), we convert the 5-point Likert scale to binary levels (low and
high) to mitigate potential noises in the training data. More specifically, as in Zhang et al.
(2017), we use the average score across raters to label each image as either low job fit or
high job fit.
1
To illustrate some examples of the resulting training data, we provide some
examples for the perceived freelancer-job fit as a programmer label in Figure B.1.
Figure B.1: Examples of perceived freelancer-job fit as programmer label as provided by
human raters.
Low fit as programmer High fit as programmer
We then use our training data to train different Convolution Neural Networks (CNN)
architectures, including VGG-16, ResNet, and Inception (Canziani et al. 2016). In doing
so, we follow three standard practices to reduce over-fitting problems that can arise due to
1
As in Zhang et al. (2017), for each imagei we first take the meanscore
i
averaged across the five raters.
We then define two thresholds
1
=scoregap=2 and
2
=score+gap=2, wherescore is the average score
for all the images in the subsample. Finally, we label each profile picturei as low job fit ifscore
i
<
1
and as
high job fit if score
i
>
2
. We discard images with score
i
2 (
1
;
2
) from the training sample. As in Zhang
et al. (2017), we use gap = 0:8, and the results did not change significantly by using alternative values of
gap = 0:5 and gap = 1:0.
105
the modest size of our training set. First, we apply transfer learning to considerably reduce
the number of trainable parameters (Zhang et al. 2017, Hartmann et al. 2019, Zhang and
Luo 2019), using the VGG16 pre-trained weights for the ImagetNet classification (Simonyan
and Zisserman 2014) to fine-tune the model.
2
Second, we use several data-augmentation
techniques such as horizontal flip, rotations, and shifts (Krizhevsky et al. 2012) to transform
the images during training. Third, we add a dropout layer between the last convolutional
layer and the classifier (Srivastava et al. 2014) to mitigate over fitting.
In our setting, the CNN architecture that performs the best in terms of out-of-sample
accuracy is VGG16, which achieves 81% accuracy for job fit as a programmer label, 72%
accuracy for the job fit as a designer label, 79% accuracy for the job fit as a writers label,
81% accuracy for the job fit in sales and marketing label. Using the learned parameters by
this architecture, we predict the labels for the remaining pictures in our data. We provide
some examples of the predictions for the job fit as programmer label in Figure B.2.
Figure B.2: Examples of perceived freelancer-job fit as programmer label as predicted by
the image classifier.
Fit as programmer score <= 0.5 Fit as programmer score > 0.5
Architecture We implement a modified version of the VGG-16 CNN architecture as em-
ployed by Hartmann et al. (2019). We freeze the first four convolutional blocks, because
their layers extract generic information or low-level features, such as contours, textures, and
2
The ImageNet dataset contains images for a wide range of categories, including people, objects, animals,
etc. We choose these pre-trained weights over other alternatives, such as the VGG-Face pre-trained weights
(for face detection and face recognition tasks, e.g., Parkhi et al. 2015) to be consistent with our conjecture
that objects in the image and its background can also influence perceptions of the freelancer-job fit. This is,
indeed, what we exploit in our second experimental study (Section 2.4), where we show that freelancers can
improve their hiring outcomes by changing the background and accessories in their profile pictures.
106
colors, that can serve for a wide range of classification tasks (Zhang et al. 2017, Hartmann
et al. 2019). We initialize the model weights with pre-trained weights (further description
below) and fine-tune the parameters of the last convolutional block, which consists of three
convolutional layers followed by a max-pooling layer. We then add three two connected
layers, where the last layer is output layer. All operations above are standard practices in
the VGG-16 CNN architecture.
The resulting architecture is illustrated in Table B.2. Note that, by freezing the first con-
volutional blocks we are training a smaller proportion (57%) of the total number of param-
eters in the original VGG architecture. Moreover, the majority of the trainable parameters
(69%) are fine-tuned with pre-trained weights.
To further relieve concerns regarding over-fitting due to the modest size of our training
set, we also estimate an alternative variation of Table B.2 in which we freeze the fifth
convolutional block, i.e., we only train the parameters from the last fully connected layer
and output layer. In doing this, we train only 18% of the total parameters in the model.
Using this modification we obtained similar accuracy rates.
3
Furthermore, we re-estimate
the conditional logit models (Table 2.3 and Table 2.4), and results using these alternative
job fit scores remain consistent. There results are available upon request.
Hyper-parameters During the training, we use the Adadelta algorithm for optimization,
a method that dynamically adapts learning rates and has been shown to be robust to noisy
gradient information and selection of hyper-parameters (Zeiler 2012). We use batch size
equal to 16, number of epochs equal to 100, and a binary cross-entropy loss function. We use
ReLU activation for our convolutional layers, and SoftMax activation for the output layer.
For data augmentation, we allow for horizontal flips, zoom range, width range, and height
range equal to 0.2, and rotation range equal to 15 degrees.
3
For most of the job categories we observe a decrease of 1% in the accuracy rate, except for the job fit
as a designer for which we observe a decrease of 8% in the accuracy rate
107
Table B.2: VGG architecture (modified and tuned)
Number of Parameters
Layer Output Shape Total Trainable
Input (224, 224, 3) 0 0
Convolutional Block 1
Convolutional Layer 1.1 (224, 224, 3) 1,792 0 (frozen)
Convolutional Layer 1.2 (224, 224, 3) 36,928 0 (frozen)
MaxPooling Layer 1 (112, 112, 64) 0 0
Convolutional Block 2
Convolutional Layer 2.1 (112, 112, 128) 73,856 0 (frozen)
Convolutional Layer 2.2 (112, 112, 128) 147,584 0 (frozen)
MaxPooling Layer 2 (56, 56, 128) 0 0
Convolutional Block 3
Convolutional Layer 3.1 (56, 56, 256) 295,168 0 (frozen)
Convolutional Layer 3.2 (56, 56, 256) 590,080 0 (frozen)
Convolutional Layer 3.3 (56, 56, 256) 590,080 0 (frozen)
MaxPooling Layer 3 (28, 28, 256) 0 0 (frozen)
Convolutional Block 4
Convolutional Layer 4.1 (28, 28, 512) 1,180,160 0 (frozen)
Convolutional Layer 4.2 (28, 28, 512) 2,359,808 0 (frozen)
Convolutional Layer 4.3 (28, 28, 512) 2,359,808 0 (frozen)
MaxPooling Layer 4 (14, 14, 512) 0 0
Convolutional Block 5
Convolutional Layer 5.1 (14, 14, 512) 2,359,808 2,359,808 (fine-tuned)
Convolutional Layer 5.1 (14, 14, 512) 2,359,808 2,359,808 (fine-tuned)
Convolutional Layer 5.3 (14, 14, 512) 2,359,808 2,359,808 (fine-tuned)
MaxPooling Layer 1 (7, 7, 512) 0 0
Flatten Layer 25,088 0 0
Droput Layer 25,088 0 0
Fully Connected Layer 128 3,211,392 3,211,392
Output Layer (Prediction) 1 129 129
Total - 17,926,209 10,290,945
108
B.3 Observational Data Variables
We provide a description of the variables we collected from Freelancer.com in Table B.3,
and Table B.4 and summary statistics of these variables in Table B.5 and Table B.6.
Table B.3: Variables included in the conditional logit model:
Profile picture, reputation, and performance variables
Variable name Description
Profile Picture Variables
Perceived Job Fit Perceived job fit score as predicted by the VGG-16 classifier
Has Picture? Whether the freelancer has a profile picture
Human Whether a human face is detected in the profile picture, obtained
from Face++ API
Gender Female or Male, obtained from Face++ API
Race White, Black, Asian, or Indian, obtained from Face++ API
Age Age, obtained from Face++ API
Beauty Beauty score (ranging from 0 to 100), obtained from Face++
API
Smile Smile score (ranging from 0 to 100), obtained from Face++ API
Reputation Variables
Number of Reviews Cumulative number of reviews at the time of the application
Avg. Rating Cumulative average rating at the time of the application
Performance Variables
Earning score Total earning score (ranging from 0 to 10) from previous projects
that required similar skills and were successfully completed
Percentage of jobs on time Percentage of previous jobs delivered on time
Percentage of jobs on budget Percentage of previous jobs delivered on budget
109
Table B.4: Variables included in the conditional logit model: Application variables and
additional controls
Variable name Description
Application Variables
Price Price the freelancer requests to complete the job, normalized within
the job
Number of Days Number of days offered by the freelancer to complete the job, normal-
ized within the job
Application log word count Log of the number of words in the application description submitted
by the freelancer
Application similarity Cosine similarity (based on bag of words) between the application
description submitted by the freelancer and the job description posted
by the employer
Recommended Freelancer? Whether the freelancer is recommended by the platform and high-
lighted in the top position of the list of applicants, as seen by the
employer
1
Application Position Position of the application relative to the entire list of applications,
as seen by the employer
Additional controls
Preferred Freelancer? Whether the freelancer is part of the preferred freelancer program,
a program implemented by the platform to distinguish experienced
freelancers who have been evaluated based on several criteria
2
Exam on required skill? Whether the freelancer passed an exam on a skill required by the
employer (e.g., Word-Press)
3
From Developed Country Whether the freelancer is from a developed country
From Employers’ Country Whether the freelancer and the employer are from the same country
Freelancer Region Region of Residence of the freelancer (e.g., North America)
Previously Reviewed? Whether the freelancer has a review from the same employer. We
use this as a proxy for whether the freelancer was hired by the same
employer in the past.
Membership Category Membership category of the freelance: Free, Intro, Basic, Plus, Pro-
fessional, Premier
4
Profile Verification Whether the freelancer verified his/her profile
1
The choice of the recommended freelancer is based on his/her reviews and previous experience. Specific details
of the criteria used by the platform are unknown. There is only one "Recommended Freelancer" per job.
2
See: https://www.freelancer.com/support/General/what-are-preferred-freelancers.
3
These exams are implemented by the platform. See: https://www.freelancer.com/exam/exams/
4
For more information, see: https://www.freelancer.com/membership/index.php.
110
Table B.5: Summary Statistics for observational data: Continuous
variables
Variable Mean, Std. Dev. 25th, 50th, 75th Percentiles Min, Max
Profile Picture Variables
Perceived Job Fit 0.538, 0.303 0.32, 0.587, 0.791 0, 0.985
Has Picture 0.987, 0.111 1, 1, 1 0, 1
Human 0.719, 0.449 0, 1, 1 0, 1
Age* 30.663, 8.632 24, 29, 35 1, 94
Beauty
62.867, 11.782 54.527, 63.14, 71.637 14.626, 93.829
Smile
47.598, 44.006 1.529, 35.04, 99.669 0, 100
Reputation Variables
Number of Reviews 151.791, 371.217 3, 33, 139 0, 4734
Average Rating 3.963, 1.873 4.622, 4.858, 4.954 0, 5
Average Rating
4.804, 0.458 4.793, 4.893, 4.969 0, 5
Performance Variables
Earning Score 4.489, 2.724 2.512, 5.21, 6.59 0, 9.058
Percentage of jobs on time 0.763, 0.372 0.821, 0.945, 0.991 0, 1
Percentage of jobs on budget 0.771, 0.375 0.848, 0.956, 0.994 0, 1
Application Variables
Price (Normalized) 0.281, 0.273 0.063, 0.2, 0.429 0, 1
Number of days (Normalized) 0.294, 0.295 0.091, 0.2, 0.4 0, 1
Application Log (Word Count) 3.448, 1.549 3.178, 3.912, 4.454 0, 6.861
Application Similarity 0.199, 0.142 0.085, 0.199, 0.298 0, 1
Application Position 24.618, 21.682 8, 18, 36 1, 100
Additional Controls
Preferred Freelancer? 0.112, 0.316 0, 0, 0 0, 1
Exam on required skill? 0.157, 0.507 0, 0, 0 0, 1
From Developed Country 0.116, 0.32 0, 0, 0 0, 1
From Employers Country 0.057, 0.233 0, 0, 0 0, 1
Previously Reviewed? 0.005, 0.07 0, 0, 0 1, 1
Profile Complete 0.99, 0.099 1, 1, 1 0, 1
Conditional on Human = 1. Note that the min and max of age label are both outliers with either
babies or a very old individual in the picture.
Conditional on Number of Reviews > 0.
111
Table B.6: Summary Statistics for observational data: Discrete variables
Distribution
Profile Picture Variables
Gender
28.85% Female, 71.15% Male
Race
10.42%Black, 55.61%Indian, 8.72%FarEastAsian, 25.25%
White
Additional Controls
Freelancer Region (23 in total) 69.02% Southern Asia, 5.14% Northern America, 3.56%
Eastern Europe, 2.97% Eastern Asia, 2.63% South-Eastern
Asia, 16.68% Other
Membership Category 33.76%Free, 5.91%Intro, 3.52%Basic, 14.16%Plus, 42.66%
Premium
Conditional on Human = 1.
112
B.4 PretestingourVisualManipulationinSecondChoice
Experiment
We run a pretest survey to verify if the visual manipulation in our second choice exper-
iment works as intended, i.e., the two versions of the profile pictures of the same freelancer
elicit different perceptions of the freelancer-job fit. To this aim, we recruit participants on
Amazon Mechanical Turk and asked them to rate one picture version of each candidate
(chosen at random; ten in total). Specifically, we asked them to rate each picture based on:
(i) their perception of the freelancer-job fit as a programmer; and (ii) their likelihood to hire
the freelancer in the picture to build them a website. All questions are measured using a
7-point Likert scale.
We report results from the pretest survey in Table B.7. In column 1, the dependent
variable is the respondents’ perception of the freelancer-job fit as a programmer. We observe
a positive and significant effect of both manipulations: profile pictures with a home-office
background or with glasses are perceived as a higher job fit. We observe no significant inter-
action between the two manipulations. In column 2, the dependent variable is respondents’
likelihood to hire the freelancer in the picture. Again, we observe that profile pictures with
a home-office background or with glasses are perceived as a higher job fit, and we observe
no significant interaction between the two manipulations.
113
Table B.7: Pretesting picture versions
Perceived Job Fit Likelihood to Hire
Home-office 0:730
0:766
Glasses 0:451
0:437
Home-office Glasses 0:065 0:049
Intercept 4:175
4:021
Controls:
Freelancer FE Yes Yes
N 2,000 2,000
LL -3,348 -3,484
AIC 6,745 7,017
BIC 6,879 7,151
Note: OLS estimates with robust standard errors.
Significance levels: * p<0.1, ** p<0.05, *** p<0.01.
114
Abstract (if available)
Abstract
This thesis brings together two research papers that empirically investigate how different digital platform design choices can lead to intended consequences on their users’ behavior.
The first paper focuses on the context of online review platforms and explores the impact of management responses on reviewing behavior, emphasizing potential gender differences. Using data from Tripadvisor.com, this paper shows that after managers begin responding to their reviewers, the probability that a negative review comes from a user that self-identifies as female decreases. Based on a survey conducted among online review platform users, this paper also shows that such a decrease can be explained by the fact that female users are more likely to perceive management responses as potential source of conflict with the manager. Based on text analysis of management responses on Tripadvisor, this paper also shows that female users are more like to receive contentious responses (i.e., responses where the manager is confrontational, tries to discredit the reviewer, or responds aggressively). Overall, the findings in this paper suggest that management responses can lead to selective attrition of female reviewers.
The second paper focuses on online labor platforms and explores the role of profile pictures in hiring outcomes, emphasizing the potential role of appearance-based perceptions of a worker’s fit for the job (e.g., whether a worker "looks the part"). Using data from Freelancer.com, this paper shows that workers who "look the part" are more likely to be hired. Based on two choice experiments, this paper also shows that the effect of "looking the part" goes above and beyond gender and race (i.e., accessories or image background can also play a role) and it is stronger when reputation systems are less diagnostic. Overall, these findings illustrate that profile pictures can influence hiring outcomes, especially when reputation systems are not diagnostic enough to differentiate candidates.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Essays on digital platforms
PDF
Manipulating consumer opinion on social media using trolls and influencers
PDF
Essays on artificial intelligience and new media
PDF
Efficient policies and mechanisms for online platforms
PDF
Essays on information design for online retailers and social networks
PDF
Essays on consumer conversations in social media
PDF
Sunsetting: platform closure and the construction of digital cultural loss
PDF
Essays on the luxury fashion market
PDF
Creation and influence of visual verbal communication: antecedents and consequences of photo-text similarity in consumer-generated communication
PDF
Strategic audience partitioning: antecedents and consequences
PDF
Competing across and within platforms: antecedents and consequences of market entries by mobile app developers
PDF
Essays on commercial media and advertising
PDF
Marketing strategies with superior information on consumer preferences
PDF
Essays on understanding consumer contribution behaviors in the context of crowdfunding
PDF
Two essays on financial econometrics
PDF
Commercialization of logistics infrastructure as an offline platform
PDF
Private disclosure contracts within supply chains and managerial learning from stock prices
PDF
Essays on delegated portfolio management under market imperfections
PDF
Essays on revenue management with choice modeling
PDF
Essays on consumer product evaluation and online shopping intermediaries
Asset Metadata
Creator
Troncoso, Isamar
(author)
Core Title
Essays on the unintended consequences of digital platform designs
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Degree Conferral Date
2022-05
Publication Date
04/08/2022
Defense Date
03/01/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
freelance platforms,OAI-PMH Harvest,online marketplaces,review platforms
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Luo, Lan (
committee chair
), Proserpio, Davide (
committee chair
), Alyakoob, Mohammed (
committee member
), Liu, Xiao (
committee member
), Mayzlin, Dina (
committee member
)
Creator Email
itroncos@usc.edu,troncoso.isamar@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110883179
Unique identifier
UC110883179
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Troncoso, Isamar
Type
texts
Source
20220408-usctheses-batch-920
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
freelance platforms
online marketplaces
review platforms