Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Predicting and modeling human behavioral changes using digital traces
(USC Thesis Other)
Predicting and modeling human behavioral changes using digital traces
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PREDICTING AND MODELING HUMAN BEHAVIORAL CHANGES USING DIGITAL TRACES by Farshad Kooti A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2016 Copyright 2016 Farshad Kooti Dedication To Maryam, who keeps me happy and sane.... ii Acknowledgements This dissertation was only possible with help and support of my family, friends, and colleagues. First and foremost, I would like to thank my advisor Kristina Lerman for her help and support in all stages of my PhD. I learnt a lot from her, not only for my academic career, but also for my everyday life. I was fortunate that she gave me the freedom to work on the projects that I was passionate about and had the back- ground. Moreover, she was open to many collaborations that I nd very important for building my career. I cannot imagine having a better advisor and mentor for my PhD. I would also like to thank my qualication and dissertation committee Cyrus Shahabi, Morteza Dehghani, Andrew Gordon, and Jelena Mirkovic who generously gave their time to oer me valuable comments toward improving my work. I was exceptionally fortunate to work with amazing mentors and collaborators from whom I learnt a lot. My interest in online social network analysis and data science started in 2004 with my bachelor's thesis at the University of Tehran. My supervisor, Masoud Asadpour, introduced me to this fascinating area and I will always be indebted to him for getting to know this eld in its early days. I would also like to thank my master's advisor, Krishna Gummadi, who helped me realize the power of critical reasoning and the joy of research. I did great internships at QCRI, Yahoo! Labs Barcelona and Playa Vista, and Facebook. I met amazing iii people in all these places whom I learnt a lot from. My sincere thanks goes to Ingmar Weber, Luca Aiello, Mihajlo Grbovic, Karthik Subbian, Winter Mason, and Lada Adamic for all their advice and the exciting projects. I was also fortunate to work with many other great researchers Meeyoung Cha, Nemanja Djuric, Amin Mantrach, Vladan Radosavljevic, Nathan Hodas, Aram Galstyan, Emilio Ferrara, Esteban Moro, Philipp Singer, and Markus Strohmaier. Thank you all for the thoughtful discussions and everything that I learnt from you. I was incredibly lucky to be part of the ISI family. I learnt a lot from all the bril- liant researchers during the discussions and questions in the seminars, talks, social hours, and the GSS's. Academic and non-academic discussions with my friends at ISI was always fun, rewarding, and helped my research. Thanks everyone for the amazing years: Mohsen Taheriyan, Jeon-Hyung Kang, Nima Pourdamghani, Marjan Ghazvininejad, Majid Ghasemi, Suradej Intagorn, Xin-Zeng Wu, Hao Wu, Sahil Garg, and Shuyang Gao. Our pool games with Mohsen, Nima, and Majid at ISI were especially fun. I would also like to thank the ISD admins Alma Nava, Kary Lau, and Peter Zamar who helped me from the rst days with settling in Los Angeles and all the help with the events and travel reimbursements. Last four years in LA were especially fun because of my friends outside of ISI. Thank you all for so many great memories: Shahab, Ali Boz, Ali Mamad, Moji, Dina, Athena, Ali Zandi, Ladan, Ali Nej, Athena, and Behzad. There is no way to express how grateful I am for my parents, Mohammad and Akram, who supported and motivated me from the rst days of school till today. They were constant source of love, support and strength all these years and I wouldn't be here if it was not because of them. I wholeheartedly thank my sister, Leila, who always gave me love and courage despite the long distance. I also thank my brother, Hessam, who helped me with my studies and guided me from iv my rst days of school till my last days of PhD. He was the main reason that I chose Computer Science for my studies, which is one of my best decisions in life. Kheili doosetoon daram, merci baraye hame chi. Last but not the least, I thank my wife and best friend, Maryam, for all the sup- port, encouragement, and passion that she gave me during my PhD. She patiently kept up with the long working nights before deadlines. My PhD years couldn't be any better because of her. v Contents Dedication ii Acknowledgements iii List of Tables x List of Figures xiv Abstract xxi 1 Introduction 1 1.1 Online Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Thesis Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Contributions and Overview . . . . . . . . . . . . . . . . . . . . . . 10 2 Background and Related Work 14 2.1 Cyclic Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Analysis of Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Fatigue and Cognitive Depletion . . . . . . . . . . . . . . . . . . . . 16 2.4 Online Shopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Information Overload . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Short-term behavioral changes 23 3.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.1 Session-level behavioral changes . . . . . . . . . . . . . . . . 29 3.1.2 Role of user characteristics . . . . . . . . . . . . . . . . . . . 38 3.2 Facebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 Behavioral Changes during the Session . . . . . . . . . . . . 45 3.3 Reddit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.1 Data and methodology . . . . . . . . . . . . . . . . . . . . . 56 3.3.2 Performance deterioration . . . . . . . . . . . . . . . . . . . 63 vi 3.4 Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4 Predicting Online Behavior 79 4.1 Facebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.1 Session Length . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.2 Number of Stories Viewed . . . . . . . . . . . . . . . . . . . 82 4.1.3 Return Time . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.1 Predicting Reply Time . . . . . . . . . . . . . . . . . . . . . 88 4.2.2 Predicting Reply Length . . . . . . . . . . . . . . . . . . . . 90 4.2.3 Predicting the End of the Thread . . . . . . . . . . . . . . . 91 5 Modeling Online Behavior Using Digital Traces 93 5.1 Online Shopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.1.1 Purchase pattern analysis . . . . . . . . . . . . . . . . . . . 97 5.1.2 Predicting purchases . . . . . . . . . . . . . . . . . . . . . . 101 5.2 Uber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2.1 Understanding Riders . . . . . . . . . . . . . . . . . . . . . . 112 5.2.2 Understanding Drivers . . . . . . . . . . . . . . . . . . . . . 118 5.2.3 Rider vs Driver . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.2.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.3 iPhone Digital Spending . . . . . . . . . . . . . . . . . . . . . . . . 128 5.3.1 Avid Gamers . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.4 Handling Heterogeneity in Large Data Sets . . . . . . . . . . . . . . 146 6 Discussion 151 6.1 Network Paradoxes on Twitter . . . . . . . . . . . . . . . . . . . . . 151 6.1.1 Friendship Paradox on Twitter . . . . . . . . . . . . . . . . 152 6.1.2 Friend Activity Paradox . . . . . . . . . . . . . . . . . . . . 154 6.1.3 Virality Paradox . . . . . . . . . . . . . . . . . . . . . . . . 156 6.2 Information Overload . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.3 Origins of the Paradoxes . . . . . . . . . . . . . . . . . . . . . . . . 164 6.3.1 Statistical Origins of Paradoxes . . . . . . . . . . . . . . . . 164 6.3.2 Behavioral Origins of Paradoxes . . . . . . . . . . . . . . . . 167 7 Conclusion and Future Work 175 Reference List 180 vii A Online Shopping 197 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 A.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 A.3 Purchase Pattern Analysis . . . . . . . . . . . . . . . . . . . . . . . 204 A.3.1 Demographic Factors . . . . . . . . . . . . . . . . . . . . . . 204 A.3.2 Temporal Factors . . . . . . . . . . . . . . . . . . . . . . . . 208 A.3.3 Social Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 215 A.4 Predicting Purchases . . . . . . . . . . . . . . . . . . . . . . . . . . 217 A.4.1 Price of the Next Purchase . . . . . . . . . . . . . . . . . . . 219 A.4.2 Time of the Next Purchase . . . . . . . . . . . . . . . . . . . 221 A.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 A.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 B Uber 228 B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 B.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 B.3 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 B.4 Understanding Riders . . . . . . . . . . . . . . . . . . . . . . . . . . 237 B.4.1 Demographics and number of rides . . . . . . . . . . . . . . 237 B.4.2 Duration, length, and cost of rides . . . . . . . . . . . . . . 238 B.4.3 Income, surge, and car type . . . . . . . . . . . . . . . . . . 240 B.4.4 Ride dynamics . . . . . . . . . . . . . . . . . . . . . . . . . 242 B.4.5 Promotions . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 B.4.6 Rider attrition . . . . . . . . . . . . . . . . . . . . . . . . . . 244 B.5 Understanding Drivers . . . . . . . . . . . . . . . . . . . . . . . . . 246 B.5.1 Demographics . . . . . . . . . . . . . . . . . . . . . . . . . . 246 B.5.2 Hours, income, and rating . . . . . . . . . . . . . . . . . . . 247 B.5.3 Driver retention . . . . . . . . . . . . . . . . . . . . . . . . . 250 B.6 Rider vs Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 B.6.1 Demographic comparison . . . . . . . . . . . . . . . . . . . . 252 B.6.2 Eect of matching . . . . . . . . . . . . . . . . . . . . . . . 252 B.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 B.7.1 Riders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 B.7.2 Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 B.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 C iPhones Digital Marketplace 262 C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 C.2 Data set and Marketplace . . . . . . . . . . . . . . . . . . . . . . . 265 C.3 Big spenders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 C.3.1 Characteristics of big spenders . . . . . . . . . . . . . . . . . 273 C.3.2 App adoption and abandonment . . . . . . . . . . . . . . . . 274 viii C.4 Purchase Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 C.4.1 Temporal model . . . . . . . . . . . . . . . . . . . . . . . . . 278 C.4.2 Novelty prediction . . . . . . . . . . . . . . . . . . . . . . . 280 C.4.3 App prediction . . . . . . . . . . . . . . . . . . . . . . . . . 282 C.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 C.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 ix List of Tables 3.1 Pearson correlation between features. . . . . . . . . . . . . . . . . . 58 3.2 Mixed-eects model results. In (a), the models study the eect of session length on the quality of the rst comment C 1 in a session; i.e., data only contains the rst session comments. In (b), the mod- els investigate the eect of the session index i on the quality of respective commentC i ; data includes all comments in sessions with more than a single comment. Each table highlights the most appro- priate models for each quality features based on extensive model analytics|lmer refers to linear mixed-eects models while glmer refers to generalized linear mixed-eects models. All coecients are strongly signicant as derived from model comparisons based on BIC statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.1 Top 5 features for predicting the length of the session and their information gain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 Result of logistic regression on the independent variables for the length of the session. *** pvalue< 0:001, **pvalue< 0:01, * pvalue< 0:05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3 Prediction accuracy using dierent sets of features. . . . . . . . . . 84 4.4 Top 5 features for predicting the number of stories viewed in a ses- sion and their information gain. . . . . . . . . . . . . . . . . . . . . 85 4.5 Result of logistic regression on the independent variables for the number of stories read in the session. *** pvalue< 0:001 . . . . 85 4.6 Top 5 features for predicting the break time and their information gain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.7 Result of logistic regression on the independent variables for the break length. *** pvalue < 0:001, ** pvalue < 0:01, * p value< 0:05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.8 Summary of the prediction results. Accuracy: percentage of cor- rectly classied samples. Majority vote: always predicting the largest group, or predicting randomly (same group sizes). . . . . . . . . . 88 4.9 Top 5 most predictive features for predicting reply time and their 2 value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 x 4.10 Top 5 most predictive features for predicting length of reply and their 2 value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.11 Top 5 most predictive features for predicting last email in a thread and their 2 value. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.12 Summary of the prediction results. Accuracy: percentage of cor- rectly classied samples. AUC: Weighted average of Area Under the Curve for classes. RMSE: Root Mean Square Error. The improve- ments are reported over the majority vote baseline. . . . . . . . . . 92 5.1 Dierences in the categories of products purchased by women and men. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2 Top predictive features for prediction of the price of the next item and their 2 value. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3 Top predictive features for prediction of time of next purchase and their 2 value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4 Summary of the prediction results. Accuracy: percentage of cor- rectly classied samples. Majority vote: always predicting the largest group, or predicting randomly. Most used: the group the user had the most in earlier purchases. AUC: Weighted average of Area Under the Curve for classes. RMSE: Root Mean Square Error. The improvements are reported over the majority vote baseline. . . . . . 106 5.5 Comparison of riders by race, age, and gender. . . . . . . . . . . . . 113 5.6 Size and centers of clusters of riders from their monthly number of rides along with their demographic breakdown. . . . . . . . . . . . . 117 5.7 Demographics of riders in each cluster. . . . . . . . . . . . . . . . . 117 5.8 Comparison of drivers of dierent races. . . . . . . . . . . . . . . . 119 5.9 Size and centers of the clusters of drivers from their monthly hours worked. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.10 Demographics of drivers in each cluster. . . . . . . . . . . . . . . . 120 5.11 Results of logistic regression on the independent variables for the riders. ***pvalue< 0:001, **pvalue< 0:01, *pvalue< 0:05125 5.12 Results of logistic regression on the independent variables for the drivers. ***pvalue< 0:001, **pvalue< 0:01, *pvalue< 0:05127 5.13 Results of logistic regression on the independent variables for aban- donment prediction. *** pvalue< 0:001. . . . . . . . . . . . . . 142 5.14 Top 5 closest apps based on cosine similarity for 3 apps. . . . . . . 143 5.15 An example of Simpson's paradox, where Treatment A works better for both small stones and large stones separately, but in aggregate Treatment B seems to work better falsely. . . . . . . . . . . . . . . 146 xi 6.1 Mean of average size of received cascades for under- and overloaded users. Overloaded users have larger mean across all four groups, sending, respectively, 1) less than 5 tweets, 2) 5{19, 3) 20{59, and 4) 60+ tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.2 Network properties. (Top) Assortativity of attributes of connected users and (Bottom) within-node correlations of the attribute with degree in the empirical data (Emp.) and in the shued networks after a controlled (Contr.) and full shue (Shue) of attributes. . 168 A.1 Top 5 most purchased products . . . . . . . . . . . . . . . . . . . . 205 A.2 Top 5 products with the most money spent on them . . . . . . . . . 205 A.3 Top product categories purchased by women and men . . . . . . . . 205 A.4 Top products purchased by women and men . . . . . . . . . . . . . 205 A.5 Dierences in the products purchased by younger (18-22 yo) and older (60-70 yo) users . . . . . . . . . . . . . . . . . . . . . . . . . . 206 A.6 Top 5 items with the most number of repurchases . . . . . . . . . . 210 A.7 Top predictive features for prediction of the price of the next item and their 2 values . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 A.8 Top predictive features for prediction of time of next purchase and their 2 values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 A.9 Summary of the prediction results. Accuracy: percentage of cor- rectly classied samples. Majority vote: always predicting the largest group, or predicting randomly. Most used: the group the user had the most in earlier purchases. AUC: Weighted average of Area Under the Curve for classes. RMSE: Root Mean Square Error. The improvements are reported over the majority vote. . . . . . . . . . . 222 B.1 Comparison of riders by race, age, and gender. . . . . . . . . . . . . 237 B.2 Comparison between riders who used promotions to those who did not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 B.3 Size and centers of clusters of riders from their monthly number of rides along with their demographic breakdown. . . . . . . . . . . . . 245 B.4 Demographics of riders in each cluster. . . . . . . . . . . . . . . . . 245 B.5 Comparison of drivers of dierent races. . . . . . . . . . . . . . . . 247 B.6 Size and centers of the clusters of drivers from their monthly hours worked. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 B.7 Demographics of drivers in each cluster. . . . . . . . . . . . . . . . 251 B.8 Percentage of above average weeks for women and men drivers, given the percentage of women or men drivers. . . . . . . . . . . . . . . . 254 B.9 Results of logistic regression on the independent variables for the riders. ***pvalue< 0:001, **pvalue< 0:01, *pvalue< 0:05258 xii B.10 Results of logistic regression on the independent variables for the drivers. ***pvalue< 0:001, **pvalue< 0:01, *pvalue< 0:05260 C.1 Top 10 apps by in-app earnings, with demographics. . . . . . . . . . 269 C.2 Top 5 gender-biased categories. . . . . . . . . . . . . . . . . . . . . 270 C.3 Top 5 gender-biased categories. . . . . . . . . . . . . . . . . . . . . 270 C.4 AIC for dierent distributions. Lower AIC scores are preferred. . . 280 C.5 Results of logistic regression on the independent variables for aban- donment prediction. *** pvalue< 0:001. . . . . . . . . . . . . . 282 C.6 Top 5 closest apps by cosine similarity for 3 apps. . . . . . . . . . . 284 C.7 New app prediction accuracy. . . . . . . . . . . . . . . . . . . . . . 285 xiii List of Figures 3.1 Timeline of user activity on Twitter segmented into sessions. The timeline is a time series of tweets, including normal tweets, retweets, and replies. These activities fall into sessions. A period between consecutive tweets lasting longer than 10 minutes indicates a break between sessions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Distribution of the time interval between consecutive tweets. . . . 26 3.3 Distribution of the number of tweets in a session. . . . . . . . . . . 27 3.4 Visualization of clustering of sessions using the fraction of normal tweets, replies, and retweets. . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Fraction of dierent tweet types given the time from the user's last tweet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6 Fraction of tweets that are replies posted during sessions of a given length in time and number of tweets in the session. The data was binned and only bins with more than 100 sessions are included. . . 32 3.7 Change in the fraction of tweets of each type over the course of sessions in which users posted 10, 20, or 30 tweets. . . . . . . . . . . 33 3.8 Change in the fraction of tweets of each type over the course of sessions of length 10 in shued data. . . . . . . . . . . . . . . . . . 33 3.9 Relative change in the fraction of tweets of each type over the course of sessions with 10, 20, or 30 tweets. . . . . . . . . . . . . . . . . . . 35 3.10 Fraction of tweets that are replies to tweets posted since the begin- ning of the same session (for sessions with 10 tweets) . . . . . . . . 36 3.11 Fraction of long tweets posted over the course of sessions of a given length (10 tweets). Long tweets are dened as non-reply tweets that are longer than 130 characters. . . . . . . . . . . . . . . . . . . . . . 36 3.12 Percentage of change of spelling errors made in tweets over the course of session relative to shued data. . . . . . . . . . . . . . . . 38 3.13 Relative change in the tweet type throughout a session for users with few friends and many friends. The change is relative to shued sessions with 10 tweets. . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.14 Relative change in the tweet type throughout a session for users with low and high activity. . . . . . . . . . . . . . . . . . . . . . . . 41 xiv 3.15 Change in time spent per story given the time in the session along with the 95% condence interval. . . . . . . . . . . . . . . . . . . . 47 3.16 Change in fraction of stories that have been viewed earlier and are not new for the web users. . . . . . . . . . . . . . . . . . . . . . . . 49 3.17 Change in time spent on dierent content types given the time in the session (web users) along with the 95% condence interval. . . . 49 3.18 Change in time spent on stories for people with dierent age given the time in the session (web users) along with the 95% condence interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.19 Change in time spent on stories for people with dierent number of friends given the time in the session (web users) along with the 95% condence interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.20 Change in time spent on stories given the time in the session during dierent time of the day (web users) along with the 95% condence interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.21 Change in time spent viewing videos given the time in the session (web users). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.22 Performance of comments within sessions. We show the average Reddit score for comments in sessions of length 10 (original session data, blue solid line). The average rating of each comment decreases starkly, by about 0.3 points for each comment after the rst one in the session. This suggests the presence of (super linear) performance deterioration throughout user sessions. The eect disappears in ran- domized data having shued comments within sessions (red dashed line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.23 Time dierences between consecutive comments of users on Reddit. The log-scaled histogram shows a peak for very short time scales (minutes) and very long ones (1 day) suggesting daily routines. A natural valley emerges between both peaks arguing for the choice of a one hour break between comments for sessionizing. . . . . . . . . 59 3.24 Sessions and randomization. Circles represent comments C i and arrows depict the time dierence t i;j between subsequent com- ments C i and C j . Sessions are derived by breaking at time dier- ences exceeding 60 min. Original data sessions are shown in the rst row. The middle row shows randomized sessions where time dierences between comments are swapped for deriving new ses- sions while retaining the original order of comments. The bottom row depicts the randomized index data where sessions are retained but the order of comments within sessions is swapped. . . . . . . . 61 xv 3.25 Empirical observations. This gure visualizes the average of all four quality features of interest at their respective position in a ses- sion. The colors (dierent markers) indicate dierent session lengths (number of comments written in a session, 1 up to a length of 5). The x-axis depicts a comment's index within the session, and the y-axis gives the average feature value (with error bars). The rst row (a) depicts the original session data while the second (random- ized session data) and third row (randomized index data) visualize results for the randomized data. The results indicate that earlier comments in a session tend to be of higher quality than later ones. Additionally, there appears to be a relation between the session length and the performance of the rst comment in a session (stack- ing of lines). These clear patterns for the original data (a) mostly disappear for both of our randomized data sets (b,c). . . . . . . . . 65 3.26 Illustration of an email thread. . . . . . . . . . . . . . . . . . . . . . 76 3.27 (a) Median reply time for dierent steps of threads for a given thread length. Replies become faster, except for the very last reply that is much slower. (b) Median length of reply for dierent steps of threads for a given thread length. Calculated on dyadic conversations. . . . 76 3.28 Reply time and length as a function of the length of a conversation for dyadic interactions with less than 50 steps in a thread, which are 99.7% of all threads. Each plot shows the median, 25 th and 75 th percentile of the measure vs. the number of messages in a thread. Longer threads have shorter reply delays and lengths. . . . . . . . 77 3.29 Correlation between time to reply and length of reply for outgoing and incoming emails for dyadic conversations. There is a strong correlation till the length of 200 words (more than 83% of all replies). 77 4.1 Correlation coecient between the features. Boxes with the cross have statically signicant correlation. . . . . . . . . . . . . . . . . . 83 5.1 Demographic analysis. (a) Percentage of online shoppers, (b) num- ber of items purchased, (c) average price of products purchased, and (d) total spent by men and women, broken down by age. . . . . . . 98 5.2 Distribution of number of days between purchases. . . . . . . . . . . 100 5.3 Daily number of rides in our data set. . . . . . . . . . . . . . . . . . 111 5.4 Distribution of rider age given their gender. . . . . . . . . . . . . . 113 5.5 Riders and surge pricing. (a) Percentage of rides with surge pricing as a function of rider age and gender. (b) Comparison of income of riders who had at least one ride with a surge fare and rest of the riders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.6 Type of Uber car requested by riders given their income. . . . . . . 116 xvi 5.7 Number of drivers for a given age and gender. . . . . . . . . . . . . 118 5.8 Comparison of income of riders and drivers. . . . . . . . . . . . . . 121 5.9 Correlation between the features of the riders. Pairs without statis- tically signicant correlation are crossed (pvalue< 0:05). . . . . 124 5.10 Correlation between the features of the drivers. Pairs without sta- tistically signicant correlation are crossed (pvalue < 0:05). 1 and 2 show the value from the rst or second week. . . . . . . . . . 126 5.11 Percentage of users, purchases, and money spent on each category. . 131 5.12 Eect of gender and age on the spendings of users on phone purchases.132 5.13 Eect of income on the spending. There are more than 10k users for each given age. . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.14 Heatmap of the median amount of money spent by the users in each country and US. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.15 Lorenz curve of the spending of the users on in-app purchases, show- ing high disparity among users. . . . . . . . . . . . . . . . . . . . . 135 5.16 Fraction of avid gamers, given the income of the users. . . . . . . . 136 5.17 Change in delay and spending in the rst 10 days of purchases from an app. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.18 Change in delay and spending in the last 10 days of purchases from an app. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.19 Results of tting the time between purchases to a Pareto distribution.140 5.20 Pairwise correlation coecient among the features for predicting the purchase from new apps. . . . . . . . . . . . . . . . . . . . . . . . . 142 5.21 Average time spent reading posts at a given minute in the session for all Facebook sessions. . . . . . . . . . . . . . . . . . . . . . . . . 147 5.22 Relationship between purchase price and time to next purchase. 0.95 condence interval is also shown, but it is too small to be observed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.23 Relationship between purchase price and time to next purchase with 0.95 condence interval. . . . . . . . . . . . . . . . . . . . . . . . . 148 6.1 An example of a directed network of a social media site. Users receive information from their friends and broadcast information to their followers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.2 Variants of the friendship paradox on Twitter showing that your (a) friends and (b) followers are better connected than you are (i.e., have more friends on average) and (c, d) are more popular than you are (i.e., have more followers on average). . . . . . . . . . . . . . . . 153 6.3 Comparison of user's activity and the average activity of his or her friends (measured by the number of tweets posted by them). Most (88%) of the users are less active than their friends on average. . . . 155 xvii 6.4 Comparison of average size of posted and received cascade of users with their friends. For the vast majority of users, their friends both receive and post URLs with higher average cascade size, indicating a virality paradox. . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.5 Growth in the volume of incoming information as a function of user's connectivity and user activity it stimulates. Lines in (a) show the best power law and linear ts. . . . . . . . . . . . . . . . . . . . . . 158 6.6 User activity as a function of the number of followers and friends the user has. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.7 Comparison of size of posted cascades of overloaded and under- loaded users, grouped by their activity. . . . . . . . . . . . . . . . . 160 6.8 Comparison of size of received cascades of overloaded and under- loaded users, grouped by their activity. . . . . . . . . . . . . . . . . 161 6.9 (a) Estimated mean grows with sample size. Three distributions, exponential (Exp), log-normal, and Pareto, each result in dierent biased estimates of the mean. The larger the dierence between median and population mean, the larger the discrepancy. Thus, a user will always observe a paradox when calculating the mean of its neighbors when the population mean is greater than the median. (b) Eect of using mean vs. median on fraction of users with given number of friends estimated to be in paradox condition in a random network with no correlations. Users' attributes are drawn indepen- dently fromx 1:2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.10 Percentage of users in paradox regime on Twitter and Digg after shuing the number of friends (top row) and the number of followers (bottom row). Error bars show the 0.95 condence interval. . . . . 169 6.11 Percentage of users in paradox regime on Twitter and Digg after shuing user attribute. Error bars show the 0.95 condence interval.171 6.12 Percentage of users in paradox regime for the shued attribute, but keeping attribute-connectivity correlation (controlled shuing). . . 173 A.1 Distribution of number of purchases made by users . . . . . . . . . 201 A.2 Distribution of total money spent by users . . . . . . . . . . . . . . 202 A.3 Number of times dierent items have been purchased . . . . . . . . 203 A.4 Number of item purchases as a function of price . . . . . . . . . . . 203 A.5 Demographic analysis broken down by age: (a) Percentage of online shoppers; (b) number of items purchased; (c) average price of prod- ucts purchased; and (d) total spent by men and women . . . . . . . 206 A.6 Eect of income on purchasing behavior . . . . . . . . . . . . . . . 207 A.7 Number of purchases in a day and average weekly . . . . . . . . . . 209 A.8 Number of purchases in each hour of the day . . . . . . . . . . . . . 209 A.9 Distribution of number of days between purchases . . . . . . . . . . 211 xviii A.10 Relationship between purchase price and time to next purchase (0.95 condence interval are shown yet too small to be observed) . . . . . 212 A.11 Relationship between purchase price and time to next purchase with 0.95 condence interval . . . . . . . . . . . . . . . . . . . . . . . . . 213 B.1 Daily number of rides in our data set. . . . . . . . . . . . . . . . . . 235 B.2 Distribution of rider age given their gender. . . . . . . . . . . . . . 238 B.3 Rider activity as a function of age and gender. . . . . . . . . . . . . 239 B.4 Riders and surge pricing. (a) Percentage of rides with surge pricing as a function of rider age and gender. (b) Comparison of income of riders who had at least one ride with a surge fare and rest of the riders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 B.5 Type of Uber car requested by riders given their income. . . . . . . 242 B.6 Number of rides at dierent times of the day and days of the week. 243 B.7 Number of drivers for a given age and gender. . . . . . . . . . . . . 247 B.8 Average number of hours worked and weekly earnings of drivers, given their age and gender. . . . . . . . . . . . . . . . . . . . . . . . 248 B.9 Comparison of drivers with at least one surge ride to the rest of the drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 B.10 Percentage of above-average ratings given the percentage of surge earnings in a week. . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 B.11 Percentage of above average weeks for a given age and gender. . . . 251 B.12 Comparison of income of riders and drivers. . . . . . . . . . . . . . 253 B.13 Correlation between the features of the riders. Pairs without statis- tically signicant correlation are crossed (pvalue< 0:05). . . . . 257 B.14 Correlation between the features of the drivers. Pairs without sta- tistically signicant correlation are crossed (pvalue < 0:05). 1 and 2 show the value from the rst or second week. . . . . . . . . . 259 C.1 Percentage of users, purchases, and money spent on each category. . 266 C.2 Eect of gender and age on iPhone purchases. . . . . . . . . . . . . 267 C.3 Eect of income on spending. There are more than 10k users for each income category. . . . . . . . . . . . . . . . . . . . . . . . . . . 267 C.4 Heatmap of the median amount of money spent by the users in each country and US. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 C.5 PDF and CDF of user's spending on in-app purchases. . . . . . . . 271 C.6 Lorenz curve of the spending of the users on in-app purchases, show- ing high disparity among users. . . . . . . . . . . . . . . . . . . . . 272 C.7 Lorenz curve of the earning of the apps, showing extremely high inequality in the earning of the apps. . . . . . . . . . . . . . . . . . 273 C.8 Fraction of big spenders, given the income of the users. . . . . . . . 275 C.9 Change in delay and spending in the rst 10 purchases from an app. 277 xix C.10 Change in delay and spending in the last 10 days of purchases from an app. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 C.11 Results of tting the time between purchases to a Pareto distribution.279 C.12 Pairwise correlation coecient among the features for predicting the purchase from new apps. . . . . . . . . . . . . . . . . . . . . . . . . 281 xx Abstract People are increasingly spending more time online. Understanding how this time is spent and what patterns exist in online behavior is essential for improving systems and user experience. One of the main characteristics of online activity is diurnal, weekly, and monthly patterns, re ecting human circadian rhythms, sleep cycles, as well as work and leisure schedules. These patterns range from mood changes re ected on Twitter at dierent times of the days and days of the weeks to reading stories on news aggregator websites. Using large-scale data from multiple online social networks, we uncover temporal patterns that take place at far shorter time scales. Specically, we demonstrate short-term, within-session behavioral changes, where a session is dened as a period of time during which a person engages continuously with the online social network without a long break. On Twitter, we show that people prefer easier tasks such as retweeting over more complicated tasks such as posting an original tweet later in a session. Also, tweets posted later in a session are shorter and are more likely to contain a spelling mistake. We focus on information consumption on Facebook and show that people spend less time reading a story as they spend more time in the session. More interestingly, the rate of the change depends on the type of the content and people are more likely to spend time on photos and videos later in a session compared to textual posts. We also found changes in the quality of the content generated on Reddit xxi and found that comments that are posted later in a session get lower scores from other users, receive fewer replies, and have lower readability. All these ndings are evidence for short-term behavioral changes in the type of activity that users perform. Moreover, we identify the factors that aect these short-term behavior changes; age of the person being the most signicant factor. We nd that other factors such as gender, location, and time of the day also have considerable role in the behavioral changes. All these correlations can be used to predict the online behavior of individuals with high accuracy. E.g., we can predict the length of the activity session or the break on Facebook with much higher accuracy than other competitive baselines. Our observations are compatible with the cognitive depletion theories that suggest that people's performance drop as they perform sustained activity for a period of time, and verify small-scale, laboratory studies conducted by psychologists. We also investigate more general behavioral changes than short-term behavioral changes in the context of consumer behavior. We analyze data from purchases that people made online, including purchasing goods, taking rides with ride-sharing apps, and purchases from Apple's app store. We show that there is a signicant heterogeneity in these large-scale data sets and not handling this heterogeneity can result in false ndings. We present an approach to test for the false ndings using randomization and show in a case of a mistake, how the mistake could be solved. xxii Chapter 1 Introduction 1.1 Online Behavior In the recent years, human communications and daily activity have increasingly moved to online platforms: more than 200 billion emails are sent daily 1 , 350 thou- sand tweets are sent in just a minute 2 , and 300 thousand status updates are posted on Facebook each minute 3 . Unlike oine behavior, online activity and communications are usually logged and detailed traces of millions of users are recorded. This large-scale data has enabled researchers to analyze and understand human behavior in the ways not possible even a decade ago. For example, the well-known \small-world" experiment was conducted by Milgram in 1967 to demonstrate the searchability of social net- works [Milgram, 1967]. In this experiment, random people in Nebraska or Kansas were asked to send a letter to a particular target person in Boston, only by for- warding the letter to people who they knew personally. This experiment took months, and needed lots of eort and cost to measure the network paths. The same question can now be answered on Facebook 4 or other online social networks in a much shorter time and much larger scale (1.6B people on Facebook study vs. 1 http://www.radicati.com/wp/wp-content/uploads/2013/04/ Email-Statistics-Report-2013-2017-Executive-Summary.pdf 2 http://www.internetlivestats.com/twitter-statistics/ 3 https://zephoria.com/top-15-valuable-facebook-statistics/ 4 https://research.facebook.com/blog/three-and-a-half-degrees-of-separation/ 1 only 64 in Milgram study). Similarly, emergence of conventions can be studied in the online world much more accurately with detailed and large data sets [Kooti et al., b,a]. Multiple studies of online behavior have shown that online activity exhibits strong temporal regularities on a daily, weekly, and seasonal scales. For example, the mood expressed by Twitter users worldwide shows daily and seasonal variation, where people exhibit more positive sentiments in the evenings, weekends, and warmer months [Golder and Macy, 2011]. Daily patterns of food consumption, as well as increased nightlife activity on the weekends, emerge from Foursquare check- in data [Grinberg et al., 2013]. And, voting for news stories submitted to the social news aggregator Digg displays clear daily and weekly cycles of activity [Szabo and Huberman, 2010]. These patterns can be attributed to psychological states governed by circadian rhythms, sleep cycles, and seasonal changes in day length or other cycles in human life, e.g., the monthly income payments. In this dissertation, I show that regular changes in online activity take place on a far shorter time scale: minutes instead of days or months. I argue that these behavioral changes re ect cognitive processes that result from fatigue or loss of interest, because in the short time scale most of the other factors are constant. Moreover, the patterns exist across dierent platforms, so they are highly unlikely to arise due to a particular design or the way people use the platform. I conduct several empirical studies using data from multiple popular social networking platforms, including Twitter [Kooti et al., 2014, Hodas et al., 2013], Reddit, Facebook, and emails [Kooti et al., 2015]. I use large samples of the pop- ulation from each of these platforms: more than 16 billion emails in my analysis of emails, and more than ten billion interactions from Facebook. Interestingly, 2 very similar activity patterns could be observed in individuals' behavior across these platforms, which have completely dierent purposes and interfaces. Exis- tence of the same trends in dierent platforms, strongly suggests that the ndings are rooted in the human's behavior and are not just the result of a particular application or interface. One of the biggest challenges in understanding online behavior is the strong heterogeneity among users that may result in false trends in aggregate. In order to partly account for heterogeneity, I segment the time series of an individual's activity on these platforms into sessions. I dene a session as a series of consecutive interactions without a break longer than some xed threshold. I experimented with dierent ways of dening sessions and dierent thresholds, and my ndings are qualitatively very similar with dierent denitions of session. I nd clear behavioral changes throughout the session, with people tending to do easier tasks later in a session. In Twitter, individuals tend to post shorter tweets and more retweets later in the session and when there is a short delay between consecutive actions. Similarly in emails, a signicant correlation between length of an email sent and the time it took to be composed. In Facebook, I observed that people tend to prefer photos over textual posts later in the session and also generally they spend less time reading each story later in the session. And on Reddit, I found that quality of comments decreases as users post more comments in a session. I also quantify the role of a variety of factors and demographic characteristics of the individuals such as age, gender, and time of the day. I use my ndings to train classiers to predict a variety of individuals' behavior, such as the length of the sessions, when a next action is going to take place, and what the next action will be. These predictions could be used to design better systems that allow people to use the platform more eectively. Moreover, using 3 the predictions, I can nd the predictive power of dierent features, which gives us an insight on the behavioral dynamics of people in sessions. Loss of attention and preference for cognitively easier tasks is one potential mechanism that can explain the found patterns in my work. My observations are consistent with the \ego depletion" phenomenon identied by psychologists, in which a period of sustained mental eort leads to declines in cognitive per- formance and even loss of self-control [Baumeister and Vohs, 2007]. While the mechanism of ego depletion is not clearly understood, the decrease in cognitive performance following a period of mental exertion has been well documented in many settings [Healy et al., 2004, van der Linden et al., 2003, Butler et al.]. In this dissertation, I use the term \cognitive constraint" to refer to the nite capac- ity individuals have for performance, which is depleted by mental work including using online social networks. Individuals' cognitive constraints and structure of online interactions aect global collective phenomena resulting, for example, in information overload. Ear- lier work has shown that information overload can aect information diusion globally, due to the decreased susceptibility of highly connected people to informa- tion propagation [Hodas and Lerman, 2012]. Moreover, a recent study on Twitter has shown that individuals who are willing to spend more time on Twitter con- struct their network in a way to increase the diversity of the information they receive and since eort is a proxy of the person's cognitive capacity (or willing- ness) for processing information, cognitive factors appear to play a considerable role in network structure [Kang and Lerman, 2015]. 4 1.2 Motivation As people increasingly use online social networks and online tools, they are also deluged with information. Information overload is dened as receiving more infor- mation than an individual can process eectively and can have adverse eects such as missing important information. Information overload can be seen as nat- ural outcome of the interplay between cognitive constraints and network eects. Understanding the outcomes and the changes in behavior is the rst step in over- coming information overload. One interesting network eect that results in information overload is the \Friendship Paradox" or Feld's Paradox. This paradox states that, on average, your friends have more friends than you do. This is due to the over-representation of extremely popular individuals in the average of the number of friends [Feld, 1991]. The paradox has been empirically demonstrated in both online social net- works, such as Facebook [Ugander et al., 2011], and oine [Feld, 1991, Zuckerman and Jost, 2001]. My analyses shows that this paradox can be extended to directed networks, such as Twitter and Digg, and also holds true for other characteristics of the individuals, such as activity and virality of the content. Moreover, interest- ingly, the paradoxes hold not only for mean of friends' number, but also for the median, which means the paradoxes also have a behavioral root. The observed paradoxes have an unexpected result: if we wish to receive more information, we can usually choose to incorporate more individuals into our online social networks, e.g., by following them on Twitter or friending on Facebook. How- ever, as we grow our social network, we dramatically increase the volume of incom- ing information, since, as my analysis shows, not only are your friends better con- nected than you, they also tend to be more active, producing more information than you are willing to consume. Thus, increase in information ow collides with 5 our innate cognitive limitations and does not increase our ability to appreciate the totality of our relationships. By increasing the incoming ow of information, we dilute our attention and reduce the visibility of any individual tweet [Hodas and Lerman, 2012]. Receiving too much information may exceed our ability and desire to maintain existing social connections, even if they are unreciprocated [Kwak et al., 2011]. Thus, users will naturally attempt to regulate the amount of incoming information by tuning the number of users they follow. My goal is to understand how these cognitive constraints aect online behavior, both in production and consumption of information. Interestingly, as explained above, these individual level behavioral changes, aect the friends of each individ- ual and cause a global change in the network. For example, the fact that informa- tion overload causes users to miss some of the important information, decreases the chance of the missed content to be disseminated by the user, and this limits many of the cascades [Hodas and Lerman, 2012]. In short, I am interested in studying the role of cognitive constraints in the pro- duction and consumption of information. Understanding these eects would help us better understand factors aecting mental fatigue and cognitive depletion, using large-scale data. In addition, it would give insights to improve user experience on the social networking platforms by correctly accounting for cognitive depletion and fatigue. In this dissertation, I focus on such short-term behavioral changes. The motivation for understanding the short-term behavioral changes is two- fold: First, understanding the trends can be leveraged for predicting the human behavior and the prediction could be used for multiple purposes, such as improving user experience by adjusting the system or better targeting for advertisement. Second, from a psychological point of view, these trends can be explained by 6 cognitive depletion showing evidence with large-scale data for theories that have been shown in much smaller-scale studies [Baumeister and Vohs, 2007, Healy et al., 2004, van der Linden et al., 2003, Butler et al.]. Challenges There are multiple challenges in studying large-scale data sets. First, gathering and cleaning large-scale data sets takes a lot of time and many spe- cial cases have to be handled, e.g., in the context of the emails, email applications have dierent formats for quoting the original email in a reply, and detecting and eliminating the quoted text needed a lot of eort in nding the dierent possible variations. Second, since the data includes tens of millions of users and billions of interac- tions, the computer codes analyzing the data should be highly ecient and should be able to analyze the data in parallel, otherwise a simple calculation over the whole data set might take months, but the same calculation can be done in just a few minutes with a parallel and ecient code. Lastly, there is signicant level of heterogeneity among a large population. Usually the majority of the population is inactive users, but there is a highly active minority. Aggregating over all users at the same time might result in nding a false trend. The Simpson's paradox is a special case of this problem, where a trend exists for the whole population, but the opposite trend is observed if the population is divided into two sub-groups [Simpson, 1951]. I take two steps to address this issue, rst, whenever a distribution is heavy-tailed, I use the median instead of the mean, which is robust to outliers. Second, I segment behavior of user e.g, by age or into sessions in a way that heterogeneity is signicantly decreased in each group. I conduct shue tests to verify my ndings are robust and not the result of the mixture of the heterogeneous population or the way the 7 experiment is conducted. I explain the details of the shue test and how to handle the heterogeneous population in Chapter 5. 1.3 Research Questions In this dissertation, I investigate short-term behavioral changes and then in gen- eral, the role of individual traits including demographics, on human behavior. In order to properly study these questions, I had to develop methods to handle het- erogeneity. Particularly, my research questions are: Q1. What are the short-term patterns in the quantity and quality of user gener- ated content and how do they dier across dierent user populations (age, gender, network position, etc.)? Daily, weekly, and monthly patterns have been observed in human behavior. Here, I investigate whether human behavior changes in shorter time frames, e.g., minutes or even seconds. I am interested in in a variety of behavioral changes: Type of the action that the users take: e.g., posting a original tweet, reply, or retweet on Twitter. Rate and type of information being consumed: e.g., time spent on each post on Facebook and how that changes over time for a particular content type such as a photo or a textual post. Quality of content generated: e.g., the score of a comment on Reddit given by other users or the readability of the comment. To study these behaviors, I create sessions of activity and observe the changes within each session. Moreover, it would be interesting to know how dierent char- acteristics of the users aect these behavioral changes, e.g. what is the dierence 8 between younger and older users in the way that their behavior changes in a session, or whether there is a signicant dierence between men and women. Q2. Can we leverage the short-term patterns to predict individual behavior? Next, I use all the discovered trends in Q1 to make predictions of people's behavior. I perform the predictions to rst show that the observed trends matter and can be used to more accurately predict online behavior. Second, the pre- dictions have system design implications, e.g., predicting the time of reply to an email would help in designing better email applications. Last, I am able to com- pare dierent factors in their predictive power to see which of the observed eects are more important, which would be useful in a bigger picture of understanding human behavior. Q3. How to handle heterogeneity in large-scale data to predict more general online behavior? Finally, I study more general online behavior with other sources of data. In particular, I focus on how the heterogeneity should be handled in the large and diverse data sets. I show that mixing dierent populations together in an analysis can lead to wrong conclusions, and how the problem could be xed. Moreover, in these more general data sets, I focus on how people spend money online in dierent platforms such as online shopping (such as Amazon and Ebay), Uber, and iPhone digital market. I elucidate the role of demographics in the spending behavior. 1.4 Thesis Proposal Short-term behavioral changes, potentially stemming from cognitive constraints, can be observed and measured in online behaviors and these 9 patterns can be leveraged to predict users' behavior much more accu- rately than competitive baselines. In this dissertation, I show that individuals' online behavior change in a far shorter time frames than observed before. The eect can be observed in both quantity and quality of the content produced and consumed by individuals. More- over, I will show that understanding these eects will help us to make better predictions of human behavior in online social networks. 1.5 Contributions and Overview As explained above, this dissertation includes case studies on a variety of plat- forms and social networking websites. Here, I list more detailed contributions and ndings from each platform. On Twitter, a data set of 260M tweets posted by 2M users is analyzed and the key ndings are as follows: I present a detailed analysis of user activity sessions on Twitter. I show that most of the sessions are very short; however, while large fraction of sessions include only one type of tweet, most of the sessions are mixture of dierent types of tweets (e.g., normal tweets, replies, and retweets). I show that people tend to perform easier interactions later in a session, such as replying or retweeting instead of composing original tweets. Also, they tend to compose shorter tweets later in a session. I divide users based on their characteristics, such as position in the follower graph or activity, and show that people with higher activity or more friends behave dierently. 10 On Facebook, I study tens of millions of users and billions of interactions and focus on information consumption. The main contributions of studying Facebook are as follows: I demonstrate short-term changes in activity, with people spending less time on each story in the News Feed over the course of a session. The rate of these changes varies for dierent demographic segments. I show that as the session progresses, people change their patterns of content consumption, e.g., they spend more time viewing photos rather than textual posts. I predict the length of an individual's session using only the activity during the rst minute of the session, more accurately than competitive baselines. I also predict how many stories an individual will consume and the time he or she will return to Facebook. I characterize some of the variety of session types, including sessions where people are more likely to comment on stories, sessions where they prefer to \like" stories, and sessions where they mostly read the News Feed. I studied performance deterioration on Reddit user sessions quantied by study- ing a massive data set containing over 55 million comments. After constructing sessions of activity, I observe a general decrease in the quality of comments pro- duced by users over the course of the sessions. I propose mixed-eects models that capture the impact of session intensity on comments, including their length, quality, and the responses they generate from the community. My ndings suggest performance deterioration: Sessions of increasing intensity are associated with the 11 production of shorter, progressively less complex comments, which receive declin- ing quality scores (as rated by other users), and are less and less engaging (i.e., they attract fewer responses). Moreover, I conduct the largest study to date on emails, studying 16B emails sent to or from 5 million users. The key contributions and ndings of my analysis of emails are: I empirically characterize email replying behavior, focusing on reply time, length of the reply, and the correlation between them. I quantify how dier- ent factors, including the day and time the message was received, the device used, the number of attachments in the email, and user demographics aect replying. I show that email overload is evident in email usage and has adverse eects, resulting in users replying to a smaller fraction of received emails. Users tend to send shorter replies, but with shorter delays when receiving many emails. I nd that dierent age groups cope with overload dierently: younger users shorten their replies, while older users reply to smaller fraction of received messages. I nd evidence of synchronization in dyadic interactions within a thread: users become more similar in terms of reply time and length of replies until the middle of a thread, and start acting more independently after that. I can predict reply time and length, and the last email in a thread with a much higher accuracy than competitive baseline. This has important implications for designing future email clients. This dissertation reviews earlier work on temporal patterns of behavior of peo- ple at dierent resolutions of daily, weekly, and seasonal patterns in Section 2.1. 12 In contrast, my work focuses on a much smaller time scale of minutes, and pat- terns of activity are shown in this short time frame. Also, I cover the earlier work that studied sessions of online activities (Section 2.2). The earlier analyses mostly considered sessions of web browsing and sessions of search, and to the best of my knowledge there has not been any work studying the dynamics of a session in online social networks in such a large scale. Finally, I discuss the literature on fatigue and cognitive depletion (Section 2.3). These studies have been conducted mainly in a small scale and most of them are in a laboratory or survey-based. The large data from online activity provides the opportunity to test the theories that have been found in small settings of laboratory experiments and survey studies. I study the changes in individuals' behavior in the type of actions they take in a session and present similar trends across dierent populations (Chapter 3). I show that as people spend more time in Twitter, tend to perform more easier actions such as retweeting, rather than posting original tweets. In Chapter 4, I leverage my earlier ndings to predict dierent behaviors of users. For emails, I predict the time and length of the next reply, and whether the email is going to get a reply or not. For Facebook, I predict the length of the session that the person is going to have, amount of content being consumed in a session, and return time. In Chapter 5, I consider more general behavioral changes on online platforms and in particular consumer behavior and focus on how to handle heterogeneous populations. In Chapter 6, I discuss the importance and implications of my ndings. Finally, I conclude and summarize all the studies in Chapter 7. 13 Chapter 2 Background and Related Work 2.1 Cyclic Patterns With the abundance of data from online activity, scientist have studied patterns in online behavior extensively in the recent years. Multiple studies have shown daily, weekly, monthly, and yearly patterns of activity in the online world that are similar to trends found in the oine world. In oine world, it has been shown that people are more likely to make donations in the morning [Kouchaki and Smith, 2013]. In online world, Grinberg et al. showed daily and weekly patterns of eating, drinking, shopping, and nightlife in using Foursquare checkins [Grinberg et al., 2013]. Golder et al. found consistent weekly and seasonal patterns of social interaction among college students on Facebook [Golder et al., 2007]. Later, Golder and Macy drew a connection between sentiment on Twitter posts to diurnal and seasonal cycles, where people express more positive sentiments during evenings and warmer months of the year [Golder and Macy, 2011]. Moreover, Naaman et al. studied the variations of keyword use on Twitter diurnal patterns and assessed their robustness across geographical locations [Naaman et al., 2012]. Leskovec et al. developed a framework for tracking variants of short textual phrases over time [Leskovec et al., 2009] and found prototypical temporal patterns in the spread of news stories [Yang and Leskovec, 2011]. 14 2.2 Analysis of Sessions Sessions of activity have shown to be an eective way to characterize people's online behavior, by segmenting individual's activity into meaningful smaller sections that are easier to study and analyze [Smith et al., 2005, Rose and Levinson, 2004, Daoud et al., 2009]. In the research community, sessions are usually constructed in two ways: a series of actions that serve a single intent [Eickho et al., 2014, Jones and Klinkner, 2008], or more commonly, a period of time without a break longer than a given threshold [Spiliopoulou et al., 2003, Go seva-Popstojanova et al., 2006], which is our denition of session. Sessions have been studied extensively in context of browsing and search behav- ior. Jones and Klinker studied manually labeled logs of searching queries and showed that constructing sessions by using a time threshold for detecting tasks does not achieve high accuracy [Jones and Klinkner, 2008]. While this is true in the case of task detection, in my study, since I am interested in nding the eects of cognitive constraints, constructing sessions by using time threshold is more reasonable, because the users energy is recovered while they are not using the system. Huang and Efthimiadis study search query logs to understand users' reformulation of query logs for achieving better results [Huang and Efthimiadis, 2009]. Moreover, Kumar and Tomkins use activity log sessions to characterize users browsing behavior, nding that half of the pageviews online are content, one third are communications, and the remaining one sixth are search [Kumar and Tomkins, 2010]. In another context, in a recent work, Kapoor et al. proposed a hidden semi-Markov model to predict the song to recommend to a user, based on their sessionized music listening history [Kapoor et al., 2015]. In the recent years, sessions of activity have been also used for understanding users' behavior in online social networks. Benevenuto et al. created sessions of 15 activity from a social network aggregator to understand users' behavior in a high- level, e.g. how frequently and for how long the social networks are used [Jin et al., 2013]. On Twitter, Teevan et al. studied sessions to compare Twitter search with web search [Teevan et al., 2011]. And more recently on Facebook, Grinberg et al. studied the eect of content production on length and number of sessions [Grinberg et al., 2016]. 2.3 Fatigue and Cognitive Depletion Psychologists have shown that cognitive performance has a temporal component: exerting mental eort makes it more dicult for people to exert mental eort at a later time, whether to exercise self-control [Muraven et al., 1998, Gailliot et al., 2007], decide between multiple options [Baumeister et al., 2008] or accurately solve problems [Healy et al., 2004]. This phenomenon is generally referred to as \ego depletion" [Baumeister and Vohs, 2007]. Although various mechanisms for ego depletion have been proposed and are still debated, the fact remains that cognitive performance declines over a period of sustained mental eort. The changes in behavior of users over the course of a session may be explained by fatigue or cognitive depletion. These concepts have been studied extensively in the oine world by psychologists and dierent eects of fatigue on performance have been found. Fatigue has multiple and some times even opposite eects on people's behavior in dierent tasks. Multiple studies suggest that when a person is cognitively depleted she fails to make progress on a task [van der Linden et al., 2003, Butler et al., Mariakakis et al., 2015]. Van der Linden et al conducted a controlled experiment were a group of participants were made mentally tired by working for 2 hours on a cognitively demanding task. Then, the participants' problem solving 16 and planning was measured. The experiments showed that the fatigued group had higher perseveration and slower planning time in complicated tasks [van der Linden et al., 2003]. Though, there was no dierence observed in performing a simple task. Moreover, Butler et al. proposed a model for creating levels for a game, considering that fatigue will cause players to stuck on a level [Butler et al.]. These studies suggest \lack of advancement" for cognitively depleted people. On the other hand, multiple studies have suggested that fatigue causes task rushing, where people tend to do the task faster and miss or ignore many details [Cheng et al., 2015, Lim et al., 2010, Rzeszotarski and Kittur, 2011]. It is not clear which of these scenarios would happen in a social networking site like Facebook, will the users read their feed at a slower rate as suggested by lack of advancement studies or will they go faster through their feed due to task rushing? In this work, we show the eect of fatigue on individuals' behavior in consuming content in social media. 2.4 Online Shopping We also study more general behavior changes in the context of online spendings. Most of the previous research on shopping behavior and characterization of shop- pers has been conducted through interviews and questionnaires administered to groups of volunteers composed by at most few hundred members. Oine shopping in physical stores has been studied in terms of the role of demographic factors on the attitude towards shopping. The customer's gender predicts to some extent the type of purchased goods, with men shopping more for groceries and electronics, while women more for clothing [Dholakia, 1999a, Hayhoe et al., 2000]. Gender is also a discriminant factor with respect to the attitude 17 towards nancial practices, nancial stress, and credit, and it can be a quite good predictor of spending [Hayhoe et al., 2000]. Many shoppers express the need of alternating the experience of online and oine shopping [Wolnbarger and Gilly, 2001, Tabatabaei, 2009], and it has been found that there is an engagement spiral between online and oine shopping: searching for products online positively aects the frequency of shopping trips, which in its turn positively in uences buying online [Farag et al., 2007]. Online shopping has been investigated since the early stages of the Web. Many studies tried to draw the prole of the typical online shopper. Online shop- pers are younger, wealthier, and more educated than the average Internet user. In addition, they tend to be computer literate and to live in the urban areas [Zaman and Meng, 2002, Swinyard and Smith, 2003, 2011, Farag et al., 2007]. Their trust of e-commerce sites and their understanding of the security risks associated with online payments positively correlate with their household income and edu- cation level [Horrigan, 2008, Hui and Wan, 2007], and it tends to be stronger in males [Garbarino and Strahilevitz, 2004]. The perception of risk of online trans- actions in uences shoppers to purchase small, cheap items rather than expensive objects [Bhatnagar et al., 2000]. Customers of online stores tend to value the convenience of online shopping in terms of ease of use, usefulness, enjoyment, and saving of time and eort [Perea y Monsuw e et al., 2004]. Their shopping experi- ence is deeply in uenced by their personal traits (e.g., previous online shopping experiences, trust in online shopping) as well as other exogenous factors such as situational factors or product characteristics [Perea y Monsuw e et al., 2004]. Demographic factors can in uence the shopping behaviour and the percep- tion of the shopping experience online. Men value the practical advantages of online shopping more and consider a detailed product description and fair pricing 18 signicantly more important than women do. In contrast, some surveys have found that women, despite the ease of use of e-commerce sites, dislike more than men the lack of a physical experience of the shop and value more the visibility of wide selections of items rather than accurate product specications [Van Slyke et al., 2002, van, 2005, Ulbrich et al., 2011, Hui and Wan, 2007]. Unlike gender, the eect of age on the purchase behavior seems to be minor, with older people searching less for online items to buy but not exhibiting lower purchase frequency [Sorce et al., 2005]. With extensive evidence from a large-scale data set we nd that age greatly impacts the amount of money spent online and the number of items purchased. The role of the social network is also a crucial factor that steers customer behavior during online shopping. Often, social media is used to communicate purchase intents, which can be automatically detected with text analysis [Gupta et al., 2014]. Also, social ties allow for the propagation of information about eec- tive shopping practices, such as nding the most convenient store to buy from [Guo et al., 2011] or recommending what to buy next [Leskovec et al., 2007]. Survey- based studies have found that shopping recommendations can increase the will- ingness of buying among women rather than men [Garbarino and Strahilevitz, 2004]. Factors leading to purchases in oine stores have been extensively inves- tigated as they have direct consequences on the revenue potential of retailers and advertisers. Survey-based studies attempted to isolate the factors that lead a customer to buy an item or, in other words, to understand what the strongest predictors of a purchase are. Although the mere amount of online activity of a customer can predict to some extent the occurrence of a future purchase [Bellman et al., 1999], multifaceted predictive models have been proposed in the past. Fea- tures related to the phase of information gathering (access to search features, prior 19 trust of the website) and to the purchase potential (monetary resources, product value) can often predict whether a given item will be purchased or not [Hansen et al., 2004, Pavlou and Fygenson, 2006]. Prediction of online purchases has been addressed through data-driven studies, mostly using click and activity logs. User purchase history is extensively used by e-commerce websites to recommend relevant products to their users [Lin- den et al., 2003]. Features derived from user events collected by publishers and shopping sites are often used in predicting the user's propensity to click or pur- chase [Djuric et al., 2014]. For example, clickstream data have been used to predict the next purchased item [Van den Poel and Buckinx, 2005, Senecal et al., 2005]; click models predict online buying by linking the purchase decision to what user are exposed to while on the site and what actions they perform while on the site [Montgomery et al., 2002, Sismeiro and Bucklin, 2004]. Besides user click and purchase events, one can leverage product reviews and ratings to nd relationships between dierent products [McAuley et al., 2015]. Email is also a valuable source of information to analyze and predict user shopping behaviour [Grbovic et al., 2015]. Click and browsing features represent only a weak proxy of user's purchase intent, while email purchase receipts convey a much stronger intent signal that can enable advertisers to reach their audience. The value of commercial e-mail data has been recently explored for the task of clustering commercial domains [Grbovic and Vucetic, 2014]. Signals to predict purchases can be strengthen by demographic features [Kim et al., 2003]. Also, the information extracted from the customers' proles in social media, in combination with the information of their social circles, can help with predicting the category of product that will be purchased next [Zhang and Pennacchiotti, 2013]. 20 2.5 Information Overload Several studies investigated the impact that the amount of social ties of a social media user has on the way she interacts or exchanges information with her friends, followees or contacts [Backstrom et al., 2011, Hodas and Lerman, 2012, Miritello et al., 2013]. Backstrom et al. measured the way in which an individual divides his or her attention across contacts by analyzing Facebook data. Their analysis suggests that some people focus most of their attention on a small circle of close friends, while others disperse their attention more broadly over a large set. Hodas and Lerman quantify how a user's limited attention is divided among information sources (or followees) by tracking URLs as markers of information in Twitter. They provide empirical evidence that highly connected individuals are less likely to propagate an arbitrary tweet. Miritello et al. analyze mobile phone call data and note that individuals exhibit a nite communication capacity, which limits the number of ties they can maintain actively. The common theme is to investigate whether there is a limit on the amount of ties (e.g., friends, followees or phone contacts) people can maintain, and how people distribute attention across them. Recently, there have been attempts to analyze and model information propagation assuming competition and cooperation between contagions [Goyal et al., 2014, Myers and Leskovec, 2012, Weng et al., 2012]. Finally, a recent study provided empirical evidence of information processing limits for social media users and the prevalence of information overload [Rodriguez et al., 2014]. The study shows that the most active and popular social media users are often the ones that are overloaded. Moreover, the authors found that the rate at which users receive information impacts their processing behavior, including how they prioritize information from dierent sources, how much information they process, and how quickly they process information. Finally, the susceptibility of 21 a social media user to social contagions depends crucially on the rate at which she receives information. An exposure to a piece of information, be it an idea, a convention or a product, is much less eective for users that receive information at higher rates, meaning they need more exposures to adopt a particular contagion. 22 Chapter 3 Short-term behavioral changes In this section, we present four case studies on dierent platforms: Twitter, Face- book, Reddit, and Email. We show similarities in user behavior that exist across these dierent platforms. In each platform, we use large-scale data sets that include tens of millions of users and billions of interactions. Several mechanisms could explain our observations. First, deterioration of per- formance following a period of sustained mental eort has been documented in a variety of settings, including data entry [Healy et al., 2004], and exerting self- control [Muraven and Baumeister, 2000], and led researchers to postulate cognitive fatigue [Baumeister et al., 1998] as the explanation. On Twitter, as people become tired over the course of a session, they may switch to easier tasks that require less cognitive eort, such as retweeting instead of composing original tweets. Alter- nately, our observations could be explained by growing boredom or loss of moti- vation. It is plausible that social interactions are highly motivating, and the fact that users continue to reply to others, even when they are less likely to compose original tweets, appears to indicate that they shift their eort to the more engag- ing tasks, such as social interactions. Still other explanations are possible, such as users' choice to strategically shift their attention to other tasks. While our work does not address the causes of these behavioral changes, our ndings are signicant in that they can be used to predict users' future actions, which could, in turn, be leveraged to improve user online experience on social platforms. 23 Activity Sessions: User online activity can be segmented into sessions, usually characterized by a single intent in search query research [Jones and Klinkner, 2008, Huang and Efthimiadis, 2009]. More commonly, sessions are dened as a series of actions without a long break. We apply a similar idea to our data sets. We examine the time interval between successive actions and consider a break between sessions to be a time interval greater than some threshold. Following [Jones and Klinkner, 2008], we use a 10-minute threshold for Twitter and Facebook. Thus, on Twitter, all tweets posted within 10 minutes of a previous tweet by the same user are considered to be in the same session, and the rst tweet following a time period longer than 10 minutes denotes the start of a new session (Figure 3.1). We experimented with dierent time thresholds and the results remain robust. Due to the heavy-tailed distribution of the time interval between tweets, increasing the threshold only merges a very small fraction of sessions. Similarly on Facebook, we consider a 10-minute time threshold, but due to the slower pace of people commenting on Reddit, we consider 60-minute time threshold for Reddit. Moreover, one of the challenges of working with large-scale observational data is that human behavior is highly heterogeneous. For Facebook users, this translates into large variation in preferences about how they read the News Feed (on a mobile device or web browser), or how much of the News Feed they read. Averaging behaviors over such heterogeneous populations could produce spurious correlations. Activity sessions also help with partially accounting for the heterogeneity among the users: we separate sessions by their length and analyze the behavior of a more homogeneous population of people who have sessions of a specic length (e.g., sessions that are 10 minutes long). 24 3.1 Twitter First, we carry out a study of user activity sessions on Twitter to demonstrate short-term behavioral changes occurring over the course of a single session. Similar to earlier studies of web search, we segment the time series of an individual's activity on Twitter into sessions. We dene a session as a series of consecutive interactions|tweeting, retweeting, or replying|without a break longer than a specied threshold. (We experimented with dierent ways of dening sessions and dierent thresholds, and our ndings are qualitatively very similar with dierent denitions of session.) We nd that most user sessions are short, but consistent with heavy-tailed distribution of human activity, there is considerable number of sessions that span hours. Despite their short duration, we nd that signicant behavioral changes occur over the course of a session, with people preferring easier interactions later in a session. Specically, people tend to compose longer tweets at the beginning of a session, and post more replies and retweets later in the session, and also when there is a short time period between consecutive interactions. While Twitter population is highly heterogeneous, these patterns hold across dierent subsets of the population, e.g., for both highly connected and poorly connected users, as well as for highly-active and less-active users. Our Twitter data set includes more than 260M tweets posted by 1.9M users. The data set includes all the tweets of randomly selected users. Twitter is known to include lots of spammers and bots, i.e. computer programs tweeting. To eliminate non-humans from our data set, we took the approach of [Ghosh et al., 2011] and classied users as spammers or bots based on entropy of content generated and entropy of time intervals between tweets (spammers and bots tend to have low entropy of content, and they tweet at regular time intervals). 25 Figure 3.1: Timeline of user activity on Twitter segmented into sessions. The timeline is a time series of tweets, including normal tweets, retweets, and replies. These activities fall into sessions. A period between consecutive tweets lasting longer than 10 minutes indicates a break between sessions. As discussed earlier, we create sessions of activity on Twitter, using a 10- minute threshold. Figure 3.2 shows the probability (PDF) and cumulative (CDF) distribution of time between consecutive tweets. This distribution is very similar to the distribution of time between phone calls a person makes [Saram aki and Moro, 2015]. There is no clear cut-o and the plot drops gradually. This gure also shows that increasing the 10-minute threshold to 30 minutes, only aects 6% of sessions. 1e−05 1e−03 1e−01 1 10 100 1k 10k Time since the previous tweet (min) PDF (a) PDF 0.25 0.50 0.75 1.00 1 10 100 1k 10k 100k Time since the previous tweet (min) CDF (b) CDF Figure 3.2: Distribution of the time interval between consecutive tweets. 26 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 5 10 15 20 Number of tweets in the session PDF (a) PDF 0.6 0.7 0.8 0.9 1.0 0 5 10 15 20 Number of tweets in the session CDF (b) CDF Figure 3.3: Distribution of the number of tweets in a session. Segmenting user activity into sessions is the rst step for dealing with the user heterogeneity and understanding short-term behavioral changes. To understand sessions, we look at the distribution of session length (time interval between the rst and last tweet of a session) and number of tweets posted in a session. While these distributions change slightly when a dierent time threshold was used to segment sessions, as explained above, the change is not signicant. Figure 3.3 shows that most of the sessions include only a few tweets: 64% of sessions include only two tweets, and only 1% of sessions include 12 or more tweets. Moreover, sessions tend to be very short: 99% of sessions are only 1 minute long, even if we only consider sessions that include 5 tweets or more, 98% of them are still only 1 minute long. To analyze the types of tweets that are posted in a session, we classify tweets into three main types: reply a message directed to another user, usually starting with an @mention, part of a convention. 27 retweet an existing message that is re-shared by the user, sometimes preceded by an `RT' normal all other tweets; typically composed tweets, which may include urls and hashtags If we consider all the sessions, 59% of tweets include only one type of tweet. This percentage is very high because a large fraction of sessions include only two tweets, so there is a very low probability of diversity. If we consider only the sessions that include more than ve tweets, then 35% of the sessions include one type of tweet, 41% include two types of tweets, and the remaining 24% include all three types of tweets. To better understand the diversity of sessions, we consider sessions that include 10 tweets and cluster them based on the fraction of normal tweets, replies, and retweets. The X-means algorithm from Weka 1 which automatically detects the number of clusters, creates three clusters, where in each cluster one type of tweet is dominant. 44% of sessions belong to the cluster where majority of tweets are normal, 31% are sessions with many replies, and 25% of the sessions include mostly retweets. Figure 3.4 shows a visualization of the sessions with each color representing a cluster and the size of dots representing the number of sessions with that fractions of tweet types. The x-axis shows the fraction of normal tweets in the session, and y-axis shows the fraction of replies in the session. Each cluster could be found in the plot by considering the fractions, e.g., the red circles belong to replies, because they have high fraction of replies, and the green circles belong to the retweet cluster, because they have low fraction of normal tweets and replies. As shown in the gure, these clusters are not clearly separated, and there is a spectrum of sessions with dierent fraction of tweet types. This means there are 1 http://weka.sourceforge.net/doc.packages/XMeans/weka/clusterers/XMeans.html 28 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Fraction of normal tweets Fraction of replies Log count ● ● ● 10 12 14 Clusters ● ● ● Normal Reply Retweet Figure 3.4: Visualization of clustering of sessions using the fraction of normal tweets, replies, and retweets. no clear user or session types that have a particular purpose, and most of the sessions include a mixture of dierent types of tweets. 3.1.1 Session-level behavioral changes In this section, we present evidence of changes in user behavior over the course of a single session on Twitter. We focus on three types of behaviors: (i) the type of a message (tweet) a user posts on Twitter, (ii) the length of the message the user composes, and (iii) the number of spelling errors the user makes. Since sessions are typically short, with the vast majority lasting only a few minutes, the demonstrated behavioral changes take place on far faster times scales than those previously reported in literature (e.g., diurnal and seasonal changes). 29 Time to next tweet The type of a tweet a user posts depends on how much time has elapsed since the user's previous activity or interaction on Twitter. As shown in Figure 3.5, 30% of the tweets posted 10 seconds after another tweet are normal tweets, whereas more than 50% of tweets posted two minutes or more following a previous tweet are normal tweets. In general, the longer the period of time since user's last action on Twitter, the more likely the new tweet is to be a normal tweet. Note that we excluded tweets posted within 10 seconds of the previous tweet, because they are likely to have been automatically generated, e.g., by a Twitter bot. Despite the ltering, our data still contains some machine-generated activity, as evidenced by spikes at 60 seconds, 120 seconds, etc. Interestingly, the shorter the time delay since the previous tweet, the more likely the tweet is to be a retweet. Replies are initially similar to normal tweets: the more time elapsed since the previous tweet, the more likely the new tweet is to be a reply, but unlike normal tweets, their probability saturates and even decreases with longer delays. To understand these temporal patterns, we segment a user's activity into ses- sions, as described in the previous section. We can characterize sessions along two dimensions: a) number of tweets produced during the session and b) the length of the session in terms of seconds or minutes, i.e., the time period between the rst and last tweet of the session. Each of these dimensions plays an important role in the types of the tweets that are produced during the session. For example, short sessions with many tweets are very intense, and the user may not have enough time to compose original tweets; hence, the tweets are likely to be replies. On the other hand, a long session with few tweets is more likely to include more normal tweets, because the user has had enough time to compose them. The fraction of 30 0.00 0.25 0.50 0.75 10 100 1000 Time from last tweet (seconds) Fraction of tweet types Type normal reply retweet Figure 3.5: Fraction of dierent tweet types given the time from the user's last tweet. tweets that are replies is shown in Figure 3.6, which shows these trends: users are more likely to reply as sessions become longer (in time), or there are fewer tweets posted during sessions of a given duration. We can study the behavioral change with respect to either the position of the tweet in the session or the time elapsed since the beginning of the session. Our preliminary analysis showed that the number of tweets in a session plays a more signicant role compared to the time since the rst tweet of the session. Hence, in the following analyses, we study changes with respect to the position of the tweet within a session and not with respect to the time since the rst tweet. In general, the trends are similar but weaker if we consider the time since the rst tweet of the session. 31 Figure 3.6: Fraction of tweets that are replies posted during sessions of a given length in time and number of tweets in the session. The data was binned and only bins with more than 100 sessions are included. Changes in tweet type Next, we study the types of tweets that are posted at dierent times during the session. Since user behavior during longer sessions could be systematically dierent from their behavior during shorter sessions, we aggregate sessions by their length, which we take to be the number of tweets posted. Then for each tweet position within a session, we calculate the fraction of tweets that belong to each of our three types. Figure 3.7 shows that tweets are more likely to be normal tweets early in a session, and users prefer cognitively easier (i.e., retweet) or socially more rewarding (i.e., reply) interactions later in a session. Since user population on Twitter is highly heterogenous, these results could arise from non-homogeneous mixing of dierent classes of user populations. [Kooti 32 0.2 0.3 0.4 0.5 0.6 1 2 3 4 5 6 7 8 9 10 Tweet position Average fraction of tweet types Type normal reply retweet 0.2 0.3 0.4 0.5 0.6 2 4 6 8 10 12 14 16 18 20 Tweet position Average fraction of tweet types Type normal reply retweet 0.2 0.3 0.4 0.5 0.6 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Tweet position Average fraction of tweet types Type normal reply retweet Sessions with 10 tweets Sessions with 20 tweets Sessions with 30 tweets Figure 3.7: Change in the fraction of tweets of each type over the course of sessions in which users posted 10, 20, or 30 tweets. 0.2 0.3 0.4 0.5 1 2 3 4 5 6 7 8 9 10 Tweet position − shuffled Average fraction of tweet types Type normal reply retweet Figure 3.8: Change in the fraction of tweets of each type over the course of sessions of length 10 in shued data. et al., 2016] shows an example of this, where a specic class of users are over- represented on one side of the plot (e.g., early during a session), producing a trend that does not actually exist. One way to test for this eect is through a shue test. In a shue test, we randomize the data and conduct analysis on the randomized (i.e., shued) data. If the analysis of the shued data yields a similar result as of the original data, then the trend is simply an artifact of the analysis and does 33 not exist in the data. If trends disappear completely, it suggests that the original analysis is meaningful. To shue the data, we reorder the tweets within each session, keeping the time interval between them the same. Figure 3.8 shows results of the analysis on the shued data. Flat lines indicate that the factions of all tweet types do not change over the course of the shued session. This suggests that the trends observed in the original data have a behavioral origin. Using values in Figure 3.8 as baseline probabilities for a tweet to belong to each of three dierent tweet types, we repeat the analysis to see how the fraction of each type changes with respect to the baseline. This allows us to observe session-level behavioral changes more clearly, compared to Figure 3.7 that shows the change in absolute values. Figure 3.9 clearly shows that rst tweets of a session are up to 30% more likely to be normal tweets, and 10-20% less likely to be replies or retweets. Interestingly, the position that the normal tweet becomes less likely than the baseline (red line crossing zero), increases as we consider longer sessions, and it happens after30% of the tweets are posted, i.e., at the 3rd position for sessions with 10 tweets, at 5th position for sessions with 20 tweets, and at 10th position in sessions with 30 tweets. What explains the observed trends? To partially address this question, we focus on the fraction of replies. As explained above, users are more likely to post a reply later in a session instead of composing a normal tweet. This may be due to some sessions becoming longer because of the ongoing conversations the user has with others. To test this hypothesis, we calculate what fraction of replies at each position within the session are in reply to a tweet that was posted since the start of that session. In other words, we calculate the fraction of replies that are a conversion that was initiated during that session. Figure 3.10 shows this fraction: 34 −30 −20 −10 0 10 20 30 1 2 3 4 5 6 7 8 9 10 Tweet position Relative change over baseline Type normal reply retweet −30 −20 −10 0 10 20 30 2 4 6 8 10 12 14 16 18 20 Tweet position Relative change over baseline Type normal reply retweet −30 −20 −10 0 10 20 30 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Tweet position Relative change over baseline Type normal reply retweet Sessions with 10 tweets Sessions with 20 tweets Sessions with 30 tweets Figure 3.9: Relative change in the fraction of tweets of each type over the course of sessions with 10, 20, or 30 tweets. replies that are posted later in the session are much more likely to be the result of an ongoing conversation. This means that some part of the trend found above could be explained by the fact users extend their online sessions to interact with others on Twitter. Change in tweet length One way cognitive depletion, whether through fatigue, boredom, or loss of atten- tion, could manifest itself is through users writing shorter messages. To investigate this, we study the change in the length of tweets posted over the course of a session. We exclude retweets from this analysis, because length of the retweets does not represent the eort needed for composing them. First, we calculate the average length of the tweet at each position in the session, but there is too much varia- tion in tweet length to produce any statistically signicant trends. Instead, we divide tweets into long (longer than 130 characters) and short tweets (shorter than 130 characters), and measure the fraction of long tweets over the course of the session. We nd a statistically signicant trend, wherein tweets posted later in 35 40 50 60 70 80 90 1 2 3 4 5 6 7 8 9 Tweet position Fraction replies to tweets posted during the session Figure 3.10: Fraction of tweets that are replies to tweets posted since the beginning of the same session (for sessions with 10 tweets) . 0.14 0.15 0.16 0.17 1 2 3 4 5 6 7 8 9 10 Tweet position Fraction of long tweets Figure 3.11: Fraction of long tweets posted over the course of sessions of a given length (10 tweets). Long tweets are dened as non-reply tweets that are longer than 130 characters. 36 the session are more likely to be short, compared to tweets posted earlier in the session (Figure 3.11). We choose a high threshold for the long tweets, because when a user is reaching the 140 character limit imposed by Twitter, they usually have to make an eort to shorten their tweet by rephrasing and abbreviating the message. We believe that this results in a stronger signal for analysis, compared to the situation where the user is just typing a few more characters e.g., 30 characters vs. 35 characters. To ensure that the drop in the fraction of long tweets is a real trend, we perform the shue test and obtain a at line. This suggests that later in a session users are less likely to devote the eort to composing long tweet. We exclude tweets including URLs and repeat the analysis again, and we achieve very similar results. Moreover, the same trend exists if we consider replies or normal tweets individually. Change in the number of spelling mistakes Yet another manifestation of cognitive depletion is people paying less attention and making more mistakes. To verify this with our data, we consider the percentage of words that are spelled incorrectly in a tweet. Earlier studies have shown that when people are tired their judgment is impaired [Baumeister et al., 2008], and it is harder for them to solve problems correctly [Healy et al., 2004]. We hypothesize that we can observe this eect in terms of the number of spelling errors users make. To this end, for each tweet we calculate the percentage of words that are spelled incorrectly (i.e., typos) and calculate the average percentage of typos at each tweet position in a session. We exclude non-English tweets and punctuations and use a dictionary that includes all forms of a word, e.g., including the past tense of the verbs and the plural of the nouns. 37 −3 −2 −1 0 1 1 2 3 4 5 6 7 8 9 10 Tweet position Relative change in % of typos Figure 3.12: Percentage of change of spelling errors made in tweets over the course of session relative to shued data. Figure 3.12 shows that there is a small but statistically signicant increase in the percentage of typos made in tweets over the course of a session. This percentage rises quickly initially, but saturates later in the session. Overall, there is a 3% relative increase in the probability of making a spelling mistake later in the session, compared to rst tweets of the session. 3.1.2 Role of user characteristics In this section, we investigate how dierences between users may contribute to behavioral changes. We split users based on their characteristics and carry out analysis described in the previous section within subsets of users. User connectivity One of the main characteristics of Twitter users is the number of friends they have, i.e., the number of other Twitter users they follow. This number is highly correlated 38 −60 −30 0 30 60 1 2 3 4 5 6 7 8 9 10 Tweet position Relative difference Type normal reply retweet (a) Few friends (< 46 friends) −60 −30 0 30 60 1 2 3 4 5 6 7 8 9 10 Tweet position Relative difference (b) Many friends (> 1,321 friends) Figure 3.13: Relative change in the tweet type throughout a session for users with few friends and many friends. The change is relative to shued sessions with 10 tweets. with the amount of information users receive and the number of interactions they have with other users. We rank users based on the number of others they follow (i.e., number of friends) and compare the bottom 20% with the top 20% to see session-level behavioral dierences. In both cases, we measure how the fraction of tweet types relative to the baseline changes over the course of a session. Figure 3.13 shows that users with many friends retweet signicantly more compared to users who follow few others. This is probably due to the fact that these users receive many more tweets, so they have more opportunities for retweeting. These users also tend to be very active, and as users become more active, they tend to perform easier tasks such as retweeting. However, even though the fraction of tweet types is dierent in the two groups, the change over the course of a session is very similar. Therefore, we conclude that users with dierent numbers of friends act dierently in general, but their behavior changes the same way over the course of a session. We verify that the results are not an artifact of the analysis by performing the shue test. 39 User activity Next, we divide users into dierent classes based on their activity, i.e., the rate of tweeting. We order users based on the average number of tweets in a month, and compare the top 20% of the most active users to the bottom 20% of the users. We nd that the less active users tend to compose original (normal) tweets, and are more likely to do it than users with most tweets. In contrast, the more active users produce many more retweets and replies, compared to users with lower levels of activity (Figure 3.14). And, unlike previous analysis that we divided users based on the number of friends, the change in the fraction of replies shows a higher increase for more active users. We again conduct a shue test to ensure that the observed eect is real. We conclude that part of what makes users active is their willingness to engage in social interactions on Twitter. Users extend their session to carry on conversa- tions with others. People appear to prioritize their online activity on Twitter, and social interactions appear to be preferable to users, especially more active users, later in the session. Summary: We analyzed user behavior during Twitter activity sessions. We found that most of the Twitter sessions are short, on the order of minutes, and include only a few tweets, although they tend to be diverse tweets, including com- posed messages, retweets of others' messages, and replies to other users. Despite their short duration, we showed that user behavior changes over the course of an activity session. As people spend more time in the session, they prefer to perform easier or more socially engaging tasks, such as retweeting and replying, rather than harder tasks such as composing an original tweet. In the beginning of the session, 40 −60 −30 0 30 60 1 2 3 4 5 6 7 8 9 10 Tweet position Relative difference Type normal reply retweet (a) Low activity (< 0.93 tweets in a month) −60 −30 0 30 60 1 2 3 4 5 6 7 8 9 10 Tweet position Relative difference (b) High activity (> 52.5 tweets in a month) Figure 3.14: Relative change in the tweet type throughout a session for users with low and high activity. tweets are up to 25% more likely to be an original tweet than near the end of the session. In addition, tweets tend to get shorter later in the session, and people tend to make more spelling mistakes. All these could be explained by people becoming mentally fatigued, or perhaps careless due to loss of motivation. If we divide users into classes based on the number of friends they follow, or their activity level (i.e., the number of tweets they posted), we nd that while these user classes behave dierently in general, in terms of the types of tweets they tend to post, all classes manifest similar behavioral changes over the course of the session. Understanding dynamics of user behavior over the course of such activity sessions could help better predict behavior and adjust the system accordingly to enhance user experience. 41 3.2 Facebook Next, we conduct a study using data from the popular social networking service Facebook, which is used daily by more than a billion people worldwide to stay in touch with family and friends, to connect with communities and interests, to be informed about current events, and to be entertained. Like many other social net- working services, Facebook compiles stories shared by friends, pages, and groups, which includes status updates, photos, videos, links to other online content, etc., and presents it as a list (i.e., News Feed). A person browses this list to read status updates from friends or watch photos and videos they shared. As discussed earlier, we create sessions of activity by using a 10-minute thresh- old. We compare sessions of the same length, and nd that individuals spend less time viewing each story as the session progresses. In addition, we nd that people preferentially shift attention to some types of content, such as photos, over the course of a session. These trends are more pronounced in the older population and also in people who have fewer friends on Facebook. Moreover, longer sessions have markedly dierent patterns of activity than shorter sessions. This distinction is so strong that we can use the rst minute of a person's activity to predict how long his or her session will be. We can also predict how much content the person will consume over the course of a session, and when he or she will return to Facebook. Although this study does not resolve the origins of these behavioral changes, quantifying them and moreover, using them to predict behavior, can potentially allow for improvements in user online experience. For example, content could be ranked dynamically to account for these behavioral changes by shifting photos to later in the session. The ability to predict session length and activity could be par- ticularly useful for caching content on mobile devices. In developing countries and emerging markets, there are hundreds of millions of users with out-dated mobile 42 devices and poor internet connectivity. Correctly caching content on such devices, based on the predicted session activity|and at the right time (just before the user logs-in, based on predicted return times)|can potentially improve the overall user experience by minimizing the network latency of delivering fresh content. Data A primary activity on Facebook is browsing the News Feed to consume stories shared by friends, which include friends' status updates that can be in form of textual posts, videos, or photos they shared. By default, the News Feed ranks all the friends' stories by their predicted relevance and interest to the user. Since we are interested in the short-term behavioral changes, such as those occurring over the course of a session, the News Feed ranking algorithm may introduce a substantial confounding factor, for example, by putting more interesting stories higher in the News Feed, so that a person will see them earlier in a session. Facebook also allows users to rank the stories in chronological order, with the most recently shared story at the top of the list. This option is called \most recent" ranking, and although just a small fraction of people use it, they represent a large enough sample to test our hypotheses. For our study, we considered only the people who chose the \most recent" option for ranking stories in their News Feed. As a result, the stories they saw on Facebook were ordered by the time of story posting, rather than relevance, so that any observed dierences in engagement with stories would be due to factors such as time spent in the session, rather than changes in properties of the stories. On average, the population using most recent ranking is more active than the general Facebook population, but they are broadly distributed across dierent demographic segments such as age, location, and number of friends. 43 We consider all activities for a random sample of these people over the course of one month (June 2015). In addition, we focus only on people who used Facebook via the web, an iOS device, or an Android device. This sample of millions of users performed billions of interactions on each of the platforms considered. Here, an action is any type of activity that a person can perform on Facebook, such as reading a story, liking or commenting on it, or creating a status update (including links, pictures, text, etc.). All analysis was performed in aggregate on de-identied data. We rst consider the consumption of information, comprised of: 1) reading a story, which includes reading textual stories and comments, and viewing photos, and 2) watching videos, which includes watching both videos that are originally shared on Facebook and videos that are shared from external websites. Next, we calculate the time people spent on dierent activities, such as reading posts and watching videos in their News Feed. We used the logged data to calculate the time that a person spends on each story. This is achieved by considering all stories that are visible to the user in the News Feed, and dividing the time spent viewing between the stories based on the proportion of the screen that the stories are occupying. For example, if there are only three stories visible in the screen and the rst one occupies half of the screen, the second one 20%, and the third one 30% of the screen, and the user spends 10 seconds on reading these stories, then we give an approximate allocation of 5 seconds spent on the rst story, 2 seconds on the second story, and 3 seconds on the last story. For video we use the amount of time a video has been played in the News Feed. By default, videos in the News Feed are played automatically, and we count them as watched if the viewer switches to full-screen, un-mutes the sound, or stays 44 on the video for at least 75% of its length. Other thresholds and criteria yielded similar results. We study how people allocate their time to reading stories and watching videos and how this allocation changes over the course of an activity session. Activity Sessions To study changes in behavior over the course of a session, we have to segment the time series of user interactions data to identify sessions. One option is to use the actual sessions; that is, the time beginning when a person navigates to Facebook or opens the Facebook application until the time the web page or application is closed. However, this means that when a person closes the page and then opens it a few seconds later, two sessions would be counted, while someone who leaves the page open all day would only have one session counted. Since we are interested in the continuous periods of active engagement with Facebook, we need a dierent denition of a session. Hence, we dene a session as a series of consecutive interactions without a break longer than 10 minutes. In other words, a session consists of all interactions that are within 10 minutes of the previous interaction, similar to denition of Twitter sessions explained above.While 10 minutes is an arbitrary threshold, using dierent values left the substantive results of this paper unchanged. 3.2.1 Behavioral Changes during the Session We demonstrate that people change their content consumption behavior over the course of a session. We explore how dierent factors, including age and content type, aect these behavioral changes. In the rst subsection, we focus on all 45 types of content that appear in the News Feed except videos, which we consider separately because the time spent on videos is measured dierently. Reading Stories How much time do people spend reading stories in their News Feed, and how does this behavior change over the course of a session? To answer these questions, we calculate the average time the users view stories as a function of time since the beginning of the session. A potential source of bias is our denition of a session as a period of activity without a 10-minute (or longer) break, which gives rise to a data censoring problem. A person who starts reading a story one minute before the end of the session will by denition spend at most one minute on the story, while a person who starts reading the story at the very beginning of the session could spend up to 10 minutes reading it. As a result, we would observe the average time spent consuming content decrease towards the end of the session. For a fair comparison, we do not consider the stories that take longer than one minute to read in the average of time spent on stories. These stories are a small portion (7%) of the entire data set. Figure 3.15(a) shows how the average time people spend reading stories varies as a function of time since the beginning of the session (time spent is normalized by the maximum time spent across all sessions). These data are for people accessing their News Feed through a web browser. The plot includes sessions of length 10, 20, 30, and 40 minutes, which all show a similar trend: people read stories in their feed faster as the session progresses. As can be seen in Figures 3.15(b) and 3.15(c), people who read News Feed stories on mobile devices, such as iOS and Android devices, have a very similar pattern of behavior. There are some spikes in 46 Figure 3.15(b) that are due to the batching that happens during data logging on iOS devices, but the overall drop in the time spent reading stories is still visible. 4.25 4.50 4.75 5.00 5.25 0 10 20 30 40 Time in the session (minutes) − www Average time spent (seconds) Session length 10 20 30 40 (a) Web 4.25 4.50 4.75 5.00 5.25 0 10 20 30 40 Time in the session (minutes) − ios Average time spent (seconds) Session length 10 20 30 40 (b) iOS 3.8 4.0 4.2 4.4 0 10 20 30 40 Time in the session (minutes) − Android Average time spent (seconds) Session length 10 20 30 40 (c) Android Figure 3.15: Change in time spent per story given the time in the session along with the 95% condence interval. It is unlikely that behavioral changes occurring over the length of a session are the result of dierences in content relevance. Because we restrict our analysis to 47 people who view stories in a (reverse) chronological order of the time they were posted on Facebook, it is unlikely that the length or interestingness of stories is correlated with their position in the feed. In the context of reading the News Feed, the ndings suggest that as people consume content, they devote less and less time to each item. One explanation for the decrease in time spent on stories over the course of a session is that as people get closer to the end of the session, they are more likely to see a story they have seen before; hence, they spend less time on it (Figure 3.16). In addition to the drop in time per story over the course of a session, we also observe that people read stories faster during shorter sessions, already starting from the beginning of that session. This suggests that the activities taking place during the rst minute of a session are some of the more predictive features of session length. The gures also show a precipitous drop at the very end of the session. We speculate that this pattern is common to people consuming the News Feed in the \most recent" conguration, where they reach the point where they encounter content they have previously consumed, rapidly scroll through a few more stories to ensure they have really reached the end of new content, and then end their session. Next we examine the impact of dierent factors, such as content type, on session-level behavioral changes. We only present results for 30 minute sessions on web browsers, but the trends for other session lengths and interfaces for reading the News Feed are qualitatively similar. Content type We start by considering the type of stories people consume, dier- entiating between photos, links to external content, and textual posts. Intuitively, 48 0.15 0.20 0.25 0.30 0.35 0 10 20 30 40 Time in the session (minutes) Fraction of already−seen posts Session length 10 20 30 40 Figure 3.16: Change in fraction of stories that have been viewed earlier and are not new for the web users. 70 80 90 100 0 10 20 30 Time in the session (minutes) − www Average time spent (normalized) Content type link photo post Figure 3.17: Change in time spent on dierent content types given the time in the session (web users) along with the 95% condence interval. dierent types of content require dierent amounts of mental eort to consume: e.g., most people nd it easier to look at a photo than read a text post. This may cause the consumption of some types of content to be less aected by depletion than others. As Figure 3.17 shows, in the rst minute of a session, people spend almost the same amount of time viewing photos as reading textual posts, but in 49 50 60 70 80 90 100 0 10 20 30 Time in the session (minutes) − www Average time spent (normalized) Age 55−60 45−50 35−40 25−30 15−20 Figure 3.18: Change in time spent on stories for people with dierent age given the time in the session (web users) along with the 95% condence interval. the last minute of the session they spent 9% more time on photos compared to textual posts. Links to external content show a smaller drop over the course of the session. Age Next, we examine how age relates to session-level behavioral changes. As Figure 3.18 shows, age has a striking eect on the average time spent reading each story. First, older people read stories more slowly. The relative dierence is as high as 80% between 15-20 year olds and 55-60 year olds. Second, and more interestingly, the behavior of older people changes more over the course of a session than for younger people, with the time spent per story experiencing a sharper drop. For the youngest age group, the average time spent reading stories remains nearly constant over a session, though much shorter than for the older age groups. A similar trend has been found in email behavior, where older people take much longer to reply to an email compared to teenagers [Kooti et al., 2015]. 50 70 80 90 100 0 10 20 30 Time in the session (minutes) − www Average time spent (normalized) # of friends 50−100 150−200 250−300 350−400 450−500 Figure 3.19: Change in time spent on stories for people with dierent number of friends given the time in the session (web users) along with the 95% condence interval. Number of friends Figure 3.19 compares content consumption patterns of peo- ple with many friends to those with fewer friends. People with fewer friends spend more time reading each story in general, compared to people with many friends. The slower rate of content consumption may be due to the fact that they have fewer new stories in the News Feed, so they do not need to rush through their feed to read all the stories in their limited time. Alternatively, the people with fewer friends may belong to a dierent population that is less familiar with the interface or generally consumes content at a slower rate. Our second observation is that people with fewer friends experience a bigger behavioral change over the course of a session compared to people with more friends: those with few (50{100) friends experience a 14% speed up in their content consumption rate between the begin- ning and end of a (30 minute) session, while those with many (450{500) friends experience a 9% speed up. Highly connected people interact with the larger vol- ume of content they receive from their friends by spending less time on each story. They also do not change their behavior as much as people with fewer friends. 51 80 85 90 95 100 0 10 20 30 Time in the session (minutes) − www Average time spent (normalized) Time of the day 8am 12pm 4pm 10pm Figure 3.20: Change in time spent on stories given the time in the session during dierent time of the day (web users) along with the 95% condence interval. Time of day Finally, we consider the eect of the time of day on content con- sumption. Earlier research has shown that people's behavior changes over the course of the day, and that this is best explained by people having higher levels of energy in the morning. For example, the \morning morality eect" exists because people have higher moral awareness and self-control in the morning [Kouchaki and Smith, 2013]. In the online world, people reply to a higher fraction of emails and reply to emails faster in the morning than in the evening [Kooti et al., 2015]. To test the time of day eect, we consider sessions that started at dierent times of the day (8 am, 12 pm, 4 pm, and 10pm). Interestingly, people spend relatively more time to read posts in the morning (8 am) and late at night (10 pm) compared to noon (12 pm) and late afternoon (4 pm). Figure 3.20 shows that there is little dierence in the rate at which behavior changes over the session. In summary, we demonstrated changes in behavior over the course of a session, with people spending less time on each story as they go through the feed faster and faster. This eect is more signicant for some types of content: for example, 52 textual posts, which presumably require a greater eort, show a larger decline than other content, such as photos. Age plays a considerable role in the observed eect: spending more time continuously in a session has a stronger eect on the time spent per story in older populations compared to younger ones. Viewing Videos Next, we analyze video viewing behavior during a session. Since video viewing time is measured as the duration of time the video plays, it gives a similar but more rened view on how people allocate their time to dierent types of content over the course of a session. We observe changes in the time spent on viewing videos during a session. Fol- lowing the analysis described in the previous section, we group together sessions of the same length and calculate the average amount of time spent watching videos at any minute during the session. Figure 3.21 shows people spend less and less time watching videos over the course of a session. However, the drop is about 5% smaller than the drop for reading stories (Figures 3.15 and 3.17). Therefore, video viewing behavior changes less during a session compared to other kinds of content, and as a result, people tend to watch relatively more videos later in the sessions. Summary: We analyzed a large data set of user activities on Facebook, com- prised of the interactions people have with the content their friends shared. These interactions can be divided into sessions, or periods of activity without a break longer than 10 minutes. Once segmented into sessions, content consumption shows strong regularities with predictable changes over the course of a session for many people. Regardless of the platform they use to consume Facebook content (web browser or mobile device), their demographic attributes, social connectivity, or 53 85 90 95 100 0 10 20 30 40 Time in the session (minutes) − web Average video time (normalized) Session length 10 20 30 40 Figure 3.21: Change in time spent viewing videos given the time in the session (web users). the time of the day they are active, people manifest similar behavioral changes: as the session progresses, they spend relatively less time viewing a story or video, and preferentially shift their attention away from reading stories and more towards viewing photos and videos. There were also strong dierences between short and long sessions. People spent less time consuming content during shorter sessions, a pattern that was already evident at the start of the session. While our work does not address the origins of these behavioral changes, the fact that we see them in almost all user populations suggests a fruitful area of future research that delves into factors aecting dierences in people's content consumption and interaction between and within sessions. 3.3 Reddit Next, we study online performance on Reddit, a popular peer production and social news platform. We measure online peer production performance as the quality of 54 comments produced by Reddit users over the course of a session, dened as a period of activity without a prolonged break. The data set we study contains over 55 million comments posted on Reddit in April 2015, and includes a variety of related meta-data, such as time stamps, information about the users, and the score attributed by others to each comment. We segment user activity into ses- sions, dened as periods of commenting without a break longer than 60 minutes, as suggested in [Halfaker et al., 2015] (Figure 3.24). We link an individual's com- menting performance over the course of a session to dierent proxy measures for a comment's quality, such as its length, readability, the score it receives from others, and the number of responses it triggers. 1 2 3 4 5 6 7 8 9 comment index difference −3 −2 −1 0 change in score original session data randomized index data Figure 3.22: Performance of comments within sessions. We show the average Reddit score for comments in sessions of length 10 (original session data, blue solid line). The average rating of each comment decreases starkly, by about 0.3 points for each comment after the rst one in the session. This suggests the presence of (super linear) performance deterioration throughout user sessions. The eect disappears in randomized data having shued comments within sessions (red dashed line). Our analyses uncover deteriorating online performance over the course of user sessions, with a decline in quality of subsequent comments across dierent proxy 55 measures. Figure 3.22 illustrates the decline in the average score received by com- ments posted during sessions with ten comments: the data shows that each sub- sequent comment receives a rating that is on average 0:3 points lower than the preceding one. The size of this eect is quite large: It is equivalent to a 30% probability increase of receiving a down-vote to a comment, for each extra com- ment posted after the rst one in the session. To statistically study this eect, we design and implement mixed-eects models|allowing the incorporation of hetero- geneous behavioral dierences|which model the eect of session duration on the deterioration of online performance. Our ndings may be linked to eects of cognitive depletion: Exerting mental eort to compose a comment may diminish an individual's capacity to continue producing quality comments, whether through the loss of attention, mental fatigue, or simply the onset of boredom. 3.3.1 Data and methodology Data For studying performance deterioration we utilized a publicly available data set 2 containing all comments (nearly 1:7 billion) ever written on Reddit starting from the rst one on October 17, 2007 to the last one at the end of May 2015. For our experiments, we extracted a smaller sample that limits the data to all comments posted in April 2015. An advantage of this limited data is that we do not need to additionally account for changes in Reddit's platform not only in its interface, but also in its voting mechanisms as well as the general usage patterns of users on the 2 https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_ available_reddit_comment 56 site [Singer et al., 2014]. Our results are robust to sample data from other months showing similar observations. Quality features To measure online performance, we studied the following comment quality features. Text length. This feature counts the number of characters in a comment and is an indicator for its textual length. Each URL in a comment accounts for one additional character. The overall mean of text lengths is = 168:08, the median is m = 86:00, and the standard deviation is = 281:88. Score. The score is a measure of the perception of other users and is the dier- ence between their up- and downvotes (the starting score is 1). All ratings can be summarized by the mean = 6:05, the median m = 1:00 and the standard deviation = 51:57. Number of responses. We see the number of replies a comment triggers as a proxy for engagement and a comment's success. We only count direct replies in the comment hierarchy. The mean number of responses is = 0:61, the median is m = 0:00 and the standard deviation is = 1:44. Readability. George Klare provided the original denition of readability [Klare, 1963] as \the ease of understanding or comprehension due to the style of writing". For measuring readability of Reddit comments, we use the so-called Flesch-Kincaid grade level [Kincaid et al., 1975] representing the readability of a piece of text by the number of years of education needed to understand the text upon rst reading; it contrasts the number of words, sentences and syllables. It is dened as follows: 57 0:39 total words total sentences + 11:8 total syllables total words 15:59 The lowest possible grade is3:4 which e.g., emerges for comments that only contain a sentences having a single syllable such as \OK", only a single URL or only emoticons. We set the maximum Flesch-Kincaid grade to be 22. Simply put, a higher Flesch-Kincaid grade indicates higher readability complexity of a given comment. The overall mean of the Flesch-Kincaid grade is = 5:12, the median is m = 4:91 and the standard deviation is = 4:61. Correlation of features. Most of the features are not strongly correlated (Pear- son's ) with each other; however, we can identify two special cases (Table 3.1). First, readability and text length have a correlation of = 0:296, which is not surprising given that shorter texts are easier to read, which is accounted for in the Flesch-Kincaid grade level formula. Second, the two success features score and number of responses have a correlation of = 0:558, meaning that comments that get a high score also tend to receive more replies. However, overall, these correlation results indicate that each feature represents interesting aspects on its own. Table 3.1: Pearson correlation between features. text length readability responses score text length 1.000 0.296 0.072 0.005 readability 0.296 1.000 0.043 0.005 responses 0.072 0.043 1.000 0.558 score 0.005 0.005 0.558 1.000 58 5 sec. 1 min. 5 min. hour day week month time difference 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 frequency 1e6 Figure 3.23: Time dierences between consecutive comments of users on Reddit. The log-scaled histogram shows a peak for very short time scales (minutes) and very long ones (1 day) suggesting daily routines. A natural valley emerges between both peaks arguing for the choice of a one hour break between comments for sessionizing. Sessions We decided to take the time dierences between consecutive comments as session indicators. To this end, we followed the approach advised in [Halfaker et al., 2015], where a strong regularity in how social media users initiate events across several dierent platforms was identied. Authors argue that a good rule-of-thumb is an inactivity threshold of 60 minutes to separate sessions. However, as postulated, we rst visually and analytically inspect the log-scaled histogram of time dierences between consecutive comments (after cleaning comments, before ltering sessions) as depicted in Figure 3.23. Similar to the results presented for other platforms [Hal- faker et al., 2015, Geiger and Halfaker, 2013], there is a peak for very short time scales (minutes) and a peak for time dierences of one day suggesting daily routine. By tting a Gaussian Mixture Model (using EM-algorithm, log-normal mixture) with two components to the log-transformed data, we end up with the two means 59 1 = 6:85min. and 2 = 794min. A natural valley is visible between the two peaks and thus, combined with the results from the log-normal mixture tting, we fol- low the rule-of-thumb of [Halfaker et al., 2015] and pick a time dierence t i;j of one hour between consecutive comments C i and C j to separate sessions. Note that other (similar) choices of break time (e.g., 30 or 90 minutes) produce similar inference. Data pre-processing We took several steps for pre-processing and cleaning the data. First, we removed accounts in removedlist 3 and the accounts that have been deleted later; this accounts for around 4:5M comments. Second, we deleted all sessions contain- ing at least one comment (i) that has been deleted, (ii) that is completely empty, or (iii) that contains characters that are not in the ASCII character set (e.g., Chi- nese characters)|accounting for additional 3M comments. Finally, we removed all sessions containing more than 10 comments accounting for around 7:25M allowing for easier experimental tractability and the removal of further bot accounts. Note though that the inclusion of these sessions into the experiments does not change the main observations of this paper. Our nal data set contains 40; 064; 930 comments produced by 2; 669; 969 dierent users and posted in 47; 462 dierent subreddits. Randomizing sessions For comparison, we created two randomized data sets to which we applied our analysis. The rst baseline|which we call randomized session data set|attempts to preserve as much information as possible while randomizing the process of ses- sionizing user commenting behavior. To do so, we shued the time dierences 3 https://www.reddit.com/r/autowikibot/wiki/redditbots 60 t i;j between consecutive comments made by each user, but preserved all other features, including the temporal order of comments. Then, we simply sessionized user activity based on shued times. An example is provided in Figure 3.24 (mid- dle row). This baseline data set is very conservative in terms of randomization and retains many original sessions. For example, many parts of a session stay intact as only the short time dierences are potentially swapped, which does not alter the sessions. The second baseline|which we call randomized index data set|keeps the sessions intact, but randomizes the order of comments inside each session (e.g., exchanging C 1 by C 3 ). Thus, it does not preserve the original order of comments; see Figure 3.24 (bottom row). Multiple randomization iterations did not alter the results. Figure 3.24: Sessions and randomization. Circles represent comments C i and arrows depict the time dierence t i;j between subsequent comments C i and C j . Sessions are derived by breaking at time dierences exceeding 60 min. Original data sessions are shown in the rst row. The middle row shows randomized sessions where time dierences between comments are swapped for deriving new sessions while retaining the original order of comments. The bottom row depicts the ran- domized index data where sessions are retained but the order of comments within sessions is swapped. 61 Mixed-eects models For statistically modeling performance deterioration, we utilized mixed-eects mod- els allowing for the incorporation of heterogeneous eects and behavioral dier- ences accounting for the non-independent nature of longitudinal data at hand. Mixed-eects models include both xed and random eects; following [Gelman, 2005], we refer to xed eects as eects being constant across levels (e.g., individ- uals) and random eects as those varying between dierent levels. An overview of mixed-eects models can be found in [Pinheiro and Bates, 2006]. In our setting, the introduction of random eects enabled us to consider vari- ations between dierent levels; the most important level being dierent users accounting for the inherent dierences between individual Reddit users (e.g., the average quality of their comments). As highlighted in [Baayen et al., 2008], mixed- eects models have further advantages, such as exibility in handling (i) missing data and (ii) continuous and categorical responses, as well as (iii) the capability of modeling heteroskedasticity. For simplicity, let us specify mixed-eects model equations using the following syntax [Bates et al., 2015]: outcome 1 + xed eect + (random eectjlevel) (3.1) This specication describes a model where an outcome (dependent variable) is explained by an intercept 1, one or more xed eect(s), as well as one or more random eects allowing for variations between levels. For all our experiments, we utilize the lme4 R package [Bates et al., 2015] and t the models with maximum likelihood. Examples about model specications can be found online. 4 4 http://glmm.wikidot.com/faq#toc27 62 As each of our experiments is conducted on one of our four dierent features that all exhibit dierent properties|e.g., count (text length) vs. continuous (read- ability) data|we performed extensive model analytics to nd the most suitable model for each problem setting. To this end, we focused not only on applying simple linear mixed-eects models, but also on (i) various transformations, (ii) generalized mixed-eects models, such as Poisson or negative Binomial regression, (iii) model diagnostics, (iv) further renements that allowed us to compensate problematic data structures leading to e.g., overdispersion or zero-in ation, and (v) checking for potential problems like multicollinearity, outlier bias, as well as convergence problems. For judging signicance of xed and random eects, we followed an incremental modeling approach starting with the most simple model only explaining the outcome by the intercept and then subsequently adding eects to the model. For comparing the relative ts of these models, we used the Bayesian Information Criterion (BIC) [Schwarz et al., 1978], which balances the likelihood of a model with its complexity. All reported eects in this paper are signicant, except where mentioned (randomized baseline data). For completeness, we also conducted additional signicance test for the xed eects such as t-tests or F-tests conrming our BIC diagnostics. Instead of reporting the individual analytical steps for each experiment, we focused on reporting xed eect coecients, as those are the main eects we are interested in. 3.3.2 Performance deterioration Next, we present our ndings on studying eects of session dynamics on online performance focusing on (i) empirical observations, as well as by utilizing mixed- eects models on (ii) performance at session start and (iii) performance over 63 the course of sessions. We study, after pre-processing, around 40 million Reddit comments posted in April 2015. We sessionize user behavior by periods of com- menting activity without breaks longer than 60 minutes as suggested in [Halfaker et al., 2015]. We experimented with other thresholds and the trends remained the same. For measuring performance, we look at four proxies of comment quality: text length, readability, the score a comment receives from others, and the number of responses it triggers. For comparison, we also study eects on two randomized session data sets as described in Figure 3.24. Figure 3.25 visualizes changes in online performance over the course of user sessions with respect to our quality features (comment text length, number of responses, score, and readability). Dierent colors and markers distinguish sessions of distinct length (i.e., number of comments written during the session) of up to a length of 5. The x-axis shows the session index of a comment, the y-axis shows the (population-wide) average of respective feature (with error bars). For example, in the rst plot of Figure 3.25(a), the red triangle at x = 2 refers to the average text length of all comments written in second position of all sessions of length 3. Figure 3.25(a) depicts the original session data of interest and suggests inter- esting dynamics in user behavior. First, all lines are stacked: The rst comment of a longer session also starts out with a longer text, a higher score, more responses, and more complex text (evidenced by higher readability score). Second, all feature values decline throughout the course of a session hinting towards some form of performance deterioration. On average, the last comment of a session is shorter, receives a lower score and fewer responses, and is easier to read. In contrast, these trends largely disappear in our randomized data|i.e., ran- domized session data shown in Figure 3.25(b) and randomized index data shown as in Figure 3.25(c). There is no clear decline in feature values of later comments 64 one comment two comments three comments four comments five comments 1 2 3 4 5 comment position 155 160 165 170 175 180 185 190 195 avg. text length text length 1 2 3 4 5 comment position 5.0 5.5 6.0 6.5 7.0 7.5 avg. score score 1 2 3 4 5 comment position 0.55 0.60 0.65 0.70 0.75 0.80 avg. responses num. responses 1 2 3 4 5 comment position 5.00 5.05 5.10 5.15 5.20 5.25 5.30 avg. readability readability (a) Original session data 1 2 3 4 5 comment position 163 164 165 166 167 168 169 170 171 172 avg. text length text length 1 2 3 4 5 comment position 5.85 5.90 5.95 6.00 6.05 6.10 6.15 6.20 6.25 avg. score score 1 2 3 4 5 comment position 0.58 0.59 0.60 0.61 0.62 0.63 0.64 0.65 avg. responses num. responses 1 2 3 4 5 comment position 5.06 5.08 5.10 5.12 5.14 5.16 avg. readability readability (b) Randomized session data 1 2 3 4 5 comment position 162 164 166 168 170 172 174 176 178 avg. text length text length 1 2 3 4 5 comment position 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 avg. score score 1 2 3 4 5 comment position 0.56 0.58 0.60 0.62 0.64 0.66 avg. responses num. responses 1 2 3 4 5 comment position 5.00 5.05 5.10 5.15 5.20 5.25 avg. readability readability (c) Randomized index data Figure 3.25: Empirical observations. This gure visualizes the average of all four quality features of interest at their respective position in a session. The colors (dierent markers) indicate dierent session lengths (number of comments written in a session, 1 up to a length of 5). The x-axis depicts a comment's index within the session, and the y-axis gives the average feature value (with error bars). The rst row (a) depicts the original session data while the second (randomized session data) and third row (randomized index data) visualize results for the randomized data. The results indicate that earlier comments in a session tend to be of higher quality than later ones. Additionally, there appears to be a relation between the session length and the performance of the rst comment in a session (stacking of lines). These clear patterns for the original data (a) mostly disappear for both of our randomized data sets (b,c). 65 in comparison to earlier comments in sessions. The reason why some lines (e.g., number of responses) in Figures 3.25(b)- 3.25(c) are still slightly stacked can be explained by our way of randomizing|see middle and bottom rows of Figure 3.24. Some sessions (especially in the randomized session data) still stay partly, or some- times even fully, intact, preserving the original session data. However, the eects are much reduced, for example, the average number of responses in the original data ranges between 0:55 and 0:77, while in the two randomized sets it ranges in the intervals [0:58 0:65] and [0:56 0:66] respectively. Several considerations limit the conclusions we can draw from these empirical results. First, the population-wide average feature value may not be fully indica- tive of user performance because some distributions (length, score, responses) are heavy-tailed. Second, we have only visualized sessions up to a length of 5. While visualizations including all lengths up to 10 show similar trends (not shown here), more detailed analyses are necessary. Third, and most importantly, we have ignored the fact that our samples are not independent of each other as we repeatedly measure comments for individual redditors. Each user's behavior may be dierent, for example, one user may tend to write very long comments, while another one may prefer making shorter ones; mixing these dierent behavioral aspects in one analysis does not allow for specic inference about performance deterioration. We resolve some of these issues by using mixed-eects models incorporating individual dierences. We start with (1 ) an analysis of the performance on the rst comment in sessions, based on our observation of a potential stacking eect, and continue with (2 ) experiments on performance deterioration over the course of sessions with respect to a potential decline in quality. 66 Performance at session start We hypothesized a relation between the length of sessions and their comments' respective quality; readily apparent in the stacking of lines in Figure 3.25(a). We now statistically study this relation by focusing on the simplied question whether the length (number of comments) of a session has an eect on the performance of the very rst comment in the session. We model the data with mixed-eects models specied as: feature 1 + session length + (1juser). The outcome (dependent) variable refers to one of our four quality features. The session length is the main xed eect of interest. Additionally, we vary the intercept between users (random eect). For this analysis, we limit our data to only consider the very rst comment in each session (around 23:5M comments). The results (xed session length eects) are summarized in Table 3.2(a). As hypothesized in our empirical population-wide observations, the results indicate that there is a positive relation between the length of sessions (i.e., the number of comments) and their rst comment's quality. This is imminent from resulting positive xed eects coecients meaning that an increase in session length leads, on average, to an increase of the rst comment's text length, the number of responses it triggers, the score it receives and its Flesch-Kincaid grade level which corresponds to higher complexity of written text. A potential explanation for the observed eect is that users start with dier- ent capacities to make quality contributions depending on how many more com- ments they plan to compose. Another (opposite) explanation could be that a higher performance of the rst comment encourages users to produce more com- ments leading to longer sessions. While we believe the rst explanation is more plausible|text length and readability are not based on external success measures, and responses accumulate at a (somewhat) longer time scale|future studies should 67 Table 3.2: Mixed-eects model results. In (a), the models study the eect of session length on the quality of the rst comment C 1 in a session; i.e., data only contains the rst session comments. In (b), the models investigate the eect of the session indexi on the quality of respective commentC i ; data includes all comments in sessions with more than a single comment. Each table highlights the most appropriate models for each quality features based on extensive model analytics| lmer refers to linear mixed-eects models while glmer refers to generalized linear mixed-eects models. All coecients are strongly signicant as derived from model comparisons based on BIC statistics. feature best model coe (session length) text length lmer (log-transform) +0:0342 num. responses glmer (Poisson, log link) +0:0685 score glmer (Poisson, log link, constant added for pos.) +0:00015 readability lmer +0:0478 (a) Performance at session start feature best model coe (session index) text length lmer (log-transform) 0:0205 num. responses glmer (Poisson, log link) 0:0640 score glmer (Poisson, log link, constant added for pos.) 0:00028 readability lmer 0:0410 (b) Performance over the course of sessions aim at answering these causal questions. Without resolving the nature of causality, the identied relation between session length and quality of the rst comment has implications for the experiments we report below that model the dynamics of user performance during the sessions. We have now shown, empirically and statisti- cally, a high heterogeneity between dierent sessions with respect to their length. Accounting for this (e.g., as a nuisance eect) in our models is thus necessary. 68 Performance over the course of sessions We now turn our attention to the dynamics of user performance in sessions on Red- dit. Our empirical insights so far have suggested a performance decline through- out the course of a session. We statistically study this hypothesis by investigating whether the index of a comment (relative position in session) has an eect on the quality of the respective comment. To this end, we apply mixed-eects models specied as: feature 1 + session index + session length + (1juser). Again, the dependent variable refers to one of our four quality features. The session index is the main xed eect that we are interested in for studying performance declines. Our models include an additional nuisance eect controlling for individual session lengths as suggested by our previous experiments|model analytics conrm the importance of this factor. An additional random eect models the variations of the intercept between dierent authors. For this analysis, we consider all data for sessions having more than a single comment (around 24:5M comments). We summarize the main results in Table 3.2(b) and again focus on the xed session index eect. The results now indicate a negative eect of the session index on our respective quality features indicated by the four negative coecients. This means that with duration of a session, the quality of comments decreases on aver- age. The next comment in a session is of shorter text length, triggers less responses and a fewer score, as well has a lower Flesch-Kincaid grade level indicating easier complexity of written text. This argues for performance deterioration throughout the course of user sessions on Reddit. To further conrm observed eect, we repeated the above experiments on the randomized data. For both the randomized session and index data sets, the session index eect is not signicant for all features of interest indicating no performance depletion eect in the randomized data. This is in contrast to real session data 69 analyzed above and also conrms that the eects do not simply arise as a result of the order in which comments are made, but their order within a session. Summary: This study presents novel evidence of performance deterioration dur- ing prolonged online activity. Our study was conducted on Reddit, an online social network platform that attracts millions of users. We segmented user activity on Reddit into sessions and separated them by intensity (the number of posts produced during that session). We found that sessions with more activity are signicantly associated with production of lower quality content, as measured by four proxy variables: content length, its readability score, its average rating and number of responses. In light of these ndings, we developed a mixed-eects model that captures online performance deterioration. 3.4 Email Finally, we study email communications. Email remains an essential tool for social interactions and a popular platform for computer-mediated communication. It is used within organizations to exchange information and coordinate action, and also by ordinary people to converse with friends. Patterns of email interactions reveal circadian rhythms [Malmgren et al., 2008] and bursty dynamics of human activity [Barabasi, 2005], and the structure of evolving conversations [Eckmann et al., 2004]. Understanding how these patterns shape email use is necessary for designing the next generation of interaction tools that will improve the eciency of communication and coordination in social groups. We address these questions with a largest study to date of email conversations (16B emails). We focus our analyses on the replying behavior within dyadic inter- actions, i.e., conversations between pairs of users. Specically, we measure the 70 time a user takes to reply to a message, the length of the reply, as well as the fraction of messages a user replies to. First, we empirically characterize replying behavior in email conversations in a large population of users, and also how these behaviors vary by gender and age. Although we nd no signicant variation due to gender, we nd that younger email users reply faster and write shorter replies than older users. Next, we study how email load, measured by the number of received email messages, aects replying behavior. We nd that while users attempt to adapt to the rising information load by replying to more emails, they do not adequately compensate. As email load increases, they reply faster but to a decreasing fraction of incoming emails. These ndings suggest that email overload is a problem, with users generally unable to keep up with rising load. We also study how replying behavior evolves over the course of a conversation. We nd that users initially synchronize their behaviors, with reply times and length of replies becoming more similar in the rst half of the conversation. However, they become desynchronized in the second half of the conversation. In contrast, users continue to coordinate their linguistic styles, as messages exchanged over the course of the entire conversation become more similar in content and style. The key contributions of our work are: 1. We empirically characterize email replying behavior of users, focusing on reply time, length of the reply, and the correlation between them. We quantify how dierent factors, including the day and time the message was received, the device used, the number of attachments in the email, and user demographics aect replying. 2. We show that email overload is evident in email usage and has adverse eects, resulting in users replying to a smaller fraction of received emails. Users 71 tend to send shorter replies, but with shorter delays when receiving many emails. We nd that dierent age groups cope with overload dierently: younger users shorten their replies, while older users reply to smaller fraction of received messages. 3. We nd evidence of synchronization in dyadic interactions within a thread: users become more similar in terms of reply time and length of replies until the middle of a thread, and start acting more independently after that. 4. We can predict reply time and length, and the last email in a thread with a much higher accuracy than the baseline. This has important implications for designing future email clients. Yahoo Mail is one of the largest email providers in the world, with over 300M users (according to ComScore 5 ), who generate an extremely high volume of mes- sages. Obviously, not all the email addresses are associated with real people and are used instead by organizations or possibly bots to generate emails for commer- cial promotions and spam [Jagatic et al., 2007, Zhuang et al., 2008, Grbovic et al., 2014]. To meet the goal of studying social interactions, it is necessary to subsample the data to a set of interactions that are likely occurring between real people. For this reason, we apply a conservative ltering strategy and focus our study on user pairs (or dyads) that exhibit reciprocal interaction (i.e., bi-directionality of emails sent) and exchange some minimum number of messages. This ensures that i) all the email addresses of the dyadic endpoints are likely associated with human users, and that ii) the emails sent between them are not automatically generated. Accordingly, we selected a random subsampleE of 1.3M dyads of Yahoo Mail users worldwide who have signicant interactions with each other, sending at least 5 http://www.comscore.com 72 ve replies in each direction in the time span of undisclosed number of months. For privacy and policy reasons, the data set includes messages belonging exclusively to users who voluntarily opted-in for such studies. Consequently, considering a pair of users who exchanged more than 5 emails, both users need to be opt-in users to be included in the study. These pairs comprise a setN containing 2M unique users exchanging 187M emails overall. We refer to the full sequence of emails owing within the dyad as a conversation. Next, we gathered all incoming and outgoing emails of users inN over the same time period, a total of 16B emails. Note that these emails included only emails from commercial domains and emails to and from other opt-in Yahoo users. Due to Yahoo policy, the study did not include personal email messages between Yahoo users and other email clients. In addition, we excluded notications from social network sites, such as Facebook or Twitter, which represent a considerable portion of commercial, automatically- generated emails. We conduct our study at two dierent levels: at the dyadic level, we consider only the emails owing between dyad endpoints; at the global level, we consider the entire data set. In the latter case, commercial and spam messages will be likely be part of the incoming emails directed to users inN . As one of the goals of this work is to study information overload in all its facets, we don't apply any a priori lter. Then, depending on the target of each part of the analysis, we apply ad-hoc lters to t our specic goals. Each email included the sender ID, receiver ID, time sent, subject, body of the email, and number of attachments. All data were anonymized to preserve user privacy and worked at the level of anonymized user ids and email ids. Given the data sensitivity, even with opt-in users, email bodies were not editorially inspected by humans. To extract statistics from email bodies (length, number of articles, 73 email vectors, email ids in a thread, etc.), we made a MapReduce call with a specied function to a protected cluster. To make sense of the conversation ow, we need to break down the dyadic conversations into threads. A thread is an ordered sequence of one or more emails, including the initial email, and the list of succeeding replies to it (if any). As our data set does not keep track of the thread structure, we need to reconstruct it. We compose the threads using the subject line and the time each email was sent. Yahoo Mail automatically adds the token \Re: " to the subject line of all replies. For each pair of the users, we group all the exchanged emails that have the same subject line. If all the emails in the group start with \Re:", we also add the email that has the exact subject without the \Re:"; this email would be the rst email of the thread. Then, in each group we order the emails based on the sent time, and the time to replies could be calculated as the delay between the email from one user and the reply of the other user. In case of consecutive replies in one thread from one of the users, we just consider the rst one (Figure 3.26). Like most other email service providers, Yahoo Mail quotes the body of the original message at the end of the reply, unless manually removed. To get only the text of the last email sent, we search for standard string templates that occur before the quoted message, (e.g., \On Thursday May 1, 2014 a@yahoo.com wrote") and exclude any text from that point. We also looked for common mobile device signatures, such as "Sent from my iPhone" and exclude the text from the signature onwards. Spam. One main concern with studying emails is the spam. To minimize the eect of spam, we conduct most of our analyses on dyadic email exchanges that occur between two users who have exchanged at least ve emails with each other. We believe that the fact that two users have exchanged more than ve emails means 74 that neither of them is a spammer. In the analysis of the information load on users, we have to consider all the emails those users received, and many of them could be spam and should not be considered. To deal with this problem, we conduct the analysis once for all the received emails and once for emails received from contacts, i.e., others with whom the user had exchanged at least one email. Again, ltering users who have not received any replies from a user is a very conservative approach to eliminate spammers. We characterize the replying behavior of email users involved in dyadic inter- actions, focusing on reply time and the length of replies. Reply time is the period between the time the sender sends a message (e.g., \user A" in Figure 3.26) and the time the receiver (\user B") replies to it. When receiver replies after multiple consecutive emails are sent by the sender within the same thread, reply time is calculated from the time of the rst message to the time the receiver replied. We experimented with dierent denitions of reply time, e.g., from the time of the last messages in a series of emails in a thread, but this did not signicantly change the key properties of replying behavior, resulting only in slightly faster replies. We investigate the eect of the position of the reply within a conversation (i.e., step in a thread) on the reply time and length. Figure 3.27(a) shows how reply time changes as a function of thread step, for threads of dierent length. Replies become faster as the conversation progresses, but the last reply is much slower than the previous replies. The long delay in a reply could be considered as a signal for the end of the conversation. Figure 3.27(b) shows the eect of thread step on the length of the reply. Replies get slightly longer as conversation progresses, although the last reply in a thread is much shorter than the previous replies. Moreover, we see that longer conversations (threads) have faster reply times and shorter reply 75 Figure 3.26: Illustration of an email thread. 0 10 20 30 40 50 0 20 40 60 80 100 Step of the thread Median reply time Thread len = 5 Thread len = 10 Thread len = 20 Thread len = 50 (a) Time to reply 0 10 20 30 40 50 20 30 40 50 60 Step of the thread Median length Thread len = 5 Thread len = 10 Thread len = 20 Thread len = 50 (b) Length of reply Figure 3.27: (a) Median reply time for dierent steps of threads for a given thread length. Replies become faster, except for the very last reply that is much slower. (b) Median length of reply for dierent steps of threads for a given thread length. Calculated on dyadic conversations. lengths. To better quantify this eect, we calculate the median reply time and length for dierent thread lengths. We nd that both reply times and length of replies are smaller in longer conversations (Figure 3.28). This would be expected if the data covered a very short time period, because the reply times would had to be small to t a long conversation in a short period of time. But, this is not the case here, since we are covering several months. 76 1 10 20 30 40 50 0 200 400 600 # of emails in a thread Median reply time (a) Reply time 1 10 20 30 40 50 0 50 100 150 200 # of emails in a thread Median length (b) Length of reply Figure 3.28: Reply time and length as a function of the length of a conversation for dyadic interactions with less than 50 steps in a thread, which are 99.7% of all threads. Each plot shows the median, 25 th and 75 th percentile of the measure vs. the number of messages in a thread. Longer threads have shorter reply delays and lengths. 100 200 300 400 500 20 40 60 80 100 Email length (# of words) Median reply time (minute) Sent Received Figure 3.29: Correlation between time to reply and length of reply for outgoing and incoming emails for dyadic conversations. There is a strong correlation till the length of 200 words (more than 83% of all replies). Figure 3.29 shows how time to reply to an incoming (received) message varies as the function of the length of the received message and the length of the reply. There is a strong correlation between reply time and length, showing that longer 77 replies take longer to be composed. However, replies longer than 200 words are slightly faster. This could be due to a number of reasons. First, we may not properly account for message length due to copy and pasted emails or missed quoted messages. Second, there could be systematic dierences in the population of users who write replies longer than 200 words, e.g., such users may be more adept at writing. There is also a strong correlation between reply time and the length of received emails, showing that the longer the messages the users receive, the longer it takes them to reply. The slight decrease in reply time for messages longer than 200 words could be explained as above. Figure 3.29 also shows that the just couple of minutes dierence in the time needed to reply to an email results in signicantly dierent length of the email. 78 Chapter 4 Predicting Online Behavior In this chapter, we show that patterns found in previous chapter can be used for prediction of users' behavior. We have already presented models for Twitter and Reddit, so in this section we focus on Facebook and Email data sets. 4.1 Facebook On Facebook, we predict session length, the number of stories consumed, and return time, using past behavior of the users. We use one month of data and use the rst three weeks of the data for training, and the last week for testing. In this way, we do not use any future information in our predictions. 4.1.1 Session Length We show that user activity during just the rst minute of a session helps predict the ultimate session length. We frame the prediction task as a classication problem with three classes: short sessions between (1,5] minutes, medium sessions between (5,15] minutes, and long sessions, which are longer than 15 minutes. Short sessions include about 36% of all sessions, medium sessions include 37%, and long sessions account for the remaining 26% of the sessions. Note that using only the rst minute of the session to predict session length is a hard task compared to other framings of the problem, such as predicting whether the session time will double. We include a variety of features in the prediction task: 79 User characteristics: age, gender, location, number of friends, number of days on Facebook, language, number of days active on the last 7 and 30 days, number of subscribers and subscriptions (10 features). Subscribers are users who follow updates from another user (usually celebrities), and it does need the approval of the other user, so it is a directed relationship. Session activity: features from the activity of the user in the rst minute { Mean, median, maximum, standard deviation, 10th, 25th, 75th, and 90th percentiles of time spent on reading posts, watching videos, and creating a post in the rst minute of a session (24 features). { Number of dierent interactions during the rst minute of a session: number of likes, comments, stories viewed, video playbacks, posts removed, shares, clicks, and reshares (8 features). { Time of the day (1 feature) { Time since the previous session (1 feature) { Number of notications at the beginning of the session (1 feature) User history: features from activity of the user in the rst three weeks of the data { Mean, median, maximum, standard deviation, 10th, 25th, 75th, and 90th percentiles of length of the session in minutes, number of stories viewed, and return times in the training data (24 features). { Counts of short, medium, and long session lengths; small, medium, and large number of stories viewed; and short, medium, and long return times in the training data (9 features). 80 We use the C5.0 classier, which is a decision tree based classier with feature selection [Quinlan, 2004]. We used a temporal split of data for training (75%) and testing (25%). For predicting the length of the session, we achieve F1 score of 0.44 and accuracy of 48.3%, which is 30.5% relative improvement over the majority vote baseline of 37.0%. Majority vote baselines always predict the majority class for each user based on the training data. We also consider a baseline that uses the empirical distribution of session length classes from the training data for each user, which then probabilistically picks one of the classes from that distribution for testing. This baseline achieves 29.3% accuracy, which is worse than the majority vote baseline; this might be because median is a typically more robust indicator of behavior than the mean user behavior. To nd the most predictive features, we rank features based on their informa- tion gain. Table 4.1 shows the top ve features with the highest information gain. It is perhaps unsurprising that the user history features are among the most infor- mative features, as past behavior is often the best predictor of future behavior. However, the most informative feature (and two of the top ve most informative features) come from the person's behavior in the rst minute in the session. The most predictive feature is the mean time spent reading stories. This suggests that scrolling quickly through stories at the beginning of a session is a very strong signal that the person will have a short session (also seen in Figure 3.15). To better understand the predictive power of dierent features, we divide our features into three groups and use the features in each group alone to predict session length. The rst group includes features that are characteristics of the users, such as age and gender. These features will have the same value for dierent sessions of the same user. The next group includes all features that are extracted from user activity in that particular session, e.g., mean time spent on stories. The 81 Table 4.1: Top 5 features for predicting the length of the session and their infor- mation gain. Rank Feature Info. gain 1 Mean time spent reading stories 0.070 2 Mean session length 0.048 3 Mean time spent on creating a post 0.042 4 STD of session lengths 0.041 5 Number of long sessions 0.035 third group includes features that are related to the earlier behavior of the user, e.g., the mean session length. Table 4.3(a) shows the accuracy of the prediction using the features in each of these groups. The user characteristics features are the least predictive, while the person's earlier Facebook browsing and their behavior at the start of the session are more predictive. To calculate the importance of each feature individually, we rst remove the correlated features (Figure 4.1) and then run a logistic regression over the data after normalization. Table 4.2 shows the ranking of the variables that have statistically signicant coecients. We nd that the number of actions taken in the rst minute is the top feature in ranking according to logistic regression and suggests that the number of actions users take in the rst minute re ects over the period of the session. Next features are number of friends, and number of sessions that the user had in the training period. Due to privacy concerns, we are not able to share the coecients for each feature. 4.1.2 Number of Stories Viewed Next, we predict how many stories the person will view in the current session. We use the same approach and features as in the previous prediction, using the 82 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 stories_count like_count comm_count video_count click_count action_count break_time mean_action_time age gender days_on_FB subscription_cnt subscriber_cnt friend_count active_last_month session_count mean_length mean_break start_hour mean_story_time stories_count like_count comm_count video_count click_count action_count break_time mean_action_time age gender days_on_FB subscription_cnt subscriber_cnt friend_count active_last_month session_count mean_length mean_break start_hour mean_story_time Figure 4.1: Correlation coecient between the features. Boxes with the cross have statically signicant correlation. Table 4.2: Result of logistic regression on the independent variables for the length of the session. *** pvalue< 0:001, ** pvalue< 0:01, * pvalue< 0:05 Rank Variable pval 1 Total # of interactions *** 2 # of friends * 3 # of sessions *** 4 Mean length of sessions *** 5 Previous break length *** 6 Mean interaction time *** 7 Time of day session started *** 83 Table 4.3: Prediction accuracy using dierent sets of features. Prediction Feature Group Accuracy Session User characteristics 37.4% Length Session activity 42.5% User history 43.1% Number User characteristics 37.6% Stories Session activity 39.4% Viewed User history 46.6% Return User characteristics 36.5% Time Session activity 82.8% User history 42.8% rst minute of activity in the session. In this case we create three classes for the number of stories viewed in the session using thresholds that result in roughly balanced classes (classes include 35.1%, 32.6%, and 32.3% of data). For predicting the number of stories viewed in a session, our classier achieves an F1 score of 0.48 and accuracy of 49.7%. This is a 41.6% improvement over the majority vote baseline, with an accuracy of 35.1%, and a 42.2% improvement over the prediction using the same probabilities that the user had in the training data, which has an accuracy of 35.3%. Our classier is able to predict how many stories the person will consume in a session, which could be used to determine the amount of content to cache prior to the user's session. The most predictive features in this prediction problem (Table 4.4) are all from the earlier behavior of the person in the training data (Table 4.3(b), e.g., mean number of stories viewed per session). Similar to the session length prediction problem, we run a logistic regression on the independent variables to nd the role of each feature in the prediction. Table 4.5 shows the result of the regression: mean time spent on dierent inter- actions, number of sessions, and the number of days that the user has been on 84 Table 4.4: Top 5 features for predicting the number of stories viewed in a session and their information gain. Rank Feature Info. gain 1 Mean # of stories viewed 0.137 2 75th perc. of # of stories viewed 0.137 3 90th perc. of # of stories viewed 0.128 4 Median # of stories viewed 0.120 5 STD of # of stories viewed 0.107 Table 4.5: Result of logistic regression on the independent variables for the number of stories read in the session. *** pvalue< 0:001 Rank Variable pval 1 Mean interaction time *** 2 # of sessions *** 3 # days on Facebook *** 4 Mean session length *** 5 Age *** 6 Previous break length *** 7 Total # of interactions *** Facebook are the top three features for the prediction of number of stories read in the session. 4.1.3 Return Time In addition to predicting the length of the session and number of stories consumed, we can also predict when the person will return to Facebook. Similar to the pre- vious predictions, we use two thresholds that result in roughly equal-sized classes (each class includes 33.3% of the data). We use the same features, and at the end of each session we predict the time to the the next session. Using top features, our classier achieves F1 score of 0.79 and very high accuracy of 79.0%, which is 85 signicantly higher than the other predictions and is a 137.2% relative improve- ment over the majority vote baseline. Interestingly, this accuracy is achieved by using only four features from the feature selection algorithm: the length of the session that just ended, mean time spent on interactions, median return time, and number of posts a person has modied (such as modifying the caption on a photo). Also, using only the length of the session that just ended achieves 73.5% accuracy, which is considerably higher than the baseline. If we use the more complex base- line, considering the history of the user, we achieve a much lower accuracy of only 36.7%. Table 4.6 shows the top ve most predictive features, with the length of the session being the most predictive feature by far. It's surprising that the length of the session is such a strong predictor of the length of the break, since the the length of the break is not predictive of the length of the session. In other words, people who stay longer on Facebook tend to take longer to return, but people who return to Facebook after a long break do not necessarily stay on Facebook longer. Finally, if we group the features, we observe that session activity has more predictive power compared to features extracted from user's earlier behavior (Table 4.3(c)). Since our classier uses historical information from users' earlier behavior, we cannot predict new users' behavior with the same accuracy. This problem, i.e. cold start problem, is common in recommender systems. One way to mitigate the problem this is to use information from users with the same characteristics to replace missing features. Predicting return time with such high accuracy can be extremely useful in caching the content to a users' mobile phone, by having the data ready for brows- ing before the user starts using the application. This could greatly improve user experience, especially in areas with poor network connectivity. 86 Table 4.6: Top 5 features for predicting the break time and their information gain. Rank Feature Info. gain 1 Session length 0.800 2 Mean time spent on interactions 0.418 3 Number of interactions 0.138 4 Number of clicks 0.121 5 Number of stories viewed 0.104 Table 4.7: Result of logistic regression on the independent variables for the break length. *** pvalue< 0:001, ** pvalue< 0:01, * pvalue< 0:05 Rank Variable pval 1 Gender * 2 Age *** 3 Mean session length *** 4 # of sessions *** 5 Time of day session started *** 6 Total # of interactions *** 7 # active days in last 30 days *** 8 Mean interaction time *** We also run a logistic regression to nd the role of each feature. Gender, age, and the mean session length are the top three features in the prediction of break length (Table 4.7). Table 4.8 summarizes all the prediction results. 4.2 Email For email, we test whether the features studied in previous chapter are suitable for predicting the behavior of users. In particular, we try to predict the time that a user will take to reply to a message, the length of the reply, and whether a message will end the thread. More than just measuring the performance in terms of 87 Table 4.8: Summary of the prediction results. Accuracy: percentage of correctly classied samples. Majority vote: always predicting the largest group, or predicting randomly (same group sizes). Prediction Majority vote Probabilistic baseline Our classier Absolute imprv. Relative imprv. F1 Session length 37.0% 29.3% 48.3% 11.3% 30.5% 0.44 # of stories 35.1% 35.2% 49.7% 14.6% 41.6% 0.48 Return time 33.3% 36.7% 79.0% 45.7% 137.2% 0.79 accuracy, prediction allows us to quantify the importance of features in describing emailing behavior. We use the same features for the three analysis. Features include: mean, median, and earlier reply times and reply lengths between the pair of users (20 features), age and gender of the sender and receiver (4 features), step of the thread (1 feature), statistics on number of received, sent, and replied emails for sender and receiver (18 features), statistics on number of contacts of sender and receiver (18 features), statistics on length of all emails sent and received by the sender and receiver (18 features), the time of the day and day of the week that the email was received (2 features), number of attachments (1 feature), and if the user has used a phone, or tablet earlier (1 feature). Overall, we consider 83 features. 4.2.1 Predicting Reply Time We start by predicting the reply time of emails within dyadic conversations. For each pair, we use the rst 75% of the replies for training and the last 25% for testing, so that we are not using any of the future emails for predicting the current reply time. Predicting the exact reply time is a hard problem. We simplied the problem by considering classes of replies. In practice, knowing if we are going to receive a reply shortly or with a long delay would still be very helpful and we do 88 not necessarily need the exact reply time. We consider three balanced classes of replies: immediate replies that happen within 15 minutes (33.5% of the data), fast replies that happen after 15 minutes, but before 164 minutes (33.1% of the data), and slow replies that take longer than 164 minutes (33.4% of the data). Here our baseline would be the largest class (majority vote), which contains 33.4% of the training data. We experiment with a variety of machine learning algorithms and bagging algorithm yields the best results with Root Mean Squared Error (RMSE) of 0.420 and accuracy of 55.8%, which is 22.4% absolute improvement and 67.1% relative improvement over the baseline. In case we want to just distinguish between immediate and slow replies in an application, we can eliminate our middle class and do the prediction only for the two class of immediate and slow replies. In this case, we achieve a much higher accuracy of 79.5% compared to the baseline of 50.1%. We also rank the features based on their predictive power by computing the value of the 2 statistic with respect to the class. Table A.8 shows the top 5 features with the highest predictive power, and all of them come from the history of the reply times between the users. The feature with the highest predictive power is the median reply time of the replier from earlier replies. So, if we want to use only one feature for guessing the reply time of a message, the typical reply time of the replier would be the most useful feature, which makes sense. Interestingly, if we select only 7 features from all the 83 features that have high predictive power and low overlap, the accuracy would be 58.3%, which is slightly less than the case that we consider all the features. All the 7 features represent earlier history of reply time between the pair of users. 89 Table 4.9: Top 5 most predictive features for predicting reply time and their 2 value. Rank Feature 2 value 1 Replier's median reply time 6,374 2 Receiver's median reply time 4,839 3 Replier's last reply time 4,528 4 Receiver's last reply time 4,157 5 Replier's 2nd to the last reply time 3,259 4.2.2 Predicting Reply Length Next, we take the same approach to predict the length of a reply that is going to be sent. We use the same set of features as the previous section and again use the rst threads of emails between a pair of users for training and the rest for testing. Again, we divide our data to three balanced classes: short replies of 21 words or smaller (33.1% of the data), medium-length replies longer than 21 words, but shorter or equal to 88 words (33.6% of the data), and replies that are longer than 88 words (33.3%). A naive classier that always predicts the largest class, would have a 33.6% accuracy, which is our baseline. Using the bagging classier we achieve accuracy of 71.8%, which is much higher than the prediction of reply time. Our classier has 38.2% absolute improvement and 113.7% relative improvement over the baseline. Similar to time to replies, we eliminate the middle class to calculate the accuracy for distinguishing short and long replies. Our classier can successfully assign the correct class in 89.5% of cases, which is well above the 50.2% baseline. We use the 2 statistics to rank the features based on their predictive power (Table 4.10). All the top 5 features are from the earlier reply lengths of the replier. Unlike reply time, there is no feature related to the receiver's activity in the top 5 features. This suggests that the length of the reply of the other person has a 90 Table 4.10: Top 5 most predictive features for predicting length of reply and their 2 value. Rank Feature 2 value 1 Replier's average reply length 12,953 2 Replier's last reply length 12,509 3 Replier's median reply length 11,558 4 Replier's 2nd to the last reply length 9,476 5 Replier's 3rd to the last reply length 7,595 Table 4.11: Top 5 most predictive features for predicting last email in a thread and their 2 value. Rank Feature 2 value 1 Receiver's avg # of words received/day 2,160 2 Replier's avg # of words received/day 1,981 3 Receiver's median # of words received/day 1,935 4 Receiver's avg # of words sent/day 1,884 5 Replier's median # of words received/day 1,872 weaker eect on the length of the outgoing reply, compared to the eect of the reply time of the party on the reply time of the replier. We also try the 8 features that have high predictive power and these only top 8 features are just slightly less predictive than all the features (0.2%). 4.2.3 Predicting the End of the Thread Finally, we use the same approach to predict whether a reply is the last reply in a thread or not. The baseline for this prediction problem is 50.6% and bagging classication yields accuracy of 65.9%, i.e. 15.3% absolute improvement and 30.2% relative improvement over the baseline. Table 4.11 shows the top 5 predictive features for predicting the last email in a thread and interestingly all the top 91 Table 4.12: Summary of the prediction results. Accuracy: percentage of correctly classied samples. AUC: Weighted average of Area Under the Curve for classes. RMSE: Root Mean Square Error. The improvements are reported over the major- ity vote baseline. Prediction Majority vote Last reply Most used Our classier Absolute imprv. Relative imprv. AUC Reply time 33.4% 50.2% 45.1% 58.8% 22.4% 67.1% 0.715 Reply length 33.6% 68.2% 45.2% 71.8% 38.2% 113.7% 0.865 Last email 50.6% { { 65.9% 15.3% 30.2% 0.761 features are related to the load of information in terms of number of words on the replier or receiver. Table 4.12 summarizes our results for the three prediction problems. Besides the majority vote baseline, we also considered last reply and most used reply time and length as other baselines. These baselines perform better than the majority vote, but our classier outperforms all three baselines: Relative improvement is 17.1% for last reply time and 5.3% for the last reply length. For most used baseline the relative improvement is 30.4% for reply time and 58.9% for reply length. 92 Chapter 5 Modeling Online Behavior Using Digital Traces In this section, we consider a broader scope than the short term behavioral changes, and study behavioral changes in large-scale data. In particular, we focus on con- sumer behavior extracted from the receipts that have been sent to the users via emails. We start by analyzing online shopping for goods on websites such as Ama- zon, Ebay, and Walmart. Then, we focus on ride sharing service, Uber, and study dierent factors that aect people's engagement. Finally, we study the spendings on iPhone's digital market, i.e., money spent for buying apps and songs. In all these data sets, signicant heterogeneity exists and we show how not handling it correctly can result in nding false trends. Then, we explain an approach for overcoming this problem. 5.1 Online Shopping Consumer spending is an integral component of economic activity. In 2013, it accounted for 71% of the US gross domestic product (GDP) 1 , a measure often used to quantify economic output and general prosperity of a country. Given 1 https://research.stlouisfed.org/fred2/series/PCE/ 93 its importance, many studies focused on understanding and characterizing con- sumer behavior. Researchers examined gender dierences and motivations in shop- ping [Dholakia, 1999b, Hayhoe et al., 2000], as well as spending patterns across urban areas [Sobolevsky et al., 2015]. In recent years, shopping has increasingly moved online. Consumers use inter- net to research product features, compare prices, and then purchase products from online merchants, such as Amazon and Walmart. Moreover, platforms like eBay allow people to directly sell products to one another. While there exist concerns about the risks and security of online shopping [Bhatnagar et al., 2000, Perea y Monsuw e et al., 2004, Teo, 2002], large numbers of people, especially younger and wealthier [Horrigan, 2008, Swinyard and Smith, 2003], choose online shopping even when similar products can be purchased oine [Farag et al., 2007]. In fact, online shopping has grown signicantly, with an estimated $1,471 billion dollars spent online in 2014 in the United States alone by 191 million online shoppers 2 . Most of these online purchases result in a conrmation or shipment email sent to the shopper by the merchant. These emails provide a rich source of evidence to study online consumer behavior across dierent shopping websites. Unlike pre- vious studies [Pavlou and Fygenson, 2006], which were based on surveys and thus limited to relatively small populations, we used email data to perform a large-scale study of online shopping. Specically, we extracted information about 121 million purchases amounting to 5.1B dollars made by 20.1 million shoppers, who are also Yahoo Mail users. The information we extracted included names of purchased products, their prices, and purchase timestamps. We used email user prole to link this information to demographic data, such as gender, age, and zip code. This information enabled us to characterize patterns of online shopping activity and 2 http://www.statista.com/topics/871/online-shopping/ 94 their dependence on demographic and socio-economic factors. We found that, for example, men generally make more purchases and spend more on online purchases. Moreover, online shopping appears to be widely adopted by all ages and economic classes, although shoppers from more auent areas generally buy more expensive products than less auent shoppers. Looking at temporal factors aecting online shopping, we found patterns similar to other online activity [Kooti et al., 2015]. Not surprisingly, online shopping has daily and weekly cycles showing that people t online shopping routines into their everyday life. However, purchasing decisions appear to be correlated. The more expensive a purchase, the longer the shopper has to wait since the previous purchase to buy it. This can be explained by the fact that most shoppers have a nite budget and have to wait longer between purchases to buy more expensive items. In addition to temporal and demographic factors, social networks are believed to play an important role in shaping consumer behavior, for instance by spreading information about products through the word-of-mouth eect [Rodrigues et al., 2011]. Previous studies examined how consumers use their online social networks to gather product information and recommendations [Guo et al., 2011, Gupta et al., 2014], although the direct eect of recommendations on purchases was found to be weak [Leskovec et al., 2007]. In addition, people who are socially connected are generally more similar to one another than unconnected people [McPherson et al., 2001], and hence, they are more likely to be interested in similar products. Our analysis conrmed that shoppers who are socially connected (because they email each other) tend to purchase more similar products than unconnected shoppers. Once we understand the factors aecting consumer behavior, we are then able to predict it. Given users' purchase history and demographic data, we attempt 95 to predict the time of their next purchase, as well as the price of their next pur- chase. Our method attains a relative improvement of at least 49.8% over the random baseline for predicting the price of the next purchase and 36.4% relative improvement over the random baseline for predicting the time of the next pur- chase. Interestingly, demographic features were the least useful in the prediction task, while temporal features carried the most discriminative information. The contributions of the paper are summarized below: Introduction of a unique and very rich data set about consumer behavior, extracted from purchase conrmations merchants send to buyers; A quantitative analysis of the impact of demographic, temporal, and network factors on consumer behavior; Prediction of consumer behavior, specically predicting when will the next purchase occur and how much money will be spen. A better understanding of consumer behavior can benet consumers, mer- chants, as well as advertisers. Knowing when consumers are ready to spend money and how much they are willing to spend can improve the eectiveness of advertis- ing campaigns, and prevent consumers from wasting their resources on unnecessary purchases. Understanding these patterns can help make online shopping experi- ence more ecient for consumers. Considering that consumer spending is such a large portion of the economy, even a small eciency gain can have dramatic consequences on the overall economic activity. Yahoo Mail is one of the world's largest email providers with more than 300M users 3 , and many online shoppers use Yahoo Mail for receiving purchase conr- mations. We select these emails by using a list of merchant's email addresses. 3 http://www.comscore.com/ 96 Applying a set of manually written templates to the email body, we extract the list of purchased item names and the price of each item. The item name was used as input to a classier that predicts the item category. We used a 3 levels deep, 1,733 node Pricegrabber taxonomy 4 to categorize the items. The details of cate- gorization are beyond the scope of this paper. In case of multiple items purchased in a single order, we consider them as individual purchases occurring at the same time. Therefore, throughout the paper the expression \purchase" will refer to a purchase of a single item. We limit our study to a random subset of Yahoo Mail users in the US; for all of them age, gender, and zip code information is also available from the Yahoo user database. We excluded users who made more than 1,000 purchases (less than 0:01% of the sample), because these accounts probably belong to stores and not to individuals. Overall, our data set contains information on 20.1M users, who collectively made 121M purchases from several top retailers between February and September of 2014, amounting to total spending of 5.1B dollars. The data set includes messages belonging exclusively to users who voluntarily opted-in for such studies. All the analysis has been performed in aggregate and on anonymized data. 5.1.1 Purchase pattern analysis We present a quantitative analysis of factors aecting online purchases. We exam- ine the role of demographic, temporal, and social factors including gender and age, daily and weekly patterns, frequency of shopping, tendency to recurrent purchases, and budget constraints. 97 20 30 40 50 20 30 40 50 60 Age % of online shoppers Men Women (a) Percentage of online shoppers 3 4 5 6 7 20 40 60 80 Age Avg # of items purchased Men Women (b) Number of purchases 20 30 20 40 60 80 Age Median of product price Men Women (c) Average price 25 50 75 100 125 20 40 60 80 Age Median total $ spent Men Women (d) Total money spent Figure 5.1: Demographic analysis. (a) Percentage of online shoppers, (b) number of items purchased, (c) average price of products purchased, and (d) total spent by men and women, broken down by age. Demographic Factors We measure how gender, age, and location (zip code) aect purchasing behavior. First, we measure the fraction of all email users that made an online purchase. We nd that higher fraction of women make online purchases compared to men ((Fig- ure A.5(a))), but men make slightly more purchases per person (Figure A.5(b)), 4 http://www.pricegrabber.com 98 Table 5.1: Dierences in the categories of products purchased by women and men. Rank Top categories Distinctive women Distinctive men 1 Android Books Games 2 Accessories Dresses Flash memory 3 Books Diapering Light bulbs 4 Vitamins Wallets Accessories 5 Shirts Bracelets Batteries and they spend more money, on average, on online purchases (Figure A.5(c)). As a result, men spend much more money in total (Figure A.5(d)). The same pat- terns hold across dierent age groups. All these plots, except Figure A.5(a) back up ndings from earlier consumer surveys that revealed man having a higher per- ceived advantage of online shopping [van, 2005], and women having higher concern of negative consequences of online purchasing [Garbarino and Strahilevitz, 2004], resulting in a higher number of purchases done by men. With respect to the age, spending ability increases as people get older, peaking among the population between age 30 to 50 and declining afterwards. The same pattern holds for number of purchases made, average item price, and total money spent (Figures A.5(b), A.5(d), A.5(c)). We also measure the impact of economic factors on online shopping behavior. We use US Census data to retrieve the median income associated with each zip code 5 . The inferred income for the user is an aggregated estimate but, given the large size of this data set, this coarse appraisal is enough to observe clear trends. The number of purchases, average product price, and total money spent (Figures A.6(a), A.6(b), and A.6(c) respectively) are all positively correlated with income. While users living in high income zip codes do not buy substantially more 5 http://www.boutell.com/zipcodes/zipcode.zip 99 10 −5 10 −3 10 −1 10 0 10 1 10 2 Days from last purchase PDF Figure 5.2: Distribution of number of days between purchases. expensive products, they make many more purchases, spending more money in total than users from lower income zip codes. Although the factors leading lower- income households to spend less online are multiple and entangled, part of this eect can be explained by the reluctance of people who are concerned with their nancial safety to trust and make full use of online shopping, as it has been pointed out by previous studies [Horrigan, 2008]. Finally, we study the dynamics of individual purchasing behavior. Figure A.9 shows the distribution of number of days between purchases. The distribution is heavy-tailed, indicating bursty dynamics. The most likely time between purchases is one day and there are local maxima around multiples of 7 days, consistent with weekly cycles we observed. 100 5.1.2 Predicting purchases Predicting the behavior of online shoppers can help e-commerce sites on one hand to improve the shopping experience by the means of personalized recommenda- tions and on the other hand to better meet merchants' needs by delivering targeted advertisements. In a recent study, Grbovic et al. addressed the problem of predict- ing the next item a user is going to purchase using a variety of features [Grbovic et al., 2015]. In this work, we consider the complementary problems of predicting i) the time of the next purchase and ii) the amount that will be spent on that purchase. Predicting the exact time and price of a purchase (e.g., using regression) is a very hard problem, therefore we focused on the simpler classication task of predicting the class of the purchase among a nite number of predened price or time intervals. We experimented with dierent classication algorithms and Bayesian Network Classication yielded the highest accuracy. To estimate the conditional probability distributions we used direct estimates of the conditional probability with = 0:5. The classier was trained on the rst six months of purchase data and evaluated on the last two. From each entry we extracted 55 features belonging to variety of categories: Demographics of online shoppers (4 features): Gender, age, location (zip code), and income (based on zip code) Purchase price history (19 features): Price of the last three purchases, price category of the last three purchases, number of purchases, mean price of purchased item, median price of purchased items, total amount of money spent, standard deviation in item prices, number of earlier purchases in each 101 price group (5 groups), price group with the most number of purchases and the count for it, and total number of purchases until that point. Purchase time history (13 features): Time of last three purchases, mean time between purchases, median time between purchased, standard deviation in times between purchases, number of earlier purchases in each time group, and time group with the most number of purchases and the count for it. Purchase history of products (4): Last three categories of products purchased, most purchased category. Time or price of the next purchase (1 feature): We also assume that we know when the next purchase is going to happen. This seems unrealistic at rst, but we include this feature because the system is going to make recommendations at a given time, and we assume that the buyer is going to make the decision at that time. For having a symmetrical problem we also consider the price of the next purchase, which would be similar as knowing the budget of the user. Contacts (14 features): Mean, median, standard deviation, minimum, maxi- mum, and 10th and 90th percentile of price and time of the purchases of the contacts of the users. For the aggregated features such as average price of item purchased, we used only purchases in the training period and did not consider future information. We compared results of our classier to three baselines: Random prediction. Price or time class of the previous purchase. Most popular price or time class of the target user's earlier purchases. 102 Price of the Next Purchase We partition prices in ve classes using $6, $12, $20, and $40 as price thresholds to obtain equally-sized partitions. These thresholds represent (a) very cheap prod- ucts that cost less than $6 (20.7% of the data), (b) cheap products between $6 and $12 (20.3%) (c) medium-priced products between $12 and $20 (19.3%), (d) expensive products that cost more than $20, but less than $40 (19.9%), and nally (e) very expensive products worth more than $40 (19.8%). Our classier achieves an accuracy of 31.0% with a +49.8% relative improvement over the 20.7% accuracy of the random classier (i.e., relative size of the largest class). The category of the last purchase and the most frequent purchase category turn out to be quite strong predictors, achieving alone accuracy of 29.3% and 29.8%. The supervised approach outperforms them, but with only a +5.8% and +4.0% relative improvement respectively. When measuring the predictive power of the features with the 2 statistics (Table A.7) we nd that the highest predictive power is the most frequent class of earlier purchases, by far. This suggests that users tend to buy mostly items in the same price bracket. The second feature in the ranking is the number of purchases from the very cheap category, followed by median and mean of earlier prices. In general, all the top 16 most informative features are related to the price of earlier purchases. After those, median time between purchases and time delay before the last purchase are the most predictive features. The relatively high position of the last time delay in the feature rank suggests that the recommender system should consider the time that has passed from the last purchase of the user, and change the suggestions dynamically. In other words, if the user has made a purchase recently, cheaper products should be favored over more expensive products to the user, whereas if a long period of time has passed since the last purchase, more expensive products should be advertised to 103 Table 5.2: Top predictive features for prediction of the price of the next item and their 2 value. Rank Feature 2 value 1 Most used class earlier 214,996 2 Number of under $6 purchases 115,560 3 Median price of earlier purchases 106,876 4 Mean price of earlier purchases 91,409 5 Number of over $40 purchases 84,743 the user, as they are more likely to be purchased. All of the demographics features have limited predictive power and are ranked last (though the demographics might aect the purchase history), with income being the most important among them. Time-Between-Interaction of the Next Purchase Similarly to purchase price, prediction of purchase time could be leveraged to make a better use of the advertisement space. If the user is likely not to purchase any- thing for a certain period of time, ads can be momentarily suspended or replaced with ads that are not related to consumer goods. For creating the categories, we choose thresholds of 1, 5, 14, and 33 days. Very short delays are within a day (22.8% of our data), short delays between 1 and 5 days (20.9%), medium delays between 5 and 14 (19.6%), long delays between 14 and 33 (18.2%) and the very long delays exceed 33 days (18.5%). Training a Bayesian Network on all the features yields an accuracy of 31.1%, a +36.4% relative improvement over the 22.8% accuracy of the random prediction baseline. The accuracy of our classier is also +24.9% relatively higher than the baseline of predicting as the last purchase delay, which has accuracy of 24.9%. Finally, the most occurred purchase has an accuracy of 22.2%, which is outperformed by our classier by +40.1% relatively. 104 Table 5.3: Top predictive features for prediction of time of next purchase and their 2 value. Rank Feature 2 value 1 Number of earlier purchases 48,719 2 Median time between purchases 35,558 3 Time since the rst purchase 30,741 4 Previous time delay 30,692 5 Class of previous time delay 22,710 Ranking features by their 2 (Table A.8), we nd that the most informative feature is the number of earlier purchases that the user has made so far, followed by median time delay, previous purchase delay, time since the rst purchase, and the class of the previous purchase delay. To summarize, we trained two classiers for predicting the price and the time of the next purchase. Our algorithm outperformed the baselines in both prediction tasks, by a higher margin in case of predicting the time. Table A.9 summarizes all of our results showing a relative improvement of at least 49.8% for predicting the price of the next item purchased and 36.4% for predicting the time of the next purchase over the majority vote baseline. Interestingly, user demographics were not particularly helpful for making any prediction, and the observed correlations in earlier sections of the paper are masked by other features such as the history of prior purchases. Conclusion Studying the online consumer behavior as recorded by email traces allows to overcome the limitations of previous studies that focused either on small- scale surveys or on purchases' logs from individual vendors. In this work, we provide the rst very large-scale analysis of user shopping proles across several vendors and over a long time span. We measured the eect of age and gender, 105 Table 5.4: Summary of the prediction results. Accuracy: percentage of correctly classied samples. Majority vote: always predicting the largest group, or predicting randomly. Most used: the group the user had the most in earlier purchases. AUC: Weighted average of Area Under the Curve for classes. RMSE: Root Mean Square Error. The improvements are reported over the majority vote baseline. Prediction Majority vote Last used Most used Our classier Absolute imprv. Relative imprv. AUC Item price 20.7% 29.3% 29.8% 31.0% 10.3% 49.8% 0.611 Purchase time 22.8% 24.9% 22.2% 31.1% 8.3% 36.4% 0.634 nding that the spending ability goes up with age till the age of 30, stabilizes in the early 60s, and then starts dropping afterwards. Regarding the gender, a female email user is more likely to be an online shopper than an average male email user. On the other hand, men make more purchases, buy more expensive products on average, and spend more money. Younger users tend to buy more phone accessories compared to older users, whereas older users buy TV shows and vitamins & supplements more frequently. Using the user location, we show clear correlation between income and the number of purchases users make, average price of products purchased, and total money spent. Moreover, we study the cyclic behavior of users, nding weekly patters where purchases are more likely to occur early in the week and much less frequently in the weekends. Also, most of the purchases happen during the work hours, morning till early afternoon. We complement the purchase activity with the network of email communication between users. Using the network, we test if users that communicate with each other have more similar purchases compared to a random set of users, and we nd indeed that is the case. We also consider gender of the users and nd that woman-woman pairs are more similar than man-man pairs that are also more similar to each other than the woman-man pairs. Finally, we use our ndings 106 to build a classier to predict the price and the time of the next purchase. Our classier outperforms the baselines, especially for the prediction of the time of the next purchase. This classier can be used to make better recommendations to the users. Our study comes with a few limitations, as well. First, we can only capture purchases for which a conrmation email has been delivered. We believe this is the case for most of online purchases nowadays. Second, if users use dierent email addresses for their purchases, we would not have their full purchase history. Similarly, people can share a purchasing account to enjoy some benets (e.g., an Amazon Prime account between multiple people), but that occurs rarely, as suggested by the fact that less than 0.01% of the users have goods shipped to more than one zip-code. Third, the social network that we considered, albeit big, is not complete. However, the network is large enough to observe statistically signicant results. Lastly, we considered the items that were purchased together as separate purchases; it would be interesting to see which items are usually bought together in the same transaction. 5.2 Uber The rapid growth of the sharing economy, exemplied by the ride-sharing plat- forms Uber and Lyft, and the home-sharing platforms Airbnb and Couchsurng, is changing the patterns of ownership and consumption of goods and services. In a sharing economy, consumers exchange services in a peer-to-peer fashion, through matching markets facilitated by social networks and online applications. Instead of owning a car or hailing a taxi, ride-sharing services enable consumers to request 107 rides from other people who own private vehicles, or in turn, become drivers oer- ing rides to others. Similarly, home-sharing services enable consumers to stay in private homes while on travel, or oer rooms in their homes as short-term rentals to others. The various benets provided to consumers, such as convenience, cost savings, possibility for extra income, and new social interactions, have fueled the sharing economy's dramatic growth [Hamari et al., 2015]. Arguably, Uber, along with Airbnb, is one of the most successful sharing econ- omy markets. Founded in 2009, Uber is an online marketplace for riders and drivers. Riders use a smartphone app to request rides. Ride requests are assigned to Uber drivers, who use their own vehicles to provide the rides. Lower prices, short wait times, as well as the convenience of easy ride request and payment are considered as the main reasons contributing to Uber's popularity among the riders [Horpedahl, 2015], and the exibility of work schedule and higher compen- sation rates are among the main reasons making Uber attractive to drivers [Hall and Krueger, 2015]. Uber has grown wildly popular, providing more than a million daily rides as of December 2014 6 and is the most valued venture-backed company with a valuation of $62.5B as of December 2015 7 . Uber's popularity makes it attractive for studies aimed at understanding participation in the sharing economy. But, the system is still not well understood. Specically, what are the characteristics of Uber riders and drivers? What eect do dierent factors such as promotions, rider- driver matching, and dynamic (or surge) pricing have on user participation and retention? Can these factors and characteristics be used to accurately predict 6 newsroom.uber.com/our-commitment-to-safety 7 nyti.ms/1XD9cdT 108 users' behavior on Uber, particularly whether a new user will become an active user? In this work, we study Uber data that contains information about 59M rides taken by 4.1M people over a seven month period, along with data about 222K drivers over the same time period. This information is extracted from the email conrmation messages sent by Uber to riders after each ride, as well as weekly reports sent to drivers. The ride email receipts include information regarding rides, such as the fare, trip length, pick-up and drop-o times and locations, as well as driver's name. The weekly driver reports include driver's earnings, number of rides given in that week, and ratings. By analyzing usage and demographics of the population of the Uber users, we nd that an average active Uber rider is a young mid-20s individual with an above-average income. In addition, various demographic groups exhibit dierences in their behavior. E.g., younger riders tend to take more rides, older riders take longer and more expensive rides, and the more auent riders take more rides and are more likely to use the more expensive types of cars, such as Uber Black. We present a detailed demographic analysis of Uber riders and drivers, in terms of age, gender, race, income, and times of the rides. Our main ndings are as follows: Uber is not an \all-serve-all" market. Riders have higher income than drivers and dier along racial and gender lines. Rider and driver attrition is very high, but the in ux of newcomers leads to an overall growth in the number of rides. We identify characteristics of riders and drivers who become active users. Better matches of riders to drivers result in higher ratings. 109 Surge pricing does not favor more auent riders, but mostly aects younger riders (who use the service during peak times, including weekend nights). Drivers with many surge rides receive lower ratings, on average, suggesting the riders' dislike of surge pricing. Using users' initial activity on Uber, we can predict whether a rider or driver will become active or leave Uber. This work presents an in-depth analysis of the ride-sharing market from a large- scale Uber data including both riders and drivers. Our analysis reveals the demo- graphic and socioeconomic factors that aect participation in the ride-sharing mar- ket, and enables us to predict who will become an active market participant. Since consumer retention is generally much cheaper than consumer acquisition [Rech- inhheld and Sasser, 1990], detecting customers who are likely to stop using Uber could help improve consumer retention. For example, by oering promotions to people who are likely to drop out, Uber could stimulate them to remain active users of its services. Every time a rider takes a ride, Uber emails a receipt shortly thereafter. This email has a specic format, making it easy to parse. The email includes the following information: pick-up and drop-o times, origin and destination addresses, duration of the ride, distanced traveled, type of the car (UberX, UberBlack, etc.), driver's rst name, and the fare, along with a breakdown of the price, including whether or not a promotion code was used and if the surge multiplier was applied (during peak hours the fare is multiplied by a value called the surge). We obtained information about Uber rides of Yahoo Mail users using an automated extraction pipeline that preserves the anonymity of the rider and the driver. In total, we study data on over 59M rides taken by 4.1M users over a period of time from 110 2e+05 3e+05 4e+05 5e+05 Oct'15 Dec'15 Feb'16 Apr'16 Date Number of rides Figure 5.3: Daily number of rides in our data set. October 2015 to May 2016. In Figure 5.3, we show the number of rides taken on each day. There is a strong weekly pattern, with many rides taking place during the weekends. Also, some holidays, such as New Year's eve and Halloween, result in large peaks in the number of rides, while other holidays like Christmas result in a drop in the number of rides. Drivers receive two separate weekly emails. One email includes the hours they worked each day of that week, percent of busy hours worked, riders' textual feed- back (if any), their average rating over the week, and whether the rating is higher or lower than average of drivers. The other email includes the money earned each day of the week. Both emails refer to the drivers by their rst name. These emails also have consistent formats, making it easy for this information to be extracted. Our data set includes more than 1.9M weekly summaries for 222K drivers, which were extracted by preserving the anonymity of the driver. Moreover, whenever a person joins Uber they receive a welcome email. So, besides the ride information, we know when a user has joined Uber either as a rider or a driver. 111 In addition to Uber emails, we relied upon the Yahoo Mail network graph during the same period of time. The email graph G consists of pairs of hashed user ids that communicated with each other. For the present analysis, we consider only a subgraph of G induced by the two-hop neighborhood from the users who are Uber riders and/or drivers. Finally, since both riders and drivers are Yahoo Mail users, we also have access to their demographic information, including age, gender, and location at the zip code level. In this study, we conducted our analysis only on users from the US, unless otherwise stated. Further, only for purposes of this study we produced income and race estimates for all riders and drivers. Since Yahoo does not collect declared income or race information during sign-up process, we derived estimates using publicly available US census data that contains race and income distributions for each zip code. Specically, all drivers and riders from a specic zip code were assigned the median income and race associated with that zip code. The inferred income and race for the user are aggregated estimates (we do not know the ground truth for any specic user) but, given the large size of this data set, this coarse appraisal is enough to observe clear trends. 5.2.1 Understanding Riders In this section, we examine the relationship between the demographics and char- acteristics of Uber riders and their activity using the service. Demographics and number of rides Figure B.2 shows the distribution of number of riders, given their age and gender. A typical Uber rider is young (38% of riders are 18{27 years old), and slightly more likely to be a woman (51% are women). Female riders are somewhat younger than 112 0 20000 40000 60000 20 30 40 50 60 70 80 Age # of riders Men Women Figure 5.4: Distribution of rider age given their gender. Table 5.5: Comparison of riders by race, age, and gender. Race %riders %women Avg. age Med. fare White 80.5% 49.6% 36.1 $12.8 Hispanic 8.5% 52.2% 31.1 $11.0 African-American 8.2% 61.1% 32.4 $11.5 Asian-American 2.8% 50.4% 35.5 $12.8 males (mean age of men is 34.6 years vs. 33.1 years for women). The vast majority of riders are white (80.5%), followed by Hispanic (8.5%), African-American (8.2%), and Asian-American (2.8%). Table B.1 breaks down riders by race, age, and gender. Hispanic and African-American riders are younger than white and Asian- American riders, but the median number of rides is 3 for all races. We consider the average number of rides per week as a measure of riders' activity. We found that in general, older riders use the service less frequently, e.g., 30-year-old men use Uber 20% more than 50-year-old men (FigureB.3(a)). Although young men and women (aged less than 25 years) use Uber at about the same rate, older men use it slightly more then older women. The values shown in 113 the gure are the average for a given age and gender. The frequency of rides has a heavy-tailed distribution: most riders have a very low activity, while a few riders are very active. The median number of rides overall is only 0.2 rides per week, and the top10% most active riders take 1.3 rides or more per week. Income, surge, and car type Next, we examine the impact of rider income, surge pricing, and car type, on rider activity. First, we are interested to nd who is most aected by surge pricing: lower income riders, who may be priced out by the increase in fares during peak hours, or the more auent riders who are willing to pay more for rides during times of high demand. Yahoo Mail users do not report their household income; instead, we estimate their income based on zip code of their self-declared home location. We use median income for a given zip code as the user's estimated income. Figure B.3(d) shows the average income of riders of a given age and gender. Surprisingly, older riders have higher income compared to younger riders. However, the percentage of the rides with surge pricing for a given age and gender has exactly the opposite trend: older riders are less likely to pay the surge price B.4(a). This might be due to younger riders using Uber on weekend nights, when there is surge pricing due to high demand. Even though these trends seem to suggest that riders paying surge prices should have lower income than riders not paying the surge prices, if we divide riders based on whether they had a ride with surge pricing or not, we nd that riders with at least one surge ride have slightly higher income than riders not paying surge pricing B.4(b). These plots seem to be con icting, but they can be explained simply by the large heterogeneity among users. In short, people with higher income are more likely to take rides with surge pricing, but age plays a much more signicant role. 114 5 6 7 8 9 10 20 30 40 50 60 70 80 Rider age % of surged rides Men Women (a) Demographics and surge 0.00 0.05 0.10 0.15 30k 50k 70k 90k110k Rider ncome ($) PDF All With surge (b) Income and surge Figure 5.5: Riders and surge pricing. (a) Percentage of rides with surge pricing as a function of rider age and gender. (b) Comparison of income of riders who had at least one ride with a surge fare and rest of the riders. Uber oers dierent service options: budget options, such as UberX and UberXL, and more expensive luxury options, such as Black, Select, SUV, and Luxury. Last, the Pool ride is the cheapest option as it allows the rider to split the trip cost with another person headed in the same direction. Figure B.5 compares the type of Uber cars requested by riders with dierent incomes. There exists a clear trend, with more auent riders requesting more expensive cars: e.g., people with annual income of $100k are 84% relatively more likely to take an Uber Black compared to users with annual income of $50k. Rider attrition What happens after a rider's rst ride, whether or not a promotion was used? Does the rider become an active Uber user? Or does the rider stop using Uber and revert to his or her previous transportation options? Given the high costs of 115 0.00 0.25 0.50 0.75 1.00 30k 50k 70k 90k 110k 130k 150k Income Fraction of types Service type Luxury SUV Black Select UberXL UberX Pool Figure 5.6: Type of Uber car requested by riders given their income. attracting new costumers (advertising, promotions), retaining them is an economic priority for businesses. To measure rider attrition, we focus on riders who took their rst ride during our data collection period and measure changes in their engagement levels over time. Recognizing new riders is feasible due to the welcome email they receive from Uber upon signing up. We exclude riders who took their rst ride during the last four months of our data collection period, to ensure that we have at least four months of rider activity records for the new riders. We also exclude riders who took only one ride during this period (11.5% of riders), because low activity rates could bias results. After ltering, we still have large enough number of riders, 295K riders, due to large size of our data set. Next, we characterize each rider with a vector containing the number of rides taken in each month following their rst ride. To identify dierent groups of riders who have similar behavior, we ran a k-means clustering algorithm over the riders. 116 To nd the optimal number of clusters we perform a parameter sweep from k = 2 tok = 15. The mean square error (i.e., distance from the center of clusters) gradually decreases as k increases, but with diminishing returns; after k = 3 the error reduction becomes signicantly smaller. We chose k = 3 to balance between compactness of the model and the quality of clustering. Table B.3 shows the number of riders belonging to each cluster, as well as the centers of the three clusters. We see that the vast majority of riders (90.9%) belong to the cluster that has almost no rides after the rst month (labeled Inactive). The second cluster of riders (8.0%) has a medium level of activity, almost 1 ride a week (Low activity). Finally, the remaining riders (1.1%) are highly active and maintain high levels of activity over time. Table 5.6: Size and centers of clusters of riders from their monthly number of rides along with their demographic breakdown. Clusters % riders Month 1 Month 2 Month 3 Month 4 Inactive 90.9% 2.1 0.4 0.4 0.5 Lo activity 8.0% 8.5 5.8 5.6 5.6 Hi activity 1.1% 18.0 21.6 23.3 22.1 Table 5.7: Demographics of riders in each cluster. Clusters Avg. age Women White Hispanic African-Amer. Asian-Amer. Inactive 35.1 53.3 80.3% 8.8% 8.3% 2.6% Lo activity 31.9 51.3 69.7% 11.8% 15.8% 2.7% Hi activity 31.2 52.1 60.3% 13.4% 24.1% 2.3% The Inactive cluster includes riders who abandon the service quickly, while the remaining two clusters include more active riders, who use Uber more fre- quently. To characterize these riders, we break down each cluster by demographics in Table B.4, showing that the more active riders are younger, and less likely to be white, and more likely to be Hispanic and African-American than riders who 117 0 1000 2000 3000 20 30 40 50 60 70 80 Driver age # of drivers Men Women Figure 5.7: Number of drivers for a given age and gender. eventually leave Uber. We nd no signicant dierence between the groups in their gender composition . 5.2.2 Understanding Drivers In this section, we conduct an analysis of Uber drivers, focusing on their demo- graphics and earnings, and identify factors that aect driver retention. Demographics Figure B.7 shows the number of drivers of a given age and gender. The gure shows a signicant dierence between the number of male and female drivers, with 76% of the drivers being male and typically in their 30s. These results apply only to drivers from the US, and other countries dier widely with respect to driver gender. US has the highest percentage of women drivers (24.0%), followed by Malaysia (10.1%), Singapore (9.9%), and Canada (9.4%). Surprisingly, UK has a much lower fraction: only 4.3% of all UK drivers are women. 118 Table 5.8: Comparison of drivers of dierent races. Race % of drivers Women Avg. age Avg. hrs worked Avg. earning Above avg. rating % hrs surged White 60.0% 21.9% 41.9 15.4hrs $355 62.7% 20.1% African-Amer. 21.6% 36.5% 40.8 14.8hrs $341 57.2% 23.9% Hispanic 13.7% 23.9% 38.5 15.2hrs $378 59.8% 21.5% Asian-Amer. 4.7% 16.4% 41.6 18.2hrs $511 58.4% 23.1% Table B.5 shows the breakdown of US drivers by race. Compared to US riders (Table B.1), there are signicant disparities in the racial composition of drivers and riders. For example, while the majority of drivers are white (60%), this is much smaller than the percentage of white riders (81%). With regard to gender, women of all races participate signicantly less as drivers than as riders. This is in contrast to conventional wisdom, which suggests that exibility of driving for Uber part time will be attractive to women. All races have similar average age, except for Hispanic drivers, who are two or three years younger. Moreover, all races drive the same number of hours and earn the same amount of money, except for the Asian-American drivers, who work and earn 23%-43% more than drivers from other races, on average. Driver retention We study factors that correlate with driver activity. Similar to our analysis of riders, we cluster drivers based on the number of hours worked each month since joining Uber. With k = 3 clusters, a large fraction of drivers stop working almost completely, driving fewer than ve hours during a period of a month. However, this fraction (73.3%) is much lower than the fraction of riders who stop using Uber. The lower rate of driver attrition could be due to the higher eort required to 119 become an Uber driver compared to an Uber rider. The sign up process for drivers includes a background check, submission of documentations, and completion of a city-knowledge test [Hall and Krueger, 2015]. About 21.0% of drivers have at least half an hour driving per day on average over the four months and the remaining 5.7% of drivers are very active, working longer than three hours/day on average. Number of hours that the drivers worked drops across all three clusters. Table 5.9: Size and centers of the clusters of drivers from their monthly hours worked. Clusters % drivers Month 1 Month 2 Month 3 Month 4 Inactive 73.3% 20.1 4.7 3.0 2.2 Lo activity 21.0% 89.9 45.1 26.3 16.4 Hi activity 5.7% 150.3 133.8 126.8 94.1 Table 5.10: Demographics of drivers in each cluster. Clusters Avg. age Women White Hisp. African- Amer. Asian- Amer. Inactive 37.1 40.3% 62.6% 26.1% 3.4% 7.9% Lo activity 43.2 31.3% 67.1% 22.7% 2.3% 7.8% Hi activity 43.6 25.3% 76.2% 13.3% 1.0% 9.5% We also characterize the drivers in each cluster in Table B.7. The rst cluster (labeled Inactive) includes drivers with the lowest engagement levels, who eventu- ally stop driving for Uber, while the other two clusters contain active drivers with dierent engagement levels. Active drivers tend to be older, more man, white, and Asian-American than than the Inactive drivers. 120 0.00 0.05 0.10 0.15 0.20 Income PDF Driver Rider Figure 5.8: Comparison of income of riders and drivers. 5.2.3 Rider vs Driver In this section, we answer the questions that involve both riders and drivers at the same time, including comparing their demographics, and studying the eect of matching on ratings. Demographic comparison First, we are interested to see if Uber has a \all-serve-all" economy, or the riders have dierent demographics or higher income. As shown in Figure B.12 riders have higher income compared to drivers: median income for a rider is $62.4k and the median income of drivers is $55.3k. Also, riders are 51% more likely to be men, and 7.3 years younger than drivers on average. If we pick a random rider and driver, the rider is 34% more likely to be white and the driver is 5 times more likely to be African-American. Even though the riders and drivers have big dierences, a considerable 17.4% of drivers are also riders. 121 5.2.4 Prediction We use the ndings presented in earlier sections to predict whether a new rider or driver will become an active Uber user. Riders We dene the prediction problem as follows: given all the information about a rider and his or her Uber activity during the rst two weeks since joining Uber, will that person become an active rider or not? We dene active riders those who take six or more rides in weeks 3{8 of using the service (i.e., at least one ride a week on average). To make this prediction, we use the following sets of features: User characteristics: age, gender, location, income, race, education, and zip code. Ride features: # of rides, average distance, average price, average duration, fraction of rides in the second week, percentage of rides with surge pricing, number of cities Uber used in, fraction of the rides in the weekend and weekday, fraction of rides in the morning, afternoon, or at night, fraction of rides with a promotion, and number of distinct origins and destinations. Driver features: for the rides that have been matched: driver demograph- ics, driver ratings, and age, gender, and race dierence of riders and drivers. Social features: number of Uber rider and Uber driver friends based on the email network graph. We extract all of the above features, and balance the classes by under-sampling the larger class. This results in half of the users in the data set being active users, which leads to a 50% baseline for random prediction. Then, we select a random 122 set of 80% of the users for training and use the remaining 20% for testing. We use the C5.0 classier [Quinlan, 2004] for our predictions and achieve accuracy of 75.2%, which is a 50.4% relative improvement over the baseline. The precision is 0.786 and the recall is 0.687. We also dene a more intelligent baseline, where we predict a user is going to be an active user, if the user had more than two rides in the rst two weeks (median number of rides taken by all riders in the rst two weeks). Surprisingly, this baseline performs much better than the random baseline and achieves an accuracy of 74.3%, which is still slightly lower than our classier. This simple baseline indicates that the signals from the activity of the users is strong enough to be an accurate predictor. We use logistic regression to quantify the importance of the features. Since regression is sensitive to colinearities in the data, we rst eliminate correlated features, by calculating pair-wise correlation coecient and removing one of the features that has high correlation (> 0:7 or <0:7) with another feature using Figure B.13 that shows pairwise correlation coecients. Table B.9 shows the results of the logistic regression on the remaining 12 independent variables, which we normalize rst. Older users are less likely to become active Uber riders. Men and riders with more trips in the rst two weeks are more likely to become active riders, but those who had more expensive rides, used more expensive car types (such as Uber Black), and had higher fraction of rides on the weekends, are less likely to become an active Uber rider in the future. Drivers We conduct a similar prediction task for the drivers, based on the hours worked instead of the number of rides. We dene active drivers as those who worked 10 hours or more per week in weeks 3{8 (since joining Uber). Those who worked less 123 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 age total_count frac_2nd_week avg_duration avg_dist avg_price frac_morning frac_afternoon frac_night frac_uberx frac_expensive_car frac_weekend frac_weekday cities_count frac_out_state frac_surge frac_promotion income rider_friend driver_friend age total_count frac_2nd_week avg_duration avg_dist avg_price frac_morning frac_afternoon frac_night frac_uberx frac_expensive_car frac_weekend frac_weekday cities_count frac_out_state frac_surge frac_promotion income rider_friend driver_friend Figure 5.9: Correlation between the features of the riders. Pairs without statisti- cally signicant correlation are crossed (pvalue< 0:05). than 10 hours a week in weeks 3{8 are inactive drivers. We consider the following features for driver prediction: User characteristics: age, gender, location, income, race, education, and zip code Drive features: # of days worked, # of hours worked, # of rides given, ratings, rate of earning, % of busy hours worked, acceptance rate, and missed earnings for week 1 and 2 separately. 124 Table 5.11: Results of logistic regression on the independent variables for the riders. *** pvalue< 0:001, ** pvalue< 0:01, * pvalue< 0:05 Variable Coeff: Total # of rides 0.340*** % of out of state rides 0.330 # of dierent cities 0.113*** Gender (men) 0.054*** # of rider friends 0.006 Average fare -0.009*** Age -0.020*** % of rides at night -0.098*** # of driver friends -0.162 % of rides in weekends -0.382*** % of rides in expensive car types -0.645*** % surge rides -0.837*** Rider features: for the rides that have been matched: rider demographics, and age, gender, and race dierence of riders and drivers. Social features: number of Uber rider and Uber driver friends With the same setup as above, and 50% baseline, we achieve 83.1% accuracy, which is 66.2% relative improvement over the baseline. Precision is 0.775 and recall is 0.689. If we dene a similar other baseline as for riders, using the median hours worked in the rst two weeks, the baseline achieves accuracy of 81.9%, which is signicantly higher than the random baseline, but again slightly lower than our classier. Similar to riders, simple rules from early behavior of the drivers is a very strong signal for future usage. We also remove the correlated features (using Figure B.14) and carry out logis- tic regression over non-correlated features. Table B.10 shows that older users, men, and drivers who worked more and had higher earning rates are more likely 125 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 age trip_count1 trip_count2 hours_worked1 hours_worked2 earn_rate1 earn_rate2 accept_rate1 accept_rate2 rating1 rating2 missed_earning1 missed_earning2 busy_hrs_worked1 busy_hrs_worked2 age trip_count1 trip_count2 hours_worked1 hours_worked2 earn_rate1 earn_rate2 accept_rate1 accept_rate2 rating1 rating2 missed_earning1 missed_earning2 busy_hrs_worked1 busy_hrs_worked2 Figure 5.10: Correlation between the features of the drivers. Pairs without sta- tistically signicant correlation are crossed (pvalue < 0:05). 1 and 2 show the value from the rst or second week. to become active drivers, but the drivers who had a lower acceptance rate (% of rides they accepted to deliver) are less likely to become an active driver. Conclusion This work characterizes Uber's riders and drivers. We consider age, gender, and race and show how dierent populations behave dierently. For exam- ple, younger riders use Uber more frequently compared to older riders, but they take shorter rides. Considering the gender, while the riders have balanced gender split, drivers have a very unbalanced split, 76% of drivers being man. We also show that riders have about $12k higher annual income than drivers. Study of 126 Table 5.12: Results of logistic regression on the independent variables for the drivers. *** pvalue< 0:001, ** pvalue< 0:01, * pvalue< 0:05 Variable Coeff: Gender (men) 0.371** Driver's rating 0.227 # hours drove 0.157*** Age 0.037*** Earning rate 0.029* Money missed in the week -0.002 Acceptance rate -0.015* # of busy hours worked -1.479 surge pricing shows that drivers who take advantage of busy hours can earn on average 60% more, while working the same number of hours. We also study the ratings given to the drivers by riders. We nd that older drivers tend to get lower ratings, and women drivers who are 30-50 years old tend to get higher ratings. Interestingly, the matching of riders and drivers has an eect on the ratings: rider an drivers having smaller age dierence or having the same race results in a higher rating, and for women drivers when smaller fraction of riders are woman, the rating tends to be higher. These ndings could be used to perform a better matching and improve users' experience. Finally, we focus on users' engagement levels and show that vast majority of users become less active and drop out after just a few weeks. By leveraging our ndings, we are able to predict the users who become active riders or drivers with a high accuracy. Prediction of users attrition or abandonment can be helpful for Uber to focus on these users, as keeping users is much less expensive than acquiring new users. 127 5.3 iPhone Digital Spending Consumer spending is an integral component of economic activity. In 2013, it accounted for 71% of the US gross domestic product (GDP) 8 , a measure often used to quantify economic output and general prosperity of a country. Given its importance, many studies focused on understanding and characterizing con- sumer behavior. Researchers examined gender dierences and motivations in shop- ping [Dholakia, 1999b, Hayhoe et al., 2000], as well as spending patterns across urban areas [Sobolevsky et al., 2015]. In this work, we study a large-scale data set extracted from email receipts of digital purchases on iPhones, iPads, and other iOS devices (we call all the purchases iPhone purchases in short). People spend money on a variety of items on iPhones, such as purchasing an app, a song, or a bonus or upgrade in a game. We used the detailed data set to characterize the spending behavior of users. First, we show that users with dierent age, gender, and country of origin have signicantly dierent spending ability and the users purchase dierent types of items. Then, we study how users distribute their budget across dierent types of purchases and we nd that in-app purchases account for more than 60% of all purchases. Next, we analyze the spendings of individual users and found that most of the money spent on the in-app purchases is spent by a small minority of users. Moreover, we characterize the set of detected users who are the big spenders or \avid gamers", nding that they are more likely to be men, older, and not from the US. We used all our ndings to predict users behavior. First, we use only demo- graphics of the user to predict if the user is one of the avid gamers. Second, we 8 https://research.stlouisfed.org/fred2/series/PCE/ 128 predict when a user is going to stop making purchases from the app. This predic- tion is in the direction of the churn or attrition problem, which has a great value for companies, because detecting the churn earlier can help the companies to pre- vent it, e.g. here by oering some bonuses and promotions. In the last prediction problem, we predict the income of the app, using only the initial sales data. We also model the user's behavior more accurately and build a model that explains all the purchases by the user. First, we show that the time between purchases is best modeled by a Pareto distribution. Second, we build a classier that predicts if a user is going to make a purchase from a new app, or an app that the user has made a purchase from earlier. This classier achieves a very high accuracy. Finally, based on the outcome of the previous step we predict the app that the user is going to make a purchase from. Our main ndings are as follows: 61% of all money spent on iPhones is spent in in-app purchases, followed by 23% on songs, and 7% on app purchases. The spending is highly heterogeneous, with top 1% of the users accounting for 59% of all the money spent on in-app purchases. The Gini coecient is 0.884. The avid buyers tend to be 3-8 years older, 23% more likely to be man, 31% less likely to be from the US compared to normal users. Interestingly, the income is comparable. Avid buyers become slower in making purchases as time passes, but their rate of spendings initially increases and then decreases again We are able to predict big spenders, app abandonment, and the apps earn- ings, with much higher accuracy than some basic baselines. 129 Our ndings can be leveraged for better app recommendations to users and extending the usage of apps by users Shortly after each purchase on an iPhone (or any other iOS device), the user receives an email with the details of the purchase. This emails contains information regarding the purchase, including the amount of money spent and the type of the purchase. The email has a specic format that makes it easy to parse automatically. We obtained information about the iPhone purchases of Yahoo Mail users, using an automated pipeline that hashes the names and IDs of users to preserve anonymity. All the analyses have been performed in aggregate and on unidentiable data. We gathered data covering 15 months, from March 2014 to June 2015. In total, our data set includes 26M users who in total made more than 776M purchases summing to $4.6B over the 15 months. There are six main categories of purchases on iPhone: applications, songs, movies, TV shows, books, and in-app purchases (purchases within an app, e.g., bonuses or coins in games). Each of these categories have vastly dierent number of users purchasing from them. 16M users purchased at least one song, but only 671K users purchased a TV show. The number of purchases in each category vary greatly as well. There are 430M song and 255M in-app purchases, while movies, books, and TV shows have fewer than 40M purchases all together. And more interestingly, the total money spent in each category has the most disproportionate split. Surprisingly, people spent $2.8B or 61% of all the money spent on in-app purchases. 23% of the money is spent on songs, 7% on app purchases (purchasing the app itself, not the purchases in the app), 6% on movies, 2% on books, and only 0.7% on TV shows. Even though, there are considerably fewer total in-app purchases compared to songs (60% smaller), the money spent on in-app purchases is 2.7 times higher than songs, showing that each in-app purchase is much more 130 expensive than a song purchase. Figure C.1 shows the number of users, number of purchases, and the money spent on each of these six categories. 0 20 40 60 # users # purchases Total spent Percentage App Book In−app Movie Song TV Figure 5.11: Percentage of users, purchases, and money spent on each category. Our data set includes age, gender, and zip code of the users that was provided by the users at the time of sign up. Demographics play a signicant role in the spendings of the users. Figure C.2(a) shows the cumulative distribution function (CDF) of the spendings for men and women. Men spend more money on purchases than women; the median spending for women is $31.1, and for men it is $36.2, which is 17% higher. Age of the users also aects their spendings. Figure C.2(b) shows as users get older they spend more money on phone purchases till the mid 30s and after that the spendings decrease quickly. Surprisingly, income only aects the spending ability of people with lower than $40K annual income (Figure C.3). This is in contrast with online shopping, where users with higher income increasingly spent more money for online purchases [Kooti et al., 2016]. 131 0.00 0.25 0.50 0.75 1.00 0 250 500 750 1000 Spending ($) CDF Men Women (a) Spending and gender 15 20 25 30 35 40 20 30 40 50 60 70 80 Age Median spending ($) (b) spending and age Figure 5.12: Eect of gender and age on the spendings of users on phone purchases. 32.5 33.0 33.5 34.0 30K 50K 70K 90K 110K 130K Income ($) Median spending ($) Figure 5.13: Eect of income on the spending. There are more than 10k users for each given age. Moreover, the country that the user comes from plays a considerable role. Fig- ure C.4(a) shows that interestingly European countries, especially Eastern Euro- pean and Scandinavian countries, have the highest spending per capita, even higher 132 than the US. Finally, we consider the zip code of the users in the US and use the median income in that zip code as an estimate of the income for the user. 8.78 205 450 989 2173 (a) World 150 170 190 210 (b) US Figure 5.14: Heatmap of the median amount of money spent by the users in each country and US. There are also more than 154k applications in our data set, where users made a purchase from them. As shown above, the earnings from in-app purchases are considerably higher than app-purchases themselves (almost 9 times higher). 5.3.1 Avid Gamers In this work, we focus on in-app purchases, due to considerable higher spendings on them. In this section, rst we show that a small group of users are driving the majority of spendings in the in-app purchases. Then, we characterize these users by studying their demographics, including age, gender, country of origin, and income. Finally, we focus on how these users started spending a lot of money on a app, and how they stopped making purchases on apps. To better observe the disparity in people's spendings, we plot the Lorenz curve of spendings of the users that shows the percentage of the total spendings spent 133 by various portions of the population, when the population is ordered by their spendings. Figure C.6 shows the Lorenz curve of the spendings. The diagonal line represents perfect equality between the users (i.e., all of them spending the same amount of money) and larger distance from the diagonal means bigger inequality in the spendings. The gure shows very high inequality, showing that the bottom half of users are spending less than 2% of all the money, while the top 10% is responsible for 84% of all the spendings, or only the top 1% is responsible for 59% of all the money spent on in-app purchases. This high inequality can be captured by Gini coecient that summarizes the plot in a single number. The Gini coecient is 0.884, which represents an extremely high inequality. Interestingly, if we consider the earnings of the apps, the inequality is even higher with Gini coecient of 0.989, and 0.1% of the apps earning 71% of all the income from in app purchases. As a comparison, the Gini coecient for the income of the population of the US is 0.469, which is the highest among Western industrialized nations (from census data). As mentioned above, only 1% of the users, or 154K users, are responsible for the majority of in-app purchases. In the rest of this section, we focus on this set of users and since most of the apps with the high earnings were games, we call this set of users avid gamers. We also calculated the top 1% of spenders for each month separately. Avid gamers, who are top spenders over the whole time, are mostly the top spenders in half of the months, 68.4% of them being top spenders of half of the months or fewer. This shows that there are some bursts in spending and the avid gamers are not always spending a lot of money. Characteristics of avid gamers We start by comparing the demographics of the avid gamers with the rest of the users. Understanding the dierences in the demographics, could be useful 134 Figure 5.15: Lorenz curve of the spending of the users on in-app purchases, showing high disparity among users. for advertisers for better targeting the population that is more likely to be a big spender. Avid gamers are 23% more likely to be men, with 59% of them being men, compared to 48% of the rest of the users being men. Both men and women avid gamers tend to be older than normal users. Men avid gamers have a median age of 37 years, while normal users have a median age of 34 years old. The dierence is even larger among women, 43 vs 35 years old. Moreover, there are considerable dierences in the country of the origin of avid gamers and normal users. For some countries like the US a random user is less likely to be an avid gamer (31% for US), but for the other countries users are much more likely to be an avid gamer, e.g., when a random user is Greek, she is 50 times more likely to be an avid gamer. 135 Similarly, users from Turkey are 33, and users from Romania are 29 times more likely to be an avid gamer, compare to the general population. We also consider the role of income for the users from the US, by calculating the fraction of users with a given income being an avid gamer. Figure C.8 shows that income has a very small eect on users being an avid gamer, except for the users with $20k or $140k annual income. Note that the percentage is almost always smaller than the expected 1%, which is the denition of the avid gamers, because this analysis is conducted only on users from the US, and avid gamers are less likely to be from the US. 0.0 0.4 0.8 1.2 20K 40K 60K 80K 100K 120K 140K Income ($) % of avid gamers Figure 5.16: Fraction of avid gamers, given the income of the users. Adoption and abandonment of apps To better understand avid gamers behavior, we focus on how they adopt and start using an app, and how they abandon and stop using it. We consider the pairs of avid gamers and apps that have more than 50 purchases, to capture the behavior for the apps that the user used frequently and for a long time. These apps are 136 responsible for most of the spendings in the in-app purchases. Then, since we want to understand the start and end, we have to eliminate the cases that the usage was started before or after the period of data collection. To this end, we only consider the pairs where the rst purchase happened after the rst month, and the last purchase was before the last month; since these users have many purchases of an app, it is very unlikely for them to have a gap of longer than 1 month in their purchases. We start by analyzing the delay between consecutive purchases. In this analysis, the values for each user have to be normalized for each user individually to account for the large heterogeneity of the users. Figure C.9(a) shows that in the 2nd to the 9th delays, normalized by the rst delay. Overall, users become slightly slower in their purchases and their delays get larger. On the other hand, interestingly, users start spending more money on each day in their rst 10 days of purchases, normalized by the average spending per day (Figure C.9(b)). This means that even though, users become less frequent costumers, they spend more money after couple of transactions. Since a considerable fraction of the spendings are from bonuses and coins in the games, our nding might be explained by users starting to buy small packages of bonuses and coins, and moving to larger ones, and since the purchase is larger, they will need more bonuses/coins later than when they buy smaller packages. Similarly, if we focus on the last 10 days of purchases, we nd that users' delays still get larger, but now with much higher rate, especially the very last delay is six times larger than the rst delay, on average (Figure C.10(a)). This long delay is a strong signal for abandonment of an app, which could be used for predictions. And considering the spending on the last 10 days of the purchases, Figure C.10(b) 137 2.0 2.1 2.2 2.3 2.4 2 3 4 5 6 7 8 9 10 Kth delay delay / first delay (a) Change in delay 0.90 0.95 1.00 1 2 3 4 5 6 7 8 9 10 Kth purchase spent / avg day spent (b) Change in spending Figure 5.17: Change in delay and spending in the rst 10 days of purchases from an app. shows that as users get closer to their last purchase they start spending less money each day. 3 4 5 6 −8 −7 −6 −5 −4 −3 −2 −1 0 Kth delay delay / first delay (a) Change in delay 0.85 0.90 0.95 1.00 −8 −7 −6 −5 −4 −3 −2 −1 0 Kth purchase spent / avg day spent (b) Change in spending Figure 5.18: Change in delay and spending in the last 10 days of purchases from an app. 138 5.3.2 Modeling In this section, we model the sequence of purchases that the users make. Our model has three main steps similar to steps taken by Benson et al. [Benson et al., 2016]: (i) Modeling time between purchases (ii) Predicting if a new item is going to be purchased or the user is going to make a purchase from an app that the user has already made a purchase from, and nally (iii) Predicting the exact app that the user is going to make a purchase from, given the output of the previous step. It is possible to model the purchases in dierent orders, but we select this sequence to build on the approach taken in [Benson et al., 2016, Anderson et al., 2014] and output of each model is used in the next steps; the estimated time interval is one of the main indicators for predicting if the user is going to make purchase from a new app, and also we can predict the app much more accurately if we know the app is a new app or a re-purchase. Temporal model First, we investigate a set of IID distributions to see which one ts the distribution on inter-purchase times better. We try Weibull, Gamma, Log normal, and Pareto and the best t for Pareto matches the data the best. We use AIC and the plots in Figure C.11 (shown for Pareto) to compare the distributions. The plot shows that Pareto distribution with shape = 3:21 and scale = 20:17 matches the data fairly well. Novelty prediction Next, we predict if the user is going to make purchase from an app that she has already made a purchase from or not. We approach the problem as supervised learning at the time that the user is going to make the purchase and include 139 Figure 5.19: Results of tting the time between purchases to a Pareto distribution. the following features: age, gender, time since previous purchase, average time between purchases, average time between re-purchases, total number of purchases, day of the current purchase, percentage of times of purchases that are re-purchases, whether last three purchases are from new app, and number of apps the user has made a purchase from. We use the rst year of the data for training and the last 140 three months for testing, so that we do not use any future information for our predictions. We try a set of dierent classication algorithms and the C5.0 algorithm in R gives us the best result [Quinlan, 2004]. Our classier achieves a high accuracy with predicting the right class in 84.5% of the cases with precision of 0.862, recall of 0.965, and F-score of 0.964. This accuracy is slightly higher than the presented results for a similar problem of re-consumption of music and videos in [Benson et al., 2016]. To better understand the importance of each feature, we also perform a logistic regression on the data after removing the correlated features. Figure C.12 shows the pairwise correlation coecient among the features and we removed the one of the features that have higher than 0.7 correlation coecient. Table C.5 shows the result of the logistic regression, showing that percentage of re-purchases that the user has made is the most important feature, which captures the tendency of the user for making re-purchases rather then purchase from a new app. Next three important features capture the history of the user for the recent purchases and whether they are a re-purchase. These are followed by gender, with a positive correlation, showing men are more likely to have a repurchase. This is in-tune with our earlier nding that men are more likely to be an avid gamer and avid gamers make a lot of purchases from the same app. App prediction In two previous steps, we modeled the time between purchases and whether the user is going to make purchase from a new app or an app that she has already made a purchase from. If the outcome of the model is the a new app we predict the app based on the apps that the user has already made purchase from, and if 141 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 day last_delay avg_delay tot_purch age gender repurch_frac last_class last_class2 last_class3 repurch_delay repurch_count day last_delay avg_delay tot_purch age gender repurch_frac last_class last_class2 last_class3 repurch_delay repurch_count Figure 5.20: Pairwise correlation coecient among the features for predicting the purchase from new apps. Table 5.13: Results of logistic regression on the independent variables for aban- donment prediction. *** pvalue< 0:001. Variable Coeff: % of re-purchase 6:236e+00*** Previous class (re-purchase) 2:878e01*** 2nd to the last class (re-purchase) 1:624e01*** 3rd to the last class (re-purchase) 7:878e02*** Gender (m) 7:375e02*** Mean inter-purchase time 4:764e02*** Time since last purchase 2:782e02*** Total number of re-purchases 2:232e02*** Day of the purchase 1:236e03*** Age 1:069e03*** 142 the outcome of the classier is purchase from previous apps, we use the sequence of all previous purchases to select one of the earlier apps. New app: Here, the problem is to pick the new app that the user is most likely to make a purchase from. Similar problems have been studied extensively in the area of the recommendation systems (add refs). Here, we take a novel approach for the recommendation by using the idea of word2vec. Table C.6 shows some randomly picked apps from the popular apps and their top 5 closest apps along with the cosine similarity of the vectors representing each app. The table shows that our approach can capture the similarity between apps very accurately. Table 5.14: Top 5 closest apps based on cosine similarity for 3 apps. Kim Kardashian West Ocial Top 5 closest apps Cosine similarity Khlo Kardashian Ocial 0.907495 Kourtney Kardashian Ocial 0.865857 Kylie Jenner Ocial 0.862917 Kendall Jenner Ocial 0.804534 kimoji 0.732694 Homework Top 5 closest apps Cosine similarity Smart Studies 0.502183 iStudy Pro 0.500390 Barrons Hot Wrods 0.491156 Physics 101 0.480380 PSAT Preliminary SAT Test Prep 0.479376 Checkbook Pro Top 5 closest apps Cosine similarity Accounts 2 Checkbook 0.677532 Checkbook Spending 0.657431 Checkbook HD Personal Finance 0.641007 My Check Register 0.616790 My Checkbook 0.607814 143 After we represent each app by a vector, we predict the new apps by using cosine similarity between the apps that user have already made a purchase from and the remaining apps. We test our approach by using the apps installed in the rst year of the data, to predict the apps that the user is going to make a purchase from in the remaining three months. We make the predictions per user and we make the guesses based on the number of apps that the user made purchase from in the rst year. More precisely, if the user made purchase fromk apps in the rst year, we use thesek apps and guessk=4 apps, because the test period is one fourth of the training period. We pick thek=4 items by considering the most similar app to each of the k apps and picking one fourth of them randomly. Our predictions are correct in 1.9% of cases. This seems very low, but considering that there are more than 270K apps that the user can choose from and in the context of the recommendation systems, the approach is working considerably well. Re-purchase: Finally, in case of a re-purchase, we use both the frequency of the usage of the apps by the user and the recency of the usage to pick one of the earlier applications. It might seem that the problem is easy and users are almost always making purchase from the last app that they made a purchase from or the app that they made the most number of purchases from, but the in case of re-purchase only in 46.5% of the cases the user make purchase from the latest app they had a purchase from and in 45.3% of the cases the make purchase from the app that they had the most number of purchases. So, another more detailed model is required. We take a similar approach to [Benson et al., 2016, Anderson et al., 2014] and consider both recency and popularity of the old items to pick the best one. This is done by having a weight function and a time function that maps the frequency of the usage and time since previous usage to the learnt values. 144 Conclusion Mobile devices are becoming increasingly more popular and peo- ple are spend more money purchasing digital goods on their phones. Despite the increasing popularity, there has not been any large-scale study of people's spend- ings on the phone's digital market. In this work, we study a large data set of more than 776M purchases on iPhones, including songs, apps, and in-app purchases. We nd that surprisingly 61% of all the money spent is spent on in-app purchases and a small group of users are responsible for most of these spendings; the top 1% of users are spending 59% of all the money on in-app purchases. We characterize these users showing that they are more likely to be men, tend to be older, and less likely to be from the US. Then, we focus on how these users start and stop making purchases from the apps, nding that users gradually lose interest and the delay between purchases increases, but the amount of money spent per day initially increases and then decreases, with a sharp drop towards the end. Finally, we model the purchasing behavior of the users by breaking down the behavior in three dierent steps. First, the time between purchases is modeled by testing a variety of dierent distributions and we nd Pareto ts the model most accurately. Second, we take a supervised learning approach to predict if the user is going to make purchase from a new app or not. Finally, if the purchase is from a new app, we use a novel approach to nd the new app based on the app that the user had already made purchase from, and if the purchase is from an app that user has already made a purchase from, we combine the earlier frequency of purchase and time from those purchases to predict the exact app. Each step of modeling achieves much higher accuracy than competitive baselines. 145 Table 5.15: An example of Simpson's paradox, where Treatment A works better for both small stones and large stones separately, but in aggregate Treatment B seems to work better falsely. Treatment A Treatment B Small stones 93% (81/87) 87% (234/270) Large stones 73% (192/263) 69% (55/80) Both 78% (273/ 350) 83% (289/350) 5.4 Handling Heterogeneity in Large Data Sets As shown above, there is considerable heterogeneity in large data sets. In the online shopping data set, bottom 50% of users with the lowest spending are responsible for only 4.8% of total money spent, while only the top 1% is responsible for 38.9% of all the money spent. Similarly, on Apple Store, top 1% of users spend 58.9% of all the money spent on iPhone purchases, while the bottom 50% spends virtually nothing (only 1.8%). In many cases, aggregation over this drastically heterogenous population will result in nding false trends, similar to the problems caused by Simpson's Paradox. This paradox exists when a trend exists in dierent groups of data, but disappears or reverses when these groups are combined. Table 5.15 shows an example of this paradox, where a treatment works better for both subgroups of cases, but the other treatment seems to work better in aggregate. The Simpson's paradox could be explained by the vast dierence in the size of groups under each treatment. A similar problem to the Simpson's paradox occur in real-world data sets. In Chapter 3, we showed that on Facebook, people spend less time on each post later in a session. We conducted the analyses by analyzing sessions of dierent length individually. But, if we consider all the sessions together, we falsely nd the exact opposite trend as shown in Figure 5.21. The upward trend could be explained by 146 95 96 97 98 99 100 0 10 20 30 Time in the session (minutes) Average time spent (normalized) Figure 5.21: Average time spent reading posts at a given minute in the session for all Facebook sessions. 0.10 0.15 0.20 0.25 0.30 0 50 100 150 Days from last purchase Average normalized price Figure 5.22: Relationship between purchase price and time to next purchase. 0.95 condence interval is also shown, but it is too small to be observed. the observation that people tend to spend less time per post in shorter sessions and when we take the average of all sessions with dierent lengths, the shorter sessions are over-represented on the left side of the plot and the average will become smaller. 147 0.19 0.20 0.21 0.22 0 50 100 150 Days from last purchase Average normalized price Normal Shuffled (a) Users with 5 purchases 0.100 0.105 0.110 0.115 0.120 0 50 100 150 Days from last purchase Average normalized price Normal Shuffled (b) Users with 9-11 purchases 0.030 0.035 0.040 0.045 0.050 0 50 100 150 Days from last purchase Average normalized price Normal Shuffled (c) Users with 28-32 purchases Figure 5.23: Relationship between purchase price and time to next purchase with 0.95 condence interval. Similarly in online shopping, we investigated the time between two consecutive purchases and the price of the second purchase. Our hypothesis is that larger pur- chases should be proceeded with longer delays. Since dierent users have dierent spending power, we consider the normalized change in the price given the num- ber of days from the last purchase. In other words, we compute how users divide their personal spending across dierent purchases, given the time delay between purchases. We then average the normalized values for all users, and report the 148 change for each time delay. Figure A.10 shows that as the time delay gets longer, users spend higher fraction of their budgets, which supports our hypothesis. To test that our analysis does not have any bias in the way the users are grouped we perform a shue test, by randomly swapping the prices of products purchased by users. This destroys the correlation between the time delay and product price. We then do the same analysis with the shued data and we expect to see a at line. However, the same increase also exists in the shued data, indicating a bias in the methodology. This is due to heterogeneity of the underlying population: we are mixing users with dierent number of purchases. Users making more purchases have lower normalized prices and also shorter time delays, and they are systemat- ically overrepresented in the left side of the plot, even in the shued data. To partially account for heterogeneity, we grouped users by the number of pur- chases: e.g., those who made exactly 5 purchases, those with 9-11 purchases, and 28-32 purchases. Even within each group there is variation as the total spend- ing diers signicantly across users, which we address by normalizing the product price by the total amount of money spent by the user, as explained above. If our hypothesis is correct, there should be a refractory period after a purchase, with users waiting longer to make a larger purchase. We clearly observe a positive relationship between (normalized) purchase price and the time (in days) since last purchase (Figure A.11), but not in the shued data, which produces a horizontal line. We conclude that the relationship between time delay and purchase price arises due to behavioral factors, such as limited budget. In short, considering all the users or sessions together might result in nding a false trend. We can test the correctness of our algorithm by a shue test, where we eliminate the correlation that we are looking for and repeat the analysis. If the trend still exists, then the analysis is not conducted in the right way. The problem 149 could be solved with the correct grouping of users or sessions and conducting the analysis within each group. 150 Chapter 6 Discussion The observed short-term behavioral changes could be explained by cognitive con- straints. In recent years, one of the main causes of increased cognitive load on individuals has been information overload. In this section, we explain how net- work paradoxes and heterogeneity of users can result in fast growth of amount of incoming information and can result in information overload. 6.1 Network Paradoxes on Twitter The friendship paradox, as formulated by Feld, is applicable to oine relationships, which are undirected, and it has also been observed in the undirected social network of Facebook [Ugander et al., 2011]. We demonstrate empirically that the friendship paradox also exists on Twitter. Unlike the friendship relations of the oine world and Facebook, the relations on Twitter are directed. When user a follows the activity of user b, he or she can see the posts tweeted by b but not vice versa. We refer to user a as the follower of b, and b as a friend or followee of a. Figure 6.1 illustrates a directed social network of a social media site, such as Twitter. The user receives information from friends and, in turn, posts infor- mation to her or his followers. The friends may themselves receive broadcasts from their friends, whom we call friends-of-friends and post tweets to their own followers, whom we call followers-of-friends. 151 Figure 6.1: An example of a directed network of a social media site. Users receive information from their friends and broadcast information to their followers. 6.1.1 Friendship Paradox on Twitter The friendship paradox can be stated in four dierent ways on a directed graph: i) On average, your friends have more friends than you do. ii) On average, your followers have more friends than you do. iii) On average, your friends have more followers than you do. iv) On average, your followers have more followers than you do. We empirically validate each statement above. The rst statement says that, on average, a user's friends are better connected than he or she is, i.e., they follow 152 10 0 10 1 10 2 10 3 10 4 10 5 0 0.05 0.1 0.15 Avg. #friends−of−friends / #friends PDF (a) Friends of a user are better connected (have more friends on average) 10 0 10 1 10 2 10 3 10 4 10 5 0 0.05 0.1 0.15 Avg. #friends−of−followers / # friends PDF (b) Followers of a user are better connected (have more friends on average) 10 0 10 1 10 2 10 3 10 4 10 5 0 0.02 0.04 0.06 0.08 Avg. #followers−of−friends / #followers PDF (c) Friends of a user are more popular (have more followers on average) 10 0 10 1 10 2 10 3 10 4 10 5 0 0.05 0.1 0.15 Avg. #followers−of−followers / #followers PDF (d) Followers of a user are more popular (have more fol- lowers on average) Figure 6.2: Variants of the friendship paradox on Twitter showing that your (a) friends and (b) followers are better connected than you are (i.e., have more friends on average) and (c, d) are more popular than you are (i.e., have more followers on average). more people than he or she does. To validate this statement, for each user in the data set we count how many friends she has, i.e., how many other users she follows. Then, for each friend, we count how many other users the friend follows, and average over all friends. Then, we calculate the ratio of average number of friends of friends, over number of friends of each user. The probability density function (PDF) of this ratio, the average friend's connectivity to a user's connectivity, shown 153 in Figure 6.2(a), is> 1 for 98% of the users, peaking around 10. In other words, in the Twitter follower graph, a typical friend of a user is ten times better connected than the user. Not only are a user's friends better connected, but so are the user's followers. Figure 6.2(b) shows the PDF of the ratio of the friends-of-followers to user's friends. Again, for 98% of users, this ratio is above one, indicating that the average follower is better connected than the user. In fact, a typical follower is almost 20 times better connected than the user is. The last two variants of the friendship paradox deal with user's popularity, i.e., the number of followers he or she has. It appears that on Twitter, user's both friends and followers are more popular than the user himself of herself. In our data set, 99% and 98% of users were respectively less popular than their friends and followers. While a typical follower is about 10 times more popular than the user (Fig. 6.2(c)), the ratio of the friend's average popularity to the user's popularity shows a bimodal distribution (Fig. 6.2(d)). While some of a user's friends are ten times more popular, some friends are about 10,000 times more popular, showing a tendency of Twitter users to follow highly popular celebrities. 6.1.2 Friend Activity Paradox In addition to connectivity and popularity paradoxes, we also demonstrate a novel activity paradox on Twitter. Friend activity paradox: On average, your friends are more active than you are. To empirically validate this paradox, we measure user activity, i.e., the number of tweets posted by a user during a given time period; we exclude users who joined 154 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 # tweets posted by user Avg. # posted tweets per friend y = x (a) Average number of tweets posted by user's friends vs the number of tweets posted by the user. (b) PDF of the ratio of tweets posted by friends and tweets posted by number of posted tweets. Figure 6.3: Comparison of user's activity and the average activity of his or her friends (measured by the number of tweets posted by them). Most (88%) of the users are less active than their friends on average. Twitter after the start of the time period. After windowing by a two-months time period we are left with 37M tweets from 3.4M users and 144.5M links among these users. Note that the data set contains a random sample of all tweets; therefore, the number of tweets posted by the user in our sample is an unbiased measure of his or her overall activity. At the same time, we measure the number of sampled tweets posted by user's friends during the same time interval. Figure 6.3(a) shows the average activity (number of posted tweets ) per friend of users who each have same level of activity, i.e., mean average friend activity as a function of user activity. The unit slopey =x line is shown for comparison. 88% of all users are less active than their typical friend. Figure 6.3(b) shows the probability distribution of the ratio of average per friend activity over user activity. For the vast majority of users, the friend activity paradox holds: their friends are more active than they are. 155 It is known that some users become inactive after some time. To ensure that our results are not aected by inactive users, we checked the same paradox for a shorter time period of one week, during which time fewer users may have become inactive. Activity paradox still holds. In fact, a much larger fraction of users are in the paradox regime: 99% of users are less active than their friends. Also, note that in all the analyses that we are comparing users with their friends (followers) we exclude users who don't have any friends (followers), because there is no one for the comparison. 6.1.3 Virality Paradox Your friends' superior social connectivity puts them in a better position to monitor, in aggregate, the ow of information, thereby mediating the information you receive via the social network. Perhaps this also puts them in a position to receive higher quality content. As a measure of quality, we investigate virality of URLs tweeted by users, i.e., number of times a URL was posted by any user over some time period. Virality paradox: On average, your friends spread more highly viral content than you do. To conrm this paradox, we calculate average size of posted URL cascades for each user and compare this value with the average size of posted cascades of friends. We observe that 32% of users haven't posted any URLs (average cascade size of 0), while their friends did. Therefore, these inactive users have posted fewer viral cascades than their friends. For the remaining 68% of users, Figure 6.4(a) shows the probability distribution of the ratio of average size of cascades posted by friends to the average size of cascades posted by user. We nd that 79% of users 156 (a) PDF of hsize of posted cascade per friendi /hsize of posted cascadesi. (b) PDF of hsize of received cascades per friendi /hsize of received cascadesi. Figure 6.4: Comparison of average size of posted and received cascade of users with their friends. For the vast majority of users, their friends both receive and post URLs with higher average cascade size, indicating a virality paradox. have ratio of greater than 1, which means that their friends have posted more viral content. Considering the users who haven't post any URLs, 86% of all users have posted less viral content than their friends. Users not only post less popular URLs than their friends, but also receive less viral content than their friends do, on average. Figure 6.4(b) shows the probability distribution of the ratio of the average size of cascades friends receive to the average size of cascades received by the user. Here again 76% of users receive smaller (less viral) cascades than their friends (15% of users have received URLs with same level of virality as their friends). The friend activity paradox in directed social networks of online social media is not a mere statistical curiosity | it has surprising implications for how social media users process information. As social media users become more active on the site, they may want to grow their social networks to receive more novel information. 157 (a) Average number of tweets received by users with the given number of friends 0 500 1000 1500 2000 0 0.5 1 1.5 2 x 10 5 # posted tweets # received tweets (b) Average number of tweets posted by user vs the number of received tweets Figure 6.5: Growth in the volume of incoming information as a function of user's connectivity and user activity it stimulates. Lines in (a) show the best power law and linear ts. Clearly, adding more friends will increase the amount of information a user has to process. However, according to the friend activity paradox, an average new friend is more active than the user is herself; therefore, the volume of new information in a user's stream will grow super-linearly as new connections are added. Sometimes the volume of new information will exceed user's ability to process it, pushing the user into information overload regime. Overloaded users are less sensitive detectors of information. 6.2 Information Overload We study how the volume of incoming information, measured by the number of tweets received by a user, grows with the size of a user's social network. Fig- ure 6.5(a) shows the average number of tweets received by users who follow a 158 0 200 400 600 800 1000 0 20 40 60 80 100 # followers # posted tweets (a) Average number of posted tweets vs num- ber of followers. (b) Average number of posted tweets vs num- ber of friends. Figure 6.6: User activity as a function of the number of followers and friends the user has. given number of friends. The data is shown for users with up to 2000 friends, and has surprisingly low dispersion. This data is best t by an power-law function with exponent 1.14 (R 2 = 0:9865). The best linear t has slope of 71 (R 2 = 0:8915), while the best quadratic t has slope of 60 (R 2 = 0:8930). The lines in Fig- ure 6.5(a) show the best power-law and linear ts, where the linear t was shifted down vertically for clarity. These data show that the average volume of information received by a user grows super-linearly with the number of friends. Regardless of the precise functional form, the volume of incoming information increases quickly with user's connectivity: for every new friend, users receive hundreds of new posts in their stream. 1 Users can compensate for the increased volume of incoming information by increasing their own activity, e.g., visiting Twitter more frequently. While we 1 This total is over the course of two months. Our data set is a 20% sample, so the total numbers should be scaled accordingly. 159 cannot directly observe when a user visits Twitter to read friends' posts, we can indirectly estimate user activity by counting the number of tweets he or she posts within the time period. Figure 6.5(b) shows that users who receive more infor- mation are also more active, though after about 500 posted tweets (over a two month period) the relationship between incoming volume of information and user activity becomes very noisy. These extremely active users (posting 50 or more tweets a day, on average, accounting for our 20% sample), who are not limiting how much information they receive, could be spammers. We include these users in our analyses, because their activity impacts the information load of people who choose to follow them. 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −2 10 0 Avg. size of posted cascades PDF Overloaded Underloaded (a) < 5 posted tweets 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −2 10 0 Avg. size of posted cascades PDF Overloaded Underloaded (b) 5-19 posted tweets 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −2 10 0 Avg. size of posted cascades PDF Overloaded Underloaded (c) 20-59 posted tweets 10 0 10 1 10 2 10 3 10 4 10 5 10 −4 10 −2 10 0 Avg. size of posted cascades PDF Overloaded Underloaded (d) 60+ posted tweets Figure 6.7: Comparison of size of posted cascades of overloaded and underloaded users, grouped by their activity. 160 10 0 10 1 10 2 10 3 10 4 10 5 0 0.05 0.1 0.15 0.2 0.25 Avg. size of received cascades PDF Overloaded Underloaded (a) < 5 posted tweets 10 0 10 1 10 2 10 3 10 4 10 5 0 0.1 0.2 0.3 Avg. size of received cascades PDF Overloaded Underloaded (b) 5-19 posted tweets 10 0 10 1 10 2 10 3 10 4 10 5 0 0.1 0.2 0.3 Avg. size of received cascades PDF Overloaded Underloaded (c) 20-59 posted tweets 10 0 10 1 10 2 10 3 10 4 10 5 0 0.1 0.2 0.3 0.4 Avg. size of received cascades PDF Overloaded Underloaded (d) 60+ posted tweets Figure 6.8: Comparison of size of received cascades of overloaded and underloaded users, grouped by their activity. Moreover, we investigate the correlation between user activity and the number of friends and followers. Figure 6.6 shows user activity, measured by the number of tweets posted during the time interval, as a function of the number of followers and friends the user has. There appears to be a signicant correlation between user's activity, connectivity, and popularity. The correlation between user activity and the number of followers appears especially strong. This correlation could, in fact, explain the friend activity paradox, because highly active users contribute to the average friend activity of their many followers, causing overrepresentation when averaging over friend's activity. The detailed mechanism for this correlation 161 is not yet clear. It is conceivable that as the user becomes more active, she begins to follow more and more people. Being active leads her to acquire new followers as her posts become visible to others, for example, by being retweeted. This will lead to a correlation between the number of friends and followers that goes beyond simple reciprocation of links. We leave these questions for future research. As explained above, the volume of incoming information in a user's stream quickly increases with the number of friends the user follows. While the user may attempt to compensate for this growth by increasing her own activity, this may not be enough. As a result, the user may receive more posts than she can read or otherwise process. We say that such users are in the information overload regime. In this section, we compare the behavior of users who are overloaded with those who are not. We consider number of tweets posted by users during some time period (here rst two months of the data set) as a measure of the amount of eort they are willing to allocate to their Twitter activities, and categorize users into four classes based on this measure. We only consider users who joined Twitter before June 2009, so that the duration of potential activity for all users is identical. The four classes are as follows: users who posted (i) fewer than ve tweets, (ii) 5{19 tweets, (iii) 20{ 59 tweets, and (iv) 60 or more tweets (average of one tweet per day in the sample). Then, in each group we ranked users based on number of tweets they received. We consider the top one third of users who received the most tweets to be information overloaded, and the bottom one third are taken as underloaded users. We compare the average size of cascades that are sent (posted) and received by overloaded and underloaded users. Each cascade is associated with a unique URL, and its size is simply the number of times that URL was posted or retweeted in our 162 data sample during the two months period. Figure 6.7 compares the average size of posted cascades of overloaded and underloaded users. (If the user receives the same URL multiple times, we take into account all appearances of that cascade during averaging.) The average cascade size of URLs posted by overloaded users is slightly larger than those tweeted by underloaded users. Across all four groups overloaded users tweeted cascades of larger mean size, suggesting that overloaded users participate in viral cascades more frequently than underloaded users. Figure 6.8 shows the dierence in the average size of URL cascades received by overloaded and underloaded users. Across all four groups, overloaded users receive larger cascades on average, as shown in Table 6.1, but overloaded users see far fewer small cascades. In other words, overloaded will be poor detectors of small, developing cascades. They seem to only know about the information spreading in a cascade when everyone else in their social network knows about it. Surprisingly, overloaded users also less likely to have their stream dominated by viral cascades than underloaded users. This could happen because globally popular URLs tend to be less popular within a user's local network [Lerman and Galstyan, 2008], so that their few occurrences in the user's stream are drowned out by other tweets. No matter the explanation, it appears that overloaded users are only good detectors Table 6.1: Mean of average size of received cascades for under- and overloaded users. Overloaded users have larger mean across all four groups, sending, respec- tively, 1) less than 5 tweets, 2) 5{19, 3) 20{59, and 4) 60+ tweets Category Underloaded Overloaded Group 1 12:56 104:96 Group 2 40:78 132:94 Group 3 119:75 160:99 Group 4 145:44 202:86 163 for information of mid-range interestingness | most likely the information that their friends already know. 10 0 10 1 10 2 10 3 1 1.5 2 2.5 3 3.5 Sample Size Estimated Mean / Median Exp Log−normal Pareto (a) 10 0 10 1 10 2 10 3 0.4 0.5 0.6 0.7 0.8 0.9 Number of Friends % Users In Paradox Condition Mean Median (b) Figure 6.9: (a) Estimated mean grows with sample size. Three distributions, exponential (Exp), log-normal, and Pareto, each result in dierent biased estimates of the mean. The larger the dierence between median and population mean, the larger the discrepancy. Thus, a user will always observe a paradox when calculating the mean of its neighbors when the population mean is greater than the median. (b) Eect of using mean vs. median on fraction of users with given number of friends estimated to be in paradox condition in a random network with no correlations. Users' attributes are drawn independently fromx 1:2 . 6.3 Origins of the Paradoxes 6.3.1 Statistical Origins of Paradoxes The friendship paradox is thought to be rooted in the heterogeneous distribution of node attributes, such as node degree [Feld, 1991]. Such distributions are char- acterized by a \heavy tail", where extremely large values, e.g., high degree nodes, appear much more frequently than expected compared to a normal or exponential 164 distribution. These large values skew the \average", giving rise to a large dier- ence between the mean and the median. In this section, we show that randomly sampling from a heavy tailed distribution can produce a paradox when using the mean, but not the median. Therefore, if the friendship paradox also exists in the median, it cannot be purely statistical in nature, and must have a behavioral ori- gin. Conversely, a paradox using the mean may arise simply from statistics of the attribute distribution, without any behavioral component or correlation between users. To be more precise, consider the denition of mean vs median for continuous, non-negative distributions 2 . The mean is dened as R 1 0 xP (x)dx: The median, m, is dened by the solution to 1 2 = R m 0 P (x)dx. Given a sample consisting of a single random instance drawn from P (x), there is an equal chance that it will be larger or smaller than m. Thus, the guess for the value minimizing the mean absolute error between the guess and the mean of a sample with size = 1 is simply the median, ie. . At the other extreme, the value that minimizes the mean absolute error of the mean of an innitely large sample is , by denition. Social behavior is often characterized by unimodal distributions with heavy tails, so m. Figure 6.9(a) provides an illustration of this behavior. We randomly sam- ple values from three example distributions, exponential ( e 2x ), log-normal ( =0:3; = 1:5), and Pareto distribution (x 1:2 ), and calculate the mean and median of the samples of varying sizes. The true mean and median for these dis- tributions are ( = 1 2 ;m = 1 2 log 2), ( = 2:28;m = 0:74), and ( = 6;m = 1:78), 2 The conclusions in the present work hold for any distribution where the mean is greater than the median. This is almost always the case for heavy-tailed distributions, but may be violated for small-support discrete distributions. When median is greater than the mean, the present conclusions are simply reversed; we would expect `anti-paradoxes' when considering the mean. 165 respectively. As explained above, the estimated sample mean changes monoton- ically from m to as the sample size increases from 1, shown in Fig. 6.9(a). However, the sample median does not vary with sample size, because half of the numbers in the sample are below the population median and half above. Thus, if you consider the fraction of users in paradox condition, shown in Figure 6.9(b), when users attributes are drawn iid from the previous Pareto distribution, the more friends a user has, the more likely the mean of their friends' attributes exceeds their own, but when using the median, no paradox is observed. Thus, consider the following explanation of how network paradoxes arise and how they depend on what you mean by \average." Assume we are measuring some empirical quantity | user attribute x | in a purely random network where each user's x is an independent, identically distributed (iid) variable. The x for a user is compared to the \average" x for the user's network neighbors. The MLE of x for the user ism x , because it is a sample of size 1. On the other hand, the number of users' neighbors is at least one, meaning that the MLE for the mean of the neighbors' x ism x . Therefore, even in a purely random network, as long as the mean ofx is greater than the median ofx, one would be led to the conclusion that \your x is smaller than the mean of your neighbors' x." But, if you consider the median, both you and your neighbors will have the same median, and no paradox will be observed. Observation of paradoxes utilizing the median provides a test of the origin of such paradoxes: do they arise simply from heavy-tailed distributions or are they due to humans positioning themselves in the network according to some nontrivial behavioral mechanism? 166 6.3.2 Behavioral Origins of Paradoxes As argued above, network paradoxes for the median do not arise purely due to statistical properties of the distribution of user-attribute values, but they must be due to some behavioral mechanisms. We test two potential sources of the behavioral mechanisms. The rst source is the correlation between a user's degree and her own attributes. We call this \within-node correlation". We use Pearson's Correlation Coecient to measure the within-node correlation between its number of friends and its attribute. We use the number of friends as the degree and not the number of followers, because only the former characteristic is under user control. The second potential source of the paradoxes is the correlation between an attribute of the node and the attributes of its neighbors. This correlation is at the link level, and we call it \between-node correlation". We use assortativity to measure this correlation. Table 6.2 reports the empirical values (Emp.) of assortativity and correlation for a variety of attributes in the Twitter and Digg networks. Note that the follower graphs of Twitter and Digg have a slight degree disassortativity, as found by a previous study of Twitter [Kwak et al., 2010]. Other attributes, on the other hand, are somewhat assortative. Within-node correlations are higher, as observed also in co-authorship networks [Eom and Jo, 2014]. We use the shue test to probe the behavioral explanation of the paradoxes. The shue test randomizes node attribute values, destroying the within-node and/or between-node correlations. We then measure network paradox for the attribute in the shued network. If the paradox disappears, because most of the users are no longer in the paradox regime, we conclude that the correlation is the root cause of the paradox. First, we start by destroying both correlations and observe that the strong form of the paradoxes disappear, so the paradoxes 167 Table 6.2: Network properties. (Top) Assortativity of attributes of connected users and (Bottom) within-node correlations of the attribute with degree in the empirical data (Emp.) and in the shued networks after a controlled (Contr.) and full shue (Shue) of attributes. Assortativity of the attribute Attribute Twitter Digg Emp. Contr. Shue Emp. Contr. Shue num. friends 0.015 | 0.000 -0.040 | 0.001 num. followers -0.047 | 0.000 -0.157 | 0.001 activity 0.037 0.016 0.000 0.152 0.005 0.000 diversity 0.055 0.012 0.000 -0.041 0.022 -0.001 posted virality 0.030 0.000 0.000 0.061 0.000 0.000 received virality 0.191 0.001 0.000 0.105 0.010 0.000 Correlation of node's attribute with num. friends Attribute Twitter Digg Emp. Contr. Shue Emp. Contr. Shue activity 0.191 0.138 -0.001 0.097 0.108 -0.002 diversity 0.895 0.867 0.001 0.999 0.690 0.005 posted virality -0.001 -0.001 0.000 -0.019 0.040 0.003 received virality 0.000 0.000 0.000 0.287 0.281 0.001 are caused by these correlations. Then, we do a controlled shue to dierentiate between within-node and between-node correlations. Shue Test: We start by examining the number of friends attribute. As explained earlier, in directed networks there are four variants of the friendship paradox, which compare the number of friends or followers a user has with the average number of friends or followers of her (i) friends and (ii) followers. We shue the network to destroy the correlation between connected nodes as follows. We keep the links between users as is, preserving network structure, but assign a new number of friends to each user, which is randomly drawn from another network node. Shuing the number of friends does not change its distribution, but eliminates any correlation between the number of friends of connected users. Table 6.2 (column Shue) shows that in all cases degree assortativity disappears in the shued network. Figures 6.10(a) and 6.10(b) show that the friendship paradox 168 Twitter Digg 0 50 100 % users mean median (a) Friends of friends Twitter Digg 0 50 100 % users mean median (b) Friends of followers Twitter Digg 0 50 100 % users mean median (c) Followers of friends Twitter Digg 0 50 100 % users mean median (d) Followers of followers Figure 6.10: Percentage of users in paradox regime on Twitter and Digg after shuing the number of friends (top row) and the number of followers (bottom row). Error bars show the 0.95 condence interval. still holds in the shued network (though weaker) for the mean. However, the paradox no longer holds for the median, since fewer than 50% of users are in the paradox regime in the shued network. Next, we consider the paradoxes involving the number of followers. We shue the number of followers by assigning to each user the number of followers from a randomly drawn user. The two paradoxes still hold for more than 60% of users for 169 the mean, but only about 50% of users are in the paradox regime for the median on Twitter and Digg (Figs. 6.10(c){6.10(d)). The empirically observed degree dissassortativity is an outcome of the mecha- nisms people use to select who to follow in online social networks. Disassortativity appears in a network where below-average users follow above-average users. This seems to be the case on Twitter (and Digg) where large fraction of follow links are from normal users to the celebrities (or top users on Digg) with orders of magni- tude more followers. Friendship paradoxes in the online social networks of Digg and to some extent Twitter appear to be related to degree disassortativity. The remaining paradoxes are similar because each compares a user's attribute with the average value of this attribute among her friends. In each shue test, we shue the values of the attribute among all users. This eliminates both within- node and between-node correlations. Table 6.2 (column Shue) shows that none of the correlations exist in the shued network. Figure 6.11 measures network paradoxes for the four attributes in the shued Twitter and Digg networks. In almost all cases, the paradoxes still hold for the mean. The only exception is the received virality paradox on Digg, which does not hold because virality of the stories does not have a fat-tailed distribution on Digg, as mentioned earlier. When comparing user with friends using the median, the paradoxes mostly disappear. One ambiguous case is content diversity paradox on Twitter, which has 60% of the users in the paradox regime, a statistically signicant but small paradox. We conclude that the empirical observations of the paradox for the median cannot be explained purely by statistical sampling and imply a socio-behavioral dimension. The origin of these strong network paradoxes appears to be in the within-node and between-node correlations. 170 Twitter Digg 0 50 100 % users mean median (a) Activity Twitter Digg 0 50 100 % users mean median (b) Diversity Twitter Digg 0 50 100 % users mean median (c) Posted URL virality Twitter Digg 0 50 100 % users mean median (d) Received URL virality Figure 6.11: Percentage of users in paradox regime on Twitter and Digg after shuing user attribute. Error bars show the 0.95 condence interval. Controlled Shue Test: Eom and Jo [Eom and Jo, 2014] argued that within- node correlation between attribute and degree results in the observed paradox in the mean for the attribute. Unfortunately, the shue test does not allow us to distinguish whether within-node or between-node correlation (assortativity) is responsible for the paradox. In this section, we disentangle these eects through a controlled shue test, which attempts to eliminate the between-node correlation 171 while preserving within-node correlation. We achieve this by grouping together users with the same degree (number of friends) and shuing the attribute values within each group. Thus, the reassigned attribute is still correlated with degree, because it's from another user with the same degree. We log-bin the data to deal with degree sparseness at high values. Table 6.2 shows that the within- node correlation has not changed signicantly, but the between-node correlation is greatly reduced in the shued network (column Contr.). Figure 6.12 shows the result of the controlled shue test. No single type of correlation is responsible for all paradoxes. The activity paradox is greatly reduced by controlled shue of the Twitter network and disappears on Digg, suggesting that between-node correlation (here activity assortativity) is largely responsible for this paradox. This means that the paradox arises because active users preferentially link to other active users. The posted-URL virality paradoxes are similar in that it is largely reduced by controlled shuing. We conclude that it mostly arises due to assortativity of this attribute { not within-node correlation. The diversity paradox, however, appears to be unaected by controlled shuing both on Digg and Twitter. This suggests that the diversity paradox is not caused by assortativity of diversity. Instead, it is due to within-node degree{diversity correlation. The received-URL virality paradox is similar to diversity in that it is largely unaected by controlled shuing. Hence, we conclude that degree{virality correlation plays a key role in creating the paradox, but it is not simply due to users selectively seeking out interesting users. There are a few plausible behavioral mechanisms that could lead to these corre- lations. First, some of the correlations arise simply from the nature of the attribute. For example the within-node correlation of number of friends and diversity (as mea- sured by number of distinct URLs) exists because as users add more friends, they 172 Twitter Digg 0 50 100 % users mean median (a) Activity Twitter Digg 0 50 100 % users mean median (b) Diversity Twitter Digg 0 50 100 % users mean median (c) Posted URL virality Twitter Digg 0 50 100 % users mean median (d) Received URL virality Figure 6.12: Percentage of users in paradox regime for the shued attribute, but keeping attribute-connectivity correlation (controlled shuing). will eventually begin connecting with users outside of their immediate interests. Between-node correlations arise as users position themselves near people with cer- tain characteristics. For example, active users are generally more engaged with the social networking site, consuming more information. To increase the amount of content they receive, they could seek out new friends (degree{activity correla- tion) or seek out more active friends (activity assortativity). Regardless of which 173 correlation may be indicated as the source of the strong paradox for a particular attribute, the unifying theme is that they arise from decisions made by the users and not from statistical artifacts of averaging. Summary: In short, the short-term behavioral changes shown in this document can be explained by cognitive constraints and fatigue, which results in people performing simpler tasks or having lower quality of content generation. In a large population, there is a signicant heterogeneity thus some users can handle much more information and have more friends in their social network. This discrepancy in the number of friends results in the friendship paradox, as explained above, and the paradoxes result in people's incoming information to grow much faster than what they expect. This fast growth causes users to experience information overload, which results in stronger short-term behavioral changes. So, there is a cycle with dierent elements playing a role in people being overloaded. Further analyses is needed for solving the problem of information overload. 174 Chapter 7 Conclusion and Future Work This document presented evidence for behavioral changes observed at much shorter time scales than those discovered before. These behavioral changes occur only on the order of few minutes, rather than days or months shown in earlier studies. Understanding these behavioral changes can help us design systems that improve user experience. For example, we observed that Facebook users prefer looking at photos and videos later in the session to reading textual posts. The feed ranking algorithm can be adjusted to consider this fact and show more photos and videos to the users later in the session. Our work included multiple studies on dierent platforms. Below, we summarize the ndings from each platform. On Twitter, we analyzed user behavior during activity sessions. We found that most of the Twitter sessions are short, on the order of minutes, and include only a few tweets, although they tend to be diverse tweets, including composed messages, retweets of others' messages, and replies to other users. Despite their short duration, we showed that user behavior changes over the course of an activity session. As people spend more time in the session, they prefer to perform easier or more socially engaging tasks, such as retweeting and replying, rather than harder tasks such as composing an original tweet. In the beginning of the session, tweets are up to 25% more likely to be an original tweet than near the end of the session. In addition, tweets tend to get shorter later in the session, and people tend to make more spelling mistakes. All these could be explained by people becoming mentally fatigued, or perhaps careless due to loss of motivation. If we divide users 175 into classes based on the number of friends they follow, or their activity level (i.e., the number of tweets they posted), we nd that while these user classes behave dierently in general, in terms of the types of tweets they tend to post, all classes manifest similar behavioral changes over the course of the session. Understanding dynamics of user behavior could help better predict behavior and adjust the system accordingly to enhance user experience. On Facebook, we analyzed a large data set of user activities, comprised of the interactions people have with the content their friends shared. Similar to our Twitter work, we segmented these interactions into activity sessions. Once segmented into sessions, content consumption shows strong regularities with pre- dictable changes over the course of a session for many people. Regardless of the platform they use to consume Facebook content (web browser or mobile device), their demographic attributes, social connectivity, or the time to day they are active, people manifest similar behavioral changes: as the session progresses, they spend relatively less time viewing a story or video, and preferentially shift their atten- tion away from reading stories and more towards viewing photos and videos. There were also strong dierences between short and long sessions. People spent less time consuming content during shorter sessions, a pattern that was already evident at the start of the session. We leveraged observed behavioral regularities to predict the length of a session, how much content people will consume over the course of a session, and when they will return to Facebook. While a person's past Facebook usage oers good indi- cators for predicting future behavior, surprisingly, the rst minute of activity was also a very good predictor of session behavior. In fact, it was the most predictive feature in our models, followed by historical features which included the average session length from the previous day and the average time spent on all interactions. 176 These kinds of predictions could potentially be used to improve user experience, e.g. by caching content based on when and how much an individual is likely to consume it, especially in areas where internet connectivity may be poor. On Reddit, we presented novel evidence of performance deterioration during prolonged online activity. We showed that sessions with more activity are sig- nicantly associated with production of lower quality content, as measured by the length of the comment posted, its readability score, its average score and the number of responses it receives. In light of these ndings, we developed a mixed- eects model that captures online performance deterioration. Our analysis can be expanded in several directions. For example, we have only accounted for the basic dierences between distinct Reddit users in the mixed-eects models. Yet, a much more nuanced analysis of heterogeneous eects of online performance deterioration would be warranted. One interesting direction involves understanding whether all individuals exhibit the same levels of performance deterioration, or whether these eects vary from user to user. For example, we might nd that all users consistently exhibit deterioration or that dierent subgroups of users exist, where some users might even show improvements in performance over time. Neuroscience studies found individual dierences in working memory and other cognitive activities in the human brain [Vogel and Machizawa, 2004]. However, it remains unclear from a physiological standpoint whether capacity to process or produce information varies from person to person [Marois and Ivano, 2005]. Online performance deteriora- tion may also depend on acquired experience (as a form of cognitive dexterity) with a system. A new, and thus unfamiliar, user in a system may experience faster performance deterioration than an experienced user, because e.g., the cognitive or attention cost associated with the same operations may be experience-dependent 177 (this is particularly true for information discovery and content production activi- ties). A computational study of online performance in this direction could be very valuable. Additionally, other hypotheses can be studied, such as that performance dete- rioration depends on the topic (politics vs. funny images), the time of the day, or the intensity of sessions (shorter average time dierences between comments). A further aspect to consider is, that we have considered all comments posted to Reddit as equal, meaning that we did not distinguish between those comments posted at the root of a comment hierarchy and those posted further down the hier- archy. Future research in that direction is necessary to better understand observed deterioration eect. For example, top-level comments might generally be of higher quality than low-level comments, or performance deterioration might be stronger for successive posts in the same submission thread compared to comments across submissions. These and similar questions can be studied by our proposed models. They are highly adaptable and xed and random eects can be utilized to model these potential heterogeneous eects; for example, including a random eect allow- ing the deteriorating eects to vary between users could already allow us to make further inference about individual dierences. The ndings across dierent platforms are all compatible, thus our studies strongly suggest that these short-term behavioral changes are the result of an underlying human behavior and not just result of a particular platform. The review of psychology literature suggests that these behavioral changes could be explained by ego depletion or fatigue, but we cannot infer causality from our studies that all used observational data. Finding the cause needs a controlled experiment. One way of conducting the experiment is to consider the activity of two groups of users on social media, but induce fatigue on one group before they start by asking them 178 to perform some mentally demanding tasks. If there are dierences among the two groups and the dierences are similar to the observed changes in our studies, then we can conclude that the reason for the behavior change is fatigue. This area needs more further study. 179 Reference List People spent an insane amount of money on apps this year. http://time.com/ 4169153/apple-app-store-stats-2015, a. Accessed: 2016-07-11. Annual apple app store revenue in 2013 and 2015 (in billion u.s. dollars). http://www.statista.com/statistics/296226/ annual-apple-app-store-revenue/, b. Accessed: 2016-07-11. Apple's cut of 2015 app store revenue tops $6b. http: //www.computerworld.com/article/3019716/apple-ios/ apples-cut-of-2015-app-store-revenue-tops-6b.html, c. Accessed: 2016-07-11. Online shopping is still a pathetically tiny fraction of all shopping. http://goo. gl/ecTKxT. Accessed: 2016-07-11. Personal consumption expenditures. https://research.stlouisfed.org/ fred2/series/PCE/. Accessed: 2016-07-11. This is your brain on uber. http://www.npr.org/2016/05/17/478266839/ this-is-your-brain-on-uber. Accessed: 2016-05-27. Understanding gender-based dierences in consumer e-commerce adoption. Com- munications of the Association for Information Systems, 26, 2005. Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE transactions on knowledge and data engineering, 17(6):734{749, 2005. Luca Maria Aiello, Alain Barrat, Ciro Cattuto, Giancarlo Ruo, and Rossano Schifanella. Link creation and prole alignment in the aNobii social network. In SocialCom '10: Proceedings of the Second IEEE International Conference on Social Computing, 2010. Luca Maria Aiello, Alain Barrat, Rossano Schifanella, Ciro Cattuto, Benjamin Markines, and Filippo Menczer. Friendship prediction and homophily in 180 social media. ACM Trans. Web, 6(2):9:1{9:33, Jun 2012. ISSN 1559-1131. doi: 10.1145/2180861.2180866. URL http://doi.acm.org/10.1145/2180861. 2180866. Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike, pages 199{213. Springer, 1998. Aris Anagnostopoulos, Ravi Kumar, and Mohammad Mahdian. In uence and correlation in social networks. In Proceeding of the 14th ACM SIGKDD inter- national conference on Knowledge discovery and data mining, KDD '08, pages 7{15, New York, NY, USA, 2008. ACM. URL http://dx.doi.org/10.1145/ 1401890.1401897. Ashton Anderson, Ravi Kumar, Andrew Tomkins, and Sergei Vassilvitskii. The dynamics of repeat consumption. In Proceedings of the 23rd international con- ference on World wide web, pages 419{430. ACM, 2014. R Harald Baayen, Douglas J Davidson, and Douglas M Bates. Mixed-eects mod- eling with crossed random eects for subjects and items. Journal of Memory and Language, 59(4):390{412, 2008. Lars Backstrom, Eytan Bakshy, Jon M Kleinberg, Thomas M Lento, and Ita- mar Rosenn. Center of attention: How facebook users allocate attention across friends. ICWSM, 11:23, 2011. Ricardo Baeza-Yates, Di Jiang, Fabrizio Silvestri, and Beverly Harrison. Predicting the next app that you are going to use. In WSDM '15: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 285{ 294, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3317-7. doi: 10.1145/ 2684822.2685302. URL http://doi.acm.org/10.1145/2684822.2685302. A. L. Barabasi. The origin of bursts and heavy tails in human dynamics. Nature, 435:207{211, 2005. Douglas Bates, Martin M achler, Ben Bolker, and Steve Walker. Fitting linear mixed-eects models using lme4. Journal of Statistical Software, 67(1):1{48, 2015. doi: 10.18637/jss.v067.i01. Roy F. Baumeister and Kathleen D. Vohs. Self-Regulation, Ego Depletion, and Motivation. Social and Personality Psychology Compass, 1(1):115{128, November 2007. ISSN 1751-9004. doi: 10.1111/j.1751-9004.2007.00001.x. http://dx.doi.org/10.1111/j.1751-9004.2007.00001.x. Roy F Baumeister, Ellen Bratslavsky, Mark Muraven, and Dianne M Tice. Ego depletion: is the active self a limited resource? Journal of personality and social psychology, 74(5):1252, 1998. 181 Roy F Baumeister, Erin A Sparks, Tyler F Stillman, and Kathleen D Vohs. Free will in consumer behavior: Self-control, ego depletion, and choice. Journal of Consumer Psychology, 18(1):4{13, 2008. Steven Bellman, Gerald L Lohse, and Eric J Johnson. Predictors of online buying behavior. Communications of the ACM, 42(12):32{38, 1999. Austin R Benson, Ravi Kumar, and Andrew Tomkins. Modeling user consump- tion sequences. In Proceedings of the 25th International Conference on World Wide Web, pages 519{529. International World Wide Web Conferences Steering Committee, 2016. Amit Bhatnagar, Sanjog Misra, and H. Raghav Rao. On risk, convenience, and internet shopping behavior. Commun. ACM, 43(11):98{105, November 2000. ISSN 0001-0782. doi: 10.1145/353360.353371. URL http://doi.acm.org/10. 1145/353360.353371. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. In The Journal of machine Learning research, 2003. Jes us Bobadilla, Fernando Ortega, Antonio Hernando, and Abraham Guti errez. Recommender systems survey. Knowledge-Based Systems, 46:109{132, 2013. Michael Braun and David A Schweidel. Modeling customer lifetimes with multiple causes of churn. Marketing Science, 30(5):881{902, 2011. Michael Buhrmester, Tracy Kwang, and Samuel D Gosling. Amazon's mechani- cal turk a new source of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3{5, 2011. Eric Butler, Erik Andersen, Adam M Smith, Sumit Gulwani, Zoran Popovic, and WA Redmond. Automatic game progression design through analysis of solution features. Le Chen, Alan Mislove, and Christo Wilson. Peeking beneath the hood of uber. In Proceedings of the 2015 ACM Conference on Internet Measurement Conference, IMC'15, pages 495{508, New York, NY, USA, 2015. ACM. ISBN 978-1-4503- 3848-6. doi: 10.1145/2815675.2815681. URL http://doi.acm.org/10.1145/ 2815675.2815681. Justin Cheng, Jaime Teevan, and Michael S Bernstein. Measuring crowdsourcing eort with error-time curves. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 1365{1374. ACM, 2015. 182 Jyh-Shen Chiou and Chien-Chien Ting. Will you spend more money and time on internet shopping when the product and situation are right? Computers in Human Behavior, 27(1):203 { 208, 2011. ISSN 0747-5632. doi: http:// dx.doi.org/10.1016/j.chb.2010.07.037. URL http://www.sciencedirect.com/ science/article/pii/S0747563210002335. Current Research Topics in Cog- nitive Load Theory Third International Cognitive Load Theory Conference. Judd Cramer and Alan B Krueger. Disruptive change in the taxi business: The case of uber. 2015. David Crandall, Dan Cosley, Daniel Huttenlocher, Jon Kleinberg, and Siddharth Suri. Feedback eects between similarity and social in uence in online com- munities. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'08, pages 160{168, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-193-4. doi: 10.1145/1401890.1401914. URL http://doi.acm.org/10.1145/1401890.1401914. Mariam Daoud, Lynda Tamine-Lechani, Mohand Boughanem, and Bilal Chebaro. A session based personalized search using an ontological user prole. In Pro- ceedings of the 2009 ACM symposium on Applied Computing, pages 1732{1736. ACM, 2009. Ruby Roy Dholakia. Going shopping: key determinants of shopping behaviors and motivations. International Journal of Retail and Distribution Management, 27: 154{165, 1999a. Ruby Roy Dholakia. Going shopping: key determinants of shopping behaviors and motivations. International Journal of Retail & Distribution Management, 27(4): 154{165, 1999b. doi: 10.1108/09590559910268499. URL http://dx.doi.org/ 10.1108/09590559910268499. N. Djuric, V. Radosavljevic, M. Grbovic, and N. Bhamidipati. Hidden conditional random elds with deep user embeddings for ad targeting. In Data Mining (ICDM), 2014 IEEE International Conference on, pages 779{784, Dec 2014. Rex Y. Du and Wagner A. Kamakura. Where did all that money go? understand- ing how consumers allocate their consumption budget. Journal of Marketing, 72 (6):109{131, 2008. doi: 10.1509/jmkg.72.6.109. URL http://dx.doi.org/10. 1509/jmkg.72.6.109. Jean-Pierre Eckmann, Elisha Moses, and Danilo Sergi. Entropy of dialogues creates coherent structures in e-mail trac. Proceedings of the National Academy of Sciences of the United States of America, 101(40):14333{14337, October 2004. 183 Benjamin G Edelman and Damien Geradin. Eciencies and regulatory shortcuts: How should we regulate companies like airbnb and uber? Harvard Business School NOM Unit Working Paper, (16-026), 2015. Carsten Eickho, Jaime Teevan, Ryen White, and Susan Dumais. Lessons from the journey: a query log analysis of within-session learning. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 223{232. ACM, 2014. Young-Ho Eom and Hang-Hyun Jo. Generalized friendship paradox in complex networks, January 2014. URL http://arxiv.org/abs/1401.1458. William N. Evans and Timothy J. Moore. Liquidity, economic activity, and mor- tality. Review of Economics and Statistics, 94(2):400{418, January 2011. doi: 10.1162/restn an 00184. URL http://dx.doi.org/10.1162/rest_a_00184. Sendy Farag, Tim Schwanen, Martin Dijst, and Jan Faber. Shopping online and/or in-store? a structural equation model of the relationships between e-shopping and in-store shopping. Transportation Research Part A: Policy and Practice, 41(2):125{141, 2007. ISSN 0965-8564. doi: http://dx.doi.org/10.1016/j.tra. 2006.02.003. URL http://www.sciencedirect.com/science/article/pii/ S0965856406000267. Scott L. Feld. The focused organization of social ties. The American Journal of Sociology, 86(5):1015{1035, 1981. Scott L. Feld. Why Your Friends Have More Friends Than You Do. American Journal of Sociology, 96(6):1464{1477, May 1991. URL http://www.jstor. org/stable/2781907. Matthew T Gailliot, Roy F Baumeister, C Nathan DeWall, Jon K Maner, E Ashby Plant, Dianne M Tice, Lauren E Brewer, and Brandon J Schmeichel. Self-control relies on glucose as a limited energy source: willpower is more than a metaphor. Journal of personality and social psychology, 92(2):325, 2007. Ellen Garbarino and Michal Strahilevitz. Gender dierences in the perceived risk of buying online and the eects of receiving a site recommendation. Journal of Business Research, 57(7):768 { 775, 2004. ISSN 0148-2963. doi: http://dx.doi. org/10.1016/S0148-2963(02)00363-6. URL http://www.sciencedirect.com/ science/article/pii/S0148296302003636. R Stuart Geiger and Aaron Halfaker. Using edit sessions to measure participation in wikipedia. In Conference on Computer Supported Cooperative Work, pages 861{870, 2013. 184 Andrew Gelman. Analysis of variance|why it is more important than ever. The Annals of Statistics, 33(1):1{53, 2005. Rumi Ghosh, Tawan Surachawala, and Kristina Lerman. Entropy-based classi- cation of retweeting activity on twitter. In Proceedings of KDD workshop on Social Network Analysis (SNA-KDD), August 2011. Scott A Golder and Michael W Macy. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science, 333(6051):1878{1881, 2011. Scott A Golder, Dennis M Wilkinson, and Bernardo A Huberman. Rhythms of social interaction: Messaging within a massive online network. In Communities and technologies 2007, pages 41{66. Springer, 2007. Michael F. Goodchild. Citizens as sensors: the world of volunteered geog- raphy. GeoJournal, 69(4):211{221, 2007. ISSN 1572-9893. doi: 10.1007/ s10708-007-9111-y. URL http://dx.doi.org/10.1007/s10708-007-9111-y. Katerina Go seva-Popstojanova, Ajay Deep Singh, Sunil Mazimdar, and Fengbin Li. Empirical characterization of session{based workload and reliability for web servers. Empirical Software Engineering, 11(1):71{117, 2006. Sanjeev Goyal, Hoda Heidari, and Michael Kearns. Competitive contagion in networks. Games and Economic Behavior, 2014. Mihajlo Grbovic and Slobodan Vucetic. Generating ad targeting rules using sparse principal component analysis with constraints. In Proceedings of the Com- panion Publication of the 23rd International Conference on World Wide Web Companion, WWW Companion '14, pages 283{284, Republic and Canton of Geneva, Switzerland, 2014. International World Wide Web Conferences Steer- ing Committee. ISBN 978-1-4503-2745-9. doi: 10.1145/2567948.2577351. URL http://dx.doi.org/10.1145/2567948.2577351. Mihajlo Grbovic, Guy Halawi, Zohar Karnin, and Yoelle Maarek. How many folders do you really need?: Classifying email into a handful of categories. In Proceedings of the 23rd ACM International Conference on Conference on Infor- mation and Knowledge Management, pages 869{878. ACM, 2014. Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. E-commerce in your inbox: Product recommendations at scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1809{ 1818. ACM, 2015. 185 Nir Grinberg, Mor Naaman, Blake Shaw, and Gilad Lotan. Extracting diurnal patterns of real world activity from social media. In ICWSM, 2013. Nir Grinberg, P. A. Dow, L. A. Adamic, and Mor Naaman. Extracting diurnal patterns of real world activity from social media. In CHI, 2016. Stephen Guo, Mengqiu Wang, and Jure Leskovec. The role of social networks in online shopping: Information passing, price of trust, and consumer choice. In Proceedings of the 12th ACM Conference on Electronic Commerce, EC '11, pages 157{166, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0261-6. doi: 10.1145/1993574.1993598. URL http://doi.acm.org/10.1145/1993574. 1993598. Vineet Gupta, Devesh Varshney, Harsh Jhamtani, Deepam Kedia, and Shweta Karwa. Identifying purchase intent from social posts. In Proceedings of the Eighth International Conference on Weblogs and Social Media, ICWSM 2014, Ann Arbor, Michigan, USA, June 1-4, 2014., 2014. URL http://www.aaai. org/ocs/index.php/ICWSM/ICWSM14/paper/view/8037. Aaron Halfaker, Oliver Keyes, Daniel Kluver, Jacob Thebault-Spieker, Tien Nguyen, Kenneth Shores, Anuradha Uduwage, and Morten Warncke-Wang. User session identication based on strong regularities in inter-activity time. In Inter- national Conference on World Wide Web, 2015. Jonathan V Hall and Alan B Krueger. An analysis of the labor market for uber's driver-partners in the united states. Princeton University Industrial Relations Section Working Paper, 587, 2015. Juho Hamari, Mimmi Sj oklint, and Antti Ukkonen. The sharing economy: Why people participate in collaborative consumption. Journal of the Association for Information Science and Technology, 2015. Torben Hansen, Jan Moller Jensen, and Hans Stubbe Solgaard. Predicting online grocery buying intention: a comparison of the theory of reasoned action and the theory of planned behavior. International Journal of Informa- tion Management, 24(6):539{550, 2004. ISSN 0268-4012. doi: http://dx.doi. org/10.1016/j.ijinfomgt.2004.08.004. URL http://www.sciencedirect.com/ science/article/pii/S026840120400091X. Celia Ray Hayhoe, Lauren J. Leach, Pamela R. Turner, Marilyn J. Bruin, and Frances C. Lawrence. Dierences in spending habits and credit use of college students. Journal of Consumer Aairs, 34(1):113{133, 2000. ISSN 1745-6606. doi: 10.1111/j.1745-6606.2000.tb00087.x. URL http://dx.doi.org/10.1111/ j.1745-6606.2000.tb00087.x. 186 Alice F. Healy, James A. Kole, Carolyn J. Buck-Gengle, and Lyle E. Bourne. Eects of prolonged work on data entry speed and accuracy. Journal of Experi- mental Psychology. Applied, 10(3):188{199, 2004. Nathan O. Hodas and Kristina Lerman. How limited visibility and divided atten- tion constrain social contagion. In SocialCom, 2012. Nathan O. Hodas, Farshad Kooti, and Kristina Lerman. Friendship paradox redux: Your friends are more interesting than you. In Proceedings of the 7Th Interna- tional Aaai Conference On Weblogs And Social Media (ICWSM), 2013. L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In In Proceedings of the First Workshop on Social Media Analytics, 2010. Jeremy Horpedahl. Ideology uber alles' economics bloggers on uber, lyft, and other transportation network companies. Econ Journal Watch, 12(3):360{374, 2015. John A Horrigan. Online shopping. Pew Internet & American Life Project Report, 36, 2008. Je Huang and Efthimis N Efthimiadis. Analyzing and evaluating query reformu- lation strategies in web search logs. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 77{86. ACM, 2009. Tak-Kee Hui and David Wan. Factors aecting internet shopping behaviour in singapore: gender and educational issues. International Journal of Consumer Studies, 31(3):310{316, 2007. ISSN 1470-6431. doi: 10.1111/j.1470-6431.2006. 00554.x. URL http://dx.doi.org/10.1111/j.1470-6431.2006.00554.x. Tapio Ikkala and Airi Lampinen. Monetizing network hospitality: Hospitality and sociability in the context of airbnb. In Proceedings of the 18th ACM Confer- ence on Computer Supported Cooperative Work & Social Computing, CSCW'15, pages 1033{1044, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-2922-4. doi: 10.1145/2675133.2675274. URL http://doi.acm.org/10.1145/2675133. 2675274. Tom N. Jagatic, Nathaniel A. Johnson, Markus Jakobsson, and Filippo Menczer. Social phishing. Commun. ACM, 50(10):94{100, October 2007. ISSN 0001-0782. doi: 10.1145/1290958.1290968. URL http://doi.acm.org/10.1145/1290958. 1290968. Long Jin, Yang Chen, Tianyi Wang, Pan Hui, and Athanasios V Vasilakos. Under- standing user behavior in online social networks: A survey. Communications Magazine, IEEE, 51(9):144{150, 2013. 187 Rosie Jones and Kristina Lisa Klinkner. Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 699{ 708. ACM, 2008. Jeon-Hyung Kang and Kristina Lerman. Using lists to measure homophily on twitter. In AAAI workshop on Intelligent Techniques for Web Personalization and Recommendation, July 2012. Jeon-Hyung Kang and Kristina Lerman. User eort and network structure mediate access to information in networks. arXiv preprint arXiv:1504.01760, 2015. Komal Kapoor, Karthik Subbian, Jaideep Srivastava, and Paul Schrater. Just in time recommendations: Modeling the dynamics of boredom in activity streams. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pages 233{242. ACM, 2015. Eunju Kim, Wooju Kim, and Yillbyung Lee. Combination of multiple classiers for the customer's purchase behavior prediction. Decision Support Systems, 34 (2):167{175, 2003. J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. Derivation of new readability formulas (automated readability index, fog count and esch reading ease formula) for navy enlisted personnel. Technical report, DTIC Document, 1975. Aniket Kittur, Ed H. Chi, and Bongwon Suh. Crowdsourcing user studies with mechanical turk. In CHI: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 453{456, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-011-1. doi: 10.1145/1357054.1357127. URL http://doi. acm.org/10.1145/1357054.1357127. George Roger Klare. Measurement of readability. 1963. Isabel Kloumann, Lada Adamic, Jon Kleinberg, and Shaomei Wu. The lifecycles of apps in a social ecosystem. In Proceedings of the 24th International Conference on World Wide Web, pages 581{591. ACM, 2015. Farshad Kooti, Winter A. Mason, Krishna P. Gummadi, and Meeyoung Cha. Pre- dicting Emerging Social Conventions in Online Social Networks. In In Pro- ceedings of the 21st International Conference on Information and Knowledge Management (CIKM), a. Farshad Kooti, Haeryun Yang, Meeyoung Cha, Krishna P. Gummadi, and Win- ter A. Mason. The Emergence of Conventions in Online Social Networks. In In 188 Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM), b. Farshad Kooti, Nathan O. Hodas, and Kristina Lerman. Network weirdness: Exploring the origins of network paradoxes. In Proceedings of the 8Th Inter- national Aaai Conference On Weblogs And Social Media (ICWSM), 2014. Farshad Kooti, Luca Maria Aiello, Mihajlo Grbovic, Kristina Lerman, and Amin Mantrach. Evolution of Conversations in the Age of Email Overload. In Pro- ceedings of the 24st International World Wide Web Conference (WWW'15), Florence, Italy, May 2015. Farshad Kooti, Kristina Lerman, Luca Maria Aiello, Mihajlo Grbovic, Nemanja Djuric, and Vladan Radosavljevic. Portrait of an Online Shopper: Understand- ing and Predicting Consumer Behavior. In Proceedings of the 9th ACM Interna- tional Conference on Web Search and Data Mining (WSDM'16), San Francisco, USA, February 2016. Maryam Kouchaki and Isaac H Smith. The morning morality eect the in u- ence of time of day on unethical behavior. Psychological science, page 0956797613498099, 2013. Ravi Kumar and Andrew Tomkins. A characterization of online browsing behavior. In Proceedings of the 19th international conference on World wide web, pages 561{570. ACM, 2010. H. Kwak, H. Chun, and S. Moon. Fragile online relationship: a rst look at unfollow dynamics in twitter. In Proceedings of the 2011 annual conference on Human factors in computing systems, pages 1091{1100. ACM, 2011. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social network or a news media? In Proc. 19th Int. Conference on World Wide Web, pages 591{600, 2010. ISBN 978-1-60558-799-8. doi: 10.1145/1772690. 1772751. URL http://doi.acm.org/10.1145/1772690.1772751. Min Kyung Lee, Daniel Kusbit, Evan Metsky, and Laura Dabbish. Working with machines: The impact of algorithmic and data-driven management on human workers. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI'15, pages 1603{1612, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3145-6. doi: 10.1145/2702123.2702548. URL http: //doi.acm.org/10.1145/2702123.2702548. Kristina Lerman and Aram Galstyan. Analysis of social voting patterns on digg. In Proc. 1st ACM SIGCOMM Workshop on Online Social Networks, 2008. 189 Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. The dynam- ics of viral marketing. ACM Trans. Web, 1(1), May 2007. ISSN 1559-1131. doi: 10.1145/1232722.1232727. URL http://doi.acm.org/10.1145/1232722. 1232727. Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynam- ics of the news cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 497{506. ACM, 2009. Julian Lim, Wen-Chau C. Wu, Jiongjiong Wang, John A. Detre, David F. Dinges, and Hengyi Rao. Imaging brain fatigue from sustained mental workload: an asl perfusion study of the time-on-task eect. NeuroImage, 49(4):3426{3435, 2010. Greg Linden, Brent Smith, and Jeremy York. Amazon.com recommendations: Item-to-item collaborative ltering. IEEE Internet Computing, 7(1):76{80, Jan- uary 2003. ISSN 1089-7801. doi: 10.1109/MIC.2003.1167344. URL http: //dx.doi.org/10.1109/MIC.2003.1167344. Janne Lindqvist, Justin Cranshaw, Jason Wiese, Jason Hong, and John Zimmer- man. I'm the mayor of my house: Examining why people use foursquare - a social-driven location sharing application. In Proceedings of the SIGCHI Con- ference on Human Factors in Computing Systems, CHI'11, pages 2409{2418, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0228-9. doi: 10.1145/ 1978942.1979295. URL http://doi.acm.org/10.1145/1978942.1979295. R. Dean Malmgren, Daniel B. Stouer, Adilson E. Motter, and Lu s A. N. Amaral. A poissonian explanation for heavy tails in e-mail communication. Proceedings of the National Academy of Sciences, 105(47):18153{18158, 2008. Alex Mariakakis, Mayank Goel, Md Tanvir Islam Aumi, Shwetak N Patel, and Jacob O Wobbrock. Switchback: Using focus and saccade tracking to guide users' attention for mobile task resumption. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 2953{2962. ACM, 2015. Ren e Marois and Jason Ivano. Capacity limits of information processing in the brain. Trends in Cognitive Sciences, 9(6):296{305, 2005. Winter Mason and Siddharth Suri. Conducting behavioral research on amazon's mechanical turk. Behavior Research Methods, 44(1):1{23, 2011. ISSN 1554- 3528. doi: 10.3758/s13428-011-0124-6. URL http://dx.doi.org/10.3758/ s13428-011-0124-6. Julian McAuley, Rahul Pandey, and Jure Leskovec. Inferring networks of substi- tutable and complementary products. In Proceedings of the 21th ACM SIGKDD 190 International Conference on Knowledge Discovery and Data Mining, pages 785{ 794. ACM, 2015. Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1):415{444, 2001. doi: 10.1146/annurev.soc.27.1.415. URL http://dx.doi.org/10.1146/ annurev.soc.27.1.415. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. Dis- tributed representations of words and phrases and their compositionality. In NIPS, pages 3111{3119, 2013. Stanley Milgram. The small world problem. Psychology today, 2(1):60{67, 1967. Giovanna Miritello, Rub en Lara, Manuel Cebrian, and Esteban Moro. Limited communication capacity unveils strategies for human interaction. Scientic reports, 3, 2013. Alan L. Montgomery, Shibo Li, Kannan Srinivasan, and John C. Liechty. Pre- dicting online purchase conversion using web path analysis. Technical report, 2002. M. Muraven, D.M. Tice, and R.F. Baumeister. Self-control as a limited resource: Regulatory depletion patterns. Journal of personality and social psychology, 74 (3):774, 1998. Mark Muraven and Roy F Baumeister. Self-regulation and depletion of limited resources: Does self-control resemble a muscle? Psychological bulletin, 126(2): 247, 2000. Seth A Myers and Jure Leskovec. Clash of the contagions: Cooperation and competition in information diusion. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 539{548. IEEE, 2012. Mor Naaman, Amy Xian Zhang, Samuel Brody, and Gilad Lotan. On the study of diurnal urban routines on twitter. In ICWSM, 2012. Anastasios Noulas, Vsevolod Salnikov, Renaud Lambiotte, and Cecilia Mas- colo. Mining open datasets for transparency in taxi transport in metropoli- tan environments. EPJ Data Science, 4(1):1{19, 2015. ISSN 2193-1127. doi: 10.1140/epjds/s13688-015-0060-2. Paul A Pavlou and Mendel Fygenson. Understanding and predicting electronic commerce adoption: An extension of the theory of planned behavior. MIS quarterly, pages 115{143, 2006. 191 To~ nita Perea y Monsuw e, Benedict GC Dellaert, and Ko De Ruyter. What drives consumers to shop online? a literature review. International journal of service industry management, 15(1):102{121, 2004. Jose Pinheiro and Douglas Bates. Mixed-eects models in S and S-PLUS. Springer Science & Business Media, 2006. Giovanni Quattrone, Davide Proserpio, Daniele Quercia, Licia Capra, and Mirco Musolesi. Who benets from the sharing economy of airbnb? In Proceedings of the 25th International Conference on World Wide Web, pages 1385{1394. International World Wide Web Conferences Steering Committee, 2016. Ross Quinlan. Data mining tools see5 and c5. 0. 2004. Lisa Rayle, Danielle Dai, Nelson Chan, Robert Cervero, and Susan Shaheen. Just a better taxi? a survey-based comparison of taxis, transit, and ridesourcing services in san francisco. Transport Policy, 45:168{178, 2016. F Rechinhheld and W Sasser. Zero defections: Quality comes to service. Harvard Business Review, 68(5):105{111, 1990. Yossi Richter, Elad Yom-Tov, and Noam Slonim. Predicting customer churn in mobile networks through analysis of social groups. In SDM, volume 2010, pages 732{741. SIAM, 2010. Tiago Rodrigues, Fabr cio Benevenuto, Meeyoung Cha, Krishna Gummadi, and Virg lio Almeida. On word-of-mouth based discovery of the web. In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, IMC '11, pages 381{396, New York, NY, USA, 2011. ACM. ISBN 978-1-4503- 1013-0. doi: 10.1145/2068816.2068852. URL http://doi.acm.org/10.1145/ 2068816.2068852. Manuel Gomez Rodriguez, Krishna Gummadi, and Bernhard Schoelkopf. Quanti- fying information overload in social media and its impact on social contagions. arXiv preprint arXiv:1403.6838, 2014. Brishen Rogers. The social costs of uber. University of Chicago Law Review Dialogue, 82, 2015. Daniel E Rose and Danny Levinson. Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web, pages 13{19. ACM, 2004. Julian Runge, Peng Gao, Florent Garcin, and Boi Faltings. Churn prediction for high-value players in casual social games. In 2014 IEEE Conference on Computational Intelligence and Games, pages 1{8. IEEE, 2014. 192 Jerey M Rzeszotarski and Aniket Kittur. Instrumenting the crowd: using implicit behavioral measures to predict task performance. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 13{ 22. ACM, 2011. Vsevolod Salnikov, Renaud Lambiotte, Anastasios Noulas, and Cecilia Mascolo. Openstreetcab: Exploiting taxi mobility patterns in new york city to reduce commuter costs. arXiv preprint arXiv:1503.03021, 2015. Jari Saram aki and Esteban Moro. From seconds to months: an overview of multi- scale dynamics of mobile telephone calls. The European Physical Journal B, 88 (6):1{10, 2015. Christel Schoger. How the most successful apps monetize globally. Retrieved January, 15:2015, 2014. Gideon Schwarz et al. Estimating the dimension of a model. The Annals of Statistics, 6(2):461{464, 1978. Sylvain Senecal, Pawel J. Kalczynski, and Jacques Nantel. Consumers' decision- making process and their online shopping behavior: a clickstream analysis. Jour- nal of Business Research, 58(11):1599{1608, 2005. ISSN 0148-2963. doi: http: //dx.doi.org/10.1016/j.jbusres.2004.06.003. URL http://www.sciencedirect. com/science/article/pii/S0148296304001985. Retailing Reseach :things change, things stay the same SixthfSMAg Retailing Research Symposium. Cosma Rohilla Shalizi and Andrew C. Thomas. Homophily and contagion are generically confounded in observational social network studies. Sociological Methods and Research, 40(2):211{239, 2011. doi: 10.1177/0049124111404820. URL http://smr.sagepub.com/content/40/2/211.abstract. S. Andrew Sheppard, Andrea Wiggins, and Loren Terveen. Capturing quality: Retaining provenance for curated volunteer monitoring data. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, CSCW '14, pages 1234{1245, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2540-0. doi: 10.1145/2531602.2531689. URL http://doi. acm.org/10.1145/2531602.2531689. Rafet Sifa, Fabian Hadiji, Julian Runge, Anders Drachen, Kristian Kersting, and Christian Bauckhage. Predicting purchase decisions in mobile free-to-play games. In Eleventh Articial Intelligence and Interactive Digital Entertainment Conference, 2015. 193 Edward H Simpson. The interpretation of interaction in contingency tables. Jour- nal of the Royal Statistical Society. Series B (Methodological), pages 238{241, 1951. Philipp Singer, Fabian Fl ock, Clemens Meinhart, Elias Zeitfogel, and Markus Strohmaier. Evolution of reddit: from the front page of the internet to a self- referential community? In International Conference on World Wide Web Com- panion, 2014. Catarina Sismeiro and Randolph E. Bucklin. Modeling purchase behavior at an e- commerce web site: A task-completion approach. Journal of Marketing Research, 41(3):306{323, 2004. ISSN 00222437. URL http://www.jstor.org/stable/ 30162341. Brent R Smith, Gregory D Linden, and Nida K Zada. Content personalization based on actions performed during a current browsing session, February 8 2005. US Patent 6,853,982. Stanislav Sobolevsky, Izabela Sitko, Remi Tachet des Combes, Bartosz Hawelka, Juan Murillo Arias, and Carlo Ratti. Cities through the prism of people's spend- ing behavior. arXiv, 2015. URL arxiv.org/abs/1505.03854. Patricia Sorce, Victor Perotti, and Stanley Widrick. Attitude and age dierences in online buying. International Journal of Retail & Distribution Management, 33(2):122{132, 2005. Myra Spiliopoulou, Bamshad Mobasher, Bettina Berendt, and Miki Nakagawa. A framework for the evaluation of session reconstruction heuristics in web-usage analysis. Informs journal on computing, 15(2):171{190, 2003. William R Swinyard and Scott M Smith. Why people (don't) shop online: A lifestyle study of the internet consumer. Psychology & Marketing, 20(7):567, 2003. William R Swinyard and Scott M Smith. Activities, interests, and opinions of online shoppers and non-shoppers. International Business and Economics Research Journal, 3(4):37{48, 2011. Gabor Szabo and Bernardo A. Huberman. Predicting the popularity of online con- tent. Commun. ACM, 53(8):80{88, August 2010. ISSN 0001-0782. doi: 10.1145/ 1787234.1787254. URL http://doi.acm.org/10.1145/1787234.1787254. Manouchehr Tabatabaei. Online shopping perceptions of oine shoppers. Issues in Information Systems, 10(2):22{26, 2009. 194 Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. # twittersearch: a comparison of microblog search and web search. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 35{44. ACM, 2011. Thompson SH Teo. Attitudes toward online shopping and the internet. Behaviour & Information Technology, 21(4):259{271, 2002. Rannie Teodoro, Pinar Ozturk, Mor Naaman, Winter Mason, and Janne Lindqvist. The motivations and experiences of the on-demand mobile workforce. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, CSCW '14, pages 236{247, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2540-0. doi: 10.1145/2531602.2531680. URL http://doi.acm.org/10.1145/2531602.2531680. Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. The Anatomy of the Facebook Social Graph, November 2011. URL http://arxiv. org/abs/1111.4503. Frank Ulbrich, Tina Christensen, and Linda Stankus. Gender-specic on-line shop- ping preferences. Electronic Commerce Research, 11(2):181{199, 2011. ISSN 1389-5753. doi: 10.1007/s10660-010-9073-x. URL http://dx.doi.org/10. 1007/s10660-010-9073-x. Dirk Van den Poel and Wouter Buckinx. Predicting online-purchasing behaviour. European Journal of Operational Research, 166(2):557 { 575, 2005. ISSN 0377- 2217. doi: http://dx.doi.org/10.1016/j.ejor.2004.04.022. URL http://www. sciencedirect.com/science/article/pii/S0377221704002875. Dimitri van der Linden, Michael Frese, and Theo F Meijman. Mental fatigue and the control of cognitive processes: eects on perseveration and planning. Acta Psychologica, 113(1):45{65, 2003. Craig Van Slyke, Christie L. Comunale, and France Belanger. Gender dierences in perceptions of web-based shopping. Commun. ACM, 45(8):82{86, August 2002. ISSN 0001-0782. doi: 10.1145/545151.545155. URL http://doi.acm.org/10. 1145/545151.545155. Edward K Vogel and Maro G Machizawa. Neural activity predicts individual dierences in visual working memory capacity. Nature, 428(6984):748{751, 2004. Lillian Weng, Alessandro Flammini, Alessandro Vespignani, and Fillipo Menczer. Competition among memes in a world with limited attention. Scientic Reports, 2(335), 2012. 195 Mary Wolnbarger and Mary C Gilly. Shopping online for freedom, control, and fun. California Management Review, 43(2):34{55, 2001. Jaewon Yang and Jure Leskovec. Patterns of temporal variation in online media. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 177{186. ACM, 2011. Manir Zaman and Malvin Yeo Wei Meng. Internet shopping adoption: A compara- tive study on city and regional consumers. In ANZMAC 2002, pages 2421{2428. Deakin University, 2002. Georgios Zervas, Davide Proserpio, and John Byers. The rise of the sharing econ- omy: Estimating the impact of airbnb on the hotel industry. Boston U. School of Management Research Paper, (2013-16), 2015a. Georgios Zervas, Davide Proserpio, and John W. Byers. The impact of the sharing economy on the hotel industry: Evidence from airbnb's entry into the texas market. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, EC '15, pages 637{637, New York, NY, USA, 2015b. ACM. ISBN 978-1-4503-3410-5. doi: 10.1145/2764468.2764524. URL http://doi.acm.org/ 10.1145/2764468.2764524. Yongzheng Zhang and Marco Pennacchiotti. Predicting purchase behaviors from social media. In Proceedings of the 22nd international conference on World Wide Web, pages 1521{1532. ACM, 2013. Li Zhuang, John Dunagan, Daniel R. Simon, Helen J. Wang, and J. D. Tygar. Characterizing botnets from email spam records. In LEET, pages 2:1{2:9, Berkeley, CA, USA, 2008. USENIX Association. URL http://dl.acm.org/ citation.cfm?id=1387709.1387711. E.W. Zuckerman and J.T. Jost. What makes you think you're so popular? self- evaluation maintenance and the subjective side of the" friendship paradox". Social Psychology Quarterly, pages 207{223, 2001. 196 Appendix A Online Shopping A.1 Introduction Consumer spending is an integral component of economic activity. In 2013, it accounted for 71% of the US gross domestic product (GDP) 1 , a measure often used to quantify economic output and general prosperity of a country. Given its importance, many studies focused on understanding and characterizing con- sumer behavior. Researchers examined gender dierences and motivations in shop- ping [Dholakia, 1999b, Hayhoe et al., 2000], as well as spending patterns across urban areas [Sobolevsky et al., 2015]. In recent years, shopping has increasingly transitioned from in-store to an online experience. Consumers use internet to research product features, compare prices, and then purchase products from online merchants, such as Amazon or Walmart. Moreover, platforms like eBay allow people to directly sell products to one another. While there exist concerns about the risks and security of online shopping [Bhat- nagar et al., 2000, Perea y Monsuw e et al., 2004, Teo, 2002], large numbers of peo- ple, especially younger and wealthier [Horrigan, 2008, Swinyard and Smith, 2003], choose online shopping even when similar products can be purchased oine [Farag et al., 2007]. The new habits of customers have had a tremendous economic impact 1 https://research.stlouisfed.org/fred2/series/PCE/ 197 on online market, with an estimated $1,471 billion dollars spent by 191 million shoppers in 2014 in the United States alone 2 . Most of the online purchases result in a conrmation or shipment email sent to the shopper by the merchant. These emails provide a rich source of evidence to study online consumer behavior across dierent shopping websites. Unlike previous studies [Pavlou and Fygenson, 2006], which were based on surveys and thus limited to relatively small populations, we used large-scale email data to perform an in- depth study of online shopping. More specically, we extracted online receipts from 20.1M Yahoo Mail users, amounting to 121 million purchases worth 5.1B dollars. The extracted information included names of purchased products, their prices, and purchase timestamps. We used email user prole to link this information to demographic and geolocation data, such as gender, age, and zip code. This information enabled us to characterize patterns of online shopping activity and their dependence on demographic and socio-economic factors. We found that, for example, men on average make more purchases and spend more money on online purchases. Moreover, online shopping appears to be widely adopted by all ages and economic classes, although shoppers from high-income areas buy more expensive products than less wealthy shoppers. Looking at temporal factors aecting online shopping, we found patterns com- mon to other online activities as well [Kooti et al., 2015]. Not surprisingly, online shopping has daily and weekly cycles showing that people t online shopping rou- tines into their everyday life. Furthermore, purchasing decisions appear to be correlated. The more expensive a previous purchase was, the longer the shopper has to wait until the next purchase. This can be explained by the fact that most 2 http://www.statista.com/topics/871/online-shopping/ 198 shoppers have a nite budget and need to wait longer between purchases to buy more expensive items. In addition to temporal and demographic factors, social networks are believed to play an important role in shaping consumer behavior (e.g., by spreading informa- tion about products through the word-of-mouth [Rodrigues et al., 2011]). Previous studies examined how consumers use their online social networks to gather product information and recommendations [Guo et al., 2011, Gupta et al., 2014], although the direct eect of recommendations on purchases was found to be weak [Leskovec et al., 2007]. In addition, people who are socially connected are generally more similar to one another than unconnected people [McPherson et al., 2001], and hence are more likely to be interested in similar products. Our analysis conrmed that shoppers who are socially connected and e-mail each other tend to purchase more similar products than unconnected shoppers. Once we understand the factors aecting consumer behavior, we can use this knowledge to predict future purchases. Given users' purchase history and demo- graphic data, we address a problem of predicting the time and price of their next purchase. Our method attains a relative improvement of 108.7% over the random baseline for predicting the price of the next purchase, and 36.4% relative improve- ment over the random baseline for predicting the time of the next purchase. Inter- estingly, demographic features were shown to be the least useful in these prediction tasks, while temporal features carried the most discriminative information. The contributions of the paper are summarized below: In-depth analysis of a unique and very rich data set describing consumer behavior, extracted from purchase conrmations merchants send to shoppers; A quantitative analysis of an impact of demographic, temporal, and social factors on consumer behavior; 199 Prediction of consumer behavior, where we predict the time of the next purchase and how much money will be spent. Better understanding of consumer behavior can benet both consumers and advertisers. Knowing when consumers are ready to make a purchase and how much they are willing to spend can improve the eectiveness of advertising campaigns, in terms of optimizing ad impressions and budget spend. Understanding these patterns can also make online shopping experience more ecient for consumers. Considering that consumer spending presents such a large portion of the economy, even a small eciency gain can have signicant impact on the overall economic activity. A.2 Data set Most online purchases result in a conrmation email being sent by the merchant to the shopper. These emails provide a unique opportunity to study the shopping behavior of people across dierent online retail stores, such as Amazon, eBay, or Walmart. Yahoo Mail is one of the world's largest email providers with more than 300M users 3 , and many online shoppers use this email service for receiving purchase con- rmations. We selected these emails by using a precompiled list of email addresses of popular merchants. Applying a set of manually specied extraction rules to the email body, we extracted the list of purchased item names and the price of each item. In case of multiple items purchased in a single order, we considered them as individual purchases occurring at the same time. Therefore, throughout the paper the expression \purchase" will refer to a purchase of a single item. In 3 http://www.comscore.com/ 200 0.0 0.1 0.2 10 0 10 1 10 2 10 3 # of purchases PDF (a) PDF 0.4 0.6 0.8 1.0 10 0 10 1 10 2 # of purchases CDF (b) CDF Figure A.1: Distribution of number of purchases made by users order to be able to analyze purchases by category (e.g., electronics, books, hand- bags), we developed a categorization module based on the purchased item names. Specically, an item name was used as an input to a classier that predicts the item category. We used a three levels deep, 1,733 node Pricegrabber taxonomy 4 to categorize the items. The details of categorization are beyond the scope of this paper. We limited our study to a random subset of Yahoo Mail users in the US. Our data set contains information on 20.1M users, who collectively made 121M pur- chases from several top retailers between February and September 2014, amounting to total spending of 5.1B dollars. For each user we extracted age, gender, and zip code information from the Yahoo user database. We excluded users who made more than 1,000 purchases (amounting to less than 0:01% of the sample), as these accounts are likely to belong to stores and not individual users. The analysis of the data set was performed in aggregate and on anonymized data. 4 http://www.pricegrabber.com 201 0.000 0.003 0.006 0.009 10 0 10 1 10 2 10 3 Total money spent ($) PDF (a) PDF 0.00 0.25 0.50 0.75 1.00 10 0 10 1 10 2 10 3 10 4 Total money spent ($) CDF (b) CDF Figure A.2: Distribution of total money spent by users In order to examine social aspects of the shopping behavior, we used the Yahoo email network data set in addition to the data set of purchases. The email network can be represented as a directed graphG, with edges denoted by (i;j;N ij ) signifying that useri sentN ij emails to userj. For our analysis we retained only edges with a minimum of 5 exchanged messages, and considered only a subgraphC ofG induced by the two-hop neighborhood of the users who made purchases (i.e., their immediate contacts and contacts of their contacts). The subgraphC was used to construct a list of 1 st -level contacts and 2 nd -level contacts for each online shopper. Let us take a closer look at the data set characteristics. In Figure A.1(a) we show the distribution of number of purchases per user, which reveals the expected heavy-tailed characteristic. Figure A.1(b) shows that only 5% of the users made more than 20 purchases. In contrast, the distribution of total money spent per user peaks at around 10 dollars, sharply decreasing for smaller amounts (Figure A.2(a)). Also, there is a non-negligible minority of people who spend a substantial amount of money for online shopping, such as 5% of the users spent more than 1,000 dollars (Figure A.2(b)). 202 10 −6 10 −4 10 −2 10 0 10 0 10 1 10 2 10 3 10 4 # of times being purchased PDF Figure A.3: Number of times dierent items have been purchased 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 0 10 1 10 2 10 3 Item price # of times purchased Figure A.4: Number of item purchases as a function of price Items being purchased also have drastically dierent levels of popularity, as shown by the distribution in Figure A.3. Disney's Frozen DVD, the most popular item in the data set, has been purchased more than 200,000 times, whereas the vast majority of the items has been purchased less than 10 times. Table A.1 lists the top 5 most frequently purchased items. Intuitively, the set of items that users have spent the most money on is a dierent set, because a single purchase of an 203 expensive item would account for the same amount of money of several purchases of cheaper items (Table A.2). In fact, the number of times an item is purchased negatively correlates with the price of that item (Figure A.4). This is in line with previous survey-based studies [Bhatnagar et al., 2000] that found the vast majority of items purchased online are worth at most few tens of dollars. A.3 Purchase Pattern Analysis In this section we present a quantitative analysis of factors aecting online pur- chases. We examine the role of demographic, temporal, and social factors that include gender and age, daily and weekly patterns, frequency of shopping, ten- dency towards recurring purchases, and budget constraints. A.3.1 Demographic Factors Let us consider how gender, age, and location (i.e., zip code) aect purchasing behavior of customers. First, we measured fraction of all email users that made an online purchase. We found that higher fraction of women make online purchases compared to men (Figure A.5(a)), albeit men make slightly more purchases (Fig- ure A.5(b)) and spend more money on average (Figure A.5(c)). It is interesting that, as a result, men spend much more money in total (Figure A.5(d)). The same patterns hold across dierent age groups. The presented results back up ndings from earlier consumer surveys which revealed that man have a higher perceived advantage of online shopping [van, 2005], while women havie a higher concern for negative consequences of online purchasing [Garbarino and Strahilevitz, 2004], resulting in a higher number of purchases made by men. 204 Table A.1: Top 5 most purchased products Rank Product name # of purchases 1 Frozen (DVD) 202,103 2 Cards Against Humanity (Cards) 110,032 3 Google Chromecast 59,548 4 HDMI cable 54,402 5 Pampers 51,044 Table A.2: Top 5 products with the most money spent on them Rank Product name Money spent on product 1 Play Station 4 $ 7.0M 2 Frozen (DVD) $ 3.8M 3 Kindle $ 3.2M 4 Samsung Galaxy Tab $ 2.7M 5 Cards Against Humanity $ 2.5M Table A.3: Top product categories purchased by women and men Rank Top categories Distinctive women Distinctive men 1 Android Books Games 2 Accessories Dresses Flash memory 3 Books Diapering Light bulbs 4 Vitamins Wallets Accessories 5 Shirts Bracelets Batteries Table A.4: Top products purchased by women and men Rank Top women Top men Distinctive women Distinctive men 1 Frozen Frozen Frozen Chromecast 2 iPhone screen protector Game of Thrones iPhone screen protector Game of Thrones 3 Cards Against Humanity Chromecast iPhone screen protector Titanfall Xbox One 4 iPhone screen protector (another brand) Cards Against Humanity iPhone case Playstation 4 5 Game of Thrones iPhone screen protector iPhone case Godfather collection With respect to the age, spending ability increases as people get older, peak- ing for the population between age 30 to 50 and then declining afterwards. The 205 20 30 40 50 20 30 40 50 60 Age % of online shoppers Men Women (a) Percentage of online shoppers 3 4 5 6 7 20 40 60 80 Age Avg # of items purchased Men Women (b) Number of purchases 20 30 20 40 60 80 Age Median of product price Men Women (c) Average price 25 50 75 100 125 20 40 60 80 Age Median total $ spent Men Women (d) Total money spent Figure A.5: Demographic analysis broken down by age: (a) Percentage of online shoppers; (b) number of items purchased; (c) average price of products purchased; and (d) total spent by men and women Table A.5: Dierences in the products purchased by younger (18-22 yo) and older (60-70 yo) users Rank Top younger users Top older users Distinctive younger users Distinctive older users 1 iPhone screen protector Frozen iPhone screen protector Frozen 2 iPhone screen protector Game of Thrones iPhone screen protector (another brand) Game of Thrones 3 Cards Against Humanity Chromecast Cards Against Humanity Downton Abbey 4 iPhone case Downton Abbey iPhone case Blood sugar medicine 5 Frozen Hunger Games iPhone case TurboTax Package 206 0.0 2.5 5.0 7.5 50k 100k 150k Income Average # of purchases (a) Number of purchases 0 20 40 60 50k 100k 150k Income Average item price (b) Average item price 0 100 200 300 400 500 50k 100k 150k Income Average total spent (c) Total money spent Figure A.6: Eect of income on purchasing behavior same pattern holds for number of purchases made, average item price, and total money spent (Figures A.5(b), A.5(d), A.5(c)). Dierent generations also purchase dierent types of products online (Table A.5). Younger shoppers (18-22 years old) purchase more phone accessories and games, whereas older shoppers (60-70 years old) are much more interested in buying TV shows. Also, blood sugar medicine is purchased more by the older users, which is expected. Dierences exist across genders as well. Table A.3 shows the top ve categories of purchased products for male and female customers. Even though the ranking of the top products is the same, each product accounts for dierent fraction of all purchases within the same gender. To nd the most distinctive categories, we compare the fraction of all the items bought by both genders, and consider the categories that have the largest dierences. Books, dresses, and diapering are the categories that were more disproportionately bought by women, whereas games, ash memory sticks, and accessories (e.g., headphones) are the categories purchased more by men. The largest dierences range from only 0.5% to 1%, but are statistically signicant. This result is aligned with previous research on oine shopping that found men more keen in buying electronics and entertainment 207 products, while women more inclined to buy clothes [Dholakia, 1999a, Hayhoe et al., 2000]. We repeated the same gender analysis at a product level (Table A.4). Consistently, there is a high overlap between most purchased products. In the following, we measure the impact of economic factors on online shopping behavior. We use the US census data to retrieve median income associated with each zip code, making an inferred income for a user an aggregated estimate. Nev- ertheless, given the large size of this data set this coarse approach was enough to observe clear trends. The number of purchases, average product price, and total money spent are all positively correlated with income (Figures A.6(a), A.6(b), and A.6(c) respectively). While users living in high-income zip codes do not buy sub- stantially more expensive products, they do make more purchases, spending more money in total than users from lower-income zip codes. Although the factors that lead to lower-income households spending less online are multiple and complex, part of this eect can be explained by the reluctance of people who are concerned with their nancial safety to trust and make full use of online shopping, as pointed out by previous studies [Horrigan, 2008]. A.3.2 Temporal Factors Our data set spans a period of eight months, giving us opportunity to investigate temporal dynamics of purchasing behavior and factors aecting it. Besides daily and weekly cycles and periodic purchasing, we observed temporal variations that we associate with nancial depletion: users wait longer to buy more expensive items, waiting for the budget to recover from the previous purchases. 208 8 × 10 +5 1.2 × 10 +6 1.6 × 10 +6 2 × 10 +6 Jun15 Jul15 Aug15 Date Number of purchases Daily Week avg Figure A.7: Number of purchases in a day and average weekly 2 × 10 +6 4 × 10 +6 6 × 10 +6 8 × 10 +6 0 5 10 15 20 Time of the day Number of purchases Figure A.8: Number of purchases in each hour of the day Daily and Weekly Cycles Figure A.7 shows the daily number of purchases over a period of two months. The gure indicates a clear weekly shopping pattern with more purchases taking place in the rst days of the week and fewer purchases on the weekends. On average there are 32.6% more purchases on Mondays than Sundays. There also exists 209 Table A.6: Top 5 items with the most number of repurchases Rank Item name Median purchase delay 1 Pampers 448 count 42 2 Bath tissue 62 3 Pampers 162 count 30 4 Pampers 152 count 31 5 Frozen 12 strong diurnal patterns in shopping behavior (Figure A.8). Interestingly, most of the purchases occur during the working hours (i.e., in the morning and early afternoon). Note that for this analysis we inferred the time zone from the user's zip code, which might be dierent from a shipping zip code for a purchase. Researchers have also reported monthly eects, where people spend more money at the beginning of the month when they receive their paychecks, compared to the end of the month [Evans and Moore, 2011]. To test the rst-of-the-month phenomenon, we compared spending in the rst Monday of the month with the last Monday of the month. We considered the rst and the last Mondays and not the rst and the last days, because the strong weekly patterns would result in an unfair comparison if the rst and the last day of the month are not the same day of the week. Our data does not support the earlier ndings and there are months in which the last Monday of the month includes more activity compared to the rst Monday of the month. Recurring Purchases Some products are purchased periodically, such as printer cartridges, water lters, or toilet paper. Finding these items and their typical cycle would help predicting purchasing behavior. We do this by counting the number of times each item 210 10 −5 10 −3 10 −1 10 0 10 1 10 2 Days from last purchase PDF Figure A.9: Distribution of number of days between purchases has been purchased by each user, then from each user's count we eliminate those products purchased only once, and lastly we aggregate the number of purchases per each item. Table A.6 shows the top ve such products, along with the median number of days between purchases. Out of the top 20 products only four are neither toilet paper nor tissue (Frozen, Amazon gift card, chocolate chip cookie dough, and single serve coees). In the top 20 list, the only unexpected item is the Frozen DVD, which probably made the list due to users buying additional copies as gifts or due to purchases by small stores that were not eliminated by our removal criterion of maximum 1,000 purchases. Interestingly, the number of days between purchases for most of the top 20 items is close to 1 or 2 months, which might be due to automatic purchasing that users can set up. 211 0.10 0.15 0.20 0.25 0.30 0 50 100 150 Days from last purchase Average normalized price Figure A.10: Relationship between purchase price and time to next purchase (0.95 condence interval are shown yet too small to be observed) Finite Budget Eects Finally, we study the dynamics of individual purchasing behavior. Figure A.9 shows the distribution of number of days between purchases. The distribution is heavy-tailed, indicating bursty dynamics. The most likely time between purchases is one day and there are local maxima around multiples of 7 days, consistent with weekly cycles we observed. An individual's purchasing decisions are not independent, but constrained by their nances. Budgetary constraints introduce temporal dependencies between purchases made by the user: After buying a product, the user has to wait some time to accumulate money to make another purchase. Previous work in economics studied models of budget allocation of households across dierent types of goods to maximize an utility function [Du and Kamakura, 2008] and analyzed conditions under which people are willing to break their budget cap [Chiou and Ting, 2011]. However, we are not aware of any study aimed to support the hypothesis of the 212 0.19 0.20 0.21 0.22 0 50 100 150 Days from last purchase Average normalized price Normal Shuffled (a) Users with 5 purchases 0.100 0.105 0.110 0.115 0.120 0 50 100 150 Days from last purchase Average normalized price Normal Shuffled (b) Users with 9-11 purchases 0.030 0.035 0.040 0.045 0.050 0 50 100 150 Days from last purchase Average normalized price Normal Shuffled (c) Users with 28-32 purchases Figure A.11: Relationship between purchase price and time to next purchase with 0.95 condence interval time of purchase being partly driven by an underlying cyclic process of budget depletion and replenishment. To test this hypothesis, we examined the relationship between purchase price and the time period since last purchase. Since dierent users have dierent spend- ing power, we considered the normalized change in the price given the number of days from the last purchase. In other words, we computed how users divide their personal spending across dierent purchases, given the time delay between 213 purchases. We then averaged the normalized values for all users, and report the change for each time delay. Figure A.10 shows that as the time delay gets longer, users spend higher fraction of their budgets, which supports our hypothesis. To test that our analysis does not have any bias in the way the users are grouped, we perform a shue test by randomly swapping the prices of products purchased by users. This destroys the correlation between the time delay and product price. We then do the same analysis with the shued data and expect to see a at line. However, the same increase also exists in the shued data, indicating a bias in the methodology. This is due to the heterogeneity of the underlying population: we are mixing users with dierent number of purchases. Users making more purchases have lower normalized prices and also shorter time delays, and are systematically overrepresented in the left side of the plot, even in the shued data. To partially account for heterogeneity, we grouped users by the number of purchases (i.e., those who made exactly 5 purchases, those with 9-11 purchases, and 28-32 purchases). Even within each group there is variation as the total spending diers signicantly across users, which we address by normalizing the product price by the total amount of money spent by the user, as explained above. If our hypothesis is correct, there should be a refractory period after a purchase, with users waiting longer to make a larger purchase. We clearly observe a positive relationship between (normalized) purchase price and the time (in days) since last purchase (Figure A.11), but not in the shued data, which produces a horizontal line. We conclude that the relationship between time delay and purchase price arises due to behavioral factors, stemming from the limited budget of customers. 214 A.3.3 Social Factors An individual's behavior is often correlated with that of his or her social contacts (or friends). In online shopping, this would result in users purchasing products that are similar to those purchased by their friends. Distinct social mechanisms give rise to this correlation [Anagnostopoulos et al., 2008]. First, a friend could in uence the user to buy the same product by highly recommending it. This is the basis for social contagions in general, and \word-of-mouth" marketing in particular, although empirical evidence suggests that in uence has a limited eect on shopping behavior [Leskovec et al., 2007]. Alternatively, users could have bought the same product as their friends purchased, because people tend to be similar to their friends, and therefore, have similar needs. The tendency of socially connected individuals to be similar is called homophily, and it is a strong organizing principle of social networks. Studies have found that people tend to interact with others who belong to a similar socio-economic class [Feld, 1981, McPherson et al., 2001] or share similar interests [Kang and Lerman, 2012, Aiello et al., 2012]. Finally, a user's and their friend's behavior may be spuriously correlated because both depend on some other external factor, such as geographic proximity. In reality, all these eects are interconnected [Crandall et al., 2008, Aiello et al., 2010] and are dicult to disentangle from observational data [Shalizi and Thomas, 2011]. For example, homophily often results in selective exposure that may amplify social in uence and word-of-mouth eects. We investigate whether social correlations exist, although we do not resolve the source of the correlation. Specically, we study whether users who are connected to each other via email interactions tend to purchase similar products in contrast to users who are not connected. To measure similarity of purchases between two 215 users, we rst describe the purchases made by each user with a vector of prod- ucts, each entry containing the frequency of purchase. This approach results in large and sparse vectors due to the large number of unique products in our data set. To address this challenge, we use vectors of product categories, instead of product names. There are three levels of product categories, and we perform our experiments at all levels. We compare similarity of category vectors of pairs of users who are directly con- nected in the email network (104K pairs of users) with the same number of pairs of randomly chosen users (who are not directly connected). We use cosine simi- larity to measure similarity of two vectors. Using top-level categories to describe user purchases gives average similarity of 0.420 between connected pairs of users, whereas random pairs have similarity of 0.377 on average (+11% relative change as compared to connected pairs). Using the more detailed level-2 categories to describe purchases gives average similarity of 0.215 for connected versus 0.170 for random pairs of users (+26% relative increase). Finally using the most detailed, level 3, categories results in average similarity of 0.188 for connected vs. 0.145 for random pairs of users (+30% relative increase). Although the absolute similarity decreases as a more detailed product vector is used, shoppers who communicate by email are always more similar than random shoppers who are not directly con- nected. Gender also plays an important role when measuring purchase similarity between user pairs. To quantify this eect, we calculate the cosine similarity between the vectors of number of purchases from the detailed category (level 3), and take the average of the cosine similarity. Instead of taking the average for all the connected pairs, we separate the pairs based on the gender of the users in the pair: woman-woman, man-man, and woman-man. The woman-woman pairs 216 have the highest average cosine similarity with 0.192, next followed by man-man pairs with average similarity of 0.186. Heterogeneous pairs are the least similar ones, with average cosine similarity of 0.182. The similarity measures are still greater than measures for random pairs of users, which have similarity of 0.145. Woman-man pair having the smallest similarity primarily supports our earlier nd- ing about a sensible dierence in the type of goods that attract interests of the two genders. Previous work also found that receiving a shopping recommendation from a friend will have a greater positive eect on willingness to purchase online among women than among men [Garbarino and Strahilevitz, 2004]. The highest similarity of female-female pairs might be partly explained by that eect. A.4 Predicting Purchases Predicting the behavior of online shoppers can help e-commerce sites to improve the shopping experience by the means of personalized recommendations on one hand, and on the other to better meet merchants' needs by delivering targeted advertisements. In a recent study, Grbovic et al. addressed the problem of pre- dicting the next item a user is going to purchase using a variety of features [Grbovic et al., 2015]. In this work, we consider the complementary problems of predicting i) the time of the next purchase, and ii) the amount that will be spent on that purchase. Predicting the exact time and price of a purchase (e.g., using regression) is a hard problem, therefore we focused on the simpler classication task of predicting the class of the purchase among a nite number of predened price or time intervals. We experimented with dierent classication algorithms and Bayesian Network Classication yielded the highest accuracy. To estimate the conditional probability 217 distributions we used direct estimates of the conditional probability with = 0:5. The classier was trained on the rst six months of purchase data and evaluated on the last two. From each entry we extracted 55 features belonging to a variety of categories: Demographics of online shoppers (4 features): Gender, age, location (zip code), and income (based on zip code). Purchase price history (19 features): Price of the last three purchases, price category of the last three purchases, number of purchases, mean price of purchased item, median price of purchased items, total amount of money spent, standard deviation in item prices, number of earlier purchases in each price group (5 groups), price group with the most number of purchases and the count for it, and total number of purchases until that point. Purchase time history (13 features): Time of last three purchases, mean time between purchases, median time between purchased, standard deviation in times between purchases, number of earlier purchases in each time group, and time group with the most number of purchases and the count for it. Purchase history of products (4): Last three categories of products purchased, most purchased category. Time or price of the next purchase (1 feature): We also assume that we know when the next purchase is going to happen. This seems unrealistic at rst, but we include this feature because the system is going to make recommendations at a given time, and we assume that the shopper is going to make the decision at that time. For having a symmetrical problem we also consider the price of the next purchase, which would be similar as knowing the budget of the user. 218 Table A.7: Top predictive features for prediction of the price of the next item and their 2 values Rank Feature 2 value 1 Most used class earlier 214,996 2 Number of under $6 purchases 115,560 3 Median price of earlier purchases 106,876 4 Mean price of earlier purchases 91,409 5 Number of over $40 purchases 84,743 Contacts (14 features): Mean, median, standard deviation, minimum, maxi- mum, and 10th and 90th percentile of price and time of the purchases of the contacts of the users. For the aggregated features such as average price of item purchased, we used only purchases in the training period and did not consider future information. To evaluate the proposed approach, we compared results of our classier to three baselines: Random prediction; Price or purchase delay of the previous purchase; Most popular price or purchase delay of the purchases a target user made in the past. A.4.1 Price of the Next Purchase We partition prices in ve classes using $6, $12, $20, and $40 as price thresholds to obtain equally-sized partitions. These thresholds represent (a) very cheap products that cost less than $6 (20.7% of the data); (b) cheap products between $6 and $12 (20.3%); (c) medium-priced products between $12 and $20 (19.3%); (d) expensive 219 products that cost more than $20, but less than $40 (19.9%); and nally (e) very expensive products worth more than $40 (19.8%). Our classier achieves an accu- racy of 43.2% with a +108.7% relative improvement over the 20.7% accuracy of the random classier (i.e., relative size of the largest class). A category of the last purchase and the most frequent purchase category turn out to be quite strong predictors, achieving accuracy of 29.3% and 29.8% by them- selves, respectively. The supervised approach outperforms them with a +47.4% and +45.0% relative improvement, respectively. When measuring the predictive power of the features with the 2 statistics (Table A.7) we nd that the highest pre- dictive power is the most frequent class of earlier purchases, by far. This suggests that users tend to buy mostly items in the same price bracket. The second feature in the ranking is the number of purchases from the very cheap category, followed by median and mean of earlier prices. In general, all the top 16 most informative features are related to the price of earlier purchases. After those, median time between purchases and time delay before the last purchase are the most predictive features. The relatively high position of the last time delay in the feature rank suggests that the recommender system should consider the time that has passed from the last purchase of the user, and change the suggestions dynamically. In other words, if the user has made a purchase recently, cheaper products should be favored over more expensive products to the user, whereas if a long period of time has passed since the last purchase, more expensive products should be advertised to the user, as they are more likely to be purchased. All of the demographics fea- tures have limited predictive power and are ranked last (though the demographics might aect the purchase history), with income being the most important among them. 220 A.4.2 Time of the Next Purchase Similarly to purchase price, prediction of purchase time could be leveraged to make a better use of the advertisement space. If the user is likely not to purchase any- thing for a certain period of time, ads can be momentarily suspended or replaced with ads that are not related to consumer goods. For creating the categories, we choose thresholds of 1, 5, 14, and 33 days. Very short delays are within a day (22.8% of our data), short delays between 1 and 5 days (20.9%), medium delays between 5 and 14 (19.6%), long delays between 14 and 33 (18.2%) and the very long delays exceed 33 days (18.5%). Training a Bayesian Network on all the features yields an accuracy of 31.1%, a +36.4% relative improvement over the 22.8% accuracy of the random prediction baseline. The accuracy of our classier is also +24.9% relatively higher than the baseline of predicting as the last purchase delay, which has accuracy of 24.9%. Finally, the most occurred purchase has an accuracy of 22.2%, which is outperformed by our classier by +40.1% relatively. Ranking features by their 2 (Table A.8), we nd that the most informative feature is the number of earlier purchases that the user has made so far, followed by median time delay, previous purchase delay, time since the rst purchase, and the class of the previous purchase delay. To summarize, we trained two classiers for predicting the price and the time of the next purchase. Our algorithm outperformed the baselines in both prediction tasks, by a higher margin in case of predicting the price. Table A.9 summarizes all of our results showing a relative improvement of 108.7% for predicting the price of the next item purchased and 36.4% for predicting the time of the next purchase over the majority vote baseline. Interestingly, user demographics were not particularly helpful for making any prediction, and the observed correlations 221 Table A.8: Top predictive features for prediction of time of next purchase and their 2 values Rank Feature 2 value 1 Number of earlier purchases 48,719 2 Median time between purchases 35,558 3 Time since the rst purchase 30,741 4 Previous time delay 30,692 5 Class of previous time delay 22,710 Table A.9: Summary of the prediction results. Accuracy: percentage of correctly classied samples. Majority vote: always predicting the largest group, or predicting randomly. Most used: the group the user had the most in earlier purchases. AUC: Weighted average of Area Under the Curve for classes. RMSE: Root Mean Square Error. The improvements are reported over the majority vote. Prediction Majority vote (random classier) Last used Most used Our classier Absolute improvement Relative improvement AUC RMSE Item price 20.7% 29.3% 29.8% 43.2% 22.5% 108.7% 0.676 0.3806 Purchase time 22.8% 24.9% 22.2% 31.1% 8.3% 36.4% 0.634 0.4272 in earlier sections of the paper are masked by other features such as the history of prior purchases. A.5 Related Work Most of previous research on shopping behavior and characterization of shoppers has been conducted through interviews and questionnaires administered to groups of volunteers composed by at most few hundred members. Oine shopping in physical stores has been studied in terms of the role of demographic factors on the attitude towards shopping. The customer's gender predicts to some extent the type of purchased goods, with men shopping more for groceries and electronics, while women more for clothing [Dholakia, 1999a, Hayhoe 222 et al., 2000]. Gender is also a discriminant factor with respect to the attitude towards nancial practices, nancial stress, and credit, and it can be a quite good predictor of spending [Hayhoe et al., 2000]. Many shoppers express the need of alternating the experience of online and oine shopping [Wolnbarger and Gilly, 2001, Tabatabaei, 2009], and it has been found that there is an engagement spiral between online and oine shopping: searching for products online positively aects the frequency of shopping trips, which in its turn positively in uences buying online [Farag et al., 2007]. Online shopping has been investigated since the early stages of the Web. Many studies tried to draw the prole of the typical online shopper. Online shop- pers are younger, wealthier, more educated than the average Internet user. In addition, they tend to be computer literate and to live in the urban areas [Zaman and Meng, 2002, Swinyard and Smith, 2003, 2011, Farag et al., 2007]. Their trust of e-commerce sites and their understanding of the security risks associated with online payments positively correlate with their household income and edu- cation level [Horrigan, 2008, Hui and Wan, 2007], and it tends to be stronger in males [Garbarino and Strahilevitz, 2004]. The perception of risk of online trans- actions in uences shoppers to purchase small, cheap items rather than expensive objects [Bhatnagar et al., 2000]. Customers of online stores tend to value the convenience of online shopping in terms of ease of use, usefulness, enjoyment, and saving of time and eort [Perea y Monsuw e et al., 2004]. Their shopping experi- ence is deeply in uenced by their personal traits (e.g., previous online shopping experiences, trust in online shopping) as well as other exogenous factors such as situational factors or product characteristics [Perea y Monsuw e et al., 2004]. Demographic factors can in uence the shopping behavior and the perception of the shopping experience online. Men value the practical advantages of online 223 shopping more and consider a detailed product description and fair pricing sig- nicantly more important than women do. In contrast, some surveys have found that women, despite the ease of use of e-commerce sites, dislike more than men the lack of a physical experience of the shop and value more the visibility of wide selections of items rather than accurate product specications [Van Slyke et al., 2002, van, 2005, Ulbrich et al., 2011, Hui and Wan, 2007]. Unlike gender, the eect of age on the purchase behavior seems to be minor, with older people searching less for online items to buy but not exhibiting lower purchase frequency [Sorce et al., 2005]. With extensive evidence from a large-scale data set we nd that age greatly impacts the amount of money spent online and the number of items purchased. The role of the social network is also a crucial factor that steers customer behavior during online shopping. Often, social media is used to communicate purchase intents, which can be automatically detected with text analysis [Gupta et al., 2014]. Also, social ties allow for the propagation of information about eec- tive shopping practices, such as nding the most convenient store to buy from [Guo et al., 2011] or recommending what to buy next [Leskovec et al., 2007]. Survey- based studies have found that shopping recommendations can increase the will- ingness of buying among women rather than men [Garbarino and Strahilevitz, 2004]. Factors leading to purchases in oine stores have been extensively inves- tigated as they have direct consequences on the revenue potential of retailers and advertisers. Survey-based studies attempted to isolate the factors that lead a customer to buy an item or, in other words, to understand what the strongest predictors of a purchase are. Although the mere amount of online activity of a customer can predict to some extent the occurrence of a future purchase [Bellman 224 et al., 1999], multifaceted predictive models have been proposed in the past. Fea- tures related to the phase of information gathering (access to search features, prior trust of the website) and to the purchase potential (monetary resources, product value) can often predict whether a given item will be purchased or not [Hansen et al., 2004, Pavlou and Fygenson, 2006]. Prediction of purchases in online shopping is a task that has been addressed through data-driven studies, mostly on click and activity logs. User purchase history is extensively used by e-commerce websites to recommend rele- vant products to their users [Linden et al., 2003]. Features derived from user events collected by publishers and shopping sites are often used in predicting the user's propensity to click or purchase [Djuric et al., 2014]. For example, clickstream data have been used to predict the next purchased item [Van den Poel and Buckinx, 2005, Senecal et al., 2005]; click models predict online buying by linking the pur- chase decision to what users are exposed to while on the site and what actions they perform while on the site [Montgomery et al., 2002, Sismeiro and Bucklin, 2004]. Besides user click and purchase events, one can leverage product reviews and ratings to nd relationships between dierent products [McAuley et al., 2015]. Email is also a valuable source of information to analyze and predict user shop- ping behavior [Grbovic et al., 2015]. Click and browsing features represent only a weak proxy of user's purchase intent, while email purchase receipts convey a much stronger intent signal that can enable advertisers to reach their audience. The value of commercial email data has been recently explored for the task of cluster- ing commercial domains [Grbovic and Vucetic, 2014]. Signals to predict purchases can be strengthened by demographic features [Kim et al., 2003]. Also, information 225 extracted from customers' proles in social media, in combination with the infor- mation of their social circles, can help with predicting the product category that will be purchased next [Zhang and Pennacchiotti, 2013]. A.6 Conclusion Studying the online consumer behavior as recorded by email traces allows to over- come the limitations of previous studies that focused either on small-scale surveys or on purchases' logs from individual vendors. In this work, we provide the rst very large-scale analysis of user shopping proles across several vendors and over a long time span. We measured the eect of age and gender, nding that the spending ability goes up with age till the age of 30, stabilizes in the early 60s, and then starts dropping afterwards. Regarding the gender, a female email user is more likely to be an online shopper than an average male email user. On the other hand, men make more purchases, buy more expensive products on average, and spend more money. Younger users tend to buy more phone accessories compared to older users, whereas older users buy TV shows and vitamins & supplements more frequently. Using the user location, we show clear correlation between income and the number of purchases users make, average price of products purchased, and total money spent. Moreover, we study the cyclic behavior of users, nding weekly patters where purchases are more likely to occur early in the week and much less frequently in the weekends. Also, most of the purchases happen during the work hours, morning till early afternoon. We complement the purchase activity with the network of email communication between users. Using the network, we test if users that communicate with each other have more similar purchases compared to a random set of users, and we 226 nd indeed that is the case. We also consider gender of the users and nd that woman-woman pairs are more similar than man-man pairs that are also more similar to each other than the woman-man pairs. Finally, we use our ndings to build a classier to predict the price and the time of the next purchase. Our classier outperforms the baselines, especially for the prediction of the price of the next purchase. This classier can be used to make better recommendations to the users. Our study comes with a few limitations. First, we can only capture purchases for which a conrmation email has been delivered. We believe this is the case for most of online purchases nowadays. Second, if users use dierent email addresses for their purchases, we would not have their full purchase history. Similarly, people can share a purchasing account to enjoy some benets (e.g., an Amazon Prime account between multiple people). However, as suggested by the fact that less than 0.01% of the users have goods shipped to more than one zip-code, that occurs rarely in our data set. Third, the social network that we considered, albeit big, is not complete. However, the network is large enough to observe statistically signicant results. Lastly, we considered the items that were purchased together as separate purchases; it would be interesting to see which items are usually bought together in the same transaction. 227 Appendix B Uber B.1 Introduction The rapid growth of the sharing economy, exemplied by the ride-sharing plat- forms Uber and Lyft, and the home-sharing platforms Airbnb and Couchsurng, is changing the patterns of ownership and consumption of goods and services. In a sharing economy, consumers exchange services in a peer-to-peer fashion, through matching markets facilitated by social networks and online applications. Instead of owning a car or hailing a taxi, ride-sharing services enable consumers to request rides from other people who own private vehicles, or in turn, become drivers oer- ing rides to others. Similarly, home-sharing services enable consumers to stay in private homes while on travel, or oer rooms in their homes as short-term rentals to others. The various benets provided to consumers, such as convenience, cost savings, possibility for extra income, and new social interactions, have fueled the sharing economy's dramatic growth [Hamari et al., 2015]. Arguably, Uber, along with Airbnb, is one of the most successful sharing econ- omy markets. Founded in 2009, Uber is an online marketplace for riders and drivers. Riders use a smartphone app to request rides. Ride requests are assigned to Uber drivers, who use their own vehicles to provide the rides. Lower prices, short wait times, as well as the convenience of easy ride request and payment are considered as the main reasons contributing to Uber's popularity among the 228 riders [Horpedahl, 2015], and the exibility of work schedule and higher compen- sation rates are among the main reasons making Uber attractive to drivers [Hall and Krueger, 2015]. Uber has grown wildly popular, providing more than a million daily rides as of December 2014 1 and is the most valued venture-backed company with a valuation of $62.5B as of December 2015 2 . Uber's popularity makes it attractive for studies aimed at understanding participation in the sharing economy. But, the system is still not well understood. Specically, what are the characteristics of Uber riders and drivers? What eect do dierent factors such as promotions, rider- driver matching, and dynamic (or surge) pricing have on user participation and retention? Can these factors and characteristics be used to accurately predict users' behavior on Uber, particularly whether a new user will become an active user? In this work, we study Uber data that contains information about 59M rides taken by 4.1M people over a seven month period, along with data about 222K drivers over the same time period. This information is extracted from the email conrmation messages sent by Uber to riders after each ride, as well as weekly reports sent to drivers. The ride email receipts include information regarding rides, such as the fare, trip length, pick-up and drop-o times and locations, as well as driver's name. The weekly driver reports include driver's earnings, number of rides given in that week, and ratings. By analyzing usage and demographics of the population of the Uber users, we nd that an average active Uber rider is a young mid-20s individual with an above-average income. In addition, various demographic groups exhibit dierences in their behavior. E.g., younger riders tend 1 newsroom.uber.com/our-commitment-to-safety 2 nyti.ms/1XD9cdT 229 to take more rides, older riders take longer and more expensive rides, and the more auent riders take more rides and are more likely to use the more expensive types of cars, such as Uber Black. We present a detailed demographic analysis of Uber riders and drivers, in terms of age, gender, race, income, and times of the rides. Our main ndings are as follows: Uber is not an \all-serve-all" market. Riders have higher income than drivers and dier along racial and gender lines. Rider and driver attrition is very high, but the in ux of newcomers leads to an overall growth in the number of rides. We identify characteristics of riders and drivers who become active users. Better matches of riders to drivers result in higher ratings. Surge pricing does not favor more auent riders, but mostly aects younger riders (who use the service during peak times, including weekend nights). Drivers with many surge rides receive lower ratings, on average, suggesting the riders' dislike of surge pricing. Using users' initial activity on Uber, we can predict whether a rider or driver will become active or leave Uber. This work presents an in-depth analysis of the ride-sharing market from a large- scale Uber data including both riders and drivers. Our analysis reveals the demo- graphic and socioeconomic factors that aect participation in the ride-sharing mar- ket, and enables us to predict who will become an active market participant. Since consumer retention is generally much cheaper than consumer acquisition [Rech- inhheld and Sasser, 1990], detecting customers who are likely to stop using Uber 230 could help improve consumer retention. For example, by oering promotions to people who are likely to drop out, Uber could stimulate them to remain active users of its services. B.2 Related Work Crowdsourcing platforms have emerged as solutions to cheaply execute large amount of independent micro-tasks that are easy to solve by human workers [Kit- tur et al., 2008]. As opposed to virtual crowd-markets, in which tasks are executed fully online, more recently a number of platforms that support mobile crowdsourc- ing have sprung up (e.g., TaskRabbit, GigWalk, OpenStreetMap, Uber). These services have inherently dierent motivation and structure, as they specically address tasks that need physical presence in a place, either to create geographic information or to provide a service in a specic location [Goodchild, 2007, Shep- pard et al., 2014, Teodoro et al., 2014]. Drawing from the existing literature on characterizing crowdworkers [Mason and Suri, 2011, Buhrmester et al., 2011], here we focus on the ride-sharing service Uber, studying the factors that are linked with the rate of participation. As ride-sharing services continue to gain popularity, the interest of the scientic community in analyzing its success factors grows as well. Even though we present a rst study of Uber data at this scale for both riders and drivers, several studies have been conducted on Uber mostly on a much smaller scale. For example, a recent PEW study 3 surveyed 4.7K Americans and found that 15% of the population have used ride-sharing applications, compared to 29% of college graduates. Hall and Krueger [Hall and Krueger, 2015] studied Uber drivers' activity along with the 3 http://www.pewinternet.org/2016/05/19/on-demand-ride-hailing-apps/ 231 results of 601 surveys administered to drivers. They found that the age distribution of the drivers is more similar to the general workforce than to the taxi drivers and chaueurs. They also studied the growth in the number of Uber drivers in dierent cities, and showed that Uber becomes popular much faster in certain cities such as Miami, Austin, and Houston. Even though this study addresses similar questions to our work, the authors only considered drivers, and did not study more complex matters, such as the driver-rider matching or usage prediction. Other studies used Uber data along with data from Taxi's meters to compare Uber rides to Taxi rides [Cramer and Krueger, 2015]. Analyzing the data from the mobile application OpenStreetCab [Salnikov et al., 2015], Noulas and colleagues compared the cost of Uber rides to that of Yellow Cab rides and found that Uber eectively charges higher fares on average, especially for short but popular routes [Noulas et al., 2015]. Uber drivers drive with a passenger for a higher fraction of their time, compared to taxi drivers. This might be explained by easier ride request via the mobile app, better matching, or larger scale of Uber. Finally, researchers analyzed the eects of Uber's dynamic pricing (also known as surge pricing), that adjusts the ride cost to the demand. Lee et al. interviewed 21 Uber and Lyft drivers [Lee et al., 2015] to learn more about the eect of dynamic pricing on drivers' behavior in ride-sharing services, nding that drivers were not in uenced by surge pricing information. Even though the surge pricing is a completely opaque mechanism that raises concerns about fairness and may cause frustration for the riders [Chen et al., 2015], in general it helps the ride-sharing market place [Horpedahl, 2015]. Several studies measured the overall impact of the sharing economy on the traditional markets. For example, Zervas et al. [Zervas et al., 2015a,b] showed that 8-10% of the hotels' revenue has been impacted by Airbnb. Moreover, Rayle et al. conducted a survey [Rayle et al., 2016] aimed at nding the reasons why 232 people use ride-sharing services, as opposed to alternatives, such as taxi and public transportation. They found that the ease of payment, short wait time, and fast service are the top three reasons for using ride-sharing services. In addition, 39% of survey participants stated that taxi would be their rst alternative if Uber did not exist and 33% would have used public transport, indicating that Uber is impacting the economy of the public transportation services, such as buses and subway, as well. The clash between traditional economy and sharing economy has sparked a polarizing debate about the social cost of the introduction of new sharing economy services [Rogers, 2015] and about the need of additional policies to regulate such emerging markets [Edelman and Geradin, 2015]. A recent study on the economy of Airbnb suggests that such regulations should be responsive to real-time demands [Quattrone et al., 2016], which in turn call for a data-driven analysis of the sharing platforms. We also study the changes in users' engagement level on Uber. In the context of crowdsourcing platforms, previous work has studied the main incentives that lead to user participation [Lindqvist et al., 2011]. E.g., AirBnb users are moti- vated to monetize hospitality for a mixture of nancial and social reasons [Ikkala and Lampinen, 2015]. Similarly, by looking at the relationship between activity of drivers and riders with demographic and economic indicators, we aim to nd evidence that specic segments of the population may have dierent incentives to take part to the ride-sharing economy. Engagement has also a direct impact on attrition or consumer churn, which is one of the key metrics measuring the success of a business, representing the rate at which the consumers stop using the service. The importance of consumer attrition analysis is driven by the fact that retaining an existing consumer is much less expensive than acquiring a new consumer [Rechinhheld and Sasser, 1990]. Thus, prediction of consumer churn is 233 extremely valuable for the companies and has been the topic of multiple studies. E.g., Ritcher et al. exploit the information from the users' social network to pre- dict consumer churn in mobile networks [Richter et al., 2010]. In this work, we study both the consumer's (rider's) and provider's (driver's) attrition. B.3 Data set Every time a rider takes a ride, Uber emails a receipt shortly thereafter. This email has a specic format, making it easy to parse. The email includes the following information: pick-up and drop-o times, origin and destination addresses, duration of the ride, distanced traveled, type of the car (UberX, UberBlack, etc.), driver's rst name, and the fare, along with a breakdown of the price, including whether or not a promotion code was used and if the surge multiplier was applied (during peak hours the fare is multiplied by a value called the surge). We obtained information about Uber rides of Yahoo Mail users using an automated extraction pipeline that preserves the anonymity of the rider and the driver. In total, we study data on over 59M rides taken by 4.1M users over a period of time from October 2015 to May 2016. In Figure B.1, we show the number of rides taken on each day. There is a strong weekly pattern, with many rides taking place during the weekends. Also, some holidays, such as New Year's eve and Halloween, result in large peaks in the number of rides, while other holidays like Christmas result in a drop in the number of rides. Drivers receive two separate weekly emails. One email includes the hours they worked each day of that week, percent of busy hours worked, riders' textual feed- back (if any), their average rating over the week, and whether the rating is higher or lower than average of drivers. The other email includes the money earned each 234 2e+05 3e+05 4e+05 5e+05 Oct'15 Dec'15 Feb'16 Apr'16 Date Number of rides Figure B.1: Daily number of rides in our data set. day of the week. Both emails refer to the drivers by their rst name. These emails also have consistent formats, making it easy for this information to be extracted. Our data set includes more than 1.9M weekly summaries for 222K drivers, which were extracted by preserving the anonymity of the driver. Moreover, whenever a person joins Uber they receive a welcome email. So, besides the ride information, we know when a user has joined Uber either as a rider or a driver. In addition to Uber emails, we relied upon the Yahoo Mail network graph during the same period of time. The email graph G consists of pairs of hashed user ids that communicated with each other. For the present analysis, we consider only a subgraph of G induced by the two-hop neighborhood from the users who are Uber riders and/or drivers. Finally, since both riders and drivers are Yahoo Mail users, we also have access to their demographic information, including age, gender, and location at the zip code level. In this study, we conducted our analysis only on users from the US, unless otherwise stated. Further, only for purposes of this study we produced 235 income and race estimates for all riders and drivers. Since Yahoo does not collect declared income or race information during sign-up process, we derived estimates using publicly available US census data that contains race and income distributions for each zip code. Specically, all drivers and riders from a specic zip code were assigned the median income and race associated with that zip code. The inferred income and race for the user are aggregated estimates (we do not know the ground truth for any specic user) but, given the large size of this data set, this coarse appraisal is enough to observe clear trends. Data anonymization: All the analyses have been performed in aggregate and on data which was anonymized with hashed user ids. In addition, to ensure that user privacy is always preserved, our automatic email extraction pipeline hashes any personally identiable information from the email content. For example, Uber ride receipts contain a message with driver's rst name: \Thank you for driving with David". Our pipeline detects and encrypts the rst name of the driver, i.e. replaces it with a hashed value. The same procedure is applied on the rst name of the rider. Additional data sources, such as income and race, are aggregated estimates inferred from the zip codes, and are not known for any specic user. Limitations: Our data set has several limitations. First, our data only includes Uber users who are also Yahoo Mail users, and there may be a selec- tion bias in the subset of users being studied. While this might happen to some extent, given the popularity of Yahoo Mail with over 300M users 4 , we believe the Uber population is large enough to be representative sample and the ndings could be generalized. Second, our data does not include ratings of individual rides. This information is shared neither with the riders nor with the drivers in the email or even on their private prole. To answer questions regarding the ratings, we 4 www.comscore.com 236 Table B.1: Comparison of riders by race, age, and gender. Race %riders %women Avg. age Med. fare White 80.5% 49.6% 36.1 $12.8 Hispanic 8.5% 52.2% 31.1 $11.0 African-American 8.2% 61.1% 32.4 $11.5 Asian-American 2.8% 50.4% 35.5 $12.8 take multiple steps to nd the subset of drivers whose vast majority of rides were included in our data. We explain this in more details later in the paper. B.4 Understanding Riders In this section, we examine the relationship between the demographics and char- acteristics of Uber riders and their activity using the service. B.4.1 Demographics and number of rides Figure B.2 shows the distribution of number of riders, given their age and gender. A typical Uber rider is young (38% of riders are 18{27 years old), and slightly more likely to be a woman (51% are women). Female riders are somewhat younger than males (mean age of men is 34.6 years vs. 33.1 years for women). The vast majority of riders are white (80.5%), followed by Hispanic (8.5%), African-American (8.2%), and Asian-American (2.8%). Table B.1 breaks down riders by race, age, and gender. Hispanic and African-American riders are younger than white and Asian- American riders, but the median number of rides is 3 for all races. We consider the average number of rides per week as a measure of riders' activity. We found that in general, older riders use the service less frequently, e.g., 30-year-old men use Uber 20% more than 50-year-old men (FigureB.3(a)). 237 0 20000 40000 60000 20 30 40 50 60 70 80 Age # of riders Men Women Figure B.2: Distribution of rider age given their gender. Although young men and women (aged less than 25 years) use Uber at about the same rate, older men use it slightly more then older women. The values shown in the gure are the average for a given age and gender. The frequency of rides has a heavy-tailed distribution: most riders have a very low activity, while a few riders are very active. The median number of rides overall is only 0.2 rides per week, and the top10% most active riders take 1.3 rides or more per week. B.4.2 Duration, length, and cost of rides The duration, cost, and length of rides are all signicantly correlated, but studying each individually helps us understand the type of rides that are taken by users in dierent demographic segments. Figure B.3(b) shows the average duration of rides taken by riders of a given age and gender. In general, rides tend to be relatively short (median duration is 14 minutes). Older riders take longer trips on average (average length of rides of 60-year-old men is 30% longer than those of 20-year-old men). Women and men do 238 not vary signicantly with respect to the length of the rides. The distance traveled shows a similar trend. Most of the rides are relatively short, with half shorter than 4 miles, but 10% are longer than 36 miles. 0.3 0.4 0.5 0.6 0.7 20 30 40 50 60 70 80 Rider age Avg. trips/week Men Women (a) # of weekly trips 15.0 17.5 20.0 22.5 20 30 40 50 60 70 80 Rider age Avg. duration (minute) Men Women (b) Duration 10 15 20 25 20 30 40 50 60 70 80 Rider age Avg. fare ($) Men Women (c) Fare 60K 65K 70K 75K 80K 20 30 40 50 60 70 80 Rider age Avg. income ($) Men Women (d) Income Figure B.3: Rider activity as a function of age and gender. Since travel time largely determines the fare, the trend in the cost of rides is similar to that of the duration of the rides. The average fare ranges from $13{$21, 239 with older riders spending more per ride. Men and women are very similar, except middle-aged men spend signicantly more than women on the rides (Figure B.3(c)). The median fare of all rides is $10, and the total money spent by each rider (not per ride) has a heavy-tailed distribution, with a small fraction of the riders responsible for a considerable fraction of the total spending. Specically, the top 1% of riders account for 18.8% of all money spent on fares in our data set. The Gini coecient, which measures inequality of a distribution on a scale from 0 (uniformity) to 1 (maximum inequality), is 0.785 for total fares, showing a very high heterogeneity in spending among riders. B.4.3 Income, surge, and car type Next, we examine the impact of rider income, surge pricing, and car type, on rider activity. First, we are interested to nd who is most aected by surge pricing: lower income riders, who may be priced out by the increase in fares during peak hours, or the more auent riders who are willing to pay more for rides during times of high demand. Yahoo Mail users do not report their household income; instead, we estimate their income based on zip code of their self-declared home location. We use median income for a given zip code as the user's estimated income. Figure B.3(d) shows the average income of riders of a given age and gender. Surprisingly, older riders have higher income compared to younger riders. However, the percentage of the rides with surge pricing for a given age and gender has exactly the opposite trend: older riders are less likely to pay the surge price B.4(a). This might be due to younger riders using Uber on weekend nights, when there is surge pricing due to high demand. Even though these trends seem to suggest that riders paying surge prices should have lower income than riders not paying the surge prices, if we divide riders based on whether they had a ride with surge pricing or 240 not, we nd that riders with at least one surge ride have slightly higher income than riders not paying surge pricing B.4(b). These plots seem to be con icting, but they can be explained simply by the large heterogeneity among users. In short, people with higher income are more likely to take rides with surge pricing, but age plays a much more signicant role. 5 6 7 8 9 10 20 30 40 50 60 70 80 Rider age % of surged rides Men Women (a) Demographics and surge 0.00 0.05 0.10 0.15 30k 50k 70k 90k110k Rider ncome ($) PDF All With surge (b) Income and surge Figure B.4: Riders and surge pricing. (a) Percentage of rides with surge pricing as a function of rider age and gender. (b) Comparison of income of riders who had at least one ride with a surge fare and rest of the riders. Uber oers dierent service options: budget options, such as UberX and UberXL, and more expensive luxury options, such as Black, Select, SUV, and Luxury. Last, the Pool ride is the cheapest option as it allows the rider to split the trip cost with another person headed in the same direction. Figure B.5 compares the type of Uber cars requested by riders with dierent incomes. There exists a clear trend, with more auent riders requesting more expensive cars: e.g., people with annual income of $100k are 84% relatively more likely to take an Uber Black compared to users with annual income of $50k. 241 0.00 0.25 0.50 0.75 1.00 30k 50k 70k 90k 110k 130k 150k Income Fraction of types Service type Luxury SUV Black Select UberXL UberX Pool Figure B.5: Type of Uber car requested by riders given their income. B.4.4 Ride dynamics In this section, we try to understand the types of the rides that are taken by the users. We start by examining diurnal variations in the number of rides taken. Human behavior generally exhibits very strong daily and weekly trends, and it can be seen in Uber activity (Figure B.6). Weekday morning and afternoon peaks represent riders who use Uber to commute to work. Also, we see peaks during the lunch hour, showing increased usage of Uber for going to restaurants. Next, we study round trip rides. Knowing which rides result in a return trip would be helpful for the system in predicting the needs of the rider and scheduling drivers accordingly. Overall, 9.1% of all rides have another ride from the exact same location back to the previous ride's origin. About 2.5% of riders make round trip rides. If we consider the time the rst ride took place, then 9.9% of rides starting 5am-12pm have a return ride, 10.2% of rides in 12pm-7pm, and 7.6% of 242 0e+00 2e+05 4e+05 6e+05 Mon Tue Wed Thu Fri Sat Sun # of rides Figure B.6: Number of rides at dierent times of the day and days of the week. rides in 7pm-5am, showing that the rides in the afternoon are more likely to have a return ride. B.4.5 Promotions Uber uses promotions to attract new riders, for example, oering them a free rst ride (as long as it does not exceed a certain price). To understand the impact of promotions on rider behavior, we extract all rides in which a promotion was used, and compare the characteristics of riders using promotions to the rest of the riders. While we cannot make any causal claims, any uncovered trends could sug- gest whether promotions are in fact a useful tool to attract new riders. Table B.2 compares the characteristics of riders who used promotions to the rest of the pop- ulation. Riders who used promotions are younger, less active, and have lower income. And interestingly, they are more likely to stop using Uber early on and drop out, if we dene dropping out as not taking any rides after the rst week. 243 B.4.6 Rider attrition What happens after a rider's rst ride, whether or not a promotion was used? Does the rider become an active Uber user? Or does the rider stop using Uber and revert to his or her previous transportation options? Given the high costs of attracting new costumers (advertising, promotions), retaining them is an economic priority for businesses. To measure rider attrition, we focus on riders who took their rst ride during our data collection period and measure changes in their engagement levels over time. Recognizing new riders is feasible due to the welcome email they receive from Uber upon signing up. We exclude riders who took their rst ride during the last four months of our data collection period, to ensure that we have at least four months of rider activity records for the new riders. We also exclude riders who took only one ride during this period (11.5% of riders), because low activity rates could bias results. After ltering, we still have large enough number of riders, 295K riders, due to large size of our data set. Next, we characterize each rider with a vector containing the number of rides taken in each month following their rst ride. To identify dierent groups of riders who have similar behavior, we ran a k-means clustering algorithm over the riders. Table B.2: Comparison between riders who used promotions to those who did not. Promotion No promotion % men 44:3% 46:1% Average age 31:5 34:0 Median # of rides 2 3 Median income $50:0K $62:5K % drop out 59:9% 55:6% 244 To nd the optimal number of clusters we perform a parameter sweep from k = 2 tok = 15. The mean square error (i.e., distance from the center of clusters) gradually decreases as k increases, but with diminishing returns; after k = 3 the error reduction becomes signicantly smaller. We chose k = 3 to balance between compactness of the model and the quality of clustering. Table B.3 shows the number of riders belonging to each cluster, as well as the centers of the three clusters. We see that the vast majority of riders (90.9%) belong to the cluster that has almost no rides after the rst month (labeled Inactive). The second cluster of riders (8.0%) has a medium level of activity, almost 1 ride a week (Low activity). Finally, the remaining riders (1.1%) are highly active and maintain high levels of activity over time. Table B.3: Size and centers of clusters of riders from their monthly number of rides along with their demographic breakdown. Clusters % riders Month 1 Month 2 Month 3 Month 4 Inactive 90.9% 2.1 0.4 0.4 0.5 Lo activity 8.0% 8.5 5.8 5.6 5.6 Hi activity 1.1% 18.0 21.6 23.3 22.1 Table B.4: Demographics of riders in each cluster. Clusters Avg. age Women White Hispanic African-Amer. Asian-Amer. Inactive 35.1 53.3 80.3% 8.8% 8.3% 2.6% Lo activity 31.9 51.3 69.7% 11.8% 15.8% 2.7% Hi activity 31.2 52.1 60.3% 13.4% 24.1% 2.3% The Inactive cluster includes riders who abandon the service quickly, while the remaining two clusters include more active riders, who use Uber more fre- quently. To characterize these riders, we break down each cluster by demographics in Table B.4, showing that the more active riders are younger, and less likely to be white, and more likely to be Hispanic and African-American than riders who 245 eventually leave Uber. We nd no signicant dierence between the groups in their gender composition . B.5 Understanding Drivers In this section, we conduct an analysis of Uber drivers, focusing on their demo- graphics and earnings, and identify factors that aect driver retention. B.5.1 Demographics Figure B.7 shows the number of drivers of a given age and gender. The gure shows a signicant dierence between the number of male and female drivers, with 76% of the drivers being male and typically in their 30s. These results apply only to drivers from the US, and other countries dier widely with respect to driver gender. US has the highest percentage of women drivers (24.0%), followed by Malaysia (10.1%), Singapore (9.9%), and Canada (9.4%). Surprisingly, UK has a much lower fraction: only 4.3% of all UK drivers are women. Table B.5 shows the breakdown of US drivers by race. Compared to US riders (Table B.1), there are signicant disparities in the racial composition of drivers and riders. For example, while the majority of drivers are white (60%), this is much smaller than the percentage of white riders (81%). With regard to gender, women of all races participate signicantly less as drivers than as riders. This is in contrast to conventional wisdom, which suggests that exibility of driving for Uber part time will be attractive to women. All races have similar average age, except for Hispanic drivers, who are two or three years younger. Moreover, all races drive the same number of hours and earn the same amount of money, except 246 0 1000 2000 3000 20 30 40 50 60 70 80 Driver age # of drivers Men Women Figure B.7: Number of drivers for a given age and gender. Table B.5: Comparison of drivers of dierent races. Race % of drivers Women Avg. age Avg. hrs worked Avg. earning Above avg. rating % hrs surged White 60.0% 21.9% 41.9 15.4hrs $355 62.7% 20.1% African-Amer. 21.6% 36.5% 40.8 14.8hrs $341 57.2% 23.9% Hispanic 13.7% 23.9% 38.5 15.2hrs $378 59.8% 21.5% Asian-Amer. 4.7% 16.4% 41.6 18.2hrs $511 58.4% 23.1% for the Asian-American drivers, who work and earn 23%-43% more than drivers from other races, on average. B.5.2 Hours, income, and rating Next, we examine the weekly number of hours drivers worked, their income, and ratings. Figure B.8(a) shows the average number of weekly hours worked by drivers of a given age and gender. Interestingly, older drivers work longer and are more likely to be full-time Uber drivers. The majority of drivers worked part-time, 247 10 20 30 20 30 40 50 60 70 80 Driver age Avg. hours/week (a) Hours worked 200 300 400 500 600 20 30 40 50 60 70 80 Driver age Avg. earning/week ($) (b) Earning Figure B.8: Average number of hours worked and weekly earnings of drivers, given their age and gender. and only 19% worked 40 hours or longer in a week. The weekly hours worked is the length of the time the driver was active and received ride requests, and not necessarily the hours the driver was driving. Our data set also includes the rate the driver was paid that week. Figure B.8(b) shows the weekly earnings for drivers of a given age and gender. The main factor aecting the earnings is the surge pricing, and drivers working during the peak hours earn much more per hour. Younger drivers earn more, and men earn slightly more than women. Considering all weeks drivers worked, 25% of drivers were paid $21-25 per hour and 1.5% of drivers had a rate lower than $7.25 per hour, the federal minimum wage in the US. Typical taxi driver salary is $30k a year or $14.4/hour, which is lower than 86% of all Uber drivers. Figure B.9(a) compares the number of hours worked by drivers with at least one surge ride with the rest of the drivers, and Figure B.9(b) shows the earnings of these drivers. Drivers who have at least on surge ride, work almost as much as the 248 0.00 0.02 0.04 0.06 0 10 20 30 40 50 60 70 80 Hours/week PDF No surge Surge (a) Hours worked 0.00 0.02 0.04 0.06 0 250 500 750 1000 Avg. earning/week ($) PDF (b) Earning Figure B.9: Comparison of drivers with at least one surge ride to the rest of the drivers. rest of the drivers, but they earn signicantly more: the median weekly earning for drivers with a surge ride is $180, but the rest of the drivers have a median of only $80. And drivers who have at least one surge ride, on average earn 60% more than the rest of the drivers, while working the same number of hours. Some studies and articles suggest that surge pricing is frustrating for rid- ers [Chen et al., 2015, sur]. To verify this, we compare the ratings of weeks that the driver had lots of surge rides, with the weeks that the driver had fewer rides with surge pricing. More specically, we group the weeks based on the percentage of earning from surge, as an estimate of the percentage of the surge rides. Then, for each group, we calculate the percentage of the weeks that have an above average ratings. Figure B.10 shows that initially the rating increases slightly as the drivers serve more surge rides, but the drivers who have many surge rides, receive worse ratings. This is in-tune with the earlier studies and suggests that earning more money could come in the expense of worse ratings. 249 48 52 56 60 0 25 50 75 100 % of earning from surge % above average rating Figure B.10: Percentage of above-average ratings given the percentage of surge earnings in a week. Finally, we look at the ratings of drivers with respect to their age and gender. Drivers are rated by riders after each ride. Figure B.11 shows the fraction of weeks that the driver of a given age and gender had an above-average rating. Generally, older drivers tend to receive lower ratings and are more likely to receive a below average rating. Also, among 30-50 years old drivers, women tend to receive higher ratings. B.5.3 Driver retention We study factors that correlate with driver activity. Similar to our analysis of riders, we cluster drivers based on the number of hours worked each month since joining Uber. With k = 3 clusters, a large fraction of drivers stop working almost completely, driving fewer than ve hours during a period of a month. However, this fraction (73.3%) is much lower than the fraction of riders who stop using Uber. The lower rate of driver attrition could be due to the higher eort required to become an Uber driver compared to an Uber rider. The sign up process for drivers 250 0.52 0.56 0.60 0.64 20 30 40 50 60 Driver age Fraction of above average ratings Men Women Figure B.11: Percentage of above average weeks for a given age and gender. includes a background check, submission of documentations, and completion of a city-knowledge test [Hall and Krueger, 2015]. About 21.0% of drivers have at least half an hour driving per day on average over the four months and the remaining 5.7% of drivers are very active, working longer than three hours/day on average. Number of hours that the drivers worked drops across all three clusters. Table B.6: Size and centers of the clusters of drivers from their monthly hours worked. Clusters % drivers Month 1 Month 2 Month 3 Month 4 Inactive 73.3% 20.1 4.7 3.0 2.2 Lo activity 21.0% 89.9 45.1 26.3 16.4 Hi activity 5.7% 150.3 133.8 126.8 94.1 Table B.7: Demographics of drivers in each cluster. Clusters Avg. age Women White Hisp. African- Amer. Asian- Amer. Inactive 37.1 40.3% 62.6% 26.1% 3.4% 7.9% Lo activity 43.2 31.3% 67.1% 22.7% 2.3% 7.8% Hi activity 43.6 25.3% 76.2% 13.3% 1.0% 9.5% 251 We also characterize the drivers in each cluster in Table B.7. The rst cluster (labeled Inactive) includes drivers with the lowest engagement levels, who eventu- ally stop driving for Uber, while the other two clusters contain active drivers with dierent engagement levels. Active drivers tend to be older, more man, white, and Asian-American than than the Inactive drivers. B.6 Rider vs Driver In this section, we answer the questions that involve both riders and drivers at the same time, including comparing their demographics, and studying the eect of matching on ratings. B.6.1 Demographic comparison First, we are interested to see if Uber has a \all-serve-all" economy, or the riders have dierent demographics or higher income. As shown in Figure B.12 riders have higher income compared to drivers: median income for a rider is $62.4k and the median income of drivers is $55.3k. Also, riders are 51% more likely to be men, and 7.3 years younger than drivers on average. If we pick a random rider and driver, the rider is 34% more likely to be white and the driver is 5 times more likely to be African-American. Even though the riders and drivers have big dierences, a considerable 17.4% of drivers are also riders. B.6.2 Eect of matching We consider the age, gender, and race of riders and drivers to see if there is any pattern in the ratings with respect to the match. First we take the following steps to match a rider and a driver: 252 0.00 0.05 0.10 0.15 0.20 Income PDF Driver Rider Figure B.12: Comparison of income of riders and drivers. 1. Retrieve driver's rst name hash and date of the ride from the ride email receipts. 2. Match to all drivers with the same rst name hash from the weekly summary emails. 3. Eliminate the drivers who did not make more than the fare of the ride in that day. 4. Eliminate drivers who are in a dierent state than the rider. 5. Consider a match if there is only one driver left. Next, we need the rating for each ride, but that is not shared with the riders nor drivers, and we can only get the weekly summary of ratings for drivers. So, we nd the drivers that have enough number of rides (at least 10) in a week, and large enough fraction of their rides where matched with a rider (at least 75%). Then, we only consider these driver-weeks and compare the rating for the weeks that the rating is above average with the weeks that the rating is below average, given the 253 age, gender, or race dierence among the riders and the driver. For quantifying the eect of age, we compare the average age dierence among riders and drivers for the weeks with an above-average rating, and compare that with the average age dierence for the weeks that the ratings were below average. We nd that age plays a considerable role, where the average age dierence is 1.7 years smaller for the weeks that have above average rating, 11.4 years (0.47, 95% condence interval) years vs. 13.1(0.56). Similarly for race, the weeks that had higher than average ratings had higher percentage of riders and drivers with the same race. For above-average weeks, in 41.5% (0.2) cases rider and driver had the same race, while for below average weeks in 39.5% (0.2) rider and driver had the same race. Table B.8: Percentage of above average weeks for women and men drivers, given the percentage of women or men drivers. Women drivers % women riders % above average weeks Standard error 0%-45% 62.6% 3.0 45%-55% 53.4% 3.2 55%-100% 50.2% 3.2 Men drivers % men riders % above average weeks Standard error 0%-45% 60.2% 1.6 45%-55% 57.2% 1.4 55%-100% 61.9% 1.2 To measure the eect of gender, we consider women and men drivers sepa- rately, and compare the percentage of above average weeks given the fraction of men/women riders. Gender plays an interesting role for women drivers, where lower percentage of women riders resulted in higher ratings. The weeks that the majority of the riders were men, were 12.4% more likely to have an above average 254 rating (Table B.8). The trend is more complicated for men drivers, where having 0-45% or 55%-100% men riders, results in higher rating compared to the cases where there are 45%-55% men drivers. The ltering and statistically signicant result were only possible due to the large size of our data set. B.7 Prediction We use the ndings presented in earlier sections to predict whether a new rider or driver will become an active Uber user. B.7.1 Riders We dene the prediction problem as follows: given all the information about a rider and his or her Uber activity during the rst two weeks since joining Uber, will that person become an active rider or not? We dene active riders those who take six or more rides in weeks 3{8 of using the service (i.e., at least one ride a week on average). To make this prediction, we use the following sets of features: User characteristics: age, gender, location, income, race, education, and zip code. Ride features: # of rides, average distance, average price, average duration, fraction of rides in the second week, percentage of rides with surge pricing, number of cities Uber used in, fraction of the rides in the weekend and weekday, fraction of rides in the morning, afternoon, or at night, fraction of rides with a promotion, and number of distinct origins and destinations. Driver features: for the rides that have been matched: driver demograph- ics, driver ratings, and age, gender, and race dierence of riders and drivers. 255 Social features: number of Uber rider and Uber driver friends based on the email network graph. We extract all of the above features, and balance the classes by under-sampling the larger class. This results in half of the users in the data set being active users, which leads to a 50% baseline for random prediction.Then, we select a random set of 80% of the users for training and use the remaining 20% for testing. We use the C5.0 classier [Quinlan, 2004] for our predictions and achieve accuracy of 75.2%, which is a 50.4% relative improvement over the baseline. The precision is 0.786 and the recall is 0.687. We also dene a more intelligent baseline, where we predict a user is going to be an active user, if the user had more than two rides in the rst two weeks (median number of rides taken by all riders in the rst two weeks). Surprisingly, this baseline performs much better than the random baseline and achieves an accuracy of 74.3%, which is still slightly lower than our classier. This simple baseline indicates that the signals from the activity of the users is strong enough to be an accurate predictor. We use logistic regression to quantify the importance of the features. Since regression is sensitive to colinearities in the data, we rst eliminate correlated features, by calculating pair-wise correlation coecient and removing one of the features that has high correlation (> 0:7 or <0:7) with another feature using Figure B.13 that shows pairwise correlation coecients. Table B.9 shows the results of the logistic regression on the remaining 12 independent variables, which we normalize rst. Older users are less likely to become active Uber riders. Men and riders with more trips in the rst two weeks are more likely to become active riders, but those who had more expensive rides, used more expensive car types (such as Uber Black), and had higher fraction of rides on the weekends, are less likely to become an active Uber rider in the future. 256 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 age total_count frac_2nd_week avg_duration avg_dist avg_price frac_morning frac_afternoon frac_night frac_uberx frac_expensive_car frac_weekend frac_weekday cities_count frac_out_state frac_surge frac_promotion income rider_friend driver_friend age total_count frac_2nd_week avg_duration avg_dist avg_price frac_morning frac_afternoon frac_night frac_uberx frac_expensive_car frac_weekend frac_weekday cities_count frac_out_state frac_surge frac_promotion income rider_friend driver_friend Figure B.13: Correlation between the features of the riders. Pairs without statis- tically signicant correlation are crossed (pvalue< 0:05). B.7.2 Drivers We conduct a similar prediction task for the drivers, based on the hours worked instead of the number of rides. We dene active drivers as those who worked 10 hours or more per week in weeks 3{8 (since joining Uber). Those who worked less than 10 hours a week in weeks 3{8 are inactive drivers. We consider the following features for driver prediction: User characteristics: age, gender, location, income, race, education, and zip code 257 Table B.9: Results of logistic regression on the independent variables for the riders. *** pvalue< 0:001, ** pvalue< 0:01, * pvalue< 0:05 Variable Coeff: Total # of rides 0.340*** % of out of state rides 0.330 # of dierent cities 0.113*** Gender (men) 0.054*** # of rider friends 0.006 Average fare -0.009*** Age -0.020*** % of rides at night -0.098*** # of driver friends -0.162 % of rides in weekends -0.382*** % of rides in expensive car types -0.645*** % surge rides -0.837*** Drive features: # of days worked, # of hours worked, # of rides given, ratings, rate of earning, % of busy hours worked, acceptance rate, and missed earnings for week 1 and 2 separately. Rider features: for the rides that have been matched: rider demographics, and age, gender, and race dierence of riders and drivers. Social features: number of Uber rider and Uber driver friends With the same setup as above, and 50% baseline, we achieve 83.1% accuracy, which is 66.2% relative improvement over the baseline. Precision is 0.775 and recall is 0.689. If we dene a similar other baseline as for riders, using the median hours worked in the rst two weeks, the baseline achieves accuracy of 81.9%, which is signicantly higher than the random baseline, but again slightly lower than our classier. Similar to riders, simple rules from early behavior of the drivers is a very strong signal for future usage. 258 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 age trip_count1 trip_count2 hours_worked1 hours_worked2 earn_rate1 earn_rate2 accept_rate1 accept_rate2 rating1 rating2 missed_earning1 missed_earning2 busy_hrs_worked1 busy_hrs_worked2 age trip_count1 trip_count2 hours_worked1 hours_worked2 earn_rate1 earn_rate2 accept_rate1 accept_rate2 rating1 rating2 missed_earning1 missed_earning2 busy_hrs_worked1 busy_hrs_worked2 Figure B.14: Correlation between the features of the drivers. Pairs without sta- tistically signicant correlation are crossed (pvalue < 0:05). 1 and 2 show the value from the rst or second week. We also remove the correlated features (using Figure B.14) and carry out logis- tic regression over non-correlated features. Table B.10 shows that older users, men, and drivers who worked more and had higher earning rates are more likely to become active drivers, but the drivers who had a lower acceptance rate (% of rides they accepted to deliver) are less likely to become an active driver. 259 Table B.10: Results of logistic regression on the independent variables for the drivers. *** pvalue< 0:001, ** pvalue< 0:01, * pvalue< 0:05 Variable Coeff: Gender (men) 0.371** Driver's rating 0.227 # hours drove 0.157*** Age 0.037*** Earning rate 0.029* Money missed in the week -0.002 Acceptance rate -0.015* # of busy hours worked -1.479 B.8 Conclusion This work characterizes Uber's riders and drivers. We consider age, gender, and race and show how dierent populations behave dierently. For example, younger riders use Uber more frequently compared to older riders, but they take shorter rides. Considering the gender, while the riders have balanced gender split, drivers have a very unbalanced split, 76% of drivers being man. We also show that riders have about $12k higher annual income than drivers. Study of surge pricing shows that drivers who take advantage of busy hours can earn on average 60% more, while working the same number of hours. We also study the ratings given to the drivers by riders. We nd that older drivers tend to get lower ratings, and women drivers who are 30-50 years old tend to get higher ratings. Interestingly, the matching of riders and drivers has an eect on the ratings: rider an drivers having smaller age dierence or having the same race results in a higher rating, and for women drivers when smaller fraction of riders are woman, the rating tends to be higher. These ndings could be used to perform a better matching and improve users' experience. 260 Finally, we focus on users' engagement levels and show that vast majority of users become less active and drop out after just a few weeks. By leveraging our ndings, we are able to predict the users who become active riders or drivers with a high accuracy. Prediction of users attrition or abandonment can be helpful for Uber to focus on these users, as keeping users is much less expensive than acquiring new users. 261 Appendix C iPhones Digital Marketplace C.1 Introduction Consumer spending is an essential component of the economy, accounting for 71% of the total US gross domestic product (GDP) in 2013 [spe]. In spite of repre- senting just over 10% of the entire consumer spending [onl], online shopping is rapidly growing as people are becoming more comfortable with the online pay- ment systems, security and delivery of the purchased goods. Over the last three years, online sales grew over 45% [Bhatnagar et al., 2000] and are showing signs of exponential growth. One of the largest and fastest-growing online markets is Apple's iOS market, where people can make digital purchases from dierent categories. Apple's revenue from digital purchases surpassed $20 billion in 2015 [app, a] and more than doubled in just three years [app, b,c]. Mobile digital markets oer a wide variety of digital goods, including apps, songs, movies and digital books. They also include items that users can buy within an app, called in-app purchases, such as virtual currencies, bonuses, extra game lives and levels, etc. Despite the popularity of the iOS market, there has not been a large-scale study characterizing user's spending on dierent types of content and apps. For example, it is not known how much is spent on in-app purchases compared to songs or how disproportionate the spending is. Learning how people spend their money in this context has direct practical implications 262 on the business of several stakeholders, including app developers and managers of online app stores, but it also has important theoretical implications for the understanding of the consumer behavior in an emerging market whose dynamics are still poorly known. We study a longitudinal data set extracted from hundreds of millions of email receipts for digital purchases on iPhones, iPads, and other iOS devices (which we refer to as \iPhone purchases" for brevity). Besides its large scale, our data has two unprecedented advantages over other data collected in the past. First, besides recording the purchase history, it also includes demographic information such as people's age, gender, and country of residence. This enables us to not only characterize the consumer population, but also study how spending diers by income and location. For example, we nd that the average spending is not correlated with income but strongly depends on the country of residence. And second, in contrast to the application-centric view of previous works, our data is user-centric and allows for the observation of user spending behavior across multiple apps. This allows us to study how users abandon an app and start using a new one, and to compare the behavior of users who make purchases from a single app with users who make purchases from multiple apps simultaneously. We analyze how users spend money across dierent categories and show that a small fraction of users are responsible for the majority of spending. We call these users big spenders. Moreover, our analysis informs predictive models of digital spending behavior. We rst show that the time between purchases is best described by a Pareto distribution. We then build a supervised classier that accurately predicts if a user will make a purchase from a new app, or an existing app that he/she previously consumed. Finally, based on the outcome of the previous step we predict the exact app from which the user will purchase. 263 In summary, our main ndings and contributions are as follows: In-app purchases account for 61% of all money spent on digital purchases on iPhones, followed by songs (23%) and app purchases (7%). The spending is highly heterogeneous: the top 1% of spenders account for 59% of all money spent on in-app purchases. Big spenders tend to be 3-8 years older, 23% more likely to be a man, and 31% less likely to be from the US, compared to the typical spender. Interestingly, income is comparable between the big spenders and the rest of the users. Big spenders become slower to repurchase from an app as time passes, but their rate of spending within an app initially increases, then decreases. From the perspective of app developers and ad networks big spenders are the most valuable targeting segment. Even if they abandon an app they are frequently buying from, they are 4.5x more likely to be a big spender in a new app compared to a random user. We model the entire purchasing process in several steps: modeling time between purchases, predicting purchase from a new app or an app from which the user has already purchased, and predicting the exact app of an in-app purchase. Both consumers and producers in the mobile app market might benet from this study. Our results can inform the deployment of better app recommendation systems that can lead people to download apps they are more likely to enjoy, thus creating higher revenue opportunities for app developers. 264 C.2 Data set and Marketplace Shortly after each digital purchase on an iPhone (or any other iOS device), the user receives a conrmation email with details of the purchase. This email con- tains information about the purchase, including the amount of money spent and the type of purchase. The email has a specic format, making it easy to parse automatically. We obtained information about digital purchases of Yahoo Mail users, using an automated pipeline that hashes the names and IDs of users to preserve anonymity. All the analyses have been performed in aggregate and on non-personally unidentiable data. We gathered data covering 15 months, from March 2014 to June 2015. Our data set includes 26M users who together made more than 776M purchases totaling $4.6B. There are six main categories of iPhone purchases: applications (apps), songs, movies, TV shows, books, and in-app purchases (purchases within an app, e.g., bonuses or coins in games). These categories dier vastly in the numbers of pur- chasers: 16M people purchased at least one song, but only 671K people purchased a TV show. The number of purchases in each category varies greatly as well: there are 430M song and 255M in-app purchases, while movies, books, and TV shows have fewer than 40M purchases all together. The total money spent in each cate- gory varies even more: in-app purchases account for $2.8B, or 61%, of all money spent on digital purchases; 23% of the money is spent on songs, 7% on app pur- chases (purchasing apps themselves, not the purchases within apps), 6% on movies, 2% on books, and only 0.7% on TV shows. Even though there are considerably fewer total in-app purchases compared to songs (60% fewer), the money spent on in-app purchases is 2.7 times more than on songs, showing that on average in-app purchase is much more expensive than a song purchase. Figure C.1 shows the number of users, purchases, and the money spent on each of these six categories. 265 0 20 40 60 # users # purchases total spent Percentage App Book In−App Movie Song TV Figure C.1: Percentage of users, purchases, and money spent on each category. Our data set also includes user age, gender, and zip code as provided by the users at the time of sign up. Spending varies signicantly based on demographics. Figure C.2(a) shows the cumulative distribution function (CDF) of the spending for men and women. Men spend more money on purchases than women; the median spending for women is $31.1, and for men it is $36.2, which is 17% higher. Age also aects spending: Figure C.2(b) shows that the peak age for iPhone spender is the mid-30s, and after that the spending level decrease quickly. For US residents, we use the median income of the zip code they declared as their residence as an estimate of their income. Surprisingly, income only aects the spending of people with less than $40K annual income (Figure C.3). This is in contrast with online shopping, where users with higher income tend to spend more money shopping online [Kooti et al., 2016]. Spending on iPhones varies considerably by geography. Figure C.4(a) shows that European countries, especially Eastern European and Scandinavian countries, have the highest spending per Yahoo mail user. Canada, Mexico, and Australia 266 0.00 0.25 0.50 0.75 1.00 0 250 500 750 1000 Spending ($) CDF Men Women (a) Spending and gender 15 20 25 30 35 40 20 30 40 50 60 70 80 Age Median spending ($) (b) spending and age Figure C.2: Eect of gender and age on iPhone purchases. 32.5 33.0 33.5 34.0 30K 50K 70K 90K 110K 130K Income ($) Median spending ($) Figure C.3: Eect of income on spending. There are more than 10k users for each income category. have higher spending per person than the US, while most African and Asian coun- tries have lower levels of spending. There are also more than 154K applications in our data set with at least one user purchase or in-app purchase. As shown above, the earnings from in-app pur- chases are considerably higher than from app purchases themselves (almost 9 times 267 8.78 205 450 989 2173 (a) World 150 170 190 210 (b) US Figure C.4: Heatmap of the median amount of money spent by the users in each country and US. higher). Table C.1 shows the top 10 apps by in-app earnings, along with their num- bers of purchases, average purchaser age, and percentage of female purchasers. We make some observations about these data. First, there are considerable dierences among the earnings of top apps. Second, the average price of in-app purchases varies widely across dierent apps: while there are more than 3.3 times as many purchases in Candy Crush compared to Clash of Clans, the earnings for Clash of Clans is 2.1 times higher than for Candy Crush. Third, there is a signicant dier- ence in the demographics of purchasers of dierent apps. The average age of buyers is 33 years for Clash of Clans and 49.4 years old for DoubleDown Casino and 81% of the users making purchases from Farm Heroes Saga are women, compared to only 18% for Boom Beach. Other apps, such as Pandora Radio and Net ix (not shown in the table), have a balanced audience. Knowing the audience of an appli- cation could be useful for both the advertisers and game designers. Advertisers can target the particular population and the game designers can make changes 268 Table C.1: Top 10 apps by in-app earnings, with demographics. App Name Earnings # of Purchases Avg. Age % Women Clash of Clans $356.5M 13.6M 33.0 29% Candy Crush Saga $168.0M 45.3M 40.3 70% Game of War $159.8M 1.9M 35.0 25% Boom Beach $60.9M 1.7M 34.4 18% Hay Day $52.0M 3.2M 36.8 67% Farm Heroes Saga $40.4M 6.8M 42.3 81% Candy Crush Soda $37.5M 8.8M 41.0 77% Big Fish Casino $32.9M 1.1M 44.5 57% DoubleDown Casino $27.2M 1.1M 49.4 65% Pandora Radio $24.3M 5.7M 37.8 54% to their apps to make them more appealing to the audience that is not currently engaged. We also collected the category information for each application, e.g., puzzle game or travel, from Apple's iTunes. Then, for each category, we calculated the percentage of people who purchased an app or made an in-app purchase from an app belonging to that category. The top ve categories by gender are shown in Table C.2. Men are more likely to make a purchase from apps relating to sports, and women prefer games, especially brain games. Similarly, we found the top 5 categories with the youngest and oldest average age. Younger buyers are interested in photo and video apps, racing games, and social networking applications, while older buyers are interested in more general applications for food, weather, business, and travel (Table C.3). Limitations. Our data set only includes purchases from people who are Yahoo Mail users, there may be a selection bias in the subset of users being studied. While this might occur to some extent, but given the popularity of Yahoo Mail (over 300M 269 Table C.2: Top 5 gender-biased categories. Top Categories for Men % Men Top Categories for Women % Women Sport magazines 84.6% Board games 70.5% Sports 74.9% Word games 64.9% Racing games 69.9% Puzzle games 63.6% Navigation 68.1% Family games 62.1% Sports games 67.4% Educational games 61.5% Table C.3: Top 5 gender-biased categories. Categories for Youth Avg. Age Categories for Older Users Avg. Age Photo & Video 32.2 Food & Drink 47.7 Strategy games 33.2 Weather 45.4 Racing games 33.6 Travel magazines 45.2 Trivia games 33.6 Board games 44.7 Social networking 34.1 Business 42.9 users 1 ), we believe our data includes a somewhat representative sample and the ndings can be generalized to other users. Moreover, the email receipts are sent within a day after the purchase, so our data set does not include the exact time of the purchase and we have to conduct our analysis in granularity of a day. C.3 Big spenders In this section we focus on the category of in-app purchases since it is the largest spending category in the digital marketplace. We rst show that a small number of users are responsible for the majority of spending. Then, we characterize these users demographically, by age, gender, country of residence, and income. Finally, 1 www.comscore.com 270 we study how these buyers discover a specic app, start spending money within it, and how they stop making purchases within it as their interest in that app diminishes. In-app spending patterns vary signicantly across dierent users. In Figure C.5 we show the PDF and CDF of spending on in-app purchases. It demonstrates that the spending has a heavy-tailed distribution, with most users spending $20 or less, and a small minority (2.4% of users) spending more than $1000 over the studied 15 months. 1e−06 1e−04 1e−02 1 10 100 1000 10000 Spent on in−app purchases ($) PDF (a) PDF 0.25 0.50 0.75 1.00 1 10 100 1000 10000 Spent on in−app purchases ($) CDF (b) CDF Figure C.5: PDF and CDF of user's spending on in-app purchases. To better demonstrate the disparities in spending, we plot the Lorenz curve, which shows the percentage of the total spending by dierent percentiles of the population when ordered by spending (Figure C.6). The diagonal line represents a perfect equality of spending (i.e., when each person spends the same amount). The larger the distance from the diagonal, the larger the inequality in spending. The gure shows very high inequality: the bottom half of buyers spend less than 2% of the total amount of money, while the top 10% are responsible for 84% of all 271 spending. In fact, just the top 1% is responsible for 59% of all the money spent on in-app purchases. This inequality can be captured by the Gini coecient that summarizes the distance from equality in a single number, which turns out to be 0.884, representing an extremely high inequality. Interestingly, if we consider the earnings of the apps, the inequality is even higher, with Gini coecient 0.989, and 0.1% of the apps earn 71% of all the in-app purchase income (Figure C.7). As a comparison, the Gini coecient for the income of the US population is 0.469, which is the highest among Western industrialized nations (according to census data). 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Gini = 0.884 Fraction of users Fraction of spendings Figure C.6: Lorenz curve of the spending of the users on in-app purchases, showing high disparity among users. As mentioned above, the 1% of buyers, that represent 154K users, are respon- sible for the majority of in-app purchases. In the rest of this section, we focus on this set of users, and since most of the apps with high earnings were games, 272 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Gini = 0.989 Fraction of apps Fraction of earnings Figure C.7: Lorenz curve of the earning of the apps, showing extremely high inequality in the earning of the apps. we call this set of users big spenders. We also calculated the top 1% of spenders in each month separately. Among the big spenders, who are in the top spenders over the entire 15 month period, 68.4% are in the top 1% of spenders for half of these months or fewer. This shows that there are bursts and pauses in individual spending levels for each big spender. C.3.1 Characteristics of big spenders We start by comparing the demographics of big spenders with the rest of the users in our data set. Understanding the dierences in demographics could be useful for advertisers and app stores to better target the population that is more likely to be a big spender. 273 Big spenders are 23% relatively more likely to be men (59% men vs. 48% women). Regardless of gender, big spenders tend to be older. Men who are big spenders have a median age of 37 years, while the median age in our data set is 34 years. The dierence is even larger among women: 43 vs. 35 years. Moreover, there are considerable dierences in the country of residence statistics for big spenders compared to the typical user. For some countries, like the US, a random user is less likely than average to be a big spender, but for the other countries users are much more likely to be big spenders. For example, Greek, Turkish, and Romanian users are respectively 50, 33, and 29 times more likely to be big spenders than users from our study population at large. We also consider the role of income for the people from the US, by calculating the fraction of people with a given income who are big spender. Figure C.8 shows that income has a very small eect on users being big spenders, except for the users with less than $20K or more than $140K annual income. Note that the percentage is almost always smaller than the expected 1%, which is how we dene big spenders, because this analysis is conducted only on users from the US, and big spenders are less prevalent in the US. C.3.2 App adoption and abandonment To better understand the behavior of big spenders, we focus on how they start making in-app purchases within apps, and how they abandon them. In order to analyze the behavior of big spenders over a long period of time within apps they use frequently, we only considered pairs of big spenders and apps that had more than 50 in-app purchases. Furthermore, we were interested in the entire time span of the user's app usage, i.e. from the rst time they make a purchase to the last time they make a purchase, so we ltered out the cases in which the usage started 274 0.0 0.4 0.8 1.2 20K 40K 60K 80K 100K 120K 140K Income ($) % of big spenders Figure C.8: Fraction of big spenders, given the income of the users. before or ended after the period of data collection. This was done by considering only the (user, app) pairs for which the rst purchase happened after the rst month in our data set, and the last purchase was before the last month. We start our analysis by looking at the time delay between consecutive pur- chases. Because our data has one-day granularity, we count multiple purchases by a user in one day as a single, more expensive, purchase. To account for the large heterogeneity in time delays between purchases by dierent users, we normalize the values for each user individually. Figure C.9(a) shows the 2nd to 9th delays, normalized by the rst delay. On average, both the time between purchases and the spending per purchase increase (Figure C.9(b)). This means that even though people make fewer in-app purchases, they spend more money after couple of trans- actions. Since a considerable fraction of the spendings are for virtual game coins and bonuses, this suggests that users start by buying small packages of coins and bonuses and move on to the larger ones as they progress in the game. Also, as 275 they buy more and more bonuses and/or coins per purchase, it takes longer for them to replenish their supplies by making a follow-up purchase. Similarly, when focusing on the user's last 10 purchases within an app, we nd that users' delays still get longer, but now at a much higher rate. The very last delay is six times longer than the rst delay, on average (Figure C.10(a)). This long delay is a strong indicator of app abandonment. Finally, in Figure C.10(b) we show that as users get closer to their last purchase, they start spending less and less money on their daily purchases. Switching to other apps. Next, we investigate what big spenders do after they abandon an app. More precisely, we wanted to nd out what fraction of those users switch to another app in which they again become a frequent buyer. We conduct this analysis on the same data as above, i.e., user-app pairs that have more than 50 purchases. We nd that 8.6% of big spenders who stop making purchases from an app will start making purchases from another app and will become a big spender in the new app (i.e., making more than 50 purchases). This number may seem small, but consider that the big spenders who abandon an app are 2.1x more likely to be a big spender in another app when compared to a random user from our entire data set (because a only 4.1% of users make 50 or more purchases from at least one app). Consequently, from a marketing perspective, it makes much more sense to advertise the new apps to the big spenders of existing apps. Furthermore, if we consider a more restrictive denition for big spender and only examine user-app pairs with is more than 100 purchases, the dierence becomes even larger, and the big spenders who have abandoned an app are 4.5x more likely to become a big spender in another app. 276 2.0 2.1 2.2 2.3 2.4 2 3 4 5 6 7 8 9 10 Kth delay delay / first delay (a) Change in delay 0.90 0.95 1.00 1 2 3 4 5 6 7 8 9 10 Kth purchase spent / avg day spent (b) Change in spending Figure C.9: Change in delay and spending in the rst 10 purchases from an app. 3 4 5 6 −8 −7 −6 −5 −4 −3 −2 −1 0 Kth delay delay / first delay (a) Change in delay 0.85 0.90 0.95 1.00 −8 −7 −6 −5 −4 −3 −2 −1 0 Kth purchase spent / avg day spent (b) Change in spending Figure C.10: Change in delay and spending in the last 10 days of purchases from an app. C.4 Purchase Model In this section, we model the sequence of purchases people make in order to under- stand the purchasing behavior better. The insights into the future purchasing 277 behavior could be used by both the gaming companies and the app store to increase user engagement and provide better app recommendation. Following the prior work on user consumption sequences [Benson et al., 2016, Anderson et al., 2014] we model the in-app purchases in 3 main steps: 1) modeling time between purchases, 2) predicting whether the next in-app purchase will come from an app that the user already purchased in the past or a completely new app, and nally 3) predicting the exact app that the user will purchase from, given the output of the previous step. The output of each step is used in the next step; the estimated time interval is one of the main indicators for predicting if the user will purchase from a new app, and we can predict the next app to be consumed much more accurately if we know whether the app is a new app for that user or not. C.4.1 Temporal model First, we investigate a set of parametrized distributions to see which one best describes the distribution of inter-purchase times. We considered Weibull, Gamma, Log normal, and Pareto and nd that Pareto best ts the data. We used the Akaike Information Criterion (AIC) [Akaike, 1998] and the P-P and Q-Q plots, such as the ones shown in Figure C.11 for Pareto, to compare dierent distributions. The AIC values were fairly close for all distributions as shown in Table C.4, but the plots showed that the Pareto distribution withshape = 3:21 andscale = 20:17 matches the data better than other distributions. Figures C.11(a), C.11(b) show that the modeled distribution ts the probability density function and cumulative density function very well. There are some deviances in the Q-Q plot (Fig C.11(c)), where the empirical and theoretical quantiles are matched, that shows distribution failing to capture very large values in the data. However, the Pareto distribution still is 278 (a) Red line shows the theoretical den- sity. (b) Red line represents cumulative distri- bution function of tted Pareto distribu- tion. (c) Empirical and theoretical quantiles. (d) Empirical and theoretical per- centiles. Figure C.11: Results of tting the time between purchases to a Pareto distribution. the best t for our data distribution, considering both the AIC and the density distributions. 279 Table C.4: AIC for dierent distributions. Lower AIC scores are preferred. Distribution AIC Pareto 59.55M Log Normal 60.86M Weibull 61.87M Gamma 62.22M C.4.2 Novelty prediction Next, we predict whether the user will purchase from a new app or from an app he or she has purchased in the past. We approach the problem as supervised learning at the time that the user will make the purchase and use the following features: age, gender, time since previous purchase, average time between purchases, average time between re-purchases, total number of purchases, day of the current purchase, percentage purchases from new vs. past apps, whether last three purchases are from a new app, and the number of apps from which the user has purchased in the past. We use the rst year of our data set for training and the last three months for testing, such that we do not use any future information in our predictions. We tested a collection of dierent classication algorithms, including several types of decision tree algorithms and SVMs. Finally, the C5.0 algorithm in R achieves the best result [Quinlan, 2004]. Our classier achieves a high accuracy, predicting the right class in 84.5% of the cases with precision of 0.862, recall of 0.965, and F-score of 0.964. This accuracy is slightly higher than the result reported in [Benson et al., 2016] for a similar problem o music and video re-consumption. To better understand the importance of each feature, we also t a Logistic Regression model to our data set after removing the correlated features. Fig- ure C.12 shows the pairwise correlation coecient between the features. We removed one of the features from pairs with correlation coecient higher than 280 0.7. Table C.5 shows the result of the Logistic Regression: the percentage of re- purchases that the user has made is the most important feature; it captures the tendency of the user to re-purchase from an app. The three next most important features capture user's recent history of re-purchases and purchases from new apps. These are followed by gender, with a positive correlation, showing men are more likely to have a repurchase. This is in tune with our earlier ndings that men are more likely to be big spenders, and big spenders make many purchases from the same app. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 day last_delay avg_delay tot_purch age gender repurch_frac last_class last_class2 last_class3 repurch_delay repurch_count day last_delay avg_delay tot_purch age gender repurch_frac last_class last_class2 last_class3 repurch_delay repurch_count Figure C.12: Pairwise correlation coecient among the features for predicting the purchase from new apps. 281 Table C.5: Results of logistic regression on the independent variables for abandon- ment prediction. *** pvalue< 0:001. Variable Coeff: % of re-purchase 6:236e+00*** Previous class (re-purchase) 2:878e01*** 2nd to the last class (re-purchase) 1:624e01*** 3rd to the last class (re-purchase) 7:878e02*** Gender (m) 7:375e02*** Mean inter-purchase time 4:764e02*** Time since last purchase 2:782e02*** Total number of re-purchases 2:232e02*** Day of the purchase 1:236e03*** Age 1:069e03*** C.4.3 App prediction In two previous steps, we modeled the time between purchases and whether the user's next purchase will come from a new app, with no previous purchases by that user, or from an existing app, with past purchases by that user. If the outcome of the model indicates that the purchase will come from a new app, then our task is to predict the most likely new app, given the previous apps that user made purchases from. If, on the other hand, the outcome of the classier indicates that the purchase is from an existing app, then we use the sequence of all previous purchases to select the most likely existing app. New app prediction Given the apps that user purchased from in the past our goal is to predict the most likely new app the user will purchase from next. Similar problems have been studied extensively in the area of the recommendation systems [Adomavicius and Tuzhilin, 2005, Bobadilla et al., 2013]. 282 Motivated by recent success of embedding models in a number of natural lan- guage processing tasks [Mikolov et al., 2013], we propose to use a language model to learn app vectors in a low-dimensional space, trained from sequences of user in-app purchases such that apps that appeared in similar context reside nearby in the embedding space. Following the embedding step, we propose to use a k-nearest neighbor approach in the learned vector space to predict the most likely new app given the existing apps consumed by the user. More formally, let us assume we are given a set of apps A =fa j jj = 1:::Mg, each identied by a unique identier a j . In addition, in-app purchase times for N users over a time period T from our data set are also known. For the n th user we collect data in a form d n =f(a j ;t i );i = 1;:::;K i ;t 1 < t 2 < ::: < t K i g, where d n denotes the user's in-app purchase sequence, K i is the total number of in-app purchases user made, and t i is time of i th in-app purchase from app a j . Given a setD of N user in-app purchase sequences, where sequence d n 2D is dened as an in-app purchase from K apps, the objective is to maximize log- likelihood of the training dataD, L = 1 N X d2D X a j 2d X bib;i6=0 log{(a j+i ja j ) ; (C.1) where b is the context widths for in-app purchase sequences and probability {(a j+i ja j ) of observing a neighboring in-app purchase given the current in-app purchase is dened using a softmax function [Mikolov et al., 2013] expressed using app vectors. Once we learn a vector representation for each app, we can leverage vector cosine similarities to calculate similarities between apps. In Table C.6 we show 5-nearest neighbor apps for several randomly picked apps along with the cosine 283 Table C.6: Top 5 closest apps by cosine similarity for 3 apps. Kim Kardashian West Ocial Top 5 closest apps Cosine similarity Khloe Kardashian Ocial 0.907 Kourtney Kardashian Ocial 0.866 Kylie Jenner Ocial 0.863 Kendall Jenner Ocial 0.805 kimoji 0.733 Homework Top 5 closest apps Cosine similarity Smart Studies 0.502 iStudy Pro 0.500 Barrons Hot Words 0.491 Physics 101 0.480 PSAT Preliminary SAT Test Prep 0.479 Checkbook Pro Top 5 closest apps Cosine similarity Accounts 2 Checkbook 0.678 Checkbook Spending 0.657 Checkbook HD Personal Finance 0.641 My Check Register 0.617 My Checkbook 0.608 similarities between the corresponding app vectors. As demonstrated in the table, the proposed approach can accurately capture similarities between apps. We used the rst year from our data set to train the app embeddings and leveraged them to predict the new app user will purchase from in the remaining three months. The prediction was done using cosine similarity between the apps that user has already purchased from and all the remaining apps. All predictions were made per user. Specically, if the user made purchases fromk apps in the rst year, we used these k apps to predict the new k=4 apps user will likely purchase from in the next three months (k=4 was chosen because the test period is one fourth of the training period). The k=4 apps were predicted by considering the 284 Table C.7: New app prediction accuracy. Method Accuracy App embeddings 4.7% NMF 4.1% Top apps 2.2% LDA 1.7% most similar app to each of the k previous apps and picking one fourth of them randomly. Our predictions were correct in 4.7% of cases. This seems very low, but con- sidering that there are more than 216K apps the user can choose from, in the context of the recommendation systems the approach is working considerably well. To further quantify the accuracy of the proposed embedding approach, we com- pared our recommendation strategy to several baseline models: 1) Non-negative matrix factorization (NMF) approach trained using the matrix of in-app purchases formed fromD; 2) LDA [Blei et al., 2003] applied on the app description text; 3) ranking the apps by popularity and always predicting top k=4 apps users still did not purchase from. The results, presented in Table C.7, show that the app embedding approach outperforms the considered baselines. Poor performance of LDA agrees with pre- vious research [Hong and Davison, 2010] that found that this method perform poorly when trained on short text documents. Better performance of app embed- dings over the NFM approach can be explained by the fact that NFM model loses the notion of time and sequence order once it transforms the data setD into the matrix. 285 Existing app in-app re-purchase In the case the model from Section 4.3. predicts that a re-purchase is most likely to happen, we use the frequency and recency of user's previous app consumption to predict the app from which the re-purchase will occur. This may appear to be an easy prediction problem, as one might think that users almost always purchase from the last app they purchased from, or the app from which they made the majority of purchases. However, in case of in-app re-purchases, only 46.5% of the time the re-purchase comes from the latest app they purchased from, and only 45.3% of them come from the app from which the user made most of the purchases. This justies the need for a more involved re-purchase model. We follow a similar approach to the one from [Benson et al., 2016, Anderson et al., 2014] and use both recency and popularity of previous apps to predict from which app the user's next in-app purchase will come from. We use a weight function and a time function that maps the frequency of the usage and time since previous usage to the learnt values. This repeat consumption model could be used for the ith consumption: {(x i =e) = P j<i I(x j =e)s(x j )T (t i t j ) P j<i s(x j )T (t i t j ) (C.2) In this equation, function s represents the frequency of the purchase from the app, and function T represents the time between the purchases. These functions are optimized jointly by calculating the negative log-likelihood over the equation. The negative log-likelihood is not convex ins andT , but is convex in each function when the other one is xed. Thus, we use a standard gradient descent to maximize the likelihood with respect tos andT , separately. After learning the weight functions, we are able to predict the correct app from which the user is going to make a 286 purchase with 54.8% accuracy, which is considerably higher than the baselines mentioned above, i.e. 46.5% and 45.3% accuracy by always predicting the latest or the most consumed app, respectively. C.5 Related Work Online shopping is becoming more popular as people learn to trust online pay- ment systems, which was not the case in the past [Bhatnagar et al., 2000]. Mul- tiple studies, aimed at proling online shoppers, found that online shoppers tend to be younger, wealthier, and more educated compared to the average Internet user [Zaman and Meng, 2002, Swinyard and Smith, 2003, 2011, Farag et al., 2007]. A more recent work showed that while women are more likely to be online shoppers, men spend more money per purchase and make more purchases overall [Kooti et al., 2016]. In our work, we focus on a particular subset of online purchases, iPhone digital purchases. There are considerable dierences in characteristics of iPhone purchases and purchases of physical goods. One of the main dierences is that people are much more likely to purchase the same item multiple times. Similar to online shopping, spending on mobile digital goods is increasing, and people have spent more than $20 billion dollars in the Apple App Store in 2015 [app, a], which is four times more per user than in Android App Stores 2 . This might be due to the dierent demographics of iPhone users. Given this high level of spending, understanding the market would help us to more eectively target apps towards users who are likely to become regular users and frequent spenders. Despite the popularity of the iPhone digital market, there has not been any large-scale study of how people are spending money on this platform. In this 2 http://fortune.com/2014/06/27/apples-users-spend-4x-as-much-as-googles/ 287 work, we show that most of the money is spent on in-app purchases, and we present a demographic and prediction analysis of spending. Usage and purchases from apps have been the subject of a few studies. Sifa et al. studied the purchase decisions in free-to-play mobile games [Sifa et al., 2015]. They built a classier that predicts whether a user is going to make any purchase in the future and also build a regression model to estimate the amount of money that will be spent by each user. The models are moderately accurate. Schoger studied the monetization of popular apps in the global market, identifying growing markets and that in-app purchases are increasingly accounting for larger fraction of total purchases [Schoger, 2014]. Our study, unlike those studies, includes the full history of iPhone purchases by the users and considers that many users make purchases from multiple apps. Moreover, the large scale of the data set allows us have enough big spenders to analyze their behavior accurately. We also study changes in user purchases over time, how users becomes fre- quent buyers in a particular app, and how their purchases evolve over time. The abandonment of a service is called consumer attrition or churn. The importance of consumer attrition analysis is driven by the fact that retaining an existing con- sumer is much less expensive than acquiring a new consumer [Rechinhheld and Sasser, 1990]. Thus, prediction of consumer churn is of great interest for compa- nies, and has been studied extensively. For example., Ritcher et al. exploit the information from users' social networks to predict consumer churn in mobile net- works [Richter et al., 2010]. Braun and Schweidel focus on the causes of churn rather than when churn will occur [Braun and Schweidel, 2011]. They nd that a considerable fraction of churn in the service they studied happens due to reasons outside of the companies' control, e.g., the consumer moving to another state. In context of mobile games, Runge et al. study user churn for two mobile games and 288 predict it using various machine learning algorithms [Runge et al., 2014]. They also implement an A/B test and oer players bonuses before the predicted churn. They nd that the bonuses do not result in longer usage or spending by the users. Kloumann et al. study the usage of apps by users who use a Facebook login and model the lifetime of apps using the popularity and sociality of apps, showing that both of these aect the lifetime of the app [Kloumann et al., 2015]. Baeza-Yates et al. addressed the problem of predicting the next app the user is going to open through a supervised approach [Baeza-Yates et al., 2015]. In our work, we model the whole sequence of purchases that users make, including adoption, churn, and prediction of the next app. Our work is the rst work that studies the details of all iPhone purchases made by a large number of users. This allows us to better understand the inter- play between usage of multiple apps that are competing for the same users, their attention and purchasing power. C.6 Conclusion Mobile devices have grown wildly in popularity and people are spending more money purchasing digital products on their devices. To better understand this digital marketplace, we studied a large data set of more than 776M purchases made on iPhones, including songs, apps, and in-app purchases. We nd that, surprisingly, 61% of all the money spent is on in-app purchases, and a small group of users are responsible for most of this spending: the top 1% of users are responsible for 59% of all spending on in-app purchases. We characterize these users, showing that they are more likely to be men, older, and less likely to be from the US. Then, we focus on how these big spenders start and stop making purchases from 289 apps, nding that as users gradually lose interest, the delay between purchases increases. The amount of money spent per day on purchases initially increases, then decreases, with a sharp drop before abandonment. Nevertheless, from the perspective of app developers these big spenders are a valuable user segment as they are 4.5x more likely to be a big spender in a new app then a random app user. In the last part of our study we model the purchasing behavior of users by breaking it down in three dierent steps. First, we model the time between purchases by testing a variety of dierent distributions, and we nd the Pareto distribution ts the data most accurately. Second, we take a supervised learning approach to predict whether a user is going to make purchase from a new app. Finally, if the purchase is from a new app, we use a novel approach to predict the new app based on the previous in-app purchases. If the purchase is from an app that user purchased from in the past, we combine the earlier frequency of the purchases and the time between the purchases to predict from which app the re-purchase will come from. The ndings and the models proposed in our study can be leveraged by app developers, app stores and ad networks to better target the apps to the corresponding users. 290
Abstract (if available)
Abstract
People are increasingly spending more time online. Understanding how this time is spent and what patterns exist in online behavior is essential for improving systems and user experience. One of the main characteristics of online activity is diurnal, weekly, and monthly patterns, reflecting human circadian rhythms, sleep cycles, as well as work and leisure schedules. These patterns range from mood changes reflected on Twitter at different times of the days and days of the weeks to reading stories on news aggregator websites. Using large-scale data from multiple online social networks, we uncover temporal patterns that take place at far shorter time scales. Specifically, we demonstrate short-term, within-session behavioral changes, where a session is defined as a period of time during which a person engages continuously with the online social network without a long break. On Twitter, we show that people prefer easier tasks such as retweeting over more complicated tasks such as posting an original tweet later in a session. Also, tweets posted later in a session are shorter and are more likely to contain a spelling mistake. We focus on information consumption on Facebook and show that people spend less time reading a story as they spend more time in the session. More interestingly, the rate of the change depends on the type of the content and people are more likely to spend time on photos and videos later in a session compared to textual posts. We also found changes in the quality of the content generated on Reddit and found that comments that are posted later in a session get lower scores from other users, receive fewer replies, and have lower readability. All these findings are evidence for short-term behavioral changes in the type of activity that users perform. Moreover, we identify the factors that affect these short-term behavior changes
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Socially-informed content analysis of online human behavior
PDF
Modeling social and cognitive aspects of user behavior in social media
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Modeling dynamic behaviors in the wild
PDF
Identifying Social Roles in Online Contentious Discussions
PDF
Modeling information operations and diffusion on social media networks
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Learning distributed representations from network data and human navigation
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
PDF
Decoding information about human-agent negotiations from brain patterns
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Utilizing real-world traffic data to forecast the impact of traffic incidents
PDF
Partitioning, indexing and querying spatial data on cloud
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Responsible artificial intelligence for a complex world
PDF
Beyond active and passive social media use: habit mechanisms are behind frequent posting and scrolling on Twitter/X
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
Asset Metadata
Creator
Kooti, Farshad
(author)
Core Title
Predicting and modeling human behavioral changes using digital traces
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/07/2016
Defense Date
12/01/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
behavior prediction,behavioral change,big data analysis,consumer behavior,data science,email,facebook,Human behavior,machine learning,Modeling,network analysis,OAI-PMH Harvest,online behavior,online shopping,social network analysis,Twitter
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lerman, Kristina (
committee chair
), Dehghani, Morteza (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
farshad.kooti@gmail.com,kooti@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-313986
Unique identifier
UC11213740
Identifier
etd-KootiFarsh-4870.pdf (filename),usctheses-c40-313986 (legacy record id)
Legacy Identifier
etd-KootiFarsh-4870.pdf
Dmrecord
313986
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kooti, Farshad
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
behavior prediction
behavioral change
big data analysis
consumer behavior
data science
email
facebook
machine learning
network analysis
online behavior
online shopping
social network analysis
Twitter