Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Toward better understanding and improving user-developer communications on mobile app stores
(USC Thesis Other)
Toward better understanding and improving user-developer communications on mobile app stores
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TOWARD BETTER UNDERSTANDING AND IMPROVING USER-DEVELOPER COMMUNICATIONS ON MOBILE APP STORES by Kamonphop Srisopha A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2022 Copyright 2022 Kamonphop Srisopha Dedication This dissertation is dedicated to my parents, Pipop and Prapai, and my grandmother, Phasuk. ii Acknowledgments This dissertation marks a significant period in my academic path. This journey would not have been possible without the support and encouragement from people around me. I want to take this space to thank everyone who has guided and supported me along the way. First and foremost, I would like to express my deepest gratitude to my advisor, Dr. Barry Boehm, who has entrusted me with the opportunity and freedom to explore and pursue my research interests. He has been an incredible advisor and has always been there when I needed help. I will forever be grateful to Julie Sanchez and Daniel Link for their help in various ways that would take pages to list throughout the years. To all my other amazing colleagues at the USC Center for Systems and Software Engineering (CSSE) – Elaine Venson, Reem Alfayez, Pooyan Behnamghader, Bo Wang, and Kan Qi, I am truly honored and delighted to have such a privilege to get to know and work with you all, thank you. I would also like to thank Dr. Supannika Koolmanojwong for taking me under her wings and allowing me to give occasional lectures for CS577 Software Engineering I and II courses. Similarly, I would like to give a special mention to Dr. Andrew Goodney who trusted me to create several new challenging assignments (programming assignments and labs) and manage existing assignments for CS103L, Introduc- tion to Programming. Being a teaching assistant for these courses has taught me immensely and given me invaluable and enjoyable experience as a Ph.D. student. Last but not least, I would like to thank my dissertation committee members, Dr. Aiichiro Nakano and Dr. Emilio Ferrara, for their words of encouragement, constructive feedback, availability, and support. I would also like to extend my gratitude to two other professors that were part of my qualifying and proposal exams committee: Dr. Xiang Ren and Dr. Berok Khoshnevis. iii TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii ListofTables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii ListofFigures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Contributions and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Peer-Reviewed Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Apps and App Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 User Review and Developer Response . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Foundations of Analyzing User Reviews and Developer Responses . . . . . . . . . . . . . . 13 2.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Machine Learning for Text Classification . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 Explainable Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter3: FeaturesthatPredictDeveloperResponses . . . . . . . . . . . . . . . . . . . . . . 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Definitions and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 Review and Response Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Task and Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.2 Dataset: Case-Study Apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.4 Model Construction and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.5 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 RQ1: How effective is the random forest algorithm at predicting developer responses to user reviews? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.2 RQ2: Which individual features and groups of features are most important in predicting developer responses? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 iv 3.4.3 RQ3: How do the important features affect prediction? . . . . . . . . . . . . . . . . 46 3.5 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.1 User-Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.2 Developer-Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.1 Completeness of Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.2 Tool and Method Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.3 Developer responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.4 External validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7.1 User Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7.2 Developer Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Chapter4: HowDeveloperShouldRespondtoReviewsforSuccess . . . . . . . . . . . . . . 58 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2.1 Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Research Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.4 Model Construction and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.5 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.1 RQ1: How often do users change their original ratings with and without a developer response? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.2 RQ2: Which review intention has the highest chance of success? . . . . . . . . . . 72 4.4.3 RQ3: Can we effectively predict which developer responses will be successful? . . 73 4.4.4 RQ4: Which features play roles in predicting the success of developer responses, and how? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5.1 Evidence-Based Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5.2 Response Writing Assistant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.6 Limitations and Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.7.1 Analysis of App User Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.7.2 Analysis of Developer Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Chapter5: UserReviewsoftheSameAppsBetweentheUSandOtherCountries . . . . . . 88 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Study Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.3 Manual Annotation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3 Empirical Study Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 v 5.3.1 RQ1: Do the contents of reviews from the US differ from the other countries regarding the software quality and improvement factors? . . . . . . . . . . . . . . 98 5.3.2 RQ2: What factors are discriminant in classifying the reviews of the US and the other countries? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3.3 RQ3: How effective is the NLP approach in identifying the specific needs and priorities of users from a pair of countries? . . . . . . . . . . . . . . . . . . . . . . 104 5.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.4.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5 Threats to Validity and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.5.1 Different App Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.5.2 Users and Their Respective Countries . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.5.3 Software Quality and Improvement Factors . . . . . . . . . . . . . . . . . . . . . . 114 5.5.4 Subjectivity in Manual Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.5.5 English Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.5.6 App Sampling Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.5.7 The NLP approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.5.8 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Chapter6: UserReviewsandtheApp’sInternalQualityAttributes . . . . . . . . . . . . . . 120 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.2.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2.3 Manual Review Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2.4 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Procedures and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3.1 RQ: To what extent do different review sentence types correlate with the apps’ quality attributes? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Chapter7: ConclusionsandFutureDirections . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 vi ListofTables 3.1 The study apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Review features potentially affect developer responses along two dimensions: Non-text and Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Example results from SentiStrength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Example results from ARdoc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 The chosen hyperparameters for each app using RandomizedSearchCV . . . . . . . . . . . 38 3.6 Top 10 most important features in predicting developer responses for each app (IV = Importance Value) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.7 The relative importance of each group of features in response prediction (IV = Importance Value) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 The changes in ratings with and without a response . . . . . . . . . . . . . . . . . . . . . . 72 4.2 The importance of the three categories of features . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 The importance of features in the Presentation category . . . . . . . . . . . . . . . . . . . 76 4.4 The importance of features in the Tone category . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5 The importance of features in the Time category . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1 The overview of the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2 The 13 Software quality and Improvement factors . . . . . . . . . . . . . . . . . . . . . . . 94 5.3 The list of all factors for each country that are proportionally inconsistent with the US (proportion z-test*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4 The lists of discriminant factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.5 Examples of Feature Request Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6 The classification performance using 5-fold cross-validation . . . . . . . . . . . . . . . . . 106 vii 5.7 The percentage of seeded reviews extracted . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.1 The characteristics of the study apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Examples of sentences and their categories along the three dimensions . . . . . . . . . . . 125 6.3 The eleven review sentence types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.4 The apps’ internal quality attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.5 The correlation coefficients (r) and the p-values . . . . . . . . . . . . . . . . . . . . . . . . 129 viii ListofFigures 3.1 Examples of user reviews with a developer response . . . . . . . . . . . . . . . . . . . . . . 25 3.2 An overview of the approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 The ROC curves of the random forest algorithm that uses the selected features for all case study apps and the baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 The effects of the Length features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5 The effects of the Style features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.6 A mockup of a developer response prediction system integrated into the App Store . . . . 50 4.1 An example of a successful developer response . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 The probability of success for each review intention . . . . . . . . . . . . . . . . . . . . . . 73 4.3 The ROC curves of the model and the baseline . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 PDPs of the top two features in the Presentation category . . . . . . . . . . . . . . . . . . . 77 4.5 PDPs of the top two features in the Tone category . . . . . . . . . . . . . . . . . . . . . . . 78 4.6 PDPs of the top features in the Time category . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.7 A UI prototype of a response writing assistant tool . . . . . . . . . . . . . . . . . . . . . . 82 5.1 The review annotation tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2 The proportion of all factors for each country . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3 The overview of the approach used in our experiment . . . . . . . . . . . . . . . . . . . . . 105 5.4 Preliminary experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.1 Results from applying the three static analysis tools on Twidere releases . . . . . . . . . . 127 ix Abstract Recent studies and surveys have shown that mobile app user reviews are valuable for software engineer- ing activities, such as maintenance or requirement elicitation. Potential new users base their download decision on reviews and ratings left by other users. Hence, achieving positive star ratings and user reviews is greatly important for app success and app developers. Mobile app stores, such as the Apple App Store and the Google Play Store, have launched the ability for developers to respond to reviews, thus creating a two-way communication channel between users and developers directly on the app stores. However, establishing effective communication between the two parties can be challenging. In addition, the rapid growth of smartphone users worldwide has called for developers to make their apps accessible to global audiences and markets. Analyzing reviews of users from other countries could help developers better understand their users and give developers more perspectives, which may lead to changes in requirements or new requirements. Motivated by these observations, for this dissertation, we conduct four empirical studies utilizing a suite of machine learning, data mining and analysis techniques, and a number of datasets of mobile app user reviews and developer responses from app stores. Specifically, first, we study a wide range of features within user reviews and their relation to developer responses. Second, we study how developers should respond to reviews for success. Third, we study how the users from different countries perceive the soft- ware quality of the same apps compared to the US users. Lastly, we compare the users’ perception of the software quality of apps to the apps’ internal quality characteristics. The primary objective is to derive understandings and uncover insights for a broad range of practi- tioners to be used for improving communications between users and developers on mobile app stores. x Chapter1 Introduction Mobile application stores (or app stores), such as the Google Play Store and the Apple App Store, have become a critical medium for app developers of all sizes to distribute mobile apps to users around the world and for users to browse and download apps for their mobile devices. Currently, app stores allow developers to distribute their apps to over 150 countries and regions around the world [6, 43]. The mobile app market is also increasingly competitive. As of the first quarter of 2021, the Google Play Store and the Apple App Store hosted a combined number of over 5.5 million apps [20]. This number is steadily increasing every year. In terms of economic profit, global users spent over 133 billion US dollars on mobile apps in 2021, and this number is estimated to reach over 270 billion US dollars in 2025 [100]. App stores are also a place for users to review and rate their downloaded apps. Users can share their overall experience or assessment of the app by writing a user review and giving a star rating on a scale of one to five, where one star indicates extreme dissatisfaction and five stars extreme satisfaction. While users can use this review mechanism to inform other users about their experiences with the app, they can also use it to communicate with app developers. For example, they can report unexpected app behaviors, specify areas for app improvement, or ask for specific information about the app, so that developers can improve the app over time. Recent studies and surveys have found that potential new app users base their download decisions on reviews and ratings left by other app users [50, 70, 99]. There are several research works in the past decade that study the importance of user reviews in the software development lifecycle [27]. These studies have found that user reviews contain a wealth of 1 helpful information for developers and that developers leverage the information in reviews for various software engineering tasks, such as maintenance and requirement elicitation [63, 85]. Some even found that developers prioritize issues reported in user reviews on app stores over reports from other channels and that developers often deviate from their usual apps’ release schedule to accommodate user reviews [3, 79]. Therefore, in such a competitive market, it is clear that achieving high star ratings and good user reviews is crucial for developers, app success, and app survivability. In addition to being a vital gateway for developers to distribute their apps, app stores also serve as a channel for developers to communicate with their users. Developers can post a public response to a user review directly on the app stores. Reviews with negative ratings present great opportunities for developers to respond to for potential rating increases and alleviate dissatisfaction. Recent studies have shown that responding to reviews positively affects app ratings [51, 76, 103]. Users are three times as likely to increase their initial rating with a developer response than without one [103]. However, studies have also shown that some developers do not respond to any reviews of their apps. When developers do respond, not all reviews receive a response, and the response strategies can vary considerably [105]. Despite much attention from researchers in this area, many knowledge gaps and limitations exist, which can be summarized as follows. • The majority of research efforts have been expended towards understanding how users leave ratings and write reviews for an app and creating different approaches and techniques that automatically classify and summarize reviews for developers [27]. However, so far, most approaches and tech- niques use features based on intuition or assumption of what developers look for in user reviews. This emphasizes the need to investigate in-depth user reviews that receive a developer response and compare them to those that do not. The benefit of understanding the key characteristics within user reviews that influence developers’ response decision is at least two folds. First, researchers can build review or response prioritization tools not based on assumptions or intuitions about developers’ 2 behavior. Second, users will know how to shape and place a review to increase the likelihood of receiving a response from developers. • Existing work on app developer response analysis has reached the same conclusion that responding to reviews can positively impact app ratings, though not all responses will increase ratings [51, 76]. To that end, these studies paid less attention to examining how the composition of a developer’s response could affect its outcome. More specifically, how features within a developer response that developers can control at composition time can influence whether users will increase their ratings after it. Uncovering this insight is critical and valuable for generating a recommendation for devel- opers to follow to optimize the benefits of responding to reviews and for research works that focus on creating tools to assist developers in writing a response. • The majority of existing work on analyzing app user reviews has been mainly focused on examining user reviews either from the Google Play Store, which separates reviews by language and not by the country of origin of the users or from the US Apple App Store. However, little effort has been devoted to understanding how users from the US and other countries perceive the quality of the same apps and investigating techniques that can automatically identify differences in opinions of users from different countries. Analyzing reviews of users from other countries could provide unique insights for app developers, which may lead to changes in requirements or new requirements, thus increasing the competitiveness and the likelihood of app survival in global markets. • Static analysis tools have been the vehicle for measuring and assessing software product quality for decades. They provide numerous quality metrics (or attributes) that can give an insight into the quality of the code, such as code complexity and code smells. Existing work on app user reviews analysis has shown that user reviews also provide a wealth of insight into the quality of an app and how the app should evolve and be maintained [85]. However, research has expended less attention on 3 investigating whether the users’ perception of the quality of apps correlates with the apps’ internal quality attributes. The primary purpose of this dissertation is to address these gaps and limitations. To achieve this, we conduct the following four empirical investigations. First, we study the characteristics of user reviews that receive a developer response and how users should write reviews that will incite a developer response. Second, we examine the characteristics of a successful developer response and how developers can respond to user reviews to optimize their chances of success. Third, we explore whether the users’ perception of the software quality of the same apps differs between the US and other countries. Fourth, we examine how user reviews correlate with the apps’ internal quality attributes. Our studies utilize a suite of machine learning and data mining techniques and several large-scale datasets of user reviews and developer responses gathered from the Google Play Store and the Apple App Store. The primary objective is to derive understandings and uncover insights for a broad range of practitioners to be used for improving communications between users and developers on mobile app stores. 1.1 ContributionsandOrganization This section summarizes our contributions to the area of software engineering for each chapter in this dis- sertation. Additionally, we outline the chapters and indicate the relation between parts of this dissertation to our peer-reviewed publications listed in Section 1.2. • Chapter 2: Background. In this chapter, we provide background information about the mobile applications, app stores, and the review and response mechanisms that exist in them. Additionally, we discuss the foundations of analyzing user reviews and developer responses. 4 • Chapter 3: Features that Predict Developer Responses. In this chapter, we investigate a wide range of features that can be extracted from user reviews of iOS apps that are likely to be associated with developers’ response behavior. We study many novel features such as writing style (e.g., readability, the ratio of misspelled words), vote, timing (e.g., posting time, posting day), sentiment (e.g., the proportion of positive sentiment words), and intention (i.e., a software engineering oriented features: bug report and feature request). We extract seven different groups of features and use the random forest algorithm to build models to predict developer responses. Through applying model interpretation techniques, we identify the relative importance of each feature and group of features and uncover the relationships between the key features and developer responses. The analysis is conducted using over 112,000 user reviews and over 20,000 developer responses from eight iOS applications, sampled from a broad range of application domains. Our analysis demonstrates a potential application to apply feature engineering and machine learning to aid in automatically prioritization of developer responses or user reviews. The insights discovered in this chapter may serve as evidence-based guidelines for users on how to provide the kind of feedback that will have a high likelihood of receiving a developer response. This chapter is based on our works in [105] and [108]. • Chapter4: HowDeveloperShouldRespondtoReviewsforSuccess. In this chapter, we track changes in user reviews and developer responses of the 1600 top free apps over ten weeks to deter- mine what and how to make developer responses successful. We consider a response a success if, after receiving it, a user increases his or her initial rating. The empirical analysis is conducted using a large-scale dataset involving over 6 million user reviews and 320,000 developer responses. 5 The work in this chapter has two objectives. The first objective is to investigate the potential to predict early whether a developer’s response is likely to be successful. The second objective is to pinpoint what and how features that developers can control at the time of writing a response influ- ence the success of the response. We extract three groups of features, namely Time, Presentation, and Tone. We utilize the extreme gradient boosting algorithm (XGBoost) to model the success of developer responses from these fea- tures. A ranking of features by their importance and a set of important features are then extracted from the model. Partial dependence plots are generated for the important features to uncover how the probability of success changes with these features. Through our analysis, we derive a set of evidence-based recommendations that developers can follow to write effective responses and increase the chance of success of the responses. For those investi- gating an automatic response generation approach, our insights could help them generate developer responses with an increased chance of success. This chapter is based on our work in [103]. • Chapter 5: User Reviews of the Same Apps Between the US and Other Countries. In this chapter, we investigate whether the users’ perception of the same apps differs between the US and other countries. The analysis is conducted using a large-scale dataset comprising 300,643 user reviews of the 15 most popular iOS apps of 2018 from nine English-speaking and culturally diverse countries around the world. The chapter involves two phases. In the first phase, we manually classify 3,358 user reviews into 13 software quality and improvement factors to perform in-depth content analysis. Additionally, 6 through explainable machine learning techniques, we determine factors that can discriminate the reviews of the US and other countries. Building on top of the insights uncovered in the first phase, in the second phase, we present a pre- liminary approach aimed at detecting differences in the contents of user reviews between a pair of countries. We then evaluate the performance of the approach through three case study apps. Lastly, we discuss several key challenges associated with analyzing and utilizing app user reviews for cross-country or cross-cultural studies. This chapter is based on our works in [106] and [107]. • Chapter6:UserReviewsandtheApps’InternalQualityAttributes. In this chapter, we explore if there exists a relationship between the user’s perception of the quality of apps and the apps’ internal quality attributes. We retrieve several apps’ internal quality attributes from 46 releases of three Android open-source software (OSS) apps utilizing three widely adopted static analysis tools: PMD, FindBugs, and Sonar- Qube. We gather the complete reviews from each release, totaling 1,004 reviews, and manually analyze each sentence of these reviews along three dimensions: intention, software quality, and sentiment. We then use correlation to find relationships between user reviews and the app’s inter- nal quality attributes. This chapter is based on our work in [102]. • Chapter 7: Conclusions and Future Directions. In this chapter, we conclude the dissertation and discuss opportunities for future research. 7 1.2 Peer-ReviewedPublications Parts of the work in this dissertation has been previously peer-reviewed and accepted for publication in the following conferences in the area of software engineering: • Chapter3 (i) Srisopha,K., Link, D., Swami, D., Boehm, B. (2020). “Learning Features that Predict Developer Responses for iOS App Store Reviews.” In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). [105] (ii) Srisopha,K., Swami, D., Link, D., Boehm, B. (2020). “How features in iOS App Store Reviews can Predict Developer Responses.” In Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering (EASE). [108] • Chapter4 (i) Srisopha,K., Link, D., Boehm, B. (2021). “How Should Developers Respond to App Reviews? Features Predicting the Success of Developer Responses.” In Proceedings of the 25th Interna- tional Conference on Evaluation and Assessment in Software Engineering (EASE). [103] • Chapter5 (i) Srisopha,K., Phonsom, C., Lin, K., and Boehm, B. (2019). “Same App, Different Countries: A Preliminary User Reviews Study on Most Downloaded iOS Apps.” In Proceedings of the 35th IEEE International Conference on Software Maintenance and Evolution (ICSME). [107] (ii) Srisopha, K., Phonsom, C., Li, M., Link, D., Boehm, B. (2020). “On Building an Automatic Identification of Country-Specific Feature Requests in Mobile App Reviews: Possibilities and 8 Challenges.” In Proceedings of the 42nd IEEE/ACM International Conference on Software En- gineering Workshops (ICSE-W). [106] • Chapter6 (i) Srisopha, K. and Alfayez, R. (2018). “Software quality through the eyes of the end-user and static analysis tools: A study on Android OSS applications.” In Proceedings of the 1st IEEE/ACM International Workshop on Software Qualities and their Dependencies (SQUADE). [102] 9 Chapter2 Background This chapter provides context for this research and is divided into two sections. The first section provides background information about app stores and the user reviews and developer responses that exist in them. The second section discusses the foundations of analyzing user reviews and developer responses. 2.1 Context 2.1.1 AppsandAppStores In a nutshell, a mobile application orapp is a type of software application that runs on one or more mobile devices, such as smartphones, tablets, and wearables. A mobile operating system in these devices governs which apps can be installed and run on them. The top two mobile operating systems by market share are Google’s Android and Apple’s iOS. Apps are commonly distributed through platform-specific mobile app distribution platforms known asappstores. Google’s Android apps are distributed through the Google Play Store, while Apple’s iOS apps are distributed through the Apple App Store. In this dissertation, we use the terms the Apple App Store, the App Store, and the iOS App Store interchangeably. On app stores, each app has its own store page. The store page contains the app’s description, screen- shots, the total number of reviews and ratings, version history, release notes, an icon, and a download button. Users can browse apps, view their details, and download them. However, app stores are not just a place for users to browse and download apps. They are also a place for users to review and rate downloaded 10 apps. Other users can read reviews and vote for reviews they find helpful. Recently, app stores also al- lowed developers to respond to reviews. In fact, developers view app stores as a channel of communication between users and themselves [3]. In terms of global accessibility and reachability, the Apple App Store and Google Play Store currently support over 150 countries and regions around the world [6, 43]. Once developers publish an app, it will be made available in all countries the stores currently support by default. Typically, only one version of it can exist on the entirety of all national app stores at a given time. On the Google Play Store, however, developers can target specific countries for alpha, beta, and production releases [42]. Limiting app version by country, for example, v1.2 in the US and v1.0 anywhere else in the world, is not possible on the App Store. However, developers can limit the countries where the app will be available. For the Apple App Store, this suggests that users from different countries use the same version of the same app. Additionally, on the Apple App Store, user review data is stored separately per country, allowing easy identification of the country origin of a given review. In contrast, on the Google Play Store, user reviews are separated by language, not by country. This means that user reviews written in English are combined into one shared pool of English user reviews, regardless of the country origin of the reviews. Nonetheless, both app stores require that users be in the chosen country or have a payment method and billing address from the chosen country in order to download and install apps from the app stores of that country. 2.1.2 UserReviewandDeveloperResponse There are a number of research works that study the importance of user reviews in the software develop- ment lifecycle and how the information in user reviews can be extracted and used by various stakeholders [27]. For example, Pagano et al. [85] found that one-third of reviews contain valuable information for software evolution and maintenance (such as bug reports and feature requests). Khalid et al. [63] found 12 types of user complaints in iOS app user reviews. They also studied the impact each complaint type has on 11 rating. Palomba et al. [86] asked developers how much they incorporated user reviews into their workflow and whether addressing user reviews in subsequent releases benefitted users. The authors reported that 75% of interviewed developers stated that they have used and incorporated reviews for software main- tenance and evolution and that they can observe an increase in user satisfaction and app ratings when doing so. Nayebi et al. [79] found that the most important driver for app developers to deviate from their usual apps’ release schedule is accommodating user feedback. Similarly, Al-Subaihin et al. [3] interviewed app developers and found that they prioritize issues reported in user reviews on app stores over reports from other channels. Additionally, they found that developers frequently make use of user reviews for app feature enhancements. Furthermore, a user survey study by Lim et al. [70] found that reviews and ratings are among the most important factors people consider when choosing apps to download. They are also a crucial factor influencing how an app ranks in search results [9]. More specifically, the more positive ratings and reviews an app have, the higher it will rank in search results and the more visible it is to potential users. Research also showed that ratings and reviews highly correlate with app download counts [50] and that developers use app store metrics (such as the number of downloads or installs, or ratings) to evaluate the success of their apps [3]. Recently, app stores have begun to allow developers to post a public response to a user review. If developers respond to a review, an email, including an option to edit the review, is sent out to the user who wrote the review. The ability for users to write reviews and for developers to respond to reviews cre- ates a two-way communication channel between users and developers. Developers can use this response mechanism to communicate with the user, thank the user, ask for more details regarding the problem the user is having, provide troubleshooting steps, or inform the user that they are taking the suggestion into account [51]. While developers can also respond to reviews indirectly, such as by updating the app based on cues taken from the reviews with or without communicating those changes to the users (e.g., through 12 mentioning in the app release notes) [10, 81], in the context of this dissertation, we refer to a developer response as a written response a developer posts in an app store in reply to a user review. Research shows that responding to reviews can improve overall ratings and user satisfaction [51, 76, 103]. Research has found that users are three times as likely to increase their ratings with a developer than without[103]. Developers can solve 34% of users’ issues by responding to reviews alone, without needing to update or deploy a new version [51]. Hence, user support and responding to reviews should be considered important activities for app developers to perform as part of the software lifecycle. Nonetheless, both of these activities have not received much scholarly attention, according to the recent systematic literature survey of Dąbrowski et al. [27]. 2.2 FoundationsofAnalyzingUserReviewsandDeveloperResponses This section provides the foundations of techniques we use to analyze user reviews and developer re- sponses in this dissertation. The techniques include feature extraction, statistical analysis, machine learn- ing for text classification, and explainable machine learning. 2.2.1 FeatureExtraction In this dissertation, feature extraction refers to the process of transforming user reviews or developer responses into a feature set that can be analyzed or utilized by machine learning algorithms to uncover insights, identify patterns, generate recommendations, or predict outcomes. Two main methods to extract features from user reviews and developer responses are used in literature: manual and automated. In the following, we will discuss each extraction method in more detail. 13 UsingManualFeatureExtraction Manual feature extraction involves manually annotating reviews to one or more groups of features that are meaningful to the study. These groups of features are typically not trivial to extract and cannot yet be extracted trivially and accurately by automated means. For example, the features might be the differ- ent types of user complaints, such as app crashing, hidden cost, network problems, privacy issues, and unresponsive app [63]. They could be the different types of software quality issues, such as usability, re- liability, compatibility, and security [58, 102]. They could be the different sentiments (positive, negative, or neutral) [47] or emotions (love, joy, sadness, anger, fear, and surprise) that users expressed [18]. They could also be the different review types (bug reports, feature requests, user experiences, and ratings) [72] or the different categories for maintenance perspective (feature requests, problem discovery, information seeking, and information giving) [88]. Some even attempted to extract the gender of the users based on their username [48, 49]. Due to the large amount of data that can be collected, manually annotating the entire data set becomes burdensome, time-consuming, and impracticable. Hence, annotation often performs on a statistically rep- resentative sample of the data set, typically with a confidence level of 95% and a confidence interval of either 5% or 10%. To do this, one of the methods of sampling, such as simple random sampling or stratified random sampling, is used. Additionally, manual annotation is prone to human error. Hence, to ensure that the annotation pro- cess is reliable and scientifically sound, it is often performed methodologically, with detailed annotation guidelines. For example, before actual annotations occur, a pilot run is conducted to introduce annotators to the task and terminology, verify that all annotators fully understand the process, and reduce subjec- tivity and bias. The annotation guidelines may subsequently be revised to take into account issues raised during the pilot run. An annotation tool may also be created to facilitate the annotation process and help reduce incorrect annotations due to human error. The guidelines may instruct annotators not to annotate 14 more than a certain amount of reviews in one sitting. This helps eliminate annotators’ cognitive load and avoid errors due to fatigue. In addition, oftentimes, annotators can produce conflicting annotations. As a result, the guidelines should include procedures for resolving these disagreements. For instance, one can consider only reviews for which annotators give the same annotations [72], or add an additional annotator to reconcile the conflicting annotations [47, 107]. In the case of an odd number of annotators, a majority vote can also be used to determine the correct annotation of a given review. Several crowdsourcing platforms (such as Amazon Mechanical Turk 1 ) have emerged in the past few years to allow researchers to outsource their labor-intensive manual tasks to the public. For example, Nayebi et al. [80] leveraged crowd workers on Amazon Mechanical Turk to judge the similarity and dis- similarity between topics in user reviews and Twitter’s tweets produced by Latent Dirichlet Allocation (LDA), a topic modeling technique. Researchers can also create and send annotation tasks to crowd work- ers on such a platform. While doing so would enable a significant quantity of data to be collected, ensuring the quality of annotations becomes the main challenge. Once the manual feature extraction is completed, researchers either empirically analyze the annotated data set or use it as a ground trust set for training and evaluating machine learning algorithms [27]. As a result, manual feature extraction is among the first steps that give rise to techniques for automatic feature extraction. In Chapter 5 and 6, we manually extract software quality features from user reviews. UsingAutomaticFeatureExtraction Automatic feature extraction involves using natural language processing (NLP) or machine learning tech- niques to extract a feature set from user reviews and developer responses. These techniques thus enable a great number of features to be extracted and then analyzed. 1 https://www.mturk.com/ 15 While user reviews and developer responses mainly take the form of text, they also contain important non-text or meta information that can be extracted automatically, such as the posting time, the rating, the number of votes, and the concerned app release version number. Although sentiment extraction is currently an active research topic, there exist several tools that can identify sentiments in text. SentiStrength [113], a lexical-based domain-independent sentiment analysis tool, is among such tools. We employ the tool in this work because it has been shown to work well for short and unstructured text and has been widely adopted in several user review studies (e.g., Bailey et al. [10], Guzman and Maalej [47], Hassan et al. [51]). Additionally, we utilize EmoTxt by Calefato et al. [18], an open-source toolkit for emotion recognition, to extract emotions in text. The tool contains six binary classifiers where each classifier detects a specific emotion in text: Anger, Fear, Joy, Love, Sadness, and Surprise. Danescu et al. [28] and Yeomans et al. [123] proposed methods to evaluate the politeness conveyed in a text by taking into account different syntactic and social markers presented in it. These markers include, for example, the presence of “please” at the start of a sentence, the use of indirect requests (e.g., “could you”), the use of greeting (e.g., “hi”, “hello’), or the show of gratitude (e.g., “thank you”, “I appreciate”). We leverage these methods to detect politeness in developer responses as it has shown to be important in handling complaints [29, 77]. There exist tools that can extract software engineering-oriented features from user reviews. These tools automatically classify user reviews according to a predefined set of categories or topics in a super- vised manner. For example, ARdoc (App Reviews Development Oriented Classifier) by Panichella et al. [88] is a review classifier that assigns each sentence in a review into four user intention categories that are relevant to developers: problem discovery (bug report), feature request, information giving, and in- formation seeking. Several published works have since incorporated ARdoc in their tool’s pipeline. For example, SURF (Summarize of User Reviews Feedback) by Sorbo et al. [31] further groups the output of 16 ARdoc into topic clusters (e.g., pricing, content, UI, functionality). ChangeAdvisor by Palomba et al. [87] relies onARdoc to classify user reviews as change requests and recommend source code changes based on them. In this dissertation, we utilize ARdoc and SURF to classify the intents and topics in user reviews. Furthermore, many studies in this area have utilized Latent Dirichlet Allocation (LDA) [15], an unsu- pervised topic modeling method, to help extract and analyze the contents of user reviews. For example, Chen et al. [22] introduced AR-miner, a tool that uses LDA to extract informative topics for developers from user reviews. Guzman and Maalej [47] applied sentiment analysis and LDA to extract feature-related information from user reviews for requirement evolution and software maintenance. LDA assumes that each document is composed of a mixture of topics, where each topic is a distribution of words. Similar sets of words that occur together repeatedly may indicate topics. For instance, a set of words {color, but- ton, banner, layout} may indicate a topic about the user interface (UI), and a document that contains one or more of the words in this set will have some underlying probability to belong to the UI topic. In this dissertation, we propose a method using LDA as part of the pipeline to detect differences in the contents of user reviews between a pair of countries in Chapter 5. 2.2.2 StatisticalAnalysis Statistical analysis is employed in many studies in this area for several tasks [27]. Such studies use sta- tistical analysis to report summary statistics in the dataset, test significance and hypotheses, examine the relationship between features within reviews, and determine the sample size for the manual extraction process described earlier. In this dissertation, we use statistical analysis to report summarized statistics of features in the dataset and employ hypothesis tests such as the two-proportion z-test. 17 2.2.3 MachineLearningforTextClassification Many studies in this area have used supervised and unsupervised machine learning algorithms to facilitate user review analysis [27]. In supervised machine learning, a set of features and their associated labels or categories are provided to the model, whereas in unsupervised machine learning, a dataset is provided without labels. Text classification is one of the fundamental tasks in machine learning applications. It aims to categorize text documents into one or more categories. Due to its usefulness, it finds applications in a wide variety of tasks through both supervised and unsupervised approaches. For example, it helps identify emotions in text [18], predict best answers in a Q&A forum [84], detect abusive language [83], and identify topics discussed in a Q&A forum [11]. Recent surveys have reported the top 10 popular machine learning algorithms for user review analy- sis: Naive Bayes, Support Vector Machines, Decision Tree, Logistic Regression, Random Forest [17], Neural Network, Linear Regression and K-Nearest Neighbor, K-means, and LDA [27]. We employ some of these al- gorithms, including Naive Bayes, Support Vector Machines, Neural Networks, Random Forest, and Extreme Gradient Boosting [23], and LDA, in our work. We choose these algorithms based on their performance and adoption for a text classification task. We study the performance of these algorithms and use them to make predictions and uncover insights from the data. Additionally, for supervised machine learning algorithms, hyperparameters play a critical role in a model’s performance [92, 111]. Each supervised machine learning algorithm has a set of hyperparameters that affect the performance of the learned model. Hence, these hyperparameters need to be tuned to opti- mize prediction performance. Ideally, one should experiment with every combination of hyperparameter settings and choose the best-performing model as the final model. This brute force method is called Grid Search. However, this can be time-consuming and computationally expensive to try every combination. Random Search [13], which selects a hyperparameter setting randomly, and Bayesian Optimization [14], which selects a hyperparameter setting based on the model’s performance of the previous setting, are two 18 alternative hyperparameter optimization techniques. In our work, we apply the latter two techniques for hyperparameter optimization. In the case of a dataset with an unequal class distribution, a standard procedure to evaluate the gen- eralizability and the performance of a supervised machine learning algorithm for a given hyperparameter setting is the stratified k-fold cross-validation. The data set is split into k folds, where each fold is made by preserving the class distribution in the dataset, and wherek-1 folds are used for training and the remaining fold for evaluating. This process is repeated k times, rotating the training and testing folds. The model’s performance for the hyperparameter setting is calculated as the mean of all k runs. 2.2.4 ExplainableMachineLearning Explainable machine learning refers to methods that make the decision or prediction of a machine learning model understandable for humans [19]. These methods order the input features by their contribution (or importance) from the most important to the least important toward the model’s decision. Additionally, by analyzing the top features with the highest importance (e.g., through visualization), explanations and insights into the data can be obtained. The common explainable machine learning methods are Permutation Feature Importance (PFI) and Partial Dependence Plot (PDP) [38]. Permutation feature importance determines the importance of a fea- ture by observing the drop in the model’s performance after permuting or random shuffling the feature’s values. The higher the drop, the higher the importance. A partial dependence plot (PDP) is a visualization method that plots the change in the average prediction probability for a class as a feature varies over its marginal distribution [38]. PDP is often used after PFI to understand how the important features affect the prediction. Carvalho et al. [19] provide a more comprehensive review of explainable machine learning. Explainable machine learning methods have been adopted to uncover insights in many research areas on a variety of problems, such as in medicine [114, 122], in geoscience [71, 124], and in communication 19 [121]. In the software engineering domain, the methods have also been used to understand important features for a wide range of tasks, such as defect prediction [59], acceptability prediction on the Stack Overflow forum [84], and software security tactics identification [115]. In this dissertation, we employ these explainable machine learning techniques to understand what and how the features identified from the feature extraction methods affect the prediction. We choose these methods because of their adoption and human-interpretability of the results. Additionally, we follow the guidelines by Tantithamthavorn et al. [111] to avoid common pitfalls in utilizing explainable machine learning methods. 20 Chapter3 FeaturesthatPredictDeveloperResponses Which aspects of an iOS App Store user review motivate developers to respond? Numerous studies have been conducted to extract useful information from reviews, but little effort has been expended to answer this question. The work in this chapter aims to investigate the potential of using a machine learning algorithm and the features that can be extracted from user reviews to model developers’ response behavior. Through this process, we want to uncover the learned relationship between these features and developer responses. For our prediction, we run a random forest algorithm over the derived features. We then perform a feature importance analysis to understand the relative importance of each feature and groups thereof. We show features that predict developer responses through a case study of eight popular apps. Our results demonstrate that not only rating and review length are among the most important, but timing, sentiment, and the writing style also play an important role in the response prediction. Additionally, the variation in feature importance ranking implies that different app developers use different feature weights when prioritizing responses. Our results may provide guidance for those building review or response prioritization tools and devel- opers wishing to prioritize their responses effectively. This chapter is based on our works in [105] and [108]. 21 3.1 Introduction The ever-tightening web of social interactions has laid the groundwork for closer communications between the developers and users of applications. In 2017, the iOS App Store had, like many other vendors, taken advantage of this to empower both sides by adding the option of following up on reviews with responses such as advice or clarifications 1 . While users and developers are positioned at the opposite ends of the exchange of information, both are interested in optimizing their input/output ratio. Consequently, users would like to gain the atten- tion of the developers and receive a response that addresses their concerns; developers value having their app ratings raised after posting an explanation or, possibly, a note that the app has been updated based on the feedback. This requires developers to show a commitment to listening to the community of users and making improvements accordingly. Since, typically, their resources, such as time and available per- sonnel, are limited, they have to prioritize their responses, as evident by the fact that a small fraction of reviews received a response [51]. Nevertheless, responding to reviews can improve overall rating and user satisfaction [51, 76, 103]. From the users’ perspective, a major point of interest is how to shape and place a review to garner a response from the developers. Previously, there have been efforts toward understanding how users review apps and different approaches that automatically classify and summarize reviews to help developers accelerate software maintenance and evolution tasks [27, 75]. On the other hand, less effort has been spent investigating which features out of many in an app user review prompt a developer’s response. Understanding the key features that influence developers’ response decisions could provide guidance for those building review or response prioritization tools and meaningful direction to developers wishing to prioritize their responses effectively. Generally, better response strategies could lead to increased user satisfaction and improved communication between users and developers. 1 https://developer.apple.com/library/archive/releasenotes/General/WhatsNewIniOS/Articles/iOS10_ 3.html 22 Motivated by the above observations, in this chapter, we investigate a wide range of features extracted from user reviews and apply a random forest to study how these features contribute to whether develop- ers will respond to a review. A case study of eight popular free-to-download apps shows that with the identified features as input, the random forest algorithm can differentiate between reviews that receive a developer response from those that do not, outperforming all baselines with considerable gains. This outcome suggests that it is possible to predict developers’ response behavior. Additionally, we perform a feature importance analysis to understand the relative importance of each feature and group of features in predicting a developer response. Insights discovered in this chapter could be of practical significance to a broad range of practitioners. Our main contributions of the work in this chapter can be summarized as follows: • We investigate a wide range of features that can be extracted from user reviews on the iOS App Store that are likely to be associated with developers’ response behavior. Different from previous work, we study many novel features such as writing style, vote, and time. In addition, we include software engineering-oriented features in our analysis, such as bug reports, feature requests, and inquiries. • We extract seven different groups of features and build models to predict developer responses. We identify the relative importance of each feature and groups of features. We provide insights into the relationships between the key features and developer responses. • We demonstrate a potential application to apply feature engineering and machine learning to aid in the automatic prioritization of developer responses or user reviews. This chapter is structured as follows. Section 3.2 provides background information of the work. Section 3.3 describes the research setup. Section 3.4 presents the experiment results. Section 3.5 explores the implications for practical use of the findings. Section 3.6 identifies limitations and threats to validity. Section 3.7 provides a summary of the related literature. Section 3.8 summarizes the chapter. 23 3.2 Background 3.2.1 DefinitionsandExamples Developer reaction refers to what the developer(s) of an app will do after seeing a user’s review. They could (1) do nothing, (2) write a public response on the app store, or (3) update the app based on cues taken from the review without communicating those changes to the users, or both of the former options. In this study, we focus only on the second type of reaction, i.e., an instance where developers provide a written response to a review on the App Store. We call thisadeveloperresponse. Figure 3.1 shows three examples of actual user reviews that received developer responses on the App Store. In the review example on the left, a reviewer named TTVBOI911 reported an unexpected app be- havior. He gave the title of the review “Fix” and a one-star rating. A couple of days later, he received a developer response providing a troubleshooting step to solve his issue. In the review example on the top right, a reviewer namedMaggieSteward suggested that developers add a new feature. Shortly after, a developer responded that an identical feature had already been implemented in the version of the app that she has. Lastly, we chose the review example on the bottom right to show an instance where a user edits his or her review after receiving a developer response. The original content of the review is unknown as it was replaced by the edited version, in which the user wrote, “Dark theme is finally here!”. In such a case, theedited indicator will appear next to the posted date. It can also be identified by the fact that the review posted date comes after the developer response posted date. 3.2.2 ReviewandResponseMechanism After users download an app from the iOS App Store, they can share their experience and opinion about the app by leaving a public review, consisting of a text review and a star rating. The star rating ranges 24 Figure 3.1: Examples of user reviews with a developer response (1) The title (2) The star rating (3) The name of the user (4) The review posted date (5) The body (6) The developer response block (7) The response posted date (8) The edited indicator from one to five, where one star indicates extreme dissatisfaction and five stars extreme satisfaction. Users cannot write multiple reviews for an app but can edit or delete their reviews and ratings. Users can also upvote or downvote reviews they find helpful or not helpful. On the other hand, developers see all user reviews through the app management portal 2 . They can filter reviews based on their ratings, countries of origin, and the latest or all versions of the app. They can also sort reviews, for example, by recency or helpfulness (reviews that have received the most upvotes from other users). They cannot delete reviews, but they can give public responses to them. Each review can only have one response associated with it. If developers respond to a review, an email, including an option to edit the review, is sent out to the user who wrote the review. 2 https://itunesconnect.apple.com/ 25 Users and developers can edit their reviews or responses at any time. If they do so, the current reviews or responses will be replaced by the edited version. In other words, the App Store only saves and shows the latest version of a review or a response. 3.3 ResearchMethodology 3.3.1 TaskandResearchQuestions The aim of this work is to investigate the potential of using a machine learning algorithm and the features extracted from user reviews to model developers’ response behavior. While doing so, we want to uncover the learned relationship between these features and developer responses. Our overarching goal is to pro- vide insights into the subject or guidance to those building review or response prioritization tools. We formally define our task as follows. Task: DeveloperResponsePrediction • Given a reviewR • Predict whether developers will provide a written response toR. We then set out to answer the following research questions. • RQ1: How effective is the random forest algorithm at predicting developer responses to userreviews? • RQ2: Which individual features and groups of features are most important in predicting developerresponses? • RQ3: Howdotheimportantfeaturesaffectprediction? 26 Figure 3.2: An overview of the approach In the following, we describe our research method, including the data collection, feature extraction, model construction and evaluation, and model interpretation. Figure 3.2 depicts the overview of our re- search method. 3.3.2 Dataset: Case-StudyApps We obtained a list of 50 popular free-to-download iOS apps in the US App Store based on the ranking by AppAnnie 3 , a market research company that monitors app downloads in the iOS App Store. For each app on the list, we used a web crawling tool we developed to retrieve all available reviews by US-based users that were posted within a period from October 1st, 2018, to March 2nd, 2020 (UTC). The crawler acts like an iOS device and utilizes the iTunes APIs to collect data about an app. The length of our study period should allow enough time for these apps to go through many releases and gather user reviews and developer responses. The decision to select only popular apps is based on two reasons. First, such apps have more users than unpopular ones and should therefore have more user review data to analyze. Second, we believe that their developers care about maintaining their popularity by motivating their users to keep using them, which requires good support, including responses. We collected the following data for each app: 3 https://www.appannie.com/en/ 27 1. GeneralAppinfo: • ID, name, and category. 2. Userreviews: • username, post date and time, star rating, body, title, edited, concerned app release number, developer response date and time, and developer response body. 3. Appreleases: • release number, release date and time, and release note. After retrieving the data, we filtered out reviews that users edited after receiving a developer response. That is because we did not have the original data of these reviews before the users edited them, which developers responded to in the first place. Notably, only a total of 5.95% of user reviews with responses in our dataset were filtered out. We selected eight apps that have a response rate 4 between 5% and 50% from the list for our case studies. As a result, our findings may hold true only for apps within this response range. Table 3.1 summarizes the information, including the name, category, number of reviews, number of developer responses, number of edited reviews, and response rate for each app in our case study. 3.3.3 FeatureExtraction In this section, we present the features we considered in this study. The choice of the features we selected is mainly influenced by: 1. Prior work – Rather than starting from scratch, we chose to leverage the wealth of knowledge from prior work to focus on features that have been shown to associate with response decisions in other areas of study (e.g., in Q&A forums [84] and in email communications [121]). 4 The response rate is the percentage of all reviews with a developer response over all reviews. 28 Table 3.1: The study apps AppName Category Reviews Responses Edited ResponseRate Bank of America Finance 14754 2833 245 19.2% Discord Social Network 8707 1837 186 21.10% Mercari Shopping 9047 1540 261 17.02% PayPal Finance 19597 2163 76 11.03% Realtor Lifestyle 18779 6508 25 32.66% Reflectly Fitness 17010 1435 83 8.44% StockX Shopping 16805 4536 324 26.99% WeatherBug Weather 8183 2320 180 28.35% 2. The collection method – Due to user reviews’ unstructured and informal nature, some features are not trivial to extract. Hence, we focus on features that can be extracted easily or using tools that have been employed successfully in literature. Table 3.2 shows the summary of features considered in this study. We categorized our selected features into7 groups based on the characteristics of a feature within the group. Groups 1 to 3 are non-text features, and groups 4-7 are text features. We believe that the short descriptions in Table 3.2 are explanatory enough for some features. Time In email communications, the sent date and time affect email reply behavior [64, 121]. We hypothesize that reviews’ posted date and time may also affect developer response decisions as developers may be more active during the day and on weekdays. In other words, reviews posted during the day and on weekdays may be more likely to get a developer response than those posted at night or on weekends. Hence, in this group, we derived features from the posted date and time of reviews. We consider the following features: TimeOfDay, DayOfWeek, and IsWeekEnd. 29 Table 3.2: Review features potentially affect developer responses along two dimensions: Non-text and Text # Group Feature Description 1 Time TimeOfDay The time of the day that the review is posted. DayOfWeek The day of week that the review is posted. IsWeekend Whether the posted date is a weekday or a weekend. TimeAfterRelease The time, in hour, between the release of a version and the submission of review for the version. 2 Rating Rating The number of stars in the rating. 3 Vote VoteHelpful The number of users who vote the review as helpful to them. VoteTotal The total number of votes the review receives. 4 Style† Readability The score obtained from Flesch reading-ease test [36]. MisspelledWords The proportion of spelling errors in the text. ModalVerbs The proportion of modal verbs in the text. UniqueWords The proportion of unique words in the text. ContainsDigits Whether the text contain any numerical values. FirstCap Whether the text begins with a capital letter. AllCap Whether the text is written in all capital letters. 5 Length† NumChars The number of characters in the review. 6 Sentiment† Sentiment The overall sentiment polarity of the review. PositiveWords The proportion of positive sentiment words in the text. NegativeWords The proportion of negative sentiment words in the text. 7 Intention† ProblemDiscovery (PD) Whether the review contains a sentence describing issues or unexpected behaviors of the app. FeatureRequest (FR) Whether the review contains a sentence expressing ideas or suggestions for app improvement. InformationSeeking (IS) Whether the review contains a sentence inquiring information or help from developers. InformationGiving (IG) Whether the review contains a sentence informing developers about an aspect of the app. †: Applicable to both review title and body. Non-text (#1–#3) and Text (#4–#7) 30 Following Yang et al. [121], we partitioned the time of day into four sections: night (0/24− 6), morning (6− 12), afternoon (12− 18), and evening (18− 24/0) [121]. We considered Monday through Friday as weekdays and Saturday and Sunday as weekends. Rating In addition to writing text reviews, users can leave star ratings for an app on a scale of one to five stars, where one star indicates extreme dissatisfaction and five stars extreme satisfaction. Star ratings are, there- fore, an meaningful, at-a-glance representation of how users feel about an app. They are also a crucial factor influencing how an app ranks in search results and deter or encourage app downloads. It is reason- able to assume that developers will prioritize responding to reviews with lower star ratings to get users to change their low ratings into higher ones. Hassan et al. [51] found that if users edited their reviews after receiving a developer response, they tended to increase the rating. They also found that the likelihood that developers will respond to a review increases as the review rating decreases. These results are consistent with previous studies, which found lower rating (1 and 2-star) reviews to be more valuable to developers than higher rating reviews [63, 73]. Nonetheless, the importance of review ratings to developer responses is still unclear when considered together with many other features. Consequently, we selected Rating as one of the features. Vote Users can upvote or downvote reviews they find helpful or not helpful, respectively. Each review, there- fore, has two numbers representing the number of up (helpful) votes and the total number of votes it receives, though these numbers are not shown in the current version of the App Store. In the past, the App Store showed them in the form “<NUM_HELPFUL> out of <NUM_TOTAL> customers found this review helpful.” By default, the App Store displays the most helpful reviews to users first. Intuitively, by 31 seeing developer responses to such reviews, users can quickly gauge how well the app is supported, which may be one of their criteria for app download decisions. Reviews that many users vote helpful can signal to developers that many users agree with the reviewer’s assessment of the app, have similar concerns, or experience the same issues. Additionally, reviews not considered helpful will also appear toward the bottom of the list; thus, such reviews may not be great candidates for developers to respond to. Hence, it is reasonable to assume that developers want to prioritize responding to helpful reviews. Note that de- velopers also have an option to sort reviews by helpfulness in the app management portal. Therefore, we included VoteTotal and VoteHelpful in our list of features. Style Several features of the writing style of the reviewer were found to influence the perceived value of reviews. We hypothesize that they may also affect a developer’s perception of a review and thus affect their reaction. We consider the following features: Readability,MisspelledWords,ModalVerbs,UniqueWords,ContainDigits, FirstCap, and AllCap. Note that these features are applicable to both review title and body. Fang et al. [34] and Korfiatis et al. [65] found that reviews that are easy to understand strongly cor- relate with high-value reviews. Schindler et al. [97] noted that reviews with greater use of negative style characteristics (e.g., spelling errors) were perceived as less valuable. Modal verbs were found to be relevant to user review classification [72]. Unique words were found to be important for review helpfulness predic- tion [101]. Reviews that contain digits could mention specificity, for example, in-app purchase pricing (e.g., $3.99), app version (e.g., v12.3), or OS version (e.g., iOS 10.3). Lastly, all capital letter words are typically used to indicate emphasis, attract attention, or convey intense emotion, which may all affect developer responses. We used the textstat 5 package to calculate the Flesch’s reading ease score [36]. To detect capital letters, we used the built-in string function provided directly by Python (isUpper() function). For the rest of the 5 https://pypi.org/project/textstat/ 32 features in this group, we applied common pre-processing steps to reviews. Specifically, we converted them to lower case and used the Treebank tokenizer 6 to split a given review into words or tokens. Note that some style features may call for additional steps, which will be discussed in detail below. Furthermore, we obtained the proportion of each feature by normalizing its count by the number of words in the review. To detect spelling errors, we checked each word in a review against a dictionary consisting of over 497,000 English words 7 . To count modal verbs, we compared each word in a review against the following common modal verbs: can, could, may, might, must, will, would, shall, should, and ought. We also included their negative and contraction forms (e.g., won’t and can’t). We also added the name of the application to the dictionary. To count unique words in reviews we added the widely adopted Porter Stemmer al- gorithm [91] to the pre-processing steps. This is done to reduce complexity and semantic duplicates in reviews by transforming the word back to its root form (e.g., from shopping to shop). Finally, to derive the ContainsDigits feature, we checked each character in a review to see if it is a digit. Length Review length has been studied extensively in many studies and was consistently found to be directly associated positively with its perceived value [68, 85, 97]. In another area of research, Yang et al. [121] found that the length of the email subject and body were two of the most important features to predict a reply. The result is also supported by Hassan et al. [51], which found that review length affected developers’ response decisions as developers were more likely to respond to longer reviews than shorter ones. Intuitively, long reviews stand out more to developers because, typically, most reviews are short [85]. They are also more valuable to developers because they can hold more information than shorter ones. In fact, longer reviews were targeted as being of greater 6 https://www.nltk.org/_modules/nltk/tokenize/treebank.html 7 https://github.com/dwyl/english-words 33 interest to developers than shorter ones [73]. Hence, we considered the following two features for this group: NumCharsBody and NumCharsTitle. Sentiment The sentiment of reviews could be positive, negative, or neutral. Intuitively, developers will prioritize responding to negative reviews because such reviews should most likely indicate severe problems with the app or highlight users’ dissatisfaction with the app. In other words, they are candidates for developers to increase users’ satisfaction. Sentiment can help indicate the purpose of the review. Recent studies have shown that incorporating sentiment improves the performance of app review classification approaches [72, 89]. To calculate the sentiment expressed in user reviews, we use SentiStrength [113], a lexicon-based sen- timent analysis technique for short texts. SentiStength was chosen because it has been widely adopted in literature and used successfully in several mobile app user reviews studies (e.g., Bailey et al. [10], Guzman and Maalej [47], Hassan et al. [51], and Maalej and Nabil [72]). SentiStrength assigns a negative and a positive score to each pre-defined token or word in its knowledge base. The scores assume integral val- ues between± 5 and± 1, where a higher magnitude implies a stronger emotion. For example, “happy” is assigned a score of [+2,-1] and “sad” a [+1,-4]. In addition, it accounts for capital letters and repeated punc- tuation symbols, which text writers often use to express emotion degree as well as emoticons (e.g., :-)). For a more detailed discussion on the inner work of SentiStrength, we refer to its official documentation 8 . Table 3.3 shows the output of theSentiStrength tool when applied it to find sentiment of a review. The examples show that SentiStrength takes into account emoticons and provides additional emphasis when encountering capital letters and punctuation to magnify the underlying sentiment. Additionally, the use or absence of positive and negative words in reviews was found to be associated with the perceived value and whether or not a review will be read [95]. To identify positive and negative 8 http://sentistrength.wlv.ac.uk/documentation/ 34 Table 3.3: Example results from SentiStrength # Review SentiStrengthExplanation Sentiment 1 THIS APP IS A SCAM THIS APP IS A SCAM[-4] [-1 CAPITALS] [sentence: 1,-5] [result = average (1 and -5) of 1 sentences] − 1 2 Used to be great. New update ruined everything. Used [proper noun] to be great[3].[sentence: 3,-1] New update ruined[-2] everything [sentence: 1,-2] [result = average (4 and -3) of 2 sentences] − 1 3 I <3 Mercari ...prob TOO much!! I <3 [1 emoticon] Mercari [proper noun] ...[sentence: 2,-1] prob TOO much !!’[+1 punctuation mood emphasis] [sentence: 2,-1] [result = average (4 and -2) of 2 sentences] +1 words in reviews, we used the opinion lexicon corpus provided by Hu and Liu [54]. Words that do not appear in the lexicon corpus are considered neutral and do not get counted towards the positive and negative words feature. The count for each is then normalized by the number of words in the review. Overall, we considered the sentiment of review body and title. Hence, we extracted SentimentBody, SentimentTitle, PositiveWordsBody, PositiveWordsTitle, NegativeWordsBody, and NegativeWordsTitle. Intention Each review can consist of multiple sentences, each with its own intention. For example, one sentence may specify which aspect of the app needs to be fixed, while another might specify the improvements needed. A third may specify an action suggested to the developer, simply state a fact about the app, or pose a question to the developer. Out of these examples, some intentions are actionable, while the others may not be. More precisely, developers may or may not feel obliged to respond to reviews with particular intentions. Panichella et al. [89] grouped 17 common topics present in user reviews, as identified by Pagano et al. [85], into the following four user intention categories that are relevant to developers (i.e., software engineering-oriented intentions). • ProblemDiscovery (PD) – to describe issues or unexpected behaviors of the app. 35 • FeatureRequest (FR) – to express ideas or suggestions for app improvement. • InformationGiving (IG) – to inform developers or other users about an aspect of the app. • InformationSeeking (IS) – to inquire information or help from developers. To detect intention in user reviews, we used the ARdoc tool, a user review classifier proposed by Panichella et al. [88]. The tool assigns each sentence in a review to one of the above user intention categories. We chose the tool because it is fine-grained as it operates in a sentence-level manner and has been successfully incorporated in the pipeline of other tools, such as in ChangeAdvisor by Palomba et al. [87] and inSURF by Sorbo et al. [31]. We directly used the original Java implementation of the tool, which is publicly available on the authors’ website 9 . To capture the overall intention of a given review, if at least one of its sentences was assigned to one of the above intentions, we treated the review as having that intention. As an example, Table 3.4 shows the results of applying ARdoc to actual user reviews. Table 3.4: Example results from ARdoc Review Intention Can not access apps, if I do I continuously get kicked out! PD This has been my go to app & I love to search by zip code. Unfortunately after the recent update it’s not working. Please help!! IG, PD Needs to have more pictures of the properties. FR Why are the photos in EVERY listing blurry all of a sudden? IS 3.3.4 ModelConstructionandEvaluation Our goal for building a machine learning model is to use the model to gain insight into our subject through examining the learned relationship between the features and developer responses. 9 https://www.ifi.uzh.ch/en/seal/people/panichella/tools/ARdoc.html 36 With this goal in mind, we experimented with the random forest algorithm [17], one of the most suc- cessful non-linear machine learning algorithms suitable for this type of task. The random forest algorithm is an ensemble algorithm made up of multiple decision trees. Each tree in the forest is trained on a random sample of the trained data drawn with replacement (known as bootstrapping). It also randomly selects subsets of features when splitting nodes and makes the decisions by averaging the predictions from each decision tree. As a result, the random forest algorithm is less prone to over-fitting. In addition, the use of an internal error estimation through Out-of-Bag samples removes the need for a set-aside validation or test set. The algorithm is flexible as it can take numerical or categorical variables as input and does not require feature scaling. For these reasons, we chose the random forest algorithm for our study. We used Scikit-learn 10 , a machine learning library for Python, to implement the random forest algo- rithm. The dataset of each app is split into 80% training and 20% test data using stratified sampling to ensure similar label distribution across the test and train sets. Hyperparametersoptimization We have constructed a grid search space and performed randomized search over uniformly drawn samples from it. The range of values used for different hyperparameters to create this grid search space are as follows: • n_estimators: 100 to 1000 with step size of 100. • max_features: integral multiple of √ #features to #features. • max_depth: 10 to 100 with step size of 10. • min_samples_split: 0.01, 0.03, and 0.05. • min_samples_leaf : 1, 3, and 5. 10 https://scikit-learn.org/ 37 • bootstrap: always set to True. • class_weights: always set to balanced. The above set of values leads to a grid search space of5400 (10· 6· 10· 3· 3). We then performed random sampling using RandomizedSearchCV (from scikit-learn) to try out a broad spectrum of values from the above search space. We have used 100 iterations through 4− fold cross-validation to identify the optimal hyperparameters. We report the hyperparameters obtained from the procedure for each app in Table 3.5. Table 3.5: The chosen hyperparameters for each app using RandomizedSearchCV Parameter BankofAmerica Discord Mercari Paypal n_estimators 100 1000 500 500 max_features 41 42 34 34 max_depth 90 20 20 20 min_samples_split 0.01 0.01 0.01 0.01 min_samples_leaf 3 1 1 1 Parameter Realtor Reflectly StockX WeatherBug n_estimators 900 700 400 600 max_features 20 6 34 34 max_depth 10 40 40 90 min_samples_splits 0.01 0.01 0.01 0.01 min_samples_leaf 3 1 3 3 Baselines To evaluate the effectiveness, we compare the performance of the random forest algorithm described earlier to the following two standard baseline approaches. • Random, we randomly assign a class to each review in the test set. • MajorityClass, we assign the majority class in the training set to the testing set. Generally, suppose the performance of our model is too close to these baselines. In that case, it implies that the selected features have no association with developer responses or that more features are needed. 38 PerformanceMetrics Table 3.1 shows that our dataset is imbalanced, with most reviews not receiving a developer response. The average response rate for all case study apps is 20%. Therefore, for our task, Accuracy is not a suitable evaluation metric. That is because we can predict that developers will not respond to any reviews and achieve an accuracy of 80%, but we gain no knowledge of our research subject. Instead, we selected the Area Under the Receiver Operating Characteristic (ROC) Curve to evaluate the performance of our task. We use the acronym AUC to denote this metric. The ROC curve shows how the false positive rate and true positive rate relationship change as the model’s threshold for identifying positives changes. It plots false positive rate (FPR), i.e., false alarm, on the x-axis (eq - 3.1) and and true positive rate (TPR), i.e., recall, on the y-axis (eq - 3.2). FPR = false positives false positives + true negatives (3.1) TPR = true positives true positives + false negatives (3.2) The area under this curve (AUC), therefore, indicates the performance of a model at separating classes. A perfect classification model would yield an AUC value of 1.00, whereas a random model would yield an AUC value of around 0.50. Generally, the higher the AUC value, the better the model is at distinguishing between the two classes. 3.3.5 ModelInterpretation We employed a model-agnostic technique called “permutation feature importance,” which was introduced by Breiman [17] to measure the importance of an individual feature and a group of features. 39 The key idea of this technique is that if a random permutation of a feature leads to a substantial increase in the classification error, the model heavily relies on that feature for prediction. We refer to this increase as theImportanceValue(IV). Hence, the higher the increase in the classification error, the more important the feature is to the model. We used the training data to compute importance. By using this same principle, we can compute the importance of a group of features by permuting features in the same group together (i.e., as a single meta-feature). The increase in the classification error is the relative importance for that group. Nonetheless, Strobl et al. [110] noted that when two or more individual features are highly correlated, the technique can lead to incorrect identification of feature importance because the model can use the same information from a correlated feature. Hence, one strategy to mitigate this is to identify and discard highly correlated features before training the model. For this, Spearman’s correlation is used to identify correlated features due to its ability to model complex monotonic relationships among features rather than just linear as modeled by the Pearson correlation method. Following Omondiagbe et al. [84], feature pairs with an absolute correlation value of 0.7 or higher are marked as highly correlated. Among these identified pairs, the ones with a lower mutual information [66] are subsequently removed. This approach ensures that our models are trained on features that are not significantly correlated with one another. Consequently, we can safely employ the permutation feature importance technique. 40 3.4 ExperimentResults 3.4.1 RQ1:Howeffectiveistherandomforestalgorithmatpredictingdeveloperresponses touserreviews? Figure 3.3: The ROC curves of the random forest algorithm that uses the selected features for all case study apps and the baselines Figure 3.3 shows the ROC curves of the baselines and the random forest algorithm, which uses the identified features as input for each case study app. We observed that the random forest algorithms that use the identified features as input performed significantly better than the two baselines at distinguishing reviews that received a developer response from those that did not, for all eight apps. The average AUC for the random forest algorithms is 0.84. Regarding the two baselines, since the MajorityClass baseline always predicts ‘No’ or0, its true positive rate is 1.0 because reviews with no developer responses were correctly classified. The false positive rate is 1.0 since reviews with developer responses were incorrectly classified. 41 This results in an average AUC score of 0.500 across all eight apps. Similarly, Random achieved expected AUC scores of around 0.500 due to their randomness. For the app WeatherBug, the model can achieve an AUC score of 0.99, which is considered almost perfect. This indicates that we have identified features that developers for this app use to prioritize their responses. However, for the app StockX, the random forest cannot achieve as high an AUC score as other apps, although higher than the two baselines. This may indicate that we may still be missing crucial features that developers for this app use to decide whether a review should get a response or that developers do not have a concrete response strategy. Overall, we observed from Figure 3.3 that the random forest algorithms that use the identified features as input performed significantly better than the three baselines at distinguishing reviews that received a developer response from those that did not, for all eight apps. The average AUC for the random forest algorithms is 0.84. These high AUC scores indicate that our models have high explanatory power in pre- dicting developer responses. Summaryoffindings: When compared to the baselines, the high performance of the random forest algorithm that uses the studied features suggests that the features have some dependent relationships with developer responses. The findings implicate a potential application of applying feature engineer- ing and machine learning to help developers automatically identify reviews they are most likely to respond to and focus on. 3.4.2 RQ2: Which individual features and groups of features are most important in predictingdeveloperresponses? Table 3.6 shows the top 10 most important features in predicting developer responses for each app. We also observe thatRating is among the most important features across most apps, except for the appsStockX and Reflectly . This is expected since the rating is an important representation of how users feel about an app. 42 Table 3.6: Top 10 most important features in predicting developer responses for each app (IV = Importance Value) BankofAmerica Discord Mercari Paypal Feature IV Feature IV Feature IV Feature IV Rating 0.191 Rating 0.265 Rating 0.170 Rating 0.181 TimeAfterRelease 0.052 PositiveWordsTitle 0.012 TimeAfterRelease 0.146 PositiveWordsBody 0.046 IsWeekend 0.010 PositiveWordsBody 0.008 NumCharsBody 0.072 TimeAfterRelease 0.040 PositiveWordsBody 0.010 NumCharsTitle 0.006 PositiveWordsBody 0.060 PDBody 0.034 NegativeWordsTitle 0.009 PDBody 0.006 NumCharsTitle 0.041 MisspelledWordsBody 0.029 IsMorning 0.007 SentimentBody 0.005 NegativeWordsBody 0.033 NumCharsBody 0.028 NumCharsBody 0.006 NumCharsBody 0.005 MispelledWordsBody 0.029 NegativeWordsBody 0.011 IsWednesday 0.005 TimeAfterRelease 0.005 ModalVerbsBody 0.026 PositiveWordsTitle 0.009 IsTuesday 0.004 NegativeWordsBody 0.004 PositiveWordsTitle 0.017 ModalVerbsBody 0.008 NumCharsTitle 0.003 SentimentTitle 0.004 ReadabilityBody 0.015 NegativeWordsTitle 0.003 Realtor Reflectly StockX WeatherBug Feature IV Feature IV Feature IV Feature IV Rating 0.180 TimeAfterRelease 0.161 TimeAfterRelease 0.157 Rating 0.279 NumCharsBody 0.066 Rating 0.058 NumCharsBody 0.153 TimeAfterRelease 0.008 TimeAfterRelease 0.052 NumCharsBody 0.033 UniqueWordsBody 0.016 IsThursday 0.004 PositiveWordsBody 0.039 PositiveWordsBody 0.031 NumCharsTitle 0.016 PositiveWordsBody 0.003 NumCharsTitle 0.024 IsTuesday 0.028 PositiveWordsBody 0.015 NegativeWordsTitle 0.002 NegativeWordsBody 0.014 IsThursday 0.024 IsSaturday 0.015 NumCharsBody 0.002 PositiveWordsTitle 0.013 NumCharsTitle 0.020 Rating 0.013 PDBody 0.001 PDBody 0.012 PositiveWordsTitle 0.020 MisspelledWordsBody 0.009 NumCharsTitle 0.001 FRBody 0.010 NegativeWordsBody 0.017 IsNight 0.008 MisspelledWordsBody 0.001 ReadabilityBody 0.009 ReadabilityBody 0.014 ReadabilityBody 0.007 SentimentBody 0.001 Developers can quickly gauge how satisfied their users are with an app by looking at the rating without having to read the review contents. This result is congruent with that of previous work of Hassan et al. [51] and McIlroy et al. [76], which found the rating to play a role in deciding whether a developer will respond to a review. Another feature that appears within the top five most important features of most apps is NumChars- Body, the feature that has consistently been found to be positively correlated with the perceived value of a text [85, 97]. Here, we also observed that the random forest algorithms relied on this feature for review title and review body to predict developer responses. For theSentiment feature group,PositiveWords plays a significant role in predicting developer responses as it consistently appears within the top five most important features for most apps. This may indicate 43 that a review that appears to be overly emotional or in a neutral tone can incite developer responses. Interestingly, our results suggest that SentimentBody and SentimentTitle are less important for prediction than other features in the same group. The Time features are also important in predicting developer responses as TimeAfterRelease appears within the top three most important features for developer response prediction for all apps. Likewise, DayOfWeek features (such as IsThursday) and TimeOfDay features (such as IsMorning) can also help with the prediction. The results strongly suggest that the posted date and time of review can increase the likelihood of it getting a developer response. In the case of the Intention feature group, we observed that, for four apps (Discord, PayPal, Realtor, and WeatherBug), the random forest algorithms relied on the Problem Discovery feature for review body to help predict developer responses. This suggests that developers for these apps often respond to reviews that report unexpected behaviors of the apps. Interestingly, the FeatureRequest feature for review body is important for the developer responses prediction of only two apps (Realtor and WeatherBug). The Infor- mationGiving and theinformationSeeking features provide a weak contribution to the prediction, despite the latter being more actionable. We note that the intentions for the review title also provide a weak con- tribution to the prediction. Only the app Discord has the Problem Discovery feature for review title in its most important features list. Although not allStyle features appear in Table 3.6 for most apps, someStyle features such asReadabil- ity, MisspelledWords, ModalVerbs, and UniqueWords did play some role in predicting developer responses. This suggests that the quality of reviews does matter to developers to some extent. Interestingly, whether a review contains a numeric value (ContainsDigits), is written in all capital letters (AllCap) or begins with a capital letter (FirstCap) does not contribute to the prediction. Similarly, in all apps, the random forest algorithms did not rely on the Vote feature group (VoteHelpful and VoteTotal). 44 Table 3.7: The relative importance of each group of features in response prediction (IV = Importance Value) BankofAmerica Discord Mercari Paypal Feature IV Feature IV Feature IV Feature IV Rating 0.179 Rating 0.350 Time 0.184 Rating 0.157 Time 0.088 Sentiment 0.031 Rating 0.183 Sentiment 0.081 Sentiment 0.035 Time 0.018 Sentiment 0.133 Style 0.045 Style 0.012 Style 0.014 Length 0.109 Time 0.038 Length 0.007 Length 0.011 Style 0.088 Intention 0.032 Intention 0.005 Intention 0.009 Intention 0.008 Length 0.031 Vote 0.002 Vote 0.002 Vote 0.008 Vote 0.000 Non-text 0.269 Non-text 0.363 Non-text 0.313 Non-text 0.181 Text 0.063 Text 0.056 Text 0.258 Text 0.158 Realtor Reflectly StockX WeatherBug Feature IV Feature IV Feature IV Feature IV Rating 0.163 Time 0.256 Time 0.196 Rating 0.282 Length 0.037 Sentiment 0.089 Length 0.170 Time 0.012 Sentiment 0.036 Style 0.068 Sentiment 0.040 Sentiment 0.002 Time 0.027 Length 0.051 Style 0.037 Intention 0.001 Style 0.018 Rating 0.050 Rating 0.006 Style 0.001 Intention 0.013 Intention 0.011 Intention 0.002 Length 0.000 Vote 0.000 Vote 0.000 Vote 0.002 Vote 0.000 Non-text 0.181 Non-text 0.310 Text 0.212 Non-text 0.347 Text 0.110 Text 0.185 Non-text 0.200 Text 0.008 By investigating the results of the feature group analysis in Table 3.7, we observed that for most apps, Rating is still the most important feature. However, for three apps (Mercari, Reflectly , and StockX), the Time feature group is more important than Rating. Interestingly, the Intention group and Vote group are among the least important feature groups in predicting developer response for most apps. These results are also consistent with the results in the individual feature analysis from earlier. By grouping features into Text and Non-text features, we find that Non-text features are more important than Text features for the task for all almost all apps, except the app StockX. However, the importance value between Text and Non-text for the app indicates that both groups of features are equally important to the prediction. 45 Summaryoffindings: Rating, Time, Length, Sentiment, and some of the features for Style are consis- tently found to be among the top important features in developer response prediction. Vote features do not contribute to the prediction. Non-text features are more important than text features. Lastly, we ob- served some variation in how important features rank, suggesting that a one-size-fits-all prioritization tool may not be suitable for ranking reviews for developers on the App Store. 3.4.3 RQ3: Howdotheimportantfeaturesaffectprediction? While feature importance analysis (in RQ2) can reveal the relative importance of each feature in predict- ing developer responses, it does not determine the direction of influence or the relationship between the features and developer responses. In this RQ, we explore the relationship between the key features and developer responses to provide guidance for those building review or response prioritization tools. How- ever, we note that this is a univariate analysis which has the shortcoming of not considering multivariate interactions. Additionally, we note that we present and discuss only the subset of key features (RQ2). For interpretability, we use a quantile-based discretization function to bin the values for features with contin- uous values. Otherwise, features are binned by ratio increments of 10%. The response rate for each feature is the percentage of reviews with the feature that receive a developer response over all reviews with the feature. We found that for the majority of apps, reviews that are posted in the morning (UTC), during weekdays (UTC), and immediately after an app release are more likely to get developer re- sponses. This makes sense since developers should be more active during the day and on a weekday as these are during business hours. A possible explanation why developers are more likely to respond to reviews posted shortly after a release is that developers are more active during that time and may still be working on the first round of severe errors. We observed thatreviewswithlowerratings(3-starandbelow)aremorelikelytogetdeveloper responses than reviews with higher ratings (4-star and 5-star reviews). This makes sense since developers 46 Figure 3.4: The effects of the Length features may view such reviews as a great candidate to get users to change their low rating into a higher one. This finding is consistent with previous studies, which found reviews with lower ratings to be more valuable to developers than reviews with higher ratings [51, 63]. We observed from Figure 3.4 thatreviewswithlongerbodiesandtitleshaveagreaterlikelihood of getting developer responses. This is reasonable because longer review bodies and titles can hold more information and most reviews are short, making longer reviews stand out more. Interestingly, we observed that the above statement holds only up to a certain length threshold in some apps. This may be due to developers discarding very long reviews. We found that reviews with higher use of positive words are less likely to get developer re- sponses. Likewise, for most apps, reviews with negative sentiment are more likely to get developer responses than reviews with neutral and positive sentiment. However, we observed that the sentiment feature (e.g.,SentimentBody) does not seem to be a strong predictor, which is also evident by the outcomes of our feature importance analysis in RQ2. We observed that reviews with the intention of reporting unexpected app behaviors (Prob- lemDiscovery)havethehighestchanceofreceivingdeveloperresponsesoverreviewswithother 47 intentions. For most apps, the likelihood of getting a developer response for reviews with the intention of seeking information (InformationSeeking) is higher than reviews that express ideas or suggest app improve- ment (FeatureRequest) and reviews with the intention of informing developers or other users about some aspect of the app (InformationGiving). However, the likelihood differences are not pronounced. Hence, these features provide weak predictive power (Table 3.6). We also used SURF tool [31] to understand fur- ther the topics within intention that have the highest chance of receiving developer responses. We found that the update/version and download topics have the highest chance of receiving a response for the prob- lem discovery intention. We found that the resources and company topics have the highest chance of receiving a response for the information seeking intention. For the feature request intention, we found that the update/version and GUI have the highest chance of receiving a response. Lastly, for the infor- mation giving intention, we found the update/version and download topics to have the highest chance of receiving a response. From Figure 3.5, we observed thatthehigherthepercentageofmisspelledwordsareviewhas, thelesslikelyitwillreceiveadeveloperresponse. Our observation is consistent with the outcomes of Schindler et al. [97], which stated that reviews with the greater use of negative style characteristics (e.g., spelling errors) were perceived as less valuable and may not be read. Similarly, from Figure 3.5, we observed that reviews with high readability score, according to the Flesch reading-ease test [36], are more likely to get a developer response than reviews with low readability score. This observation converges with Fang et al. [34] and Korfiatis et al. [65], which found that reviews that are easy to understand strongly correlate with high-value reviews. Interestingly, our result also suggests that the response likelihood drops significantly for reviews that are too easy to read. When we investigated reviews in this category, we found that they are primarily reviews with short sentences consisting of one or two words (e.g., “Bad app”). 48 Figure 3.5: The effects of the Style features 3.5 Implications The findings of the work in this chapter have several implications that could be of practical significance to a broad range of practitioners. 3.5.1 User-Side The mockup in Figure 3.6 shows a motivating example for a developer response prediction system inte- grated into the app store. Based on multiple features that can be extracted during review composition, the system determines whether the user needs a developer response, predicts the chance of getting the response, and suggests steps to increase that chance. To prevent users from exploiting this knowledge to elevate the priority of their otherwise mostly irrelevant reviews, the system may not reveal or suggest all highly weighted features to the users but only apply the features on the developers’ side when prioritizing reviews. As shown in Figure 3.6a, if it detects that the user is reporting an unexpected behavior of the app but only describing the bug in one short sentence, it will suggest the user increase the length by adding 49 (a) (b) Figure 3.6: A mockup of a developer response prediction system integrated into the App Store more relevant information or further clarifications, both of which should increase the chance of getting a developer response. Similarly, in Figure 3.6b, if it detects that the user is asking for help, it suggests submitting the review on weekday mornings to increase the chance of eliciting a developer response. Therefore, this suggestion system would serve as a collection of evidence-based guidelines on provid- ing the kind of feedback that will have a high chance of receiving a developer response. It is easy to see that, indirectly, this has the potential to raise the quality of user feedback on the app store by making it the interest of users to post focused, clear, and actionable reviews instead of rants. 3.5.2 Developer-Side This chapter has identified several important features of developer response prediction. Based on our findings, researchers could develop tools that automatically prioritize reviews based on features or criteria corresponding to developers’ actual response behavior (i.e., what they look for in a review) and not on intuitions or assumptions about their behavior. Table 3.6 shows that each app has its own individual feature 50 importance ranking, suggesting that one-size-fits-all prioritization tools may not be the most effective. Therefore, the tools should also allow developers to calibrate by adjusting feature weights to their specific needs. Knowing which reviews to prioritize will let developers focus their resources and efforts optimally. Better response strategies should lead to increased user satisfaction and improved communication between users and developers. 3.6 ThreatstoValidity The study in this chapter has several factors that may affect the validity of the results and conclusions. In this section, we identify several threats to validity and limitations of our study. 3.6.1 CompletenessofFeatureSet Our results and conclusions were affected by the features we have selected and considered. This means that others employing different sets of features and categorizing them differently may arrive at different research outcomes. The number of features possible is only limited by human imagination. Examples of features that we have considered for our prediction but could not incorporate yet are country of origin and the discussion topic. Praise could also be important in building relationships between developers and their users [85]. 3.6.2 ToolandMethodReliability While the tools we employed to extract features have been successfully integrated and used in related research and have been shown to be effective in literature, it is still possible that inaccuracies exist in them that we did not take into account. In addition, since the existing tools are trained on review bodies, they may not be able to categorize the review titles well. We acknowledge that this may affect the results. Concerning our methods, we note that the Time feature may introduce a flaw due to the difficulty of 51 correctly pinpointing a responding developer’s actual location. This is because developer teams may be distributed over various countries and time zones. 3.6.3 Developerresponses Similar to previous works [10, 51, 76], understating features of user reviews that drive developers to re- spond to the reviews can be done in at least three ways. One possible way is to interview and survey developers. Another way is to investigate developer responses through the release notes [10]. We in- vestigated an instance where developers provide a written response to a review. We encourage future researchers to cross-reference our findings through developer surveys and release notes. Moreover, response strategies may vary considerably from app to app. While some apps may only have a sole developer who struggles to respond to user reviews, others might have a dedicated customer service team that responds to most reviews regardless of content. We acknowledge that this may affect our conclusions. 3.6.4 Externalvalidity We acknowledge several limitations that may influence the generalizability of our findings. First, the number of our studied apps is limited and may not represent all apps in the iOS App Store. Thus, our data is subject to the App Sampling Problem [74]. However, we selected apps that vary along multiple dimensions such as category, owner, number of releases, age, and number of reviews and responses received. Second, our app selection was biased towards top-ranked free apps. It is possible that analyzing paid or unpopular apps may yield different results. Third, we only studied iOS apps. In other words, apps from the Apple App Store. Research has found that users have different reviewing cultures on different app stores [52]. Additionally, some features, such as the review title, are available on one app store but not on the other. Therefore, we limit our study by 52 focusing on only apps in one app store. In addition, the iOS App store separates reviews by their country of origin, while the Google Play Store separates reviews by language. Previous studies have suggested cultural and country differences exist in how users from different countries provide app reviews [48, 107]. In order to protect our study from such bias, we only studied US-based users’ reviews. As a result, our findings may not hold true for apps on other app stores, such as the Google Play Store. Fourth, we only experimented with one machine learning algorithm, the random forest algorithm. Therefore, we cannot claim that our results will apply to other algorithms. Hence, additional evaluation for different machine learning algorithms is needed. Further study is still necessary to investigate whether our findings are valid for apps and reviews that do not share the same characteristics as those in our study. For reproducibility, the dataset and script used in this study are available at [104] and the repository 11 . 3.7 RelatedWork 3.7.1 UserReview Pagano et al. [85] conducted a large-scale user reviews study to understand user reviewing behaviors (e.g., how, why, and when users submit reviews). Their dataset consisted of 1.1 million user reviews from the iOS App Store. They did not specify the country of origin of these reviews. They discovered that most users submitted reviews shortly after new app releases, that there was an association between highly downloaded apps and positive reviews, and that apps with many complaints were usually less downloaded apps. They also found that popular apps could receive thousands of reviews daily, emphasizing the need for automated tools to classify and prioritize reviews. By manually analyzing 1,100 reviews from the dataset, they found that only about one-third of user reviews were technically relevant and informative to mobile app developers. 11 https://github.com/Kamonphop/ESEM20-Replication 53 Khalid et al. [63] manually analyzed 6,390 user reviews of 20 free iOS apps. They did not specify the country of origin of these reviews. They identified 12 different issues that users complained about in user reviews and calculated the frequency and impact of each complaint type. They observed that users complained the most about functional errors and usually gave the issue a one-star rating. Palomba et al. [86] asked developers how much they incorporated user reviews into their workflow and whether addressing user reviews in subsequent releases benefitted users. The authors reported that 75% of interviewed developers stated that they have used and incorporated reviews for software maintenance and evolution. They can observe an increase in user satisfaction and app ratings when doing so. Harman et al. [50] found a strong correlation between the number of downloads for an app and its ratings. Similarly, Apple stated that user reviews and ratings are used as metrics to rank apps in the search results [9]. While it is clear from these works that user reviews contain a wealth of information and greatly impact an app’s success, popular apps can receive thousands of reviews a day. Reading such massive amounts of user reviews and extracting data from each one is impracticable for humans. Therefore, during the past years, many researchers have put much effort into creating models to classify and summarize reviews to support developers in various software engineering activities. However, some of these models use criteria based on intuitions or assumptions about developers’ actual behavior. For example, Gao et al. [40] only used rating and length to prioritize user reviews. Hence, our study uses a data-driven approach to understand what developers look for in user reviews that incite their responses. We refer to Martin et al. [75] and Dąbrowski et al. [27] for a comprehensive survey in this area. Chen et al. [22] proposed the AR-Miner tool, which uses a semi-supervised learning algorithm to classify reviews as informative and non-informative. Guzman and Maalej [47] used SentiStrength and topic modeling to extract sentiment of app features from user reviews. Similarly, Maalej and Nabil [72] introduced and evaluated several techniques (from SentiStrength) to classify reviews into four types: bug 54 reports, feature requests, user experience, and ratings. They also leveraged theSentiStrength tool to capture a sentiment in reviews. Panichella et al. [88] grouped 17 common topics in reviews identified by Pagano et al. [85] into four user review intentions that they claimed were relevant to developers. They then proposed the ARdoc tool, a user review classifier that integrates natural language processing, sentiment analysis, and text analysis. Using the tool from the previous work, Sorbo et al. [31] created the SURF tool that summarizes and classifies user reviews into common topic clusters, such as GUI, Pricing, and Company. Similarly, Palomba et al. [87] used the ARdoc tool for intention classification and a clustering algorithm to create the ChangeAdvisor tool, which not only can classify reviews but can also recommend changes to software artifacts. Villarroel et al. [116] developed aCLAP tool, an approach that uses a random forest algorithm to classify and uses a clustering algorithm to group user reviews based on the content in them. Gao et al. [40] utilized a topic modeling approach to detect emerging issues in reviews – issues that have been discussed heavily in one time slice but not in another. 3.7.2 DeveloperResponse Until now, few studies have started investigating the nature of the responses provided by developers on the app stores. One of the earlier works in this area is by McIlroy et al. [76]. They analyzed review and response pairs from 10,713 top apps on the Google Play Store. They found that for only 13.8% of these apps, developers responded to reviews. They also found that users are more likely to increase their ratings after receiving responses. By manually labeling a representative sample of user reviews and developer responses from these apps, they found ten common topics in developer responses. The results are later supported by Hassan et al. [51] who conducted a similar study. They analyzed user reviews from 2,328 top free-to-download apps on the Google Play Store. They also came to the same conclusion as the previous work: developers for most apps do not respond to reviews and that responding to reviews positively affects ratings. More importantly, they investigated how features of reviews such 55 as length and rating affect developer reactions and found that developers are more likely to respond to lower ratings and longer reviews. However, they did not attempt response prediction, and the list of study features and insights discovered were limited. While they focused their analysis on reviews of Android apps, we focused our analysis on reviews of iOS apps. Research has shown that users have different reviewing cultures on different app stores [52]. The objectives of their study also differ from ours. Savarimuthu et al. [96] also conducted a similar study from a social norm perspective. They studied 24,407 reviews and 2,668 responses from the top 20 apps on the Google Play Store. They identified 12 norms in these developer responses. They found, for example, that developers of 70% of the apps are aware of the greeting norm as their responses contain appropriate salutation (e.g., Hi) and of 65% of the apps are aware of the personalization norms as their responses contain the name of the reviewer. Additionally, they noticed that when users edited their reviews after getting a developer response, they tended to increase the rating. This result is consistent with the results from the previously mentioned studies. Our work is primarily motivated by the above studies. However, while these works have investigated developers’ response behavior to some extent, little effort has been put into studying the wide range of features, including those that are software engineering-oriented, of user reviews that may affect whether or not a review will get a developer response. Consequently, we took a step toward investigating this issue as we believe that our findings could reveal developers’ actual response behavior and guide research and the development of prioritization tools to be more in line with developers’ actual behavior on the App Store. 3.8 ChapterSummary In this chapter, we identify features of user reviews that drive developers to respond to the reviews. We extract seven different groups of features from user reviews: time, sentiment, rating, length, vote, intention, and writing style, and build models using the random forest algorithm to predict developer responses. 56 Through a case study of eight popular free-to-download apps on the iOS App Store, we show the efficacy of the random forest algorithm in predicting developer responses using the extracted features. We achieve an average AUC of 0.84, outperforming all baselines by large margins. The findings indicate a potential application of feature engineering and machine learning to help developers automatically identify reviews they are most likely to respond to and focus on. In addition, we determine the relative importance of each feature and group of features in predicting developer responses. We find that rating, review length, posted date and time, and the proportions of positive words are among the most important features in predicting developer responses. We also uncover that non-text features are more important than text features. More importantly, we observe some variation in the rankings of important features, suggesting that a one-size- fits-all prioritization tool or review ranking tool may not apply to all developers. Lastly, we provide insights into the relationships between the key features and developer responses. The findings of our work have several implications that could be of practical significance to a broad range of practitioners. In the next chapter, using a similar approach, we investigate features that predict the success of devel- oper responses. We consider a response to be successful if, after receiving it, a user increases his or her rating. We seek to derive a set of evidence-based guidelines that developers can follow to write effective responses and increase the chance of success of the responses. 57 Chapter4 HowDeveloperShouldRespondtoReviewsforSuccess The Google Play Store allows app developers to respond to user reviews. Existing research shows that response strategies vary considerably. In addition, while responding to reviews can lead to several types of favorable outcomes, not every response leads to success, which we define as increased user ratings. The work in this chapter has two objectives. The first is to investigate the potential to predict early whether a developer’s response to a review is likely to be successful. The second is to pinpoint how developers can increase the chance of their responses to achieve success. We track changes in user reviews and developer responses of the 1,600 top free apps over ten weeks and find that the rating increase after a response in 11,034 out of 228,274 one- to four-star reviews. We extract three groups of features, namely time, presentation, and tone, from the responses given to these reviews. We apply the extreme gradient boosting (XGBoost) algorithm to model the success of developer responses using these features. We employ model interpretation techniques to derive insights from the model. Our model can achieve an AUC of 0.69, thus demonstrating that feature engineering and machine learning have the potential to enable developers to estimate the probability of success of their responses at composition time. We learn from it that the ratio between the length of the review and response, the textual similarity between the review and response, and the timeliness and the politeness of the response have the highest predictive power for distinguishing successful and unsuccessful developer responses. 58 Based on our findings, we provide recommendations that developers can follow to increase the chance of success of their responses. Tools may also leverage our findings to support developers in writing more effective responses to reviews on the app store. This chapter is based on our work in [103]. 4.1 Introduction Since 2013, the Google Play Store has allowed developers to respond to reviews and establish two-way communications with their users. However, only a small fraction of reviews are read and garner a response from developers, and there is a considerable variation in response strategies [51, 96, 105]. Despite the sizable body of literature on mobile app reviews [75], there have been no findings in literature on how developers should respond to app reviews. Nonetheless, recent studies agree that developers are more likely to respond to reviews with negative ratings than those with positive ones. One reason mobile apps are particularly vulnerable to negative rating reviews is that the more positive ratings and reviews an app has, the higher it will rank in the search results, making it more visible and encouraging downloads [50, 70]. For this reason, reviews with negative ratings present great opportunities for developers to respond for potential ratings increase [51, 105]. Additionally, most research in this area has treated user reviews and developer responses as static when they can change over time. This inspired us to compare the attributes of a review before and immediately after a response to determine the effectiveness of the response. In this chapter, we are interested in two objectives. The first is investigating the potential to predict the probability of a successful response early (manifested in a higher rating). The second is to pinpoint how developers can increase that probability. To achieve our objectives, we track changes to user review ratings and developer responses of the 1,600 top free-to-download apps in the Google Play Store for ten weeks. We then identify the features of successful developer responses and explore how they differ from 59 the unsuccessful ones. Specifically, we focus on features that developers can control when composing a response. We extract three groups of features, namely Time, Presentation, and Tone. We then utilize the extreme gradient boosting algorithm (XGBoost) to model the success of developer responses from these features. A ranking of features by their importance and a set of important features are extracted from the model. Partial dependence plots are generated for the important features to uncover how the probability of success of responses changes with these features. We believe that our findings could be of practical significance to a broad range of practitioners. In this chapter, our primary contributions can be summarized as follows: • We conduct an empirical investigation of features that predict the success of developer responses. • We demonstrate that feature engineering and machine learning have the potential to enable devel- opers to estimate the probability of success of their responses. • We provide a set of evidence-based recommendations that developers can follow to write effective responses and increase the chance of success of the responses. This chapter is structured as follows. Section 4.2 provide background information. Section 4.3 describes the research setup. Section 4.4 presents the experiment results. Section 4.5 explores the implications for practical use of the findings. Section 4.6 identifies limitations and threats to validity. Section 4.7 provides a summary of the related literature. Section 4.8 summarizes the chapter. 4.2 Background 4.2.1 Terms While developers can respond to reviews indirectly, such as through the deletion of features [81], our focus is on how their written responses impact the behavior of individual users. Similar to the previous chapter, 60 adeveloperresponse refers to a written response a developer posts in an app store in reply to a user review in the same app store. In reaction to the response, a user can do nothing. They can also change their rating or review, or both. We are particularly interested in a developer response that, after receiving it, a user increases his or her initial rating. We consider such a response successful. Figure 4.1 shows an example of a successful developer response. Figure 4.1: An example of a successful developer response 4.3 ResearchSetup The work in this chapter has two objectives. The first is to investigate the potential to predict early whether a developer’s response is likely to be successful. The second is to pinpoint what and how features that developers can control influence the success of the responses. Through our analysis, we seek to derive a set of recommendations that developers can follow to increase the probability of success of their responses. We can formulate this problem as a binary classification task as follows. 61 Task: DeveloperResponseSuccessPrediction • Given a response R and a user review U • Predict whether a user of U will increase his or her rating after receiving R. 4.3.1 ResearchQuestions We break the aforementioned high-level objective into three concrete research questions. • RQ1: How often do users change their original ratings with and without a developer re- sponse? • RQ2: Whichreviewintentionhasthehighestchanceofsuccess? • RQ3: Canweeffectivelypredictwhichdeveloperresponseswillbesuccessful? • RQ4: Whichfeaturesplayrolesinpredictingthesuccessofdeveloperresponses,andhow? The research method we employed in this chapter is similar to the previous chapter (Chapter 3 - see Figure 3.2 for the overview of the method), which includes four primary steps: data collection, feature extraction, model construction and evaluation, and model interpretation. 4.3.2 DataCollection We based our app selection on the following three criteria. • Popularity: We focused on top free-to-download apps as we believe that developers would want to maintain the popularity of their apps by providing support, including responses, to their users. We did not consider paid apps because we wanted to avoid the influence of app prices on our results. 62 • Category: We focused on apps across all app categories in the Google Play Store to avoid selection bias and masking disparities that different app categories may exhibit. There are 32 categories for apps in the Google Play Store 1 . • Maturity: We focused on apps that had been in the app store for at least three months before our study period. This decision is to avoid an influx of reviews for newly released apps [85] and developers being more active than otherwise (e.g., since they would still be trying to establish their apps). We selected the top 50 apps in each category on July 2, 2020. Consequently, we collected data from 1600 top apps in the Google Play Store. The app store does not provide an easy way to collect the change histories of reviews, and such data is only available to the developers of apps. Therefore, to circumvent this and avoid overwhelming the app store with requests, for each app, we set up cron jobs (timed program execution) to run our crawler and crawl the store every hour. The crawler collects new data and detects changes to the already crawled data. The changes include a change to review contents, ratings, and developer responses. The crawler was programmed to act like a smartphone device and interface with the Google Play Store’s APIs. The crawler ran every hour for ten weeks, starting on October 4, 2020, and ending on December 13, 2020. Consequently, our dataset contains each app’s complete change history of user reviews during that period. We collected 9,352,326 reviews in total, of which 676,311 reviews (7.23%) had changed at least once. Among these reviews, the changes in 494,298 reviews include receiving a developer response (73.09%). 4.3.3 FeatureExtraction We purposely focused only on features that developers can control when composing a response because our ultimate goal is to derive a set of recommendations for writing effective responses that developers 1 https://support.google.com/googleplay/android-developer/answer/113475 63 can follow. Previous work on effective complaint handling, communications on public forums, and mobile app review analysis have pointed to several textual and non-textual features which we anticipated collec- tively distinguish successful from not successful responses. We extracted three categories of features from developer responses. The three categories are Time, Presentation, and Tone. Time Upon a developer’s response to their review, a user who wrote the review will receive an email and a push notification. Research shows that the timing aspects affect recipients’ responses to such notifications [121]. The work in the previous chapter has shown that the timing aspects also have contributed to the successes of users with their requests for responses from developers on the iOS store [105]. For this category, we derived features from the posting time of a response and a review. The times are denoted in Greenwich Mean Time (GMT). 1. Timeliness – The time difference between the posting time of the review and the response. 2. Day of Week – The posting day of the week of the response (e.g., Monday or Tuesday) and whether it is the same as that of the review. 3. WeekdayWeekend – Whether the posting of the response occurred on a weekday or a weekend, and whether the same applies to the review. 4. Time of Day — The posting time of day of the response and whether it is the same as that of the review. We partitioned the time into four slices: Morning [6-12), Afternoon [12-18), Evening [18- 24/0), and Night [24/0-6). 64 Presentation Research shows that several features of text, such as its readability and length, can influence its perceived value [65, 97]. The use of the user’s name in a text has important implications for organizational com- munication [117]. The presence of URLs and email addresses inform users that additional help resources are available. High similarity between the review and response may indicate that developers have thor- oughly read and understood the review. These features can collectively make responses less generic and more personal, which are important factors in effective complaints handling [29, 77]. We extracted the following text features from a response. 1. Length — The number of characters. 2. Length Ratio — The ratio between the number of characters of the review and that of the response. 3. Readability — The readability score obtained from the Flesch Reading Ease test [36]. 4. Reviewer’s Name – Whether the response includes the reviewer’s name. 5. URL – The count of links. 6. Email Address – The count of email addresses. 7. Phone Number – The count of phone numbers. 8. Number – The count of numbers, excluding phone numbers. 9. Capitalized Word – The percentage of capitalized words. 10. Non-AN Char – The percentage of non-alphanumeric characters, excluding white spaces. 11. Mention Rating – Whether the response includes words about star rating (e.g., stars, ratings). 12. Shannon’s Entropy – Shannon’s entropy of the response. 65 13. RR Similarity — The degree of similarity between the review and the response. We utilized a pre- trained Universal Sentence Encoder (USE) model [21] to generate dense vector representations for each review and response pair. The model has shown to be efficient and achieve state-of-the-art performance on the semantic textual similarity task. Although the model does not require pre- processing input texts, we removed URLs, emails, numbers, and phone numbers to reduce noise in the responses. We then used cosine similarity to measure the similarity between the vector repre- sentations of the pair. Tone The group of features that make up the tone of a developer’s response is perhaps the one that is most intuitively matched with the success of that response. When actual or perceived defects of an app have led to negative emotions on the user’s part, it is essential for the developer to mitigate them, so the user does not perceive the interaction as negative and learns this as a negative impression. In line with this, research shows that the manner in which a complaint is handled is crucial for an organization to generate a successful service recovery, noting that responses should be polite and empathetic [29, 77]. We extracted the following features from a response: 1. Sentiment — The sentiment polarity (positive, negative, or neutral), computed using SentiStrength [113], a publicly available tool for sentiment analysis. 2. Emotion — The six different emotions (anger, fear, joy, love, sadness, and surprise), obtained from the emotion module in the EmoTxt tookit [18]. The module contains six binary classifiers, each detects a specific emotion in text. 3. Persistence — The number of times developers have edited their response before the current state. 66 4. MoodandModality – The response’s mood and modality, measured by the Pattern tool of De Smedt et al. [30]. 5. Politeness — The politeness score, measured using a tool by Danescu et al. [28], and the count of 36 syntactic and social markers of politeness, extracted using the politeness R package by Yeomans et al. [123]. Examples of these markers include request, greeting, gratitude, and apology. 4.3.4 ModelConstructionandEvaluation The two primary goals for building a machine learning model are (1) to investigate how well we can determine the success of a developer response from its features and (2) to assess which features play roles in determining the success of the response and how they do. Here we formalize our task as a binary classification problem, i.e., given a developer response, predict whether a user will increase their rating after the response. To achieve our goals, we experimented with the XGBoost (eXtreme Gradient Boosting) model [23], a widely used tree-based ensemble machine learning model for this type of task [82]. The model is based on gradient tree boosting, where it trains many decision trees iteratively, where each new tree is built to reduce the errors of previous trees. The model can handle heterogeneous feature types, does not require feature scaling, and has been shown to work well on imbalanced datasets [78]. More importantly, it has several measures to combat over-fitting, mainly accomplished through hyperparameter optimization. For these reasons, we chose to experiment with XGBoost. To build the model, we used the XGBoost python package [23]. The dataset was split into 90% training and 10% testing (hold-out) sets using stratified sampling. We used the training set to tune the model’s hyperparameters. 67 HyperparameterOptimization Hyperparameters play a critical role to a model performance [92, 111]. For this, we constructed a hyperpa- rameter search grid for the model to sample from. The range of values used for different hyperparameters in the search space are as follows: • max_depth — 6 to 12 with step size of 1. • learning_rate — draw uniformly between 0.01 and 0.1. • gamma — draws uniformly between 0 and 0.5. • min_child_weight — draws uniformly between 1 and 15. • subsample — draws uniformly between 0.1 and 1.0. • colsample_bytree: — draws uniformly between 0.1 and 1.0. • n_estimators — always set to 10000. • early_stopping_rounds — always set to 50. We refer to the official documentation of XGBoost 2 for the description of each parameter. As it is too computationally expensive to try every hyperparameter setting, we adopted a Bayesian Optimization ap- proach [14] to try out a total of 100 settings from the search space with the goal of maximizing the model’s performance. We applied a stratified 10-fold cross-validation for each setting to ensure performance sta- bility. We took the hyperparameter setting from the best-performing model to train the final model on the entire training set. We then evaluated the final model on the hold-out set to assess its performance on unseen new data. 2 https://xgboost.readthedocs.io/en/latest/parameter.html 68 PerformanceMeasure We selected Area Under the Receiver Operating Characteristic Curve (AUC-ROC) as the performance measure. The ROC curve plots the false positive rate (false alarm) against the true positive rate (recall) at different classification thresholds. The area under this curve (AUC) provides a scalar performance measure regardless of the classification threshold. AUC value lies between 0.50 to 1.00. A classifier that predicts a random or a constant class has an AUC of 0.50, while a perfect classifier has an AUC of 1.00. Generally, the higher the AUC, the better the model is at distinguishing between classes. 4.3.5 ModelInterpretation To derive insights from the model, we adopted two widely used model-agnostic interpretation methods: permutation feature importance and partial dependence plot. PermutationFeatureImportance To determine which features are predictive of success, we adopted a method called permutation feature importance [17]. The fundamental intuition is that a feature is important if permuting or shuffling its values drops the classification performance. We refer to this drop in performance as the Importance. Hence, the greater the drop, the more important the feature is. The method has several advantages. For example, it takes into account feature interactions, can be used on a single feature or a group of features (by permuting features in the same group together), and does not require retraining the model. More importantly, the results are intuitive and easy to interpret. Nonetheless, two pitfalls need to be addressed to use the method properly. First, in the presence of highly correlated features, the method can decrease the importance of such features [110]. One way to handle this is to identify and discard correlated features before training the model. To do so, we generated a correlation matrix of features based on Spearman’s correlation. We used a correlation (|ρ |) threshold of 69 0.70 to mark a pair of features as highly correlated. For each pair, we kept the feature with the highest mutual information [66] and discarded the other. Second, due to the stochastic nature of the method, different permutations of the same feature can give slightly different importance estimates. To obtain a more reliable estimate for each feature, we ran the method 20 times using different permutations. The mean and the standard deviation across these runs are reported as the importance estimate of a given feature. PartialDependencePlot While permutation feature importance reveals which features are predictive of success, it does not deter- mine how these important features affect the predictions. We turned to a partial dependence plot (PDP) to answer this. A PDP is a visualization method that plots the change in the average prediction probability for a class as an independent variable varies over its marginal distribution [38]. More specifically, a PDP shows how the average prediction probability that a developer response will be successful varies across a range of possible values of a feature of interest while keeping other features constant. Thus, insights from PDPs can shed light on how developers can increase the chance of success of their responses. 4.4 Results 4.4.1 RQ1:Howoftendouserschangetheiroriginalratingswithandwithoutadeveloper response? First, it is important to make explicit how worthwhile responding to reviews is from our dataset. To answer this question correctly required us to filter out the developer responses given to reviews and the reviews without a response up to 24 days prior to the final data collection date. This step was necessary to avoid introducing bias because most users may not have had enough time to change their ratings before that 70 date. We specifically chose 24 days because we observed that 90% of users changed their ratings within 24 days after a response. After filtering the data this way, the rating changed after a response in 13,647 out of 320,274 reviews (4.26%). Without a response, the changes in ratings happened in 68,382 out of 5,843,802 reviews (1.17%). Tables 4.1A and 4.1B break down the changes in ratings with and without a response. The diagonal values indicate that the ratings remain unchanged. The values above and below the diagonal indicate that the ratings increase and decrease, respectively. In cases where a user changed their ratings multiple times after a response, we only considered the earliest change because it is more likely to have been caused by the response than the later ones. Similarly, when multiple rounds of a developer response followed by a rating change occurred, we only considered the earliest instance because the earlier rounds may confound the changes in later rounds. For our comparison of ratings increases and decreases, we only considered those cases in which in- creases or decreases are possible and consequently excluded 1-star ratings for decreases and 5-star ratings for increases, causing us to start from a different base for each respective comparison. We found that the rating increased after a response in 11,034 out of 228,274 1-4 star reviews (4.83%). On the other hand, the rating increased without a response in 32,231 out of 1,914,154 such reviews (1.68%). This indicates that users who left a 1-4 star rating review are 2.9 times as likely to increase their ratings after a response as without one. In the case of rating decreases, we found that this happened in 2,613 out of 191,934 reviews with a 2-5 star rating (1.36%). On the other hand, the rating decreased without a response in 36,151 out of 4,991,131 such reviews (0.72%). This indicates that users who left a 2-5 star rating review are 0.9 times more likely to decrease their ratings after a developer response than without it. The findings suggest that responding to reviews is worthwhile and can have a favorable outcome, although not every response will increase user ratings. Our results confirm previous studies [51, 76] which 71 Table 4.1: The changes in ratings with and without a response (A) Reviews with a developer response Rating before Ratingafter 1 2 3 4 5 1 122,211 686 1,109 1,336 2,998 2 1,091 30,487 448 569 834 3 391 378 34,333 769 1,220 4 95 59 137 28,058 1,072 5 170 57 100 135 91,538 (B) Reviews without a developer response Start rating Endrating 1 2 3 4 5 1 835,389 1,885 2,150 2,421 10,826 2 5,586 185,131 1,060 993 2,406 3 3,394 2,125 285,186 1,808 3,175 4 1,984 1,121 1,895 560,112 5,507 5 9,963 2,360 3,203 4,520 3,909,602 found that responding to reviews is beneficial, although we did not find the benefit as high. Our next step is to employ a modeling technique to identify features within a response that influence users to increase their ratings. Consequently, the dataset for the later parts ended up containing 228,274 developer responses given to 1-4 star reviews, 11,034 of which were successful. 4.4.2 RQ2: Whichreviewintentionhasthehighestchanceofsuccess? Figure 4.2 shows the probability of success for each review intention. Note that we used SURF [31] to extract intentions from each review. We observed that responding to reviews with information seeking or problem discovery intent has the highest chance of success, while responding to information giving has the lowest chance of success. Additionally, responding to reviews with these intentions increases the chance by 30% versus responding to reviews with information giving intent. 72 Figure 4.2: The probability of success for each review intention 4.4.3 RQ3: Canweeffectivelypredictwhichdeveloperresponseswillbesuccessful? Figure 4.3 shows the ROC curve of our model 3 (the green line) and the ROC curve of a classifier that predicts a random or a constant class, i.e., a baseline classifier (the dashed red line). The area between the two curves measures how well our model performs over a baseline classifier. The model is able to achieve an AUC of 0.69. Although the model is far from perfect (AUC = 1.00), it outperforms a baseline classifier (AUC = 0.50) by a substantial margin. This indicates the feasibility of predicting early the responses that are more likely to be successful using only features that developers can control during response composition. It also suggests that the selected features are associated with the success of responses. Hence, we demonstrated that feature engineering and machine learning have the potential to enable developers to estimate the probability of success of their responses at composition time. It is important to note that due to the imbalanced nature of the dataset, one may apply a technique to balance the dataset, such as random under-sampling. However, when the purpose of building a model is to derive insights and understandings, one should avoid using such a technique [112]. We note that the 3 The hyperparameters setting we used for the model is as follows: learning_rate = 0.02, gamma = 0.27, max_depth = 11, min_child_weight = 9.0, subsample = 0.90, and col_sample_bytree = 0.30. 73 Figure 4.3: The ROC curves of the model and the baseline model’s performance is not as high. This could be because users may or may not change their reviews for many reasons that do not necessarily react to the responses they received. For example, they may simply forget to update the ratings. 4.4.4 RQ4: Whichfeaturesplayrolesinpredictingthesuccessofdeveloperresponses, andhow? To answer this question, we apply the two model-agnostic interpretation methods described in Section 4.3.5 on the hold-out set to derive insights from the model regarding which features contribute to the generalization power of the model and how. Table 4.2 shows the importance of the three categories of features: Time, Presentation, and Tone. We observed that the Presentation category is the most important in predicting the success of developer responses, followed by the Tone category and then the Time category. 74 Table 4.2: The importance of the three categories of features Feature Importance Presentation 0.1004± 0.0072 Tone 0.0802± 0.0046 Time 0.0183± 0.0034 ThePresentationcategory: Table 4.3 shows the importance value of each feature in the presentation category. Our analysis shows that the ratio between the number of characters of the review and that of the response is one ofthemostdominatingfeaturesthatdistinguishsuccessfulfromunsuccessfulresponses. When generating a PDP for the feature (Fig 4.4a), we observed that responses that are shorter than reviews are more likely to be successful. However, this could also mean that users who posted a longer review are more likely to increase their ratings after a response than those who posted a shorter review. This is why the length feature, though important for prediction, ranks lower than the length ratio feature. Our results show thatthetextualsimilaritybetweenthereviewandresponsepairshelpsdis- tinguishsuccessfuldeveloperresponses. The PDP of the feature (Fig 4.4b) depicts that the probability of success increases as the similarity between the review and response pair increases. This finding is con- gruent with research that says that less generic and more personal responses are critical to generating a successful service recovery, noting that high similarity between the complaint and response demonstrates a grasp of customers’ situation [46, 77]. Another important predictor is the number of emails sent. However, we observed thatresponsesthat donotincludeanemailaremorelikelytobesuccessful. This may indicate that users want to have their issues resolved on the app store and do not want to contact developers through additional means of communication. 75 The next important feature for prediction in this category worth noting is Mention Rating. We ob- served thatresponsesaremorelikelytobesuccessfulifdevelopersmentionedstarratingsinthe response. Below are actual examples of such a response: • “...Is there something else that we could do to deserve a better rating from you? ...” • “...We hope you’ll reconsider the rating once we fix the issue. ... ” Interestingly, whether developers include the user’s name, numbers, phone numbers, or URLs in the response does not strongly influence developer responses’ success as these features are among the least important features in this group. Although low importance, we found that responses containing URLs and phone numbers have a slightly lower probability of success. Table 4.3: The importance of features in the Presentation category Feature Importance Length Ratio 0.0266± 0.0023 RR similarity 0.0163± 0.0020 Email 0.0051± 0.0011 Mention Rating 0.0032± 0.0010 Length 0.0031± 0.0013 Non-AN Char 0.0023± 0.0013 Readability 0.0018± 0.0011 Shannon’s Entropy 0.0017± 0.0017 Capitalized Word 0.0012± 0.0010 URL 0.0012± 0.0005 Reviewer’s Name 0.0004± 0.0004 Number 0.0003± 0.0011 Phone 0.0000± 0.0001 TheTonecategory: Table 4.4 lists the importance of each feature in the Tone category. It shows that politeness is one of the most important features in distinguishing successful from unsuccessful responses. Figure 4.5a shows the PDP of the politeness score, as measured by the tool of Danescu et al. [28]. It shows that 76 (a) Length Ratio (b) RR Similarity Figure 4.4: PDPs of the top two features in the Presentation category the more polite the responses are, the more they are likely to be successful. Our results are congruent with research that shows that the manner in which a complaint is handled is crucial for an organization to generate a successful service recovery, noting that responses should be polite and empathetic (part of the politeness markers) [29, 77]. Regarding the 36 syntactic and social markers of politeness, we did not find them to be important when we observed each in isolation. However, when combined, these markers did interact and were found to be important for the prediction. Among these 36 markers, the use of plural first-person pronouns (e.g., we, us, our), showing gratitude (e.g., “thanks for reaching out”), the absence of the use of negation (e.g., never, not), the use of “please,” and the use of indirect requests (e.g., “could you”) are the top 5 markers for the prediction. Our results show thatpersistenceenhancesresponses’success. More specifically, the more times the developers edit their responses, the more likely they will be successful. As shown in Fig 4.5b, keeping all other features constant, the probability of success of responses increases as the number of edits increases. Interestingly, we found that sentiment and the group of features for emotions (anger, fear, joy, love, sadness, and surprise) as a whole did not contribute to the prediction. 77 Table 4.4: The importance of features in the Tone category Feature Importance Politeness 0.0365± 0.0032 Persistence 0.0118± 0.0015 Mode and Modality 0.0052± 0.0026 Sentiment 0.0009± 0.0006 Emotion 0.0005± 0.0009 (a) Politeness Score (b) Persistence Figure 4.5: PDPs of the top two features in the Tone category 78 TheTimecategory: Table 4.5 shows the importance of each feature in the Time category. Thetimelinesshasshowntobe the most important feature in the Time category. We found that responses posted too quickly or slowly are less likely to succeed (Fig 4.6b). We also found that responses posted duringsometimewindowsaremoresuccessfulthanothers. We observed that responses posted on Friday, Saturday, Sunday, and in the evening and night (GMT), which correspond to the morning and afternoon in US Central Time, are more likely to be successful than responses posted on the other days of the week and times of day (Fig 4.6b). Regarding whether developers should respond to reviews at the same time of day, day of the week, or weekday/weekend as the reviews’, our results indicate that this does not affect the success of the responses. Table 4.5: The importance of features in the Time category Feature Importance Timeliness 0.0129± 0.0021 DayOfWeek 0.0019± 0.0012 TimeOfDay 0.0017± 0.0011 Weekday Weekend 0.0000± 0.0005 4.5 Implications This section outlines the most promising prospects and practical implications of our work. 4.5.1 Evidence-BasedRecommendations We believe that our findings are of interest as we have examined the predictive power of a wide range of features that developers can control in terms of distinguishing successful from unsuccessful developer responses. Therefore, based on our findings, in responding to reviews with a one- to four-star rating, we recommend that developers do the following to their responses to increase the probability of success: 79 (a) Timeliness (b) TimeOfDay vs. DayOfWeek Figure 4.6: PDPs of the top features in the Time category • Include a paraphrase of the review. This is to demonstrate genuine attention to restating the users’ concerns. The response will be less generic and more personal. • Be polite and possess different syntactic and social markers of politeness (e.g., show gratitude, use indirect request). Polite responses can ease customers’ anger and dissatisfaction. • Write a response that is long but, at the same time, shorter than the review. • Try to resolve the issues within the app store and refrain from asking users to contact through an additional communication channel. • Include a sentence mentioning the star rating. This is to remind users about their ratings or suggest they reconsider them. • Post in a timely fashion, but not too soon or too late. Responses that come in too fast may indicate that developers have not thoroughly read and understood the review. As a result, such responses may be viewed as automatic and not be as valuable as those that come later. However, at the same time, 80 delayed responses may indicate a lack of concern for the users. Still, legitimately urgent requests need to be processed with urgency. • Utilize the more successful posting time windows. In other words, responses should be posted in the evening and the night (GMT), which correspond to the morning and afternoon in US Central Time, on Fridays, Saturdays, and Sundays. • Be persistent by following up with the users if they have not yet reacted to the responses, for example, by editing the responses. This increases the chance of users to react to the response. • Prioritize responding to reviews with information seeking or problem discovery intention. 4.5.2 ResponseWritingAssistant By modeling the success of responses, we can integrate machine intelligence into a developers’ app man- agement portal (e.g., the Google Play Console) to provide value for developers. Figure 4.7 shows a user interface (UI) prototype of a response writing assistant tool. In the scenarios we envisage, the tool is not intended to write or automatically generate a response for developers but rather to assist developers when they write a response. More specifically, while developers compose their response to a review, the tool 1) collects metadata, 2) scans the response body for the presence and counts of important features, 3) evaluates the probability of success of the response in its current state, and 4) makes recommendations for developers to increase that probability. For example, in Figure 4.7, the tool has determined that the response is unlikely to be successful and shows that the response’s similarity level is not in the optimal spot. It then recommends that the developers modify their response to address the user’s concern(s) more. It is easy to see that, indirectly through its recommendations, such a tool has the potential to increase the quality and the effectiveness of developer responses on the app store. Hence, better response strategies could also lead to improving user-developer communication efficiency. 81 Additionally, we investigated whether we could build a lightweight model using only the top features identified in RQ4. For this, we repeated the entire model construction process and found that the final model achieved a lower performance than what we could achieve using all features. This indicates that the interaction of multiple important and unimportant features reflects the success of developer responses. Hence, in practice, we recommend practitioners consider all features when building a prediction model for a response writing assistant tool to achieve the best performance. Figure 4.7: A UI prototype of a response writing assistant tool 4.6 LimitationsandThreatstoValidity The study in this chapter has several factors that may affect the validity of the results and conclusions. In this section, we identify several threats to validity and limitations of our study. An investigation of features influencing the success of developer responses can be conducted in at least one other way, that is, through a user survey. However, this may require the participants to have 82 received at least one developer response for the survey to yield reliable findings. Due to the limited time and accessibility to such a population of users, we did not pursue this route. Furthermore, we consider a response to be successful if a user increases their rating after receiving it. This definition of success may be too strict for other researchers or app developers. For example, the success of a response could be measured by the change in the sentiment of a review (from negative to positive sentiment). It could even be argued that editing a review while keeping the rating and sentiment unchanged constitutes some success because the response intensifies the user’s engagement. Additionally, users may or may not change their reviews for many reasons that do not necessarily react to the responses they received. For instance, users may simply forget to update their ratings or decrease them because they find new issues with the app and become more dissatisfied. We acknowledge that this may affect the conclusions of our study and encourage future work to expand the definition of success and triangulate our findings through user surveys or interviews. While the tools we employed to extract the tone and text similarity features have been shown to be effective in literature, it is possible that the intrinsic inaccuracies in them can affect our results. We also experimented with sentiment analysis tools built specifically for the software engineering domain, such as SentistrengthSE [57], and observed a comparable performance for our model. Next, the method we used to extract each presentation feature may not be entirely accurate due to the informal and unstructured nature of the text data. Furthermore, we cannot claim that our set of features is exhaustive as we might have overlooked other important internal and external features to developers that may play a role in the success of responses. These features include the starting star rating, the type or intention of review, or the number of times the user has edited their review before receiving a response. Note that this work only focuses on features that are internal to developers (i.e., features that developers can control when composing a response). 83 In line with our objectives, we purposely focused only on features that developers can control because our goal was to derive a set of recommendations for writing a successful response. We note that evidence in our analysis has led us to believe that incorporating more features from the reviews may further help improve the model’s performance. Lastly, regarding the external validity, it is still unclear whether our findings hold true for apps with different characteristics (e.g., other top apps, paid apps, unpopular apps, or apps from other app stores) or for a study in a different time frame. As a result, our study is subject to the app sampling problem [74]. Furthermore, we only experimented with one machine learning model. Different models can lead to different results [111]; thus, we cannot claim that our findings will hold true for other modeling algorithms. 4.7 RelatedWork This section summarizes related work dealing with user reviews and developer responses on the app stores. 4.7.1 AnalysisofAppUserReviews App reviews contain a rich source of information [85] that can be used for various software engineering activities [3]. For instance, in maintenance and requirement engineering, app reviews help developers elicit new requirements and identify unexpected app behaviors [40, 55, 63, 72]. In addition, research has also pointed out that users’ perceptions of the same apps can vary across different platforms [2] and countries [107], and cultures [48]. Nonetheless, apps can receive thousands of reviews in a day. Manually reading and extracting infor- mation from each review is impracticable. Hence, a sizeable body of literature has been devoted to finding ways to extract useful information from user reviews automatically. For example, Iacob and Harrison [55] proposed MARA, a framework for retrieving app feature requests based on linguistic rules. Gao et al. [40] adapted the topic modeling technique to detect emerging issues across different time slices. Maalej 84 and Nabil [72] evaluated several techniques to classify reviews. Srisopha et al. [106] detailed a method to identifying country-specific app feature requests. Sorbo et al. [31] proposed a method to categorize reviews into different intentions and topics. Palomba et al. [87] proposed a tool that recommends changes to software artifacts. Different from these existing analysis of review research, our work in this chapter contribute toward improving the effectiveness of the communications between users and developers on the app store instead of focusing only on extracting information from user reviews. 4.7.2 AnalysisofDeveloperResponses Lim et al. [70] have shown that reviews and ratings are among the most important factors people consider when choosing apps to download. Additionally, the more positive ratings and reviews an app have, the higher it will rank in search results and the more visible it is to potential users. Research shows that responding to reviews positively affects app ratings. McIlroy et al. [76] analyzed review and response pairs from 10,713 top apps on the Google Play Store. They found evidence that responding to reviews positively impacts user ratings. Hassan et al. [51] further studied the topic and found the chance of ratings increase with a developer response to be six times as high as without. While we did not observe a similar likelihood, we argue that our findings provide a more realistic estimate because we avoided bias by filtering out reviews with a 5-star rating (as users cannot increase their rating further than five stars) and reviews that were posted towards the end of our data collection period. Due to the large number of user reviews an app can receive, many studies have also started investi- gating ways to generate developer responses automatically. Gao et al. [41] proposed RRGen, an attention- based sequence-to-sequence neural model to generate responses. RRGen has been shown to create sat- isfactory responses to reviews that are common among many apps but did not create responses that are 85 app-specific well. Farooq et al. [35] attempted to solve this problem. They proposed AARSYNTH, a method that augmented the similar technique used in RRGen with app-specific information, such as app descrip- tion. However, both proposed methods still have the shortcoming of not being as app-specific, fluent, and correct in addressing users’ concerns as manually written responses. Nonetheless, these studies could use insights from this chapter to generate a response with an increased chance of success. Despite the sizable and growing body of literature on the analysis of app reviews and developer re- sponses, there have been no findings on how developers should respond to app reviews. Hence, we take a step to fill this gap in research as we believe that examining the predictive power of a wide range of fea- tures within a response in terms of distinguishing successful from unsuccessful responses carries practical and valuable implications. 4.8 ChapterSummary While responding to reviews can have a favorable outcome, not every response leads to the increased user ratings we define as success. By focusing on the state of a review before and immediately after a response, we can determine whether the response is successful. This chapter has two objectives. The first is to in- vestigate the potential to predict early the success of developer responses. The second is to pinpoint which features in a response influence its success and how. Three categories of features are considered: Time, Presentation, and Tone. The XGBoost algorithm is adopted to model the success of developer responses based on these features. We achieve an AUC of 0.69, outperforming a baseline classifier. The use of two model-agnostic interpretation techniques furthers the understanding of what and how features influence the success of developer responses. We believe the findings in this chapter could be of practice significance to a broad range of stakeholders. 86 In the next chapter, we analyze user reviews of the same apps between the US and other countries. The study could help developers better understand their users and give them more perspectives and insights, which may lead to changes in requirements or new requirements. 87 Chapter5 UserReviewsoftheSameAppsBetweentheUSandOtherCountries Mobile app stores are available in over 150 countries, allowing users from all over the world to leave public reviews of downloaded apps. Prior work on mobile app reviews has demonstrated that user reviews contain a wealth of information and are seen as a potential source of requirements. However, most of the studies in this area mainly focused on mining and analyzing user reviews from the US App Store, leaving reviews of users from other countries unexplored. This chapter has two aims. First, we seek to understand if the perception of the same apps between users from other countries and that from the US differs through analyzing user reviews. Second, we seek to investigate whether it is possible to automatically identify distinctive, country-specific needs of users between any pair of countries through a natural language processing (NLP) approach. We retrieve 300,643 user reviews of the 15 most popular iOS apps of 2018, published directly by Apple, from nine English-speaking countries over five months. We manually classify 3,358 reviews into 13 soft- ware quality and improvement factors. We leverage a random forest-based algorithm to identify factors that can be used to differentiate reviews between the US and other countries. Additionally, we present a preliminary NLP-based approach to automatically extract differences in reviews from any pairs of coun- tries and discuss some of the challenges in using NLP for this task. This chapter is based on our works in [106] and [107]. 88 5.1 Introduction The rapid growth in smartphone users worldwide has prompted developers to make their apps more ac- cessible to global audiences and markets. Developers use mobile app distribution platforms such as Google Play Store and the Apple App Store, which are available in over 150 countries [7, 44], to reach these global users and markets. Besides allowing users from all over the world to download apps conveniently, the plat- forms also allow users to share their experience of using the downloaded apps through writing reviews and ratings. Reviews help other users decide which app to download from many similar apps on the market. They also serve as a channel for users to communicate with the developers directly [3, 51, 85]. Prior works (e.g., [39, 45, 85, 89]) have shown that these reviews contain a wealth of information and are seen as a potential source of requirements that can help the developers improve the app’s quality and better meet users’ needs and expectations. In recent years, a trend in software engineering has emerged to formally study how users provide reviews on the app stores [85] and how reviews could be utilized to improve the quality of software [89]. However, one fundamental question related to the study of user reviews, which has not received much scholarly attention, is whether the users’ perception of the software quality of the same apps in different countries differs. Unfortunately, researchers in this area mainly focused on analyzing user reviews from the Google Play Store (which separates reviews by language and not by the country of origin of the users), the App Stores of unspecified countries, or only the US App Store. Nonetheless, prior studies in the area of mobile app user review [48] and many other areas of research [62, 70, 94, 118] have hinted that country differences may exist in user reviews. Without recognizing that users from different countries may have different needs and expectations, mobile app developers may find it challenging to replicate the same success from one country to another [70]. In other words, app success in international markets may hinge on whether developers can identify 89 such differences. Meeting users’ needs and expectations for local and global users has been beneficial. As a specific example, Evernote, a popular note-taking app, noticed a spike in its user base in Japan [8]. This was interesting to its parent company for several reasons. One, the company did not expend any marketing efforts toward Japanese users, and two, Japanese users were using the app in English. However, Japan was categorized as having low proficiency in English [32]. Evernote then flew a team to Japan to study its users. They uncovered that Japanese users valued different features than US-based users. Once they highlighted and updated their apps to reflect those features, they noticed a significant increase in the subscription conversion rate from Japanese users. Suppose developers want to learn more about users’ different needs and expectations (if at all) in every country where the app is available. In that case, this manual information gathering process becomes impracticable as it requires considerable resources, can take much time to gather, and does not scale well. We hypothesize that app reviews from other countries are also a source of information that app de- velopers can leverage to identify country-specific expectations and to improve existing products for the distinctive needs of users in that country. It may also be possible to gather information about what bothers the users and makes them likely to give poor reviews specific to that country. Recent work has shown that the characteristics of reviews, such as rating, sentiment, and length, significantly differ at the country-level [48]. However, to that extent, a more fine-grained analysis in the review content focusing on the software quality aspects (not just bug reports or feature requests) has not yet been performed. Additionally, other aspects of reviews from other countries (e.g., app-level and version-level) have not yet been investigated in greater detail. In this chapter, we retrieve 300,643 user reviews of the 15 most popular iOS apps of 2018 from nine English-speaking and culturally diverse countries worldwide. These countries are the United States of America, the United Kingdom, Canada, Australia, Singapore, the Philippines, South Africa, India, and Malaysia. The reviews are written from September 1st, 2018, to January 31st, 2019 (5 months). We then 90 manually classify 3,358 reviews written in English into different software quality and improvement factors. We utilize a random forest algorithm to identify factors that can be used to differentiate reviews between the US and other countries. Additionally, we present a preliminary NLP-based approach that enables app developers to compare reviews of users from any pair of countries. We then discuss key challenges in using NLP for this task. Our primary contributions in this chapter can be summarized as follows: • We conduct an empirical investigation to whether the users’ perception of the same apps differs between the US and other countries. Unlike previous work, we focus our effort on the software quality aspects of apps. • We manually classify 3,358 reviews into different software quality and improvement factors. Using statistical analysis and machine learning, we identify factors that are essential or discriminant in classifying the reviews of the US and other countries. • We present a preliminary NLP approach that can automatically extract differences in reviews from any pair of countries and discuss key challenges in using NLP for this task. The rest of the chapter is organized as follows. Section 5.2 describes the research setup. Section 5.3 presents the experiment results. Section 5.4 introduces challenges. Section 5.5 identifies limitations and threats to validity. Section 5.6 provides a summary of the related literature. Section 5.7 summarizes the chapter. 5.2 StudySetup In this section, we outline our research questions and the overall methodology of our empirical study. 91 5.2.1 ResearchQuestions We define the following research questions using the US App Store user reviews as a baseline. • RQ1: DothecontentsofreviewsfromtheUSdifferfromtheothercountriesregardingthe softwarequalityandimprovementfactors? • RQ2: What factors are discriminant in classifying the reviews of the US and the other countries? • RQ3: HoweffectiveistheNLPapproachinidentifyingthespecificneedsandprioritiesof usersfromapairofcountries? 5.2.2 DataCollection We decided to investigate reviews from a similar set of countries that Guzman et al. [48] had in their work since they already verified that their selected countries are culturally diverse according to Hofstede’s model [56]. However, due to the limited number of reviews from Hong Kong for more than one app on our list, we decided to exclude Hong Kong. Instead, we included Malaysia and the Philippines to balance the number of western and eastern countries. All countries in our study use English as a first or second language and are culturally diverse [56]. The list of countries is shown in Table 5.1. We used the list of the top 20 most popular apps of 2018 [5] published by Apple as a starting point. We then excluded apps that do not have services in multiple countries, such as Amazon, and apps that do not have a sufficient number of reviews ( < 10) in more than three countries, such as Bitmoji and Tiktok. Consequently, we selected 15 apps from the list. We collected reviews that were written in the period from September 1st, 2018, to January 31st, 2019 (5 months) for all apps and all countries through a web crawling tool we developed. The crawler simulates 92 Table 5.1: The overview of the dataset App Domain US 1 AU 2 CA 3 UK 4 IN 5 MA 6 PH 7 SA 8 SG 9 Total Facebook Social Networking 16929 1296 1585 2622 1048 601 949 149 220 25399 Gmail Productivity 1420 98 162 190 144 14 26 9 21 2084 Google Chrome Utilities 2253 138 242 257 182 32 27 10 24 3165 Google Maps Navigation 3401 331 438 579 960 117 36 48 62 5972 Google Photos Photo & Video 16435 794 1525 2127 1830 269 154 162 149 23445 Instagram Photo & Video 43365 3297 5123 7211 3269 2537 847 485 802 66936 Messenger Social Networking 6926 758 1127 1274 496 143 3986 71 78 14859 Netflix Entertainment 4703 528 954 690 396 96 177 55 87 7686 Snapchat Photo & Video 18985 1013 1819 2550 517 87 42 29 23 25065 Spotify Music 22011 2472 2753 4230 4 353 571 208 259 32861 Twitter News 3180 153 311 525 203 165 176 41 38 4792 Uber Travel 10202 521 897 1161 1164 9 15 167 11 14147 WhatsApp Social Networking 5154 344 625 1880 3694 985 53 524 356 13615 Wish Shopping 13330 1220 2109 2595 193 26 18 116 69 19676 Youtube Photo & Video 27396 1884 3107 4383 2420 576 566 279 330 40941 Total 195690 14847 22777 32274 16520 6010 7643 2353 2529 300643 Sample size 10 392 383 387 389 382 369 374 338 344 3358 1 USA, 2 Australia, 3 Canada, 4 United Kingdom, 5 India, 6 Malaysia, 7 The Philippines, 8 South Africa, 9 Singapore, 10 Confidence level of 95% and confidence interval of 5%, then applied stratified random sample by country. a mobile device and interfaces the iTunes APIs to collect an app’s data for a specified country. For each app, we gathered the following data. • Metadata: app name, app category • Reviews: reviewer name, posted datetime, star rating, review body, review title, country of origin, review app version. In total, we collected 300,643 reviews. Table shows 5.1 the list of apps, their respective app domain, and the number of reviews breakdown for each country and app. 5.2.3 ManualAnnotationProcess To perform the content analysis, we determined a sample size that statistically represents all reviews of a country using a confidence level of 95% and a confidence interval of 5%. We then selected the appropriate 93 number of reviews for each app in each country using stratified random sampling. The sample size for each country is shown in the last row of Table 5.1. Overall, our sample set has 3,358 reviews. Table 5.2: The 13 Software quality and Improvement factors # Name Description Example 1 Usability The app is not easy to operate or the interface is not up to user’s expectation. “The new update is horrible like why would you ever want to swipe sideways for posts the way it was was so much better” 2 Reliability The app cannot be used, opened, or crashes. “This is about 5th time my youtube crash. to be honest this update fail man. please fix this asap” 3 Performance The app’s response and processing time is longer than expected. “Whenever we used to go back from chat the bottom taskbar will choke up and we have to wait for it being activated. So it will take time to view story and chat on the same time” 4 Compatibility The app does not work as expected on a particular OS version or device. “With the latest iOS update I can no longer save my videos and animations as a Video to my iPhone. It keeps coming back as Oops Something went wrong.” 5 Functional correctness A particular feature does not work as expected. “Recommended videos don’t work properly, posting comments twice, subscribed channel videos not popping up in the subscriber feed.” 6 Resource utilization The app uses an excessive amount of resources. “Latest app size is more than 300mb for iOS. Reduce the app size” 7 Security and privacy The app lacks security, has security problems, inappropritates, or violates user privacy. “Can you make improvement for security. I hope there is security when someone else try to open your Whatsapp they will need to enter the password first.” 8 Content The content of the app is lacking or prevents users from enjoying the full experience. “There are soooooo many ads now, I’m getting a bit overwhelmed.” 9 Pricing The app incurs hidden cost or requires users to pay to access its full service. “Without premium it’s kinda annoying but still a good app with lots of artists.” 10 Not specific Regarding the app’s quality overall. “Crap. Enough said.” 11 Feature completeness The app lacks a function or content, or the existing function does not cover all users’ needs. “Why can I no longer archive messages in my app? Very frustrating. Used to be hold down message and then archive. Now it no longer seems to even be an option.” 12 Feature request Suggestion on functionality or content to be added. “Please provide delete contact feature to delete unwanted contact.” 13 Feature removal Suggestion on functionality or content to be removed. “Please remove the blue lines on street view! They ruin the ability to move around! Double tapping to the next location is all we need!” Note: We consider Factor 1 to 11 as Software Quality factors and Factor 11 to 13 as Software Improvement factors. Our categories (see Table 5.2) are based initially on the categories found by Khalid et al. [63]. However, we are interested in the software quality aspects of apps. As a result, we redefined and added new cate- gories through an iterative process, similar to the open coding method [67], until we could not find another category that best fits the contents of reviews during our pilot run. We also consulted the Software Prod- uct Quality model defined in ISO/IEC 25010 [58]. Overall, our software quality factors include Usability, Reliability, Compatibility, Performance, Functional correctness, Resource utilization, Security and privacy, Content, Pricing, Not specific, and Feature completeness. (Table 5.2 factor 1-11), and our improvement factors include Feature completeness, Feature request, and Feature removal (Table 5.2 factor 11-13). Note that Feature completeness applies to both the software quality and improvement factors because it is part 94 of the Software Product Quality model and indicates that the app’s current set of functions does not cover all users’ needs. Table 5.2 lists all categories. It also contains a description and an example of review for each category. To label each review, a team comprising one Ph.D. student and three master’s level CS students an- notated the reviews. Our annotation process is adapted from [48]. Precisely, we divided 3,358 reviews into four sets, each annotator taking two sets. For each set, two annotators independently annotated re- views. Note that many categories could be assigned to a single review. Oftentimes, annotators can produce conflicting annotations. When the first two annotators finished the set, the third annotator who did not annotate the set cross-validated the results. The third annotator reconciled any disagreements between the first two annotators. To avoid bias, annotators were unaware of the country where the reviews were from. We also instructed each annotator to annotate no more than 100 reviews per day to avoid errors due to fatigue. In addition, before we started the annotation process, we held several meetings to ensure that all an- notators had mutual concepts regarding the definition of each factor. We also developed a tool to facilitate the annotating and cross-validating processes to reduce potential human errors. Figure 5.1 shows our tool used for annotating. Figure 5.1A shows the tool in the annotation mode. The tool allows the annotators to easily look up the definition and an example of review for each factor by hovering over the question mark icon at the end of each factor. The annotators can also flag a review and leave a comment if unsure so that the third annotation will be more informed when performing disagreement resolution or cross-validating the annotation results. Figure 5.1B shows the tool in the disagreement resolution mode. Reviews that the two annotators do not completely agree on will be shown in this mode. In Figure 5.1B, we can see that Annotator A associated the review with “Reliability”, whereas Annotator B did not. The mode highlights “Reliability” in red for clarity and the right panels where the disagreements occur in red; in this case, a disagreement occurs in the “Software Quality" section. If no disagreements are found, the tool highlights 95 the annotators’ results in the left panel in gray and the sections on the right panel in green. During the annotation process, annotators can mark a review as written in English or not (see the “In English” check- box). Reviews in the initial sample set that are not written in English get replaced by reviews of the same app selected randomly from the dataset. We keep repeating this step until our set contains all English reviews and no duplication. We considered only English reviews because it is a common language that all our annotators can read and understand. 96 (a) Annotation Mode 1. Review basic information; 2. Checkboxes for annotation; 3. A comment box; 4. A checkbox to indicate an English review. (b) Disagreement Resolution Mode 1. Annotator A’s annotation result; 2. Annotator B’s annotation result; 3. Red indicates disagreement; 4. Green indicates agreement. Figure 5.1: The review annotation tool 97 5.3 EmpiricalStudyResults 5.3.1 RQ1: Do the contents of reviews from the US differ from the other countries regardingthesoftwarequalityandimprovementfactors? Motivation We want to identify what specific software quality and improvement factors users from the US are con- cerned or care about differently from the other countries. The identified factors deserve further investiga- tion for root causes. For example, a disproportionate factor “Feature Completeness” in Singapore suggests that the app lacks some key features that Singapore users may want, while the US users may not. Procedure First, we manually annotated statistically representative samples of reviews in each country into factors in Table 5.2. The manual annotation process is described earlier in Section 5.2.3. After we obtained the result, we computed frequencies or proportions for each factor in each country. Figure 5.2 depicts the proportion of factors for each country. In addition to naive frequency, we also used a standard two-proportion z-test to check whether the proportion of each factor in each country is significantly different from the US. We assumed that a user mentions each factor with some underlying probability (i.e., a binomial distribution). For convenience, the labels of factors are 1-13, as shown in Table 5.2. Table 5.3 shows the results from the two-sided two-proportion z-test. Results Figure 5.2 and Table 5.3, we draw several interesting observations as follows. • SG, the PH, and MA have the highest number of factors (1-13) per review. This implies that, on average, users from these countries report more software quality problems, consistent with their 98 Figure 5.2: The proportion of all factors for each country low average star ratings: 2.84, 2.53, and 2.76, respectively. (other countries have an average rating of 3.05 or higher). • Users from IN and the PH care proportionally less about the app’s “Content" (8) than the US users. • Although the factor “Functional Correctness” (5) appears most frequently in every country, only MA, the PH, and SG are proportionally inconsistent with the US on this factor. • The second frequent factor in every country is “Feature Completeness” (11). This factor in IN, MA, the PH, and SG is proportionally inconsistent with the US. • Users from AU, SG, and the UK care proportionally more about the app’s “Usability” than the US users. • SG has the largest number of factors that are proportionally inconsistent with the US. 99 Table 5.3: The list of all factors for each country that are proportionally inconsistent with the US (propor- tion z-test*) Country Factor AU 1 CA 4, 10 IN 8, 11 MA 5, 11 PH 5, 7, 8, 11 SA 2, 10 SG 1, 3, 5, 11, 12, 13 UK 1 *H0: “There is no difference be- tween the two population pro- portions”. • AU and the UK have only one factor, “Usability”, which is proportionally inconsistent with the US. • CA users care less about the “Compatibility” issue than the US users. Summaryoffindings: All countries have at least one software quality and improvement factor that is proportionally inconsistent with the US. 5.3.2 RQ2: WhatfactorsarediscriminantinclassifyingthereviewsoftheUSandthe othercountries? Motivation Although the naive frequency and univariate comparison of proportions in RQ1 produce some meaningful insights, some of our factors are correlated; for example, “Feature Request” should imply “Feature Com- pleteness”. More precisely, users suggest that developers include a feature (factor: feature request) that has not yet been implemented in the current version (factor: feature completeness). However, oftentimes users just mention a missing feature without “expressing the need” to have the feature. Hence, “Feature Completeness” and “Feature Request” are correlated but not equal. 100 For this reason, we should take into account the co-dependence of factors. Hence, in this RQ, we implemented a random forest-based algorithm to detect discriminant factors in each country. Results from the algorithm should confirm important factors, filter out artificially important factors, or identify missing ones in RQ1. We chose a random forest algorithm because of its superior performance in classification problems with the presence of correlated factors or variables [17, 69]. In this RQ, we included the star rating (“Rating”) as “factor 0” because the random forest algorithm can accept both numerical and categorical inputs. More importantly, “Rating” is correlated to many other factors and differs significantly at the country-level [48]. Procedure We leveraged a random forest-based algorithm introduced by Strobl et al. [110] to identify discriminant factors. The algorithm measures the importance of a factor by an average decrease in out-of-bag prediction accuracy of decision trees when a factor is permuted among samples. The key idea is that if a factor is im- portant for classifications, a random permutation of this factor will eliminate the association between the factor and the response (predicted class) and thus suppress the factor’s contribution. A random permuta- tion, therefore, should lead to a reduction in the prediction accuracy on out-of-bag samples. We will refer to this decrease as “Importance Value” or just “Importance”. Hence, a substantial decrease in out-of-bag accuracy corresponds to a substantial Importance Value. In [110], the authors argued that, when there are correlated factors, it is fruitful to consider conditional permutations – only permuting a factor within the group of fixed values of correlated factors – instead of the total permutation. More precisely, conditional permutations prevent false identification of Impor- tance due to correlation or co-dependence of factors. A more rigorous discussion regarding conditional permutations is described in [110]. 101 In our study, we created 1,000 bootstrap samples to construct trees using the DecisionTreeClassifier module from Scikit-learn 1 , an open-source machine learning library in Python. We used the “sqrt” param- eter for the number of factors considered at each split and a default value for other parameters. In the conditional permutation step, we specified 0.1 as a cut-off (i.e., we permuted each factor within samples with fixed values of other factors to which the factor has an absolute correlation of at least 0.1). Due to the large sample size, it is reasonable to assume that the mean of Importance will have an approximately normal distribution, according to the Central Limit Theorem. To determine if a factor is discriminant, we, therefore, used the t-test with a 0.05 significance level for the mean of Importance Values, computed from the 1000 decision trees, and tested against the following null hypothesis. H 0 : the true mean of Importance is less than 0. RejectingH 0 implies that the factor is discriminant. To derive more definitive discriminant factors, we also adopted a stricter null hypothesis (H 0 : the true mean is less than 0.005). In other words, the strength of a factor on the classification accuracy is higher than 0.5%. Although 0.5% seems small, it is a substantial decrease in accuracy caused by one factor because there are 14 factors overall, and the initial accuracy is low (≤ 0.606, see Table 5.4). Note that discriminant factors warrant further investigation into the root cause. For example, one may perform a topic modeling analysis to extract topics associated with the factor ‘Functional Correctness’ in the PH and the US. The topics unique to the PH can indicate specific issues or bugs that bother the PH users more than the US users. We shall refer to statistically significant factors, measured by their Importance toward the accuracy of a random forest described earlier, as discriminant. This chapter focused only on comparisons between the US and other countries. Thus, we performed binary classifications, one pair of countries at a time. An alternative option is direct multivariate hypothesis testing, such as comparing the proportions of all factors simultaneously with the possibility of correlations between factors. Unfortunately, this type 1 https://scikit-learn.org/stable/ 102 of test requires rigorous mathematical justification, which either has not been well-established or is not readily available due to high-level technicalities. Consequently, we did not pursue this kind of analysis in this chapter. Results Table 5.4 list all discriminant factors based on the two null hypotheses according to t-test with 0.05 signif- icance level. We also reported out-of-bag accuracy of binary classifications before random permutations in the last column. We can draw several conclusions from Table 5.4. Table 5.4: The lists of discriminant factors Country DiscriminantFactors(>0%) † DiscriminantFactors(>0.5%) ‡ Out-of-bag Accuracy AU 0, 1, 4, 9, 10, 11, 12 1,11 0.508 CA 2, 4, 5, 10 5 0.492 IN 2, 5, 6, 10, 11, 12 - 0.504 MA 1, 2, 3, 5, 6, 11, 12 2, 5, 11, 12 0.582 PH 2, 3, 4, 5, 6, 7, 8, 11, 12, 13 2, 5, 11 0.606 SA 2, 5, 7, 8, 10, 11, 12, 13 2, 10, 11 0.517 SG 2, 3, 5, 6, 10, 11, 12 2, 10, 11, 12 0.556 UK 1, 3, 5, 10, 13 - 0.506 † Null hypothesisH0: The true mean of importance is less than 0 ‡ Null hypothesisH0: The true mean of importance is less than 0.005 First, the average accuracy of binary classifications between the US and each of AU, CA, IN, and the UK has poor initial (before permutations) out-of-bag accuracy, i.e., the classifications are almost as bad as simply random guessing. This observation suggests that either we are missing essential factors that can differentiate reviews better or simply reviews in AU, CA, IN, and the UK are mildly different in terms of the presence or absence of factors 1-13 (if different at all) from reviews (along with factor 0 or “Rating”) in the US. In contrast, the accuracy of binary classifications of MA, the PH, SA, and SG indicates that the factors 0-13 have some ability to differentiate reviews. 103 Second, under the strict test (>0.5%), AU, CA, IN, and the UK have 2, 1, 0, and 0 discriminant factors, respectively, while MA, the PH, SA, and SG have 4, 3, 3, and 4 discriminant factors, respectively. Finally, under the strict test (>0.5%), we can observe that the discriminant factors are related to the factors identified in RQ1 (see: Table 5.3). To give specific examples and possible implications, SG has discriminant factors 2, 10, 11, and 12 while has disproportionate factors 1, 3, 5, 11, 12, and 13. In this case, we conjectured that the algorithm filters out artificially discriminant factors 1, 3, 5, and 13 and identifies missing factors 2 and 10. The random forest algorithm also affirms the discriminant factors 11 and 12. If a developer needs to prioritize which factors they should work on for SG, they can first focus on aspects of the app that are relevant to these discriminant factors. However, due to the scope of this work, we did not verify this implication. 5.3.3 RQ3: How effective is the NLP approach in identifying the specific needs and prioritiesofusersfromapairofcountries? Motivation App developers can use at least two methods to identify country-specific information for software evolu- tion and maintenance. One method is by conducting a survey or interviewing users from a country directly, as discussed earlier in the case of the Evernote app. Another method is by manually analyzing reviews of users from that country. The ability to quickly identify different opinions of users from different countries could give developers more perspectives and insights, which may lead to changes in requirements or new requirements. However, it is easy to see that both methods require a considerable amount of resources and do not scale well if developers want to identify such information from every country where the app is available. However, there may be a possibility that the second method can be accomplished automatically 104 using NLP techniques. Consequently, we propose a preliminary NLP approach aimed at detecting differ- ences in the contents of user reviews between a pair of countries. We also investigate the effectiveness of the proposed approach. Annotated Dataset Train Entire Dataset LSTM Classifier Model Topics (LDA) Review Topics Distributions Analyze Feature Importance Extract High Confi‐ dence Reviews Reviews by Country Predict Important Topics Combined Corpus Random Forest Step 1 Step 4 Step 2 Step 3 Figure 5.3: The overview of the approach used in our experiment Approach Our approach consists of four primary steps: feature request classification, discussion topic extraction, feature important analysis, and high confidence reviews extraction. Figure 5.3 shows the overall approach. Step 1: Feature Request Classification The goal of this step is to form a corpus of feature request reviews for a particular app and a country for further analysis. Feature request reviews are reviews that contain sentences that suggest to developers on things the app can be improved upon. Hence, reviews that contain improvement factors (i.e., factors 11-13 in Table 5.2) are considered “Feature request reviews”. Table 5.5 shows more examples of feature request reviews. To extract feature request reviews for our experiment, we trained a binary classification model based on the 3,358 manually labeled user reviews described in Section 5.2.3. For this task, we experimented with three widely used supervised machine learning algorithms for text classification: Naive Bayes, Support Vector Machines (SVM), and Long Short-Term Memory networks (LSTM). For each review, we removed punctuations and numbers, tokenized each review into words, and converted words to lower case. We 105 Table 5.5: Examples of Feature Request Reviews FeatureRequestReviews “Please provide delete contact feature to delete unwanted contact.” “I can’t put music on insta story. That’s dumb” “Please remove the blue lines on street view! They ruin the ability to move around!” “When you’ll add the speed cameras on app? Thanks.” applied these steps before training all three algorithms. We chose the LSTM network for this task based on the best F1-score of 0.71, shown in Table 5.6. Table 5.6: The classification performance using 5-fold cross-validation Metric NaiveBayes SVM LSTM Precision 0.54 0.79 0.67 Recall 0.74 0.55 0.75 F1 0.63 0.65 0.71 Step2: DiscussionTopicsExtraction To find out feature request topics that users from one country and the other discuss for an app, we used Latent Dirichlet Allocation (LDA) [15], a statistical topic modeling technique, to automatically extract the discussion topics in the reviews. The intuition is that the emphasis should be put on improving the app based on whether a significant portion of users mentions rather than on what only a small number of users mention. Hence, LDA is suitable as it can help detect topics mentioned by a significant proportion of user reviews in a corpus. Note that LDA, along with its variants, has been widely adopted in many research for a similar task [11, 40, 53]. Since LDA is probabilistic and unsupervised, comparing topics from different LDA runs is challenging. To compare topics between any two corpora (i.e., compare all feature request reviews of one country against those of the other), we did the following. First, we combined two corpora into one corpus. Second, we ran LDA on the combined corpus. Third, we investigated how the topics were distributed among the 106 original two corpora. The pre-processing steps for LDA are similar to those in Step 1. Additionally, we removed stop words to reduce semantic duplicates and applied stemming to each word. We adapted this approach from Hu et al. [53]. We experimented with several number of topics (K) ranging between 20 and 30. Based on human judgment on the interpretability and uniqueness of topics, we setK = 30 for all LDA runs. Step3-FeatureImportanceAnalysis: The idea for this step is to identify key discussion topics (features) that have a significant impact on the classification accuracy of user reviews between countries. Discussion topics that can differentiate reviews between two countries are topics that may be discussed exclusively and in a greater quantity in one country over the other. Hence such topics are our primary candidates for identifying country-specific feature requests. Hence, for each LDA run, we extracted review-topic distri- butions for all reviews in the combined corpus while keeping track of their original corpus. A review-topic distribution is a topic composition of a review given by LDA, which assumes that reviews are composed of a mixture of topics. We then used the review-topic distributions as features for a supervised classification model for country classification. In this step, we implemented a random forest algorithm as it is less prone to over-fitting, does not require a set-aside validation or test set, and, most importantly, is often used for feature importance anal- ysis. Due to the possible correlation in the review-topic distributions, we leveraged conditional variable importance for the random forests technique introduced by Strobl et al. [110] for this task. In our experiment, we created, for each comparison between countries, an ensemble of 1,000 decision trees using the DecisionTreeClassifier module from Scikit-learn. Since an app can have an imbalanced number of reviews between the corpus and other countries’ corpus, we applied random under-sampling to the majority corpus for each tree. We used bootstrap sampling with replacement to create the training and OOB sets. In the conditional permutation step, we found a pairwise correlation of features and specified 0.1 as a cut-off. 107 By examining how the important topics are distributed among reviews of users between two countries, we can indicate which country has a higher concentration of the topic. Step4-HighConfidenceReviewsExtraction . Using the same random forest implementation from the previous step, we employed the following procedures to pinpoint feature request reviews that are unique or important for a specific country in the comparison. (a) For each tree in the forest, we recorded for each review sample (1) whether it is chosen as an OOB sample and (2) if yes, whether the tree predicts its country of origin accurately. (b) For each review sample, we used the records in the first step to find the proportion of times that the tree accurately classified the review to the correct country. Next, a user-specified cut-off parameter is introduced to obtain a set of accurately classified reviews with the likelihood above the cut-off. Through manual examination of different cut-off parameters, we chose 0.6. In other words, reviews classified accurately to a particular country above 0.6 (or 60% of the time) were considered in a high confidence review set for that particular country. (c) We assigned to each review in the set a single topic based on the highest topic probability from its review-topics distribution given out by LDA. (d) We extracted reviews that are both in the set and related to the most important topic from Step 3. PreliminaryResultsandEvaluation Below, we present small, handpicked feature request reviews that are unique or important for a specific app and country over the US extracted by our approach. The data gathering process is described earlier in Section 5.2.2). The following reviews show that users in Malaysia asked WhatsApp to add a dark mode feature to the app over users in the US (country classification accuracy = 0.59): 108 • “Dark mode please... It’s a bit glaring when using Whatsapp at night.” • “Please build a night mode version.” • “Whatsapp never let me down, but hope that it will have the night mode version. I think it will be great.” • “It’d be great if Whatsapp allows us to have night mode and daylight mode so we could just set as we like. Please consider this and please improve the interface like telegram. Thankyouuu.” • “When will you be having a night mode? I would love to rate 5 if you have it.” Users in India askedWhatsApp to add asecurityandprivacylayer to the app over users in the US (country prediction accuracy = 0.59): • “Please add password or fingerprint lock option..!” • “I am humble request WhatsApp developer teams to please give password protection in WhatsApp to lock the WhatsApp and control our private messages, videos, photos, etc.” • “Please provide a provision for setting password to WhatsApp.” • “We want a special lock facility in whatsapp because we don’t have any lock apps for whats app in iphone.” In our experiment, many comparison instances yielded low country classification accuracy, especially when we compared countries like the US, the UK, and Canada. We hypothesize that feature request reviews of users between any of the two countries mentioned are either closely similar or equally diverse. In other words, the approach cannot effectively discern feature request reviews of users between the two countries. We specifically investigated such instances. In such an instance, we injected a feature request topic to one of the country feature request corpora (i.e., inject to non-US feature request corpus). We experimented with several reviews for the seeded topic 109 Note: the dotted line indicates the number of seeded topic reviews needed to detect that the seeded topic is the most important topic. Figure 5.4: Preliminary experimental results ranging from 5 to 40. We examined with three apps (compare against the US feature request corpus): Spotify for Australia, Gmail for the UK, and Messenger for Canada. For Spotify, we added reviews discussing Siri support. For Gmail, we added reviews about thedarkmode feature. For Messenger, we added reviews about the security feature. The seeded topic was gathered from reviews of actual users who discussed that they wanted a certain feature to be implemented from another app and minimally modified if needed to suit the context of our concerned apps. For example, in the case of the Messenger app, we modified “WhatsApp please us give some privacy and security option” to “Messenger please give us some privacy and security option”. This ensures that the reviews reflect the real words and needs of actual users. Figure 5.4 shows our experimentation results. As expected, the country classification accuracy for each comparison increased when we added more seeded topic reviews to the corpus (again, added to a non-US corpus). The dotted lines indicate the number of seeded topic reviews needed for our approach to detect that the seeded topic is the most important topic for classification and to confidently extract at least one of the reviews from the topic. Gmail, Messenger, and Spotify require 10, 30, and 15 seeded topic reviews, which are 15%, 10%, and 3.3% the size of the original data, respectively. As shown in Table 5.7, the approach needed 10, 30, and 25 seeded topic reviews which are 15%, 10%, and 5.65% of the original data for Gmail, Messenger, and Spotify, respectively, to achieve a 50% recovery rate of all seeded topic reviews. This suggests that the approach only requires a small number of users from one country to discuss a 110 certain topic over users from another country before it can detect at least half of the reviews on that topic. Note that some users in the US may have already expressed similar needs, which are then picked up by random under-sampling to compare against our seeded topic reviews. This may explain some fluctuation we observed in the country prediction accuracy. Besides, some fluctuations in the number of seeded topic reviews extracted (in Table 5.7) could suggest that some reviews have a higher probability of belonging to another topic, thus overshadowing the seeded topic. This may suggest that assigning a single topic to a review in Step 4 may also be too strict. Table 5.7: The percentage of seeded reviews extracted Numberseeded 0 5 10 15 20 25 30 35 40 Gmail UK 0 0 50.0% 13.3% 40.0% 52.0% 63.3% 74.3% 67.5% Messenger CA 0 0 0 0 0 0 60.0% 17.1% 57.5% Spotify AU 0 0 0 26.7% 35.0% 68.0% 86.7% 80% 80% Summaryoffindings: The proposed preliminary approach aimed at detecting differences in the con- tents of user reviews between a pair of countries shows encouraging results. However, a more rigorous evaluation involving a larger number of apps and user reviews from multiple countries and a larger number of manually annotated reviews is still needed. 5.4 Discussions 5.4.1 Challenges In this section, we identify three key points that may pose challenges in further studying the subject and building an automated approach for the task. IdentificationofthecountryoforiginofthereviewersisonlypossibleontheiOSAppStore In the iOS App Store, once users in different countries leave a public review for an app, the review data will be stored separately based on country, allowing easy identification of the origin of a given review. In 111 contrast, on the Google Play Store, user reviews are separated by language, not by country. In this case, we can see reviews in English, for example, but we cannot tell if the reviews were written by British or Australian users. This makes incorporating the nationality of users in a user review study impracticable for Android apps. However, cross-lingual studies are still possible. ReviewsarenotonlywritteninEnglish This is true even in countries that use English as a de facto official language, such as the US. We found that some quantity of the reviews from the US were written in Spanish. In addition, a large portion of user reviews from users in Canada was also written in French. Therefore, if only English reviews are analyzed, bias in the results may exist. Additionally, in some countries that speak English as a second language, English may be spoken only by people from particular social classes, which could affect their review contents, needs, and priorities. We believe that the most challenging task in applying AI to solve this problem is the existence of multiple languages in user reviews. Thoroughevaluationofcurrentstate-of-the-artuserreviewclassifiersisneeded To use the current state-of-the-art user review classifiers for this task, such as ARdoc of Panichella et al. [88], one must make sure that such tool contains no bias. Guzman et al. [48] noted that most user review classifiers were trained and evaluated using reviews exclusively from the US. Our study found that the users’ perception of the software quality of the same apps between the US and other countries can differ. As a result, such classifiers could potentially include algorithm bias if they did not account for the diversity in user reviews. However, a large number of manually annotated reviews from multiple countries are required to evaluate such classifiers rigorously. 112 5.5 ThreatstoValidityandLimitations The study in this chapter has several factors that may affect the validity of the results and conclusions. In this section, we identify several threats to validity of our study. 5.5.1 DifferentAppVersions According to Apple’s official guidelines (Section 4.3 in [4]), developers can only deploy one version of an app in the App Store. By default, the app will be available in all countries the store currently supports. However, developers can restrict or offer content specific to a particular country within this same app version, e.g., due to legal requirements, which may contribute to the varying expectations of the users from each country. For example, some content on Netflix is unavailable in some regions due to licensing rights or local laws. Nonetheless, in cross-platform app development, previous work showed that developers make an effort to ensure their app behaves similarly across platforms [61]. Therefore, it is reasonable to assume that developers also want to offer a consistent user experience of the same version of an app to users across different countries on the same platform. 5.5.2 UsersandTheirRespectiveCountries Similar to Guzman et al. [48], we assumed throughout this chapter that reviews from a specific country’s App Store were written by users who currently reside in that country. However, this may not be the case. For example, users can use a virtual private network (VPN) to change their location to access another country’s App Store. Nonetheless, the App Store requires that users have a payment method and billing address from the country to download and install apps from that country’s App Store. Hence, there should be a strong connection between the users and the country they submit their reviews to. 113 5.5.3 SoftwareQualityandImprovementFactors This threat concerns the validity of the category we use to categorize reviews (Table 5.2). To address this threat, we based our categories initially on Khalid et al. [63]. To tailor to our needs, studying software qual- ity complaints and improvement request factors, we redefined the definition and modified the categories through a method similar to the open coding method [67]. We also consulted the Software Product Quality model defined in ISO/IEC 25010 [58]. Nonetheless, other research employing different sets of categories and definitions might arrive at different results and conclusions. 5.5.4 SubjectivityinManualClassification It is not uncommon to include humans in this type of study (e.g., [63, 80]). To mitigate the threat, we em- ployed multiple measures mentioned in Section 5.2.3. Specifically, we held several meetings and ensured that all annotators understood the definition of each factor. We also developed an annotation tool to facil- itate the annotation process and help eliminate potential human errors (e.g., due to typing). Nevertheless, we cannot claim that our data is free from error as some bias or error may remain. 5.5.5 EnglishReviews In this work, we manually analyzed user reviews written in English. This is because our annotators are commonly fluent in English and can read and understand reviews written in English. However, in some countries that speak English as a second language, it may be spoken or used only by people from certain social classes, which may not represent all users within the country. We acknowledge that this may affect the findings. There is at least one way to circumvent this limitation. For example, we could hire native speakers from the study countries to manually annotate reviews written in the country’s official language. However, due to the limited time and accessibility to such resources, we did not pursue this route. An- other way is by leveraging a machine translation model [109, 119]. However, assessing the quality of the 114 automatic translation becomes the main challenge. Future work should incorporate user reviews written in the country’s native language in the analysis. 5.5.6 AppSamplingProblem Martin et al. [74] found that sampling bias may exist if partial subsets (small number of apps) of data are selected to represent the whole for app review analysis. Therefore, our data is also subject to this app sampling problem. To mitigate this threat, we do not claim that our findings can be extended to those not part of the top most popular apps of 2018 and not part of the five-month study period. Thus, our findings do not suffer from the App Sampling Problem. Any attempt to extend and generalize the findings to different app domains and time frames would require great care to mitigate this potential threat. 5.5.7 TheNLPapproach We could improve the preliminary approach described in RQ3 and strengthen the evaluation method in several directions. For example, in Step 1, we trained our own binary classifier and did not leverage mod- els proposed by past studies. Besides, adding more data and applying feature selection techniques could improve the performance of our classifier. In Step 2, we configured LDA with the same setting for com- parisons, i.e., we setK = 30. Using a different number of topics may produce different results. Different topic modeling techniques can be used to discover discussion topics in a text corpus, such as Biterm [120] or Twitter-LDA [125], which are claimed to produce better results on short texts than LDA. In Step 4, we assigned each review to a single topic. This could result in a loss of information since LDA assumes that each review has a mixture of topics. Besides, we selected reviews based on a threshold. For this, we ar- gued that we are only interested in reviews that belong confidently to a particular topic and are classified correctly to its country of origin. These decisions may affect the number of reviews extracted. However, we believe it should not significantly affect the characteristics of the extracted reviews. 115 5.5.8 ExternalValidity This threat concerns how generalizable our findings are. First, we select apps from the list of top down- loaded apps released directly from Apple. Second, 10 out of 15 apps in our study are from different app domains. Third, all nine countries we investigate are culturally diverse. Nonetheless, we do not claim that our findings will generalize to apps in other categories (e.g., paid apps), app reviews from other mobile distribution platforms (e.g., Google Play Store), or reviews written in another language. Any attempt to generalize our findings to other apps would be vulnerable to the App Sampling Problem [74], as mentioned earlier. However, our findings highlight that users from other countries perceive the software quality of the same apps differently, and unique information that can improve apps exists. Additionally, the method- ologies used in our study are generic. They can be applied to any reviews of apps from other categories, in other app stores, and other languages, given that we can read and understand such reviews. Nevertheless, we encourage researchers to investigate this further. 5.6 RelatedWork Over the past decades, analyzing user reviews has gained much attention from researchers. However, in most literature, researchers conducted their studies using user reviews only from the US or with no clear separation of the country origin of the reviews. Nonetheless, the closest work to ours is the work by Guzman et al. [48]. They analyzed 2,560 app reviews of seven apps from the App Store of eight countries. They concluded that, at the country level, the characteristics of reviews such as sentiment, rating, content, and length significantly differ. While their annotators could assign more than one category to a review, they reduced the complexity of their analysis by assigning each review to a single category as either a bug report, feature request, or other. However, user reviews often contain one or more sentences; each has an intention. In contrast, our content analysis 116 categories are more fine-grained where we focused on 13 software quality and improvement factors, and each review could be assigned to multiple factors. Additionally, we utilized the random forest algorithm to detect factors that can differentiate app reviews from the US and other countries, taking into account the co-dependence of factors. We further proposed a preliminary NLP approach aimed at detecting differences in the contents of user reviews between a pair of countries, which could help developers identify country- specific information for software evolution. A user survey has also been used as another source of data to conduct a cross-country study on mobile app stores. Lim et al. [70] investigated users’ needs, adoption of the app store, and rationale for selecting or abandoning an app by conducting a survey involving 4,824 mobile app users from 15 countries, includ- ing the US, Japan, China, Germany, France, Brazil, the UK, Italy, Russia, India, Canada, Spain, Australia, Mexico, and South Korea. They provided evidence that users are likely to stop using an app if it contains a bug or does not have a desired feature. Their analysis revealed significant differences in user behaviors across countries in many cases. For example, users’ likelihood of abandoning the app due to crashing was higher than average in Brazil and Spain and lower than average in South Korea and Japan. In addition, they found that app users in Russia, Mexico, China, and India were more likely to spend money on apps. Although there is still a dearth of cross-country research in user reviews analysis, few studies have al- ready conducted a cross-country study in other software engineering domains. For example, Reinecke and Gajos [94] studied people’s aesthetic preferences for the design of websites among different demographic groups, including age, gender, education, and nationality. The results showed significant differences in preference within these groups. For example, older people preferred less complicated and colorful web- sites than younger people, people with lower education levels appealed to more colorful and complex websites than people with higher education levels, and females liked more colorful websites than males. They also found that countries in close proximity tended to share similar preferences. For example, peo- ple from Finland and Russia appealed to less colorful websites. Additionally, people from the Northern 117 European countries (e.g., Denmark and Sweden) appealed to less colorful websites than those from the Southern European countries (e.g., Italy and Greece). Reinecke and Bernstein [93] also proposed a sys- tem that automatically generated personalized interfaces that corresponded to users’ cultural preferences. They found that users with a culturally adjusted interface performed tasks faster and more accurately. 5.7 ChapterSummary In this chapter, we analyze user reviews of the top 15 popular apps of 2018 from English-speaking and cul- turally diverse countries over five months. By applying content analysis and statistical tests, we empirically demonstrate that all countries have some software quality and improvement factors that are proportionally inconsistent with the US. In addition, we identify factors that can differentiate the reviews of the US and other countries through an explainable machine learning approach via a random forest algorithm, taking into account the correlation of factors. We also propose a preliminary approach aimed at detecting differ- ences in the contents of user reviews between a pair of countries. Our findings suggest that it is possible to gather country-specific information for software evolution and maintenance and about what bothers the users and makes them likely to give poor reviews specific to the country by analyzing reviews of users in that country. Furthermore, we discuss several key challenges associated with conducting a cross-country study using user reviews on app stores. We hope our findings bring attention to app developers that they should not solely rely on the US users, hoping that the US users can identify the needs and expectations of the users from other countries. In other words, analyzing and mining reviews from only the US App Store for software evolution and maintenance may not be enough to ensure user satisfaction and app success in the global markets. Nonetheless, to identify users’ country-specific needs, developers should consider mining reviews of users from other countries as a more cost-effective alternative to conducting a survey or directly interviewing users from 118 that country. This could help developers better understand their users and give them more perspectives and insights, which may lead to changes in requirements or new requirements. 119 Chapter6 UserReviewsandtheApp’sInternalQualityAttributes Source code analysis tools have been the vehicle for measuring and assessing software product quality for decades. However, recently many studies have shown that post-deployment user reviews provide a wealth of insight into the quality of a software product and how it should evolve and be maintained. For example, user reviews help identify missing features or inform developers about incorrect or unexpected software behavior. This chapter presents a preliminary analysis aimed at investigating whether both methods correlate with one another. In other words, we explore if there exists a relationship between the user-perceived software quality of apps and the apps’ internal quality attributes. In this chapter, we analyze 46 actual app releases of three Android open-source software (OSS) apps from the Google Play Store. For each app release, we employ multiple static analysis tools to assess several apps’ internal quality attributes. Additionally, we retrieve and manually analyze the complete reviews after each release for each app, totaling 1,004 reviews. We believe that analyzing user reviews and utilizing analysis tools is a crucial step toward understand- ing the complete picture of the quality of a software product and toward reasoning about its evolutionary history. This chapter is based on our work in [102]. 120 6.1 Introduction Software evolves over time, whether to adapt to a new hardware environment, address its users’ evolving needs, or repair defects. The abundance of software products in the market calls for software companies to thrive on making these changes and releasing new versions of software faster. This may force software developers to put less emphasis on managing software quality and more emphasis on delivering software on time. Consequently, software quality gradually deteriorates, leading to higher code complexity, lower code reusability, and higher maintenance cost [33, 90]. In an effort to improve software quality, many source code analysis tools have been developed (e.g., PMD, FindBugs, and SonarQube). Source code analysis, often called static code analysis, is the software analysis performed on source code without executing the program. The analysis can reveal possible vulner- abilities, defects, or design issues at an early stage in the development phase. Static analysis tools provide numerous software metrics that can give an insight into the quality of the code, such as code complexity, code smell, and comment density. For example, software with a high number of code smells indicates low code quality [37]. Analyzing change in the software quality over time among different releases reveals how the software evolves and whether or not developers put emphasis on maintaining code quality [12]. While it is vital for software to be developed with high code quality and within budget and schedule, successfully achieving them is still not enough to ensure a successful software product if the software does not meet user needs or expectations [16, 60]. Fortunately, app stores (e.g., the Google Play Store or the Apple App Store) provide an outlet for users to share their experiences and assessments of the downloaded apps. Users can submit a written review and give a star rating on a scale of one to five. Such reviews contain a wealth of information that can help requirement engineers better meet user needs (i.e., crowdsourcing for requirement elicitation), notify software maintainers of unexpected behavior, and inform other users about their experiences with the app. In fact, mining app store reviews to improve the quality of software has already gained the attention of researchers [27]. An empirical study by Pagano et al. [85] found that 121 the app store serves as a communication channel among users and developers. They also found that most reviews are provided shortly after new app releases and that new releases trigger user reviews. We believe that utilizing static code analysis tools and analyzing reviews are complementing steps in assessing the software quality of a system since each approach reveals some unique characteristics of the app. To conduct our preliminary analysis, we retrieve several apps’ internal attributes of 46 releases of three Android OSS apps utilizing three widely adopted static analysis tools: PMD 1 , FindBugs 2 , and SonarQube 3 . We gather the complete reviews for each release from each app’s store page and manually classify these reviews based on three dimensions: Intention, Sentiment, and Software quality. We then use correlation to find relationships between user reviews and the app’s internal quality attributes. The rest of this chapter is structured as follows. Section 6.2 describes the research setup. Section 6.3 presents the preliminary results and discusses the main findings. Section 6.4 identifies threats to validity. Section 6.5 summarizes the chapter. 6.2 ResearchDesign 6.2.1 ResearchQuestions The main objective of this chapter is to investigate whether there exists a relationship between the user- perceived software quality of apps and the apps’ internal quality attributes. To guide our research, we formulate the following research question: • RQ: To what extent do different review sentence types correlate with the apps’ quality attributes? 1 https://pmd.github.io/ 2 http://findbugs.sourceforge.net/ 3 https://www.sonarqube.org/ 122 6.2.2 DataCollection In order to answer these research questions, we analyzed a total of 46 releases of three Android OSS apps and a total of 1,004 reviews for these releases. In this section, we present our data collection method. We selected Android apps from the Google Play Store based on the following criteria. First, the app should be open-source. Obtaining the source code of a closed source app through decompiling can affect the outcome of the static analysis tools if its source code is protected by code obfuscation methods [98]. Second, the app should have good popularity among developers. The app should have at least 1,000 stars on its public Github repository. Third, since our study involves manual feature extraction, the app should not have more than 150 reviews between releases. Fourth, the app should have a relatively large number of downloads, i.e., more than 100,000 downloads. Finally, the app should have at least 3,000 user ratings on its store page. Table 6.1: The characteristics of the study apps App Category Stars NDL NRev NRL TimeSpan Omni-Notes Productivity 1,288 >100K 295 22 08/15 - 01/18 QKSMS Communication 1,733 >500K 507 10 12/15 - 07/16 Twidere Social 1,555 >100K 202 14 01/17 - 11/17 NDL: Number of downloads; NRev: Number of reviews; NRL: Number of re- leases;TimeSpan: Month/Year - Month/Year We retrieved the source files for each app release from each app’s GitHub releases page. Then we collected all user reviews of each release using a web crawling tool we developed. Table 6.1 shows the three Android OSS applications we considered, along with their app category, number of repository stars, approximate download numbers (NDL), the total number of reviews used in this study (NRev), the number of releases analyzed (NRL), and the time span between earliest and latest release version analyzed. Reviews that did not fall into the study period were discarded. It is important to note that the release dates on GitHub are not necessarily congruent with the release dates on its store page. We used the official release dates on the apps’ store page, not their GitHub releases page. 123 6.2.3 ManualReviewAnnotation Each user review in the dataset was manually examined at the sentence-level granularity by two anno- tators: one CS Ph.D. student and one CS Master’s student. The two annotators worked on each review sentence independently. Conflicting annotations were then resolved by discussion. The annotators were instructed not to annotate more than 200 sentences in a day to reduce annotation errors due to fatigue. Each sentence was classified based on three dimensions: Intention, Sentiment, and Software quality. In the following, we describe each dimension in detail. Intention This dimension concerns the purpose or the underlying goal of a given sentence. Each sentence was classified into one of the four intentions based on how relevant a sentence is to developers performing software maintenance and evolution tasks. The four intentions in this dimension, adapted from Panichella et al. [89], are as follows: • Problem Report: sentences describing issues or unexpected behaviors (e.g., “Pages load, display for 1 or 2 seconds then go completely white.") • ImprovementRequest: sentences suggesting or expressing needs or ways to improve or enhance the application (e.g., “Wish it had the option to add new contact from messages.") • Inquiry: sentences attempting to inquire information from the developers or other users (e.g., “How do you create tags?") • Other: sentences that are irrelevant or uninformative (e.g., “Thank you developer.") 124 SoftwareQuality This dimension concerns the software quality aspects of the app. Each sentence was classified with respect to the ISO/IEC 25010 software product quality definitions. ISO/IEC 25010 defines the quality of a system as the extent to which it satisfies the stated and implied needs of its various stakeholders [58]. In our case, stakeholders refer to the users. The model has eight attributes: functional suitability, performance efficiency, usability, portability, compatibility, reliability, maintainability, and security. Note that a sentence can be categorized into one or more software quality characteristics. For example, “Fast, responsive, and highlycustomizable." implies that the user was satisfied with the performance (time behavior - performance efficiency) and with the set of functions the app offers (functional completeness - functional suitability). Sentiment This dimension concerns the tone of a sentence. Each sentence was classified into one of the three different levels of sentiment: positive, negative, or neutral. Table 6.2: Examples of sentences and their categories along the three dimensions Sentence Intention Softwarequality Sentiment “Unbearably slow to open, deleted it to save my patience.” Problem Report Performance efficiency Negative “The UI is superb, easy navigable and awesome to use.” Other Usability Positive “This app is by far the biggest battery drain app on my phone.” Problem Report Performance efficiency Negative “Wish we could see notifications like who faved or rted our tweets.” Improvement Request Functional Suitability Neutral Table 6.2 shows examples of sentences and their corresponding categories based on the three dimen- sions previously discussed. Table 6.3 lists the eleven types of review sentences we mainly were interested in and considered in this study. In particular, we selected “Problem report” from the intention dimension. For the software quality dimension, we mainly focused on sentences mentioning the software quality attributes 125 with a negative sentiment. In other words, we focused on sentences where users express dissatisfaction with the software quality of the app. For the sentiment dimension, we selected “Negative” and “Positive” sentiments. We included the latter because we hypothesized that it might show a negative correlation with the results given by the static analysis tools. Since review sentence types were collected as counts and the duration between releases could increase the number of counts, we normalized the count of each sentence type by the total number of sentences for the corresponding release. We used these normalized values in our analysis. Table 6.3: The eleven review sentence types Dimension SentenceType Intention Problem report Sentiment Negative, Positive Software Quality Functional suitability, Performance efficiency, Compatibility, Usability, Reliability, Security, Maintainability, Portability 6.2.4 StaticAnalysis We executed PMD, SonarQube, and FindBugs on each release’s source files to extract the app’s internal quality attributes. We selected these static analysis tools because they support analyzing code written in a programming language, such as Java, used in Android app development. We used each tool’s de- fault rulesets. Table 6.4 lists the quality attributes we considered in this study. Code Smells (CS) is a maintainability-related issue in the code reported by SonarQube. PD is a set of quality attributes reported by PMD. FB is a set of quality attributes reported by FindBugs. Since the size of each application was different, which would affect the total number of quality at- tributes reported by the tools, we normalized the values of each characteristic by the number of lines of code. Figure 6.1 shows the results of applying the three static analysis tools on the Twidere app. The x-axis 126 Table 6.4: The apps’ internal quality attributes Abbr. Tool Attribute CS SonarQube Number of code smells PD PMD Empty code, Naming, Braces, Import statements, Coupling, Unused Code, Unnecessary, Design, Optimization, String and StringBuffer FB FindBugs Dodgy code, Bad practice, Malicious code, Performance, Correctness, Security, Multithreaded correctness, Internalization denotes each app version, while the y-axis denotes the percentage of the quality characteristic per line of code. We can observe from the figure that Twidere may have begun incorporating FindBugs to measure its quality attributes starting from release v3.6.29. In fact, we found the FindBugs plugin inside the build file’s dependencies block in release v3.6.29 but not in release v3.6.24. Figure 6.1: Results from applying the three static analysis tools on Twidere releases 127 6.3 ProceduresandResults This section explains the procedure for answering the research question, presents the preliminary experi- mental results, and discusses the findings. 6.3.1 RQ: To what extent do different review sentence types correlate with the apps’ qualityattributes? Procedure We first paired each review sentence type with each quality attribute, resulting in 33 pairs. Then, we used the Pearson correlation coefficient (r) to measure the strength of the linear relationship between each pair. The correlation coefficient value lies between − 1 and 1, where 0 implies no correlation. A positive corre- lation indicates that both variables increase or decrease together, whereas a negative correlation indicates that both variables increase or decrease in the opposite directions. Note that there are different interpre- tations of the correlation coefficient strength [1]. In this chapter, we used Cohen’s guideline [25, 26] to interpret the strength of the correlation coefficient: • Negligible (N), if|r|< 0.1. • Small (S), if 0.1≤| r|< 0.3. • Medium (M), if 0.3≤| r|< 0.5. • Large (L), if 0.5≤| r|. We also calculated the statistical significance ( p-value) for each correlation coefficient. This p-value denotes the probability that the obtained correlation may occur by chance. For example, if the correlation coefficient is 0.8 and the p-value is 0.01, we can say that 99.9% of the time, the correlation between the two 128 Table 6.5: The correlation coefficients (r) and the p-values CS PD FB Problem Report r = 0.0486 p = 7.5e-01 r = 0.3590 p =1.4e-02 r = -0.0039 p = 9.8e-01 Negative Sentiment r = 0.1872 p = 2.1e-01 r = 0.4462 p =1.9e-03 r = -0.1046 p = 4.9e-01 Positive Sentiment r = -0.1693 p = 2.6e-01 r = -0.2381 p = 1.1e-01 r = 0.2668 p = 7.3e-02 Functional Suitability r = 0.1005 p = 5.1e-01 r = 0.3454 p =1.9e-02 r = -0.0731 p = 6.3e-01 Performance Efficiency r = 0.3162 p =3.2e-02 r = 0.6056 p =8.2e-06 r = -0.0786 p = 6.0e-01 Compatibility r = 0.0182 p = 9.0e-01 r = 0.0628 p = 6.8e-01 r = -0.0564 p = 7.1e-01 Usability r = 0.1086 p = 4.7e-01 r = -0.0454 p = 7.6e-01 r = 0.0034 p = 9.8e-01 Reliability r = 0.5832 p =2.1e-05 r = 0.5459 p =8.7e-05 r = -0.1513 p = 3.2e-01 Maintainability r = -0.0832 p = 5.8e-01 r = 0.1871 p = 2.1e-01 r = 0.2479 p = 9.7e-02 Security r = -0.1467 p = 3.3e-01 r = -0.0851 p = 5.7e-01 r = -0.1822 p = 2.3e-01 Portability r = 0.0182 p = 9.0e-01 r = 0.0628 p = 6.8e-01 r = -0.0564 p = 7.1e-01 The strength of the correlation: negligible, small, medium, and large. Ap-value in bold indicates statistical significance ( p≤ 0.05) independent variables is large. To indicate whether the correlation is statistically significant, we set the significance level ( α ) to 0.05. ResultsandDiscussion Table 6.5 shows the correlation coefficient (r) and the p-value of each comparison pair. We observed that the correlation between different review types and the app’s internal attributes is negligible in 14 pairs, small in 12, medium in four, and large in three. The correlation of the seven pairs with medium or large 129 correlation is statistically significant. This means that we have strong evidence suggesting that the cor- relation for these pairs exists and is not likely to happen by chance, and the strength of the correlation of these pairs is medium or large. Five review types (“Problem Report,” “Negative sentiment,” “Functional suitability,” “Performance efficiency,” and “Reliability”) positively correlate with PD with statistical signif- icance. This implies that the more quality attributes reported by PMD the release gets, the more of these review sentences the release receives. Similarly, review types “Reliability” and “Performance Efficiency” positively correlate with CS with statistical significance. This implies that the more code smells reported by SonarQube the release gets, the more sentences expressing dissatisfaction with the release’s perfor- mance and reliability the release receives. We found no correlation between each review sentence type and FB to be statistically significant, though five out of eleven correlations are small. Our findings are congruent with Chulani et al. [24] who also found a strong correlation between user satisfaction with reliability and a certain type of software defect. Based on these results, it may be fruitful for developers to prioritize reducing the app’s internal quality attributes reported by PMD and SonarQube to alleviate users’ dissatisfaction with, for example, the Performance Efficiency or Reliability aspects of their apps. However, due to the scope of this work, we did not verify this implication. Additionally, we found that some software quality attributes, such as Usability, can be subjective. For example, one user may feel that the apps are easy to use or aesthetically pleasing, while others may not. As a specific example, in v2.7.0 of QKSMS, the developers changed the look and feel of the app (e.g., icon), resulting in mixed reviews between those who liked the new design and those who did not. Conversely, we observed less diverse opinions when users experienced response-time degradation, crash, or unexpected behaviors after updating the app. Moreover, static analysis tools do not have the capability to measure, for example, whether the applica- tion “can replace another specified software product for the same purpose in the same environment,” which is a part of the Portability attribute described in ISO/IEC 25010 [58]. As a result, negligible correlations 130 were found. Furthermore, we observed that users seldom discussed the apps’ Security, Maintainability, Compatibility, and Portability aspects. This indicates that these software quality attributes may be diffi- cult to assess by the users. However, these types of quality attributes can be assessed by static analysis tools. 6.4 ThreatstoValidity In this section, we identify several threats to validity and limitations of our study. We directly adopted the categories and their definition of software quality attributes from ISO/IEC 25010 [58]. Other research employing different categories and definitions might arrive at different conclu- sions. In addition, we cannot claim that the set of apps’ internal quality attributes we extracted from static analysis tools is exhaustive. Future work should include some interesting apps’ internal quality attributes, such as technical debt, in the analysis. Although multiple measures have been considered to reduce potential annotation errors, we cannot claim that our annotation results are error-free. We used three widely adopted static analysis tools to extract the apps’ internal quality attributes: PMD, FindBugs, and SonarQube. While these tools support analyzing code written in programming languages used in typical Android app development, they are not specifically developed to analyze Android apps. Future work should conduct a similar study using static analysis tools built specifically for Android app development. Regarding the generalizability of the findings, the number of study apps is limited and cannot represent all apps in the Google Play Store. Hence, any attempt to generalize our findings to other apps or apps in the Apple App Store would be vulnerable to the App Sampling Problem [74]. 131 6.5 ChapterSummary In this chapter, we investigate the extent to which the users’ perception of the software quality of apps ex- tracted from user reviews correlates with the apps’ internal quality attributes extracted from static analysis tools. We utilize three widely adopted static analysis tools: PMD, FindBugs, and SonarQube, to extract several internal quality attributes from 46 releases of three Android OSS apps. We also retrieve user reviews of these releases from their store page on the Google Play Store, totaling 1,004 reviews. We split these reviews into sentences and manually categorize each sentence along three dimensions: Intention, Sentiment, and Software Quality. We then use correlation to find relationships between user reviews and the apps’ internal quality attributes. While some users’ perceived quality attributes of apps, such as Performance Efficiency and Reliabil- ity, significantly correlate with the apps’ internal quality attributes extracted using PMD and SonarQube, most comparisons result in negligible correlation. This indicates that having high or low internal quality attributes, as reported by the tools, does not guarantee overall user satisfaction. The work in this chapter can be described as a preliminary proof of technical value that analyzing user reviews and utilizing static analysis tools are complementary methods to gain an insight into the quality of an app. 132 Chapter7 ConclusionsandFutureDirections This chapter concludes the dissertation and discusses opportunities for future research. 7.1 Conclusions To better understand and help improve user-developer communications on app stores, we set out to answer the following overarching questions. • What are the characteristics of user reviews that receive a developer response, and how should users write reviews that will incite a developer response? • What are the characteristics of a successful developer response, and how should developers respond to user reviews to optimize their chances of success? • How do the users’ perception of the software quality of the same apps differs between the US and other countries? • How do the users’ perception of the software quality of apps correlate with the apps’ internal quality attributes? In particular, we study a wide range of features within user reviews and their relation to developer responses. We derive evidence-based guidelines for users on how they should write reviews to incite 133 developer responses. Second, we study how developers should respond to app reviews optimally for suc- cess. We consider a response successful if users increase their initial rating after receiving it. We derive evidence-based guidelines for developers so that they can write a developer response with an increased chance of success. Additionally, we investigate whether users from the US and the other countries per- ceive the software quality of the same apps differently. We uncover that analyzing and mining reviews from only the US App Store for software evolution and maintenance may not be enough to ensure user satisfaction and app success in the global markets. We find that it is also possible to understand what bothers the users and makes them likely to give poor reviews specific to the country by analyzing reviews of users in that country. It is also possible to gather country-specific information for app improvement using an NLP technique. Lastly, we examine the relationship between the users’ perception of the quality of apps extracted from user reviews and the apps’ internal quality attributes extracted from static analysis tools. Our studies are conducted using a suite of machine learning and data mining techniques and multiple large-scale datasets of mobile app user reviews and developer responses data from mobile application stores. The insights uncovered in this dissertation have several implications that could be of practical significance to a broad range of practitioners, including app users, developers, researchers, and app store owners. 7.2 FutureDirections The research described in this dissertation offers several directions and opportunities for further research to be conducted. In the following, we describe possible opportunities and directions. • Responseprioritizationsystem: To develop a full-fledged response prioritization system, we en- visage that it requires two phases. The first phase is to determine which reviews developers should respond to. This has been studied in Chapter 3. The second phase is to determine how soon the 134 response should be provided. Hence, one of the future directions will be determining the response time (i.e., whether developers should respond to the review sooner or later). Knowing how soon the reviews should be responded to has various advantages for multiple stakeholders. For instance, because users can only write one review per app, users can gauge whether they want to wait for a re- sponse or report new issues. For developers, knowing which reviews to respond to first will let them focus their resources and efforts optimally. Better response strategies should lead to increased user satisfaction and improved user-developer communications. A response prioritization system should also suggest ways for users to write a review with a higher chance of receiving a fast response. In- directly, it is easy to see that this also has the potential to raise the quality of user feedback on app stores. A mockup UI of such a system is shown in Figure 3.6 of Section 3.5. Additionally, Section 3.4.2 demonstrates that each app has a different ranking of important features. This suggests that a one-size-fits-all response prioritization tool may not be effective. Hence, a full-fledged response prioritization tool should allow developers to calibrate by adjusting feature weights to their specific needs. • Developerresponsewritingassistant: In Chapter 4, we employed explainable machine learning techniques to uncover the predictive power of a wide range of features that developers can control in terms of distinguishing successful from unsuccessful responses. One exciting avenue for future research to be conducted is to develop a functional prototype of the response writing assistant tool. Section 4.5.2 describes a set of functionalities the tool should have, and Figure 4.7 shows a mockup UI of such a tool. Ideally, the tool should also be able to adapt and adjust its suggestions to help developers respond to reviews based on different types of reviews or as time goes by. Building upon the tool, another direction would be to conduct a long-term study with app developers to assess the utility or usefulness of the tool in practice. Additionally, researchers who are investigating ways to generate developer responses automatically, such as Gao et al. [41] and Farooq et al. [35], could 135 explore the possibility of leveraging the insights discovered in Chapter 4 to help generate developer responses with an increased chance of success. Recommendations for further studies include investigating how developers should respond based on rating or types of reviews, examining what types of reviews are most likely to be modified with and without a response, exploring whether the first interaction between developer and user could determine the outcome of the conversation, and expanding the definition of success to include, for example, the change in the sentiment of a review from negative to positive after a response. • Approach for discovering unique, helpful information for software evolution in reviews of users from different countries: While the proposed preliminary approach for detecting dif- ferences in the contents of user reviews between a pair of countries described in Chapter 5 shows encouraging results, the approach still contains several limitations as described earlier. For example, it works only for reviews that are written in English. Hence, one of the future directions could be to focus on improving the proposed approach. For instance, to deal with multilingual problems in user reviews from different countries, one may incorporate a machine translation model [109, 119] in the pipeline of the approach. However, user reviews are typically short, informal, and unstruc- tured [85]. In addition, users usually express their opinions with slang, jargon, and abbreviations. Consequently, the main challenge that one might face is assessing the quality of the translation in the presence of these text characteristics. Lastly, another potential direction could be conducting a large-scale study with app developers from various application domains to assess the value of the approach by investigating its utility and application in practice over a period of time. 136 References [1] H. Akoglu. User’s guide to correlation coefficients. Turkish journal of emergency medicine, 18(3):91–93, 2018. [2] M. Ali, M. E. Joorabchi, and A. Mesbah. Same app, different app stores: a comparative study. In 2017 IEEE/ACM 4th International Conference on Mobile Software Engineering and Systems (MOBILESoft), pages 79–90. IEEE, 2017. [3] A. AlSubaihin, F. Sarro, S. Black, L. Capra, and M. Harman. App store effects on software engineering practices. IEEE Transactions on Software Engineering, 2019. [4] Apple. App store review guidelines. https://developer.apple.com/app-store/review/guidelines. Accessed: 2019-06-05. [5] Apple. Apple presents the best of 2018. https://www.apple.com/newsroom/2018/12/apple-presents-the-best-of-2018/. Accessed: 2019-04-20. [6] Apple. Availability of apple media services. https://support.apple.com/en-us/HT204411. Accessed: 2022-03-23. [7] Apple. Build apps for the world. https://developer.apple.com/internationalization/. Accessed: 2019-03-27. [8] Apple. Developer insight - evernote. https://developer.apple.com/app-store/evernote/. Accessed: 2019-06-05. [9] Apple. Ratings, reviews, and responses. https://developer.apple.com/app-store/ratings-and-reviews/. Accessed: 2019-06-04. [10] K. Bailey, M. Nagappan, and D. Dig. Examining user-developer feedback loops in the ios app store. In Proceedings of the 52nd Hawaii International Conference on System Sciences, 2019. [11] A. Barua, S. W. Thomas, and A. E. Hassan. What are developers talking about? an analysis of topics and trends in stack overflow. Empirical Software Engineering, 19(3):619–654, 2014. [12] P. Behnamghader, R. Alfayez, K. Srisopha, and B. Boehm. Towards better understanding of software quality evolution through commit-impact analysis. In 2017 IEEE International conference on software quality, reliability and security (QRS), pages 251–262. IEEE, 2017. 137 [13] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012. [14] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, and D. D. Cox. Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery, 8(1):014008, July 2015.doi: 10.1088/1749-4699/8/1/014008. [15] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003. [16] B. Boehm, J. Lane, S. Koolmanojwong, and R. Turner. The incremental commitment spiral model, 2014. [17] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [18] F. Calefato, F. Lanubile, and N. Novielli. Emotxt: a toolkit for emotion recognition from text. In Proc. of 7th Int’l Conf. on Affective Computing and Intelligent Interaction Workshops and Demos , ACII ’17, pages 79–80, San Antonio, TX, USA, 2017.isbn: 978-1-5386-0563-9.doi: 10.1109/ACIIW.2017.8272591. [19] D. V. Carvalho, E. M. Pereira, and J. S. Cardoso. Machine learning interpretability: a survey on methods and metrics. Electronics, 8(8):832, 2019. [20] L. Ceci. Number of apps available in leading app stores as of 1st quarter 2021. https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/. Accessed: 2022-04-10. [21] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y.-H. Sung, B. Strope, and R. Kurzweil. Universal sentence encoder, 2018. arXiv: 1803.11175[cs.CL]. [22] N. Chen, J. Lin, S. C. Hoi, X. Xiao, and B. Zhang. Ar-miner: mining informative reviews for developers from mobile app marketplace. In Proceedings of the 36th International Conference on Software Engineering, pages 767–778. ACM, 2014. [23] T. Chen and C. Guestrin. Xgboost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, San Francisco, California, USA. Association for Computing Machinery, 2016.isbn: 9781450342322.doi: 10.1145/2939672.2939785. [24] S. Chulani, P. Santhanam, D. Moore, B. Leszkowicz, and G. Davidson. Deriving a software quality view from customer satisfaction and service data. In European Conference on Metrics and Measurement. Citeseer, 2001. [25] J. Cohen. A power primer. Psychological bulletin, 112(1):155, 1992. [26] J. Cohen. Statistical power analysis for the behavioral sciences. Routledge, 2013. 138 [27] J. Dąbrowski, E. Letier, A. Perini, and A. Susi. Analysing app reviews for software engineering: a systematic literature review. Empirical Software Engineering, 27(2):1–63, 2022. [28] C. Danescu-Niculescu-Mizil, M. Sudhof, D. Jurafsky, J. Leskovec, and C. Potts. A computational approach to politeness with application to social factors. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 250–259, Sofia, Bulgaria. Association for Computational Linguistics, Aug. 2013.url: https://www.aclweb.org/anthology/P13-1025. [29] M. Davidow. Organizational responses to customer complaints: what works and what doesn’t. Journal of service research, 5(3):225–250, 2003. [30] T. De Smedt and W. Daelemans. Pattern for python. The Journal of Machine Learning Research, 13(1):2063–2067, 2012. [31] A. Di Sorbo, S. Panichella, C. V. Alexandru, J. Shimagaki, C. A. Visaggio, G. Canfora, and H. C. Gall. What would users change in my app? summarizing app reviews for recommending software changes. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 499–510. ACM, 2016. [32] EF. Ef english proficiency index. https://www.ef.edu/epi/. Accessed: 2019-04-20. [33] S. G. Eick, T. L. Graves, A. F. Karr, J. S. Marron, and A. Mockus. Does code decay? assessing the evidence from change management data. IEEE Transactions on Software Engineering, 27(1):1–12, 2001. [34] B. Fang, Q. Ye, D. Kucukusta, and R. Law. Analysis of the perceived value of online tourism reviews: influence of readability and reviewer characteristics. Tourism Management, 52:498–506, 2016. [35] U. Farooq, A. Siddique, F. Jamour, Z. Zhao, and V. Hristidis. App-aware response synthesis for user reviews. In 2020 IEEE International Conference on Big Data (Big Data), pages 699–708, Los Alamitos, CA, USA. IEEE Computer Society, Dec. 2020.doi: 10.1109/BigData50022.2020.9377983. [36] R. Flesch. A new readability yardstick. Journal of applied psychology, 32(3):221, 1948. [37] M. Fowler and K. Beck. Refactoring: improving the design of existing code. Addison-Wesley Professional, 1999. [38] J. H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001.doi: 10.1214/aos/1013203451. [39] L. V. Galvis Carreño and K. Winbladh. Analysis of user comments: an approach for software requirements evolution. In Proceedings of the 2013 International Conference on Software Engineering, pages 582–591. IEEE Press, 2013. [40] C. Gao, J. Zeng, M. R. Lyu, and I. King. Online app review analysis for identifying emerging issues. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pages 48–58. IEEE, 2018. 139 [41] C. Gao, J. Zeng, X. Xia, D. Lo, M. R. Lyu, and I. King. Automating app review response generation. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, ASE ’19, pages 163–175, San Diego, California. IEEE Press, 2019.isbn: 9781728125084.doi: 10.1109/ASE.2019.00025. [42] Google. Distribute app releases to specific countries. https://support.google.com/googleplay/android-developer/answer/7550024?hl=en. Accessed: 2022-03-23. [43] Google. Supported locations for distribution to google play users. https://support.google.com/googleplay/android-developer/answer/10532353?hl=en. Accessed: 2022-04-10. [44] Google. Supported locations for distribution to google play users. https://support.google.com/googleplay/android-developer/table/3541286. Accessed: 2019-03-27. [45] E. C. Groen, S. Kopczyńska, M. P. Hauer, T. D. Krafft, and J. Doerr. Users the hidden software product quality experts?: a study on how app users report quality aspects in online reviews. In 2017 IEEE 25th International Requirements Engineering Conference (RE), pages 80–89. IEEE, 2017. [46] T. Gruber. I want to believe they really care: how complaining customers want to be treated by frontline employees. Journal of Service Management, 22:85–110, 2011. [47] E. Guzman and W. Maalej. How do users like this feature? a fine grained sentiment analysis of app reviews. In 2014 IEEE 22nd international requirements engineering conference (RE), pages 153–162. IEEE, 2014. [48] E. Guzman, L. Oliveira, Y. Steiner, L. C. Wagner, and M. Glinz. User feedback in the app store: a cross-cultural study. In 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS), pages 13–22. IEEE, 2018. [49] E. Guzman and A. P. Rojas. Gender and user feedback: an exploratory study. In 2019 IEEE 27th international requirements engineering conference (RE), pages 381–385. IEEE, 2019. [50] M. Harman, Y. Jia, and Y. Zhang. App store mining and analysis: msr for app stores. InProceedings of the 9th IEEE Working Conference on Mining Software Repositories, pages 108–111. IEEE Press, 2012. [51] S. Hassan, C. Tantithamthavorn, C.-P. Bezemer, and A. E. Hassan. Studying the dialogue between users and developers of free apps in the google play store. Empirical Software Engineering, 23(3):1275–1312, 2018. [52] H. Hu, C.-P. Bezemer, and A. E. Hassan. Studying the consistency of star ratings and the complaints in 1 & 2-star user reviews for top free cross-platform android and ios apps. Empirical Software Engineering:1–34, 2018. [53] H. Hu, S. Wang, C.-P. Bezemer, and A. E. Hassan. Studying the consistency of star ratings and reviews of popular free hybrid android and ios apps. Empirical Software Engineering, 24(1):7–32, 2019. 140 [54] M. Hu and B. Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177. ACM, 2004. [55] C. Iacob and R. Harrison. Retrieving and analyzing mobile apps feature requests from online reviews. In Proceedings of the 10th Working Conference on Mining Software Repositories, pages 41–44. IEEE Press, 2013. [56] H. Insights. Country comparsion. https://www.hofstede-insights.com/country-comparison/. Accessed: 2019-04-20. [57] M. R. Islam and M. F. Zibran. Sentistrength-se: exploiting domain specificity for improved sentiment analysis in software engineering text. Journal of Systems and Software, 145:125–146, 2018. [58] ISO/IEC 25010:2011: Systems and software engineering – Systems and software Quality Requirements and Evaluation (SQuaRE) – System and software quality models. Standard, ISO, 2011. [59] J. Jiarpakdee, C. Tantithamthavorn, H. K. Dam, and J. Grundy. An empirical study of model-agnostic techniques for defect prediction models. IEEE Transactions on Software Engineering, 2020. [60] C. Jones. Applied software measurement: global analysis of productivity and quality. McGraw-Hill Education Group, 2008. [61] M. E. Joorabchi, A. Mesbah, and P. Kruchten. Real challenges in mobile app development. In 2013 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 15–24. IEEE, 2013. [62] M. Kemmelmeier. Cultural differences in survey responding: issues and insights in the study of response biases. International Journal of Psychology, 51(6):439–444, 2016. [63] H. Khalid, E. Shihab, M. Nagappan, and A. E. Hassan. What do mobile app users complain about? IEEE Software, 32(3):70–77, 2015. [64] F. Kooti, L. M. Aiello, M. Grbovic, K. Lerman, and A. Mantrach. Evolution of conversations in the age of email overload. In Proceedings of the 24th International Conference on World Wide Web, pages 603–613. International World Wide Web Conferences Steering Committee, 2015. [65] N. Korfiatis, E. García-Bariocanal, and S. Sánchez-Alonso. Evaluating content quality and helpfulness of online product reviews: the interplay of review helpfulness vs. review content. Electronic Commerce Research and Applications, 11(3):205–217, 2012. [66] A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutual information. Physical review E, 69(6):066138, 2004. [67] B. L BERG. Qualitative research methods for the social sciences, 2001. 141 [68] S. Lee and J. Y. Choeh. Predicting the helpfulness of online reviews using multilayer perceptron neural networks. Expert Systems with Applications, 41(6):3041–3046, 2014. [69] A. Liaw, M. Wiener, et al. Classification and regression by randomforest. R news, 2(3):18–22, 2002. [70] S. L. Lim, P. J. Bentley, N. Kanakam, F. Ishikawa, and S. Honiden. Investigating country differences in mobile app user behavior and challenges for software engineering. IEEE Transactions on Software Engineering, 41(1):40–64, 2014. [71] D. Ljubobratović, M. Vuković, M. Brkić Bakarić, T. Jemrić, and M. Matetić. Utilization of explainable machine learning algorithms for determination of important features in ‘suncrest’peach maturity prediction. Electronics, 10(24):3115, 2021. [72] W. Maalej and H. Nabil. Bug report, feature request, or simply praise? on automatically classifying app reviews. In 2015 IEEE 23rd international requirements engineering conference (RE), pages 116–125. IEEE, 2015. [73] Y. Man, C. Gao, M. R. Lyu, and J. Jiang. Experience report: understanding cross-platform app issues from user reviews. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), pages 138–149. IEEE, 2016. [74] W. Martin, M. Harman, Y. Jia, F. Sarro, and Y. Zhang. The app sampling problem for app store mining. In Proceedings of the 12th Working Conference on Mining Software Repositories, pages 123–133. IEEE Press, 2015. [75] W. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman. A survey of app store analysis for software engineering. IEEE transactions on software engineering, 43(9):817–847, 2016. [76] S. McIlroy, W. Shang, N. Ali, and A. E. Hassan. Is it worth responding to reviews? studying the top free apps in google play. IEEE Software, 34(3):64–71, 2015. [77] H. Min, Y. Lim, and V. P. Magnini. Factors affecting customer satisfaction in responses to negative online hotel reviews: the impact of empathy, paraphrasing, and speed. Cornell Hospitality Quarterly, 56(2):223–231, 2015. [78] N. Moniz, P. Branco, and L. Torgo. Evaluation of ensemble methods in imbalanced regression tasks. In First International Workshop on Learning with Imbalanced Domains: Theory and Applications, pages 129–140. PMLR, 2017. [79] M. Nayebi, B. Adams, and G. Ruhe. Release practices for mobile apps–what do users and developers think? In 2016 ieee 23rd international conference on software analysis, evolution, and reengineering (saner), volume 1, pages 552–562. IEEE, 2016. [80] M. Nayebi, H. Cho, and G. Ruhe. App store mining is not enough for app improvement. Empirical Software Engineering:1–31, 2018. 142 [81] M. Nayebi, K. Kuznetsov, P. Chen, A. Zeller, and G. Ruhe. Anatomy of functionality deletion: an exploratory study on mobile apps. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR ’18, pages 243–253, Gothenburg, Sweden. Association for Computing Machinery, 2018.isbn: 9781450357166.doi: 10.1145/3196398.3196410. [82] D. Nielsen. Tree boosting with xgboost-why does xgboost win" every" machine learning competition? Master’s thesis, NTNU, 2016. [83] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang. Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, pages 145–153, 2016. [84] O. P. Omondiagbe, S. A. Licorish, and S. G. MacDonell. Features that predict the acceptability of java and javascript answers on stack overflow. In Proceedings of the Evaluation and Assessment on Software Engineering. ACM, 2019. [85] D. Pagano and W. Maalej. User feedback in the appstore: an empirical study. In 2013 21st IEEE international requirements engineering conference (RE), pages 125–134. IEEE, 2013. [86] F. Palomba, M. Linares-Vásquez, G. Bavota, R. Oliveto, M. Di Penta, D. Poshyvanyk, and A. De Lucia. Crowdsourcing user reviews to support the evolution of mobile apps. Journal of Systems and Software, 137:143–162, 2018. [87] F. Palomba, P. Salza, A. Ciurumelea, S. Panichella, H. Gall, F. Ferrucci, and A. De Lucia. Recommending and localizing change requests for mobile apps based on user reviews. In Proceedingsofthe39thinternationalconferenceonsoftwareengineering, pages 106–117. IEEE Press, 2017. [88] S. Panichella, A. Di Sorbo, E. Guzman, C. A. Visaggio, G. Canfora, and H. C. Gall. Ardoc: app reviews development oriented classifier. In Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, pages 1023–1027. ACM, 2016. [89] S. Panichella, A. Di Sorbo, E. Guzman, C. A. Visaggio, G. Canfora, and H. C. Gall. How can i improve my app? classifying user reviews for software maintenance and evolution. In 2015 IEEE international conference on software maintenance and evolution (ICSME), pages 281–290. IEEE, 2015. [90] D. L. Parnas. Software aging. In Software Engineering, 1994. Proceedings. ICSE-16., 16th International Conference on, pages 279–287. IEEE, 1994. [91] M. F. Porter. An algorithm for suffix stripping. program, 14(3):130–137, 1980. [92] P. Probst, A.-L. Boulesteix, and B. Bischl. Tunability: importance of hyperparameters of machine learning algorithms. The Journal of Machine Learning Research, 20(53):1–32, 2019. [93] K. Reinecke and A. Bernstein. Improving performance, perceived usability, and aesthetics with culturally adaptive user interfaces. ACM Transactions on Computer-Human Interaction (TOCHI), 18(2):8, 2011. 143 [94] K. Reinecke and K. Z. Gajos. Quantifying visual preferences around the world. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 11–20. ACM, 2014. [95] M. Salehan and D. J. Kim. Predicting the performance of online consumer reviews: a sentiment mining approach to big data analytics. Decision Support Systems, 81:30–40, 2016. [96] B. Savarimuthu, S. Licorish, M. Devananda, G. Greenheld, V. Dignum, and F. Dignum. Developers’ responses to app review feedback-a study of communication norms in app development. In Proc. of the COIN 2017 workshop@ AAMAS (2017), 2017. [97] R. M. Schindler and B. Bickart. Perceived helpfulness of online consumer reviews: the role of message content and style. Journal of Consumer Behaviour, 11(3):234–243, 2012. [98] P. Schulz. Code protection in android. Insititute of Computer Science, Rheinische Friedrich-Wilhelms-Universitgt Bonn, Germany, 110, 2012. [99] A. Sefferman. Mobile ratings: the good, the bad, and the ugly. https://www.apptentive.com/blog/2016/06/23/mobile-ratings-good-bad-ugly/. Accessed: 2022-04-10. [100] Sensor Tower. 5-year market forecast: app spending will climb to $270 billion by 2025. https://sensortower.com/blog/sensor-tower-app-market-forecast-2025. Accessed: 2022-04-10. [101] J. P. Singh, S. Irani, N. P. Rana, Y. K. Dwivedi, S. Saumya, and P. K. Roy. Predicting the “helpfulness” of online consumer reviews. Journal of Business Research, 70:346–355, 2017. [102] K. Srisopha and R. Alfayez. Software quality through the eyes of the end-user and static analysis tools: a study on android oss applications. In 2018 IEEE/ACM 1st International Workshop on Software Qualities and their Dependencies (SQUADE), pages 1–4. IEEE, 2018. [103] K. Srisopha, D. Link, and B. Boehm. How should developers respond to app reviews? features predicting the success of developer responses. In Evaluation and Assessment in Software Engineering, pages 119–128. 2021. [104] K. Srisopha, D. Link, D. Swami, and B. Boehm. A Replication Package of Learning Features that Predict Developer Responses for iOS App Store Reviews. https://doi.org/10.5281/zenodo.3960965. [105] K. Srisopha, D. Link, D. Swami, and B. Boehm. Learning features that predict developer responses for ios app store reviews. In Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), ESEM ’20, Bari, Italy. Association for Computing Machinery, 2020.isbn: 9781450375801.doi: 10.1145/3382494.3410686. [106] K. Srisopha, C. Phonsom, M. Li, D. Link, and B. Boehm. On building an automatic identification of country-specific feature requests in mobile app reviews: possibilities and challenges. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, pages 494–498, 2020. 144 [107] K. Srisopha, C. Phonsom, K. Lin, and B. Boehm. Same app, different countries: a preliminary user reviews study on most downloaded ios apps. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 76–80. IEEE, 2019. [108] K. Srisopha, D. Swami, D. Link, and B. Boehm. How features in ios app store reviews can predict developer responses. In Proceedings of the Evaluation and Assessment in Software Engineering, pages 336–341, 2020. [109] F. Stahlberg. Neural machine translation: a review. Journal of Artificial Intelligence Research , 69:343–418, 2020. [110] C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis. Conditional variable importance for random forests. BMC bioinformatics, 9(1):307, 2008. [111] C. Tantithamthavorn and A. E. Hassan. An experience report on defect modelling in practice: pitfalls and challenges. In Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP ’18, pages 286–295, Gothenburg, Sweden. Association for Computing Machinery, 2018.isbn: 9781450356596.doi: 10.1145/3183519.3183547. [112] C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto. The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Transactions on Software Engineering, 46(11):1200–1219, 2020.doi: 10.1109/TSE.2018.2876537. [113] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and A. Kappas. Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12):2544–2558, 2010. [114] A. Vaid, S. Somani, A. J. Russak, J. K. De Freitas, F. F. Chaudhry, I. Paranjpe, K. W. Johnson, S. J. Lee, R. Miotto, F. Richter, et al. Machine learning to predict mortality and critical events in a cohort of patients with covid-19 in new york city: model development and validation. Journal of medical Internet research, 22(11):e24018, 2020. [115] E. Venson, T. F. Lam, B. Clark, and B. Boehm. Analyzing software security-related size and its relationship with vulnerabilities in oss. In 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS), pages 956–965. IEEE, 2021. [116] L. Villarroel, G. Bavota, B. Russo, R. Oliveto, and M. Di Penta. Release planning of mobile apps based on user reviews. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pages 14–24. IEEE, 2016. [117] J. Waldvogel. Greetings and closings in workplace email. Journal of Computer-Mediated Communication, 12(2):456–477, 2007. [118] T. Walsh, P. Nurkka, and R. Walsh. Cultural differences in smartphone user experience evaluation. In Proceedings of the 9th International Conference on Mobile and Ubiquitous Multimedia, page 24. ACM, 2010. 145 [119] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. [120] X. Yan, J. Guo, Y. Lan, and X. Cheng. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web, pages 1445–1456, 2013. [121] L. Yang, S. T. Dumais, P. N. Bennett, and A. H. Awadallah. Characterizing and predicting enterprise email reply behavior. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 235–244. ACM, 2017. [122] S. W. Yang, Y. K. Hyon, H. S. Na, L. Jin, J. G. Lee, J. M. Park, J. Y. Lee, J. H. Shin, J. S. Lim, Y. G. Na, et al. Machine learning prediction of stone-free success in patients with urinary stone after treatment of shock wave lithotripsy. BMC urology, 20(1):1–8, 2020. [123] M. Yeomans, A. Kantor, and D. Tingley. The politeness package: detecting politeness in natural language. R Journal, 10(2):489–502, 2018. [124] Y. Zhang, J. Li, and Q. Qin. Identification of factors influencing locations of tree cover loss and gain and their spatio-temporally-variant importance in the li river basin, china. Remote sensing, 8(3):201, 2016. [125] W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In European conference on information retrieval, pages 338–349. Springer, 2011. 146
Abstract (if available)
Abstract
Recent studies and surveys have shown that mobile app user reviews are valuable for software engineering activities, such as maintenance or requirement elicitation. Potential new users base their download decision on reviews and ratings left by other users. Hence, achieving positive star ratings and user reviews is greatly important for app success and app developers.
Mobile app stores, such as the Apple App Store and the Google Play Store, have launched the ability for developers to respond to reviews, thus creating a two-way communication channel between users and developers directly on the app stores. However, establishing effective communication between the two parties can be challenging. In addition, the rapid growth of smartphone users worldwide has called for developers to make their apps accessible to global audiences and markets. Analyzing reviews of users from other countries could help developers better understand their users and give developers more perspectives, which may lead to changes in requirements or new requirements.
Motivated by these observations, for this dissertation, we conduct four empirical studies utilizing a suite of machine learning, data mining and analysis techniques, and a number of datasets of mobile app user reviews and developer responses from app stores. Specifically, first, we study a wide range of features within user reviews and their relation to developer responses. Second, we study how developers should respond to reviews for success. Third, we study how the users from different countries perceive the software quality of the same apps compared to the US users. Lastly, we compare the users' perception of the software quality of apps to the apps' internal quality characteristics.
The primary objective is to derive understandings and uncover insights for a broad range of practitioners to be used for improving communications between users and developers on mobile app stores.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Software quality understanding by analysis of abundant data (SQUAAD): towards better understanding of life cycle software qualities
PDF
Utilizing user feedback to assist software developers to better use mobile ads in apps
PDF
Software architecture recovery using text classification -- recover and RELAX
PDF
Toward understanding mobile apps at scale
PDF
Value-based, dependency-aware inspection and test prioritization
PDF
The effects of required security on software development effort
PDF
Incremental development productivity decline
PDF
Mining and modeling temporal structures of human behavior in digital platforms
PDF
A search-based approach for technical debt prioritization
PDF
Experimental and analytical comparison between pair development and software development with Fagan's inspection
PDF
Assessing software maintainability in systems by leveraging fuzzy methods and linguistic analysis
PDF
A model for estimating cross-project multitasking overhead in software development projects
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Towards generalized event understanding in text via generative models
PDF
Reducing user-perceived latency in mobile applications via prefetching and caching
PDF
Lexical complexity-driven representation learning
PDF
Architectural evolution and decay in software systems
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Process implications of executable domain models for microservices development
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
Asset Metadata
Creator
Srisopha, Kamonphop
(author)
Core Title
Toward better understanding and improving user-developer communications on mobile app stores
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
07/18/2022
Defense Date
04/21/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
app store mining,developer responses,explainable machine learning,mining software repositories,natural language processing,OAI-PMH Harvest,software engineering,software quality,user reviews
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Boehm, Barry (
committee chair
), Ferrara, Emilio (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
ksrisopha@icloud.com,srisopha@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111373214
Unique identifier
UC111373214
Legacy Identifier
etd-SrisophaKa-10845
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Srisopha, Kamonphop
Type
texts
Source
20220719-usctheses-batch-955
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
app store mining
developer responses
explainable machine learning
mining software repositories
natural language processing
software engineering
software quality
user reviews