Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Toward understanding mobile apps at scale
(USC Thesis Other)
Toward understanding mobile apps at scale
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TOWARD UNDERSTANDING MOBILE APPS AT SCALE by Shuai Hao A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) June 2014 Copyright 2014 Shuai Hao Acknowledgements Six years is not short. There are many people who have guided and helped me during my Ph.D. study. I could not finish this journey without them. First, I want to thank my advisor, Professor Ramesh Govindan. Ramesh is the excellent mentor I have ever met and has become my lifelong learning role model ever since. He has trained me well on doing research: how to pick problems that are interesting, challenging, and have real-world impact; how to describe ideas in formal writing; and how to convey them to different groups of people. I also want to thank my co-advisor, Professor William G.J. Halfond. GJ first introduced me to the field of program analysis in his class and later guided me through in all my dissertation projects. Thanks, Ramesh and GJ! Second, I want to thank all my collaborators in my dissertation work: Ding Li oneLens (Chapter 2) and SIF (Chapter 3), Bin Liu and Suman Nath onPUMA (Chapter 4). I am also very grateful to other labmates in former ENL and current NSL for being good friends, helping me out on technical problems, proofreading my paper drafts before deadlines, and going out for dinners and hikes during weekends. They are Tobias Flach, Omprakash Gnawali, Ki-Young Jang, Nupur Kothari, Bin Liu, Nilesh Mishra, Jeongyeup Paek, Luis Pedrosa, Moo-Ryong Ra, Sumit Rangwala, Abhishek Sharma, and Marcos Vieira. Finally, I want to thank USC and the Annenberg Fellowship program for their financial support through- out my study and thank my parents, my sister, and my wife, Jane, for their continuous love and support. The arrival of our son, Ethan, made my last one and a half years busy, exciting and joyful! ii Table of Contents Acknowledgements ii List of Tables vi List of Figures vii Abstract ix Chapter 1: Introduction 1 1.1 Dissertation Overview and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2: eLens: Estimating Mobile App Energy Consumption using Program Analysis 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Our Approach for Energy Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Generating the Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Estimating Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Energy Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Software Environment Energy Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Path-Independent Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Path-Dependent Instruction Costs . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.2 Accuracy of eLens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.3 Why Do We Need Energy Profilers? . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.4 Analysis Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.5 Using eLens to Compare Applications . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 3: SIF: A Selective Instrumentation Framework for Mobile Apps 29 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 A Selective Instrumentation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 Overview of SIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.2 The SIFScript Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.2.1 Codepoint Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2.2 Path Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 SIF Component Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.3.2 Realizing the Codepoint Set Abstraction . . . . . . . . . . . . . . . . . 44 iii 3.3.3.3 Realizing the Path Set Abstraction . . . . . . . . . . . . . . . . . . . . 45 3.3.3.4 Overhead Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.4 Implementation of SIF for Android . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.1 Expressivity of SIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.2 Efficiency of SIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 4: PUMA: Programmable UI-Automation for Large-Scale Dynamic Analysis of Mobile Apps 65 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.1 Dynamic Analysis of Mobile Apps . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.2 Related Work on Dynamic Analysis of Mobile Apps . . . . . . . . . . . . . . . . 70 4.2.3 Framework Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 Programmable UI-Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.1 PUMA Overview and Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.2 The PUMAScript Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3.3 PUMA Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.4 Implementation of PUMA for Android . . . . . . . . . . . . . . . . . . . . . . . 83 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.2 PUMA Scalability and Expressivity . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.3 Analysis 1: Accessibility Violation Detection . . . . . . . . . . . . . . . . . . . . 88 4.4.4 Analysis 2: Content-based App Search . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.5 Analysis 3: UI Structure Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.6 Analysis 4: Ad Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4.7 Analysis 5: Network Usage Profiler . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.8 Analysis 6: Permission Usage Profiler . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4.9 Analysis 7: Stress Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Chapter 5: Literature Review 101 5.1 Energy Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2 Instrumentation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3 Dynamic Analysis of Mobile Apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Chapter 6: Conclusion and Future Work 107 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 References 109 Appendix A SIF Supplement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.1 SIFScript Program Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.1.1 Ad Cleaner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.1.2 Fine-grained Permission Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.1.3 Privacy Leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 iv Appendix B PUMA Supplement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 B.1 PUMAScript Program Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 B.1.1 Accessibility Violation Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 117 B.1.2 Content-based App Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 B.1.3 UI Structure Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 B.1.4 Ad Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 B.1.5 Permission Usage Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 B.1.6 Stress Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 v List of Tables 2.1 Subject applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Component-level accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Time vs. Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Analysis time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Implemented instrumentation tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Implemented SIF tasks (*Line Of Code) . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 Native methods invoked during a game run . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4 Analytics results collected from 3 users . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5 Time to instrument SIF tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6 Runtime overhead of SIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Recent work that has used a monkey tool for dynamic analysis . . . . . . . . . . . . . . . 70 4.2 List of analyses implemented with PUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Accessibility violation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4 Search results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5 Ad fraud results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 vi List of Figures 1.1 My contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Overview of eLens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Source line visualization provided by eLens . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Whole program accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Method-level accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Top five energy hotspots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Overview of SIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Operations on SIF abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Timing profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Call graph profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5 Flurry-like analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6 SIFScript descriptions for some tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.7 Dialog asking for user’s choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.8 Screenshot before and after AdCleaner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.9 Photos taken and uploaded by instrumented app . . . . . . . . . . . . . . . . . . . . . . . 59 3.10 FreeMarket attacker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.11 Accuracy of SIF’s overhead estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Overview of PUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 App clustering for UI structure classification . . . . . . . . . . . . . . . . . . . . . . . . . 93 vii 4.3 Cluster size forr spatial = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4 An app clone example (one app per rectangle) . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5 Network traffic usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6 Permission usage: granted vs used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 viii Abstract The mobile app ecosystem has experienced tremendous growth in the last decade. This has triggered active research on dynamic analysis of energy, performance, and security properties of mobile apps. There is, however, a lack of tools that can accelerate and scale these studies to the size of an entire app marketplace. In this dissertation, we present three pieces of work that can help researchers and developers move toward this direction. First, we present a new approach that can provide fine-grained estimates of mobile app energy con- sumption. We achieve this by using a novel combination of program analysis and per-instruction energy modeling. Our Android prototype, called eLens, shows that our approach is both accurate and lightweight. We believe that the development of energy efficient mobile apps will be accelerated with eLens. Then, we introduce a framework, called SIF, for selective app instrumentation. SIF contains two high- level programming abstractions: codepoint sets and path sets. Additionally, SIF also provides users with overhead estimates for specified instrumentation tasks. By implementing a diverse set of tasks, we show thatSIF abstractions are compact and precise and its overhead estimates are accurate. We expect the release of SIF can accelerate studies of the mobile app ecosystem. Last, we focus on programming framework for dynamic analysis of mobile apps. This is motivated by the fact that existing research has largely developed analysis-specific UI automation techniques, where the logic for exploring app execution is intertwined with the logic for analyzing app properties. PUMA is a programmable framework that separates these two concerns. It contains a generic UI-Automation ix capability and exposes high-level events for which users can define handlers. We demonstrate the capa- bilities of PUMA by analyzing seven distinct performance, security, and correctness properties over 3,600 marketplace apps. x Chapter 1 Introduction Over the last decade, the mobile computing field has evolved into an ecosystem consisting of millions of applications (apps), billions of end users, and a huge community of app developers and researchers. While this has transformed many aspects of our life, there is, however, still much to learn about the ecosystem. For example, when end users complain about the abnormal energy behavior for a particular app, the app developers usually do not have much clue about the problem and the exact portion of code that may cause the problem. Security researchers may want to record every Internet access an app performs in runtime. An ad network provider may be interested in finding out apps that violate their ad policies. These scenarios pose several challenges that need to be tackled to better understand the mobile ecosystem: Fine-grained Visibility into App Properties. This is important for understanding most app properties. For example, in the abnormal energy behavior scenario, having fine-grained visibility into the energy consumption inside an app can help developers quickly pinpoint the energy hotspots and proceed to the next phase of debugging or optimization. Instrumentation Capability. This arises in most app studies that require customized modifications or changes, termed as instrumentation. The instrumentation can take place in the OS, the app, or a combina- tion of both. Instrumenting the OS can provide access to more information but with the disadvantage of 1 Figure 1.1: My contributions having to obtain privileged access to the device and reflash the OS. Instrumenting the app does not need to deal with such hurdles but has relatively limited access to information. Efficient and Automated Analysis. While extending app studies to the scale of whole app marketplace brings more insights or better understanding of the ecosystem, it is important to have efficient and auto- mated analysis methods in this process, since the resource usage is usually proportional to the number of apps and involving users in the loop is normally expensive and does not scale well. Given these challenges, my research goal is to develop efficient analysis methods that will help mar- ketplaces and developers characterize properties of a large number of mobile apps. Throughout this dissertation, we advance the state-of-art research as described below. 1.1 Dissertation Overview and Contributions This dissertation made three major contributions to the field of mobile computing, as shown in Figure 1.1. Fine-grained Energy Profiling (Chapter 2). To measure is to know. Fine-grained visibility into app properties, like energy consumption, can greatly help developers understand the energy implication of their implementation choices. However, before eLens, the state-of-art research can only support energy profiling at the whole app or method level, which will not help when the method complexity reaches hundreds of lines of code. eLens pushes this profiling state-of-art to source line level. eLens achieves this using a novel combination of two techniques: program analysis and per-instruction energy modeling. With program analysis, we can track the runtime execution statistics for each path. With per-instruction 2 energy modeling, we can calculate the energy cost for each instruction. Together with the runtime path statistics and the mapping between source lines and low-level instructions, we can aggregate and calculate the energy cost for each source line. We have implemented eLens for the Android platform and evaluated its accuracy over real marketplace apps. Our evaluation results show that eLens’ energy estimates are within 10% of ground truth and it is fast and lightweight. Selective App Instrumentation (Chapter 3). In current mobile ecosystems, instrumentation of mobile apps is a much needed emerging capability when researchers and developers want to understand these apps. However, app instrumentation is a labor-intensive and error-prone process. Before SIF, there were no existing tools that can help people perform the instrumentation: they usually had to build their own tools for their specific purposes from scratch. This can slow down their studies significantly. With SIF, we provide programming support for people to write high-level scripts to perform their instrumentation tasks. We derive the requirements forSIF by surveying research projects that used app instrumentation. SIF contains two high-level programming abstractions: codepoint sets and path sets. Codepoint sets allow users to specify the points of interests and hook with their own instrumentation logic. Path sets report runtime paths between two user-specified codepoint sets. We have implemented SIF for the Android platform and evaluated the expressivity of the two abstractions and usefulness of SIF. Programmable UI-Automation (Chapter 4). To perform dynamic analysis of mobile apps, UI-Automation is a commonly used practice, where users have to develop two pieces of logic: logic for exploring app exe- cution and logic for analyzing dynamic app properties. However, our survey of UI-Automation based app analyses indicates that the two pieces of logic are usually intertwined and the analysis tool built by one research group is very hard to be used by others. This can slow down the research on dynamic analysis significantly. PUMA is a programmable UI-Automation framework that contains a generic UI-Automation capability and exposes high-level events for which users can define handlers. With PUMA, users can now write compact customization code (handlers) for UI-Automation and focus on the app analysis logic. We have implemented PUMA for the Android platform and evaluated the expressivity of the abstractions and usefulness of PUMA. 3 1.2 Dissertation Outline This dissertation is organized as follows. In Chapter 2, we present a fine-grained energy profiler eLens. We first describe the overall design and then cover the details of the two techniques: program analysis and per-instruction energy modeling. Last, we present the evaluation results of eLens on marketplace apps. In Chapter 3, we present a selective instrumentation frameworkSIF. We first describe the overall design and then cover the details of the two high-level programming abstractions: codepoint sets an path sets. We evaluate the expressivity of SIF abstractions and accuracy of its overhead estimates. In Chapter 4, we present a programmable UI-Automation framework PUMA. We first describe the rationale of the design and then cover the details of high-level events. We evaluate the capabilities of PUMA and show how PUMA can help large-scale app analysis. In Chapter 5, we present the related literature work for each project above. Lastly, we conclude the dissertation and list some future work in Chapter 6. 4 Chapter 2 eLens: Estimating Mobile App Energy Consumption using Program Analysis Optimizing the energy efficiency of mobile applications can greatly increase user satisfaction. However, developers lack viable techniques for estimating the energy consumption of their applications. In this work, we propose a new approach that is both lightweight in terms of its developer requirements and provides fine-grained estimates of energy consumption at the code level. It achieves this using a novel combination of program analysis and per-instruction energy modeling. In evaluation, our approach is able to estimate energy consumption to within 10% of the ground truth for a set of mobile applications from the Google Play store. Additionally, it provides useful and meaningful feedback to developers that helps them to understand application energy consumption behavior. 2.1 Introduction Smartphones and tablets allow people to carry around more computational power in their hands than most had on their desktops just a few years ago. However, the usability of these devices is strongly defined by the energy consumption of mobile applications, and user reviews of applications reveal many customer complaints related to energy usage. 5 Research in estimating the energy usage of mobile devices has explored a wide variety of techniques, ranging from specialized hardware, cycle-accurate simulators and operating system level instrumentation, to carefully calibrated software-based energy profilers that provide coarse-grained energy estimates. From the perspective of a developer wishing to optimize energy consumption of an application, each of these approaches has one or more shortcomings: specialized hardware can be expensive, cycle-accurate simula- tors and operating system level instrumentation can slow down a mobile app beyond the point of usability, and coarse-grained energy estimates may not be able to pinpoint hotspots within an app. To address these shortcomings, we explore a novel approach, called eLens, that combines two ideas that have not previously been explored together: program analysis to determine paths traversed and track energy-related information during an execution, and per-instruction energy modeling that enables eLens to obtain fine-grained estimates of application energy. eLens does not require the developer to possess specialized hardware or to instrument the operating system, does not impact the usability of applications, and can measure energy-usage at method, path or line-of-source granularity. These two ideas work in concert as follows. When a developer wishes to obtain an energy estimate for specific use cases of a mobile app, eLens uses instrumentation to identify the corresponding paths of an application that will be executed and record runtime information that is needed by the energy models (Section 2.2). To compute the energy estimate, eLens analyzes the recorded paths and runtime infor- mation to extract the energy relevant information, and uses this information to drive the energy models and estimate the energy consumption of each bytecode or API call for the hardware components of the system (e.g., CPU, memory, network and GPS). The energy models are provided by a software environ- ment energy profile (SEEP), whose design and development enables the per instruction energy modeling (Section 2.3). eLens energy consumption estimates can be computed at different levels of granularity, ap- plication, method, path, and line of source code and integrated into a development environment, such as Eclipse, so that developers can visualize the energy usage of their application during development. eLens has several desirable properties that distinguish it from prior work. By design, it is lightweight; eLens does not require modifications to the mobile operating system or require expensive power monitoring 6 hardware. Moreover, eLens provides fine-grained visibility into the energy consumption of an application at multiple levels of granularity down to an individual line of source code. Using experiments on popular mobile applications obtained from the Android marketplace, we demonstrate two other important prop- erties (Section 2.4). eLens is accurate; it is able to estimate the power consumption of real marketplace applications to within 10% of ground-truth measurements. Competing methods that are path-insensitive, or use coarse-grained energy models can be an order of magnitude more inaccurate. Finally, it is fast, allow- ing developers to easily analyze the energy behavior of multiple combinations of hardware and operating systems. 2.2 Our Approach for Energy Estimation eLens analyzes the implementation of a mobile application and provides code-level estimates of the energy that it will consume at runtime. The results of this analysis are summarized at the granularity of the whole program, path, method, and source line to help the developer make informed implementation decisions for reducing energy consumption. The inputs to the approach (Figure 2.1) are: (1) the software artifact; (2) the workload, which describes the way the software will be used at runtime; and (3) system profiles, which use per-instruction energy models to specify the power characteristics of the platforms for which the developer is targeting the imple- mentation (Section 2.3). WithineLens, there are three components: the Workload Generator (Section 2.2.1) translates the workload into sets of paths through the software artifact; the Analyzer (Section 2.2.2) uses the paths and system profiles to compute an energy estimate; and, the Source Code Annotator (Section 2.2.3) combines the paths and energy estimate to create an annotated version of the source code that is provided to the developer. The output of eLens is a visualization that shows the estimated energy consumption of the software at the path, method, source line, and whole program granularity. 7 Figure 2.1: Overview of eLens 2.2.1 Generating the Workload The Workload Generator is responsible for converting the user-level actions, for which the developer wants an estimation, to the path information used by the Analyzer and Source Code Annotator. Trivially, this information could be provided by assuming that every path in the artifact is executedn number of times. However, this would not be an accurate reflection of how the application would execute at runtime, and since energy consumption is not uniform over different paths, the resulting estimate would not be as help- ful. For example, in a video viewing application, the paths traversed to watch a video will consume more power than the paths traversed to start or exit the application. The inputs to the Workload Generator are the workload description,W , and the implementation of the application, S. The workload description is a specification of the behavior of the application for which the developer wants to estimate energy consumption and can be represented as a sequence of use cases hu 1 ;u 2 ;:::;u n i. The Workload Generator instrumentsS to create a version,S 0 , that will record the paths 8 traversed during an execution. Next, the Workload Generator runs eachu i 2W onS 0 . The paths traversed foru i are denoted asP i . The set of allP i ,P, is the output of the Workload Generator. The workload description can be specified informally, where the developer simply interacts with the instrumented application, or formally, where the sequence of actions is explicitly listed and can be executed by automated Android testing tools, such as MonkeyRunner [10] and Robotium [11]. We only require that the specification mechanism must be able to execute the instrumented version of the application, so the paths traversed by the application are recorded. There is no adequacy criterion for the workload description except that it represents the set of actions that are of interest to the developer. For example, an informal workload description of a video player may specify that the user perform the following actions: (a) start the application, (b) search for a video using the term “eLens at ICSE”, (c) play the first video found, (d) replay the video 100 times, and (e) exit the application. The instrumentation inserted by the Workload Generator records the path traversed through each method of the application. This recording is based on an efficient path profiling technique proposed by Ball and Larus [17]. The Ball-Larus approach assigns weights to edges of a method’s control-flow graph (CFG) such that the sum of the edge weights along each unique path through the CFG results in a unique path ID; a single instrumentation variable per method then suffices to record the traversed path for one method invocation. Therefore, each P i is comprised of a sequence of sub-path tuples, each denoted byhm;idi, wherem is the method andid is the ID of the path traversed. Our implementation extends the Ball-Larus approach to handle nested method calls, concurrency, and exceptions, as described in Section 2.4.1. To illustrate the output of Workload Generator, consider a software artifact with three methodsa,b and c. For this artifact, a possibleP isfhha; 1i;hb; 1i;ha; 2ii;hha; 2i;hc; 3iig, which contains two paths. The first corresponds to a use case in which path 1 of methoda is executed followed by path 1 of methodb and then path 2 of methoda. The second corresponds to a use case in which path 2 of methoda is executed followed by path 3 of methodc. 9 2.2.2 Estimating Energy Consumption The Analyzer computes energy estimates using the path information provided by the Workload Generator and the energy cost functions from a software environment energy profile (SEEP) (Alg. (1)). Algorithm 1 Estimate Energy Consumption Input: H: set of hardware components,l: source line, m: method, andP : path Output: Energy estimate in Joules 1: cost 0 2:h;M;Li regenerate(P ) 3: D propagateDT() 4: for allh2H do 5: for alli2 do 6: f h powerstate(i;h) 7: ifL(i) =l^M(i) =m then 8: cost(h)+=C(i;h;f h ;D(i)) 9: return P h2H cost(h) As input, the Analyzer takes a sequence of sub-pathsP2P, methodm, source line numberl, and set of hardware componentsH to be accounted for in the estimation. For each instruction 1 specified byl and m, the Analyzer calculates the energy cost using the cost functions defined in the SEEP. The output of the analysis is then the energy estimateE, in Joules, of the path, method, or source line. An energy estimate for the entire software artifact can be calculated by summing the calls for eachP i 2P. The SEEP defines energy cost functionsC() for all instructions. Broadly speaking, there are two di- mensions along which energy costs may vary for instructions, power state and path-dependent information. Most components, such as the CPU, network and GPS, have multiple power states: modern smartphone CPUs can operate at different frequencies, which consume different amounts of energy; the GPS may be on or off; and the network may be idle, transmitting/receiving, or have selectively enabled multiple antennas. Thus,C() can depend upon the power state of the corresponding component when a bytecode or API com- ponent is executed. Furthermore, some API invocations’ energy consumption is based on path-dependent information. For example, sending data over the network incurs energy costs (roughly) proportional to the 1 We use instruction to denote both bytecodes and system API calls. 10 size of the data transfer. During the process of workload generation, eLens tracks the power states of hard- ware components, and also certain types of path related data. This information is included as one or more arguments to the cost functions defined in the SEEP. Details and examples of path-dependent information are provided below and in Section 2.3. The first operation performed by the Analyzer is to regenerate the instruction sequence represented by P (line 2 of Alg. (1)). As defined by the Ball-Larus algorithm, given a subpathhm;idi, it is possible to regenerate the sequence of instructions that define each subpath. The regenerate combines each subpath in P to define the complete path (entry to exit) taken during the execution. For nested method calls, where methodA calls methodB,regenerate calculates the sequence of instructionsa 1 ;a 2 ;:::;a n traversed in A and b 1 ;b 2 ;:::;b m traversed in B. If a k is an invocation to B, then the final sequence would be a 1 ;a 2 ;:::;a k ;b 1 ;b 2 ;:::;b m ;a k+1 ;:::;a n . Identifying which subpaths in P were called by other subpaths is straightforward, sinceP is a sequence and eachhm;idi is appended toP afterm exits. The final output of regenerate is the tupleh;M;Li, where is is the complete path,M maps each instruction in to its containing method, andL maps each instruction in to its source code line number. The energy cost of certain instructions is based on path-dependent information. For example, the cost of a network send instruction depends upon the amount of data sent and the cost of opening an input stream will depend on the stream type. Line 3 of the algorithm propagates argument and type data along that can be used to identify this type of information and initializes a functionD that relates this information to each instruction. The propagateDT function implements this functionality by simulating data-flow along the path . The instrumentation introduced by the Workload Generator records information, such as the size of data operated on by an instruction or the class implementing an API call at specific points of the execution. propagateDT then simulates the stack contents along to track the types and relevant data attributes to the point where they are needed by the energy cost functions. Stack simulation works well in this context because it is operating on only one complete path () and only a subset of all instructions on the path need to be simulated. In cases where values are manipulated by an uninstrumented function (e.g., 11 a library call), an average value is used for the energy cost functions. More details on the specific types of information tracked is discussed in Section 2.3. Many path-dependent APIs require specialization of the general approach described above. Due to space constraints, we provide two representative examples of the analysis employed by the propagateDT function. (1) Precisely identifying the class that extends InputStream is difficult because in Android applications these often originate from a factory class. To gather this information the Workload Generator inserts a probe intoS 0 after calls to specific factory methods to record the implementing class type. This information is then used to annotate the data item in the stack simulation, so that when theInputStream API is called, the implementing class on the stack can be identified and the appropriate cost functions used. (2) Allocation instructions set aside memory space for an array. For example, network buffers may allocate a byte array and define its size using a hard-coded constant of 1024. The propagateDT function tracks loads of constants onto the stack so that when they are popped and used as arguments, the value of the constant can be identified. In the example above, the Analyzer would be able to determine that 1024 is the value of the constant on the stack used to supply the argument to the allocation statement. As described above, many components have multiple power states. For example, modern CPUs con- serve power by changing the CPU frequency in response to high or low utilization. An instruction’s power consumption can therefore depend on the frequency of the CPU when it is executed. The function powerstate, used in line 6, maps each instruction i2 to the power state f h of component h (where h may be the CPU, WiFi, or other component with multiple power states) when i was executed. eLens computesf h by tracking, during workload generation, when component power state changes occur. The cost functions for each instruction take f h and h as arguments. For example, assuming a CPU has two frequency levels, high and low, the CPU cost function for anldc instruction would return two different energy cost valuese high ore low , depending on whetherf CPU reported the frequency as high or low for that instance of the instruction. Each instruction that satisfies the method and line number constraints is added to the total energy cost (lines 7–8). The Analyzer calculates an instruction’s energy cost for each component of the platform as a 12 Figure 2.2: Source line visualization provided by eLens function of its type, path-dependent data, and component power state. The Analyzer can be configured to explore different sets of hardware components via the inputH. 2.2.3 Energy Annotations The Source Code Annotator converts the path information and energy estimation numbers into a graphical representation that allows developers to visualize the energy consumption of their software. The repre- sentations connect the estimated energy numbers to the implementation structure of the software artifact. The ability to visualize energy usage at a source code line level is a unique feature of eLens. Feedback at this level allows developers to iteratively refine implementation details, such as statements and loops, to improve the overall energy consumption of the software. We have implemented the visualization as an Eclipse plugin, which can provide visual representation of the energy consumption at four different granularity levels: whole software, per-method, per line of source code, and for each path. Note that, 13 for each granularity level, the energy consumption can be shown for some or all use cases in the work- load. The mechanism for two of these representations is discussed below; the mechanisms for path and whole-software annotations follow from these. Per Line of Source Code: For a given source file, the annotator ranks each source line according to its energy cost over all P 2P. Then the rankings are mapped to a color spectrum, such as blue to red, and each line of source code is colored based on its position on the spectrum. This results in a SeeSoft like visualization [25] of the power consumption of the software. Figure 2.2 shows a screenshot of a Java source file with the energy based colorings. Method Representation: For a given source file, the Source Code Annotator generates the call graph (CG) of the software artifact. The methods in the CG are ranked according to their energy consumption over allP2P and then assigned a color in a spectrum based on their relative use of energy. A method’s assigned color and energy value are then used to annotate the corresponding node in the CG. 2.3 Software Environment Energy Profile The Software Environment Energy Profile (SEEP) provides per-instruction energy cost functions for each component of the target platform. The use of the SEEP allows eLens to analyze energy consumption on multiple platforms by simply providing different SEEPs as input to the Analyzer. We anticipate that a SEEP will be developed and distributed by a platform’s manufacturers as part of the platform’s software development kit. This method of distribution makes it unnecessary for developers to use complicated or expensive energy monitoring equipment. Currently, it is not common practice for manufacturers to provide SEEPs, so we discuss below the steps required to develop a SEEP. For each distinct hardware component, the SEEP contains a function that estimates the energy cost of each instruction at each distinct power state of the hardware component. Thus, C CPU (i;f) denotes the CPU energy cost of instructioni at frequencyf andC WiFi (i) the WiFi cost 2 . 2 The latest WiFi standard, 802.11n, supports multiple power states. We have left cost functions for 802.11n as future work. 14 To estimate these cost functions, we used the LEAP power measurement device [63]. LEAP reports energy-consumption measurements at a fine-granularity, for each hardware component in the system; its analog-to-digital converter (DAQ) samples each component’s current draw at 10 KHz. It contains an ATOM processor that runs Android version 3.2, so we can measure the energy consumption of Android applications. In addition, the LEAP provides Android applications with the ability to trigger a pulse that can be used to correlate application activity with DAQ readings. This capability allows us to time specific sections of profiling code by generating pulses at the beginning and end of sections of code whose energy consumption we wish to measure. The DAQ readings are stored in a log and post-processed in order to estimate the total energy consumed between a pair of pulses. These measurements are performed by hardware external to the components. The energy cost functions for instructions can be broadly grouped into two categories: those with a path-dependent energy cost and those with a path-independent energy cost. The cost functions for the latter can be approximated for each hardware component by knowing only information local to the instructions. The path-dependent instructions require additional information that can only be identified by incorporating information from prior instructions in the executed path. The profiling for both types is explained in more detail below. 2.3.1 Path-Independent Cost Functions Cost functions for instructions with path-independent costs can be calculated for each hardware component using the type of instruction and power state (e.g., CPU frequency) of the hardware component at which the instruction was executed. We identified these instructions by analyzing the implementation of the Dalvik Virtual Machine and confirmed our analysis through empirical measurements. To profile individual instructions, we created a set of test scripts, each of which profiles an individual instruction by placing that instruction in a loop that executes 20 million times. This ensures a measurement of a sufficiently long duration that exceeds the DAQ sampling interval and reduces variance. Each loop is then run ten times on a quiescent system to minimize the impact of cold starts, cache effects or background processes (such as 15 garbage collection). We subtract the cost of the loop setup and LEAP pulse instructions from the measured cost. Some instructions require the results of executing other instructions. For example, anadd instruction requires that two operands be pushed on to the stack. To handle this, a dependency is identified between the test scripts. For instance, if the ldc instruction is used to push operands on the stack, then the test script for theadd instruction includes these bytecodes. When profiling theadd instruction, we subtract the cost of the twoldc instructions. We repeated this process for all of the Dalvik bytecodes we identified as having a local cost. Details for different categories of bytecodes are provided below. This list is not intended to be exhaustive, but to illustrate how different categories were handled: Invocations and returns from application function: When the target of an invocation is an application function, the target method will already be instrumented by the Workload Generator, so only the overhead of the invocation and return needs to be profiled. In the profiling scripts, it was not possible to isolate an invoke from a return; both are required to execute in pairs. Therefore, we measured all combinations of invoke and return types and then calculated the cost of each instruction. Values for these instructions varied according to the type of invoke (e.g., virtual or static) and the type of value returned (e.g., void, integer, or float). Invocations to fixed-cost APIs: Many invocations to system APIs performed operations, such as setting a property, that could be described with a fixed cost. These were profiled differently than regular instruc- tions because of their relatively high execution time, both in terms of the functionality they implemented and argument initialization. We profiled those APIs that had a fixed cost over all hardware components and were called from within one of the applications used in our evaluation. This set included approximately 1,500 unique APIs. For these, we instrumented our application to obtain energy consumption incurred by each API, and ran our applications several times to obtain a sufficient number of energy consumption samples for each API call. The average of these samples for each API was used as its fixed cost. Other instructions: The energy costs for loads and stores varied according to the basic type on which they operated (e.g., integer or float). For arithmetic and logic instructions, the energy cost varied based on 16 the operation performed and the basic type on which the operations were performed. Stack management instructions had a fixed cost. Finally, jumps and branches incurred a fixed cost regardless of the type of condition they check. 2.3.2 Path-Dependent Instruction Costs The energy cost for path-dependent instructions is based not only on hardware power state and instruction type, but also on information that is provided by other instructions in the path, such as the size of the argu- ments to a network send instruction. In general, we found four categories of path dependent instructions: allocation instructions, invocations of system APIs whose cost depends on the argument data, invocations of system APIs whose cost depends on the implementing class, and invocations whose cost depends on external data. These types are discussed in more depth below. Allocation instructions, such asanewarray, cause memory to be set aside for an array of basic types. Our measurements indicate that the cost of allocation statements is linearly proportional to the amount of memory allocated. Therefore, the cost of an allocation instruction is a linear function with respect to the size of the array and the size of the basic type allocated. The size of a basic type (e.g., char, int and object reference) is known ahead of time and the runtime instrumentation can record an estimate of the array dimension at runtime, which is then propagated along the path by the propagateDT function described in Section 2.2. For allocation of objects, the new command has a simpler fixed cost function, as it only initializes a pointer, which is followed by an invocation of the object’s constructor, which is handled as an invoke to function. The energy cost of data-dependent invocations is based on the size of the arguments to the invocation. This type of invocation is often used to access hardware components, such as the network, or perform data manipulations, such as data sorting. For modeling invocations to hardware components, our profiling was informed by research work that investigated and modeled power consumption of various Android hardware components. We identified salient features of these models whose values could be provided 17 via program analysis, and used empirical evaluation of the LEAP node to determine values for hardware- specific constants. Hardware component models were built for the CPU, RAM, WiFi, and GPS. It was not possible to model additional components because they required hardware modification to the LEAP to wire them into the DAQ, a necessary step to verify the accuracy of the models. As we show in Section 2.4, we were able to achieve high accuracy and believe that the process is straightforward to extend to other hardware components as the ability to measure them within the LEAP framework becomes available. To provide the argument data size to the energy cost model, the propagateDT function propagates the size of data structures to data-dependent invocations. The cost incurred by some invocations relies on data or conditions external to the application that cannot be identified using analysis of the path. For example, for an invocation that retrieves a web page or queries a database, there are two associated energy costs: the invocation that makes the request for data and the processing of the response. In our experimentation, we found that the former could be modeled accurately as a function of the response time of the external source of data and the latter as a function of the size of the response. To build the energy cost functions for these invocations, we instrumented all invocations that made external data requests to record the name of the data requested (e.g., URL, database query, or filename), response time, and response size. We used this information to build a map from the data name to its response attributes. Then, when computing the cost of an invocation that depends on external data, we used the map to look up the name of the requested data and provided the response time as an argument to the invocation’s energy cost function. When a subsequent invocation occurs that processes or iterates over the returned data, the size of the response is supplied as an argument to this invocation’s cost function. As with the previous types of invocations, the response attributes were propagated to the invocation using the propagateDT function. We used this methodology to estimate the cost of database calls to the Android SQLite database and invocations to theHttpMessage interface. Invocations of some methods can vary significantly based on the class that provides the method imple- mentation. For example, the cost of accessing the member functions of the abstract classInputStream 18 Table 2.1: Subject applications App C M BC Description BBC Reader 590 4923 293910 RSS reader for BBC news Bubble Blaster II 932 6060 398437 Game to blast bubbles Classic Alchemy 751 4434 467099 Game to combine chemical elements Location 428 3179 232898 Provide location with PL2303 dirver Skyfire 684 3976 274196 Web-browser Textgram 632 5315 244940 Text editor will depend on whether its implementation is provided by a network-based class, such asChunkedInput- Stream, or a file-based class, such asFileInputStream. To handle this type of invocation, we used manual analysis and empirical measurements to identify the methods whose energy varied due to differ- ences in their implementing class. We analyzed each such method and its implementation to determine whether a model could be built based on simply profiling the method or whether a more complex model, based on argument size or external data, needed to be constructed. The energy cost function for these methods could therefore depend on data tracked by thepropagateDT function, argument data and im- plementing class information. 2.4 Evaluation In this section, we empirically evaluate eLens by measuring the accuracy of its energy estimates, illus- trating the usefulness of our approach by evaluating whether time profiling would have been an effective substitute for energy profiling, and demonstrating the usability of eLens in an interactive development set- ting by measuring its run time. We conclude with a case study that demonstrates how eLens can be used to understand how applications consume energy. 2.4.1 Methodology Our evaluation is based on an implementation of eLens which is able to estimate the energy usage of unmodified Android applications from the Google Play store. As input, eLens takes the implementation of 19 an application in Dalvik bytecode and uses thedex2jar tool 3 to convert it into Java bytecode. The Java bytecode version is then provided as input to the Workload Generator and Source Code Annotator. The output of eLens is a visualization and reports on the estimated energy consumption of the application. The Workload Generator uses theBCEL instrumentation library 4 to add path profiling instrumentation (as discussed in Section 2.2) and to collect the data and type information as needed for the SEEP. In our implementation, we identify and discard paths that result in exceptions, since their catch blocks may cause control-flow to jump outside of the method, a behavior for which the Ball-Larus algorithm is undefined. Overall, this only resulted in less than 0.01% of paths being discarded. Concurrency is handled by using the thread’s ID to identify the counter that must be updated to track the path ID. After instrumenting the application, it is compiled from Java bytecode back to Dalvik bytecode using the standard Androiddx tool and deployed to the LEAP platform. The use cases were run manually by interacting with the application while it was deployed on the LEAP platform. The algorithms for the Analyzer and Source Code Annotator were implemented as discussed in Sec- tion 2.2.2 and Section 2.2.3. The Analyzer uses the SEEP that we built as specified in Section 2.3. The Source Code Annotator was built as an Eclipse plugin and can display energy consumption at the level of granularity of the whole application, method, path, or source line. Screenshots for the plugin are shown in Section 2.2.3. Subject Applications: Table 2.1 shows the set of subject applications that we used in our empirical evaluation. For each application, the table shows the number of classes defined in the implementation (C), the total number of methods across all of those classes (M), number of total bytecodes (BC), and a brief description of the functionality of the application. We used total bytecode count instead of source lines because the apps were downloaded from the Google Play market and source code was not provided as part of the distribution. We selected the applications based on three criteria: (1) diversity of provided functionality, (2) ability to convert the Dalvik bytecode to Java bytecode, and (3) ability of the application to run on the LEAP 3 http://code.google.com/p/dex2jar/ 4 http://commons.apache.org/bcel/ 20 platform. The last two criteria curtailed the number of applications available for us to experiment with. dex2jar is not yet fully mature and is sometimes unable to completely translate Dalvik to Java bytecode. Furthermore, many applications use native libraries that are not available for the LEAP’s x86 processor. 2.4.2 Accuracy of eLens We first compare the accuracy of the estimates produced by eLens against the ground truth (GT) measured by the LEAP platform. To do this, we provide each application as input to eLens. This generates an instru- mented version that is deployed to the LEAP platform, where we interact with each application to exercise its most prominent features. During the execution, the LEAP platform measures power consumption across all of the hardware components. After the execution, the Analyzer computes energy estimates which are summarized and reported by the Source Code Annotator. We compared the measured GT against theeLens estimates at both the method and whole software level. As explained below, it was not possible to calculate GT at the line of source code level. We also compare eLens to two other plausible strategies for approximating the energy consumption of mobile applications. The average-bytecode strategy does not, unlike eLens, use per-instruction energy cost functions, but assumes that each instruction has a uniform cost that can be calculated by averaging the cost of all instructions over all runs of the application. The no-path-sensitivity strategy does not, unlike eLens, account for the specific paths traversed in a method but estimates energy based on the number of times a method is called multiplied by the cost of all of the method’s bytecodes (using per-instruction energy costs). These strategies represent the results that could be achieved by a developer with a method profiler (e.g., gprof) and power measurement device, but each lacks one important capability of eLens (per-instruction energy modeling and program analysis, respectively). The GT is calculated by taking hardware-level measurements on the LEAP platform while the appli- cation is running. However, there are three challenges that preclude a straightforward measurement and require a more complex process for establishing GT. First, the applications can pause while waiting for user input or a data response. Although the application is not executing, the device will still consume 21 energy that should not be counted towards an application’s GT. Although this idle time is imperceptible to humans, our measurements show that it dominates the total execution time of an application and must be excluded in order to not have a GT dominated by times when the application is idle. Second, even when measuring the application energy during periods when it is not idle, the LEAP (or any power monitor) can- not distinguish if the measured energy was expended by the application or background processes. Third, the LEAP samples at 10 KHz, so in theory it can only capture energy usage of methods whose execution time exceeds 0.1ms. In practice, we found that reliable energy estimates can only be obtained for functions which run for at least 10 ms. This meant that GT for many methods could not be measured and it was not possible to measure energy consumed by a specific source line of code. All three of these were addressed by our experiment’s methodology. To address idle time we identified and timestamped APIs where the applications could block and idle. Unless other threads in the application were executing during the blocked time, we counted this as idle time and subtracted it from the GT. To ensure that measured energy was accurately attributed, we performed the GT measurements on a quiescent system. As mentioned in Section 2.3, our technique does not account for garbage collection or process switching, so we identified points during the execution when these occurred and excluded the energy consumed along the affected subpaths from both the energy calculation and the GT total. This represented only 0.05% of the total paths traversed in our experiments. Lastly, to account for sampling frequency, we only conducted accuracy experiments at the whole program and method level. Only methods that ran for more than 10ms at a time were included in the evaluation. Note that this does not mean eLens cannot be used for source line estimates, only that the LEAP platform could not provide GT to evaluate the accuracy of these estimates. eLens excludes waiting time energy because it counts only instructions executed by the application, and is able to estimate energy usage of arbitrarily small functions because it uses profiled costs of bytecode energy usage. For the same reason, eLens can isolate the energy usage of application code. These are significant advantages of our approach. Figure 2.3 shows the accuracy at the level of gran- ularity of the whole application. Our subject applications are shown along the X-axis. Figure 2.4 shows 22 1 10 100 1,000 10,000 Textgram Bubble Alchemy Skyfire BBC Location Estimation Error(%) eLens Average Bytecode No Path Sensitivity Figure 2.3: Whole program accuracy 1 10 100 1,000 10,000 run load onCreate a onActionCycle Estimation Error(%) eLens Average Bytecode No Path Sensitivity Figure 2.4: Method-level accuracy accuracy at the method level for those methods whose running time exceeded 10ms. Note that for all methods, there were only six that executed long enough to get an accurate GT measurement. For each ap- plication or method, the bars show on a logarithmic scale the average estimation error (compared against GT) of ten runs reported by eLens and the two reference techniques, average bytecode and no path sensi- tivity. Each bar also shows one standard deviation above and below average estimation error. The results show that eLens is able to calculate energy estimates with high accuracy at both the whole program and method level. For the subject applications, eLens’ estimation error at the whole program level was below 10% across all applications with an overall average of 8.8% (std. deviation of 3%), and from 7.2% to 10%, with an overall average of 7.1% at the method level (std. deviation of 3.6%). Furthermore, eLens is able to accurately break down application energy usage by hardware component (Table 2.2); its 23 Table 2.2: Component-level accuracy App CPU RAM WiFi GPS BBC Reader -6.2 5.9 -6.8 - Bubble Blaster II -11.5 3.5 -11.6 - Classic Alchemy -7.9 -6.9 -4.4 - Location -7.8 -8.4 - 8.1 Skyfire -7.9 0.9 -8.4 - Textgram 5.2 4.6 4.6 - energy estimation errors for all hardware components are within 12%. Note that Location was the only app that used GPS. This is highly encouraging, and suggests that eLens can be a viable approach for exploring the energy usage of mobile applications. In comparison, the two other plausible strategies are inaccurate by several orders of magnitude. Specif- ically, the average estimation error for average-bytecode at the whole program level was 133%, and for no-path-sensitivity was 267%. To understand why average-bytecode is inaccurate, we plotted a distribution of bytecode energy costs. This distribution, omitted for brevity, is highly skewed, with a small number of instructions using more than an order of magnitude more energy than the rest. Thus, an average bytecode cost can either inflate the energy estimate for a program that does not use these expensive instructions, or underestimate energy usage for programs that do. Moreover, our results also indicate that path-sensitivity is crucial for capturing energy usage of applications and methods. This is because our subject applica- tions are significantly large, and have a large number of potential paths that may be explored during an execution. Overall, these results provide a compelling demonstration of the accuracy ofeLens, its ability to provide energy estimates at granularity that is beyond the reach of hardware power monitors, and of the importance of the two pillars of its design (per-instruction energy modeling and path sensitivity). 2.4.3 Why Do We Need Energy Profilers? Traditionally, using execution times obtained from a method profiler is a common way for developers to identify methods they will focus on to improve the efficiency of their application. In this section, we 24 Table 2.3: Time vs. Energy App r cos BBC Reader 0 0.21 Bubble Blaster II 0 0.01 Classic Alchemy 0 0.13 Location 0 0.17 Skyfire 0 0.69 Textgram 0 0.05 show that execution time may not be a good proxy to identify energy-inefficient segments of code and that specialized energy profilers like eLens are necessary. To demonstrate this, we first calculated the execution time and energy estimate of each method of each application. To obtain the execution time we profiled each method using timestamps and calculated the corresponding energy estimate for the method using eLens. We then compared the information in two different ways to evaluate whether time is, indeed, a reasonable proxy for energy cost. Correlation: We first determined whether there is a linear correlation between the execution time of a method and its estimated total energy (across all components). We do this by calculating the Pearson correlation coefficient of the two series. Values of the coefficient closer to 1 or -1 indicate that the two series have a strong (positive or negative) linear relationship and values closer to zero indicate that they are uncorrelated. The Pearson coefficients (r in Table 2.3) are nearly zero across all applications, indicating that there is almost no linear correlation between execution time and energy usage. Ranking similarity: We then considered that even if energy and time were not linearly related, the relative rankings by the metric might provide useful guidance. So we measured the similarity of the rankings by calculating their cosine similarity, a technique used to measure similarity between two vectors in n-dimensional space. In this case, we defined two vectors, v 0 ;v 1 ;:::v jmethodsj , where each v i was defined for one vector as methodi’s energy rank and for the other, methodi’s execution time rank. The cosine similarity ranges from -1 to 1, with -1 denoting the exact opposite ranking, 0 denoting independent rankings, and 1 denoting the same ranking. Table 2.3 shows cosine similarity values closer to 0 than to -1 25 Table 2.4: Analysis time App T Inst (sec) T Est (sec) BBC Reader 344 16 Bubble Blaster II 450 17 Classic Alchemy 886 17 Location 274 10 Skyfire 258 8 Textgram 269 6 or 1 for all but one application, Skyfire. This strongly suggests that, for most applications, time and energy are almost independent. There are at least two reasons why time and energy are uncorrelated. The first is that many hardware components have multiple power states. Two different methods can take the same time to execute on a CPU, but when one of them is executing, the CPU may be at frequency f 1 , while when the other is executing it may be atf 2 . Iff 2 >f 1 , the energy consumed by the latter will be more than that consumed by the former. The second explanation lies in the asynchronous design of system and API calls. When an application sends data over the network, that data is buffered by the operating system so the application may not be charged the time taken to transmit the data. However, eLens will accurately account for the energy cost of transmitting the data, since it profiles the send API call. These results demonstrate the importance of an approach like eLens, which can guide energy optimiza- tions more accurately than method profilers based on execution time. 2.4.4 Analysis Time In this section, we evaluate one aspect of usability of eLens, namely whether it is fast enough to be used during software development. We measured for each application, over a series of executions, the time the Workload Generator took to instrument (T Inst ) and the time needed by the Analyzer to calculate the energy estimate given the output of the Workload Generator (T est ). Table 2.4 shows these two measures, for each application. As the results show, the instrumentation time ranged from about 4 to 14 minutes and the analysis time ranged from 6 to 17 seconds. In practice, 26 instrumentation time would be much lower because after each developer iteration, it would only be nec- essary to reinstrument the changed classes as opposed to all classes (our numbers report the latter). The analysis time is fairly low in comparison and would not hinder usability during development. We did not measure the overhead introduced at runtime by our instrumentation because our method of executing use cases was to manually interact with the application and we could not control for normal hu- man variations when interacting with the instrumented and non-instrumented versions of the application. Anecdotally, the runtime overhead was imperceptible to users interacting with the instrumented applica- tion. Using the additional estimated energy consumed by our instrumentation as a proxy, we estimate that the runtime overhead ranged from 0.2%-7.2% across the applications. 2.4.5 Using eLens to Compare Applications We conclude by illustrating how eLens can be used to compare application energy usage. We have left a more complete study of application energy characteristics to future work, but this section begins to explore how eLens can be used to understand where energy is expended in an application, and how that differs across applications. To study application energy usage, we calculated, for each application, the top 5 energy hotspots. The bar graph of Figure 2.5 plots the subject applications on the x-axis, and, for each application, the 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Textgram Bubble Alchemy Skyfire BBC Energy Consumption Figure 2.5: Top five energy hotspots fraction of total application energy consumed by the top 5 hotspots, in order from left to right. As this 27 figure shows, applications vary widely in how energy usage is distributed. The fraction of total energy that can be attributed to the top 5 hotspots varies from about 10% for Bubble Blaster to an astounding 85% for Skyfire. Moreover, as the figure shows, even among the top 5 hotspots, energy consumption can be significantly skewed; in Skyfire, a single hotspot uses over 80% of application energy! Less evident from this figure is an understanding of the causes of high energy usage, but an exami- nation of the top 5 hotspots (listing omitted for space reasons) reveals interesting insights. In Skyfire, an Android library call for HTTP downloads consumes the most energy, but in the BBC Reader, the top 5 hotspots include API calls for downloading and rendering content as well as downloading advertisements. In Textgram as well as in the two games, appropriately enough, graphics computations dominate the top 5 hotspots. In these cases, an included package or library consumed a significant amount of energy; in response to this, an application developer might either choose to optimize the library implementation or re-implement the functionality provided by the library in a more energy-efficient manner. 2.5 Conclusion We have presented a new technique, eLens, for estimating energy consumption of applications written for Android mobile devices. eLens brings together two ideas, per-instruction energy modeling, and program analysis, in order to accurately, and without requiring power measurement hardware, estimate application energy usage at the level of granularity of the whole application, method, path, or source code line. An evaluation of eLens on six marketplace applications shows that its energy estimates are accurate to within 10%, and its run time is acceptable. Moreover, eLens can reveal insights about energy usage across dif- ferent applications, and its energy estimates are uncorrelated with execution time, suggesting that method profilers may not help in optimizing applications for energy use. Overall, the results of the evaluation were very positive and indicate that eLens is an accurate, fast, and useful technique for estimating energy consumption. 28 Chapter 3 SIF: A Selective Instrumentation Framework for Mobile Apps Mobile app ecosystems have experienced tremendous growth in the last six years. As researchers and developers turn their attention to understanding the ecosystem and its different apps, instrumentation of mobile apps is a much needed emerging capability. In this project, we explore a selective instrumentation capability that allows users to express instrumentation specifications at a high level of abstraction; these specifications are then used to automatically insert instrumentation into binaries. The challenge in our work is to develop expressive abstractions for instrumentation that can also be implemented efficiently. Designed using requirements derived from recent research that has used instrumented apps, our selec- tive instrumentation framework, SIF, contains abstractions that allow users to compactly express precisely which parts of the app need to be instrumented. It also contains a novel path inspection capability, and provides users feedback on the approximate overhead of the instrumentation specification. Using experi- ments on our SIF implementation for Android, we show that SIF can be used to compactly (in 20-30 lines of code in most cases) specify instrumentation tasks previously reported in the literature. SIF’s overhead is under 2% in most cases, and its instrumentation overhead feedback is within 15% in many cases. As such, we expect that SIF can accelerate studies of the mobile app ecosystem. 29 3.1 Introduction Mobile app ecosystems, such as the iPhone App Store and Google Play, have experienced tremendous growth in the last six years. Relative to ecosystems for desktop applications, mobile device app ecosystems are fast growing and have a large number of users, an evolving base of smartphone and tablet platforms, a large number of contributors and developers, as well as a wide range of functionality made possible by ubiquitous Internet access and the availability of various kinds of sensors (GPS, cameras etc.). These factors, together with rapid growth in the use of mobile devices, have sparked an interest in understanding the properties of mobile apps. Recent research has developed methods to study performance properties [66, 84] and security properties [39, 58, 68, 86] of mobile apps. A common thread through this line of research is instrumentation: each work has developed customized ways to insert instrumentation for studying app behavior. More generally, app instrumentation is a crucial emerging capability that will facilitate future studies of the mobile app ecosystem. Traditionally, instrumentation frameworks for programming languages have permitted some degree of flexibility in instrumenting software (Section 5.2). However, these are insufficient for mobile apps which rely on concurrency, event handling, access to sensors, and (on some mobile platforms) resource usage permissions integrated with the app. These differences, together with the constraints of mobile devices, motivate the need for an instrumentation framework with qualitatively different requirements from that considered in prior work. A careful analysis of prior research that has used custom instrumentation reveals several interesting requirements of an instrumentation framework for mobile apps (Section 3.2). We find that the framework must permit selective instrumentation since the processing constraints on mobile devices preclude perva- sive instrumentation. Furthermore, this capability must permit arbitrary user-level instrumentation that can alter the functionality of the app and not just measure performance. Moreover, the instrumentation frame- work must permit path inspection between specified codepoints, a capability motivated by device access 30 control capabilities in some mobile OSs. Finally, because user-level instrumentation can add significant overhead, the framework must be able to accurately estimate the overhead of the specified instrumentation. We describe the design and implementation of a Selective Instrumentation Framework (SIF) for mobile apps that satisfies these novel requirements (Section 3.3). Our first contribution is identifying the smallest set of instrumentation primitives that permit expressivity while admitting efficient implementation. SIF allows users to specify instrumentation locations using codepoint sets (collections of locations in the code) that can be selected at various levels of granularity from class hierarchy specifications down to individual bytecodes, and then specify user-defined instrumentation for each set. It also defines a path set abstraction, which allows users to dynamically trace inter-procedural paths between two arbitrary codepoints in the app. This capability is novel in an instrumentation framework, and can be used to explore privacy leakage and permissions violations in mobile apps. Taken together, these two abstractions can be used to express all instrumentation tasks considered in the literature. A second contribution is SIF’s use of static and dynamic program analysis to derive instrumentation locations, minimize instrumentation overhead, and estimate instrumentation cost. In particular, implementing the path set abstraction requires sophisticated stitching of intra-procedural path segments to derive inter-procedural paths. SIF’s abstractions and implementation methods are, for the most part, independent of the underlying mobile app ecosystem, but we have implementedSIF for the Android platform. Using this implementation, we have evaluated SIF’s expressivity and efficiency (Section 3.4). We demonstrate that SIF’s abstractions can express many of the instrumentation tasks previously proposed in the literature, as well as other com- mon tasks. Moreover, the SIF specifications are compact, requiring fewer than 100 lines of code even for the most complicated instrumentation tasks. Finally, SIF can often reduce instrumentation cost signifi- cantly, requires less than half a minute to instrument binaries, and provides accurate (within 15%) overhead feedback in many cases. While much work remains (Section 3.5), we believe that SIF can accelerate studies of the mobile app ecosystem and lead to an improved understanding of app behavior and usage. 31 3.2 Background and Motivation In this section, we motivate the need for an instrumentation framework for mobile apps, and articulate the unique requirements posed by these apps. We then describe the challenges associated with satisfying these requirements; this discussion lays the groundwork for the design of SIF, described in Section 3.3. Instrumentation Frameworks. Instrumentation refers to the process of inserting code into an application, often by an entity (software or user) other than the original developer. An instrumentation framework is a software system that allows an entity to insert instrumentation at specific points in a program. In traditional software systems, instrumentation frameworks are widely used for a variety of tasks [42, 47], but, as we discuss in Section 5.2, these do not satisfy one or more of the requirements of mobile app instrumentation that we identify below. Instrumentation frameworks are generally based on one of three different mechanisms. The first mech- anism is to instrument the source code, an approach which requires source to be available to the instru- mentor. A more general mechanism instruments the runtime system responsible for program execution; for example, an instrumented operating system or virtual machine can record every executed method. The drawback to this is that a customized runtime system must be developed for every platform on which an entity will want to perform instrumentation. Furthermore, once developed, it can be very difficult to modify the runtime system instrumentation. The third mechanism, and the one we choose, is binary instrumentation, in which instrumentation code is directly inserted into the compiled binary or bytecodes. This does not require source code and is more portable and flexible than customized runtime systems. More broadly, the use of binary instrumentation also enables users to instrument and analyze apps after they have been released. This is an especially important capability in the mobile app ecosystem because its growth has spurred a number of independent efforts in understanding the performance and behavior of mobile apps. For example, AppInsight [66] de- veloped a way to instrument apps for a specific purpose, namely, critical path monitoring. The code for doing this instrumentation was done manually, and was targeted for the purpose of critical path monitoring. 32 In our work, we seek to provide a programming framework that can specify, at a high-level, the instrumen- tation required for AppInsight and other tasks, leaving the task of generating the low-level instrumentation to a compiler. Framework Requirements. We conducted a survey of recent research work that has developed cus- tomized instrumentation, and have used these to develop a set of requirements for a binary instrumentation framework. We discuss three specific examples; Section 3.4 presents a more comprehensive discussion of these pieces of research. Many mobile apps, in response to a user action, perform multiple concurrent operations, and the user- perceived latency is dominated by the critical path (the concurrent operation which takes the most time) through the code. AppInsight [66] has attempted to develop general methods to instrument apps for critical path analysis. Some existing mobile operating systems provide coarse-grain access control to sensors and other system facilities: e.g., Android requires app developers to explicitly require permission to access the network or GPS. Researchers [58] have developed methods to instrument apps to enable more fine-grained permissions checking: e.g., preventing third-party libraries, often used to develop appli- cations, from using the permitted resources. We have been developing a sensor auditing capability to instrument an app to understand what pro- cessing it performs on the sensor (e.g., GPS or camera), and whether, after acquiring location sensor readings, the app uploads the sensor readings to a website. Many of these studies are motivated by novel features of mobile app platforms: concurrent execution and event handling, per-app restrictions on resource usage, and the availability of novel sensors. These studies drive the requirements for our binary instrumentation framework. A strawman approach to solving these problems is to instrument every method call or execution path. However, this can incur significant overhead on modern smartphones, to the point where app usability can be impacted. In some preliminary experiments, we have observed up to 2.5x greater CPU usage when using Android’s Trace- view [13] to instrument every method and system API invocation. Accordingly, the first requirement of a 33 binary instrumentation framework for mobile apps is selectivity: users should be able to instrument only the code of interest to their study. To support AppInsight and the fine-grained permissions study, our framework needs to provide selec- tivity by allowing users to flexibly specify locations for inserting instrumentation. Specifically, in these studies, the authors inserted instrumentation at specific points in the program: event handlers, API calls with certain permission capabilities, etc. Many instrumentation frameworks permit selectivity of method calls or APIs. Our sensor auditing study motivates another requirement for a binary instrumentation framework that prior work lacks (Sec- tion 5.2), the ability to inspect dynamic execution paths. This capability would allow a user to determine which code paths were traversed between two points in the code, and examine what transformations might have been done on data along these paths. Instrumentation frameworks differ in the kinds of instrumentation they allow a user to insert. To support AppInsight, a binary instrumentation framework that provides basic instrumentation primitives (such as timing or counting procedure invocations) would suffice. However, the fine-grained permissions study alters the functionality of an app. To support instrumentation for functional modifications, a binary instrumentation framework must allow arbitrary user-specified instrumentation, since it cannot anticipate the kinds of instrumentation that might be needed. Finally, efficiency is an important requirement of instrumentation frameworks; the instrumentation overhead must be minimal and must preserve the usability of the mobile app. This is particularly difficult to achieve in a framework which permits user-specified instrumentation, since the framework has no control over the complexity of that instrumentation. Accordingly, we add one additional requirement for binary app instrumentation, overhead feedback. If the instrumentation framework can estimate the overhead of user-specified instrumentation, users can quickly adapt the instrumentation (e.g., by being more selective) in order to reduce the overhead, without actually needing to run the instrumented binary. These requirements raise significant research questions and challenges. What are the appropriate ab- stractions for specifying where instrumentation should be inserted? This is particularly challenging for 34 path inspection since some apps are highly complex and contain several million distinct paths (a static analysis of the code paths involved in composing email using the Gmail app reveals nearly 0.15 million path segments). Additionally, how do we provide a flexible mechanism for allowing the user to provide any instrumentation without introducing extra overhead and complications? Finally, how do we minimize and report guidance on overhead in a way that can help users? In particular, how do we accurately predict the overheard of arbitrary instrumentation? Our SIF instrumentation framework provides functionality to meet all of these challenges. SIF provides a domain-specific programming language and support libraries that allow users to selectively instrument an app with arbitrary user-specified code along any path or codepoint based location. The framework uses sophisticated program analysis techniques to introduce minimal overhead during instrumentation and provide overhead feedback for the user-specified instrumentation. We describe how SIF provides all of this functionality in the next section. 3.3 A Selective Instrumentation Framework In this section, we describe SIF, our binary instrumentation framework for mobile apps that satisfies the requirements listed in the previous section. We begin with an overview that describes how a user interacts with SIF and the instrumentation workflow within SIF. We then discuss the instrumentation specification language abstractions, and describe how we overcome some of the challenges listed in Section 3.2. We conclude the section by describing our implementation of SIF for Android. 3.3.1 Overview of SIF Figure 3.1 describes the overall workflow for SIF. A user provides three pieces of information as input to SIF. The first is the original app binary to be instrumented. The second is the user-specified instrumentation code, written in a language called SIFScript 1 . We say that a SIFScript codifies an instrumentation task. The 1 In what follows, we will useSIFScript to denote both the language and the specification program; the usage will be clear from the context. 35 Figure 3.1: Overview of SIF third input to SIF is a workload description. Intuitively, a workload description captures the app use-cases that the user is interested in instrumenting. For example, in the critical path analysis example above, the user may be interested in knowing the user-perceived latency for posting to Facebook. This use- case (posting to Facebook) is encapsulated in a workload description obtained from a workload generator (Figure 3.1). We describe later how a user provides a workload descriptor. The workload description is used by SIF to provide accurate overhead feedback, as described in Section 3.3.3. In the first step of SIF’s workflow, the instrumenter component interprets the SIFScript specification and generates an instrumented version of the app. This instrumenter realizes the user-level specification and path inspection capabilities in SIF by inserting the user-specified instrumentation code at the appropriate locations. The instrumenter also outputs some additional metadata used in later stages. In our current instantiation of SIF, all instrumentation output is stored locally on the mobile device, then extracted for post-processing. In future work, we plan to explore automatic export of instrumentation output to a cloud server, a capability that can enable large-scale debugging and app analytics in the wild. The metadata generated by the instrumenter, together with the workload information input by the user, is fed to an overhead estimator. That component calculates the impact of the instrumentation for the given workload description. Impact may be measured in terms of the extra execution time or additional resource 36 usage (e.g., CPU cycles, memory, energy) incurred as a result of the instrumentation. If the estimated impact is unacceptable, users can refine their instrumentation specifications. When the instrumented application is run, the instrumentation outputs either log data generated by SIF defaults or the data collected by the user-specified instrumentation. An example of the latter is execution timings generated by user-specified instrumentation. In addition, SIF produces output whenever the user employs its path inspection capability. This output is an intermediate description of paths traversed; a SIF module called the path stitcher component is automatically invoked on this output to generate user readable path information. In the remainder of this section, we describe the components of SIF. Before we do so, a word about the potential users of SIF. In our view, SIF is an instrumentation tool at an intermediate level of abstraction. It is intended for an expert user, such as a researcher or a software engineer, who understands the app code and/or the mobile OS API well, and who might, without SIF, have manually instrumented apps for whatever task he/she is interested in (or developing custom software for this instrumentation). It can be made more broadly available to other users by adding front-end code that provides a higher-level of abstraction: for example, a security researcher can make available a web page which takes a binary and instruments it for some purpose (say to block ads), and users wishing an ad-free version of their app can upload a binary, retrieve the instrumented binary and run it. Finally, it is not unreasonable to expect developers to use SIF even when they have access to app source code: instrumen- tation is a programming concern that is separable from application logic, and a tool like SIF, which allows developers to treat instrumentation as a separate concern rather than having to weave instrumentation into application logic, might be helpful in many cases. 3.3.2 The SIFScript Language Our first design choice for SIF was to either define a new domain-specific language for SIFScript or to realize SIFScript as an extension to an existing language. A new language is more general since it can be compiled to run on multiple mobile platforms, but it also incurs a higher learning curve. Instead, we chose to instantiate SIFScript as a Java extension. This design option has the advantage of familiarity, but may 37 Figure 3.2: Operations on SIF abstractions limit SIF’s applicability to some mobile OS platforms. However, we emphasize that the abstractions and the underlying instrumentation methods based on program analysis are independent of specific mobile app programming platforms, and are extensible to multiple platforms. The next design challenge for SIF was to identify abstractions that provided sufficient expressivity and enabled a variety of instrumentation tasks. In addressing this challenge, we were guided by the require- ments identified in Section 3.2 and the instrumentation tasks described in Section 3.4. An instrumentation specification language should permit instrumenting code according to different attributes, such as method invocations, specific bytecodes, or classes. The language should also allow for combining these attributes in different ways to build up sophisticated instrumentation specifications. To permit maximum flexibility and cover the use cases discussed in Section 3.2 and Section 3.4, SIFScript incorporates two qualitatively different instrumentation abstractions, codepoint sets and path sets. These abstractions specify where in a binary program the user wishes to insert instrumentation. Codepoint Set This abstraction encapsulates a set of instructions (e.g., bytecodes or invocations of arbi- trary functions) in the binary program that share one or more attributes. For example, a user might define a codepoint set that consists of all invocations to a specified library (we discuss other attributes below). 38 1 class TimingProfiler implements SIFTask { 2 public void run() { 3 CPFinder.setBytecode("invoke. * ", ". * native . * "); 4 UserCode code; 5 for (CP cp in CPFinder.apply()) { 6 code = new UserCode("Logger", "start", CPARG); 7 Instrumenter.place(code, BEFORE, cp); 8 code = new UserCode("Logger", "end", CPARG); 9 Instrumenter.place(code, AFTER, cp); 10 } 11 } 12 } 13 class Logger { 14 private static Map map = new HashMap(); 15 public static void start(int mid, int pos) { 16 long id = Thread.currentThread().getId(); 17 String k = mid + "," + pos + "," + id; 18 long start = System.nanoTime(); 19 map.put(k, start); 20 } 21 public static void end(int mid, int pos) { 22 long end = System.nanoTime(); 23 long id = Thread.currentThread().getId(); 24 String k = mid + "," + pos + "," + id; 25 long start = map.get(k); 26 Log.v(TAG, k + "," + (end - start)); 27 } 28 } Listing 3.1: Timing profiler for native invokes Path Set This abstraction encapsulates the set of dynamically traversed paths that satisfy a user-specified constraint. Currently, SIF supports two forms of constraints: paths traversing any codepoint in a code- point set or paths containing at least one codepoint from each of two or more codepoint sets. Figure 3.2 documents the operations on these abstractions, which are discussed in greater detail below. 3.3.2.1 Codepoint Sets We now discuss the semantics of SIFScript abstractions using a simple example instrumentation task. List- ing 3.1 shows the complete SIFScript listing of a timing profiler, which selectively profiles the execution time of native code invocations. Modern smartphone OSs (e.g., iOS and Android) permit apps to imple- ment part of their functionality at a lower-level in native code (usually C) for performance reasons. Native code is used in many apps such as browsers, video display and gaming. SIFScript allows users to specify instrumentation tasks by defining separate classes for each task; each instrumentation task inherits from aSIFTask base class. Users can select arbitrary codepoints, define user- specified instrumentation, and specify where to place instrumentation. Line 3 of Listing 3.1 is an example of a SIFScript construct for selecting codepoints. CPFinder is a class that provides methods to specify 39 1 class LocationAuditor implements SIFTask { 2 public void run() { 3 CPFinder.setPermission(LOCATION); 4 Set<CP> X = CPFinder.apply(); 5 CPFinder.setPermission(INTERNET); 6 Set<CP> Y = CPFinder.apply(); 7 PathFinder.sequence(X, Y); 8 PathFinder.report(); 9 } 10 } Listing 3.2: Location auditor codepoint attributes and iterate on the identified codepoint sets. In line 3, the setBytecode() method selects all points in the binary that are invocations to native methods. More generally,setBytecode() takes as its first argument a regular expression that specifies the kind of bytecode (in this example, invocations), followed by an optional argument that specifies a regular expression matching the name (in this example, native invocations). Although not shown in our example, SIFScript contains a hierarchy of attribute specifications, which users may use to progressively narrow codepoint selections. The setClass() method of CPFinder selects classes whose name or whose class hierarchy matches specified regular expressions. If this method is invoked, only codepoints within matching classes are considered for inclusion in a codepoint set. Within these classes, users may narrow down the scope of codepoint selection by using setMethod, which takes, as an argument, a regular expression for the method names. Only codepoints within the matching methods are considered. Thereafter, users may invoke setBytecode() to specify codepoints inside the relevant classes and methods. Users may also use setPermissions() to refine the selection to those codepoints that require resource access permissions (e.g., network or location access) and setLoops(), which allows users to instrument loop edges. If any of these attributes are not defined, the effect is equivalent to specifying no refinement. For exam- ple, if setClass() is omitted, all classes are considered when selecting codepoints. Thus, in Listing 3.1, line 3 selects native methods in all classes. The CPFinder class also contains two other methods. init() resets a selection, since a SIFScript might contain multiple instrumentation steps, with each step instrumenting a different selection of codepoints. apply() (line 5) analyzes the specified attribute and computes the resulting codepoint set. 40 Once a codepoint set has been defined (as in line 3), the next step in writing a SIFScript is to specify what instrumentation to insert, and where to insert instrumentation. For the former, SIFScript defines a UserCode type which declares a code block; a new code block can be specified in the constructor to UserCode which takes as arguments a class name, a method name, and arguments to the method. Thus, an instance of UserCode effectively specifies arbitrary user-specified instrumentation. For example, in line 6, the SIFScript defines code to be the start method of the Logger class, and in line 8, the end method. Lines 13-27 provide the definitions of these methods. Thestart method generates and stores a timestamp along with a thread specific key. The end method computes the invocation time and writes it out to a log. To specify where to insert instrumentation, SIF provides an Instrumenter.place() method. This method takes three arguments: aUserCode instance, a location specification, and a codepoint. The seman- tics of place() are as follows: it places the UserCode code block instance at the specified codepoint. SIF currently supports three location specifications: BEFORE inserts the code block before a codepoint, AFTER inserts it after the codepoint, and AT replaces the codepoint. In Listing 3.1, line 7 shows the start method of Logger being inserted before the codepoint, and line 9 shows the end method being inserted after the codepoint. The Instrumenter also supports a placeLoops() method to instrument loop back-edges; this is discussed in Section 3.3.3. 3.3.2.2 Path Sets In Section 3.2, we motivated the need for a path inspection capability in a mobile app instrumentation framework. To illustrate SIF’s abstraction for path inspection, consider the context-aware app aroundme, which searches for points of interest near a mobile device’s current location. To achieve this, the app requests access permissions for both location data and the Internet. But, since it displays advertisements in the free version, privacy-conscious users may be interested in knowing whether the app leaks their location information. To audit how their location information is used, users can write a location auditor instrumentation task in SIF using the path inspection abstraction, as shown in Listing 3.2. This auditor provides the user with a listing of all sub-paths that access location data and then access the Internet. 41 The basic abstraction for path inspection in SIF is the path set provided by the PathFinder class. Conceptually, a path set consists of a collection of paths traversed by the app when it is executed. Thus, unlike the codepoint set abstraction, the set of paths belonging to a path set cannot be enumerated statically (i.e., before execution). As with codepoint sets, path sets are specified by describing attributes of paths for interest. SIF cur- rently supports two forms of attribute specifications. Thecontains(C) method ofPathFinder takes as an argument a codepoint set, and returns all intra-procedural paths (i.e., paths that begin and terminate within the same procedure) that contain at least one of the codepoints in the argument. sequence(C 1 ,C 2 ,:::,Cn) specifies all inter-procedural paths that contain, in sequence, a member of each of the n codepoint sets. Thus, a path in this set contains a codepointc i 2C 1 followed by ac j 2C 2 , and eventuallyc k 2C n . The path starts in the method containingc i and ends in the method containingc k . SIF supports one action on path sets, report(), which logs all paths in a path set, so a human can inspect them. This log contains every instruction in the path, so a user can understand what operations are performed along a path. In the location auditor (Listing 3.2), the user defines two codepoint sets, the first is all invocations (e.g., API calls) with permission to access location data and the second is all invocations with permission to perform network operations. At runtime, the location auditor logs all paths between an invocation in the first codepoint set and an invocation in the second codepoint set. If the output is empty, the user knows that there is no direct leakage of location information, for the tested use cases. If the output is non-empty, the user can examine the processing done on the location data before a network operation occurs, for example, to determine whether the location granularity was coarsened to the zip code level. 3.3.3 SIF Component Design In this section, we describe how various components of SIF are designed and how they collectively realize the abstractions described above. SIF’s design borrows from program analysis techniques and abstractions. Before we discuss the SIF design, we introduce some of these techniques and abstractions. 42 3.3.3.1 Preliminaries A control flow graph (CFG) represents the flow of control (branching, looping, procedure calls) in a pro- gram or within a method. Nodes in the graph represent basic blocks of code and edges represents jumps or branches. CFGs are often used in many static analysis applications. A call graph captures the invocation relationship among methods within a program. A static call graph can be constructed by analyzing a program and constructing relationships between callers and callees. A dynamic call graph depicts the sub-graph of the static call graph that is encountered during an execution and can be constructed by instrumenting and logging invocations. SIF uses a technique called efficient path profiling [17] proposed by Ball and Larus. This technique instruments programs to accurately, but with minimal overhead, measure path execution statistics. The Ball-Larus profiler assigns weights to edges of a method’s control-flow graph (CFG) such that the sum of the edge weights along each unique path through the CFG results in a unique path identifier; a single instrumentation counter per method then suffices to record the path traversed during each invocation of the method. When a program instrumented with these counters is executed, the output is a count, for each path, of the number of times the path is executed. More precisely, the Ball-Larus profiler instruments path segments. For example, in methods with two branches, there are two path segments, the then and else branches. In methods with a single loop, there are four acyclic path segments: one that runs through the method without executing the loop body, a second that starts at the beginning and terminates at the end of the loop body, a third containing the execution of the loop body to the end of the method, and a fourth containing only the loop body. Intuitively, any loop execution can be described using a linear combination of these path segments. An execution that does not execute the loop body will result in a count vectorh1; 0; 0; 0i; one iteration will result inh0; 1; 1; 0i andk iterations inh0; 1; 1;k 1i. The complete path can be reconstructed post facto by correlating the outputs of the Ball-Larus profiler with the CFG. 43 We extend the Ball-Larus profiler to handle nested method calls, exceptions, and concurrency. We identify and discard paths that result in exceptions, since their catch blocks may cause control-flow to jump outside of the method, a behavior for which the Ball-Larus profiler is undefined. Concurrency is handled by using the thread’s ID to identify the counter that must be updated to track the path ID. 3.3.3.2 Realizing the Codepoint Set Abstraction There are three distinct parts to realizing the codepoint set abstraction: finding target instrumentation positions, enabling access to local data variables, and inserting user-defined action code. Finding the target instrumentation positions is performed by the method CPFinder.apply(). This method first combines the regular expression based attribute specifications for class hierarchy, classes, methods, bytecodes and permissions into a set of constraints. Then, the method hierarchically applies these constraints during successive scans of the code, ultimately identifying the set of instructions that need to be instrumented. A challenge for SIF is to provide user-defined instrumentation with access to program state that is available at each instrumentation codepoint. For example, consider a codepoint invoke bar(x,y) inside method foo(a,b). User-specified instrumentation code should be able to access the method signature (foo(int,int)), method arguments (a,b), and operands of the instruction (i.e., the method reference to bar, invocation arguments (x,y), and the return value, if any). This type of access to the program state is necessary, for example, in code that tracks the values of arguments supplied to bar. For method signatures, SIF scans the binary to extract these signatures, then inserts instructions to load at runtime the corresponding signature at the appropriate locations, so that they will be accessible to user- specified instrumentation code. SIF also provides users with access to other information discussed above. To access this information, SIF makes special symbols available to users that notify SIF to instrument in such a way so that this data is provided to the inserted instrumentation code as arguments. When SIF encounters these special symbols, it inserts additional instrumentation to make this data available to the user-specified instrumentation code. 44 SIF inserts the user-specified instrumentation at all identified codepoints via theInstrumenter.apply() method. There are two approaches to this: one is to inline the instrumentation code at each codepoint, and the other is to insert an invocation to the instrumentation code. SIF employs the latter approach, which results in more compact instrumentation if there is more than one codepoint that must be instrumented with the same code. As an aside, we note that code obfuscation frameworks, which obfuscate binaries without affecting functionality, can limit the applicability ofSIF’s codepoint abstraction. These frameworks cannot obfuscate invocations to system APIs and methods, so SIF will still be able to instrument those codepoints. Supporting Distinguished Codepoints. Apps contain distinguished codepoints that correspond to higher- level program constructs of interest to users. These codepoints are method entry and exit, CFG branch edges, exception entry points, and loop back-edges. Our current instantiation of SIF supports only the sub- set required to achieve the instrumentation tasks discussed in Section 3.4, but the remainder are straight- forward to support using CFG analysis. To instrument method entry and exit, a user would first identify codepoints that define a method using the CPFinder then use the Instrumenter.place() method with the location specification of either ENTRY or EXIT. To identify codepoints related to loop “back-edges” (jumps back to the beginning of the loop), SIF provides the CPFinder.setLoops() method. This method uses a depth-first search of the CFG to identify back-edges. Then the Instrumenter’s placeLoops() function inserts instrumentation before a back-edge (at the loop exit), after a back-edge (at the loop entry), and at a back-edge. 3.3.3.3 Realizing the Path Set Abstraction In SIF, path sets support path inspection capabilities. The methods to support path inspection take code- point sets as arguments; the techniques discussed above can be used for identifying the relevant codepoints. The remainder of this section discusses how SIF realizes path inspection. In general, SIF provides path in- spection capabilities by appropriately adapting the Ball-Larus path profiling method discussed above, and 45 using a path stitcher as a post-processing step to aggregate the path segments produced by path profiling into method-level or inter-procedural paths. SIF’s adaptations reduce the overhead of path profiling. Implementing the contains(C) method of PathFinder is conceptually straightforward. One could simply instrument all path segments of all methods in an app using the Ball-Larus profiler, then report only those path segments containing the specified codepoints (these can be identified in a post-processing step). Instead, in SIF, we use eachc2C to identify the methods that containc, and only instrument those methods. Implementingsequence() is a little bit more involved. Extending Ball-Larus profiling to inter-procedural paths is known to be a hard problem, and it is not one we solve here. Our approach relies on the obser- vation that we know which inter-procedural paths are of interest, namely, the ones that traverse specified codepoint sets. We can use the Ball-Larus profiler, together with additional instrumentation, to find these inter-procedural paths. We discuss the algorithm for the case when the input to sequence() contains two codepoint sets C 1 andC 2 . The extension of this algorithm to multiple codepoint sets follows in a straightforward manner. For sequence(C 1 ;C 2 ), we wish to record all paths at runtime that execute a c i 2 C 1 followed by a c j 2 C 2 . We could instrument every method in the app using the Ball-Larus profiler, but this approach adds unnecessary instrumentation and does not give us enough information to determine inter-procedural paths between eachc i andc j . Instead, we use a more sophisticated program analysis to statically determine an over-approximation of all the methods invoked between the codepoints inC 1 andC 2 . To identify these methods that could be invoked between codepoints, we perform a standard reachability analysis (with some additional inputs to specify concurrency and event handling constraints) between all c i 2 C 1 and c j 2 C 2 . All intervening methods in the call graph are marked for instrumentation. We then add additional instrumentation beyond that required by the Ball-Larus profiler, described below, to instrument these methods. This approach trades off slightly higher instrumentation costs for a much more compact instrumented binary. 46 The additional instrumentation identifies the identifier of the path segments and ordering of the path segments, so that the path stitcher can reconstruct the intra-procedural path. For example, suppose that codepointc i is invoked in procedureA along pathp l;A (wherel is an identifier for the path). One approach to obtaining the inter-procedural path is to recordl;A and the timestamp at which that path was executed. Then, when codepoint c j is invoked in procedure B along path p m;B , m;B is also output by the instru- mentation. Ifl;A andm;B occur successively in the output, then we can infer thatB was called byA and the corresponding inter-procedural path consists ofl followed bym. In SIF, a path stitcher performs this analysis of paths. Rather than output timestamps and path segment identifiers, we note that it suffices to simply output the sequence of path segment identifiers encountered during the execution. From this sequence, and by logging all call sites encountered during execution 2 and every method entry and exit, it is possible to stitch together inter-procedural paths. We optimize this logging by run-length encoding the path identifier sequence and compactly encoding the call site information. Finally, the path stitcher performs a call stack simulation in order to determine the calling sequence of the path segments from the executed methods. To do this, it uses the logged records from path profiling and simulates the call stack encountered during execution. The output of this step is the set of all inter- procedural paths between eachc i 2C 1 andc j 2C 2 . Dealing with Exceptions and Concurrency. Mobile apps may throw exceptions and contain concurrent threads. To handle exceptions, we conservatively include every exception handler when performing the reachability analysis. To handle concurrency, we log thread identifiers so the path stitcher can separate path segments executed by different threads. Our path stitcher can also splice thread paths when one thread creates another (by searching for a thread fork call) and identify thread identifier reuse (by determining when the thread has exited). 2 Within a given method, another methodm may be invoked at several points. Logging the call sites helps disambiguate these during path stitching. 47 3.3.3.4 Overhead Feedback The final component of SIF is the overhead feedback estimator. In SIF, overhead can come from two sources: instructions inserted by SIF components (e.g., instructions to load method templates or perform path logging) and user-specified instrumentation code. As we show in Section 3.4, the former component is small. However, as in any programming framework that provides flexibility, users can insert instru- mentation that adds significant processing overhead to an app, making the app unusable. To give the user approximate feedback on the overhead introduced by their instrumentation, SIF provides an overhead feedback estimator. The overhead of instrumentation is difficult to define statically (i.e., without running the app) since, in many cases, execution time is dependent on code structures that can execute a variable number of times. So, SIF provides users with a way to provide a workload as input. Intuitively, a workload captures the dynamic execution statistics for a given use of the app (e.g., playing one instance of a game or sending an email). To translate the workload into paths that can be analyzed, SIF provides the user with a version of the app instrumented only with the Ball-Larus profiler (i.e., the version does not contain the user-specified instrumentation). The user can execute the workload on this instrumented app and obtain the frequency of execution of every path segment in the app. This constitutes the workload. Based on this information, the overhead estimator knows precisely which instructions inserted by SIF are executed and how many times each is executed. In addition, the overhead estimator can determine how many times user-specified instrumentation is invoked, and can account for the overhead of this as well. This estimation works well if user-specified instrumentation’s CFG is acyclic; extending overhead estimation to more complex user-specified instrumentation is left to future work. The overhead estimator combines the instruction counts derived from the workflow description and profiled estimates of execution time for each instruction to provide an estimate of the total execution time incurred by instrumentation (feedback based on other measures of overhead, such as energy, is left for future work). With this estimate, users can refine their instrumentation specification to reduce overhead. 48 We emphasize that, to determine overhead, the overhead estimator runs a version of the app instru- mented using the Ball-Larus profiler (to generate the workload description), but does not need to run a version of the app that includes user-specified instrumentation. In this sense, the feedback provided to the user is an estimate. This is convenient because the user has to generate the workload description once, but may iteratively refine instrumentation several times. 3.3.4 Implementation of SIF for Android We have designed SIF to be broadly applicable to different mobile computing platforms. The abstractions on which SIF depends are general in that they are based on standard programming constructs (instructions, invocations and paths). Furthermore, the components use program analysis techniques common to imper- ative programming, but not specific to a given programming language. We have chosen to instantiate SIF for the Android platform because of the platform’s popularity and active research targeting the platform. That said, it should be conceptually straightforward to instantiate SIF on other platforms like Windows, iOS and we have left this to future work. We made two exceptions to the generalizability goal of SIF. The first is that, along with handling concurrent execution, we implemented path set abstractions to handle Android-specific asynchronous tasks that can be executed in the background. Second, permissions support is closely modeled on Android’s permissions model. The reason for this is that prior work [14] has identified, for a given permission capability, a list of APIs which match that capability, for example, the set of API calls that requireINTERNET permission to use the network. SIF uses this list to implement permissions-based codepoint selection. We have left it to future work to identify the permission mappings of other platforms. Although Android uses a dialect of Java, our specification language abstractions are implemented as an extension of Java. We do this because there are robust tools for Java bytecode manipulation. To manipulate Android programs, we first convert Dalvik bytecode to Java. To achieve this translation, we use apktool [3] to unpack and extract app binaries and resource files, and dex2jar [5] to convert Dalvik bytecode to Java bytecode. The BCEL [4] library is used for reading and modifying Java bytecode. Finally, 49 Android SDK tools convert Java bytecode back to Dalvik and repack the instrumented app. We have implemented EPP [17] in Java. The total implementation of SIF is about 5,000 lines of code. Our implementation does not handle Java reflection and dynamically loaded code instantiated by the Java class loader. Program analysis techniques for handling reflection and dynamic class loading have not progressed to the point where it is possible to accurately analyze a broad range of code. Furthermore, SIF has no visibility inside native code, which is used by several applications, but can instrument invocations to native methods. We have left these issues to future work. Finally, we note that SIF can also be used, with minor modifications to instrument arbitrary Java programs, not just mobile apps. We have left an exploration of this capability to future work also. 3.4 Evaluation The primary research question with SIF is its applicability for instrumentation. There are several aspects to this applicability: is SIF expressive enough to express a variety of instrumentation tasks, is its language compact enough to permit rapid instrumentation, is its framework efficient enough to be usable, and is its overhead feedback accurate enough for users to rely on its estimates. In this section, we address these aspects using our Android instantiation of SIF. All our experiments are conducted on Galaxy Nexus smart- phones running Android 4.1.2. 3.4.1 Expressivity of SIF To demonstrate the expressivity of SIF, we have implemented ten different instrumentation tasks, shown in Table 3.1, which demonstrate different facets of SIF. These tasks exhibit variety along several dimensions. First, they range from a simple timing profiler task that requires a single instrumentation step to sophisti- cated multi-step tasks, such as injecting privacy leaks and performing critical path analysis in the presence of concurrent events. Second, some of these tasks illustrate traditional uses of instrumentation, such as performance monitoring or dynamic tracing, while others are more specific to mobile apps, focusing on 50 sensor usage and sensor data security. Third, while some of these instrumentation tasks are common, a majority of them have been motivated by recent research. Finally, these tasks use different combinations of theSIF abstractions: path sets or codepoint sets specified at different granularities (class hierarchy, method, bytecode); permissions; and the ability to instrument loops. Table 3.1: Implemented instrumentation tasks As Table 3.2 shows, the SIFScript for each instrumentation task is very compact. No SIFScript exceeds 100 lines; if we exclude FreeMarket, AlgoProf and AppInsight, the remaining tasks require less than 30 lines. This demonstrates the conciseness of the abstractions; as we shall discuss later, the larger tasks, AlgoProf and AppInsight, are approaches that use extensive instrumentation to study application behavior. LOC* (SIF) LOC (user) LOC (total) Timing Profiler 12 16 28 Call Graph Profiler 30 22 52 Flurry-like Analytics 13 18 31 Fine-grained Permission 20 44 64 AdCleaner 10 4 14 Privacy Leakage 12 44 56 FreeMarket 22 - - AlgoProf 61 - - AppInsight 91 - - Location Auditor 10 0 10 Table 3.2: Implemented SIF tasks (*Line Of Code) 51 Timing Profiler. The Timing Profiler shown in Listing 3.1 profiles the timing of native method invo- cations. For space reasons, instead of showing SIF code for subsequent instrumentation tasks, we use a pictorial representation of these tasks. Figure 3.3 shows this representation for the timing profiler. At- tribute specifications for codepoint sets (line 3 in Listing 3.1) are represented as boxes with a gray back- ground. User-specified instrumentation (lines 13-27 in Listing 3.1) is represented as a blue box with a brief description of the instrumentation code. The relative positioning of these boxes indicates whether the user-specified instrumentation is inserted before (top left), after (bottom right), or replaces (middle) the corresponding codepoints. Double arrows between filter boxes indicate multiple instrumentation steps (the Timing Profiler has only one step, but subsequent instrumentation tasks have more than one). Figure 3.3: Timing profiler To demonstrate the timing profiler, we have applied it to the Angry Birds app and measured the app’s native method invokes. Angry Birds uses native methods to optimize UI event handling. As an aside, we note that we cannot instrument entry and exit of the Java definitions of the native methods, since these methods always have an empty body. Instead, we have to select all invoke instruction types (Android permits several invocation types), refine this codepoint set to native method invocations, and then insert the timestamp logging code before and after each such invoke. We ran the instrumented app with a common use case (for this application, playing a single game). The results show a mismatch between the static and dynamic view of native method usage in the app. While a static analysis of the binary shows 18 distinct native method invocations, only 7 are actually involved in our use case. As Table 3.3 shows, some of these are used significantly more than others but exhibit 52 significant variability in execution time (e.g., update). Other are computationally expensive (e.g., init and pause) and infrequently invoked. The total size of the SIF implementation of the Timing Profiler is 28 lines of code. This illustrates the conciseness of SIFScript; even small and simple programs are able to provide meaningful and useful insights into app behavior. #invokes Avg (ms) Min (ms) Max (ms) setVideoReady 1 0.061 0.061 0.061 nativeInit 1 507.996 507.996 507.996 nativeInput 156 0.046 0.031 0.336 nativeKeyInput 10 0.027 0.031 0.061 nativePause 1 182.007 182.007 182.007 nativeResize 1 0.031 0.031 0.031 nativeUpdate 3696 5.605 0.366 3894.928 Table 3.3: Native methods invoked during a game run Call Graph Profiler. Another common use of instrumentation is to log the dynamic call graph of an application. This task is different from the timing profiler in two ways. First, rather than instrumenting invocations, it instruments method entries and exits. Second, while the timing profiler uses a single in- strumentation step, the call graph profiler uses multiple steps, where each step refers to instrumenting a distinct codepoint set or reporting a distinct path set. Figure 3.4 illustrates the SIF steps to construct a call graph for all methods invoked between X.foo() and Y.bar(). The first step instruments every method entry to check a global variable that controls call graph generation. If this variable is set, the instrumentation increments a global variable that tracks the level of the current method in the call graph, and logs both the level and method identifier. The second step performs analogous actions for the exit of every method. In these two steps, the instrumentation location is specified by indicating a bytecode position; by default, SIF applies this to every method in every class. Finally, the last two steps instrument the entry and exit points of the target methods to (respectively) enable and disable the global variable that controls call graph generation. From the sequence of log records and level indicators, it becomes easy to infer the dynamic call graph. The SIFScript for this instrumentation task is 52 lines of code. 53 Figure 3.4: Call graph profiler We verified our call graph profiler on Angry Birds, instrumenting the app to identify the dynamic call graph generated between calls to the app’s onCreate() and onDestroy() methods. This logs 100 distinct app methods out of a total of 1958 methods in the app. The call graph has over 22K edges and a max depth of 10, with 90% of the calls being 5-6 methods deep. This example demonstrates how one can use SIF to develop insights about app structure and complexity. An alternative implementation could use the path set sequence method to find all paths between X.foo() and Y.bar(), then post-process the returned paths to determine invocations and entry/exit pairs. The relative performance of these approaches depends on the app and the inputs; SIF’s feedback estimator can be used to determine which approach might be better for a given workload. Flurry-like App Usage Analytics. In mobile apps, it is common practice for developers to collect data about post-deployment app usage. In fact, there are several services, Flurry [6] being the most popular, which provide developers with an API for logging app usage information. When users use an app, these logs are uploaded to the service’s site and made available to developers. This capability enables developers to refine their apps or introduce new features based on customer preferences, expertise, or other factors. Figure 3.5: Flurry-like analytics We now show howSIF can provide usage statistics for apps whose developers have not used the analyt- ics API while developing the app. Our code is specific to helpout, a multi-level puzzle game app. In this game, users advance levels when they successfully complete a level, and can choose to move down a level 54 or have the app display a solution. We develop instrumentation to count how many times a user advances levels, chooses to move down a level, or displays a solution. To do this in SIF requires only two steps (Figure 3.5) and 31 lines of code. In the first step, the SIFScript loads a unique identifier for the game player from storage (generating one if necessary) when a game starts. The SIFScript identifies the game start codepoints by create a class hierarchy based attribute specification that selects all app-defined classes derived from the Android Activity class and all onCreate methods of those classes. Then, it instruments these codepoints by inserting code for loading and generating the user identifier. This identifier is used to distinguish between multiple game users. In the second step, the SIFScript instruments the methods that implement the functionality discussed above: going up a level, going down a level, and asking for help. Whenever one of these methods is invoked, the user-specified instrumentation sends a message to a server that includes the method name and the user identifier. App usage analytics can help developers of helpout understand the distribution of expertise among their users. We had three users play an instrumented version of this game, and recorded their usage. As Table 3.4 clearly shows, userC has the least expertise in this game, going down a level twice and asking many times for the solution, while userB has the highest expertise, going up to level 22. Beyond illustrating SIF’s ability to support application diagnostics, this instrumentation task demon- strates how SIF allows scripts to perform more sophisticated actions beyond simply counting or timing method usage (e.g., uploading data to a server). showSolution gameLevelNext gameLevelPrev user A 2 10 0 user B 0 22 0 user C 5 7 2 Table 3.4: Analytics results collected from 3 users Fine-grained Permissions. In Android, permissions to access resources, such as sensors and devices, are granted and enforced at the granularity of an entire app. However, many apps are composed of multiple 55 “packages” obtained from different developers. For example, aroundme is an app that returns context- sensitive results, and its free version uses two additional packages developed by ads and flurry. The Android security model does not distinguish between the developer of the app and the developers of other packages, and treats them all as identical principals from a security perspective. Motivated by this observation, some recent work [39,86] has proposed finer-grained permissions grant- ing and enforcement. This work analyzes the app to infer the right set of permissions needed for each package, then instruments the app to enforce those permissions. We now show how SIF can be used to develop functionality analogous to fine-grained permissions enforcement. Our SIFScript code pops up a dialog box to obtain explicit permission the first time that a method in a given package invokes a certain permission. In our example, the SIFScript checks methods in the flurry API that require INTERNET permission. Thereafter it remembers the user’s choice and only invokes the API if the user has granted the necessary permission. Our SIFScript for this task (Figure 3.6(a)) contains two steps. The first step of our SIFScript selects the onCreate method of the entry activity and stores a reference to Context in the user-specified instrumenta- tion class. Most UI actions in Android, such as popping up a dialog box, require a pointer to a UIContext. The second step illustrates the use of codepoint set selection based on permissions capabilities and the ability of SIF to replace codepoints. In this step, the SIFScript replaces all API invokes that need INTER- NET permission in the flurry module with another method that has the same signature as the replaced method, but displays a dialog box (Figure 3.7) that asks for the user’s choice. Subsequent invocations to any method in flurry will result in network traffic only if the user has granted access. We have instrumented the aroundme app with this capability. Our SIFScript is 64 lines of code, and can be easily extended to enforce fine-grained permissions for modules other thanflurry or permissions other than network access. This would require a different codepoint set definition and additional code to track different types of permissions. AdCleaner. Free versions of many mobile apps come with advertisement displays. In fact, this feature is so pervasive that there exist third-party libraries that app developers can use to include ad displays. Ads 56 (a) Fine-grained permission control (b) AdCleaner (c) Privacy leakage study Figure 3.6: SIFScript descriptions for some tasks can consume Internet bandwidth, affect energy usage, and take up valuable screen real estate. Analogous to Web-based ad blockers, mobile app users can add a blacklist of domain names for ad hosting servers in their local name resolvers, but this requires root access. Recent research [62] has explored using library interposition techniques to block ads. This capability is simple to implement inSIF, and requires only 14 lines of code and a single instrumen- tation step. Our implementation simply replaces the loadAd() method invocation of a popular ad library with a null method. Figure 3.8 shows the screenshots of aroundme before and after our instrumentation is applied. Privacy Leakage. As with many tools that have significant expressive power, SIF can also be used for malicious purposes. With this instrumentation task, we illustrate how easy it is in SIF to innocuously insert a significant privacy leak. In this example, this capability is achieved as a result of SIF’s support for arbitrary user-specified instrumentation. We have been able to instrument the Skype app, which is permitted to use both the camera and the network, to periodically take a picture with the camera and upload it to a website. This can, of course, significantly leak privacy by exposing the physical context of a given user. The SIFScript for this (Fig- ure 3.6(c)) is 56 lines of code and requires only a single step, which instruments the onCreate method of 57 Figure 3.7: Dialog asking for user’s choice the entry activity and stores a reference to the Context object. The inserted code starts a background task that periodically takes a photo and uploads it to a user server. Figure 3.9 shows an experiment in which a user first uses a phone outdoors, puts the phone in his pocket, then removes the phone indoors. The photos clearly reveal these place transitions in addition to several details about the locations. FreeMarket. A recent study [68] has explored vulnerabilities in the process of in-app Android purchases. Specifically, the work proposes an attack on Google’s In-App Billing service protocol that allows for users to pay for purchases from within an app. This attack instruments an app to modify its behavior to (a) bypass access to Google’s billing servers, (b) redirect these calls to a local Android service (instantiated by instrumentation code) that always returns successfully, and (c) bypassing a key verification step. These in- strumentation steps have been documented in [68] and we have been able to devise SIF code that replicates these instrumentation steps. TheSIFScript for this task consists of 22 lines of codepoint set specifications. The first step finds invokes to Android’s bindService API in Context class and replace that binding with one to a local service. The second step replaces invocations to the signature verification Java API call with a null method that is always 58 (a) Before (b) After Figure 3.8: Screenshot before and after AdCleaner Figure 3.9: Photos taken and uploaded by instrumented app successful. A final step adds some instrumentation to deal with the case where in-app billing is invoked through Java reflection. We have re-created the instrumentation steps necessary to mount the attack described in [68], but have not validated that attack because (a) the actions of the local service are not described in detail in the paper, and (b) the attack is successful only on some applications whose identities are not revealed in the paper (so we would have to search for a vulnerable app). AlgoProf. A recent work [84] has explored methods to estimate an application’s asymptotic complexity as a function of input size. Their prototype, AlgoProf, instruments Java bytecode in order to collect extensive profiling information, and then post-processes the output to produce a complexity estimate. Specifically, they instrument every method entry and exit, array access, field access, object allocation, and loop entry 59 Figure 3.10: FreeMarket attacker and exit. Although this work did not target mobile apps (its focus is on Java apps in general), it nicely demonstrates features of SIF that might be exploited in future instrumentation-based studies of mobile apps. We have written a SIFScript to replicate the instrumentation required by AlgoProf. For space reasons, we omit a pictorial description of the code. The SIFScript has ten steps which require 61 lines of code (not including the user-specified instrumentation), and uses a wide range of SIF’s codepoint set capabilities. Unlike previous examples that have only instrumented method entry and exit or invocations, this SIFScript also instruments Java bytecodes (e.g., for field accesses) and is the only one of our examples that also instruments loops. Although our SIFScript easily duplicated the instrumentation part of the approach in only 61 lines of code, we could not verify the results of executing the instrumented apps without also duplicating AlgoProf’s complexity inference algorithms, which was beyond the scope of this project. AppInsight. By far the most complex instrumentation task that we have applied SIF to is AppInsight [66], which analyzes latency-critical paths in mobile apps. Based on the observation that many UI actions require multiple concurrent operations, this work identifies upcalls and then adds instrumentation code to match upcalls with asynchronous callers. The SIFScript for AppInsight requires twelve instrumentation steps requiring 91 lines of code (not in- cluding user-specified instrumentation). These steps run the gamut of location specifications and actions, instrumenting individual bytecodes and invocations, replacing invocations with user-defined instrumenta- tion, and inserting new handlers for specified events. It was not possible to compare the output of our implementation against the original AppInsight because SIF works for Android platforms and AppInsight 60 leverages certain opcodes and functions that are only present in Silverlight (Windows Phone). Nonethe- less, our SIFScript implementation demonstrates that even complex state of the art instrumentation based approaches can be easily implemented in SIF. Location Auditor. Finally, we have previously described another instrumentation task, location audit- ing (Listing 3.2) on the aroundme app. This task demonstrated the use of the pathset abstractions. An experiment involving this instrumented app is discussed below. 3.4.2 Efficiency of SIF The second major aspect ofSIF’s design is its efficiency, which we quantify in four distinct ways. First, SIF uses program analysis to minimize the instrumentation of path sets, and we quantify these savings. Second, we demonstrate that SIF’s user-perceived time to instrument a binary is moderate. Third, we quantify the runtime overhead due toSIF. Finally, we quantify the accuracy of our feedback estimator. With an accurate estimator, users can iteratively refine their instrumentation to achieve acceptable overhead. Program Analysis for Path Sets. SIF analyzes the program binary in order to minimize the instrumenta- tion for the sequence operation on path sets. To evaluate its performance, we instrumented the aroundme app with the location auditor task (Listing 3.2). CPFinder was able to find 10 codepoints with LOCATION permission and 12 with INTERNET permissions. Using program analysis, SIF determined that only 25 methods (out of about 560 methods in the app) needed to be instrumented, or fewer than 5% of the total number of methods. This demonstrates the benefit of sophisticated program analysis for path sets. We also ran a simple use case that searched for nearby gas stations and restaurants. SIF reported three suspicious paths: two from the aroundme package and one from ads. As expected, aroundme read the location and sent it over the network. Unexpectedly, we found that the ads package also appeared to send a user’s location over the network, presumably for location-targeted advertising. No path was reported for the flurry package, indicating that, at least for this use case, there were no paths that read the GPS sensor then accessed the network. However, by analyzing network traffic, we found that flurry did leak location information. An analysis of the binary revealed that it read and 61 Table 3.5: Time to instrument SIF tasks stored the location in memory or storage, and later transmitted that location over the network. To detect such leaks, taint analysis or other forms of information flow analysis are necessary; SIF’s path inspection capabilities can help narrow the search scope of leakage. Time to Instrument Apps. In this section, we quantify the CPU time taken to instrument binaries for seven of the tasks presented above; in each case, we instrument the corresponding apps used to demonstrate the instrumentation tasks. Our measurements are performed on a ThinkPad T400 laptop with 3GB RAM. As shown in Table 3.5, the search for relevant codepoints is fast; CPFinder takes at most 4.6s to finish. The time to apply the instrumentation is at most 6s. Notice that the cost depends on the app binary size as well as the complexity of the relevant SIFScript. Instrumentation time is dominated by the the cost of packing and unpacking the app. All tasks can finish within half a minute; we consider this to be reasonable, especially since SIF itself only contributes to a small fraction of these times. Runtime Overhead. There are two sources of overhead that affect the performance of an instrumented app: user-defined instrumentation, and instructions inserted bySIF. We now quantify the latter by replacing all user-defined code with empty stub so that the functionality of the original app is unchanged; this allows us to measure the overhead attributable to SIF. We run the same workload on both original and instrumented app and compare their running time. Table 3.6 shows the duration of original app and the overhead introduced bySIF. Overall,SIF introduces less than 2% overhead except for the call graph profiler, where the overhead is 4.41%. This is because SIF overhead depends on the number of codepoints to be instrumented and the call graph profiler has to instrument many more codepoints than the rest. 62 Original app (sec) Overhead by SIF (sec) Timing Profiler 59.473 0.452 (0.76%) Call Graph Profiler 59.729 2.637 (4.41%) Flurry-like Analytics 115.384 0.679 (0.59%) Fine-grained Permission 11.862 0.153 (1.29%) AdCleaner 11.721 0.114 (0.97%) Privacy Leakage 35.230 0.137 (0.39%) Table 3.6: Runtime overhead of SIF Accuracy of Overhead Feedback. We evaluate the accuracy of SIF’s overhead estimates by comparing them with measured ground truth values. Our experiments measure execution times with and without instrumentation. The difference between these two numbers is the ground truth cost of the instrumentation. The experiments were averaged over ten runs and controlled for most types of non-deterministic behavior seen between successive runs 3 . Figure 3.11: Accuracy of SIF’s overhead estimates Figure 3.11 plots the SIF estimates and measured ground truth for the six tasks we have tested com- pletely that also involve user-specified instrumentation. The y-axis represents the instrumentation overhead measured as a fraction of the total execution time. SIF’s overhead estimate is very close to ground truth, within 15%, for all tasks but privacy leakage. For fine-grained permission control and AdCleaner, the mea- sured ground-truth has lower execution time after the instrumentation; that is because, in both cases, SIF replaces invocations with null methods. Our overhead estimator currently does not account for replaced 3 For example, we modified the Angry Birds binary to set its random number seed to a fixed value (by default, it uses the current local time. 63 invocations, so it over-estimates overhead. However the estimate in these situations will always provide a conservative upper bound estimate, which is appropriate, since the goal is to give the user an approximate indication of potential overhead. The one case that requires significant future work is the privacy leakage study. Its estimate is significantly off because its user-defined instrumentation is complex (the code fires a periodic timer which uploads a photo) and we have not developed analysis methods for such code. 3.5 Conclusion We have described the design and implementation of SIF, a binary instrumentation framework for mobile apps whose codepoint set abstractions are able to specify instrumentation location at different granulari- ties and incorporate resource usage permissions. Its path set abstractions allow dynamic path inspection between arbitrary codepoints, and its program analysis techniques can reduce the overhead of instrumen- tation. SIF is expressive enough to incorporate a variety of instrumentation tasks previously proposed in the literature, and is quite efficient. Much work remains, however, including validating SIF on other instrumentation tasks, porting SIF to other platforms and integrating their access permissions methods into SIF, supporting advanced language features such as reflection in its path set abstractions, evaluating the effectiveness of path sets for studying privacy leakage and comparing path-sensitive analysis with more expensive information-flow style analy- ses, studying the usability of overhead feedback, and improving the accuracy of feedback estimation for advanced forms of user-specified instrumentation. 64 Chapter 4 PUMA: Programmable UI-Automation for Large-Scale Dynamic Analysis of Mobile Apps Mobile app ecosystems have experienced tremendous growth in the last six years. This has triggered research on dynamic analysis of performance, security, and correctness properties of the mobile apps in the ecosystem. Exploration of app execution using automated UI actions has emerged as an important tool for this research. However, existing research has largely developed analysis-specific UI automation techniques, wherein the logic for exploring app execution is intertwined with the logic for analyzing app properties. PUMA is a programmable framework that separates these two concerns. It contains a generic UI automation capability (often called a Monkey) that exposes high-level events for which users can define handlers. These handlers can flexibly direct the Monkey’s exploration, and also specify app instrumen- tation for collecting dynamic state information or for triggering changes in the environment during app execution. Targeted towards operators of app marketplaces, PUMA incorporates mechanisms for scaling dynamic analysis to thousands of apps. We demonstrate the capabilities of PUMA by analyzing seven dis- tinct performance, security, and correctness properties for 3,600 apps downloaded from the Google Play store. 65 4.1 Introduction Today’s smartphone app stores host large collections of apps. Most of the apps are created by unknown developers who have varying expertise and who may not always operate in the users’ best interests. Such concerns have motivated researchers and app store operators to analyze various properties of the apps and to propose and evaluate new techniques to address the concerns. For such analyses to be useful, the analysis technique must be robust and scale well for large collections of apps. Static analysis of app binaries, as used in prior work to identify privacy [49] and security [24, 27] problems, or app clones [20] etc., can scale to a large number of apps. However, static analysis can fail to capture runtime contexts, such as data dynamically downloaded from the cloud, objects created during runtime, configuration variables, and so on. Moreover, app binaries may be obfuscated to thwart static analysis, either intentionally or unintentionally (such as stripping symbol information to reduce the size of the app binary). Therefore, recent work has focused on dynamic analyses that execute apps and examine their runtime properties (Section 4.2). These analyses have been used for analyzing performance [34, 66, 67], bugs [46, 52, 65], privacy and security [26, 64], compliance [48] and correctness [44], of apps, some at a scale of thousands of apps. One popular way to scale dynamic analysis to a large number of apps is to use a software automation tool called a “monkey” that can automatically launch and interact with an app (by tapping on buttons, typing text inputs, etc.) in order to navigate to various execution states (or, pages) of the app. The monkey is augmented with code tailored to the target analysis; this code is systematically executed while the monkey visits various pages. For example, in DECAF [48], the analysis code algorithmically examines ads in the current page to check if their placement violates ad network policies. Dynamic analysis of apps is a daunting task (Section 4.2). At a high level, it consists of exploration logic that guides the monkey to explore various app states and analysis logic that analyzes the targeted runtime properties of the current app state. The exploration logic needs to be optimized for coverage— it should explore a significant portion of the useful app states, and for speed—it should analyze a large 66 collection of apps within a reasonable time. To achieve these goals, existing systems have developed a monkey from scratch and have tuned its exploration logic by leveraging properties of the analysis. For example, AMC [44] and DECAF [48] required analyzing one of each type of app page, and hence their monkey is tuned to explore only unique page types. On the other hand, SmartAds [57] crawled data from all pages, so its monkey is tuned to explore all unique pages. Similarly, the monkeys of VanarSena [65] and ConVirt [46] inject faults at specific execution points, while those of AMC and DECAF only read specific UI elements from app pages. Some systems even instrument app binaries to optimize the monkey [65] or to access app runtime state [57]. In summary, exploration logic and analysis logic are often intertwined and hence a system designed for one analysis cannot be readily used for another. The end effect is that many of the advances developed to handle large-scale studies are only utilizable in the context of the specific analysis and cannot currently be generalized to other analyses. Contributions. In this project we propose PUMA (Section 4.3), a dynamic analysis framework that can be instantiated for a large number of diverse dynamic analysis tasks that, in prior research, used systems built from scratch. PUMA enables analysis of a wide variety of app properties, allows its users to flexibly specify which app states to explore and how, provides programmatic access to the app’s runtime state for analysis, and supports dynamic runtime environment modification. It encapsulates the common components of ex- isting dynamic analysis systems and exposes a number of configurable hooks that can be programmed with a high level event-driven scripting language, called PUMAScript. This language cleanly separates analysis logic from exploration logic, allowing its users to (a) succinctly specify navigation hints for scalable app exploration and (b) separately specify the logic for analyzing the app properties. This design has two distinct advantages. First, it can simplify the analysis of different app proper- ties, since users do not need to develop the monkey, which is often the most challenging part of dynamic analysis. A related benefit is that the monkey can evolve independently of the analysis logic, so that mon- key scaling and coverage improvements can be made available to all users. Second, PUMA can multiplex dynamic analyses: it can concurrently run similar analyses, resulting in better scaling. 67 To validate the design of PUMA, we present the results of seven distinct analyses (many of which are presented in prior work) executed on 3,600 apps from Google Play (Section 4.4). The PUMAScripts for these analyses are each less than 100 lines of code; by contrast, DECAF [48] required over 4,000 lines of which over 70% was dedicated to app exploration. Our analyses are valuable in their own right, since they present fascinating insights into the app ecosystem: there appear to be a relatively small number (about 40) of common UI design patterns among Android apps; enabling content search for apps in the app store can increase the relevance of results and yield up to 50 additional results per query on average; over half of the apps violate accessibility guidelines; network usage requirements for apps vary by six orders of magnitude; and a quarter of all apps fail basic stress tests. PUMA can be used in various settings. An app store can usePUMA: the store’s app certification team can use it to verify that a newly submitted app does not violate any privacy and security policies, the advertising team can check if the app does not commit any ad fraud, the app store search engine can crawl app data for indexing, etc. Researchers interested in analyzing the app ecosystem can download PUMA and the apps of interest, customize PUMA for their target analysis, and conduct the analysis locally. A third-party can offer PUMA as a service where users can submit their analyses written in PUMAScript. 4.2 Background and Motivation In this section, we describe the unique requirements of large-scale studies of mobile apps and motivate the need for a programmable UI-based framework for supporting these studies. We also discuss the challenges associated with satisfying these requirements. In Section 4.3, we describe how PUMA addresses these challenges and requirements. 4.2.1 Dynamic Analysis of Mobile Apps Dynamic analysis of software is performed by executing the software, subjecting it to different inputs, and recording (and subsequently analyzing) its internal states and outputs. Mobile apps have a unique structure 68 that enables a novel form of dynamic analysis. By design, most mobile app actions are triggered by user interactions, such as clicks, swipes etc., through the user interface (UI). Mobile apps are also structured to enable such interactions: when the app is launched, a “home page” is shown that includes one or more UI elements (buttons, text boxes, other user interface elements). User interactions with these UI elements lead to other pages, which in turn may contain other UI elements. A user interaction may also result in local computation (e.g., updating game state), network communication (e.g., downloading ads or content), access to local sensors (e.g., GPS), and access to local storage (e.g., saving app state to storage). In the abstract, execution of a mobile app can be modeled as a transition graph where nodes represent various pages and edges represent transitions between pages. The goal of dynamic analysis is to navigate to all pages and to analyze apps’ internal states and outputs at each page. UI-Automation Frameworks. This commonality in the structure of mobile apps can be exploited to automatically analyze their dynamic properties. Recent research has done this using a UI automation framework, sometimes called a monkey, that systematically explores the app execution space. A monkey is a piece of software that runs on a mobile device or on an emulator, and extracts the user-interface structure of the current page (e.g., the home page). This UI structure, analogous to the DOM structure of web pages, contains information about UI elements (buttons and other widgets) on the current page. Using this information, the monkey can, in an automated fashion, click a UI element, causing the app to transition to a new page. If the monkey has not visited this (or a similar) page, it can interact with the page by clicking its UI elements. Otherwise, it can click the “back” button to return to the previous page, and click another UI element to reach a different page. 1 In the abstract, each page corresponds to a UI-state and clicking a clickable UI element results in a state transition; using these, a monkey can effectively explore the UI-state transition graph. 1 Some apps do not include back buttons; this is discussed later. 69 System Exploration Target Page Transition Inputs Properties Checked Actions Taken Instrumentation AMC [44] Distinct types of pages UI events Accessibility None No DECAF [48] Distinct types of pages con- taining ads UI events Ad layouts None No SmartAds [57] All pages UI events Page contents None Yes A 3 E [15] Distinct types of pages UI events None None Yes AppsPlayGround [64] Distinct types of pages UI events, Text inputs Information flow None Yes VanarSena [65] Distinct types of pages UI events, Text inputs App crashes Inject faults Yes ContextualFuzzing [46] All pages UI events Crashes, perfor- mance Change con- texts No DynoDroid [52] Code basic blocks UI events, Sys- tem events App crashes System inputs No Table 4.1: Recent work that has used a monkey tool for dynamic analysis 4.2.2 Related Work on Dynamic Analysis of Mobile Apps As discussed above, our work is an instance of a class of dynamic analysis frameworks. Such frameworks are widely used in software engineering for unit testing and random (fuzz) testing. The field of software testing is rather large, so we do not attempt to cover it; the interested reader is referred to [12]. Monkeys have been recently used to analyze several dynamic properties of mobile apps (Table 4.1). AMC [44] evaluates the conformance of vehicular apps to accessibility requirements; for example, apps need to be designed with large buttons and text, to minimize driving distractions. DECAF [48], detects violations of ad placement and content policies in over 50,000 apps. SmartAds [57] crawls contents from an app’s pages to enable contextual advertising for mobile apps. A 3 E [15] executes and visits app pages to uncover potential bugs. AppsPlayground [64] examines information flow for potential privacy leaks in apps. VanarSena [65], ContextualFuzzing [46], and DynoDroid [52] try to uncover app crashes and performance problems by exposing them to various external exceptional conditions. At a high-level, these systems share a common feature: they use a monkey to automate dynamic app execution and use custom code to analyze a specific runtime property as the monkey visits various app states. At a lower level, however, they differ in at least the following five dimensions. Exploration Target. This denotes what pages in an app are to be explored by the monkey. Fewer pages mean the monkey can perform the analysis faster, but that the analysis may be less comprehensive. AMC, 70 A 3 E, AppsPlayground, VanarSena aim to visit only pages of unique types. Their analysis goals do not require visiting two pages that are of same type but contain different contents (e.g., two pages in a news app that are instantiated from the same page class but displays different news articles), and hence they omit exploring such pages for greater speed. On the other hand, SmartAds requires visiting all pages with unique content. DECAF can be configured to visit only the pages that are of unique types and that are likely to contain ads. Page Transition Inputs. This denotes the inputs that the monkey provides to the app to cause transitions between pages. Most monkeys generate UI events, such as clicks and swipes, to move from one page to another. Some other systems, such as AppsPlayground and VanarSena, can provide text inputs to achieve a better coverage. DynoDroid can generate system inputs (e.g., the “SMS received” event). Properties Checked. This defines what runtime properties the analysis code checks. Different systems check different runtime properties depending on what their analysis logic requires. For example, DECAF checks various geometric properties of ads in the current page in order to identify ad fraud. Actions Taken. This denotes what action the monkey takes at each page (other than transition inputs). While some systems do not take any actions, VanarSena, ContextualFuzzing, and DynoDroid create vari- ous contextual faults (e.g., slow networks, bad user inputs) to check if the app crashes on those faults. Instrumentation. This denotes whether the monkey runs an unmodified app or an instrumented app. Va- narSena instruments apps before execution in order to identify a small set of pages to explore. SmartAds instruments apps to retrieve page contents. Due to these differences, each work listed in Table 4.1 has developed its own automation components from scratch and tuned the tool to explore a specific property of the researchers’ interest. The resulting tools have an intertwining of the app exploration logic and the logic required for analyzing the property 71 of interest. This has meant that many of the advances developed to handle large-scale studies are only utilizable in the context of the specific analyses and cannot be readily generalized to other analyses. PUMA. As mentioned in Section 4.1, our goal is to build a generic framework called PUMA that enables scalable and programmable UI automation, and that can be customized for various types of dynamic anal- ysis (including the ones in Table 4.1). PUMA separates the analysis logic from the automated navigation of the UI-state transition graph, allowing its users to (a) succinctly specify navigation hints for scalable app exploration and (b) separately specify the logic for analyzing the app properties. This has two distinct advantages. It can simplify the analysis of different app properties, since users do not need to develop UI automation components, and the UI automation framework can evolve independently of the analysis logic. As we discuss later, the design of scalable and robust state exploration can be tricky, and PUMA users can benefit from improvements to the underlying monkey, since their analysis code is decoupled from the mon- key itself. Existing monkey tools only generate pseudo-random events and do not permit customization of navigation in ways that PUMA permits. Moreover, PUMA can concurrently run similar analyses, resulting in better scaling of the dynamic analysis. We discuss these advantages below. 4.2.3 Framework Requirements Table 4.1 and the discussion above motivate the following requirements for a programmable UI-automation framework: Support for a wide variety of properties: The goal of using a UI-automation tool is to help users analyze app properties. But it is hard (if not impossible) for the framework to predefine a set of target properties that are going to be useful for all types of analyses. Instead, the framework should provide a set of necessary abstractions that can enable users to specify properties of interest at a high level. Flexibility in state exploration: The framework should allow users to customize the UI-state explo- ration. At a high-level, UI-state exploration decides which UI element to click next, and whether a (similar) state has been visited before. Permitting programmability of these decisions will allow 72 analyses to customize the monkey behavior in flexible ways that can be optimized for the analysis at hand. Programmable access to app state: Many of the analyses in Table 4.1 require access to arbitrary app state, not just UI properties, such as the size of buttons or the layout of ads. Examples of app state include dynamic invocations of permissions, network or CPU usage at any given point, or even app-specific internal state. Support for triggered actions: Some of the analyses in Table 4.1 examine app robustness to changes in environmental conditions (e.g., drastic changes to network bandwidth) or exceptional inputs. PUMA must support injecting these runtime behaviors based on user-specified conditions (e.g., change net- work availability just before any call to the network API). These requirements raise significant research questions and challenges. For example, how can PUMA provide users with flexible and easy-to-use abstractions to specify properties that are unknown beforehand? Recall these properties can range from basic UI attributes to those that aim to diagnose various performance bottlenecks. Also, can it provide flexible control of the state exploration, given that the state space may be huge or even infinite? We now describe how PUMA meets these challenges. 4.3 Programmable UI-Automation In this section, we describe PUMA, a programmable framework for dynamic analysis of mobile apps that satisfies the requirements listed in the previous section. We begin with an overview that describes how a user interacts with PUMA and the workflow within PUMA. We then discuss how users can specify analysis code using a PUMAScript, and then discuss the detailed design of PUMA and its internal algorithms. We conclude the section by describing our implementation of PUMA for Android. 73 Figure 4.1: Overview of PUMA 4.3.1 PUMA Overview and Workflow Figure 4.1 describes the overall workflow for PUMA. A user provides two pieces of information as input to PUMA. The first is a set of app binaries that the user wants to analyze. The second is the user-specified code, written in a language called PUMAScript 2 . The script contains all information needed for the dynamic analysis. In the first step of PUMA’s workflow, the interpreter component interprets the PUMAScript specification and recognizes two parts in the script: monkey-specific directives and app-specific directives. The former provides necessary inputs or hints on how apps will be executed by the monkey tool, which are then translated as input to our programmable monkey component. The latter dictates which parts of app code are relevant for analysis, and specifies what actions are to be taken when those pieces of code are executed. These app-specific directives are fed as input to an app instrumenter component. The app instrumenter component statically analyzes the app to determine parts of app code relevant for analysis and instruments the app in a manner described below. The output of this component is the instrumented version of input app that adheres to the app-specific directives in PUMAScript. Then, the programmable monkey executes the instrumented version of each app, using the monkey- specific directives specified in thePUMAScript. PUMA is designed to execute the instrumented app either on a 2 In the rest of text, we will usePUMAScript to denote both the language used to write analysis code and the specification program itself; the usage will be clear from the context. 74 phone emulator, or on a mobile device. As a side effect of executing the app,PUMA may produce logs which contain outputs specified in the app-specific directives, as well outputs generated by the programmable monkey. Users can analyze these logs using analysis-specific code; such analysis code is not part of PUMA. In the remainder of this section, we describe these components of PUMA. 4.3.2 The PUMAScript Language Our first design choice for PUMA was to either design a new domain-specific language for PUMAScript or implement it as an extension of some existing language. A new language is more general and can be compiled to run on multiple mobile platforms, but it may also incur a steeper learning curve. Instead, we chose the latter approach and implemented PUMAScript as a Java extension. This choice has its advantage of familiarity for programmers but also limits PUMA’s applicability to some mobile platforms. However, we emphasize that the abstractions in our PUMAScript language are general enough and we should be able to port PUMA to other mobile platforms relatively easily, a task we have left to future work. The next design challenge forPUMA was to identify abstractions that provide sufficient expressivity and enable a variety of analysis tasks, while still decoupling the mechanics of app exploration from analysis code. Our survey of related work in the area (Table 4.1) has influenced the abstractions discussed below. Terminology. Before discussing the abstractions, we first introduce some terminology. The visual ele- ments in a given page of the mobile app consist of one or more UI element. A UI element encapsulates a UI widget, and has an associated geometry as well as content. UI elements may have additional attributes, such as whether they are hidden or visible, clickable or not, etc. The layout of a given page is defined by a UI hierarchy. Analogous to a DOM tree for a web page, a UI hierarchy describes parent-child relationships between elements. One can programmatically traverse the UI hierarchy to determine all the UI elements on a given page, together with their attributes and textual content (image or video content associated with a UI element is usually not available as part of the hierarchy). 75 The UI state of a given page is completely defined by its UI hierarchy. In some cases, it might be desirable to define a more general notion of the state of an app page, which includes the internal program state of an app together with the UI hierarchy. To distinguish it from UI state, we use the term total state of a given app. Given this discussion, a monkey can be said to perform a state traversal: when it performs a UI action on a UI element (e.g., clicks a button), it initiates a state transition which may, in general, cause a com- pletely different app page (and hence UI state) to be loaded. When this loading completes, the app is said to have reached a new state. PUMAScript Design. PUMAScript is an event-based programming language. It allows programmers to specify handlers for events. In general, an event is an abstraction for a specific point in the execution either of the monkey or of a specific app. A handler for an event is an arbitrary piece of code that may perform various actions: it can keep and update internal state variables, modify the environment (by altering system settings), and, in some cases, access UI state or total state. This paradigm is an instance of aspect-oriented programming, where the analysis concerns are cleanly separated from app traversal and execution. The advantage of having a scriptable specification, aside from conciseness, is that it is possible (as shown in Section 4.3.3) to optimize joint concurrent execution of multiple PUMAScripts, thereby enabling testing of more apps within a given amount of time. PUMAScript defines two kinds of events: monkey-specific events and app-specific events. Monkey-specific Events. A monkey-specific event encapsulates a specific point in the execution of a monkey. A monkey is a conceptually simple tool 3 , and Alg. (2) describes the pseudo-code for a generic monkey, as generalized from the uses of the monkey described in prior work (Table 4.1). The highlighted names in the pseudo-code are PUMA APIs that will be explained later. The monkey starts at an initial state (corresponding to its home page) for an app, and visits other states by deciding which UI action to perform (line 8), and performing the click (line 12). This UI action will, in general, result in a new state (line 13), 3 However, as discussed later, the implementation of a monkey can be significantly complex. 76 and the monkey needs to decide whether this state has been visited before (line 15). Once a state has been fully explored, it is no longer considered in the exploration (lines 19-20). Algorithm 2 Generic monkey tool. PUMA APIs for configurable steps are highlighted. 1: while not all apps have been explored do 2: pick a new app 3: S empty stack 4: push initial page toS 5: whileS is not empty do 6: pop an unfinished pages i fromS 7: go to pages i 8: pick next clickable UI element froms i // Next-Click 9: if user input is needed (e.g., login/password) then 10: provide user input by emulating keyboard clicks // Text Input 11: effect environmental changes // Modifying Environment 12: perform the click 13: wait for next pages j to load 14: analyze pages j // In-line Analysis 15: flag s j is equivalent to an explored page // State-Equivalence 16: if not flag then 17: adds j toS 18: update finished clicks fors i 19: if all clicks ins i are explored then 20: removes i fromS 21: flag monkey has used too many resources // Terminating App 22: if flag orS is empty then 23: terminate this app In this algorithm, most of the steps are mechanistic, but six steps involve policy decisions. The first is the decision of whether a state has been visited before (Line 15): prior work in Table 4.1 has observed that it is possible to reduce app exploration time with analysis-specific definitions of state-equivalence. The second is the decision of which UI action to perform next (Line 8): prior work in Table 4.1 has proposed using out-of-band information to direct exploration more efficiently, rather than randomly selecting UI actions. The third is a specification of user-input (Line 10): some apps require some forms of text input (e.g., a Facebook or Google login). The fourth is a decision (Line 11) of whether to modify the environment as the app page loads: for example, one prior work [65] modifies network state to reduce bandwidth, with the aim of analyzing the robustness of apps to sudden resource availability changes. The fifth is analysis (Line 14): some prior work has performed in-line analysis (e.g., ad fraud detection [48]). Finally, the sixth 77 is the decision of whether to terminate an app (Line 21): prior work in Table 4.1 has used fixed timeouts, but other policies are possible (e.g., after a fixed number of states have been explored). PUMAScript separates policy from mechanism by modeling these six steps as events, described below. When these events occur, user-defined handlers are executed. (1) State-Equivalence. This abstraction provides a customizable way of specifying whether states are classified as equivalent or not. The inputs to the handler for a state-equivalence event include: the newly visited states j , and the set of previously visited statesS. The handler should returntrue if this new state is equivalent to some previously visited state inS, andfalse otherwise. This capability permits an arbitrary definition of state equivalence. At one extreme, two statess i and s j are equivalent only if their total states are identical. A handler can code this by traversing the UI hierarchies of both states, and comparing UI elements in the hierarchy pairwise; it can also, in addition, compare program internal state pairwise. However, several pieces of work have pointed out that this strict notion of equivalence may not be nec- essary in all cases. Often, there is a trade-off between resource usage and testing coverage. For example, to detect ad violations, it suffices to treat two states as equivalent if their UI hierarchies are “similar” in the sense that they have the same kinds of UI elements. Handlers can take one of two approaches to define such fuzzier notions of equivalence. They can implement app-specific notions of similarity. For example, if an analysis were only interested in UI properties of specific types of buttons (like [44]), it might be sufficient to declare two states to be equivalent if one had at least one instance of each type of UI element present in the other. A more generic notion of state equivalence can be obtained by collecting features derived from states, then defining similarity based on distance metrics for the feature space. In DECAF [48], we defined a generic feature vector encoding the structure of the UI hierarchy, then used the cosine-similarity metric 4 with a user-specified similarity threshold, to determine state equivalence. This state equivalence function is built into PUMA, so a PUMAScript handler can simply invoke this function with the appropriate threshold. 4 http://en.wikipedia.org/wiki/Cosine\_similarity 78 A handler may also define a different set of features, or different similarity metrics. The exploration of which features might be appropriate, and how similarity thresholds affect state traversal is beyond the scope of this work. (2) Next-Click. This event permits handlers to customize how to specify which element to click next. The input to a handler is the current UI state, together with the set of UI elements that have already been clicked before. A handler should return a pointer to the next UI element to click. Handlers can implement a wide variety of policies with this flexibility. A simple policy may decide to explore UI elements sequentially, which may have good coverage, but increase exploration time. Alterna- tively, a handler may want to maximize the types of elements clicked; prioritizing UI elements of different types over instances of a type of UI element that has been clicked before. These two policies are built into PUMA for user convenience. Handlers can also use out-of-band information to implement directed exploration. Analytics from real users can provide insight into how real users prioritize UI actions: for example, an expert user may rarely click a Help button. Insights like these, or even actual traces from users, can be used to direct exploration to visit states that are more likely to be visited by real users. Another input to directed exploration is static analysis: the static analysis may reveal that button A can lead to a particular event handler that sends a HTTP request, which is of interest to the specific analysis task at hand. The handler can then prioritize the click of button A in every visited state. (3) Text Input. The handler of this event provides the text input required for exploration to proceed. Often, apps require login-based authentication to some cloud-service before permitting use of the app. The input to the handler are the UI state and the text box UI element which requires input. The handler’s output includes the corresponding text (login, password etc.), using which the monkey can emulate keyboard actions to generate the text. If the handler for this event is missing, and exploration encounters a UI element that requires text input, the monkey stops exploring the app. (4) Modifying the Environment. This event is triggered just before the monkey clicks a UI element. The corresponding handler for this event takes as input the current UI state, and the UI element to be clicked. 79 Based on this information, the handler may enable or disable devices, dynamically change network avail- ability using a network emulator, or change other aspects of the environment in order to stress-test apps. This kind of modification is coarse-grained, in the sense that it occurs before the entire page is loaded. It is also possible to perform more fine-grained modifications (e.g., reducing network bandwidth just before ac- cessing the network) using app-specific events, described below. If a handler for this event is not specified, PUMA skips this step. (5) In-line Analysis. The in-line analysis event is triggered after a new state has completed loading. The handler for this event takes as input the current total state; the handler can use the total state information to perform analysis-specific computations. For example, an ad fraud detector can analyze the layout of the UI hierarchy to ensure compliance to ad policies [48]. A PUMAScript may choose to forgo this step and perform all analyses off-line; PUMA outputs the explored state transition graph together with the total states for this purpose. (6) Terminating App Exploration. Depending on the precise definition of state equivalence, the number of states in the UI state transition graph can be practically limitless. A good example of this is an app that shows news items. Each time the app page that lists news items is visited, a new news item may be available which may cause the state to be technically not equivalent to any previously visited state. To counter such cases, most prior research has established practical limits on how long to explore an app. PUMA provides a default timeout handler for the termination decision event, which terminates an app after its exploration has used up a certain amount of wall-clock time. A PUMAScript can also define other handlers that make termination decisions based on the number of states visited, or CPU, network, or energy resources used. App-specific Events. In much the same way that monkey-specific events abstract specific points in the execution of a generic monkey, an app-specific event abstracts a specific point in app code. Unlike monkey- specific events, which are predetermined because of the relative simplicity of a generic monkey, app- specific events must be user-defined since it is not known a priori what kinds of instrumentation tasks will be needed. In a PUMAScript, an app-specific event is defined by naming an event and associating the named event with a codepoint set [35]. A codepoint set is a set of instructions (e.g., bytecodes or invocations of 80 arbitrary functions) in the app binary, usually specified as a regular expression on class names, method names, or names of specific bytecodes. Thus, a codepoint set defines a set of points in the app binary where the named event may be said to occur. Once named events have been described, a PUMAScript can associate arbitrary handlers with these named events. These handlers have access to app-internal state and can manipulate program state, can out- put state information to the output logs, and can also perform finer-grained environmental modifications. A Sample PUMAScript. Listing 4.1 shows a PUMAScript designed to count the network usage of apps. A PUMAScript is effectively a Java extension, where a specific analysis is described by defining a new class inherited from a PUMAScript base class. This class (in our example, NetworkProfiler) defines handlers for monkey-specific events (lines 2-7), and also defines events and associated handlers for app- specific events. It uses the inbuilt feature-based similarity detector with a threshold that permits fuzzy state equivalence (line 3), and uses the default next-click function, which traverses each UI element in each state sequentially (line 6). It defines one app-specific event, which is triggered whenever execution invokes the HTTPClient library (lines 10-11), and defines two handlers, one (line 21) before the occurrence of the event (i.e., the invocation) and another after (line 24) the occurrence of the event. These handlers respectively log the size of the network request and response. The total network usage of an app can be obtained by post-facto analysis of the log. 4.3.3 PUMA Design PUMA incorporates a generic monkey (Alg. (2)), together with support for events and handlers. One or more PUMAScripts are input to PUMA, together with the apps to be analyzed. The PUMAScript interpreter instruments each app in a manner designed to trigger the app-specific events. One way to do this is to instrument apps to transfer control back to PUMA when the specified code point is reached. The advantage of this approach is that app-specific handlers can then have access to the explored UI states, but it would have made it harder for PUMA to expose app-specific internal state. Instead, PUMA chooses to instrument apps so that app-specific handlers are executed directly within the app context; this way, handlers have 81 1 class NetworkProfiler extends PUMAScript { 2 boolean compareState(UIState s1, UIState s2) { 3 return MonkeyInputFactory.stateStructureMatch(s1, s2, 0.95); 4 } 5 int getNextClick(UIState s) { 6 return MonkeyInputFactory.nextClickSequential(s); 7 } 8 void specifyInstrumentation() { 9 Set<CodePoint> userEvent; 10 CPFinder.setBytecode("invoke. * ", "HTTPClient.execute(HttpUriRequest request)"); 11 userEvent = CPFinder.apply(); 12 for (CodePoint cp : userEvent) { 13 UserCode code = new UserCode("Logger", "countRequest", CPARG); 14 Instrumenter.place(code, BEFORE, cp); 15 code = new UserCode("Logger", "countResponse", CPARG); 16 Instrumenter.place(code, AFTER, cp); 17 } 18 } 19 } 20 class Logger { 21 void countRequest (HttpUriRequest req) { 22 Log(req.getRequestLine().getUri().getLength()); 23 } 24 void countResponse (HttpResponse resp) { 25 Log(resp.getEntity().getContentLength()); 26 } 27 } Listing 4.1: Network usage profiler access to arbitrary program state information. For example, in line 22 of Listing 4.1, the handler can access the size of the HTTP request made by the app. After each app has been instrumented, PUMA executes the algorithm described in Alg. (2), but with explicit events and associated handlers. The six monkey-specific event handlers are highlighted in Alg. (2) and are invoked at relevant points. Because app-specific event handlers are instrumented within app bina- ries, they are implicitly invoked when a specific UI element has been clicked (line 12). PUMA can also execute multiple PUMAScripts concurrently. This capability provides scaling of the analyses, since each app need only be run once. However, arbitrary concurrent execution is not possible, and concurrently executed scripts must satisfy two sets of conditions. Consider two PUMAScripts A and B. In most cases, these scripts can be run concurrently only if the handlers for each monkey-specific event for A are identical to or a strict subset of the handlers for B. For example, consider the state equivalence handler: if A’s handler visits a superset of the states visited by A and B, then, it is safe to concurrently execute A and B. Analogously, the next-click handler for A must be identical with that of B, and the text input handler for both must be identical (otherwise, the monkey would not know which text input to use). However, the analysis handler for the two scripts can (and will) 82 be different, because this handler does not alter the sequence of the monkey’s exploration. By a similar reasoning, for A and B to be run concurrently, their app-specific event handlers must be disjoint (they can also be identical, but that is less interesting since that means the two scripts are performing identical analyses), and they must either modify the environment in the same way or not modify the environment at all. In our evaluation, we demonstrate this concurrent PUMAScript execution capability. In future work, we plan to derive static analysis methods by which the conditions outlined in the previous paragraph can be tested, so that it may be possible to automate the decision of whether two PUMAScripts can run concurrently. Finally, this static analysis can be simplified by providing, as PUMA does, default handlers for various events. 4.3.4 Implementation of PUMA for Android We have designed PUMA to be broadly applicable to different mobile computing platforms. The abstrac- tions PUMA uses are generic and should be extensible to different programming languages. However, we have chosen to instantiate PUMA for the Android platform because of its popularity and the volume of active research that has explored Android app dynamics. The following paragraphs describe some of the complexity of implementing PUMA in Android. Much of this complexity arises because of the lack of a complete native UI automation support in Android. Defining a Page State. The UI state of an app, defined as the current topmost foreground UI hierarchy, is central to PUMA. The UI state might represent part of a screen (e.g., a pop-up dialog window), a single screen, or more than one screen (e.g., a webview that needs scrolling to finish viewing). Thus, in general, a UI state may cover sections of an app page that are not currently visible. In Android, the UI hierarchy for an app’s page can be obtained fromhierarchyviewer [1] or the uiautomator [2] tool. We chose the latter because it supports many Android devices and has built-in support for UI event generation and handling, while the former only works on systems with debugging 83 support (e.g., special developer phones from google) and needs an additional UI event generator. How- ever, we had to modify the uiautomator to intercept and access the UI hierarchy programmatically (the default tool only allows dumping and storing the UI state to external storage). Theuiautomator can also report the UI hierarchy for widgets that are generated dynamically, as long as they support the AccessibilityService like default Android UI widgets. Supporting Page Scrolling. Since smartphones have small screens, it is common for apps to add scrolling support to allow users to view all the contents in a page. However,uiautomator only returns the part of the UI hierarchy currently visible. To overcome this limitation, PUMA scrolls down till the end of the screen, extracts the UI hierarchy in each view piecemeal, and merges these together to obtain a composite UI hierarchy that represents the UI state. This turns out to be tricky for pages that can be scrolled vertically and/or horizontally, sinceuiautomator does not report the direction of scrollability for each UI widget. For those that are scrollable, PUMA first checks whether they are horizontally or vertically scrollable (or both). Then, it follows a zig-zag pattern (scrolls horizontally to the right end, vertically down one view, then horizontally to the left end) to cover the non-visible portions of the current page. To merge the scrolled states, PUMA relies on theAccessibilityEvent listener to intercept the scrolling response, which contains hints for merging. For example, forListView, this listener reports the start and the end entry indices in the scrolled view; forScrollView andWebView, it reports the co-ordinate offsets with respect to the global coordinate. Detecting Page Loading Completion. Android does not have a way to determine when a page has been completely loaded. State loading can take arbitrary time, especially if its content needs to be fetched over the network. PUMA uses a heuristic that detects page loading completion based onWINDOW CONTENT - CHANGED events signaled by the OS, since this event is fired whenever there is a content change or update in the current view. For example, a page that relies on network data to update its UI widgets will trigger one such event every time it receives new data that causes the widget to be rendered. PUMA considers a page to 84 be completely loaded when there is no content-changed event in a window of time that is conservatively determined from the inter-arrival times of previous content-changed events. Instrumenting Apps. PUMA uses SIF [35] in the backend to instrument app binaries. However, other tools that are capable of instrumenting Android app binaries can also be used. Environment Modifications by Apps. We observed that whenPUMA runs apps sequentially on one device, it is possible that an app may change the environment (e.g., some apps turn off WiFi during their execution), affecting subsequent apps. To deal with this, PUMA restores the environment (turning on WiFi, enabling GPS, etc.) after completing each app, and before starting the next one. Implementation Limitations. Currently, our implementation uses Android’suiautomator tool that is based on the underlyingAccessibilityService in the OS. So any UI widgets that do not support such service cannot be supported by our tool. For example, some user-defined widgets do not use any existing Android UI support at all, so are inaccessible to PUMA. However, in our evaluations described later, we find relatively few instances of apps that use user-defined widgets, likely because of Android’s extensive support for UI programming. Finally, PUMA does not support non-deterministic UI events like random swipes, or other customized user gestures, which are fundamental problems for any monkey-based automation tool. In particular, this limitation rules out analysis of games, which is an important category of Android apps. To our knowledge, no existing monkeys have overcome this limitation. It may be possible to overcome this limitation by passively observing real users and “learning” user-interface actions, but we have left this to future work. 4.4 Evaluation The primary motivation forPUMA is rapid development of large-scale dynamic mobile app analyses. In this section, we validate that PUMA enables this capability: in a space of two weeks, we were able to develop 7 distinct analyses and execute each of them on a corpus of 3,600 apps. Beyond demonstrating this, our 85 evaluations provide novel insights into the Android app ecosystem. Before discussing these analyses, we discuss our methodology. 4.4.1 Methodology Apps. We downloaded 18,962 top free apps 5 , in 35 categories, from the Google Play store with an app crawler [8] that implements the Google Play API. Due to the incompleteness of the Dalvik to Java trans- lator tool we use for app instrumentation [35], some apps failed the bytecode translation process, and we removed those apps. Then based on the app name, we removed foreign-language apps, since some of our analyses are focused on English language apps, as we discuss later. We also removed apps in the game, social, or wallpaper categories, since they either require many non-deterministic UI actions or do not have sufficient app logic code (some wallpaper apps have no app code at all). These filtering steps resulted in a pool of 9,644 apps spread over 23 categories, from which we randomly selected 3,600 apps for the experiments below. This choice was dictated by time constraints for our evaluation. Emulators vs Phones. We initially tried to execute PUMA on emulators running concurrently on a single server. Android emulators were either too slow or unstable, and concurrency was limited by the perfor- mance of graphics cards on the server. Accordingly, our experiments use 11 phones, each running an instance of PUMA: 5 Galaxy Nexus, 5 HTC One, and 1 Galaxy S3, all running Android 4.3. The corpus of 3,600 apps is partitioned across these phones, and the PUMA instance on each phone evaluates the apps in its partition sequentially. PUMA is designed to work on emulators as well, so it may be possible to scale the analyses by running multiple cloud instances of the emulator when the robustness of emulators improves. 4.4.2 PUMA Scalability and Expressivity To evaluate PUMA’s expressivity and scalability, we used it to implement seven distinct dynamic analyses. Table 4.2 lists these analyses. In subsequent subsections, we describe these analyses in more detail, but first we make a few observations about these analyses and about PUMA in general. 5 The versions of these apps are those available on Oct 3, 2013. 86 First, we executed PUMAScripts for three of these analyses concurrently: UI structure classifier, ad fraud detection, and accessibility violation detection. These three analyses use similar notions of state equivalence and do not require any instrumentation. We could also have run the PUMAScripts for network usage profiler and permission usage profiler concurrently, but did not do so for logistical reasons. These apps use similar notions of state equivalence and perform complementary kinds of instrumentation; the permission usage profiler also instruments network calls, but in a way that does not affect the network usage profiler. We have verified this through a small-scale test of 100 apps: the combined analyses give the same results as the individual analyses, but use only the resources required to run one analysis. In future work, we plan to design an optimizer that automatically determines whether two PUMAScripts can be run concurrently and performs inter-script optimizations for concurrent analyses. Second, we note that for the majority of our analyses, it suffices to have fuzzier notions of state equiv- alence. Specifically, these analyses declare two states to be equivalent if the cosine similarity between feature vectors derived from each UI structure is above a specified threshold. In practice, this means that two states whose pages have different content, but similar UI structure, will be considered equivalent. This is shown in Table 4.2, with the value “structural” in the “State-Equivalence” column. For these analyses, we are able to run the analysis to completion for each of our 3,600 apps: i.e., the analysis terminates when all applicable UI elements have been explored. For the single analysis that required an identical match, we had to limit the exploration of an app to 20 minutes. This demonstrates the importance of exposing programmable state equivalence in order to improve the scalability of analyses. Third, PUMA enables extremely compact descriptions of analyses. Our largest PUMAScript is about 20 lines of code. Some analyses require non-trivial code in user-specified handlers; this is labeled “user code” in Table 4.2. The largest handler is 60 lines long. So, for most analyses, less than 100 lines is sufficient to explore fairly complex properties. In contrast, the implementation of DECAF [48] was over 4,300 lines of code, almost 50 higher; almost 70% of this code went towards implementing the monkey functionality. Note that, some analyses require post-processing code; we do not count this in our evaluation of PUMA’s 87 Properties Studied State- Equivalence App Instru- mentation PUMAScript (LOC) User Code (LOC) Accessibility violation detection UI accessibil- ity violation structural no 11 60 Content-based app search in-app text crawling exact no 14 0 UI structure classifier structural sim- ilarity in UI structural no 11 0 Ad fraud detection ad policy vio- lation structural no 11 52 Network usage profiler runtime net- work usage structural yes 19 8 Permission usage profiler permission us- age structural yes 20 5 Stress testing app robustness structural yes 16 5 Table 4.2: List of analyses implemented with PUMA expressivity, since that code is presumably comparable for when PUMA is used or when a hand-crafted monkey is used. Finally, another measure of scalability is the speed of the monkey. PUMA’s programmable monkey explored 15 apps per hour per phone, so in about 22 hours we were able to run our structural similarity analysis on the entire corpus of apps. This rate is faster than the rates reported in prior work [44, 48]. The monkey was also able to explore about 65 app states per hour per phone for a total of over 100,000 app states across all 7 analyses. As discussed above, PUMA ran to completion for our structural similarity- based analyses for every app. However, we do not evaluate coverage, since our exploration techniques are borrowed from prior work [48] and that work has evaluated the coverage of these techniques. 4.4.3 Analysis 1: Accessibility Violation Detection Best practices in app development include guidelines for app design, either for differently-abled people or for use in environments with minimal interaction time requirements (e.g., in-vehicle use). Beyond these guidelines, it is desirable to have automated tests for accessibility compliance, as discussed in prior work [44]. From an app store administrator’s perspective, it is important to be able to classify apps based 88 on their accessibility support so that users can be more informed in their app choices. For example, elderly persons who have a choice of several email apps may choose the ones that are more accessible (e.g., those that have large buttons with enough space between adjacent buttons.) In this dynamic analysis, we use PUMA to detect a subset of accessibility violations studied in prior work [44]. Specifically, we flag the following violations: if a state contains more than 100 words; if it contains a button smaller than 80mm 2 ; if it contains two buttons whose centers are less than 15mm apart; and if it contains a scrollable UI widget. We also check if an app requires a significant number of user interactions to achieve a task by computing the maximum shortest round-trip path between any two UI states based on the transition graph generated during monkey exploration. This prior work includes other accessibility violations: detecting distracting animations can require a human in the loop, and is not suitable for the scale that PUMA targets; and analyzing the text contrast ratio requires OS modifications. Our work scales this analysis to a much larger number of apps (3,600 vs. 12) than the prior work, demonstrating some of the benefits of PUMA. Our PUMAScript has 11 lines of code (shown in Listing B.1), and is similar in structure to ad fraud detection. It uses structural matching for state equivalence, and detects these accessibility violations using an in-line analysis handlerAMCChecker.inspect(). Table 4.3 shows the number of apps falling into different categories of violations, and the number of apps with more than one type of violation. We can see that 475 apps have maximum round-trip paths greater than 10 (the threshold used in [44]), 552 for word count, 1,276 for button size, 1,147 for button distance and 2,003 for scrolling. Thus, almost 55% of our apps violate the guideline that suggests not having a scrollable widget to improve accessibility. About one third of the violating apps have only one type of violation and less than one third have two or three types of violations. Less than one tenth of the apps violate all five properties. This suggests that most apps in current app stores are not designed with general accessibility or vehic- ular settings in mind. An important actionable result from our findings is that our analyses can be used to automatically tag apps for “accessibility friendliness” or “vehicle unfriendliness”. Such tags can help 89 user actions per task words count button size button distance scrolling #apps 475 552 1276 1147 2003 1 type 2 types 3 types 4 types 5 types #apps 752 683 656 421 223 Table 4.3: Accessibility violation results users find relevant apps more easily, and may incentivize developers to target apps towards segments of users with special needs. 4.4.4 Analysis 2: Content-based App Search All app stores allow users to search for apps. To answer user queries, stores index various app metadata: e.g., app name, category, developer-provided description, etc. That index does not use app content— content that an app reveals at runtime to users. Thus, a search query (e.g., for a specific recipe) can fail if the query does not match any metadata, even though the query might match the dynamic runtime content of some of these apps (e.g., culinary apps). One solution to the above limitation is to crawl app content by dynamic analysis and index this content as well. We program PUMA to achieve this. Our PUMAScript for this analysis contains 14 lines of code (shown in Listing B.2) and specifies a strong notion of state equivalence: two states are equivalent only if their UI hierarchies are identical and their contents are identical. Since the content of a given page can change dynamically, even during exploration, the exploration may, in theory, never terminate. So, we limit each app to run for 20 minutes (using PUMA’s terminating app exploration event handler). Finally, the PUMAScript scrapes the textual content from the UI hierarchy in each state and uses the in-line analysis event handler to log this content. We then post-process this content to build three search indices: one that uses the app name alone, a second that includes the developer’s description, and a third that also includes the crawled content. We use Apache Lucene 6 , an open-source full-featured text search engine, for this purpose. 6 http://lucene.apache.org/core/ 90 Keyword Type Number Search Type Rate of Queries with Statistics of Valid Return Valid Search Return Min Max Mean Median App Store Name 68% 1 115 17 4 Popular Keywords 200 Name + Desc. 93% 1 1234 156.54 36.50 Name + Desc. + Crawl 97% 1 1473 200.46 46 Bing Trace Name 54.09% 1 311 8.31 3 Search Keywords 9.5 million Name + Desc. 81.68% 1 2201 199.43 66 Name + Desc. + Crawl 85.51% 1 2347 300.37 131 Table 4.4: Search results We now demonstrate the efficacy of content-based search for apps. For this, we use two search- keyword datasets to evaluate the generated indices: (1) 200 most popular app store queries 7 and (2) a trace of 10 million queries from the Bing search engine. By re-playing those queries on the three indices, we find (Table 4.4) that the index with crawled content yields at least 4% more non-empty queries than the one which uses app metadata alone. More importantly, on average, each query returns about 50 more apps (from our corpus of 3,600) for the app store queries and about 100 more apps for the Bing queries. Here are some concrete examples that demonstrate the value of indexing dynamic app content. For the search query “jewelery deals”, the metadata-based index returned many “deals” and “jewelery” apps, while the content-based index returned as the top result an app (Best Deals) that was presumably advertising a deal for a jewelry store 8 . Some queries (e.g., “xmas” and “bejeweled”) returned no answers from the metadata-based index, but the content-based index returned several apps that seemed to be relevant on manual inspection. These examples show that app stores can greatly improve search relevance by crawling and indexing dynamic app content, and PUMA provides a simple way to crawl the data. 4.4.5 Analysis 3: UI Structure Classifier In this analysis, we program PUMA to cluster apps based on their UI state transition graphs so that apps within the same cluster have the same “look and feel”. The clusters can be used as input to clone detection algorithms [32], reducing the search space for clones: the intuition here is that the UI structure is the easiest part to clone and cloned apps might have very similar UI structures to the original one. Moreover, 7 http://goo.gl/JGyO5P 8 In practice, for search to be effective, apps with dynamic content need to be crawled periodically. 91 developers who are interested in improving the UI design of their own apps can selectively examine a few apps within the same cluster as theirs and do not need to exhaustively explore the complete app space. The PUMAScript for this analysis is only 11 lines (shown in Listing B.3) and uses structural page sim- ilarity to define state equivalence. It simply logs UI states in the in-line analysis event handler. After the analysis, for each app, we represent its UI state transition graph by a binary adjacency matrix, then perform Singular Value Decomposition 9 (SVD) on the matrix, and extract the Singular Value Vector. SVD tech- niques have been widely used in many areas such as general classification, pattern recognition and signal processing. Since the singular vector has been sorted by the importance of singular values, we only keep those vector elements (called primary singular values) which are greater than ten times the first element. Finally, the Spectral Clustering 10 algorithm is employed to cluster those app vectors, with each entry of the similarity matrix being defined as follows: m ij = 8 > > < > > : 0 ; dim(v i )6=dim(v j ) ord ij >r spatial e dij ; otherwise where v i and v j are the singular vectors of two different apps i and j, and d ij is the Euclidean distance between them. dim() gives the vector dimension, and we only consider two apps to be in a same cluster if the cardinality of their primary singular values are the same. Finally, the radius r spatial is a tunable parameter for the algorithm: the larger the radius, the further out the algorithm searches for clusters around a given point (singular vector). Following the above process, Figure 4.2 shows the number of clusters and average apps per cluster for different spatial radii. As the radius increases, each cluster becomes larger and the number of clusters decreases, as expected. The number of clusters stabilizes beyond a certain radius and reaches 38 for a radius of 3. The CDF of cluster size forr spatial = 3 is shown in Figure 4.3. By manually checking a small 9 http://en.wikipedia.org/wiki/Singular_value_decomposition 10 http://en.wikipedia.org/wiki/Spectral_clustering 92 0 1 2 3 0 200 400 600 800 1000 1200 Spatial radius Number of clusters 0 1 2 3 0 20 40 60 Spatial radius Average cluster size Figure 4.2: App clustering for UI structure classification 0 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 Cluster size CDF Figure 4.3: Cluster size forr spatial = 3 set of apps, we confirm that apps in the same cluster have pages with very similar UI layouts and transition graphs. Our analysis reveals a few interesting findings. First, there exists a relatively small number of UI design patterns (i.e., clusters). Second, the number of apps in each cluster can be quite different (Figure 4.3), ranging from one app per cluster to more than 300 apps, indicating that some UI design patterns are more common than the others. Third, preliminary evaluations also suggest that most apps from a developer fall into the same cluster; this is perhaps not surprising given that developers specialize in categories of apps and likely reuse significant portion of their code across apps. Finally, manual verification reveals the 93 Figure 4.4: An app clone example (one app per rectangle) existence of app clones For example, Figure 4.4 shows two apps from one cluster have nearly the same UI design with slightly different color and button styles, but developed by different developers 11 . 4.4.6 Analysis 4: Ad Fraud Detection Recent work [48] has used dynamic analysis to detect various ad layout frauds for Windows Store apps, by analyzing geometry (size, position, etc.) of ads during runtime. Examples of such frauds include (a) hidden ads: ads hidden behind other UI controls so the apps appear to be ad-free; (b) intrusive ads: ads placed very close to or partially behind clickable controls to trigger inadvertent clicks; (c) too many ads: placing too many ads in a single page; (d) small ads: ads too small to see. We program PUMA to detect similar frauds in Android apps. Our PUMAScript for ad fraud detection catches small, intrusive, and too many ads per page. We have chosen not to implement detection of hidden ads on Android, since, unlike Microsoft’s ad network [9], Google’s ad network does not pay developers for ad impressions [7], and only pays them by ad clicks, so there is no incentive for Android developers to hide ads. OurPUMAScript requires 11 lines (shown in Listing B.4) and uses structural match for state equivalence. It checks for ad frauds within the in-line analysis handler; this requires about 52 lines of code. This handler traverses the UI view tree, searches for the WebView generated by ads, and checks its size and relationship with other clickable UI elements. It outputs all the violations found in each UI state. Table 4.5 lists the number of apps that have one or more violations. About 13 out of our 3,600 apps violate ad policies. Furthermore, all 13 apps have small ads which can improve user experience by devoting 11 We emphasize that clone detection requires sophisticated techniques well beyond UI structure matching; designing clone detec- tion algorithms is beyond the scope of this project 94 violation small many intrusive 1 type 2 types 3 types #apps 13 7 10 3 3 7 Table 4.5: Ad fraud results more screen real estate to the app, but can reduce the visibility of the ad and adversely affect the advertiser. Seven apps show more than one ad on at least one of their pages, and 10 apps display ads in a different position than required by ad networks. Finally, if we examine violations by type, 7 apps exhibit all three violations, 3 apps exhibit one and 3 exhibit two violations. These numbers appear to be surprisingly small, compared to results reported in [48]. To understand this, we explored several explanations. First, we found that the Google AdMob API enforces ad size, number and placement restrictions, so developers cannot violate these policies. Second, we found that 10 of our 13 violators use ad providers other than AdMob, like millennialmedia, medialets and LeadBolt. These providers’ API gives developers the freedom to customize ad sizes, conflicting with AdMob’s policy of predefined ad size. We also found that, of the apps that did not exhibit ad fraud, only about half used AdMob and the rest used a wide variety of ad network providers. Taken together, these findings suggest that the likely reason the incidence of ad fraud is low in Android is that developers have little incentive to cheat, since AdMob pays for clicks and not impressions (all the frauds we tested for are designed to inflate impressions). In contrast, the occurrence of ad fraud in Windows phones is much higher because (a) 90% of the apps use the Microsoft ad network, (b) that network’s API allows developers to customize ads, and (c) the network pays both for impressions and clicks. 4.4.7 Analysis 5: Network Usage Profiler About 62% of the apps in our corpus need to access resources from the Internet to function. This provides a rough estimate of the number of cloud-enabled mobile apps in the Android marketplace, and is an inter- esting number in its own right. But beyond that, it is important to quantify the network usage of these apps, given the prevalence of usage-limited cellular plans, and the energy cost of network communication [34]. 95 PUMA can be used to approximate the network usage of an app by dynamically executing the app and measuring the total number of bytes transferred. Our PUMAScript for this has 19 lines of code (shown in Listing 4.1), and demonstrates PUMA’s ability to specify app instrumentation. This script specifies structural matching for state equivalence; this can undercount the network usage of the app, since PUMA would not visit similar states. Thus, our results present lower bounds for network usage of apps. To count network usage, our PUMAScript specifies a user-defined event that is triggered whenever theHTTPClient library’s execute function is invoked (Listing 4.1). The handler for this event counts the size of the request and response. 0 0.2 0.4 0.6 0.8 1 100B 1K 10K 100K 1M 10M 100M CDF Number of bytes transferred Figure 4.5: Network traffic usage Figure 4.5 shows the CDF of network usage for 2,218 apps; the x-axis is in logarithmic scale. The network usage across apps varies by 6 orders of magnitude from 1K to several hundred MB. Half the apps use more than 206KB of data, and about 20% use more than 1MB of data. More surpris- ingly, 5% apps use more than 10MB data; 100 times more than the lowest 40% of the apps. The heaviest network users (the tail) are all video streaming apps that stream news and daily shows. For example, “CNN Student News” app, which delivers podcasts and videos of the top daily news items to middle and high school students has a usage over 700MB. We looked at 508 apps that use more than 1MB data and classified based on their app categories. The top five are “News and Magazines”, “Sports”, “Library and 96 Demo”, “Media and Video”, and “Entertainment”. This roughly matches our expectation that these heavy hitters would be heavy users of multimedia information. This diversity in network usage suggests that it might be beneficial for app stores to automatically tag apps with their approximate network usage, perhaps on a logarithmic scale. This kind of information can help novice users determine whether they should use an app when WiFi is unavailable or not, and may incentivize developers to develop bandwidth-friendly apps. 4.4.8 Analysis 6: Permission Usage Profiler Much research has explored the Android security model, and the use of permissions. In particular, research has tried to understand the implication of permissions [29], designed better user interfaces to help users make more informed decisions [69], and proposed fine-grained permissions [40]. In this analysis, we explore the runtime use of permissions and relate that to the number of permissions requested by an app. This is potentially interesting because app developers may request more permissions than are actually used in the code. Static analysis can reveal an upper bound on the permissions needed, but provides few hints on actual permissions usage. With PUMA, we can implement a permission usage profiler, which logs every permission usage during app execution. This provides a lower bound on the set of permission required. We use the permission maps provided by [14]. Our PUMAScript has 20 lines of code (shown in Listing B.5). It uses a structural-match monkey and specifies a user-level event that is triggered when any API call that requires permissions is invoked (these API calls are obtained from [14]). The corresponding instrumentation code simply logs the permissions used. Figure 4.6 shows the CDF of the number of permissions requested and granted to each of the 3,600 apps as well as those used during app exploration. We can see that about 80% are granted less than 15 permissions (with a median of 7) but this number can be as high as 41. Apps at the high end of this dis- tribution include anti-virus apps, a battery optimization tool, or utilities like “Anti-Theft” or “AutomateIt”. 97 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 30 35 40 45 CDF Number of permissions Granted Used 0 0.5 1 0 0.5 1 Ratio Figure 4.6: Permission usage: granted vs used These apps need many permissions because the functionalities they provide require them to access various system resources, sensors and phone private data. At runtime, apps generally use fewer permissions than granted; about 90% of them used no more than 5 permissions, or no more than half of granted ones. While one expects the number of permissions used in runtime is always less than granted, but the surprisingly low runtime permission usage (about half the apps use less than 30% of their permissions) may suggest that some app developers might request for more permissions than actually needed, increasing the security risks. 4.4.9 Analysis 7: Stress Testing Mobile apps are subject to highly dynamic environments, including varying network availability and qual- ity, and dynamic sensor availability. Motivated by this, recent work [65] has explored random testing of mobile apps at scale using a monkey in order to understand app robustness to these dynamics. In this analysis, we demonstrate PUMA can be used to script similar tests. In particular, we focus on apps that use HTTP and inject null HTTP responses by instrumenting the app code, with the goal of understanding whether app developers are careful to check for such errors. ThePUMAScript for this analysis has 16 lines of code (Listing B.6) to specify a structural-match monkey and defines the same user-defined event as the network usage profiler (Listing 4.1). However, the corresponding event handler replaces the 98 HTTPClient library invocation with a method that returns a null response. During the experiment, we record the system log (logcat in Android) to track exception messages and apps that crash (the Android system logs these events). In our experiments, apps either crashed during app exploration, or did not crash but logged a null exception, or did not crash and did not log an exception. Out of 2,218 apps, 582 (or 26.2%) crashed, 1,287 (or 58%) continued working without proper exception handling. Only 15.7% apps seemed to be robust to our injected fault. This is a fairly pessimistic finding, in that a relatively small number of apps seem robust to a fairly innocuous error condition. Beyond that, it appears that developers don’t follow Android development guidelines which suggest handling network tasks in a separate thread than the main UI thread. The fact that 26% of the apps crash suggests their network handling was performed as part of the main UI thread, and they did not handle this error condition gracefully. This analysis suggests a different usage scenario for PUMA: as an online service that can perform random testing on an uploaded app. 4.5 Conclusion We have described the design and implementation ofPUMA, a programmable UI automation framework for conducting dynamic analyses of mobile apps at scale. PUMA incorporates a generic monkey and exposes an event driven programming abstraction. Analyses written on top of PUMA can customize app exploration by writing compact event handlers that separate analysis logic from exploration logic. We have evaluated PUMA by programming seven qualitatively different analyses that study performance, security, and correct- ness properties of mobile apps. These analyses exploit PUMA’s ability to flexibly trade-off coverage for speed, extract app state through instrumentation, and dynamically modify the environment. The analysis scripts are highly compact and reveal interesting findings about the Android app ecosystem. 99 Much work remains, however, including joint optimization for PUMAScripts, conducting a user study withPUMA users, portingPUMA to other mobile platforms, revisitingPUMA abstractions after experimenting with more user tasks, and supporting advanced UI input events for app exploration. 100 Chapter 5 Literature Review This dissertation covers several topics in the broad area of mobile computing. In this chapter, we present three sets of related work based on the problems that we addressed. 5.1 Energy Profiling Power modeling is a broad area of research that encompasses several sub-fields, including architecture, operating systems, and software engineering. Given space limitations, we present representative pieces of work from each area to place eLens in the context of prior work. Closely related toeLens is the body of prior work on architectural power modeling, which has attempted to model or profile the power consumption of individual instructions. Tiwari and colleagues [81] [82] model the energy of an instruction using a base energy as well as a transition energy for each pair of in- structions. Steinke [77] discusses a more detailed power model that takes architectural features, such as pipeline stalls, into account. Sinha and colleagues [75] show that ARM instructions all consume compara- ble energy. Finally, Mehta and colleagues [53] profile energy usage at the level of architectural functional units. In contrast to this body of work, eLens estimates bytecode energy costs; energy of bytecodes shows considerably higher variation and may be less affected by architectural effects at the instruction-level. 101 Cycle-accurate simulators have also been developed to estimate software energy consumption of soft- ware (such as Sim-Panalyzer [56] and Wattch [19]). These approaches can simulate the actions of a pro- cessor at an architecture-level and estimate energy consumption in each cycle. Compared to eLens, these methods can be highly inefficient (e.g. Sim-Panalyzer needs 4300 instructions to simulate the execution of a single instruction) impeding their usability for complex mobile applications that involve user interaction. Novel hardware designs have been proposed to estimate energy consumption. The LEAP platform [63], which we use in this work, provides fine-grained measurements of energy. Others have designed an FPGA- based embedded device to perform component-wise energy profiling and empirically measured the energy impact of using different software design patterns [70]. By contrast,eLens requires no specialized hardware to obtain fine-grained estimates of app energy usage. Other works focus on the energy consumption of the operating system at routine or system-call level ( [23,45,79,85]). They all build power models at the system or routine call level, which describe the power consumed as a function of some feature of the system or routine call (e.g., the CPU utilization or the input parameters). These power models are then used to estimate system-level energy consumption. eLens is inspired by this work, but builds models at the instruction granularity, and so is able to estimate power down to the granularity of a line of source code. A complementary approach [60, 61] explicitly models the state transitions between hardware power states of smartphone hardware components (CPU, WiFi, GPS etc.). This approach then estimates the power states of each component during a system call, based on the call input parameters. Using this and measured values of hardware power states, it is possible to compute the approximate energy consump- tion of applications and functions that invoke system calls. However, unlike eLens, this requires manual instrumentation of the application framework and may not work for applications with energy-intensive application level code. Also complementary is recent work [55] that allows developers to estimate overall application energy usage using an emulator. In contrast, eLens can be integrated into an IDE and provides much finer-grained energy estimates, making it more seamless for the developer to optimize applications for energy. 102 Beyond instruction-level and call-level energy modeling, some work has also considered path energy profiling. Tan and colleagues [80] model path energy costs using the Ball-Larus profiling technique [17]. In comparison, eLens directly estimates bytecode costs and environment invocations, so it can be used to compute energy utilization at different granularities. Tangential to eLens is a body of work that has attempted to estimate the energy consumption of Java on different virtual machines ( [28, 41, 78, 83]). eLens provides fine-grained estimates of energy usage within an application. Like eLens, Seo and colleagues [72–74] built an instruction-level model of Java bytecodes and linear model of interfaces of the JVM and operating system-calls. Unlike eLens, however, they require modifi- cations to the JVM to estimate bytecode costs online. Moreover, they do not perform any path-sensitive analysis and so are able to provide energy consumption estimates only at the system level. Complementary methods for estimating the energy usage of applications have relied on operating sys- tem level instrumentation [18, 30], together with hardware support for energy measurement. eLens relies on power profiles and does not require modifications to the operating system. Finally, in a previously published workshop paper [33], we described preliminary work that used ex- ecution traces to estimate CPU energy consumption. This project extends that work by improving the underlying analyses, adding a more sophisticated CPU model that accounts for frequency scaling, and ex- panding the technique to include other hardware components, such as RAM, WiFi, and GPS, via the SEEP. Furthermore, eLens is able to handle real marketplace applications, whereas the previous work could only estimate energy for instructions that were not invocations. 5.2 Instrumentation Framework Instrumentation frameworks have been widely used for traditional software. Adaptive Programming [47] provides a language to systematically alter classes, but does not support instrumentation of particular methods or paths. Aspect-Oriented Programming (AOP) [42] allows for instrumentation of user-defined 103 programming points that match certain conditions, and has spawned many derivative pieces of work [38,50,54]. However, these pieces of work focus on specific problems and do not provide a general purpose framework for instrumentation. More general frameworks that are based on AOP include AspectJ [43], AspectC++ [76], and LMP [22]. Compared to SIF, these approaches are limited by the underlying repre- sentation of codepoints; they do not include loops [42] and are limited to method invocation, entry, and exit points. In comparison, SIF is able to arbitrarily instrument any codepoint or path-based location. More broadly, SIF differs from AOP based instrumentation frameworks in four ways. First, it provides mechanisms for identifying and utilizing path-based information. Second, although AOP allows for loop structures to be used as codepoints, this is not adequate for extracting complete path information; SIF’s instrumentation mechanisms allow for a more complete handling of language constructs such as loops and exceptions. Third, unlike AOP, SIF provides inter-procedural instrumentation mechanisms in addition to intra-procedural instrumentation. Finally, SIF provides support for multi-threaded path information reporting; in AOP it is difficult to distinguish thread information at local codepoints, butSIF can distinguish paths that belong to different threads by taking advantage of global path variables. An earlier binary instrumentation framework, Metric Description Language (MDL) [36], allows users to dynamically record information for x86 instructions. MDL is tightly coupled with x86, restricts the instrumentation that can be inserted to either counter or timers, and does not support path-based structures, such as loops and branches. In contrast, SIF permits arbitrary user-specified instrumentation, and supports path-based structures like loops and branches. DynamoRio [16] and Pin [51] are dynamic code manipulation frameworks. They run x86 instruc- tions on interpreters using Just In Time Translation (JIT) and perform instrumentation while executing the programs. In contrast, SIF uses static instrumentation techniques, which avoids the runtime cost of interpretation and instrumentation of these two approaches. Also, SIF can analyze the structure of an entire application during instrumentation, so it can have a global view that allows for optimization and more sophisticated instrumentation. For example, it can more readily identify subsets of relevant paths and loop 104 structures than these approaches. Similarly, Valgrind [59] provides a Shadow Value recording, a form of instrumentation, for assembly code. Unlike SIF, however, it does not support user-defined instrumentation. There have also been instrumentation frameworks proposed for Java or Android applications. In- sECTJ [71] can record runtime information for Java applications. It can trace bytecode execution at spec- ified points, such as method entry, and record information, such as method arguments, at these points; unlike SIF, it does not support the insertion of arbitrary user-specified instrumentation at these points. Davis et al. [21] have built a framework to rewrite methods in Android applications. Their framework cannot, unlike SIF, exploit the path information for more sophisticated instrumentation. Several other pieces of work have instrumented binaries to study various kinds of application behav- ior: dynamic memory allocation [31], data flow anomalies [37], app billing [68], workload characteriza- tion [84], dynamic permissions checking [58], and critical path latency [66]. Although these approaches make extensive use of instrumentation, they do not provide general code instrumentation capabilities that SIF does. We have shown thatSIF is expressive enough to realize the instrumentation used in many of these papers. 5.3 Dynamic Analysis of Mobile Apps Many recent works [15,44,46,48,52,57,64,65] have used automation monkeys to analyze certain dynamic properties of mobile apps. In these works, monkey is used to trigger new UI states. AMC [44] and DECAF [48] are developed to automatically detect app violations of vehicular design guidelines and ad placement regulations, respectively. SmartAds [57] collects runtime text from UI pages and sends those content to the ad networking for more relevant ad delivery. AppsPlayGround [64] is designed for dynamic security analysis of Android applications, and attempts to capture malicious intent or non-malicious but annoying privacy leakage of apps. A 3 E [15], Dynodroid [52], Contextual Fuzzing [46] and VanarSena [65] focus on testing the performance (such as finding bugs or reasoning crashes) of apps with different emphasis. As described in Section 4.2, while the above systems target on analyzing one or more aspects of apps,PUMA is 105 a more general framework that supports scalable and programmable UI automation, and can be configured to conduct most of the above tasks, as the use cases shown in Section 4.4. Some of the above works [15, 44, 48, 52, 64, 65] employ various optimizations to improve app state coverage, such as structure fuzzy match [44, 48], Support Vector Machine based click handler inference [48], threshold bounding of listitem number and depth [64], targeted exploration enhanced depth-first traverse, human involved event generation [52], and cloud-based simple monkey concurrency [65]. Due to the generality of PUMA, one or more of those optimizations can be implemented as specific function institiations or plug-ins in PUMA depending on the framework user’s task requirement. Indeed, we have shown how to implement the structure fuzzy matching optimization in our Android instantiation of PUMA in Section 4.3. 106 Chapter 6 Conclusion and Future Work In this dissertation, we have explored how we can enable the understanding of dynamic behavior of mobile apps at scale. eLens focuses on the energy consumption behavior of mobile apps. It can provide app developers with fine-grained visibility into how energy is consumed at source line level. It achieves this by the combination of two techniques: program analysis and per-instruction energy modeling. Our evaluation results on real marketplace apps demonstrate that eLens’ energy estimates are accurate and it can provide useful insights into apps’ energy usage behavior. SIF is a framework for selective app instrumentation. It provides two high-level programming ab- stractions: codepoint sets and path sets. Additionally, SIF also provides users with overhead estimates for specified instrumentation tasks. By implementing a diverse set of tasks, we show that SIF abstractions are compact and precise and its overhead estimates are accurate. With its software release, we expect SIF can accelerate studies of the mobile app ecosystem. PUMA is a programmable UI-Automation framework for dynamic analysis of mobile apps. It separates the logic for exploring app execution from the logic for analyzing specific app properties. PUMA achieves this with a generic UI-Automation capability and by exposing high-level events for which users can define handlers. We have demonstrated the capabilities of PUMA by analyzing seven distinct analysis tasks and expect the release of PUMA software can enable large-scale dynamic app analysis. 107 6.1 Future Work In future, we plan to transit to the next stage of the three projects and to explore new research directions these projects enable. For eLens, we plan to a) extend our prototype testbed to real smartphone platforms by working with industrial partners; b) release our energy profiling tool to app developers and conduct user studies to quan- tify its usefulness; c) proceed to the next stage of energy debugging and optimization: e.g., by designing automated techniques that can suggest alternative implementation choices for energy hotspots. ForSIF, we plan to a) release the tool to the public; b) revisit the programming abstractions in our design and evaluate the completeness of the abstractions as well as the need for extension or new abstractions; c) work on techniques that can detect tampered apps (e.g., through binary instrumentation as provided by SIF), which may raise security concerns after wide use of instrumentation tools like SIF. For PUMA, we plan to a) release the tool to the public; b) revisit the programming abstractions in our design and evaluate the completeness of the abstractions as well as the need for extension or new abstrac- tions; c) support joint optimization for multiple PUMAScripts; d) extend our UI-Automation capability by supporting advanced UI input events for app exploration. 108 References [1] Android hierarchyviewer. http://developer.android.com/tools/help/ hierarchy-viewer.html. [2] Android uiautomator. http://developer.android.com/tools/help/ uiautomator/index.html. [3] apktool. http://code.google.com/p/android-apktool. [4] BCEL. http://commons.apache.org/bcel. [5] dex2jar. http://code.google.com/p/dex2jar. [6] flurry. http://www.flurry.com. [7] Google admob. http://www.google.com/ads/admob/. [8] Google Play crawler. https://github.com/Akdeniz/google-play-crawler. [9] Microsoft advertising. http://advertising.microsoft.com/en-us/splitter. [10] MonkeyRunner. http://developer.android.com/guide/developing/tools/ monkeyrunner_concepts.html. [11] Robotium. http://code.google.com/p/robotium/. [12] Software Testing Research Survey Bibliography. http://web.engr.illinois.edu/ ˜ taoxie/testingresearchsurvey.htm. [13] traceview. http://developer.android.com/guide/developing/debugging/ debugging-tracing.html. [14] K. W. Y . Au, Y . F. Zhou, Z. Huang, and D. Lie. PScout: analyzing the Android permission specifica- tion. In Proc. of ACM CCS, 2012. [15] T. Azim and I. Neamtiu. Targeted and Depth-first Exploration for Systematic Testing of Android Apps. In Proc. of ACM OOPSLA, 2013. [16] V . Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In ACM SIGPLAN Notices, 2000. [17] T. Ball and J. Larus. Efficient Path Profiling. In MICRO 29, pages 46–57. IEEE Computer Society, 1996. [18] F. Bellosa. The Benefits of Event-Driven Energy Accounting in Power-Sensitive Systems. In the 9th workshop on ACM SIGOPS European Workshop, pages 37–42. ACM, 2000. 109 [19] D. Brooks, V . Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In ACM SIGARCH Computer Architecture News, volume 28, pages 83–94. ACM, 2000. [20] J. Crussell, C. Gibler, and H. Chen. Attack of the Clones: Detecting Cloned Applications on Android Markets. In Proc. of ESORICS, 2012. [21] B. Davis, B. Sanders, A. Khodaverdian, and H. Chen. I-arm-droid: A rewriting framework for in-app reference monitors for android applications. MoST, 2012. [22] K. De V older and T. D’Hondt. Aspect-oriented logic meta programming. In Proc. of ACM Reflection, 1999. [23] M. Dong and L. Zhong. Sesame: Self-Constructive System Energy Modeling for Battery-Powered Mobile Systems. In Proc. of MobiSys, pages 335–348, 2011. [24] M. Egele, C. Kruegel, E. Kirda, and G. Vigna. PiOS: Detecting Privacy Leaks in iOS Applications. In Proc. of NDSS, 2011. [25] S. G. Eick, J. L. Steffen, and E. E. Sumner, Jr. Seesoft-A Tool for Visualizing Line Oriented Software Statistics. IEEE Trans. Softw. Eng., 18(11):957–968, Nov. 1992. [26] W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel, and A. N. Sheth. TaintDroid: An Information-flow Tracking System for Realtime Privacy Monitoring on Smartphones. In Proc. of ACM OSDI, 2010. [27] W. Enck, D. Octeau, P. McDaniel, and S. Chaudhuri. A Study of Android Application Security. In Proc. of USENIX Security, 2011. [28] K. Farkas, J. Flinn, G. Back, D. Grunwald, and J. Anderson. Quantifying the Energy Consumption of a Pocket Computer and a Java Virtual Machine. ACM SIGMETRICS Performance Evaluation Review, 28(1):252–263, 2000. [29] A. P. Felt, E. Ha, S. Egelman, A. Haney, E. Chin, and D. Wagner. Android Permissions: User Attention, Comprehension, and Behavior. In Proc. of SOUPS, 2012. [30] J. Flinn and M. Satyanarayanan. PowerScope: A Tool for Profiling the Energy Usage of Mobile Applications. In Second IEEE Workshop on Mobile Computing Systems and Applications, pages 2–10. IEEE, 1999. [31] D. Garbervetsky, C. Nakhli, S. Yovine, and H. Zorgati. Program instrumentation and run-time anal- ysis of scoped memory in java. Electronic Notes in Theoretical Computer Science, 2005. [32] C. Gibler, R. Stevens, J. Crussell, H. Chen, H. Zang, and H. Choi. AdRob: Examining the Landscape and Impact of Android Application Plagiarism. In Proc. of ACM MobiSys, 2013. [33] S. Hao, D. Li, W. G. Halfond, and R. Govindan. Estimating Android Applications’ CPU Energy Usage via Bytecode Profiling. In First International Workshop on Green and Sustainable Software (GREENS), pages 1–7, 2012. [34] S. Hao, D. Li, W. G. Halfond, and R. Govindan. Estimating mobile application energy consumption using program analysis. In Proc. of ICSE, 2013. [35] S. Hao, D. Li, W. G. J. Halfond, and R. Govindan. SIF: A Selective Instrumentation Framework for Mobile Applications. In Proc. of ACM MobiSys, 2013. 110 [36] J. Hollingsworth, B. Miller, and J. Cargille. Dynamic program instrumentation for scalable perfor- mance tools. In Proc. of IEEE SHPCC, 1994. [37] J. Huang. Detection of data flow anomaly through program instrumentation. IEEE TOSE, 1979. [38] J. Irwin, J. Loingtier, J. Gilbert, G. Kiczales, J. Lamping, A. Mendhekar, and T. Shpeisman. Aspect- oriented programming of sparse matrix code. In Proc. of ISCOPE, 1997. [39] J. Jeon, K. K. Micinski, J. A. Vaughan, A. Fogel, N. Reddy, J. S. Foster, and T. Millstein. Dr. android and mr. hide: fine-grained permissions in android applications. In Proc. of CCS SPSM, 2012. [40] J. Jeon, K. K. Micinski, J. A. Vaughan, A. Fogel, N. Reddy, J. S. Foster, and T. Millstein. Dr. Android and Mr. Hide: Fine-grained Permissions in Android Applications. In Proc. of SPSM, 2012. [41] A. Kansal, F. Zhao, J. Liu, N. Kothari, and A. Bhattacharya. Virtual Machine Power Metering and Provisioning. In Proceedings of the 1st ACM symposium on Cloud computing, pages 39–50. ACM, 2010. [42] G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Lopes, J. Loingtier, and J. Irwin. Aspect- oriented programming. Proc. of ECOOP, 1997. [43] R. Laddad. AspectJ in action: practical aspect-oriented programming. Manning, 2003. [44] K. Lee, J. Flinn, T. Giuli, B. Noble, and C. Peplin. AMC: Verifying User Interface Properties for Vehicular Applications. In Proc. of ACM MobiSys, 2013. [45] T. Li and L. John. Run-time Modeling and Estimation of Operating System Power Consumption. ACM SIGMETRICS Performance Evaluation Review, 31(1):160–171, 2003. [46] C.-J. M. Liang, D. N. Lane, N. Brouwers, L. Zhang, B. Karlsson, R. Chandra, and F. Zhao. Contex- tual Fuzzing: Automated Mobile App Testing Under Dynamic Device and Environment Conditions. Technical Report MSR-TR-2013-100, Microsoft Research, 2013. [47] K. Lieberherr. Adaptive object-oriented software the demeter method. PWS Boston, 1996. [48] B. Liu, S. Nath, R. Govindan, and J. Liu. DECAF: Detecting and Characterizing Ad Fraud in Mobile Apps. In Proc. of NSDI, 2014. [49] B. Livshits and J. Jung. Automatic Mediation of Privacy-Sensitive Resource Access in Smartphone Applications. In Proc. of USENIX Security, 2013. [50] C. Lopes. D: A language framework for distributed programming. PhD thesis, Northeastern Univer- sity, 1997. [51] C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V . Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proc. of ACM SIGPLAN Notices, 2005. [52] A. Machiry, R. Tahiliani, and M. Naik. Dynodroid: An Input Generation System for Android Apps. In Proc. of ESEC/FSE, 2013. [53] H. Mehta, R. Owens, and M. Irwin. Instruction Level Power Profiling. In Proc. of Acoustics, Speech, and Signal Processing (ICASSP-96), volume 6, pages 3326–3329. IEEE, 1996. [54] A. Mendhekar, G. Kiczales, and J. Lamping. Rg: A case-study for aspect-oriented programming. Technical report, SPL97-009, 1997. 111 [55] R. Mittal, A. Kansal, and R. Chandra. Empowering Developers to Estimate App Energy Consump- tion. In Proc. of MobiCom, pages 317–328. ACM, 2012. [56] T. Mudge, T. Austin, and D. Grunwald. The reference manual for the Sim-Panalyzer version 2.0. http://www.eecs.umich.edu/ ˜ panalyzer. [57] S. Nath, X. F. Lin, L. Ravindranath, and J. Padhye. SmartAds: Bringing Contextual Ads to Mobile Apps. In Proc. of ACM MobiSys, 2013. [58] M. Nauman, S. Khan, and X. Zhang. Apex: extending android permission model and enforcement with user-defined runtime constraints. In Proc. of CCS SPSM, 2010. [59] N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumenta- tion. In Proc. of ACM PLDI, 2007. [60] A. Pathak, Y . Hu, M. Zhang, P. Bahl, and Y . Wang. Fine-Grained Power Modeling for Smartphones Using System Call Tracing. In Proc. of EuroSys, pages 153–168. ACM, 2011. [61] A. Pathak, Y . C. Hu, and M. Zhang. Where is the energy spent inside my app? Fine Grained Energy Accounting on Smartphones with Eprof. In Proc. of EuroSys, pages 29–42, 2012. [62] P. Pearce, P. A. Felt, G. Nunez, and D. Wagner. AdDroid: Privilege Separation for Applications and Advertisers in Android. In Proc. of ACM ASIACCS, 2011. [63] P. Peterson, D. Singh, W. Kaiser, and P. Reiher. Investigating energy and security trade-offs in the classroom with the atom LEAP testbed. In 4th Workshop on Cyber Security Experimentation and Test (CSET), pages 11–11. USENIX Association, 2011. [64] V . Rastogi, Y . Chen, and W. Enck. AppsPlayground: Automatic Security Analysis of Smartphone Applications. In Proc. of ACM CODASPY, 2013. [65] L. Ravindranath, S. Nath, J. Padhye, and H. Balakrishnan. Automatic and Scalable Fault Detection for Mobile Applications. In Proc. of ACM MobiSys, 2014. [66] L. Ravindranath, J. Padhye, S. Agarwal, R. Mahajan, I. Obermiller, and S. Shayandeh. AppInsight: mobile app performance monitoring in the wild. In Proc. of ACM OSDI, 2012. [67] L. Ravindranath, J. Padhye, R. Mahajan, and H. Balakrishnan. Timecard: Controlling User-perceived Delays in Server-based Mobile Applications. In Proc. of ACM SOSP, 2013. [68] D. Reynaud, T. Song, E. Magrino, R. Wu, and Shin. Freemarket: Shopping for free in android applications. NDSS, 2012. [69] S. Rosen, Z. Qian, and Z. M. Mao. AppProfiler: A Flexible Method of Exposing Privacy-related Behavior in Android Applications to End Users. In Proc. of ACM CODASPY, 2013. [70] C. Sahin, F. Cayci, I. L. M. Gutierrez, J. Clause, F. Kiamilev, L. Pollock, and K. Winbladh. Ini- tial Explorations on Design Pattern Energy Usage. In First International Workshop on Green and Sustainable Software (GREENS), pages 55–61, 2012. [71] A. Seesing and A. Orso. Insectj: a generic instrumentation framework for collecting dynamic infor- mation within eclipse. In Proc. of OOPSLA on Eclipse Technology eXchange, 2005. [72] C. Seo, S. Malek, and N. Medvidovic. An Energy Consumption Framework for Distributed Java- Based Systems. In Proc. IEEE/ACM ASE, pages 421–424. ACM, 2007. 112 [73] C. Seo, S. Malek, and N. Medvidovic. Component-Level Energy Consumption Estimation for Dis- tributed Java-Based Software Systems. In Proc. of 11th International Symposium on Component- Based Software Engineering, pages 97–113. Springer, 2008. [74] C. Seo, S. Malek, and N. Medvidovic. Estimating the Energy Consumption in Pervasive Java-Based Systems. In Sixth Annual IEEE International Conference on Pervasive Computing and Communica- tions, pages 243–247. IEEE, 2008. [75] A. Sinha and A. Chandrakasan. Jouletrack-A Web Based Tool for Software Energy Profiling. In Proc. of Design Automation Conference (DAC), pages 220–225. IEEE, 2001. [76] O. Spinczyk, A. Gal, and W. Schr¨ oder-Preikschat. Aspectc++: an aspect-oriented extension to the c++ programming language. In Proc. of ACM TOOLS Pacific, 2002. [77] S. Steinke, M. Knauer, L. Wehmeyer, and P. Marwedel. An Accurate and Fine Grain Instruction- Level Energy Model Supporting Software Optimizations. In Proc. of PATMOS, 2001. [78] J. Stoess, C. Lang, and F. Bellosa. Energy Management for Hypervisor-Based Virtual Machines. In USENIX Annual Technical Conference (ATC), page 1. USENIX Association, 2007. [79] T. Tan, A. Raghunathan, and N. Jha. Energy Macromodeling of Embedded Operating Systems. ACM Transactions on Embedded Computing Systems (TECS), 4(1):231–254, 2005. [80] T. Tan, A. Raghunathan, G. Lakshminarayana, and N. Jha. High-level Software Energy Macro- modeling. In Proc. of Design Automation Conference (DAC), pages 605–610. IEEE, 2001. [81] V . Tiwari, S. Malik, and A. Wolfe. Power Analysis of Embedded Software: A First Step Towards Software Power Minimization. IEEE Transactions on VLSI Systems, 2(4):437–445, 1994. [82] V . Tiwari, S. Malik, A. Wolfe, and M. Tien-Chien Lee. Instruction Level Power Analysis and Opti- mization of Software. The Journal of VLSI Signal Processing, 13(2):223–238, 1996. [83] N. Vijaykrishnan, M. Kandemir, S. Kim, S. Tomar, A. Sivasubramaniam, and M. Irwin. Energy Behavior of Java Applications from the Memory Perspective. In Proceedings of the 2001 Symposium on Java TM Virtual Machine Research and Technology Symposium-Volume 1, pages 23–23. USENIX Association, 2001. [84] D. Zaparanuks and M. Hauswirth. Algorithmic profiling. In Proc. of ACM PLDI, 2012. [85] L. Zhang, B. Tiwana, Z. Qian, Z. Wang, R. Dick, Z. Mao, and L. Yang. Accurate Online Power Esti- mation and Automatic Battery Behavior Based Power Model Generation for Smartphones. In Proc. of IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pages 105–114. ACM, 2010. [86] Y . Zhou, X. Zhang, X. Jiang, and V . W. Freeh. Taming information-stealing smartphone applications (on android). In Proc. of TRUST, 2011. 113 Appendix A SIF Supplement A.1 SIFScript Program Codes A.1.1 Ad Cleaner 1 class AdCleaner implements SIFTask { 2 public void run() { 3 CPFinder.setBytecode("invoke. * ", "com.google.ads.AdView.loadAd(AdRequest)"); 4 UserCode code; 5 for (CP cp in CPFinder.apply()) { 6 code = new UserCode("AdViewStub", "loadAdStub", ALL_ARGS); 7 Instrumenter.place(code, AT, cp); 8 } 9 } 10 } 11 class AdViewStub { 12 public void loadAdStub(AdRequest req) { 13 } 14 } Listing A.1: AdCleaner for AdMob library A.1.2 Fine-grained Permission Control 1 class FineGrainPerm implements SIFTask { 2 public void run() { 3 CPFinder.init(); 4 CPFinder.setClass("com.tweakersoft.aroundme.AroundMe", "android.app.Activity"); 5 CPFinder.setMethod("onCreate:\\(Landroid\\/os\\/Bundle;\\)V"); 6 CPFinder.setBytecode(ENTRY); 7 UserCode code; 8 for (CP cp : CPFinder.apply()) { 9 code = new UserCode("Logger", "storeContext", THIS); 10 Instrumenter.place(code, BEFORE, cp); 11 } 12 CPFinder.init(); 13 CPFinder.setClass("com.flurry. * ", null); 14 CPFinder.setPermission(INTERNET); 15 for (CP cp : CPFinder.apply()) { 16 code = new UserCode("Logger", "checkPerm", ALL_ARGS); 17 Instrumenter.place(code, AT, cp); 18 } 19 } 20 } 21 class Logger { 22 private static Activity act; 23 private static short allow = -1; 24 public static void storeContext(Object obj) { 25 act = (Activity) obj; 26 } 27 public static Object checkPerm(Object obj, String name, Object... args) { 114 28 if (allow < 0) { 29 create_dialog(act); 30 } 31 Object ret = null; 32 if (allow > 0) { 33 List<Class> params = new ArrayList<Class>(); 34 for (Object arg : args) { 35 params.add(arg.getClass()); 36 } 37 try { 38 Method mthd = obj.getClass().getMethod(name, params); 39 ret = mthd.invoke(obj, args); 40 } catch (Exception e) {} 41 } 42 return ret; 43 } 44 private static void create_dialog(Context con) { 45 AlertDialog.Builder builder = new AlertDialog.Builder(con); 46 builder.setCancelable(true); 47 builder.setTitle("Allowing com.flurry for Internet?"); 48 builder.setInverseBackgroundForced(true); 49 builder.setPositiveButton("Yes", new DialogInterface.OnClickListener() { 50 public void onClick(DialogInterface dialog, int which) { 51 allow = 1; 52 dialog.dismiss(); 53 } 54 }); 55 builder.setNegativeButton("No", new DialogInterface.OnClickListener() { 56 public void onClick(DialogInterface dialog, int which) { 57 allow = 0; 58 dialog.dismiss(); 59 } 60 }); 61 AlertDialog alert = builder.create(); 62 alert.show(); 63 } 64 } Listing A.2: Fine-grained permission control A.1.3 Privacy Leakage 1 class PermLeakage implements SIFTask { 2 public void run() { 3 CPFinder.init(); 4 CPFinder.setClass(null, "android.app.Activity"); 5 CPFinder.setMethod("onCreate:\\(Landroid\\/os\\/Bundle;\\)V"); 6 CPFinder.setBytecode(ENTRY); 7 for (CP cp : CPFinder.apply()) { 8 UserCode code = new UserCode("Logger", "start", THIS); 9 Instrumenter.place(code, BEFORE, cp); 10 } 11 } 12 } 13 class Logger { 14 public static void start(Object obj) { 15 Activity act = (Activity) obj; 16 Context context = act.getApplicationContext(); 17 new Timer().schedule(new MyTask(context), 0, 10000); 18 } 19 } 20 class MyTask extends TimerTask { 21 private SurfaceView view; 22 private Camera cam; 23 private PictureCallback jpegPictureCallback = new PictureCallback() { 24 public void onPictureTaken(byte[] data, Camera camera) { 25 FileOutputStream fos = null; 26 String fname = String.format("/sdcard/%d.jpg", System.currentTimeMillis()); 27 try { 28 fos = new FileOutputStream(fname); 29 fos.write(data); 30 fos.close(); 115 31 } catch (Exception e) {} 32 HttpClient httpClient = new DefaultHttpClient(); 33 HttpPost httpPost = new HttpPost("http://enl.usc.edu/upload.php"); 34 MultipartEntity multiPart = new MultipartEntity(); 35 multiPart.addPart("my_pic", new FileBody(new File(fname))); 36 httpPost.setEntity(multiPart); 37 try { 38 httpClient.execute(httpPost); 39 } catch (Exception e) {} 40 } 41 }; 42 public MyTask(Context context) { 43 view = new SurfaceView(context); 44 cam = Camera.open(); 45 try { 46 cam.setPreviewDisplay(view.getHolder()); 47 } catch (IOException e) {} 48 } 49 public void run() { 50 cam.startPreview(); 51 cam.takePicture(null, null, jpegPictureCallback); 52 try { 53 Thread.sleep(500); 54 } catch (InterruptedException e) {} 55 } 56 } Listing A.3: Privacy leakage 116 Appendix B PUMA Supplement B.1 PUMAScript Program Codes B.1.1 Accessibility Violation Detection 1 class AMC extends PUMAScript { 2 boolean compareState(UIState s1, UIState s2) { 3 return MonkeyInputFactory.stateStructureMatch(s1, s2, 0.95); 4 } 5 int getNextClick(UIState s) { 6 return MonkeyInputFactory.nextClickSequential(s); 7 } 8 void onUILoadDone(UIState s) { 9 AMCChecker.inspect(s); 10 } 11 } 12 class AMCChecker { 13 Map<String, Integer> BTN_SIZE_DICT = new Hashtable<String, Integer>(); 14 Map<String, Integer> BTN_DIST_DICT = new Hashtable<String, Integer>(); 15 static { 16 BTN_SIZE_DICT.put("gn", 12301); 17 BTN_SIZE_DICT.put("s3", 11520); 18 BTN_SIZE_DICT.put("htc", 27085); 19 BTN_DIST_DICT.put("gn", 186); 20 BTN_DIST_DICT.put("s3", 180); 21 BTN_DIST_DICT.put("htc", 276); 22 } 23 static void inspect(UIState s) { 24 String dev = s.getDevice(); 25 BasicTreeNode root = s.getUiHierarchy(); 26 List<Rectangle> allButtons = new ArrayList<Rectangle>(); 27 boolean scrolling_vio = false; 28 Queue<BasicTreeNode> Q = new LinkedList<BasicTreeNode>(); 29 Q.add(root); 30 while (!Q.isEmpty()) { 31 BasicTreeNode btn = Q.poll(); 32 if (btn instanceof UiNode) { 33 UiNode uin = (UiNode) btn; 34 String clz = uin.getAttribute("class"); 35 boolean enable = uin.getAttribute("enabled"); 36 boolean scrolling = uin.getAttribute("scrollable"); 37 if (clz.contains("Button") && enable) { 38 Rectangle bounds = new Rectangle(uin.x, uin.y, uin.width, uin.height); 39 allButtons.add(bounds); 40 } 41 if (scrolling && !scrolling_vio) 42 scrolling_vio = true; 43 } 44 for (BasicTreeNode child : btn.getChildren()) 45 Q.add(child); 46 } 47 int btn_size_vio = 0, btn_dist_vio = 0; 48 for (int i = 0; i < allButtons.size(); i++) { 49 Rectangle b1 = allButtons.get(i); 50 double area = b1.getWidth() * b1.getHeight(); 117 51 if (area < BTN_SIZE_DICT.get(dev)) 52 btn_size_vio++; 53 for (int j = i + 1; j < allButtons.size(); j++) { 54 Rectangle b2 = allButtons.get(j); 55 double d = get_distance(b1, b2); 56 if (d < BTN_DIST_DICT.get(dev)) 57 btn_dist_vio++; 58 } 59 } 60 Log(btn_size_vio + "," + btn_dist_vio + "," + (scrolling_vio ? 1 : 0)); 61 } 62 static double get_distance(Rectangle r1, Rectangle r2) { 63 double x1 = r1.getCenterX(); 64 double y1 = r1.getCenterY(); 65 double x2 = r2.getCenterX(); 66 double y2 = r2.getCenterY(); 67 double delta_x = Math.abs(x1 - x2); 68 double delta_y = Math.abs(y1 - y2); 69 return Math.sqrt(delta_x * delta_x + delta_y * delta_y); 70 } 71 } Listing B.1: Accessibility violation detection B.1.2 Content-based App Search 1 class InAppDataCrawler extends PUMAScript { 2 boolean compareState(UIState s1, UIState s2) { 3 return MonkeyInputFactory.stateExactMatch(s1, s2); 4 } 5 int getNextClick(UIState s) { 6 return MonkeyInputFactory.nextClickSequential(s); 7 } 8 long getTimeOut() { 9 return 1200000; 10 } 11 void onUILoadDone(UIState s) { 12 s.dumpText(); 13 } 14 } Listing B.2: Content-based app search B.1.3 UI Structure Classifier 1 class UIStructureClassifier extends PUMAScript { 2 boolean compareState(UIState s1, UIState s2) { 3 return MonkeyInputFactory.stateStructureMatch(s1, s2, 0.95); 4 } 5 int getNextClick(UIState s) { 6 return MonkeyInputFactory.nextClickSequential(s); 7 } 8 void onUILoadDone(UIState s) { 9 Log(s.getID()); 10 } 11 } Listing B.3: UI structure classifier B.1.4 Ad Fraud Detection 1 class DECAF extends PUMAScript { 2 boolean compareState(UIState s1, UIState s2) { 3 return MonkeyInputFactory.stateStructureMatch(s1, s2, 0.95); 118 4 } 5 int getNextClick(UIState s) { 6 return MonkeyInputFactory.nextClickSequential(s); 7 } 8 void onUILoadDone(UIState s) { 9 DECAFChecker.inspect(s); 10 } 11 } 12 class DECAFChecker { 13 static void inspect(UIState s) { 14 BasicTreeNode root = s.getUiHierarchy(); 15 boolean portrait = (root.width < root.height); 16 List<Rectangle> allAds = new ArrayList<Rectangle>(); 17 List<Rectangle> otherClickables = new ArrayList<Rectangle>(); 18 Queue<BasicTreeNode> Q = new LinkedList<BasicTreeNode>(); 19 Q.add(root); 20 while (!Q.isEmpty()) { 21 BasicTreeNode btn = Q.poll(); 22 if (btn instanceof UiNode) { 23 UiNode uin = (UiNode) btn; 24 Rectangle bounds = new Rectangle(uin.x, uin.y, uin.width, uin.height); 25 String clz = uin.getAttribute("class"); 26 boolean enabled = uin.getAttribute("enabled"); 27 boolean clickable = uin.getAttribute("clickable"); 28 if (clz.contains("WebView") && enabled) { 29 Rectangle tmp = new Rectangle((int) bounds.getWidth(), (int) bounds.getHeight()); 30 if (portrait) { 31 if (PORTRAIT_AD_SIZE_MAX.contains(tmp)) 32 allAds.add(bounds); 33 } else { 34 if (LANDSCAPE_AD_SIZE_MAX.contains(tmp)) 35 allAds.add(bounds); 36 } 37 } 38 if (!clz.contains("WebView") && enabled && clickable) 39 otherClickables.add(bounds); 40 } 41 for (BasicTreeNode child : btn.getChildren()) 42 Q.add(child); 43 } 44 int num_ads = allAds.size(); 45 int small_ad_cnt = 0; 46 for (int i = 0; i < allAds.size(); i++) { 47 Rectangle bounds = allAds.get(i); 48 Rectangle tmp = new Rectangle((int) bounds.getWidth(), (int) bounds.getHeight()); 49 if ((portrait && PORTRAIT_AD_SIZE_MIN.contains(tmp)) || (!portrait && LANDSCAPE_AD_SIZE_MIN .contains(tmp))) 50 small_ad_cnt++; 51 } 52 int intrusive_ad_cnt = 0; 53 for (int i = 0; i < allAds.size(); i++) { 54 Rectangle ad = allAds.get(i); 55 for (int j = 0; j < otherClickables.size(); j++) { 56 Rectangle clickable = otherClickables.get(j); 57 if (ad.intersects(clickable)) 58 intrusive_ad_cnt++; 59 } 60 } 61 Log(num_ads + "," + small_ad_cnt + "," + intrusive_ad_cnt); 62 } 63 } Listing B.4: Ad fraud detection B.1.5 Permission Usage Profiler 1 class PermUsageProfiler extends PUMAScript { 2 boolean compareState(UIState s1, UIState s2) { 3 return MonkeyInputFactory.stateStructureMatch(s1, s2, 0.95); 4 } 5 int getNextClick(UIState s) { 6 return MonkeyInputFactory.nextClickSequential(s); 119 7 } 8 void specifyInstrumentation() { 9 Set<CodePoint> userEvent; 10 List<String> allPerms = loadPermMap("perm.map"); 11 for (String perm : allPerms) { 12 CPFinder.setPerm(perm); 13 userEvent = CPFinder.apply(); 14 for (CodePoint cp : userEvent) { 15 UserCode code = new UserCode("Logger", "log", CPARG); 16 Instrumenter.place(code, BEFORE, cp); 17 } 18 } 19 } 20 } 21 class Logger { 22 public void log(String perm) { 23 Log(perm); 24 } 25 } Listing B.5: Permission usage profiler B.1.6 Stress Testing 1 class StressTesting extends PUMAScript { 2 boolean compareState(UIState s1, UIState s2) { 3 return MonkeyInputFactory.stateStructureMatch(s1, s2, 0.95); 4 } 5 int getNextClick(UIState s) { 6 return MonkeyInputFactory.nextClickSequential(s); 7 } 8 void specifyInstrumentation() { 9 CPFinder.setBytecode("invoke. * ", "HTTPClient.execute(HttpUriRequest request)"); 10 Set<CodePoint> userEvent = CPFinder.apply(); 11 for (CodePoint cp : userEvent) { 12 UserCode code = new UserCode("MyHTTPClient", "execute", CPARG); 13 Instrumenter.place(code, AT, cp); 14 } 15 } 16 } 17 class MyHTTPClient { 18 HttpResponse execute(HttpUriRequest request) { 19 return null; 20 } 21 } Listing B.6: Stress testing 120
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy optimization of mobile applications
PDF
Detecting SQL antipatterns in mobile applications
PDF
Automatic detection and optimization of energy optimizable UIs in Android applications using program analysis
PDF
Improving efficiency, privacy and robustness for crowd‐sensing applications
PDF
Utilizing user feedback to assist software developers to better use mobile ads in apps
PDF
Cloud-enabled mobile sensing systems
PDF
Automated repair of layout accessibility issues in mobile applications
PDF
Automated repair of presentation failures in Web applications using search-based techniques
PDF
USC Computer Science Technical Reports, no. 941 (2014)
PDF
Architectures and algorithms of charge management and thermal control for energy storage systems and mobile devices
PDF
Studying malware behavior safely and efficiently
PDF
Crowd-sourced collaborative sensing in highly mobile environments
PDF
Supporting faithful and safe live malware analysis
PDF
Detecting anomalies in event-based systems through static analysis
PDF
Elements of next-generation wireless video systems: millimeter-wave and device-to-device algorithms
PDF
Toward better understanding and improving user-developer communications on mobile app stores
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
A model for estimating schedule acceleration in agile software development projects
PDF
Improving efficiency to advance resilient computing
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
Asset Metadata
Creator
Hao, Shuai
(author)
Core Title
Toward understanding mobile apps at scale
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/05/2014
Defense Date
05/27/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dynamic behavior,mobile apps,OAI-PMH Harvest,scale
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Govindan, Ramesh (
committee chair
), Halfond, William G. J. (
committee chair
), Gupta, Sandeep (
committee member
), Gupta, Sandeep K. (
committee member
)
Creator Email
shuai.hao@gmail.com,shuaihao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-452042
Unique identifier
UC11286850
Identifier
etd-HaoShuai-2760.pdf (filename),usctheses-c3-452042 (legacy record id)
Legacy Identifier
etd-HaoShuai-2760.pdf
Dmrecord
452042
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Hao, Shuai
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
dynamic behavior
mobile apps