Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Detecting SQL antipatterns in mobile applications
(USC Thesis Other)
Detecting SQL antipatterns in mobile applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Detecting SQL Antipatterns in Mobile Applications
by
Yingjun Lyu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2020
Copyright 2020 Yingjun Lyu
Acknowledgements
Pursuing Ph.D. at USC has been a life-changing experience for me. It would not have been
possible to do without the support and guidance that I received from many people.
First of all, I would like to thank my advisor, Dr. William G.J. Halfond, for his continuous
guidance and support throughout my Ph.D. journey. I greatly appreciate his many eorts in
helping me cultivate myself as a researcher, thinker, and doer. There is an old saying in Chinese,
\if you give a hungry man a sh, you feed him for a day; but if you teach him how to sh, you feed
him for a lifetime". I always recognize how hard it is to teach a person how to \sh" comparing
to giving him \shes". I have been and will be always grateful for my advisor's kindness and
patience.
Besides my advisor, I would like to thank the rest of my dissertation committee: Dr. Nenad
Medvidovic, Dr. Chao Wang, Dr. Jyotirmoy Deshmukh, and Dr. Sandeep Gupta. Their insightful
comments and valuable feedback have helped me greatly improve my dissertation work.
It has been wonderful being companied by my labmates during my Ph.D. journey. We have
had numerous insightful discussions on challenging research issues and interesting cultural topics.
I would like to thank Ding Li for bringing me onboard in the early stage of my Ph.D. study, Sonal
Mahajan for introducing me much knowledge about Indian culture and religion, Jiaping Gui and
Mian Wan for helping me with my research papers, Abdulmajeed Alameer for introducing me
the fancy places in Los Angeles, Negarsadat Abolhassani for teaching me Farsi and giving me
constructive feedback on my research. I would also like to thank Paul Chiou, Ali Alotaibi, and
Sasha Volokh for their support and help.
Last but not the least, my deep appreciation goes out to my family and friends. They always
believe in me, encourage me, and provide whatever support I need. And nally to Xiaofang, who
has been by my side throughout this Ph.D. journey, living every single minute of it, sharing my
cheerful moments, and helping me get through every dicult time.
ii
Table of Contents
Acknowledgements ii
List of Tables v
List of Figures vi
Abstract vii
Chapter 1: Introduction 1
1.1 Major Challenges and Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Overview of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Detection Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Benchmark Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Overview of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2: Background 11
2.1 Mobile App Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 SQL Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Local Database Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 SQL Antipatterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 3: A Literature Review of SQL Antipatterns 15
3.1 Literature Review Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Literature Search Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Antipattern: Unbatched-Writes . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Antipattern: Not-Merging-Projection-Predicates . . . . . . . . . . . . . . . 20
3.2.3 Antipattern: Not-Merging-Selection-Predicates . . . . . . . . . . . . . . . . 20
3.2.4 Antipattern: Loop-to-Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.5 Antipattern: Vulnerable-Query . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.6 Antipattern: Not-Using-Parameterized-Query . . . . . . . . . . . . . . . . . 22
3.2.7 Antipattern: Not-Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.8 Antipattern: Unnecessary-Column-Retrieval . . . . . . . . . . . . . . . . . . 24
3.2.9 Antipattern: Unnecessary-Row-Retrieval . . . . . . . . . . . . . . . . . . . . 24
3.2.10 Antipattern: Unbounded-Query . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.11 Antipattern: Readable-Password . . . . . . . . . . . . . . . . . . . . . . . . 26
iii
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 4: A SQL Antipattern Detection Framework 28
4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1 Overview of SAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Language Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.3 The SANDLess Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.4 Inter-procedural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Implementing SQL Antipattern Detectors Using SANDLess . . . . . . . . . 44
4.2.2 RQ1: Code Complexity of SANDLess Detectors . . . . . . . . . . . . . . . . 48
4.2.3 RQ2: Detection Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.4 RQ3: Detection Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Discussion of Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter 5: A Benchmark Study of SQL Antipatterns 55
5.1 Benchmark Study Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1 Benchmark Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.1 Antipattern: Unbatched-Writes . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2 Antipattern: Not-Merging-Projection-Predicates . . . . . . . . . . . . . . . 61
5.2.3 Antipattern: Not-Merging-Selection-Predicates . . . . . . . . . . . . . . . . 62
5.2.4 Antipattern: Loop-to-Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.5 Antipattern: Vulnerable-Query . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.6 Antipattern: Not-Using-Parameterized-Query . . . . . . . . . . . . . . . . . 64
5.2.7 Antipattern: Not-Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.8 Antipattern: Unnecessary-Column-Retrieval . . . . . . . . . . . . . . . . . . 64
5.2.9 Antipattern: Unnecessary-Row-Retrieval . . . . . . . . . . . . . . . . . . . . 65
5.2.10 Antipattern: Unbounded-Query . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.11 Antipattern: Readable-Password . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter 6: Related Work 68
6.1 Detecting SQL Antipatterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2 Program Analysis Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3 Analyzing Database Usage And Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4 Performance Optimization for Mobile Apps . . . . . . . . . . . . . . . . . . . . . . 73
Chapter 7: Conclusion 75
7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
References 78
iv
List of Tables
3.1 Selected conferences and keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 Antipatterns and their required abstracted information . . . . . . . . . . . . . . . . 31
4.2 Categorized application-database relationships . . . . . . . . . . . . . . . . . . . . 32
4.3 Scripts statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Analysis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 SQL antipatterns and their performance impacts. A positive number means the
resource consumption increases after xing the antipattern. A negative number
means the resource consumption decreases. . . . . . . . . . . . . . . . . . . . . . . 61
v
List of Figures
1.1 Dissertation overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Android activity lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Literature review process
ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
vi
Abstract
Local databases underpin important features in many mobile applications. However, bad pro-
gramming practices of using database operations, also called SQL antipatterns, can introduce
high resource consumption, aect the responsiveness, and undermine the security of a mobile
application.
In my dissertation, I designed and evaluated a framework for detecting SQL antipatterns
eectively and eciently in mobile apps. The framework, SAND, abstracts away the interactions
between the application and the database. It provides a language, SANDLess, which is used to
query abstractions of application-database relationships and specify SQL antipattern detection
tasks.
I addressed two key challenges when designing this framework. The rst challenge was to
determine what kinds of application-database relationships should be abstracted. The second
challenge was to determine how to eectively and eciently extract the abstractions from the
database access code in mobile apps. To tackle the rst challenge, I collected a full spectrum of
SQL antipatterns and their detection approaches via a systematic literature review. I analyzed
each of these approaches and identied the application-database relationships that were analyzed
by these approaches. To address the second challenge, I developed a set of static analysis tech-
niques that were specialized at analyzing the database access code in mobile apps. These static
analysis techniques focused on handling the complications of string-based SQL statements, loops,
and other kinds of application-database relationships that were required by the abstractions.
In the evaluation, I showed that the detection rules of all eleven SQL antipatterns reported
in the literature can be compactly expressed in SANDLess with low code complexity. These
SANDLess detectors, which were built on top of abstractions of application-database relation-
ships, identied thousands of instances of SQL antipatterns in marketplace mobile apps with
high precision and fast analysis speed. Overall, these results are positive and indicate that my
framework can detect all kinds of SQL antipatterns eectively and eciently in mobile apps.
vii
Chapter 1
Introduction
The past decade has seen incredible growth in the amount of mobile apps available to end users.
Developers compete for end users by designing innovative apps that can provide a rich user
experience by combining data from web services, location services, and sensor data. In addition
to innovating new services, developers also strive to create apps that are reliable, responsive,
secure, and consume a minimal amount of limited resources, such as network data and energy.
Balancing these concerns is important for developers as complaints about excessive resource usage
or poor performance can lead to poor ratings for their apps [51, 34, 35].
Easy access to and management of data underpins important features in many mobile apps.
To improve the responsiveness and reliability of their apps, developers frequently take advantage
of local databases, such as SQLite [31], to store and manage their data. Local databases are
stored directly on the mobile devices. They can be synchronized with remote databases and can
help maintain reliable and responsive service even when the mobile device does not have a reliable
connection. An application can also share its local data with other applications. These benets
have led to their widespread use and popularity; conservative estimates are that they are used in
over 50% of all mobile apps [71].
Although local databases oer many benets to developers, their use comes with potential
problems. Unlike a server, mobile devices are inherently limited by their battery power and
computation power. Many database operations, such as transactions, require le locking and
rollback capability. This means that such operations require CPU time, memory, and I/O access.
Unsurprisingly, these database operations consume many resources; recent studies have found
that local database services have become one of the top three resource consuming services on
mobile devices [59, 63]. Although all kinds of database interactions can be resource intensive,
1
certain kinds of usage patterns can be especially problematic. A recent study determined that the
bad programming practices of using database operations, also called SQL antipatterns, can have a
signicant impact on the resource consumption of mobile apps, including runtime and energy [70].
Some SQL antipatterns can even increase the resource consumption of database operations over
16 times and noticeably aect user experience [70]. In addition to these performance issues, SQL
antipatterns can also undermine the security of mobile apps, such as by allowing SQL injection
and privacy leakage vulnerabilities [110, 42].
Although there is a clear need to detect these SQL antipatterns in mobile apps, building a
detector is extremely challenging. It requires a profound understanding of the antipattern, a
mastery of various sophisticated static analysis techniques, and many implementation eorts. For
example, a typical SQL antipattern detector requires analyzing the SQL statements. However,
modeling what kinds of SQL statements can be issued in an application can be very challenging.
This is because dynamically constructed string-based SQL statements are prevalent among modern
mobile apps [71]. These SQL statements can be built using a sequence of string operations (e.g.,
concatenation) and may vary over dierent paths of execution. While it requires many eorts to
develop an accurate analysis to model SQL statements, a SQL antipattern detector may need more
sophisticated analyses, depending on what kinds of application-database relationships that the
detector needs to reason about. For example, a tool that detects an antipattern about inecient
database writes needs to analyze loops and transactions and could contain more than 5,000 lines
of code [28]. While the practitioners and database experts have expertise in SQL antipatterns,
the lack of a program analysis background can stop them from developing eective detectors.
The challenges of building detectors of SQL antipatterns can be addressed by a framework
approach. Instead of developing static analysis techniques from scratch to analyze the database
access code, the detector builders can focus on specifying the detection logic while relying on a
framework to carry out the necessary static analyses for them. The framework can account for
the complex semantics of various local database operations and make sure that the detectors built
on top of the framework can analyze the database access code eectively and eciently.
The goal of my dissertation is to provide a general framework for detecting SQL antipatterns
in mobile apps. At a high level, the framework abstracts away the interactions between the
application and the database using static analysis techniques. It provides abstractions of various
kinds of application-database relationships so that detectors of SQL antipatterns can specify their
detection logic on top of these abstractions. When designing this framework, I aim at supporting
2
the detection of all kinds of SQL antipatterns. In addition, I aim at designing eective and ecient
static analyses that can handle the complex semantics of local database operations in mobile apps.
1.1 Major Challenges and Insights
In order to build a framework for detecting SQL antipatterns, I need to address two major
challenges.
The rst challenge lies in determining what kinds of application-database relationships should
be abstracted by the framework. Choosing the right abstractions is essential as they directly de-
termine what kinds of detectors of SQL antipattern can be supported by the framework. However,
this is challenging because there are many types and variations of SQL antipatterns. Detecting
this wide range of SQL antipatterns may require analyzing dierent kinds of application-database
relationships.
To tackle this rst challenge, my insight is to analyze existing detection approaches. More
specically, I rst collect a full spectrum of SQL antipatterns and their detection approaches via a
literature survey. I then analyze each of these detection approaches and identify the application-
database relationships that are needed by these approaches. For example, one detector checks
if a SQL statement contains the keyword \LIMIT". The application-database relationship ana-
lyzed by this detector is the SQL statement. Based on the identied set of application-database
relationships, I design the abstractions that the framework should provide.
After identifying the key abstractions, the second challenge of building the framework lies in
extracting the abstractions eectively and eciently from the database access code in mobile apps.
The extraction process is challenging because local database operations have complex semantics
in mobile apps. At a high level, there are two major factors that contribute to the complexity
of the database access code. First, dynamically constructed string-based SQL statements are
frequently and widely used in mobile apps. In fact, two out of the top three most frequently used
local database APIs issue raw string-based SQL statements [71]. Half of these SQL statements
are not statically embedded, but are constructed dynamically via string operations, such as con-
catenation [71]. Second, database operations are usually involved in complex program paths, such
as loops and long call chains. In fact, the two APIs that issue string-based SQL statements are
also the most frequent APIs invoked in loops [71]. Adding to the complexity, 89% of database
operations are performed inter-procedurally with four methods involved on average [72]. The call
3
chain length can be as long as seventeen. Taken together, these numbers mean that extracting
abstractions needs to examine all inter-procedural paths leading to the database operations and
reason about various kinds of application-database relationships along these dierent paths.
To address the challenges with respect to extracting abstractions from the code, my key
insight is to develop a set of static analysis techniques that are specialized at analyzing the
database access code. These static analysis techniques can focus on handling the complications
of string-based SQL statements, loops, and other kinds of application-database relationships that
are required by the abstractions. Relying on static analyses instead of dynamic proling to extract
application-database relationships can avoid the situation where false negatives are introduced by
the workload. In other words, if dynamic proling is used and the chosen workload suers from
low code/path coverage, the code that matches the antipattern may not be triggered. In this
case, false-negatives are introduced. By extracting application-database relationships via static
analyses, the detectors do not need to make any assumption about the workload.
1.2 Hypothesis
In this section, I present the hypothesis that this dissertation tests. Based on the insights, the
hypothesis of my dissertation is:
Static detectors built on top of abstractions of application-database relationships can eectively
and eciently detect SQL antipatterns in mobile apps.
To evaluate this hypothesis, I designed a SQL antipattern detection framework. At a high
level, the framework abstracts away the interactions between the application and the database
using static analysis techniques. It provides a language, called SANDLess, that is used to query
the application-database relationships and specify the detection tasks of various SQL antipat-
terns. To determine what kinds of application-database relationships should be abstracted, I rst
conducted a systematic literature review to collect a comprehensive list of SQL antipatterns and
their detection approaches. I then analyzed the collected detection approaches and derived the
language specication from them. In order to extract the abstractions from the database access
code, I developed a range of static analysis techniques that can analyze the database access code
eectively and eciently.
To demonstrate the eectiveness and eciency of the framework, I implemented SAND as a
prototype tool for the Android platform and the SANDLess abstractions as a set of Java functions.
4
I used SANDLess to implement eleven SQL antipatterns that were discovered by the literature
review. I then ran these detectors on a large set of 1,000 top-ranked marketplace Android apps.
The results showed that the detection rules of all these eleven SQL antipatterns can be compactly
expressed in SANDLess, using 12 to 74 lines of Java code with low code complexity. These
SANDLess detectors, which are built on top of abstractions of application-database relationships,
precisely identied thousands of instances of SQL antipatterns with a precision of at least 99.4%.
These SANDLess detectors were also fast as applying eleven detectors only took an average of
forty-one seconds per app. Overall, these results conrmed the hypothesis of my dissertation and
indicated that my framework can detect all kinds of SQL antipatterns eectively and eciently
in mobile apps.
1.3 Overview of Dissertation
In this section, I provide an overview of my dissertation. To better illustrate my eorts in working
towards the goal of my dissertation, I divided my dissertation work into three parts, as shown
in Figure 1.1: a literature review (Section 1.3.1), a detection framework (Section 1.3.2), and a
benchmark study (Section 1.3.3). The relationships between these three parts are explained as
follows. I rst conducted a systematic literature review that laid the foundation for my detection
techniques. Through this review, I obtained a comprehensive list of SQL antipatterns, the existing
detection methods, and the repair strategies. Based on the results and insights I obtained from
this review, I designed a framework to detect the list of collected SQL antipatterns. To further
demonstrate the usefulness of my framework, I conducted a benchmark study to quantify the
real-world performance impact of the SQL antipattern instances identied by my framework. In
addition to the antipattern instances, this benchmark study also leveraged the repair strategies
identied by the literature review.
1.3.1 Literature Review
The goal of the literature review is to collect and analyze the current knowledge about SQL an-
tipatterns, which laid the foundation for my detection framework and the rest of my dissertation.
I conducted a systematic literature review of SQL antipatterns by following general guidelines for
a literature review process. The review process included three phases: planning, conducting, and
5
reporting the review. In the planning phase, I dened the research questions and the review pro-
tocol. The protocol included several steps, such as developing the search strategy and specifying
the selection criteria. In the conducting phase, I followed the planned protocol and documented
the results obtained from each step. In the reporting phase, I summarized the identied SQL
antipatterns, the existing detection methods, and the repair strategies. Through this review, I
identied eleven SQL antipatterns and their corresponding detection and repair methods. This
review not only helped me identify potential research gaps with respect to SQL antipattern de-
tection, but also provided insights that guided the design of my framework and benchmark study.
The full description of the protocol and results are in Chapter 3.
SQL Antipatterns A Literature Review of
SQL Antipatterns A Framework for SQL
Antipattern Detection A Benchmark Study of
SQL Antipatterns Detection methods Repair strategies Real-world
antipattern
instances Figure 1.1: Dissertation overview
1.3.2 Detection Framework
The goal is to design a generic framework for detecting SQL antipatterns in mobile apps. Based on
the comprehensive list of SQL antipatterns and their detection methods collected in the literature
review, I designed a detection framework called SAND (SQL Antipattern Detection). The frame-
work abstracted away the application's complex database access code and other implementation
details using a range of novel static analyses. It provided a language called SANDLess, whose
requirement was derived from the list of collected SQL antipatterns and their detection methods.
Using the SANDLess language, I then specied the SQL antipattern detection tasks based on
6
the detection rules proposed in the literature. These SANDLess detectors take an Application
Under Test (AUT) as input, and report the locations of the detected SQL antipattern instances
in the code. The evaluation results demonstrated that my framework can be used to compactly
express the detection tasks for all the collected SQL antipatterns. The detectors built on top of
my framework were also eective and ecient in identifying SQL antipatterns in mobile apps.
The full description of SAND is in Chapter 4.
1.3.3 Benchmark Study
The goal of the benchmark study is to demonstrate the performance impact of the SQL antipat-
terns identied by the detectors built on top of the framework. To achieve this goal, I constructed
a benchmark suite that was designed to measure the resource consumption of two implementa-
tions, the antipattern version and the xed version, of performing the same database related task.
I based the antipattern version of the implementation on real-world antipattern instances that
were collected at a large scale by my detection framework. I then applied the repair strategies
collected in the literature review to the antipattern version of the implementation so as to generate
the xed version. In order to ensure the generalizability of the study results, my benchmark suite
took dierent performance aecting factors into consideration and simulated their representative
values in mobile apps. To evaluate the impact, I measured and compared the resources (energy
and runtime) consumed by running the code snippets on multiple mobile devices. The evaluation
results demonstrated that for eight out of the eleven SQL antipatterns, they were performance-
oriented and they had a signicant performance impact. For the rest of the three antipatterns,
they were security-oriented and the x only introduced a minor performance cost. The detailed
protocol and results are described in Chapter 5.
1.4 Contributions
The contributions of my dissertation are as follows.
1. Literature Review: I conducted a literature review of SQL antipatterns, following the
best practices for a literature review process, to identify a comprehensive list of SQL an-
tipatterns, their detection methods, and repair strategies. The literature review laid the
foundation for the framework. The results obtained from the review revealed the limita-
tions of existing approaches, and more importantly, they helped to determine what kinds of
7
application-database relationships should be abstracted and how they should be abstracted
in the framework. To the best of my knowledge, this work was the rst literature review
that focused on SQL antipatterns. I discuss this contribution in Chapter 3.
2. Detection Framework: I designed and evaluated a framework for SQL antipattern detec-
tion. To the best of my knowledge, my approach was the rst to provide a general detection
framework for SQL antipatterns. As part of this contribution, I analyzed a comprehensive
list of SQL antipatterns and their detection methods collected from the literature review.
With this analysis, I identied the abstractions of application-database relationships that
were required by the framework. As another part of this contribution, I conducted an ex-
tensive evaluation to demonstrate the eectiveness and eciency of my framework in SQL
antipattern detection. I discuss this contribution in Chapter 4.
3. Benchmark Study: I designed a benchmark suite to quantify the performance impact of
the real-world antipattern instances identied by the detectors built on top of my framework.
To the best of my knowledge, my benchmark study was the rst to quantify and compare
the performance impact of SQL antipatterns on mobile apps in a uniform manner.
1.5 Overview of Publications
In this section, I provide an overview of the papers that I published and are directly related to
this dissertation. The papers are listed below. For each of the papers, I was the primary author
(or one of the primary authors), with contributions including idea, design, implementation, and
evaluation of the work. All of the papers were full papers in the main research track. They were
co-authored with my Ph.D. advisor, Prof. William G. J. Halfond.
Chapter 3: A Literature Review of SQL antipatterns
In this chapter, I present my literature review of SQL antipatterns. The information collected in
this review lays the foundation of my detection framework and benchmark study. This work was
published at the IEEE International Conference on Software Maintenance and Evolution (ICSME)
in 2019. This paper was co-authored with Ali Alotaibi, a Ph.D. student at USC.
8
1. [70] Yingjun Lyu, Ali Alotaibi, and William GJ Halfond. Quantifying the Performance
Impact of SQL Antipatterns on Mobile Applications. In 2019 IEEE International Conference
on Software Maintenance and Evolution (ICSME), pages 53{64, 2019
Chapter 4: Detection of SQL antipatterns
In this chapter, I discuss the detection framework that I designed for detecting SQL antipatterns
in mobile apps. The paper that focuses on the framework itself is currently under submission [72].
Related to this framework, there are two additional published papers. The rst paper was an
empirical study that investigated how local databases are used in mobile applications. The paper
provided essential insights about the complexity of database access code in mobile apps, which
guided the design of the detection framework. This paper was published at the IEEE International
Conference on Software Maintenance and Evolution (ICSME) in 2017. The co-authors of this
paper are Jiaping Gui and Mian Wan, who were both Ph.D. students at USC when the paper was
accepted. The second paper included a fully automated detection and optimization approach for
a specic antipattern. The parts of the approach that were responsible for analyzing loops and
transactions were utilized by the detection framework. This paper was published at the ACM
SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) in 2018. This
work was developed in collaboration with Ding Li, a fellow Ph.D. student at USC.
1. [71] Yingjun Lyu, Jiaping Gui, Mian Wan, and William G. J. Halfond. An Empirical Study
of Local Database Usage in Android Applications. In 2017 IEEE International Conference
on Software Maintenance and Evolution (ICSME), pages 444{455, Sept 2017
2. [73] Yingjun Lyu, Ding Li, and William G. J. Halfond. Remove RATs from Your Code:
Automated Optimization of Resource Inecient Database Writes for Mobile Applications.
In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing
and Analysis, ISSTA 2018, pages 310{321, 2018
3. [72] Yingjun Lyu and William G. J. Halfond. SAND: A Static Analysis Framework for
Detecting SQL Antipatterns in Mobile Applications. In Submission
9
Chapter 5: A Benchmark Study of SQL antipatterns
In this chapter, I describe the benchmark study that I conducted to quantify the impact of SQL
antipatterns on the resource consumption of mobile apps. This study was included in the paper
that was published at ICSME 2019, which is the same paper that contains the literature review.
1. [70] Yingjun Lyu, Ali Alotaibi, and William GJ Halfond. Quantifying the Performance
Impact of SQL Antipatterns on Mobile Applications. In 2019 IEEE International Conference
on Software Maintenance and Evolution (ICSME), pages 53{64, 2019
10
Chapter 2
Background
This chapter provides background information on concepts that are used throughout the disserta-
tion. Section 2.1 discusses the fundamentals of a mobile application. Section 2.2 introduces basic
information about SQL. Section 2.3 describes how a mobile application manages a local database.
Lastly, Section 2.4 gives background information about SQL antipatterns.
2.1 Mobile App Basics
Android is a mobile operating system based on a modied version of the Linux kernel. It is
the most widely adopted mobile computing platform [96]. Applications running on the Android
platform are mostly written in Java. Typically, an Android application is rst compiled to .class
les that contain Java bytecode instructions. These .class les are then converted to .dex les,
which contain Dalvik bytecode instructions and are compatible with the Dalvik virtual machine
and Android Runtime. Finally, the .dex les are packaged into an .apk le for distribution and
installation [1].
In the Android platform, each UI window displayed on a screen is called an activity. Each
activity is required to follow a prescribed lifecycle that denes how this activity is created, used,
and destroyed [23]. To navigate transitions between stages of the activity lifecycle, Android
provides a core set of six callbacks: onCreate(), onStart(), onResume(), onPause(), onStop(),
and onDestroy(). As shown in Figure 2.1, an activity's lifecycle always starts with a call to
onCreate(), when the system rst creates the activity, and ends with a call to onDestroy(),
when the system destroys the activity. After onCreate() is called, the onStart() call makes the
activity visible to the user, as the app prepares for the activity to enter the foreground and become
11
interactive. The onStart() method completes very quickly and once this callback nishes, the
activity invokes the onResume() method. The activity's foreground lifetime starts after this call to
onResume(), and lasts until onPause() is called, when another activity comes to the foreground.
If the entire activity goes to the background and becomes invisible, onStop() is called.
Activity on foreground Activity leaves foreground Figure 2.1: Android activity lifecycle
In additional to handling the lifecycle, the callback mechanism is also used to handle events.
These events represent user interactions and each event is registered to a particular UI object
with a callback function. When the user interacts with the UI object, the corresponding callback
function would be executed. For example, the callback method onClick() would be called when
the user touches the UI object, and the callback method onFocusChange() would be called when
the user navigates onto or away from the UI object, using the navigation-keys or trackball.
2.2 SQL Basics
SQL is a domain-specic language used to access and manipulate databases. It is the dominant
language for interacting with databases [87]. SQL provides many functionalities to users, including
accessing data, dening data in a database and manipulating that data, creating and dropping
databases and tables, and so on. SQL can be embedded within other languages using SQL
12
modules, libraries, and pre-compilers. For example, in Android apps, SQL is embedded within
Java in the form of Java strings.
A relational database system contains one or more objects called tables, which are used to store
the data or information for the database. Tables consist of columns and rows and are uniquely
identied by the tables' names. Each column in a table consists of the column name, data type,
and other attributes for the column. The data type of a column constrains the set of possible
values that can be assigned to a column. For instance, if a column is declared to be numerical,
it will not accept arbitrary text strings. Rows contain the records or data for the columns. The
number of rows in a table re
ects how much data is stored at a given moment.
There are some widely used SQL terminologies. Projection means choosing which columns (or
expressions) the query shall return. Selection means which rows are to be returned. In SQL, the
term query is a synonym for a SELECT statement, while a SQL statement is any SQL command,
such as a SELECT, INSERT, UPDATE, and DELETE statement. There are two ways to select
which columns to return in SQL. The rst way is to explicitly specify the columns. For example,
\SELECT id, name, grade FROM students". The second way is to use the \SELECT *" syntax,
which implicitly selects all the columns from the table.
2.3 Local Database Basics
A local database reads and writes data directly to local disk les. In the Android app ecosystem,
SQLite has become the most common local database service. It is used in over 90% of all Android
apps that access a local database service [71]. The Android runtime allows developers to manage
a SQLite database using the class SQLiteDatabase. Two types of APIs are provided in this class
to read or change the contents of a database. The rst type of API allows an entire SQL statement
to be specied as a raw string that is passed as a parameter to an API. The second type of API
predetermines the type of a database command, such as SELECT, and allows dierent parts of
the command to be specied as multiple parameters to an API. Internally, the second type of API
combines the parameters through string concatenations and converts them into a complete SQL
statement in the form of a raw string.
In Android, accessing the results of a query made against a database is done via the Cursor
class [22]. The class provides APIs that allow an app to iterate over the rows of the retrieved
results as well as read the columns of each row. Internally, the cursor stores the rows of data
13
returned by the query along with a position that points to the current row of data. The app
can manipulate this position via API calls, such as moveToNext() and moveToFirst(). Once the
cursor's position is pointing to a valid row, the columns of the row can be read from the cursor. To
read the column data, the app can either specify a string that indicates the name of the column
or an integer that indicates the index of the column.
Transactions are used to ensure data integrity in relational Database Management Systems
(DBMSs), including SQLite. Any command that changes the database has to be done in a
transaction. Each transaction involves operations such as writing and deleting rollback journal
les, and making database changes to the memory and on the disk. Transactions can be started
and ended manually via the calls to beginTransaction() and endTransaction() in Android
SQLite. Dierent from many popular remote DBMS, such as MySQL, transactions can be nested
in Android SQLite [31]; however, only the outermost transaction will perform commits or rollbacks
while the inner ones only keep track of the nesting relationship. If a group of database commands
are issued when a transaction is open, they are processed in a batch.
2.4 SQL Antipatterns
SQL antipatterns describe bad programming practices of using database operations in the ap-
plication code. These antipatterns focus on the problems in the application code instead of the
database design. They usually do not aect the correctness of the program, but undermine many
non-functional aspects, such as performance, security, and maintainability of the program. For
example, a performance-oriented SQL antipattern arises when developers issue a query that reads
more columns than needed. While the data transfer between the application and the database
comes with a cost, retrieving columns that are not needed in subsequent computation can be
harmful to performance and scalability. For another example, a security-oriented SQL antipat-
tern arises when a vulnerable input is not properly sanitized and is concatenated to a SQL query.
This could introduce SQL injection vulnerabilities that allow the attacker to manipulate the syn-
tax of the SQL query, threatening the security of the application and its underlying database.
More details about the denitions and rationales of various SQL antipatterns known to date are
described in Section 3.2.
14
Chapter 3
A Literature Review of SQL Antipatterns
The goal of the literature review is to summarize and analyze the current knowledge about SQL
antipatterns. In this chapter, I give a detailed description of the protocol of my literature review
in Section 3.1. Following the protocol, I obtained a comprehensive list of SQL antipatterns, their
denitions, rationales, detection methods, and optimization methods. The details of these results
are described in Section 3.2. These results are essential to other parts of my dissertation, as they
provide key insights to address the challenges with respect to SQL antipattern detection, and
guide the design of the detection framework (Chapter 4) and the benchmark study (Chapter 5).
3.1 Literature Review Methodology
In this section, I present my literature review protocol. Based on Kitchenham's systematic re-
view guidelines [55], I rst specied the research questions (Section 3.1.1). I then dened the
search process, which includes the initial search strategy (Section 3.1.2.1), the selection criteria
(Section 3.1.2.2), and the citation search strategy (Section 3.1.2.3).
15
Specify research questions (Section 3.1.1) Perform initial search (Section 3.1.2.1) Apply selection criteria (Section 3.1.2.2) Perform citation search (Section 3.1.2.3) Conduct detailed paper review Answer research questions (Section 3.2) Figure 3.1: Literature review process
ow
3.1.1 Research Questions
The goal of the study is to provide a comprehensive overview of SQL antipatterns, covering their
denitions, qualitative impact, detection methods, and repair strategies. In this study, I addressed
the following research questions:
RQ 1: What are the SQL antipatterns known to date? The information about SQL
antipatterns is very fragmented. The vast majority of related papers often focus on specic
problems and only cover a small set of SQL antipatterns. The aim of this research question is
to distill the scattered antipatterns from the literature and provide a comprehensive collection of
SQL antipatterns.
RQ 2: In what aspects do the SQL antipatterns aect the software quality?
The impact of SQL antipatterns on the software quality can be multidimensional. The purpose
of this research question is to qualitatively explore which aspects of software quality each SQL
antipattern aects and why it aects them.
RQ 3: What is the current state of research with respect to the detection and
repair of the SQL antipatterns?
The SQL antipatterns target dierent database programing problems. It is likely that some
SQL antipatterns received more research attention than others. The aim of this research question
16
Table 3.1: Selected conferences and keywords
Venue Research Field Abbreviation Keywords
International Conference on Software Engineering Software Engineering ICSE code smell, antipattern, database, sql
Foundations of Software Engineering Software Engineering FSE code smell, antipattern, database, sql
Automated Software Engineering Software Engineering ASE code smell, antipattern, database, sql
Computer and Communications Security Security CCS database, sql
Security and Privacy Security S&P database, sql
Special Interest Group on Management of Data Database SIGMOD database application, developer
is to explore the methods proposed to detect and x each of the discovered SQL antipatterns. We
discuss the potential limitations of the current techniques and the open research questions with
respect to the detection and repair of SQL antipatterns in mobile applications.
3.1.2 Literature Search Protocol
The literature search protocol aims to identify related studies that form the basis of the survey.
The work
ow of the search process is shown in Figure 3.1. My search protocol has three major
steps: conducting an initial search to collect a primary set of related literature, applying selection
criteria to lter the search results, and iteratively performing a citation search to maximize the
identication of the set of related studies.
3.1.2.1 Initial Search
The goal of this step is to collect an initial set of related studies. In this section, I explain my
search strategy, which includes the selection of search engines, keywords, and search methods.
To collect an initial set of papers, I started the search with the papers that were published in the
related major conferences during the past ten years. Since the studies on SQL antipatterns may
involve research in multiple areas, i.e., software engineering, databases, and security, the initial set
of conferences were selected from these elds. The selected conferences, each of which represents
top conferences in its eld, are shown in Table 3.1. As the publications of these conferences
are stored in the electronic databases, IEEE Explore and ACM Digital Library, I utilized their
searching capability as explained below.
Given the scope of my literature review, I focused on the selected keywords to perform the
search on the papers' titles and abstracts. Since each research domain has its own terminology,
dierent keywords were chosen for dierent research elds. The keywords chosen are listed under
the \Keywords" column in Table 3.1. To retrieve all related papers, each research domain in
the search string was represented as a disjunction of the corresponding keywords provided above.
17
This means that for example, in the security domain, any paper that has \database" or \sql" in
its title or abstract would be included in the initial set.
3.1.2.2 Selection Criteria
Not all of the papers returned by the search engines can help to answer the research questions.
Therefore, I used the following inclusion and exclusion criteria to further lter the candidate
papers.
Inclusion Criteria. A search result should be included if it: (1) discovers/denes at least one
SQL antipattern, (2) explores the impact of at least one SQL antipattern, (3) proposes detection
methods, or (4) develops repair mechanisms.
Exclusion Criteria. A search result should be excluded if it: (1) focuses on general coding
practices or antipatterns that are unrelated to databases; (2) discusses problems at the system
level, such as the database system, operating system, or le system; or (3) discusses problems at
the design level instead of the implementation level.
Each selected paper obtained from the search went through a manual inspection of the title,
keywords, and abstract. The inspection applied the above criteria to carry out the inclusion or
exclusion of the papers.
3.1.2.3 Forward/Backward Citation Search
The objective of this step is to obtain a more comprehensive list of studies by investigating the
citations. After performing the previous two steps, I had an initial set of studies. However, these
primary studies only covered the selected conferences in the selected years. To expand the list
of studies, I performed an iterative forward and backward citation search. The forward search
investigated the works that cite the primary studies. The backward search inspected the articles
that are cited by the primary studies. The inspection followed the same selection criteria and
protocol dened in Section 3.1.2.2. This citation search process leverages the authors' knowledge
and expertise in their domain to identify additional closely related work. As shown in Figure 3.1, I
conducted this step iteratively for multiple rounds in order to maximize the list of related studies.
The iteration continued until the newly obtained citations highly overlapped with the previous
collected studies. In other words, until few new articles that meet the selection criteria could be
found.
18
3.2 Results
Following the literature review protocol, I identied 57 related studies (56 research papers and 1
book). To answer the research questions, I manually inspected the collected literature.
In the following sections, I list the denitions and rationales for the eleven identied SQL
antipatterns. For each of them, I also present and discuss the detection and repair methods.
3.2.1 Antipattern: Unbatched-Writes
Denition: A sequence of database writes issued separately instead of being batched in a single
transaction [17, 105, 14, 71].
Rationale: If the database writes trigger transactions repetitively, the transaction processing
overhead of each write request can lead to signicant ineciency. Batching can reduce the over-
head as the database writes performed in a batch would only trigger the transaction operations
once.
Techniques:
Chen and colleagues [17] targeted the Unbatched-Writes antipattern for applications developed
using Object-relational Mapping (ORM) frameworks. These frameworks map standard APIs to
SQL statements and allow developers to access the database by manipulating objects. Chen's
technique statically detects the database-accessing functions that are invoked inside a loop without
being optimized by some ORM batching annotations (e.g., Batch).
RAT-TRAP [73] statically detects database writes that happen within loops and that will
trigger inecient autocommit behaviors. RAT-TRAP then uses additional analyses to identify
those that are optimizable and rewrites the code.
Tamayo and colleagues [98] developed a technique that can suggest batching opportunities by
analyzing the dependencies between multiple SQL statements. If a sequence of data writes does
not have certain data dependencies, the technique will suggest submitting all the write statements
together with a single database write API. In order to obtain the SQL statements issued by the
application and support further analysis, this technique relies on proling. As with most dynamic
analyses, the technique can only cover the code that is executed by the workload. If the chosen
workload is not representative, it could introduce false-negatives.
19
3.2.2 Antipattern: Not-Merging-Projection-Predicates
Denition: Instead of issuing a single SQL query to read all the needed columns at once,
developers issue multiple queries, each of which only reads a subset of the required columns [74, 3].
For example, \SELECT name FROM User" and \SELECT id FROM User" could have been
merged to \SELECT id, name FROM User".
Rationale: This antipattern describes the problem of transmitting too many small database
requests instead of aggregating them. The processing overhead of each request, e.g., the cost
of network round trip (when communicating with remote databases) and query processing, can
introduce unnecessary ineciency.
Techniques: Manjhi and colleagues [74] developed static analyses to detect the Not-Merging-
Projection-Predicates antipattern. Their analyses check if (1) the second query is executed when-
ever the rst query is executed, and (2) the queries are identical except for the projection pred-
icates. To automate the transformation, they used a source-to-source compiler to merge the
queries. The limitation of this work lies in its assumption about the method of constructing SQL
statements. The technique assumes that the entire SQL statement is statically embedded at the
database interaction point. When analyzing local database operations in mobile apps, it can miss
detection and optimization opportunities as SQL statements are usually constructed at runtime
by various string operations along dierent paths [71].
Arzamasova and colleagues [3] analyzed the existing SQL query logs to look for the Not-
Merging-Projection-Predicates antipattern. Their log analyzer checks if a sequence of SQL queries
has equivalent FROM and WHERE clauses, but a dierent SELECT clause. However, since their
techniques takes the query log as an input and does not analyze the application code, it can only
determine whether the Not-Merging-Projection-Predicates antipattern exists but not where the
antipattern is located in the application code.
3.2.3 Antipattern: Not-Merging-Selection-Predicates
Denition: Instead of issuing a single SQL query to read all the needed rows at once, developers
issue multiple select queries, each of which only reads a subset of the required rows [74, 3].
Rationale: The rationale behind this antipattern is similar to the Not-Merging-Projection-
Predicates antipattern, as they both describe the problem of not aggregating the queries and
introduce performance overheads.
20
Techniques: To detect this antipattern, Manjhi and colleagues [74] developed static analyses
to check if (1) the second query is executed whenever the rst query is executed, and (2) the
queries are identical, except one selection clause. To automate the transformation, they used a
source-to-source compiler to merge the queries.
Arzamasova and colleagues [3] also covered the Not-Merging-Selection-Predicates antipattern
in their query log analyzer. The analyzer identies the SQL queries that have equal SELECT and
FROM clauses, but a dierent WHERE clause.
The limitations of these techniques are the same as the ones discussed in Section 3.2.2.
3.2.4 Antipattern: Loop-to-Join
Denition: Instead of joining two tables and querying from the joint table, developers rst query
from one table to get multiple values and then for each value (using a loop structure) query from
another table [18, 74, 104].
Rationale: This antipattern causes unnecessary database requests. The processing overhead of
each request can introduce unnecessary ineciency.
Techniques: Manjhi and colleagues [74] proposed a detection analysis for the Loop-to-Join
antipattern. The analysis checks if: (1) the loop iterates using the result of a previous query, (2)
the loop issues a query in each iteration, and (3) the previous query is executed whenever the
loop executes. Once the pattern is identied, they used the work done by Kim [53] to replace the
small queries with a merged query. As mentioned previously, this technique assumes that SQL
statements are statically embedded and cannot analyze dynamic string-based SQL statements.
Cheung [18] and Emani [25] both proposed techniques to optimize the Loop-to-Join antipattern
for applications developed using ORM. Their techniques automatically transform imperative code
fragments (written using ORM and nested loops) into SQL queries that use joins. Since these two
techniques target ORM applications, they assume that there is a one-to-one mapping between the
database access API and the concrete SQL statement. However, this assumption does not hold
for mobile apps as dynamic string-based SQL statements are widely used [71].
3.2.5 Antipattern: Vulnerable-Query
Denition: A vulnerable input is not properly sanitized and is concatenated to a SQL query
[5, 85, 95, 49]. For example, if a string derived from an untrusted source, such as user input,
21
is concatenated to a query using string operations, the resulting query would be considered as a
vulnerable query.
Rationale: This antipattern is the root cause of the SQL injection vulnerability. SQL injection
refers to a type of attack in which data that derives from untrusted input sources is included
in an SQL query in such a way that part of the vulnerable input is treated as SQL code. This
allows the attacker to manipulate the syntax of the SQL query, threatening the security of the
application and its underlying database.
Techniques: There is a wide range of techniques in the security domain that tackle this antipat-
tern. These techniques include, but are not limited to static code checkers [33, 102], combined
static and dynamic analyses [97, 10], taint based approaches [44, 69, 37], etc. These techniques
have been well-summarized and discussed by existing surveys [38, 54, 47].
3.2.6 Antipattern: Not-Using-Parameterized-Query
Denition: A parameterized query (also called a prepared statement) takes the form of a tem-
plate that contains the logical purpose of the query. It leaves certain values unspecied, called
parameters or placeholders. The actual values of those parameters are bound at runtime. This
antipattern arises when a query could have been parameterized but it is not [7, 8, 14, 49].
Rationale: By parameterizing the queries, the DBMS can parse, compile, and perform query
optimization on the SQL statement template. The processed template can be used for the future
execution with input parameters. From a security perspective, parameterized queries are very
useful against SQL injection because the parameter values are syntactically bound to the positions
of string literals and are therefore not interpreted as commands. From a performance perspective,
parameterized queries can reduce parsing time because the preparation on the query is done only
once no matter how many times the query is executed. Therefore, not parameterizing the query
would lose these benets.
Techniques: TAPS [7, 8] is an automated technique for query parameterization. It analyzes
the parsed structure of the SQL statements to identify data arguments for the parameterized
query. It then traverses the program backwards to the program statements that generate these
arguments, and substitutes the arguments with placeholders (i.e., the symbol \?"). To compute
the possible SQL statements issued by the application, TAPS uses symbolic execution, which
could be inecient and not scalable.
22
3.2.7 Antipattern: Not-Caching
Denition: Multiple syntactically equivalent or partially equivalent queries are issued to retrieve
data from the database [16, 112, 104, 93, 105]. These queries can be exactly the same or they
share a common subexpression.
Rationale: Not caching query results often leads to redundant computations being performed.
If the database contents have not been altered between the execution of a group of syntactically
equivalent queries, caching their results can help to improve performance.
Techniques: CacheOptimizer [16] is a technique that helps developers optimize the congura-
tion of caching frameworks for web applications that are implemented using Hibernate (a Java
ORM framework). CacheOptimizer leverages existing web logs to discover the optimal cache
conguration so that the caching framework of Hibernate can cache the query results properly.
Yan and colleagues [104] proposed a technique to detect the Not-Caching antipattern in web
applications that are developed using ORM. ORM frameworks allow users to construct queries by
chaining multiple function calls (e.g., where, join), with each chain translated into a SQL query.
To detect the equivalent or partially equivalent queries, the proposed technique analyzes the query
call chains using static analysis: if the same ORM query function is used in two dierent ORM
query function chains, the corresponding queries will share a common expression. This technique
targets ORM applications and therefore cannot analyze dynamic string-based SQL statements.
Sqlcache [93] is a compiler optimization technique that automatically adds sound SQL caching
to Web applications. Sqlcache computes conditions for irrelevance between read queries and
write queries [9]. If relevant, Sqlcache instruments the application so as to enable automatic
cache invalidations. This technique targets applications coded in the Ur/Web domain-specic
functional language. This language forces programmatically constructed queries to follow the
grammatical structure of SQL. Therefore the technique cannot handle dynamic string-based SQL
statements.
As mentioned previously in the Unbatched-Writes section, the technique by Tamayo and col-
leagues [98] analyzes the dependencies between multiple SQL statements. In addition to nding
batching opportunities, the technique also looks for caching opportunities by checking if multiple
queries always return the same result in their dependency graph. However, the technique relies
on dynamic proling to obtain the SQL statements instead of analyzing the application statically.
23
3.2.8 Antipattern: Unnecessary-Column-Retrieval
Denition: Developers issue a query that reads more columns than needed [15, 105, 104, 49]. For
instance, the query \SELECT * FROM User" retrieves all the columns, but in the application,
a subset of the retrieved columns is never used.
Rationale: The more columns the SQL query fetches, the more data must travel between the
application and the database. While the data transfer comes with a cost, retrieving columns that
are not needed in subsequent computation can be harmful to performance and scalability.
Techniques: Chen and colleagues [15] proposed an approach that combines static and dynamic
analysis to detect this Unnecessary-Column-Retrieval antipattern. The static analysis part of
the approach identies and instruments database-accessing functions. The dynamic analysis part
obtains the code execution traces and the SQL queries. Then, the approach discovers instances
of this antipattern by examining the data access mismatches between the needed columns and
the requested columns. This technique relies on dynamic analysis to identify the needed columns.
However, if a column is needed somewhere in the code, but the chosen workload does not trigger
the code, the technique would consider that the column is unnecessarily retrieved, which is a
false-positive. In addition to this limitation, this technique also targets ORM applications and
cannot analyze string-based SQL statements.
Yan and colleagues [104] proposed a detector using only static analysis. The detector identies
the columns retrieved by each query and checks if there are subsequent uses of the retrieved
columns. If there are no such uses, it means the Unnecessary-Column-Retrieval antipattern occurs
in the application. This technique addresses the false-positive problem of Chen's technique [15],
since it uses static analysis to safely identify the used columns. However, this technique targets
ORM applications and cannot analyze string-based SQL statements.
3.2.9 Antipattern: Unnecessary-Row-Retrieval
Denition: Developers issue a query that reads more rows than needed [14, 24]. When table
rows are retrieved from the database, not all of them may be needed by the application. This
antipattern arises when the retrieved rows are ltered by the application logic and only a subset
of the rows are actually used by the application.
24
Rationale: Similar to the Unnecessary-Column-Retrieval antipattern, redundant row retrieval is
a waste of resources and can harm performance. The ltering conditions could have been specied
in the selection clause of the query so as to reduce the amount of data transfered.
Techniques: Chaudhuri and colleagues [14] proposed a cost estimation tool to alleviate this
antipattern. They rst used dynamic proling to compute the number of rows consumed by the
application and the total number of rows returned by the query. If the query returns many rows
(say N) and the application consumes only k rows (k N), then it may be benecial to pass a
query hint to the database server requesting that the plan should be optimized for returning the
top k rows quickly. Based on the proling information, Chaudhuri developed a Fast-k analysis
tool that can estimate how the cost of the query varies with k. Such information can be used by
developers to decide if it is appropriate to rewrite their query to use the hint.
Dugan and colleagues [24] analyzed the antipattern using software performance engineering
modeling techniques and compute several metrics, such as the number of disk I/O operations.
Using the same methodology, they also examined several solutions to the antipattern using a
lower bound and upper bound in the query to limit the number of retrieved rows.
DBridge [25] includes an optimization mechanism for unnecessary row retrieval. It identies
lters on query results expressed using conditional imperative constructs, such as if, and pushes
them into the query as a selection.
3.2.10 Antipattern: Unbounded-Query
Denition: When a query may return an unbounded number of records and there exists a
subsequent computation over the returned records [104, 105, 85, 106].
Rationale: From a performance perspective, if the application directly renders the results of
the query (whose size can be potentially very large), the responsiveness of the application can
be aected because of the long rendering process [104]. From a security perspective, attackers
can carry out a second-order denial-of-service attack by tainting the database table with a large
number of records [85]. Then if an unbounded query is issued to retrieve data from the table,
and the number of executions of a loop is controlled by the query result, CPU exhaustion may
happen.
Techniques: The work by Yan and colleagues [104] examines if a query is bounded by checking
if (1) the query always returns a single value (e.g., a COUNT query); (2) the query always returns
a single record (e.g., retrieving using a unique identier); or (3) the query uses a LIMIT keyword
25
bounding the number of returned records. This work targets ORM application and does not
handle string-based SQL statements.
Olivo and colleagues [85] veried if this antipattern occurs by statically checking if (1) a
database attribute can be tainted by a user input, (2) a query uses the tainted database attribute
in its selection clause, and (3) the number of executions of a loop is controlled by the tainted
query result. In terms of the limitation of this technique, it is mentioned in the paper that some
false positives in their evaluation can be eliminated by employing a more precise string analysis.
Panorama [106] suggests a variety of refactorings that can x this antipattern, such as pagi-
nation, asynchronous loading, etc. It identies opportunities for applying such refactoring in web
applications, and automatically suggests patches.
3.2.11 Antipattern: Readable-Password
Denition: Sensitive information, such as a user password, is stored in plain text in the database [49].
This antipattern arises if an application stores the password by specifying the password as a string
literal in an insert or update statement. When authenticating the password, this antipattern can
also appear if the application compares the user's input to the password string stored in the
database. For example, \SELECT * FROM Accounts WHERE account id = `123'AND password
= `opensesame'".
Rationale: This programming practice results in security vulnerabilities because if attackers can
read the SQL statement, they can see the plain text password. Hackers can steal a password by
searching SQL statement logs, or reading data from database backup les.
Techniques: We have not identied any technique that targets this antipattern.
3.3 Summary
In this chapter, I presented an overview of the current knowledge about SQL antipatterns. I iden-
tied eleven SQL antipatterns through my literature review. They were performance-oriented
or security-oriented. Fixing these antipatterns could have potential performance implications.
However, there does not exist any systematic study that quanties the impact of these SQL an-
tipatterns on the performance of mobile apps. In terms of detection and optimization techniques, I
found that there exist automated approaches for ten out of the eleven SQL antipatterns. However,
I discovered that each of these existing approaches only targeted a small subset of specic SQL
26
antipatterns. Most existing approaches focused on analyzing database operations in web apps and
had assumptions that could undermine their capability in analyzing database operations in mobile
apps. More specically, they either assumed that the entire SQL statement is statically embedded,
or there is a one-to-one mapping from the database access API to the concrete SQL statement.
Overall, the results obtained from the review revealed the limitations of existing approaches, and
more importantly, they provided insights about how to detect and repair various kinds of SQL
antipatterns. These insights guided the design of the detection framework (Chapter 4) and the
benchmark study (Chapter 5).
27
Chapter 4
A SQL Antipattern Detection Framework
The goal of the work presented in this chapter is to provide a general framework that can detect
SQL antipatterns eectively and eciently in mobile apps. In this chapter, I give a detailed
description of my detection framework. I elaborate on the process of determining the key ab-
stractions of application-database relationships, and explain the various static analysis techniques
that I developed in order to extract those abstractions from the database access code. In this
chapter, I also provide an empirical evaluation of the framework to demonstrate its eectiveness
and eciency in SQL antipattern detection.
This chapter is organized as follows. In Section 4.1, I describe my detection framework in detail.
Then in Section 4.2, I give a detailed evaluation of the framework. I discuss the limitations of the
framework in Section 4.3. Finally, I give a summary in Section 4.4.
4.1 Approach
In this section, I describe SAND (SQL Antipattern Detection), a framework for SQL antipattern
detection. I begin with an overview that describes the work
ow of SAND and the goals I am trying
to achieve with this framework. I then discuss the design and implementation of the detection
specication language that SAND provides.
4.1.1 Overview of SAND
At a high level, the SAND framework abstracts away the interactions between the application
and the database. It provides a language, called SANDLess, that is used to query various kinds of
application-database relationships. In the work
ow, a detector written in SANDLess species the
28
concrete detection logic on top of the abstracted application-database relationships. It takes an
AUT as input, and outputs the locations of the pieces of code that matches the detection logic.
The framework has the following goal: to be eective and ecient in supporting the full spec-
trum of SQL antipattern detection tasks. This goal raises two signicant challenges. First, there
are many types and variations of SQL antipatterns. The detection of each of these SQL antipat-
terns may require analyzing dierent kinds of application-database relationships. This challenge
requires the framework to extract proper abstractions of application-database relationships so as
to enable all kinds of SQL antipattern detection tasks. Second, local database operations have
complex semantics in mobile apps. Applications frequently embed the construction of SQL state-
ments in the application logic with various string operations and complex code constructs [71].
Adding to the complexity, 89% of database operations are performed inter-procedurally with long
call chains. SAND needs to accurately and scalably examine all inter-procedural paths leading
to the database operations and reason about various kinds of application-database relationships
along these dierent paths.
In the following sections, I explain the design of SAND, as well as how this design can achieve
the goals and address the challenges.
4.1.2 Language Requirement
The language requirement denes what key abstractions of application-database relationships
SANDLess needs to provide and directly determines whether SAND can support the detection of
the full spectrum of SQL antipatterns. To identied the key abstractions, I analyzed the complete
list of SQL antipatterns and their corresponding detectors collected in Chapter 3. These detectors
and their algorithms contain information about what to analyze and how to analyze the database
access code. They therefore determine what should be included in the requirement.
The process of determining the language requirement can be divided into three steps. First,
given a SQL antipattern, I identied the detection rules proposed by its existing detectors. These
rules do not include the details of the detectors' underlying analyses, but describe how those
detectors determine whether a piece of database access code matches the antipattern. Second,
given the detection rules, I extracted the application-database relationships that were analyzed by
each rule. Third, I categorized the extracted application-database relationships. Following these
three steps, I analyzed each of the SQL antipatterns and their detectors. The results of the rst
two steps are summarized in Table 4.1, where I show the detection rules and the corresponding
29
application-database relationships. The results of the third step are summarized in Table 4.2.
Each column represents one category of application-database relationships. Each row represents
how an antipattern's relationships (the items in the third column of Table 4.1) map to the general
categories of application-database relationships.
Summarizing my ndings, detecting SQL antipatterns generally needs to analyze the (1) SQL
statements, (2) reachability relationships, (3) loops, (4) control dependencies, (5) transactions,
(6) data dependencies with respect to the SQL statements' input (i.e., input defs), and (7) data
dependencies with respect to the SQL statements' output (i.e., output uses). In the next section,
I will explain how I designed abstractions for each of these application-database relationships.
4.1.3 The SANDLess Language
In order to support all kinds of SQL antipattern detection tasks, SAND provides functionalities to
abstract all of the application-database relationships listed in Table 4.2. I designed SANDLess as
a Java extension. The reason is that, rst, Java is a language with which many people are familiar.
Detector builders do not need to learn a new language in order to use the framework. Second,
Java is Turing-complete. The SANDLess abstractions and functions can be used in combination
with Java to express all kinds of analyses. In this section, I explain the abstractions and functions
provided in SANDLess, and elaborate on how SAND handles the complex semantics of database
operations when extracting abstractions from the database access code.
The design of SANDLess is driven by one essential commonality among all the detectors - they
center on the program points that perform API calls to issue SQL commands to the database.
(Using the terminology introduced by Kapfhammer and Soa [48], we call these program points
database interaction points). Every detector needs to analyze these database interaction points
and reason about the application-database relationships with respect to them. This observation
guides the design of SANDLess. I propose an abstraction, called Silica, that helps to analyze the
database interaction points and the possible SQL statements issued at these points. I also propose
a set of SANDLess functions that help to obtain the other application-database relationships listed
in Table 4.2.
4.1.3.1 The Silica Abstraction
The silica abstraction serves as a fundamental unit of SANDLess. It is a tuplehp, ri, consisting of
a database interaction point p, and a string model r that represents the possible SQL statements
30
Table 4.1: Antipatterns and their required abstracted information
Antipattern Detection Rules Application-database Relationships
Unbatched-
Writes
Identify the database writes that are
performed inside a loop but not inside an
open transaction [73, 17].
(1) The type of the SQL statements (2) The
reachability relationship between the
program point that issues the SQL
statement and the loops (3) The control
ow
relationship between the program point that
issues the SQL statement and the open
transactions
Not-Merging-
Projection-
Predicates
Examine if two database reads can execute
one after another, and the two issued queries
are identical except for the projection
predicates [74, 3].
(1) The type of the SQL statements (2) The
reachability relationship between the two
program points that issue the SQL
statements (3) The projection predicates
and the other parts of the SQL statements
Not-Merging-
Selection-
Predicates
Examine if two database reads can execute
one after another, and the two issued queries
are identical except for the selection
predicates [74, 3].
(1) The type of the SQL statements (2) The
reachability relationship between the two
program points that issue the SQL
statements (3) The selection predicates and
the other parts of the SQL statements
Loop-to-Join
Check if a database read is performed to
retrieve data, a loop then iterates using the
result of the read data, and there is another
database read issued in the same loop that
executes after the previous database read and
uses the previous read data [74].
(1) The type of the SQL statements (2) The
data
ow of the output of the SQL
statements (3) The reachability relationship
between the two program points that issue
the SQL statements (4) The control
ow
relationship between the program point that
issues the SQL statement and the loops
Vulnerable-
Query
Trace the concatenated input of a SQL
statement to determine if it is vulnerable [69].
(1) The data
ow of the input of the SQL
statements
Not-Using-
Parameterized-
Query
Analyze the parse structure of the SQL
statement to identify substrings at the
syntactical positions of data values, and
determine if the values are generated
dynamically [7].
(1) The data values of the SQL statements
(2) The data
ow of the input of the SQL
statements
Not-Caching
Check if two database reads can execute one
after another, and the two issued queries are
equivalent [104].
(1) The type of the SQL statements (2) The
reachability relationship between the
program points that issue the SQL
statements (3) The concrete SQL statements
Unnecessary-
Column-
Retrieval
Find the database read, analyze the SQL
statement to identify the columns retrieved
by the read, and verify that there is no
subsequent use of the retrieved
columns [104, 15].
(1) The type of the SQL statements (2) The
projection predicates of the SQL statements
(3) The data
ow of the output of the SQL
statements
Unnecessary-
Row-
Retrieval
Check if any data retrieved from a database
read is used in a conditional construct (such
as if), and this conditional construct guards
the use of the retrieved data [25].
(1) The type of the SQL statements (2) The
data
ow of the output of the SQL
statements (3) The control dependence
relationship between the program points
Unbounded-
Query
Analyze the SQL statement to examine if it
is bounded, such as using the LIMIT
keyword [104], and check if there is any loop
iterating over the output of this unbounded
database read [85, 106].
(1) The concrete SQL statements (2) The
control
ow relationship between the
program points where the output is used and
the loops (3) The data
ow of the output of
the SQL statements
Readable-
Password
Analyze the SQL statement to determine the
part of the query that is likely to be sensitive,
and then check this sensitive data has been
encrypted [49].
(1) The concrete SQL statements (2) The
data
ow of the input of the SQL statements
31
Table 4.2: Categorized application-database relationships
SQL
statements
Reachabil-
ity
Loop
Control
dependence
Transac-
tion
Input def Output use
Unbatched-Writes (1) (2) (3)
Not-Merging-
Projection-Predicates
(1), (3) (2)
Not-Merging-
Selection-Predicates
(1), (3) (2)
Loop-to-Join (1) (3) (4) (2)
Vulnerable-Query (1)
Not-Using-
Parameterized-Query
(1) (2)
Not-Caching (1), (3) (2)
Unnecessary-Column-
Retrieval
(1), (2) (3)
Unnecessary-Row-
Retrieval
(1) (3) (2)
Unbounded-Query (1) (2) (3)
Readable-Password (1) (2)
issued at p. For the ease of explanation, given a silica s, we use s:p and s:r to denote the two
elements in the tuple. We dene S to be the set of all silicas in an AUT, R to be the set of all
models, D to be the set of all database interaction points, and P to be the set of all program
statements. (Note that D is a subset of P .) In order to identify and analyze the silicas, I dene
three functions.
4.1.3.1.1 getSQLPoints : AUT 7! D Given an AUT, SAND locates invocations to the
specic APIs that are responsible for issuing database commands. The program points that
perform these invocations, i.e., the database interaction points D, are returned by this function.
4.1.3.1.2 getSQLModel : p2 D 7! r 2 R Given a database interaction point p, SAND
normalizes the specic semantics of the database API invoked at p, and constructs a unied
string model r that represents the possible SQL statements issued at p.
The normalization is important because of two reasons. First, dierent types of database
APIs can be invoked at the database interaction points and each type has it specic semantics
of constructing SQL statements. For example, one group of database APIs allows an entire SQL
statement to be specied as a string parameter, as shown at line 13 of Listing 4.1. Another
group of database APIs allows dierent parts of a database command to be specied as multiple
parameters, as shown at line 4 of Listing 4.1. Normalization can help the detectors focus on the
issued SQL statements without the need of handling the specic semantics of the database APIs.
32
Second, the process of constructing SQL statements is embedded in the application logic and can
be very complex. As discovered in a recent study [71], dynamically constructed SQL statements are
frequently and widely used in mobile apps. In fact, two out of the top three most frequently used
database APIs issue raw string-based SQL statements, half of which are constructed dynamically
via complex string operations [71]. These complex semantics of constructing SQL statements even
limit the capability of existing detectors in identifying SQL antipatterns in mobile apps. Many
existing detectors either assume that the SQL statements are static/hard-coded [85, 93, 12, 74, 94],
or that there is a one-to-one mapping between the database access API and the concrete SQL
statement (i.e., ORM) [17, 16, 25, 104, 18]. Although some existing detectors develop their own
mechanism or use an existing string analysis approach to deal with the dynamic string-based
SQL statements [7, 8, 39], these string analysis approaches suer from slow performance, low
precision, and the lack of ability to handle complex string operations. These drawbacks in string
analysis ultimately aect the eectiveness and eciency of the detectors. By providing an accurate
and unied string model, SAND allows the detectors to focus on analyzing the SQL statements
themselves without the need of developing complex analyses to reason about how these SQL
statements are constructed.
SAND needs to meet two requirements in order to construct an accurate and unied model
r for the SQL statements issued in mobile apps. First, as mentioned in Section 4.1.1, nearly
90% of database operations are performed inter-procedurally. SAND needs a string model that
can capture context-sensitive call site information. Second, as discovered in the study [71], the
construction of SQL statements involves complex string operations. SAND needs a string model
that can capture all kinds string operations. To address these needs, SAND bases the construction
of r on a string representation proposed by a state-of-the-art string analysis technique, Violist [61],
that I participated in designing and implementing. The representation dened in Violist is called
Intermediate Representation (IR). The IR is structured as an expression tree with the leaf nodes
dening either constants or placeholders for unknown variables. The internal nodes in the tree
are string expressions that represent the values of string variables in the AUT. For the ease of
explanation, we represent expressions in the tree as op a
1
a
2
. In this expression, op represents
the operator and a
i
represents the operand. This IR captures data-
ow dependencies in loops,
context-sensitive call site information, and the string operations applied to the string variable
along the various paths leading to the use of a string variable. In Listing 4.1, the IR of the
parameter at line 13 is denoted as + \INSERT INTO reg (id) VALUES (" (+ X
7
\)"). The +
33
sign denotes string concatenation. The unique placeholder X
7
indicates that the variable that is
external to the analysis scope is dened at line 7.
Depending on the database API invoked at p, SAND constructs the model r accordingly. If
an entire SQL statement is passed as a parameter to the API, the IR of the parameter is directly
used as the value of r. If dierent parts of a SQL statement are passed as multiple parameters,
SAND combines the semantics of the API with the IRs of these parameters, and converts them
into a single IR. This single IR is used as the value of r. For instance, the model r at line 4 of
Listing 4.1 is + \SELECT" (+ \id, name, grade" + (\FROM" (+ \students" (+ \WHERE" (+
\name = " X
3
))))).
SAND denes functionalities to represent r as a set of concrete nite strings, denoted as (r),
that enables further analyses on the concrete SQL statements. At line 13 of Listing 4.1, (r) =
fINSERT INTO reg (id) VALUES (X
7
)g. To generate (r), SAND interprets the expression tree
represented by r. It performs a post-order traversal on the tree and applies the string operations
in the internal nodes to their child nodes. When SAND unravels loops in r, it assumes the upper
boundn to be one on the loop's iterations. Various values ofn were also tested. I discovered that
the number of detected instances remained the same or similar, which means that the value of n
had little impact on the result of antipattern detection. Nevertheless, the detectors can choose
the value of n according to their needs. The set of nite strings provided in (r) enables all kinds
of analyses needed by the SQL antipattern detectors. As shown in Table 4.1, the detectors of
seven SQL antipatterns require analyzing the concrete SQL statements, including their project
predicates, selection predicates, data values, keywords, and so on. These detectors can leverage
an existing SQL parser [99] to parse the statement and extract the parts that they are interested
in. Subsequent analyses, such as string comparison, can be conducted according to the detectors'
needs.
Although there could exist scenarios where parts of a SQL statement are derived from external
inputs, whose values cannot be determined by SAND, they did not prevent the detectors built
on top of SAND from precisely identifying the various SQL antipatterns. The rst reason is that
the unknown parts, marked by unique placeholders, are typically at the syntactical positions of
table names, data values, etc and they do not aect the structure of the SQL statement. The
parts of the SQL statement that were embedded in the code and were captured by SAND's static
string analysis were sucient for the detectors to carry out the necessary analyses on the SQL
statements. For example, my prior empirical study found that nearly 92% of all SQL statements
34
have their prexes embedded in the application code [71]. By analyzing the prexes, the type
of a SQL statement can be determined. As for the parts in the SQL statement other than the
prex, whether they are unknown or not, do not prevent the detectors from precisely identifying
the type of the SQL statement. When conducting the evaluation for this framework, I found that
among 13,418 dierent database interaction points associated with the detected instances, 44% of
them had at least one placeholder embedded in their corresponding r model. In my inspection of
results, these placeholders did not aect the detection accuracy. The second reason is that even if
the detector of certain antipatterns, such as Unnecessary-Column-Retrieval, needs to analyze the
table names or column names, whose values can be derived from external sources, the placeholder
mechanism allows the detectors to handle the situation accordingly. For example, a detector can
treat an unknown column, marked by a placeholder, to be potentially any column. As shown
in the evaluation results, the detector of Unnecessary-Column-Retrieval identied thousands of
instances of this antipattern with a precision of 99.4%.
1 public void main()
2 f
3 String userInput = text.getText().toString();
4 Cursor cursor = database.query("students", new String[]f"id", "name", "grade"g, "name = " + userInput, null,
null, null, null);
5 while(cursor.moveToNext())
6 f
7 int id = cursor.getInt(0);
8 if(id < 10)
9 f
10 database.beginTransaction();
11 for (int i = 0; i < 10; i++)
12 f
13 database.execSQL("INSERT INTO reg (id) VALUES (" + id + ")");
14 g
15 database.endTransaction();
16 g
17 g
18 g
Listing 4.1: Example program
4.1.3.1.3 nd :regex7!fsjs2Sg Given a regular expression (regex), this function returns
a set of silicas whose SQL statements match the given regex. Regexes are sucient for the
detectors to identify an initial set of silicas of interest. Although SQL is more expressive than a
regular language, the detectors do not need to use regexes to express all possible SQL statements'
35
forms. Instead, the detectors only need to identify certain patterns of SQL statements and all these
patterns can be recognized by regexes. I was able to determined this by analyzing the detection
rules in Table 4.1. For example, the detector of Unbatched-Writes focuses on database writes,
which can be recognized by the regex \(insertjupdatejdelete):
". The detectors of Unbounded-
Query and Readable-Password target SQL statements that contain certain keywords. These SQL
statements can be recognized by the regex \:
keyword:
". The detector of Vulnerable-Query needs
to identify all SQL statements, which can be recognized by the regex \:
". The rest of the detectors
focus on database reads and they can use the regex \select:
".
To realize this function, SAND rst gets all the database interaction points via getSQLPoints.
For each identied database interaction point p, if9q2 (getSQLModel(p))(q matches regex), a
silica tuplehp; getSQLModel(p)i will be added to the return set.
4.1.3.2 Analyzing Other Application-database Relationships
SAND provides a set of six SANDLess functions to identify the application-database relationships
categorized in Table 4.2. When designing these functions, I ensure that they are expressive
enough to specify all the required detection rules on the corresponding types of application-
database relationships. In the rest of this section, I explain these functions in detail. For the ease
of explanation, I illustrate the realization of the functions in the context of an intra-procedural
analysis. In Section 4.1.4, I discuss how the analyses conducted by the functions work inter-
procedurally and in the presence of activity lifecycle methods.
4.1.3.2.1 getReachableSet : p2 P 7!fgj g2 Pg This function analyzes the control
ow
relationship between program points. Given a program point p, the function returns a set of
program points, fg j g 2 Pg, that are reachable from p in the Control Flow Graph (CFG).
This function is essential because identifying if one SQL statement can execute after another is
required by the detectors of four SQL antipatterns (i.e., Not-Merging-Projection-Predicates-2,
Not-Merging-Selection-Predicates-2, Loop-to-Join-3, and Not-Caching-2.) Given two silicas s1
and s2, if s2:p2 getReachableSet(s1:p), it means that the SQL statements represented by s2 can
execute after the ones represented bys1. To realize the getReachableSet function, SAND conducts
a reachability analysis [2] and computes the nodes in the CFG that are reachable from the given
node. The analysis iteratively propagates the nodes in the CFG to their successors and converges
36
when no more nodes can propagate to a new node. The nodes that the given program point can
propagate to are returned by getReachableSet.
4.1.3.2.2 getLoopSet : p2 P 7!flj l2 Pg This function helps to obtain the control
ow
relationship with respect to loops. It takes as an argument a program point p, and returns a set
of program points,fljl2Pg, which consists of the headers of the loops that enclosep. With the
help of this function, the detectors can specify a range of detection rules with respect to loops. For
instance, checking if a program pointp executes in a loop can be expressed by getLoopSet(p)6=;.
To ensure that two program points p1, p2 are not in the same loop, the detector can examine if
getLoopSet(p1)\ getLoopSet(p2) =;. The detector can even combine this getLoopSet function
with other SANDLess functions so as to specify more advance detection logic. For example, if
the result of a database read is used at one of the program points returned by this function,
it means that the corresponding loop iterates using the database read result. The getLoopSet
function addresses the need for the detection of three SQL antipatterns (i.e., Unbatched-Writes-2,
Loop-to-Join-4, and Unbounded-Query-2).
To realize this function, SAND employs a region-based analysis that can accurately identify
nesting relationships between loops and the nodes that each loop contains [2]. Formally, a region
is a collection of nodes N and edges E such that (1) there is a header h in N that dominates all
the nodes in N ; (2) if some node m can reach a node n in N without going through h, then m is
also in N ; (3) E is the set of all edges between nodes n
1
and n
2
in N. This region-based analysis
represents the regions of a method as a Region Tree (RT) in which nested regions are shown as
children of their parent regions. In this RT, the root node of the tree is the method body and the
children nodes are the loop bodies. In Listing 4.1, there are three regions R0, R1, and R2. R0
(line 1|18) represents the method body of main. R1 (line 5|17) and R2 (line 11|14) are loops
in the method. To identify which loops enclose a given program point, SAND rst builds the RT
for the method that contains the program point. It then scans each region and identies the ones
that contain the given program point. In Listing 4.1, the loop regions that contain the program
point at line 13 are R1 and R2. The program points where the headers of these loops locate, i.e.,
line 5 and line 11, are returned by getLoopSet.
4.1.3.2.3 getControlDependenceSet : p2 P 7!fhc;iij c2 P; i2 Zg This function an-
alyzes the control dependence relationship. Given a program point p, it returns a set of tuples,
37
fhc;iig. The program point c, along with the integer i, represent which branch of a condition-
al/switch statement p is transitively control dependent on. For example, in Listing 4.1, line 13 is
transitively control dependent on three conditional statements line 5, 8, and 11 when they eval-
uate to true. The returned set is thereforefhline 5; 1i,hline 8; 1i,hline 11; 1igi, where the true
branch is represented by the value 1. This function is needed by the detection of Unnecessary-
Row-Retrieval, which requires checking if there is a conditional construct guards the use of the
retrieved data. The detector can apply this function to the program points where the read data is
retrieved and used. If there is a tuple that exists in all the returned sets, it means that the given
program points are guarded by the same branch of a conditional/switch statement. To realize
this function, SAND employs a control dependence analysis [27], which identies the nodes in
the CFG that the given program point is control dependent on. The algorithm of the analysis
determines that a node m is control dependent on another node n if (1) there exists a path from
n to m such that every node in the path other than n and m is post-dominated [2] by m, and (2)
n is not post-dominated by m.
4.1.3.2.4 getTransactionSet :p2D7!ftjt2Pg This function identies the transactions
that enclose the given database interaction point. It takes an argument a database interaction
point p, and returns a set of program points,ftj t2 Pg, where the transactions that enclose
the given program point are initiated. In Listing 4.1, the database interaction point at line 13 is
enclosed by the transaction initiated at line 10. The returned set would therefore consist of the
program point at line 10. Analyzing the transactions is required by one antipattern, Unbatched-
Writes. The detection rule of checking if a database command issued atp is under the eect of an
open transaction can be expressed by getTransactionSet(p)6=;. To further understand whether
the transaction is open inside a loop, the detector can combine this function with getLoopSet and
check if9t(t2 getTransactionSet(p)^ getLoopSet(t)6=;).
The
exible semantics of transaction control, as explained in Chapter 2, introduce great chal-
lenges to realizing the getTransactionSet function. Thematically, analyzing transaction open and
close operations is similar to detecting resource leakage, e.g., [36, 89]. However, a key dier-
ence is that for resource leakage, once the resource-releasing API is called, the corresponding
resource is released. This is not true for nested transactions. If multiple transactions are open
in a nested manner, the only way to close them is to issue the exact same number of transaction
close operations.
38
To reason about nested transactions, SAND utilizes a static transaction analysis that I de-
veloped [73]. Given a program point that issues a SQL statement, the analysis propagates this
program point backwards in CFG, and uses a counter to tracks the invocations to the transaction
open and close operations along the path leading to the given program point. Using the counter
value, the analysis is able to judge whether a transaction is open or close at a certain program
point. With this analysis, SAND identies the transactions that remain open when the given
program point executes. The program points where these transactions are open are added to the
result set.
4.1.3.2.5 getDefSet : s2 S7!fhf;placeholderij f2 P; placeholder2
g This function
analyzes the data
ow of the input of a SQL statement. It takes as an argument a silica s, and
returns a set of tuples,fhf, placeholderig. The label placeholder represents an external input to
the AUT that is concatenated to an SQL statement, and the program point f indicates one of
the possible denition points of this input. For example, given the database interaction point
at line 13 of Listing 4.1, the returned set of the function isfhline 7;X
7
ig, indicating that an
external input, labeled as X
7
, is dened at the program point at line 7. The label placeholder
can be used to associate the external input with its position in an SQL statement. For instance,
given (r) =f\INSERT INTO reg (id) VALUES (X
7
)"g at line 13 of Listing 4.1, X
7
is at the
syntactical position of data values. This function enables a range of detection rules that reason
about the input of a SQL statement. It is essential for the detection of three SQL antipatterns,
Vulnerable-Query-1, Not-Using-Parameterized-Query-2, and Readable-Password-2. For example,
to check if any substring in the query is derived from a vulnerable API, the detector can inspect
the program point f and verify if any vulnerable API is invoked at this point. Since f is also
associated with a placeholder that embeds in the SQL statement, the detector is able to identify
which part of a SQL statement is vulnerable.
Realizing this function requires SAND reasoning about the data
ow of the variables that
contribute to the construction of the SQL statements. As introduced in Section 4.1.3.1, the
string model r in the silica tuple hp, ri is an expression tree and it captures the data
ow
of variables. Every variable that is external to the analysis scope is represented by a unique
placeholder in r. The position information about where the variables are dened is also incor-
porated in the placeholder. For instance, given the model r at line 13 of Listing 4.1, which is
+ \INSERT INTO reg (id) VALUES (" (+X
7
\)"), the placeholder isX
7
and the denition point
39
is the program point at line 7. To implement the getDefSet function, SAND traverses the nodes
in r and looks for the placeholders in the leaf nodes. For each identied placeholder h, SAND
locates the program pointf that corresponds to the denition point of this placeholder, and adds
a tuplehf;hi to the returned set of this function.
4.1.3.2.6 getUseSet : s 2 S 7! fhu;columni j u 2 P; column 2
g This function is
responsible for analyzing the data
ow of the output of a SQL statement. It takes as an argument
a silica s, and returns a set of tuplesfhu;columnig. The program point u indicates where the
output data is used. The string label column associates u with the concrete column/expression
of the output data. It represents one of the columns/expressions that are selected by the SQL
query issued at s:p, and are used at the program point u. For instance, in Listing 4.1, given the
silica at line 4, one of the returned tuples ishline 8;idi. The tuple means that the id column is
used at line 8. This function allows the detectors to reason about the output of a SQL statement,
including where the output is used, how the output is used, and which columns are used. It lays
the foundation for the detection of three SQL antipatterns, Loop-to-Join-2, Unnecessary-Column-
Retrieval-3, and Unnecessary-Row-Retrieval-2. With this function, the detectors can express a
variety of detection rules. For instance, if the output data of a SQL query issued by a silica s is
never used, getUseSet(s) =;. For another instance, the detector of Loop-to-Join needs to verify
if the result of a database read is used as the input of another SQL query. Using the getUseSet
function, given a silica s that issues a database read, the detector can check if there exists any
program point u returned by getUseSet(s) equals the program point of another silica.
There are two challenges regarding the realization of this function. First, the output data of
a query is stored in a complex object, the cursor. Once the cursor is returned from the database
API, developers can use it inter-procedurally and retrieve data from it
exibly. Identifying the
use of data not only requires tracking this cursor object, but also requires tracing any column
data that is retrieved from it. Second, the cursor APIs provide
exible ways to retrieve a desired
column from the cursor object, making it challenging for SAND to associate the use of data
with the columns statically. In most DBMSs, including the local DBMS provided by Android,
developers can retrieve the column data from the cursor object via the column index or column
name. If the column index is used, the mapping from the index to the name can depend on the
query and the corresponding table, because of the \SELECT *" syntax in SQL. If the column
name is used, identifying its value is also challenging as the name is a string parameter whose
40
value can be constructed inter-procedurally via string operations. Although there exist techniques
that tried to identify the retrieved columns, they did not address the aforementioned challenges.
The existing techniques either fail to do it statically [15], or fail to handle the mappings from the
column index to the column name [104, 20].
To address the rst challenge, SAND develops a static taint analysis to track the cursor
object [4]. The taint analysis tracks the information
ow in an AUT. Information
ows from
object x to object y, whenever an operation uses the value of x to derive the value of y. If x is
tainted, y is considered to be tainted. This process is called taint propagation. SAND treats the
cursor object returned by the database read as a taint source. During the taint propagation, if a
column is retrieved, SAND performs a column identication analysis, which will be explained in
the next paragraph, and annotates the tainted variable with the column name. For example, the
variable id is tainted by the cursor object at line 7 of Listing 4.1. SAND annotates the variable id
with the column name id. If the program point simply uses the cursor object but not any specic
column data, such as line 5 in Listing 4.1, the annotation is null. For each program point u that
uses a tainted variable v with a column annotation c, SAND adds a tuplehu;ci to the return
value of getUseSet.
SAND's column identication analysis addresses the second challenge. Given a silica s, and
a program point u where a column retrieval API is invoked (e.g., line 7 of Listing 4.1), the
analysis returns a set of strings that represent the possible columns retrieved at u. Depending on
whether a column name or a column index is used as the parameter of the column retrieval API,
SAND carries out two dierent approaches. If a column name is used, SAND leverages a string
analysis [61] to identify the possible values of the name, and uses them directly as the output of
this column identication analysis. If a column index is used, after identifying its values, SAND
maps the index to the name as explained below.
The mapping from the column index to the column name is determined by the query issued
at the program point of the given silica s. As there are two ways to select columns in SQL,
SAND handles the mapping accordingly. If the query explicitly selects columns, the mapping
can be done by analyzing the query itself. More specically, SAND parses the SQL statement,
extracts the selected columns in order, and matches the index value to the corresponding column.
For instance, given a query \SELECT id, name, grade FROM students" and an index zero, the
index zero maps to the column id. If the query implicitly selects columns, the mapping requires
analyzing not only the query, but also the table structure. To do that, SAND rst parses the SQL
41
statement and identies the selected table. It then searches for the corresponding table creation
statement by leveraging the nd function. Next, SAND obtains the column order by parsing
the table creation statement. Lastly, SAND matches the index value to the column. Using an
illustrative example, the column index zero maps to the column id if the query is \SELECT *
FROM students" and the table creation statement is \CREATE TABLE students (id INT, name
TEXT, grade TEXT)".
SAND annotates the tainted variable with a placeholder ! as the column name, when a
column is selected but its name is unknown. Introducing the placeholder ! is necessary because
the parameter of the column retrieval API can be derived from an external source. In this case,
the column identication analysis is not able to determine the column name statically. The
placeholder ! allows the detectors built on top of the framework to handle the unknown column
based on their needs. For example, they can safely assume that all the columns are selected
because this unknown column can potentially be any column.
4.1.4 Inter-procedural Analysis
SAND's inter-procedural analysis is precise and fast, which ensures the eectiveness and eciency
of detectors built on top of SAND. To handle activity lifecycle methods, SAND treats each of
these methods as an entry point of the AUT. Within each activity lifecycle method, SAND uses
the Cloned Call Graph (CCG) [81] of the AUT to perform the analyses inter-procedurally. In
the CCG, every distinct calling context invokes a dierent instance of a method. This context-
sensitive CCG improves precision and allows the analyses to work the same as if applied to an
intra-procedural CFG. Although the CCG is generally quite large for even small programs, the
demand-driven nature of the framework allows it to prune the call graph of the AUT so that
only the transitive callers of the silica containing methods and their transitive callees remain.
The pruning process is done during the construction of silicas. After identifying the program
point p in the silica tuple, the framework assumes that all SANDLess functions can be used; it
precomputes all related transitive callers and callees of the method that contains p. Due to the
sparsity of silicas in an AUT, removing the methods that are not related to the silicas typically
results in the removal of 96% on average from my subject applications' code and a dramatic
improvement in scalability and eciency.
To further improve eciency, SAND avoids redundant computation by caching analysis results.
There are two kinds of caching mechanisms. First, the outputs of the functions are cached. For
42
example, if there are multiple calls to getLoopSet on the same program point, SAND conducts
the corresponding analysis once and reuses the cached result. This mechanism is particularly
useful when multiple detectors that share a common set of functions and target silicas are run on
the same AUT. The analysis results can be reused across multiple detectors. The second type
of caching is to cache SAND's intermediate analysis result on the CFG. Several analyses that
the functions are based on, such as region analysis, have the same analysis results given the same
CFG. Such results can be cached and reused. For instance, if getLoopSet is called on two program
points that are inside the same method, SAND can build the RT for this method once and reuse
the RT.
4.1.5 Implementation
I realized SANDLess as a Java extension. The silica abstraction is implemented as a Java class.
The SANDLess functions can be invoked as Java functions. The input of the SANDLess functions
is specied as the function parameter, and the output is returned by the corresponding Java
functions.
The analyses conducted on the AUT, such as building the CFG and Call Graph (CG), are
based on the Soot analysis framework [56]. Soot helped to convert the applications' binaries to
its intermediate representation of bytecode, called Jimple. All the analyses conducted by SAND
were based on the Jimple bytecode.
The current implementation of SAND is targeted to Android application binaries and their
default local database management system, SQLite [31]. This implementation choice allowed us to
evaluate SAND on a large number of marketplace applications, where SQLite is widely used [71].
Note that the design of SAND is generalizable. It can be extended to analyze other languages
for which it is possible to generate CFGs and CCGs. It can also be extended to analyze other
database APIs where SQL statements are strings, such as JDBC.
4.2 Evaluation
To assess the eectiveness of SAND for SQL antipattern detection, I evaluated whether SAND can
support a wide range of SQL antipattern detection tasks with concise scripts and good detection
results. To do that, I implemented the detectors of all the SQL antipatterns discovered in the
literature using SANDLess. I measured the complexity of the SANDLess scripts (RQ1). I then
43
measured the detection accuracy of these SANDLess detectors (RQ2). To assess the eciency of
SAND, I measured the time needed for the SANDLess detectors to run (RQ3). The corresponding
research questions are:
RQ1: What is the complexity of the detection scripts written in SANDLess?
RQ2: What is the accuracy of the SANDLess detectors in comparison with other techniques?
RQ3: How fast are the SANDLess detectors?
4.2.1 Implementing SQL Antipattern Detectors Using SANDLess
In order to answer the three research questions, I rst implemented a set of detectors using
SANDLess. These detectors cover all the SQL antipatterns identied by the literature survey [70].
Each of them implements the detection rules proposed by existing techniques for the corresponding
SQL antipattern. The rules are listed under the \Detection Rules" column of Table 4.1. In the rest
of this section, I elaborate on how the various SQL antipatterns can be detected using dierent
combinations of the SANDLess functions. For the ease of explantation, I use T to denote the set
of instances of each of the targeted SQL antipattern.
Unbatched-Writes For each silica that issues a database write, the script checks if the write
is in a loop but not in a transaction, as shown in Equation (4.1).
T =fsjs2 nd(\^(insertjupdatejdelete):
")^ getLoopSet(s:p)6=;^ getTransactionSet(s:p) =;g
(4.1)
Not-Merging-Projection-Predicates To detect this antipattern, the SANDLess script rst
identies the silicas that issue select queries and may execute one after another. For each pair
of such silicas, the script then checks if the queries issued by the rst and second silica in the
pair are identical except for the projection predicates. It is done by string comparisons on the
44
SQL statements returned by the function. The set of instances of this antipattern is shown in
Equation (4.2).
T =fhs1;s2ij
s1;s22 nd(\^select:
")^
s2:p2 getReachableSet(s1:p)^
9q1;q2 (q12 (s1:r)^q22 (s2:r)^ q1 and q2 are identical except for the projection predicates)g
(4.2)
Not-Merging-Selection-Predicates The steps of detecting this antipattern is similar to the
ones of detecting Not-Merging-Projection-Predicates, as shown in Equation (4.3). The dierence
is that in the detector, the script checks if the queries are identical except for the selection
predicates.
T =fhs1;s2ij
s1;s22 nd(\^select:
")^
s2:p2 getReachableSet(s1:p)^
9q1;q2 (q12 (s1:r)^q22 (s2:r)^ q1 and q2 are identical except for the selection predicates)g
(4.3)
Loop-to-Join To detect this antipattern, the script rst identies the pairs of silicas that issue
select queries and may execute one after another. For each pairhs1;s2i, the script then checks if
there is a loop that containss2:p and iterates using the result ofs1, i.e.,9l (l2 getLoopSet(s2:p)^
9hp0;c0i(hp0;c0i2 getUseSet(s1)^l =p0)). Lastly, the script checks if the output of s1 is used
ats2, i.e.,9hp1;c1i(hp1;c1i2 getUseSet(s1)^p1 =s2:p). The set of instances of this antipattern
is shown in Equation (4.4).
T =fhs1;s2ij
s1;s22 nd(\^select:
")^
s2:p2 getReachableSet(s1:p)^
9l (l2 getLoopSet(s2:p)^9hp0;c0i(hp0;c0i2 getUseSet(s1)^l =p0))^
9hp1;c1i(hp1;c1i2 getUseSet(s1)^p1 =s2:p)g
(4.4)
45
Vulnerable-Query The script rst locates all the silicas by using the regex \:
", which
represents any string. To identify if a tainted source is used to build the SQL statement, the
script inspects each silica and checks if any input of the SQL statements represented the silica is
returned by a tainted API. The set of instances of this antipattern is described in Equation (4.5).
The script builder can decide the list of tainted APIs. For example, the list by Susi [92] can be
used. It contains APIs whose returned values are considered to be untrusted.
T =fsjs2 nd(\:
")^
9hf; placeholderi(hf; placeholderi2 getDefSet(s)^ f invokes a tainted API )g
(4.5)
Not-Using-Parameterized-Query The detection of this antipattern starts with nding all
the silicas. For each silica, the script checks if any substring at the syntactical positions of data
values is dened externally. The set of instances of this antipattern is described in Equation (4.6).
Identifying data values can be done by parsing the SQL statement q with a SQL parser [99].
T =fsjs2 nd(\:
")^
9hf; placeholderi(hf; placeholderi2 getDefSet(s)^q2 (s:r)^ placeholder is a data value of q)g
(4.6)
Not-Caching The script rst identies the pairs of silicas that issue select queries and may
execute one after another. It then veries if the SQL statements issued by the two silicas in the
pair are identical. This is done by string comparison on the SQL statements returned by the
function. The set of instances of this antipattern is shown in Equation (4.7).
T =fhs1;s2ij
s1;s22 nd(\^select:
")^
s2:p2 getReachableSet(s1:p)^
9q1;q2 (q12 (s1:r)^q22 (s2:r)^q1 =q2)g
(4.7)
Unnecessary-Column-Retrieval For each silica s that issues a select query, let C denote
the columns selected by the query q where q2 (s:r), and O denote the columns returned by
getUseSet(s). The script investigates if there exists any retrieved column that is not used, i.e.,
9c2 C(c = 2 O)^@o2 O(o = !). The purpose of checking if there exists any unknown column
46
name, i.e., !, is to avoid false-positives; because an unknown column means any possible column
can be used. The set of instances of this antipattern is shown in Equation (4.8).
T =fsjs2 nd(\^select:
")^9c2C(c = 2O)^@o2O(o =!)g:
(4.8)
Unnecessary-Row-Retrieval For each silicas that issues a select query, the script checks if
the following case exists. A selected column is used in a conditional statement, i.e.,9hu0;c0i(hu0;c0i2
getUseSet(s)^u0 is a conditional statement^c06=null). The other program points that use the
read data are control dependent on this if statement, i.e., 8hu;ci(hu;ci2 getUseSet(s)^u6=
u0 =) u02 getControlDependenceSet(u)). The set of instances of this antipattern is described
in Equation (4.9).
T =fsjs2 nd(\^select:
")^
9hu0;c0i(hu0;c0i2 getUseSet(s)^u0 is an If statement^c06=null^
8hu;ci(hu;ci2 getUseSet(s)^u6=u0 =) u02 getControlDependenceSet(u)))g:
(4.9)
Unbounded-Query The script starts with nding the select queries that may retrieve un-
bounded number of records. This is done by nding the silicas that match certain patterns, such
as not using the keyword LIMIT. For each such silica, the script tests if the application iterates
over the read data. The set of instances is shown in Equation (4.10).
T =fsjs2 nd(\^select:
")^
9q(q2 (s:r)^ q is unbounded)^
9hu1;c1i;hu2;c2i(hu1;c1i2 getUseSet(s)^hu2;c2i2 getUseSet(s)^u12 getLoopSet(u2))g
(4.10)
47
Readable-Password The script rst identies the SQL statements that may include sensitive
information. It searches for password comparisons by using the regex (\:
password = :
"). It
then checks if the value of the password has been encrypted, as shown in Equation (4.11).
T =fsjs2 nd(\:
password =:
")^
9hf; placeholderi(hf; placeholderi2 getDefSet(s)^
placeholder is at the position after \passward ="^
f invokes an encryption API )g
(4.11)
4.2.2 RQ1: Code Complexity of SANDLess Detectors
To answer RQ1, for each of the SANDLess detectors, I counted the number of predicates needed
for specifying the detection rules, the number of lines of Java code (LOC) in the script, and the
cyclomatic complexity of the code. The cyclomatic complexity is a common metric for measuring
code complexity [76]. It is the number of linearly independent paths through a program's source
code, indicating the complexity of a program. In addition to the aforementioned three numbers,
I also counted lines of code that underlay the SANDLess functions that were required by each of
the detectors. For example, for the Unbatched-Writes detector, I added the lines of code related
to obtaining the silicas, getLoopSet, and getTransactionSet. Note that each required function was
only counted once no matter how many times it was invoked in the detection script. This line
number did not include the string analysis tool, Violist, that I implemented, which itself has 9,503
lines of code [29]. I calculated the lines of code underlying the SANDLess functions because the
source code of most of the existing detection techniques are not available, which means a direct
comparison between the existing techniques and the SANDLess detectors was not a viable choice.
Counting the lines of code underlying the SANDLess functions allowed me to demonstrate the
relative conciseness of SANDLess scripts.
The results are summarized in Table 4.3. The four columns in the table represent the the
number of predicates, the number of lines of Java code in the script, the cyclomatic complexity
of the code, and the number of lines of code underlying the SANDLess functions.
Overall, the results demonstrate how the complex SQL antipattern detection tasks can be
expressed concisely using SAND. With the SANDLess functions, the detection logic can be ex-
pressed using a small number of predicates. The SANDLess scripts are very compact, requiring
12 to 74 lines of code for a detection task. The Unbounded-Query detector requires 74 lines of
48
Table 4.3: Scripts statistics
Antipattern
# Predi-
cates
# Lines
Cyclomatic
complexity
# Lines
behind
SANDLess
Unbatched-Writes 3 12 4 2,900
Not-Merging-Projection-Predicates 3 25 11 2,141
Not-Merging-Selection-Predicates 3 25 11 2,141
Loop-to-Join 7 45 13 4,268
Vulnerable-Query 3 18 5 2,012
Not-Using-Parameterized-Query 4 25 7 2,012
Not-Caching 5 21 8 2,141
Unnecessary-Column-Retrieval 6 29 8 3,207
Unnecessary-Row-Retrieval 7 52 14 3,656
Unbounded-Query 4 74 19 3,961
Readable-Password 4 29 7 2,012
code because the detection rules proposed by the technique check several properties on the query
itself in order to determine that the query is unbounded [104]. Processing the query string needed
around 40 lines of code. Comparing to the total lines of code underlying SANDLess, which are at
least 2,000, the lines of code in the detection script are signicantly smaller. In addition to being
compact, the code also has low complexity. Studies on cyclomatic complexity suggest dierent
values (10, 15, and 20) as the thresholds between acceptable and complex [76, 103, 109]. As shown
in Table 4.3, 6/11 of the scripts have complexity values within 10; 10/11 of the scripts have com-
plexity values within 15; and all of the scripts have complexity values within 20. These numbers
demonstrate that the complexity values of most of the scripts are within or close to the thresh-
old. Although there is one script that exceeds the threshold of 10 and 15, I found that the high
complexity is due to the processing code of the query string when detecting Unbounded-Query.
If we do not take into account this part of the code when computing the cyclomatic complexity,
the complexity value of the Unbounded-Query detector would be 9. Based on these results, the
SANDLess scripts are compact and have low code complexity.
4.2.3 RQ2: Detection Accuracy
Subject apps: In order to measure the detection accuracy, I ran the detectors on a set of
subject apps. The subject pool contained 1,000 dierent marketplace Android applications. To
49
ensure diversity in the subject pool, I selected apps that were: (1) downloaded frequently; (2)
designed with dierent functional purposes; and (3) varied in terms of the amount of code. I
identied potential subjects from the Google Play app store [32], which is the dominant app store
where Android users download their apps. For each of the 34 categories dened by the store, I
downloaded the top-ranked apps in the category and conrmed that the apps worked with Soot.
After obtaining 1,000 subject apps through this process, I computed the amount of code that they
had. The results showed that 16% of these apps had less than 10K bytecodes, 48% of them had
bytecode counts between 10K and 100K, and 36% of them had more than 100K bytecodes. These
numbers suggested that the apps were varied in terms of the size of app code.
Protocol: To answer RQ3, I focused on the true-positives and false-positives that the SAND-
Less detectors and existing techniques found in the subject apps.
A detected instance is considered to be a true-positive if the instance conformed to the deni-
tion of the corresponding SQL antipattern. For instance, Unnecessary-Column-Retrieval denes
an instance to be a database column being read without being used. A detected instance is
considered to be a false-positive if all the read columns have been used. For each instance of a
SQL antipattern detected, I decompiled the app and inspected the code of the methods in the
call chain related to the detected instance so as to determine if the instance is a true-positive or
false-positive. From the results of this analysis, I calculated the precision.
I was not able to obtain the number of false-negatives nor compute the recall. This is because
dierent from calculating precision, calculating recall requires to analyze all of the bytecode to
build the ground truth, instead of only examining the related code of the detected instances. It
is extraordinarily dicult to accurately and manually analyze all of the bytecode for the large set
of subject apps to build the ground truth.
In order to establish a comparison between the SANDLess detectors and the existing tech-
niques, I needed the implementation of all these techniques. However, a major issue is that many
of the existing techniques neither publish their tools nor include sucient details to replicate their
implementation. Moreover, the techniques target dierent programming languages (Ruby, PHP,
Java, etc) and platforms. Most of these techniques do not target Android apps. To tackle this
issue, I modied SAND's underlying analyses so as to meet the assumptions that the existing
techniques made about the characteristics of the database access code. For example, the tech-
niques for some SQL antipatterns assume that the SQL statements are hard-coded or there is a
one-to-one mapping from the API to the SQL statement. I congured SAND so that it could only
50
Table 4.4: Analysis results
Antipattern
#
TPs/FPs
# DB
points
# Apps
#
TPs/FPs
(Hard-
coded)
#
TPs/FPs
(One-to-
One)
Unbatched-Writes 990/0 990/0 151/0 N/A 943/0
Not-Merging-Projection-Predicates 262/0 156/0 15/0 54/0 N/A
Not-Merging-Selection-Predicates 8,696/48 508/32 21/1 21/0 N/A
Loop-to-Join 111/0 173/0 20/0 4/0 92/0
Vulnerable-Query 2,398/0 2,398/0 78/0 N/A N/A
Not-Using-Parameterized-Query 1,748/0 1,748/0 189/0 N/A N/A
Not-Caching 5,961/4 1,144/8 80/4 N/A 4,478/4
Unnecessary-Column-Retrieval 2,939/18 2,939/18 270/6 N/A 1,871/12
Unnecessary-Row-Retrieval 108/0 108/0 35/0 N/A N/A
Unbounded-Query 1,517/0 1,517/0 228/0 N/A 858/0
Readable-Password 53/0 53/0 8/0 N/A N/A
handle hard-coded strings, or could only handle APIs where there is a mapping between the API
signature and the underlying SQL statement. I ran these modied detectors on the same set of
subject apps and compared the detection results with the original SANDLess detectors.
Result: The detection results are summarized in Table 4.4. For each SQL antipattern, this
table provides the number of detected TPs (true-positives) and FPs (false-positives), the number
of distinct database interaction points associated with these TPs and FPs, the number of apps
that contain these TPs and FPs, the number of TPs and FPs detected by existing detection
approaches that have assumptions about how a SQL statement is constructed. Overall, the
detectors' detection precision ranged from 99.4% to 100%. In total, 383 out of the 1,000 subjects
contain at least one SQL antipattern.
Discussion: These results show that the SANDLess detectors can identify the various SQL
antipatterns with a high precision from a large portion of apps. There are thousands of instances
detected for some SQL antipatterns and the precision is as high as 99.4% - 100%. Although there
exists some cases where the precision is not 100%, I found that the false-positives are due to two
limitations of SAND's underlying techniques. First, the lifecycle methods are assumed to be the
entry points of the program. These methods were not connected in the call graph because their
execution relationship cannot be predicted statically in general. Since the detectors could not
trace the data that is used in another lifecycle method, some false-positives were introduced while
51
detecting Unnecessary-Column-Retrieval. If a complete call graph is given to SAND, such false-
positives would not occur. Second, the string analysis that SAND used was safe but imprecise in
handling conditional string values. For example, if there is a branch statement that determines the
value of a string, the string analysis safely took all the possible values no matter which branch the
program took at runtime. This imprecision ultimately led to several false-positives in detecting
Not-Caching and Not-Merging-Selection-Predicates. If the string analysis is extended in a way
that can resolve conditional constraints, such false-positives can be eliminated.
Nevertheless, except for this small portion of false-positives, the SANDLess detectors precisely
located a large number of various SQL antipatterns in real-world marketplace apps. Compar-
ing to the existing techniques, which made certain assumptions about the characteristics of the
database access code, the SANDLess detectors were able to identify much more instances of var-
ious SQL antipatterns in real-world apps. This demonstrates that the SANDLess functions are
well supported by SAND's underlying analyses.
4.2.4 RQ3: Detection Speed
To evaluate the speed of the SANDLess detectors in nding SQL antipatterns, I calculated the
execution time of the detectors running on the subject apps. This time included the entire detec-
tion process from converting the apps' bytecode to conducting various analyses. To validate the
contribution of the caching mechanism in improving the execution time of SANDLess detectors, I
ran the detectors on the same set of apps while disabling the caching mechanism. The experiment
result showed that the average and median execution time of running all the SANDLess detectors
on one app were 41 seconds and 21 seconds, respectively. For 92% of the apps, all the detection
analyses were nished within 60 seconds. When breaking down the total execution time, we found
that 42% of the total time (18 seconds on average) was consumed by Soot converting bytecodes,
and 14% of the total time (6 seconds on average) was consumed by Violist conducting the string
analysis. If the caching mechanism was disabled, the average execution time of running all detec-
tors went up to 289 seconds, seven times longer than the execution time of running the detectors
with caching enabled. The median execution time was similar (22 seconds). Taken together these
numbers suggest that the SANDLess detectors were ecient in analyzing modern marketplace
mobile apps. The caching mechanism had played an important role in ensuring the eciency
of the SANDLess detectors. The time consumed by Violist demonstrated that this full power
string analysis used by SAND helped the detectors identify more true-positives of various SQL
52
antipatterns (as shown in Section 4.2.3); but it did not come at the cost of signicant runtime
overhead.
4.2.5 Threats to Validity
The eectiveness of SANDLess in expressing SQL antipattern detection tasks is based on the
assumption that the list of SQL antipatterns I used is complete. In order to maximize the
completeness of this list, I used a list of SQL antipatterns collected by the literature survey in
Chapter 3, which followed the best practices of literature reviews [55] to explore the current
knowledge about SQL antipatterns. If this assumption does not hold, for instance, the list would
not contain any SQL antipattern that has not been discovered yet, SAND still has the potential
to detect it because the SAND framework is extensible.
To mitigate the potential bias of the eectiveness of SANDLess in expressing SQL antipattern
detection tasks, I implemented the rules as proposed by others, instead of creating the detection
rules myself. Furthermore, the evaluation results on detection accuracy demonstrated that these
rules were themselves well dened and accurate.
4.3 Discussion of Limitations
The rst limitation is that SAND currently does not handle inter-callback control-
ow and data-
ow relationships. SAND treats each event handler callback and and lifecycle callback as an entry
point of the AUT and assumes that the callbacks are independent of each other. This led to a
small number of false-positives when detecting Unnecessary-Column-Retrieval, as discovered in
the evaluation. Analyzing relationships between callbacks is still an open problem in program
analysis. Existing research eorts focus on determining the execution orders between GUI-related
callbacks [108, 107], constructing callback summaries based on Android API methods [13, 90], or
assuming that callbacks can be executed in any arbitrary order [4]. These existing analyses of
callbacks could be potentially integrated into SAND so as to further improve the detection results.
The second limitation is that SAND is based on static analyses and cannot identify values
derived from external sources, e.g., user input. In this case, SAND introduces placeholders to
represent these unknown values and allow detectors to handle them accordingly. As discussed
in Section 4.1.3.1, the placeholders are typically at the syntactical positions of data values, table
names, etc, and do not aect the structure of the SQL statements. According to my prior empirical
53
study, nearly 92% of all SQL statements have their prexes embedded in the application code.
These prexes can be captured by SAND's underlying static analysis, which allows the detectors
to identify the type of the SQL statement. When conducting the evaluation for this framework,
I also found that among 13,418 dierent database interaction points associated with the detected
instances, 44% of them have at least one placeholder embedded in their corresponding r model.
However, these placeholders, whose concrete values were unknown, did not prevent the detectors
from precisely identifying the various SQL antipatterns. The parts of the SQL statements that
were embedded in the code and were captured by SAND's static string analysis were sucient for
the detectors to carry out the necessary analyses on the SQL statements.
4.4 Summary
In the chapter, I introduced a framework, SAND, for detecting SQL antipatterns in mobile apps.
The framework abstracts away the interactions between the application and the database. It
provides a language, called SANDLess, for specifying SQL antipattern detection tasks. As part of
this framework, I analyzed a comprehensive list of SQL antipatterns and their detection methods
collected from a systematic literature review. With this analysis, I identied the abstractions
of application-database relationships that were required to build SQL antipattern detectors. To
demonstrate the eectiveness and eciency of the framework, I used SANDLess and implemented
a full spectrum of eleven SQL antipatterns that were discovered by the literature survey in Chap-
ter 3. I then ran these detectors on a large set of 1,000 subject apps. The results showed that the
detection rules of all these eleven SQL antipatterns can be compactly expressed in SANDLess,
using 12 to 74 lines of Java code with low code complexity. These SANDLess detectors, which are
built on top of abstractions of application-database interactions, precisely identied thousands of
instances of SQL antipatterns with a precision of at least 99.4%. These SANDLess detectors were
also fast as applying eleven detectors only took an average of forty-one seconds per app. The
evaluation results also successfully conrmed the hypothesis of my dissertation: static detectors
built on top of abstractions of application-database interactions can eectively and eciently de-
tect SQL antipatterns in mobile apps. In the next chapter, I will further demonstrate the impact
of SQL antipattern instances detected by SAND on the performance of mobile apps.
54
Chapter 5
A Benchmark Study of SQL Antipatterns
The goal of the benchmark study is to quantify the performance impact (in terms of runtime and
energy) of the detected SQL antipatterns on mobile apps. This quantitative impact is important
because it demonstrates whether the framework is able to identify issues that are signicant to
the performance of mobile apps. The results of the study also provide developers information
about the potential tradeos of the various SQL antipatterns. For example, xing an antipattern
may lead to consuming more or less resources. It is also possible that the resource consumption
stays unchanged after the x. In this chapter I present my benchmark study and my analysis on
the results of the study. The rest of this chapter is organized as follows. In Section 5.1, I give
a detailed description of the methodology of my benchmark study. In Section 5.2, I present and
discuss the quantitative impact I obtained from the study.
5.1 Benchmark Study Methodology
In this section, I present the methodology of the study. I explain the design of the benchmark
suite in Section 5.1.1, and the process of measuring the performance impact in Section 5.1.2.
5.1.1 Benchmark Design
In order to quantify the performance impact of the dierent instances of various SQL antipatterns,
I constructed a benchmark suite. This suite was designed to measure the resource consumption
of two implementations, the antipattern version and the xed version, of performing the same
database related task. When designing the benchmark suite, a goal is that the performance
impact obtained from the study is generalizable to the real-world impact of SQL antipatterns.
55
Achieving this goal is challenging because there are many factors that can aect the costs of
database operations, as well as the quantitative performance impact of the SQL antipatterns.
The values of these factors can vary across dierent applications as well. A manually synthesized
benchmark, such as the one published in an online blog post [26], arbitrarily chooses the values
for dierent factors from a predetermined range. These values may not re
ect how developers use
local databases in mobile applications, limiting the generalizability of the results obtained from
the benchmark.
In order to make the study results more generalizable, my benchmark suite based the an-
tipattern version of the implementation on real-world antipattern instances that were identied
at a large scale by the SANDLess detectors. The benchmark suite took dierent performance
aecting factors into consideration. It simulated the representative values of various factors in
mobile applications via a range of static and dynamic analysis techniques. In the following sec-
tions, I explain how the performance aecting factors were chosen and varied (Section 5.1.1.1),
and how the antipattern version and the xed version of the implementation were constructed
(Section 5.1.1.2).
5.1.1.1 Controlling Performance Aecting Factors
To achieve the goal of making the results of the benchmark study generalizable, I controlled and
varied the factors that can aect the performance of database operations. In my study, I varied
the factors that can be changed in the application code and controlled the other factors. This is
because my study focused on the antipatterns that are in the coding aspects. The performance
aecting factors that are not in the code, including the hardware, the le system, the database
management system, the database conguration, etc, these were controlled to be the same across
dierent measurements. In the rest of this section, I will explain the factors that were varied and
how their representative values were chosen.
In my study, the varied factors are the number of SQL statements issued, the SQL statement
forms, the underlying table forms, the size of the table, and the two ways of implementation
(i.e., the antipattern and its x). The SQL statement form is dened to be the SQL keywords
contained in a SQL statement in order. For example, the statement form of \SELECT id FROM
students ORDER BY grade" is SELECT FROM ORDER BY. The table form is dened to be a
multiset where the elements are the types of columns in the table. For instance, the table form
56
of the table created by the statement \CREATE TABLE students (id INT, name TEXT, grade
TEXT)" isfINT, TEXT, TEXTg.
When I varied the values of the performance aecting factors, I chose values that were rep-
resentative of their typical values in mobile applications. To do that, I made use of static and
dynamic analyses, which allowed me to analyze mobile applications at a large scale and to identify
representative factor values.
For the two factors, SQL statement forms and table forms, I utilized static analysis to obtain
representative forms from real-world antipattern instances. To collect these, I ran the SANDLess
detectors on a set of 1,000 marketplace applications from the Google Play app store [32]. (The
details of these apps and detectors are described in Section 4.2.) After identifying real-world
instances, I split them into equivalence classes. Each equivalence class was a subset that included
all instances that shared the same SQL statement form. I then ranked the equivalence classes
based on their sizes. The top-ranked equivalence classes therefore contained instances that had
the most frequently used statement form. From each top-ranked equivalence class, I randomly
selected one instance. I applied the same process to the table forms. Following this protocol,
I selected 52 antipattern instances, whose SQL statement forms and table forms represented at
least 72%, and on average 90%, of all of the detected instances.
Similar to the SQL statement forms and table forms, I also needed to vary the values for the
number of SQL statements issued and the size of the tables. Since the number of columns and the
column types have been determined by the table form, the size of the table depends on the number
of rows and the size of the columns whose sizes are
exible, e.g., the string/text/varchar types of
columns. Instead of naively selecting random values for these numeric factors, I estimated a range
of values from real-world apps. I ran the 1,000 apps in my application corpus using a widely-
used workload generation tool, PUMA [41]. During the execution, I logged the SQL statements
issued along with their execution timestamps. From the logs, I computed empirical values for the
number of rows in a table, the size of a string type column, and the number of SQL statements
issued. These were estimated by computing the average number of insert statements to a table,
the average length of string data values in the insert statements, and the average number of SQL
statements issued in a consecutive sequence. Lastly, I chose the average values and varied them
by one and two standard deviations. These dierent values allowed me to obtain the quantitative
impact of various SQL antipatterns at dierent scale levels.
57
After identifying the values of the performance aecting factors, the last challenge is to deter-
mine the way to combine them. In order to thoroughly test the impact of the antipatterns under
dierent settings, I used the Cartesian product on the sets of factor values. Each tuple in the
resulting Cartesian product corresponds to a benchmark in the benchmark suite. A test in the
suite consists of the two benchmarks that share the same factor values except for the programming
practice, i.e., one of them is the antipattern version while the other one is the xed version.
5.1.1.2 Constructing the Antipattern and its Fix
To generate the antipattern version of the implementation, I utilized the selected real-world an-
tipattern instances. Note that I manually veried the decompiled code to ensure that they were
true positives before using them as benchmarks. I minimized the app code so that only the
database operations and code constructs that were related to the detected antipattern instances
were included in my benchmark.
To generate the xed version of the implementation, I followed the strategies proposed in the
literature and xed the antipattern version of the benchmark. For each of the SQL antipatterns,
the literature contains information about what kind of database access task a developer wants
to perform, how the developer would implement it in an inecient/insecure manner (i.e., the
antipattern) and in an ecient/secure manner (i.e., the x). In Section 5.2, I summarize the
repair strategies proposed in the literature for each SQL antipattern.
5.1.2 Performance Evaluation
In the evaluation, I measured and analyzed the energy consumption and runtime of the benchmark
code. There were four steps in the process. First, I inserted probes to the source code of the
benchmark. Second, I deployed the app containing the benchmark code on a smartphone that
was connected to a power meter. Third, I executed the app and recorded the energy and runtime
measurements during the execution. Last, I analyzed the collected data. In the rest of this section,
I elaborate on these steps.
Inserting Timing Probes: Before launching the app that contained the benchmark code,
timing probes were inserted. These probes recorded the start times and end times of the part of
the code that was changed or inserted after xing the antipattern. The recorded times were used
to calculate the runtime and energy consumption of the benchmark code in later steps.
58
Deploying Benchmarks: In this step, I deployed the app containing the benchmark code and
timing probes on a Samsung Galaxy S5 that was connected to a Monsoon power meter [80]. This
power measurement platform sampled the power consumption of the smartphone at a frequency
of 5KHz and synchronized these samples with the standard Unix time. By aligning starting and
ending times with the Monsoon measurement samples, I could obtain power measurements and
energy consumption for the benchmark code.
Conducting Experiments: In this step, I executed the app containing the benchmark code.
There were three mechanisms that I used to improve the stability of my measurement results.
First, to reduce the impact of any non-deterministic or uncontrolled behavior, I repeated the
measurements fteen times. I obtained an average relative standard error of 4.4%. According to
guidelines [43], if the relative standard error is greater than 25%, it means that the sampling error
is high and should be used with caution. The relative standard error I obtained was signicantly
smaller than this threshold. Second, for each measurement, the database was reinitialized to
ensure consistency across dierent measurements. Third, between each measurement, I added a
waiting time to avoid any impact from tail energy behavior and to allow the device to cool down.
Analyzing Data: In this step, I collected and analyzed the runtime and energy measure-
ments. For each antipattern, I measured the energy consumption and runtime dierence between
two code versions under various settings. I computed the mean, median, and statistical signif-
icance of the dierences in measurements for both the energy and runtime measurements. To
compute statistical signicance, I ran a Mann-Whitney U test ( = .05), which did not assume
the measurement data followed a normal distribution.
5.1.3 Threats to Validity
A potential threat is that my benchmark study was based on SQLite, which could have dierent
performance characteristics than other database management systems. However, SQLite is the
default and dominant local database service in mobile devices [71]. Therefore, my experiment
results apply to most mobile apps.
Other devices and operating systems may have dierent runtime and energy characteristics.
In my experiment, I collected the energy and runtime measurement results from a single device, a
Samsung Galaxy S5 running Android 5.0. To mitigate this threat, I repeated the measurements
on a Nexus 5 running Android 6.0.1 and a Google Pixel running Android 7.1.2. I observed that
59
although the absolute values diered from the results we obtained from the S5, the percentage dif-
ferences and statistical signicance were consistent. Therefore, my conclusions on the performance
impact of SQL antipatterns apply to other mobile devices.
When computing estimated values for the varied factors, I used an automated workload gen-
erator to generate the workload, which may not be representative of real user workload. However,
this potential bias was not likely to undermine my conclusion as my experiment was based on a
large number of applications and SQL statement executions. In addition, my benchmark suite
only required an estimated range of values for the varied factors instead of an accurate number.
I evaluated the performance impact on a synthesized benchmark suite, which may not re
ect
the realistic impact of the SQL antipattern on real apps. I made several eorts to mitigate this
threat. First, the code in the benchmark suite was replicated from SQL antipattern instances
detected in real apps. Second, I chose representative values of the performance aecting factors
to be used in the benchmark suite. For example, the selected table forms and statement forms
represented 90% of all instances detected in the real apps.
I xed the antipattern instances manually. If the repair was carried out incorrectly and altered
the functionality of the original program, it may aect the validity of the performance results I
obtained. To ensure that the code's functionality remained consistent before and after the x, I
added additional checks to the benchmark code. If the detected SQL antipattern instance involved
database reads, I checked if the read data that was used by the application remained consistent
before and after the x. If database writes were involved, I veried the corresponding database
table and checked if the content and size remained the same. Note that these checks were not
part of the measurements.
5.2 Results
In Table 5.1, I summarize the performance impacts. A positive number means the resource
consumption increases after xing the antipattern. A negative number means the resource con-
sumption decreases. For each antipattern, I show the mean (absolute value and percentage) and
median of the dierences in resource consumption. Since I constructed benchmarks for each an-
tipattern using various detected antipattern instances under dierent settings, in the table, I show
the benchmark tests that established statistical signicance in the runtime and energy dierences.
60
Table 5.1: SQL antipatterns and their performance impacts. A positive number means the re-
source consumption increases after xing the antipattern. A negative number means the resource
consumption decreases.
Antipattern
Runtime mean
(ms)
Runtime
median
(ms)
Runtime
signi-
cance
Energy mean
(mJ)
Energy
median
(mJ)
Energy
signi-
cance
Unbatched-Writes -263.99 (-94%) -218.03 108/108 -152.65 (-96%) -104.57 105/108
Not-Merging-Projection-
Predicates
-7.93 (-32%) -2.73 34/36 -5.04 (-36%) -1.03 18/36
Not-Merging-Selection-
Predicates
-18.59 (-66%) -9.9 14/18 -5.64 (-66%) -1.63 12/18
Loop-to-Join -1467.58 (-94%) -538.13 9/9 -3455.40 (-96%) -863.33 9/9
Vulnerable-Query 1.43 (18%) 0.40 7/45 0.49 (15%) 0 4/45
Not-Using-
Parameterized-Query
-21.70 (-50%) -11.07 49/54 -31.02 (-55%) -11.27 48/54
Not-Caching -8.37 (-7%) -3.87 24/27 -7.06 (-6%) -0.53 14/27
Unnecessary-Column-
Retrieval
-8.90 (-27%) -2.93 95/126 -10.42 (-29%) -2.07 78/126
Unnecessary-Row-
Retrieval
-9.41 (-65%) -2.20 35/45 -10.31 (-70%) -1.67 28/45
Unbounded-Query 1.57 (7%) 0.80 31/81 1.18 (7%) 0.07 11/81
Readable-Password 0.84 (0.5%) 0.67 9/18 0.10 (0.1%) 0 2/18
In the following sections, I list the repair strategies that I used for each SQL antipattern. I
present and discuss the quantitative impact I obtained from the benchmark study.
5.2.1 Antipattern: Unbatched-Writes
Repair Strategy: As suggested by my prior work et al. [73], I inserted explicit transaction
control so as to batch the writes.
Result: As shown in Table 5.1, after xing the antipattern, the runtime and energy were both
reduced by over 90% on average with statistical signicance. The results suggest that Unbatched-
Writes can be very resource-intensive. The more implicit transactions are triggered, the more
performance impact this antipattern can make. The average runtime reduction even exceeded
100 ms (a common threshold for human to perceive delay [79, 82]), which means this antipattern
can noticeably impact user experience.
5.2.2 Antipattern: Not-Merging-Projection-Predicates
Repair Strategy: To x this antipattern, I merged the projection predicates and read all the
needed columns with one query as suggested by Manjhi et al. [74].
61
Result: As shown in Table 5.1, the runtime and energy consumption were reduced by 32%
and 36% on average. Most of the runtime dierences established statistical signicance. These
results suggest that Not-Merging-Projection-Predicates can introduce performance overhead to
local database operations. The cost of scanning the database tables and nding the target data
can be resource expensive. By merging the projection predicates and retrieving the necessary
columns at once, the aforementioned costs can be reduced.
5.2.3 Antipattern: Not-Merging-Selection-Predicates
Repair Strategy: As suggested by Arzamasova et al. [3], I merged the selection predicates and
retrieved all the needed rows with one query. The columns used in the merged selection predicates
were added to the projection attributes. For example, \SELECT id FROM ScaleTable WHERE
Name = `Optimism'" and \SELECT id FROM ScaleTable WHERE Name = `Caring' " were
merged to \SELECT id, Name FROM ScaleTable WHERE Name in (`Optimism', `Caring')". To
ensure that the functionality of the program remained unchanged, the returned rows were split
in the application code according to the values in the Name column.
Result: The runtime and energy consumption were both reduced by 66% after xing the an-
tipattern. However, not all of the performance dierences established statistical signicance. I
observed two cases where the resource consumption increased after the x. When looking in depth
at those cases, I found that in the antipattern version, the cost of the queries was low. After merg-
ing them in the xed version, the performance gain was lower than or similar to the overhead of
splitting the data. Nevertheless, in most of the cases, there was a big performance reduction after
xing the antipattern. This was because multiple costly queries were combined. These results
demonstrated that the Not-Merging-Selection-Predicates antipattern can have a negative impact
on the resource consumption of local database operations, but this impact depends on the cost of
the queries and the overhead of splitting the results.
5.2.4 Antipattern: Loop-to-Join
Repair Strategy: To x this antipattern, I joined the two tables that were used in the rst query
and second query, and then queried from the joint table as suggested by the literature [74, 18, 25].
Result: As shown in Table 5.1, the runtime and energy consumption were reduced by 94% and
96% on average after xing the antipattern. The statistical test established statistical signicance
62
for all of the results. As the runtime dierence was over 100 ms, this antipattern is very likely to
noticeably impact user experience. The reason behind this big performance impact is that xing
the antipattern can reduce the number of queries issued from 1+N (N stands for the number of
rows retrieved from the rst table) to 1. By looking in depth at the results, I found that when the
size of the retrieved data exceeded a certain limit, the time required to iterate over the retrieved
rows increased signicantly. In Android, the retrieved data is stored in a Cursor object [22]. It has
a buer of 2 MB in size by default to store the retrieved data. When a row that is not in the buer
is requested and the buer is full, the memory needs to be freed and the buer will be refreshed.
In the antipattern version, the buer refreshing overhead was triggered when the application code
iterated over the data that was retrieved from the rst table. In the xed version, since the result
iteration was avoided by joining the tables, the buer refreshing overhead was avoided as well.
In summary, xing the antipattern can reduce the number of database operations, reduce the
amount of data transferred, and reduce the time that is required to iterate over the result. These
ultimately led to signicant performance improvement in terms of runtime and energy.
5.2.5 Antipattern: Vulnerable-Query
Repair Strategy: To x this antipattern, I inserted encoding functions that sanitized the vul-
nerable input before this input was concatenated to the SQL statement. The encoding functions
were provided by the OWASP Enterprise Security API (ESAPI) Toolkits [88]. These encoding
functions use the proper escaping scheme for the database and sanitize the provided input. The
DBMS will not confuse the sanitized input with the SQL code written by the developer, thus
avoiding any possible SQL injection vulnerabilities. Sanitizing the input is not the only way to
tackle SQL injection. Another typical way is to used parameterized queries. In Section 5.2.6, I
also evaluated how parameterization would impact the performance.
Result: My experiment results showed that sanitizing the vulnerable input can increase the
runtime and energy consumption by 18% and 16%, respectively. But as shown in Table 5.1,
the absolute cost was relatively low. The performance increases did not establish statistical
signicance for most of the cases as well. When I investigated into the ones with signicant
dierences, I discovered that calling the sanitization API for the rst time had a relatively high
cost (10 ms on average). This was introduced by the initialization process of the sanitizer. Overall,
the experiment results suggest that developers can improve the security of their applications with
a relatively low performance cost if they sanitize the vulnerable input.
63
5.2.6 Antipattern: Not-Using-Parameterized-Query
Repair Strategy: To x this antipattern, I parameterized the queries and reused the precompiled
template as suggested by Karwin [49].
Result: As shown in Table 5.1, the runtime and energy of the database operations were reduced
50% and 55% on average after xing the antipattern. Statistical signicance was established for
most of the cases. By reusing the query template, the repetitive string concatenation, query
parsing, and query compilation can be avoided. These ultimately led to signicant runtime im-
provement and energy savings. The results suggest that using parameterized queries is not only
a more secure way, but also a more resource-ecient way to interact with local databases.
5.2.7 Antipattern: Not-Caching
Repair Strategy: Since the existing repair methods usually rely on adding a cache layer, of
which software developers may not have control over, I experimented with a strategy that can
x this antipattern by code refactoring. I avoided issuing repeated queries by reusing the Cursor
object [22]. This object is returned by the select query API and it stores the query results.
Result: The experiment results showed that xing this antipattern can reduce the runtime and
energy of the database operations by 7% and 6%, respectively. When I investigated into the
savings, I found that the resource consumption of retrieving data was reduced signicantly, but
the time required to iterate over the retrieved data (i.e., the cursor), remained unchanged. This is
because caching the data helps to reduce the amount of data transferred from the database to the
application, but does not help to reduce of amount of data that the application needs to iterate
over. Therefore, the buer refreshing overhead would still occur. Nevertheless, the absolute energy
and runtime savings were still relatively high. Many of them established statistical signicance
as well, suggesting the benets of xing this antipattern.
5.2.8 Antipattern: Unnecessary-Column-Retrieval
Repair Strategy: As suggested by Yan et al. [104], I modied the query so that it only retrieved
the needed columns.
Result: The experiment results showed that after xing the antipattern, the runtime and energy
consumption of the database operations can be reduced by 27% and 29% on average, respectively.
The dierences established statistical signicance for most of the runtime results. I investigated
64
into the cases that did not show a statistically signicant impact. I found that the unnecessarily
retrieved column had an integer type, whose size was small. As for the cases that had a signicant
impact, I found that xing the antipattern can reduce the time of retrieving data since less data
needed to be transferred. This demonstrates that even though the data does not need to travel over
the network, there is still a signicant cost for retrieving data from a local database. Moreover,
I discovered that retrieving less columns could also save the resources consumed by iterating
over the results. Although xing the antipattern did not reduce the number of retrieved rows, it
reduced the total amount of retrieved data as less columns were fetched. As a consequence, the
memory pressure was alleviated and the buer refreshing rate was decreased. These ultimately
led to less resource consumption.
5.2.9 Antipattern: Unnecessary-Row-Retrieval
Repair Strategy: As suggested by Emani et al. [25], I modied the query so that it only retrieved
the needed rows.
Result: The experiment results showed that by xing the antipattern, the runtime and energy
can be reduced by 65% and 70% on average, respectively. The statistical test established statistical
signicance for most of the runtime results. The ones that did not show a signicant improvement
were from an instance with statement form SELECT DISTINCT. In that instance, since only the
distinct rows were retrieved, the antipattern version only retrieved two more rows than the xed
version, resulting in a negligible performance dierence. When I looked in depth at the results
that showed a signicant dierence, I found that xing this antipattern not only reduced the time
needed to retrieve the data but also reduced the time needed to iterate over the retrieved result.
This is because there was less data transferred and less rows needed to be iterated over. Similar
to xing the Unnecessary-Column-Retrieval antipattern, when the total amount of data to read
became larger and exceeded the threshold, optimizing the Unnecessary-Row-Retrieval antipattern
can signicantly reduce the time needed to iterate over the result; because it can avoid or alleviate
the buer refreshing overhead.
5.2.10 Antipattern: Unbounded-Query
Repair Strategy: For this antipattern, carrying out a repair can be challenging. This is because
changing a query from unbounded to bounded can alter the semantics and even the display of
65
the program. Depending on the functionality of the application, developers can choose to use
the SQL keyword LIMIT to simply limit the number of retrieved rows. A sophisticated change
at the design level, such as employing pagination [106], can also be applied. Although there can
be a variety of repair strategies, there is one check that developers can always insert into the
code before issuing an unbounded query, which is to check the number of rows that are going to
be retrieved and iterated over. If this number is under certain threshold, the original unbounded
query can be issued anyway. Otherwise, a xed version of the code would be executed. Since there
does not exist a universal repair strategy for this antipattern, in our experiment, we measured
the performance overhead of checking the number of retrieved rows.
Result: Our experiment results showed that the runtime and energy consumption were both
increased 7% on average after adding the bound check. The absolute runtime and energy in-
crease were 1.57 ms and 1.18 mJ, respectively. Comparing to the performance impact of other
antipatterns, this overhead was relatively low. The low cost was due to the fact that the bound
check did not require the application to load all the necessary rows of data from the database
to the memory. Developers can check the number of retrieved rows using the COUNT() function
in SQL, which returns a single number. By checking the number of retrieved rows in advance,
developers can improve the security of their applications with a low performance cost. Developers
can choose to apply a x to this antipattern on top of the bound check so as to further improve
the performance and security.
5.2.11 Antipattern: Readable-Password
Repair Strategy: To x this antipattern, I encrypted the sensitive information before storing it
to the database as suggested by Karwin [49].
Result: The experiments showed that the runtime and energy were increased 0.5% and 0.1%
on average after xing the antipattern. The absolute runtime increased 0.84 ms and the energy
consumption increased 0.1 mJ. These costs are relatively low given the 100 ms threshold for
humans to perceive delay [79, 82]. These results mean that xing this antipattern by encryption
comes at a low performance cost.
66
5.3 Summary
In this chapter, I demonstrated the performance impact of the detected SQL antipatterns on
local database operations in mobile apps. Using a benchmark study that was constructed based
on the detected instances of eleven SQL antipatterns, I discovered that eight SQL antipatterns
have a signicant impact on the runtime and energy consumption of local database operations.
Out of these eight SQL antipatterns, two of them (Unbatched-Writes and Loop-to-Join) are
particularly impactful and can noticeably aect user experience. For the three security oriented
SQL antipatterns, I discovered that the repair only introduced a minor performance cost. The
results demonstrate that the SAND framework was able to help developers identify problematic
code that had a signicant performance impact on mobile apps.
67
Chapter 6
Related Work
In this chapter, I present work that is related to my dissertation. I divide the related work into four
parts: The rst part is the group of work related to detection of SQL antipatterns (Section 6.1).
The second part is the group of work related to program analysis frameworks (Section 6.2). The
third part is the group of work related to analyzing database usage and cost (Section 6.3). The
last part is the group of work related to detection and optimization of various performance issues
in mobile apps (Section 6.4).
6.1 Detecting SQL Antipatterns
Many approaches have been proposed to detect various SQL antipatterns. In Section 3.2, I
provided detailed descriptions for each of these techniques and discussed their drawbacks. In this
section, I summarize the techniques and discuss how they dier from SAND and the SANDLess
detectors.
Chen et al. [17] targeted the Unbatched-Writes antipattern for applications developed using
ORM. Their technique statically detects the database-accessing functions that are invoked inside a
loop without being optimized by some ORM annotations (e.g., Batch). As their technique targets
ORM applications, it cannot handle the dynamically constructed string-based SQL statements in
mobile apps. In additional, their technique examines whether a database write is in a transaction
by checking if the ORM annotation exists. This means that the technique cannot model the
complex transaction operations in mobile apps as the SANDLess detector for Unbatched-Writes
does.
68
Tamayo et al. [98] developed a technique that can suggest batching opportunities by analyzing
the dependencies between multiple dynamically captured SQL statements. Dierent from SAND's
static approach, the technique relies on dynamic proling to extract SQL statements. While
dynamic proling can precisely identify the SQL statements, a general limitation is that the recall
highly depends on the workload. If the workload does not trigger the code that matches the
antipattern, it could introduce false-negatives. The SANDLess detectors are dierent as they are
based on static analyses, which do not require any assumption about the workload.
Manjhi et al. [74] developed static analyses to inspect an application's CFGs and its statically
embedded SQL queries in order to detect the Loop-to-Join, Not-Merging-Projection-Predicates,
and Not-Merging-Selection-Predicates antipatterns. The limitation of this work lies in its assump-
tion about the method of constructing SQL statements. The technique assumes that the entire
SQL statement is statically embedded at the database interaction point. When analyzing local
database operations in mobile apps, it can miss detection targets as SQL statements are usually
constructed at runtime by various string operations along dierent paths [71]. This limitation is
addressed by SAND's design of the silica abstraction.
Arzamasova et al. [3] analyzed the existing SQL query logs to look for the Not-Merging-
Projection-Predicates and Not-Merging-Selection-Predicates antipatterns. Their log analyzer
checks if a sequence of SQL queries has equivalent FROM and WHERE clauses, but a dier-
ent SELECT clause. However, since their techniques takes the query log as an input and does not
analyze the application code, it can only determine whether the antipatterns exist but not where
they are located in the application code. The SANDLess detectors do not have this limitation
because they use static analyses and the detected instances are associated with the silicas, which
contain the locations of the database interaction points.
Cheung et al. [18] and Emani et al. [25] both proposed techniques to detect the Loop-to-Join
antipattern for applications developed using ORM. Their techniques automatically transform
imperative code fragments (written using ORM and nested loops) into SQL queries that use
joins. Since these two techniques target ORM applications, they assume that there is a one-to-
one mapping between the database access API and the concrete SQL statement. However, this
assumption does not hold for mobile apps as dynamic string-based SQL statements are widely
used [71]. The limitation of analyzing SQL statements is addressed by SAND's design of the silica
abstraction.
69
TAPS [7, 8] is an automated technique for query parameterization. To identify SQL queries
that can be parameterized, TAPS rst uses symbolic execution to compute the possible SQL
queries issued by an application. It then analyzes the parsed structure of the symbolic expression
to identify data arguments for the parameterized query. To compute the possible SQL statements
issued by the application, TAPS uses symbolic execution, which could be inecient and not
scalable. SAND's underlying string analysis, Violist, uses a summary-based approach to compute
possible SQL statements. The analysis has been shown to be ecient and scalable [61].
Chen et al. [15] proposed an approach that combines static and dynamic analysis to detect the
Unnecessary-Column-Retrieval antipattern in Java-based ORM frameworks. This technique has
two limitations. First, it targets ORM applications and cannot analyze string-based SQL state-
ments. Second, the technique relies on dynamic analysis to identify the needed columns. However,
if a column is needed somewhere in the code, but the chosen workload does not trigger the code,
the technique would consider that the column is unnecessarily retrieved, which is a false-positive.
The SANDLess detector for Unnecessary-Column-Retrieval addresses the aforementioned two lim-
itations. First, it can handle string-based SQL statements. Second, it contains a static analysis
approach that safely identies all the columns that are possibly used in the application. This
means that the SANDLess detector does not require any assumption about the workload.
Yan et al. [104] proposed static analysis techniques to analyze ORM applications and proposed
detection rules to identify several SQL antipatterns, including Not-Caching, Unnecessary-Column-
Retrieval, and Unbounded-Query. Their work targets ORM application and does not handle
string-based SQL statements, which can be addressed by SAND.
At a high level, in addition to the dierences between the individual techniques and the SAND-
Less detectors, a major dierence is that my dissertation aims at providing a general solution for
SQL antipattern detection, while the aforementioned techniques focus on specic SQL antipat-
terns. Most of the existing techniques only analyze web apps and remote databases instead of
mobile apps and local databases.
6.2 Program Analysis Framework
The most similar work to ours is the work by Dasgupta et al. [20]. Their framework provides a set
of services that analyze database applications that use ADO.NET data access APIs. The services
include extracting SQL statements, extracting properties of how the SQL statement results are
70
used in the application, and analyzing user input and their propagation to SQL statements.
SAND is dierent from their framework in the following aspects. First, their framework only
analyzes a subset of the application-database relationships that SAND analyzes. The control
ow relationships between program points, loops, transactions, etc, which are essential to the
detectors of many SQL antipatterns, are not provided in their framework. Second, SAND provides
a language, SANDLess. Such functionality is not provided in their framework. Third, their
approach only handles string concatenation and simple control
ow, while ours supports all of the
string operations in the Java API and various complex control
ows.
Meurice et al. [78, 77] presented a static analysis approach to analyze dynamic database usage
in Java systems. Their approach helps developers to automatically identify the source code
locations accessing given database tables and columns. There are two major dierences between
their study and mine. First, Meurice's work focused on identifying the table and column names
of interpreted queries. My work analyzed all kinds of application-database relationships. Second,
the string analysis used in their work cannot handle complex string manipulation and data
ow.
This can result in false positives and false negatives in identifying string values.
A variety of program analysis frameworks have been proposed to help people conduct sophis-
ticated program analyses. SIF [40] is an instrumentation framework that contains abstractions
that allow users to compactly express precisely which parts of the app need to be instrumented.
The framework contains a novel analysis that inspects path execution, and provides users feed-
back on the overhead of the instrumentation specication. Burgstaller et al. proposed a generic
symbolic analysis framework [11]. Their framework provides a comprehensive and compact alge-
braic structure that describes the complete control and data
ow analysis information valid at a
given program point. Another framework proposed by Verbaere et al. provides a domain-specic
language for refactoring, named JunGL [100]. JunGL manipulates a graph representation that
contains all information about the program, including ASTs, control
ow and so on. Maritin
et al. [75] proposed a language called PQL that focuses on the error patterns that deal with se-
quences of events associated with a set of related objects. In this work, both static and dynamic
techniques were developed to nd solutions to PQL queries. The static analyzer nds all poten-
tial matches conservatively using alias analysis. The dynamic analyzer instruments the source
program to catch all violations precisely as the workload executes and to optionally perform user-
specied actions. All these frameworks are very useful in tackling their domain specic issues.
However, they do not target application-database relationships and therefore do not support some
71
core functionalities, such as modeling SQL statements, that are required by the detection of SQL
antipatterns.
6.3 Analyzing Database Usage And Cost
A group of studies empirically evaluated database usage in applications. Qiu et al. [91] conducted
an empirical analysis of the co-evolution of database schemas and code in ten large database
applications. They found database schemas evolved frequently during the application life cycle
and caused signicant code-level modications. In their study, they focused on database schemas
and web applications that use remote databases. In contrast, my dissertation focused on the local
database operations in mobile apps and provided a detailed look into the associated antipatterns
and costs. Goeminne et al. [30] explored the survival of ve Java database frameworks among
dierent Java projects to see whether certain database frameworks were used more successfully
in certain combinations than others. Similarly, Decan et al. [21] empirically studied how the
use of relational database access technologies evolved over time in open source Java projects.
Dierent from Goeminne's and Decan's work, my dissertation did not compare dierent database
frameworks for Java applications, but focused on local database usage and related antipatterns
in mobile applications.
Researchers have conducted studies to understand the cost of local database operations on
mobile devices. Jeong and colleagues [45] proposed an Android storage performance analysis
tool, called AndroStep, to characterize and analyze the behavior of the IO subsystem in Android
based devices. In addition to this tool, they also introduced a benchmark that included typical
lesystem workloads, e.g., Random vs. Sequential, Synchronous vs. Buered IO, and SQLite
insert/update. They analyzed the performance result of the benchmark on eight smartphone
models. Kim et al. [52] proposed a storage benchmarking tool, called AndroBench, for Android
devices. It measured the I/O performance and the throughput of various types of database
operations, such as insert, update, and delete. AndroBench accumulated the benchmark results
from hundreds of devices from all over the world. Based on these results, Kim analyzed the
sequential and random I/O performance, SQLite performance, and other characteristics of the
storage performance in detail. Kennedy and colleagues [50] conducted a similar experiment that
recorded SQLite activities on eleven Android phones during one month of usage. They then
collected and analyzed the characteristics and costs of SQL queries. Dierent from my benchmark
72
study, all the aforementioned studies did not measure energy consumption and did not take the
programming practices of using database operations into consideration.
Some existing work that focuses on the energy consumption of mobile applications also reports
database energy usage. Linares-V asquez and colleagues [63] conducted an empirical study on
measuring the energy consumption of Android API methods. In their study, they mined the
usage patterns of energy greedy APIs. Li and colleagues [59] investigated the energy consumption
of over 400 real-world marketplace mobile applications. They discovered that SQLite is one of
the most energy consuming components in Android apps. My prior work [71] also included an
empirical study that measured which groups of SQLite API calls were more resource consuming.
However, all the aforementioned studies do not take the database programming practices into
consideration.
Researchers have focused on analyzing storage mechanisms for database operations. Jeong and
colleagues [46] improved the performance of I/O operations by coordinating the dierent database
journaling modes with dierent lesystems. Oh and colleagues [84] leveraged the phase change
memory technique to reduce redundant page writes. Ouarnoughi and colleagues [86] assessed the
performance behavior of database applications on
ash memory. In their study, micro bench-
marking and modeling methodology were used to analyze SQLite database queries on embedded
ash specic le systems. Although this area of research has made important advancements, it
does not address the role that software engineering practices and program structure can also have
on the resource consumption of database services on mobile devices.
6.4 Performance Optimization for Mobile Apps
One group of techniques characterized and detected performance and energy bugs in mobile apps.
Pathak et al. [89] presented real world no-sleep energy bug characteristics, proposed automatic
solution to detect these bugs, and provided experimental data showing the accuracy of their tool.
Liu et al. [65, 66] proposed an approach, called GreenDriod, that can automatically analyze an
application's sensory data utilization at dierent states, and can report actionable information to
help developers locate energy ineciency problems. Banerjee et al. [6] proposed an automated
test generation framework that detects energy bugs in Android applications. Their framework
generates test inputs that capture user interaction scenarios that lead to energy bugs in a mo-
bile application. Liu et al. [67] studied the characteristics of 70 real-world performance bugs and
73
identied common patterns of these bugs. They then implemented a static code analyzer, Per-
fChecker, to detect their identied performance bug patterns. Guo et al. [36] focused on resource
leak problems in Android apps. They developed a static analysis tool called Relda, which can
automatically analyze an application's resource operations and locate the resource leaks. Their
tool targets Android applications and can handle the features of event-driven mobile programming
by analyzing the callbacks dened in Android framework. Liu et al. [68] identied eight patterns
of wake lock misuses that commonly cause functional and non-functional issues. They designed
a static analysis technique, Elite, to detect two most common patterns of wake lock misuses in
mobile apps. This group of detection techniques generally focused on sensors, wakelocks, and var-
ious issues related to resource leakage. The database usage issues and various SQL antipatterns
were not covered by these techniques.
Much research eort has also been devoted to performance and energy optimization for mobile
apps. Li et al. proposed techniques to optimize energy consumption of mobile apps by bundling
HTTP requests [57, 58, 60]. Zhao et al. [111] proposed a technique, called PALOMA, to optimize
runtime performance of mobile apps by prefetching HTTP requests across callbacks. Li et al. [62]
developed an approach for automatically rewriting applications so that they generate more energy
ecient web pages when these web pages render on smartphones' OLED display. Wan et al. [101]
proposed a technique for detecting display energy hotspots. Their technique leverages display
power modeling and automated display transformation techniques to detect these hotspots and
prioritize them for developers. Linares-V asquez et al. [64] proposed an approach, named GEMMA,
for generating color palettes using a multi-objective optimization technique. The technique pro-
duces color solutions that optimize energy consumption and contrast while using consistent colors
with respect to the original color palette. Nikzad et al. [83] proposed a technique to schedule
tasks on mobile apps in a more energy ecient way. Zhu et al. [113] proposed a set of language
extensions to guide energy optimizations in mobile Web applications. Cito et al. [19] optimized
recurrent advertisement and analytics requests to save energy. However, the aforementioned tech-
niques did not optimize runtime performance or energy consumption of database operations in
mobile apps.
74
Chapter 7
Conclusion
The goal of my research is to design a framework that can detect SQL antipatterns eectively
and eciently in mobile apps. The hypothesis of my dissertation is:
Static detectors built on top of abstractions of application-database relationships can eectively
and eciently detect SQL antipatterns in mobile apps.
To evaluate the hypothesis of my dissertation, I designed a detection framework, SAND, that
abstracts away the interactions between the application and the database using static analysis
techniques. It provides a language, SANDLess, which is used to query abstractions of application-
database relationships and specify SQL antipattern detection tasks.
To demonstrate that detectors of SQL antipatterns can be built on top of abstractions of
application-database relationships, I designed SANDLess and used it to express the detection rules
of a full spectrum of SQL antipatterns. More specically, I rst collected a full spectrum of eleven
SQL antipatterns and their detection approaches via a literature survey. I then analyzed each of
these detection approaches and identied seven types of application-database relationships that
were needed by these approaches. Based on these identied relationships, I designed SANDLess,
which provides abstractions of these seven types of application-database relationships and allows
detector builders to specify detection logic on top of these relationships. In my dissertation work,
I showed that the detection rules of all eleven SQL antipatterns discovered in the literature can
be expressed in SANDLess. This conrmed that detectors of SQL antipatterns can be built on
top of abstractions of application-database relationships.
When realizing SANDLess, I designed a range of static analysis techniques that were spe-
cialized at eectively and eciently extracting the abstractions of application-database relation-
ships from the database access code in mobile apps. These static analysis techniques focused
75
on handling the complications of string-based SQL statements, transactions, and other kinds of
application-database relationships that were required by the abstractions. To empirically evalu-
ate the eectiveness and eciency of the framework, I measured the precision and speed of the
SANDLess detectors. On the aspect of eectiveness, the results showed that these SANDLess
detectors, which are built on top of abstractions of application-database interactions, precisely
identied thousands of instances of SQL antipatterns with a precision of at least 99.4%. On the
aspect of eciency, these SANDLess detectors were fast. Applying eleven detectors only took an
average of forty-one seconds per app. Overall, these results are very positive and they successfully
demonstrate the eectiveness and eciency of the framework in SQL antipattern detection.
In summary, my dissertation work has demonstrated that static detectors can be built on top
of abstractions of application-database relationships and that these static detectors can eectively
and eciently detect a full spectrum of SQL antipatterns in mobile apps, thereby conrming the
hypothesis of my dissertation.
7.1 Future Directions
One possible direction of future work is to extend the detection framework. The design of SAND
is generalizable. SAND can be extended to analyze other languages for which it is possible to
generate CFGs and CCGs. It can also be extended to analyze other database APIs where SQL
statements are strings, such as JDBC. In the future, researchers and practitioners may discover
new SQL antipatterns. Future research can utilize the abstractions provided by SAND to develop
detectors for these new antipatterns. If the existing abstractions are not expressiveness enough,
SAND can be extended to provide more kinds of abstractions.
Another interesting direction is to prioritize the detected instances of SQL antipatterns based
on their predicted performance impact. In my benchmark study, I collected a large amount of data
about the runtime and energy consumption of various instances of SQL antipatterns. This data
provides insights about under what circumstances a SQL antipattern is more likely to consume
more resources in terms of energy and runtime. For example, one of the insight was that when the
size of the retrieved data exceeded a certain limit, the time required to iterate over the retrieved
rows increased signicantly. This was due to the buer refreshing overhead in Android. Future
research can focus on discovering similar insights and can use them to guide the prioritization of
SQL antipatterns based on their predicted impact.
76
Using the functionalities provided by SAND, researchers and practitioners can extend the
application of this framework beyond the detection of SQL antipatterns. One possible direction is
to conduct empirical studies on the database usage in mobile apps. As the SANDLess functions can
eectively and eciently identify various control
ow and data
ow properties with respect to the
database operations, future research can make use of these functions and study the characteristics
of the database access code in mobile apps at a large scale.
Another possible direction of future work is to repair the SQL antipatterns automatically. In
one of my prior work [73], I proposed a fully automated approach that optimizes one of the SQL
antipatterns. The technique was able to repair the antipattern in a large portion of mobile apps
and the performance gain was signicant after the repair. Besides this antipattern, other SQL
antipatterns can also have signicant impact on the performance of mobile apps, as shown in my
benchmark study. However, existing repair approaches, as summarized in my literature review,
have drawbacks that limit their capability of repairing SQL antipatterns in mobile apps. There
are also SQL antipatterns for which automated repair techniques do not exist yet. Future research
can look into the opportunities of repairing the instances of various SQL antipatterns detected by
SAND.
77
References
[1] Android Documentation. https://developer.android.com, March 2020.
[2] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jerey D. Ullman. Compilers: Princi-
ples, Techniques, and Tools (2Nd Edition). Addison-Wesley Longman Publishing Co., Inc.,
Boston, MA, USA, 2006.
[3] N. Arzamasova, M. Schler, and K. Bhm. Cleaning antipatterns in an sql query log. IEEE
Transactions on Knowledge and Data Engineering, 30(3):421{434, March 2018.
[4] Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques
Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. Flowdroid: Precise con-
text,
ow, eld, object-sensitive and lifecycle-aware taint analysis for android apps. In
Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and
Implementation, PLDI '14, pages 259{269, New York, NY, USA, 2014. ACM.
[5] Sruthi Bandhakavi, Prithvi Bisht, P. Madhusudan, and V. N. Venkatakrishnan. Candid:
Preventing sql injection attacks using dynamic candidate evaluations. In Proceedings of the
14th ACM Conference on Computer and Communications Security, CCS '07, pages 12{24,
New York, NY, USA, 2007. ACM.
[6] Abhijeet Banerjee, Lee Kee Chong, Sudipta Chattopadhyay, and Abhik Roychoudhury.
Detecting energy bugs and hotspots in mobile apps. In Proceedings of the 22Nd ACM
SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014,
pages 588{598, New York, NY, USA, 2014. ACM.
[7] Prithvi Bisht, A. Prasad Sistla, and V. N. Venkatakrishnan. Taps: Automatically preparing
safe sql queries. In Proceedings of the 17th ACM Conference on Computer and Communi-
cations Security, CCS '10, pages 645{647, New York, NY, USA, 2010. ACM.
[8] Prithvi Bisht, A Prasad Sistla, and VN Venkatakrishnan. Automatically preparing safe sql
queries. In International Conference on Financial Cryptography and Data Security, pages
272{288. Springer, 2010.
[9] Jos e A. Blakeley, Neil Coburn, and Per-:1Vke Larson. Updating derived relations: Detecting
irrelevant and autonomously computable updates. ACM Trans. Database Syst., 14(3):369{
400, September 1989.
[10] Gregory Buehrer, Bruce W. Weide, and Paolo A. G. Sivilotti. Using parse tree validation to
prevent sql injection attacks. In Proceedings of the 5th International Workshop on Software
Engineering and Middleware, SEM '05, pages 106{113, New York, NY, USA, 2005. ACM.
[11] Bernd Burgstaller, Bernhard Scholz, and Johann Blieberger. A symbolic analysis framework
for static analysis of imperative programming languages. Journal of Systems and Software,
85(6):1418{1439, 2012.
78
[12] Wei Cao and Dennis Shasha. Appsleuth: a tool for database tuning at the application level.
In Proceedings of the 16th International Conference on Extending Database Technology,
pages 589{600. ACM, 2013.
[13] Yinzhi Cao, Yanick Fratantonio, Antonio Bianchi, Manuel Egele, Christopher Kruegel, Gio-
vanni Vigna, and Yan Chen. Edgeminer: Automatically detecting implicit control
ow
transitions through the android framework. In 22nd Annual Network and Distributed Sys-
tem Security Symposium, NDSS 2015, San Diego, California, USA, February 8-11, 2015.
The Internet Society, 2015.
[14] Surajit Chaudhuri, Vivek Narasayya, and Manoj Syamala. Bridging the application and
dbms proling divide for database application developers. In VLDB. Very Large Data Bases
Endowment Inc., September 2007.
[15] T. H. Chen, W. Shang, Z. M. Jiang, A. E. Hassan, M. Nasser, and P. Flora. Finding
and evaluating the performance impact of redundant data access for applications that are
developed using object-relational mapping frameworks. IEEE Transactions on Software
Engineering, 42(12):1148{1161, Dec 2016.
[16] Tse-Hsun Chen, Weiyi Shang, Ahmed E Hassan, Mohamed Nasser, and Parminder Flora.
Cacheoptimizer: Helping developers congure caching frameworks for hibernate-based
database-centric web applications. In Proceedings of the 2016 24th ACM SIGSOFT Inter-
national Symposium on Foundations of Software Engineering, pages 666{677. ACM, 2016.
[17] Tse-Hsun Chen, Weiyi Shang, Zhen Ming Jiang, Ahmed E. Hassan, Mohamed Nasser, and
Parminder Flora. Detecting performance anti-patterns for applications developed using
object-relational mapping. In Proceedings of the 36th International Conference on Software
Engineering, ICSE 2014, pages 1001{1012, New York, NY, USA, 2014. ACM.
[18] Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. Optimizing database-backed
applications with query synthesis. In Proceedings of the 34th ACM SIGPLAN Conference
on Programming Language Design and Implementation, PLDI '13, pages 3{14, New York,
NY, USA, 2013. ACM.
[19] J urgen Cito, Julia Rubin, Phillip Stanley-Marbell, and Martin Rinard. Battery-aware trans-
formations in mobile applications. In Proceedings of the 31st IEEE/ACM International
Conference on Automated Software Engineering, ASE 2016, pages 702{707, New York, NY,
USA, 2016. ACM.
[20] Arjun Dasgupta, Vivek Narasayya, and Manoj Syamala. A static analysis framework for
database applications. In Proceedings of the 2009 IEEE International Conference on Data
Engineering, ICDE '09, pages 1403{1414, Washington, DC, USA, 2009. IEEE Computer
Society.
[21] Alexandre Decan, Mathieu Goeminne, and Tom Mens. On the Interaction of Rela-
tional Database Access Technologies in Open Source Java Projects. arXiv preprint
arXiv:1701.00416, 2017.
[22] Android Documentation. Android Cursor. https://developer.android.com/reference/
android/database/Cursor, March 2019.
[23] Android Documentation. Understand the activity lifecycle. https://developer.android.
com/guide/components/activities/activity-lifecycle, March 2020.
79
[24] Robert F. Dugan, Jr., Ephraim P. Glinert, and Ali Shokoufandeh. The sisyphus database
retrieval software performance antipattern. In Proceedings of the 3rd International Work-
shop on Software and Performance, WOSP '02, pages 10{16, New York, NY, USA, 2002.
ACM.
[25] K. Venkatesh Emani, Tejas Deshpande, Karthik Ramachandra, and S. Sudarshan. Dbridge:
Translating imperative code to sql. In Proceedings of the 2017 ACM International Confer-
ence on Management of Data, SIGMOD '17, pages 1663{1666, New York, NY, USA, 2017.
ACM.
[26] Jason Feinstein. Squeezing Performance from SQLite: Insertions. https://medium.com/
@JasonWyatt/squeezing-performance-from-sqlite-insertions-971aff98eef2, April
2017.
[27] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The program dependence graph
and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319{349, July 1987.
[28] Github. Rat-trap: A tool of automated optimization of resource inecient database writes
for mobile applications. https://github.com/USC-SQL/RAT-TRAP, March 2020.
[29] Github. Violist: A string analysis framework for java and android apps. https://github.
com/USC-SQL/Violist, March 2020.
[30] M. Goeminne and T. Mens. Towards a Survival Analysis of Database Framework Usage
in Java Projects. In 2015 IEEE International Conference on Software Maintenance and
Evolution (ICSME), pages 551{555, Sept 2015.
[31] Google. Android SQLite Documentation. https://developer.android.com/reference/
android/database/sqlite/SQLiteDatabase.html, 2019.
[32] Google. Gooogle play app store. https://play.google.com/store/apps, 2019.
[33] Carl Gould, Zhendong Su, and Premkumar Devanbu. Jdbc checker: A static analysis tool
for sql/jdbc applications. In Proceedings of the 26th International Conference on Software
Engineering, ICSE '04, pages 697{698, Washington, DC, USA, 2004. IEEE Computer Soci-
ety.
[34] Jiaping Gui, Stu Mcilroy, Mei Nagappan, and William G. J. Halfond. Truth in advertis-
ing: The hidden cost of mobile ads for software developers. In Proceedings of the 37th
International Conference on Software Engineering (ICSE), May 2015. To Appear.
[35] Jiaping Gui, Meiyappan Nagappan, and William GJ Halfond. What Aspects of Mobile Ads
Do Users Care About? An Empirical Study of Mobile In-app Ad Reviews. arXiv preprint
arXiv:1702.07681, 2017.
[36] Chaorong Guo, Jian Zhang, Jun Yan, Zhiqiang Zhang, and Yanli Zhang. Characterizing
and detecting resource leaks in android applications. In Automated Software Engineering
(ASE), 2013 IEEE/ACM 28th International Conference on, pages 389{398, Nov 2013.
[37] Vivek Haldar, Deepak Chandra, and Michael Franz. Dynamic taint propagation for java.
In Proceedings of the 21st Annual Computer Security Applications Conference, ACSAC '05,
pages 303{311, Washington, DC, USA, 2005. IEEE Computer Society.
[38] William G Halfond, Jeremy Viegas, Alessandro Orso, et al. A classication of sql-injection
attacks and countermeasures. In Proceedings of the IEEE International Symposium on
Secure Software Engineering, volume 1, pages 13{15. IEEE, 2006.
80
[39] William G. J. Halfond and Alessandro Orso. Preventing sql injection attacks using amnesia.
In Proceedings of the 28th International Conference on Software Engineering, ICSE '06,
pages 795{798, New York, NY, USA, 2006. ACM.
[40] Shuai Hao, Ding Li, William G. J. Halfond, and Ramesh Govindan. Sif: A selective in-
strumentation framework for mobile applications. In Proceedings of the 11th International
Conference on Mobile Systems, Applications and Services (MobiSys), June 2013.
[41] Shuai Hao, Bin Liu, Suman Nath, William G.J. Halfond, and Ramesh Govindan. Puma:
Programmable ui-automation for large scale dynamic analysis of mobile apps. In Proceed-
ings of the ACM International Conference on Mobile Systems, Applications, and Services
(MobiSys), June 2014.
[42] Roee Hay, Omer Tripp, and Marco Pistoia. Dynamic Detection of Inter-application Com-
munication Vulnerabilities in Android. In Proceedings of the 2015 International Symposium
on Software Testing and Analysis, ISSTA 2015, pages 118{128. ACM, 2015.
[43] Stephen Horn. Guide to Standard Errors for Cross Section Estimates. Melbourne Institute
of Applied Economic and Social Research, 2004.
[44] Yao-Wen Huang, Fang Yu, Christian Hang, Chung-Hung Tsai, Der-Tsai Lee, and Sy-Yen
Kuo. Securing web application code by static analysis and runtime protection. In Proceed-
ings of the 13th International Conference on World Wide Web, WWW '04, pages 40{52,
New York, NY, USA, 2004. ACM.
[45] Sooman Jeong, Kisung Lee, Jungwoo Hwang, Seongjin Lee, and Youjip Won. Androstep:
Android storage performance analysis tool. In Software Engineering (Workshops), vol-
ume 13, pages 327{340, 2013.
[46] Sooman Jeong, Kisung Lee, Seongjin Lee, Seoungbum Son, and Youjip Won. I/o stack
optimization for smartphones. In Presented as part of the 2013 USENIX Annual Technical
Conference (USENIX ATC 13), pages 309{320, San Jose, CA, 2013. USENIX.
[47] R. Johari and P. Sharma. A survey on web application vulnerabilities (sqlia, xss) exploita-
tion and security engine for sql injection. In 2012 International Conference on Communi-
cation Systems and Network Technologies, pages 453{458, May 2012.
[48] Gregory M. Kapfhammer and Mary Lou Soa. A family of test adequacy criteria for
database-driven applications. In Proceedings of the 9th European Software Engineering Con-
ference Held Jointly with 11th ACM SIGSOFT International Symposium on Foundations of
Software Engineering, ESEC/FSE-11, page 98107, New York, NY, USA, 2003. Association
for Computing Machinery.
[49] Bill Karwin. SQL Antipatterns: Avoiding the Pitfalls of Database Programming. Pragmatic
Bookshelf, 1st edition, 2010.
[50] Oliver Kennedy, Jerry Ajay, Georey Challen, and Lukasz Ziarek. Pocket data: The need
for tpc-mobile. In Technology Conference on Performance Evaluation and Benchmarking,
pages 8{25. Springer, 2015.
[51] H. Khalid, E. Shihab, M. Nagappan, and A. E. Hassan. What Do Mobile App Users
Complain About? IEEE Software, 32(3):70{77, May 2015.
[52] Je-Min Kim and Jin-Soo Kim. Androbench: Benchmarking the storage performance of
android-based mobile devices. In Frontiers in Computer Education, pages 667{674. Springer,
2012.
81
[53] Won Kim. On optimizing an sql-like nested query. ACM Trans. Database Syst., 7(3):443{
469, September 1982.
[54] D. A. Kindy and A. K. Pathan. A survey on sql injection: Vulnerabilities, attacks, and pre-
vention techniques. In 2011 IEEE 15th International Symposium on Consumer Electronics
(ISCE), pages 468{471, June 2011.
[55] Barbara Kitchenham. Procedures for performing systematic reviews. 33, 08 2004.
[56] Patrick Lam, Eric Bodden, Ondrej Lhot ak, and Laurie Hendren. The soot framework
for java program analysis: a retrospective. In Cetus Users and Compiler Infastructure
Workshop (CETUS 2011), 2011.
[57] Ding Li and William G. J. Halfond. An Investigation Into Energy-Saving Programming
Practices for Android Smartphone App Development. In GREENS, 2014.
[58] Ding Li and William G. J. Halfond. Optimizing Energy of HTTP Requests in Android
Applications. In Proceedings of the Third International Workshop on Software Development
Lifecycle for Mobile (DeMobile) { Short Paper, September 2015.
[59] Ding Li, Shuai Hao, Jiaping Gui, and William G.J. Halfond. An empirical study of the
energy consumption of android applications. In Proceedings of the International Conference
on Software Maintenance and Evolution (ICSME), September 2014.
[60] Ding Li, Yingjun Lyu, Jiaping Gui, and William G. J. Halfond. Automated energy opti-
mization of http requests for mobile applications. In Proceedings of the 38th International
Conference on Software Engineering, ICSE 16, pages 249{260, 2016.
[61] Ding Li, Yingjun Lyu, Mian Wan, and William G. J. Halfond. String analysis for java
and android applications. In Proceedings of the 2015 10th Joint Meeting on Foundations of
Software Engineering, ESEC/FSE 2015, pages 661{672, 2015.
[62] Ding Li, Huyen Tran, Angelica, and G. J. Halfond, William. Making Web Applications
More Energy Ecient for OLED Smartphones. In ICSE, 2014.
[63] Mario Linares-V asquez, Gabriele Bavota, Carlos Bernal-C ardenas, Rocco Oliveto, Massimil-
iano Di Penta, and Denys Poshyvanyk. Mining energy-greedy api usage patterns in android
apps: an empirical study. In Proceedings of the 11th Working Conference on Mining Soft-
ware Repositories (MSR), 2014.
[64] Mario Linares-V asquez, Gabriele Bavota, Carlos Eduardo Bernal C ardenas, Rocco Oliveto,
Massimiliano Di Penta, and Denys Poshyvanyk. Optimizing Energy Consumption of GUIs in
Android Apps: A Multi-objective Approach. In Proceedings of the 2015 10th Joint Meeting
on Foundations of Software Engineering, ESEC/FSE 2015, pages 143{154. ACM, 2015.
[65] Yepang Liu, Chang Xu, and S.C. Cheung. Where has my battery gone? nding sensor
related energy black holes in smartphone applications. In PerCom, 2013.
[66] Yepang Liu, Chang Xu, S.C. Cheung, and Jian Lu. Greendroid: Automated diagnosis of
energy ineciency for smartphone applications. Software Engineering, IEEE Transactions
on, 40(9):911{940, Sept 2014.
[67] Yepang Liu, Chang Xu, and Shing-Chi Cheung. Characterizing and detecting performance
bugs for smartphone applications. In Proceedings of the 36th International Conference on
Software Engineering, ICSE 2014, pages 1013{1024, New York, NY, USA, 2014. ACM.
82
[68] Yepang Liu, Chang Xu, Shing-Chi Cheung, and Valerio Terragni. Understanding and de-
tecting wake lock misuses for android applications. In Proceedings of the 2016 24th ACM
SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016,
pages 396{409, New York, NY, USA, 2016. ACM.
[69] V. Benjamin Livshits and Monica S. Lam. Finding security vulnerabilities in java appli-
cations with static analysis. In Proceedings of the 14th Conference on USENIX Security
Symposium - Volume 14, SSYM'05, pages 18{18, Berkeley, CA, USA, 2005. USENIX Asso-
ciation.
[70] Yingjun Lyu, Ali Alotaibi, and William GJ Halfond. Quantifying the Performance Impact
of SQL Antipatterns on Mobile Applications. In 2019 IEEE International Conference on
Software Maintenance and Evolution (ICSME), pages 53{64, 2019.
[71] Yingjun Lyu, Jiaping Gui, Mian Wan, and William G. J. Halfond. An Empirical Study of
Local Database Usage in Android Applications. In 2017 IEEE International Conference on
Software Maintenance and Evolution (ICSME), pages 444{455, Sept 2017.
[72] Yingjun Lyu and William G. J. Halfond. SAND: A Static Analysis Framework for Detecting
SQL Antipatterns in Mobile Applications. In Submission.
[73] Yingjun Lyu, Ding Li, and William G. J. Halfond. Remove RATs from Your Code: Au-
tomated Optimization of Resource Inecient Database Writes for Mobile Applications. In
Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and
Analysis, ISSTA 2018, pages 310{321, 2018.
[74] A. Manjhi, C. Garrod, B. M. Maggs, T. C. Mowry, and A. Tomasic. Holistic query trans-
formations for dynamic web applications. In 2009 IEEE 25th International Conference on
Data Engineering, pages 1175{1178, March 2009.
[75] Michael Martin, Benjamin Livshits, and Monica S. Lam. Finding application errors and
security
aws using pql: A program query language. In Proceedings of the 20th Annual
ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and
Applications, OOPSLA '05, pages 365{383, New York, NY, USA, 2005. ACM.
[76] Thomas J McCabe. A complexity measure. IEEE Transactions on software Engineering,
(4):308{320, 1976.
[77] L. Meurice and A. Cleve. DAHLIA 2.0: A Visual Analyzer of Database Usage in Dynamic
and Heterogeneous Systems. In 2016 IEEE Working Conference on Software Visualization
(VISSOFT), pages 76{80, Oct 2016.
[78] Loup Meurice, Csaba Nagy, and Anthony Cleve. Static analysis of dynamic database usage
in java systems. pages 491{506, 06 2016.
[79] Robert B. Miller. Response time in man-computer conversational transactions. In Proceed-
ings of the December 9-11, 1968, Fall Joint Computer Conference, Part I, AFIPS '68 (Fall,
part I), pages 267{277, New York, NY, USA, 1968. ACM.
[80] Monsoon Solutions, Inc. Monsoon Power Monitor. http://www.msoon.com/
LabEquipment/PowerMonitor, 2017.
[81] Steven S. Muchnick. Advanced Compiler Design Implementation. Morgan Kaufmann, 1997.
[82] Brad A. Myers. The importance of percent-done progress indicators for computer-human
interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, CHI '85, pages 11{17, New York, NY, USA, 1985. ACM.
83
[83] Nima Nikzad, Octav Chipara, and William G. Griswold. Ape: An annotation language and
middleware for energy-ecient mobile application development. In Proceedings of the 36th
International Conference on Software Engineering, ICSE 2014, pages 515{526, New York,
NY, USA, 2014. ACM.
[84] Gihwan Oh, Sangchul Kim, Sang-Won Lee, and Bongki Moon. Sqlite optimization with
phase change memory for mobile applications. In Proc. VLDB Endow., 2015.
[85] Oswaldo Olivo, Isil Dillig, and Calvin Lin. Detecting and exploiting second order denial-
of-service vulnerabilities in web applications. In Proceedings of the 22Nd ACM SIGSAC
Conference on Computer and Communications Security, CCS '15, pages 616{628, New York,
NY, USA, 2015. ACM.
[86] Hamza Ouarnoughi, Jalil Boukhobza, Pierre Olivier, Loic Plassart, and Ladjel Bellatreche.
Performance analysis and modeling of sqlite embedded databases on
ash le systems. De-
sign Automation for Embedded Systems, 17(3):507{542, 2013.
[87] Stack Over
ow. 2018 survey of stackover
ow. https://insights.stackover
ow.com/survey/2018,
2018.
[88] OWASP. Category:OWASP Enterprise Security API. https://www.owasp.org/index.
php/Category:OWASP_Enterprise_Security_API, August 2018.
[89] Abhinav Pathak, Abhilash Jindal, Y Charlie Hu, and Samuel P Midki. What is keeping
my phone awake?: characterizing and detecting no-sleep energy bugs in smartphone apps.
In MobiSys, 2012.
[90] Danilo Dominguez Perez and Wei Le. Generating predicate callback summaries for the
android framework. In Proceedings of the 4th International Conference on Mobile Software
Engineering and Systems, MOBILESoft 17, page 6878. IEEE Press, 2017.
[91] Dong Qiu, Bixin Li, and Zhendong Su. An Empirical Analysis of the Co-evolution of
Schema and Code in Database Applications. In Proceedings of the 2013 9th Joint Meeting
on Foundations of Software Engineering, ESEC/FSE 2013, pages 125{135. ACM, 2013.
[92] Siegfried Rasthofer, Steven Arzt, Ec Spride, Technische Universitt Darmstadt, and Eric
Bodden. A Machine-learning Approach for Classifying and Categorizing Android Sources
and Sinks. In Proceedings of the 21th Annual Network and Distributed System Security
Symposium (NDSS'14), San Diego, CA, February 2014.
[93] Ziv Scully and Adam Chlipala. A program optimization for automatic database result
caching. ACM SIGPLAN Notices, 52(1):271{284, 2017.
[94] Tushar Sharma, Marios Fragkoulis, Stamatia Rizou, Magiel Bruntink, and Diomidis Spinel-
lis. Smelly relations: Measuring and understanding database schema quality. 2018.
[95] Sooel Son, Kathryn S. McKinley, and Vitaly Shmatikov. Diglossia: detecting code injection
attacks with precision and eciency. In Proceedings of the 2013 ACM SIGSAC conference
on Computer & communications security, CCS '13, pages 1181{1192, New York, NY,
USA, 2013. ACM.
[96] StatCounter. Mobile operating system market share worldwide. https://gs.
statcounter.com/os-market-share/mobile/worldwide, 2020.
84
[97] Zhendong Su and Gary Wassermann. The essence of command injection attacks in web
applications. In Conference Record of the 33rd ACM SIGPLAN-SIGACT Symposium on
Principles of Programming Languages, POPL '06, pages 372{382, New York, NY, USA,
2006. ACM.
[98] Juan M. Tamayo, Alex Aiken, Nathan Bronson, and Mooly Sagiv. Understanding the
behavior of database operations under program control. In Proceedings of the ACM Inter-
national Conference on Object Oriented Programming Systems Languages and Applications,
OOPSLA '12, pages 983{996, New York, NY, USA, 2012. ACM.
[99] Tobias. JSQLParser. https://github.com/JSQLParser/JSqlParser, 2017.
[100] Mathieu Verbaere, Ran Ettinger, and Oege de Moor. Jungl: A scripting language for
refactoring. In Proceedings of the 28th International Conference on Software Engineering,
ICSE '06, pages 172{181, New York, NY, USA, 2006. ACM.
[101] Mian Wan, Yuchen Jin, Ding Li, and William G. J. Halfond. Detecting display energy
hotspots in android apps. In Proceedings of the 8th IEEE International Conference on
Software Testing, Verication and Validation (ICST), April 2015. To Appear.
[102] Gary Wassermann, Carl Gould, Zhendong Su, and Premkumar Devanbu. Static checking of
dynamically generated queries in database applications. ACM Trans. Softw. Eng. Methodol.,
16(4), September 2007.
[103] Arthur Henry Watson, Dolores R Wallace, and Thomas J McCabe. Structured testing: A
testing methodology using the cyclomatic complexity metric, volume 500. US Department of
Commerce, Technology Administration, National Institute of , 1996.
[104] Cong Yan, Alvin Cheung, Junwen Yang, and Shan Lu. Understanding database performance
ineciencies in real-world web applications. In Proceedings of the 2017 ACM on Conference
on Information and Knowledge Management, CIKM '17, pages 1299{1308, New York, NY,
USA, 2017. ACM.
[105] Junwen Yang, Pranav Subramaniam, Shan Lu, Cong Yan, and Alvin Cheung. How not to
structure your database-backed web applications: a study of performance bugs in the wild.
2018.
[106] Junwen Yang, Cong Yan, Chengcheng Wan, Shan Lu, and Alvin Cheung. View-centric
performance optimization for database-backed web applications. In Proceedings of the 41st
International Conference on Software Engineering, ICSE '19, pages 994{1004, Piscataway,
NJ, USA, 2019. IEEE Press.
[107] Shengqian Yang, Haowei Wu, Hailong Zhang, Yan Wang, Chandrasekar Swaminathan, Da-
cong Yan, and Atanas Rountev. Static window transition graphs for android. Automated
Software Engg., 25(4):833873, December 2018.
[108] Shengqian Yang, Dacong Yan, Haowei Wu, Yan Wang, and Atanas Rountev. Static control-
ow analysis of user-driven callbacks in android applications. In Proceedings of the 37th
International Conference on Software Engineering - Volume 1, ICSE 15, page 8999. IEEE
Press, 2015.
[109] Sheng Yu and Shijie Zhou. A survey on metric of software complexity. In 2010 2nd IEEE
International Conference on Information Management and Engineering, pages 352{356.
IEEE, 2010.
85
[110] Mu Zhang and Heng Yin. AppSealer: Automatic Generation of Vulnerability-Specic
Patches for Preventing Component Hijacking Attacks in Android Applications. In Proceed-
ings of the 21th Annual Network and Distributed System Security Symposium (NDSS'14),
San Diego, CA, February 2014.
[111] Yixue Zhao, Marcelo Schmitt Laser, Yingjun Lyu, and Nenad Medvidovic. Leveraging
program analysis to reduce user-perceived latency in mobile applications. In Proceedings of
the 40th International Conference on Software Engineering, ICSE 18, pages 176{186, 2018.
[112] Jingren Zhou, Per-Ake Larson, Johann-Christoph Freytag, and Wolfgang Lehner. Ecient
exploitation of similar subexpressions for query processing. In Proceedings of the 2007 ACM
SIGMOD International Conference on Management of Data, SIGMOD '07, pages 533{544,
New York, NY, USA, 2007. ACM.
[113] Yuhao Zhu and Vijay Janapa Reddi. Greenweb: Language extensions for energy-ecient
mobile web computing. In Proceedings of the 37th ACM SIGPLAN Conference on Pro-
gramming Language Design and Implementation, PLDI '16, pages 145{160, New York, NY,
USA, 2016. ACM.
86
Abstract (if available)
Abstract
Local databases underpin important features in many mobile applications. However, bad programming practices of using database operations, also called SQL antipatterns, can introduce high resource consumption, affect the responsiveness, and undermine the security of a mobile application. In my dissertation, I designed and evaluated a framework for detecting SQL antipatterns effectively and efficiently in mobile apps. The framework, SAND, abstracts away the interactions between the application and the database. It provides a language, SANDLess, which is used to query abstractions of application-database relationships and specify SQL antipattern detection tasks. I addressed two key challenges when designing this framework. The first challenge was to determine what kinds of application-database relationships should be abstracted. The second challenge was to determine how to effectively and efficiently extract the abstractions from the database access code in mobile apps. To tackle the first challenge, I collected a full spectrum of SQL antipatterns and their detection approaches via a systematic literature review. I analyzed each of these approaches and identified the application-database relationships that were analyzed by these approaches. To address the second challenge, I developed a set of static analysis techniques that were specialized at analyzing the database access code in mobile apps. These static analysis techniques focused on handling the complications of string-based SQL statements, loops, and other kinds of application-database relationships that were required by the abstractions. In the evaluation, I showed that the detection rules of all eleven SQL antipatterns reported in the literature can be compactly expressed in SANDLess with low code complexity. These SANDLess detectors, which were built on top of abstractions of application-database relationships, identified thousands of instances of SQL antipatterns in marketplace mobile apps with high precision and fast analysis speed. Overall, these results are positive and indicate that my framework can detect all kinds of SQL antipatterns effectively and efficiently in mobile apps.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Automatic detection and optimization of energy optimizable UIs in Android applications using program analysis
PDF
Utilizing user feedback to assist software developers to better use mobile ads in apps
PDF
Detection, localization, and repair of internationalization presentation failures in web applications
PDF
Automated repair of layout accessibility issues in mobile applications
PDF
Energy optimization of mobile applications
PDF
Automated repair of presentation failures in Web applications using search-based techniques
PDF
Toward understanding mobile apps at scale
PDF
Constraint-based program analysis for concurrent software
PDF
Side-channel security enabled by program analysis and synthesis
PDF
Reducing user-perceived latency in mobile applications via prefetching and caching
PDF
Architectural evolution and decay in software systems
PDF
Proactive detection of higher-order software design conflicts
PDF
Analysis of embedded software architecture with precedent dependent aperiodic tasks
PDF
Studying malware behavior safely and efficiently
PDF
Improving efficiency, privacy and robustness for crowd‐sensing applications
PDF
Assessing software maintainability in systems by leveraging fuzzy methods and linguistic analysis
PDF
Reducing inter-component communication vulnerabilities in event-based systems
PDF
Techniques for methodically exploring software development alternatives
PDF
Machine learning for efficient network management
PDF
Detecting anomalies in event-based systems through static analysis
Asset Metadata
Creator
Lyu, Yingjun
(author)
Core Title
Detecting SQL antipatterns in mobile applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/02/2020
Defense Date
04/30/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
antipatterns,database,Energy,mobile applications,OAI-PMH Harvest,performance,Security,SQL,static analysis
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Halfond, William G.J. (
committee chair
), Deshmukh, Jyotirmoy Vinay (
committee member
), Gupta, Sandeep (
committee member
), Medvidovic, Nenad (
committee member
), Wang, Chao (
committee member
)
Creator Email
winsonlyu@gmail.com,yingjunl@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-367069
Unique identifier
UC11666374
Identifier
etd-LyuYingjun-8936.pdf (filename),usctheses-c89-367069 (legacy record id)
Legacy Identifier
etd-LyuYingjun-8936.pdf
Dmrecord
367069
Document Type
Dissertation
Rights
Lyu, Yingjun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
antipatterns
database
mobile applications
SQL
static analysis