Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Reducing user-perceived latency in mobile applications via prefetching and caching
(USC Thesis Other)
Reducing user-perceived latency in mobile applications via prefetching and caching
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Reducing User-Perceived Latency in Mobile Applications via
Prefetching and Caching
by
Yixue Zhao
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2020
Copyright 2020 Yixue Zhao
Acknowledgements
Pursing a Ph.D. is the best decision I have ever made so far. It certainly is not an easy journey.
There were challenges, confusions, self-doubt, and even tears and breakdowns that made me think
I would never see the end of it. But I am glad that I did. Only in the end when you re
ect back, will
you notice that all the ups and downs were truly necessary and precious. Those lessons I learned
throughout my Ph.D. are invaluable not only for my academic career, but most importantly, are
benecial for my life in many ways. It has completely reshaped my core values, and challenged
me to rethink the meaning of my life.
This is the end of the program, but also a new and exciting beginning of my life where I
am armed with both courage and humility. Courage, as in there were so many challenges that
seemed impossible at the time, but there were always possibilities and solutions at the same time
as well. After experiencing this pattern over and over again, I am lled with courage when facing
challenges, and will always keep an open mind to enjoy the fun process of problem-solving. This
gives me so much strength and freedom to embrace all the possibilities that I could only dream
of before. Humility, as in I was exposed to the smartest people all around the world who do all
kinds of amazing work with the greatest passion and dedication. It shows me how limited my
current knowledge and beliefs are, how tremendous the universe of knowledge is, and how much
deeper you can go to advance knowledge in our limited lifetime. This combination of courage and
humility will always guide me to go as far as I can with condence, while acknowledging my own
limitations to always keep an open mind, to learn from everyone and every experience (good or
bad), and to focus on my most important goals.
Ph.D. is certainly not something I could do by myself. There are so many amazing people who
helped me along the way. No matter big or small, constant support or simply one conversation,
all of those experiences are so meaningful and have shaped my unique Ph.D. journey. I cannot
thank you all enough for the long-lasting impact you had on me that carried me over to where I
am right now, and will always be part of me wherever I go.
ii
First of all, I have to thank my family, especially my parents Jing Chen and Jianwei Zhao.
There is no way I could nish my Ph.D. without your unconditional support and all the freedom
you have been giving me throughout my life. The degree is half yours. When I was in my darkest
times, besides the constant encouragement you always provided, you never posed any personal
wills on me, but gave me complete freedom to make decisions in my own journey. Even when it
came to quitting the program, you did not try to give me suggestions on any particular paths,
but really tried to listen, understand, and analyze the situation with me, trusted my judgement
and decisions, and provided as much support as you could. \If you want to continue, go for it
and we know you can. If you want to stop, what's better than having you home?" That is your
parenting philosophy that I am forever grateful for. It has always been giving me such a strong
foundation so that I am not afraid when taking risks and choosing what I truly value.
I also would like to thank my committee members: Nenad Medvidovi c, Chao Wang, Bhaskar
Krishnamachari, William G.J. Halfond, and Leana Golubhchik. Your valuable feedback and
insightful questions have helped me to ll the gaps in my dissertation that I missed, which have
signicantly improved my dissertation work. Additionally, I want to give a special thanks to G.J.,
whose program analysis class raised my interest in the subject, and directly helped me on shaping
my rst research project. I remember taking the opportunities at his oce hours to discuss my
premature research ideas on program analysis, and G.J. was always very patient and inspiring.
His expertise and tough questions have tremendously helped me in the process, and have guided
me to go deeper in this area.
I must dedicate a paragraph to my advisor Nenad Medvidovi c, although words are far from
enough to express my deepest gratitude towards all the guidance he has provided throughout the
past six years. I am so glad that I had Neno to guide me through this challenging but worthwhile
research journey, raising me from a research \infant" to an independent \adult". Without Neno,
I would not be where I am right now, and would not be able to imagine all the amazing things
that I can do in my life. Advisors are like parents in some way. Starting as a research infant,
Neno raised me with so much patience when I could not understand scientic papers, proposed
random engineering ideas that had no research value, and could not even articulate my ideas due
to my poor English level. As I grew older, just like any parent, Neno started challenging me and
\scolding" me when I do stupid things. When I wrote a 7-line sentence, Neno said: \I'm sorry
Yixue. I'll have to kill you. I'm sure the police will understand. We'll remember all the good
things you did to this world." His humor not only brought so many fun memories in my Ph.D.,
iii
but also helped me to remember the lessons I learned from him more vividly. All the \scolding"
pushed me further and further than I could ever imagine before. Throughout the process, not
only have I learned invaluable lessons on research skills, but most importantly, the mindset on
conducting high-quality research. At the end of the day, it is really not about one paper, but it
is the reputation and credibility that can carry you further. Furthermore, I really appreciate the
freedom Neno has given me to explore all kinds of research areas that I am interested in. This
has opened the doors of many possibilities for me to tackle complicated, interdisciplinary, and
important problems that I am passionate about. This is also something I truly admire about
Neno. Even it is an area that he does not have much past experience in, he is still able to provide
valuable advice and constructive feedback to produce high-quality work, and is never afraid of
the risks and challenges down the road. He puts his students' research interests rst, and passes
his wisdom as much as he can as a guide, not as a boss, to really help us to achieve our own
goals. At the same time, he holds a high standard, sometimes even seems impossible, to always
trust our abilities and encourages us to reach our full potential. All the challenges Neno gave
me sometimes resulted in \deadline magic" and sometimes turned out to be missing a deadline.
No matter what the result is, the process is the most valuable experience that one could ever
ask for in a Ph.D. program. Finally, I must mention how accessible Neno is when we need him.
Whenever I send an email to ask for feedback, or report research progress, or get stuck and not
sure what to do, Neno always replies so fast! If his reply needs more interaction, he would meet
with me in the next few days or even on the same day sometimes. And his advice is always very
open-ended, not giving you direct answers, but asking you the right questions to gure it out in
your own way. This is not that impressive if he only advises one Ph.D. student at full-time, but
considering how many students he has and all the other teaching and service duties he takes on,
how many meetings he has already scheduled, this fast feedback is simply like magic. And during
deadline seasons, we would move to Skype and he always gives us instant responses or Skype calls.
With all of our panic and breakdowns going on, Neno can always keep a clear head and calm us
down, leading us in the right direction. Neno even stays up with us until the very end to help as
much as he could. One time on the ICSE day, he even had a migraine, but he went right back to
the paper once he felt better and sprinted with us until the very end (ICSE deadline is 5am in
our time zone!). Words simply are not enough to express my gratitude towards what Neno did,
and he really does not have to do any of this. But that is Neno, who got tenure a long time ago
and has way more publications than I could count, but he still tries his best to help his students'
iv
publications with all the expertise and wisdom that he could give. Thank you, Neno. I hope I can
be a good \parent" like you, passing on the precious lessons I have learned from you to inspire
more people like you did to me.
I will never forget the help I received from so many wonderful people during my rst re-
search project. Without much research experience, shaping my rst project is probably the most
challenging task I have faced in the Ph.D. program. Without your perspectives, feedback, and
encouragement, I would not have the strength to keep going until the nish line (the rst project
took a long time and was submitted to ICSE 2018 at the end of my third year). First, I would
like to thank Yuhao Zhu, who was a stranger at that time and we have never met in person.
I was inspired by his work and sent out an email to ask his opinion on my premature research
idea. Surprisingly, not only did Yuhao give me insightful feedback and various perspectives to
tackle the problem, but he also constantly helped me (a stranger to him!) to shape the idea
into a top-tier paper. Especially in the beginning of writing my rst research paper, I did not
have much experience and found it very dicult to meet my advisor's standards. Yuhao, a se-
nior Ph.D. student at that time (now a professor), completely understood all my struggles and
directly guided me through the process from a peer's perspective with great patience and en-
couragement. I will always remember the inspiring phone calls we had to discuss the problems
we are interested in, to talk about our future research visions, and all the advice and wisdom
you consistently shared and still share with me. I feel so lucky to get to know you and learn
from you ever since we were connected by a simple email. I can not thank you enough on all the
mentoring and encouragement you have provided throughout my Ph.D. life. I will always cherish
those memories and our friendship. Furthermore, I have also received constructive feedback and
helpful advice from Wyatt Lloyd, Ding Li, my roommate Yi-Ching Chiu, and my labmates at
that time: Jae young Bang, Gholamreza Sa, Youn Kyu Lee, Duc Le, and Arman Shahbazian. I
am also forever grateful for the encouragement and support I constantly received from my family
and friends outside of my academic circle: Jing Chen, Jianwei Zhao, Jianmin Zhao, Qing Chang,
Beijia Li, and Chihuang Liu. Last but not least, this project would not be possible without the
direct help from my co-authors: Marcelo Schmitt Laser, Yingjun Lyu, and of course, my advisor
Nenad Medvidovi c.
Throughout my Ph.D. life, I was lucky enough to be surrounded by amazing labmates that I
can always share my ups and downs, have fun conversations, and enjoy delicious food together. My
labmates (Adriana Seja, Daye Nam, Marcelo Schmitt Laser, Nikola Lukic, Saghar Talebipour,
v
Chenggang Li, Suhrid Karthik, Duc Le, Arman Shahbazian, Youn Kyu Lee, Gholamreza Sa, Jae
young Bang) and my \neighbor" labmates (Brandon Paulsen, Marc Juarez, Meng Wu, Shengjian
(Daniel) Guo, Jingbo Wang, Yannan Li, Zunchen Huang, Chungha Sung, Sonal Mahajan, Ding
Li, Abdulmajeed Alameer, Mian Wan, Yingjun Lyu, Jiaping Gui, Negarsadat Abolhasani, Paul
Chiou, Ali Alotaibi, Sasha Volokh) have brought so many fun memories to my Ph.D. journey.
Especially, I can never forget the wonderful memories during the Europe trip I shared with Daye
Nam, and the climbing and rose milk tea I enjoyed with Brandon Paulsen, Adriana Seja, and
Marc Juarez.
I am also very fortunate to have the chance to share the lab space with other wonderful people,
even just for a short time period. In my rst year, Alessandro Garcia, Leonardo da Silva Sousa,
Diego Cedrim, and Roberto Oliveira visited our research group from Brazil. Although there
was some \culture shock" involved in the beginning, we developed great friendships over time.
I will always remember the bowling we played, the delicious food we had, and what I learned
from you on how dancing should be done. A special thank you to Leo, who brought me to my
very rst ICSE experience, including the deadline season, the acceptance excitement, and the fun
conference experience (my very rst ICSE in 2016!). I learned so much from you when helping
the project you led, and I am always grateful for all the help I received from you on my project.
Moreover, I had the great opportunity to visit UCL in London during the summer of 2019 for
two months. Not only did I share the lab space, but also coee, laughters, and fun conversations
and activities with Jie Zhang, Rebecca Moussa, Giovani Guizzo, Carlos Gavidia, David Kelly,
Iason Papapanagiotakis-Bousy, Pror-Petru Partachi, Leonid Joe, Vali Tawosi, Maria Kechagia,
Hector D Menendez, Aymeric Blot, Zheng Gao, and Bill Langdon. I will always remember the
\camel" conversations, the movie nights, the yoga classes, and the ice skating, which made my
visit so delightful.
None of my dissertation projects were a one-woman job. I simply cannot achieve what I
have right now without the support from my amazing collaborators: Mark Harman, Justin Chen,
Adriana Seja, Marcelo Schmitt Laser, Jie Zhang, Federica Sarro, Marcelo Schmitt Laser, Yingjun
Lyu, Paul Wat, Haoyu Wang, and Siwei Yin, and Kevin Moran. I really appreciate all the help
and guidance you have provided along the way, and I look forward to future opportunities to
collaborate again.
I am also grateful for the sustainable culture in our research group. Although we did not
share the lab and advisor at the same time, I have received invaluable advice from my academic
vi
\siblings", which directly helped me in my career path. Yuriy Brun, Sam Malek, and Joshua
Garcia have always been willing to share what they can oer.
During my Ph.D., I had the opportunities to attend many research conferences where I met
wonderful peers and inspirational role models. I am grateful for Tianyi Zhang, who I met at
my very rst conference when I was a research \infant". Being ahead of me in the program,
Tianyi always set a great example for me to follow, from how many top-tier papers one can have
that really motivated me when I had none, to how to choose impactful research topics when I
was only focused on getting my rst project done. The discussions with him have always been
insightful and helped me grow. I have also met so many amazing women who inspired me greatly
with their passion and dedication: Bara Buhnova (also my ICSE roommate and my tour guide
in Prague!), Grace Lewis, Ladan Tahvildari, Myra Cohen, Christine Julien, Gail Murphy, Amy
J. Ko, Jie Zhang, Lili Wei, Reyhaneh Jabbarvand, Denae Ford Robinson, Brittany Johnson, and
many more! Especially, as my peers in a similar stage, Jie Zhang and Lili Wei have always been
so motivating and inspiring for me. I enjoyed every conversation we have had, from research
discussions, to encouragement and support. Although we are still at an early stage in our career,
I am sure you will do an amazing job and I can not wait to see what the future holds for you. I
am so grateful to have you grow together with me in this academic journey.
Finally, a special thanks to all my friends who have attended my defense and sent all the
positive vibes all around the world. Your presence brought so much support for me, and I fully
enjoyed every minute of it.
This is not the end, but a new beginning. I know I will always be OK no matter what the
future holds, because of every single one of you. A heartful thank you to you all.
vii
Table of Contents
Acknowledgements ii
List of Tables xi
List of Figures xii
Abstract xiv
Chapter 1: Introduction 1
1.1 Insights and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Background 9
2.1 HTTP Protocol and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 HTTP Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 HTTP Libraries Used in Mobile Apps . . . . . . . . . . . . . . . . . . . . . 10
2.2 Content-based Prefetching and Caching . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Mobile App Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Content-based Approach Example . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 History-based Prefetching and Caching . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 History-based Approach Work
ow . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 History-based Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 3: Literature Review of Prefetching and Caching 16
3.1 Prefetching and Caching in Browser Domain . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Content-based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 History-based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Content-based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 History-based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 4: Empirical Study of Prefetching and Caching Opportunities 20
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.1 Prefetchability of HTTP Requests . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.2 Cacheability of HTTP Responses . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.3 Identifying Truly Redundant HTTP Requests . . . . . . . . . . . . . . . . . 22
4.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 Data Collection Work
ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
viii
4.3.2 Initial Set of Subject Apps . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.3 App Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.4 App Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.5 Final Set of Subject Apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.1 Prefetchability of HTTP Requests . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.2 Cacheability of HTTP Responses . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.3 Identifying Truly Redundant HTTP Requests . . . . . . . . . . . . . . . . . 33
4.5 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Chapter 5: Content-based Approach PALOMA 40
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.1 String Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.2 Callback Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.3 App Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.4 Runtime Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 Microbenchmark Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5.1 Microbenchmark Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5.2 Microbenchmark Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5.3 Microbenchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.6 Third-Party App Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Chapter 6: History-based Approach HiPHarness 63
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Study Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3 The HiPHarness Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3.1 HiPHarness's Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3.2 HiPHarness's Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 Empirical Study Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.5 Results and Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.5.1 RQ1 { User-Request Repetitiveness . . . . . . . . . . . . . . . . . . . . . . 73
6.5.2 RQ2 { Prediction-Algorithm Eectiveness . . . . . . . . . . . . . . . . . . . 75
6.5.3 RQ3 { Training-Data Size Reduction . . . . . . . . . . . . . . . . . . . . . . 78
6.5.4 Broader Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.5.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 7: Evaluation Framework FrUITeR 87
7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2 Background on Test Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.2.1 Motivating Example and Terminology . . . . . . . . . . . . . . . . . . . . . 90
7.2.2 Strategies Explored to Date . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
ix
7.2.3 Existing Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3 FrUITeR's Principle Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.4 FrUITeR's Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.4.1 FrUITeR's Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.4.2 FrUITeR's Work
ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.4.3 FrUITeR's Baseline Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.5 FrUITeR's Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.5.1 Modularizing Existing Techniques . . . . . . . . . . . . . . . . . . . . . . . 105
7.5.2 FrUITeR's Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.5.3 FrUITeR's Implementation Artifacts . . . . . . . . . . . . . . . . . . . . . . 110
7.6 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.6.1 GUI Mapper Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.6.2 Insights and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 8: Related Work 117
8.1 Prefetching and Caching in Mobile Apps . . . . . . . . . . . . . . . . . . . . . . . . 117
8.2 Mobile App Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Chapter 9: Conclusion 121
9.1 Broader Impact and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.1.1 Prefetching and Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.1.2 Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.1.3 Open Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
References 130
x
List of Tables
4.1 App information for each category among initial subjects . . . . . . . . . . . . . . 24
4.2 App information for each category among nal subjects . . . . . . . . . . . . . . . 28
5.1 Results of PALOMA's evaluation using MBM apps covering the 25 cases dis-
cussed in Section 5.5.2. \SD", \TP", and \FFP" denote the runtimes of the three
PALOMA instrumentation methods. \Orig" is the time required to run the orig-
inal app. \Red/OH" represents the reduction/overhead in execution time when
applying PALOMA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Results of PALOMA's evaluation across the 32 third-party apps. . . . . . . . . . . 61
6.1 Requests per user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Repeated requests across users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Average training-data size reduction after applying the MOR, MAD, and MSD
pruning strategies, and the resulting Static Precision (SP), Static Recall (SR), and
Dynamic Recall (DR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Pairwise comparison result summary . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1 Fidelity metrics as used in AppFlow [91], CraftDroid [113], ATM [48], GTM [49],
and FrUITeR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2 Summary information of benchmark apps. . . . . . . . . . . . . . . . . . . . . . . . 108
7.3 Benchmark test cases in shopping (TS) and news (TN) categories. . . . . . . . . . 109
xi
List of Figures
4.1 Our data collection work
ow. The App Proling, App Instrumentation, App Test-
ing, and Send GET Requests components perform automated tasks. . . . . . . . . 23
4.2 Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
numbers of GET requests in apps across the 33 app categories. Apps in 7 categories
had maximums higher than 150 (numbers displayed beside the corresponding bars).
Note that the average for app category 3 is also higher than 150, and thus not shown. 29
4.3 Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of GET requests in apps across the 33 app categories. . . . . . . . . . . 29
4.4 Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
numbers of Expires headers in each app category. Apps in 11 categories had
maximums higher than 30 (numbers displayed beside or above the corresponding
bars). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of the Expires headers for each app category. . . . . . . . . . . . . . . 32
4.6 Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of trusted Expires headers in each app category. . . . . . . . . . . . . 32
4.7 Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
numbers of Cache-Control headers in each app category. Apps in 14 categories had
maximums higher than 30 (numbers displayed beside or above the corresponding
bars). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.8 Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of Cache-Control headers in each app category. . . . . . . . . . . . . 33
4.9 Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of trusted Cache-Control headers in each app category. . . . . . . . . 33
4.10 Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of ostensibly redundant requests in each app category. . . . . . . . . . 34
4.11 Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
expiration times for the redundant requests in each app category. . . . . . . . . . . 34
4.12 Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of truly redundant requests in each app category. . . . . . . . . . . . . 35
xii
5.1 CCFG extracted from Listing 2.3 by GATOR [148, 183] . . . . . . . . . . . . . . . 43
5.2 Relationship of PALOMA's Terminology . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 High-level overview of the PALOMA approach . . . . . . . . . . . . . . . . . . . . 44
5.4 PALOMA's detailed work
ow. Dierent analysis tools employed by PALOMA and
artifacts produced by it are depicted, with a distinction drawn between those that
are extensions of prior work and newly created ones. . . . . . . . . . . . . . . . . . 46
5.5 The 24 test cases covering all congurations involving dynamic values. The hori-
zontal divider denotes the Trigger Point, while the vertical divider delimits the two
dynamic values. The circles labeled with \DS
i;j
" are the locations of the Denition
Spots with respect to the Trigger Point. \H" denotes a hit, \NH" denotes a non-hit,
and \NP" denotes a non-prefetchable request. . . . . . . . . . . . . . . . . . . . . . 58
6.1 HiPHarness's work
ow for assessing history-based prediction models . . . . . . . . 66
6.2 Average values of the three accuracy metrics across the four algorithms . . . . . . 76
6.3 Resource consumption of the four algorithms with ten sets of dierent-sized models
trained on a mobile device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Average accuracy values across the DG, PPM, MP, and Na ve algorithms after
applying MOR (left), MAD (middle), and MSD (right), overlayed on top of the
original accuracy values from Figure 6.2 . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5 A schematic of the Sliding Window approach with window size 5, sliding distance
1, and training ratio 0:8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6 The averages for Static Precision (top), Static Recall (middle), and Dynamic Recall
(bottom). Y-axes capture the metrics' values; X-axes indicate the dierent data
points corresponding to window sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.1 sign-in tests for Wish (a1) and Etsy (b1{b3). . . . . . . . . . . . . . . . . . . . . . 90
7.2 Overview of FrUITeR's automated work
ow. . . . . . . . . . . . . . . . . . . . . . 99
7.3 Comparison of average precision and recall. . . . . . . . . . . . . . . . . . . . . . . 112
7.4 Comparison of average accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.5 Comparison of average eort and reduction. . . . . . . . . . . . . . . . . . . . . . . 114
xiii
Abstract
Prefetching and caching is a fundamental approach to reduce user-perceived latency, and has been
shown eective in various domains for decades. However, its application on today's mobile apps
remains largely under-explored. This is an important but overlooked research area since mobile
devices have become the dominant platform, and this trend is re
ected in the billions of mobile
devices and millions of mobile apps in use today. At the same time, user-perceived latency has
been shown to have a large impact on mobile-user experience and can cause signicant economic
consequences.
In this dissertation, I aim to ll this gap by providing a multifaceted solution to establish
the foundation for exploring prefetching and caching in the mobile-app domain. To that end,
my dissertation consists of four major elements. As a rst step, I conducted an extensive study
to investigate the opportunities for applying prefetching and caching techniques in mobile apps,
providing empirical evidence on their applicability and demonstrating insights to guide future tech-
niques. Second, I developed PALOMA, the rst content-based prefetching technique for mobile
apps using program analysis, which has achieved signicant latency reduction with high accuracy
and negligible overhead. Third, I constructed HiPHarness, a tailorable framework for investigat-
ing history-based prefetching in a wide range of scenarios. Guided by today's stringent privacy
regulations that have limited the access to mobile-user data, I further leveraged HiPHarness to
conduct the rst study on history-based prefetching with \small" prediction models, demonstrat-
ing its feasibility on mobile platforms and in turn, opening up a new research direction. Finally, to
reduce the manual eort required in evaluating prefetching and caching techniques, I have devised
FrUITeR, a customizable framework for assessing test-reuse techniques, in order to automatically
select suitable test cases for evaluating prefetching and caching techniques without real users'
engagement as required previously.
xiv
Chapter 1
Introduction
Mobile devices have become the dominant computing platform recently, and this trend is re
ected
in the billions of mobile devices and millions of mobile apps in use today [99]. At the same
time, user-perceived latency that is exhibited by mobile apps and exposed to end-users remains
a signicant problem. It has been shown to have a large impact on user experience and severe
economic consequences [172]. For example, a recent report shows that the majority of mobile users
would abandon a transaction or even delete an app if the response time of a transaction exceeds
three seconds [42]. Furthermore, Google estimates that an additional 500ms delay per transaction
would result in up to 20% loss of trac, while Amazon estimates a 1% sales loss with 100ms extra
delay [172]. A previous study showed that the main cause of user-perceived latency is network
transfer since mobile apps spend the majority of their time fetching data from the Internet [147].
A compounding factor is that mobile devices rely on wireless networks, which can exhibit high
latency, intermittent connectivity, and low bandwidth [89], compared to the stationary desktops.
Network latency
1
has been studied in the distributed systems for decades. Prefetching (\get
it before you need it") and caching (\store it for quick access when you need it") have been
shown to be eective since they can bypass the performance bottleneck (network speed) and
allow near-immediate response from a local store [89]. While prefetching makes use of built-in
caching schemes, caching-only techniques are also widely employed [188, 141].
Existing prefetching and caching techniques can be divided into three categories based on the
location where they are controlled and performed [37]:
1. Server-based approaches analyze all the network requests sent to a specic server to decide
how to prefetch and cache the network responses [134, 151, 68, 150].
1
In the context of mobile communication, we dene latency as the response time of an HTTP request.
1
2. Proxy-based approaches deploy proxies that act as intermediaries between the client and
the server to prefetch and cache for a group of users [72, 117, 92, 135].
3. Client-based approaches rely on the information from the client (e.g., browsers, mobile
apps) to prefetch and cache on the client-side for individual users [192, 172, 169, 127].
A key observation that motivates our research is that, in the mobile-app domain, client-based
prefetching and caching techniques can be very eective, and they are complementary to proxy-
or server-based approaches. First, the prefetched responses are stored locally on the client (i.e.,
mobile devices) that are ready to be used immediately without network's involvement. Second,
individual users tend to have established behavioral patterns in using mobile apps, which is
suitable for prefetching and caching techniques [84]. Furthermore, most of today's mobile apps
depend extensively on heterogeneous third-party servers or proxies over which app developers and
researchers do not have control [172], making client-based solutions especially attractive.
Thus, this dissertation specically focuses on client-based prefetching and caching tech-
niques in order to address the network latency problem in the mobile-app domain. Given the
existing foundation consisting of a large body of research on traditional browsers, prefetching
and caching in mobile browsers quickly became a crowded research area [169, 118, 172, 127, 166].
However, the research on mobile apps remains surprisingly unexplored, despite the fact that mo-
bile users spend more than 80% of their time in mobile apps [58]. For example, Cachekeeper [188]
made an initial eort to study the redundant web trac in mobile apps and proposed an OS-level
caching service for HTTP requests. Without proposing any prefetching techniques, Cachekeeper
can only show benets for a limited number of HTTP requests that are repeatedly sent. Early-
Bird [171] proposed a social network content prefetcher to reduce user-perceived latency, but the
technique is limited to the social-network domain. Another related research thread focuses on
balancing Quality-of-Services (QoSs), such as energy consumption and cellular date usage, to
suggest \how much" of the network requests to prefetch and cache [89, 47]. On the other hand,
this dissertation has a complementary focus to address the challenge of \what" and \when" to
prefetch and cache.
Specically, this dissertation explores prefetching and caching in the largely-overlooked mobile-
app domain, via four major components:
2
1. We performed a literature review of the large body of work in the browser domain to form
an in-depth understanding of existing prefetching and caching techniques, and to determine
the extent to which they can be applied to the mobile-app domain;
2. We conducted a large-scale empirical study to explore the prefetching and caching opportu-
nities in existing mobile apps;
3. We developed a set of novel prefetching and caching techniques to reduce network latency
in mobile apps;
4. We devised a tailorable evaluation framework to select suitable test cases in order to assess
dierent prefetching and caching techniques under the same baseline.
1.1 Insights and Hypotheses
This section describes the insights and their corresponding hypotheses that guided the four major
elements in the dissertation.
Insight 1
There should be a large opportunity for prefetching and caching
HTTP requests in mobile apps.
Hypothesis 1.1
The majority of all the HTTP requests should be prefetchable
in mobile apps.
Hypothesis 1.2
The majority of all the prefetchable HTTP requests should be
cacheable in mobile apps.
Insight 1 is the guiding insight to motivate this dissertation, guided by the fact that mobile apps
spend the majority of their time fetching data from the Internet [147]. We thus formulated two
hypotheses to explore the pervasiveness of prefetchable requests (Hypothesis 1.1) and cacheable
requests (Hypothesis 1.2).
Moving forward, I aimed to develop novel prefetching and caching solutions that are suitable for
mobile platforms in particular, covering the two principle categories|content-based and history-
based|that have been shown eective in the traditional browser domain.
The content-based approach is guided by two key insights where \user think time" provides
sucient time to perform prefetching operations in the background (Insight 2a), and mobile app
code contains the information needed to address \what" and \when" to prefetch (Insight 2b).
3
Insight 2a
\User think time" in mobile apps provides opportunities for
prefetching and caching HTTP requests in the background with
negligible side-eects to the end-users.
Insight 2b
Mobile app code has sucient information to guide \what"
and \when" to prefetch certain HTTP requests with high accuracy.
Hypothesis 2.1
Program analysis can be leveraged to identify \what" and
\when" to prefetch certain HTTP requests in mobile apps with
high accuracy.
Hypothesis 2.2
A program analysis-based prefetching technique for mobile apps
can achieve an average latency reduction of nearly 100% for
prefetchable HTTP requests with negligible overhead.
Thus, we formulated Hypothesis 2.1 and Hypothesis 2.2 to explore the possibility of employ-
ing program analysis techniques to prefetch and cache HTTP requests, with signicant latency
reduction, high accuracy, and low overhead.
Insight 3a
History-based prefetching and caching techniques from the
browser domain can be directly applied to the mobile-app
domain to predict future HTTP requests.
Insight 3b
Mobile users tend to use mobile apps for short time periods
and have established behavior patterns, thus their behaviors
should be predictable with a relatively small amount of
historical data.
Hypothesis 3
Existing history-based techniques can be applied to mobile
platforms with small amounts of historical data, and achieve
accuracy that is no lower than the accuracy achieved using
larger amounts of historical data.
The history-based approach is guided by the insights we learned from our literature review
where history-based prefetching from the traditional browser domain can be directly applied to
mobile platforms (Insight 3a), and relatively small amounts of historical data might be sucient
to predict mobile-user behaviors (Insight 3b). To that end, we formulated Hypothesis 3 to study
the suitability of existing history-based techniques on the mobile platforms, with a specic focus
on leveraging smaller amounts of historical data compared to the conventional strategies.
Finally, the signicant manual eort involved in evaluating prefetching and caching techniques
motivated us to explore solutions that can automatically generate suitable test cases for the
evaluation without real user's engagement. Our solution is guided by Insight 4a and Insight 4b
where similar mobile apps share similar test cases that can potentially be reused to reduce the
manual eort of constructing suitable test cases for evaluating prefetching and caching techniques.
Therefore, we formulated Hypothesis 4.1 and Hypothesis 4.2, aiming to investigate automated
4
Insight 4a
End-users tend to use mobile apps in similar ways when the
apps share similar functionalities and User Interfaces (UIs).
Insight 4b
Reusing existing test cases to generate new test cases can
reduce the manual eort in evaluating prefetching and
caching techniques.
Hypothesis 4.1
A test generation technique can automatically generate
realistic test cases by reusing existing test cases from
other mobile apps based on UI similarities.
Hypothesis 4.2
A framework can evaluate dierent test-reuse techniques
under the same baseline to identify suitable test cases
automatically, and thus reduce the manual eort in
evaluating prefetching and caching techniques.
solutions that can produce suitable test cases for evaluating prefetching and caching techniques
with reduced manual eort.
1.2 Dissertation Overview
This section overviews the four major components of the dissertation that are guided by the
insights and hypotheses discussed in Section 1.1.
(1) Literature Review|By studying existing prefetching and caching techniques in the
browser domain, we discovered that client-based techniques can be classied into two principle
categories [37]: content-based approaches predict future network requests by analyzing Web page
content, in order to anticipate \sub-resources" (e.g., images, JavaScript les) that will be needed
by a Web page; history-based approaches predict future requests by analyzing past requests. In the
mobile-app domain, devising a content-based approach requires the analysis of mobile apps instead
of the Web page content. This is more challenging in general, due to the complexity of mobile
app programs, compared to the browser domain where the content dependency is embedded
in the well-structured Web documents (e.g., HTML) and the source code of the Web pages is
publicly available [61]. History-based approaches, on the other hand, can be applied directly to
mobile apps as they analyze the past requests which are domain-independent. These key ndings
directly motivated us to develop novel content-based techniques and to adapt existing history-
based techniques in the mobile-app domain.
(2) Empirical Study|Due to the lack of prefetching and caching techniques in mobile
apps, it is unclear whether such techniques, shown to work well in browsers, can be eective
in this emerging domain. Motivated by Insight 1 that there should be a large opportunity for
5
prefetching and caching in mobile apps, we aim to assess Hypothesis 1.1 and Hypothesis 1.2
regarding the pervasiveness of prefetchable requests and cacheable requests. To that end, we
conducted a large-scale empirical study [194] to understand the characteristics of HTTP requests
in over 1,000 popular Android-based mobile apps. Our work focused on the prefetchability of
network requests using static program analysis and cacheability of the resulting responses. We
found that there is a substantial opportunity to leverage prefetching and caching in mobile apps,
but that suitable techniques should take into account the nature of apps' network interactions
and common idiosyncrasies, such as untrustworthy HTTP headers.
(3) Prefetching and Caching Techniques|Motivated by Insight 2a and Insight 2b re-
garding the content-based techniques, we developed PALOMA [192], the rst content-based prefetch-
ing technique to reduce user-perceived latency in mobile apps based on program analysis. PALOMA
leverages string analysis and callback control-
ow analysis to automatically instrument apps using
our rigorous formulation of scenarios that address \what" and \when" to prefetch. In the end,
PALOMA automatically produces an optimized app with prefetching enabled. PALOMA has ver-
ied Hypothesis 2.1 and Hypothesis 2.2 by demonstrating signicant runtime savings (hundreds
of milliseconds per prefetchable HTTP request) in practice with high accuracy and negligible
overhead. More importantly, PALOMA denes formally the conditions under which the requests
are prefetchable, and constructs a comprehensive and reusable microbenchmark that forms the
foundation for future content-based prefetching techniques in the mobile-app domain.
Guided by Insight 3a and Insight 3b regarding history-based techniques, we aim to investigate
existing solutions from the traditional browser domain using small amounts of historical data. At
the same time, we noticed that today's privacy regulations make it infeasible to explore history-
based prefetching with the usual strategy of amassing large amounts of historical data over long
periods and constructing conventional, \large" prediction models as studied in the browser do-
main. This observation further reinforces our focus on small amounts of historical data. Thus, to
test Hypothesis 3 that answers whether history-based prefetching is feasible with small amounts
of historical data, we constructed HiPHarness, a framework for automatically assessing history-
based prediction models, and used it to conduct an extensive empirical study based on over 15
million HTTP requests collected from nearly 11,500 mobile users during a 24-hour period, result-
ing in over 7 million models. Our results demonstrated the feasibility of history-based prefetching
with small models on mobile platforms, directly motivating future work in this new research
area. We further introduced several strategies for improving prediction models while reducing
6
the model size. Our HiPHarness framework provided the foundation for future explorations of
eective history-based prefetching techniques across a range of usage scenarios.
(4) Evaluation Framework|Evaluating prefetching and caching techniques is a time-
consuming task since it requires realistic test cases triggered by real users, which cannot be
generated automatically by existing test generation techniques. Recent research has explored
opportunities for reusing existing realistic tests from an app to automatically generate new tests
for other apps, in order to reduce the manual eort involved [49, 146, 91, 113, 48]. This has
provided Insight 4a and Insight 4b for us to compare dierent test-reuse techniques in order to
select suitable test cases for evaluating prefetching and caching techniques. However, lacking a
standard baseline and evaluation protocol, it is unclear which test-reuse technique can generate
more suitable test cases for evaluating prefetching and caching techniques under a given scenario.
To address this challenge, we developed FrUITeR [191], a framework that automatically evalu-
ates test-reuse techniques under the same baseline. We applied FrUITeR to existing test-reuse
techniques on a uniform benchmark we established, resulting in 11,917 test-reuse cases from 20
apps that can be used to select suitable test cases of one's interest, directly verifying Hypothesis
4.1 and Hypothesis 4.2. With the selected realistic test cases, the order of the network requests
triggered will be based on real users, and thus can be used for evaluating and comparing dierent
prefetching and caching techniques without engaging new users for each app. Furthermore, we
made the test cases publicly available [11] to benet future prefetching and caching techniques,
as well as other techniques that require real user's engagement in the mobile-app domain.
1.3 Contributions
This section summarizes the main contributions of this dissertation.
1. Literature Review: We conducted a literature review to understand the nature of existing
prefetching and caching techniques, and to provide insights on how they can be applied to
the largely-overlooked mobile-app domain. This contribution is described in Chapter 3.
2. Empirical Study [194]: We conducted the rst large-scale study on real mobile apps that
has provided empirical evidence regarding the signicant opportunities for prefetching and
caching. The study further identied several characteristics that should be considered when
devising prefetching and caching techniques in the mobile-app domain. This contribution is
presented in Chapter 4.
7
3. Content-based Approach [192, 190]: We developed PALOMA, the rst content-based
prefetching technique in mobile apps using program analysis that achieved signicant la-
tency reduction and negligible overhead. I further devised a comprehensive and reusable
microbenchmark for standardized evaluation of content-based prefetching techniques in the
mobile-app domain. This contribution is discussed in Chapter 5.
4. History-based Approach [195]: We developed HiPHarness, a tailorable framework for
automatically exploring history-based prefetching techniques across a wide range of scenar-
ios. I used HiPHarness to conduct the rst extensive study of history-based prefetching on
mobile platforms, which has provided empirical evidence on the feasibility of history-based
prefetching using \small" amounts of user data, and has identied several future directions
to improve history-based approaches. This contribution is demonstrated in Chapter 6.
5. Evaluation Framework [191, 11]: W developed FrUITeR, a customizable framework to
automatically assess test-reuse techniques under the same baseline, and thus can be used to
identify suitable test cases for evaluating prefetching and caching techniques in mobile apps
with reduced manual eort. I leveraged FrUITeR to conduct a side-by-side evaluation of
the state-of-the-art test-reuse techniques, uncovering several needed improvements in this
emerging area. I built a reusable benchmark that contains realistic test cases to directly
benet the evaluation of prefetching and caching techniques, as well as other techniques that
require real user's engagement in the mobile-app domain. This contribution is elaborated
further in Chapter 7.
The remainder of this dissertation is organized as follows. Chapter 2 describes the background
information related to network requests, and prefetching and caching techniques in mobile apps,
covering both content-based and history-based approaches. Chapter 3 presents the literature
review of prefetching and caching techniques and the lessons learned. Chapter 4 describes our
empirical study that explores prefetching and caching opportunities in the mobile-app domain.
Chapter 5 and 6 present our novel prefetching and caching solutions targeting mobile platforms:
content-based approach PALOMA (Chapter 5) and history-based approach HiPHarness (Chapter
6). Chapter 7 details FrUITeR, an evaluation framework for assessing test-reuse techniques in
order to select suitable test cases for evaluating prefetching and caching techniques. Chapter 8
introduces the related work covered in the scope of this dissertation. Chapter 9 concludes the
dissertation and discusses several future directions.
8
Chapter 2
Background
This chapter provides background information on concepts that are used throughout the dis-
sertation. Section 2.1 introduces the HTTP protocol and HTTP libraries that are relevant to
prefetching and caching. Section 2.2 uses a concrete code example to demonstrate content-based
prefetching and caching in mobile apps. Section 2.3 discusses the typical work
ow of history-based
prefetching and caching techniques.
2.1 HTTP Protocol and Libraries
In this section, we overview aspects of the HTTP protocol that are relevant to prefetching and
caching. We then illustrate with concrete examples of how developers perform network operations
in mobile apps using HTTP libraries, with a particular focus on Android.
2.1.1 HTTP Protocol
Previous studies have shown that mobile apps spend between 34% and 85% of their time fetching
data from the Internet [147]. The majority of apps run over HTTP [66], where requests are sent
by clients and responses returned by servers.
An HTTP request consists of an HTTP method, the destination of the resource to fetch (i.e.,
the URL), and request headers and body, both of which are optional. The HTTP method|GET,
POST, DELETE, etc.|needs to be specied by developers when sending a request. Optional request
headers allow the client to pass additional information to the server [75], such as Accept-Language:
en-US. The request body contains the resource to send to the server, but is only needed for \write"
HTTP methods, such as POST.
9
HTTP 1.1 [75] denes eight HTTP methods. Some of them, such as DELETE, are not suitable
for prefetching because they may change the server's state contrary to the user's intention. Only
the GET and HEAD methods are considered \safe", in that they result in the retrieval of data and
do not have any side-eects on the server [75]. The HEAD method is similar to GET, except that
its response does not contain a message body [75]. Thus, GET requests are of particular interest
in our study.
An HTTP response consists of a status code, a status message, and response headers and
body, both of which are optional. The status code and status message indicate whether the request
was successful or not, and why. The response body contains the fetched resource from the server.
Response headers contain additional information that is often used by developers to decide on
their caching strategies. For example, the Expires header species when the response will become
stale, while Cache-Control header contains the information pertaining to caching mechanisms
such as no-cache and max-age. Interestingly, as observed in our study (see Section 6), those
headers cannot always be trusted by developers, and sometimes they are missing altogether.
2.1.2 HTTP Libraries Used in Mobile Apps
In Android apps, developers often use o-the-shelf HTTP libraries to interact with the servers.
Listing 2.1 and Listing 2.2 demonstrate how developers send HTTP requests and receive responses
using the two most popular HTTP libraries for Android: URLConnection and OkHttp.
When sending HTTP requests, developers need to specify the URL of the resource to be
fetched (Listing 2.1: line 1, Listing 2.2: line 3), HTTP method (Listing 2.1: line 3, Listing 2.2:
line 5), request headers (line 4 in both Listings), and request body (Listing 2.1: line 6, Listing 2.2:
line 5). Only the URL is mandatory and GET method will be used by default if the HTTP method
is not specied (e.g., if line 3 in Listing 2.1 and line 5 in Listing 2.2 are removed). When receiving
HTTP responses, developers can retrieve the response body (Listing 2.1: line 8, Listing 2.2: line
7) as well as the response headers (Listing 2.1: line 9, Listing 2.2: line 8) that may contain caching
information.
10
1 URL url = new URL("http://www.ase.com/post");
2 URLConnection conn = url.openConnection();
3 conn.setRequestMethod("POST");
4 conn.setRequestProperty("Accept-Language", "en-US");
5 OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
6 wr.write("post_data_to_send");
7 wr.
ush();
8 InputStream responseStream = conn.getInputStream();
9 Map headerMap = conn.getHeaderFields();
Listing 2.1: Sending a POST request using the URLConnection library
1 OkHttpClient client = new OkHttpClient();
2 Request request = new Request.Builder()
3 .url("http://www.ase.com/post")
4 .addHeader("Accept-Language", "en-US")
5 .post("post_data_to_send")
6 .build();
7 Response response = client.newCall(request).execute();
8 Headers headers = response.headers();
Listing 2.2: Sending a POST request using the OkHttp library
2.2 Content-based Prefetching and Caching
In this section, we use a concrete example to introduce the fundamental building blocks and
execution model of the mobile app programs, with a particular focus on Android. We then
illustrate how content-based prefetching and caching techniques can reduce the user-perceived
latency signicantly with the concrete example.
2.2.1 Mobile App Code Example
Mobile apps that depend on network generally involve two key concepts: events that interact with
user inputs and network requests that interact with remote servers. We explain these concepts,
via Listing 2.3's simplied code fragment of an Android app that responds to user interactions by
retrieving weather information.
Events: In mobile apps, user interactions are translated to internal app events. For instance,
a screen tap is translated to an onClick event. Each event is, in turn, registered to a particular
application UI object with a callback function; the callback function is executed when the event is
11
triggered. For instance in Listing 2.3, the button object submitBtn is registered with an onClick
event (Line 9), and the corresponding callback functiononClick() (Line 10) will be executed when
a user clicks the button. Similarly, the drop-down box object cityNameSpinner is registered with
an onItemSelected event that has an onItemSelected() callback function (Lines 5-7).
1 class MainActivity f
2 String favCityId, cityName, cityId;
3 protected void onCreate()f
4 favCityId = readFromSetting("favCityId");//static
5 cityNameSpinner.setOnItemSelectedListener(new OnItemSelectedListener()f
6 public void onItemSelected() f
7 cityName = cityNameSpinner.getSelectedItem().toString();//dynamic
8 gg);
9 submitBtn.setOnClickListener(new OnClickListener()f
10 public void onClick()f
11 cityId = cityIdInput.getText().toString();//dynamic
12 URL url1 = new URL(getString("domain")+"weather?&cityId="+favCityId);
13 URL url2 = new URL(getString("domain")+"weather?cityName="+cityName);
14 URL url3 = new URL(getString("domain")+"weather?cityId="+cityId);
15 URLConnection conn1 = url1.openConnection();
16 Parse(conn1.getInputStream());
17 URLConnection conn2 = url2.openConnection();
18 Parse(conn2.getInputStream());
19 URLConnection conn3 = url3.openConnection();
20 Parse(conn3.getInputStream());
21 startActivity(DisplayActivity.class);
22 gg);
23 g
24 g
Listing 2.3: Code snippet with callbacks and HTTP requests using the URLConnection library
Network Requests: Within an event callback function, the app often has to communicate
with remote servers to retrieve information. The communication is performed through network
requests over the HTTP protocol in most non-realtime apps. Each HTTP request is associated
with a URL eld that species the endpoint of the request. For instance in Listing 2.3, the
onClick event callback sends three HTTP requests, each with a unique URL (Lines 12-14).
There are two types of URL values, depending on when the value is known: static and dynamic.
For instance, favCityId in Listing 2.3 is static because its value is obtained statically by reading
the application setting (Lines 4, 12). Similarly, getString("domain") reads the constant string
value dened in an Android resource le [83] (Line 12, 13, 14). In contrast, cityName is dynamic
12
since its value depends on which item a user selects from the drop-down box cityNameSpinner
during runtime (Lines 7, 13). Similarly, cityId is also a dynamic URL value (Lines 11, 14).
2.2.2 Content-based Approach Example
The key insight for content-based approaches is that one can signicantly reduce the user-perceived
latency by prefetching and caching certain network requests through analyzing the app's content.
For instance, Listing 2.3 corresponds to a scenario in which a user selects a city name from the
drop-down boxcityNameSpinner (Line 7), then clickssubmitBtn (Line 9) to get the city's weather
information through an HTTP request. By understanding the app's content (i.e., the program's
code), a prefetching scheme would submit the request of retrieving the weather information im-
mediately after the user selects a city name, i.e., before the user clicks submitBtn button. This
can signicantly reduce the time the user would have to wait to receive the information from the
remote server through network.
Content-based prefetching and caching is especially promising for two reasons. First, an HTTP
request's destination URL can sometimes be known before the actual request is sent out, such as
the static URL url1 (Line 12) in Listing 2.3. Second, there is often suciently long slack between
the time a request's URL value is known and when the request is sent out, due to other code's
execution and the \user think time" [127, 74]. Prefetching in eect \hides" the network latency
by overlapping the network requests with the slack period.
The key challenges of eciently prefetching HTTP requests involve determining (1) which
HTTP requests to prefetch, (2) what their destination URL values are, and (3) when to prefetch
them. Prior work addressed these challenges by relying on various server hints [150, 68, 134],
developer annotations [127, 110], and patterns of historical user behaviors [134, 159, 74, 103, 175].
Our goal is to avoid relying on such external information that is often dicult to obtain, and
instead to use only the information on the client-side as explained in Chapter 1, such as the
mobile app programs.
2.3 History-based Prefetching and Caching
This section describes the typical work
ow employed by history-based prefetching techniques
(Section 2.3.1), and introduces the widely-adopted algorithms for building history-based prediction
models (Section 2.3.2).
13
2.3.1 History-based Approach Work
ow
History-based approach consists of three major phases: (1) training the prediction model based on
the historical data; (2) predicting future requests based on the trained model; and (3) prefetching
requests based on the prediction results and runtime conditions.
In the training phase, a prediction model is trained with past requests (often referred to as
\training data") to capture their relationship based on a prediction algorithm (e.g., the probability
of visiting a given request next). One prediction algorithm can be used to produce multiple
prediction models by taking as input dierent training data, such as dierent numbers of past
requests. The trained prediction model will be used in the next phase, and will only be able to
predict the requests that have been used to train the model.
In the predicting phase, the trained model is used to predict future requests based on the
current context (e.g., the current request or the previousN requests) and the prediction algorithm,
depending on the prediction algorithm used to build the model. The current context is used to
trigger the prediction of future requests (e.g., those with the highest probability of occurrence after
the current request). The predicted requests are candidates for prefetching in the nal phase.
Finally, in the prefetching phase, certain predicted requests are prefetched based on runtime
conditions, such as battery life and cellular-data usage [89].
2.3.2 History-based Prediction Algorithms
This section introduces three foundational prediction algorithms from the traditional browser
domain [52, 134, 136], which can be directly applied to mobile apps in principle. Note that there
are multiple variations of the three algorithms [73, 72, 71, 68, 116, 130, 60, 117, 46]; a detailed
summary can be found in a survey [37]. The variations in question target specic characteristics of
traditional browsers, such as web-page structure [71], and do not carry over to mobile platforms.
We thus rely on the originally dened algorithms.
Most-Popular (MP) [52] maintains a list of the most-popular subsequent requests for each
request in the training set. In the training phase, MP adds next requests within a specied window
to the current request's list and stores their occurrences. In the predicting phase, MP predicts
the most-popular requests with the highest occurrences.
Dependency Graph (DG) [134] trains a directed graph indicating the dependencies among
the requests, where nodes represent requests and arcs indicate that a target node is visited after
14
the original node within a window. Each arc has a weight that represents the probability that
the target node will be visited next. In the predicting phase, DG predicts future requests based
on their probabilities stored in the dependency graph.
Prediction-by-Partial-Match (PPM) [136] is based on a high-order Markov model. PPM
is context-sensitive, i.e., it considers the order of requests. In the training phase, it builds a trie
structure [64] that indicates the immediate-followed-by relationship of the past requests. In the
predicting phase, it predicts future requests whose parents match the current context, i.e., the
previous N requests.
15
Chapter 3
Literature Review of Prefetching and Caching
Web prefetching and caching are entrenched techniques to reduce network latency since the Inter-
net was born and have attracted a large body of work in the browser domain. However, prefetching
and caching in the mobile-app domain is surprisingly unexplored. Thus, we conducted a litera-
ture review of existing prefetching and caching techniques in the browser domain starting with
existing surveys [37, 167, 165] to understand the reason behind the dearth of such techniques in
mobile apps. Our objective is to identify the fundamental dierences and similarities between the
traditional browser domain and the emerging mobile-app domain, and provides insights on how
prefetching and caching techniques can be applied to today's mobile apps.
3.1 Prefetching and Caching in Browser Domain
As discussed in Chapter 1, our focus in this dissertation is client-based prefetching and caching
techniques. Existing client-based techniques in the browser domain can be classied into two
principle categories [37]: content-based approaches that predict future requests by analyzing Web
page contents, and history-based approaches that analyze past requests.
3.1.1 Content-based Techniques
The content-based techniques predict future network requests based on the analysis of Web page
contents. Early eorts focus on predicting the HTML links that are likely to be followed by the
clients. For instance, Xu el al. proposed a prefetching technique to predict future requests based
on keywords in anchor text of URL [178].
16
In recent years, as the Web page contents have become more complex, the performance bot-
tleneck for network latency in the browser domain has been identied as resource loading [173]:
one initial HTTP request will require a large number of sub-resources (e.g., images, JavaScript
les), which can only be discovered after the main resource (e.g., HTML) is fetched and parsed.
Thus, existing content-based techniques have been focused on analyzing the dependencies among
sub-resources. For instance, Tempo [172] is a mobile browser that speculatively loads the sub-
resources based on the dependency information in the resource graph extracted from the websites.
Shandian [169] analyzes the Web page load process to identify its ineciencies, such as sequential
sub-resource loading, and restructures page load process to remove those identied ineciencies.
Polaris [133] dynamically schedules sub-resource loading based on a dependency graph it generates
that tracks the ne-grained data
ows across the JavaScript heap and the DOM.
3.1.2 History-based Techniques
The history-based techniques predict future user requests based on observed page access behaviors
in the past and have yielded a large body of work [134, 72, 139, 136, 60, 125, 95, 176, 182, 105,
92, 135, 100, 145, 181]. The performance of those techniques highly depend on the actual order
of the past requests and can vary signicantly when being applied at dierent granularities. For
instance, one prediction algorithm may yield better results when it is applied to individual user's
past requests, compared to the past requests of a group of users. For this reason, we only focus
on several representative history-based techniques instead of the incremental work that improves
such techniques based on specic user behaviors in the browser domain. Our objective is to study
those representative history-based techniques and adapt them to the mobile-app domain based on
the particular characteristics of mobile app users. A more detailed summary of existing history-
based techniques can be found in a survey [37]. The work
ow of history-based techniques and
several foundational history-based prediction algorithms are detailed in Section 2.3.
3.2 Lessons Learned
There is clearly a missed opportunity to apply prefetching and caching techniques in the mobile-
app domain, both on content-based and history-based techniques. We now illustrate our insights
on how those techniques in the browser domain can be adapted to the mobile-app domain.
17
3.2.1 Content-based Techniques
In the mobile-app domain, the content-based techniques from the browser domain cannot be
applied directly for two reasons. First, the techniques for analyzing Web page content is no longer
relevant in the mobile-app domain as the content structure is fundamentally dierent. However,
inspired by the research in the browser domain, mobile app program structure is the analogous
content in the mobile-app domain, and can be analyzed to extract useful information in order
to answer \what" and \when" to prefetch and cache. Second, as discussed in Section 3.1.1, the
performance bottleneck for page load times is resource loading in the browser domain because one
initial HTTP request will require a large number of sub-resources [173]. While in mobile apps,
the HTTP requests are always lightweight [108]: one request only fetches a single resource that
does not require any further sub-resource fetching. Therefore, the focus in the mobile-app domain
should be prefetching and caching the individual requests that a user may trigger next rather
than the sub-resources within a single request.
Thus, in the mobile-app domain, devising a content-based approach to analyze mobile app
programs is a promising direction but more challenging in general, due to the complexity of mobile
app programs, compared to the browser domain where the content dependency is embedded in the
well-structured Web documents (e.g., HTML) and the source code of the Web pages is publicly
available [61].
3.2.2 History-based Techniques
The history-based approaches, on the other hand, can be applied directly to mobile apps as
they only rely on the past requests that are domain-independent. Each network request in the
browser domain, both initial HTML request and sub-resource request, can be identied as a URL,
which is still applicable in the mobile-app domain. However, user behaviors in mobile apps may
exhibit dierent characteristics compared to the browser domain, making the direct applicability
of existing history-based techniques unclear. For instance, the fundamental limitation of history-
based techniques is that it can only predict the requests that have been triggered in the past.
Therefore, history-based techniques cannot achieve desirable results if mobile app users tend to
trigger brand new requests constantly.
Furthermore, a key challenge in applying history-based prefetching on mobile platforms today
is that the user data is hard to obtain due to the profusion of data-privacy regulations introduced
18
and regularly tightened around the world [31, 106, 94, 140, 154]. For example, a study published
in 2011 reported on data collected from 25 iPhone users over the course of an entire year [155]
and inspired a number of follow-up studies [172, 179, 129, 119, 37, 111, 143]. A decade later,
such protracted data collection would be dicult to imagine, both because it would likely fall
afoul of legal regulations introduced in the meantime and because today's end-users are more
keenly aware and protective of their data.
Given the above constraints, an obvious strategy would be to limit the amount of data on
which the history-based techniques rely. However, there is no evidence that it is feasible to predict
future requests based on small amounts of data. This appears to have been an important factor
that has discouraged the exploration of history-based prefetching in recent years. We believe
that this has been a missed opportunity. Namely, the previously reported mobile-device usage
patterns|repetitive activities in brief bursts [84, 156, 65, 152, 55]|lead us to hypothesize that
history-based prefetching may work eectively with small prediction models trained on mobile-user
requests collected during short time periods. Thus, to apply history-based prefetching and caching
in the mobile-app domain eectively, it is critical to investigate whether mobile-user behaviors
are predictable today based on small amounts of historical requests.
19
Chapter 4
Empirical Study of Prefetching and Caching Opportunities
This section describes an extensive study [194] we conducted that provided empirical evidence
regarding the opportunities for prefetching and caching in mobile apps. Moreover, it identied
concrete shortcomings in the current app development practices that may hinder prefetching and
caching solutions.
4.1 Motivation
The research on prefetching and caching techniques in the browser domain has yielded a large
body of work [172, 127, 133, 166, 169, 159, 150, 52]. However, the resulting techniques cannot
be applied to mobile apps due to their dierent root causes of network latency. In the browser
domain, the bottleneck for latency is resource loading since a large number of resources|usually
les such as images|are needed within each HTTP request [173]. In the mobile-app domain,
each request only fetches a single response, and additional requests need to be issued explicitly
to fetch further resources [108, 192]. Thus, prefetching and caching techniques in the browser
domain target subresources within a single request [150, 172, 127, 169], while the research in the
mobile app domain focuses on separate HTTP requests [188, 192].
Aside from a couple of exceptions, there has been a lack of research on prefetching and caching
techniques that may be suitable for the mobile-app domain. In fact, it is currently not clear
whether such techniques can be eective or whether they are even feasible in practice. For
instance, CacheKeeper [188] made an initial eort to study the redundant web trac in mobile
apps and proposed an OS-level caching service. However, the resulting service was only evaluated
20
on 10 apps. Furthermore, CacheKeeper's performance highly depends on the
aws in the web
caching strategies employed in the original app, and its broader utility is unclear.
The dearth and shortcomings of previous work motivated us to conduct a more extensive
empirical study that aims to understand the characteristics of HTTP requests in mobile apps.
In this empirical study, we report our results from the automated analysis of 1,687 most popular
Android apps, spread across 33 app categories.
4.2 Research Questions
The goal of this study is to understand whether prefetching and caching can be applied to the
mobile-app domain eectively, in order to reduce user-perceived latency. We formulated nine
research questions (RQs) to this end. These RQs target the prefetchability of HTTP requests,
cacheability of HTTP responses, and redundancies among HTTP requests.
4.2.1 Prefetchability of HTTP Requests
Our objective is to assess the extent to which requests in mobile apps are prefetchable. Prefetchable
requests are read-only requests that have no side-eects on the server state. As discussed in
Section 2.1.1, these read-only requests are GET requests [75] in the context of the HTTP protocol.
Furthermore, we study whether the prevalence of prefetchable requests varies across dierent app
categories. Such variations may allow identifying app categories that are particularly suitable for
prefetching. We formulate three research questions to this end:
• RQ
1
{ What is the number of GET requests per app?
• RQ
2
{ What is the percentage of GET requests among all HTTP requests in mobile apps?
• RQ
3
{ How prevalent are GET requests across dierent app categories?
4.2.2 Cacheability of HTTP Responses
A prefetchable request may not be cacheable if the response to the request changes over time
(e.g., in the case of weather data). In such cases, the cached response may be stale and serving it
would lead to incorrect app behavior. To determine when a response becomes stale, or whether
a request is cacheable at all, developers have to rely on the header information specied in the
response, specically, Expires and Cache-Control (recall Section 2.1.1). However, there are no
21
standard rules for developers to follow when constructing a response, leaving open the possibility
that header information may be unreliable or even missing. To investigate this, we formulate four
additional research questions:
• RQ
4
{ How prevalent are Expires headers?
• RQ
5
{ Are Expires headers trustworthy?
• RQ
6
{ How prevalent are Cache-Control headers?
• RQ
7
{ Are Cache-Control headers trustworthy?
4.2.3 Identifying Truly Redundant HTTP Requests
Caching is only eective when there exist redundant requests for the same resource. An HTTP
request is redundant if a previous request specied the same HTTP method and URL, and yielded
the same response; the later request is redundant because the original response could have been
stored locally and reused. Previous work [188] suggests an opportunity for mobile app-based
caching techniques, in that it identied the presence of redundant HTTP trac and showed that
implementations of web caching are inadequate for mobile apps. Our work goes beyond identifying
redundant HTTP requests and tries to assess the intent behind them. A set of ostensibly redundant
requests could be generated on purpose (e.g., to retrieve updated weather information), and thus
may not be truly redundant. If a caching scheme fails to consider this, it will lead to cache
staleness. We thus consider the actual responses to the candidate redundant requests, aiming to
distinguish among them and provide better insights for future caching techniques in mobile apps.
With this in mind, we formulate the last two research questions:
• RQ
8
{ How prevalent are redundant HTTP requests?
• RQ
9
{ Are the identied ostensibly redundant requests truly redundant?
4.3 Data Collection
This section details (1) the work
ow we used for the data collection, (2) the criteria behind our
selection of subject apps, (3) app instrumentation, (4) our collection of data via runtime testing,
and (5) the reasons for eliminating certain apps from the subject set before conducting further
analysis.
22
Figure 4.1: Our data collection work
ow. The App Proling, App Instrumentation, App Testing,
and Send GET Requests components perform automated tasks.
4.3.1 Data Collection Work
ow
Figure 4.1 illustrates the work
ow we implemented for collecting the data needed to answer the
nine research questions stated above. The initial subject apps were downloaded from the Google
Play Store (Initial Set of Subject Apps). The apps were automatically instrumented based on
the information extracted from HTTP library documentation and the decompiled code of several
sample apps (App Instrumentation). The instrumented apps were automatically tested using
randomly generated inputs to produce logs that contain the information needed to answer RQ
1
{
RQ
4
, RQ
6
, and RQ
8
(App Testing). We manually examined the apps that could not be tested
due to problems such as installation failures and runtime crashes, to identify the root causes of
the problems (Final Set of Subject Apps). Finally, we automatically sent GET requests to the
subject apps at dierent time intervals, to answer RQ
5
, RQ
7
, and RQ
9
.
4.3.2 Initial Set of Subject Apps
We downloaded 1,687 top-ranked apps across 33 categories from the Google Play Store in the
United States. 1,308 of the apps could be processed by Soot [163], a state-of-the-art tool for
instrumenting Android apps, as further discussed in App Instrumentation. The sizes of those
1,308 apps vary between 16 KB and 103.4 MB. The total number of HTTP requests per app
varied between 0 and 1,243 in our tests, as described in App Testing.
Table 4.1 summarizes the information about the 1,308 subject apps. The table shows the
maximum and average numbers of HTTP requests per app for each category; the minimum
number of HTTP requests in every category is 0 and we thus omit it from the table. Finally,
the right-most column shows the number of apps in each category that sent at least four HTTP
23
requests in our tests, as well as the percentage of such apps compared to the total number of apps
in the given category. The reason behind highlighting this subset of the 1,308 subject apps will
be explained in Final Set of Subject Apps.
Table 4.1: App information for each category among initial subjects
Category #Apps Max #Req Avg #Req #Apps (#Req4)
1. Art & Design 11 14 2.27 3 (27.27%)
2. Auto & Vehicles 29 6 1.07 4 (13.79%)
3. Beauty 11 1243 120.82 6 (54.55%)
4. Books & Reference 40 108 11.58 16 (40%)
5. Business 55 87 5.71 17 (30.91%)
6. Comics 55 319 20.84 19 (34.55%)
7. Communications 40 96 3.98 8 (20%)
8. Dating 16 334 29.94 6 (37.5%)
9. Education 55 62 4.98 17 (30.91%)
10. Entertainment 28 134 12.36 11 (39.29%)
11. Events 8 53 14.13 5 (62.5%)
12. Finance 61 150 15.97 27 (44.26%)
13. Food & Drink 28 188 16.43 13 (46.43%)
14. Games 37 59 12.59 25 (67.57%)
15. Health & Fitness 41 14 3.44 15 (36.59%)
16. House & Home 25 149 17.96 8 (32%)
17. Libraries & Demo 45 22 0.6 1 (2.22%)
18. Lifestyle 21 82 12.48 12 (57.14%)
19. Maps & Navigation 54 206 8.37 8 (14.81%)
20. Medical 59 63 2.8 10 (16.95%)
21. Music & Audio 43 44 5.47 14 (32.56%)
22. News & Magazines 49 802 37.71 26 (53.06%)
23. Parenting 24 28 2.54 5 (20.83%)
24. Personalization 31 288 29.61 11 (35.48%)
25. Photography 43 58 7.72 14 (32.56%)
26. Productivity 68 119 8.31 24 (35.29%)
27. Shopping 46 198 21.54 22 (47.83%)
28. Social 48 108 10.4 23 (47.92%)
29. Sports 43 146 19.42 18 (41.86%)
30. Tools 54 130 6.44 16 (29.63%)
31. Travel & Local 63 208 14.33 27 (42.86%)
32. Video Players & Editors 47 134 5.89 8 (17.02%)
33. Weather 30 123 14.7 12 (40%)
Total 1308 1243 50.20 451 (34.48%)
4.3.3 App Instrumentation
Each subject app went through an automated instrumentation process that used Soot [163] to
insert code that captures information about HTTP requests and responses. This information is
primarily located in the HTTP headers. Capturing such information in the browser domain is
straightforward because HTTP requests and responses are managed in a unied way. On the
other hand, mobile apps presented a challenge: we rst had to identify how the HTTP requests
24
and responses are handled (recall Section 2.1.2); only then could we instrument the corresponding
code to capture this information automatically.
It was thus necessary to determine what libraries most apps use to send HTTP requests.
We rst identied a set of popular HTTP libraries, including URLConnection [28], OkHttp [22],
Volley [29], and Retrofit [23]. We then analyzed a sample of the subject apps' bytecodes
and checked the package names against the libraries. For example, the presence of the string
\java.net.URLConnection" generally indicates the use of the URLConnection library.
The data gathered from our analysis point to URLConnection and OkHttp as the most popular
HTTP libraries used in the subject apps. This is unsurprising: URLConnection is the standard
built-in library of the Android framework, and it has been augmented with OkHttp since Android
v.4.4 (KitKat). We thus decided to focus on URLConnection and OkHttp in our study.
We then performed a more detailed analysis of how our subject apps use these two libraries.
We recorded the runtimes of those methods that are imported from URLConnection and OkHttp,
and narrowed our focus to methods that are most time-consuming. The rationale is that those
are most likely to be the methods related to sending requests and receiving responses over the
network.
In addition, we inspected the decompiled code of the subject apps, as well as the documen-
tation and source code of the HTTP libraries used in the apps, to identify the actual usage of
HTTP requests and responses. The reason for this additional inspection is that developers send
requests and receive responses in various ways, even when using the same HTTP library. List-
ings 2.1 and 2.2 in Section 2.1.2 only demonstrate one common way of using each of the two
HTTP libraries. While recommended in the libraries' documentation, there is no requirement
or guarantee that developers will follow this guidance in their apps. Furthermore, the examples
in the documentation are at the source code level, while our instrumentation using Soot [26] is
at the bytecode level. This meant that we needed to understand the actual usage of those two
HTTP libraries at the bytecode level. With the additional inspection, we were able to identify
the actual methods used for sending requests and receiving responses in the apps, allowing us to
instrument the code to capture the precise information needed for our study. For example, line 9
in Listing 2.1 from Section 2.1 denes headerMap that contains all of the header information; our
instrumentation then inserts a method after line 9 to capture the headers relevant to our study,
such asExpires header. It is important to note that the instrumented apps' primary functionality
is left unchanged in this process.
25
4.3.4 App Testing
After the instrumentation, each app was subjected to random input testing through Android
Debug Bridge (adb) [1]. We used the UI/Application exerciser tool Monkey [27] to generate
random streams of user events, such as clicks, touches, and swipes. The apps were run on the
NoxPlayer Android emulator [21]. Each test consisted of 3,000 events under WiFi network settings.
We also explored testing with 1,000, 5,000, and 10,000 events. We found that 3,000 was the
smallest number of events that yielded a representative number of HTTP requests triggered at
runtime across the subject apps; neither 5,000 nor 10,000 events resulted in a signicant increase
in HTTP requests, while 1,000 events proved to be too few to adequately exercise the relevant
functionality in the apps.
All tests were preceded by a fresh installation of the given subject app, and the app was
removed from the emulator after each test's conclusion. This minimized the chances of errors
caused by any interference between apps or by previously saved settings.
4.3.5 Final Set of Subject Apps
The objective of our study is to determine whether and when HTTP requests should be prefetched
and their responses cached. In some cases, the number of HTTP requests triggered in our tests
was very low, suggesting that prefetching and caching in such apps would not be benecial. To
determine the nature of \low network usage" apps and the underlying reasons behind the data we
obtained, we manually inspected each app, starting with those that do not trigger any requests.
A total of 623 out of the 1,308 subject apps triggered no requests. We identied six recurring
reasons behind this:
1. The app's installation failed.
2. The app crashed upon launching.
3. The app's version was incompatible with the NoxPlayer Android emulator [21].
4. The app was obfuscated so that the methods relevant to HTTP requests were not captured
by our instrumentation.
5. The app required external information before it could be used, such as a bank PIN (com-
monly required in the Finance category) or a vehicle license plate (commonly required in
the Auto & Vehicles category).
26
6. The app only contained static content and did not rely on the network.
Note that, while we could not automatically test the above apps, many of them may, in fact,
trigger HTTP requests at runtime. The only exception are apps from the last category. The
automated nature of our app testing prevented us from determining the exact numbers of apps
that fell in each of the above six categories. A manual inspection of a random sample of the apps
suggests that, with a 95% condence level, no more than 50% of the 623 apps contained only
static content.
An additional 234 of the 1,308 subject apps triggered 1-3 requests at runtime. We observed a
common pattern among these apps. Namely, regardless of the type of app, those requests tended
to be one or more of the following:
1. Load an application-specic conguration le.
2. Log in with Facebook using Facebook GraphRequest.
3. Use monitoring services Crashlytics or Google Analytics.
Further manual testing of these apps yielded no additional HTTP requests beyond the above
three.
We were unable to identify any patterns such as the above in apps that trigger any other
number of requests. Thus, the below analysis of prefetchability and cacheability is based on 451 of
our subject apps that trigger four or more requests at runtime, corresponding to the right-most
column of Table 4.1.
4.4 Results and Discussion
This section describes the results of our analysis, framed by the nine research questions from Sec-
tion 4.2, and discusses the lessons learned from the results. Table 4.2 summarizes the information
about the nal set of 451 subject apps in each category that are analyzed in this section. Note
that the app categories are numbered 1-33, to aid the depiction and understanding of the gures
in the remainder of this section. Among the 451 apps, the number of HTTP requests ranged
between 4 (the cut-o number for our analysis, as discussed above) and 1,243, with the average
slightly above 35 requests per app.
27
Table 4.2: App information for each category among nal subjects
Category #Apps Min. #Req Max. #Req Avg. #Req
1. Art & Design 3 4 14 8.33
2. Auto & Vehicles 4 4 6 4.75
3. Beauty 6 4 1243 220.33
4. Books & Reference 16 4 108 27.94
5. Business 17 4 87 17.24
6. Comics 19 4 319 59.58
7. Communications 8 4 96 19
8. Dating 6 5 334 78.83
9. Education 17 4 62 15.06
10. Entertainment 11 6 134 30.73
11. Events 5 11 53 22.2
12. Finance 27 5 150 35.59
13. Food & Drink 13 4 188 33.46
14. Games 25 4 59 18
15. Health & Fitness 15 4 14 8.13
16. House & Home 8 4 149 55.38
17. Libraries & Demo 1 22 22 22
18. Lifestyle 12 4 82 21
19. Maps & Navigation 8 8 206 54.88
20. Medical 10 4 63 14
21. Music & Audio 14 5 44 16.14
22. News & Magazines 26 4 802 70.88
23. Parenting 5 4 28 12
24. Personalization 11 6 288 82.73
25. Photography 14 4 58 23
26. Productivity 24 4 119 22.67
27. Shopping 22 4 198 44.14
28. Social 23 4 108 20.35
29. Sports 18 7 146 45.67
30. Tools 16 4 130 21.44
31. Travel & Local 27 4 208 32.11
32. Video Players & Editors 8 4 134 33.63
33. Weather 12 7 123 36.17
Total 451 4 1243 35.28
4.4.1 Prefetchability of HTTP Requests
Recall from Section 4.2 that we try to answer three research questions regarding the prefetchability
of HTTP requests. Specically, we are interested inGET requests, which are the primary candidates
for prefetching.
• RQ
1
{ What is the number of GET requests per app?
• RQ
2
{ What is the percentage of GET requests among all HTTP requests in mobile apps?
• RQ
3
{ How prevalent are GET requests across dierent app categories?
28
To answer the above questions, we instrumented and tested our subject apps using the proce-
dure described in Section 4.3. We calculated the total number of GET requests observed during our
testing, and the percentage of GET requests among all HTTP requests triggered at runtime in each
app. We subsequently grouped the results by app category. Figure 4.2 depicts the minimum, max-
imum, and average numbers of GET requests per app (RQ
1
) across the dierent categories (RQ
3
).
Figure 4.3 depicts the minimum, maximum, and average percentages of GET requests as compared
to all HTTP requests (RQ
2
) in each app category (RQ
3
).
Figure 4.2: Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
numbers of GET requests in apps across the 33 app categories. Apps in 7 categories had maximums
higher than 150 (numbers displayed beside the corresponding bars). Note that the average for
app category 3 is also higher than 150, and thus not shown.
Figure 4.3: Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of GET requests in apps across the 33 app categories.
Our data indicate that GET requests are pervasive across all 33 app categories. As shown in
Figure 4.2, seven categories contained apps that sent 150 or more GET requests. On average, an
app sent 28 GET requests, and those requests comprised 68% of all HTTP requests sent by the app.
29
As shown in Figure 4.3, several categories|Beauty (94%), Comics (87%), Entertainment (88%),
and Events (87%)|had very high percentages of GET requests. Only two categories|Dating
(43%) and Tools (44%)|had slightly fewer than 50% of GET requests.
These results suggest that there is a signicant opportunity to exploit prefetching among the
451 subject apps that sent 4 or more HTTP requests. It was surprising to see that 102 apps,
spanning 29 of the 33 categories, sent only GET requests. Certain categories are potentially more
suitable for prefetching than others. This is a by-product of the types of functionality that are
typical in a given category. The nature of apps in \stable" domains, such as Art & Design or
Libraries & Demo, is such that they may be able to operate with less remotely accessed data than
apps in more \dynamic" domains such as News & Magazines or Shopping.
4.4.2 Cacheability of HTTP Responses
As discussed in Section 4.2, the cacheability of HTTP responses is a function of the presence of
Cache-Control and Expires headers, and their trustworthiness. To that end, we try to answer
the following four research questions.
• RQ
4
{ How prevalent are Expires headers?
• RQ
5
{ Are Expires headers trustworthy?
• RQ
6
{ How prevalent are Cache-Control headers?
• RQ
7
{ Are Cache-Control headers trustworthy?
To answer the above questions, we instrumented the subject apps to capture response headers
(recall Section 4.3.3) and calculate the numbers of occurrences of the two relevant headers. To
determine whether the header of a given request is trustworthy, we made each request 4 times:
at initial time t, t + 10, t + 30, and t + 60 seconds. This allowed us to determine whether later
responses re
ect what is specied in the header of the original response.
For example, let us assume that the original request is sent at time(t) and that the response
header contains Expires: time(exp). We will mark the header as untrustworthy if it falls into
any of the following three cases, where x is the time period after the original request is sent:
1. time(exp) <= time(t)
2. (time(t) < time(exp) time(t+x))^ (response@(t) = response@(t+x))
30
3. (time(exp) > time(t+x))^ (response@(t)6= response@(t+x))
In our case, x is any of 10s, 30s, or 60s. The rst case indicates a scenario where the response
expires before the request is even sent. The second case indicates a scenario where the response
is supposed to have expired, but it has remained unchanged. Finally, the third case indicates a
scenario where the response should have remained the same, but it changed.
We use the analogous algorithm to determine whether the Cache-Control header is trustwor-
thy, based on the max-age eld specied within the header.
Figure 4.4 shows the minimum, maximum, and average numbers of the Expires headers
included in HTTP responses for each app category (RQ
4
). Figure 4.5 shows the minimum, max-
imum, and average percentages of the Expires headers among all the response headers in each
app category (RQ
4
). Figure 4.6 shows the percentages of the trustworthy Expires headers among
all the Expires headers (RQ
5
). Figures 4.7, 4.8, and 4.9 show the analogous information for the
Cache-Control header (RQ
6
, RQ
7
).
From the results, we can conclude that the Expires headers and Cache-Control headers are
not always included in the responses, and they are not always trustworthy. The Cache-Control
header tends to be used more reliably than the Expires header. Across the 33 app categories,
53% of the response headers contain Expires on average, while 65% contain Cache-Control.
Only an average of 25% of the Expires headers are trustworthy, while 77% of the Cache-Control
headers are trustworthy. While there are individual apps among our subjects where each of the
two headers was used in a completely trustworthy manner (100%), there were an even greater
number of apps where the opposite was true (0%).
Figure 4.4: Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
numbers of Expires headers in each app category. Apps in 11 categories had maximums higher
than 30 (numbers displayed beside or above the corresponding bars).
31
Figure 4.5: Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of the Expires headers for each app category.
Figure 4.6: Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of trusted Expires headers in each app category.
Figure 4.7: Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
numbers of Cache-Control headers in each app category. Apps in 14 categories had maximums
higher than 30 (numbers displayed beside or above the corresponding bars).
32
Figure 4.8: Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of Cache-Control headers in each app category.
Figure 4.9: Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of trusted Cache-Control headers in each app category.
These results strongly suggest that developers should not depend on the response headers
to determine their caching schemes. Unfortunately, there are currently no reliable alternatives
for the mobile app domain. However, this presents a research opportunity to investigate more
intelligent approaches. One strategy that suggests itself based on our study would involve learning
the correct information to include in the headers based on historical data. Such a technique could
then automatically suggest app modications, in order to x the \buggy" headers.
4.4.3 Identifying Truly Redundant HTTP Requests
As discussed in Section 4.2, redundant HTTP requests are good candidates for prefetching and
caching. However, certain HTTP requests are only ostensibly redundant in that they seem identical
but actually yield dierent responses. Our nal two research questions aim to shed light on this
issue.
33
• RQ
8
{ How prevalent are redundant HTTP requests?
• RQ
9
{ Are the identied ostensibly redundant requests truly redundant?
In our analysis, we have specically focused on GET requests, as discussed previously.
To answer the above questions, upon completion of testing a given app (by executing the 3,000
events as explained in Section 4.3.4), we identify the ostensibly redundant requests in each app.
We then run a script that executes the app by sending each identied request four times: at initial
time t, t + 10, t + 30, and t + 60 seconds. We check whether the responses change during this
interval. This helps to identify HTTP requests that are truly redundant; the responses to those
requests are thus suitable candidates for caching.
Figure 4.10: Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of ostensibly redundant requests in each app category.
Figure 4.11: Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
expiration times for the redundant requests in each app category.
Figure 4.10 shows the minimum, maximum, and average percentages of the identied ostensi-
bly redundant requests as compared to the total number of requests in each app category (RQ
8
).
34
Figure 4.12: Minimum (bottom edges), maximum (top edges), and average (horizontal dashes)
percentages of truly redundant requests in each app category.
Figure 4.11 shows the minimum, maximum, and average expiration times for the identied re-
quests (RQ
9
). A request's expiration time is the time at which its response is dierent from
the response received for the initial request at time t. Finally, Figure 4.12 shows the minimum,
maximum, and average percentages of the truly redundant requests (RQ
9
).
As Figure 4.10 shows, redundant requests comprise a signicant proportion of all HTTP re-
quests across most of the app categories. In certain apps, nearly 100% of the requests are re-
dundant, while the average across all apps is20%. By themselves, these results would suggest
considerable cacheability potential.
This is further bolstered by some of the results in Figure 4.11, which points to several apps
in which the HTTP requests did not expire even after the full 60s. However, this is somewhat
deceptive: The average request expiration time was 12s across the 33 app categories; it was exactly
10s for several of the categories; and only two categories|Food & Drink and House & Home|had
average expiration times over 20s. Since 10s was the shortest interval used in our study, these
results suggest that most redundant requests expire within a relatively short time period. This
should be taken into account when devising caching schemes for mobile apps.
Finally, Figure 4.12 shows that, on average, an overwhelming majority of ostensibly redun-
dant requests are truly redundant across the 33 app categories. This means that the ostensibly
redundant request did not expire at one or more of the 10s, 30s, and 60s checkpoints. In a number
of individual apps, all ostensibly redundant requests are truly redundant (the maximum value of
100%), while their average for app categories is as high as 92%. This observation shows a large
opportunity for caching redundant requests in mobile apps.
35
4.5 Implications
Our analysis identied a number of apps in which prefetching and caching are unlikely to have
signicant, or any, benets. At the least, these include the several hundred apps from our original
set of subjects that only provide static content or make very few (1-3) HTTP requests. In fact,
it is possible that the number of these apps surpasses the 451 apps that do rely on the network
and that we included in our nal set of subjects (recall the discussion in Section 4.3.5). This
outcome was at least somewhat surprising, given the long history of research on data prefetching
and caching in distributed systems, of which mobile apps are only a more recent example.
A deeper analysis helps to identify several reasons behind this. For example, in hindsight it may
have been expected that apps from the Libraries & Demo or Video Players & Editors categories
provide static content, such as PDF viewers, organizers, digital books, and video players. On the
other hand, we did not expect to nd almost as much static content in Auto & Vehicles. We
already discussed in Section 4.3.5 that a number of apps from this category required login by
suppling a license plate number. An additional, large number of apps also contained purely static
content, such as instructions on how to perform car maintenance. This is re
ected in our data:
under 14% of the Auto & Vehicles apps made it into our nal set of 451 subjects (recall Table 4.1).
Another issue was presented by apps that used network communication that was either not
based on HTTP or extensively used HTTP methods other than GET. For example, a number of
apps in the Communications category provide instant messaging capabilities (including VoIP),
while others actually implement browsers. Maps & Navigation provide GPS applications that
dier signicatnly from typical HTTP services. Yet another example are Finance apps. Even
though 44% of these apps made it into our nal set of subjects, a lot of them are banking apps
that predominantly perform push-type operations, making them ill-suited for prefetching.
Even within the 451 nal subject apps, there are clearly some for which the benets of prefetch-
ing and caching may be marginal. Games presented an interesting case. Over
2
3
of the apps in
this category made it into our nal set of subjects since they used suciently large numbers of
HTTP requests. These apps also exhibited very high Cache-Control trustworthiness. On the
other hand, as expected, their requests tended to expire very quickly and to have little redun-
dancy. Therefore, while a game app may be identied as a candidate for prefetching, the resulting
cached data would become stale very quickly. In turn, this would possibly lead to incorrect app
36
behavior or, just as bad, constant thrashing of the prefetching facilities that would cripple the
app's performance.
These issues can be further illustrated with a somewhat crude analysis of an average app
from our subject set. The average app sent 28 GET requests (recall Section 4.4.1) as a result of
the 3,000 automatically generated UI events. 20% of those requests were truly redundant (recall
Section 4.4.3). That means that up to 6 GET requests were prefetchable.
While we must be cognizant of apps, such as those above, that are not especially amenable
to prefetching and caching, several scenarios in our study paint a much more favorable picture.
Consider the app from category 3 (Beauty) that issued 1,243 GET requests (recall Figure 4.2),
all of which are truly redundant (corresponding to the maximum value for app category 3 in
Figure 4.12). Even if we assume that the result of each redundant request can only be reused
once before it expires (recall from Figure 4.11 that the expiration time for app category 3 is 10s),
that still yields 621 requests for which the results can be reused from the local cache.
4.6 Threats to Validity
Our study is based on top-ranked, free Android apps. Therefore, our results may not hold for
paid apps or lower-ranked apps. However, over 90% of the Android apps in the Google Play Store
are free [8]. Furthermore, top-ranked apps are used most widely. This suggests that our results
should have broad applicability.
We excluded from our numerical analysis the apps that trigger fewer than four HTTP requests
at runtime. However, part of the objective of our study was to explore this problem space. Specif-
ically, we identied the reasons behind the apps' low numbers of requests (recall Section 4.3.5).
Furthermore, we acknowledged explicitly that the exclusion of these apps from the nal set of
subjects limits the applicability of our ndings (recall Section 4.5).
Our study is based on apps that use the HTTP protocol and two HTTP libraries (URLConnection
and OkHttp). Our ndings are unlikely to be directly applicable to other protocols for network
communication, and they may not carry over to other HTTP libraries. However, most mobile
apps, and in particular Android apps, rely on HTTP. Furthermore, our focus is on the funda-
mental characteristics of HTTP requests and responses, and those characteristics do not change
across dierent HTTP libraries. Including other libraries would naturally result in the inclusion
37
of greater numbers of subject apps. However, given the popularity of the HTTP libraries we
selected, our results should be widely representative among Android apps.
In our process for answering RQ
5
, RQ
7
, and RQ
9
, we sent out sets of four requests, at times t,
t + 10, t + 30, and t + 60 seconds. As shown in Figure 4.11, redundant requests tend to expire at
t + 10 or soon thereafter. This indicates that t + 60 is a suciently long period to identify truly
redundant requests in most cases. While choosing dierent time intervals would likely not lead to
dierent results, ner-grained intervals may give us tighter bounds on request expiration times.
Finally, our app usage information was obtained via automated generation of UI events, as
opposed to logging real user events. This may result in numbers and sequences of HTTP requests
that are not representative of actual app use. Given the nature of the study and the large number
of apps we aimed to analyze, it would have been unreasonable to attempt to nd actual users for
each app, while our results would potenially suer from user-specic biases and idiosyncrasies in
engaging the app. On the other hand, mimicking actual users with humans who are unfamiliar
with the apps in question, which would have been a more likely alternative, would have suered
from the same potential problem as our automated testing. Finally, neither actual nor novice
human users would have been able to repeatedly and reliably generate large numbers of events
(3,000 per app execution in the main portion of our study, and up to 10,000 per execution in the
preliminary analysis).
4.7 Summary
We presented the results of an extensive empirical study aimed at understanding the characteristics
of HTTP requests and responses in mobile apps. We formulated nine research questions with the
focus on the prefetchability of HTTP requests, cacheability of HTTP responses, and redundancies
among HTTP requests. Our overarching objective is to ll in the gap between the well-studied
browser domain and comparatively-less explored mobile-app domain, by motivating and providing
guidelines for future research in this area.
We found that a large number of HTTP requests used in real apps are prefetchable and the
responses to those requests cacheable. This has the potential for signicant reductions in user-
perceived latency, which would, in turn, render the use of certain mobile apps even more attractive.
This directly motivates our subsequent work to propose novel prefetching and caching techniques
in the mobile-app domain.
38
At the same time, our study highlighted the need to carefully consider which requests should
be prefetched and which data cached, for two reasons. First, we empirically demonstrated the
frequent lack of discipline with which developers use the relevant HTTP headers in mobile (specif-
ically, Android) apps, making those headers misleading. Second, we showed that responses to cer-
tain HTTP requests that seem like good candidates for caching may yield incorrect app behaviors
due to cache staleness.
Our results suggest that prefetching and caching can be useful across a wide range of mobile
apps and scenarios, but they are not universally applicable and their benets will vary. Certain
app categories are more amenable for prefetching and caching. However, there is a non-trivial
amount of variation even among dierent apps within a single cateogry. While our analysis
reported in this study does not provide denitive answers to questions of what, when, and how
much to prefetch and cache, it provides a process, tools, and data that form a foundation for
answering those questions much more precisely than has been possible thus far.
39
Chapter 5
Content-based Approach PALOMA
Motivated by the large opportunities for prefetching and caching shown in our empirical study [194]
described in Section 4, we developed PALOMA [192], the rst content-based prefetching technique
in the mobile-app domain. This chapter presents PALOMA, a novel client-based technique for
reducing the network latency by prefetching HTTP requests in Android apps. PALOMA lever-
ages string analysis and callback control-
ow analysis to automatically instrument apps using its
rigorous formulation of scenarios that address \what" and \when" to prefetch. PALOMA has
been shown to incur signicant runtime savings (several hundred milliseconds per prefetchable
HTTP request), both when applied on a reusable evaluation benchmark we have developed and
on real applications.
5.1 Motivation
Prefetching has been explored in distributed systems previously. Existing approaches can be
divided into four categories based on what they prefetch and when they do so. (1) Server-based
techniques analyze the requests sent to the server and provide \hints" to the client on what
to prefetch [150, 68, 134, 151]. However, most of today's mobile apps depend extensively on
heterogeneous third-party servers. Thus, providing server-side \hints" is dicult, not scalable, or
even impossible because app developers do not have control over the third-party servers [172]. (2)
Human-based approaches rely on developers who have to explicitly annotate application segments
that are amenable to prefetching [127, 110]. Such approaches are error-prone and pose signicant
manual burden on developers. (3) History-based approaches build predictive models from prior
requests to anticipate what request will happen next [134, 159, 74, 103, 175]. Such approaches
40
require signicant time to gather historical data. Additionally, building a precise predictive model
based on history is more dicult in today's setting because the context of mobile users changes
frequently. (4) Domain-based approaches narrow down the problem to one specic domain. For
example, approaches that focus on the social network domain [171, 177] only prefetch the constant
URLs in tweets based on user behavior and resource constraints. These approaches cannot be
applied to mobile apps in general.
To address these limitations of current prefetching approaches, we have developed PALOMA
(Program Analysis for Latency Optimization of Mobile Apps), a novel technique that is client-
centric, automatic, domain-independent, and requires no historical data. In its implementation,
we focus on native Android apps because of Android's dominant market share [142] and its
reliance on event-driven interaction, which is the most popular style used in mobile apps today.
Our guiding insight is that an app's code can provide a lot of useful information regarding what
HTTP requests may occur and when. In addition, a mobile user usually spends multiple seconds
deciding what event to trigger next|a period known as \user think time" [127]|providing an
opportunity to prefetch HTTP requests in the background. By analyzing an Android program,
we are able to identify HTTP requests and certain user-event sequences (e.g., onScroll followed
by onClick). With that information, we can prefetch requests that will happen next during user
think time.
User-event transitions are captured as callback control-
ow [183] relationships in PALOMA,
and we only perform very targeted, short-term prefetching|a single callback ahead. There are
several reasons we opted for this strategy. First, short-term prefetching minimizes the cache-
staleness problem that is commonly experienced by longer-term prefetching because the newly
updated cache will be used immediately when the user transitions to the next event. Second,
the information needed to send the HTTP requests (e.g., a parameter in an HTTP request that
depends on user input) is more likely to be known since the prefetching occurs very close in time
to the actual request. Third, short-term prefetching takes advantage of user think time between
callbacks, which has been shown to be sucient for prefetching HTTP requests [127, 74]. By
contrast, prefetching within the same callback would not provide a performance gain since the
relevant statements would execute within a few milliseconds of one another.
41
5.2 Terminology
We dene several terms needed for describing our approach to program analysis-based prefetching
of network requests. We use the concrete example (Listing 2.3) introduced in Section 2.2 to
demonstrate the terms.
URL Spot is a code statement that creates a URL object for an HTTP request based on
a string denoting the endpoint of the request. Example URL Spots are Lines 12, 13, and 14 in
Listing 2.3.
Denition Spot
m;n
is a code statement where the value of a dynamic URL string is dened,
such as Lines 7 and 11 in Listing 2.3. m denotes the m
th
substring in the URL string, and
n denotes the n
th
denition of that substring in the code. For example, Line 7 would contain
Denition Spot L7
3;1
for url2 because cityName is the third substring in url2 and Line 7 the
rst denition of cityName.
Fetch Spot is a code statement where the HTTP request is sent to the remote server. Example
Fetch Spots are Lines 16, 18, and 20.
Callback is a method that is invoked implicitly by the Android framework in response to a
certain event. Example callbacks from Listing 2.3 include the onItemSelected() (Line 6) and
onClick() (Line 10) methods. These are referred to as event handler callbacks in Android as they
respond to user interactions [81]. Android also denes a set of lifecycle callbacks that respond to
the change of an app's \life status" [80], such as the onCreate() method at Line 3.
Call Graph is a control-
ow graph representing the explicit invocation relationships between
procedures in the app code.
Target Method is a method that contains at least one Fetch Spot. It is named that because
identifying methods that contain Fetch Spots is the target of PALOMA's analysis. For example,
the onClick() method is a Target Method because it contains three Fetch Spots. A Target
Method may or may not be a Callback.
Target Callback is a Callback that can reach at least one Target Method in a Call Graph.
If a Target Method itself is a Callback, it is also a Target Callback. For example, the onClick()
Callback dened at Lines 10-22 of Listing 2.3 is a Target Callback.
Callback Control-Flow Graph (CCFG) represents the implicit-invocation
ow involving
dierent Callbacks [183]. In a CCFG, nodes represent Callbacks, and each directed edge f!s
denotes that s is the next Callback invoked after f. Figure 5.1 illustrates the CCFG extracted
42
onCreate,
MainActivity
wn1
onClick,
Button,
[submitBtn]
onItemSelected,
Spinner,
[cityNameSpinner]
onCreate,
DisplayActivity
1
2
3
4 5
callback,
Class,
[instance]
Legend
Figure 5.1: CCFG extracted from Listing 2.3 by GATOR [148, 183]
from Listing 2.3 using GATOR, a recently-developed analysis technique [148, 183]. A wait node
in a CCFG (e.g., wn1 in Figure 5.1) indicates that the user's action is required and the event she
triggers will determine which one of the subsequent callbacks is invoked.
Trigger Callback is any Callback in the CCFG that is an immediate predecessor of a Target
Callback with only a wait node between them. For instance, in Listing 2.3 the Trigger Callbacks
for the Target Callback onClick() are onCreate() (path 1!2) and onItemSelected() (path
5!2). Note that onClick() cannot be the Trigger Callback for DisplayActivity's onCreate()
method (path 3) because there is no wait node between them.
Trigger Point is the program point that triggers the prefetching of one or more HTTP
requests.
The relationships of the terminologies are demonstrated in Figure 5.2. Each node is a callback,
and the links are the control
ow between callbacks. The solid arrow between nodes means
\followed by immediately", e.g., Target Callback may be triggered immediately after Trigger
Callback if that path is selected by the user at runtime, and there will not be any other callbacks
in between. The dotted arrow between callbacks means \same or after", e.g., Callback 2 may be
the same callback as Callback 1, alternatively, Callback 2 may be triggered after Callback 1, and
there could be other callbacks in between. The Call Graph relationship within each callback is
abstracted away.
43
URL
Spot
Fetch
Spot
Trigger Point
Definition
Spot
Target Callback
Trigger Callback
Legend followed by immediately
same or after
Callback 1
Callback 2
Figure 5.2: Relationship of PALOMA's Terminology
5.3 Approach
This section presents PALOMA, a prefetching-based solution for reducing user-perceived latency in
mobile apps that does not require any developer eort or remote server modications. PALOMA is
motivated by the following three challenges: (1) which HTTP requests can be prefetched, (2) what
their URL values are, and (3) when to issue prefetching requests. Our guiding insight is that
static program analysis can help us address all three challenges. To that end, PALOMA employs
an oine-online collaborative strategy shown in Figure 5.3. The oine component automati-
cally transforms a mobile app into a prefetching-enabled app, while the online component issues
prefetching requests through a local proxy.
Figure 5.3: High-level overview of the PALOMA approach
44
PALOMA has four major elements. It rst performs two static analyses: it (1) identies
HTTP requests suitable for prefetching via string analysis and (2) detects the points for issuing
prefetching requests (i.e., Trigger Points) for each identied HTTP request via callback analysis.
PALOMA then (3) instruments the app automatically based on the extracted information and
produces an optimized, prefetching-enabled app. Finally at runtime, the optimized app will
interact with a local proxy deployed on the mobile device. The local proxy (4) issues prefetching
requests on behalf of the app and caches prefetched resources so that future on-demand requests
can be serviced immediately. We detail these four elements next.
5.3.1 String Analysis
The goal of string analysis is to identify the URL values of HTTP requests. Prefetching can only
happen when the destination URL of an HTTP request is known. The key to string analysis is
to dierentiate between static and dynamic URL values. A static URL value is the substring in
a URL whose concrete value can be determined using conventional static analysis. In contrast,
a dynamic URL value is the substring in a URL whose concrete value depends on user input.
For this reason, we identify the Denition Spots of dynamic URL values and postpone the actual
value discovery until runtime.
As Figure 5.4 shows, the output of string analysis is a URL Map that will be used by the proxy
at runtime, and the Denition Spot in the URL Map will be used by the App Instrumentation
step. The URL Map relates each URL substring with its concrete value (for static values) or
Denition Spots (for dynamic values). In the example of Listing 2.3 described in Section 2.2, the
entry in the URL Map that is associated with url2 would be
f url2: ["http://weatherapi/", "weather?&cityName=", L7
3;1
]g
We now explain how the URL Map is created for static and dynamic URL values.
Static value analysis { To interpret the concrete value of each static substring, we must nd
its use-denition chain and propagate the value along the chain. To do that, we leveraged a recent
string analysis framework, Violist [109], that performs control- and data-
ow analyses to identify
the value of a string variable at any given program point. Violist is unable to handle implicit
use-denition relationships that are introduced by the Android app development framework. In
particular, in Android, string values can be dened in a resource le that is persisted in the app's
internal storage and retrieved during runtime. For instance in Listing 2.3, all three URLs have
a substring getString("domain") (Lines 12-14), which is dened in the app's resource le [83].
45
Figure 5.4: PALOMA's detailed work
ow. Dierent analysis tools employed by PALOMA and
artifacts produced by it are depicted, with a distinction drawn between those that are extensions
of prior work and newly created ones.
PALOMA extends Violist to properly identify this case and include the app's resource le that is
extracted by decompiling the app in the control- and data-
ow analysis. In the end, the concrete
value of each static substring in each URL is added to the URL Map.
Dynamic value analysis { Dynamic URL values cannot be determined by static analysis.
Instead, PALOMA identies the locations where a dynamic value is dened, i.e., its Denition
Spots. The Denition Spots are later instrumented (see Section 5.3.3) such that the concrete
values can be determined at runtime.
The key challenge in identifying the Denition Spots is that a URL string may be dened
in a callback dierent from the callback where the URL is used. Recall that, due to the event-
driven execution model, callbacks are invoked implicitly by Android. Therefore, the control
ow
between callbacks on which the string analysis depends cannot be obtained by analyzing the
app code statically. Solving the inter-callback data-
ow problem is outside the scope of this
paper. This is still an open problem in program analysis, because of the implicit control-
ow
among callbacks as well as the complex and varied types of events that can trigger callbacks at
runtime, such as GUI events (e.g., clicking a button), system events (e.g., screen rotation), and
background events (e.g., sensor data changes). Research eorts on understanding callbacks are
limited to specic objectives that prevent their use for string analysis in general. Such eorts have
included a focus restricted to GUI-related callbacks [184, 183] (which we do use in our callback
46
analysis, detailed in Section 5.3.2), assumption that callback control-
ow can be in any arbitrary
order [43], and analysis of the Android framework-level, but not app-level, code to construct
callback summaries [138, 57].
To mitigate these shortcomings, we developed a hybrid static/dynamic approach, where the
static part conservatively identies all potential Denition Spots, leaving to the runtime the
determination of which ones are the actual Denition Spots. In particular, we focus on the
Denition Spots of class elds because a eld is a common way to pass data between callbacks.
We identify all potential Denition Spots in two ways. First, if a string variable is a private
member of a class, we include all the Denition Spots inside that class, such as constructor
methods, setter methods, and denitions in the static block. Second, if a variable is a public
member of a class, that variable can be dened outside the class and we conduct a whole-program
analysis to nd all assignments to the variable that propagate to the URL.
At the end of the analysis, all substring Denition Spots for a URL are added to the URL
Map. It is worth noting that although the static analysis is conservative and multiple Denition
Spots may be recorded in the URL Map, the true Denition Spot will emerge at runtime because
false denitions will either be overwritten by a later true denition (i.e., a classic write-after-write
dependency) or will never be encountered if they lie along unreachable paths.
5.3.2 Callback Analysis
Callback analysis determines where to prefetch dierent HTTP requests, i.e., the Trigger Points
in the app code. There may be multiple possible Trigger Points for a given request, depending on
how far in advance the prefetching request is sent before the on-demand request is actually issued.
The most aggressive strategy would be to issue an HTTP request immediately after its URL value
is discovered. However, this approach may lead to many redundant network transmissions: the
URL value may not be used in any on-demand requests (i.e., it may be overwritten) or the callback
containing the HTTP request (i.e., the Target Callback) may not be reached at runtime at all.
In contrast, the most accurate strategy would be to issue the prefetching request right before the
on-demand request is sent. However, this strategy would yield no improvement in latency.
Our approach is to strike a balance between the two extremes. Specically, PALOMA issues
prefetching requests at the end of the callback that is the immediate predecessor of the Target
Callback. Recall from Section 5.2 that we refer to the Target Callback's immediate predecessor as
Trigger Callback, because it triggers prefetching. This strategy has the dual benet of (1) taking
47
Algorithm 1: IdentifyTriggerCallbacks
Input: CCFG, ECG, App
Output: TriggerMap
1 InstrumentTimestamp(App);
2 NetworkMethodLogs Profile(App);
3 Signature GetFetchSignature(NetworkMethodLogs);
4 Requests GetRequests(Signature);
5 TriggerMap =?;
6 foreach req2 Requests do
7 tarMethod GetTargetMethod(req);
8 TargetCallbacks FindEntries(tarMethod;ECG);
9 foreach tarCallback2 TargetCallbacks do
10 TriggerCallbacks GetImdiatePredecessors(tarCallback;CCFG);
11 foreach trigCallback2 TriggerCallbacks do
12 TriggerMap:Add(trigCallback;req:url);
13 return TriggerMap;
advantage of the \user think time" between two consecutive callbacks to allow prefetching to take
place, while (2) providing high prefetching accuracy as the Trigger Point is reasonably close to
the on-demand request.
As Figure 5.4 shows, PALOMA creates a Trigger Map at the end of callback analysis that
is used by app instrumentation. The Trigger Map maps each Trigger Callback to the URLs that
will be prefetched at the end of that callback. In the example of Listing 2.3, the Trigger Map will
contain two entries:
f [onCreate]: [url1, url2, url3]g
f [onItemSelected]: [url1, url2, url3]g
because both onCreate() and onItemSelected() are Trigger Callbacks that are the immediate
predecessors of the Target Callback onClick(), which in turn contains url1, url2, and url3.
Algorithm 1 details how PALOMA identies Trigger Callbacks and constructs the Trigger
Map. In addition to the app itself, the algorithm relies on two additional inputs, both obtained
with the help of o-the-shelf-tools: the Callback Control-Flow Graph (CCFG) [183] and the Call
Graph (CG) [26]. Note that the CCFG we use in our callback analysis is restricted to GUI callbacks
that are triggered by user actions (recall Section 2.2). However, this ts PALOMA's needs given
its focus on user-initiated network requests. The CCFG captures the implicit-invocation
ow
of Callbacks in Android, and thus allows us to nd the Trigger Callbacks of a given Target
Callback. On the other hand, the CG, which is extracted by Soot [26], captures the control
ow
between functions, and thus allows us to locate the Callbacks that contain any given method.
48
However, the CG does not include the direct invocations that are initiated from the Android
framework. We identied such invocations from Android's documentation and extended the CG
with the resulting direct edges. An example is the execute()!doInBackground() edge from the
AsyncTask class [79] that is widely used for network operations in Android. We refer to the thus
extended CG as ECG.
Given these inputs, PALOMA rst identies the signature of a Fetch Spot, i.e., the method
that issues HTTP requests, by proling the app (Lines 1-3 of Algorithm 1). We found that
the proling is needed because the methods that actually issue HTTP requests under dierent
circumstances can vary across apps. For example, the getInputStream() method from Java's
URLConnection library may consume hundreds of milliseconds in one app, but zero in another
app where, e.g., the getResponseCode() method consumes several hundred milliseconds.
1
Thus,
we obtain the signatures by instrumenting timestamps in the app, and select the most time-
consuming network operations according to our proling results. Using the signatures, we then
identify all HTTP requests that the app can possibly issue (Line 4). In the example of Listing 2.3,
the signature would be getInputStream() and the Requests would be conn1.getInputStream(),
conn2.getInputStream(), and conn3.getInputStream(). We iterate through each discovered
request and identify the method in which the request is actually issued, i.e., the Target Method
(Line 7). Using the control
ow information that the ECG provides, we locate all possible Target
Callbacks of a Target Method (Line 8). We then iterate through each Target Callback and identify
all of its immediate predecessors, i.e., Trigger Callbacks, according to the CCFG (Line 10). Finally,
we add eachfTrigger Callback, URLg pair to the Trigger Map (Lines 11-12).
5.3.3 App Instrumentation
PALOMA instruments an app automatically based on the information extracted from the two
static analyses, and produces an optimized, prefetching-enabled app as Figure 5.4 shows. At
runtime, the optimized app will interact with a local proxy that is in charge of issuing prefetching
requests and managing the prefetched resources. While PALOMA's app instrumentation is fully
automated and it does not require the source code of the app, PALOMA also supports app
developers who have the knowledge and the source code of the app to further improve runtime
latency reduction via simple prefetching hints. We describe the two instrumentation aspects next.
1
In this work, we focus on URLConnection, a built-in Java standard library widely used by Android developers.
If the developer is using a dierent library and/or knows which method(s) to optimize, then PALOMA's proling
step may not be needed.
49
5.3.3.1 Automated Instrumentation
PALOMA performs three types of instrumentation automatically. Each type introduces a new
API that we implement in an instrumentation library. Listing 5.1 shows an instrumented version
of the app from Listing 2.3, with the instrumentation code bolded. We will use this example to
explain the three instrumentation tasks.
1. Update URL Map { This instrumentation task updates the URL Map as new values
of dynamic URLs are discovered. Recall that the values of static URLs are fully determined
and inserted into the URL Map oine. This instrumentation is achieved through a new API,
sendDefinition(var, url, id), which indicates that var contains the value of the id
th
sub-
string in the URL named url. The resulting annotation is inserted right after each Denition
Spot. For instance at Line 8 of Listing 5.1, PALOMA will update the third substring in url2
with the runtime value of cityName. This ensures that the URL Map will maintain a fresh copy
of each URL's value and will be updated as soon as new values are discovered.
2. Trigger Prefetching { This instrumentation task triggers prefetching requests at each
Trigger Point. A Trigger Point in PALOMA is at the end of a Trigger Callback. We made this
choice for two reasons: on one hand, it makes no discernible dierence in terms of performance
where we prefetch within the same callback; on the other hand, placing the Trigger Point at the
end is more likely to yield known URLs (e.g., when the Denition Spot is also within the Trigger
Callback). PALOMA provides this instrumentation via the triggerPrefetch(url1, ...) API.
The URLs that are to be prefetched are obtained from the Trigger Map constructed in the callback
analysis (recall Section 5.3.2). For instance, PALOMA triggers the proxy to prefetch url1, url2,
and url3 at the end of onItemSelected() (Line 9) and onCreate() (Line 26) of Listing 5.1,
which is consistent with the Trigger Map built in Section 5.3.2.
3. Redirect Requests { This instrumentation task redirects all on-demand HTTP requests
to PALOMA's proxy instead of the origin server. This allows on-demand requests to be served
from the proxy's cache, without latency-inducing network operations. The request redirection
is achieved through the fetchFromProxy(conn) API, where conn indicates the original URL
connection, which is passed in case the proxy still needs to make the on-demand request to the
origin server. This instrumentation replaces the original methods at each Fetch Spot: calls to the
getInputStream() method at Lines 16, 18, and 20 of Listing 2.3 are replaced with calls to the
fetchFromProxy(conn) method at Lines 19, 21, and 23 in Listing 5.1.
50
1 class MainActivity f
2 String favCityId, cityName, cityId;
3 protected void onCreate()f
4 favCityId = readFromSetting("favCityId");//static
5 cityNameSpinner.setOnItemSelectedListener(new OnItemSelectedListener()f
6 public void onItemSelected() f
7 cityName = cityNameSpinner.getSelectedItem().toString();//dynamic
8 sendDenition(cityName, url2, 3);
9 triggerPrefetch(url1, url2, url3);
10 gg);
11 submitBtn.setOnClickListener(new OnClickListener()f
12 public void onClick()f
13 cityId = cityIdInput.getText().toString();//dynamic
14 sendDenition(cityId, url3, 3);
15 URL url1 = new URL(getString("domain")+"weather?&cityId="+favCityId);
16 URL url2 = new URL(getString("domain")+"weather?cityName="+cityName);
17 URL url3 = new URL(getString("domain")+"weather?cityId="+cityId);
18 URLConnection conn1 = url1.openConnection();
19 Parse(fetchFromProxy(conn1));
20 URLConnection conn2 = url2.openConnection();
21 Parse(fetchFromProxy(conn2));
22 URLConnection conn3 = url3.openConnection();
23 Parse(fetchFromProxy(conn3));
24 startActivity(DisplayActivity.class);
25 gg);
26 triggerPrefetch(url1, url2, url3);
27 g
28 g
Listing 5.1: Example code of the optimized app
5.3.3.2 Developer Hints
Although PALOMA can automatically instrument mobile apps without developer involvement,
it also provides opportunities for developers to add hints in order to better guide the prefetch-
ing. In particular, PALOMA enables two ways for developers to provide hints: by using its
instrumentation APIs and by directly modifying its artifacts. These two approaches are described
below.
API support { PALOMA's three API functions dened in the instrumentation library|
sendDefinition(), triggerPrefetch(), and fetchFromProxy()|can be invoked by the devel-
opers explicitly in the app code. For instance, if a developer knows where the true Denition
Spots are, she can invoke sendDefinition() at those locations. Developers can also invoke
51
triggerPrefetch() at any program point. For example, prefetching can happen farther ahead
than is done automatically by PALOMA if a developer knows that the responses to a prefetching
request and its corresponding on-demand request will be identical.
Artifact modication { Using PALOMA's instrumentation APIs in the manner described
above requires modications to the app source code. An alternative is to directly modify the
artifacts generated by PALOMA's static analyses|Trigger Map, Fetch Spot Signature, and Def-
inition Spot (recall Figure 5.4)|without altering the code. For example, a developer can add an
entry in the Trigger Map; as a result, PALOMA's instrumenter will automatically insert a call to
triggerPrefetch() at the end of the Trigger Callback specied by the developer.
We now introduce two frequently occurring instances where developers are well positioned to
provide prefetching hints with very little manual eort. These hints can be provided using either
of the above two approaches.
Prefetching at app launch { Launching an app may take several seconds or more because
many apps request remote resources, typically toward the end of the launch process. The URLs
of the launch-time requests are usually statically known, but the ways in which the URL values
can be obtained are highly app-dependent. For instance, apps may retrieve the request URLs
from a conguration le or a local database. Supporting those cases in PALOMA's string analysis
would mean that PALOMA must understand the semantics of each individual app, which is not a
reasonable requirement. However, a practical alternative is for developers to provide prefetching
hints because they understand their own apps' behavior. One way developers could implement
this is to insert into the URL Map additional static URLs and then call triggerPrefetch() at
the beginning of onCreate(), which for PALOMA's purposes can be treated as the app entry
point in most Android applications.
Prefetching for ListView { The ListView class [82] is commonly used in Android apps to
display the information of a list of items. The app \jumps" to another page to display further
information based on the item a user selects in the list. The URL fetched for the page to which
the app \jumps" is typically only known after the user selects the item in the list. Ordinarily, this
would prevent prefetching. However, Android apps tend to exhibit two helpful trends. First, the
list view usually displays similar types of information. Second, the further information obtained
by selecting an item is related to the information displayed in the list itself. Based on these
observations, we identied and are exploiting in PALOMA similar patterns in the URLs for
the list and the subsequent page. Consider a wallpaper app for illustration: The URL that is
52
fetched to render an item in the list view may be \image1Url small.jpg", while the URL that is
fetched after the user selects image1 may be \image1Url large.jpg". Based on this pattern, we
have explored manually adding Denition Spots of the URLs that are fetched in the list view
and sending modied values to the proxy, such as replacing \small" with \large" in the wallpaper
example.
5.3.4 Runtime Prefetching
PALOMA's rst three phases are performed oine. By contrast, this phase captures the interplay
between the optimized apps and PALOMA's proxy to prefetch the HTTP requests at runtime. The
instrumented methods in an optimized app trigger the proxy to perform corresponding functions.
We now use the example from Listing 5.1 to show how the three instrumented functions from
Section 3.3.1 interact with the proxy.
1. Update URL Map { When the concrete value of the dynamic URL is obtained at
runtime, the inserted instrumentation method sendDefinition(var, url, id) is executed and
the concrete runtime value is sent to the proxy. In response, the proxy updates the corresponding
URL value in the URL Map. For instance in Listing 5.1, when a user selects a city name from the
cityNameSpinner (Line 7), the concrete value of cityName will be known, e.g., \Gothenburg".
Then cityName is sent to the proxy (Line 8) and the URL Map entry for url2 will be updated
tof url2: ["http://weatherapi/", "weather?&cityName=","Gothenburg"]g.
2. Trigger Prefetching { When the inserted instrumentation method triggerPrefetch
(url1,...) is executed, it triggers the proxy to perform TriggerPrefetch as shown in Algo-
rithm 2. For each request that is sent to the proxy by triggerPrefetch(url1,...), the proxy
checks if the whole URL of the request is known but the response to the request has not yet been
cached (Line 2). If both conditions are met, a \wait"
ag is set in the cache for that request
(Line 3). This ensures that duplicated requests will not be issued in the case when the on-demand
request is made by the user before the response to the prefetching request has been returned from
the origin server. In the example of Listing 5.1, when the app reaches the end of onCreate (Line
26), it triggers the proxy to perform triggerPrefetch(url1,url2,url3). Only url1 meets both
conditions at Line 2 of Algorithm 2: the URL value is concrete (it is, in fact, a static value) and the
response is not in the cache. The proxy thus sets the \wait"
ag for url1 in the cache, prefetches
url1 from the origin server, stores the response in the cache, and nally sends an \unwait" signal
to the on-demand request that is waiting for the prefetched request (Line 3-6). Thereafter, when
53
Algorithm 2: TriggerPrefetch
Input: Requests
1 foreach req2 Requests do
2 if IsKnown(req:url) and:IsCached(req) then
3 SetWaitFlag(req)
4 response req:FetchRemoteResponse()
5 cache:Put(req;response)
6 UnWait(req)
the user selects a city name from the dropdown box, onItemSelected (Line 6 of Listing 5.1) will
be triggered. At the end of onItemSelected (Line 9), TriggerPrefetch(url1,url2,url3) is
invoked again and url2 will be prefetched because its URL is known (its dynamic value obtained
at Line 8) and has not been previously prefetched. In contrast, the value of url1 is known at this
point but url1 was already prefetched at Line 26, so the proxy will not prefetch url1.
3. Redirect Requests { When the on-demand request is sent at the Fetch Spot, the replaced
function fetchFromProxy(conn) will be executed, and it will in turn trigger the proxy to perform
ReplacedFetch as shown in Algorithm 3. If the request has a corresponding response in the
cache, the proxy will rst check the \wait"
ag for the request. If the
ag is set, the proxy
will wait for the signal of the prefetching request (Line 3) and will return the response of the
prefetching request when it is back from the origin server (Line 4). If the \wait"
ag has not been
set, the response is already in the cache and the proxy returns the response immediately with no
network operations involved (Line 4). Otherwise, if the cache does not contain the response to the
request, the proxy issues an on-demand request using the original URL connection conn to fetch
the response from the origin server, stores the response in the cache, and returns the response to
the app (Line 6-8). For instance in Listing 5.1, if a user clicks submitBtn, fetchFromProxy(conn)
will be executed to send on-demand requests for url1, url2, url3 to the proxy (Lines 19, 21, and
23 of Listing 5.1). The proxy in turn returns the responses to url1 and url2 from the local cache
immediately because url1 and url2 are prefetched at Lines 26 and 9 respectively, as discussed
above. url3 is not known at any of the Trigger Points, so the response to url3 will be fetched
from the origin server on demand as in the original app. Note that if a user did not select a city
name from the dropdown box before clicking submitBtn, onItemSelected will not be triggered,
meaning that Lines 8 and 9 of Listing 5.1 will not be executed. In this case, only the response for
url1 will be returned from the cache (prefetched at Line 26) while the on-demand requests for
url2 and url3 will be routed to the origin server.
54
Algorithm 3: ReplacedFetch
Input: req2Requests
Output: response2Responses
1 if IsCached(req) then
2 if GetWaitFlag(req) is TRUE then
3 Wait(req)
4 return cache:GetResponse(req)
5 else
6 response req:FetchRemoteResponse()
7 cache:Put(req;response)
8 return response
5.4 Implementation
PALOMA has been implemented by reusing and extending several o-the-shelf tools, and integrat-
ing them with newly implemented functionality. PALOMA's string analysis extends the string
analysis framework Violist [109]. The callback analysis is implemented on top of the program
analysis toolkit GATOR [148], and by extending GATOR's CCFG analysis [183]. PALOMA's
instrumentation component is a stand-alone Java program that uses Soot [26] to instrument an
app. The proxy is built on top of the Xposed framework [32] that provides mechanisms to \hook"
method calls. The proxy intercepts the methods that are dened in PALOMA's instrumentation
library and replaces their bodies with corresponding methods implemented in the proxy. The
total amount of newly added code to extend existing tools, implement the new functionality, and
integrate them together in PALOMA is 3,000 Java SLOC.
5.5 Microbenchmark Evaluation
In this section, we describe the design of a microbenchmark (MBM) containing a set of test cases,
which we used to evaluate PALOMA's accuracy and eectiveness.
MBM thoroughly covers the space of prefetching options, wherein each test case contains a
single HTTP request and diers in whether and how that request is prefetched. The MBM is
restricted to individual HTTP requests because the requests are issued and processed indepen-
dently of one another. This means that PALOMA will process multiple HTTP requests simply
as a sequence of individual requests; any concurrency in their processing that may be imposed by
the network library and/or the OS is outside PALOMA's purview. In practice, the look-up time
for multiple requests varies slightly from one execution of a given app to the next. However, as
55
shown below in Table 5.1, the look-up time required by PALOMA would not be noticeable to a
user even with a large number of requests. As we will show in Section 5.6, the number of HTTP
requests in real apps is typically bounded. Moreover, PALOMA only maintains a small cache
that is emptied every time a user quits the app.
In the rest of this section, we will rst lay out the goals underlying the MBM's design (Mi-
crobenchmark Design Goals), and then present the MBM (Microbenchmark Design). Our evalua-
tion results show that PALOMA achieves perfect accuracy when applied on the MBM, and leads
to signicant latency reduction with negligible runtime overhead (Microbenchmark Results).
5.5.1 Microbenchmark Design Goals
The MBM is designed to evaluate two fundamental aspects of PALOMA: accuracy and eective-
ness.
PALOMA's accuracy pertains to the relationship between prefetchable and actually prefetched
requests. Prefetchable requests are requests whose URL values are known before the Trigger Point
and thus can be prefetched. Quantitatively, we capture accuracy via the dual measures of precision
and recall. Precision indicates how many of the requests that PALOMA tries to prefetch at a
given Trigger Point were actually prefetchable. On the other hand, recall indicates how many
requests are actually prefetched by PALOMA out of all the prefetchable requests at a given
Trigger Point.
PALOMA's eectiveness is also captured by two measures: the runtime overhead introduced
by PALOMA and the latency reduction achieved by it. Our objective is to minimize the runtime
overhead while maximizing the reduction in user-perceived latency.
5.5.2 Microbenchmark Design
The MBM is built around a key concept|prefetchable|a request whose whole URL is known
before a given Trigger Point. We refer to the case where the request is prefetchable and the
response is used by the app as a hit. Alternatively, a request may be prefetchable but the response
is not used because the denition of the URL is changed after the Trigger Point. We call this a
non-hit. The MBM aims to cover all possible cases of prefetchable and non-prefetchable requests,
including hit and non-hit.
56
There are three factors that aect whether an HTTP request is prefetchable: (1) the number
of dynamic values in a URL; (2) the number of Denition Spots for each dynamic values; and (3)
the location of each Denition Spot relative to the Trigger Point. We now formally dene the
properties of prefetchable and hit considering the three factors. The formal denitions will let us
succinctly describe test cases later.
Formal Denition. Let M be the set of Denition Spots before the Trigger Point and N
the set of Denition Spots after the Trigger Point, which is within the Target Callback (recall
Sections 5.2). Let us assume that a URL has k 1 dynamic values. (The case where k = 0, i.e.,
the whole URL is static, is considered separately.) Furthermore, let us assume that the dynamic
values are the rst k values in the URL.
2
The i
th
dynamic value (1ik) has d
i
1 Denition
Spots in the whole program. A request is
• prefetchable i 8i9(j2 [1::d
i
])j DefSpot
i;j
2M
(every dynamic value has a DefSpot before Trigger Point)
• hit i prefetchable^8(j2 [2::d
i
])j DefSpot
i;j
2M
(all dynamic value DefSpots are before Trigger Point)
• non-hit i prefetchable^9(j2 [2::d
i
])j DefSpot
i;j
2N
(some dynamic value DefSpots are after Trigger Point)
• non-prefetchable i 8(j2 [1::d
i
])9ij DefSpot
i;j
2N
(all DefSpots for a dynamic value are after Trigger Point)
Without loss of generality, MBM covers all cases where k 2 and d
i
2. We do not consider
cases wherek > 2 ord
i
> 2 because we only need two dynamic values to cover the non-prefetchable
case|where some dynamic values are unknown at the Trigger Point|and two Denition Spots
to cover the non-hit case|where some dynamic values are redened after the Trigger Point.
There are a total of 25 possible cases involving congurations with k 2 and d
i
2. The
simplest case is when the entire URL is known statically; we refer to it as case 0. The remaining
24 cases are diagrammatically encoded in Figure 5.5: the two dynamic URL values are depicted
with circles and delimited with the vertical line; the location of the Trigger Point is denoted with
the horizontal line; and the placement of the circles marks the locations of the dynamic values'
2
This assumption is used only to simplify our formalization. The order of the values in a URL has no impact
on whether the URL is prefetchable and can thus be arbitrary.
57
DS
1,1
1 -H
DS
1,1
2 -NP
DS
1,2
DS
1,1
DS
1,1
4 -NH
DS
1,2
DS
1,2
5 -NP
DS
1,1
DS
2,1
8 -NP
DS
1,1
DS
1,1
6 -H
DS
2,1
DS
1,1
7 -NP
DS
2,1 DS
2,1
DS
1,1
11 -NP
DS
2,2
DS
1,1
15 -NP
DS
2,2
DS
2,1
DS
1,1
DS
2,1
12 -NH
DS
2,2
DS
1,1
13 -NP
DS
2,2
DS
2,1
DS
2,2
DS
1,1
14 -NP
DS
2,1
DS
2,1
DS
2,2
DS
1,2
DS
1,1
DS
2,2
19 -NP
DS
2,1
DS
1,2
DS
1,1
17 -NP
DS
2,1
DS
2,2
DS
1,2
DS
1,1
DS
1,1
18 -NP
DS
2,2
DS
2,1
DS
1,2
DS
2,1
22-NP
DS
2,2
DS
1,2
DS
1,1
DS
2,2
DS
2,1
DS
1,2
DS
1,1
DS
1,1
24 -NH
DS
2,2
DS
2,1
DS
1,2
DS
1,2
DS
1,1
DS
2,1
DS
2,2
DS
1,1
21 -NH
DS
2,1
DS
2,2
DS
1,2
DS
1,1
10 -H
DS
2,2
DS
2,1
DS
1,1
DS
2,2
9 -NP
3 -H
16 -H
20 -NP 23 -NH
Figure 5.5: The 24 test cases covering all congurations involving dynamic values. The horizontal
divider denotes the Trigger Point, while the vertical divider delimits the two dynamic values. The
circles labeled with \DS
i;j
" are the locations of the Denition Spots with respect to the Trigger
Point. \H" denotes a hit, \NH" denotes a non-hit, and \NP" denotes a non-prefetchable request.
Denition Spots (\DS
i;j
" in the gure) with respect to the Trigger Point. These 24 cases can be
grouped as follows:
• single dynamic value { cases 1-5;
• two dynamic values, one Denition Spot each { cases 6-9;
• two dynamic values, one with a single Denition Spot, the other with two { cases 10-15;
and
• two dynamic values, two denition spots each { cases 16-24.
58
Each case is labeled with its prefetchable/hit property (\H" for hit, \NH" for non-hit, and
\NP" for non-prefetchable). Of particular interest are the six cases|0, 1, 3, 6, 10, and 16|that
represent the hits that should allow PALOMA to prefetch the corresponding requests and hence
signicantly reduce the user-perceived latency.
5.5.3 Microbenchmark Results
We implemented the MBM as a set of Android apps along with the remote server to test each of
the 25 cases. The server is built with Node.js and deployed on the Heroku cloud platform [15].
The apps interact with the server to request information from a dataset in MongoDB [20]. The
evaluation was performed on the 4G network. The testing device was Google Nexus 5X running
Android 6.0. Overall, our evaluation showed that PALOMA achieves 100% precision and recall
without exception, introduces negligible overhead, and can reduce the latency to nearly zero under
appropriate conditions (the hit cases discussed above).
Table 5.1 shows the results of each test case corresponding to Figure 5.5, as well as case
0 in which the entire URL value is known statically. Each execution value is the average of
multiple executions of the corresponding MBM apps. The highlighted test cases are the hit
cases that should lead to a signicant latency reduction. The columns \SD", \TP", and \FFP"
show the average times spent in the corresponding PALOMA instrumentation methods in the
optimized apps|sendDefinition(), triggerPrefetch(), and fetchFromProxy(), respectively
(recall Section 5.3.3). The \Orig" column shows the execution time of the method invoked at the
Fetch Spot in the original app, such as getInputStream().
The nal column in the table, labeled \Red/OH" shows the percentage reduction in execution
time when PALOMA is applied on each MBM app. The reduction is massive in each of the six
hit cases (99%). It was interesting to observe that applying PALOMA also resulted in reduced
average execution times in 11 of the 19 non-hit and non-prefetchable cases. The more expected
scenario occurred in the remaining eight of the non-hit and non-prefetchable cases: applying
PALOMA introduced an execution overhead (shown as negative values in the table). The largest
runtime overhead introduced by PALOMA was 149ms in case 11, where the original response
time was 2,668ms. This value was due to a couple of outliers in computing the average execution
time, and it may be attributable to factors in our evaluation environment other than PALOMA,
such as network speed; the remaining measurements were signicantly lower. However, even this
value is actually not prohibitively expensive: recall that PALOMA is intended to be applied in
59
Table 5.1: Results of PALOMA's evaluation using MBM apps covering the 25 cases discussed in
Section 5.5.2. \SD", \TP", and \FFP" denote the runtimes of the three PALOMA instrumen-
tation methods. \Orig" is the time required to run the original app. \Red/OH" represents the
reduction/overhead in execution time when applying PALOMA.
Case SD (ms) TP (ms) FFP (ms) Orig (ms) Red/OH
0 N/A 2 1 1318 99.78%
1 0 5 0 15495 99.97%
2 0 1 2212 2659 16.81%
3 1 4 1 781 99.24%
4 2 5 611 562 -9.96%
5 0 2 2588 2697 3.97%
6 1 4 2 661 98.95%
7 1 4 2237 2399 6.54%
8 1 9 585 568 4.75%
9 2 2 611 584 -5.31%
10 1 5 0 592 98.99%
11 2 2 2813 2668 -5.58%
12 2 6 546 610 8.16%
13 2 3 2478 2753 10.87%
14 3 3 549 698 20.49%
15 5 1 631 570 -11.75%
16 1 11 0 8989 99.87%
17 0 3 418 555 31.83%
18 2 6 617 596 -4.87%
19 4 6 657 603 -10.61%
20 1 3 620 731 17.15%
21 2 10 611 585 -6.50%
22 2 7 737 967 29.62%
23 2 9 608 607 -1.98%
24 1 10 611 715 14.95%
cases in which a user already typically spends multiple seconds deciding what event to trigger
next [127].
5.6 Third-Party App Evaluation
We also evaluated PALOMA on third-party Android apps to observe its behavior in a real-
world setting. We used the same execution setup as in the case of the MBM. We selected 32
apps from the Google Play store [13]. We made sure that the selected apps span a range of
application categories|Beauty, Books & Reference, Education, Entertainment, Finance, Food &
60
Drink, House & Home, Maps & Navigation, Tools, Weather, News & Magazines, and Lifestyle|
and vary in sizes|between 312KB and 17.8MB. The only other constraints in selecting the apps
were that they were executable, relied on the Internet, and could be processed by Soot.
3
We asked two Android users to actively use the 32 subject apps for two minutes each, and
recorded the resulting usage traces.
4
We then re-ran the same traces on the apps multiple times,
to account for variations caused by the runtime environment. Then we instrumented the apps
using PALOMA and repeated the same steps the same number of times. Each session started
with app (re)installation and exposed all app options to users. As in the case of the MBM, we
measured and compared the response times of the methods at the Fetch Spots between the original
and optimized apps.
Unlike in the case of the MBM, we do not have the ground-truth data for the third-party apps.
Specically, the knowable URLs at the Trigger Points would have to be determined manually,
which is prohibitively time-consuming and error prone. In fact, this would boil down to manually
performing inter-callback data-
ow analysis (recall Section 5.3.1). For this reason, we measured
only two aspects of applying PALOMA on the third-party apps: the hit rate (i.e., the percentage
of requests that have been hit out of all triggered requests) and the resulting latency reduction.
Table 5.2 depicts the averages, outliers (min and max values), as well as the standard deviations
obtained across all of the runs of the 32 apps.
Table 5.2: Results of PALOMA's evaluation across the 32 third-party apps.
Min. Max. Avg. Std. Dev.
Runtime Requests 1 64 13.28 14.41
Hit Rate 7.7% 100% 47.76% 28.81%
Latency Reduction 87.41% 99.97% 98.82% 2.3%
Overall, the results show that PALOMA achieves a signicant latency reduction with a rea-
sonable hit rate. There are several interesting outlier cases. The minimum hit-rate is only 7.7%.
The reason is that the app in question fetches a large number of ads at runtime whose URLs
are non-deterministic, and only a single static URL is prefetched outside those. There are four
additional apps whose hit rate is below 20% because those apps are list-view apps, such as a
wallpaper app (recall Section 5.3.3), and they fetch large numbers of requests at the same time.
3
Soot is occasionally unable to process an Android app for reasons that we were unable to determine. This
issue was also noted by others previously.
4
While the average app session length varies by user and app type (e.g., [5]), two minutes was suciently long
to observe representative behavior and, if necessary, to extrapolate our data to longer sessions.
61
In PALOMA, we set the threshold for the maximum number of requests to prefetch at once to
be 5. This parameter can be increased, but that may impact device energy consumption, cellular
data usage, etc. This is a trade-o that will require further study.
Similarly to the MBM evaluation, PALOMA achieves a reduction in latency of nearly 99% on
average for \hit" cases. Given the average execution time for processing a single request across the
32 unoptimized apps of slightly over 800ms, prefetching the average of 13.28 requests at runtime
would reduce the total app execution time by nearly 11s, or 9% of a two-minute session. Note that
the lowest latency reduction was 87.41%. This was caused by on-demand requests that happen
before the prefetching request is returned. In those cases, the response time depends on the
remaining wait time for the prefetching request's return. However, there were only 5 such \wait"
requests among 425 total requests in the 32 apps. This strongly suggests that PALOMA's choice
for the placement of Trigger Points is eective in practice.
5.7 Summary
We have presented PALOMA, the rst content-based prefetching technique via program analysis
that reduces the user-perceived latency in mobile apps by prefetching certain HTTP requests.
While PALOMA cannot be applied to all HTTP requests an app makes at runtime, it provides
signicant performance savings in practice. Several of PALOMA's current facets make it well
suited for future work in this area, both by us and by others. For instance, PALOMA denes
formally the conditions under which the requests are prefetchable. This can lead to guidelines that
developers could apply to make their apps more amenable to prefetching, and lay the foundations
for further program analysis-based prefetching techniques in the mobile-app domain. We have also
identied several shortcomings to PALOMA whose remedy must include improvements to string
analysis and callback analysis techniques. Another interesting direction is to improve the precision
and reduce the waste associated with prefetching by incorporating certain dynamic information
(e.g., user behavior patterns, runtime QoS conditions). Finally, PALOMA's microbenchmark
(MBM) forms a foundation for standardized empirical evaluation and comparison of future eorts
in this area.
62
Chapter 6
History-based Approach HiPHarness
History-based prefetching is a well-studied solution from the traditional browser domain that can
reduce network latency by predicting users' future actions based on their past behaviors. As
discussed earlier in Section 3, unlike content-based approaches, history-based prefetching can be
applied to mobile apps directly in principle. However, such techniques are largely unexplored
on mobile platforms. One key challenge is that today's privacy regulations make it infeasible to
explore prefetching with the usual strategy of amassing large amounts of data over long periods
and constructing conventional, \large" prediction models. Our work is based on the observation
that this may not be necessary: Given previously reported mobile-device usage trends (e.g.,
repetitive behaviors in brief bursts), we hypothesized that prefetching should work eectively
with \small" models trained on mobile-user requests collected during much shorter time periods.
To test this hypothesis, we constructed a framework HiPHarness for automatically assessing
prediction models, and used it to conduct an extensive empirical study based on over 15 million
HTTP requests collected from nearly 11,500 mobile users during a 24-hour period, resulting in over
7 million models [195]. This chapter presents HiPHarness approach in detail, and demonstrates
our results obtained by HiPHarness, demonstrating the feasibility of prefetching with small models
on mobile platforms and directly opening up a new research arena.
6.1 Motivation
Prefetching network requests is a well-established area aimed at reducing user-perceived la-
tency [69, 70, 67, 85, 151, 127, 169, 61, 192, 172, 136, 134, 93, 139, 46, 182]. Recent research has
highlighted the opportunity to apply prefetching on mobile platforms [194] and started exploring
63
content-based strategies, such as PALOMA [192] as presented in Chapter 5, and its follow-up work
APPx [61] and NAPPA [122]. History-based approaches on the other hand, remain largely unex-
plored on mobile platforms, despite the large body of work targeting history-based prefetching in
the traditional browser domain [69, 70, 67, 85, 136, 134, 93, 139, 46, 182].
To understand the reasons behind this gap, we studied the literature and reached out to
the authors of the most recent prefetching techniques targeting mobile platforms [192, 61, 122].
This uncovered a key challenge in applying history-based prefetching on mobile platforms today:
the user data is hard to obtain due to the profusion of data-privacy regulations introduced and
regularly tightened around the world [31, 106, 94, 140, 154]. For example, a study published
in 2011 reported on data collected from 25 iPhone users over the course of an entire year [155]
and inspired a number of follow-up studies [172, 179, 129, 119, 37, 111, 143]. A decade later,
such protracted data collection would be dicult to imagine, both because it would likely fall
afoul of legal regulations introduced in the meantime and because today's end-users are more
keenly aware and protective of their data.
Given these constraints, an obvious strategy would be to limit the amount of data on which the
prefetching relies. However, there is no evidence that it is feasible to predict future requests based
on small amounts of data. This appears to have been an important factor that has discouraged the
exploration of history-based prefetching in recent years. We believe that this has been a missed
opportunity. Namely, the previously reported mobile-device usage patterns|repetitive activities
in brief bursts [84, 156, 65, 152, 55]|lead us to hypothesize that history-based prefetching may
work eectively with small prediction models trained on mobile-user requests collected during
short time periods. If borne out in practice, this would open new research avenues.
To evaluate our hypothesis, and to facilitate future explorations in this area, we rst construct
a tailorable framework HiPHarness, which provides several customizable components to auto-
matically assess prediction models across a range of scenarios. For example, HiPHarness allows
comparing the eectiveness of dierent prediction algorithms by running them side-by-side on the
same data, measuring the impact of dierent training-data sizes on accuracy, and so on.
HiPHarness enables us to
exibly assess prediction models built with any algorithm of inter-
est. In this dissertation, we specically customize HiPHarness to analyze models built with the
three most widely employed history-based prediction algorithms from the traditional browser do-
main [134, 136, 52], as well as a fourth algorithm we introduce to serve as the evaluation baseline.
Our study uses the mobile-network trac obtained from nearly 11,500 users at a large university
64
during a 24-hour period. The selection of this time period was guided by previously made ob-
servations of repetitive mobile-user behaviors during a single day [156, 65, 152, 55]. The closest
study to ours [172] only evaluates one algorithm [52] with mobile-browser data collected over a
year from 25 iPhone-using undergraduates at Rice university [155]. By comparison, our study
evaluates four algorithms on both mobile-browser and mobile-app data, relying on400 more
users from a much more diverse user base, but during a400 shorter time frame.
Our dataset comprises over 15 million HTTP requests from nearly 31,000 Internet domains,
allowing us to explore orders-of-magnitude more models compared to prior work. We use HiPHar-
ness to assess over 7 million prediction models tailored to each mobile user based on their past
usage, varying from a single request to over 200,000 requests. While HiPHarness allows the ex-
ploration of a range of research questions, we take an initial step to study 1 the repetitiveness
of user requests during very short time periods as the key prerequisite for the feasibility of small
prediction models; 2 the eectiveness of existing prediction algorithms on mobile platforms to
provide insights for future algorithms; and 3 strategies for reducing training-data sizes without
sacricing model accuracy as a way of yielding even smaller models.
6.2 Study Focus
We specically focus on assessing the accuracy of prediction models built in the rst two phases
of the prefetching work
ow as introduced in Section 2.3, i.e., training and predicting. The third,
prefetching phase involves trading o runtime conditions, studied by a complementary body of
research [89, 47]. We exclude such runtime factors since they would taint the results of the models'
accuracy (e.g., failing to prefetch requests due to a third-party server's unavailability). Further-
more, the expiration of the prefetched requests may vary depending on when the experiments
are conducted, which would introduce additional bias into our results. Currently, determining
whether a prefetched request has expired remains an open challenge since the HTTP headers [17]
(e.g., Cache-Control, Expires) are not trustworthy [194]. Thus, to eliminate runtime variations
and fairly assess the models' accuracy, we assume ideal runtime conditions, i.e., each predicted
request will be prefetched and will not expire.
As mentioned earlier, our prediction models are tailored to individual users based on their
past behaviors. This is motivated by today's stringent regulations limiting personal data access.
This has shifted recent research to \on-device" prediction, to avoid exporting sensitive user data
65
to servers [87, 112, 144]. At the same time, this trend of client-side prediction highlights the need
for small prediction models since mobile devices are resource-constrained. We thus also assess the
resource consumption of dierent prediction models trained on-device.
6.3 The HiPHarness Framework
This section describes the design of HiPHarness, a tailorable framework for automatically assess-
ing history-based prediction models, followed by the details of its instantiation.
6.3.1 HiPHarness's Design
Figure 6.1 depicts HiPHarness's work
ow, comprising six customizable components that can be
reused, extended, or replaced as needed. For instance, one can explore how much training data
to use with a given prediction algorithm by varying Training Selection and reusing the remaining
components.
Figure 6.1: HiPHarness's work
ow for assessing history-based prediction models
To enable HiPHarness's automated work
ow, Historical Requests need to be provided as input,
and each component (shaded boxes) needs to be reused from existing components, or provided
anew if not available. Historical Requests are past requests used to generate the intermediate
outputs of Training Requests, Trigger Requests, and Test Requests based on the logic dened in
Training Selection, Trigger Selection, and Test Selection, respectively. Training Requests are used
to build the Prediction Model. Trigger Requests are used to trigger the prediction of subsequent
66
requests, such as the current request. Finally, Test Requests are the future requests used to
evaluate the Prediction Model; they represent the \ground truth".
The three intermediate outputs are used to make the prediction and produce the nal Test
Results to evaluate the Prediction Model. The Training Engine implements the algorithm used to
train a specic Prediction Model, such as the dependency graph in DG [134], based on the selected
Training Requests. Optionally, the trained Prediction Model can be updated dynamically (dashed
arrow) while being evaluated: Test Requests become historical requests after being tested, and can
be used to train the Prediction Model. Prediction Engine implements the algorithm that predicts
subsequent requests based on Trigger Requests and the trained Prediction Model. Finally, when
evaluating the Prediction Model, Test Engine iteratively invokes Prediction Engine with a series of
Test Requests. Prediction Engine selects N requests that immediately precede each Test Request
to obtain a set of Predicted Requests. Test Engine then compares the Predicted Requests with the
corresponding Test Requests (i.e., \ground truth") and outputs the Test Results that contain the
information needed to calculate evaluation metrics of interest. The metrics used in our empirical
study are discussed in Section 6.4.
HiPHarness supports \plug and play" by tuning and/or replacing each of the components to
explore dierent research questions. The Training Selection, Trigger Selection, and Test Selec-
tion components can be customized based on dierent selection strategies. For example, dierent
parts of Historical Requests can be chosen based on desired ratios or time periods, to produce the
corresponding Training Requests, Trigger Requests, and Test Requests. Likewise, Training Engine
and Prediction Engine can be tailored with specic prediction algorithms to train the models
based on the selected Training Requests, and to predict future requests based on the selected
Trigger Requests. Finally, Test Engine can be customized to evaluate dierent prediction models
with various testing strategies. For instance, as discussed earlier, Test Engine may enable up-
dating the prediction model dynamically; by plugging-in Test Engines that implement dierent
dynamic-update strategies, HiPHarness can isolate the impact of those strategies by producing
their side-by-side Test Results.
6.3.2 HiPHarness's Instantiation
We have instantiated HiPHarness by implementing several variations of its six components. In-
stances of Training Selection and Test Selection are implemented to select requests based on
dierent ratios, such as using the rst 80% of the Historical Requests as Training Requests and the
67
remaining 20% as Test Requests. Trigger Selection is implemented to select one current request
to trigger the prediction of subsequent requests. Four pairs of Training Engine and Prediction
Engine are implemented based on the three algorithms introduced in Section 2.3|DG [134], PPM
[136], and MP [52]|and a fourth baseline algorithm we will describe in Section 6.5. Finally,
Test Engine's implementation is detailed in Algorithm 4.
Algorithm 4: Test Engine
Input: PredictionEngine PE, TestRequests test reqs
Output: cache:Size, hit set, miss set, #prefetch, #hit, #miss
1 cache =?, hit set =?, miss set =?,
2 #prefetch = 0, #hit = 0, #miss = 0
3 foreach current req2test reqs do
4 predicted reqs PE:Predict(current req:pre)
5 foreach candidate2predicted reqs do
6 if:IsCached(candidate) then
7 Prefetch(candidate)
8 cache:Put(candidate)
9 prefetch prefetch + 1
10 if IsCached(current req) then
11 hit hit + 1
12 hit set:Put(current req)
13 else
14 miss miss + 1
15 miss set:Put(current req)
16 PE:update Model(current req)
Test Engine's objective is to output information needed for evaluating Prediction Engine's
(PE) results. For each request current req in Test Requests test reqs, PE predicts the potential
current requests based on current req's previous request current req:pre (Line 4) and prefetches
the predicted requests that are not already in the cache (Lines 5-9).
As discussed in Section 6.2, we assume that all predicted requests should be prefetched and
will not expire. We thus place the predicted requests in the cache without sending them to the
server to avoid tainted results caused by spurious runtime variations. The cache size is unbounded
in our study: since we focus on small amounts of historical data guided by our hypothesis (recall
Section 6.1), the required cache size is negligible compared to the available storage on mobile
devices.
Test Engine then compares current req with the cached requests to determine whether it can
be reused, and updates the corresponding information (Lines 10-15). Finally, PE dynamically
68
updates the Prediction Model by addingcurrent req to train the model (Line 16) sincecurrent req
becomes a past request as Test Engine moves to the next request in test reqs.
In the end, Test Engine outputs the size of the cache (cache:Size); unique prefetched requests
that were subsequently used (hit set); unique requests in Test Requests that were not in the cache
when requested (miss set); total number of prefetched requests (#prefetch); and numbers of
times a request was in the cache (#hit) vs. not in the cache (#miss) when requested. We use
this information to dene metrics that evaluate the accuracy of prediction models in Section 6.4.
6.4 Empirical Study Overview
This section provides the details of our empirical study enabled by HiPHarness, including the
research questions we focus on, the data collected, and the evaluation metrics used.
6.4.1 Research Questions
An overarching hypothesis frames our study: History-based prefetching may work eectively with
small prediction models trained on mobile-user requests collected during short time periods. To
evaluate it, we focus on three research questions.
• RQ
1
{ To what extent are mobile users' requests repetitive during short time periods?
• RQ
2
{ How eective are the existing prediction algorithms when applied on mobile platforms
using small prediction models?
• RQ
3
{ Can the training-data size be reduced without signicantly sacricing the prediction
models' accuracy?
6.4.1.1 RQ1 { Repetitiveness of User Requests
Since history-based techniques can only predict requests that have appeared in the past, our
study is centered on repeated requests [119], i.e., identical requests issued at least twice by a user.
Requests are considered identical when they have the same URLs, including GET parameters if
any [77]. Specically, we explore the extent to which mobile users send repeated requests during
a day, aiming to understand the \ceiling" that can be achieved by any history-based techniques
with small models. Note that prior work has only reported the repetitiveness of coarse-grained
behaviors (e.g., phone-calls [152], mobility patterns [160, 90, 180]).
69
6.4.1.2 RQ2 { Eectiveness of Prediction Algorithms
In the closest study, Wang et al. [172] found history-based algorithms ineective. However, their
study was conducted a decade ago, had a limited scope, and relied on the conventional, large
prediction models. We re-assess their conclusions using small models, while extending the study's
scope in three dimensions. First, Wang et al. only analyzed 25 iPhone users who were university
undergraduates; we rely on 11,476 users with dierent types of mobile devices and diverse occupa-
tions. Second, the previous dataset only included requests collected from the Safari browser; we
rely on both mobile-browser and app data to more comprehensively cover mobile-user behaviors.
Finally, only one history-based algorithm (MP [52]) was previously investigated; our conclusions
are based on the three most widely-employed algorithms, including MP, and an additional baseline
algorithm.
6.4.1.3 RQ3 { Reducing Training-Data Size
We explore whether the amount of training data can be reduced without sacricing accuracy.
Guided by the Pareto principle [59], we posit that top-20% of the training data are likely to
account for 80% of the results, and explore three strategies to select the training data: 1 Most
Occurring Requests { Given a request sequence in the training set, we group repeated requests,
count the size of each group, and use the requests in the largest 20% of the groups to train the
models. 2 Most Accessed Domains { We group requests that belong to the same domain (e.g.,
google.com/*), and train the prediction models with the requests in the largest 20% of the groups.
3 Most Suitable Domains { Domains that tend to contain repeated requests are potentially good
candidates for prefetching. To explore this, we again group the requests in the same domain
together. This time, we rank the groups by the proportion of repeated requests, and use the
top-20% of the groups to train the prediction models.
We further investigate whether there is a lower-bound on the training-data size that yields the
smallest model without sacricing the resulting accuracy. To that end, we designed the Sliding-
Window approach (see Section 6.5) to study the impact of dierent numbers of requests in the
training set on the prediction accuracy, and to identify the lower-bound.
70
6.4.2 Data Collection
We now detail the data collected for our study, including the process we followed, the ethical
considerations made, and the overview of our dataset.
6.4.2.1 Data Collection Process
The network traces were collected at the gateway between a large University's campus network
and the Internet. The measurement servers were provided by the University's network center, and
were placed at the campus gateway router, connected via optical splitters to the Gigabit access
links. This connects the campus network to a commercial network. The HTTP headers were
captured by the servers along with the timestamps. The authentication information (User ID)
identies the trac from the same user.
6.4.2.2 Ethical Considerations
Our study was approved by the University, and was guided by the agreement the authors signed
with the University. The study strictly followed the Research Data Management Policy of the
University, including data storage, sharing, and disposal. All raw data collected for the study
was processed by the University's network center. All recorded IP addresses and authentication
information were anonymized. The authors did not have access to any of the raw data at any
time. Unlike prior work that explicitly selected 25 subject users [172, 155], our data includes all
users who accessed the campus network without any selection process.
6.4.2.3 Dataset Overview
Our study aims to assess small models trained on mobile-user requests collected during much
shorter time periods compared to the conventional weekly or monthly models [172, 67, 85, 161,
70, 69]. To that end, we were given access to the network trac collected by the University,
spanning the 24 hours of May 28, 2018 (a randomly selected date). This included the trac
from nearly 11,500 accounts, representing all users who accessed the campus network via a mobile
device
1
during that time: students, faculty, sta, contract employees, residents, outside vendors,
and visitors. These users exhibit various behavior patterns and form a diverse group for our
study. We further ltered the mobile-network trac to include only the requests involving the
1
PCs and mobile devices use dierent clients for authentication. We only collected the mobile-network trac
based on the authentication information.
71
HTTP GET method [76]: GET requests are considered \safe" for prefetching in that they do not
have any side-eects on the server [77, 77, 194]. Ultimately, we collected 15,143,757 GET requests
from 11,476 users. Each request is identied by its URL, including GET parameters if any.
6.4.3 Evaluation Metrics
As discussed in Section 6.2, we focus on the accuracy of the prediction models. We thus leverage
two widely adopted accuracy metrics in the prefetching literature, and introduce a new accuracy
metric. The metrics are computed based on the information output by our Test Engine (recall
Algorithm 4).
Static Precision measures the percentage of correctly predicted unique requests, i.e., the
ratio of prefetched requests that are subsequently used to the number of all prefetched requests
[73]. This metric is often referred to as precision or hit ratio in the literature [52, 93, 125, 172,
136, 131, 73].
Static Precision =
jhit setj
#prefetch
Static Recall is a new metric we introduce as the counterpart to Static Precision. Static
Recall measures the ratio of unique requests that were previously prefetched (and have been
cached) to the total number of unique requests made by a user.
Static Recall =
jhit setj
jhit setj +jmiss setj
Dynamic Recall measures the ratio of previously-prefetched requests to all requests a user
issues [73]. This metric is often referred to as recall or usefulness in the prefetching literature [52,
93, 125, 172, 136, 131, 73].
Dynamic Recall =
#hit
#hit + #miss
We do not use a dynamic counterpart to Static Precision because it would not convey mean-
ingful information. To measure the ratio of correctly predicted requests dynamically, this metric
would need to (1) reward the predicted requests (hits) each time they are accessed and (2) penalize
\useless" requests that were prefetched but never used. In other words:
Dynamic Precision =
#hit
#hit + #useless
72
The reward for hits would be potentially unbounded at runtime, limiting the distribution of values
to a small upper subrange of [0::1]. Moreover, the metric would allow cases in which a model
with a low Static Precision has an articially high Dynamic Precision: repeated hits of the same
request would progressively diminish the penalty for arbitrarily many useless requests.
6.5 Results and Lessons Learned
This section presents the results of our study performed using HiPHarness, its takeaways, and
threats to its validity.
As mentioned earlier, our study is based on over 15 million HTTP GET requests that spanned
30,711 domains, re
ecting the mobile trac collected from 11,476 users at a large university.
Each user in our dataset sent 1,320 requests on average during the single day. This is over 100
more than what was reported a decade ago, where each user sent 4,665 requests for the entire
year [172, 155], reinforcing the impracticality of conventional large prediction models for mobile
platforms. For instance, the strategy of building monthly models [172] would encompass40,000
requests if applied on our dataset.
Starting with this initial user set, we used the box-and-whisker plot method [162] to identify
outlier users based on the number of requests they send. The high-end outlier cut-o point was
3,932 requests (equivalent to sending a request every22 seconds on average), excluding 788
users. Since the low-end outlier cut-o point was a negative number, we decided to remove all
users who made fewer than 10 requests during the entire 24-hour period. Our rationale was that
such infrequent access is also not re
ective of common mobile-device usage. Furthermore, these
users would yield training and test data that are too limited to evaluate the models. This process
resulted in the nal dataset of 9,900 subject users as shown in Table 6.1.
Table 6.1: Requests per user
Initial 11,476 Users Final 9,900 Users
Min Avg Max Min Avg Max
1 1,320 235,837 10 981 3,923
6.5.1 RQ1 { User-Request Repetitiveness
• RQ
1
{ To what extent are mobile users' requests repetitive during short time periods?
73
Table 6.2: Repeated requests across users
Percentage Number
Min Avg Max SD Min Avg Max SD
0% 28% 98% 17% 0 293 3,225 353
To answer RQ
1
, we calculate both the number and percentage of repeated requests (recall
Section 6.4) in each user's data, as they indicate dierent aspects of predictability. The percentage
provides insights about the potential cost: low percentage indicates high proportion of non-
predictable requests, increasing the cost of building the model. The number sets the upper-bound
on requests that can be prefetched.
Table 6.2 shows this data across the 9,900 subject users. The minimums indicate that certain
users did not access the same URL more than once. 222 (2%) of our users fall in this category.
Further investigation uncovered that these users do not access the network frequently enough to
show repetitive behaviors: the average number of the network requests sent by these users is 32,
i.e.,1 request every 80 minutes during the 24 hours.
On average, 28% of the requests sent by our subjects are repeated. An average user sends 293
repeated requests, which can be prefetched and reused from a cache. The maximum percentage
(98%) and number (3,225) show that history-based prefetching can especially benet certain users.
Finally, we nd a large variation across dierent users (standard deviation of 17% for percentage
and 353 for number of repeated requests). This indicates that individual users exhibit markedly
dierent behaviors, which reinforces our choice of building personalized prediction models for each
user on the client side (recall Section 6.2).
To provide further insights for future techniques, we study the characteristics of frequently
repeated requests by computing how many times each repeated request occurs per user. Due to
space constraints, we highlight two ndings. 1 The average numbers of certain users' repeated
requests are unusually high, with the maximum of 779. Further investigation showed that these
users tended to send large numbers of requests to obtain the same WiFi conguration les from
a specic domain. Since the data available to us was sanitized, we can only hypothesize that this
was due to server-side
aws. 2 The maximum occurrence of a repeated request across all users
was 2,634. This and other high values corresponded to continually obtaining information from
certain servers based on the given users' unchanging location coordinates, as indicated in GET
requests' parameters. Such domains may also have
aws that require resending the user location
even when it remains the same. Both instances clearly point to opportunities for caching.
74
6.5.2 RQ2 { Prediction-Algorithm Eectiveness
• RQ
2
{ How eective are the existing prediction algorithms when applied on mobile platforms
using small prediction models?
RQ
1
's results indicate that our 24-hour dataset is amenable to prefetching, motivating us to
apply the existing prediction algorithms on it. We further develop an additional algorithm, called
Na ve, to serve as the evaluation baseline. Na ve assumes that each request that has appeared in
the past will appear again, and recommends each such request for caching. Na ve thus guarantees
the upper-bound on the number of predictable requests in a dataset, and provides a baseline for
measuring any other prediction algorithm's recall.
To answer RQ
2
, we use HiPHarness to obtain the Test Results (recall Figure 6.1) of the
prediction models built for each user. As discussed in Section 6.3.2, we implemented four pairs
of Training Engine and Prediction Engine in HiPHarness, based on the three existing algorithms
discussed in Section 2.3.2 and Na ve. In the end, Test Engine outputs the Test Results needed for
evaluating the models' accuracy (recall Algorithm 4), using the three accuracy metrics discussed
in Section 6.4.3.
For all four algorithms, we follow the common approach of selecting the rst 80% of Historical
Requests as Test Results, and the remaining 20% as Test Requests [45]. The Trigger Requests,
which initiate the prediction, are set to do so a single request ahead. The thresholds of each
prediction algorithm are set to the same values used in the original techniques. Recall that the
goal of our study is to demonstrate the feasibility of small prediction models on mobile platforms.
In turn, this will enable future ne-grained customization of the models to improve accuracy, e.g.,
by tuning their thresholds, pre-processing the training data, and considering additional context
when building the models.
Due to the complexity induced by PPM's context-sensitivity (recall Section 2.3.2), this algo-
rithm was unable to output results when using our entire dataset. To enable a fair comparison
among the algorithms, we thus excluded certain domains that displayed low percentages of re-
peated requests to reduce the dataset. We explored multiple cut-o points as the repeated per-
centage at which a domain is excluded and found that 10% was sucient to enable PPM. This
eliminated certain users since all of their requests were excluded, resulting in 9,751 users and
39,004 corresponding models built with the four algorithms. Interestingly, this process uncovered
a potential correlation between the cut-o point's size and the increase in the resulting models'
75
Figure 6.2: Average values of the three accuracy metrics across the four algorithms
accuracy, regardless of the algorithm used. This suggests that a smaller, better-tailored training
set can yield even more accurate models, directly motivating RQ
3
.
6.5.2.1 Accuracy of Prediction Models
Figure 6.2 shows the average accuracy results of the 39,004 models. Overall, DG outperforms
PPM and MP along all three measures, while PPM and MP trade o precision and recall.
As discussed above, Na ve is our baseline and it achieves 100% recall. On the other hand, its
precision is poor since it aggressively prefetches every request that has appeared in the past. A
trade-o between precision and recall should clearly be considered when deciding on a prefetching
strategy, based on the specic scenario. For instance, if the cache is suciently large, Na ve can
be used to maximize recall.
The Static Precision of DG is comparable to the results of conventional large models from the
browser domain, where a model is considered to perform well with a precision between 40% and
50% [67, 52, 125, 131]. This demonstrates DG's potential on mobile platforms since it achieved
comparable precision using much smaller amounts of data. On the other hand, MP achieved
poorer Static Precision than DG and PPM. This was counter-intuitive at rst since MP only
uses the most popular requests in its models, which should have yielded smaller model sizes and
smaller denominators in the Static Precision formula (recall Section 6.4.3). However, our results
suggest that the most popular requests stop reappearing at some point, indicating the need to
optimize MP by tuning what training data to use and for how long.
76
Figure 6.3: Resource consumption of the four algorithms with ten sets of dierent-sized models
trained on a mobile device
The Static Recall and Dynamic Recall follow similar trends, meaning that the numbers of hit
and miss requests do not signicantly aect the usefulness of the prediction models. All three
existing algorithms showed improvements in Dynamic Recall compared to Static Recall, indicating
that the hit requests they predicted are usually accessed multiple times. This directly motivates
future work on designing eective caching strategies.
6.5.2.2 Eciency of Prediction Models
Due to our study's scale, we needed to train over 7 million models (detailed in Section 6.5.3), which
required the use of a powerful computing environment. Our experiments were run in parallel on
a server with 32 2.60GHz Intel Xeon E5-2650 v2 CPUs and 125GB RAM. The average running
times of DG, PPM, MP, and Na ve were 42ms, 431ms, 4ms, and 9ms per model, respectively.
Note that PPM is at least an order-of-magnitude slower than other algorithms. This conrmed
our earlier observation regarding the scalability issues introduced by PPM's context-sensitivity.
To get further insights into the models' eciency, we trained models of 10 dierent sizes from
1,000 to10,000 requests, and measured their resource consumption on a mobile device (Honor
Play 3 running Android 9.0). The sizes were selected empirically, as the resource consumption
below 1,000 requests is negligible and training 10,000 requests is already expensive. For each size,
we selected 10 users whose numbers of requests are the closest to the size and built models with
all four algorithms, yielding the results of 400 prediction models.
Figure 6.3 presents the trends (at logarithmic scale) of the models' energy consumptions (in
milliAmpere hour) and runtimes (in seconds) as the number of requests increases. Each data point
represents the average resource consumption when training the model of a specic size 10 times
(the need to train a model multiple times is discussed in Section 6.5.3). Our results show that
77
DG, MP, and Na ve are practical for use on mobile devices since the largest resource consumption
among them is 36mAh (Na ve) and 81s (DG) when training with10,000 requests.
2
MP is by far
the most ecient algorithm: it consumed <3mAh and <10s in the worst-case. By contrast, PPM
is not practical option except with the smallest models: it already begins to surpass the other
algorithms' worst-case when trained with2,000 requests, while its own worst-case consumption
is prohibitive at 1,335mAh and 4,740s.
Recall from Table 6.1 that users from our initial dataset average 1,320 requests. At rst blush,
this may suggest that the above analysis was unnecessary and that all four algorithms should
be ecient enough. However, given the wide variations across individual behaviors, many users
would likely benet from larger models. Thus, our analysis can provide insights for future work
that targets models of varying sizes for specic user groups.
6.5.3 RQ3 { Training-Data Size Reduction
• RQ
3
{ Can the training-data size be reduced without signicantly sacricing the prediction
models' accuracy?
To answer RQ
3
, we rst investigate the three data-pruning strategies introduced in Sec-
tion 6.4.1.3: Most Occurring Requests (MOR), Most Accessed Domains (MAD), and Most Suit-
able Domains (MSD). We then explore whether there is a lower-bound on the training-data size
that yields the smallest prediction models without sacricing accuracy.
6.5.3.1 Data-Pruning Strategies
We apply each of the three pruning strategies to the training data of all 39,004 models studied in
RQ
2
, and use HiPHarness to evaluate the 117,012 pruned models. Table 6.3 shows the average
training-data size reduction (Size Red.) and the average values of our accuracy metrics|Static
Precision (SP), Static Recall (SR), and Dynamic Recall (DR)|after applying each pruning strat-
egy across all four algorithms; since the SR and DR values are always 1 for our baseline algorithm
Na ve, we omit them. In all but six cases, accuracy is improved after data pruning (shaded cells).
Figure 6.4 overlays the average values from Table 6.3 on top of the original results from
Figure 6.2. Each set of three bars corresponds to the results of the given algorithm after applying
MOR (left), MAD (middle), and MSD (right). For example, the leftmost three bars (in red)
represent the MOR-, MAD-, and MSD-yielded values for DG's Static Precision.
2
Modern mobile-devices' battery capacity is2,000{5,000mAh [126, 149].
78
Interestingly, all three strategies show promising results across all accuracy metrics and al-
gorithms while reducing the training-data sizes. The largest accuracy drop is only 0.06 (DG's
Dynamic Recall after applying MSD). By contrast, accuracy is signicantly improved in most
cases, with the largest boost of 0.29 (MP's Static Precision after applying MSD). Note that the
largest accuracy drop and boost are both achieved by MSD, suggesting that a pruning strategy
may have highly variable impact on dierent metrics and/or algorithms. This is conrmed by the
other two strategies: MOR and MAD produce both improvements and drops for dierent cases.
Figure 6.4: Average accuracy values across the DG, PPM, MP, and Na ve algorithms after ap-
plying MOR (left), MAD (middle), and MSD (right), overlayed on top of the original accuracy
values from Figure 6.2
MOR (left) and MSD (right) outperform MAD (middle) in most cases, while achieving sig-
nicantly larger size reductions (see Table 6.3). However, it would be inappropriate to select the
\best strategy" a priori. The preferred choice will depend on one's objective. For instance, if
one aims to maximize DG's Dynamic Recall, MAD may in fact be the best pruning strategy. If
reducing the data size (i.e., resource consumption) is critical, then MSD should be selected, with
Table 6.3: Average training-data size reduction after applying the MOR, MAD, and MSD pruning
strategies, and the resulting Static Precision (SP), Static Recall (SR), and Dynamic Recall (DR)
DG PPM MP Na ve
Pruning
Strategy
Size
Red.
SP SR DR SP SR DR SP SR DR SP
MOR 54% 0.453 0.368 0.386 0.468 0.128 0.173 0.452 0.404 0.368 0.212
MAD 27% 0.391 0.390 0.437 0.302 0.097 0.150 0.212 0.302 0.325 0.069
MSD 62% 0.484 0.379 0.376 0.432 0.174 0.194 0.453 0.443 0.409 0.194
79
DG or MP as candidate choices of algorithm. Our results provide empirical data to guide future
techniques on selecting suitable pruning strategies and prediction algorithms.
Among the four algorithms, DG still achieves competitive accuracy, but it benets the least
from data pruning. A possible reason is that DG tracks dependencies among requests when
building the model and pruning may cause the loss of dependency information. By contrast,
MP's accuracy is markedly improved by pruning in all cases. MP consistently benets the most
from the MSD strategy that also achieves the largest size reduction (recall Table 6.3). This sug-
gests that MSD's grouping of requests in a domain may have aligned especially well with MP's
notion of popular requests. Finally, it is notable that after the pruning, MP achieves compa-
rable results to DG (the best-performing algorithm identied in RQ
2
); MP's Static Recall even
surpasses DG's after applying either MOR or MSD. This observation|that an algorithm may
outperform its previously superior competitor if the training set is reduced|has not been made
previously [172, 119, 171, 69, 46].
6.5.3.2 Lower-Bound Identication
Figure 6.5: A schematic of the Sliding Window approach with window size 5, sliding distance 1,
and training ratio 0:8
Motivated by the above ndings, we conducted an in-depth analysis of the relationship between
the number of requests in a training set and the accuracy of the resulting prediction models. Our
goal was to explore the lower-bound of the training-data size that yields the smallest model without
sacricing its accuracy. To this end, we developed an approach|Sliding Window (SW)|for
selecting a subset of training data that, unlike the pruning discussed above, is independent of the
data's contents. SW tailors HiPHarness's Training Selection and Testing Selection components.
80
It helps to assesses models of dierent sizes over dierent time slices for each user, accounting for
various user behaviors throughout the day and ensuring the freshness of Historical Requests used
to train the models. SW allowed us to expand our dataset from 39,004 models studied in RQ
2
to
over 7 million models with dierent training-data sizes of interest.
Figure 6.5 illustrates SW on 10 Historical Requests. We dene a Request Window (RW) as a
subset of Historical Requests that is used to train and evaluate a prediction model of a certain size.
Each RW has a corresponding window size that indicates the number of requests in the RW. For
example, in Figure 6.5, RW has a window size of 5. Given Historical Requests and a window size
x, we rst build a model using the rst x requests, and then slide RW by a sliding distance y to
build the next model. The sliding distance is adjustable. In our study, we set it to be the length
of the test set, so that allTest Requests from the previous RW can be included in the training set
of the next model. In Figure 6.5, the sliding distance is 1. We iteratively build models of the same
window size until the Historical Requests no longer has sucient requests to form a complete RW
of size x.
We used SW to explore a range of training ratios and window sizes. For brevity, we report
the results with the training ratio of 0.8 and 11 window sizes; other values for the two parameters
yielded qualitatively similar results. Specically, we use window sizes of 50, 100, 200, 300, 400, 500,
600, 700, 800, 900, and 1000 requests. This choice of parameters resulted in 1,788,648 prediction
models for individual users with each of the four prediction algorithms, placing the total number
of prediction models we studied at over 7.1 million.
To more closely track the trends of dierent accuracy metrics as the window sizes increase, we
grouped the models of each window size and calculated the mean accuracy values in each group.
Figure 6.6 shows the trends of the three metrics across the four algorithms. Notably, all trends
converge at a certain window size. Furthermore, every trend is monotonic, suggesting that there
may exist a cut-o point after which including more training data will not aect the given model's
accuracy.
To conrm this nding with statistically-signicant evidence, we conducted pairwise-comparison
analysis using the ANOVA post-hoc test based on the Games-Howell method [78], since it does
not require groups of equal sample size. This test analyzes whether each pair of groups' means has
a statistically-signicant dierence. In our case, there are 11 groups corresponding to the dierent
window sizes, thus the test analyzes the 55 possible pairs for each of the three accuracy metrics
calculated based on a given prediction algorithm. For instance, the 55 pairwise comparison results
81
Figure 6.6: The averages for Static Precision (top), Static Recall (middle), and Dynamic Recall
(bottom). Y-axes capture the metrics' values; X-axes indicate the dierent data points corre-
sponding to window sizes
of DG's Static Precision show that there is a statistically-signicant dierence among the pairs
with window sizes6 400, but not for sizes > 400. Therefore, DG's Static Precision converges at
window size 400. This is consistent with DG's plot in Figure 6.6's top diagram.
Table 6.4 summarizes all the pairwise comparisons, including the cut-o points and trends.
The trends refer to the directionality of the relationship between the means and window sizes: a
trend is positive if the mean is higher for larger window sizes, and negative otherwise. For cases
with positive trends, a cut-o point corresponds to the amount of training data that yielded the
highest accuracy; adding more data beyond this point did not improve the model's predictive
82
Table 6.4: Pairwise comparison result summary
Algorithm Metric Cut-o Point Trend
DG Static Precision 400 Positive
DG Static Recall 500 Positive
DG Dynamic Recall 800 Positive
PPM Static Precision 300 Positive
PPM Static Recall 500 Positive
PPM Dynamic Recall 800 Positive
MP Static Precision 400 Positive
MP Static Recall 800 Negative
MP Dynamic Recall 600 Negative
Na ve Static Precision 500 Negative
power. In the three cases with negative trends, the amount of data that yielded the highest
accuracy corresponds to the smallest window size (50).
All cut-o points are lower than the largest window size (1000 requests). In fact, none of our
models needed to be trained with more than 800 requests; on average, the models needed to be
trained by fewer than 400 requests to achieve results comparable to those trained by up to 1000
requests. This goes against the conventional wisdom that a prediction model should invariably
perform better with more training data, and directly supports our hypothesis stated above.
6.5.4 Broader Implications
To our knowledge, our work is the rst to provide evidence of the feasibility of history-based
prefetching on mobile platforms, using suciently small models to accommodate today's pri-
vacy constraints. Furthermore, we have demonstrated the eectiveness of existing algorithms
when properly congured and applied, directly challenging the previously published conclusion
that history-based prefetching is ineective on mobile platforms (avg. 16% precision and 1%
recall) [172]. Our study thus motivates re-opening the research in this area and highlights the op-
portunity to revisit, and possibly improve, existing prediction algorithms. We now discuss several
insights gained from our study to guide future work.
Even though DG yielded the best accuracy and MP the lowest resource consumption, we
argue that neither should be used without a suitable data-pruning strategy since pruning had the
dual-benet of reducing the training-data size and improving the models' accuracy. We showed
83
that, with an eective pruning strategy, MP can outperform DG, achieving comparable accuracy
while maintaining superior resource consumption.
Note that the existing algorithms were not built with mobile users in mind, and that we ap-
plied them as-is. Our results can thus be treated as the \
oor" achievable by the algorithms, with
a range of further possible improvements based on mobile-user characteristics. Our data-pruning
strategies are an example of such improvements. The MSD strategy was the standout overall,
with greatest training-data size reduction (avg. 62%) and accuracy improvement (avg. 84%).
At the same time, our data showed that certain users beneted more from the MOR or MAD
strategies. This is consistent with the diverse user behaviors we observed, suggesting future work
to categorize those behaviors and tailor existing or devise new prefetching strategies accordingly.
Although this paper aims to draw general conclusions based on a large user base, our dataset
contains the detailed Test Results of each individual subject user, providing a starting point to
explore user-behavior categorization.
The identied cut-o points for the dierent prediction models (recall Table 6.4) can serve as
a useful guide for exploring suitable model sizes in future techniques. For example, to maximize
the values of all three accuracy metrics in the two best-performing algorithms|DG and MP|no
more than 800 requests are needed. Given the results from Figure 6.3, this indicates that both
algorithms will be able to train hundreds (DG) or even thousands (MP) of models \on-device"
with negligible resource consumption. In turn, this directly facilitates training multiple models
throughout the day, necessary to maintain the models' \freshness".
DG and MP, the best-performing history-based algorithms, achieved comparable accuracy to
the state-of-the-art content-based technique PALOMA (avg. precision 0.478) [192].
3
This high-
lights a potential advantage of history-based prefetching, as it is applicable to any app. However,
our data indicates that certain users may not benet from history-based prefetching since they
tend not to send sucient numbers of repeated requests. This suggests an opportunity of com-
bining history-based and content-based approaches to address their respective limitations. For
instance, a content-based technique can analyze the program structure to determine possible sub-
sequent requests when the historical data is limited; on the other hand, a history-based technique
can personalize the content-based technique to only prefetch the most likely requests based on an
individual user's past behaviors.
3
Other content-based techniques did not report accuracy results [122, 61].
84
In addition to the lessons learned from our empirical study, the HiPHarness framework pro-
vides a novel, reusable, and tailorable foundation for automatically exploring dierent aspects
of history-based prefetching based on any dataset of interest. Those aspects include identify-
ing suitable training-data sizes, trading-o dierent prediction metrics, exploring data-pruning
strategies, assessing dierent prediction algorithms, ne-tuning various thresholds, and so on. We
have already demonstrated how HiPHarness can be used to explore some of these aspects in a
exible manner via \plugging and playing" its customizable components. In fact, HiPHarness's
applicability is not limited to mobile users: it can be applied in other settings by simply replacing
its input (i.e., Historical Requests) with any historical data of interest.
6.5.5 Threats to Validity
Our dataset was collected from a university's network with 11,476 users, suggesting that our
ndings may not hold for all settings and types of users. However, universities have large and
diverse populations, spanning students, faculty, administrative sta, contract employees, outside
vendors, and visitors. The university also provides on-campus housing, dining, working, and
entertainment venues, which captured user behaviors in dierent environments. Our user base is
also over 400 times larger than the closest comparable study [172].
Our data was collected over a single day, raising the possibility that our ndings may not
apply to historical data collected over longer periods. However, this was done by design: we
aim to investigate the feasibility of small prediction models on mobile platforms, to mitigate the
challenge of obtaining large amounts of user data as discussed in Section 6.1. Furthermore, the
choice of the 24-hour period was made randomly.
The collected data includes network trac from both mobile apps and mobile browsers across
dierent device types. Our results may thus fail to capture specic characteristics of mobile
apps, mobile browsers, or certain devices. However, our goal is to demonstrate the feasibility
of history-based prefetching on mobile platforms in general. As discussed in Section 6.5.3, a
smaller but better-tailored model may yield better results, thus our study \underapproximates"
the model's achievable accuracy. While we may have missed certain users who never connected to
the university network, we had access to data from a large number of diverse users, which allowed
us to obtain statistically-signicant evidence for our results.
We assumed that the cache size is unbounded and cached requests do not expire. This is
guided by our objective to assess the accuracy of small prediction models (recall Section 6.2).
85
6.6 Summary
This chapter presents the rst attempt to investigate the feasibility of history-based prefetching
using small models on mobile platforms. We did so by developing HiPHarness, a tailorable
framework that enabled us to automatically assess over 7 million models built from real mobile-
users' data. Our results provide empirical evidence of the feasibility of small prediction models,
opening up a new avenue for improving mobile-app performance while meeting stringent privacy
requirements. We further show that existing algorithms from the browser domain can produce
reasonably accurate and ecient models on mobile platforms, and provide several insights on how
to improve them. For example, we developed several strategies for reducing the training-data
size while maintaining, even increasing, a model's accuracy. Finally, HiPHarness's reusability
and customization provide a
exible foundation for subsequent studies to further explore various
aspects of history-based prefetching.
While this initial study focused on identifying general trends that span our large user base,
tracking personalized network-usage patterns across dierent time periods (e.g., morning vs. night,
weekday vs. weekend, work vs. vacation) is likely to result in even more accurate models. This
may require access to signicantly more user data, however|in volume, variety, and geographic
span|than we have currently been granted. We will work on overcoming this challenge, and on
the related critical issue of user privacy inherent in studies such as ours.
86
Chapter 7
Evaluation Framework FrUITeR
Evaluating prefetching and caching techniques is a time-consuming task since it requires realistic
test cases triggered by real users, which cannot be generated automatically by existing test gen-
eration techniques. Recent research has explored opportunities for reusing existing realistic tests
from an app to automatically generate new tests for other apps, in order to reduce the manual
eort involved [49, 146, 91, 113, 48]. However, lacking a standard baseline and evaluation proto-
col, it is unclear which test-reuse technique can generate more suitable test cases for evaluating
prefetching and caching techniques under a given scenario. Furthermore, the evaluation of such
techniques currently remains manual, unscalable, and unreproducible, which can waste eort and
impede progress in this emerging area. To address these challenges, this chapter presents FrUITeR
[191], a framework that automatically evaluates test-reuse techniques under the same baseline.
With the selected realistic test cases, the order of the network requests triggered will be based on
real users, and thus can be used for evaluating and comparing dierent prefetching and caching
techniques without engaging new users for each app. We further made the test cases publicly
available [11] to benet future prefetching and caching techniques, as well as other techniques
that require real-user engagement in the mobile-app domain.
7.1 Motivation
Writing UI tests is tedious and time-consuming [91, 114], increasingly driving the focus toward
automated UI testing [63]. However, existing work tends to target tests that yield high code
coverage, rather than realistic, usage-based tests that explore an app's functionality and mimic
87
real-user behaviors, e.g., sign-in, purchase, search, etc. Developers heavily rely on these usage-
based tests, but currently have to write them manually [114, 63].
To reduce the manual eort of writing usage-based tests, recent research has explored reusing
existing tests in a source app to generate new tests automatically for a target app [49, 146, 91,
113, 48]. The guiding insight is that dierent apps expose common functionalities via semantically
similar GUI elements. This suggests that it is possible to reuse existing UI tests across apps|in
eect generating the tests automatically|by mapping similar GUI elements.
Four recent techniques have targeted usage-based test reuse across Android apps [49, 91, 113,
48].
1
While these techniques have shown promise, we have identied ve important limitations
that hinder their comparability, reproducibility, and reusability. In turn, this can lead to dupli-
cation and wasted eort in this emerging area.
1 The metrics applied to date evaluate whether GUI events from a source app are correctly
transferred to a target app, but do not consider whether the transferred tests are actually useful.
It is possible that events are transferred correctly, but the generated test is \wrong". This can
be, e.g., because a generated test is missing events and thus not executable. Moreover, the
metrics used in existing work are not standardized even when evaluating same aspects of dierent
techniques, making it dicult to compare the techniques.
2 Each existing technique's evaluation process requires signicant manual eort: every trans-
ferred event in each test must be inspected to determine whether the transfer is performed cor-
rectly. This imposes a practical limit on the number of tests that can be evaluated. For instance,
the authors of ATM [48] had to restrict their comparison with GTM [49] to a randomly selected
50% of the possible source-target app combinations due to the task's scale.
3 There are no standardized guidelines for conducting the manual inspections, making the
evaluation results biased and hard to reproduce. For instance, ATM's authors acknowledge the
possibility of mistakes in the manual process [48]. Such mistakes are currently hard to locate,
verify, or eliminate by other researchers.
4 Existing techniques are designed as one-o solutions and evaluated as a whole. This makes
it dicult to isolate and compare their relevant components. For instance, GTM [49], ATM [48],
and CraftDroid [113] all contain functionality to compute a \similarity score" between two GUI
elements, but it is unclear which of those specic components performs the best against the same
1
Rau el al. recently proposed a test-reuse technique for web applications [146]. In this dissertation, we focus on
Android apps due to the availability of a larger number of existing techniques to evaluate, although in principle
our work is not limited to Android.
88
baseline. This would impede subsequent research that could potentially benet from identifying
underlying components that should be reused and/or improved.
5 Existing techniques make dierent assumptions that hinder their comparison. For instance,
GTM [49] and ATM [48] require access to apps' code, and cannot be directly compared with
techniques evaluated on close-sourced apps. Similarly, AppFlow [91] requires its tests to be written
in a special-purpose language it denes, and cannot transfer tests used in other techniques.
To address limitations 1{3, as well as limitation 4 in part, we have developed FrUITeR, a
Framework for evaluating UI Test Reuse. FrUITeR consists of three key elements: a set of new
evaluation metrics that consolidate the metrics used by existing techniques and expand them to
measure important aspects that are currently missed; two baseline UI test-reuse techniques that
establish the lower- and upper-bounds for the evaluation metrics; and an automated work
ow
that modularizes UI test-reuse functionality and signicantly reduces the manual eort. With
FrUITeR, one can automatically evaluate test-reuse techniques on apps/tests of interest against
the same baseline, thus opening the possibility of in-depth studies at a large-scale.
To fully address limitation 4, as well as limitation 5, we have extracted the core components
from existing techniques and established a benchmark for evaluating and comparing them. Our
benchmark currently contains 20 subject apps with 239 test cases, involving 1,082 GUI events.
This benchmark is used by FrUITeR to evaluate side-by-side the extracted components and the
two baseline components we developed, yielding 11,917 test-reuse instances.
The results obtained by FrUITeR revealed several important ndings. For example, we have
been able to pinpoint specic trade-os between ML-based (e.g., AppFlow) and similarity-based
(e.g., ATM) techniques. We have also identied scenarios that may seem counter-intuitive, such
as the fact that manually writing tests requires less eort than attempting automated transfer in
certain cases. Finally, performing evaluations on a much larger data corpus allowed us to refute
some conclusions reached in prior work.
This chapter makes the following contributions. 1 We develop FrUITeR to automatically
evaluate UI test reuse with an expanded set of metrics as compared to existing work, and two
baseline techniques that help to provide the lower- and upper-bounds of UI test reuse in a given
scenario. 2 We identify and extract the core components from existing test-reuse techniques,
enabling their fair comparison. 3 We establish a reusable benchmark with standardized ground
truths that facilitates the reproducibility of UI test-reuse techniques' evaluation and compari-
son. 4 We use FrUITeR to conduct a side-by-side evaluation of the state-of-the-art test-reuse
89
techniques, uncovering several needed improvements in this area. 5 We make FrUITeR's imple-
mentation and all data artifacts publicly available [11], directly fostering future research.
7.2 Background on Test Reuse
In this section, we introduce a motivating example and terminology related to UI test reuse,
followed by an overview of the strategies pursued by existing work and how they have been
evaluated to date.
a1 b1
a1-1
a1-2
a1-3
b2 b3
b1-1
b2-1
b3-3
b3-2
b3-1
Figure 7.1: sign-in tests for Wish (a1) and Etsy (b1{b3).
7.2.1 Motivating Example and Terminology
Figure 7.1 shows the screenshots of the sign-in process of two popular shopping apps: Wish (left)
and Etsy (right). Each screen is labeled with an identier, e.g., a1 is the rst screen of Wish. In
each screen, there may be one or more actionable GUI elements with which end-users can interact
based on the associated actions. For instance, the \Sign In" button in screen a1 (a1-3) is associated
with a click action. Actionable elements and their associated actions embody GUI events (dened
below). By contrast, the label \Sign In" that is circled in screen a1 is a non-actionable GUI
element.
As an illustration, assume that Wish's sign-in test exists and our goal is to automatically
transfer it to Etsy. The relevant actionable GUI elements in this sign-in example are labeled and
will be used to describe the following key terms used throughout this chapter.
90
GUI Event, or event in short, is a triple comprising (1) an actionable GUI element, (2) an
associated action, and (3) an optional input value (e.g., user input for a text box). We reuse this
denition from existing work [49, 113, 48]. For simplicity, we use the label of a GUI element (e.g.,
a1-1) to refer to the GUI event triple.
Canonical Event is an abstracted event that captures a category of commonly occurring
events. An example canonical event may be AppSignIn, and it would correspond to the a1-3 and
b3-3 from Figure 7.1, as well as similar events from other apps, such as Log In.
Usage-Based Test exercises a given functionality in an app, such as sign-in. A usage-based
test
2
consists of a sequence of GUI events. For instance, Figure 7.1 highlights the sign-in test in
Wish (left) as the event sequencef a1-1, a1-2, a1-3g.
Source App is the app with known tests that can be transferred to other apps with similar
usage. For instance, Wish is a source app with a sign-in test that can potentially be transferred
to other apps with sign-in functionality. Target App is the app to which one aims to transfer
existing tests. A target app can reuse the tests from multiple source apps; at the same time, it
can serve as a source app to other target apps if it contains known tests. Both source apps and
target apps are used extensively in existing work [49, 113, 48].
Source Test is an existing test for a given source app that should be transferred to a target
app to generate a Transferred Test. Ground-Truth Test is an existing test for a target app
that is used to evaluate whether the transferred test is correct. (i.e., whether the two tests match).
Source Event, Transferred Event, and Ground-Truth Event refer to the GUI events that
belong to the source test, transferred test, and ground-truth test, respectively.
Ancillary Event is a special type of transferred event that is not mapped from a source
event, but is added in order to reach certain states in the target app. For example, b1-1 and b2-1
from Figure 7.1 may need to be added as ancillary events in order to reach Etsy's sign-in screen
b3; such events do not exist in the source test.
Null Event is an event that should have been mapped from a source event, but was not
identied as such by a given test-reuse technique. Thus, the null event does not exist in the
transferred test, but it has a corresponding source event from which it maps. This could be
because of (1) a test-reuse technique's inaccuracy or (2) the dierence in app behaviors. An
example of the latter would be the inability to map Etsy's events b1-1 and b2-1 to Wish in
Figure 7.1.
2
If not mentioned otherwise, test or test case refers to usage-based test in this chapter.
91
7.2.2 Strategies Explored to Date
Four recent techniques [49, 48, 91, 113] have targeted UI test reuse in Android. The shared core
concern of these techniques is to correctly map the GUI events from a source app to a target app.
In the example from Figure 7.1, the source test sign-in in Wish comprises the event sequence
fa1-1, a1-2, a1-3g. By mapping GUI events in this test from Wish to Etsy asfa1-1! b3-1, a1-2
! b3-2, a1-3! b3-3g, a new sign-in test for the target app, Etsy, is generated asfb3-1, b3-2,
b3-3g.
The existing techniques can be classied into two main categories, based on how they map GUI
events across dierent apps: AppFlow [91] is an ML-based, while CraftDroid [113], GTM [49],
and ATM [48] are similarity-based techniques. We have abstracted the two categories and their
underlying work
ows through studying the similarities and dierences across the existing tech-
niques.
ML-based techniques learn a classier from a training dataset of dierent apps' GUI events
based on certain features, such as text, element sizes, and image recognition results of graphical
icons. The classier is used to recognize app-specic GUI events and map them to canonical
GUI events used in a test library, so that app-specic tests can be generated by reusing the tests
dened in the test library.
Similarity-based techniques dene their own algorithms to compute a similarity score be-
tween pairs of GUI events in a source app and a target app based on the information extracted
from the two apps, such as text, element attributes, and Android Activity/Fragment names. The
similarity score is used to determine whether there is a match between each GUI event in the
source app and the one in the target app based on a customizable similarity threshold. For ex-
ample, a1-1 in Wish (left) from Figure 7.1 is likely to have a higher similarity score with b3-1
than with other GUI events in Etsy (right). In that case, a1-1 in Wish will be mapped to b3-1
in Etsy. Another important component in similarity-based techniques is the exploring strategy,
which determines the order of computing the similarity score between the GUI events in the source
and target apps. The target app's events that are explored earlier usually have a higher chance
of being mapped.
92
7.2.3 Existing Evaluation Metrics
To evaluate their test-reuse strategies, existing techniques have focused on the accuracy of the
GUI event mapping. This section overviews the metrics they applied, which guided us in dening
the expanded set of FrUITeR's metrics (see Section 7.4.1.1). Note that the exact denitions of
existing metrics were not provided in the prior publications [91, 49, 48, 113]; we had to separately
contact the authors of each technique to obtain the details introduced below.
AppFlow [91] is an ML-based technique that maps app-specic events to canonical events
using a classier as discussed earlier. AppFlow's classier is evaluated with the standard accuracy
metric [123], indicating the percentage of the correctly-classied GUI events among all the GUI
events being classied. Correctly-classied GUI events include two cases: (1) the app-specic
events that are mapped to the correct canonical events (true positive); and (2) the app-specic
events that are not mapped to any canonical events and such canonical events do not exist (true
negative).
CraftDroid [113] is a similarity-based technique. After the transfer of events from a source
app to events in a target app, CraftDroid's authors manually identify three cases: (1) true positive
(TP) occurs when the transferred event is the same as the one obtained during a manual trans-
fer; (2) false positive (FP) occurs when the transferred event is dierent; and (3) false negative
(FN) occurs when CraftDroid fails to nd a matching event, while the manual transfer succeeds.
Precision and recall are then calculated based on the three cases. It is important to note that
CraftDroid's FP includes both the incorrectly transferred events and the newly added ancillary
events (if any), which is dierent from the FP case dened in other techniques. We further
illustrate this in Section 7.4.1.1.
ATM [48] and GTM [49] are also similarity-based techniques, and ATM is an enhancement
of GTM by the same authors. Similar to CraftDroid, the authors manually inspect the transferred
results and identify four cases: (1) correctly matched means the source event is mapped to the
correct event in the target app (TP); (2) incorrectly matched means the source event is mapped
to the wrong event in the target app (FP); (3) unmatched (!exist) means the source event is not
mapped to any events and no such events exist in the target app (TN); (4) unmatched (exist)
means the source event is not mapped to any events although the matching event exists in the
target app (FN). ATM and GTM do not calculate the precision or recall, but present the raw
percentages of each of the four cases.
93
7.3 FrUITeR's Principle Requirements
This section elaborates on the key limitations of current test-reuse techniques and their evalu-
ation processes, initially outlined in Section 7.1. These limitations serve as the foundation of
ve principle requirements we focused on in FrUITeR's design (Section 7.4) and instantiation
(Section 7.5).
Prior to developing FrUITeR, we investigated the existing techniques and their evaluations [91,
113, 49, 48] in depth. Beyond consulting the available publications, we also studied the techniques'
implementations and produced artifacts [2, 4, 7, 14], and engaged their authors in, at times,
extensive discussions to obtain missing details and resolve ambiguities. In the end, we identied
ve limitations that are likely to hinder future advances in this emerging area. We base FrUITeR's
principal requirements on these limitations.
Req
1
| Metrics used by FrUITeR to evaluate test-reuse techniques shall be stan-
dardized and re
ect practical utility. | Existing techniques are evaluated with dierent,
and dierently applied, metrics (recall Section 7.2.3), which harms their side-by-side comparison.
More importantly, all techniques to date have focused on whether GUI events from a source app
are correctly transferred to a target app, without considering whether the transferred tests are
actually meaningful and applicable in the context of the target app. It is thus possible that all
GUI events are mapped correctly, but the transferred test cannot be applied, e.g., due to missing
ancillary events (recall Section 7.2.1). None of the existing techniques are able to identify such
scenarios; FrUITeR must be able to do so.
Req
2
| FrUITeR's work
ow shall reduce the required manual eort and thus scale
to larger numbers of apps and tests than possible with current test-reuse techniques.
| Existing techniques' evaluation processes require signicant manual eort to inspect every
transferred event in each test. For example, ATM [48] was evaluated on 4 app categories, where
each category, in turn, consisted of 4 apps. On average, each app had 10 tests to be transferred
and each test had 5 events. Within each app category, ATM transferred the tests of each app to
the remaining 3 apps, resulting in 48 source-target app pairs in total. For each app pair, ATM's
authors had to manually inspect an average of 50 transferred events (10 tests 5 events), i.e.,
2,400 events in total. This is why they were forced to restrict their comparison with GTM [49]
to a randomly selected half of possible source-target app pairs. FrUITeR must address this
94
shortcoming by providing a more scalable evaluation work
ow that requires markedly less manual
eort.
Req
3
| Evaluation results produced by FrUITeR shall be reproducible. | As
discussed in Section 7.2.3, the current techniques' evaluation results depend on identifying the
case to which each transferred event belongs (e.g., correctly matched, false positive, etc.). Such
\ground-truth mappings" are determined manually. However, there are no standard guidelines for
conducting inspections, making the results potentially biased and unreproducible. In Figure 7.1's
example, it is debatable whetherfa1-1! b3-1g is correct because a1-1 only takes the user's email,
while b3-1 takes both the email and username. ATM's authors also acknowledge the possibility of
mistakes in the manual process [48]. More importantly, any such mistakes are hard to locate or
verify by other researchers, since the results of manual inspection and the ground-truth mappings
on which they are based, are recorded in ad-hoc ways. Thus, to facilitate future research in this
area, the evaluation results produced by FrUITeR must be reproducible, with a ground-truth
representation that can be independently veried, reused, and modied.
Req
4
| Test-reuse capabilities incorporated and evaluated by FrUITeR shall be
modularized. | Despite providing similar functionality, existing test-reuse techniques are de-
signed as one-o solutions and evaluated as a whole. This makes it dicult to reuse or compare
their relevant components. In turn, it invites duplication of eort and introduces the risk of missed
opportunities for advances by other researchers, and even by the techniques' own developers. To
address this problem, FrUITeR must modularize each test-reuse artifact it evaluates, allow its
independent (re)use, and associate the obtained evaluation results with the appropriate artifacts.
Req
5
| Benchmarks provided and applied by FrUITeR shall be reusable. |
Existing test-reuse techniques have been evaluated using dierent benchmark apps and tests,
additionally hampering their comparison. In fact, only three subject apps were shared by two
(AppFlow [91] and CraftDroid [113]) out of the four existing techniques in their evaluations. The
underlying reason is the dierent assumptions made by the techniques. For instance, GTM and
ATM rely on the Espresso testing framework [9] that requires the apps' source code. As another
example, AppFlow's tests are written in a special-purpose language based on Gherkin [12] and
cannot be reused by techniques that capture tests in other languages (e.g., Java, used by ATM
and GTM). Thus, FrUITeR must establish a set of uniform benchmarks with reusable apps and
tests that can serve as the foundation for evaluating and comparing solutions in this area.
95
7.4 FrUITeR's Design
This section presents FrUITeR's design, with a focus on two features that address requirements
Req
1
, Req
2
, Req
3
, and partially Req
4
: new evaluation metrics and an automated, modular work-
ow. We also introduce two novel test-reuse techniques to serve as baselines for bounding the
existing techniques' evaluation results.
7.4.1 FrUITeR's Metrics
To address Req
1
, FrUITeR incorporates a pair of evaluation metrics: (1) delity focuses on how
correctly the GUI events are mapped from a source app to a target app; (2) utility measures how
useful the transferred tests are in practice.
7.4.1.1 Fidelity Metrics
As explained in Section 7.2.3, delity of the mapping has been the main focus of existing tech-
niques, but the previous metrics have been used inconsistently.
3
To form a fair playground for
comparing test-reuse techniques, we investigated existing metrics in-depth. We did so by consult-
ing available documentation and discussing the metrics with the authors of all four techniques.
We standardized this information into a comprehensive set of delity metrics in FrUITeR, as
shown in Table 7.1.
Table 7.1: Fidelity metrics as used in AppFlow [91], CraftDroid [113], ATM [48], GTM [49], and
FrUITeR.
True Pos.
(TP)
False Pos.
(FP)
True Neg.
(TN)
False Neg.
(FN)
Accuracy Precision Recall
AppFlow anon anon anon anon Accuracy dnc dnc
CraftDroid TP FP1 none FN none Precision Recall
ATM/
GTM
Correctly
Matched
Incorrectly
Matched
Unmatched
(!exist)
Unmatched
(exist)
dnc dnc dnc
FrUITeR Correct Incorrect NonExist Missed Accuracy Precision Recall
Table 7.1 presents the delity metrics used across the dierent test-reuse techniques, and their
relationship to the standard metrics as dened in literature [123]. Each row shows a mapping
from the names for the metrics used by each technique to the typical delity metrics' names
indicated in the header. \anon" cells represent metrics that are not reported by a technique, but
3
Existing publications in this area have referred to some of these as \accuracy" metrics. We use \delity" to
avoid confusion with a specic metric named \accuracy" dened previously in literature [123] and used by one of
the techniques we studied [91].
96
are used internally to calculate other metrics that are reported. \dnc" cells represent metrics
that are not calculated by a given technique, but can be determined based on other metrics used.
Finally, \none" cells represent cases where a metric is not used by a technique and cannot be
calculated from the available information. FrUITeR covers all seven metrics, changing several
metrics' names to better re
ect their application to test reuse, as will be further discussed below.
Recall from Section 7.2.3 that CraftDroid's FP category covers two cases: FP1 corresponds
to \Incorrectly Matched" events in ATM/GTM and \Incorrect" in FrUITeR; FP2 corresponds
to the ancillary events that are not considered by other techniques. FrUITeR also excludes the
ancillary events from its Incorrect category because they can be benign or even needed (e.g., b1-1
and b2-1 from Figure 7.1), and do not re
ect the delity of the GUI event mapping. For instance,
if ancillary events were considered to be False Positives, a large number of them would result in
a low Precision for the GUI event mapping. However, this would not be a meaningful measure
since the ancillary events are not mapped from the source app. Such events are thus not relevant
to the mapping's delity, but should be considered by the utility metrics, introduced next.
7.4.1.2 Utility Metrics
FrUITeR introduces two utility metrics to indicate how useful a transferred test is. This aspect
is not considered in prior work, but is needed because a high-delity event mapping does not
guarantee a successfully transferred test, or vice versa. For instance, a target app's ground-truth
test may contain ancillary events not covered by source events, making it impossible to generate
a \perfect" test by event mapping alone. On the
ip side, a low-delity mapping may accidentally
generate a \perfect" test. Thus, it is important to measure the utility with respect to the ground-
truth test independently of event mapping's delity.
To this end, we rst dene an eort metric, to measure how close the transferred test is to the
ground-truth test, by calculating the two tests' Levenshtein distance [107]. Levenshtein distance
is widely used in NLP to measure the steps needed to transform one string into another. In our
case, each step is dened as the insertion, deletion, or substitution of an event in the transferred
test.
Secondly, we dene a reduction metric, to assess the manual eort saved by the generation of
the transferred test, compared to writing the ground-truth test from scratch:
Reduction = (#gtEvents { eort) #gtEvents
97
The value of reduction may be negative, if transforming the transferred test into the corresponding
ground-truth test takes more steps than constructing such ground-truth test from scratch.
Note that each usage-based test targets a scenario with a specic
ow of interest; multiple
ows would result in multiple tests (e.g., sign-in from \homepage" vs. from \settings"). For each
particular
ow, it is possible for the ground-truth test to contain dierent \acceptable" ancillary
events based on one's interest, which would result in multiple \acceptable" ground-truth tests.
In FrUITeR's current benchmark (see Section 7.5.2), we manually constructed one ground-truth
test for each
ow with the minimal amount of ancillary events to match prior work. However,
FrUITeR's ground-truth tests can be modied or extended to obtain their corresponding utility
results. For instance, researchers can specify multiple \acceptable" ground-truth tests for a given
ow, and measure the transferred test's utility with respect to each ground-truth test.
We acknowledge that the utility aspect (i.e., how useful a transferred test is) can be subjective
depending on one's goal. Alternative utility metrics (e.g., bug-identication power, executability,
code coverage) can be added to FrUITeR's customizable work
ow (see Section 7.4.2). FrUITeR's
current utility metrics specically center around eort because they are applied to tests transferred
by techniques whose end-goal is to reduce the eort of writing tests manually. Rening utility's
denition with extended metrics is worthy of further study, but outside our scope. Our goal was to
show that utility is important and measurable, to motivate further exploration of such important
aspect that has been missed by prior work.
7.4.2 FrUITeR's Work
ow
To address Req
2
, Req
3
, and partially Req
4
from Section 7.3, we designed an automated evaluation
work
ow with customizable components, shown in Figure 7.2. The goal of FrUITeR's work
ow
is to generate reproducible evaluation results for a test-reuse technique's core functionality. The
work
ow's automation is enabled by two key aspects: (1) the uniform representation of the
inputs and artifacts needed in the evaluation process, and (2) a set of customizable components
that output the evaluation results of interest automatically.
98
Figure 7.2: Overview of FrUITeR's automated work
ow.
7.4.2.1 Uniform Representation of Inputs
As Figure 7.2 shows, FrUITeR takes two types of input: Test Input (bottom-left) and Mapping
Input (top-right). The two are a combination of inputs taken and artifacts produced by exist-
ing test-reuse techniques, as well as three new inputs introduced in FrUITeR to automate the
evaluation process: Ground-Truth Tests, GUI Maps and Canonical Maps.
Test Input contains source tests, ground-truth tests, and transferred tests as dened in
Section 7.2.1. The tests may be captured in various forms by a test-reuse technique, and we
cannot assume that they will be analyzable in a standard way a priori. For instance, all tests in
ATM [48] and GTM [49] are represented as Espresso tests [9] in Java, while CraftDroid [113]'s
source tests are written in Python using Appium [3] and its transferred tests are represented in
JSON [19]. In order to enable their automated evaluation, the heterogeneous tests that are part of
FrUITeR's Test Input thus need to be standardized. FrUITeR's Event Extractor converts the Test
Input into a standardized representation of source events, ground-truth events, and transferred
events as detailed in Section 7.4.2.2.
Mapping Input consists of the GUI Map and the Canonical Map, which enable automated
evaluation of a test-reuse technique's delity. The two maps are newly introduced by FrUITeR
and captured using a standardized representation. The GUI Map contains the GUI event mapping
from a source app to a target app generated by a given test-reuse technique, and is used to compute
the delity metrics introduced in Section 7.4.1.1. Prior work does not provide GUI Maps, but
only the nal Transferred Tests. The events in these tests cannot be used to calculate delity
99
by comparing with source events directly, because the transferred events may include ancillary
and null events. We further illustrate how we extract the GUI Maps from existing techniques
and evaluate their delity automatically with FrUITeR in Section 7.5. On the other hand, the
Canonical Map contains the mapping from app-specic events to canonical events. This map is
manually constructed and is used as the ground-truth mapping for FrUITeR's Fidelity Evaluator
component discussed below. Note that AppFlow [91] can generate a Canonical Map automatically
using ML techniques. However, AppFlow's certain mapping results can be wrong, and thus cannot
be used as the ground truth.
7.4.2.2 Customizable Components
FrUITeR introduces three customizable components, shown as shaded boxes in Figure 7.2: Event
Extractor, Fidelity Evaluator, and Utility Evaluator.
Event Extractor leverages program analysis to extract the GUI event sequence from the
usage tests' code. The sequence is represented as each event's ID or XPath, depending on which
of the two is used in the test. ID and XPath are widely used to locate specic GUI elements in
tests in various domains, including Android apps [10] and web apps [16]. For simplicity, we will
use \ID" to refer to either the ID or XPath of a specic event in the rest of the paper.
To extract the event sequence, Event Extractor analyzes the Test Input to locate the program
point of each event based on its corresponding API, e.g., click [6] or sendKeys [25] for tests written
with Appium [3]. Once it identies the location, Event Extractor determines the event's caller,
i.e., the GUI element where the event is triggered, and performs a def-use analysis [38] to trace
back the denition of the caller's ID. This denition is specied in a given API of the testing
framework, such as ndElementById() in Appium [34]. In that case, the def-use analysis is used
to pinpoint the ndElementById() call that corresponds to the event's caller so that ID's value
can be determined. The input value associated with the event (if any) is determined by def-use
analysis in the same manner. In the end, the converted Source Events, Ground-Truth Events,
and Transferred Events are represented in a uniform way with IDs regardless of what testing
framework is used.
Note that if the Test Input is written in dierent programming languages or testing frame-
works, multiple Event Extractor instances need to be implemented if not available previously.
However, this is a one-time eort, and subsequent work can reuse an existing Event Extractor
when applied on the tests written in the same language and testing framework. Furthermore,
100
Algorithm 5: Fidelity Evaluator
Input: EventList srcEvents, GUIMap guiMap, CanonicalMap srcCanMap,
tgtCanMap
Output: Sets correct, incorrect, missed, nonExist
1 correct =incorrect =missed =nonExist =?
2 for i = 1 to srcEvents:size do
3 src srcEvents:GET(i)
4 trans guiMap:GetMapped(src)
5 srcCan srcCanMap:GetCanonical(src)
6 transCan tgtCanMap:GetCanonical(trans)
7 if trans != null then
8 if transCan == srcCan then
9 correct:Put(src)
10 else
11 incorrect:Put(src)
12 else
13 if tgtCanMap:contains(srcCan) then
14 missed:Put(src)
15 else
16 nonExist:Put(src)
17 return correct, incorrect, missed, nonExist
the Event Extractor is easily customizable to process tests written with dierent frameworks by
replacing the relevant APIs' signatures. For instance, when identifying an event caller's ID, the
relevant API is ndElementById() if using Appium [3] to test mobile apps, or ndElement() if using
Selenium [24, 30] to test web apps. By simply replacing the relevant API signature, the Event
Extractor will be able to process tests written in both Appium and Selenium frameworks.
Fidelity Evaluator takes the Source Events produced by Event Extractor and Mapping
Input, and automatically outputs the sets of (1) correct, (2) incorrect, (3) missed, and (4) nonExist
cases for calculating FrUITeR's seven delity metrics (recall Table 7.1).
Algorithm 5 describes Fidelity Evaluator in detail. The algorithm iterates through each source
event to determine to which of the four cases it should be assigned (Lines 2-16). To do so, it rst
gets the current source event (src), and the transferred event mapped from it (trans) based
on the GUI Map (Lines 3-4). It then converts the app-specic events src and trans into their
corresponding canonical events srcCan and transCan, using their respective Canonical Maps, so
that the events are comparable (Lines 5-6). Finally, to determine which of the four cases src falls
into, the algorithm rst checks whether trans is a null event. If not, transCan will be compared
against srcCan to determine whether the transferred event refers to the same canonical event as
101
the source event, and src will be added to either the correct or incorrect set accordingly (Lines
7-11). Iftrans is null, the source event has not been mapped to any events in the target app. The
algorithm then iterates through the Canonical Map of the target app (tgtCanMap) to determine
whether the matching event srcCan exists in the target app, and src will be added to either the
missed set or nonExist set accordingly (Lines 12-16).
Utility Evaluator automatically analyzes the Ground-Truth Events and Transferred Events
produced by Event Extractor. It uses this information to compute the two utility metrics|eort
and reduction|based on their denitions described in Section 7.4.1.2.
7.4.2.3 Relationship to FrUITeR's Requirements
FrUITeR's work
ow yields three key benets that directly target Req
2
, Req
3
, and Req
4
introduced
in Section 7.3.
First, the only manual eort required by FrUITeR is to construct the Canonical Maps by
relating app-specic events to canonical events. This is a one-time eort per app, and each event
only needs to be labeled once regardless of how many times it appears in a test (Req
2
). By
contrast, in previous work [113, 49, 48], each app-specic event needs to be manually labeled
every time it appears in a test, possibly resulting in thousands of manual inspections.
Second, FrUITeR establishes ground truths with uniform representations: Canonical Maps
are the ground truth for assessing delity, while Ground-Truth Events help to assess utility. This
renders the evaluation results yielded by FrUITeR reproducible (Req
3
). For instance, any mistakes
or subjective judgments made in the current techniques' manual evaluation processes can be easily
located by inspecting the Canonical Maps, and independently reproduced. Further, FrUITeR's
Canonical Maps are reusable, modiable, and extensible for subsequent studies, helping to avoid
duplicated work.
Third, FrUITeR's work
ow consists of customizable modules that isolate the evaluation to
a relevant component of a test-reuse technique (Req
4
). For instance, Fidelity Evaluator only
assesses the performance of GUI event mapping, instead of evaluating a technique as a whole.
Moreover, both Fidelity Evaluator and Utility Evaluator can be customized, reused, or extended
to automatically evaluate other metrics of interest based on the standardized inputs and artifacts
that FrUITeR denes, directly fostering future research.
102
7.4.3 FrUITeR's Baseline Techniques
To better understand the performance of a test-reuse technique, we developed two baseline
techniques|Na ve and Perfect|that establish the lower- and upper- bounds achievable by the
delity and utility metrics in a given scenario.
Algorithm 6: Na ve Baseline Technique
Input: EventList srcEvents, AppInfo tgtAppInfo
Output: EventList transEvents
1 transEvents ?
2 currentAct tgtAppInfo:getMainActivity()
3 foreach src2srcEvents do
4 isMapped FALSE
5 events tgtAppMap:getAllevents(currentAct)
6 events:randomizeOrder()
7 foreach event2events do
8 if event:action ==src:action then
9 similarity getRandomSimilarity(0, 1)
10 if similarity >Threshold then
11 transEvents:add(event)
12 currentAct event:nextActivity()
13 isMapped TRUE
14 break
15 if:isMapped then
16 transEvents:add(null)
17 return transEvents
7.4.3.1 Na ve Baseline
The Na ve baseline uses a random strategy to select the events in a target app to which each
source event should be mapped. This sets the practical lower-bound of delity. As Algorithm 6
shows, Na ve initially explores the target app from the main Activity [18] (Line 2). For each
source event, it obtains all the events at the current Activity (events) in a random order (Lines
5-6), and then tries to nd a match between the current source event src and each event inevents
(Lines 7-14). When mapping src to event, Na ve rst checks if the associated actions of the two
events are the same, and only computes the similarity score when they are. The similarity score
is computed by selecting a random value between 0 and 1 (Line 9), which are the lower and upper
bounds used in existing work. If the similarity score of src andevent is above a certain threshold,
event is added to the list maintained in transEvents (Line 11). At that point, Na ve continues to
103
explore the target app from the Activity reached by the transferred event (Line 12), and marks
the current source event src as mapped (Line 13). In the end, if the source event is not mapped,
it will be marked as a null event and added to transEvents (Line 15-16). Null events correspond
to either the True Negative or False Negative categories in Table 7.1.
7.4.3.2 Perfect Baseline
The Perfect baseline transfers the source events based on the ground-truth mapping we establish
(recall Section 7.4.2), assuming all source events are correctly mapped to the target app. Perfect
baseline thus represents a \perfect" GUI event mapping and always achieves 100% delity by
denition. Specically, we are interested in the utility achieved by the Perfect baseline since it
represents the upper-bound of the transferred tests' practical usefulness, which is not considered
by existing work. This can help us identify the room for improvement and guide future research
in test-reuse techniques. We will discuss our ndings regarding this in Section 7.6.
7.5 FrUITeR's Instantiation
This section describes how we instantiate FrUITeR to automatically evaluate the relevant modules
of existing techniques alongside FrUITeR's baseline techniques, in partial satisfaction of Req
4
. The
evaluation is performed based on FrUITeR's reusable benchmark that addresses Req
5
. To this
end, we needed to provide information that enables FrUITeR's automated work
ow discussed
in Section 7.4.2 and depicted in Figure 7.2: the Source Tests that are supplied as inputs to a
given test-reuse technique; the Transferred Tests and GUI Maps, which are produced as outputs
of a given test-reuse technique; and the manually constructed ground truths, namely, Canonical
Maps and Ground-Truth Tests. However, existing test-reuse techniques were not designed with
FrUITeR's modular work
ow in mind, and thus do not provide such information directly.
Section 7.5.1 explains how we mitigated the above challenge in order to extract the relevant
components from existing techniques and generate the information needed by FrUITeR. Note
that this step will not be necessary for future techniques if they follow FrUITeR's modularized
design. Section 7.5.2 presents FrUITeR's reusable benchmark for the uniform evaluation of test-
reuse techniques, which contains the Source Tests, Ground-Truth Tests, and Canonical Maps
used in FrUITeR's automated work
ow. Finally, Section 7.5.3 provides the details of FrUITeR's
implementation and generated datasets.
104
7.5.1 Modularizing Existing Techniques
To lay the foundation for addressing Req
4
, we modularized FrUITeR's design. In turn, this isolated
the evaluation of GUI event mapping's delity and the transferred tests' utility, as discussed in
Section 7.4.2. However, the existing techniques are implemented and evaluated as fully integrated,
one-o solutions that do not provide the artifacts needed by FrUITeR to generate the modularized
evaluation results. Because of this, we had to extract the specic functionality from existing
techniques' implementations that performs the GUI event mapping (recall Section 7.2.2). Once
the GUI Maps are available, we can generate the Transferred Tests used in FrUITeR's Utility
Evaluator. Note that the step of extracting GUI Mapper components is not needed for future
test-reuse techniques if they follow FrUITeR's modularized design. For example, we directly
applied FrUITeR on the two baseline techniques we developed, with no extra eort.
Extracting the GUI Mapper components from the existing techniques was challenging since
we had to understand each technique's design and implementation in detail, and to modify its
source code. To this end, in addition to the available publications, we studied in depth the exist-
ing approaches' implementations [2, 4, 14, 7] and communicated with their authors extensively.
We describe the challenges we faced during this process and the specic component-extraction
strategies we applied to each existing solution.
7.5.1.1 Extracting AppFlow's GUI Mapper
AppFlow [91] is an ML-based technique whose key component trains a classier that maps app-
specic events to canonical events, but does not map the events from a source app to a target
app. To compare AppFlow with similarity-based techniques, we leverage its Canonical Maps to
transfer the source events to the target app by (1) mapping each source event to the correspond-
ing canonical event based on the source app's Canonical Map and (2) mapping this canonical
event back to the app-specic event in the target app based on the target app's Canonical Map.
AppFlow's implementation does not output its Canonical Maps, so we had to locate and modify
the relevant component to do so. Moreover, AppFlow does not store its trained classier, so
we had to congure its ML model and re-train it. During this process, we communicated with
AppFlow's authors closely to understand its code, to obtain proper conguration les and training
data, and to ensure the correctness of our re-implementation.
105
7.5.1.2 Extracting ATM's and GTM's GUI Mappers
As discussed earlier, ATM [48] was developed as an enhancement to GTM [49] and was shown to
outperform it [48]. However, the authors of these two techniques compared them only on half of
the source-target app pairs used in ATM's publication [48] due to the large manual eort required.
Since FrUITeR largely automates the comparison process, we decided to extract the GUI Mapper
components from both techniques to enable their comparison at a large scale.
An obstacle we had to overcome was that ATM and GTM both require the app's source
code due to their use of the Espresso framework [9]. Thus, they cannot be compared as-is with
techniques evaluated on closed-sourced apps, which would have limited our choice of benchmark
apps. We discussed this issue with ATM's and GTM's authors and learned that the only step
that requires source code for both techniques' GUI Mappers is computing the textual similarity
score of image GUI elements (e.g., ImageButton). In that case, the text of the image's lename
is retrieved from the app's code and analyzed to compute the similarity score. However, the
main author conrmed that, in her experience, this feature is rarely needed in practice. We thus
decided to extract ATM's and GTM's GUI Mapper components as stand-alone Java programs
that do not require Espresso, omitting the lename-retrieval feature. We subsequently conrmed
with the two techniques' authors the correctness of our implementation.
7.5.1.3 Extracting CraftDroid's GUI Mapper
CraftDroid's [113] implementation is only partially available. Its authors informed us that two
of CraftDroid's modules|Test Augmentation and Model Extraction|were not releasable when
we requested them, due to ongoing modications, while the prior versions of the two modules
were no longer available. The authors conrmed our observation that CraftDroid's GUI mapping
functionality depends on the outputs of the two missing modules, and advised us that the best
strategy would be for us to reimplement them based on CraftDroid's lone publication [113].
However, the publication in question is missing a number of details that would introduce bias in
our re-implementation: we would have no guarantee that the versions of the two components we
produce are the same as those used in CraftDroid. Instead, we decided to rely on CraftDroid's
published Transferred Tests [7] in our evaluation.
To obtain CraftDroid's GUI Maps, we inspected its published artifacts [7] and found that only
certain events in the Transferred Tests have associated similarity scores, while other events are
labeled as \empty". Further investigation showed that each event in the Transferred Tests belongs
106
to one of three cases: (1) events with available similarity scores are successfully mapped from the
source events; (2) \empty" events are mapped from the source events but no match is found by
CraftDroid (i.e., null events); (3) the remaining events are not mapped from the source events
but are added by CraftDroid (i.e., ancillary events). We excluded the ancillary events so that the
resulting transferred events have a 1-to-1 mapping from the source events, giving us CraftDroid's
GUI Maps.
7.5.2 FrUITeR's Benchmark
As discussed above in the motivation for Req
5
, existing test-reuse techniques are evaluated on
dierent apps and tests, which hinders their comparability. To address this, we established a
reusable benchmark with the same apps and tests to serve as a shared measuring stick in this
emerging domain. This section discusses our strategy for including existing apps and tests in the
benchmark, and for generating the required ground truth.
7.5.2.1 Benchmark Apps and Tests
To maximize the results from existing work that we can attempt to reproduce, we rst included
the intersection of the subject apps used by existing work. This yielded 3 shopping apps: Geek,
Wish, and Etsy. We further randomly selected 7 additional shopping apps and 10 news apps
used by AppFlow [91]. This gave us 20 benchmark apps in total, as described in Table 7.2. Our
rationale behind this choice of apps was two-fold: (1) AppFlow's authors manually inspected
all app categories on Google Play and identied shopping and news as categories with common
functionalities suitable for test reuse; (2) AppFlow was evaluated on the largest number of subject
apps among the existing techniques. By comparison, ATM [48] used 16 open-source apps that are
not as popular as those used in AppFlow.
To construct the benchmark tests, we further followed the test cases dened in AppFlow, with
a similar rationale: (1) AppFlow's authors conducted an extensive study to manually identify
tests that are shared in shopping and news apps; (2) AppFlow denes a larger number of tests
compared to other work. For example, CraftDroid [113] only has 2 tests dened in each app
category. We excluded those tests that require mocking external dependencies (e.g., a payment
service). This resulted in 15 tests in the shopping category and 14 tests in the news category,
shown in Table 7.3. Note that we cannot reuse AppFlow's tests directly because they are written
in a special-purpose language dened by AppFlow for an entire app category rather than a
107
Table 7.2: Summary information of benchmark apps.
Shopping
App ID App Name #Downloads #Tests #Events
S1 AliExpress 100M 15 76
S2 Ebay 100M 13 48
S3 Etsy 10M 13 55
S4 5miles 5M 12 78
S5 Geek 10M 13 85
S6 Google Shopping 1M 15 72
S7 Groupon 50M 14 66
S8 Home 10M 14 98
S9 6PM 500K 14 63
S10 Wish 100M 14 85
News
N1 The Guardian 5M 13 76
N2 ABC News 5M 9 31
N3 USA Today 5M 11 28
N4 News Republic 50M 10 40
N5 BuzzFeed 5M 11 50
N6 Fox News 10M 11 28
N7 SmartNews 10M 9 20
N8 BBC News 10M 9 22
N9 Reuters 1M 10 37
N10 CNN 10M 9 24
specic app. Instead, we relied on multiple undergraduate and graduate students with Android
experience to write the applicable tests for each of the 20 subject apps using Appium [3]. Some
benchmark apps did not have each functionality described in Table 7.3, ultimately resulting in
a total of 239 tests involving 1,082 events across the 20 apps (the two right-most columns of
Table 7.2), requiring 3,920 SLOC of Java code.
These benchmark tests currently do not contain oracle events because only ATM [48] and
CraftDroid [113] can transfer oracles in principle. However, due to the limited availability of
CraftDroid's source code as mentioned earlier, we would not be able to obtain CraftDroid's results
using our benchmark, making a comparison across dierent techniques impossible. As additional
test-reuse techniques are developed with the ability to transfer oracles, FrUITeR's benchmark
tests can be extended to include oracle events to obtain their results as well. Note that as long
as future techniques follow FrUITeR's modularized design to provide the needed input (e.g., GUI
Maps of the oracle event mapping), FrUITeR will be able to automatically generate the results
of oracle events. A detailed tutorial is provided on FrUITeR's website [11].
108
Table 7.3: Benchmark test cases in shopping (TS) and news (TN) categories.
Test ID Test Case Name Tested Functionalities
TS1/TN1 Sign In provide username and password to sign in
TS2/TN2 Sign Up provide required information to sign up
TS3/TN3 Search use search bar to search a product/news
TS4/TN4 Detail nd and open details of the rst search result item
TS5/TN5 Category nd rst category and open browsing page for it
TS6/TN6 About nd and open about information of the app
TS7/TN7 Account nd and open account management page
TS8/TN8 Help nd and open help page of the app
TS9/TN9 Menu nd and open primary app menu
TS10/TN10 Contact nd and open contact page of the app
TS11/TN11 Terms nd and open legal information of the app
TS12 Add Cart add the rst search result item to cart
TS13 Remove Cart open cart and remove the rst item from cart
TS14 Address add a new address to the account
TS15 Filter lter/sort search results
TN12 Add Bookmark add rst search result item to the bookmark
TN13 Remove Bookmark open the bookmark and remove rst item from it
TN14 Textsize change text size
7.5.2.2 Benchmark Ground Truth
As described in Section 7.4.2, we dene Canonical Maps to represent the ground truth for the
delity of the GUI event mapping, and Ground-Truth Events to represent the ground truth for
the utility of the transferred tests.
In our benchmark, we dene 72 canonical events for the shopping apps and 55 for the news
apps. Our canonical events are extended from AppFlow, aiming to re
ect a ner-grained classi-
cation of GUI events. For instance, event \password" in the sign-in test (TS1/TN1 in Table 7.3),
and events \password" and \conrm password" in the sign-up test (TS2/TN2 in Table 7.3), are all
represented as the same canonical event \Password" in AppFlow. However, it is debatable whether
that is appropriate. For example, mapping \password" in sign-up to \password" in sign-in may
lead to non-executable tests. To remove ambiguity, we capture such events separately.
Based on the canonical events, we construct 20 Canonical Maps, one per subject app. We
do so by manually relating to the canonical events a total of 561 subject apps' GUI events that
appear in one or more of the 239 tests. As discussed in Section 7.4.2, this is the only manual step
required by FrUITeR and is a one-time eort: the Canonical Maps can be reused when relying on
the same subject apps. As a point of comparison, recall from Section 7.3 that evaluating 48 app
pairs in ATM [48] required manually inspecting 2,400 events. By contrast, our one-time inspection
109
of the 561 events enabled the use of 200 app pairs (2 categories 1010 apps, i.e., including an
app's test transfer to itself) by every technique FrUITeR evaluated.
The Ground-Truth Events in our benchmark are extracted from the 239 tests by FrUITeR's
Event Extractor (recall Figure 7.2).
7.5.3 FrUITeR's Implementation Artifacts
FrUITeR's artifacts are publicly available [11]: its source code; nal datasets; GUI Mappers ex-
tracted from existing work; implementations of baseline techniques, their GUI Maps, and Trans-
ferred Tests; benchmark apps and tests; and manually constructed benchmark ground truths. We
highlight the key details of these artifacts below.
7.5.3.1 Source Code
FrUITeR's Event Extractor (recall Figure 7.2) is implemented in Java using Soot [26] (235 SLOC).
FrUITeR's Fidelity Evaluator and Utility Evaluator are implemented in Python (1,045 SLOC).
FrUITeR's baseline techniques Na ve and Perfect (recall Section 7.4.3) are likewise implemented
in Python (112 SLOC). The GUI Mapper components extracted from existing techniques (recall
Section 7.5.1) are implemented in their original programing languages: AppFlow in Python (1,084
SLOC); GTM in Java (1,409 SLOC); and ATM in Java (1,314 SLOC). The functionality that
processes their outputs and generates the uniform representation of GUI Maps and Transferred
Tests is implemented in Python (404 SLOC). As discussed earlier, due to CraftDroid's unavailable
source code, we can only interpret its published artifacts [7]; that functionality is implemented
in Python (86 SLOC). The data analyses that interpret our nal datasets are written in R (585
SLOC).
7.5.3.2 Final Datasets
Our nal datasets contain the results of 11,917 test transfer cases generated by the GUI Mappers
from the existing techniques and our two baselines when applied on FrUITeR's benchmark. We
apply 5 techniques|AppFlow, ATM, GTM, Na ve, Perfect|to transfer tests across 20 shopping
and news apps, involving 1,000 source-target app pairs (2 app categories 100 app pairs in
each category 5 techniques). This yielded 2,381 result entries per technique. As discussed
earlier, we have to rely on CraftDroid's nal results, and can thus only compare CraftDroid to
the other techniques on the 3 shopping apps|Geek, Wish, Etsy|used both in our benchmark
110
and in CraftDroid's evaluation. This gave us 12 result entries for CraftDroid since only 2 tests
are transferred by CraftDroid in each app. Each of the total 11,917 result entries contains the
following information: (1) the source and target apps; (2) the source, transferred, and ground-
truth tests; (3) the technique used to transfer the test; (4) the correct/incorrect/missed/nonExist
sets of GUI events output by FrUITeR's Fidelity Evaluator as described in Algorithm 5, and the
seven corresponding delity metrics dened in Section 7.4.1.1; and (5) values of the two utility
metrics|eort and reduction|dened in Section 7.4.1.2. Note that obtaining these 11,917 result
entries following prior work's evaluation processes would have required manual inspection of 53,963
events that appear across all of the source tests, which is infeasible in practice.
7.6 Findings
The datasets produced by FrUITeR include the results obtained by evaluating side-by-side the
extracted GUI Mapper components from the four existing test-reuse techniques [91, 49, 48, 113]
and the two baseline techniques we developed. In turn, this data enables further in-depth studies
of a range of research questions in this emerging domain. As an illustration, this section highlights
several ndings uncovered by FrUITeR's datasets that are missed by prior work.
7.6.1 GUI Mapper Comparison
As discussed earlier, existing techniques are evaluated in their entirety, on dierent benchmark
apps and tests, and using dierent evaluation metrics, all of which makes their results hard
to compare. By contrast, FrUITeR was able to evaluate their extracted GUI Mappers side-
by-side, with our two techniques|Na ve and Perfect|serving as baselines. We note that it is
possible for a given test-reuse technique to produce results as a whole that may be dierent from
those produced only by its extracted GUI Mapper. One reason may be that there is additional
relevant functionality that is scattered across the technique's implementation. However, any such
functionality can be added to the existing GUI Mappers, or introduced in additional FrUITeR
components.
7.6.1.1 Fidelity Comparison
FrUITeR's website [11] contains the results of all seven delity metrics from Section 7.4.1.1 ob-
tained using our benchmark. Due to space limitations, we show the results of three delity metrics
111
Figure 7.3: Comparison of average precision and recall.
Figure 7.4: Comparison of average accuracy.
(Precision, Recall, Accuracy), and restrict our discussion to Precision and Recall since Accuracy
follows a similar trend as Precision; the results of the four remaining delity metrics (Correct,
Incorrect, Missed, NonExist) can provide an in-depth understanding on each of the four specic
cases, and can be found on FrUITeR's website [11]. Figure 7.3 shows the average precision and
recall achieved by the four existing techniques as well as Na ve; we omit Perfect since its values
are always 100% by denition.
For each technique except CraftDroid, the top (blue) bar shows the average calculated based
on 2,381 cases transferred among both shopping and news apps. To meaningfully compare Craft-
Droid with other techniques, even if only partially, we show the averages calculated based on
the 12 cases for which we have CraftDroid's data, in the bottom (orange) bars. CraftDroid only
112
transferred \Sign In" and \Sign Up" tests in the 3 shopping apps|Geek, Wish, and Etsy|leading
to the 12 cases (6 source-target app pairs 2 tests).
We highlight three observations based on the results from Figure 7.3. First, every existing
technique yields lower recall than precision on the larger (blue) data set, meaning that it suf-
fers from more missed (i.e., false negative) than incorrect (i.e., false positive) cases. Although
AppFlow's recall is highest among the existing techniques, it exhibits the largest drop-o be-
tween its precision and recall values. A plausible explanation is that, as an ML-based technique,
AppFlow will likely fail to recognize relevant GUI events if no similar events exist in its training
data. This was somewhat unexpected, however, given that AppFlow's authors carefully crafted
its ML model to the app categories we also used in FrUITeR, and suggests that additional re-
search is needed in selecting and training eective ML models for UI test reuse. By comparison,
similarity-based techniques such as ATM will miss fewer GUI events in principle: they can always
compute a similarity score between two events and return the mapped events whose scores are
above a given threshold. However, if the similarity threshold is set too low, it will result in more
incorrect cases, leading to low precision.
A related observation is that AppFlow's precision outperforms the other techniques across the
board, for both the larger (blue) and smaller (orange) datasets. This is because AppFlow has the
advantage of more information, obtained from a large corpus of apps in its training dataset, than
the similarity-based techniques, which compute the similarity scores based only on the information
extracted from the source and target apps under analysis. However, AppFlow's recall is lower
than both ATM and CraftDroid on the 12 (orange) cases from Geek, Wish, and Etsy. This
reinforces the above observation that an ML-based technique will fail to recognize GUI events if
no similar events exist in its training data.
Finally, our data conrms that ATM indeed improves upon GTM, as indicated in their pairwise
comparisons across both precision and recall, on both large and small datasets. In fact, GTM
exhibits the lowest delity of all existing techniques, and its recall across the 2,381 (blue) cases
is actually lower than that achieved by the Na ve strategy. We note that GTM's design is geared
to transferring tests in programming assignments that share identical requirements, and is clearly
not suited to heterogenous real-world apps.
113
7.6.1.2 Utility Comparison
Figure 7.5 shows the two utility metrics yielded by each of the four existing and two baselines.
Recall from Section 7.4.1.2 that utility measures how useful the transferred tests are in practice
compared to the ground-truth tests. The objective of utility is to minimize the eort while
maximizing the reduction.
Figure 7.5: Comparison of average eort and reduction.
The utility of existing techniques shows similar trends to those observed in the case of delity.
For example, AppFlow outperforms other techniques, while GTM exhibits similar performance to
that of Na ve. This indicates a possible correlation between the delity of the GUI event mapping
and the utility of the transferred tests.
At the same time, we observe that, while our Perfect GUI Mapper achieves higher utility than
the remaining techniques, that utility is not optimal. In fact, Perfect's average reduction is under
50% across the 2,381 cases in the larger dataset (top, blue bar). In other words, even with the best
possible mapping strategy, we save less than half of the eort required to complete the task man-
ually. The previously published techniques perform much worse than this: AppFlow saves under
30%, ATM under 10%, and GTM under 1% of the required manual eort, while the reduction
yielded by CraftDroid on the smaller (orange) dataset is lower than Perfect's on either of the two
datasets. This indicates that delity is clearly not the only factor to consider in order to achieve
desired utility, and that there is large room for improvement in future test-reuse techniques.
To verify the above insights, we conducted pairwise correlation tests between the seven delity
and two utility metrics. Overall, the results, further discussed below and provided in their entirety
in FrUITeR's online repository [11], show a weak correlation between delity and utility. This
114
reinforces our observation that accurate GUI mappings can yield useful transferred tests, but are
not the only relevant factor. In turn, this nding calls for exploration of other components in
test-reuse techniques since the focus on GUI event mapping alone can hit a \ceiling", as shown
by the Perfect baseline. We discuss such possible directions next.
7.6.2 Insights and Future Directions
Guided by the above observations, we explore potential strategies for improving UI test reuse with
various statistical tests and manual inspections on FrUITeR's datasets. Due to space limitations,
we highlight four ndings that were not reported by previous work.
Source app selection matters for a given target app. Figures 7.3, 7.4, and 7.5 all show
consistent improvement across the techniques in the smaller datasets (12 cases transferred among
3 apps) compared to the larger ones (2,381 cases transferred among 20 apps). This suggests that
certain source-target app pairs achieve better results than others. For example, we found that
app pairs involving Wish, Geek, and a benchmark app called Home|all of which are developed
by the same company, Wish Inc.|achieve high delity and utility, regardless of the technique
used. Another such compatible app pair is ABC News and Reuters. Performing a large-scale
evaluation enabled by FrUITeR will help spot pairings like this, and give researchers a starting
point to explore the characteristics that can lead to better transfer results.
Automated transfer is not suitable for all tests. Our utility metrics revealed large eort
and negative reduction in some cases, meaning that correcting a transferred test required more
work than writing it from scratch. Further inspection revealed that this is primarily due to a test's
length rather than a technique's accuracy. For instance, Perfect showed no benet (reduction
0) 16% of the time, and the average number of source events in those cases is only 4. This suggests
that, for simple tests, manual construction may be preferable. Future research should consider
the criteria for suitable tests to transfer instead of transferring all source tests.
There is a trade-o between ML- and similarity-based techniques. As discussed
above, an insucient training set in an ML-based technique may yield low recall, while a low
similarity threshold in a similarity-based technique can address this but may yield low precision.
This suggests two future research directions. First, selecting training sets and similarity thresholds
is important, but existing techniques did not justify their choices [91, 49, 48, 113]. There is clearly a
need for further study of novel strategies such as incorporating dynamic selection criteria based on
target app characteristics. Second, future research should consider the trade-os across dierent
115
test-reuse techniques and provide guidance on selecting the most suitable techniques for a given
scenario.
Test length is not a key factor in
uencing delity. CraftDroid and GTM studied the
relationship between the test length and their transferred results. For instance, CraftDroid showed
a strong negative correlation between test length and its two delity metrics (coecient <0:5
in both cases). To verify these ndings, we conducted correlation tests on FrUITeR's much larger
datasets. Our results indicate a negative but very weak correlation between test length and
FrUITeR's delity metrics (0:25 < coecient < 0 across all seven cases). This shows that test
length is not the key factor that impacts delity, arguing that future research targeting reuse of
complex tests may be a fruitful direction.
7.7 Summary
This chapter has presented FrUITeR, a customizable framework for automatically assessing UI
test reuse, in order to select suitable test cases for evaluating prefetching and caching techniques
in the mobile-app domain. FrUITeR has been instantiated and successfully demonstrated on
the key functionality extracted from existing test-reuse techniques that target Android apps. In
the process, we have been able to identify several avenues of future research that prior work has
either missed or actually
agged as not viable. We publicly release FrUITeR [11], its accompanying
artifacts, and all of our evaluation data, as a way of fostering future research in this area of growing
interest and importance. Finally, the usage-based test cases used by FrUITeR can directly benet
the evaluation of future prefetching and caching techniques, as well as other techniques that
require real-user engagement in the mobile-app domain.
116
Chapter 8
Related Work
This chapter introduces the related work in the dissertation. Recall Section 3, we have introduced
a literature review of existing prefetching and caching techniques in the browser domain and the
lessons learned in order to apply them to the mobile-app domain. In the rest of this chapter,
we will focus on prefetching and caching in the mobile-app domain, and discuss existing work on
mobile app testing that is related to the evaluation framework FrUITeR presented in chapter 7.
8.1 Prefetching and Caching in Mobile Apps
Prefetching has yielded a large body of work since the birth of the Internet [96, 69, 70, 67,
85, 151, 127, 169, 61, 192, 172, 136, 134, 93, 139, 46, 182, 35, 56, 150]. More recently, our
empirical study presented in Chapter 4 has shown the large potential of prefetching and caching on
mobile platforms [194]. Related work includes identifying performance bottlenecks [132, 168, 164],
balancing trade-os between prefetching benets and waste [89, 47], and speculating web resources
to speed up page load time [169, 127].
Cachekeeper [188] made an initial eort to study the redundant web trac in mobile apps
and proposed an OS-level caching service for HTTP requests. Without proposing any prefetching
techniques, Cachekeeper can only show benets for a limited number of redundant HTTP requests.
EarlyBird [171] proposed a social network content prefetcher to reduce user-perceived latency,
but the technique is limited to the social-network domain. Looxy [84] uses Apriori algorithm [36]
to identify groups of request that are highly related to each other in order to predict future
requests and ooads the prefetching and caching functionalities to a local proxy. Bouquet [108]
has applied program analysis techniques to bundle HTTP requests in order to reduce energy
117
consumption in mobile apps. Bouquet detects Sequential HTTP Requests Sessions (SHRS), in
which the generation of the rst request implies that the following requests will also be made, and
then bundles the requests together to save energy. This can be considered a form of prefetching.
However, this work does not address inter-callback analysis and the SHRS are always in the same
callback. Therefore, the \prefetching" only happens a few statements ahead (within milliseconds
most of the time) and has no tangible eect on app execution time.
To ll the gap, our work PALOMA (recall Chapter 5) is the rst content-based prefetching
technique to address network latency problem in the mobile-app domain, which motivated a few
follow-up work recently [61, 122]. For example, APPx [61] utilizes program analysis to extract the
dependency relationship between the network transactions (request-response pairs) that is used
for constructing prefetch requests by learning the unknown values from past network requests.
NAPPA [122] leverages the user navigation patterns to personalize \what" and \when" to prefetch
using dynamic analysis.
History-based prefetching still remains largely unexplored on mobile platforms. Prior work
has mainly focused on mobile browsers [104, 174, 97, 53, 54, 119], while missing mobile apps
where users spend over 80% of their time [58]. The few techniques targeting mobile apps have
been restricted to specic domains, such as social media [171], video streaming [101, 157], and
ads [129]. We believe this restriction is caused by the already mentioned challenge of accessing
user data, limiting history-based prefetching to specic applications and/or domains with public
data (e.g., Twitter [171]). Thus, our work HiPHarness (recall Chapter 6) takes the rst step to
explore a novel strategy of assessing history-based prefetching across dierent domains, by relying
on small amounts of user data. The closest work to ours is the evaluation of one history-based
prediction algorithm (MP [52]) using data collected during one year from 25 iPhone users [172].
By contrast, our conclusions are based on four algorithms, including MP, and a much larger user
base but over a much shorter time period.
Another complementary research thread focuses on balancing the Quality-of-Services (QoSs)
in order to answer \how much" to prefetch and cache under dierent circumstances [186, 89, 47],
such as under dierent network conditions, battery life, and cellular data usage. On the other
hand, our focus is \what" and \when" to prefetch and cache.
Finally, a number of existing work focuses on fast prelaunching by predicting what app the
user will use next [137, 179, 45], while our focus is prefetching and caching network requests
triggered within an app.
118
8.2 Mobile App Testing
In the past decade, a large number of research work has been attracted to mobile app testing due
to their dominant market share and frequent app updates [63, 51, 158, 124, 115, 27].
Existing testing techniques in mobile apps can be categorized into manual testing and auto-
mated testing. Currently, manual testing is still preferred and used widely in practice since it
can generate high-quality tests based on realistic use cases
1
[98, 63, 124, 102, 115]. However, the
biggest drawback of manual testing is the large amount of time and eorts that are involved, such
as the domain-specic knowledge required for testers, and the time spent on manually constructed
test cases.
Thus, dierent automated test generation techniques have been proposed that can be divided
into three main categories based on a recent survey [63]. (1) Random testing techniques [27, 120,
153, 187] employ random strategy to generate UI events (e.g., clicks) to test the app. They are
particularly suitable for stress testing, but can not trigger events that need specic user inputs
and are likely to generate redundant, and sometimes unrealistic event sequences. In addition, they
do not have a stopping criterion for each test, but rather resort to a manually specied timeout
or the maximum number of events. (2) Model-based testing (MBT) techniques [158, 39, 40, 185,
62, 86, 128, 44] build a Graphical User Interface (GUI) model for an app, usually a nite state
machine, and use it to generate events to cover the states in the model. The results of MBT
techniques highly depend on the models they use, which share the limitations of the underlying
techniques to build the model, such as unsoundness of static analysis [170], and coarse-grained
representation of states [44]. These limitations can lead to missing or unrealistic tests, e.g., the
tests that change the internal states of the app without aecting the GUI will not be generated [63].
(3) Systematic testing techniques [124, 121, 41, 44, 33] use systematic exploration strategy (e.g.,
evolutionary algorithms) to guide the exploration of tests towards previously uncovered code.
These techniques have clear benets in discovering tests that require specic inputs, but they
rely on sophisticated techniques, such as symbolic execution or evolutionary algorithms, thus are
usually expensive and unscalable [63].
Recently, a few automated test generation techniques that can generate realistic test cases
by reusing existing tests (i.e., test-reuse techniques) are proposed in the mobile-app domain [49,
146, 91, 113, 48]. These techniques are of the particular focus in this dissertation, as targeted
1
This type of realistic test case is dened as usage-based test in this dissertation, as discussed in Section 7.2.1.
119
by the evaluation framework FrUITeR (recall Chapter 7). We now introduce the state-of-the-art
test-reuse techniques in detail.
Four recent techniques [49, 48, 91, 113] have targeted UI test reuse in mobile apps. The shared
core concern of these techniques is to correctly map the GUI events from a source app to a target
app. The existing test-reuse techniques can be classied into two main categories, based on how
they map GUI events across dierent apps: AppFlow [91] is an ML-based, while CraftDroid [113],
GTM [49], and ATM [48] are similarity-based techniques. We have abstracted the two categories
and their underlying work
ows through studying the similarities and dierences across the existing
techniques.
ML-based techniques learn a classier from a training dataset of dierent apps' GUI events
based on certain features, such as text, element sizes, and image recognition results of graphical
icons. The classier is used to recognize app-specic GUI events and map them to canonical
GUI events used in a test library, so that app-specic tests can be generated by reusing the tests
dened in the test library.
Similarity-based techniques dene their own algorithms to compute a similarity score be-
tween pairs of GUI events in a source app and a target app based on the information extracted
from the two apps, such as text, element attributes, and Android Activity/Fragment names. The
similarity score is used to determine whether there is a match between each GUI event in the
source app and the one in the target app based on a customizable similarity threshold. Another
important component in similarity-based techniques is the exploring strategy, which determines
the order of computing the similarity score between the GUI events in the source and target apps.
The target app's events that are explored earlier usually have a higher chance of being mapped.
120
Chapter 9
Conclusion
This dissertation has extensively investigated prefetching and caching techniques in the largely
under-explored mobile-app domain, through four major components.
1 To lay the foundation for the dissertation, Chapter 3 presented a literature review that
aimed to form a deep understanding on the nature of existing prefetching and caching techniques,
and the extent to which they can be applied to the mobile-app domain. This process has uncov-
ered two principle categories of prefetching and caching techniques that are suitable for mobile
platforms|content-based and history-based|that directly guided the rest of the dissertation.
2 To motivate and guide future prefetching and caching techniques in the mobile-app domain,
we conducted an empirical study [194] to explore whether there is room for applying prefetching
and caching techniques in mobile apps, and to understand the characteristics of HTTP requests
and responses, such as prefetchability and cacheability. Chapter 4 detailed the empirical study and
its results, which have provided empirical evidence on the signicant opportunities for applying
prefetching and caching techniques. Furthermore, the study pointed to several promising direc-
tions for devising prefetching and caching techniques, directly motivating the third component in
this dissertation as discussed next.
3 To demonstrate the eectiveness of prefetching and caching techniques in the mobile-app do-
main, Chapter 5 has presented PALOMA [192], the rst content-based prefetching technique
in mobile apps using program analysis, forming the foundation for content-based approaches in
this new research area. PALOMA has been shown to incur signicant runtime savings (several
hundred milliseconds per prefetchable HTTP request) with negligible overhead, both when ap-
plied on a reusable microbenchmark we have developed and on real applications. PALOMA's
microbenchmark further forms a fair playground for standardized evaluation and comparison of
121
future eorts in this area. Several of PALOMA's current facets make it well suited for future
work, and have already attracted follow-up research eorts [61, 122].
Furthermore, Chapter 6 has presented the rst study to investigate the eectiveness of history-
based prefetching techniques using small amounts of user data on mobile platforms [195]. We
rst developed HiPHarness, a tailorable framework that enabled us to automatically assess over
7 million models built from real mobile-users' data. The results provided empirical evidence on
the feasibility of small prediction models, opening up a new avenue for improving mobile-app per-
formance while meeting today's stringent privacy requirements. Chapter 6 further demonstrated
that existing algorithms from the browser domain can produce reasonably accurate and ecient
models on mobile platforms, and provided several insights on how to improve them. For example,
we developed several strategies for reducing the training-data size while maintaining, even increas-
ing, a model's accuracy. Finally, HiPHarness's reusability and customization provided a
exible
foundation for subsequent studies to further explore various aspects of history-based prefetching.
4 To reduce the manual eort required in evaluating prefetching and caching techniques,
Chapter 7 has presented FrUITeR [191], a customizable evaluation framework for automatically
assessing UI test-reuse techniques side-by-side. With FrUITeR, proper realistic test cases that are
required for evaluating prefetching and caching techniques can be automatically selected without
real user's engagement. This largely reduces the manual eort required in the process of evaluating
and comparing dierent prefetching and caching techniques. To facilitate future work in this
area, FrUITeR has been publicly released [11], including the test cases used by FrUITeR and
their corresponding artifacts, which can directly benet the evaluation of future prefetching and
caching techniques, as well as other techniques that require real user's engagement in the mobile-
app domain.
To summarize, this dissertation has extensively explored the fundamental research area of
prefetching and caching in the context of the overlooked mobile-app domain. We rst investigated
this area through a literature review (Chapter 3) and an empirical study (Chapter 4) to form a
deep understanding of this domain. We further developed novel solutions that cover both of
the two principle categories in the prefetching and caching literature: content-based approach
(Chapter 5) and history-based approach (Chapter 6). The approaches have been shown eective
in reducing user-perceived latency with high accuracy and negligible overhead, directly motivating
future work in this emerging area. Finally, we have devised a tailorable framework to facilitate
122
the evaluation and comparison of dierent prefetching and caching techniques in the mobile-app
domain (Chapter 7).
9.1 Broader Impact and Future Directions
Despite the main focus on the mobile-app domain, this dissertation has made broad contributions
spanning several research disciplines in computer science, including prefetching and caching, soft-
ware testing, and open science. This section discusses the broader impact and future directions
that stemmed from this dissertation.
9.1.1 Prefetching and Caching
The direct impact of this dissertation is the foundation it establishes to explore dierent aspects of
prefetching and caching strategies in order to reduce user-perceived latency. This in turn, provides
an eective way to largely improve the end-user experience in using mobile applications across
a wide range of scenarios, which is especially important in the cases where latency is a critical
concern, such as in the health care systems. Furthermore, this dissertation has demonstrated
the possibility of several novel research directions and has provided further insights to guide
subsequent studies in those areas. We highlight three such directions below.
Content-based Prefetching and Caching. Our content-based approach PALOMA has
shown the large potential of prefetching techniques using program analysis, which has achieved
signicant latency reduction and negligible overhead. However, PALOMA inherits the limitation
from the complexity of program analysis techniques, which can miss certain prefetchable requests
due to the analysis's unsoundness nature, and may fail to analyze certain apps in practice
1
. These
identied shortcomings point to the need to improve the soundness and practicality of program
analysis techniques, such as string analysis and callback analysis as employed by PALOMA.
Another future direction is to improve the prefetching accuracy and reduce the associated \waste"
by incorporating certain dynamic information, such as user behavior patterns and runtime QoS
conditions. This direction has already attracted follow-up research eorts [61, 122].
History-based Prefetching and Caching. Our history-based approach HiPHarness has
provided evidence on the feasibility of history-based prefetching using \small" prediction models,
1
This is mainly due to the limitation of the state-of-the-art program analysis tool [26], which is occasionally
unable to process an Android app for reasons that we were unable to determine. This issue was also noted by
others previously. Furthermore, certain apps are written in dierent programming languages and development
frameworks that current program analysis techniques are unable to process.
123
which directly challenges previous conclusion and in turn, re-opens this promising area. This is
especially impactful under today's stringent privacy regulations, since it has demonstrated the
possibilities of practical prefetching solutions that can improve mobile-app performance without
requiring large amounts of user data as prior work. Furthermore, the results of our study have
highlighted that existing prediction algorithms are promising starting points to develop prefetching
solutions for mobile platforms, and data-pruning strategies can be leveraged to both improve
the prediction accuracy and reduce the training-data size. The customization of HiPHarness
framework provides a foundation for automatically exploring these future directions in a
exible
way, including identifying suitable training-data sizes, trading-o dierent prediction metrics,
exploring proper data-pruning strategies, assessing dierent prediction algorithms, ne-tuning
various thresholds, and so on. It is worth noting that HiPHarness's applicability is not limited
to explore history-based prefetching in mobile platforms: it can be applied in other settings by
simply replacing its input (i.e., Historical Requests component) with any historical data of interest,
directly facilitating subsequent studies on history-based prefetching in other domains as well.
Hybrid Prefetching and Caching. Throughout the in-depth investigation of both content-
based and history-based approaches, we have learned their respective strengths and weaknesses,
as well as how they can be leveraged to complement each other. This points to a promising
direction of hybrid approaches that combine both content-based and history-based techniques,
but several considerations should be made in order to develop practical solutions. For instance,
content-based approaches have the advantage of only requiring app's code without any other
external information. In practice, this is especially suitable for the situations where no external
information is available, such as when the access to user data is prohibited. However, at the
same time, the eectiveness of a content-based approach can be hindered by the limitations of the
complex program analysis they employ as discussed above. Furthermore, without incorporating
user's historical data, content-based approaches cannot be personalized based on a specic user's
individual behavior patterns. History-based approaches, on the other hand, are easier to adopt
by developers compared to content-based approaches since they do not require complex analyses,
but only rely on user's historical data. As mentioned above, such historical data can be leveraged
to improve the prefetching solution by learning from past user behaviors in order to achieve
personalization. However, the fundamental limitation of history-based techniques is that they
can only predict user behaviors that have already appeared in the past, while content-based
124
approaches are able to identify all the possible behaviors that a user may engage in the future by
analyzing the app's code.
Thus, based on the situations in practice (e.g., how much historical data is available, whether
the app's code is analyzable), content-based and history-based approaches can be combined at
the proper granularity to address their respective limitations. For instance, a content-based
technique can analyze the program structure to determine possible subsequent user behaviors
when the historical data is limited; on the other hand, a history-based technique can personalize
the content-based approach to only prefetch the most likely user requests based on an individual
user's past behavior patterns.
9.1.2 Software Testing
Besides improving mobile-app performance, the last component of this dissertation presented
in Chapter 7 has also impacted the areas of software testing. First, FrUITeR denes the rst
usage-based testing metrics to measure the practical usefulness of a test case that tests an app's
functionality. This novel aspect complements traditional testing techniques that mainly focus
on the comprehensiveness of test cases (e.g., aiming to maximize various code coverage metrics)
without considering whether the test cases are indeed \realistic" in practice. In other words, it is
possible that the bugs identied by traditional testing techniques will never be encountered by the
end-users, and such bugs are not as critical and may not need to be xed. FrUITeR thus makes
an initial eort to advance the understanding of \useful" test cases in practice, and paves the way
for the emerging area of usage-based testing. Second, FrUITeR demonstrates a way to enable
the standardized evaluation of dierent testing techniques on a fair playground side-by-side, and
successfully migrated the state-of-the-art usage-based testing techniques to show its usefulness in
practice. This establishes the foundation for evaluating and comparing dierent testing techniques
with a standard protocol, directly improving the comparability, reusability, and reproducibility
of such techniques in this area. Furthermore, the results of dierent testing techniques produced
by FrUITeR have uncovered several promising directions to improve the emerging domain of
usage-based testing. We highlight three such directions next.
Source App Selection. Currently, the state-of-the-art usage-based testing techniques gen-
erate test cases by reusing existing test cases from \similar" apps in the same app category.
However, such \similarity" criteria based on app categories are very coarse-grained and cannot
pinpoint suitable source apps to reuse test cases from, posing extra eort for developers to choose
125
their desired test cases among the tests generated based on all the apps in the same category.
FrUITeR's results clearly showed that certain source-target app pairs can always achieve better
results compared to other app pairs across all the techniques and all the evaluation metrics. This
reinforces the importance of selecting proper source apps since it has the potential to largely
improve the quality of the generated tests across the board. Thus, given a specic target app,
future work should consider ner-grained criteria to select \similar" apps to reuse test cases from,
such as leveraging code-clone techniques and computer vision to identify the most \similar" apps
as source apps.
Testing Technique Selection. As discussed in Section 7.6, there is no clear \winning"
technique across all situations and certain trade-os should be made between similarly-based
and ML-based techniques. Furthermore, automated test generation is not always desired since
sometimes it can generate tests whose \x" takes more manual eort than writing such test
manually from scratch. Thus, to develop practical solutions for developers, future research should
consider the trade-os across dierent testing techniques and provide guidance on selecting proper
techniques for a given scenario, as well as the criteria to determine whether certain tests should
be generated automatically or written manually.
Evaluation Metrics. As discussed above, current evaluation metrics in the software testing
domain center around code coverage, and do not necessarily indicate the practical usefulness of
a given test case. Recent research on UI test reuse has introduced several accuracy metrics
2
to
measure how well the technique maps the GUI events from the source app to the target app.
However, as discussed in Chapter 7, a highly-accurate technique does not guarantee a highly-
useful transferred test, or vice versa. This points to the need to design novel metrics to cover the
\usefulness" aspect that has been missed by prior work, such as the two utility metrics introduced
in FrUITeR. However, we note that the utility aspect (i.e., how useful a test is) can be subjective
depending on one's goal, and our utility metrics are only the starting points toward understanding
the \usefulness" of a test. Other aspects of \usefulness" may concern code executability, how
\realistic" the test is, and the \popularity" of the test being triggered by the end-users. This
will require further study in this emerging domain to formalize such aspects and design novel
evaluation metrics to advance this area.
2
Recall Chapter 7, these existing metrics have been unied into FrUITeR's seven delity metrics.
126
9.1.3 Open Science
Throughout this dissertation, we have noticed several repetitive challenges when there is a need
to reuse, evaluate, or compare our research artifacts with others. Namely, the underlying com-
ponents of research tools are usually dicult to extract and reuse outside their entirety, their
evaluation results are hard to reproduce, the tools are hard to compare under the same baseline,
and sometimes the instructions of the tools or even the tools themselves are not publicly available.
This introduces signicant extra eort to communicate with the tools' original authors and it may
not always lead to fruitful results (e.g., no responses from the authors), which creates consider-
able barriers to advance science on top of prior work and in turn, consistently causes duplicated
research eorts.
To ease these challenges and facilitate follow-up work on the dissertation research, we have
employed a modular work
ow in the research process and developed corresponding frameworks
that are customizable, reusable, and reproducible, in order to enable the exploration of various
aspects of a specic problem domain in a
exible manner. These frameworks are detailed earlier
in this dissertation: HiPHarness framework (Chapter 6) focuses on the domain of history-based
prefetching, while FrUITeR framework tackles the emerging area of UI test reuse.
This dissertation has the following impact in the community from the perspective of open
science. First, it establishes an initial set of requirements that should be met when designing a
framework to incorporate open science aspect during the evaluation process of dierent techniques
(recall Section 7.3), and has demonstrated the usefulness of such frameworks with concrete exam-
ples in two dierent problem domains (i.e., HiPHarness [195], FrUITeR [191]). This serves as a
promising starting point to engage conversations on open science, and to encourage future work in
this important area. Furthermore, the dissertation work has inspired us to design an open science
infrastructure that goes beyond evaluation frameworks, laying out a roadmap to incorporate open
science aspect throughout the entire lifecycle of research techniques. The open science infrastruc-
ture is discussed in our subsequent work outside the scope of this dissertation [189, 193]. This
has the potential to yield reusable, reproducible, and practical techniques that can benet both
future research and their adoption, which in turn, bridges the gap between dierent researchers
and practitioners. We now discuss three future directions in this important area.
Infrastructure Design. Open science infrastructure requires signicant eort that involves
dierent stakeholders, such as academic researchers, policy makers, and industrial practitioners.
Our work has only taken an initial step to discuss the potential high-level design of an open
127
science infrastructure in the mobile-app domain [189, 193]. To implement such infrastructure in
practice, the perspectives of relevant stakeholders should be considered when making the detailed
design decisions. For instance, the reference architecture of techniques in a specic domain (e.g.,
MAOMAO [193] in the mobile-app domain) should be designed at a proper granularity to enable
reusability without posing too much burden on technique developers by engaging domain experts.
Another example is the design of test results when evaluating dierent research techniques since
the results should be useful for both researchers to compare dierent techniques and identify their
respective limitations, and for developers to understand which techniques are most suitable to
adopt in practice. Thus, dierent views of the infrastructure should be considered at the design
phase, in order to benet various types of \end-users" who have dierent interests and objectives.
Scalability, Privacy, and Usability. The natural scope of an open science infrastructure
involves multiple stakeholders with dierent interests and roles, and thus faces the challenge of
scalability, privacy, and usability by design. This provides research opportunities for future work
in these areas, such as assuring the privacy of the data uploaded to the infrastructure, scaling to
large numbers of researchers and developers, providing organizational support to create dierent
communities, and standardizing API documentations of dierent research techniques. As early
versions of the infrastructure are deployed and adopted, this scope will grow to include capabilities
such as recommenders of related techniques based on certain requests from developers, and access
control models to enable ne-grained data sharing.
Incentives, Guidelines, and Practical Support. Open science is an emerging and multi-
disciplinary research area, which requires signicant community eort from various areas in order
to be implemented in practice. For example, besides the technical aspect concerns computer sci-
ence as discussed above, another important aspect is to create sucient incentives for people to
contribute in the long term. One possible future direction is to involve experts outside of computer
science as well, such as psychologists and sociologists, to establish eective reward mechanism and
policies in order to build a sustainable open science culture. Furthermore, there is a lack of stan-
dard and detailed guidelines on how open science should be reinforced. The current model is
decentralized and mainly relies on volunteers, which can lead to duplicated eort and dierent
quality criteria. This direction is discussed more extensively in the context of Artifact Evaluation
recently [88]. Finally, the lack of guidelines can pose signicant workload on the participants,
discouraging them from contributing to open science in the future [88, 50]. One potential solution
is to provide practical support in order to ease the process, such as tool support and assistance.
128
For instance, creating new interdisciplinary roles dedicated to open science might be a promising
direction. One possible such role on the academia side would require deep understanding on open
science best practices, reasonable knowledge on the scientic work, and engineering skills to assist
the preparation of the research artifacts. On the industry's side, such role would require deep
understanding on the industrial needs, academic skills to understand research publications, and
engineering skills to execute and evaluate the research artifacts. This points to the need on in-
terdisciplinary experts and organizational support, in order to decentralize the responsibility and
workload involved in making open science contributions, and to encourage the participation in a
sustainable way.
129
References
[1] Android Debug Bridge. URL: https://developer.android.com/studio/command-line/
adb.
[2] AppFlow's Source Code and Artifacts. URL: https://github.com/columbia/appflow.
[3] Appium: Mobile App Automation Made Awesome. URL: http://appium.io.
[4] ATM's Source Code and Artifacts. URL: https://sites.google.com/view/
apptestmigrator/.
[5] Average mobile app session length as of 4th quarter 2015. URL: https://www.statista.
com/statistics/202485/average-ipad-app-session-length-by-app-categories/.
[6] Click - Appium. URL: http://appium.io/docs/en/commands/element/actions/click.
[7] CraftDroid's Source Code and Artifacts. URL: https://sites.google.com/view/
craftdroid/.
[8] Distribution of Free and Paid Android Apps in the Google Play Store. URL: https://www.
statista.com/statistics/266211/distribution-of-free-and-paid-android-apps/.
[9] Espresso. URL: https://developer.android.com/training/testing/espresso.
[10] Find Elements - Appium. URL: http://appium.io/docs/en/commands/element/
find-elements.
[11] FrUITeR's Website. URL: https://felicitia.github.io/FrUITeR/.
[12] Gherkin Syntax - Cucumber Documentation. URL: https://cucumber.io/docs/gherkin.
[13] Google Play App Store. Google. URL: http://play.google.com/store/apps.
[14] GTM's Source Code and Artifacts. URL: https://sites.google.com/view/
testmigration/.
[15] Heroku. URL: https://www.heroku.com/.
[16] How to Locate an Element on the Page - Web Performance. URL: https://www.
webperformance.com/load-testing-tools/blog/articles/real-browser-manual/
building-a-testcase/how-locate-element-the-page.
[17] HTTP/1.1: Header Field Denitions. URL: https://www.w3.org/Protocols/rfc2616/
rfc2616-sec14.html.
[18] Introduction to Activities $nvert$ Android Developers. URL: https://developer.
android.com/guide/components/activities/intro-activities.
[19] JSON - Wikipedia. URL: https://en.wikipedia.org/wiki/JSON.
130
[20] MongoDB. URL: https://docs.mongodb.com/getting-started/shell/import-data/.
[21] NoxPlayer. URL: https://www.bignox.com/.
[22] OkHttp Documentation. URL: http://square.github.io/okhttp/.
[23] Retrot Documentation. URL: http://square.github.io/retrofit/.
[24] SeleniumHQ Browser Automation. URL: https://www.selenium.dev.
[25] Send Keys - Appium. URL: http://appium.io/docs/en/commands/element/actions/
send-keys.
[26] Soot - A Java Optimization Framework. URL: https://github.com/Sable/soot.
[27] UI/Application Exerciser Monkey. URL: https://developer.android.com/studio/
test/monkey.html.
[28] URLConnection Class Documentation. URL:https://docs.oracle.com/javase/7/docs/
api/java/net/URLConnection.
[29] Volley Overview. URL: https://developer.android.com/training/volley/.
[30] Web Element :: Documentation for Selenium. URL: https://selenium.dev/
documentation/en/webdriver/web_element.
[31] What's Data Privacy Law In Your Country? - Privacy Policies. URL: https://www.
privacypolicies.com/blog/privacy-law-by-country.
[32] Xposed Framework. URL: http://repo.xposed.info/.
[33] Christoer Quist Adamsen, Gianluca Mezzetti, and Anders Mller. Systematic execution
of android test suites in adverse conditions. In Proceedings of the 2015 International Sym-
posium on Software Testing and Analysis, pages 83{93. ACM.
[34] admin. Chapter-4: Appium Locator Finding Strategies - Kobiton. URL:https://kobiton.
com/book/chapter-4-appium-locator-finding-strategies.
[35] Victor Agababov, Michael Buettner, Victor Chudnovsky, Mark Cogan, Ben Greenstein,
Shane McDaniel, Michael Piatek, Colin Scott, Matt Welsh, and Bolian Yin. Flywheel:
Google's data compression proxy for the mobile web. In 12th $f$USENIX$g$ Symposium
on Networked Systems Design and Implementation ($f$NSDI$g$ 15), pages 367{380.
[36] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules
in Large Databases. In Proceedings of the 20th International Conference on Very Large
Data Bases, VLDB '94, pages 487{499. Morgan Kaufmann Publishers Inc. URL: http:
//dl.acm.org/citation.cfm?id=645920.672836.
[37] Waleed Ali, Siti Mariyam Shamsuddin, Abdul Samad Ismail, et al. A survey of web caching
and prefetching. 3(1):18{44.
[38] Frances E. Allen and John Cocke. A program data
ow analysis procedure. 19(3):137.
[39] Domenico Amaltano, Anna Rita Fasolino, Porrio Tramontana, Salvatore De Carmine,
and Atif M Memon. Using GUI ripping for automated testing of Android applications.
In Proceedings of the 27th IEEE/ACM International Conference on Automated Software
Engineering, pages 258{261. ACM.
131
[40] Domenico Amaltano, Anna Rita Fasolino, Porrio Tramontana, Bryan Dzung Ta, and
Atif M Memon. MobiGUITAR: Automated model-based testing of mobile apps. 32(5):53{
59.
[41] Saswat Anand, Mayur Naik, Mary Jean Harrold, and Hongseok Yang. Automated con-
colic testing of smartphone apps. In Proceedings of the ACM SIGSOFT 20th International
Symposium on the Foundations of Software Engineering, page 59. ACM.
[42] AppDynamics. The app attention span.
[43] Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques
Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. Flowdroid: Precise context,
ow, eld, object-sensitive and lifecycle-aware taint analysis for android apps. 49(6):259{
269.
[44] Tanzirul Azim and Iulian Neamtiu. Targeted and depth-rst exploration for systematic
testing of android apps. In Acm Sigplan Notices, volume 48, pages 641{660. ACM.
[45] Ricardo Baeza-Yates, Di Jiang, Fabrizio Silvestri, and Beverly Harrison. Predicting The
Next App That You Are Going To Use. In Proceedings of the Eighth ACM International
Conference on Web Search and Data Mining, WSDM '15, pages 285{294. ACM. URL:
http://doi.acm.org/10.1145/2684822.2685302, doi:10.1145/2684822.2685302.
[46] Zhijie Ban, Zhimin Gu, and Yu Jin. An online ppm prediction model for web prefetching.
In Proceedings of the 9th Annual ACM International Workshop on Web Information and
Data Management, pages 89{96. ACM.
[47] Paul Baumann and Silvia Santini. Every Byte Counts: Selective Prefetching for Mobile
Applications. 1(2):6.
[48] Farnaz Behrang and Alessandro Orso. Test Migration Between Mobile Apps with Similar
Functionality. In 34th International Conference on Automated Software Engineering (ASE
2019).
[49] Farnaz Behrang and Alessandro Orso. Test migration for ecient large-scale assessment of
mobile app coding assignments. In Proceedings of the 27th ACM SIGSOFT International
Symposium on Software Testing and Analysis.
[50] Moritz Beller. Why I will never join an Artifacts Evaluation Committee Again. URL:
https://inventitech.com/blog/why-i-will-never-review-artifacts-again/.
[51] Dieter Bohn. ANDROID AT 10: THE WORLD'S MOST DOMINANT
TECHNOLOGY. URL: https://www.theverge.com/2018/9/26/17903788/
google-android-history-dominance-marketshare-apple.
[52] Christos Bouras, Agisilaos Konidaris, and Dionysios Kostoulas. Predictive prefetching on
the web and its potential impact in the wide area. 7(2):143{179.
[53] Joel W. Branch, Han Chen, Hui Lei, and International Business Machines Corp. Data
Pre-Fetching Based on User Demographics. URL: https://patents.google.com/patent/
US8509816B2/en.
[54] Andrew James Guy Brown. Pre-Caching Web Content for a Mobile Device. Google Patents.
[55] Eyuphan Bulut and Boleslaw K. Szymanski. Understanding user behavior via mobile data
analysis. pages 1563{1568. doi:10.1109/ICCW.2015.7247402.
132
[56] Michael Butkiewicz, Daimeng Wang, Zhe Wu, Harsha V Madhyastha, and Vyas Sekar.
Klotski: Reprioritizing Web Content to Improve User Experience on Mobile Devices. In
NSDI, volume 1, pages 2{3.
[57] Yinzhi Cao, Yanick Fratantonio, Antonio Bianchi, Manuel Egele, Christopher Kruegel, Gio-
vanni Vigna, and Yan Chen. EdgeMiner: Automatically Detecting Implicit Control Flow
Transitions through the Android Framework. In NDSS.
[58] Dave Chaey. Mobile Marketing Statistics Compilation. URL: https:
//www.smartinsights.com/mobile-marketing/mobile-marketing-analytics/
mobile-marketing-statistics/.
[59] Jim Chappelow. Pareto Principle. URL: https://www.investopedia.com/terms/p/
paretoprinciple.asp.
[60] Xin Chen and Xiaodong Zhang. Popularity-based PPM: An eective web prefetching tech-
nique for high accuracy and low storage. In Proceedings International Conference on Parallel
Processing, pages 296{304. IEEE.
[61] Byungkwon Choi, Jeongmin Kim, Daeyang Cho, Seongmin Kim, and Dongsu Han. APPx:
An automated app acceleration framework for low latency mobile app. In Proceedings of
the 14th International Conference on Emerging Networking EXperiments and Technologies,
pages 27{40. ACM.
[62] Wontae Choi, George Necula, and Koushik Sen. Guided gui testing of android apps with
minimal restart and approximate learning. In Acm Sigplan Notices, volume 48, pages 623{
640. ACM.
[63] Shauvik Roy Choudhary, Alessandra Gorla, and Alessandro Orso. Automated Test In-
put Generation for Android: Are We There Yet? (E). In Proceedings of the 2015 30th
IEEE/ACM International Conference on Automated Software Engineering (ASE), ASE '15,
pages 429{440. IEEE Computer Society. doi:10.1109/ASE.2015.89.
[64] Contributors to Wikimedia projects. Trie - Wikipedia. URL: https://en.wikipedia.org/
wiki/Trie.
[65] WilcocksonThomas D. W., A. EllisDavid, and ShawHeather. Determining Typical Smart-
phone Usage: What Data Do We Need? doi:10.1089/cyber.2017.0652.
[66] Shuaifu Dai, Alok Tongaonkar, Xiaoyin Wang, Antonio Nucci, and Dawn Song. Networkpro-
ler: Towards automatic ngerprinting of android apps. In INFOCOM, 2013 Proceedings
IEEE, pages 809{817. IEEE.
[67] Brian D Davison. Predicting web actions from html content. In Proceedings of the Thir-
teenth ACM Conference on Hypertext and Hypermedia, pages 159{168. ACM.
[68] B De La Ossa, JA Gil, Julio Sahuquillo, and Ana Pont. Improving web prefetching by
making predictions at prefetch. In 2007 Next Generation Internet Networks, pages 21{27.
IEEE.
[69] Bernardo de la Ossa, Jos e A Gil, Julio Sahuquillo, and Ana Pont. Web prefetch performance
evaluation in a real environment. In Proceedings of the 4th International IFIP/ACM Latin
American Conference on Networking, pages 65{73. ACM.
[70] Bernardo de la Ossa, Ana Pont, Julio Sahuquillo, and Jos e A Gil. Referrer graph: A low-
cost web prediction algorithm. In Proceedings of the 2010 ACM Symposium on Applied
Computing, pages 831{838. ACM.
133
[71] J. Domenech, J. A. Gil, J. Sahuquillo, and A. Pont. DDG: An Ecient Prefetching Al-
gorithm for Current Web Generation. In 2006 1st IEEE Workshop on Hot Topics in Web
Systems and Technologies, pages 1{12. doi:10.1109/HOTWEB.2006.355260.
[72] Josep Domenech, Jose A Gil, Julio Sahuquillo, and Ana Pont. Using current web page
structure to improve prefetching performance. 54(9):1404{1417.
[73] Josep Dom enech, Jos e A Gil, Julio Sahuquillo, and Ana Pont. Web prefetching performance
metrics: A survey. 63(9-10):988{1004.
[74] Li Fan, Pei Cao, Wei Lin, and Quinn Jacobson. Web prefetching between low-bandwidth
clients and proxies: Potential and performance. In ACM SIGMETRICS Performance Eval-
uation Review, volume 27, pages 178{187. ACM.
[75] Roy Fielding. RFC 2616, Part of Hypertext Transfer Protocol { HTTP/1.1. URL: https:
//www.w3.org/Protocols/rfc2616/rfc2616.html.
[76] Roy Fielding. RFC 2616, Part of Hypertext Transfer Protocol { HTTP/1.1:
https://tools.ietf.org/html/rfc2616#section-5.1.1. URL: https://tools.ietf.org/html/
rfc2616#section-5.1.1.
[77] Roy Fielding. RFC 2616, Part of Hypertext Transfer Protocol { HTTP/1.1:
https://tools.ietf.org/html/rfc2616#section-9. URL: https://tools.ietf.org/html/
rfc2616#section-9.
[78] Paul A. Games and John F. Howell. Pairwise Multiple Comparison Procedures with
Unequal N's and/or Variances: A Monte Carlo Study. 1(2):113{125. doi:10.3102/
10769986001002113.
[79] Android Developers Guide. Android AsyncTask. URL:https://developer.android.com/
reference/android/os/AsyncTask.html.
[80] Android Developers API Guides. The Activity Lifecycle. URL: https://developer.
android.com/guide/components/activities/activity-lifecycle.html.
[81] Android Developers API Guides. Android Input Events. URL: https://developer.
android.com/guide/topics/ui/ui-events.html.
[82] Android Developers API Guides. Android ListView. URL: https://developer.android.
com/guide/topics/ui/layout/listview.html.
[83] Android Developers API Guides. String Resources. URL: https://developer.android.
com/guide/topics/resources/string-resource.html.
[84] Yao Guo, Mengxin Liu, and Xiangqun Chen. Looxy: Web Access Optimization for Mobile
Applications with a Local Proxy. In 2017 IEEE 85th Vehicular Technology Conference
(VTC Spring), pages 1{5. IEEE.
[85] S ule G und uz and M Tamer
Ozsu. A web page prediction model based on click-stream tree
representation of user behavior. In Proceedings of the Ninth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 535{540. ACM.
[86] Shuai Hao, Bin Liu, Suman Nath, William GJ Halfond, and Ramesh Govindan. PUMA:
Programmable UI-automation for large-scale dynamic analysis of mobile apps. In Proceed-
ings of the 12th Annual International Conference on Mobile Systems, Applications, and
Services, pages 204{217. ACM.
134
[87] Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy,
Franfmmodenmbox cnelse cn oise Beaufays, Sean Augenstein, Hubert Eichner,
Chlofmmodenacuteenelse en Kiddon, and Daniel Ramage. Federated Learning for
Mobile Keyboard Prediction. URL: https://arxiv.org/abs/1811.03604.
[88] Ben Hermann, Stefan Winter, and Janet Siegmund. Community Expectations for Research
Artifacts and Evaluation Processes. In Proceedings of the 28th ACM Joint European Soft-
ware Engineering Conference and Symposium on the Foundations of Software Engineering,
ESEC/FSE '20. ACM.
[89] Brett D Higgins, Jason Flinn, Thomas J Giuli, Brian Noble, Christopher Peplin, and David
Watson. Informed mobile prefetching. In Proceedings of the 10th International Conference
on Mobile Systems, Applications, and Services, pages 155{168. ACM.
[90] W-J Hsu, Thrasyvoulos Spyropoulos, Konstantinos Psounis, and Ahmed Helmy. Modeling
time-variant user mobility in wireless mobile networks. In IEEE INFOCOM 2007-26th IEEE
International Conference on Computer Communications, pages 758{766. IEEE.
[91] Gang Hu, Linjie Zhu, and Junfeng Yang. AppFlow: Using machine learning to synthesize
robust, reusable UI tests. In Proceedings of the 2018 26th ACM Joint Meeting on Eu-
ropean Software Engineering Conference and Symposium on the Foundations of Software
Engineering, pages 269{282. ACM.
[92] Yin-Fu Huang and Jhao-Min Hsu. Mining web logs to improve hit ratios of prefetching and
caching. 21(1):62{69.
[93] Tamer I Ibrahim and Cheng-Zhong Xu. Neural nets based predictive prefetching to tol-
erate WWW latency. In Proceedings 20th IEEE International Conference on Distributed
Computing Systems, pages 636{643. IEEE.
[94] Consumers International. The State of Data Protection Rules around the World. URL:
https://www.consumersinternational.org/media/155133/gdpr-briefing.pdf.
[95] Yingyin Jiang, Min-You Wu, and Wei Shu. Web prefetching: Costs, benets and perfor-
mance. In Proceedings of the 7th International Workshop on Web Content Caching and
Distribution (WCW2002). Boulder, Colorado. Citeseer.
[96] Zhimei Jiang and Leonard Kleinrock. Web prefetching in a mobile environment. 5(5):25{34.
[97] Beihong Jin, Sihua Tian, Chen Lin, Xin Ren, and Yu Huang. An integrated prefetching
and caching scheme for mobile web caching system. In Eighth ACIS International Confer-
ence on Software Engineering, Articial Intelligence, Networking, and Parallel/Distributed
Computing (SNPD 2007), volume 2, pages 522{527. IEEE.
[98] Mona Erfani Joorabchi, Ali Mesbah, and Philippe Kruchten. Real challenges in mobile
app development. In Empirical Software Engineering and Measurement, 2013 ACM/IEEE
International Symposium On, pages 15{24. IEEE.
[99] SIMON KEMP. Digital in 2018. URL: https://wearesocial.com/blog/2018/01/
global-digital-report-2018.
[100] Faten Khalil, Jiuyong Li, and Hua Wang. An integrated model for next page access predic-
tion. 1(1/2):48{80.
[101] Christian Koch and David Hausheer. Optimizing mobile prefetching by leveraging usage
patterns and social information. In 2014 IEEE 22nd International Conference on Network
Protocols, pages 293{295. IEEE.
135
[102] Pavneet Singh Kochhar, Ferdian Thung, Nachiappan Nagappan, Thomas Zimmermann, and
David Lo. Understanding the Test Automation Culture of App Developers. In Software
Testing, Verication and Validation (ICST), 2015 IEEE 8th International Conference On,
pages 1{10. IEEE.
[103] Vassilis Kostakos, Denzil Ferreira, Jorge Goncalves, and Simo Hosio. Modelling Smartphone
Usage: A Markov State Transition Model. In Proceedings of the 2016 ACM International
Joint Conference on Pervasive and Ubiquitous Computing, UbiComp '16, pages 486{497.
ACM. URL: http://doi.acm.org/10.1145/2971648.2971669, doi:10.1145/2971648.
2971669.
[104] Surya Kumar Kovvali, Charles Boyle, and Movik Networks. Content Pre-Fetching and
CDN Assist Methods in a Wireless Mobile Network. URL: https://patents.google.com/
patent/US8799480B2/en.
[105] Bin Lan, Stephane Bressan, Beng Chin Ooi, and Kian-Lee Tan. Rule-assisted prefetching
in web-server caching. In Proceedings of the Ninth International Conference on Information
and Knowledge Management, pages 504{511. ACM.
[106] Elena Leu. Data Privacy Trends Across The Globe. URL: https://www.clym.io/
articles/data-privacy-trends-across-the-globe.
[107] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and rever-
sals. In Soviet Physics Doklady, volume 10, pages 707{710.
[108] Ding Li, Yingjun Lyu, Jiaping Gui, and William G. J. Halfond. Automated Energy Op-
timization of HTTP Requests for Mobile Applications. In Proceedings of the 38th In-
ternational Conference on Software Engineering, ICSE '16, pages 249{260. ACM. URL:
http://doi.acm.org/10.1145/2884781.2884867, doi:10.1145/2884781.2884867.
[109] Ding Li, Yingjun Lyu, Mian Wan, and William GJ Halfond. String analysis for Java and
Android applications. In Proceedings of the 2015 10th Joint Meeting on Foundations of
Software Engineering, pages 661{672. ACM.
[110] Yang Li. Re
ection: Enabling event prediction as an on-device service for mobile interac-
tion. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and
Technology, pages 689{698. ACM.
[111] Robert LiKamWa, Yunxin Liu, Nicholas D Lane, and Lin Zhong. Can your smartphone
infer your mood. In PhoneSense Workshop, pages 1{5.
[112] Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang
Liang, Qiang Yang, Dusit Niyato, and Chunyan Miao. Federated learning in mobile edge
networks: A comprehensive survey. arXiv:1909.11875.
[113] Jun-Wei Lin, Reyhaneh Jabbarvand, and Sam Malek. Test Transfer Across Mobile Apps
Through Semantic Mapping. In 34th International Conference on Automated Software En-
gineering (ASE 2019).
[114] Mario Linares-V asquez, Carlos Bernal-C ardenas, Kevin Moran, and Denys Poshyvanyk.
How do developers test android applications? In 2017 IEEE International Conference on
Software Maintenance and Evolution (ICSME).
[115] Mario Linares-V asquez, Martin White, Carlos Bernal-C ardenas, Kevin Moran, and Denys
Poshyvanyk. Mining android app usages for generating actionable gui-based execution sce-
narios. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories,
pages 111{122. IEEE.
136
[116] Jun Liu, Cheng Fang, and Nirwan Ansari. Request dependency graph: A model for web
usage mining in large-scale web of things. 3(4):598{608.
[117] Qinghui Liu. Web latency reduction with prefetching.
[118] Xuanzhe Liu, Yun Ma, Yunxin Liu, Tao Xie, and Gang Huang. Demystifying the imperfect
client-side cache performance of mobile web browsing. 15(9):2206{2220.
[119] Dimitrios Lymberopoulos, Oriana Riva, Karin Strauss, Akshay Mittal, and Alexandros
Ntoulas. PocketWeb: Instant Web Browsing for Mobile Devices. In Proceedings of the
Seventeenth International Conference on Architectural Support for Programming Languages
and Operating Systems, ASPLOS XVII, pages 1{12. ACM. URL: http://doi.acm.org/
10.1145/2150976.2150978, doi:10.1145/2150976.2150978.
[120] Aravind Machiry, Rohan Tahiliani, and Mayur Naik. Dynodroid: An input generation
system for android apps. In Proceedings of the 2013 9th Joint Meeting on Foundations of
Software Engineering, pages 224{234. ACM.
[121] Riyadh Mahmood, Nariman Mirzaei, and Sam Malek. Evodroid: Segmented evolutionary
testing of android apps. In Proceedings of the 22nd ACM SIGSOFT International Sympo-
sium on Foundations of Software Engineering, pages 599{609. ACM.
[122] Ivano Malavolta, Francesco Nocera, Patricia Lago, and Marina Mongiello. Navigation-
Aware and Personalized Prefetching of Network Requests in Android Apps. pages 17{20.
doi:10.1109/ICSE-NIER.2019.00013.
[123] Christopher D Manning, Prabhakar Raghavan, and Hinrich Sch utze. Introduction to Infor-
mation Retrieval. Cambridge university press.
[124] Ke Mao, Mark Harman, and Yue Jia. Sapienz: Multi-objective automated testing for
Android applications. In Proceedings of the 25th International Symposium on Software
Testing and Analysis, pages 94{105. ACM.
[125] Evangelos P Markatos, Catherine E Chronaki, et al. A top-10 approach to prefetching on
the web. In Proceedings of INET, volume 98, pages 276{290.
[126] Philip Michaels. Best phone battery life in 2020: The longest lasting smart-
phones. URL: https://www.tomsguide.com/us/smartphones-best-battery-life\
,review-2857.html.
[127] James W Mickens, Jeremy Elson, Jon Howell, and Jay R Lorch. Crom: Faster Web Browsing
Using Speculative Execution. In NSDI, volume 10, pages 9{9.
[128] Nariman Mirzaei, Joshua Garcia, Hamid Bagheri, Alireza Sadeghi, and Sam Malek. Reduc-
ing combinatorics in GUI testing of android applications. In Software Engineering (ICSE),
2016 IEEE/ACM 38th International Conference On, pages 559{570. IEEE.
[129] Prashanth Mohan, Suman Nath, and Oriana Riva. Prefetching mobile ads: Can advertising
systems aord it? In Proceedings of the 8th ACM European Conference on Computer
Systems, pages 267{280. ACM.
[130] Alexandros Nanopoulos, Dimitrios Katsaros, and Yannis Manolopoulos. A data mining
algorithm for generalized web prefetching. 15(5):1155{1169.
[131] Alexandros Nanopoulos, Dimitris Katsaros, and Yannis Manolopoulos. Eective Prediction
of Web-user Accesses: A Data Mining Approach.
137
[132] Javad Nejati and Aruna Balasubramanian. An in-depth study of mobile browser perfor-
mance. In Proceedings of the 25th International Conference on World Wide Web, pages
1305{1315. International World Wide Web Conferences Steering Committee.
[133] Ravi Netravali, Ameesh Goyal, James Mickens, and Hari Balakrishnan. Polaris: Faster Page
Loads Using Fine-grained Dependency Tracking. In NSDI, pages 123{136.
[134] Venkata N Padmanabhan and Jerey C Mogul. Using predictive prefetching to improve
world wide web latency. 26(3):22{36.
[135] George Pallis, Athena Vakali, and Jaroslav Pokorny. A clustering-based prefetching scheme
on a Web cache environment. 34(4):309{323.
[136] Themistoklis Palpanas and Alberto Mendelzon. Web Prefetching Using Partial Match Pre-
diction. Citeseer.
[137] Abhinav Parate, Matthias B ohmer, David Chu, Deepak Ganesan, and Benjamin M Marlin.
Practical prediction and prefetch for faster access to applications on mobile phones. In
Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous
Computing, pages 275{284. ACM.
[138] Danilo Dominguez Perez and Wei Le. Generating Predicate Callback Summaries for
the Android Framework. In Proceedings of the 4th International Conference on Mobile
Software Engineering and Systems, MOBILESoft '17, pages 68{78. IEEE Press. doi:
10.1109/MOBILESoft.2017.28.
[139] James Pitkow and Peter Pirolli. Mining longest repeating subsequence stop redict world
wide web surng. In Proc. UsENIX Symp. on Internet Technologies and Systems, page 1.
[140] PricewaterhouseCoopers. Top Policy Trends 2020: Data Privacy. URL:
https://www.pwc.com/us/en/library/risk-regulatory/strategic-policy/
top-policy-trends/data-privacy.html.
[141] Feng Qian, Kee Shen Quah, Junxian Huang, Jerey Erman, Alexandre Gerber, Zhuoqing
Mao, Subhabrata Sen, and Oliver Spatscheck. Web caching on smartphones: Ideal vs. re-
ality. In Proceedings of the 10th International Conference on Mobile Systems, Applications,
and Services, pages 127{140. ACM.
[142] QUARTZ. Android just hit a record 88% market share of all smartphones.
[143] Ahmad Rahmati, Chad Tossell, Clayton Shepard, Philip Kortum, and Lin Zhong. Exploring
iPhone usage: The in
uence of socioeconomic dierences on smartphone adoption, usage
and usability. In Proceedings of the 14th International Conference on Human-Computer
Interaction with Mobile Devices and Services, pages 11{20.
[144] Swaroop Ramaswamy, Rajiv Mathews, Kanishka Rao, and Fran coise Beaufays. Federated
learning for emoji prediction in a mobile keyboard. arXiv:1906.04329.
[145] Santosh K Rangarajan, VirV Phoha, Kiran Balagani, RR Selmic, and SS Iyengar. Web user
clustering and its application to prefetching using ART neural networks. pages 45{62.
[146] Andreas Rau, Jenny Hotzkow, and Andreas Zeller. Transferring Tests Across Web Applica-
tions. In Tommi Mikkonen, Ralf Klamma, and Juan Hern andez, editors, Web Engineering,
pages 50{64. Springer International Publishing.
138
[147] Lenin Ravindranath, Jitendra Padhye, Sharad Agarwal, Ratul Mahajan, Ian Obermiller,
and Shahin Shayandeh. AppInsight: Mobile App Performance Monitoring in the Wild. In
OSDI, volume 12, pages 107{120.
[148] PRESTO research group. GATOR: Program Analysis Toolkit For Android. URL: http:
//web.cse.ohio-state.edu/presto/software/gator/.
[149] Alejandro Rioja. 17 Best Smartphones with Largest Battery Capac-
ity [2019 List]. URL: https://www.fluxchargers.com/blogs/flux-blog/
best-smartphones-largest-battery-capacity-life.
[150] Sanae Rosen, Bo Han, Shuai Hao, Z. Morley Mao, and Feng Qian. Push or Request: An
Investigation of HTTP/2 Server Push for Improving Mobile Performance. In Proceedings
of the 26th International Conference on World Wide Web, WWW '17, pages 459{468.
International World Wide Web Conferences Steering Committee. doi:10.1145/3038912.
3052574.
[151] Vaspol Ruamviboonsuk, Ravi Netravali, Muhammed Uluyol, and Harsha V Madhyastha.
VROOM: Accelerating the Mobile Web with Server-Aided Dependency Resolution. In Pro-
ceedings of the Conference of the ACM Special Interest Group on Data Communication,
pages 390{403. ACM.
[152] Iqbal H. Sarker, Alan Colman, Muhammad Ashad Kabir, and Jun Han. Individualized
Time-Series Segmentation for Mining Mobile Phone User Behavior. 61(3):349{368. doi:
10.1093/comjnl/bxx082.
[153] Raimondas Sasnauskas and John Regehr. Intent fuzzer: Crafting intents of death. In
Proceedings of the 2014 Joint International Workshop on Dynamic Analysis (WODA) and
Software and System Performance Testing, Debugging, and Analytics (PERTEA), pages
1{5. ACM.
[154] Congressional Research Service. Data Protection Law: An Overview. URL: https://fas.
org/sgp/crs/misc/R45631.pdf.
[155] Clayton Shepard, Ahmad Rahmati, Chad Tossell, Lin Zhong, and Phillip Kortum. LiveLab:
Measuring wireless networks and smartphone users in the eld. 38(3):15{20.
[156] Omar K Shoukry and Magda B Fayek. Evolutionary content pre-fetching in mobile net-
works. In 2013 12th International Conference on Machine Learning and Applications, vol-
ume 1, pages 386{391. IEEE.
[157] Vasilios A Siris, Maria Anagnostopoulou, and Dimitris Dimopoulos. Improving mobile video
streaming with mobility prediction and prefetching in integrated cellular-WiFi networks. In
International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and
Services, pages 699{704. Springer.
[158] Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang Pu, Yang
Liu, and Zhendong Su. Guided, stochastic model-based gui testing of android apps. In
Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pages
245{256. ACM.
[159] N. Swaminathan and S. V. Raghavan. Intelligent prefetch in WWW using client behavior
characterization. In Proceedings 8th International Symposium on Modeling, Analysis and
Simulation of Computer and Telecommunication Systems (Cat. No.PR00728), pages 13{19.
doi:10.1109/MASCOT.2000.876424.
139
[160] Vincent S Tseng and Kawuu W Lin. Ecient mining and prediction of user behavior
patterns in mobile web systems. 48(6):357{369.
[161] Nor Jaidi Tuah, MJ Kumar, and Svetha Venkatesh. Performance modelling of speculative
prefetching for compound requests in low bandwidth networks. In Proceedings of the 3rd
ACM International Workshop on Wireless Mobile Multimedia, pages 83{92. ACM.
[162] John W. Tukey. Exploratory Data Analysis. Addison-Wesley.
[163] Raja Vall ee-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and Vijay
Sundaresan. Soot - a Java Bytecode Optimization Framework. In Proceedings of the 1999
Conference of the Centre for Advanced Studies on Collaborative Research, CASCON '99,
pages 13{. IBM Press. URL: http://dl.acm.org/citation.cfm?id=781995.782008.
[164] Jamshed Vesuna, Colin Scott, Michael Buettner, Michael Piatek, Arvind Krishnamurthy,
and Scott Shenker. Caching doesn't improve mobile web performance (much). In 2016
$f$USENIX$g$ Annual Technical Conference ($f$USENIX$g$$f$ATC$g$ 16), pages 159{
165.
[165] Greeshma G Vijayan et al. A survey on web pre-fetching and web caching techniques in a
mobile environment.
[166] Haoyu Wang, Junjun Kong, Yao Guo, and Xiangqun Chen. Mobile web browser optimiza-
tions in the cloud era: A survey. In 2013 IEEE 7th International Symposium on Service
Oriented System Engineering (SOSE), pages 527{536. IEEE.
[167] Jia Wang. A survey of web caching schemes for the internet. 29(5):36{46.
[168] Xiao Sophia Wang, Aruna Balasubramanian, Arvind Krishnamurthy, and David Wether-
all. Demystifying page load performance with WProf. In Presented as Part of
the 10th $f$USENIX$g$ Symposium on Networked Systems Design and Implementation
($f$NSDI$g$ 13), pages 473{485.
[169] Xiao Sophia Wang, Arvind Krishnamurthy, and David Wetherall. Speeding up web page
loads with Shandian. In 13th USENIX Symposium on Networked Systems Design and Im-
plementation (NSDI 16), pages 109{122.
[170] Yan Wang, Hailong Zhang, and Atanas Rountev. On the unsoundness of static analysis for
Android GUIs. In Proceedings of the 5th ACM SIGPLAN International Workshop on State
Of the Art in Program Analysis, pages 18{23. ACM.
[171] Yichuan Wang, Xin Liu, David Chu, and Yunxin Liu. Earlybird: Mobile prefetching of social
network feeds via content preference mining and usage pattern analysis. In Proceedings of
the 16th ACM International Symposium on Mobile Ad Hoc Networking and Computing,
pages 67{76. ACM.
[172] Zhen Wang, Felix Xiaozhu Lin, Lin Zhong, and Mansoor Chishtie. How far can client-only
solutions go for mobile browser speed? In Proceedings of the 21st International Conference
on World Wide Web, pages 31{40. ACM.
[173] Zhen Wang, Felix Xiaozhu Lin, Lin Zhong, and Mansoor Chishtie. Why are web browsers
slow on smartphones? In Proceedings of the 12th Workshop on Mobile Computing Systems
and Applications, pages 91{96. ACM.
[174] Benjamin T Weber. Mobile map browsers: Anticipated user interaction for data pre-
fetching.
140
[175] Ryen W. White, Fernando Diaz, and Qi Guo. Search Result Prefetching on Desktop and
Mobile. 35(3):23:1{23:34. URL: http://doi.acm.org/10.1145/3015466, doi:10.1145/
3015466.
[176] Bin Wu and Ajay D Kshemkalyani. Objective-greedy algorithms for long-term Web prefetch-
ing. In Third IEEE International Symposium on Network Computing and Applications,
2004.(NCA 2004). Proceedings., pages 61{68. IEEE.
[177] C. Wu, X. Chen, Y. Zhou, N. Li, X. Fu, and Y. Zhang. Spice: Socially-driven learning-
based mobile media prefetching. In IEEE INFOCOM 2016 - The 35th Annual IEEE In-
ternational Conference on Computer Communications, pages 1{9. doi:10.1109/INFOCOM.
2016.7524568.
[178] Cheng-Zhong Xu and Tamer I Ibrahim. A keyword-based semantic prefetching approach in
Internet news services. 16(5):601{611.
[179] Tingxin Yan, David Chu, Deepak Ganesan, Aman Kansal, and Jie Liu. Fast App Launching
for Mobile Devices Using Predictive User Context. In Proceedings of the 10th International
Conference on Mobile Systems, Applications, and Services, MobiSys '12, pages 113{126.
ACM. URL: http://doi.acm.org/10.1145/2307636.2307648, doi:10.1145/2307636.
2307648.
[180] Jie Yang, Yuanyuan Qiao, Xinyu Zhang, Haiyang He, Fang Liu, and Gang Cheng. Charac-
terizing user behavior in mobile internet. 3(1):95{106.
[181] Qiang Yang, Joshua Zhexue Huang, and Michael Ng. A data cube model for prediction-
based web prefetching. 20(1):11{30.
[182] Qiang Yang, Tianyi Li, and Ke Wang. Building association-rule based sequential classiers
for web-document prediction. 8(3):253{273.
[183] Shengqian Yang, Dacong Yan, Haowei Wu, Yan Wang, and Atanas Rountev. Static control-
ow analysis of user-driven callbacks in Android applications. In Proceedings of the 37th
International Conference on Software Engineering-Volume 1, pages 89{99. IEEE Press.
[184] Shengqian Yang, Hailong Zhang, Haowei Wu, Yan Wang, Dacong Yan, and Atanas Rountev.
Static window transition graphs for android (t). In Automated Software Engineering (ASE),
2015 30th IEEE/ACM International Conference On, pages 658{668. IEEE.
[185] Wei Yang, Mukul R Prasad, and Tao Xie. A grey-box approach for automated GUI-model
generation of mobile applications. In International Conference on Fundamental Approaches
to Software Engineering, pages 250{265. Springer.
[186] Y. Yang and G. Cao. Prefetch-Based Energy Optimization on Smartphones. 17(1):693{706.
doi:10.1109/TWC.2017.2769646.
[187] Hui Ye, Shaoyin Cheng, Lanbo Zhang, and Fan Jiang. Droidfuzzer: Fuzzing the android
apps with intent-lter tag. In Proceedings of International Conference on Advances in
Mobile Computing & Multimedia, page 68. ACM.
[188] Yifan Zhang, Chiu Tan, and Li Qun. CacheKeeper: A system-wide web caching service for
smartphones. In Proceedings of the 2013 ACM International Joint Conference on Pervasive
and Ubiquitous Computing, pages 265{274. ACM.
141
[189] Yixue Zhao. Mobile-app analysis and instrumentation techniques reimagined with DE-
CREE. In Proceedings of the 41st International Conference on Software Engineering: Com-
panion Proceedings, ICSE '19, pages 218{221. IEEE Press. doi:10.1109/ICSE-Companion.
2019.00086.
[190] Yixue Zhao. Toward Client-centric Approaches for Latency Minimization in Mobile Appli-
cations. In Proceedings of the 4th International Conference on Mobile Software Engineering
and Systems, MOBILESoft '17, pages 203{204. IEEE Press. doi:10.1109/MOBILESoft.
2017.34.
[191] Yixue Zhao, Justin Chen, Adriana Seja, Marcelo Schmitt Laser, Jie Zhang, Federica Sarro,
Mark Harman, and Nenad Medvidovic. FrUITeR: A Framework for Evaluating UI Test
Reuse. In Proceedings of the 28th ACM Joint European Software Engineering Conference
and Symposium on the Foundations of Software Engineering, ESEC/FSE '20. ACM. doi:
10.1145/3368089.3409708.
[192] Yixue Zhao, Marcelo Schmitt Laser, Yingjun Lyu, and Nenad Medvidovic. Leveraging
Program Analysis to Reduce User-perceived Latency in Mobile Applications. In Proceedings
of the 40th International Conference on Software Engineering, ICSE '18, pages 176{186.
ACM. URL: http://doi.acm.org/10.1145/3180155.3180249, doi:10.1145/3180155.
3180249.
[193] Yixue Zhao and Nenad Medvidovic. A microservice architecture for online mobile app
optimization. In Proceedings of the 6th International Conference on Mobile Software Engi-
neering and Systems, MOBILESoft '19, pages 45{49. IEEE Press.
[194] Yixue Zhao, Paul Wat, Marcelo Schmitt Laser, and Nenad Medvidovi c. Empirically
Assessing Opportunities for Prefetching and Caching in Mobile Apps. In Proceedings
of the 33rd ACM/IEEE International Conference on Automated Software Engineering,
ASE 2018, pages 554{564. ACM. URL: http://doi.acm.org/10.1145/3238147.3238215,
doi:10.1145/3238147.3238215.
[195] Yixue Zhao, Siwei Yin, Adriana Seja, Marcelo Schmitt Laser, Haoyu Wang, and Nenad
Medvidovic. Assessing the Feasibility of Web-Request Prediction Models on Mobile Plat-
forms. URL: https://arxiv.org/abs/2011.04654.
142
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy optimization of mobile applications
PDF
Utilizing user feedback to assist software developers to better use mobile ads in apps
PDF
Automatic detection and optimization of energy optimizable UIs in Android applications using program analysis
PDF
Automated repair of presentation failures in Web applications using search-based techniques
PDF
Detecting SQL antipatterns in mobile applications
PDF
Constraint-based program analysis for concurrent software
PDF
Automated repair of layout accessibility issues in mobile applications
PDF
Detection, localization, and repair of internationalization presentation failures in web applications
PDF
Towards energy efficient mobile sensing
PDF
Toward better understanding and improving user-developer communications on mobile app stores
PDF
Techniques for methodically exploring software development alternatives
PDF
Analysis of embedded software architecture with precedent dependent aperiodic tasks
PDF
Location-based spatial queries in mobile environments
PDF
Architectural evolution and decay in software systems
PDF
Assessing software maintainability in systems by leveraging fuzzy methods and linguistic analysis
PDF
A framework for runtime energy efficient mobile execution
PDF
Software quality understanding by analysis of abundant data (SQUAAD): towards better understanding of life cycle software qualities
PDF
Crowd-sourced collaborative sensing in highly mobile environments
PDF
Efficient pipelines for vision-based context sensing
PDF
Detecting and characterizing network devices using signatures of traffic about end-points
Asset Metadata
Creator
Zhao, Yixue
(author)
Core Title
Reducing user-perceived latency in mobile applications via prefetching and caching
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/29/2020
Defense Date
10/16/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
mobile applications,OAI-PMH Harvest,open science,Performance,software testing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medvidovic, Nenad (
committee chair
), Krishnamachari, Bhaskar (
committee member
), Wang, Chao (
committee member
)
Creator Email
felicitia2010@gmail.com,yixuezha@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-396463
Unique identifier
UC11668689
Identifier
etd-ZhaoYixue-9153.pdf (filename),usctheses-c89-396463 (legacy record id)
Legacy Identifier
etd-ZhaoYixue-9153.pdf
Dmrecord
396463
Document Type
Dissertation
Rights
Zhao, Yixue
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
mobile applications
open science
software testing