Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Automated repair of presentation failures in Web applications using search-based techniques
(USC Thesis Other)
Automated repair of presentation failures in Web applications using search-based techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
AUTOMATED REPAIR OF PRESENTATION FAILURES IN WEB APPLICATIONS USING SEARCH-BASED TECHNIQUES by Sonal Mahajan A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2018 Copyright 2018 Sonal Mahajan Dedication To my parents, Pradip and Sharayu, and sister, Payal, for their endless love, support, and encouragement. ii Acknowledgements Pursuing Ph.D. at USC has been one of the most rewarding, enriching, and challenging experiences of my life. It is my pleasure to acknowledge several people who were instrumental in the course of this pursuit. Firstly, I would like to thank my advisor, Professor William G. J. Halfond, for his continuous support, feedback, and encouragement at all stages of my Ph.D. study. His enthusiasm and positive attitude towards new ideas dared me to think creatively and dierently, and explore them without the fear of failing. The innumerable meetings that we had, and his timeless feedback and criticism helped me strengthen the roots of my dissertation research. He has repeatedly inspired me by setting an example through hard work, and carefully nurtured my technical and communication skills, all of which have helped me grow as a researcher. Besides my advisor, I would like to thank the rest of my dissertation committee: Prof. Nenad Medvidovic, Prof. Sandeep Gupta, Prof. Chao Wang, and Prof. Jyotirmoy Deshmukh, for their support, insightful comments, constructive feedback, and challenging questions. I am thankful to Prof. Phil McMinn from the University of Sheeld for being a great collaborator and imparting his knowledge of search-based techniques, which has helped me signicantly improve my dissertation work. My time at USC would not have been fun without the wonderful camaraderie of my labmates. They contributed towards creating a truly great and fun research environment, and helped me appreciate our dierent cultures, cuisines, and languages. I would like to thank Abdulmajeed Alameer for being a diligent co-author and a great friend, and for the interesting brainstorming sessions that we shared, Negarsadat Abolhassani for being a dear friend and for the entertaining and witty discussions that we had, Mian Wan for accompanying me in the walks back home and for introducing me to delicious Chinese snacks, Jiaping Gui for the many discussions that we had, especially, during our job search, Yingjun Lyu for the several fun conversations, and Ding Li (now at NEC Labs) for being very supportive and for helping me get through my early days at USC. My heartfelt acknowledgements go to my teachers in all stages of my education, the individuals who have inspired me to learn and grow. I would like to specially thank Prof. Ilmi Yoon and Prof. Dragutin Petkovic of San Francisco State University for encouraging me to apply and enter the iii doctoral program at USC. Their constant guidance and support helped me in making an informed decision. I am also thankful to Prof. Ashutosh Muchrikar of Cummins College of Engineering for his valuable support and encouragement during my undergraduate years. Last but not the least, I would like to thank my family. My parents, Pradip and Sharayu, who have faithfully believed in me, instilled in me the zeal for perfection and a quest for knowledge, and supported me in all my pursuits. My sister, Payal, and brother-in-law, Navdeep, who have always ensured that I don't feel homesick by providing me with a warm and welcoming home in the U.S., and always oered great technical insights that have helped me shape up my dissertation eectively. My little nephew, Neil, who has been a perfect means of relieving my stress and anxiety with his infectious laughter and loving, playful nature. My grandmother, who has been particularly supportive and encouraging. iv Table of Contents Dedication ii Acknowledgements iii List of Tables ix List of Figures x Abstract xi Chapter 1: Introduction 1 1.1 Presentation Failures: Denition and Types . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Major Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Insights and Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Insight 1: Search-based techniques can be used for repair . . . . . . . . . . 4 1.3.2 Insight 2: Repair has to quantify and balance two objectives . . . . . . . . 4 1.3.2.1 Insight 2A: Presentation failures can be quantied . . . . . . . . . 5 1.3.2.2 Insight 2B: Aesthetic similarity can be quantied . . . . . . . . . . 5 1.3.3 Hypothesis: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Overview of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Background 12 2.1 Web App Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Types of Presentation Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Layout Cross Browser Issues (XBIs) . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Mobile Friendly Problems (MFPs) . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.3 Internationalization Presentation Failures (IPFs) . . . . . . . . . . . . . . . 14 2.2.4 Mockup-driven Development Problems (MDDPs) . . . . . . . . . . . . . . . 15 2.2.5 Regression Debugging Problems (RDPs) . . . . . . . . . . . . . . . . . . . . 15 2.3 The Process of Debugging Presentation Failures . . . . . . . . . . . . . . . . . . . . 15 2.3.1 P1. Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 P2. Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3 P3. Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Search-Based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 3: Overview of the Generalized Repair Approach, *Fix 18 3.1 Design of *Fix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.1 AP1: Initialize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.2 AP2 and AP3: Search and Evaluate . . . . . . . . . . . . . . . . . . . . . . 20 v 3.1.3 AP4: Terminate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Specializations of *Fix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.1 XFix: Repair of Layout Cross Browser Issues (XBIs) . . . . . . . . . . . . 21 3.2.2 MFix: Repair of Mobile Friendly Problems (MFPs) . . . . . . . . . . . . . 22 3.2.3 IFix: Repair of Internationalization Presentation Failures (IPFs) . . . . . . 22 3.2.4 GFix: Repair of Mockup-driven Development Problems (MDDPs) and Re- gression Debugging Problems (RDPs) . . . . . . . . . . . . . . . . . . . . . 23 Chapter 4: XFix: Repair of Layout Cross Browser Issues (XBIs) 24 4.1 Background and Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Specialization of the Generalized Approach, *Fix . . . . . . . . . . . . . . . . . . . 27 4.2.1 Detection and Localization of Layout XBIs . . . . . . . . . . . . . . . . . . 28 4.2.2 Repair of Layout XBIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.2.1 Overall Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.2.2 Search for Candidate Fixes . . . . . . . . . . . . . . . . . . . . . . 33 4.2.2.3 Search for the Best Combination of Candidate Fixes . . . . . . . . 37 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.2 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.4 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.5 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.5.1 RQ1: Reduction of XBIs . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.5.2 RQ2: Impact on Cross-browser Consistency . . . . . . . . . . . . . 45 4.3.5.3 RQ3: Time Needed to RunXFix . . . . . . . . . . . . . . . . . . 47 4.3.5.4 RQ4: Similarity of Repair Patches to Real-world Websites' Code . 47 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Chapter 5: MFix: Repair of Mobile Friendly Problems (MFPs) 50 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1.1 Types of MFPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1.2 Current Methods for Addressing MFPs . . . . . . . . . . . . . . . . . . . . 53 5.2 Specialization of the Generalized Approach, *Fix . . . . . . . . . . . . . . . . . . . 54 5.2.1 Detection of MFPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.2 Repair of MFPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.3 Phase 1: Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.4 Phase 2: Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.4.1 Identifying Problematic Segments . . . . . . . . . . . . . . . . . . 57 5.2.4.2 Identifying Problematic CSS Properties . . . . . . . . . . . . . . . 58 5.2.5 Phase 3: Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.5.2 Computing Candidate Mobile Friendly Patches . . . . . . . . . . . 61 5.2.5.3 Generating the Mobile Friendly Patch . . . . . . . . . . . . . . . . 62 5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.2 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.3 Experiment One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.3.1 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.4 Experiment Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.4.1 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 vi 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chapter 6: IFix: Repair of Internationalization Presentation Failures (IPFs) 72 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Specialization of the Generalized Approach, *Fix . . . . . . . . . . . . . . . . . . . 75 6.2.1 Detection and Localization of IPFs . . . . . . . . . . . . . . . . . . . . . . . 75 6.2.2 Repair of IPFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.3 Identifying Stylistically Similar Clusters . . . . . . . . . . . . . . . . . . . . 78 6.2.3.1 Visual Similarity Metrics . . . . . . . . . . . . . . . . . . . . . . . 79 6.2.3.2 DOM Information Similarity Metrics . . . . . . . . . . . . . . . . 80 6.2.4 Candidate Solution Representation . . . . . . . . . . . . . . . . . . . . . . . 81 6.2.5 Fitness Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2.5.1 Generating the PUT 0 . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2.5.2 Fitness Function Components . . . . . . . . . . . . . . . . . . . . 82 6.2.6 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.2 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.3 Experiment One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.3.1 Presentation of Results . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3.3.2 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3.4 Experiment Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3.4.1 Presentation of Results . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3.4.2 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Chapter 7: GFix: Repair of Mockup-driven Development Problems (MDDPs) and Regression Debugging Problems (RDPs) 95 7.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.2 Specialization of the Generalized Approach, *Fix . . . . . . . . . . . . . . . . . . . 98 7.2.1 Detection and Localization of MDDPs and RDPs . . . . . . . . . . . . . . . 98 7.2.2 Repair of MDDPs and RDPs . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.2.2.1 Overall Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.2.2.2 CSS Properties from Visual Symptoms of MDDPs and RDPs . . . 106 7.2.2.3 Size and Position Analysis . . . . . . . . . . . . . . . . . . . . . . 110 7.2.2.4 Color Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.2.2.5 Predened Values Analysis . . . . . . . . . . . . . . . . . . . . . . 114 7.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.3.2 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.3.2.1 Selection of subjects for RDPs . . . . . . . . . . . . . . . . . . . . 116 7.3.2.2 Selection of subjects for MDDPs . . . . . . . . . . . . . . . . . . . 116 7.3.2.3 Test case generation . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.3.3 Experiment One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.3.3.1 Presentation of Results . . . . . . . . . . . . . . . . . . . . . . . . 118 7.3.3.2 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.3.4 Experiment Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.3.4.1 Presentation of Results . . . . . . . . . . . . . . . . . . . . . . . . 121 7.3.4.2 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.3.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 vii 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Chapter 8: Related Work 124 8.1 Automated repair of software systems . . . . . . . . . . . . . . . . . . . . . . . . . 124 8.2 Cross Browser Issues (XBIs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 8.3 Mobile Friendly Problems (MFPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 8.4 Internationalization Presentation Failures (IPFs) . . . . . . . . . . . . . . . . . . . 127 8.5 Mockup-driven Development Problems (MDDPs) and Regression Debugging Prob- lems (RDPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8.6 Other presentation failure detection techniques . . . . . . . . . . . . . . . . . . . . 129 8.7 Testing and analysis of web app client-side components . . . . . . . . . . . . . . . 129 8.8 Graphical User Interface (GUI) Testing . . . . . . . . . . . . . . . . . . . . . . . . 130 Chapter 9: Conclusion and Future Work 131 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 References 135 viii List of Tables 3.1 Overview of the specializations of *Fix . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1 Subjects used in the evaluation ofXFix . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Eectiveness ofXFix in reducing XBIs . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 XFix's average run time in seconds . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1 Subjects used in the evaluation ofMFix . . . . . . . . . . . . . . . . . . . . . . . 65 6.1 Subjects used in the evaluation ofIFix . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2 Eectiveness ofIFix in reducing IPFs . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.1 Mapping between symptoms and CSS properties . . . . . . . . . . . . . . . . . . . 107 7.2 Categorization of CSS properties based on visual impact . . . . . . . . . . . . . . . 110 7.3 Subjects used in the evaluation ofGFix . . . . . . . . . . . . . . . . . . . . . . . . 115 7.4 Eectiveness ofGFix in reducing MDDPs and RDPs . . . . . . . . . . . . . . . . . 118 7.5 GFix's median run time in seconds . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 ix List of Figures 2.1 Process ow overview of debugging presentation failures . . . . . . . . . . . . . . . 16 3.1 Overview of the generalized repair approach, *Fix . . . . . . . . . . . . . . . . . . 18 4.1 Example screenshots showing XBI . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Overview of theXFix approach for repairing layout XBIs . . . . . . . . . . . . . . 30 4.3 Fitness function components of the XBI repair approach . . . . . . . . . . . . . . . 36 4.4 High-level overview of theXFix tool . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Example of repair.css generated byXFix . . . . . . . . . . . . . . . . . . . . . . . 39 4.6 Similarity ratings given by participants inXFix human study . . . . . . . . . . . 46 4.7 Similarity ofXFix generated repair patches to real-world websites' code . . . . . 48 5.1 Overview of theMFix approach for repairing MFPs . . . . . . . . . . . . . . . . . 55 5.2 Example for demonstrating theMFix approach. . . . . . . . . . . . . . . . . . . . 56 5.3 Distribution of the median mobile friendliness score across 10 runs . . . . . . . . . 66 5.4 Breakdown of the running time ofMFix . . . . . . . . . . . . . . . . . . . . . . . 67 6.1 Example of an IPF and dierent ways of xing it . . . . . . . . . . . . . . . . . . . 77 6.2 Overview of theIFix approach for repairing IPFs . . . . . . . . . . . . . . . . . . 78 6.3 Example of ancestor elements with xed width that need to be adjusted together with SimSet elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4 Initializing the population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.5 Two UI snippets in the same equivalence class (from Hotwire) . . . . . . . . . . . . 90 6.6 Similarity ratings given by user study participants . . . . . . . . . . . . . . . . . . 92 6.7 Weighted distribution of the ratings . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.1 Overview of theGFix approach for repairing MDDPs and RDPs . . . . . . . . . . 101 7.2 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.3 Anti-aliasing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.4 Examples of UI snippets used in theGFix's user study . . . . . . . . . . . . . . . . 120 7.5 GFix's user study results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 x Abstract The appearance of a web application's User Interface (UI) plays an important part in its success. Issues degrading the UI can negatively aect the usability of a website and impact an end user's perception of the website and the quality of the services that it delivers. Such UI related issues, called presentation failures, occur frequently in modern web applications. Despite their impor- tance, there exist no automated techniques for repairing presentation failures. Instead repair is typically a manual process, where developers must painstakingly analyze the UI of a website, identify the faulty UI elements (i.e., HTML elements and CSS properties), and carry out repairs. This is labor intensive and requires signicant expertise of the developers. My dissertation addresses these challenges and limitations by automating the process of re- pairing presentation failures in web applications. My key insight underlying this research is that search-based techniques can be used to nd repairs for the observed presentation failures by in- telligently and eciently exploring large solution spaces dened by the HTML elements and CSS properties in a web page. Based on this insight, I designed and developed four techniques for the automated repair of dierent types of presentation failures in web applications. The rst technique focuses on the repair of layout Cross Browser Issues (XBIs), i.e., inconsistencies in the appearance of a website when rendered in dierent web browsers. The second technique addresses the Mobile Friendly Problems (MFPs) in websites, i.e., improves the readability and usability of a website when accessed from a mobile device. The third technique repairs problems related to internationalization in web application UIs. Lastly, the fourth technique addresses issues arising from mockup-driven development and regression debugging. In the empirical evaluations, all of the four techniques were highly eective in repairing presentation failures, while in the conducted user studies, participants overwhelmingly preferred the visual appeal of the repaired versions of the websites compared to their original (faulty) versions. Overall, these are positive results and indicate that my repair techniques can help developers repair presentation failures in web applications, while maintaining their aesthetic quality. xi Chapter 1 Introduction The number of web applications is on a constant rise, with over 1.8 billion websites reported as of December 2017 [1]. The proliferation of web applications has pervaded all aspects of human activities, from banking to social networking and from entertainment to shopping. In fact, based on a recent report released by the Department of Commerce, a total revenue of over 450 billion dollars was generated in the United States from e-commerce sales alone in the year 2017 [4]. The increasing popularity of web applications is driven by companies oering their services and products via interactive and feature-rich websites that end users can access from a wide range of browsers running on a variety of dierent platforms and devices. To further expand their outreach to international markets, companies often provide translated text and localized media content on their websites in order to eectively communicate with a global audience. An attractive and visually appealing appearance of a website's User Interface (UI) plays an important part in its success. A recent study underscores this point by noting that an average visitor forms a rst impression of a web page within the rst 50 milliseconds of visiting the page [152] | an amount of time that is heavily in uenced by the web page's aesthetics. Companies invest a signicant amount of eort to design and implement their websites. It is typical for companies to employ teams of graphic designers and web programmers to get a website's \look and feel" correct. This eort is important because studies have shown that the aesthetics of a website signicantly impact end users' overall evaluation of a website; particularly, impressions of trustworthiness and usability [152, 88, 87, 151, 71, 67, 70, 133, 145]. Issues degrading the visual consistency and aesthetics of a website can undermine this eort and negatively impact end users' perception of the website and the quality of the services that it delivers. These issues can also seriously impact a website's usability or functionality likely leading to a frustrating and poor user experience. Thus, such appearance related issues have the potential to severely impact the website's success and aect the branding a company is trying to achieve [30]. Unfortunately, 1 despite the severe impact of such appearance related issues, they are found to occur frequently in modern web applications [137, 47, 73]. Researchers have recognized the need for automating the process of debugging appearance related issues in web applications and have proposed several approaches for various parts of the debugging process. However, these techniques are limited in their applicability as they can only detect the appearance related issues, with the repair remaining a labor intensive manual task. This motivates the need for automating the process of repairing appearance related issues in web applications. 1.1 Presentation Failures: Denition and Types The appearance related issues in web applications are called presentation failures, which are formally dened as discrepancies between the actual appearance of a website and its intended appearance. Presentation failures are found pervasively in modern web applications [137, 47, 73] and can cause serious usability problems or signicantly distort the intended appearance of a website. Presentation failures can occur in web applications due to several reasons, such as developer errors and dierences in the rendering engine of browsers. I now discuss examples of dierent types of presentation failures. The rst type is layout Cross Browser Issues (XBIs), which occur when a web page is rendered inconsistently across dierent browsers. XBIs tend to arise because of dierences in the interpretations of HTML and CSS standards by the dierent browsers. The second type is Mobile Friendly Problems (MFPs), which occur when users access web pages from a non-traditional sized device, such as a smartphone or tablet, and the pages are not designed to be mobile friendly. Such pages can exhibit a range of usability issues, such as unreadable text, cluttered navigation, or content that over ows the device's viewport. The third type is Internationalization Presentation Failures (IPFs), which occur when a page is translated from one language to another (e.g., English to Spanish) and cause distortions in the layout of the page. Such distortions are caused by dierences in the lengths of text in dierent languages. The fourth type is Regression Debugging Problems (RDPs). These presentation failures occur when developers perform maintenance tasks, such as correcting a bug or refactoring the HTML structure, and cause the page to appear dierently than its original (intended) version. For example, a refactoring may change a page from a table-based layout to one based onhdivi tags. During this process, the developers may inadvertently introduce a fault that changes the appearance of the page in an unintended manner. The fth and last type is Mockup-driven Development Problems (MDDPs). In the mockup-driven style of development, developers use mockups | highly detailed renderings 2 of the intended appearance of the web page | to guide their implementation of the web pages. Developers may accidentally introduce faults during this process, which causes the implemented page to appear dierently than its mockup. 1.2 Major Challenges The detection, localization, and repair of presentation failures poses numerous challenges for developers and testers. First, detection of presentation failures is an expensive task, since typically testers must look at the UI of a website and identify when the UI does not conform with its intended appearance. Second, localization of presentation failures is dicult given the complex layout and styles of modern web pages. Third, developers lack a standardized way to repair presentation failures and generally have to resolve them on a case by case basis. Existing tools are limited in helping developers to debug presentation failures. Although tools, such as Firebug [22], can provide useful information, developers still require expertise to manually analyze the presentation failures and then repair them by performing the necessary modications so that the page renders correctly. Automated UI testing techniques, such as X- PERT [137], Google Mobile-Friendly Test Tool (GMFT) [26], and GWALI [49], are only able to detect and localize presentation failures (i.e., they address only the rst two of the three previously listed challenges), but are incapable of repairing presentation failures so that the rendering of a web page can conform with its expected appearance. Repairing presentation failures is thus unfortunately strictly a manual process that is labor intensive and guided by a developer's intuition and experience. Therefore, the accuracy and eciency of this process can vary signicantly by developer. The challenges and limitations of existing techniques motivate my research to automate the process of repairing presentation failures in web applications. Repairing presentation failures, however, poses several challenges. The rst challenge is that the solution space for identifying a repair is very large. Finding a repair for presentation failures means identifying new values for CSS properties of HTML elements that can make the faulty appearance of the web page match its correct appearance as closely as possible. These three UI aspects, i.e., HTML elements, CSS properties, and their values, form the three dimensions of the solution space, which can grow multiplicatively with respect to the number of HTML elements and CSS properties in a page. A typical web page can contain hundreds or thousands of HTML elements, each with several dozen CSS properties that range over a large set of possible values, for example, the background-color CSS property can be one of 16 million colors. Therefore, a brute force approach for nding the x value is not scalable. The second challenge is that the 3 rendering of a web page is controlled by complex and dynamic interactions between the HTML and CSS of a page, making it dicult to analytically model the interactions between HTML and CSS. This problem is further compounded as the rendering rules are context-sensitive, meaning that the analytical model deduced for one web page rendered in a browser is not generalizable across dierent web pages and browsers. The third challenge is that a repair must be constructed carefully to modify the problematic UI elements in the page without introducing new presentation failures, which can easily occur due to complex and cascading interactions between HTML and CSS in a web page. This means that any potential repair must be evaluated in the context of not only how well it resolves the targeted presentation failure, but also its impact on the rest of the page's layout as a whole. This task is complicated because it is possible that more than one element will have to be adjusted to repair a presentation failure. 1.3 Insights and Hypothesis In this section, I present the key insights underlying my research and the hypothesis that this dissertation tests in order to realize the goal of automatically repairing presentation failures in web applications. 1.3.1 Insight 1: Search-based techniques can be used for repair To address the repair challenges discussed above, my rst key insight is that search-based tech- niques can be used to repair presentation failures in web applications. An important characteristic of the presentation failures problem domain that motivates this insight is that repairing presenta- tion failures does not require nding a pixel-perfect, optimal solution. Identifying an approximate or a close-to-optimal solution that can resolve the presentation failures while maintaining close aesthetic similarity to the page's original version is sucient. Search-based techniques are ideal for this type of problem because they can explore large solution spaces intelligently and eciently to nd a good approximate solution. Such techniques typically work by evaluating the quality of potential solutions until a good enough solution is found or the allocated computation budget is fully exhausted. To evaluate the quality of a solution, search-based techniques use an objective function that can guide the search to likely solutions. The insight of designing such an objective function for repairing presentation failures is described in Insight 2. 4 1.3.2 Insight 2: Repair has to quantify and balance two objectives My second key insight is that the best repair has to balance two objectives, or competing con- straints; minimize the number of presentation failures in the page under test and maximize the page's aesthetic similarity to its original version. Search-based techniques are well-suited for ex- ecuting such multi-objective optimization problems, since they can eectively balance a number of competing constraints to produce an overall best repair. The details for quantifying these two objectives are explained in Insights 2A and 2B, respectively. 1.3.2.1 Insight 2A: Presentation failures can be quantied My insight for the rst objective is that presentation failures in a web page can be quantied by leveraging existing detection techniques, such as X-PERT [137] for XBIs. The detection techniques typically compare a page under test with its intended appearance and report the dierences as presentation failures. The number of presentation failures in a page reported by the detection techniques can be used to determine how close a candidate repair found during a search is to making the page failure free. Such an objective function could guide the search to a repair that minimizes the number of presentation failures in a page. When the number of presentation failures converges to zero, it would imply that all of the presentation faults in the page have likely been identied and repaired. One weakness of such an objective function, however, is that it is unlikely to be able to reliably guide the search to solutions that maintain (or enhance) the aesthetic quality of the page. For example, solutions generated by simply hiding the problematic HTML elements or assigning extreme CSS values can resolve the presentations failures in a page but on the other hand signicantly disrupt the page's layout and aect its functionality and usability, thereby likely being unacceptable to developers and end users. This motivates the need for nding a repair that can faithfully maintain (or enhance) the page's aesthetic quality after repair, which forms the second objective of the objective function. 1.3.2.2 Insight 2B: Aesthetic similarity can be quantied My insight for the second objective is that the aesthetic similarity of a page before and after repair can be quantied by analyzing the page's layout and that this quantication can also be used as part of the objective function. The layout of a web page rendered in a browser can be modeled as a graph, with HTML elements in the page forming the vertices and the layout relationships, such as parent-child, sibling, and position, of the elements with respect to each other forming the edges. Such a model can be built for a page before and after applying a repair and the two models 5 can be compared. The dierences extracted from the comparison can serve as the quantication of the closeness of the aesthetic similarity of the page after repair to its original version. This measurement of aesthetic similarity can be used as part of the objective function to guide the search to a repair that minimizes the layout deviation between the before and after repair versions of the page. 1.3.3 Hypothesis: Based on the above two insights, the hypothesis statement of my dissertation is: Search-based techniques can repair presentation failures in a web page with high eectiveness. To evaluate the hypothesis, I developed dierent repair algorithms using search-based tech- niques to resolve the dierent types of presentation failures in web applications discussed in Sec- tion 1.1. I then showed the eectiveness of these techniques in resolving the detected presentation failures. I quantied eectiveness as the reduction in the presentation failures reported by existing detection techniques in the before and after repair versions of the pages. I also conducted user studies to understand the reduction in the human-observable presentation failures and quantify the impact of the generated repairs on the aesthetic quality of the pages from a human perspec- tive. The empirical evaluations demonstrated that my repair algorithms can provide xes for resolving a high number of the observed presentation failures in web pages, while, in most cases, maintaining or enhancing the pages' aesthetic appeal. These results conrm the hypothesis of my dissertation and indicate that my research is potentially of high usefulness to developers for automatically repairing dierent types of presentation failures in web applications. 1.4 Contributions The contributions of my dissertation include the design and development of four search-based techniques for the automated repair of dierent types of presentation failures in web applications and empirical evaluations of the techniques to assess their eectiveness. To the best of my knowl- edge, my work is the rst automated approach for generating repairs for presentation failures, and the rst to apply search-based repair techniques to web pages. 1. Repair techniques | I designed and developed four techniques listed below for repairing the dierent types of presentation failures in web applications discussed in Section 1.1. The dierent techniques are explained in detail in Chapters 4, 5, 6 and 7. For easy understanding of my repair techniques, I explain their common structure in Chapter 3. 6 (a) XFix| The goal of this technique is to nd potential xes that can repair layout XBIs detected in web pages. (b) MFix| The goal of this technique is to improve the mobile friendliness of web pages by automatically repairing the MFPs detected in the pages. (c) IFix| The goal of this technique is to automatically repair IPFs that have been detected in translated versions of web pages. (d) GFix| The goal of this technique is to automatically repair the MDDPs and RDPs detected in web pages. 2. Empirical evaluation of the techniques | I conducted empirical evaluations of the four techniques listed above on real-world subject web pages to demonstrate their eectiveness. For measuring eectiveness, I used existing detection techniques to compute the reduction in the number of observed presentation failures before and after repair, and conducted user studies to understand the visual quality of the generated xes. I also measured the total running time required by the dierent techniques to generate repairs for each of the subject web pages. 1.5 Overview of Publications In this section, I provide an overview of the publications that I have developed during the course of this dissertation. The dissertation work is mainly divided into four chapters, Chapters 4, 5, 6 and 7, that correspond to the four techniques I developed to repair dierent types of presentation failures in web pages. Each of the chapters is based on one or more papers, which have been published or are under submission. The 15 papers corresponding to the four chapters are listed below. For each of the papers, I was the primary author (or one of the primary authors), with contributions including idea, design, and evaluation of the work. All of the papers were co- authored with my Ph.D. advisor, Prof. William G. J. Halfond. Chapter 4: Repair of Layout Cross Browser Issues (XBIs) In this chapter, I discuss the repair technique,XFix, that I designed for repairing layout XBIs in web pages. This chapter had two publications at the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) in 2017. The rst publication was a full paper in the main research track [97] and the second publication was a tool demo paper [98]. The main research track paper received the ACM SIGSOFT Distinguished Paper Award. An extended version of the 7 main research track paper is currently in preparation to be submitted at a software engineering journal [96]. All of the papers were co-authored with Abdulmajeed Alameer, a fellow Ph.D. student at USC, and Prof. Phil McMinn, a collaborator from the University of Sheeld, UK. 1. [97] Sonal Mahajan, Abdulmajeed Alameer, Phil McMinn, and William G. J. Halfond. Auto- mated Repair of Layout Cross Browser Issues using Search-Based Techniques. In Proceedings of the 26th International Symposium on Software Testing and Analysis (ISSTA), July 2017. Acceptance rate: 26%. Distinguished Paper Award. 2. [98] Sonal Mahajan, Abdulmajeed Alameer, Phil McMinn, and William G. J. Halfond. XFix: An Automated Tool for the Repair of Layout Cross Browser Issues. In Proceedings of the 26th International Symposium on Software Testing and Analysis (ISSTA) { Demo Track, July 2017 3. [96] Sonal Mahajan, Abdulmajeed Alameer, Phil McMinn, and William G. J. Halfond. Search-Based Automatic Repair of Layout Cross-Browser Issues. In submission Chapter 5: Repair of Mobile Friendly Problems (MFPs) In this chapter, I discuss the repair technique,MFix, that I designed for repairing MFPs in web pages to improve their mobile friendliness. This chapter had one publication in the research track of the IEEE/ACM International Conference on Software Engineering (ICSE) in 2018 [95]. This work was developed in collaboration with Negasadat Abolhassani, a fellow Ph.D. student at USC, and Prof. Phil McMinn. 4. [95] Sonal Mahajan, Negasadat Abolhassani, Phil McMinn, and William G. J. Halfond. Automated Repair of Mobile Friendly Problems in Web Pages. In Proceedings of the 40th International Conference on Software Engineering (ICSE), May 2018. Acceptance rate: 20%. Chapter 6: Repair of Internationalization Presentation Failures (IPFs) This chapter discusses the repair technique,IFix, that I developed for repairing IPFs in web pages. This work was published in the research track of the IEEE International Conference on Software Testing, Verication and Validation (ICST) in 2018, and was a recipient of the IEEE Distinguished Paper Award [99]. The paper was co-authored with Abdulmajeed Alameer and Prof. Phil McMinn. To facilitate the repair, I rst designed a technique (GWALI) for detecting and localizing IPFs in web pages. This work was originally published at the IEEE International Conference on Software Testing, Verication and Validation (ICST) in 2016, where it received 8 the IEEE Best Paper Award [49]. An extension of this work is currently in preparation to be submitted at a software engineering journal [48]. The detection and localization papers were developed in collaboration with Abdulmajeed Alameer. 5. [99] Sonal Mahajan, Abdulmajeed Alameer, Phil McMinn, and William G. J. Halfond. Au- tomated Repair of Internationalization Failures in Web Applications Using Style Similarity Clustering and Search-Based Techniques. In Proceedings of the 11th IEEE International Conference on Software Testing, Verication and Validation (ICST), April 2018. Accep- tance rate: 25%. Distinguished Paper Award. 6. [49] Abdulmajeed Alameer, Sonal Mahajan, and William G. J. Halfond. Detecting and Localizing Internationalization Presentation Failures in Web Applications. In Proceedings of the 9th IEEE International Conference on Software Testing, Verication and Validation (ICST), April 2016. Acceptance rate: 27%. Best Paper Award. 7. [48] Abdulmajeed Alameer, Sonal Mahajan, and William G. J. Halfond. Detecting and Localizing Internationalization Layout Failures in Web Applications. In submission Chapter 7: Repair of Mockup-driven Development Problems (MDDPs) and Regression Debugging Problems (RDPs) In this chapter, I explain the repair technique, GFix, that I developed for repairing MDDPs and RDPs in web pages. This work is currently under submission at a software engineering journal [107]. It was co-authored with Prof. Phil McMinn. Prior to developingGFix, I developed a detection and localization technique, WebSee. WebSee was originally published as a new ideas paper [102] at the IEEE/ACM International Conference on Automated Software Engineering (ASE) in 2014. It was later extended and published as a full paper in the research track of the IEEE International Conference on Software Testing, Verication and Validation (ICST) in 2015 [103]. Its tool demo paper was also published at the same conference (ICST 2015) [104]. I later developed a technique based on WebSee to identify the visual inconsistencies across web pages of the same web application that was published as a short paper at the Asia-Pacic Software Engineering Conference (APSEC) in 2016 [101]. This work was developed during my summer internship at Infosys Labs in collaboration with Krupa Benhur Gadde and Anjaneyulu Pasala. An extended version of the APSEC 2016 paper is currently in preparation to be submitted at a software engineering journal [100]. I developed two techniques to perform root cause analysis of MDDPs and RDPs to identify the faulty HTML elements and CSS properties that are likely responsible for the observed failures. 9 The rst technique was a preliminary work using search-based techniques that appeared as a new ideas short paper [106] at the Workshop on Search-Based Software Testing (SBST) in 2014. The second technique was based on visual symptoms and probabilistic modeling, and was published as a full paper [105] in the research track of the IEEE International Conference on Software Testing, Verication and Validation (ICST) in 2016. Both the papers were developed in collaboration with Bailan Li, a former Ph.D. student at USC. The ICST 2016 paper was also co-authored with Pooyan Behnamghader, a fellow Ph.D. student at USC. 8. [107] Sonal Mahajan, Phil McMinn, and William G. J. Halfond. A Search-Based Framework for the Automated Repair of Presentation Failures in Web Applications. In submission 9. [102] Sonal Mahajan and William G. J. Halfond. Finding HTML Presentation Failures Using Image Comparison Techniques. In Proceedings of the 29th IEEE/ACM International Conference on Automated Software Engineering (ASE) { New Ideas track, September 2014. Acceptance rate: 24%. 10. [103] Sonal Mahajan and William G. J. Halfond. Detection and Localization of HTML Presentation Failures Using Computer Vision-Based Techniques. In Proceedings of the 8th IEEE International Conference on Software Testing, Verication and Validation (ICST), April 2015. Acceptance rate: 24%. 11. [104] Sonal Mahajan and William G. J. Halfond. WebSee: A Tool for Debugging HTML Presentation Failures. In Proceedings of the 8th IEEE International Conference on Software Testing, Verication and Validation (ICST) { Tool track, April 2015. Acceptance rate: 36%. 12. [101] Sonal Mahajan, Krupa Benhur Gadde, Anjaneyulu Pasala, and William G. J. Halfond. Detecting and Localizing Visual Inconsistencies in Web Applications. In Proceedings of the 23rd Asia-Pacic Software Engineering Conference (APSEC) { Short paper, December 2016. Acceptance rate: 29%. 13. [100] Sonal Mahajan, Krupa Benhur Gadde, Anjaneyulu Pasala, and William G. J. Hal- fond. Detecting and Suggesting Fixes for Visual Inconsistencies in Web Applications. In submission 14. [106] Sonal Mahajan, Bailan Li, and William G. J. Halfond. Root Cause Analysis for HTML Presentation Failures Using Search-based Techniques. In Proceedings of the 7th International Workshop on Search-Based Software Testing (SBST), June 2014. Acceptance rate: 53%. 10 15. [105] Sonal Mahajan, Bailan Li, Pooyan Behnamghader, and William G. J. Halfond. Using Visual Symptoms for Debugging Presentation Failures in Web Applications. In Proceedings of the 9th IEEE International Conference on Software Testing, Verication and Validation (ICST), April 2016. Acceptance rate: 27%. 11 Chapter 2 Background This chapter provides the necessary background information that is used throughout the disser- tation. Section 2.1 discusses the fundamentals of a typical web application and how a browser renders the User Interface (UI) of a web page. In Section 2.2, I discuss ve dierent types of pre- sentation failures. Finally, in Section 2.3, I discuss the process of debugging presentation failures in web applications. 2.1 Web App Basics A web application is a client-server software application in which the application code is present on a remote server and is delivered to the client running a web browser over the Internet. Typ- ically, the client-server framework of web applications is modeled as a three-tier architecture, presentation tier, business logic tier, and the data tier. The presentation tier is considered to be the topmost tier, displaying the web application to the end users as HyperText Markup Language (HTML) pages rendered in their browsers. The business logic or the middle tier contains the core functionality or behavior of the web application, and controls the interaction between the presentation and data tier. The data tier houses the database servers where information is stored persistently. When the end users interact with the presentation tier of the web application, the client (browser) submits a Hyper Text Transfer Protocol (HTTP) request message to the business logic tier on the server. The business logic tier processes the request, interacts with the data tier if required, and fetches/generates the HTML page and related resources, and sends this as a response back to the client. The client then renders the received HTML page in the response in the browser. Modern web applications typically follow the Model-View-Controller (MVC) design pattern in which the application code (Model and Controller) runs on a server accessible via the Internet and 12 delivers HTML and CSS based web pages (View) to a client running a web browser. The repair techniques presented in my dissertation focus on repairing the View part of the web application. I now discuss how an HTML page is rendered in a web browser. The rendering engine (also called the layout engine) in a browser is responsible for displaying the requested HTML content. At a high-level, the rendering engine follows four steps to display the HTML page. The rst step is parsing the HTML to construct the Document Object Model (DOM) tree. The HTML code is parsed using a tokenization algorithm. The tokenization algorithm parses the HTML into tokens such as start tags, end tags, and attribute name-value pairs. The DOM tree construction is performed in parallel to the tokenization algorithm. Every HTML element produced by the tokenization algorithm, such as html, body, and div, is added to the DOM tree. Each HTML element may be referenced in the DOM tree using a unique expression, called an \XPath". After the DOM tree is constructed, the second step is render-tree construction. The render tree rep- resents the visual order in which the elements will be displayed. The render-tree is constructed by rst parsing the Cascading Style Sheets (CSS) that describe the presentation properties of an HTML page. By parsing the CSS, the style rule for every element to be rendered is computed by calculating the visual properties applying to it. The visual properties are calculated based on the CSS rules, cascade order, and specicity. After the render tree is constructed, the third step is to decide the layout of the render tree. This is done by mapping the DOM tree to the render tree. The bounding box for each element to be rendered on the browser screen is computed using a ow based model. A bounding box gives the physical display location and size of an HTML element on the browser screen. Then the last step is painting the render tree. This is done by traversing the render tree and calling the rendering engine's paint() function to display content on the screen. The painting is done from the back to the front of the layout, meaning the background color is painted rst, followed by the background image, the border, children, and nally outline. 2.2 Types of Presentation Failures A presentation failure is dened as a discrepancy in the actual appearance of a web page and its intended appearance. Examples of presentation failures are Cross Browser Issues (XBIs), Mobile Friendly Problems (MFPs), Internationalization Presentation Failures (IPFs), Mockup-driven De- velopment Problems (MDDPs), and Regression Debugging Problems (RDPs). My dissertation targets the automated repair of these ve dierent types of presentation failures. I discuss each of them in detail below. 13 2.2.1 Layout Cross Browser Issues (XBIs) XBIs are dened as inconsistencies in the appearance or behavior of a website across dierent browsers. XBIs have been a serious concern for web developers for a long time. A simple search on StackOver ow | a popular technical forum | with the search term \cross browser" results in over 23,000 posts discussing ways to resolve XBIs, of which approximately 7,000 are currently active questions [43]. Although XBIs can impact the appearance or functionality of a website, the vast majority | over 90% | result in appearance related problems [137]. A signicant class of appearance related XBIs is called layout XBIs, which collectively refer to any XBI that relates to an inconsistent layout of HTML elements in a web page when viewed in dierent browsers. These are the type of XBIs targeted by my dissertation. Layout XBIs appear in over 56% of the websites manifesting XBIs [137]. The impact of layout XBIs on web pages can range from minor cosmetic dierences to severe aesthetic distortions to serious usability problems. Layout XBIs tend to arise from dierent interpretations of the HTML and CSS specications, and are not per se, faults in the browsers themselves [5]. Additionally, some browsers may implement new CSS properties or existing properties dierently in an attempt to gain an advantage over competing browsers [113]. 2.2.2 Mobile Friendly Problems (MFPs) Many websites are not designed to gracefully handle users who are accessing their pages through a non-traditional sized device, such as a smartphone or tablet. These problematic sites may exhibit a range of usability issues, such as unreadable text, cluttered navigation, or content that over ows the device's viewport and forces the user to pan and zoom the page in order to access content. Such usability issues are collectively referred as MFPs [26, 13] and can lead to a frustrating and poor user experience. 2.2.3 Internationalization Presentation Failures (IPFs) Companies often employ internationalization (i18n) frameworks for their websites, which allow the websites to provide translated text or localized media content, to communicate eectively with a global audience. However, because the length of translated text diers in size from text written in the original language of the page, the page's appearance can become distorted. HTML elements that are xed in size may clip text or look too large, while those that are not xed can expand, contract, and move around the page in ways that are inconsistent with the rest of the page's layout. Such distortions, called IPFs, reduce the aesthetics or usability of a website and 14 occur frequently | a recent study reports their occurrence in over 75% of internationalized web pages [47]. 2.2.4 Mockup-driven Development Problems (MDDPs) Mockup-driven development [123, 92, 130] is a popular style of web app development. In this style of development, front-end developers use mockups { highly detailed renderings of the intended appearance of the web application { to guide their development of web application templates. The developers are generally expected to create \pixel perfect" matches of these mockups [24] using web development tools, such as Adobe Muse, Amaya, or Visual Studio. Back-end developers also make changes to these templates by adding dynamic content. Both front-end and back-end developers need to check that their respective changes are consistent with the mockup. Inconsistencies in the appearance of the implemented web application and its mockups are called MDDPs. 2.2.5 Regression Debugging Problems (RDPs) Developers often perform maintenance on their web pages in order to introduce new features, correct a bug, or refactor the HTML structure. The goal of such regression debugging tasks is to change the structure or HTML code of a page without altering its visual appearance. For example, a developer may refactor a web page to transition it from using a table-based layout to one based on the use of thehdivi tag. During this modication, developers may inadvertently introduce a fault in the code that results in presentation failures called as RDPs. 2.3 The Process of Debugging Presentation Failures From a high level, the process of debugging presentation failures in web applications involves answering three questions; is there a presentation failure (Detection), where are the faulty HTML elements causing the failure (Localization), and how to correct the faulty elements to prevent the failure (Repair). Figure 2.1 gives an overview of the process of debugging presentation failures in web applications. The process takes two inputs: the page under test (PUT) to be analyzed for presentation failures and an oracle that species the visual correctness properties of the PUT. The process of repairing presentation failures is comprised of three phases. The rst phase, Detection, compares the rendering of the PUT with the oracle to identify if there exist visual dierences. The second phase, Localization, identies a set of potentially faulty HTML elements in the PUT that are most likely responsible for causing the detected failure. The third phase, Repair, produces and applies xes to the faulty PUT to resolve the reported presentation failure. The output of 15 the repair process is a page, PUT 0 , which is a repaired version of the PUT. I now introduce the three phases of the debugging process in more detail. Page under test (PUT) Oracle for PUT Repaired PUT (PUT’) P2. Localization P3. Repair Faulty HTML elements P1. Detection Y N Exit Figure 2.1: Process ow overview of debugging presentation failures 2.3.1 P1. Detection The goal of the rst phase, Detection, is to report areas of visual dierences in the PUT with respect to its appearance oracle. The UIs of modern web applications are highly complex and dynamic. Back-end server code dynamically generates content and client-side browsers render this content based on complex HTML and CSS rules. This makes detecting presentation failures both a labor-intensive and error-prone process. To illustrate, a tester must visually compare the rendering of each page against an oracle, such as a design mockup, to detect that a presentation failure has occurred. Therefore the ability to automatically detect presentation failures is a key rst step to automate the localization and repair phases. 2.3.2 P2. Localization The goal of the second phase, Localization, is to identify a set of HTML elements in the PUT that may be responsible for the detected presentation failures. Identifying the potentially faulty HTML elements for a presentation failure in modern web applications is challenging since an element's visual appearance is controlled by a complex series of interactions dened by the page's HTML structure and CSS rules. The widespread use of HTML rendering features, such as oating elements, overlays, and dynamic sizing, also increases the diculty of identifying the faulty element as there is often no obvious or direct connection from the rendered appearance to the structure of the underlying HTML. Therefore an accurate localization of potentially faulty HTML elements in the PUT is an important step for eectively repairing the observed presentation failures. 16 2.3.3 P3. Repair The third and nal phase, Repair, of the debugging process takes the Detection and Localization information to an actionable result, nding a suitable x for the presentation fault. Although it is possible for developers to manually x their code given the localization information provided by phase two, this process poses several challenges. First, modern web pages may contain several CSS properties dened for each HTML element that control its appearance. This makes it challenging for developers to accurately determine which CSS properties of the reported faulty elements need to be adjusted in order to repair the presentation fault. Assuming that the relevant CSS properties can be identied, the developers must still carefully construct the repair. Due to complex and cascading interactions between styling rules, a change in one part of the PUT's UI can easily introduce further issues in another part of the page. This means that any potential repair must be evaluated in the context of not only how well it resolves the targeted presentation failure, but also its impact on the rest of the page's layout. For these reasons, automated techniques would help developers to more eectively and eciently repair their faulty web pages. My dissertation work automates this phase of the debugging process to generate repairs for the observed presentation failures in web pages, while maintaining their aesthetic quality. 2.4 Search-Based Techniques Search-based techniques explore large solution spaces intelligently and eciently to nd an opti- mal solution. Search-based techniques typically begin by selecting a sample set of one or more candidate solutions from the solution space. A tness function then assesses their quality and assigns each candidate a score indicating how t (i.e., good) they are. This score, called the objective score or tness score, helps in establishing a \direction" of search. If an improvement in the tness score is observed over the original score, then it indicates that the search is progressing in the correct direction. The search algorithm can then select new candidate solutions that are in this promising direction to achieve further tness improvements. The search then checks if a stopping criteria is met, if yes the best candidate, i.e., the candidate with the optimal tness score, is output as the solution. If not, the search cycle continues. 17 Chapter 3 Overview of the Generalized Repair Approach, *Fix The four dierent repair techniques that I have developed as a part of this dissertation follow a common design of using search-based techniques for identifying a repair that can resolve the observed presentation failures while maintaining the aesthetic quality of the pages. To facilitate easy explanation of the repair techniques, I have abstracted their common structure into a gener- alized approach called *Fix. I provide details of *Fix in Section 3.1 and discuss its specializations in Section 3.2. 3.1 Design of *Fix AP1. Initialize AP2. Search for fixes Y N AP4. Terminate AP3. Evaluate candidate fix Repaired PUT (PUT’) Page under test (PUT) Oracle for PUT Detection function (D) Localization function (L) Figure 3.1: Overview of the generalized repair approach, *Fix The presentation of a web page, i.e., the placement, size, and appearance of a page's UI elements, is controlled by the page's HTML elements and CSS properties. Therefore, the presen- tation failures in a page can be xed by changing the values of CSS properties that can make the faulty appearance of HTML elements in the page match the correct, intended appearance as closely as possible. With this in mind, *Fix formally denes a x for resolving presentation 18 failures as a tuplehe, p, v, v 0 i, where e is an HTML element in the page, p is a CSS property of e, v is the value of p, and v 0 is the suggested new value for p. Figure 3.1 shows an overview of *Fix. It takes four inputs. The rst input is the page under test (PUT ) exhibiting presentation failures. ThePUT is obtained via a URL that points to a location on the le system or network that provides access to all of the necessary HTML, CSS, Javascript, and media les for rendering thePUT . The second input is the oracle that species the correct or intended rendering of thePUT . The third and fourth inputs are the detection (D) and localization (L) functions, respectively, that can identify the presentation failures and locate the potentially faulty HTML elements in the PUT responsible for the observed presentation failures. Existing detection and localization techniques, such as X-PERT [137] for Cross Browser Issues (XBIs) and GWALI [49] for Internationalization Presentation Failures (IPFs), can be leveraged to provideD andL. The output of *Fix is a page, PUT 0 , a repaired version of the PUT . I have abstracted the common components of my repair techniques into four abstraction points in *Fix. An abstraction point is a processing hook or plug-in point that allows developers to add specialized code to provide specic functionality. The rst abstraction point, Initialize, identies a set of initial candidate solutions from the solution space. Then, the Search abstraction point takes as input the initial set of candidate solutions and explores the solution space to nd an optimal solution by using the Evaluate abstraction point as a guide. The Evaluate abstraction point (i.e., the tness function) guides the search to the optimal solution by considering two objectives, minimize the number of presentation failures in the PUT and maximize its aesthetic similarity to the original version. Finally, the Terminate abstraction point determines whether the search should terminate or proceed to another iteration of the search cycle. I now discuss the purpose of each of the four abstraction points in more detail. 3.1.1 AP1: Initialize Proper initialization of the search space has been repeatedly cited in literature as an important step for speeding up the convergence of a search-based algorithm [79, 114, 90, 65, 93]. With this in mind, the goal of the Initialize abstraction point is to identify a set of initial candidate solutions from the solution space that are potentially close to the optimal solution. To do this, domain specic knowledge of the presentation failure being targeted can be used. For identifying e in the x tuple, the potentially faulty HTML elements reported by the input functionL can be used. For example, for IPFs, e can be selected from the potentially faulty HTML elements reported by the GWALI tool [49]. For p, relevant CSS properties that can in uence the observed symptom of a presentation failure can be used. For example, the relevant CSS properties for IPFs are 19 font-size, width, and height, as they can be adjusted to make the layout of the page adapt to the changes from text translation. For identifying v 0 , the visual manifestation of the presentation failure can be analyzed. For example, for IPFs, the text expansion that occurred in the PUT can be used to estimate a candidate v 0 . The design of the Initialize abstraction point will vary for other types of presentation failures. 3.1.2 AP2 and AP3: Search and Evaluate The goal of the Search abstraction point is to explore the solution space to nd an optimal solution by using the Evaluate abstraction point (i.e., the tness function) to guide it. In the context of presentation failures, an optimal solution is dened as a repair that when applied to the PUT causes it to conform with the oracle. The Search abstraction point takes as input the initial set of candidate solutions produced by the Initialize abstraction point and searches the solution space to nd new CSS values for the potentially faulty HTML elements and CSS properties that can repair the PUT . The Search abstraction point can be designed using dierent search-based algorithms, such as local search (e.g., Alternating Variable Method (AVM) search and simulated annealing) and global search (e.g., genetic algorithm). An appropriate search-based algorithm must be selected based on the following two criteria. First, the search technique should generate a repair for a presentation failure in a reasonable amount of time. In a search-based technique, the tness function needs to be invoked multiple times to evaluate the quality of the generated candidate solutions. In the context of presentation failures, the tness function is an expensive call since it requires applying the candidate repair to the PUT , capturing the new layout of the page, and comparing the new layout with the oracle to get the new tness score. Therefore, a quicker convergence of the search-based algorithm to generate a repair can be achieved by minimizing the calls to the tness function. Second, the search technique should resolve the PUT 's presentation failures without introducing new failures, while also ensuring an aesthetically pleasing and usable layout. This can be done by encoding these considerations in the design of the tness function used to guide the search. A good tness function can be built to balance both objectives: minimize the number of presentation failures in the PUT and maximize the PUT 's aesthetic similarity to the original version. The rst objective can be designed by leveraging a measurement of the number of presentation failures detected in the PUT , by using the functionD. For the second objective, dierent User Interface (UI) change metrics can be designed to measure the deviation in the appearance of the changed PUT with respect to its original version. The design of the tness 20 function and the selection of the search technique is dierent for dierent types of presentation failures. 3.1.3 AP4: Terminate At the end of each search cycle, the Terminate abstraction point determines whether the search should terminate or proceed to another iteration of the search cycle. The search termination criteria can be dierent for dierent problem domains. In the context of presentation failures, common terminating conditions can be: all of the failures in thePUT are resolved or the amount of allocated resources, such as time, is exhausted. 3.2 Specializations of *Fix Table 3.1: Overview of the specializations of *Fix XBI MFP IPF MDDP RDP Detection (D) X-PERT [137] GMFT [26] GWALI [49] WebSee [103] WebSee [103] Localization (L) X-PERT [137] MFix (§5) GWALI [49] WebSee [103] WebSee [103] *Fix specialization XFix (§4) MFix (§5) IFix (§6) GFix (§7) GFix (§7) Developed by me In this section, I give an overview of the four specializations of *Fix that I have designed for the repair of dierent types of presentation failures discussed in Chapter 2. The four specializations are explained in detail in Chapters 4, 5, 6 and 7. In each of these chapters, I summarize the input functionsD andL for completeness and discuss the repair approach based on *Fix in detail. Table 3.1 shows a summary of the specializations and the techniques used for providing theD andL inputs. The cells highlighted in gray represent the techniques developed by me. 3.2.1 XFix: Repair of Layout Cross Browser Issues (XBIs) I designed an approach,XFix, for the automated repair of layout XBIs in web pages. XFix utilizes two phases of guided search to nd the best repair. The rst search nds one or more candidate xes for each XBI by quantitatively comparing the layout similarity of the page rendered in dierent browsers via a tness function. The second search then seeks to nd an optimal combination of candidate xes identied in the rst phase to produce an overall best repair by leveraging a measurement of the number of layout XBIs detected in a page. The empirical evaluation ofXFix on 15 real world web pages showed that it was able to resolve 86% of the 21 layout XBIs reported by X-PERT, and 99% of the layout XBIs observable by humans. The results demonstrate that my approach is potentially of high use to developers by providing automated xes for layout XBIs. I provide more details aboutXFix in Chapter 4. 3.2.2 MFix: Repair of Mobile Friendly Problems (MFPs) I designed an approach,MFix, to automatically generate CSS patches that can repair MFPs in a web page. A unique challenge relevant to the MFP problem domain is that a repair must maintain the page's original layout as faithfully as possible. This requires xing MFPs while maintaining, where possible, the relative proportions and positioning of elements that are related to one another on the page. To do this,MFix rst segments the page, i.e., identies elements that form natural visual groupings on the page, and then builds a property dependence graph for each segment to adjust elements within a segment in synchronization with each other. The approach then builds graph-based models of the layout of a web page and uses constraints encoded by these graphs to compute patches that can improve mobile friendliness while minimizing layout disruption. To identify the best patch eciently,MFix leverages unique aspects of the problem domain to quantify metrics related to layout distortion and parallelize the computation of the solution. The empirical evaluation ofMFix on 38 popular websites listed in the Alexa Top 50 most visited websites showed that it could eectively resolve mobile friendly problems for 95% of the subjects. I also evaluated the results with a user study, in which participants overwhelmingly preferred the repaired version of the website for use on mobile devices and also considered the repaired page to be more readable than the original. Chapter 5 discusses theMFix approach in more detail. 3.2.3 IFix: Repair of Internationalization Presentation Failures (IPFs) I designed an approach,IFix, for automatically repairing IPFs in web pages. Repairing IPFs is challenging as any kind of style change to one element must be mirrored in stylistically related elements to maintain the aesthetic consistency of the page. To address this challenge, I designed a novel clustering technique that identies groupings of elements that are stylistically similar and adjusts them together in order to maintain the visual consistency of the page. To nd repairs, I devised a guided search-based technique that quanties the amount of distortion in a page by leveraging an existing IPF detection technique, GWALI [49], and UI change metrics. The empirical evaluation showed that my approach was able to successfully resolve 98% of the reported IPFs for 23 real-world web pages. In a user study of the repaired web pages, I found that the repairs met 22 with high user approval, with over 70% of the user responses rating the repaired pages as better than the faulty versions. I explain theIFix approach in more detail in Chapter 6. 3.2.4 GFix: Repair of Mockup-driven Development Problems (MDDPs) and Regression Debugging Problems (RDPs) I designed a novel automated approach,GFix, for repairing MDDPs and RDPs in web pages. GFix uses guided search-based techniques to automatically nd repairs for the MDDPs and RDPs detected in web pages. As its tness function,GFix uses computer-vision techniques to quantify the amount of human perceptible dierences between the actual appearance of a web page and its intended appearance, with the aim of minimizing the visual dierences. In the evaluation of the approach on a set of real-world web applications, I found that the approach was able to accurately and quickly identify repairs for the failures. I provide more details of theGFix approach in Chapter 7. 23 Chapter 4 XFix: Repair of Layout Cross Browser Issues (XBIs) The constantly increasing number of web browsers with which users can access a website has introduced new challenges in preventing appearance related issues. Dierences in how various browsers interpret HTML and CSS standards can result in Cross Browser Issues (XBIs) | in- consistencies in the appearance or behavior of a website across dierent browsers. Although XBIs can impact the appearance or functionality of a website, the vast majority | over 90% | result in appearance related problems [137]. This makes XBIs a signicant challenge in ensuring the correct and consistent appearance of a website's User Interface (UI). Despite the importance of XBIs, their detection and repair poses numerous challenges for developers. First, the sheer number of browsers available to end users is large | an informal listing reports that there are over 115 actively maintained and currently available [33]. Developers must verify that their websites render and function consistently across as many of these dierent browsers and platforms as possible. Second, the complex layouts and styles of modern web applications make it dicult to identify the UI elements responsible for the observed XBI. Third, developers lack a standardized way to address XBIs and generally have to resolve XBIs on a case by case basis. Fourth, for a repair, developers must modify the problematic UI elements without introducing new XBIs. Predictably, these challenges have made XBIs an ongoing topic of concern for developers. A simple search on StackOver ow | a popular technical forum | with the search term \cross browser" results in over 23,000 posts discussing ways to resolve XBIs, of which approximately 7,000 are currently active questions [43]. Tool support to help developers repair XBIs is limited in terms of capabilities. Although tools such as Firebug [22] can provide useful information, developers still require expertise to manually analyze the XBIs (which involves determining which HTML elements to inspect, and understanding the eects of the various CSS properties dened for them), and then repair them by performing the necessary modications so that the page renders correctly. XBI-oriented techniques 24 from the research community (e.g., X-PERT [137, 59, 141] and Browserbite [144]) are only able to detect and localize XBIs (i.e., they address the rst two of the four previously listed challenges), but are incapable of repairing XBIs so that a web page can be \xed" to provide a consistent appearance across dierent browsers. To address these limitations, I propose a novel search-based approach that enables automatic generation of xes for a signicant class of appearance related XBIs. The XBIs targeted by my approach are known as layout XBIs (also referred to as \structure XBIs" by Choudhary et al. [137]), which collectively refer to any XBI that relates to an inconsistent layout of HTML elements in a web page when viewed in dierent browsers. Layout XBIs appear in over 56% of the websites manifesting XBIs [137]. My key insight is that the impact of layout XBIs can be quantied by a tness function capable of guiding a search to a repair that minimizes the number of XBIs present in a page. I implemented this search-based approach as a tool,XFix, and evaluated it on 15 real world web pages containing layout XBIs.XFix was able to resolve 86% of the XBIs reported by X-PERT [137], and 99% of the XBIs observed by humans. The results therefore demonstrate that my approach is potentially of high use to developers by providing automated xes for problematic web pages involving layout XBIs. 4.1 Background and Example In this section I provide background information that details why layout XBIs occur, what the common practices are to repair them, and introduces an illustrative example. Layout XBIs Inconsistencies in the way browsers interpret the semantics of the Document Object Model (DOM) and Cascading Style Sheets (CSS) can cause layout XBIs 1 | dierences in the rendering of an HTML page between two or more browsers. These inconsistencies tend to arise from dierent interpretations of the HTML and CSS specications, and are not per se, faults in the browsers themselves [5]. Additionally, some browsers may implement new CSS properties or existing prop- erties dierently in an attempt to gain an advantage over competing browsers [113]. Fixing Layout XBIs When a layout XBI has been detected, developers may employ several strategies to adjust its appearance. For example, changing the HTML structure, replacing unsupported HTML tags, or adjusting the page's CSS. My approach targets XBIs that can be resolved by nding alternate 1 Hereafter, layout XBIs is simply referred as XBIs. 25 values for a page's CSS properties. There are two signicant challenges to carrying out this type of repair. First, the appearance (e.g., size, color, font style) of any given set of HTML elements in a browser is controlled by a series of complex interactions between the page's HTML elements and CSS properties, which means that identifying the HTML elements responsible for the XBI is challenging. Second, assuming that the right set of elements can be identied, each element may have dozens of CSS properties that control its appearance, position, and layout. Each of these properties may range over a large domain. This makes the process of identifying the correct CSS properties to modify and the correct alternate values for those properties a labor intensive task. Once the right alternate values are identied, developers can use browser-specic CSS qualiers to ensure that they are used at runtime. These qualiers direct the layout engine to use the provided alternate values for a CSS property when it is rendered on a specic browser [7, 6]. This approach is widely employed by developers. In my analysis of the top 480 websites (see Section 4.3), I found that 79% employed browser-specic CSS to ensure a consistent cross browser appearance. In fact, web developers typically maintain an extensive list of browser specic styling conditions [6] to address the most common XBIs. (a) Correct rendering of the page with Internet Explorer 11.0.33 (b) The same page displaying an XBI when rendered with Mozilla Firefox 46.0.1 Figure 4.1: Example screenshots showing XBI Example XBI and Repair Figure 4.1 shows screenshots of the menu bar of one of the evaluation subjects, IncredibleIndia, as rendered in Internet Explorer (IE) (Figure 4.1a) and Firefox (Figure 4.1b). As can be seen, an XBI is present in the menu bar, where the text of the navigational links is unreadable in the Firefox browser (Figure 4.1b). An excerpt of the HTML and CSS code that denes the navigation bar is shown in Listing 4.1. To resolve the XBI, an appropriate value for the margin-top or padding-top CSS property needs to be found for the HTML element corresponding to the navigation bar to push it down and into view. In this instance, the x is to add \margin-top:1.7%" to the CSS for the Firefox version. The inserted browser-specic code is shown in the red box in Listing 4.1. The \-moz" prexed 26 selector declaration directs the layout engine to only use the included value if the browser type is Firefox (i.e., Mozilla), and other browsers' layout engines will ignore this code. 1 <style> 2 .menubar { 3 position: relative; 4 } 5 6 @-moz-document url-prefix("") { 7 .menubar { 8 margin-top: 1.7%; 9 } 10 } 11 12 </style> 13 <body> 14 <div class="menubar"> 15 ... 16 </div> 17 </body> Listing 4.1: HTML and CSS excerpt of the IncredibleIndia example shown in Figure 4.1. The highlighted section (lines 6{10) represents the x added to the CSS to address the XBI. This particular example was chosen because the x is straightforward and easy to explain. However, most XBIs are much more dicult to resolve. Typically multiple elements may need to be adjusted, and for each one multiple CSS properties may also need to be modied. A x itself may introduce new XBIs, meaning that several alternate xes may need to be considered. 4.2 Specialization of the Generalized Approach, *Fix In this section, I provide details of my approach for repairing layout XBIs in web pages that is based on the generalized repair approach, *Fix, explained in Chapter 3. The two prerequisites for repairing presentation failures, Detection and Localization, can be instantiated using existing Cross Browser Testing (XBT) techniques, such as X-PERT [137]. For completeness, I summarize X-PERT's detection and localization algorithm in Section 4.2.1. My contribution is in developing a repair approach for nding suitable xes for layout XBIs in web pages using search-based techniques. Section 4.2.2 discusses this approach in more detail. 27 4.2.1 Detection and Localization of Layout XBIs For implementing the Detection and Localization phases of the debugging process, I use the X- PERT tool [137], which is a well-known XBI oriented technique. X-PERT uses DOM dierencing techniques to detect and localize relative-layout XBIs in web applications. For completeness, I provide a synopsis of X-PERT's detection and localization algorithm and its evaluation results. X-PERT rst captures the layout of a page rendered in a browser as a directed graph, where the vertices are HTML elements in the page and an edge exists between two vertices only if they represent a parent-child or sibling relationship with each other. The edges are further qualied with attributes specifying the relative position of the two vertices with respect to each other, such as left-align, center-align, and above. These attributes are computed by comparing the Minimum Bounding Rectangles (MBRs) of the vertices. After building the layout models for the page in the two browsers, X-PERT performs a heuristic-based matching to nd corresponding vertices in the two layout models. It then sys- tematically compares each edge in the matched set of vertices to nd discrepancies in the edge attributes. If a discrepancy is found, X-PERT reports this as a layout XBI, represented by a tuple of the formhlabel;he 1 ;e 2 ii. Heree 1 ande 2 are the XPaths of the two HTML elements of the page that are rendered dierently in the two browsers, and label is the discrepant edge attribute that denotes the layout position of e 1 and e 2 in one browser that was violated in the other browser. For example,htop-align;e 1 ;e 2 i indicates that e 1 is pinned to the top edge of e 2 in one browser, but not in the other browser. In the evaluation on 14 real-world subject web applications, X-PERT was found to be highly accurate | 76% precision and 95% recall, on average. X-PERT also provided signicantly better results when compared with the state-of-the-art XBI detection tool, CrossCheck [59], which had an average precision of 18% and recall of 83%. 4.2.2 Repair of Layout XBIs For repairing layout XBIs in web pages, I propose a novel automated approach using search-based techniques based on *Fix. The placement of a web page's UI elements is controlled by the page's HTML elements and CSS properties. Therefore to resolve the XBIs, my approach attempts to nd new values for CSS properties that can make the faulty appearance match the correct appearance as closely as possible. In the remainder of this section, I discuss the overall algorithm of my approach followed by details about each of the dierent stages in the approach. 28 Formally, XBIs are due to one or more HTML-based root causes. A root cause is a tuple he;p;vi, where e is an HTML element in the page, p is a CSS property of e, and v is the value of p. Given a set of XBIs X for a page PUT and a set of potential root causes, my approach seeks to nd a set of xes that resolve the XBIs in X. I dene a x as a tuplehr;v 0 i, where r is a root cause and v 0 is the suggested new value for p in the root cause r. A set of XBI-resolving xes is referred as a repair. The approach generates repairs using guided search-based techniques [80, 60]. Two aspects of the XBI repair problem motivate this choice of technique. The rst is that the number of possible repairs is very large, since there can be multiple XBIs present in a page, each of which may have several root causes, and for which the relevant CSS properties range over a large set of possible values. Second, xes made for one particular XBI may interfere with those for another, or, a x for any individual XBI may itself cause additional XBIs, requiring a tradeo to be made among possible xes. Search-based techniques are ideal for this type of problem because they can explore large solution spaces intelligently and eciently, while also identifying solutions that eectively balance a number of competing constraints. Furthermore, the visual manifestation of XBIs also lends itself to quantication via a tness function, which is a necessary element for a search-based technique. A tness function computes a numeric assessment of the \closeness" of candidate solutions found during the search to the solution ultimately required. My insight is that a good tness function can be built that leverages a measurement of the number of XBIs detected in a PUT, by using well-known XBI detection techniques, and the similarity of the layout of the PUT when rendered in the reference and test browsers, by comparing the size and positions of the bounding boxes of the HTML elements involved in each XBI identied. The approach works by rst detecting XBIs in a page and identifying a set of possible root causes for those XBIs. Then the approach utilizes two phases of guided search to nd the best repair. The rst search takes the CSS property of each root cause and tries to nd a new value for it that is most optimal with respect to the tness function. This optimized property value is referred to as a candidate x. The second search then seeks to nd an optimal combination of candidate xes identied in the rst phase. This additional search is necessary since not all candidate xes may be required, as the CSS properties involved may have duplicate or competing eects. For instance, the CSS properties margin-top and padding-top may both be identied as root causes for an XBI, but can be used to achieve similar outcomes | meaning that only one may actually need to be included in the repair. Conversely, other candidate xes may be required to be used in combination with one another to fully resolve an XBI. For example, an HTML element may need to be adjusted for both its width and height. Furthermore, candidate xes 29 Page under test (PUT) Test browser Reference browser (Oracle) 1. Initial XBI Detection (AP1) 2. Extract root causes (AP1) 3. Search for candidate fixes (AP2) 4. Search for best combination of candidate fixes (AP2) Potentially fixed page (PUT’) Y N 5. Terminate? (AP4) Fitness function to quantify layout deviation (AP3) Fitness function to quantify XBIs (AP3) D and L X-PERT [ICSE’13] (Choudhary et. al.) Figure 4.2: Overview of theXFix approach for repairing layout XBIs produced for one XBI may have knock-on eects on the results of candidate xes for other XBIs, or even introduce additional and unwanted XBIs. By searching through dierent combinations of candidate xes, the second search aims to produce a suitable subset | a repair | that resolves as many XBIs as possible for a page when applied together. I now introduce the steps of this approach in more detail, beginning with an overview of the complete algorithm. 4.2.2.1 Overall Algorithm The top level algorithm of the approach is shown by Algorithm 1. Four inputs are required: the page under test, PUT, which exhibits XBIs. The PUT is obtained via a URL that points to a location on the le system or network that provides access to all of the necessary HTML, CSS, Javascript, and media les for rendering PUT. The second input is the reference browser, R, that shows the correct rendering of PUT. The third input is the test browser, T , in which the rendering of PUT shows XBIs with respect to R. The fourth input is a detection (D) and localization (D) function that can identify and report XBIs. For this input, we use the X-PERT tool [137] (summarized in Section 4.2.1). The output of the approach is a page, PUT 0 , a repaired version of PUT. The overall algorithm, shown as Algorithm 1, comprises ve stages, as shown by the overview diagram in Figure 4.2. The diagram also shows the instantiations of the dierent abstraction points (AP1{4) of the *Fix approach. Stage 1 | Initial XBI Detection The initial part of the algorithm (lines 1{4) involves obtaining the set of XBIs X when PUT is rendered in R and T . To identify XBIs, we use the inputD andL function X-PERT tool [137], which is represented by the \DL" function called on line 2. X-PERT returns a set of identied XBIs, X, in which each XBI is represented by a tuple of the formhlabel;he 1 ;e 2 ii, where e 1 and 30 Algorithm 1 Overall Algorithm Input: PUT: Web page under test R: Reference browser T : Test browser DL: Detection and Localization function for XBIs Output:PUT 0 : Modied PUT with repair applied 1: /* Stage 1 | Initial XBI Detection */ 2: X DL(PUT, R, T ) 3: DOM R buildDOMTree(PUT, R) 4: DOM T buildDOMTree(PUT, T ) 5: while true do 6: /* Stage 2 | Extract root causes */ 7: rootCauses fg 8: for eachhlabel,he 1 , e 2 ii2 X do 9: props getCSSProperties(label) 10: for each p2 props do 11: v 1 getValue(e 1 , p, DOM T ) 12: rootCauses rootCauses[he 1 , p, v 1 i 13: v 2 getValue(e 2 , p, DOM T ) 14: rootCauses rootCauses[he 2 , p, v 2 i 15: end for 16: end for 17: /* Stage 3 | Search for Candidate Fixes */ 18: candidateFixes fg 19: for eachhe, p, vi2 rootCauses do 20: candidateFix searchForCandidateFix (he, p, vi, PUT, DOM R , T ) 21: candidateFixes candidateFixes[ candidateFix 22: end for 23: /* Stage 4 | Search for Best Combination of Candidate Fixes */ 24: repair searchForBestRepair (candidateFixes, PUT, R, T ) 25: /* Stage 5 | Check Termination Criteria */ 26: PUT 0 applyRepair(PUT, repair) 27: X 0 DL(PUT 0 , R, T ) 28: if X 0 =; or X 0 = X then 29: return PUT 0 30: else ifjX 0 j>jXj then 31: return PUT 32: else 33: X X 0 34: PUT PUT 0 35: DOM T buildDOMTree(PUT 0 , T ) 36: end if 37: end while 31 e 2 are the XPaths of the two HTML elements of the PUT that are rendered dierently in T versus R, and label is a descriptor that denotes the original (correct) layout position of e 1 that was violated in T . For example,htop-align;e 1 ;e 2 i indicates that e 1 is pinned to the top edge of e 2 inR, but not inT . After identifying the XBIs, the algorithm then enters its main loop, which comprises Stages 2{5. Stage 2 | Extract Root Causes The second stage of the algorithm (lines 7{16) extracts the root causes relevant to each XBI. The key step in this stage identies CSS properties relevant to the XBI's label (shown as \getC- SSProperties" at line 9). For example, for the top-align label, the CSS properties margin-top and top can alter the top alignment of an element with respect to another and would therefore be identied in this stage. I identied this mapping through analysis of the CSS properties and it holds true for all web applications without requiring developer intervention. Each relevant CSS property forms the basis of two root causes, one for e 1 , and one for e 2 . These are added to the running set rootCauses, with the values of the CSS properties extracted for each element (v 1 and v 2 respectively) extracted from the DOM of the PUT when it is rendered in T (lines 9 and 11). Stage 3 | Search for Candidate Fixes Comprising the rst phase search, this stage produces individual candidate xes for each root cause (lines 18{22). The x is a new value for the CSS property that is optimized according to a tness function, with the aim of producing a value that resolves, or is as close as possible to resolving the layout deviation. This optimization process occurs in the \searchForCandidateFix" procedure, which is described in detail in Section 4.2.2.2. Stage 4 | Search for the Best Combination of Candidate Fixes Comprising the second phase search, the algorithm makes a call to the \searchForBestRepair" pro- cedure (line 24) that takes the set of candidate xes in order to nd a subset, repair, representing the best overall repair. This procedure is described in Section 4.2.2.3. Stage 5 | Check Termination Criteria The nal stage of the algorithm (lines 26{36) determines whether the algorithm should terminate or proceed to another iteration of the loop and two-phase search. Initially, the xes in the set repair are applied to a copy of PUT by adding test browser (T ) specic CSS code to produce a modied version of the page PUT 0 (line 26). The approach identies the set of XBIs, X 0 32 for PUT 0 , using the technique explained in Section 4.2.1, which is represented by the\getXBIs" function called on line 27. Ideally, all of the XBIs in PUT will have been resolved by this point, and X 0 will be empty. If this is the case, the algorithm returns the repaired page PUT 0 . If the set X 0 is identical to the original set of XBIsX (originally determined on line 2), the algorithm has made no improvement in this iteration of the algorithm, and so the PUT 0 is returned, having potentially only been partially xed as a result of the algorithm rectifying a subset of XBIs in a previous iteration of the loop. If the number of XBIs has increased, the current repair introduces further layout deviations. In this situation, PUT is returned (which may re ect partial xes from a previous iteration of the loop, if there were any). However, if the number of XBIs has been reduced, the current repair represents an improvement that may be improved further in another iteration of the algorithm. Broadly, there are two scenarios under which the approach could fail: (1) X-PERT does not initially include the faulty HTML element in X; or (2) the search does does not identify an acceptable x, which could happen due to the non-determinism of the search. 4.2.2.2 Search for Candidate Fixes The rst search phase (represented as the procedure \searchForCandidateFix") focuses on each potential root causehe;p;vi in isolation of the other root causes, and attempts to nd a new valuev 0 for the root cause that improves the similarity of the page when rendered in the reference browser R and the test browser T . Guidance to this new value is provided by a tness function that quantitatively compares the relative layout discrepancies between e and the elements that surround it when PUT is rendered in R and T . I begin by giving an overview of the search algorithm used, and then explain the tness function employed. Search Algorithm The inputs to the search for a candidate x are the page under test, PUT , the test browser, T , the DOM tree from the reference browser, DOM R , and the root cause tuple,he;p;vi. The search attempts to nd a new value, v 0 , for p in the root cause. The search process used to do this is inspired by the variable search component of the Alternating Variable Method (AVM) [82, 84], and specically the use of \exploratory" and \pattern" moves to optimize variable values. The aim of exploratory moves is to probe values neighboring the current value of v to nd one that improves tness when evaluated with the tness function. Exploratory moves involve adding small delta values (i.e., [-1 ,1]) to v and observing the impact on the tness score. If the tness 33 Algorithm 2 Fitness Function for Candidate Fixes Input: e: XPath of HTML element under analysis p: CSS property of HTML element, e ^ v: Value of CSS property, p PUT: Web page under test DOM R : DOM tree of PUT rendered in R T : Test browser Output: tness: Fitness value of the hypothesized xhe, p, ^ vi 1: [ PUT applyValue(e, p, ^ v, PUT) 2: DOM T buildDOMTree( [ PUT, T ) 3: /* Component 1 | Dierence in location of e with respect to R and T */ 4: hx t 1 , y t 1 , x t 2 , y t 2 i getBoundingBox(DOM T , e) 5: hx r 1 , y r 1 , x r 2 , y r 2 i getBoundingBox(DOM R , e) 6: D TL p (x t 1 x r 1 ) 2 + (y t 1 y r 1 ) 2 7: D BR p (x t 2 x r 2 ) 2 + (y t 2 y r 2 ) 2 8: pos D TL + D BR 9: /* Component 2 | Dierence in size of e with respect to R and T */ 10: width R x r 2 x r 1 11: width T x t 2 x t 1 12: height R y r 2 y r 1 13: height T y t 2 y t 1 14: size jwidth R - width T j +jheight R - height T j 15: /* Component 3 | Dierences in locations of neighboring elements of e */ 16: neighbors T getNeighbors(e, DOM T , N r ) 17: npos 0 18: for each n2 neighbors T do 19: n 0 getMatchingElement(n, DOM R ) 20: hx t 1 , y t 1 , x t 2 , y t 2 i getBoundingBox(DOM T , n) 21: hx r 1 , y r 1 , x r 2 , y r 2 i getBoundingBox(DOM R , n 0 ) 22: D TL p (x t 1 x r 1 ) 2 + (y t 1 y r 1 ) 2 23: D BR p (x t 2 x r 2 ) 2 + (y t 2 y r 2 ) 2 24: pos D TL + D BR 25: npos npos + pos 26: end for 27: /* Compute nal tness value */ 28: tness (w 1 * pos ) + (w 2 * size ) + (w 3 * npos ) 29: return tness 34 is observed to be improved, pattern moves are made in the same \direction" as the exploratory move to accelerate further tness improvements through step sizes that increase exponentially. If a pattern move fails to improve tness, the method establishes a new direction from the current point in the search space through further exploratory moves. If exploratory moves fail to yield a new direction (i.e., a local optima had been found), this value is returned as the best candidate x value. The x tuple,he;p;v;v 0 i, is then returned to the main algorithm (line 20 of Algorithm 1). Fitness Function The tness function for producing a candidate x is shown by Algorithm 2. The goal of the tness function is to quantify the relative layout deviation for PUT when rendered inR andT following the change to the value of a CSS property for an HTML element. Given the element e in PUT , the tness function considers three components of layout deviation between the two browsers: (1) the dierence in the location ofe; (2) the dierence in the size ofe; and (3) any dierences in the location of e's neighbors. Figure 4.3 shows a diagrammatic representation of these components. Rectangles with a solid background correspond to the bounding boxes of elements rendered in R and the rectangles with diagonal lines correspond to the bounding boxes of elements rendered in T . Intuitively, all three components should be minimized as the evaluated xes make progress towards resolving an XBI without introducing any new dierences or introducing further XBIs for e's neighbors. The tness function for an evaluated x is the weighted sum of these three components. The rst component, location dierence of e, is computed by lines 3{8 of Algorithm 2, and assigned to the variable pos . This value is calculated as the sum of the Euclidean distance between the top-left (TL) and bottom-right (BR) corners of the bounding box of e when it is rendered in R and T . The bounding box is obtained from the DOM tree of the page for each browser. The second component, dierence in size of e, is calculated by lines 10{14 of the algorithm, and is assigned to the variable size . The value is calculated as the sum of the dierences of e's width and height when rendered inR andT . The size information is obtained from the bounding box of e obtained from the DOM tree of the page in each browser. The third and nal component of the tness function, nding the location dierence of e's neighbors occurs on lines 16{26 of the algorithm, and is assigned to the variable npos . The neighbors ofe are the set of HTML elements that are withinN r hops frome inPUT 's DOM tree as rendered in T . For example, if N r = 1, then the neighbors of e are its parents and children. If N r = 2, then the neighbors are its parent, children, siblings, grandparent, and grandchildren. For 35 D TL D BR D TL D BR (a) Component 1: pos = DTL + DBR, where DTL and DBR is the Euclidean distance between the top left (TL) and bottom right (BR) corners, respectively, of e rendered in R and T . pos decreases as the boxes move closer. w T w R w T w R h T h R h T h R (b) Component 2: size = jwR - wTj + jhR - hTj, where wR and hR are the respective width and height ofe rendered inR, and wT and hT are the respective width and height ofe rendered inT . size decreases as the boxes become similar in size. D TL D BR D TL D BR e e n n (c) Component 3: npos = DTL + DBR, where DTL and DBR is the Euclidean distance between the top left (TL) and bottom right (BR) corners, respectively, of e's neighbor n rendered in R and T . npos decreases as e's boxes move closer, which causes n's boxes to also move closer. Figure 4.3: Fitness function components of the XBI repair approach 36 each neighbor, the approach nds its corresponding element in the DOM tree of PUT rendered in R and calculates pos for each pair of elements. The nal tness value is then formed from the weighted sum of the three components pos , size , and npos (line 28). 4.2.2.3 Search for the Best Combination of Candidate Fixes The goal of the second search phase (represented by a call to \searchForBestRepair" at line 24 of Algorithm 1) is to identify a subset of candidateFixes that together minimize the number of XBIs reported for the PUT . This step is included in the approach for two reasons. Firstly, a x involving one particular CSS property may only be capable of partially resolving an XBI and may need to be combined with another x to fully address the XBI. Furthermore, the interaction of certain xes may have emergent eects that result in further unwanted layout problems. For example, suppose a submit button element appears below, rather than to the right of a text box. Candidate xes will address the layout problem for each HTML element individually, attempting to move the textbox down and to the left, and the button up and to the right. Taking these xes together will result in the button appearing to the top right corner of the text box, rather than next to it. Identifying a selection of xes, a candidate repair, that avoids these issues is the goal of this phase. To guide this search, I use the number of XBIs that appear in the PUT after the candidate repair has been applied. The search begins by evaluating a candidate repair with a single x | the candidate x that in the rst search phase produced the largest tness improvement. Assuming this does not eradicate all XBIs, the search continues by generating new candidate repairs in a biased random fashion. Candidate repairs are produced by iterating through the set of xes. A x is included in the repair with a probability imp x =imp max , where imp x is the improvement observed in the tness score when the x was evaluated in the rst search phase divided by the maximum improvement observed over all of the xes in candidateFixes. Each candidate repair is evaluated for tness in terms of the number of resulting XBIs, with the best repair retained. A history of evaluated repairs is maintained, so that any repeat solutions produced by the biased random generation algorithm are not re-evaluated. The random search terminates when (a) a candidate repair is found that xes all XBIs, (b) a maximum threshold of candidate repairs to be tried has been reached, or (c) the algorithm has produced a sequence of candidate repairs with no improvement in tness. 37 4.3 Evaluation I conducted empirical experiments to assess the eectiveness and eciency of my repair approach for resolving XBIs, with the aim of answering the following four research questions: RQ1: How eective is the approach at reducing layout XBIs? RQ2: What is the impact on the cross-browser consistency of the page when the suggested repairs are applied? RQ3: How long does the approach take to nd repairs? RQ4: How similar in size are the approach generated repair patches to the browser-specic code present in real-world websites? 4.3.1 Implementation I implemented the approach in a prototype tool in Java, named \XFix" [69].XFix is a standalone tool that exposes a simple API to specify inputs and run the repair technique. To facilitate easy management of third-party dependencies, XFix is packaged as a Maven project. XFix can be run on any platform, such as Windows, Linux, and macOS. SinceXFix analyzes the client side code of the page under test, it is agnostic to the server side technology used. Page under test (PUT) Reference browser Test browser X-PERT User <l 1 , e 1 , e 2 > <l 2 , e 1 , e 2 > <l n , e 1 , e 2 > . . <e, p, v, v’> Repaired page (PUT’) ✔ ✗ Inputs XFix tool Output Layout XBIs Root causes 3. Search for candidate fixes 4. Search for best combina6on of candidate fixes Candidate fixes <e 1 , p 1 , v 1, > . . <e 2 , p n , v 1n > <e 1 , p 1 , v 1, > . . <e 2 , p n , v 1n > <e 1 , p 1 , v 1, > . . <e 2 , p n , v 1n > <e, p, v, v’> <e, p, v, v’> 1. Ini6al XBI detec6on CSS-label map 2. Extract root causes 5. Check Termina6on criteria Y N repair.css <e 1 , p 1 , v 1 > <e n , p n , v n > . . <e, p, v, v’> Figure 4.4: High-level overview of theXFix tool Figure 4.4 shows a high-level overview ofXFix with the dierent stages explained in Sec- tion 4.2.2 shown in italics.XFix accepts the page under test (PUT) input in the form of a URL pointing to the location of the HTML page on the le system where all the CSS, Javascript, and media necessary for rendering the PUT can be accessed. For R and T ,XFix provides an option to select a browser fromXFix's supported set of browsers. XFix currently supports all the versions of the three most widely used browsers, Firefox, Chrome, and IE. More browsers can be easily added toXFix by including the browser's standalone server implementation of the Selenium WebDriver's wire protocol. Note that R and T need to be pre-installed on the user's computer. 38 I leveraged Javascript and the Selenium WebDriver library for making dynamic changes to web pages, such as applying candidate x values. For identifying the set of layout XBIs, I used the latest publicly available version [58] of the well-known XBI detection tool, X-PERT [137, 139]. I made minor changes to the publicly available version to x bugs and add accessor methods for data structures. I used this modied version throughout the rest of the evaluation. Details of the changes made to X-PERT can be found on theXFix project page (https://github.com/ sonalmahajan/xfix ). The tness function parameters for the search of candidate xes discussed in Section 4.2.2.2 are set as: N r = 2, and w 1 = 1; w 2 = 2, and w 3 = 0:5 for the weights for pos , size , and npos respectively. (The weights assigned prioritize size , pos and npos in that order. I deemed size of an element as most important, because of its likely impact on all three components, followed by location, which is likely to impact its neighbors.) For the termination conditions (b) and (c) of the search for the best combination of candidate xes (Section 4.2.2.3), the maximum threshold value is set to 50 and the sequence value is set to 10. 1. @-moz-document url-prefix() { 2. html > body > div:nth-of-type(3) > div:nth-of-type(1) > div:nth-of-type(1) { 3. margin-top: 1.7% !important; /* 20px */ 4. } 5. } Figure 4.5: Example of repair.css generated byXFix Upon termination,XFix generates a repair.css le (shown in Figure 4.5 for the Incredi- bleIndia example of Figure 4.1) containing the repair patch and modies the PUT le to include repair.css in thehheadi section of the HTML of the page. Note that a new repair.css is created with a timestamp appended to the name for every run ofXFix to eectively resolve XBIs across dierent test browsers for the PUT. XFix generates the repair.css as follows. First, XFix adds a browser specic qualier corresponding to the test browser (e.g., -moz for Firefox) as shown in line 1 of Figure 4.5. Such qualiers direct the layout engine to use the provided alternate values for the CSS property when it is rendered on a specic browser. For example, the repair patch shown in Figure 4.5 is only applied if the browser type is Firefox, and is ignored by the layout engines of other browsers. Then, for each x tuplehe;p;v;v 0 i in the repair set identied in stage 4,XFix converts the XPath of the element e to a CSS selector and adds it to the browser specic qualier block. For example, line 2 shows the CSS selector derived from the XPath, /html/body/div[3]/div/div.XFix then converts the x value v 0 , which is an absolute 39 value (e.g., margin-top:20px), to a relative x value with respect to an element's parent's di- mensions, such as margin-top:1.7%.XFix then adds this relative x value for the CSS property p (line 3). 4.3.2 Subjects For the evaluation 15 real-world subjects were used as listed by Table 4.1. The columns labeled \#HTML" and \#CSS" report the total number of HTML elements present in the DOM tree of a subject, and the total number of CSS properties dened for the HTML elements in the page respectively. These metrics of size give an estimate of a page's complexity in debugging and nding potential xes for the observed XBIs. The \Ref" column indicates the reference browser in which the subject displays the correct layout; while the column \Test" refers to the browser in which the subject shows a layout XBI. In these columns, \CH", \FF", and \IE" refer to the Chrome, Firefox, and Internet Explorer browsers respectively. I collected the subjects from three sources: (1) websites used in the evaluation of X-PERT [137], (2) my prior interaction with websites exhibiting XBIs, and (3) the random URL generator, UROULETTE [39]. The \GrantaBooks" subject came from the rst source. The other subjects from X-PERT's evaluation could not be used because their GUI had been reskinned or the latest version of the IE browser now rendered the pages correctly. The \HotwireHotel" subject was chosen from the second source, and the remaining thirteen subjects were gathered from the third source. The goal of the selection process was to select subjects that exhibited human perceptible layout XBIs. I did not use X-PERT for an initial selection of subjects because it was found that it reported many subjects with XBIs that were dicult to observe. For selecting the subjects, I used the following process: (1) render the page, PUT , in the three browser types; (2) visually inspect the rendered PUT in the three browsers to nd layout XBIs; (3) if layout XBIs were found in the PUT , select the browser showing a layout problem, such as overlapping, wrapping, or distortion of content, as the test browser, and one of the other two browsers showing the correct rendering as the reference browser; (4) try to manually x the PUT by using the developer tools in browsers, such as Firebug for Firefox, and record the HTML elements to which the x was applied; (5) run X-PERT on the PUT with the selected reference and test browsers; and (6) use the PUT as a subject, if the manually recorded xed HTML elements were present in the set of elements reported by X-PERT. I included steps 4{6 in the selection process to ensure that if X-PERT reported false negatives, they would not bias the evaluation results. 40 Table 4.1: Subjects used in the evaluation ofXFix Name URL #HTML #CSS Ref Test BenjaminLees http://www.benjaminlees.com 317 1,525 CH FF Bitcoin https://bitcoin.org/en/ 207 1,957 FF IE Eboss http://www.e-boss.gr 439 789 IE FF EquilibriumFans http://www.equilibriumfans.com 340 868 CH FF GrantaBooks http://grantabooks.com 325 6,545 FF IE HenryCountyOhio http://www.henrycountyohio.com 300 983 IE FF HotwireHotel https://goo.gl/pH9d6d 1,457 10,618 FF IE IncredibleIndia http://incredibleindia.org 251 2,172 IE FF Leris http://clear.uconn.edu/leris/ 195 1,262 FF CH Minix3 http://www.minix3.org 118 821 IE CH Newark http://www.ci.newark.ca.us 598 17,426 FF IE Ofa http://www.ofa.org 578 5,381 IE CH PMA http://www.pilatesmethodalliance.org 456 10,159 FF IE StephenHunt http://stephenhunt.net 497 13,743 FF IE WIT http://www.wit.edu 300 3,249 FF IE 4.3.3 Methodology For the experiments, the latest stable versions of the browsers, Mozilla Firefox 46.0.1, Internet Explorer 11.0.33, and Google Chrome 51.0, were used. These browsers were selected for the evaluation as they represent the top three most widely used desktop browsers [15, 35]. The experiments were run on a 64-bit Windows 10 machine with 32GB memory and a 3rd Generation Intel Core i7-3770 processor. Since the set of XBIs reported by X-PERT can vary based on screen resolution, I also report the test monitor setup, which had a resolution of 1920 1080 and size of 23 inches. The subjects were rendered in the browsers with the browser viewport size set to the screen size. Each subject was downloaded using the Scrapbook-X Firefox plugin and the wget utility, which download an HTML page along with all of the les (e.g., CSS, JavaScript, images, etc.) it needs to display. I then commented out portions of the JavaScript les and HTML code that made active connections with the server, such as Google Analytics, so that the subjects could be run locally in an oine mode. The downloaded subjects were then hosted on a local Apache web server. X-PERT was run on each of the subjects to collect the set of initial XBIs present in the page. XFix was then run 30 times on each of the subjects to mitigate non-determinism in the search, and measured the run time in seconds. After each run ofXFix on a subject, X-PERT was run on the repaired subject and recorded the remaining number of XBIs reported, if any. 41 I also conducted a human study with the aim of judgingXFix with respect to the human- perceptible XBIs, and to gauge the change in the cross-browser consistency of the repaired page. The study involved 11 participants consisting of PhD and post-doctoral researchers whose eld of study was Software Engineering. For the study, I rst captured three screenshots of each subject page: (1) rendered in the reference browser, (2) rendered in the test browser before applying XFix's suggested repair, and (3) rendered in the test browser after applying the suggested xes. I embedded these screenshots in HTML pages provided to the participants. I varied the order in which the before (pre-XFix) and after (post-XFix) versions were presented to participants, to minimize the in uence of learning on the results and referred to them in the study as version 1 and version 2 based on the order of their presentation. Each participant received a link to an online questionnaire and a set of printouts of the renderings of the page. I instructed the participants to individually (i.e., without consultation) answer four questions per subject: The rst question asked the users to compare the reference and version 1 by opening them in dierent tabs of the same browser and circle the areas of observed visual dierences on the corresponding printout. The second question asked the participants to rate the similarity of version 1 and reference on a scale of 0{10, where 0 represents no similarity and 10 means identical. Note that the similarity rating includes the participants reaction to intrinsic browser dierences as well since they were not asked to exclude these. The third and fourth questions in the questionnaire were the same, but for version 2 . For RQ1, I used X-PERT to determine the initial number of XBIs in a subject and the average number of XBIs remaining after each of the 30 runs ofXFix. From these numbers I calculated the reduction of XBIs as a percentage. For RQ2, I classied the similarity rating results from the human study into three categories for each subject: (1) improved: the after similarity rating was higher than that of the before ver- sion, (2) same: the after and before similarity ratings were exactly the same, and (3) decreased: the after similarity rating was lower than that of the before version. The human study data can be found at the project website [69]. For RQ3, I collected the average total running times ofXFix and for Stages 3 and 4, the search phases, of the algorithm. For RQ4, I compared the size, measured by the number of CSS properties, of browser specic code found in real-world websites to that of the automatically generated repairs. I used size for comparing similarity because CSS has a simple structure and does not contain any branching or looping constructs. I used wget to download the homepages of 480 websites in the Alexa Top 500 Global Sites [10] and analyzed their CSS to nd the number of websites containing browser 42 specic code. Twenty sites could not be downloaded as they pointed to URLs without UIs | for instance the googleadservices.com and twimg.com web services. To nd whether a website has browser specic CSS, I parsed its CSS les using the CSS Parser tool [18] and searched for browser specic CSS selectors, such as the one shown in Listing 4.1, based on well-known prex declarations: -moz for Firefox, -ms for IE, and -webkit for Chrome. To calculate the size, I summed the numbers of CSS properties declared in each browser specic selector. To establish a comparable size metric for each subject web page used withXFix, I added the size of each subject's previously existing browser specic code for T , the test browser, to the average size of the repair generated for T . 4.3.4 Threats to Validity External Validity: The rst potential threat is that I used a manual selection of the subjects. To minimize this threat, I only performed a manual ltering of the subjects to ensure that the subjects showed human perceptible XBIs and that X-PERT did not miss the observed XBIs (i.e., have a false negative). I also selected subjects from three dierent sources, including a random URL generator, to make the selection process generalizable across a wide variety of subjects. All the subjects had multiple XBIs reported by X-PERT (Table 4.2), and a mix of single (e.g., Bitcoin and IncredibleIndia) and multiple (e.g., HotwireHotel and Grantabooks) human-observable XBIs. A second potential threat is the use of only three browsers. To mitigate this threat, I selected the three most widely used browsers, as reported by dierent commercial agencies studying browser statistics [35, 15]. Furthermore, my approach is not dependent on the choice of browsers, so the results should generalize to other browsers. Internal Validity: One potential threat is the use of X-PERT. However, there are no other publicly available tools for detecting XBIs that report the level of detail required byXFix to produce repairs. A further threat is represented by the changes I made to X-PERT favored my approach. However, the changes made were to provide access to existing information (and so do not change XBI-identifying behavior) or to address specic bugs. An example of one of the defects I found was a mismatch in the data type of a DomNode object being checked to see if it is contained in an array of String specifying the HTML tags to be ignored. I corrected this defect by adding a call to the getTagName() method of the DomNode object that returns the String HTML tag name of the node. I have made my patched version of X-PERT publicly available [69], with the download containing a README.txt le detailing the defects that were corrected. The fact that my judgment was used to determine which browser rendering was the refer- ence is not a threat to validity. This is because the metrics used were relative comparisons (e.g., 43 consistency) and ipping the choice of reference rendering would have produced the same dif- ference. Human participant understanding as to what constituted an XBI was not a threat to the correctness of the protocol either since I only asked them to spot dierences between the renderings. A potential threat is the number of real-world (Alexa) websites found to be using browser- specic styling. There exist numerous other ways to declare browser specic styling [6, 7] than the simple prex selector declarations I used, and therefore the number of Alexa websites I found to be using browser-specic styling and the browser-specic code sizes calculated for each only represents a lower bound. Construct Validity: A potential threat is that the similarity metric used in the human study is subjective. To mitigate this threat I used the relative similarity ratings given by the users, as opposed to the absolute value, to understand the participants' relative notion of consistency quality. A second potential threat to validity is that screenshots of the subjects were used in the human study instead of actual HTML pages. I opted for this mechanism as not all of the users had the required environment (OS and browsers). Also, to mitigate this threat I designed the HTML pages containing the screenshots to scale based on the width of the user's screen. Another potential threat is that the browser-specic code found in real-world (Alexa) websites might not necessarily be repair code for XBIs, so it might not be fair to compare that withXFix generated repair patches. However, to the best of my knowledge the primary purpose of browser-specic code is to target a particular browser and ensure cross-browser consistency. 4.3.5 Discussion of Results 4.3.5.1 RQ1: Reduction of XBIs Table 4.2 shows the results of RQ1. The results show thatXFix reported an average 86% reduction in XBIs, with a median of 93%. This shows thatXFix was eective in nding XBI xes. Of the 15 subjects,XFix was able to resolve all of the reported XBIs for 33% of the subjects and was able to resolve more than 90% of the XBIs for 67% of the subjects. I investigated the results to understand whyXFix was not able to nd suitable xes for all of the XBIs. I found that the dominant reason for this was that there were pixel-level dierences between the HTML elements in the test and reference browsers that were reported as XBIs. In many cases, perfect matching at the pixel level was not feasible due to the complex interaction among the HTML elements and CSS properties of a web page. Also, the dierent implementations 44 Table 4.2: Eectiveness ofXFix in reducing XBIs Subject #Before XBIs Avg. #After XBIs Reduction (%) BenjaminLees 25 0 100 Bitcoin 37 0 100 Eboss 49 29 41 EquilibriumFans 117 6 95 GrantaBooks 16 0 100 HenryCountyOhio 11 0 100 HotwireHotel 40 4 90 IncredibleIndia 20 12 40 Leris 13 0 100 Minix3 11 0.73 93 Newark 42 2 95 Ofa 16 3 83 PMA 39 10 75 StephenHunt 159 33 79 WIT 40 3 92 Mean 42 7 86 Median 37 3 93 of the layout engines of the browser meant that a few pixel-level dierences were unavoidable. After examining these cases, I hypothesized that these dierences would not be human perceptible. To investigate this hypothesis, I inspected the user-marked printouts of the before and after versions from the human study. I ltered out the areas of visual dierences that represented inherent browser-level dierences, such as font styling, font face, and native button appearance, leaving only the areas corresponding to XBIs. I found that, for all but one subject, the majority of participants had correctly identied the areas containing layout XBIs in the before version of the page but had not marked the corresponding areas again in the after version. This indicated that the after version did not show the layout XBIs after they had been resolved byXFix. Overall, this analysis showed an average 99% reduction in the human observable XBIs (median 100%), conrming my hypothesis that almost all of the remaining XBIs reported by X-PERT were not actually human observable. 4.3.5.2 RQ2: Impact on Cross-browser Consistency I calculated the impact ofXFix on the cross-browser consistency of a subject based on the user ratings classications, improved, same, or decreased. I found that 78% of the user ratings reported an improved similarity of the after version, implying that the consistency of the subject pages had improved withXFix's suggested xes. 14% of the user ratings reported the consistency quality as same, and only 8% of the user ratings reported a decreased consistency. Figure 4.6 shows the 45 0 1 2 3 4 5 6 7 8 9 10 11 BenjaminLees Bitcoin Eboss Equilibriumfans GrantaBooks HenryCountyOhio HotwireHotel IncredibleIndia Leris Minix3 Newark Ofa PMA StephenHunt WIT Users improvement same decrease Figure 4.6: Similarity ratings given by participants inXFix human study distribution of the participant ratings for each of the subjects. As can be seen, all of the subjects, except two (Eboss and Leris), show a majority agreement among the participants in giving the verdict of improved cross-browser consistency. The improved ratings without considering Eboss and Leris rise to 85%, with the ratings for same and decrease dropping to 10% and 4%, respectively. I investigated the two outliers, Eboss and Leris, to understand the reason for high discordance among the participants. I found that the reason for this disagreement was the signicant number of inherent browser-level dierences related to font styling and font face in the pages. Both of the subject pages are text intensive and contain specic fonts that were rendered very dierently by the respective reference and test browsers. In fact, I found that the browser-level dierences were so dominant in these two subjects that some of the participants did not even mark the areas of layout XBIs in the before version. Since the approach does not suggest xes for resolving inher- ent browser-level dierences, the judgment of consistency was likely heavily in uenced by these dierences, thereby causing high disagreement among the users. To further quantify the impact of the intrinsic browser dierences on participant ratings, I controlled for intrinsic dierences, as discussed in Section 4.3.5.1. This controlled analysis showed a mean of 99% reduction in XBIs, a value consistent with the results in Table 4.2. 46 Table 4.3:XFix's average run time in seconds Subject Search for Candidate Fixes Search for Best Combination Total BenjaminLees 159 14 204 Bitcoin 144 42 358 Eboss 1,729 780 2,685 EquilibriumFans 822 225 1,208 GrantaBooks 41 7 86 HenryCountyOhio 219 41 291 HotwireHotel 3,281 2,036 5,582 IncredibleIndia 599 247 908 Leris 105 46 169 Minix3 18 6 43 Newark 477 232 841 Ofa 122 113 257 PMA 3,050 1,384 4,488 StephenHunt 5,535 1,114 6,639 WIT 3,725 1,409 4,980 Mean 369 90 1916 Median 194 48 841 4.3.5.3 RQ3: Time Needed to RunXFix Table 4.3 shows the average time results over the 30 runs for each subject. These results show that the total analysis time ofXFix ranged from 43 seconds to 110 minutes, with a median of 14 min- utes. The table also reports time spent in the two search routines. The \searchForCandidateFix" procedure was found to be the most time consuming, taking up 67% of the total runtime, with \searchForBestRepair" occupying 32%. (The remaining 1% was spent in other parts of the overall algorithm, for example the setup stage.) The time for the two search techniques was dependent on the size of the page and the number of XBIs reported by X-PERT. Although the runtime is lengthy for some subjects, it can be further improved via parallelization, as has been achieved in related work [85, 105]. 4.3.5.4 RQ4: Similarity of Repair Patches to Real-world Websites' Code My analysis of the 480 Alexa websites revealed that browser specic code was present in almost 80% of the websites and therefore highly prevalent. This indicates that the patch structure of XFix's repairs, which employs browser specic CSS code blocks, follows a widely adopted practice of writing browser specic code. Figure 4.7 shows a box plot for browser specic code size observed in the Alexa websites and XFix subjects. The boxes represent the distribution of browser specic code size for the Alexa 47 websites for each browser (i.e., Firefox (FF), Internet Explorer (IE), and Chrome (CH)), while the circles show the data points forXFix subjects. In each box, the horizontal line and the upper and lower edges show the median and the upper and lower quartiles for the distribution of browser specic code sizes, respectively. As the plot shows, the size of the browser specic code reported by Alexa websites andXFix subjects are in a comparable range, with both reporting an average size of 9 CSS properties across all three browsers (Alexa: FF = 9, IE = 7, CH = 10 andXFix: FF = 9, IE = 13, CH = 6). FF IE CH 0 5 10 15 20 25 30 Size of Browser Specific Code ●● ● ● ● ● ● ● ● ● ● ● ●● ● Figure 4.7: Similarity ofXFix generated repair patches to real-world websites' code 4.4 Conclusion To summarize, in this chapter I introduced an approach, XFix, that addresses layout XBIs in web applications. The prerequisites for repair, Detection and Localization, are instantiated using the existing XBT technique, X-PERT [137]. Designing an algorithm to resolve XBIs is my contribution. My repair approach implemented inXFix uses two phases of guided search to nd suitable xes for the detected layout XBIs. The rst phase of the search nds candidate xes for each of the root causes identied for an XBI. The second phase then nds a subset of the candidate xes that together minimizes the number of XBIs in the web page. In the evaluation performed on 15 real-world web pages, my repair approach was able to resolve, on average, 86% 48 of the X-PERT reported layout XBIs and 99% of the human observed XBIs. In a human study assessing the impact on the cross-browser consistency of the pages, 78% of the participant ratings reported improved consistency afterXFix's suggested xes were applied. The repair patches generated by my approach were comparable in size to the browser-specic code present in real- world websites. Overall, these evaluation results are strong and support the hypothesis of my dissertation by showing that this approach using search-based techniques is highly eective in repairing layout XBIs in web pages. 49 Chapter 5 MFix: Repair of Mobile Friendly Problems (MFPs) Mobile devices have become one of the most common means of accessing the Internet. In fact, recent studies show that for a signicant portion of web users, a mobile device is their primary means of accessing the Internet and interacting with other web-based services, such as online shop- ping, news, and communication [73, 35, 21, 29]. Unfortunately, many websites are not designed to gracefully handle users who are accessing their pages through a non-traditional sized device, such as a smartphone or tablet. These problematic sites may exhibit a range of usability issues, such as unreadable text, cluttered navigation, or content that over ows the device's viewport and forces the user to pan and zoom the page in order to access content. Such usability issues are collectively referred as Mobile Friendly Problems (MFPs) [26, 13] and lead to a frustrating and poor user experience. Despite the importance of MFPs, they are highly prevalent in modern websites | in a recent study over 75% of users reported problems in accessing websites from their mobile devices [73]. Over one third of users also said that they abandon mobile unfriendly websites and nd other websites that work better on mobile devices. This underscores the importance for developers in ensuring the mobile friendliness of the web pages they design and maintain. Adding to this motivation is the fact that, as of April 2015, Google has incorporated mobile-friendliness as part of its ranking criteria when returning search results to mobile devices [28]. This means that unless a website is deemed to be mobile friendly, it is less likely to be highly ranked in the results returned to users. Making websites mobile friendly is challenging even for a well motivated developer. These challenges arise from the diculties in detecting and repairing MFPs. To detect these problems, developers must be able to verify a web page's appearance on many dierent types and sizes of mobile devices. Since the scale of testing required for this is generally quite large, developers often use mobile testing services, such as BrowserStack [16] and SauceLabs [41], to determine if there 50 are problems in their sites. However, even with this information it is dicult for developers to improve or repair their pages. The reason for this is that the appearance of web pages is controlled by complex interactions between the HTML elements and CSS style properties that dene a web page. This means that to x a MFP, developers must typically adjust dozens of elements and properties while at the same time ensuring that these adjustments do not impact other parts of the page. For example, a seemingly simple solution, such as increasing the font size of text or the margins of clickable elements, can result in a distorted user interface that is unlikely to be acceptable to end users or developers. Existing approaches are limited in helping developers to detect and repair MFPs. For example, the Mobile Friendly Test Tools produced by Google [26] and Bing [13], only focus on the detection of MFPs in a web page. While these tools may provide hints or suggestions as to how to repair the pages, the task of performing the repair is still a manual eort. Developers may also use frameworks, such as Bootstrap and Foundation, to help create pages that will be mobile friendly. However, the use of frameworks cannot guarantee the absence of mobile-friendly problems [3]. Some commercial websites attempt to automate this process (e.g., [14, 34, 20]), but are generally targeted for hobbyist pages as they require the transformed website to use one of their preset templates. This leaves developers with a lack of automated support for repairing MFPs. To address this problem, I designed an approach to automatically generate CSS patches that can improve the mobile friendliness of a web page. To do this the approach builds graph-based models of the layout of a web page. It then uses constraints encoded by these graphs to nd patches that can improve mobile friendliness while minimizing layout disruption. To eciently identify the best patch, the approach leverages unique aspects of the problem domain to quantify metrics related to layout distortion and parallelize the computation of the solution. I implemented the approach in a prototype tool,MFix, and evaluated its eectiveness on the home pages of 38 of the Alexa Top 50 most visited websites. The results showed thatMFix could eectively increase the mobile friendliness ratings of a page, typically by 33%, while minimizing layout distortion. MFix was also fast, needing less than 5 minutes, on average, to generate the CSS patch. I also evaluated the results with a user study, in which participants overwhelmingly preferred the repaired version of the website for use on mobile devices, and also considered the repaired page to be more readable than the original. Overall, these results are very positive and indicate that MFix can help developers to improve the mobile friendliness of their web pages. 51 5.1 Background In this section I discuss a variety of MFPs and current ways of addressing them in order to build a mobile friendly website. 5.1.1 Types of MFPs Widely used mobile testing tools provided by Google [26] and Bing [13] report mobile friendly problems in ve areas: 1. Font sizing: Font sizes optimized for viewing a web page on a desktop are often too small to be legible on a mobile device, forcing users to zoom in to read the text, and then out again to navigate around the page. 2. Tap target spacing: \Tap targets" are elements on a web page, such as a hyperlinks, buttons, or input boxes, that a user can tap or touch to perform actions, such as navigate to another page or ll and submit a form. If tap targets are located close to each other on a mobile screen, it can become dicult for a user to physically select the desired element without hitting a neighboring element accidentally. Targets may also be too small, requiring users to zoom into the page in order to tap them on their device. 3. Content sizing: When a web page extends beyond the width of a device's viewport, the user is required to scroll horizontally or zoom out to access content. Horizontal scrolling is particularly considered problematic since users are typically used to scrolling vertically but not horizontally [74]. This can lead to important content being missed by users. Therefore attention to content sizing is particularly important on mobile devices, where a smaller screen means that space is limited, and the browser may not be resizable to t the page. 4. Viewport conguration: Using the \meta viewport" HTML tag allows browsers to scale web pages based on the size of a user's device. Web pages that do not specify or correctly use the tag may have content sizing issues, as the browser may simply scale or clip the content without adjusting for the layout of the page. 5. Flash usage: Flash content is not rendered by most mobile browsers. This makes content based on Flash, such as animations and navigation, inaccessible. In theMFix approach, detailed in Section 5.2.2, I focus on addressing the rst three of these problems. I regard the Flash usage as out of scope for the approach, since it requires a major content change in the page; while the viewport conguration problem is trivial to address, as it only requires insertion of a missing \meta viewport" tag into the page's HTML head. 52 5.1.2 Current Methods for Addressing MFPs I now discuss dierent approaches for addressing MFPs in websites to improve their mobile friend- liness. There are a number of ways in which a website can be adjusted to become more mobile friendly. In the early days of mobile web browsing, a common approach was to simply build an alternative mobile version of an existing desktop website. Such websites were typically hosted at a separate URL and delivered to a user when the web server detected the use of a mobile device. However, the cost and eort of building such a separate mobile website was high. To address this problem, commercial services, such as bMobilized [14] and Mobify [34], can automatically create a mobile website from a desktop version using a series of pre-designed templates. A drawback of these templated websites, however, is that they fail to capture the distinct design details of the original desktop version, making them look identical to every other organization using the service. Broadly speaking, although having a separate mobile website could address mobile friendly concerns, it introduces a heavy maintenance debt on the organization in ensuring that the mobile website renders and behaves consistently and as reliably as its regular desktop version, thereby doubling the cost of an organization's online presence. Furthermore, having a separate mobile-only site would not help improve search-engine rankings of the organization's main website, since the two versions reside at dierent URLs. To avoid developing and maintaining separate mobile and desktop versions of a website, an organization may employ responsive design techniques. This kind of design makes use of CSS media queries to dynamically adjust the layout of a page to the screen size on which it will be displayed. The advantage of this technique over mobile dedicated websites is that the URL of the website remains the same. However, converting an existing website into a fully responsive website is an extremely labor intensive task, and is better suited for websites that are being built from scratch. As such, repairing an existing website may be a more cost eective solution than completely redeveloping the site. Furthermore, although a responsive design is likely to allow for a good mobile user experience, it does not necessarily preclude the possibility of MFPs, since additional styles may be used or certain provided styles may be incorrectly overridden [3]. MFix introduces a novel technique for handling MFPs by adjusting specic CSS properties in the page and producing a repair patch. The repair patch uses CSS media queries to ensure that the modied CSS is only used for mobile viewing { that is, it does not aect the website when viewed on a desktop. 53 5.2 Specialization of the Generalized Approach, *Fix In this section, I provide details of my approach for repairing layout MFPs in web pages that is based on the generalized repair approach, *Fix, explained in Chapter 3. For identifying MFPs in web pages, i.e., for the Detection phase of the debugging process, existing mobile testing tools, such as Google Mobile-Friendly Test Tool (GMFT) [26], can be used as an input to my approach. For completeness, I discuss the GMFT technique in Section 5.2.1. GMFT has limited support for Localization as the list of faulty HTML elements it supplies is generally incomplete. Therefore I developed a localization technique that is incorporated into theMFix approach. I explain the localization technique in detail in Section 5.2.4. My contribution is in developing a localization and repair approach for nding suitable xes for MFPs in web pages using search-based techniques to improve their mobile friendliness. Section 5.2.2 discusses this approach in more detail. 5.2.1 Detection of MFPs For implementing the Detection phase of the debugging process, I use the mobile testing tool provided by Google, GMFT [26]. For completeness, I provide an overview of the GMFT; however, since the GMFT is a commercial product, its algorithmic details are not available. The GMFT reports mobile friendly problems in the ve areas discussed in Section 5.1.1, namely, font sizing, tap target spacing, content sizing, viewport conguration, and ash usage. The GMFT web service takes as input a URL of a web page and returns a list of mobile friendly problems in the page that it nds and a screenshot of how the page appears on a mobile device. The GMFT also provides a list of suggested values for CSS properties for improving mobile friendliness of web pages [25]. 5.2.2 Repair of MFPs The goal ofMFix is to automatically generate a patch that can be applied to the CSS of a web page to improve its mobile friendliness. MFix addresses the three specic problem types introduced in Section 5.1, namely font sizing, tap target spacing, and content sizing for the viewport { factors used by Google to rate the mobile friendliness of a page. There is usually a straightforward x for these problems { simply increase the font size used in the page and the margins of the elements within it. The result, however, is one that would likely be unacceptable to an end-user: such changes when taken to excess can signicantly disrupt the layout of a page and require the user to perform excessive panning and scrolling. The challenge in 54 Page under test (PUT) Repaired page (PUT’) Detection function (D) (GMFT) Identify Segments P1. Segmentation (AP1) Identify Problematic Segments Identify Problematic CSS Properties Mobile Friendly Oracle (MFO) P2. Localization (AP1) Compute Candidate Mobile Friendly Patches Generate the Mobile Friendly Patch Layout distortion (L) (AP3) P3. Repair (AP2, AP4) Mobile Friendliness Score (F) (AP3) Figure 5.1: Overview of theMFix approach for repairing MFPs generating a successful repair, therefore, involves balancing two objectives { addressing a page's mobile friendliness problems, while also ensuring an aesthetically pleasing and usable layout. With this in mind, the goal ofMFix is to generate a solution that is as faithful as possible to the page's original layout. This requires xing mobile friendliness problems while maintaining, where possible, the relative proportions and positioning of elements that are related to one another on the page (for example, links in the navigation bar, and the proportions of fonts for headings and body text in the main content pane). My approach for generating a CSS patch can be roughly broken down into three distinct phases, segmentation, localization, and repair. These are shown in Figure 5.1. The diagram also shows the instantiations of the dierent abstraction points (AP1{4) of the *Fix approach.MFix takes two inputs. The rst input is the URL of a page under test (PUT). Typically, this would be a page that has been identied as failing a mobile friendly test (e.g., by using Google's [26] or Bing's [13] tool), but it may also be a page for which a developer would like to simply improve mobile friendliness. The second input is a detection function (D), that identies MFPs. I use GMFT as the functionD. The segmentation phase identies elements that form natural visual groupings on the page { referred to as segments. The localization phase then identies the MFPs in the page, and relates these to the HTML elements and CSS properties in each segment. The last phase { repair { seeks to adjust the proportional sizing of elements within segments, along with the relative positions of each segment and the elements within them in order to generate a suitable patch. I now explain each of these three phases in more detail. 5.2.3 Phase 1: Segmentation The rst phase analyzes the structure of the page to identify segments { sets of HTML elements whose properties should be adjusted together to maintain the visual consistency of the repaired web page. An example of a segment is a series of text-based links in a menu bar where if the font size of any link in the segment is too small, then all of the links should be adjusted by the 55 (a) PUT with segments (b) PUT with distortions (c) Repaired PUT Figure 5.2: Example for demonstrating theMFix approach. (Thick red rectangles highlight segments identied by our technique, while dashed red ovals indicate distortions caused by undesirable adjustments to the page's layout.) same amount to maintain the links' visual consistency. The reasonMFix uses segments is that through manual experimentation with pages that contained MFPs, I found that once the optimal x value for an element was identied, to maintain visual consistency, the same value would also need to be applied to closely related elements (i.e., those in the element's segment). This insight motivated the use of segments, and it allowed the approach to treat many HTML elements as an equivalence class, which also reduced the complexity of the patch generation process. To identify the segments in a page, the approach analyzes the Document Object Model (DOM) tree of the PUT. In informal experiments, I evaluated several well-known page segmentation analyses, such as VIPS [55], Block-o-matic [143], and correlation clustering [56]. I chose to use an automated clustering-based partitioning algorithm proposed by Romero et al. [136], as its segmentation results more readily conformed to my denition of a segment. I summarize the algorithm here for completeness. The approach starts by assigning each leaf element of the DOM tree to its own segment. Then, to cluster the elements, the approach iterates over the segments and uses a cost function to determine when it can merge adjacent segments. The cost function is based on the number of hops in the DOM tree between the lowest common ancestors of the two segments under consideration. If the number of hops is below a threshold based on the average depth of leaves in the DOM tree, then the approach will cluster the adjacent segments. The value 56 of this threshold is determined empirically. The approach continues to iterate over the segments until no further merges are possible (i.e., the segment set has reached a xed point). The output is a set of segments, Segs, where each segment contains a set of XPath IDs denoting the HTML elements that have been grouped together in the segment. Figure 5.2a shows a simplied version of the segments that were identied for one of the web pages, Wiley, used in the evaluation. The red overlay rectangles show the visible elements that were grouped together as segments. These include the header content, a left-aligned navigation menu, the content pane, and the page's footer. 5.2.4 Phase 2: Localization The second phase identies the parts of the PUT that must be targeted to address its MFPs. The second phase consists of two steps. In the rst step, the approach analyzes the PUT to identify which segments contain MFPs. Then, based on the structure and problem types identied for each segment, the second step identies the CSS properties that will most likely need to be adjusted to resolve each problem. The output of the localization phase is a mapping of the potentially problematic segments to these properties. 5.2.4.1 Identifying Problematic Segments In the rst step of the localization phase, the approach identies MFP types in the PUT and the subset of segments that will likely need to be adjusted to address them. InMFix, MFPs in the PUT are detected by an Mobile Friendly Oracle (MFO). An MFO is a function that takes a web page as input and returns a list of MFP types it contains. The MFO can identify the presence of MFPs but cannot identify the faulty HTML elements and CSS properties responsible for the observed problems. In the implementation ofMFix, I use the GMFT [26] as MFix's MFO. However, any detector or testing tool may also be used as an MFO. The basic requirement for an MFO is that it can accurately report whether there are any types of MFPs present in the page. Ideally, the MFO should also detail what types of problems are present, along with a mapping of each problem to the corresponding HTML elements. However, these are not strict requirements: MFix can correctly function with the assumption that all segments have all problem types. Though this over-approximation can increase the amount of time needed to compute the best solution in the second phase. Since I leverage the GMFT in the implementation, I discuss how the output of this particular tool is used byMFix. I expect that other MFOs, such as Bing, could be adapted in a similar way. Given a PUT, the GMFT returns, for each problem type it detects, a reference to the HTML 57 elements that contain that problem. However, through experimentation with the GMFT, I learned that the list of HTML elements it supplies is generally incomplete. Therefore, given a reported problem type,MFix applies a conservative ltering to the segments to identify which ones may be problematic with respect to that problem type. For example, if the GMFT reports that there is a problem with font sizing in the PUT, thenMFix identies any segment that contains a visible text element as potentially problematic. As mentioned above, this over-approximation may increase the time needed to compute the best solution, but does not introduce unsoundness intoMFix. The output of this step is a set of tuples of the formhs;Ti where s2 Segs is a potentially problematic segment and T is the set of problem types associated with s (i.e., in the domain offtap targets; font size; content sizeg). Referring back to the example in Figure 5.2a, GMFT identied S3 as having two problem types, the tap targets were too close and the font size was too small, so the approach would generate a tuple for S3 where T includes these two problem types. 5.2.4.2 Identifying Problematic CSS Properties After identifying the subset of problematic segments, the approach needs to identify the CSS properties that may need to be adjusted in each segment to make the page mobile friendly. The general intuition of this step is that each of a segment's identied problem types generally map to a set of CSS properties within the segment. However, this step is complicated by the fact that HTML elements may not explicitly dene a CSS property (i.e., they may inherit a style from a parent element) and thatMFix adjusts CSS properties at the segment level instead of the individual element level. To address these issues, I introduce the concept of a Property Dependence Graph (PDG), which for a given segment and problem type, models the relevant style relationships among its HTML elements based on CSS inheritance and style dependencies. Formally, I dene a PDG as a directed graph of the formhE;R;Mi. Here e2 E is a node in the graph that corresponds to an HTML element in the PUT that has an explicitly dened CSS property, p2 P , where P is the set of CSS properties relevant for a problem type (e.g., font-size for font sizing problems, margin for tap target issues, etc.). REE is a set of directed edges, such that for each pair of elementshe 1 ;e 2 i2R, there exists a dependency relationship between e 1 and e 2 . M is a function M :R7! 2 C that maps each edge to a set of tuples of the form C :hp;'i, where p2P and ' is a ratio between the values of p for e 1 and e 2 . This function is used in the following repair phase (Section 5.2.5) to ensure that style changes made to a segment remain consistent across pairs of elements in a dependency relationship. 58 MFix denes a variant of PDG for each of the three problem types: the Font PDG (FPDG), the Content Size PDG (CPDG), and the Tap Target PDG (TPDG). Each of these three graphs has a specic set of relevant CSS properties (P ), a dependency relationship, and a mapping function (M). Note that I only present the formal denition of the FPDG, as the other two graphs are dened in a similar manner. The FPDG is constructed for any segment for which a font sizing problem type has been identied. For this problem type, the most relevant CSS property is clearly font-size, but the line-height, width, and height properties of certain elements may also need to be adjusted if font sizes are changed. ThereforeP =ffont-size, line-height, width, heightg. A dependency relationship exists between any e 1 ;e 2 2E, if and only if e 1 is an ancestor of e 2 in the DOM tree and e 2 has an explicitly dened CSS property, p 2 P , i.e., the value of the property is not inherited from e 1 . The general intuition of using this dependency relationship is that only nodes that explicitly dene a relevant property may need to be adjusted and the remainder of the nodes in between e 1 ;e 2 will simply inherit the style from e 1 . The ratio, ', associated with each edge is the value of p dened for e 1 divided by the value of p dened for e 2 . To illustrate consider two HTML elements in S3 of Figure 5.2a. The rst, e 1 , is ahdivi tag wrapping all of the elements in S3 with font-size = 13px and the second,e 2 , is thehh2i element containing the text \Resources" with font-size = 18px. A dependency relationship exists from e 1 to e 2 with p as font-size and the ratio ' = 0.72. The output of this nal step is the set, I, of tuples where each tuple is of the formhs;g;ai wheres identies the segment to which the tuple corresponds, g identies a corresponding PDG, and a is an adjustment factor for the PDG that is initially set to 1. The adjustment factor is used in the repair phase and serves as a multiplier to the ratios dened for the edges of each PDG. A tuple is added toI for each problem type that was identied as applicable to a segment. Referring back to the example in Figure 5.2a, the approach would generate two tuples for S3, one containing an FPDG and the other containing an TPDG. 5.2.5 Phase 3: Repair The goal of the third phase is to compute a repair for the PUT. The best repair has to balance two objectives. The rst objective is to identify the set of changes { a patch { that will most improve the PUT's mobile friendliness. The second objective is to identify the set of changes that does not signicantly change the layout of the PUT. 59 5.2.5.1 Metrics A key insight forMFix is that both of the aforementioned objectives { mobile friendliness and layout distortion { can be quantied. For the rst objective, it is typical for mobile friendly test tools to assign a numeric score to a page, where this score represents the page's mobile friendliness. For example, the Google PageSpeed Insights Tool (PSIT) assigns pages a score in the range of 0 to 100, with 100 being a perfectly mobile friendly page. By treating this score as a function, F , that operates on a page, it is possible to establish an ordering on solutions and use that ordering to identify a best solution among a group of solutions. The second objective can also be quantied as a function, L, that compares the amount of change between the layout of a page containing a candidate patch versus the layout of the original page. The amount of change in a layout can be determined by building models that express the relative visual positioning among and within the segments of a page. I refer to these models as the Segment Model (SM) and Intra-Segment Model (ISM), respectively. Given these two models,MFix uses graph comparison techniques to quantify the dierence between the models for the original page and a page with an applied candidate solution. I now provide a more formal denition of the SM and ISM. A Segment Model (SM) is dened as a directed complete graph where the nodes are the segments identied in the rst phase (Section 5.2.3) and the edge labels represent layout relationships between segments. To determine the edge labels, the approach rst computes the Minimum Bounding Rectangles (MBRs) of each segment. This done by nding the maximum and minimum X and Y coordinates of all of the elements included in the segment, which can be found by querying the DOM of the page. Based on the coordinates of each pair of MBRs, the approach determines which of the following relationships apply: (1) intersection, (2) containment, or (3) directional (i.e., above, below, left, right). Each edge in an SM is labeled in this manner. Referring to Figure 5.2a, one of the relationships identied would be that S1 is above S3 and S4. An ISM is the same, but is built for each segment and the nodes are the HTML elements within the segment. To quantify the layout dierences between the original page and a transformed page to which a candidate patch has been applied, the approach computes two metrics. The rst metric is at the segment level. The approach sums the size of the symmetric dierence between each edge's labels in the SM of the original page and the SM of the transformed page. Recall that both models are complete graphs, so a counterpart for each edge exists in the other model. To illustrate, consider the examples shown in Figures 5.2a and 5.2b. The change to the page has caused segments S3 and S4 to overlap. This change in the relationship between the two segments would be counted as a dierence between the two SMs and increase the amount of layout dierence. The second metric 60 is similar to the rst but compares the ISM for each segment in the original and transformed page. The one dierence in the computation of the metric is that the symmetric dierence is only computed for the intersection relationship. The intuition behind this dierence in counting is that I consider movement of elements within a segment, except for intersection, to be an acceptable change to accommodate the goal of increasing mobile friendliness. Referring back to the example shown in Figure 5.2b, nine intra-segment intersections are counted among the elements in segment S4 as shown by dashed red ovals. The dierence sums calculated at the segment and intra-segment level are returned as the amount of layout dierence. 5.2.5.2 Computing Candidate Mobile Friendly Patches To identify the best CSS patch, the approach must nd new values for the potentially problematic properties, identied in the rst phase, that make the PUT mobile friendly while also maintaining its layout. To state this more formally, given I, the approach must identify a set of new values for each of the adjustment factors (i.e.,a) in each tuple ofI so that the value ofF is 100 (i.e., the maximum mobile friendliness score) and the value ofL is zero (i.e., there are no layout dierences). A direct computation of this solution faces two challenges. The rst of these challenges is that an optimal solution that satises both of the above conditions may not exist. This can happen due to constraints in the layout of the PUT. The second challenge is that, even if such a solution were to exist, it exists in a solution space that grows exponentially based on the number of elements and properties that must be considered. Since many of the CSS properties have a large range of potential values, a direct computation of the solution would be too expensive to be practical. Both of these challenges motivate the use of an approximation algorithm to identify a repair. Therefore, the approach must nd a set of values that minimizes the layout score while maximizing the mobile friendliness score. The design of my approximation algorithm inMFix takes into account several unique aspects of the problem domain to generate a high quality patch in a reasonable amount of time. The rst of these aspects is that, through manual experimentation, I learned that good or optimal solutions typically involve a large number of small changes to many segments. This motivates targeting a solution space comprised of candidate solutions that dier from the original page in many places but by only small amounts. The second of these aspects is that computing the values of the L andF functions is expensive. The reason for this is that F requires accessing an API on the web and L requires rendering the page and computing layout information for the two versions of the PUT. This motivates us to avoid algorithms that require sequential processing of L and F (e.g., simulated annealing or genetic algorithms). 61 To incorporate these insights, the approximation algorithm rst generates a set of size n of candidate patches. To generate each candidate patch, the approach creates a copy of I, called I 0 , then iterates over each tuple in I 0 and with probability x, randomly perturbs the value of the adjustment factor (i.e.,a) using a process I describe in more detail in the next paragraph. ThenI 0 is converted into a patch,R, using the process described in the next section (Section 5.2.5.3), and added to the set of candidate patches. This process is repeated until the approach has generated n candidate patches. The approach then computes, in parallel, the values ofF andL for a version of the PUT with an applied candidate patch. (The implementation ofMFix uses Amazon Web Services (AWS) to parallelize this computation.) The objective score for the candidate patch is then computed as a weighted sum of F and L. The candidate patch with the maximum score, i.e., with the highest value of F and the lowest value of L, is selected as the nal solution, R max . Figure 5.2c shows R max applied to the example page. MFix perturbs adjustment factors in such a way as to take advantage of my insight that the optimal solutions dier from the original page in many places but by only small amounts. To represent this insight, I based the perturbation on a Gaussian distribution around the original value in a property. Through experimentation, I found that it was most eective to have the mean () and standard deviation () values used for the Gaussian distribution vary based on the specic MFP type being addressed. For each problem type, the goal was to identify a and that provided a large enough range to allow sucient diversity in the generation of candidate patches. For identifying values, I found through experimentation that set at the values suggested by the GMFT [25] was not eective in generating candidate patches that could improve the mobile friendliness of the PUT. Therefore I added an amendment factor to the values suggested by the GMFT to allow the approach to select a value considered mobile friendly with a high probability. The specic amendment factors I found the most eective were: +14 for font size, -20 for content sizing, and 0 for tap target sizing problems. For example, if the GMFT suggested value for font size problems was 16px, I set at 30px. For each problem type, I then identied a value. The specic values I determined to be most eective were: = 16 for content size problems, = 5 for font size problems, and = 2 for tap target spacing problems. 5.2.5.3 Generating the Mobile Friendly Patch Given a set I, the approach generates a repair patch, R, and modies the PUT so that R will be applied at runtime. The general form of R is a set of CSS style declarations that apply to the HTML elements of each segment in I. To generate R, the approach iterates over all tuples in I. For each tuple, the approach iterates over each node of its PDG, starting with the root node, and 62 computes a new value that will be assigned to the CSS property represented by the node. The new value for a node is computed by multiplying the new value assigned to its predecessor by the ratio,', dened on the edge with the predecessor. Once new property values have been computed for all nodes in the PDG, the approach generates a set of xes, where each x is represented as a tuplehi;p;vi, where i is the XPath for each node in the PDG that had a property change, p is the changed CSS property, and v is the newly computed value. These tuples are made into CSS style declarations by convertingi into a CSS selector and then adding the declarations of p andv within the selector. All of the generated CSS style declarations are then wrapped in a CSS media query that will cause it to be loaded when accessed by a mobile device. In practice I found that the size range specied in theMFix generated patch's media query is applicable to a wide range of mobile devices. However, to allow developers to generate patches for specic device sizes, I provide congurable size parameters in the media query. Referring back to the example, the ratio (') between e 1 (hdivi containing all elements in S3) and e 2 (hh2i containing text \Resources") is 0.72. Consider a tuplehS3, font-size, 2i in I. Thus, a value v of 26px is calculated for the predecessor node e 1 based on the adjustment factor 2. Accordingly v = 26px * 1/0.72 = 36px is calculated for e 2 . Thus, the approach generates two x tuples:hdiv, font-size, 26pxi andhh2, font-size, 36pxi. 5.3 Evaluation To evaluate my approach for repairing MFPs, I designed experiments to determine its eectiveness, running time, and the visual appeal of its solutions. The specic research questions I considered were: RQ1: How eective isMFix in repairing mobile friendly problems in web pages? RQ2: How long does it take forMFix to generate patches for the mobile friendly problems in web pages? RQ3: How doesMFix impact the visual appeal of web pages after applying the suggested CSS repair patches? 5.3.1 Implementation I implemented the approach in Java as a prototype tool namedMFix [94]. For identifying the mobile friendly problems in a web page, I used the GMFT [26] and PSIT [27] APIs. I also used the PSIT for obtaining the mobile friendliness score (labeled as \usability" in the PSIT report). 63 For identifying segments in a web page and building the SM and ISM,MFix rst builds the DOM tree by rendering the page in an emulated mobile Chrome browser v60.0 and extracting rendering information, such as element MBRs and XPath, using Javascript and Selenium WebDriver. The segmentation threshold value determined by the average depth of leaves in a DOM tree was capped at four to avoid the situation where all of the visible elements in a page were wrapped in one large segment. This constant value was determined empirically, and was implemented as a congurable parameter inMFix. I used jStyleParser for identifying explicitly dened CSS properties for HTML elements in a page for building the PDG. I parallelized the evaluation of candidate solutions using a cloud of 100 Amazon EC2 t2.xlarge instances pre-installed with Ubuntu 16.04. 5.3.2 Subjects For the experiments I used 38 real-world subjects collected from the top 50 most visited websites across all seventeen categories tracked by Alexa [9]. The subjects are listed in Table 5.1. The columns \Category" and \Rank" refer to the source Alexa category and rank of the subject within that category, respectively. The column \#HTML" refers to the total number of HTML elements in a subject, which I counted by parsing the subject's DOM for node type \element". This value gives an approximation for the size and complexity of the subject. I used Alexa as the source of the subjects as the websites represent popular widely used sites and a mix of dierent layouts. From the 651 unique URLs that were identied across the 17 categories, I excluded the websites that passed the GMFT or had adult content. Each of the remaining 38 subjects was downloaded using the Scrapbook-X Firefox plugin, which downloads an HTML page and its supporting les, such as images, CSS, and Javascript. I then removed the portions of the subject pages that made active internet connections, such as for advertisements, to enable running of the subjects in an oine mode. 5.3.3 Experiment One To address RQ1 and RQ2, I ranMFix ten times on each of the 38 subjects to mitigate the non-determinism inherent in the approximation algorithm used to nd a repair solution. For RQ1, I considered two metrics to gauge the eectiveness ofMFix. For the rst metric, I used the GMFT to measure how many of the subjects were considered mobile friendly after the patch was applied. For the second metric, I compared the before and after scores for mobile friendliness and layout distortion for each subject. For comparing mobile friendliness score, I 64 Table 5.1: Subjects used in the evaluation ofMFix ID URL Category Rank #HTML 1 http://aamc.org Health 23 598 2 https://arxiv.org Science 21 381 3 http://us.battle.net Kids and teens 2 615 4 https://bitcointalk.org Science 25 1302 5 http://blizzard.com Kids and teens 33 313 6 https://boardgamegeek.com Games 31 4474 7 https://bulbagarden.net Kids and teens 26 151 8 http://coinmarketcap.com Science 8 1964 9 http://correios.com.br/para-voce Society 14 769 10 http://dict.cc Reference 20 633 11 https://www.discogs.com Arts 26 5738 12 http://drudgereport.com News 23 779 13 http://www.finalfantasyxiv.com Games 37 61 14 http://www.flashscore.com Sports 16 6621 15 https://www.fragrantica.com Health 35 1091 16 http://forum.gsmhosting.com/vbb Home 39 2618 17 http://www.intellicast.com Science 38 1393 18 https://www.irctc.co.in Regional 34 1031 19 https://www.irs.gov Home 14 569 20 https://www.leo.org Reference 31 990 21 http://letour.fr Sports 3 1260 22 http://lolcounter.com Kids and teens 30 1257 23 http://www.mmo-champion.com Games 29 1903 24 http://myway.com Computers 42 135 25 https://www.ncbi.nlm.nih.gov Science 2 833 26 http://www.nexusmods.com Games 28 2108 27 http://nvidia.com Games 20 719 28 http://rotoworld.com Sports 41 2523 29 http://sigmaaldrich.com Science 37 141 30 http://us.soccerway.com Sports 30 2708 31 http://www.square-enix.com Games 30 198 32 https://travel.state.gov Home 26 440 33 http://www.weather.gov Science 18 1101 34 http://www.bom.gov.au Kids and teens 48 685 35 http://www.wiley.com Shopping 14 460 36 http://onlinelibrary.wiley.com Business 33 824 37 https://www.wowprogress.com Games 46 2828 38 https://xkcd.com Arts 48 121 65 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 After mobile friendliness score Subjects Before mobile friendliness score Difference between before and after scores Figure 5.3: Distribution of the median mobile friendliness score across 10 runs selected, for each subject over the ten runs, the repair that represented a median score. For layout distortion, I selected, for each subject over the ten runs, the best and worst repair, in terms of layout distortion, that passed the mobile friendly test. Essentially, for each subject, these were the two patched pages that passed the mobile friendly test and had the lowest (best) and highest (worst) amount of distortion. For the subjects that did not pass the mobile friendly test, I considered the patched pages with the highest mobile friendly scores to be the \passing" pages. For RQ2, I measured the average total running time ofMFix for each of the ten runs for each of the subjects, and also measured the time spent in the dierent stages of the approach. 5.3.3.1 Discussion of results The results for eectiveness (RQ1) were that 95% (36 out of 38) of the subjects passed the GMFT after applyingMFix's suggested CSS repair patch. This shows that the patches generated by MFix were eective in making the pages pass the mobile friendly test. Figure 5.3 shows the results of comparing the before and after median mobile friendliness scores for each subject. For each subject, the dark gray portion shows the score reported by the PSIT for the patched page and the light gray portion shows the score for the original version. The black horizontal line drawn at 80 indicates the value above which the GMFT considers a page to have passed the test and to be mobile friendly. On average,MFix improved the mobile friendliness score of a subject by 33% Overall, these results show thatMFix was able to consistently improve a subject's mobile friendliness score. 66 Identifying problematic segments (25%) Segmentation (1%) Identifying problematic CSS properties (15%) Finding repair for mobile friendly problems (59%) Figure 5.4: Breakdown of the running time ofMFix I also compared the layout distortion score for the best and worst repairs of each subject. On average, the best repair had a layout distortion score 55% lower than the worst repair. These results show thatMFix was eective in identifying patches that could reduce the amount of distortion in a solution that was able to pass the mobile friendly test. (For RQ3, I examined, via a user study, if this reduction in distortion translates into a more attractive page.) I investigated the results to understand why two subjects did not pass the GMFT. The patched version of the rst subject, gsmhosting, contained a content sizing problem. The original version of the page did not contain this problem, which indicates that the increased font size introduced by the patch caused content in this page to over ow the viewport width. For the second subject, aamc,MFix was not able to fully resolve its content sizing problem as the required value was extremely large compared to the range explored by the Gaussian perturbation of the adjustment factor. Both of these issues suggest further renements toMFix that could be explored in future work, such as making the process iterative and expanding the initial search space. The total running time (RQ2) required byMFix for the dierent subjects ranged from 2 minutes to 10 minutes, averaging a little less than 5 minutes. As of August 2017, an Amazon EC2 t2.xlarge instance was priced at $0.188 per hour. Thus, with an average time of 5 minutes the cost of runningMFix on 100 instances was $1.50 per subject. Figure 5.4 shows a breakdown of the average time for the dierent stages of the approach. As can be seen from the chart, nding the repair for the mobile friendly problems (phase 3) was the most time consuming, taking up almost 60% of the total time. A major portion of this time was spent in evaluating the candidate solutions by invoking the PSIT API. The remainder of the time was spent in calculating layout distortion, which is dependent on the size of the page. The overhead caused by network delay in communicating with the Amazon cloud instances was negligible. For the API invocation, I 67 implemented a random wait time of 30 to 60 seconds between consecutive calls to avoid retrieving stale or cached results. Identifying problematic segments was the next most time consuming step as it required invoking the GMFT API. 5.3.4 Experiment Two To address RQ3, I conducted a user-based survey to evaluate the aesthetics and visual appeal of the repaired page. The main intent of the study was to evaluate the eectiveness of the layout distortion metric, L (Section 5.2.5), in minimizing layout disruptions and producing attractive pages. The general format of the survey was to ask participants to compare the original and repaired versions of a subset of the subjects. To make the survey length manageable, I divided the 38 subjects into six dierent surveys, each with six or seven subjects. For each subject, the survey presented, in random order, a screenshot of the original and repaired pages when displayed in a frame of the mobile device. The screenshots were obtained from the output of the GMFT. An example of one such screenshot is shown in Figure 5.2c. I asked each human subject to (1) select which of the two versions (original or repaired) they would prefer to use on their mobile device; (2) rate the readability of each version of the page on a scale of 1{10, where 1 means low and 10 means high; and (3) rate the attractiveness of the page on a scale of 1{10, where 1 means low and 10 means high. I had two variants of the survey, one that used the best repair as the screenshot of the repaired page and the other one that used the worst repair as the screenshot of the repaired page. Here, the best and worst repairs were as dened in Experiment 1. I used Amazon Mechanical Turk (AMT) service to conduct the surveys. AMT allows users (requesters) to anonymously post jobs which it then matches them to anonymous users (workers) who are willing to complete those tasks to earn money. To avoid workers who had a track record of haphazardly completing tasks, I only allowed workers that had high approval ratings for their previously completed tasks (over 95%) and had completed more than 5,000 approved tasks to complete the survey. In general, this is considered a fairly selective criteria for participant selection on AMT. For each survey, I had 20 anonymous participants, giving us a total of 240 completed surveys across both variants of the survey. Each participants was paid $0.65 for completing a survey. 5.3.4.1 Discussion of results Based on the analysis of the results of the rst variant of the survey, I found that the users preferred to use the repaired version in 26 out of 38 subjects, three subjects received equal preference for the original and repaired versions, and only nine subjects received a preference for using the original 68 version. Interestingly, users preferred to use the repaired version even for the two subjects that did not pass the GMFT. For readability, all but four subjects were rated as having an improved readability over the original versions. On average, the readability rating of the repaired pages showed a 17% improvement over original versions (original = 5.97, repaired = 6.98). This result was also conrmed as statistically signicant using the Wilcoxon signed-rank test with a p-value = 1:5310 14 < 0:05. Using the eect size metric based on Vargha-Delaney A measure, readability of the repaired version was observed to be 62% of the time better than the original version. With regards to attractiveness, no statistical signicance was observed, implying thatMFix did not deteriorate the aesthetics of the pages in the process of automatically repairing the reported mobile friendly problems. In fact, overall, the repaired versions produced byMFix were rated slightly higher than original versions for attractiveness (avg. original = 6.50, avg. repaired = 6.67 and median original = 6.02, median repaired = 7.12). I investigated the nine subjects where the repaired version was not preferred by the partici- pants. Based on my analysis, I found two dominant reasons that applied to all of the nine subjects. First, these subjects all had a xed sized layout, meaning that the section and container elements in the pages were assigned absolute size and location values. This caused a cascading eect with any change introduced in the page, such as increasing font sizes or decreasing width to t the viewport. The second reason was linked to the rst as the pages were text intensive, thereby requiringMFix to increase font sizes. These results motivate future work in techniques that can better handle these types of pages. Overall, these results indicate thatMFix was very eective in generating repaired pages that (1) users preferred over the original version, (2) considered to be more readable, and (3) that did not suer in terms of visual aesthetics. The results for the second variant of the survey underscored the importance of the layout distortion objective and the impact visual distortions can have on end users' perception of a page's attractiveness. The results showed that the users preferred to use the original, non-mobile friendly version, in 22 out of 38 subjects and preferred to use the repaired version for only 16 subjects. Readability showed similar results as the rst survey variant. On average, an improvement of 11% in readability was observed for the repaired pages compared to the original versions, and was still found to demonstrate statistical signicance (p-value = 7:5410 6 < 0:05). This is expected as the enlarged font sizes can make the text very readable in the repaired versions despite layout distortions. However, in this survey a statistical signicance (p-value = 2:20 10 16 < 0:05) was observed for the attractiveness of the original version being rated higher than the repaired version. On average, the original version was rated 6.82 (median 7.00) and the repaired version 69 was rated 5.64 (median 5.63). In terms of the eect size metric, the repaired version was rated to have a better attractiveness only 38% of the time. These results strongly indicate that the layout distortion objective plays an important role in generating patches that make the pages more attractive to end users. 5.3.5 Threats to Validity External Validity: The rst potential threat is bias in the selection of participants for the user- based study in Experiment two. To address this threat, I used AMT that provided us with a large pool of anonymous participants. I only specied qualication requirements for the participants in the user study (i.e., high numbers of previously completed tasks and high approval ratings, as outlined in Section 5.3.4) to ensure authentic results. Another potential threat is in the selection of subject web pages for the evaluation ofMFix. To mitigate any bias, I used the home pages of websites drawn from Alexa's 50 top ranked websites from dierent categories. Internal Validity: One potential threat is that the survey used in the user-based study may not render in full resolution when viewed on small screen devices, potentially impacting the results. To mitigate this threat, I asked participants to enter the device they used for answering the survey, so that I could isolate those results. However, only a small minority of the participants did not use a desktop or laptop, and the results from the few who used a mobile phone or tablet to answer the survey did not indicate any anomalous responses. Another potential threat is the use of GMFT and PSIT inMFix to determine mobile friendly problems and mobile friendliness score. However, the PSIT is the only publicly available tool that reports a mobile friendliness score. Bing only oers a web interface for detecting mobile friendly problems, unlike GMFT, which provides an API. Furthermore, GMFT and PSIT are stable tools that are used by Google to rank pages in their own search results. Construct Validity: A potential threat is that the layout distortion objective used inMFix quanties the aesthetic value of a page, which is a subjective aspect of a web page. To address this threat, I conducted two user-based studies (i.e., Experiment Two) to qualitatively understand the impact of layout distortion on the visual appeal of a page. A second potential threat is that the numbers supplied by participants in response to the readability and attractiveness ratings that I asked them to provide are also subjective. To mitigate this threat, I used relative values given by the participants for the before and after repair versions for the subjects, as opposed to using their absolute values. That is, although two participants may supply dierent numbers for the ratings for the same pair of web pages, they will supply higher values for one of the pages if they believe that readability/attractiveness is better for that page. A third potential threat is that I 70 used screenshots of the subject pages in the user-based study as opposed to allowing the users to interact with the pages on mobile devices. I selected this mechanism as I wanted the users to visualize the before and after repair versions of the pages next to each other to allow for an easy comparison. Also, I did not have control over participants' mobile devices and wanted to avoid variations in results that this could cause. A fourth potential threat is that the participants have a bias in selecting the repaired version based on the order in which the original and repaired versions are presented in the survey. To mitigate this threat, I randomized the order of the two versions for each question of the survey. A nal potential threat is that the denition of correct repair used inMFix may be dierent from developer intent. However, this threat is mitigated by the user study results discussed in RQ3, which show that the repairs generated byMFix were considered visually appealing and preferred by the participants. 5.4 Conclusion To summarize, in this chapter I introduced an approach,MFix, for the automated repair of MFPs in web pages. MFix performs a search for nding the repair for MFPs using an approximation algorithm. Candidate repairs in the search are selected using Gaussian perturbation around the value suggested by the GMFT. The tness function is comprised of two objectives. First, maximizing the mobile friendliness score given by the PSIT. Second, minimizing the amount of change between the layout of a page containing a candidate repair versus the layout of the original page. The amount of change in the layout of a page is calculated using constraints encoded by the graph-based models of the page's layout. For building the graph-based models,MFix rst segments the page into areas that form natural visual groupings. It then builds graph-based models of the HTML elements within each segment and also among the dierent segments in a page. In the evaluation, I found thatMFix was eective in resolving mobile friendly problems for 95% of the subjects and required only an average of ve minutes per subject. In a user study, the participants overwhelmingly preferred the repaired version of the website for use on mobile devices and also considered the repaired page to be signicantly more readable than the original. Overall, these results are strong and support the hypothesis of my dissertation by showing that this approach using search-based techniques can help developers to improve the mobile friendliness of their web pages, while maintaining a usable layout. 71 Chapter 6 IFix: Repair of Internationalization Presentation Failures (IPFs) Web applications enable companies to easily establish a global presence. To more eectively com- municate with this global audience, companies often employ internationalization (i18n) frame- works for their websites, which allow the websites to provide translated text or localized media content. However, because the length of translated text diers in size from text written in the original language of the page, the page's appearance can become distorted. HTML elements that are xed in size may clip text or look too large, while those that are not xed can expand, con- tract, and move around the page in ways that are inconsistent with the rest of the page's layout. Such distortions, called Internationalization Presentation Failures (IPFs), reduce the aesthetics or usability of a website and occur frequently | a recent study reports their occurrence in over 75% of internationalized web pages [47]. Avoiding presentation problems, such as these, is important. Studies show that the design and visual attractiveness of a website aects users' impressions of its credibility and trustworthiness, ultimately impacting their decision to spend money on the products or services that it oers [67, 70, 71]. Repairing IPFs poses several challenges for web developers. First, modern web pages may contain hundreds, if not thousands, of HTML elements, each with several CSS properties con- trolling their appearance. This makes it challenging for developers to accurately determine which elements and properties need to be adjusted in order to resolve an IPF. Assuming that the rel- evant elements and properties can be identied, the developers must still carefully construct the repair. Due to complex and cascading interactions between styling rules, a change in one part of a web page User Interface (UI) can easily introduce further issues in another part of the page. This means that any potential repair must be evaluated in the context of not only how well it resolves the targeted IPF, but also its impact on the rest of the page's layout as a whole. This 72 task is complicated because it is possible that more than one element will have to be adjusted together to repair an IPF. For example, if the faulty element is part of a series of menu items, then all of the menu items may have to be adjusted to ensure their new styling matches that of the repaired element. Existing techniques targeting internationalization problems, such as GWALI [49], are only able to detect IPFs, and cannot generate repairs. Meanwhile other web page repair approaches target fundamentally dierent UI problems and are not capable of repairing IPFs. These include XFix [97], which repairs cross-browser issues; and PhpRepair [142] and PhpSync [126], which repair malformed HTML. To address these limitations, I present an approach,IFix, for automatically repairing IPFs in web pages. IFix is designed to handle the practical and conceptual challenges particular to the IPF domain: To identify elements whose styling must be adjusted together, I designed a novel style-based clustering approach that groups elements based on their visual appearance and DOM characteristics. To nd repairs, I designed a guided search-based technique that eciently explores the large solution space dened by the HTML elements and CSS properties. This technique is capable of nding a repair solution that best xes an IPF while avoiding the introduction of new layout problems. To guide the search, I designed a tness function that leverages existing IPF detection techniques and UI change metrics. In an evaluation of the implementation ofIFix, I found that it was eective at repairing IPFs, resolving over 98% of the detected IPFs; and also fast, requiring about four minutes on average to generate the repair. In a user study of the repaired web pages, I found that the repairs met with high user approval | over 70% of user responses rated the repaired pages as better than the faulty versions. Overall, these results are positive and indicate thatIFix can help developers automatically resolve IPFs in web pages. 6.1 Background Developers internationalize web applications by isolating language-specic content, such as text, icons, and media, into resource les. Dierent sets of resource les can then be utilized depending on the user's language | a piece of information supplied by their browser | and inserted into placeholders in the requested page. This isolation of language-specic content allows a developer to design a universal layout for a web page, easing its management and maintenance, while also modularizing language specic processing. However, the internationalization of web pages can distort their intended layout because the length of dierent text segments in a page can vary depending on their language. An increase in 73 the length of a text segment can cause it to over ow the HTML element in which it is contained, be clipped, or spill over into surrounding areas of the page. Alternatively, the containing element may expand to t the text, which can, in turn, cause a cascading eect that disrupts the layout of other parts of the page. IPFs can aect both the usability and the aesthetics of a web page. An example is shown in Figure 6.1b. Here, the text of the page in Figure 6.1a has been translated, but the increased number of characters required by the translated text pushes the nal link of the navigation bar under an icon, making it dicult to read and click. Internationalization can also cause non-layout failures in web pages, such as corrupted text, inconsistent keyboard shortcuts, and incorrect/missing translations.IFix does not target these non-layout related failures as I see the solutions as primarily requiring developer intervention to provide correct translations. The complete process of debugging an IPF requires developers to (1) detect when an IPF occur in a page, (2) localize the faulty HTML elements that are causing the IPF to appear, and (3) repair the web page by modifying CSS properties of the faulty elements to ensure that the failure no longer occurs. An existing technique, GWALI [49], has been shown to be an accurate detection and localization technique for IPFs, i.e., it addresses the rst and second part of the debugging process described above. The inputs to GWALI are a baseline (untranslated) page, which represents a correct rendering of the page, and a translated version (page under test (PUT)), which is analyzed for IPFs. To detect IPFs, GWALI builds a model called a Layout Graph (LG), which captures the position of each HTML element in a web page relative to the other elements. Each node of the graph represents a visible HTML element, while an edge between two nodes is annotated with a type of visual layout relationship (e.g., \East of", \intersects", \aligns with", \contains" etc.) that exists between the two elements. After building the LGs for the two versions of a page, GWALI compares them and identies edges whose annotations are dierent in the PUT. A dierence in annotations indicates that the relative positions of the two elements are dierent, signaling a potential IPF. If an IPF is detected, GWALI outputs a list of HTML elements that are most likely to have caused it. IFix leverages the output of GWALI to initialize the repair process. Assuming that an IPF has been detected and localized, there are several strategies developers can use to repair the faulty HTML elements. One of these is to change the translation of the original text, so that the length of the translated text closely matches the original. However, this solution is not normally applicable for two reasons. Firstly, the translation of the text is not always under the control of developers, having typically been outsourced to professional translators or to an automatic translation service. Secondly, a translation that matches the original text length may not be available. Therefore a more typical repair strategy is to adapt the layout of the 74 internationalized page to accommodate the translation. To do this, developers need to identify the right sets of HTML elements and CSS properties among the potentially faulty elements, and then search for new, appropriate values for their CSS properties. Together, these new values represent a language specic CSS patch for the web page. To ensure that the patch is employed at runtime, developers use the CSS :lang() selector. This selector allows developers to specify alternative values for CSS properties based on the language in which the page is viewed. Although this repair strategy is relatively straightforward to understand, complex interactions among HTML elements, CSS properties, and styling rules make it challenging to nd a patch that resolves all IPFs without introducing new layout problems or signicantly distorting the appearance of a web UI. This challenge motivates my approach, which I present in the next section. 6.2 Specialization of the Generalized Approach, *Fix In this section, I provide details of my approach for repairing IPFs in web pages that is based on the generalized repair approach, *Fix, explained in Chapter 3. The two prerequisites for repairing presentation failures, Detection and Localization, can be instantiated using existing techniques, such as GWALI [49]. For completeness, I summarize GWALI's detection and localization algo- rithm in Section 6.2.1. My contribution is in developing a repair approach for nding suitable xes for IPFs in web pages using search-based techniques. Section 6.2.2 discusses this approach in more detail. 6.2.1 Detection and Localization of IPFs For implementing the Detection and Localization phases of the debugging process, I our prior work, GWALI [49]. For completeness, I provide a summary of GWALI's detection and localization algorithm and its evaluation results. GWALI takes as input a PUT and a baseline version (untranslated) of the page that shows its correct layout. To detect and localize IPFs, GWALI rst builds graph-based models called layout graphs of the layout of each of these pages. GWALI models the layout of a page as a complete graph, where the nodes are HTML and text elements in the page and edges represent the visual relationships, such as alignment, direction and containment, between the elements. After building the layout graphs for the PUT and the baseline, GWALI performs a heuristic- based matching to nd corresponding nodes in the two layout graphs. GWALI then compares the two layout graphs to identify dierences between them. These dierences represent potentially faulty elements. Rather than performing a comprehensive pair-wise comparison of the edges in 75 the graph, GWALI compares subgraphs of nodes and edges that are spatially close in the layout graph. Finally, GWALI analyzes and lters the elements identied by the graph comparison to produce a ranked list of elements for the developer. In the evaluation on 54 real-world subject web pages, GWALI was accurate { detecting IPFs with 91% precision and 100% recall, and identifying the faulty element with a median rank of three. GWALI was also fast { performing detection and localization for a given web page in 9.75 seconds. 6.2.2 Repair of IPFs The goal ofIFix is to automatically repair IPFs that have been detected in a translated version of a web page. As described in Section 6.1, a translation can cause the text in a web page to expand or contract, which leads to text over ow, element movement, incorrect text wrapping, and misalignment. The placement and the size of elements in a web page is controlled by their CSS properties. Therefore, these failures can be xed by changing the value of the CSS properties of elements in a page to allow them to accommodate the new size of the text after translation. Finding these new values for the CSS properties is complicated by several challenges. The rst challenge is that any kind of style change to one element must also be mirrored in stylistically related elements. This is illustrated in Figure 6.1 that shows an IPF on the DMV homepage (https://www.dmv.ca.gov), when translated from English to Spanish. To correct the overlap shown in Figure 6.1b, the text size of the word \Informacion" can be decreased, resulting in the layout shown in Figure 6.1c. However, this change is unlikely to be visually appealing to an end user since the consistency of the header appearance has been changed. Ideally, the change in Figure 6.1d is preferred, which subtly decreases the font size of all of the stylistically related elements in the header. This challenge requires that my solution identify groupings of elements that are stylistically similar and adjust them together in order to maintain the aesthetics of a web page. The second challenge is that a change for any particular IPF may introduce new layout problems into other parts of the page. This can happen when the elements surrounding the area of the IPF move to accommodate the changed size of the repaired element. This challenge is compounded when there are multiple IPFs in a page or there are many elements that must be adjusted together, since multiple changes to the page increase the likelihood that the nal layout will be distorted. This challenge requires that the solution nd new values for the CSS properties that x IPFs while avoiding the introduction of new layout problems. Two insights into these challenges guide the design ofIFix. The rst insight is that it is possible to automatically identify elements that are stylistically similar through an approach that 76 (a) Correct and untranslated web page (b) Translated web page containing an IPF (last element overlaps with the button) (c) Inconsistent x (faulty element has been shrunk by using a signicantly smaller font-size) (d) Consistent x (slight font-size reduction for all header elements) Figure 6.1: Example of an IPF and dierent ways of xing it uses traditional density based clustering techniques. I designed a clustering technique that is based on a combination of visual aspects (e.g., elements' alignment) and DOM-based metrics (e.g., XPath similarity). This allowsIFix to accurately group stylistically similar elements that need to be changed together to maintain the aesthetic consistency of a web page's style. The second insight is that it is possible to quantify the amount of distortion introduced into a page by IPFs and use this value as a tness function to guide a search for a set of new CSS values. I designedIFix's tness function using existing detectors for IPFs (i.e., GWALI [49]) and other metrics for measuring the amount of dierence between two UI layouts. Therefore, the goal of IFix's search-based approach is to nd a solution (i.e., new CSS values) that minimizes this tness function. Figure 6.2 shows an overview ofIFix and shows the instantiations of the dierent abstraction points (AP1{4) of the *Fix approach. The inputs to the approach are: a version of the web page (baseline) that shows its correct layout, a translated version (PUT) that exhibits IPFs, and a list of HTML elements of the PUT that are likely to be faulty. The last input can be 77 Page under test (PUT) Baseline page (Oracle) Identify stylistically similar clusters (AP1) Generate initial population (AP1) Fine tuning (AP2) Mutation (AP2) Repaired page (PUT’) Y N Terminate? (AP4) Fitness function to quantify layout inconsistency and amount of change (AP3) D and L GWALI [ICST’16] Figure 6.2: Overview of theIFix approach for repairing IPFs provided either by a detection technique, such as GWALI, or manually by developers. Developers could simply provide a conservative list of possibly faulty HTML elements, but the use of an automated detection technique allows the debugging process to be fully automated.IFix begins by analyzing the PUT and identifying the stylistically similar clusters that include the potentially faulty elements. Then, the approach performs a guided search to nd the best CSS values for each of the identied clusters. When the search terminates, the best CSS values obtained from all of the clusters are converted to a web page CSS repair patch and provided as the output of the approach. I now explain the parts of the approach in more detail in the following subsections. 6.2.3 Identifying Stylistically Similar Clusters The goal of this step is to group HTML elements in the page that are visually similar into Sets of Stylistically Similar Elements (SimSets). To group a page's elements into SimSets,IFix computes visual similarity and DOM information similarity between each pair of elements in the page. I designed a distance function that quanties the similarity between each pair of elements e 1 and e 2 in the page. ThenIFix uses a density-based clustering technique to determine which elements are in the same SimSet. After computing these SimSets,IFix identies the SimSet associated with each faulty element reported by GWALI. This subset of the SimSets serves as an input to the search. Dierent techniques can be used to group HTML elements in a web page. A naive mechanism is to put elements having the same style class attribute into the same SimSet. In practice I found that the class attribute is not always used by developers to set the style of similar elements, or in some cases, it is not matching for elements in the same SimSet. There are several more sophisticated techniques that may be applied to group related elements in a web page, such as Vision-based Page Segmentation (VIPS) [55], Block-o-Matic [143], and R-Trees [103]. These techniques rely on elements' location in the web page and use dierent metrics to divide the web page into multiple segments. However, these techniques do not produce sets of visually similar 78 elements as needed byIFix. Instead, they produce sets of web page segments that group elements that are located closely to each other and are not necessarily similar in appearance. The clustering inIFix uses multiple visual aspects to group the elements, while the aforementioned techniques rely solely on the location the elements, which makes them unsuitable forIFix. To identify stylistically similar elements in the page, IFix uses a density-based clustering technique, DBSCAN [68]. A density-based clustering technique nds sets of elements that are close to each other, according to a predened distance function, and groups them into clusters. Density-based clustering is well suited forIFix for several reasons. First, the distance function can be customized for the problem domain, which allowsIFix to use style metrics instead of location. Second, this type of clustering does not require prior knowledge of the number of clusters, which is ideal forIFix since each stylistically similar group may have a dierent number of elements, making the total number of clusters unknown beforehand. Third, the clustering technique puts each element into only one cluster (i.e., hard clustering). This is important because if an element is placed into multiple SimSets, the search could dene multiple change values for it, which may prevent the search from converging if the changes are con icting. IFix's distance function uses several metrics to compute the similarity between pairs of el- ements in a page. At a high-level, these metrics can be divided into two types of similarity: (1) similarity in the visual appearance of the elements, including width, height, alignment, and CSS property values and (2) similarity in the DOM information, including XPath, HTML class attribute, and HTML tag name. I include DOM related metrics in the distance function because only using visual similarity metrics may produce inaccurate clusters in cases where the elements belonging to a cluster are intentionally made to appear dierent. For example, to highlight the link of the currently rendered page from a list of navigational menu links. Since the dierent metrics have vastly dierent value ranges,IFix normalizes the value of each metric to a range [0,1], with zero representing a match for the metric and 1 being the maximum dierence. The overall distance computed by the function is the weighted sum of each of the normalized metric values. The metrics' weights were determined based on experimentation on a set of web pages and are the same for all subjects. Next, I provide a detailed description of each of the metrics IFix uses in the distance function. 6.2.3.1 Visual Similarity Metrics These metrics are based on the similarity of the visual appearance of the elements. IFix uses three types of visual metrics to compute the distance between two elements e 1 ande 2 . These are: 79 Elements' width and height match: Elements that are stylistically similar are more likely to have matching width and/or height. IFix denes width and height matching as a binary metric. If the widths of the two elements e 1 and e 2 match, then the width metric value is set to 0, otherwise it is set to 1. The height metric value is computed similarly. Elements' alignment match: Elements that are similar are more likely to be aligned with each other. This is because browsers render a web page using a grid layout, which aligns ele- ments belonging to the same group either horizontally or vertically. Alignment includes left edge alignment, right edge alignment, top edge alignment, and bottom edge alignment. These four alignment metrics are binary metrics, so they are computed in a way similar to the width and height metrics. Elements' CSS properties similarity: Aspects of the appearance of the elements in a web page, such as their color, font, and layout, are dened in the CSS properties of these elements. For this reason, elements that are stylistically similar typically have the same values for their CSS properties.IFix computes the similarity of the CSS properties as the ratio of the matching CSS values over all CSS properties dened for both elements. For this metric,IFix only considers explicitly dened CSS properties, so it does not take into account default CSS values and CSS values that are inherited from the body element in the web page. These values are matching for all elements and are not helpful in distinguishing elements of dierent SimSets. 6.2.3.2 DOM Information Similarity Metrics These metrics are based on the similarity of features dened in the DOM of the web page.IFix uses three types of DOM related metrics to compute the distance between two elements e 1 and e 2 . These are: Elements' tag name match: Elements in the same SimSet have the same type, so the HTML tag names for them need to match. HTML tag names are used as a binary metric, i.e., if e 1 and e 2 are the same tag name, then the metric value is set to 0, otherwise it is set to 1. Elements' XPath similarity: Elements that are in the same SimSet are more likely to have similar XPaths. The XPath similarity between two elements quanties the commonality in the ancestry of the two elements. In HTML, elements in the page inherit CSS properties from their parent elements and pass them on to their children. More ancestors in common between two elements means more inherited styling information is shared between them. To compute XPath distance,IFix uses the Levenshtein distance between elements' XPath. More formally, XPath distance is the minimum number of HTML tags edits (insertions, deletions or substitutions) required to change one XPath into the other. 80 Elements' class attribute similarity: As mentioned earlier, an HTML element's class attribute is often insucient to group similarly styled elements. Nonetheless, it can be a useful signal; therefore I use class attribute similarity as one of the metrics for style similarity. An HTML element can have multiple class names for the class attribute. Our approach computes the similarity in class attribute as the ratio of class names that are matching over all class names that are set. 6.2.4 Candidate Solution Representation A repair for the PUT is represented as a collection of changes for each of the SimSets identied by the clustering technique. More formally, I dene a potential repair as a candidate solution, which is a set of change tuples. Each change tuple is of the formhS;p; i where is the change value thatIFix applies to a specic CSS propertyp for a particular SimSetS. The change value can be positive or negative to represent an increase or decrease in the value of p. Note that a candidate solution can have multiple change tuples for the same SimSet as long as they target dierent CSS properties. An example candidate solution is (hS 1 , font-size,1i,hS 1 , width, 0i,hS 1 , height, 0i,hS 2 , font-size ,1i,hS 2 , width, 10i,hS 2 , height, 0i). This candidate solution represents a repair to the PUT that decreases the font-size of the elements in S 1 by one pixel, decreases the font-size of the elements inS 2 by one pixel, and increases the width of the elements in S 2 by ten pixels. Note that the value \0" means that there is no change to the elements in the SimSet for the specied property. 6.2.5 Fitness Function To evaluate each candidate solution,IFix rst generates a PUT 0 by adjusting the elements of the PUT based on the values in the candidate solution. The approach then calculates the tness score of the PUT 0 when it is rendered in a browser. I now describe both these steps in detail. 6.2.5.1 Generating the PUT 0 To generate the PUT 0 ,IFix modies the PUT according to the values in the candidate solution that will subsequently be evaluated. The approach also modies the width and the height of any ancestor element that has a xed width or height that prevents the children elements from expanding freely. An example of such an ancestor element is shown in Figure 6.3. In the example, 81 Ancestor <div> element with fixed width SimSet S elements Header1 Header2 Header3 Header4 change value △ for SimSet S Change value △ needs to be applied for the parent with fixed width Figure 6.3: Example of ancestor elements with xed width that need to be adjusted together with SimSet elements increasing the width of the elements in SimSet S requires modication to the xed width value of the ancestor div element in order to make space for the children elements' expansion. To modify the elements that need to be changed in the PUT,IFix uses the following algo- rithm. IFix iterates over each change tuplehS;p; i in the candidate solution and modies the elements e2 S by changing their CSS property values: e:p = e:p + . ThenIFix computes the cumulative increase in width and height for all the elements in S and determines the new coordinateshx1;y1i,hx2;y2i of the Minimum Bounding Rectangles (MBRs) of each element e. ThenIFix nds the new position of the right edge of the rightmost element max(e x2 ), and the new position of the bottom edge of the bottommost element max(e y2 ). After that,IFix iterates over all the ancestors of the elements inS. For each ancestora, ifa has a xed value for the width CSS property and max(e x2 ) is larger than a x2 , thenIFix increases the width of the ancestor a:width =a:width + (max(e x2 )a x2 ). A similar increase is applied to the height, if the ancestor has a xed value for the height CSS property and max(e y2 ) is larger than a y2 . 6.2.5.2 Fitness Function Components As mentioned earlier, a challenge in xing IPFs is that any change to x a particular IPF may introduce layout problems into other parts of the page. In addition, larger changes that are applied to the page make it more likely that the nal layout will be distorted. This motivates the goal of the tness function, which is to minimize the dierences between the layout of the PUT and the layout of the baseline while making minimal amount of changes to the page. To address this goal,IFix's tness function involves two components. The rst is the Amount of Layout Inconsistency component. This component measures the impact of IPFs by quantifying the dissimilarity between the PUT 0 layout and the baseline layout. The second part of the tness 82 function is the Amount of Change component. This component quanties the amount of change the candidate solution applies to the page in order to repair it. To combine the two components of the tness function,IFix uses a prioritized tness function model in which minimizing the amount of layout inconsistency has a higher priority than minimizing the amount of change. The amount of layout inconsistency is given higher priority because it is strongly tied with resolving the IPFs, which is the goal ofIFix, while amount of change component is used after resolving the IPFs to make the changes as minimal as possible. The prioritization is done by using a sigmoid function to scale the amount of change to a fraction between 0 and 1 and adding it to the amount of layout inconsistency value. Using this, the overall tness function is equal to amount of layout inconsistency +sigmoid(amount of change). I now describe the components of the tness function in more detail. Amount of Layout Inconsistency: This component represents a quantication of the dis- similarity between the baseline and the PUT 0 LGs. To compute the value for this component, IFix computes the coordinates of the MBRs of each element and the inconsistencies in the PUT as reported by GWALI. ThenIFix computes the distance (in pixels) required to make the relation- ships in the two LGs match. The number of pixels is computed for every inconsistent relationship reported by GWALI. For alignment inconsistencies, if two elements e1 and e2 are top-aligned in the baseline and not top-aligned in the PUT 0 ,IFix computes the dierence in the vertical position of the top side of the two elementsje1 y1 e2 y1 j. A similar computation is performed for bottom-alignment, right-alignment, and left-alignment. For direction inconsistencies, if e1 is situated to the \West" ofe2 in the baseline, and is no longer \West" in the PUT 0 ,IFix computes the number of pixels by which e2 needs to move to be to the West of e1, which is e1 x2 e2 x1 . A similar computation is performed for East, North, and South relationships. For containment inconsistencies, if e1 bounds (i.e., contains) e2 in the baseline, and no longer bounds it in the PUT 0 ,IFix computes the vertical and horizontal expansion needed for each side of e1's MBR to make it bound e2. The number of pixels computed for each of these inconsistent relationships (alignment, directional, and bounding) is added to get the total amount of layout inconsistency. Amount of Change: This component represents the amount of change a candidate solution causes to the page. To compute this amount, IFix calculates the percentage of change that is applied to each CSS property for every modied element in the page. The total amount of change is the summation of the squared percentages of changes. The intuition behind squaring the percentages of change is to penalize solutions more heavily if they represent a large change. 83 6.2.6 Search The goal of the search is to nd values for the CSS properties of each SimSet that make the baseline page and the PUT have LGs that are matching with minimal changes to the page.IFix generates candidate solutions using the search operations I dene in this section. ThenIFix evaluates each candidate solution it generates using the tness function to determine if the candidate solution produces a better version of the PUT. The approach operates by going through multiple iterations of the search. In each iteration, the approach generates a population of candidate solutions. Then, the approach renes the population by keeping only the best candidate solutions and performing the search operations on them for another iteration. The search terminates when a termination condition is satised. After the search terminates, the approach returns the best candidate solution in the population. More formally, the iteration includes ve main steps (1) initializing the population, (2) ne-tuning the best solution using local search, (3) performing mutation, (4) selecting the best set of candidate solutions, (5) and terminating the search if a termination condition is satised. The following is a description of each step in more detail: Initializing the population: This step creates an initial population of candidate solutions thatIFix performs the search on. The goal of this step is to create a diverse initial population that allows the search to explore dierent areas of the solution space. Figure 6.4 shows an overview of the process of initializing the population. In the gure, the rst set of candidate solutions represents modications to the elements that are computed based on text expansion that occurred to the PUT. To generate this set of candidate solutions,IFix computes the average percentage of text expansion in the elements of each SimSet that includes a faulty element. Then IFix generates three candidate solutions based on the expansion percentage. The rst candidate solution increases the width of the elements in the SimSets by a percentage equal to the percentage of the text expansion. The second candidate solution increases the height by the same percentage. The third candidate solution decreases the font-size of the elements in the SimSets by the same percentage. The rest of the candidate solutions in the initial population (i.e., fourth candidate solution in the gure) are generated by creating copies of the current candidate solutions and mutating the copies using the mutation operation described in the mutation step below. Fine tuning using local search: This step works by selecting the best candidate solution in the population and ne tuning the change values in it in order to get the best possible x. To do this,IFix uses the Alternating Variable Method (AVM) local search algorithm [82, 84].IFix performs local search by iterating over all the change tuples in the candidate solution and for each change tuple it tries a new value in a specic direction (i.e., it either increases or decreases the 84 Analyze Text Expansion Generate Candidate Solutions Based on Expansion Candidate solution with increased width Candidate solution with increased height Candidate solution with decreased font Initial population Candidate solution with random mutated values … Mutation Baseline PUT Figure 6.4: Initializing the population change value for the CSS property), then evaluates the tness of the new candidate solution to determine if it is an improvement. If there is an improvement, the search keeps trying larger values in the same direction. Otherwise, it tries the other direction. This process is repeated until the search nds the best possible change values based on the tness function. The newly generated candidate solution is added to the population. Mutation: The goal of the mutation step is to diversify the population and explore change values that may not be reached during the AVM search. IFix performs standard Gaussian mutation operations to the change values in the candidate solutions. It iterates over all the candidate solutions in the population and generates a new mutant for each one. IFix creates a mutant by iterating over each tuple in the candidate solution and changing its value with a probability of 1 / (number of change tuples). The new change value is picked randomly from a Gaussian distribution around the old value. The newly generated candidate solutions are added to the population to be evaluated in the selection step. Selection:IFix evaluates all of the candidate solutions in the current population and selects the bestn candidate solutions, wheren is the predened size of the population. The best candidate solutions are identied based on the tness function described in Section 6.2.5.2. The selected candidate solutions are used as the population for the next iteration of the search. Termination: The algorithm terminates when either of two conditions are satised. The rst condition is when a predened maximum number of iterations is reached. This condition is used to bound the execution time of the search and prevents it from running for a long time without converging to a solution. The second condition is when the search reaches a saturation point (i.e., no improvement in the candidate solutions for multiple consecutive iterations). In this cases, the search most likely converged to the best candidate solution it could nd, and further iterations will not introduce more improvement. 85 Table 6.1: Subjects used in the evaluation ofIFix ID Name URL #HTML Baseline Translated 1 akamai https://www.akamai.com 304 English Spanish 2 caLottery http://www.calottery.com 777 English Spanish 3 designSponge http://www.designsponge.com 1,184 English Spanish 4 dmv https://www.dmv.ca.gov 638 English Spanish 5 doctor https://sfplasticsurgeon.com 689 English Spanish 6 els https://www.els.edu 483 English Portuguese 7 facebookLogin https://www.facebook.com 478 English Bulgarian 8 ynas http://www.flynas.com 1,069 English Turkish 9 googleEarth https://www.google.com/earth 323 Italian Russian 10 googleLogin https://accounts.google.com 175 English Greek 11 hightail https://tinyurl.com/y9tpmro7 1,135 English German 12 hotwire https://www.hotwire.com 583 English Spanish 13 ixigo https://www.ixigo.com/flights 1,384 English Italian 14 linkedin https://www.linkedin.com 586 English Spanish 15 mplay http://www.myplay.com 3,223 English Spanish 16 museum https://www.amnh.org 585 English French 17 qualitrol http://www.qualitrolcorp.com 401 English Russian 18 rentalCars http://www.rentalcars.com 1,011 English German 19 skype https://tinyurl.com/ycuxxhso 495 English French 20 skyScanner https://www.skyscanner.com 388 French Malay 21 twitterHelp https://support.twitter.com 327 English French 22 westin https://tinyurl.com/ycq4o8ar 815 English Spanish 23 worldsBest http://www.theworlds50best.com 581 English German IFix could fail to nd an acceptable x under two scenarios. The rst scenario is when GWALI does not include the actual faulty HTML element in its reported list.IFix assumes that the initial set of elements provided as the input contains the faulty elements. If this assumption is violated,IFix will not be able to nd a repair. The second scenario is when the search does not converge to an acceptable x. This could occur due to the non-determinism of the search. 6.3 Evaluation To assess the eectiveness and performance ofIFix, I conducted an empirical evaluation on 23 real-world subject web pages and answered three research questions: RQ1: How eective isIFix in reducing IPFs? RQ2: How long does it take forIFix to generate repairs? RQ3: What is the quality of the xes generated byIFix? 86 6.3.1 Implementation I implemented the approach in Java as a prototype tool namedIFix [2]. I used the Apache Commons Math3 library implementation of the DBSCAN algorithm to group similarly styled HTML elements. I used Javascript and Selenium WebDriver for dynamically applying candidate x values to the pages and for extracting the rendered Document Object Model (DOM) informa- tion, such as element MBRs and XPath. I used the jStyleParser library for extracting explicitly dened CSS properties for HTML elements in a page. For obtaining the set of IPFs, I used the latest version of GWALI [49]. For the search technique described in Section 6.2.2, I used the following parameter values: population size = 100, mutation rate = 1.0, max number of iterations = 20, and saturation point = 2. For the Gaussian distribution, used by the mutation operator, I used a 50% decrease and increase as the min and max values, and = (maxmin)=8:0 as the standard deviation. For clustering, I used the following weights for the dierent metrics: 0.1 for width/height and alignment, 0.3 for CSS properties similarity, 0.4 for tag name, 0.3 for XPath similarity, and 0.2 for class attribute similarity. 6.3.2 Subjects For the evaluation I used 23 real-world subject web pages as shown in Table 6.1. The column \#HTML" shows the total number of HTML elements in the subject page, which was counted by parsing the subject page's DOM for node type \element". This gives a rough estimate of the page's size and complexity. The column \Baseline" shows the language of the subject used in the baseline version that shows the correct appearance of the page, and \Translated" shows the language that exhibits IPFs in the subject with respect to the baseline. I gathered these subjects from the web pages used in the evaluation of GWALI [49]. The main criteria behind selecting this source was the presence of known IPFs in the study of GWALI and the diversity in size, layouts, and translation languages that the GWALI subjects oered. The 54 subjects used in the evaluation of GWALI were collected from three dierent sources: (1) builtwith.com, (2) Alexa top 100 most visited websites, and (3) manual sampling of popular travel-related and telecom company websites. Out of the total 54 subject pages used in the evaluation of GWALI, I ltered and selected only those web pages for which at least one IPF was reported. 6.3.3 Experiment One To answer RQ1 and RQ2, I ranIFix on each subject and recorded the set of IPFs before and after each run, as reported by GWALI, and measured the total time taken. To minimize the variance 87 Table 6.2: Eectiveness ofIFix in reducing IPFs Name #Before #After (Average Reduction in %) IFix Rand NoClust Rand-NoClust (Variation 1) (Variation 2) (Variation 3) akamai 6 0 (100) 2 (74) 0 (100) 0.20 (97) caLottery 4 0 (100) 0 (100) 1 (70) 0.73 (81) designSponge 9 0.07 (99) 3 (63) 0.07 (99) 3 (71) dmv 18 0 (100) 4 (78) 2 (85) 9 (41) doctor 21 0 (100) 0 (100) 6 (72) 21 (0) els 6 0 (100) 0 (100) 0 (100) 0 (100) facebookLogin 16 0 (100) 6 (65) 12 (25) 16 (0) ynas 9 0 (100) 0.07 (99) 0 (100) 0 (100) googleEarth 15 0 (100) 0 (100) 4 (72) 7 (55) googleLogin 6 0 (100) 0 (100) 0 (100) 0 (100) hightail 2 0 (100) 0 (100) 0 (100) 0 (100) hotwire 30 0 (100) 0.47 (98) 4 (87) 4 (87) ixigo 38 12 (68) 12 (68) 0 (100) 12 (68) linkedin 22 0 (100) 0 (100) 12 (46) 19 (13) mplay 76 0.40 (99) 3 (96) 3 (95) 51 (33) museum 32 0.40 (99) 0 (100) 12 (63) 19 (40) qualitrol 19 0 (100) 0 (100) 21 (-9) 22 (-16) rentalCars 6 0 (100) 2 (74) 0 (100) 1 (99) skype 3 0 (100) 0 (100) 0 (100) 0 (100) skyScanner 4 0 (100) 0 (100) 0 (100) 0 (100) twitterHelp 5 0 (100) 0 (100) 0 (100) 0.17 (97) westin 11 1 (91) 1 (91) 1 (91) 1 (91) worldsBest 24 0 (100) 7 (69) 0 (100) 17 (29) Average 16 0.6 (98) 2 (90) 3 (82) 8 (65) in the results that can be introduced from the non-deterministic aspects of the search, I ranIFix on each subject 30 times and used the mean values across the runs in the results. To further assess and understand the eectiveness of the two main features ofIFix, guided search and style similarity clustering, I conducted more experiment runs with three variations toIFix. The rst variation replaced the guided search in the approach with a random search to evaluate the benet of guided search with a tness function. For every subject, I time bounded the random search by terminating it once the average time required byIFix for that subject had been utilized. The second variation removed the clustering component fromIFix to evaluate the benet of clustering stylistically similar elements in a page. The third variation combined the rst and second variation. Similar toIFix, I ran the three variations 30 times on each subject. All of the experiments were run on a 64-bit Ubuntu 14.04 machine with 32GB memory, Intel Core i7-4790 processor, and screen resolution of 1920 1080. For rendering the subject web pages, I used Mozilla Firefox v46.0.01 with the browser window maximized to the screen size. 88 For RQ1, I used GWALI to determine the initial number of IPFs in a subject and the number of IPFs remaining after each of the 30 runs. I calculated the reduction in IPFs as a percentage of the before and after values for each subject. For RQ2, I computed the average total running time ofIFix and variation 2 across 30 runs for each subject. I did not compare the performance ofIFix with its rst and third variations since I time bounded their random searches, as described above. I also measured the time required for the two main stages inIFix; clustering stylistically similar elements (Section 6.2.3) and searching for a repair patch (Section 6.2.6). 6.3.3.1 Presentation of Results Table 6.2 shows the results for RQ1. The initial number of IPFs are shown under the column \#Before". The columns headed \#After" show the average number of IPFs remaining after each of the 30 runs ofIFix for its three variations: \Rand", \NoClust", and \Rand-NoClust". (Since it is an average, the results under \#After" columns may show decimal values.) The average percentage reduction is shown in parenthesis. 6.3.3.2 Discussion of Results The results show thatIFix was the most eective in reducing the number of IPFs, with an average 98% reduction, compared to its variations. This shows the eectiveness ofIFix in resolving IPFs. The results also strongly validate my two key insights of using guided search and clustering in the approach. The rst key insight was validated asIFix was able to outperform a random search that had been given the same amount of time. IFix was substantially more successful in primarily two scenarios. First, pages (e.g., dmv and facebookLogin) containing multiple IPFs concentrated in the same area that require a careful resolution of the IPFs by balancing the layout constraints without introducing new IPFs. Second, pages (e.g., akamai) that have strict layout constraints, permitting only a very small range of CSS values to resolve the IPFs. I also found that, overall, the repairs generated by random search were not visually pleasing as they often involved a substantial reduction in the font-size of text, indicating that guidance was helpful for IFix. This observation was also re ected in the total amount of change made to a page, captured by the tness function, which reported that random search introduced 28% more changes, on average, compared toIFix. The second key insight of using a style-based clustering technique was validated asIFix not only rendered the pages more visually consistent compared to its non- clustered variations, but also increased the eectiveness by resolving a relatively higher number of IPFs. 89 Figure 6.5: Two UI snippets in the same equivalence class (from Hotwire) Out of the 23 subjects,IFix was able to completely resolve all of the reported IPFs in 18 subjects in each of the 30 runs and in 21 subjects in more than 90% of the runs. I investigated the two subjects, ixigo and westin, whereIFix was not able to completely resolve all of the reported IPFs. I found that the dominant reason for the ixigo subject was false positive IPFs that were reported by GWALI. This occurred because the footer area of the page had signicant dierences in terms of layout and structure between the baseline and translated page. Therefore CSS changes made byIFix were not sucient to resolve the IPFs in the footer area. For the westin subject, elements surrounding the unrepaired IPF were required to be modied in order to completely resolve it. However, these elements were not reported by GWALI, thereby precludingIFix from nding a suitable x. The total running time ofIFix ranged from 73 seconds to 17 minutes, with an average of just over 4 minutes and a median of 2 minutes.IFix was also three times faster, on average, than its second variation (no clustering). This was primarily because clustering enabled a narrowing of the search space by grouping together potentially faulty elements reported by GWALI that were also stylistically similar. Thereby a single change to the cluster was capable of resolving multiple IPFs. Moreover, the clustering overhead inIFix was negligible, requiring less than a second, on average. The detailed timing results can be found at the project website [2]. 6.3.4 Experiment Two For addressing RQ3, I conducted a user study to understand the visual quality ofIFix's suggested xes from a human perspective. The general format of the survey was to present, in random order, an IPF containing a UI snippet from a subject web page before and after repair. The participants 90 were then asked to compare the two UI snippets on a 5-point Likert scale with respect to their appearance similarity to the corresponding UI snippet from the baseline version. Each UI snippet showing an IPF was captured in context of its surrounding region to allow participants to view the IPF from a broader perspective. Examples of UI snippets are shown in Figure 6.1b and Figure 6.5. To select the \after" version of a subject, I used the run with the best tness score across the 30 runs ofIFix in Experiment One. To gure out the number of IPFs to be shown for each subject, I manually analyzed the IPFs reported by GWALI and identied groups of IPFs that shared a common visual pattern. I called these groups \equivalence classes". Figure 6.5 shows an example of an equivalence class from the Hotwire subject, where the two IPFs caused by the price text over owing the container are highly similar. One IPF from each equivalence class was presented in the survey. To make the survey length manageable for the participants, I divided the 23 subjects over ve dierent surveys, with each containing four or ve subjects. The participants of the user study were 37 undergraduate level students. Each participant was assigned to one of the ve surveys. The participants were instructed to use a desktop or laptop for answering the survey to be able to view the IPF UI snippets in full resolution. 6.3.4.1 Presentation of Results The results for the appearance similarity ratings given by the participants for each of the IPFs in the 23 subjects are shown in Figure 6.6. On the x-axis, the ID and number of IPFs for a subject are shown. For example, 4a, 4b, and 4c represent the dmv subject with three IPFs. The blue colored bars above the x-axis indicate the number of ratings in favor of the after (repaired) version. The dark blue color shows participants' response for the after version being much better than the before version, while the light blue color shows the response for the after version being somewhat better than the before version. Similarly, the red bars below the X-axis indicate the number of ratings in favor of the before repair version, with dark and light red showing the response for the before version being much and somewhat better than the after version, respectively. The gray bars show the number of ratings where the participants responded that the before and after versions had the same appearance similarity to the baseline. For example, IPF 23a had a total of 11 responses, six for the after version being much better, three for the after version being somewhat better, one reporting both the versions as the same, and one reporting the before version as being somewhat better. As can be seen from Figure 6.6, 64% of the participant responses favored the after repair versions, 21% favored the before repair versions, and 15% reported both versions as the same. 91 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 1a 2a 3a 3b 4a 4b 4c 5a 6a 7a 7b 8a 8b 9a 10a 11a 12a 12b 12c 12d 13a 13b 14a 15a 15b 16a 16b 16c 16d 17a 18a 19a 20a 21a 22a 22b 23a 23b Number of particpant reponses IPFs from all subjects before somewhat better before much better same after somewhat better after much better Figure 6.6: Similarity ratings given by user study participants 6.3.4.2 Discussion of Results The results of the user study show that the participants largely rated the after (repaired) pages as better than the before (faulty) versions. This indicates thatIFix generates repairs that are high in visual quality. The IPFs presented in the user study, however, do not comprehensively represent all of the IPFs reported for the subjects as the surveys only contained one representative from each equivalence class. Therefore I weighted the survey responses by multiplying each response from an equivalence class with the size of the class. The results are shown in Figure 6.7. With the weighting, 70% responses show support for the after version. Also, interestingly, the results show the strength of support for the after version | 41% of responses rate the after version as much better, while only 5% responses rate the before version as much better. Two of the IPFs, 3b and 23b, had no participant responses in favor of the after version. I inspected these subjects in more detail and found that the primary reason for this was thatIFix substantially reduced the font-size (e.g., from 13px to 5px for 3b) to resolve the IPFs. Although these changes were visually unappealing, I was able to conrm that these extreme changes were the only way to resolve the IPFs. I also found that IPFs, 7a, 19a, and 22b, had a majority of 92 13% 12% 5% 29% 41% same before somewhat better before much better after somewhat better after much better Figure 6.7: Weighted distribution of the ratings the participant responses reporting both versions as the same. IFix was unable to resolve 22b, implying that the before and after versions were practically the same. The issue with 7a and 19a was slightly dierent. Both IPFs were caused by guidance text in an input box being clipped because the translated text exceeded the size of the input box. Unless the survey takers could understand the target language translation, there was no way to know that the guidance text was missing words. 6.3.5 Threats to Validity The rst potential threat is the use of only GWALI for detecting IPFs. However, there exist no other available automated tools that can detect IPFs and report potentially faulty HTML elements. Another potential threat is that I manually categorized IPFs into equivalence classes for the user study. However, this categorization was fairly straightforward, and in practice there was no ambiguity regarding membership in an equivalence class, for example, as shown in Figure 6.5. To further support this, I have made the surveys and subject pages publicly available [2] for verication. A potential threat to construct validity is that I presented UI snippets of the subject pages to the participants, rather than full-page screenshots, which might have an impact on their appearance similarity ratings. I opted for this mechanism as the full page screenshots of the subjects were large in size, making it dicult to view all three screenshots, baseline, before (faulty), and after (repaired), in one frame for comparison. The benet of this mechanism was that it allowed the participants to focus only on the areas of the pages that contained IPFs and were thus modied byIFix. 93 6.4 Conclusion In summary, in this chapter I presented the design of my approach,IFix, for the automated repair of IPFs in web pages. IPFs are distortions in the intended appearance of a web page that are caused by the relative expansion or contraction of translated text.IFix uses guided search-based techniques for nding repairs for IPFs. IFix uses text expansion based heuristics to identify initial candidate solutions. It then uses the AVM search to ne tune the best candidate solution in the population and mutation to diversify the population. The tness function used to guide the search is designed based on existing detectors for IPFs (i.e., GWALI) and other metrics for measuring the amount of dierence between two UI layouts. In the evaluation,IFix was able to resolve 98% of the reported IPFs. In a user study, 70% of the participants rated the xed versions as better than the unxed versions. Overall, these results are positive and support the hypothesis of my dissertation by showing that this approach using search-based techniques can help developers automatically resolve IPFs in web pages while maintaining the pages' aesthetic consistency. 94 Chapter 7 GFix: Repair of Mockup-driven Development Problems (MDDPs) and Regression Debugging Problems (RDPs) An attractive and visually appealing appearance is important for the success of a website. A recent study by Google underscores this point by noting that the average visitor forms a rst impression of a web page within the rst 50 milliseconds of visiting a page [152] | an amount of time that is heavily in uenced by a page's aesthetics. Other studies report that the appearance of a web application's User Interface (UI) plays an important role in shaping end users' perception of the services and products it oers [151, 87]. Companies put signicant eort into the look and feel of their websites, employing graphic designers and illustrators to carefully craft their layout and graphics. Presentation failures | a discrepancy between the actual appearance of a website and its intended appearance | can undermine this eort and negatively impact end users' perception of the quality of the site and the services it delivers. These types of failures can also impact the usability of a web application's UI or be symptomatic of underlying logic or data problems. The UIs of modern web applications are highly complex and dynamic. Back-end server code dynamically generates content and client-side browsers render this content based on complex HTML, CSS, and JavaScript rules. A typical web page may contain hundreds or thousands of HTML elements. In turn, each of these elements may contain denitions of several dozen Cascading Style Sheets (CSS) properties ranging over a large set of possible values that control the element's appearance. These attributes can also interact with each other via cascading style rules, and other modern layout mechanisms, such as oating elements, overlays, and dynamic siz- ing. This complexity makes it challenging for developers to debug presentation failures. The sheer amount of HTML elements and style attributes that could be faulty makes it a labor intensive task, and the complex interactions of the CSS rules and layout mechanisms make it dicult to 95 accurately identify the relationship between an observed failure and the underlying HTML code responsible for that failure. To assist in this eort, developers have tools, such as Firebug [22], that can compute useful HTML and CSS information for a web page. Although helpful, such a debugging process remains a developer-driven process, and its accuracy and speed are dependent on the developer's expertise. For example, to use Firebug eectively, developers must be able to determine which HTML elements to investigate, understand the eects of the various style prop- erties dened by those elements, and then repair them by performing the necessary modications so that the page renders correctly. Developers have several techniques available to them to help detect and localize presentation failures, but cannot generate repairs. Moreover, these techniques have limitations in even de- tecting presentation failures that either reduce their eectiveness or make them inappropriate for general usage. For example, many techniques are focused on one type of presentation fail- ure, such as Cross Browser Issues (XBIs) (e.g., [137, 59, 141]), or a limited and predened set of application-independent failure types (e.g., Fighting Layout Bugs [149]), and cannot detect other types of presentation failures. Other techniques can only support debugging eorts where there is a prior working version that can be compared against (e.g., [76, 147]). Finally, a group of techniques require testers to exhaustively specify all correctness properties to be checked (e.g., Se- lenium [42], Crawljax [119, 116, 120], Cucumber [19], and Sikuli [57, 165], which is labor-intensive and potentially error-prone. To address these limitations, I propose a novel approach,GFix, to assist developers in de- bugging presentation failures. The advantages ofGFix are that it is fully automated and more widely applicable than previous approaches in terms of the types of presentation failures it can be used with. I frame the repair approach as a search-based problem where the identied answer is a new suggested x value for the faulty HTML element and CSS style property that causes a presentation failure. My key insight into transforming this problem into a search-based problem is that image processing techniques can be used to compare the amount of deviation between a rendered page and its intended appearance. This dierence can then be used as a tness func- tion to guide the exploration of likely repairs, and when the two have no dierences, a successful repair has been identied. In the evaluation, I found that theGFix approach is accurate { it was able to resolve 94% of the presentation failures reported by WebSee [103], an automated tool for detecting presentation failures. In a user study of the repaired web pages, I found that the repairs met with high user approval { 79% of user responses rated the repaired pages as better than the faulty versions. Overall, these results are positive and indicate thatGFix can help developers automatically resolve presentation failures in web pages. 96 7.1 Background and Motivation After detecting a presentation failure, developers must debug their web applications to identify the underlying fault. In modern web applications, the appearance of a web page is dened by HTML tags and CSS properties that specify how each HTML tag will be rendered by the browser. Therefore, when a developer is debugging a page under test (PUT), they are trying to identify a new value for a CSS property of an HTML element that is set incorrectly. This is represented by a repair tuple,he;p;v;v 0 i, where e is an HTML element in the PUT, p is a CSS property of e, v is a value ofp, andv 0 is suggested value forp for xing the presentation failure. In the remainder of this section I describe two scenarios in which developers need to debug web pages in order to identify repairs for an observed presentation failure. The rst scenario is regression debugging. Developers often perform maintenance on their web pages in order to introduce new features, correct a bug, or refactor the HTML structure. For example, a refactoring may convert ahtablei based layout to one based onhdivi tags or convert HTML 4 tags to HTML 5. During this transformation, a developer may introduce a fault that causes a type of presentation failure called the Regression Debugging Problem (RDP). Developers must debug the UI to repair such RDPs. Existing techniques, such as Cross Browser Testing (XBT) [141, 59, 137], GUI dierencing [161, 76, 75], automated oracle comparators [147, 146], or tools based on diff may be of limited use in this scenario. The reason for this is that these techniques use a tree-based representation (e.g., Document Object Model (DOM)) to compare the versions of the faulty web page. If a faulty change is small and localized within the tree, it may be straightforward for these techniques to identify the fault. However, if the tree structure has changed signicantly (as in the above refactoring example) then a comparison will most likely result in many spurious changes being identied as the fault. Furthermore, these techniques assume that any dierence between the tree-based representation implies a failure. This is not always true as there can be multiple ways to implement the same visual appearance using HTML and CSS properties. The second scenario is mockup-driven development, which occurs during the initial devel- opment of a web page's templates and user interfaces. In some development shops, front-end developers are guided by design mockups, which are detailed renderings of the intended appear- ance of a web page [123, 92, 130]. Front-end developers are expected to create \pixel-perfect" matches of these mockups [24] using web development tools, such as Adobe Muse, Amaya, or Visual Studio, which back-end developers can then modify by adding dynamic content. During this process, both types of developers need to ensure that their changes have not caused the page to look dierent than the mockup. Discrepancies in the actual appearance of the page from its 97 mockup cause Mockup-driven Development Problems (MDDPs) to occur. Tree-based debugging techniques, such as those discussed earlier, would not be applicable in this scenario because there are no prior working versions against which to compare the DOMs. Other well-known techniques, such as Selenium [42], Cucumber [19], Crawljax [116, 119, 120], Sikuli [165, 57], graphical au- tomated oracles [64], or Cornipickle [78], are not practical in this scenario because they require all correctness properties to be exhaustively specied, which is labor intensive. Furthermore, the correctness properties are expressed in terms of HTML syntax, not the visual appearance of an element. Therefore, these techniques may miss presentation failures, such as incorrect inheritance of an ancestor element's CSS properties. 7.2 Specialization of the Generalized Approach, *Fix In this section, I provide details of my approach,GFix, for repairing MDDPs and RDPs (MRPFs) 1 in web pages.GFix is a specialization of the *Fix approach explained in Chapter 3. To provide the input functionsD andL, I developed a technique WebSee [103, 104] that detects and localizes MRPFs in web pages automatically. For completeness, I summarize WebSee's detection and localization algorithm in Section 7.2.1. I then explain the repair approach,GFix, in detail in Section 7.2.2. 7.2.1 Detection and Localization of MDDPs and RDPs Testers have several techniques available to them to help detect and localize MRPFs. However, these techniques have limitations that either reduce their eectiveness or make them inappropriate for general usage. For example, many techniques are focused on one type of presentation failure, such as XBIs (e.g., [137]), or a limited and predened set of application-independent failure types (e.g., Fighting Layout Bugs [149]), and cannot detect MRPFs. Other techniques can only support debugging eorts where there is a prior working version that can be compared against (e.g., [161, 147]). Finally, a group of techniques require testers to exhaustively specify all correctness properties to be checked (e.g., Selenium [42], Crawljax [119], Cucumber [19], and Sikuli [57], which is labor-intensive and potentially error-prone. To address these limitations, I developed a novel approach, WebSee [103, 104] to assist developers in detecting and localizing MRPFs. WebSee applies techniques from the eld of computer vision to analyze the visual representation of a web page, identify MRPFs, and then determine which elements in the HTML source of the page could be responsible for the observed failures. 1 Henceforth, in this chapter I use MRPFs to collectively refer to MDDPs and RDPs 98 From a high-level, WebSee takes two inputs, a PUT and an appearance oracle (O) that species the visual correctness properties of the PUT, and processes them through three phases to locate MRPFs. The rst phase, detection, compares the visual representations of PUT and O to detect a set of dierences in the PUT. The second phase, localization, analyzes a rendering map of PUT to identify the set of HTML elements that dene the dierence pixels. Finally, the third phase, result set processing, prioritizes the identied set of elements and provides this as an output to the developer. To detect MRPFs, WebSee uses Perceptual Image Dierencing (PID), a computer vision based technique for image comparison. PID uses computational models of the human visual system to compare two images [163, 164]. This allows the approach to compare the visual representations of the PUT and O based on an idea of \similarity" that corresponds to humans' visual concept of similarity. To compare a given pair of images, PID models three features of the human visual system: (1) spatial sensitivity, (2) luminance sensitivity, and (3) color sensitivity. The PID algorithm also accepts a threshold value as a parameter, which is used to decide whether the images are below a threshold of perceptible dierence, and a eld of view value in degreesF , which indicates how much of the observer's view the screen occupies. The PID technique is particularly well-suited for our problem for two reasons. First, the three modeled features roughly account for the location (or size), contrast, and color of the HTML elements in the two pages, which together cover almost all possible visual rendering eects available via CSS or HTML. Second, the and F allow the dierence detection to be scaled to reduce false positives (via ) and account for visual representation sizes that are either very small (e.g., smartphone) or large (e.g., desktop web browser). WebSee uses the PID algorithm to compare the visual representations of O and PUT at a tolerance level specied by and F . The result of this is a set DP that contains all pixels of the two images considered to be perceptually dierent. Next, WebSee identies a set of HTML elements in the PUT that may be responsible for the detected presentation failure. The general intuition is to identify the HTML elements whose rendering area includes the dierence pixels DP . To do this, WebSee builds a Rectangle-tree (R-tree) model of the rendered PUT. An R-tree is a height-balanced tree data structure that is widely used in the spatial database community to store multidimensional information [77, 53]. In this case, the multidimensional data is the bounding rectangle that corresponds to the rendering area of an HTML element. The leaves of the R-tree are the HTML elements of the page and the non-leaf nodes are bounding rectangles that contain groups of nearby elements. For eachhx;yi pixel inDP , the containing HTML elements are found by traversing the R-tree's edges and added to a set of potentially faulty HTML elements, E. The elements in the set E are then ordered 99 based on their likelihood of being the faulty element, since not all of the elements inE are equally likely to contain the fault. To perform this prioritization, WebSee utilizes heuristics based on the relationship of elements that contain dierence pixels. For example, prioritizing elements with a higher percentage of their pixels identied as dierence pixels. I evaluated the accuracy of WebSee using real-world mockups provided by an industrial partner and on several hundred faults seeded into well-known web apps, including Gmail, Craigslist, Oracle, Virgin America, and USC CS Research. For the test cases with the real-world mockups, WebSee performed strongly, 45% of the faulty elements were listed in the top 5 and 70% were within the top 10. For the seeded faults, WebSee was able to detect all of the failures and 93% of the time was able to return a set of potentially faulty HTML elements that contained the original seeded fault. I also evaluated WebSee in the context of a user study, where users had to manually perform the detection and identication of the potentially faulty elements. The users were only able to visually detect 76% of the failures, while WebSee detected all of the failures. Furthermore, the users could generate a set containing the original seeded fault only 36% of the time, while WebSee was able to correctly identify the fault 93% of the time. WebSee was also much faster, needed an average of 87 seconds to perform this analysis versus an average of 7 minutes for the users. 7.2.2 Repair of MDDPs and RDPs The goal ofGFix is to nd potential xes that can repair the detected MRPFs. TheGFix approach treats the repair as a search-based problem where the identied solutions | modications to the web page's CSS | are potential xes for the observed MRPFs. When the search nds the correct CSS value, applying it to the failing page will cause the rendering of that page to match an oracle representing the intended appearance. Finding these correct values for the CSS properties is complicated by several challenges. The rst challenge is that a repair for any particular MRPF may introduce new layout problems into other parts of the page. This can happen when the elements surrounding the area of the MRPF move to accommodate the changed size of a repaired element. This challenge is compounded when there are multiple MRPFs in a page or there are many elements that must be adjusted together, since multiple changes to the page increase the likelihood that the nal layout will be distorted. This challenge requires thatGFix must be able to nd new values for the CSS properties that balance between xing MRPFs and avoiding the introduction of new layout problems. My key insight is that it is possible to quantify the amount of distortion introduced into a page by MRPFs and use this value as a tness function to guide a search for a set of new CSS values. The idea is 100 Page under test (PUT) Oracle 1. Initialization (AP1) 2. Search for candidate fixes (AP2) 3. Search for best combination of candidate fixes (AP2) Potentially fixed page (PUT’) Y N 4. Terminate? (AP4) Fitness function to quantify the amount of visual difference between the PUT and the oracle (AP3) D and L WebSee [ICST’15] Figure 7.1: Overview of theGFix approach for repairing MDDPs and RDPs to use image comparison techniques, such as the one provided by WebSee [103], to measure the amount of distortion via the size of the dierence pixels set. When the number of dierence pixels is zero, the rendering of the web page matches the oracle, implying that a fault has likely been identied and repaired. The second challenge is to nd the repair in a reasonable amount of time. Exploring every possible CSS property is not practical as the CSS properties can range over a large set of possible values, thereby making this a time and resource intensive approach. To address this challenge I have two key insights. The rst key insight is that the visual dierences between the PUT and its appearance oracle can be analyzed to identify a set of relevant CSS properties that are likely to have caused the MRPF in the PUT. Since the relevant CSS properties are a subset of all possible properties, this can help decrease the run time ofGFix. The second key insight is that groups of CSS properties dier in ways that can allow specialized search techniques to be designed to identify repairs in a short amount of time. Formally, MRPFs are caused by one or more root causes. A root cause is a tuple,he, p, vi, where e is a faulty HTML element in the page, p is a CSS property of e, and v is the value of p. Given a set of potential rootCauses,GFix tries to nd a set of xes that resolve the observed MRPFs. A x is dened by the tuple,he, p, v, v 0 i, wherehe, p, vi2 rootCauses and v 0 is a suggested new value for p. At a high level,GFix works by rst identifying a set of possible root causes for the failures detected in the PUT. Then the approach utilizes two forms of guided search to nd the best repair. The rst search examines each root cause in isolation to nd viable candidate xes. The search takes the CSS property of the root cause and nds a new value for it that minimizes the number of dierence pixels between the page and the oracle. The second search then seeks to nd a combination of candidate xes identied in the rst phase that can minimize the number of MRPFs reported in the page. The second search is necessary since not all candidate xes may be required, as the CSS properties involved may have duplicate or competing eects. For instance, the CSS properties margin-top and padding-top may both be identied as root causes for a 101 presentation failure, but can be used to achieve similar outcomes | meaning that only one may actually need to be included in the repair. Conversely, other candidate xes may be required to be used in combination with one another to fully resolve an MRPF. For example, an HTML element may need to be adjusted for both its width and height. Furthermore, candidate xes produced for one MRPF may interfere with those for another, or even introduce additional and unwanted MRPFs. By searching through dierent combinations of candidate xes, the second search aims to produce a suitable subset | a repair | that overall minimizes the number of MRPFs in a page when applied together. (a) Oracle (O) (b) Page under test (PUT ) (c) Dierences in O and PUT Figure 7.2: Illustrative example Figure 7.2 shows an example web application that I will use to illustrate theGFix approach. Figure 7.2a shows the intended rendering of a web page, which is used as an oracle in the approach. A screenshot of the appearance of the web page under development is shown in Figure 7.2b. By visual inspection, one can determine that there are three presentation failures: (1) the location of the `Sign in' button has changed, (2) the color of text in bottom box is dierent, and (3) the style of text, `Cellphone advertisement', has changed. The three presentation failures are shown in Figure 7.2c by areas marked A, B, and C, respectively. 7.2.2.1 Overall Algorithm Algorithm 3 shows the overall algorithm of my approach. The approach takes four inputs. The rst input is the web page under test,PUT . The form of thePUT is a URL that points to either a location on the network or lesystem where all of the HTML, CSS, JavaScript, and media les of thePUT can be accessed. The second input is the oracle (O) that species the visual correctness properties of the PUT . The form of O is an image. This oracle could be the design mockup used by the front-end developers or a screenshot of the previously correct version of the PUT . The third input is a function,D, that can detect MRPFs by comparing the visual representations 102 Algorithm 3 Overall Algorithm Input: PUT : Web page under test O: Oracle for PUT D: Function to detect MRPFs in the PUT L: Function to localize MRPFs to faulty HTML elements in the PUT Output:PUT 0 : Modied PUT with repair applied 1: /* Stage 1 | Initialization */ 2: DP D (PUT , O) 3: E L (PUT , DP ) 4: while true do 5: rootCauses fg 6: for each e2 E do 7: props getCSSProperties (e, DP ) 8: for each p2 props do 9: v getValue (e, p, PUT ) 10: rootCauses rootCauses[fhe, p, vig 11: end for 12: end for 13: /* Stage 2 | Search for Candidate Fixes */ 14: candidateFixes fg 15: for eachhe, p, vi2 rootCauses do 16: if p2 SizeAndPositionProperties then 17: v 0 sizeAndPositionAnalysis (he, p, vi, PUT , O,D) 18: else if p2 ColorProperties then 19: v 0 colorAnalysis (he, p, vi, PUT , O,D) 20: else if p2 PredenedValuesProperties then 21: v 0 predenedValuesAnalysis (he, p, vi, PUT , O,D) 22: end if 23: candidateFixes candidateFixes[fhe, p, v, v 0 ig 24: end for 25: /* Stage 3 | Search for Best Combination of Candidate Fixes */ 26: repair searchForBestCombination (candidateFixes, PUT, O,D) 27: /* Stage 4 | Check Termination Criteria */ 28: PUT 0 applyRepair(PUT , repair) 29: DP 0 D (PUT 0 , O) 30: E 0 L (PUT 0 , DP 0 ) 31: ifjDP 0 j = 0 or (jDP 0 j =jDPj and E 0 = E) then 32: return PUT 0 33: else 34: DP DP 0 35: E E 0 36: PUT PUT 0 37: end if 38: end while 103 of the PUT and O and report a set of dierences at the pixel-level. The functionD can be provided in dierent ways, for example using pixel-to-pixel equivalence comparison techniques, perceptual image dierencing techniques, or manually by a tester. InGFix, I use the detection module of WebSee [103], which uses the perceptual image dierencing technique, to deneD. The last input is a function,L, that can localize the detected MRPFs and report a ranked list of HTML elements in the PUT ordered by their likelihood of being faulty. Strictly speaking, prioritization is not required for the approach, however, it can assist the approach to nd a repair faster. The prioritization can be provided in many ways. For example, \suspiciousness" could be computed using statistical based fault localization techniques or a tester could rank elements based on debugging experience. InGFix, I leverage WebSee [103] to implementL, which uses a set of heuristics to prioritize the faulty elements in the PUT . The overall algorithm, shown by Algorithm 3, comprises four stages, as shown by the overview diagram in Figure 7.1. The gure also shows the instantiations of the dierent abstraction points (AP1{4) of the *Fix approach. Stage 1: Initialization The initial part of the algorithm (lines 2{12) involves extracting a list of root causes relevant to the detected MRPFs in PUT . To do this, the approach begins by identifying the areas of visual dierences between O and the rendered appearance of PUT (line 2) by invoking the detection functionD. The approach then obtains a ranked list of potentially faulty HTML elements, E, by invoking the localization functionL (line 3). To detect and localize MRPFs, i.e., forD andL, the approach uses the WebSee tool [103, 104]. Then the approach builds a list of root causes by iterating over each element e2E and identifying CSS properties relevant to the detected MRPF (shown as \getCSSProperties" at line 7). Identifying relevant CSS properties is challenging in the domain of MRPFs since the oracle is available in the form of an image, which lacks the details of the underlying HTML elements and CSS properties that dene the page's appearance. Moreover, the functionsD andL do not provide any descriptors indicating the nature of the MRPFs observed. To address these challenges, I have designed the \getCSSProperties" (line 7) function based on the insight that the visual symptoms of MRPFs on a page can be analyzed to identify the relevant CSS properties. Determining the relevant CSS properties is analogous to diagnosing a sick patient. In this analogy, I can observe visual symptoms | an observable and quantiable feature of the PUT 's appearance, such as color usage frequency, or text size | that can guide GFix in identifying the applicable CSS properties. I describe the process of analyzing visual 104 symptoms to identify relevant CSS properties in detail in Section 7.2.2.2. Each relevant CSS property forms the basis of one root cause. It is added to the running set rootCauses, with the value, v, extracted from the rendered PUT (lines 10). Stage 2: Search for Candidate Fixes This stage produces individual candidate xes for each root cause (lines 14{24), comprising the rst phase search. This stage works on each root cause,he,p,vi, in isolation to nd a new x value for the root cause that is optimized according to a tness function, with the aim of minimizing the visual dierence between PUT andO. The optimal x value, v 0 , is returned as the output of this stage. The design of the search algorithm for this stage is based on the key insight that tailored search algorithms can be designed for groups of CSS properties that have a common visual impact on the appearance of an HTML element. This allowsGFix to eectively and eciently identify candidate xes. Based on this insight, I have developed specialized search functions for the three categories by considering the unique aspects of the categories. The three search functions are described in Sections 7.2.2.3, 7.2.2.4 and 7.2.2.5, respectively. Each of the three search functions take as input the root cause tuple,he,p,vi, the page under test, PUT , the oracle, O, and the detection function,D. The search function attempts to nd a new candidate x value, v 0 , for p (lines 17, 19, and 21). The search functions use the number of dierence pixels reported byD as the tness function to guide the search. Stage 3: Search for Best Combination of Candidate Fixes The goal of the second search phase (represented by a call to \searchForBestCombination" on line 26) is to identify a subset of candidateFixes that when applied together minimize the overall number of MRPFs reported for the PUT . This stage takes as input the candidateFixes set produced by stage 2, the PUT , O, andD and produces a set, repair, a subset of candidateFixes. This stage is designed using a biased random search. The search begins by creating a pool of candidate repairs. A candidate repair is generated by adapting the roulette wheel technique using stochastic acceptance [89]. A candidate x is included in the repair with a probability imp x =imp max . Here, imp x is the improvement in tness score when the candidate x was evaluated in the rst phase search (stage 2), and imp max is the maximum improvement in tness score observed across all of the candidate xes. All of the candidate repairs in the pool are evaluated using the functionD as the tness function and the candidate repair resulting in the lowest tness score is selected as the best repair, which is returned as the output of the search. 105 Prior to making the design choice of using a biased random search for nding the repair, I experimented with a Genetic Algorithm (GA). However, in my experience I found that the GA took longer to converge than the biased random search, and so I used the biased random search in order to decrease the run time ofGFix. Stage 4: Check Termination Criteria The fourth and last stage determines whether the algorithm should terminate or proceed to another iteration the search (lines 28{37). Before checking the termination criteria, the approach rst applies the repair generated by stage 3 to a copy of thePUT to produce a modied version of the page,PUT 0 (line 28). Then the approach obtains a new set of dierence pixels,DP 0 , and a new list of potentially faulty HTML elements,E 0 , by invoking the detection and localization functions, D andL, for PUT 0 and O (lines 29 and 30). The algorithm terminates under two conditions: rst, all MRPFs in the page have been resolved resulting injDP 0 j = 0 or, second, no tness improvement was observed in this iteration of the algorithm compared to the previous iteration indicating that the approach was able to potentially only partially repair the MRPFs in thePUT . The modied page,PUT 0 , is returned as the output of the algorithm upon termination (line 32). If the algorithm does not terminate in the iteration, then it implies that the current repair represents an improvement that may be improved further in another iteration of the algorithm. 7.2.2.2 CSS Properties from Visual Symptoms of MDDPs and RDPs Recall that in stage 1 of the approach, the visual symptoms of an MRPF observed on the PUT are used to identify the CSS properties relevant to the MRPF (\getCSSProperties" at line 7 of Algorithm 3). In this section, I discuss the details of this process. The approach of identifying the relevant CSS properties works by analyzing visual symptoms of a web page using image processing techniques. An MRPF manifests itself as a dierence between the appearance of the PUT and O. My insight is that these dierences, or visual symptoms, are indications or clues to the underlying faulty CSS properties of the observed failure. For example, consider a MRPF where a login button is expected to be red (as specied in the oracle), but is green in thePUT . In this case, a visual symptom exists as there is a dierence between the set of colors present in the oracle and the set of colors present in the rendering of the PUT . This visual symptom would likely be due to the incorrect setting of the CSS property, background-color, which controls the color of the button. Formally, I dene a visual symptom as a predicate that can be either true or false in describing a visual dierence between the PUT and O. 106 Table 7.1: Mapping between symptoms and CSS properties Impact Area CSS Properties Symptoms Color color, background-color, border-top-color, border- left-color, background-color, border-right-color, border-bottom-color, outline-color ModiedColor Size height, width, padding, padding-bottom, padding- top, padding-right, padding-left, border-top-width, border-bottom-width, border-left-width, border- right-width, max-height, min-height, border- spacing SamePageSize, AllDPInLeftOfElement, AllDPInRightOfElement, AllDPInTopOfElement, AllDPInBottomOfElement Position margin, margin-top, margin-bottom, margin-right, margin-left, position, line-height, vertical-align, bot- tom, top, right, left, oat ShiftLeftElement, ShiftTopElement, ShiftRightElement, ShiftBottomElement Visibility display, over ow, visibility VisibleElement Text appearance font-size, direction, font-family, white-space, text- align, letter-spacing, text-decoration, word-spacing, text-indent, font-weight, text-transform, font- variant TextElement Decoration style outline-style, border-style-left, border-top-style, border-right-style, border-bottom-style DPInBorder, DPOutsideBorder, DPInsideBorder There are two challenges in identifying symptoms forGFix. First, a symptom must be inde- pendent of any particular web page. Otherwise, the usefulness of the symptom will not generalize to other web pages. To deal with this challenge, I dene symptoms that are not based on a web page's structure or content, since these can vary signicantly among web pages. Second, the set of symptoms must be comprehensive enough to cover all CSS properties and also provide distinguish- ing power among related properties. This is challenging because there are many CSS properties. Fortunately though, many of these properties have similar visual impacts, which allows denition of broad-level groupings of symptoms. Fine-grained symptoms within each broad-level grouping further allow accurate distinction between the properties. To identify these groupings, I analyzed the set of visual properties and classied them based on their visual impact on an HTML element. For example, the CSS properties display and visibility can both make an element appear or disappear, so they are grouped in the visual impact category \Visibility." In Table 7.1 under the column labeled \Impact Area" I show the six areas of visual impact that were identied in this process. For each of the visual impact areas, I list the visual properties that were classied into that area (column \Visual Properties"). In the third column (\Symptoms") I then list the set of 107 symptoms we dened that relate to the group. I will now discuss each of the visual impact areas and their corresponding symptoms in more detail. 1. Color Symptoms: There are eight CSS properties that can manipulate the color of HTML elements. Developers can use these properties to change the color of text, background, border, or outline. To dierentiate these CSS properties from others, the approach analyzes the color dierences between the screenshot of thePUT andO. To perform the color analysis, the approach rst crops the screenshot of thePUT andO to the area of the HTML element under consideration, which I refer to as PUT e and O e . ModiedColor: This symptom is true when one or more RGB colors appears in PUT e that do not appear inO e . For example, when developers set a wrong color for an HTML element, it is possible that the color is changed to some other color that did not appear in the original web page. This causes a new color to appear in the color set of the web page, which means the symptom will be true. This symptom is also true when there is a mismatch in the frequency of colors in PUT e andO e , i.e., the number of pixels with the color under consideration is dierent. This can happen, for example, if a developer sets the value of the faulty color property to a color already present in the element, the color set of the element will remain the same as O e but frequency would be dierent. The ModiedColor symptom is calculated as the symmetric dierence between the color histograms of PUT e and O e . 2. Size Symptoms: There are 14 CSS properties that are able to change the size of an HTML element. They can either directly change the HTML element by modifying its width or height, or indirectly change the element's size by size-related properties, such as border, margin, or padding. My approach denes the following symptoms for this group. SamePageSize: This symptom is false when the size of O is not equal to that of PUT . This indicates that the MRPF changed the size of an HTML element that then led to a change in the size of the page. AllDPInTopOfElement: This symptom is true when all of the pixels that are dierent between PUT and O are in the top half of an HTML element. This indicates the MRPF may have been caused by the CSS properties that refer to the top part of the element. Examples of these are border-top-width and padding-top. These types of faults either directly aect the pixels near the top of the element or indirectly aect them by moving the top either down or up in relation to its intended position. Similarly, I have analogous symptoms. AllDPInRightOfElement, AllDPInLeftOfElement, and AllDPInBottomOfElement. 108 3. Position Symptoms: In this group, there are 12 CSS properties that are able to change the position of the HTML element. The visual impact of these properties is moving the HTML element without changing its appearance. To dierentiate this group, the idea is to try to match the appearance of an HTML element in the PUT with a location in O. To do this, the approach rst retrieves the position and size of the HTML element in the PUT . Then, the approach generates a screenshot of the PUT and crops it to the area of the HTML element. Lastly, the approach determines if the same cropped part of O matches the cropped image of the HTML element from the screenshot ofPUT . Based on this matching, the approach derives the following symptoms. ShiftBottomElement: This symptom is true when the HTML element moves downwards in the screenshot of PUT . When the approach is able to match the element in O, the approach further checks if the element has moved downwards. This happens for properties, such as padding-top and margin-top, which modify the spacing to on the top of the element likely pushing it down. Similarly, the approach denes ShiftLeftElement, ShiftRightElement, and ShiftTopElement. 4. Visibility Symptoms: There are three CSS properties in this group, and they can make an HTML element invisible (i.e., not appear in the PUT ). To identify MRPFs that are caused by these CSS properties, the approach checks if the HTML element is shown in the PUT . VisibleElement: This symptom is true when the HTML element is visible in the PUT . The approach checks that the height and width of the element are greater than zero in the browser rendering. 5. Text Appearance Symptoms In this group, there are 12 CSS properties related to the appearance of text. They can change the appearance of text in terms of its size, style, etc. To distinguish this group from others, the idea is to extract the text of the HTML element, then further analyze the content of the text using text processing techniques as follows. TextElement: This symptom is true when the HTML element inPUT contains text. Because the 12 CSS properties change the appearance of the text in the HTML element, if the element does not contain any text, none of the 12 CSS properties will have any visual aect on the element, let alone cause a MRPF. Hence, this symptom equal to true is highly correlated with any of the 12 CSS properties in this group having an eect. 6. Decoration Style Symptoms In this group, there are six CSS properties. They control the style of the border or outline of the HTML element. One pattern I found for this group is that the visual change is likely to aect a certain area of the HTML element, such as border and 109 Table 7.2: Categorization of CSS properties based on visual impact Category CSS Properties Size and Position Properties height, max-height, max-width, min-height, min-width, width, padding-bottom, padding-top, padding-right, padding-left, border-bottom-width, border-top-width, border-left-width, border-right-width, outline-width, font-size, letter-spacing, line-height, margin-bottom, margin-top, margin-right, margin- left, bottom, top, left, right, z-index, text-indent, word-spacing Color Properties color, background-color, border-top-color, border-left-color, background-color, border-right-color, border-bottom-color, outline-color Predened Values Properties background-attachment, background-repeat, border-bottom- style, border-top-style, border-left-style, border-right-style, outline-style, font-family, font-style, font-variant, font-weight, list-style-position, list-style-type, clear, display, oat, over ow, position, visibility, border-collapse, caption-side, table-layout, direction, text-align, text-decoration, text-transform, white- space, vertical-align outline (a line drawn around the HTML element outside of the border). Hence, the approach rst calculates the border of the HTML element by retrieving rendering information from the DOM of the PUT . Next, the approach calculates the location of the dierence pixels. The approach then checks if the dierence pixels are on the border, outside the border, or inside the border of the HTML element. DPOnBorder: The symptom occurs when the dierence pixels are located on the border of the HTML element. This indicates the root causes contains CSS properties related to border-style. Similarly, the approach denes analogous symptoms, DPOutsideBorder and DPInsideBorder. 7.2.2.3 Size and Position Analysis As discussed in Section 7.2.2.1, I designed a specialized search technique for nding candidate xes for a group of CSS properties, including margin, padding, height, and width, that aect the size and position of elements (\sizeAndPositionAnalysis" at line 17 of Algorithm 3). A complete list of the CSS properties in SizeAndPositionProperties can be found in Table 7.2. I now explain the search analysis technique for handling the size and position properties. The goal of this analysis is to process the root cause,he,p,vi, forp2 SizeAndPositionProper- ties, to nd the x value v 0 . My key insight that motivates the design of the search technique for this analysis is that changes in size and position of an HTML element are directly proportional to the number of dierence pixels, which is used as a tness score to guide the search. The number of 110 dierence pixels are reduced as the chosen value gets closer to the correct value, since the HTML element in the rendering of the PUT starts to overlap with that in O. On the other hand, the number of dierence pixels increases or stays the same as the overlap decreases when the chosen value moves away from the correct value. The search technique for size and position analysis is inspired by the Alternating Variable Method (AVM) [82, 84]. The AVM is a local search strategy that starts at an initial point in the solution space and then tries to optimize each input variable, one at a time. In the context of GFix, the input variable is v of the root cause that needs to be optimized and the initial point is a value v i , calculated based on a translation heuristic discussed in the next paragraph. The AVM technique rst tries to establish a direction of search, and then rapidly explores the space in that direction to nd the optimal value. To establish a direction of search, the AVM technique performs exploratory moves, which adds small delta values (e.g., [-1 ,1]) to v, and observes the impact on the tness score. If the tness is observed to be improved, indicating that a potential direction for further improving the input variable has been found, the AVM switches to performing pattern moves. A pattern move adds values to v through step sizes that increase exponentially. The pattern move continues with this exploration until the newly computed tness score shows improvement over the previous tness score, with the improvement implying that the search is progressing in the correct direction. If a pattern move fails to improve tness, the AVM switches to exploratory moves to explore a new direction from the current point in search. If exploratory moves fail to improve tness, the current point in search is returned as the best candidate x value, v 0 to the main algorithm (line 17 of Algorithm 3). Translation Heuristic: The translation heuristic is used to identify an initial value for the AVM search that is potentially close to the expected correct value to facilitate quick convergence. On a high level, the initial value is calculated based on the translation that the elements in the PUT have undergone from their expected locations in O. The translation heuristic is based on the observation that a change in position or size of the rendered HTML element can be seen as a translation/displacement of pixels in the two images, O and screenshot of the PUT . The translation heuristic computes the amount of translation that the dierence pixels have undergone by taking a relative complement of the two images. In set theory, the relative complement of a set A with respect to a set B is given by the elements in B, but not in A. (BnA =fx2 Bj x 62 Ag) Thus, taking a relative complement of pixels in the PUT screenshot with respect to the pixels in O, we get a set DP o of dierence pixels in O, but not in the PUT . Similarly, taking a relative complement of pixels inO with respect to thePUT gives a setDP t of dierence pixels in the PUT , but not in O. The translation value, T , is calculated as an average over the dierence 111 between corresponding pixels pairs' (x, y) values in the two complement sets, DP o andDP t . The initial valuev i for the AVM search is set as T +v, wherev is the original value of the element, e. To illustrate the working of the size and position analysis, let us consider the example presented earlier (Figure 7.2). The area A marked in Figure 7.2c shows the positional shift of the \Sign in" button from left (as can be seen in O, Figure 7.2a) to right (as can be seen in the PUT screenshot, Figure 7.2b). Consider the faulty element identied for this visual dierence ashdiv style=\padding-left: 75px;"i. The correct value for the padding-left property is 5px. Let us assume the number of dierence pixels in area A is 5,000. Let us say that the position of the top left corner of the \Sign in" button in O with 5px padding is (100, 100). Now with a padding-left of 75px, the new position of the top left corner is (170, 100). These are the two dierence pixels present in the relative complement sets, DP o and DP t , respectively. Taking a dierence between the two pixels' values gives a translation value of 70px. Similarly, translation values of all the dierence pixels of the button are computed, and the average translation value is obtained to be -68px. With this value, the AVM search is initialized to 7px (v +T = 75 68), which gives a tness score of 300. Now trying a new padding-left value of 8px increases the tness score to 450, indicating that the search is moving away from the expected x value as 450 is greater than 300. Then trying a value 6px reduces the tness score to 240, indicating that the search is moving in the correct direction. Eventually, the search explores the value, 5px, which reduces the number of dierence pixels in the area A to 0. Thus, the search terminates and returns 5px as the candidate x value, v 0 , for the area A. 7.2.2.4 Color Analysis The color analysis proposes a specialized search strategy for nding candidate xes for the color related CSS properties, such as text color, background-color and border-color (\colorAnalysis" at line 19 of Algorithm 3). A complete list of the CSS properties in the ColorProperties category can be found in Table 7.2. In this section, I discuss the search technique I designed for handling these properties. The color analysis processes the root cause, he, p, vi, if p2 ColorProperties to nd the candidate x value v 0 . It is challenging to identify the correct color value since the color related CSS properties can be one of 16 million colors ranging from #000000 to #FFFFFF. To address this challenge, my insight is that the expected color can be identied by extracting the color in the area of e from the oracle image, O. However, identifying the expected color is compounded by the fact that in practice multiple colors are likely to be retrieved from O due to the process of anti-aliasing. Anti-aliasing is used to smooth out the jagged edges of curves in text and shapes by 112 Figure 7.3: Anti-aliasing example adding gradient colors between the curve line and the background. An example of this is shown in Figure 7.3, where shades of pink are added to smooth out the red text to the white background. To address this challenge and nd the expected color value, my insight is that the color occurring with the highest frequency in the area of e in O is likely to be the expected correct color, with the remaining colors likely reported due to anti-aliasing. Based on this insight,GFix rst builds a color histogram, H o , for O by extracting the color value for every pixel in the area of e. Similarly, a color histogram, H t , is built for e by extracting the color values from the screenshot of thePUT . ThenGFix calculates the relative complement H t inH o and stores all of the unmatched colors, i.e., colors that are present in O but not in the PUT , in a new histogram, H. The histogram H is then sorted in descending order of frequency. This ranked list of colors is then searched systematically to nd the x value, v 0 , that reports a minimum tness score and returned to the main algorithm (line 19 of Algorithm 3). Consider the running example for understanding the working of the color analysis. The dier- ence area B in Figure 7.2c is processed for color. The oracle (Figure 7.2a) requires the text color to be red (#FF0000) in that region, but thePUT (Figure 7.2b) has the text color as black (#000000). The approach rst builds the color histograms, H o =fh#FFAAFF, 1000i,h#FF55AA, 500i, ..., h#FF0000, 5500ig andH t =fh#566573, 1000i,h#D6DBDF, 500i, ...,h#000000, 5500ig. The rel- ative complement histogramH after sorting is obtained asfh#FF0000, 5500i,h#FFAAFF, 1000i, ...,h#FF55AA, 500ig. The rst value from the ranked list chosen by the approach, #FF0000, reduces the number of dierence pixels in the area B to 0, indicating that we have found the correct candidate x value. 113 7.2.2.5 Predened Values Analysis The predened values analysis nds candidate xes for the CSS properties in the category Prede- nedValuesProperties (Table 7.2), which includes properties such as font-style, display, and font-family (\predenedValuesAnalysis" at line 21 of Algorithm 3). The CSS properties in this category can take a value from a set of discrete predened values. For example, font-style can take a value from the predened setfitalic, oblique, normalg. A guided search cannot be used for selecting a candidate value in this analysis as there is no clear relationship between the dierent discrete values in the set. Therefore, I process this category using an exhaustive exploration of all of the predened values for a given CSS property. In general, exploring all possible values can be expensive, however, that expense is mitigated by the fact that W3C denes only small sets of possible values for the CSS properties in this category. The average size of these sets is 5, with the largest set containing only 21 elements. The predened value with the best tness score from the exhaustive search is returned as the candidate x value, v 0 i (line 21 of Algorithm 3). Consider the running example for understanding the working of the predened values analysis. The dierence area C in Figure 7.2c is processed for predened values. The oracle (Figure 7.2a) expects the font-style to be normal, but the test web page (Figure 7.2b) has the font-style as italic. Consider the faulty element identied for this visual dierence ashspan style=\font-style: italic;"i. GFix explores all the values from the predened set of valuesfitalic, oblique, normalg for the font-style property, and eventually observes that with the value normal, the dierence pixels in the area C are reduced to 0. This indicates that the correct candidate x value has been identied. 7.3 Evaluation I conducted an empirical evaluation to determine the accuracy and analysis time ofGFix for MDDPs and RDPs in web applications. The specic research questions that were considered are as follows: RQ1: How eective isGFix in reducing MDDPs and RDPs in web pages? RQ2: How long doesGFix take to nd repairs? RQ3: What is the quality of the xes generated byGFix? 114 Table 7.3: Subjects used in the evaluation ofGFix Name URL #HTML #CSS #T RDPs Perl http://dbi.perl.org 269 1,771 30 GTK http://www.gtk.org 75 1,334 28 Konqueror http://konqueror.org 247 7,240 24 Amulet http://www.cs.cmu.edu/ ~ amulet 98 94 21 UCF http://www.ucf.edu 315 2,414 29 MDDPs Gmail http://www.gmail.com 75 310 32 USC CS Research https://tinyurl.com/y9me423d 771 5,107 32 Craigslist http://losangeles.craigslist.org 1,145 31,356 28 Java Tutorial https://tinyurl.com/y8nrlwum 160 308 31 Virgin America http://www.virginamerica.com 396 3,730 25 7.3.1 Implementation For the purpose of the evaluation, I implemented the approach in prototype tool,GFix, in Java. To compute the dierent visual symptoms listed in Table 7.1,GFix uses Selenium WebDriver for making dynamic changes to web pages, such as applying candidate x values, to take screenshots of the web pages, and extract bounding rectangle information of the HTML elements. jStyleParser is used to extract the dierent CSS properties dened for an HTML element. GFix uses OpenCV to compare the screenshots, extract color information, crop screenshots, and perform sub-image searching. For stage 3 of the approach, the number of candidate repairs, n, is set to 50. For the input functionsD andL, I used the WebSee tool [103, 104], which provides as output a set of dierence pixels identied by perceptual image dierencing and a ranked list of suspicious HTML elements. I parallelized the search process (stage 2 and 3) using four threads on a 4th Generation Intel Core i7-4770 Processor with 32GB RAM. All of the experiments were run on an Ubuntu 14.04 platform. For rendering the subject web pages, I used Mozilla Firefox v46.0.01 with the browser window maximized to the screen size. 7.3.2 Subjects Table 7.3 shows the ten subject web pages selected for the evaluation ofGFix, ve for each of the two types of presentation failures, i.e., RDPs and MDDPs. The columns labeled \#HTML" and \#CSS" report the total number of HTML elements present in the DOM tree of a subject, and the total number of CSS properties dened for the HTML elements in the page respectively. These metrics of size give an estimate of a page's complexity in debugging and nding potential xes for the observed presentation failures. 115 7.3.2.1 Selection of subjects for RDPs In the evaluation, I focused on one scenario in which regression debugging might be needed, refactoring web pages. The goal of web page refactoring is to change the structure or HTML code of a page without altering its visual appearance. I chose refactoring because it is a topic that has received a lot of attention in the research and development community (e.g., [81]) and as a result there are objective and independently dened processes for how to refactor for certain objectives. Based on a literature review, I chose to focus on three refactoring techniques that were widely mentioned and had clearly dened processes. The rst refactoring is to migrate HTML 4 to HTML 5. For example, converting <div id="header"> into <header>. The W3C provides ocial guidelines for converting a typical HTML 4 web page into an HTML 5 page [32]; The second refactoring is to convert a table based layout to a div based layout. Related work [108] provides a set of best practices to be used in this conversion that will maintain the original appearance of the page. The third refactoring is to replace deprecated tags. For example <font> is deprecated in favor of using the CSS font property. Here again, the W3C provides a complete list of the deprecated HTML tags [31]. My choice of refactorings guided the selection of subject pages. The initial starting point was to gather URLs from an online random URL generator (http://www.uroulette.com). I then ltered this set to include only web pages to which all three refactorings could apply. The resulting set of subjects is shown in Table 7.3. For each of these subjects, I created a refactored version with all three refactorings applied. I performed the refactorings manually, but followed the instructions provided for each to make the process objective and repeatable. To conrm that the visual appearance was unchanged, I used an image dierencing tool to ensure that there were zero dierence pixels between the original and refactored version. On average, refactoring and verifying each subject took 5-10 hours depending on its complexity. 7.3.2.2 Selection of subjects for MDDPs For the experiments, I utilized the ve subject web pages shown in Table 7.3 for repairing MDDPs. I chose these web pages because they represent a mix of dierent implementation technologies and layouts that are widely used across all web applications. In particular, I chose the set of test subjects to include web pages that were dened by statically generated HTML, CSS, and JavaScript and pages dened by dynamically generated HTML. 116 7.3.2.3 Test case generation To create test cases for the evaluation, I used a random seeding based approach that inserted presentation failures into the subject web pages. I used this approach because I was not able to obtain a suciently large enough set of mockups and refactoring errors in real-world web applications in order to ensure thatGFix could be validated against a wide range of MDDPs and RDPs. To create test cases for the evaluation, I seeded presentation failures into the each subject web page. Note that for RDPs the seeding was performed on the refactored versions of each subject. We used the following process for each subject page p: (1) download p and all les required to display it; (2) take an image capture ofp to serve as the oracleO; and (3) create a set P 0 that contains variants of p, each variant created by randomly seeding a unique presentation fault. To identify the types of faults to seed, I rst manually analyzed the W3C HTML and CSS specications to identify visual properties | HTML attributes or CSS properties that could change the visual appearance of an element. I seeded faults by changing the original value of each unique visual property present inp. I eliminated any variant ofp with a seeded presentation fault that did not actually produce a presentation failure. To identify these, I computed the set of pixel dierences between the rendering of p and O, and only included the variant if this set's size was non-zero. The visual impact of a seeded fault varied, making the test cases vary in complexity forGFix. In some cases, the seeded fault caused almost all of a page to be shown as having a pixel-level dierence. For example, changing the value of the padding CSS property in thehbodyi tag. In other cases, the seeded fault impacted only a small area (e.g., changing the text color of ahspani tag). Each test case was comprised of an appearance oracle (O) and the page with a seeded fault (variant in P 0 ). The number of test cases generated for each subject is shown under the column \#T" in Table 7.3. 7.3.3 Experiment One To address RQ1 and RQ2, I ranGFix ve times on each of the subjects to mitigate the non- determinism in the search. For RQ1, I used WebSee to determine the initial number of MDDPs and RDPs, represented as visual dierence pixels, in a test case and the average number of dierence pixels remaining after each of the ve runs ofGFix. From these numbers I calculated the reduction of MDDPs and RDPs as a percentage. For RQ2, I collected the median total running 117 Table 7.4: Eectiveness ofGFix in reducing MDDPs and RDPs Subject Avg. #Before Avg. #After Avg. Reduction (%) Perl 16,371 124 95 GTK 101,700 0 100 Konqueror 19,004 11,010 90 Amulet 25,626 2 97 UCF 22,456 3352 86 Gmail 6,204 134 95 USC CS Research 4,454 101 91 Craigslist 3,157 71 92 Java Tutorial 15,821 35 94 Virgin America 2,838 0 100 Average 21,763 1,483 94 Table 7.5:GFix's median run time in seconds Subject Size and Position Color Predened Values Total Analysis Analysis Analysis Perl 4 1 3 376 GTK 3 1 12 1,972 Konqueror 5 2 33 1,355 Amulet 6 5 119 721 UCF 12 12 209 3,320 Gmail 6 4 13 314 USC CS Research 4 2 17 1,084 Craigslist 3 3 1 696 Java Tutorial 5 3 7 1,002 Virgin America 2 1 3 306 Median 4 2 9 861 times ofGFix and the median time required for the three types of search analyses, namely, size and position analysis, color analysis, and predened values analysis. 7.3.3.1 Presentation of Results Table 7.4 shows the results of RQ1. The average initial number of MDDPs and RDPs across each test case of a subject are shown under the column \#Before". The column, \#After" shows the average number of MDDPs and RDPs remaining after the ve runs of each of the test cases of a subject. The average percentage reduction in the number of MDDPs and RDPs for each of the test cases of a subject are shown under the column \Reduction". Table 7.5 shows the results of RQ2. The columns show the median running time per root cause for the dierent analyses and the total running time across each of the test cases of a subject. 118 7.3.3.2 Discussion of Results The results show thatGFix reported an average 94% reduction in MDDPs and RDPs. This shows thatGFix was eective in nding xes for MDDPs and RDPs. Of the total 280 test cases across all subjects,GFix was able to reduce the number of dierence pixels to zero, i.e., resolve all of the reported MDDPs and RDPs for 89% of the test cases. I investigated the results to understand whyGFix was not able to nd suitable xes for all of the MDDPs and RDPs. I found two dominant reasons for this. First, WebSee's perceptual image dierencing used in the tness function was not sensitive to small delta changes tried by the AVM search in the size and position analysis. This behavior was particularly observed if the test case had a signicant amount of distortion seeded in it, i.e., if the test case had a high amount of pixel-level dierences with its oracle. In such cases, no improvement in the tness score was observed, leaving the presentation failure unresolved. Second, the faulty CSS property that was required to be adjusted in order to repair the observed failure was not included in the set of relevant CSS properties identied by the visual symptoms. The total running time ofGFix ranged from ve minutes to 55 minutes, with a median of 14 minutes. As the results of the analyses show, the predened values analysis was the most time consuming, followed by size and position analysis and color analysis. This is an expected result, since predened values analysis performs an exhaustive search over the set of predened values. The total time ofGFix was dependent on the size of the page and the number of potentially faulty HTML elements reported by WebSee. Despite the use of four threads for parallelization, the runtime for some subjects is lengthy. This can be further improved by using cloud instances for further parallelization, as has been achieved in related work [85]. 7.3.4 Experiment Two For addressing RQ3, I conducted a user study to understand the visual quality ofGFix's suggested xes from a human perspective. The general format of the survey was to present, in random order, a MDDP (or RDP) containing a UI snippet from a subject web page before and after repair. The participants were then asked to compare the appearance similarity of the two UI snippets with the corresponding UI snippet from the oracle. Each UI snippet showing a failure was captured in context of its surrounding region to allow participants to view the failure from a broader perspective. Examples of UI snippets are shown in Figure 7.4. To select the \after" version of a test case to display the best repair identied byGFix, I used the run with the best tness score across the ve runs ofGFix in Experiment One. 119 (a) UI snippet from GTK before repair (b) UI snippet from GTK after repair (c) UI snippet from Java Tutorial before repair (d) UI snippet from Java Tutorial after repair Figure 7.4: Examples of UI snippets used in theGFix's user study To make the survey length manageable for the participants, I selected a random test case from each subject and presented it in the survey, thus making the length of each survey 10 questions, one question for each subject. For each question, the survey presented, in random order, the UI snippets of the before and after versions of the selected test case and a corresponding reference UI snippet from the oracle. I conducted a total of ve surveys. I used Amazon Mechanical Turk (AMT) service to conduct the surveys. AMT allows users (requesters) to anonymously post jobs which it then matches them to anonymous users (workers) who are willing to complete those tasks to earn money. To avoid workers who had a track record of haphazardly completing tasks, we only allowed workers that had high approval ratings for their previously completed tasks (over 95%) and had completed more than 5,000 approved tasks to complete the survey. In general, this is considered a fairly selective criteria for participant selection on AMT. 10 anonymous participants undertook each survey, giving a total of 50 completed surveys and 500 data points. Each participant was paid $0.20 for completing a survey. 120 0 1 2 3 4 5 6 7 8 9 10 perl amulet konqueror gtk ucf gmail java tutorial craigslist usc cs research virginamerica Participants before more similar after more similar (a) Results from survey 1 0 1 2 3 4 5 6 7 8 9 10 perl amulet konqueror gtk ucf gmail java tutorial craigslist usc cs research virginamerica Participants before more similar after more similar (b) Results from survey 2 0 1 2 3 4 5 6 7 8 9 10 perl amulet konqueror gtk ucf gmail java tutorial craigslist usc cs research virginamerica Participants before more similar after more similar (c) Results from survey 3 0 1 2 3 4 5 6 7 8 9 10 perl amulet konqueror gtk ucf gmail java tutorial craigslist usc cs research virginamerica Participants before more similar after more similar (d) Results from survey 4 0 1 2 3 4 5 6 7 8 9 10 perl amulet konqueror gtk ucf gmail java tutorial craigslist usc cs research virginamerica Participants before more similar after more similar (e) Results from survey 5 0 5 10 15 20 25 30 35 40 45 50 perl amulet konqueror gtk ucf gmail java tutorial craigslist usc cs research virginamerica Participants before more similar after more similar (f) Aggregated results Figure 7.5:GFix's user study results 7.3.4.1 Presentation of Results The results for the appearance similarity preference given by the participants for each survey are shown in Figures 7.5a, 7.5b, 7.5c, 7.5d and 7.5e. The aggregated results from all of the ve surveys are shown in Figure 7.5f, In each gure, the number of participants are shown on the Y-axis and the subjects are shown on the X-axis. The light gray bars indicate the participants reporting the after repair version as being more similar to the reference than the before. Similarly, the dark gray bars show the preference for the before repair version as being more similar to the reference than the after. 7.3.4.2 Discussion of Results Based on the analysis of the results of the ve surveys, I found that, overall, 79% of the participants reported the repaired (after) versions of the subjects as more similar to the reference than the unrepaired (before) versions. This indicates thatGFix generates repairs for MDDPs and RDPs that makes the pages aesthetically similar to their oracles from a human perspective. Interestingly, 121 out of the 50 test cases presented in the surveys, 10 test cases received responses in favor of the after version from all of the participants and in 30 test cases, the after version was rated as having more similarity to the reference by at least eight participants. As can be seen from Figure 7.5, all of the test cases expect two (\ucf" from survey 1 and \konqueror" from survey 4) showed a majority agreement among the participants. I investigated these test cases to understand the reason for high discordance among the participants. The dominant reason I found was that both the test cases had failures that impacted only a small area of the pages, likely causing both, the before and after versions of the test cases, to appear the same. The participants could have likely missed noticing such small failures, resulting in a mixed preference given for the versions and a majority disagreement among the participants. A similar nding was observed for the test case \usc cs research" from survey 3, which in fact is the only test case showing a majority of the participants in the favor of the faulty version over the repaired version. Although the failure in this test case was resolved byGFix, perhaps it was was not very noticeable to the participants. 7.3.5 Threats to Validity In this section I discuss threats to the validity of my conclusions the results ofGFix. A possible threat is that I used a fault seeding mechanism for generating the test cases for RQ1 and RQ2 instead of real-world faults. This was unavoidable as I did not have access to a large enough source of web pages with mockups, multiple versions, and refactorings. However, there are three factors that minimize this threat. First, the fault seeding mechanism was based on mutation testing techniques, which have been shown to produce useful and representative test suites. Second, my prior work has shown that detection and localization results achieved on articially generated test suites are similar to those achieved when using real mockups [103]. Third, I used a systematic seeding process based on the actual visual properties present in a page to guide the seeding process. This was done to reduce unintentional bias in the denition of the test cases. The second threat is that I controlled the changes introduced for simulating regression debugging for RDPs. To mitigate this, I chose to carry out HTML page refactoring because there were externally dened standards for how to perform the three dierent types of refactorings and I could follow these approaches to minimize the unintentional introduction of bias. 122 7.4 Conclusion In this chapter, I presented my novel search-based approach,GFix, for the automated repair of MDDPs and RDPs in web applications. The approach uses two phases of guided search to nd the repair. The rst phase uses the AVM search to identify candidate xes, while the second phase uses a biased random search to nd a subset of the candidate xes that produces an overall best repair. The tness function used to guide both of the search algorithms employs computer vision techniques to quantify the deviation between the appearance of a rendered page and its intended appearance, with a minimization goal. In the evaluation,GFix was able to reduce, on average, 94% of the presentation failures reported by WebSee. In a user study assessing the visual similarity of the pages, the participants overwhelmingly reported the repaired versions as more similar to the reference than the original (faulty) versions. Overall, these are positive results and support the hypothesis of my dissertation by showing that this approach using search-based techniques can be highly eective in resolving MDDPs and RDPs in web pages. 123 Chapter 8 Related Work In this chapter, I discuss related work for the repair of presentation failures in web applications. I also discuss dierent approaches in the literature that are related to the four techniques,XFix, MFix,IFix, andGFix. 8.1 Automated repair of software systems Automated repair of software systems has been an area of active research. Many dierent ap- proaches have been proposed that focus on repairing dierent aspects of the software systems, however, none of them are capable of repairing presentation failures in web applications. These approaches can be broadly divided into two categories as described below. Techniques for repairing faults in software programs There has been extensive work on the automated repair of software programs. Several techniques that use search-based algorithms have been proposed. Two examples include GenProg [159, 85], which uses genetic programming to nd viable repairs for C programs, and SPR [91], which uses a staged repair strategy to search through a large space of candidate xes. Alternative analytical approaches also exist, including FixWizard [127], which analyzes bug xes in a piece of code and suggests xes to similar parts of the code base; and FlowFixer [167], which repairs sequences of Graphical User Interface (GUI) interactions in modied test scripts for Java programs. How- ever, these techniques are not capable of repairing presentation failures (e.g., Cross Browser Is- sues (XBIs), Mobile Friendly Problems (MFPs), Internationalization Presentation Failures (IPFs), Mockup-driven Development Problems (MDDPs), and Regression Debugging Problems (RDPs)) in web applications because they are structured to work for general-purpose programming lan- guages, such as Java and C. 124 Techniques for repairing faults in web applications Recently, techniques to repair dierent types of faults in web applications have been proposed. These techniques deal with specic components of the client-side of web applications and as such are not meant for repairing presentation failures in web applications. PhpRepair [142] and Php- Sync [126], focus on repairing problems arising from malformed HTML. Although these techniques can help resolve a certain class of presentation failures, i.e., problems caused by HTML syntax errors, they cannot repair presentation failures such as the ones discussed in Chapter 2, as they are not caused by malformed HTML. Another technique [158] assumes that an HTML/CSS x has been found and focuses on propagating it to the server-side using hybrid analysis. Cassius [131], proposes a framework for repairing faulty CSS in web applications by assuming the availability of a set of faulty source lines in CSS les and HTML page layout examples that the technique can use to synthesize repair. This technique uses the CSS from the page layout examples as the oracle to identify the x values for the faulty CSS. In general, HTML page layout examples may not be available, for example, the intended layout of a page is available in the form of an image in the case of MDDPs and RDPs. Also, in the case of XBIs, MFPs, and IPFs, the oracle and the page under test (PUT) use the same CSS les. Therefore this technique is generally not applicable for repairing the types of presentation failures described in Chapter 2. A technique, ARROW [156], uses static analysis to repair client-side race conditions that are caused by the concurrent and asynchronous rendering of Javascript, CSS, and images during a page load. This technique cor- rects the ordering of Javascript in web pages to avoid the race conditions and can thereby repair presentation failures caused by such event races but not XBIs, MFPs, IPFs, MDDPs, and RDPs. 8.2 Cross Browser Issues (XBIs) Cross Browser Testing (XBT) techniques, such as X-PERT [137, 140, 59, 141], CrossT [118], Browserbite [144], Browsera [12], Webmate [62, 63], and work by Eaton and Memon [66], are eective in detecting XBIs. However, repairing the reported XBIs when using these techniques must still be performed manually. Crossre [61] presents a protocol for XBI debugging by extend- ing browser developer tools, such as Firefox's Firebug, to enable cross-browser support. However, the task of using the debugger to nd potential xes is developer-driven. Another technique, FMAP [138], analyzes the traces of client-server communication of a web application on dier- ent platforms (e.g., desktop and mobile) to detect unmatched functional and behavioral features. Thus, the problem addressed by FMAP is fundamentally dierent thanXFix as the User Inter- face (UI) of a web application is expected to be substantially dissimilar on dierent platforms. 125 A common resource used by web developers to reduce possible XBIs is to use simple CSS resetting techniques, such as Normalize CSS [72] and YUI 3 CSS Reset [8]. Such techniques establish a consistent CSS baseline for dierent browsers to minimize the browser dierences that can lead to XBIs. For example, Chrome denes the default width of an input/text box as 155px while Firefox denes it as 184px, which can likely lead to an XBI. The CSS reset techniques overwrite such browser specic default CSS values with standard values (e.g., width = 160px) dened in the CSS reset les to potentially reduce cross-browser dierences. Such techniques, however, cannot handle complex XBIs that are application dependent, are caused by unsupported CSS properties/values, and are caused by complex interaction between HTML and CSS. For example, Internet Explorer (IE) does not support the value initial for the CSS property left, leading to an unpredictable behavior of the HTML elements in the PUT containing this property-value pair. This can likely introduce layout XBIs in the PUT as the positioning of such elements is application dependent, i.e., based on the styling and layout of their surrounding elements, requiring such XBIs to resolved on a case by case basis. 8.3 Mobile Friendly Problems (MFPs) There are approaches that circumvent MFPs by presenting alternative versions of a desktop website, rather than than repairing MFPs in the website. For example, commercial services such as bMobilized [14], WompMobile [45], Mobilifyit [36], Duda [20], and Mobify [34], can convert a given desktop website to a mobile friendly version using pre-designed templates. Although helpful, these solutions are not appropriate in all situations. Firstly, the templates are unlikely to capture the carefully crafted layout and graphics designed for the desktop versions, possibly undermining the branding eorts a company is trying to achieve. My approach,MFix, avoids these limitations by maintaining a close similarity to the original version. Second, the output represents a separate mobile friendly website with a new URL, requiring the development team to maintain two websites. In contrast, theMFix approach generates a CSS media query patch that is added to the existing CSS of the original website and that will only be triggered if the page is requested from a device with a smaller screen size. Alternatively, modern browsers, such as Chrome [17], Safari [40], and Firefox [23], provide a \reader" mode intended for easy clutter-free viewing of web pages on mobile devices by presenting only its text and stripping out layout and page styling. The primary purpose of this mode, however, is to allow for easier reading of a page's primary content, rather than to address mobile friendly problems. 126 8.4 Internationalization Presentation Failures (IPFs) Dierent techniques exist that target detection of internationalization failures in web applications. GWALI [49] and i18n checker [44] are automated techniques, while Apple's pseudo-localization [11] requires manual checking to identify IPFs. There is also a group of techniques [52, 50, 132] that performs automated checks for identifying internationalization problems, such as corrupted text, inconsistent keyboard shortcuts, and incorrect/missing translations. However, none of the aforementioned techniques are capable of repairing IPFs. Another technique related to internationalization in web pages is TranStrL [157]. It locates strings in web applications that developers need to translate before internationalization can be considered complete. Although this technique helps developers to carry out the internationaliza- tion process more thoroughly, it does not help developers repair when their translations have led to an IPF. 8.5 Mockup-driven Development Problems (MDDPs) and Regression Debugging Problems (RDPs) There is a body of work related to my detection and localization approach, WebSee [103, 104, 102], that uses dierent dierencing techniques to identify presentation problems in web applications. DOM Dierencing: Techniques based on textual dierencing (e.g., diff) of the HTML source or DOM comparisons [141, 59, 137, 118, 37, 147, 146, 161, 76, 75] are not applicable for detecting MDDPs and are of limited use for RDPs. There are two problems with these techniques. First, only a textual dierence between two pages does not necessarily imply that there is a visual dierence. This is because (1) there are often several ways to implement the styling of HTML elements to make them appear the same (e.g., margin and padding can be used interchangeably to achieve the same visual eect or setting the CSS property of font-size to `small' will have the same visual eect as setting it to 13px) or (2) a page may have been restructured in a way that did not translate into a visual dierence (e.g., when a <table> is converted to a table-less layout with <div> tags). Second, the lack of a textual dierence does not imply that there are no presentation failures. For example, a failure can occur without any textual change to an <img> tag if the tag does not specify size attributes and the dimensions of the image le change on disk. Invariant specication techniques: A group of techniques allows developers to spec- ify invariants that will be checked and enforced on a web page. These include Selenium [42], Sikuli [57, 165], Cucumber [19], Crawljax [119, 116, 120], and Cornipickle [78]. There are two 127 main disadvantages to this type of invariant specication approach compared to WebSee. First, these techniques require testers to exhaustively specify every correctness property to be checked, which may be very labor intensive. Second, the correctness properties are expressed in terms of HTML syntax, not the visual appearance of an HTML element. Therefore, these techniques may miss presentation failures caused by incorrect inheritance of an ancestor element's CSS properties, for example. Sikuli [57, 165] is an automation framework based on computer vision techniques that uses sub- image searching to identify and manipulate GUI controls in a web page. Although not intended for verication, one could provide a set of screenshots of each GUI element and use Sikuli to ensure that they match (i.e., there are no presentation failures). However, since Sikuli uses a sub-image based search of the page, it could match the provided screenshots against any portion of the page, not necessarily the intended region. This means it would be ineective if there were visually identical elements in the page. Furthermore, Sikuli only provides an element after a positive match; therefore when there is a failure, no match will be made and no element(s) will be provided to the testers to help with localization. Automated Oracles: Work by Sprenkle and colleagues proposes a suite of HTML-based automated oracle comparators to detect presentation failures in the current version of a web application compared to its previous version [147, 146]. Therefore these techniques cannot be used for detecting MDDPs, where there is no prior working version of the web application that can be compared against. Another work facilitates the use of a graphical oracle to detect presentation problems [64]. However, this technique requires the testers to specify the characteristics that should be used to compare the images of the oracle and the page under test and then implement them in the form of extractors. The tester is also expected to implement the similarity function used to determine how close in appearance the two images are. Visual Regression Testing: Tools, \Wraith" [46] and \PhantomCSS" [38], are developed by BBC News and Huddle, respectively, for helping developers with visual/CSS regression testing. Both of these tools use pixel-level screenshot comparison to identify the dierences between test and baseline pages, and report the pixel-level image dierences to the developers. These tools have several shortcomings limiting their applicability in debugging presentation failures in web applications. First, the false positive rate in detecting presentation problems is high owing to the strict pixel-to-pixel equivalence comparison to identify dierences. Small dierences that represent concessions to coding simplicity or failures that are within a level of tolerance that the development team does not consider to be a presentation failure are reported. Second, the tools do not facilitate the handling of the dynamic portions of pages. Third, the tools only report 128 image-level dierences and do not identify the faulty HTML elements. WebSee addresses all of the above limitations by using computer-vision techniques to report only the human perceptible dierences, providing special regions handling for dynamic portions of pages, and localizing the presentation failures to HTML elements in the page by using rendering maps. Browser Solutions: Browser plug-ins, such as \PerfectPixel" [160] for Chrome and \Pixel Perfect" [121] for Firefox help developers to detect pixel-level dierences with an image based oracle. They overlay a semitransparent version of the oracle over the HTML page under test, enabling developers to do per pixel comparison to detect presentation failures. However, they re- quire the developer to manually locate the faulty elements. In contrast, WebSee is fully automated for detection and localization. Visual slicing: Recent work in the eld of program slicing of picture description languages (e.g., postscript) also uses a visual-based analysis [166, 54]. These techniques capture the seman- tics of picture description languages through visual dierences. However, such techniques address a fundamentally dierent problem of slicing non-traditional programming languages rather than detecting presentation failures in web pages. 8.6 Other presentation failure detection techniques There exist techniques for detecting other types of presentation failures besides the ones listed above. TheReDeCheck technique [155, 153, 154] uses a layout graph to nd regression failures in responsive web pages that adjust their layout according to the size of the browser's viewport. Another technique, \Fighting Layout Bugs" [149] can be used to automatically nd application agnostic presentation failures, such as overlapping text and unreadable text. However, both the techniques can only detect presentation failures, and are not able to repair them. 8.7 Testing and analysis of web app client-side components Techniques to test JavaScript [51, 129, 128, 83] and analyze CSS [117] have been proposed recently. These techniques deal with specic components of the client side and as such are not meant for repairing presentation failures in a web application. Another technique does impact analysis of CSS changes across a website [86], and noties the developer if changes made in a CSS le are introducing new presentation failures in other web pages of the website. However, this technique is not capable of repairing presentation failures. 129 There exist several techniques [112, 109, 110, 111, 150, 134, 135, 125] that use program slicing to extract features or behaviors of a web application for code reuse and debugging. However, these techniques cannot repair presentation failures in web applications. 8.8 GUI Testing Memon and colleagues [148, 115, 162, 122, 124] have done extensive work in the area of model- based GUI testing. These techniques test the behavior of a software system by triggering event sequences from the GUI. The purpose of their work is not xing presentation issues in the GUI, but rather using the GUI as a driver to nd behavioral problems in the system. 130 Chapter 9 Conclusion and Future Work In this chapter, I conclude by giving a summary of my dissertation work and discuss directions of future work. 9.1 Summary Web applications have become an important part of our daily lives for performing both profes- sional and personal activities, such as shopping, banking, networking, and email. We can very conveniently access websites from a range of dierent browsers that run on a variety of dierent platforms and devices, and render the sites in a language of our choice. Although this model is very convenient for the end-users, it is extremely challenging for developers to ensure that a website ren- ders consistently across the wide range of browsers, that the website is as user friendly on a mobile device as it is on a desktop device, and that the website adapts its layout gracefully for all of the dierent languages. The inability to do so can result in the website having User Interface (UI) is- sues, such as Cross Browser Issues (XBIs), Mobile Friendly Problems (MFPs), Internationalization Presentation Failures (IPFs), Mockup-driven Development Problems (MDDPs), and Regression Debugging Problems (RDPs). I collectively refer to such UI issues as presentation failures | a discrepancy between the actual appearance of a website and its intended appearance | that can degrade the aesthetics of a website and aect its usability and functionality likely resulting in a frustrating and poor user experience. Despite the importance of presentation failures, there exist no techniques for their automated repair, making it a manual task that is labor-intensive and requires signicant expertise. To address these limitations, the goal of my research is to automate the process of repairing presentation failures in web applications. My dissertation work furthers this goal by developing 131 dierent techniques for the automated repair of dierent types of presentation failures in web applications, such as XBIs, MFPs, and IPFs. The hypothesis statement of my dissertation is: Search-based techniques can repair presentation failures in a web page with high eectiveness. To evaluate the hypothesis of my dissertation, I designed and developed four approaches for repairing dierent types of presentation failures in web applications, namely, XBIs, MFPs, IPFs, MDDPs, and RDPs. All of my four repair approaches were designed using search-based techniques. The eectiveness of my approaches in repairing presentation failures was evaluated by measuring the reduction in the number of presentation failures reported by existing detection techniques in the before and after repair versions of the pages. I also conducted user studies to measure the eectiveness of the repair approaches in reducing the human-observable presentation failures and understand the impact of the generated repairs on the aesthetic quality of the pages from a human perspective. The rst approach isXFix, which targets the repair of layout XBIs in web pages.XFix uses two phases of guided search. The rst phase nds candidate xes for each of the root causes identied for an XBI. The second phase then nds a subset of the candidate xes that together minimizes the number of XBIs in the web page. The empirical evaluation ofXFix on 15 real world web pages showed that it was able to resolve 86% of the XBI reported by X-PERT, a well- known XBI detection tool, and 99% of the XBIs observed by humans. In a user study assessing the improvement in consistency between the repaired and reference page, 78% of the participant ratings reported an improvement in the cross-browser consistency of the repaired web pages. The second approach,MFix, targets the repair of MFPs in web pages.MFix rst segments the page into areas that form natural visual groupings. It then builds graph-based models of the segments and layout of the page and uses the constraints represented by these graphs to compute a repair that can improve mobile friendliness while minimizing layout disruption. In the evaluation, MFix was successfully able to resolve MFPs for 95% of the subjects. In a conducted user study, the participants overwhelmingly preferred the repaired version of the website for use on mobile devices, and also considered the repaired page to be signicantly more readable than the original. The third approach,IFix, targets the repair of IPFs in web pages.IFix rst uses a clustering technique that identies groupings of elements that are stylistically similar and adjusts them together in order to maintain the visual consistency of the page. It then uses a guided search- based technique that quanties the amount of distortion in a page by leveraging existing IPF detection techniques and UI change metrics. In the evaluation,IFix was able to successfully resolve 98% of the reported IPFs in the subjects. In a user study of the repaired web pages, the 132 repairs met with high user approval, with over 70% of the user responses rating the visual quality of the xed pages as signicantly higher than the unxed versions. The fourth and nal approach,GFix, targets the repair of MDDPs and RDPs in web pages. GFix uses guided search-based techniques to automatically nd repairs for the detected MDDPs and RDPs in web pages. As its tness function,GFix uses computer-vision techniques to quantify the amount of visual dierences between the actual appearance of a web page and its intended appearance. In the evaluation ofGFix on a set of real-world subjects, I found that the approach was able to accurately identify repairs for the failures and met with a high user approval rate. Overall, the four approaches have demonstrated high eectiveness in repairing dierent types of presentation failures in web pages while maintaining or enhancing the visual appeal of the pages, thereby conrming the hypothesis of my dissertation. To the best of my knowledge, my research is the rst automated approach for generating repairs for presentation failures, and the rst to apply search-based repair techniques to web pages. 9.2 Future Work In the future, addressing presentation failures will continue to be an important problem for de- velopers not just in the domain of web applications, but in other domains as well, such as mobile applications. My dissertation identies several real-world challenges in the domain of presenta- tion failures that motivates this direction of research and also lays the foundation for developing automated techniques that can repair presentation failures in dierent software applications. One possible direction of future work is to design approaches using search-based techniques for repairing more types of presentation failures in web applications. One such example of presenta- tion failures is accessibility issues. Web accessibility is important as it allows inclusion of people with disabilities in the current technology-savvy society and enables them to independently per- form work and personal activities, such as shopping and banking. An approach can be designed for the repair of accessibility issues by identifying detection techniques to quantify their impact and designing a suitable search-based algorithm to identify successful xes. Another example of presentation failures that can be repaired using search-based techniques is responsive web design problems. Responsive web design is a paradigm that allows developers to design web pages that dynamically adapt their layout to dierent device sizes. The repair of presentation problems caused by responsive web design would pose new research challenges. For example, analysis to identify which lists of hyperlinks could be grouped into a drop down menu, a refactoring to carry out this change, and a method to quantify the change's impact. 133 Another direction of future work is to retarget my developed repair approaches to areas other than web applications, particularly native mobile apps and apps on IoT (Internet of Things) devices with a Graphical User Interface (GUI), such as smartwatch, thermostat, and gaming systems. This would pose new research challenges, such as analyzing the characteristics of the presentation failures in these domains, a method to quantify the impact of the failures, and ways to carry out their repair. Search-based techniques employed by my repair approaches are non-deterministic in nature, requiring the approach to be run multiple times and selecting the best repair. Another direction of future work can be to explore deterministic algorithms for repairing presentation failures in web applications. One possible way is to model the layout of web pages as a system of constraints and nd suitable repairs using constraint solving. Another possible way is to mine large web app repositories to identify repair patches. 134 References [1] Hobbes' Internet Popularity. http://www.zakon.org/robert/internet/timeline/. [2] IFix Evaluation Data. https://github.com/USC-SQL/ifix. [3] Stackover ow search | mobile friendly problems with bootstrap. [4] Web Sales Generated Revenue. https://www.census.gov/retail/mrts/www/data/pdf/ ec_current.pdf. [5] How Do Browsers Display Web Pages, and Why Don't They Ever Look the Same? http://www.makeuseof.com/tag/ how-do-browsers-display-web-pages-and-why-dont-they-ever-look-the-same/, 2015. [6] Browser Specic CSS Hacks. http://browserhacks.com/, 2016. [7] CSS Hacks. https://en.wikipedia.org/wiki/CSS_hack, 2016. [8] Yui 3 css reset. http://cssreset.com/scripts/yahoo-css-reset-yui-3/, 2016. [9] Alexa Top 50 Websites by Category. https://www.alexa.com/topsites/category, 2017. [10] Alexa Top 500 Global Sites. http://www.alexa.com/topsites, 2017. [11] Apple Internationalization and Localization Guide. https://developer.apple. com/library/content/documentation/MacOSX/Conceptual/BPInternational/ TestingYourInternationalApp/TestingYourInternationalApp.html, 2017. [12] Automated Browser Compatibility Testing. http://www.browsera.com/, 2017. [13] Bing Mobile Friendly Test Tool. https://www.bing.com/webmaster/tools/ mobile-friendliness, 2017. [14] bMobilized Website. http://bmobilized.com/, 2017. [15] Browser Net Market Share. https://www.netmarketshare.com/, 2017. [16] BrowserStack for Testing Mobile Websites. https://www.browserstack.com/, 2017. [17] Chrome Reader Mode. https://github.com/chromium/dom-distiller, 2017. [18] CSS Parser. http://cssparser.sourceforge.net/, 2017. [19] Cucumber for BDD. https://cucumber.io/, 2017. [20] Duda Website. https://www.dudamobile.com/, 2017. [21] Estimates for Digital Users. https://www.emarketer.com/Article/ eMarketer-Releases-Updated-Estimates-US-Digital-Users/1015275, 2017. 135 [22] Firebug. https://addons.mozilla.org/en-US/firefox/addon/firebug/, 2017. [23] Firefox Reader Mode. https://support.mozilla.org/en-US/kb/ firefox-reader-view-clutter-free-web-pages, 2017. [24] Front-end Developers Job Postings. http://www-scf.usc.edu/ ~ spmahaja/ front-end-job-postings/, 2017. [25] Google Mobile Friendly Problem Types. https://support.google.com/webmasters/ answer/6352293, 2017. [26] Google Mobile Friendly Test Tool. https://search.google.com/test/mobile-friendly, 2017. [27] Google PageSpeed Insights Tool. https://developers.google.com/speed/pagespeed/ insights/, 2017. [28] Google Search Ranking based on Mobile Friendliness. https://support.google.com/ adsense/answer/6196932?hl=en, 2017. [29] Google Study for Mobile Usage. https://developers.google.com/search/ mobile-sites/, 2017. [30] GTAC 2011: Web Consistency Testing. https://www.youtube.com/watch?v= _6fV-6eMSUM, 2017. [31] HTML 5. http://www.w3schools.com/tags/, 2017. [32] HTML 5 Migration. http://www.w3schools.com/html/html5_migration.asp, 2017. [33] List of Browsers. https://en.wikipedia.org/wiki/List_of_web_browsers, 2017. [34] Mobify Website. https://www.mobify.com/, 2017. [35] Mobile Market Share. http://gs.statcounter.com/platform-market-share/ desktop-mobile/worldwide/#monthly-201407-201707, 2017. [36] Mobilifyit Website. http://www.mobilifyit.com/, 2017. [37] Mogotest. http://mogotest.com/, 2017. [38] PhantomCSS { Visual/CSS regression testing. https://github.com/Huddle/PhantomCSS, 2017. [39] Random URL Generator. http://www.uroulette.com/, 2017. [40] Safari Reader Mode. https://en.wikipedia.org/wiki/Safari_(web_browser), 2017. [41] SauceLabs for Testing Mobile Websites. https://saucelabs.com/, 2017. [42] Selenium HQ. http://docs.seleniumhq.org/, 2017. [43] Stackover ow cross-browser posts. http://stackoverflow.com/questions/tagged/ cross-browser, 2017. [44] W3C Internationalization Checker. https://validator.w3.org/i18n-checker/, 2017. [45] WompMobile Website. http://www.wompmobile.com/, 2017. [46] Wraith { A screenshot comparison tool. https://github.com/BBC-News/wraith, 2017. 136 [47] Abdulmajeed Alameer and William G.J. Halfond. An empirical study of internationalization failures in the web. In Proceedings of the International Conference on Software Maintenance and Evolution (ICSME), October 2016. [48] Abdulmajeed Alameer, Sonal Mahajan, and William G. J. Halfond. Detecting and Local- izing Internationalization Layout Failures in Web Applications. In submission. [49] Abdulmajeed Alameer, Sonal Mahajan, and William G. J. Halfond. Detecting and Localiz- ing Internationalization Presentation Failures in Web Applications. In Proceedings of the 9th IEEE International Conference on Software Testing, Verication and Validation (ICST), April 2016. Acceptance rate: 27%. Best Paper Award. [50] J. Archana, S. R. Chermapandan, and S. Palanivel. Automation framework for localizability testing of internationalized software. In 2013 International Conference on Human Computer Interactions (ICHCI), pages 1{6, Aug 2013. [51] Shay Artzi, Julian Dolby, Simon Holm Jensen, Anders Mller, and Frank Tip. A framework for automated testing of javascript web applications. In Proceedings of the 33rd International Conference on Software Engineering, ICSE, pages 571{580, New York, NY, USA, 2011. ACM. [52] Aiman M. Ayyal Awwad and Wolfgang Slany. Automated bidirectional languages localiza- tion testing for android apps with rich gui. Mobile Information Systems, 2016, 2016. [53] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The R*- tree: an ecient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data, SIGMOD '90, pages 322{331, New York, NY, USA, 1990. ACM. [54] David Binkley, Nicolas Gold, Mark Harman, Syed Islam, Jens Krinke, and Shin Yoo. ORBS: Language-independent Program Slicing. In Proceedings of the 22Nd ACM SIGSOFT Inter- national Symposium on Foundations of Software Engineering, FSE 2014, pages 109{120, New York, NY, USA, 2014. ACM. [55] Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. VIPS: a Vision-based Page Seg- mentation Algorithm. Technical report, November 2003. [56] Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. A Graph-theoretic Approach to Webpage Segmentation. In Proceedings of the 17th International Conference on World Wide Web, WWW '08, 2008. [57] Tsung-Hsiang Chang, Tom Yeh, and Robert C. Miller. GUI Testing Using Computer Vision. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '10, pages 1535{1544, New York, NY, USA, 2010. ACM. [58] Shauvik Roy Choudhary. X-PERT Code. https://github.com/gatech/xpert, 2015. [59] Shauvik Roy Choudhary, Mukul R. Prasad, and Alessandro Orso. CrossCheck: Combining Crawling and Dierencing to Better Detect Cross-browser Incompatibilities in Web Appli- cations. In Proceedings of the IEEE Fifth International Conference on Software Testing, Verication and Validation (ICST), pages 171{180, Washington, DC, USA, 2012. IEEE Computer Society. [60] John Clarke, Jose Javier Dolado, Mark Harman, Rob Hierons, Bryan Jones, Mary Lumkin, Brian Mitchell, Spiros Mancoridis, Kearton Rees, Marc Roper, et al. Reformulating software engineering as a search problem. In IEE Proceedings-Software, volume 150, pages 161{175. IET, 2003. 137 [61] Michael G. Collins and John J. Barton. Crossre: Multiprocess, Cross-browser, Open-web Debugging Protocol. In Proceedings of the ACM International Conference Companion on Object Oriented Programming Systems Languages and Applications Companion, OOPSLA, pages 115{124, 2011. [62] Valentin Dallmeier, Martin Burger, Tobias Orth, and Andreas Zeller. WebMate: A Tool for Testing Web 2.0 Applications. In Proceedings of the Workshop on JavaScript Tools (JSTools), pages 11{15. ACM, 2012. [63] Valentin Dallmeier, Bernd Pohl, Martin Burger, Michael Mirold, and Andreas Zeller. Web- mate: Web application test generation in the real world. In Software Testing, Verica- tion and Validation Workshops (ICSTW), 2014 IEEE Seventh International Conference on, pages 413{418. IEEE, 2014. [64] Marcio Eduardo Delamaro, Fatima de Lourdes dos Santos Nunes, and Rafael Alves Paes de Oliveira. Using concepts of content-based image retrieval to implement graphical testing oracles. In Softw. Test. Verif. Reliab., volume 23, pages 171{198, 2013. [65] Pedro A. Diaz-Gomez and Dean F. Hougen. Initial Population for Genetic Algorithms: A Metric Approach. In Proceedings of the International Conference on Genetic and Evolu- tionary Methods, GEM '07, 2007. [66] Cyntrica Eaton and Atif M. Memon. An Empirical Approach to Testing Web Applications Across Diverse Client Platform Congurations. In International booktitle on Web Engi- neering and Technology (IJWET), Special Issue on Empirical Studies in Web Engineering, volume 3, pages 227{253, Geneva, Switzerland, 2007. Inderscience Publishers. [67] Florian N. Egger. \Trust Me, I'm an Online Vendor": Towards a Model of Trust for e- Commerce System Design. In CHI Extended Abstracts on Human Factors in Computing Systems. ACM, 2000. [68] Martin Ester, Hans peter Kriegel, Jrg S, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD, 1996. [69] Sonal Mahajan et. al. XFix Project Page. https://github.com/sonalmahajan/xfix, 2016. [70] Andrea Everard and Dennis F. Galletta. How Presentation Flaws Aect Perceived Site Quality, Trust, and Intention to Purchase from an Online Store. Journal of Management Information Systems, 22:56{95, January 2006. [71] B. J. Fogg, Jonathan Marshall, Othman Laraki, Alex Osipovich, Chris Varma, Nicholas Fang, Jyoti Paul, Akshay Rangnekar, John Shon, Preeti Swani, and Marissa Treinen. What Makes Web Sites Credible?: A Report on a Large Quantitative Study. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI, 2001. [72] Nicolas Gallagher and Jonathan Neal. Normalize css. https://necolas.github.io/ normalize.css/, 2016. [73] Google. Consumer Study. 2018. [74] Google. Content Sizing. 2018. 138 [75] Mark Grechanik, Qing Xie, and Chen Fu. Creating GUI Testing Tools Using Accessibility Technologies. In Proceedings of the IEEE International Conference on Software Testing, Verication, and Validation Workshops, ICSTW '09, pages 243{250, Washington, DC, USA, 2009. IEEE Computer Society. [76] Mark Grechanik, Qing Xie, and Chen Fu. Maintaining and Evolving GUI-directed Test Scripts. In Proceedings of the 31st International Conference on Software Engineering, ICSE '09, pages 408{418, Washington, DC, USA, 2009. IEEE Computer Society. [77] Antonin Guttman. R-trees: a dynamic index structure for spatial searching. In SIGMOD Rec., volume 14, pages 47{57, June 1984. [78] Sylvain Hall e, Nicolas Bergeron, Francis Guerin, and Gabriel Le Breton. Testing Web Ap- plications Through Layout Constraints. In 8th IEEE International Conference on Software Testing, Verication and Validation, ICST 2015, Graz, Austria, April 13-17, 2015, pages 1{8, 2015. [79] Mark Harman. The Current State and Future of Search Based Software Engineering. In 2007 Future of Software Engineering, FOSE '07, pages 342{357, Washington, DC, USA, 2007. IEEE Computer Society. [80] Mark Harman and Bryan F Jones. Search-based software engineering. In Information and software Technology, volume 43, pages 833{839. Elsevier, 2001. [81] E. R. Harold. Refactoring HTML: Improving the Design of Existing Web Applications. Addison-Wesley Professional, 1st edition, Dec 2012. [82] Joseph Kempka, Phil McMinn, and Dirk Sudholt. Design and Analysis of Dierent Al- ternating Variable Searches for Search-Based Software Testing. In Theoretical Computer Science, volume 605, pages 1{20, 2015. [83] Emre Kiciman and Benjamin Livshits. Ajaxscope: A platform for remotely monitoring the client-side behavior of web 2.0 applications. SIGOPS Oper. Syst. Rev., 41(6):17{30, October 2007. [84] B. Korel. Automated Software Test Data Generation. In IEEE Transactions on Software Engineering, volume 16, pages 870{879, 1990. [85] Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. A sys- tematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In Proceedings of the 34th International Conference on Software Engineering, ICSE, pages 3{13, 2012. [86] Hsiang-Sheng Liang, Kuan-Hung Kuo, Po-Wei Lee, Yu-Chien Chan, Yu-Chin Lin, and Mike Y. Chen. SeeSS: Seeing What I Broke { Visualizing Change Impact of Cascading Style Sheets (CSS). In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology, UIST '13, pages 353{356, New York, NY, USA, 2013. ACM. [87] Gitte Lindgaard, Cathy Dudek, Devjani Sen, Livia Sumegi, and Patrick Noonan. An Ex- ploration of Relations Between Visual Appeal, Trustworthiness and Perceived Usability of Homepages. In ACM Trans. Comput.-Hum. Interact., volume 18, pages 1:1{1:30, May 2011. [88] Gitte Lindgaard, Gary Fernandes, Cathy Dudek, and Brown J. Attention web designers: You have 50 milliseconds to make a good rst impression! In Behaviour & Information Technology, volume 25, pages 115{126, 2006. 139 [89] Adam Lipowski and Dorota Lipowska. Roulette-wheel selection via stochastic acceptance. In Physica A Volume 391, Issue 6, pages 2193{2196, 2012. [90] Fernando G. Lobo and Cl audio F. Lima. A Review of Adaptive Population Sizing Schemes in Genetic Algorithms. In Proceedings of the 7th Annual Workshop on Genetic and Evolu- tionary Computation, GECCO '05, 2005. [91] Fan Long and Martin Rinard. Staged program repair with condition synthesis. In Proceed- ings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 166{178, 2015. [92] Patrick J. Lynch and Sarah Horton. Web Style Guide, 3rd Edition: Basic Design Principles for Creating Web Sites. Yale University Press, New Haven, CT, USA, 3rd edition, 2009. [93] Heikki Maaranen, Kaisa Miettinen, and Antti Penttinen. On initial populations of a genetic algorithm for continuous optimization problems. Journal of Global Optimization, 2006. [94] Sonal Mahajan. MFix Project. 2017. [95] Sonal Mahajan, Negasadat Abolhassani, Phil McMinn, and William G. J. Halfond. Au- tomated Repair of Mobile Friendly Problems in Web Pages. In Proceedings of the 40th International Conference on Software Engineering (ICSE), May 2018. Acceptance rate: 20%. [96] Sonal Mahajan, Abdulmajeed Alameer, Phil McMinn, and William G. J. Halfond. Search- Based Automatic Repair of Layout Cross-Browser Issues. In submission. [97] Sonal Mahajan, Abdulmajeed Alameer, Phil McMinn, and William G. J. Halfond. Auto- mated Repair of Layout Cross Browser Issues using Search-Based Techniques. In Proceedings of the 26th International Symposium on Software Testing and Analysis (ISSTA), July 2017. Acceptance rate: 26%. Distinguished Paper Award. [98] Sonal Mahajan, Abdulmajeed Alameer, Phil McMinn, and William G. J. Halfond. XFix: An Automated Tool for the Repair of Layout Cross Browser Issues. In Proceedings of the 26th International Symposium on Software Testing and Analysis (ISSTA) { Demo Track, July 2017. [99] Sonal Mahajan, Abdulmajeed Alameer, Phil McMinn, and William G. J. Halfond. Auto- mated Repair of Internationalization Failures in Web Applications Using Style Similarity Clustering and Search-Based Techniques. In Proceedings of the 11th IEEE International Conference on Software Testing, Verication and Validation (ICST), April 2018. Accep- tance rate: 25%. Distinguished Paper Award. [100] Sonal Mahajan, Krupa Benhur Gadde, Anjaneyulu Pasala, and William G. J. Halfond. De- tecting and Suggesting Fixes for Visual Inconsistencies in Web Applications. In submission. [101] Sonal Mahajan, Krupa Benhur Gadde, Anjaneyulu Pasala, and William G. J. Halfond. Detecting and Localizing Visual Inconsistencies in Web Applications. In Proceedings of the 23rd Asia-Pacic Software Engineering Conference (APSEC) { Short paper, December 2016. Acceptance rate: 29%. [102] Sonal Mahajan and William G. J. Halfond. Finding HTML Presentation Failures Using Image Comparison Techniques. In Proceedings of the 29th IEEE/ACM International Con- ference on Automated Software Engineering (ASE) { New Ideas track, September 2014. Acceptance rate: 24%. 140 [103] Sonal Mahajan and William G. J. Halfond. Detection and Localization of HTML Presen- tation Failures Using Computer Vision-Based Techniques. In Proceedings of the 8th IEEE International Conference on Software Testing, Verication and Validation (ICST), April 2015. Acceptance rate: 24%. [104] Sonal Mahajan and William G. J. Halfond. WebSee: A Tool for Debugging HTML Pre- sentation Failures. In Proceedings of the 8th IEEE International Conference on Software Testing, Verication and Validation (ICST) { Tool track, April 2015. Acceptance rate: 36%. [105] Sonal Mahajan, Bailan Li, Pooyan Behnamghader, and William G. J. Halfond. Using Visual Symptoms for Debugging Presentation Failures in Web Applications. In Proceedings of the 9th IEEE International Conference on Software Testing, Verication and Validation (ICST), April 2016. Acceptance rate: 27%. [106] Sonal Mahajan, Bailan Li, and William G. J. Halfond. Root Cause Analysis for HTML Pre- sentation Failures Using Search-based Techniques. In Proceedings of the 7th International Workshop on Search-Based Software Testing (SBST), June 2014. Acceptance rate: 53%. [107] Sonal Mahajan, Phil McMinn, and William G. J. Halfond. A Search-Based Framework for the Automated Repair of Presentation Failures in Web Applications. In submission. [108] Andy Y. Mao, James R. Cordy, and Thomas R. Dean. Automated conversion of table- based websites to structured stylesheets using table recognition and clone detection. In Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, CASCON '07, pages 12{26, Riverton, NJ, USA, 2007. IBM Corp. [109] Josip Maras, Jan Carlson, and Ivica Crnkovi. Extracting Client-side Web Application Code. In Proceedings of the 21st International Conference on World Wide Web, WWW '12, pages 819{828, New York, NY, USA, 2012. ACM. [110] Josip Maras, Jan Carlson, and Ivica Crnkovic. Towards Automatic Client-side Feature Reuse. In Web Informations System Engineering, WISE 2014, October 2013. [111] Josip Maras, Maja Stula, and Jan Carlson. Firecrow - A tool for Web Application Analy- sis and Reuse. In The 29th IEEE/ACM International Conference on Automated Software Engineering, pages 847{850. ACM, September 2014. Tool Demonstration. [112] Maras, Josip and Stula, Maja and Carlson, Jan and Crnkovic, Ivica. Identifying code of individual features in client-side web applications. IEEE Trans. Softw. Eng., 39(12):1680{ 1697, December 2013. [113] David Sawyer McFarland. CSS: The Missing Manual. O'Reilly, 2006. [114] Phil McMinn. Search-based Software Test Data Generation: A Survey: Research Articles. In Softw. Test. Verif. Reliab., volume 14, pages 105{156, Chichester, UK, June 2004. John Wiley and Sons Ltd. [115] Atif M. Memon, Ishan Banerjee, and Adithya Nagarajan. What Test Oracle Should I Use for Eective GUI Testing? In Proceedings of the International Conference on Automated So!ware Engineering (ASE), pages 164{173, 2003. [116] Ali Mesbah, Engin Bozdag, and Arie van Deursen. Crawling AJAX by Inferring User Interface State Changes. In Proceedings of the Eighth International Conference on Web Engineering, ICWE, pages 122{134, Washington, DC, USA, 2008. IEEE Computer Society. 141 [117] Ali Mesbah and Shabnam Mirshokraie. Automated analysis of CSS rules to support style maintenance. In Proceedings of the 2012 International Conference on Software Engineering, ICSE 2012, pages 408{418, Piscataway, NJ, USA, 2012. IEEE Press. [118] Ali Mesbah and Mukul R. Prasad. Automated cross-browser compatibility testing. In Proceedings of the 33rd International Conference on Software Engineering, ICSE, pages 561{570, New York, NY, USA, 2011. ACM. [119] Ali Mesbah and Arie van Deursen. Invariant-based automatic testing of AJAX user inter- faces. In Proceedings of the 31st International Conference on Software Engineering, ICSE, pages 210{220, Washington, DC, USA, 2009. IEEE Computer Society. [120] Ali Mesbah, Arie van Deursen, and Stefan Lenselink. Crawling Ajax-Based Web Applica- tions through Dynamic Analysis of User Interface State Changes. In ACM Trans. Web, volume 6, pages 3:1{3:30, March 2012. [121] Jan Odvarko Mike Buckley, Lorne Markham. Pixel Perfect Firefox. https://addons. mozilla.org/en-us/firefox/addon/pixel-perfect/, 2017. [122] Rodrigo M. L. M. Moreira, Ana C. R. Paiva, and Atif Memon. A Pattern-Based Approach for GUI Modeling and Testing. In Proceedings of the International Symposium on Software Reliability Engineering, ISSRE, pages 288 { 297, 2013. [123] Mark W. Newman and James A. Landay. Sitemaps, Storyboards, and Specications: A Sketch of Web Site Design Practice. In Proceedings of the 3rd Conference on Designing Interactive Systems: Processes, Practices, Methods, and Techniques, DIS '00, pages 263{ 274, New York, NY, USA, 2000. ACM. [124] Bao N. Nguyen, Bryan Robbins, Ishan Banerjee, and Atif Memon. GUITAR: An Innova- tive Tool for Automated Testing of GUI-driven Software. In Automated Software Engg., volume 21, pages 65{105, March 2014. [125] Hung Viet Nguyen, Christian K astner, and Tien N. Nguyen. Cross-language Program Slic- ing for Dynamic Web Applications. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 369{380, New York, NY, USA, 2015. ACM. [126] Hung Viet Nguyen, Hoan Anh Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. Auto- locating and Fix-propagating for HTML Validation Errors to PHP Server-side Code. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, ASE, pages 13{22, Washington, DC, USA, 2011. IEEE Computer Society. [127] Tung Thanh Nguyen, Hoan Anh Nguyen, Nam H. Pham, Jafar Al-Kofahi, and Tien N. Nguyen. Recurring Bug Fixes in Object-oriented Programs. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE, pages 315{324, 2010. [128] Frolin S. Ocariza, Jr., Karthik Pattabiraman, and Ali Mesbah. Vejovis: Suggesting xes for javascript faults. In Proceedings of the 36th International Conference on Software Engi- neering, ICSE 2014, pages 837{847, New York, NY, USA, 2014. ACM. [129] Frolin S. Ocariza Jr., Karthik Pattabiraman, and Ali Mesbah. AutoFLox: An Automatic Fault Localizer for Client-Side JavaScript. In Proceedings of the 2012 IEEE Fifth Interna- tional Conference on Software Testing, Verication and Validation, ICST '12, pages 31{40, Washington, DC, USA, 2012. IEEE Computer Society. 142 [130] Fatih Kursat Ozenc, Miso Kim, John Zimmerman, Stephen Oney, and Brad Myers. How to Support Designers in Getting Hold of the Immaterial Material of Software. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '10, pages 2513{ 2522, New York, NY, USA, 2010. ACM. [131] Pavel Panchekha and Emina Torlak. Automated Reasoning for Web Page Layout. In Pro- ceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA, 2016. [132] R. Ramler and R. Hoschek. How to test in sixteen languages? automation support for lo- calization testing. In 2017 IEEE International Conference on Software Testing, Verication and Validation (ICST), pages 542{543, March 2017. [133] C. Ranganathan and Shobha Ganapathy. Key dimensions of business-to-consumer web sites. Inf. Manage., 39(6):457{465, May 2002. [134] Filippo Ricca and Paolo Tonella. Web Application Slicing. In Proceedings of the IEEE In- ternational Conference on Software Maintenance (ICSM'01), ICSM '01, pages 148{, Wash- ington, DC, USA, 2001. IEEE Computer Society. [135] Filippo Ricca and Paolo Tonella. Construction of the System Dependence Graph for Web Application Slicing. In Proceedings of the Second IEEE International Workshop on Source Code Analysis and Manipulation, SCAM '02, pages 123{, Washington, DC, USA, 2002. IEEE Computer Society. [136] Richard Romero and Adam Berger. Automatic Partitioning of Web Pages Using Clustering. In Proceedings of Mobile Human-Computer Interaction - MobileHCI 2004: 6th International Symposium, 2004. [137] Shauvik Roy Choudhary, Mukul R. Prasad, and Alessandro Orso. X-PERT: Accurate Iden- tication of Cross-browser Issues in Web Applications. In Proceedings of the 2013 Interna- tional Conference on Software Engineering, ICSE, pages 702{711, 2013. [138] Shauvik Roy Choudhary, Mukul R. Prasad, and Alessandro Orso. Cross-platform feature matching for web applications. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, ISSTA 2014, pages 82{92, New York, NY, USA, 2014. ACM. [139] Shauvik Roy Choudhary, Mukul R. Prasad, and Alessandro Orso. X-PERT: A Web Ap- plication Testing Tool for Cross-browser Inconsistency Detection. In Proceedings of the International Symposium on Software Testing and Analysis, ISSTA, pages 417{420, 2014. [140] Shauvik Roy Choudhary, Mukul R. Prasad, and Alessandro Orso. X-pert: A web appli- cation testing tool for cross-browser inconsistency detection. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, ISSTA 2014, pages 417{420, New York, NY, USA, 2014. ACM. [141] Shauvik Roy Choudhary, Husayn Versee, and Alessandro Orso. Webdi: Automated identi- cation of cross-browser issues in web applications. In Proceedings of the IEEE International Conference on Software Maintenance, ICSM, pages 1{10, 2010. [142] Hesam Samimi, Max Sch afer, Shay Artzi, Todd Millstein, Frank Tip, and Laurie Hendren. Automated Repair of HTML Generation Errors in PHP Applications Using String Con- straint Solving. In Proceedings of the International Conference on Software Engineering, ICSE, pages 277{287, 2012. 143 [143] Andrs Sanoja and Stphane Ganarski. Block-o-Matic: A web page segmentation frame- work. In Proceedings of the International Conference on Multimedia Computing and Sys- tems, ICMCS, 2014. [144] Nataliia Semenenko, Marlon Dumas, and Tnis Saar. Browserbite: Accurate Cross-Browser Testing via Machine Learning over Image Features. In Proceedings of the IEEE Interna- tional Conference on Software Maintenance, ICSM, pages 528{531, Washington, DC, USA, 2013. IEEE Computer Society. [145] Gurvinder S Shergill and Zhaobin Chen. Web-based shopping: Consumers' attitudes to- wards online shopping in new zealand. Journal of Electronic Commerce Research, 6:2{79, 2005. [146] Sara Sprenkle, Holly Esquivel, Barbara Hazelwood, and Lori Pollock. WEBVIZOR: A Visualization Tool for Applying Automated Oracles and Analyzing Test Results of Web Applications. In Practice and Research Techniques, 2008. TAIC PART '08. Testing: Aca- demic & Industrial Conference, 2008. [147] Sara Sprenkle, Lori Pollock, Holly Esquivel, Barbara Hazelwood, and Stacey Ecott. Au- tomated Oracle Comparators for Testing Web Applications. Technical report, In the Intl. Symp. on Software Reliability Engineering, 2007. [148] Jaymie Strecker and Atif M. Memon. Testing Graphical User Interfaces. In Encyclopedia of Information Science and Technology, Second ed. IGI Global, 2009. [149] Michael Tamm. Fighting layout bugs. https://code.google.com/p/ fighting-layout-bugs/, October 2009. [150] Paolo Tonella and Filippo Ricca. Web Application Slicing in Presence of Dynamic Code Generation. Automated Software Engg., 12(2):259{288, April 2005. [151] Noam Tractinsky, Avivit Cokhavi, Moti Kirschenbaum, and Tal Shar. Evaluating the Consistency of Immediate Aesthetic Perceptions of Web Pages. In International booktitle of Human-Computer Studies, volume 64, pages 1071 { 1083, 2006. [152] Alexandre N. Tuch, Eva E. Presslaber, Markus St oCklin, Klaus Opwis, and Javier A. Bargas-Avila. The Role of Visual Complexity and Prototypicality Regarding First Im- pression of Websites: Working Towards Understanding Aesthetic Judgments. In Int. J. Hum.-Comput. Stud., volume 70, November 2012. [153] Thomas Walsh, Gregory Kapfhammer, and Phil McMinn. Automated Layout Failure De- tection for Responsive Web Pages without an Explicit Oracle. In Proceedings of the 26th International Symposium on Software Testing and Analysis (ISSTA), July 2017. [154] Thomas A. Walsh, Gregory M. Kapfhammer, and Phil McMinn. Redecheck: An automatic layout failure checking tool for responsively designed web pages. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2017, pages 360{363, New York, NY, USA, 2017. ACM. [155] Thomas A. Walsh, Phil McMinn, and Gregory M. Kapfhammer. Automatic Detection of Potential Layout Faults Following Changes to Responsive Web Pages. In International Conference on Automated Software Engineering (ASE), pages 709{714. ACM, 2015. [156] Weihang Wang, Yunhui Zheng, Peng Liu, Lei Xu, Xiangyu Zhang, and Patrick Eugster. Arrow: Automated repair of races on client-side web pages. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, pages 201{212, New York, NY, USA, 2016. ACM. 144 [157] Xiaoyin Wang, Lu Zhang, Tao Xie, Hong Mei, and Jiasu Sun. Locating Need-to-Translate Constant Strings in Web Applications. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE '10, 2010. [158] Xiaoyin Wang, Lu Zhang, Tao Xie, Yingfei Xiong, and Hong Mei. Automating Presentation Changes in Dynamic Web Applications via Collaborative Hybrid Analysis. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software En- gineering, FSE, pages 16:1{16:11, New York, NY, USA, 2012. ACM. [159] Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. Automati- cally nding patches using genetic programming. In Proceedings of the 31st International Conference on Software Engineering, ICSE, pages 364{374, 2009. [160] WellDoneCode. Perfect Pixel Chrome. https://chrome.google.com/webstore/detail/ perfectpixel-by-welldonec/dkaagdgjmgdmbnecmcefdhjekcoceebi?hl=en, 2017. [161] Qing Xie, Mark Grechanik, Chen Fu, and Chad M. Cumby. Guide: A GUI Dierentiator. In ICSM, pages 395{396, 2009. [162] Qing Xie and Atif M. Memon. Studying the Characteristics of a "Good" GUI Test Suite. In Proceedings of the 17th International Symposium on Software Reliability Engineering, ISSRE, pages 159{168, Washington, DC, USA, 2006. IEEE Computer Society. [163] Hector Yee, Sumanita Pattanaik, and Donald P. Greenberg. Spatiotemporal Sensitivity and Visual Attention for Ecient Rendering of Dynamic Environments. In ACM Trans. Graph., volume 20, January 2001. [164] Yangli Hector Yee and Anna Newman. A Perceptual Metric for Production Testing. In ACM SIGGRAPH 2004 Sketches, SIGGRAPH '04, 2004. [165] Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller. Sikuli: Using GUI Screenshots for Search and Automation. In Proceedings of the 22Nd Annual ACM Symposium on User Interface Software and Technology, UIST, pages 183{192, New York, NY, USA, 2009. ACM. [166] Shin Yoo, David Binkley, and Roger Eastman. Seeing Is Slicing: Observation Based Slicing of Picture Description Languages. In Proceedings of the 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation, SCAM '14, pages 175{ 184, Washington, DC, USA, 2014. IEEE Computer Society. [167] Sai Zhang, Hao L u, and Michael D. Ernst. Automatically Repairing Broken Work ows for Evolving GUI Applications. In Proceedings of the International Symposium on Software Testing and Analysis, ISSTA, pages 45{55, 2013. 145
Abstract (if available)
Abstract
The appearance of a web application's User Interface (UI) plays an important part in its success. Issues degrading the UI can negatively affect the usability of a website and impact an end user's perception of the website and the quality of the services that it delivers. Such UI related issues, called presentation failures, occur frequently in modern web applications. Despite their importance, there exist no automated techniques for repairing presentation failures. Instead repair is typically a manual process, where developers must painstakingly analyze the UI of a website, identify the faulty UI elements (i.e., HTML elements and CSS properties), and carry out repairs. This is labor intensive and requires significant expertise of the developers. ❧ My dissertation addresses these challenges and limitations by automating the process of repairing presentation failures in web applications. My key insight underlying this research is that search-based techniques can be used to find repairs for the observed presentation failures by intelligently and efficiently exploring large solution spaces defined by the HTML elements and CSS properties in a web page. Based on this insight, I designed and developed four techniques for the automated repair of different types of presentation failures in web applications. The first technique focuses on the repair of layout Cross Browser Issues (XBIs), i.e., inconsistencies in the appearance of a website when rendered in different web browsers. The second technique addresses the Mobile Friendly Problems (MFPs) in websites, i.e., improves the readability and usability of a website when accessed from a mobile device. The third technique repairs problems related to internationalization in web application UIs. Lastly, the fourth technique addresses issues arising from mockup-driven development and regression debugging. In the empirical evaluations, all of the four techniques were highly effective in repairing presentation failures, while in the conducted user studies, participants overwhelmingly preferred the visual appeal of the repaired versions of the websites compared to their original (faulty) versions. Overall, these are positive results and indicate that my repair techniques can help developers repair presentation failures in web applications, while maintaining their aesthetic quality.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Detection, localization, and repair of internationalization presentation failures in web applications
PDF
Automatic detection and optimization of energy optimizable UIs in Android applications using program analysis
PDF
Automated repair of layout accessibility issues in mobile applications
PDF
Detecting SQL antipatterns in mobile applications
PDF
Techniques for methodically exploring software development alternatives
PDF
Energy optimization of mobile applications
PDF
Constraint-based program analysis for concurrent software
PDF
Utilizing user feedback to assist software developers to better use mobile ads in apps
PDF
Reducing user-perceived latency in mobile applications via prefetching and caching
PDF
Detecting anomalies in event-based systems through static analysis
PDF
Toward understanding mobile apps at scale
PDF
Deriving component‐level behavior models from scenario‐based requirements
PDF
Analog and mixed-signal parameter synthesis using machine learning and time-based circuit architectures
PDF
Static program analyses for WebAssembly
PDF
Efficient and effective techniques for large-scale multi-agent path finding
Asset Metadata
Creator
Mahajan, Sonal
(author)
Core Title
Automated repair of presentation failures in Web applications using search-based techniques
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science (Software Engineering)
Publication Date
08/01/2018
Defense Date
04/19/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
automated repair,cross-browser issues,internationalization problems,mobile friendly problems,mockup-driven development problems,OAI-PMH Harvest,presentation failures,regression debugging problems,search-based techniques,software engineering,web applications
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Halfond, William G. J. (
committee chair
), Deshmukh, Jyotirmoy (
committee member
), Gupta, Sandeep (
committee member
), Medvidovic, Nenad (
committee member
), Wang, Chao (
committee member
)
Creator Email
sonalp.mahajan@gmail.com,spmahaja@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-45275
Unique identifier
UC11671340
Identifier
etd-MahajanSon-6596.pdf (filename),usctheses-c89-45275 (legacy record id)
Legacy Identifier
etd-MahajanSon-6596.pdf
Dmrecord
45275
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Mahajan, Sonal
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
automated repair
cross-browser issues
internationalization problems
mobile friendly problems
mockup-driven development problems
presentation failures
regression debugging problems
search-based techniques
software engineering
web applications