Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving network security through collaborative sharing
(USC Thesis Other)
Improving network security through collaborative sharing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
IMPROVING NETWORK SECURITY THROUGH COLLABORATIVE SHARING by Calvin Satiawan Ardi A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2020 Copyright 2020 Calvin Satiawan Ardi Dedication To my parents, Helen and Edward. ii Acknowledgements This dissertation has been made possible with the guidance, encouragement, and support from many in- dividuals, to whom I oer my profound gratitude. John Heidemann, for his guidance and patient support as my advisor throughout the Ph.D. I am for- tunate to have his encouragement from the very beginning to explore my own ideas and pursue many directions. His emphasis on clear and excellent work has enabled me to conduct research at the highest quality. Ramesh Govindan and Bhaskar Krishnamachari for serving on my defense committee, and Aleksandra Korolova and Kristina Lerman for serving on my thesis proposal committee. I am also thankful to Ethan Katz-Bassett for the opportunity to work together earlier on in my studies. My mentors and supervisors at LANL, Gina Fisk and Mike Fisk, and colleagues and collaborators at LANL, for providing their support and knowledge during my summers there. Our lengthy discussions enabled me to understand dierent perspectives about and applications of this research, and inuenced the direction of this dissertation. Colleagues and collaborators at the ANT Lab, for the sharing and cross-checking of ambitious ideas and research: Guillermo Baltra, Xue Cai, Asma Enayet, Xun Fan, Hang Guo, Zi Hu, Basileal Imana, Abdul Qadeer, Lin Quan, ASM Rizvi, Lan Wei, and Liang Zhu. Special thanks to Wes Hardaker, Yuri Pradkin, and Robert Story for their feedback on this research, and their helpfulness and patience in maintaining the computing infrastructure, especially during the times I pushed our resources to their limits. iii Lizsl De Leon at USC, and Joe Kemp, Alba Regalado, and Jeanine Yamazaki at USC/ISI for their help and support over the years. Tobias Bajwa (né Flach), Matt Calder, and Kyriakos Zaris, for our fantastic friendship that started at USC and the insightful conversations on and beyond the Ph.D. Alexander Bolton, Daniel Byrne, and Aaron Pope, for the unrelenting banter on research and life, and the adventures we have had during our summers at LANL and whenever our paths cross. Raymond Cheng and Diana & Ian Vo, for our serendipitous and everlasting friendship, and their feed- back and encouragement in this research—without them, I would have nished this dissertation much sooner. David Chan, Jonathan Chu, Mike Kendall, Jameson Lee, Joanna Lee, Jonathan Liu, Thomson Nguyen, Edward Podojil, Alan Wong, George Wu, and many others for their support and the meaningful moments and stories that we share. My parents and my brother, Laurence, for their constant and endless love, encouragement, and support. Jodi, for her understanding and encouragement, as we move forward through life’s journey together. iv TableofContents Dedication ii Acknowledgements iii ListofTables ix ListofFigures xi Abstract xiv Chapter1: Introduction 1 1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Demonstrating the Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Precise Detection of Content Reuse in the Web . . . . . . . . . . . . . . . . . . . . 7 1.2.2 AuntieTuna: Personalized Content-Based Phishing Detection . . . . . . . . . . . . 8 1.2.3 Retro-Future: Improving Network Security with Controlled Information Sharing . 9 1.2.4 Building a Collaborative Defense to Improve Resiliency Against Phishing Attacks . 10 1.3 Additional Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter2: PreciseDetectionofContentReuseintheWeb 14 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.3 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.4 Cleaning the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.4.1 Detecting and Handling Recursion Errors . . . . . . . . . . . . . . . . . . 28 2.4.4.2 Stop Chunk Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.5 Chunking and Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Datasets and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.1 Datasets and Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.3 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6.1 Do Our Cleaning Methods Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 v 2.6.2 Removing Recursion Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6.3 Can We Discover Known Files and Chunks? . . . . . . . . . . . . . . . . . . . . . . 36 2.6.4 Can We Detect Specic Bad Pages? . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.6.5 Can We Detect Known Bad Neighborhoods? . . . . . . . . . . . . . . . . . . . . . . 46 2.6.6 Cryptographic vs. Locality-Sensitive Hashes . . . . . . . . . . . . . . . . . . . . . . 48 2.7 Analysis of Blind Discovery of Web Copying . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.7.1 Why is File-level Discovery Inadequate? . . . . . . . . . . . . . . . . . . . . . . . . 50 2.7.2 How Does Chunking Aect Discovery? . . . . . . . . . . . . . . . . . . . . . . . . 51 2.7.3 Are There Bad Neighborhoods in the Real World? . . . . . . . . . . . . . . . . . . . 54 2.8 Applications With Expert-Identied Content . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.8.1 Detecting Clones of Wikipedia for Prot . . . . . . . . . . . . . . . . . . . . . . . . 57 2.8.2 Detecting Phishing Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Chapter3: AuntieTuna: PersonalizedContent-basedPhishingDetection 62 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2.1 Automating Phish Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2.2 Anti-Phishing User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.3 The Role of User Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3 Design for User-customizable Anti-Phishing . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3.1 Identifying and Personalizing Target Content . . . . . . . . . . . . . . . . . . . . . 68 3.3.2 Processing Pages: Hashing and Detection . . . . . . . . . . . . . . . . . . . . . . . 70 3.3.2.1 Processing a Known-Good Page . . . . . . . . . . . . . . . . . . . . . . . 70 3.3.2.2 Processing Unknown Content . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.3 Design Choices for Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.3.4 Implementation of Anti-Phishing in AuntieTuna . . . . . . . . . . . . . . . . . . . 73 3.3.4.1 Page Processing Workow . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.3.4.2 Platform-Specic Customizations . . . . . . . . . . . . . . . . . . . . . . 74 3.4 Eectiveness of Phishing Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.4.1 Evaluation of Phish Detection Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 75 3.4.2 Resisting Potential Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.4.3 Browser Performance with AuntieTuna . . . . . . . . . . . . . . . . . . . . . . . . 78 3.4.4 Experiences in Real-World Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Chapter4: Retro-Future: ImprovingNetworkSecuritywithControlled InformationSharing 81 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.1 Data at Rest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.2 Data in Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.3 Data in Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Enabling Information Sharing with Cross-Site Queries . . . . . . . . . . . . . . . . . . . . 88 4.3.1 Establishing Trust and Sharing Policies . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.2 Moderating Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.3 Controlling Data Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3.4 Time Travel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 vi 4.3.5 Securing the Retro-Future System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4 Case Studies Quantifying the Benets of Sharing . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4.1 Detecting Bots and Botnet Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4.1.2 Can sites detect malicious activity on their own? . . . . . . . . . . . . . 97 4.4.1.3 Does sharing help sites detect more malicious activity? . . . . . . . . . . 99 4.4.1.4 How consistent are the benets of sharing? . . . . . . . . . . . . . . . . 101 4.4.1.5 Can sites improve their detection sensitivity when they share? . . . . . . 103 4.4.2 Finding Malicious Activity with DNS Backscatter . . . . . . . . . . . . . . . . . . . 105 4.4.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.4.2.2 Can sites detect and classify originators on their own? . . . . . . . . . . 107 4.4.2.3 Does sharing the results of processed DNS backscatter help sites nd more malicious activity? . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Chapter5: BuildingaCollaborativeDefensetoImproveResiliencyAgainst PhishingAttacks 116 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.2 Problem Statement and Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.2.1 Target User Population and Their Attackers . . . . . . . . . . . . . . . . . . . . . . 119 5.2.2 Threats and Defenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4 Improving Network Security with Anti-Phishing and Data Sharing . . . . . . . . . . . . . 126 5.4.1 Anti-Phishing with AuntieTuna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.4.2 Improving Collective Immunity with Data Sharing . . . . . . . . . . . . . . . . . . 127 5.4.3 Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.4.4 Adversarial Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.5 Case Study: Real Phishing Attacks on USC . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.6.1 Quantifying Services Using Single Sign-On at Universities . . . . . . . . . . . . . . 136 5.6.1.1 How many online services are at a university? . . . . . . . . . . . . . . . 136 5.6.1.2 Are our service lists complete? . . . . . . . . . . . . . . . . . . . . . . . . 138 5.6.1.3 How fast is the number of online services growing? . . . . . . . . . . . . 139 5.6.2 Improving Enterprise Security at Home and the Oce . . . . . . . . . . . . . . . . 141 5.6.2.1 Is AuntieTuna eective in protecting enterprise sites without sharing? . 141 5.6.2.2 How many friends must share to protect enterprise sites? . . . . . . . . . 143 5.6.2.3 Evaluating SSO’s advantages in eectively protecting enterprise sites . . 146 5.6.2.4 How many friends must share to protect community sites? . . . . . . . . 147 5.6.2.5 Do browsing histories of simulated users reect the histories of actual users? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.6.2.6 Generalizing the benets of sharing . . . . . . . . . . . . . . . . . . . . . 156 5.6.3 Improving Election Campaign Security . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 vii Chapter6: Conclusions 163 6.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Bibliography 168 viii ListofTables 1.1 Demonstrating the thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Performance of detection on a phish corpus using cryptographic and locality-sensitive hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.2 Categories of the top 100 distinct chunks inC cc . . . . . . . . . . . . . . . . . . . . . . . . 51 2.3 Classication of a sample of 100 distinct chunks with more than 10 5 occurrences inC cc . . 52 2.4 Classication of a sample of 40 bad neighborhoods fromC cc . . . . . . . . . . . . . . . . . 55 2.5 Classication of a sample of 100 bad neighborhoods fromC g . . . . . . . . . . . . . . . . . 56 2.6 Classication of the top 40 bad neighborhoods inC cc ,L = Wikipedia . . . . . . . . . . . . 58 3.1 Classication of phish in two days of PhishTank reports, based on detection against PayPal 77 3.2 Page Render and AuntieTuna Execution Times . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1 Summary of Retro-Future’s Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 Example of an organization’s access control list (ACL) . . . . . . . . . . . . . . . . . . . . . 90 4.3 Number of domains and IPs detected as suspicious activity at each site independently and with sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4 The sensitivity of BotDigger’s detection is improved with controlled data sharing . . . . . 103 4.5 Datasets used in processing DNS backscatter . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.6 Number of originators in each class for all datasets. . . . . . . . . . . . . . . . . . . . . . . 109 4.7 Finding more malicious activity with the sharing of processed DNS backscatter . . . . . . 111 5.1 Examples of SSO-enabled Web Services at USC . . . . . . . . . . . . . . . . . . . . . . . . 136 ix 5.2 Characterizing Web Services at USC and UCB . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.3 Datasets and user populations used to generalize the benets of sharing and corresponding growth function and baseline values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 x ListofFigures 2.1 Prex lengths of neighborhoods in Common Crawl (C cc ) . . . . . . . . . . . . . . . . . . . 35 2.2 Cumulative distribution of neighborhood prex lengths in GeoCities (C g ) . . . . . . . . . 36 2.3 File-level discovery of injected duplicates inC cc , compared to le frequency . . . . . . . . 37 2.4 Chunk-level discovery of injected duplicates inC cc , compared to chunk distribution . . . 39 2.5 Percentage of chunks discovered in blog.archive.org given the number of times it is duplicated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6 Eects of continuously adding chunks on pages . . . . . . . . . . . . . . . . . . . . . . . . 42 2.7 Eects of continuously deleting chunks on pages . . . . . . . . . . . . . . . . . . . . . . . 43 2.8 Eects of continuously changing chunks on pages . . . . . . . . . . . . . . . . . . . . . . . 44 2.9 Number of pages detected as bad after continuously changing chunks on all pages in blog.archive.org . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.10 Eects of continuously adding chunks in a neighborhood . . . . . . . . . . . . . . . . . . . 47 2.11 Eects of continuously deleting chunks in a neighborhood . . . . . . . . . . . . . . . . . . 47 2.12 Eects of continuously changing chunks in a neighborhood . . . . . . . . . . . . . . . . . 48 2.13 File-level discovery onC g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.14 Chunk-level discovery onC g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.15 Frequency of badness of neighborhoods inC cc , as a histogram and CDF . . . . . . . . . . 54 2.16 Frequency of badness of neighborhoods inC g , as a histogram and CDF . . . . . . . . . . . 56 2.17 Implementation diagram of the AuntieTuna anti-phishing plugin . . . . . . . . . . . . . . . 59 xi 3.1 AuntieTuna’s Personalize Button . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2 Detecting a phishing attempt against PayPal . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3 Example of actively preventing a user from accessing a phishing site . . . . . . . . . . . . 72 3.4 Implementation diagram of the AuntieTuna anti-phishing plugin . . . . . . . . . . . . . . . 74 4.1 Detecting suspect C&C domains and IPs at each site independently . . . . . . . . . . . . . 98 4.2 Detecting suspect C&C domains and IPs at each site individually and using CSU’s shared botnet activity lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.3 Improving the sensitivity of BotDigger’s detection with controlled data sharing between sites over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.4 The sensitivity of BotDigger’s detection is improved with controlled data sharing . . . . . 104 4.5 Fraction of originator classes of Top-N originators. . . . . . . . . . . . . . . . . . . . . . . 109 5.1 Users at USC use a standardized Single Sign-On process to access many rst- and third-party services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2 An email sent on 2019-05-15 instructing a faculty member to complete a mandatory compliance survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.3 A spear-phishing email sent to everyone at USC/ISI on 2019-07-19 . . . . . . . . . . . . . 132 5.4 Detecting a phishing site attack against USC . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.5 An example of a Microsoft OneDrive phishing site attack against USC/ISI on 2020-06-09 that does not reuse content from the original, legitimate site . . . . . . . . . . . . . . . . . 134 5.6 Number of Identity (IdP) and Service (SP) Providers in the InCommon Federation over time 140 5.7 A prole of actual and simulated users at USC and the number of USC services they use . 142 5.8 Sharing known-good between users eectively inoculates them on sites at USC . . . . . . 144 5.9 Sharing with more friends increases the fraction of sites protected . . . . . . . . . . . . . 145 5.10 Sharing known-good between users inoculates them on sites at USC even when SSO is not used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.11 A comparison of the number of friends that users need to share with and the resulting protection when SSO is used and not used . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.12 Prole of users at and all the external and internal sites and services they use . . . . . . . 148 xii 5.13 Time series of DNS activity for 8 random homes in the Case Connection Zone . . . . . . . 150 5.14 Sharing with more friends increases the fraction of internet sites protected . . . . . . . . . 151 5.15 A graphical representation of sites protected in one-on-one sharing . . . . . . . . . . . . . 153 5.16 Sharing known-good between users eectively inoculates them on internet sites . . . . . 154 5.17 Growth curves of median fraction of sites protected across dierent user populations and datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.18 An example of deriving web history proles from social network interactions . . . . . . . 160 xiii Abstract As our world continues to become more interconnected through the Internet, cybersecurity incidents are correspondingly increasing in number, severity, and complexity. The consequences of these attacks in- clude data loss, nancial damages, and are steadily moving from the digital to the physical world, impact- ing everything from public infrastructure to our own homes. The existing mechanisms in responding to cybersecurity incidents have three problems: they promote a security monoculture, are too centralized, and are too slow. In this thesis, we show that improving one’s network security strongly benets from a combination of personalized, local detection, coupled with the controlled exchange of previously-private network informa- tion with collaborators. We address the problem of a security monoculture with personalized detection, introducing diversity by tailoring to the individual’s browsing behavior, for example. We approach the problem of too much centralization by localizing detection, emphasizing detection techniques that can be used on the client device or local network without reliance on external services. We counter slow mecha- nisms by coupling controlled sharing of information with collaborators to reactive techniques, enabling a more ecient response to security events. We prove that we can improve network security by demonstrating our thesis with four studies and their respective research contributions in malicious activity detection and cybersecurity data sharing. In xiv our rst study, we develop Content Reuse Detection, an approach to locally discover and detect dupli- cation in large corpora and apply our approach to improve network security by detecting “bad neigh- borhoods” of suspicious activity on the web. Our second study is AuntieTuna, an anti-phishing browser tool that implements personalized, local detection of phish with user-personalization and improves net- work security by reducing successful web phishing attacks. In our third study, we develop Retro-Future, a framework for controlled information exchange that enables organizations to control the risk-benet trade-o when sharing their previously-private data. Organizations use Retro-Future to share data within and across collaborating organizations, and improve their network security by using the shared data to increase detection’s eectiveness in nding malicious activity. Finally, we present AuntieTuna-Schooling in our fourth study, extending the proactive detection of phishing sites in AuntieTuna with data sharing between friends. Users exchange previously-private information with collaborators to collectively build a defense, improving their network security and group’s collective immunity against phishing attacks. xv Chapter1 Introduction As our world continues to become more interconnected through the Internet, the number of cybersecurity incidents has correspondingly increased. These cyberattacks have remarkable diversity in attack vectors and a seemingly unlimited amount of bandwidth. The consequences of these incidents include loss of an enormous amount of private data about individuals (Anthem [76], OPM [138], Yahoo [93]) and corporations (Sony [26], Bangladesh Bank [82]), and money spent cleaning up. In addition to cleanup costs, some organizations buy cybersecurity insurance [91], but claims may be denied [109] or may not cover the true cost of data loss. Data loss can lead to nancial damages, or, in the case of intellectual property theft, place victims at a competitive disadvantage. Not limited to “simple” data loss, the damages are growing past the physical boundary, aecting critical infrastructure, from industrial systems (Stuxnet [78]), hospitals (ransomware [133]), and to our own homes (Internet of Things (IoT) malware, phish). There are three problems with our current response to these cybersecurity incidents: existing mecha- nisms promote a security monoculture, are too centralized, and are too slow. These deciencies can leave our systems and networks in a state of lesser security. Existing solutions promote a security monoculture, dened as a computing environment using the same defensive techniques, resulting in the same vulnerabilities and attack vectors across environments. In web phishing detection, many browsers by default use blacklists (Google’s Safe Browsing API) to determine 1 whether a visited website URL is phish or not. As a consequence, an attacker counters this simply by quickly moving their phish around to dierent websites, leaving users vulnerable. Existing solutions are too centralized, leading to new privacy risks. By depending on defenses to centralized services (like URL blacklists or spam detection outsourcing), users and organizations give their sensitive information to third parties. These third parties, now prime targets, are susceptible to data loss (by external attackers) and misuse (data sold to others), resulting in the loss of privacy of their users (Equifax [70]). Finally, existing mechanisms are too slow, as reactive techniques are inecient and proactive tech- niques are often slow-to-update, centralized sources. Reactive techniques, dened as responses requiring humans in the loop, are ultimately customized (an incident response team responding to a particular se- curity event), as their response is inherently particular to the network or organization. However, reactive techniques are limited by human timescales both in responding to the event and in sharing the results, making them inecient as attacks propagate much faster in computer timescales. While proactive tech- niques (URL blacklists, intrusion detection systems, or antivirus software) are eective, they often require centralized updates in response to newfound threats and are thus slow to update, leaving users and orga- nizations vulnerable while updates propagate. In this thesis, we will show how we can improve network security through personalized, local detec- tion and controlled information sharing (Section 1.1). We prove that we can improve network security by demonstrating our thesis with four studies (Section 1.2) and their respective research contributions (Section 1.2, Section 1.3) in malicious activity detection and cybersecurity data sharing. 2 1.1 ThesisStatement This thesis shows thatimprovingone’snetworksecuritystronglybenetsfromacombinationof personalized, local detection, coupled with the controlled exchange of previously-private net- workinformationwithcollaborators. We deneimprovingone’snetworksecurity in two ways: increasing the eectiveness in detecting malicious activity and responding to security incidents. Our contribution is to quantitatively show more eective detection by nding more, previously undiscovered, malicious activity with a novel personalized, local detection technique and by augmenting existing detection techniques with a novel cross-site, con- trolled data exchange of previously-private information. Responding to security incidents more eectively means decreasing the amount of resources it takes to respond to security incidents. We dene personalized, local detection and controlled data exchange below. Personalizeddetection means a detection technique that is customized to the user based on their be- havior or environment. In contrast, traditional detection generally applies the same set of broad techniques or software (antivirus, rewalls), treating all environments the same (or in a set number of categories). Our novel contribution is to apply personalization to anti-phishing. When we apply personalization to phish detection, it generates uncertainty in attackers, improving resiliency against a monoculture defense and increasing an attackers’ eort to launch a successful attack. For example, we personalize our anti-phishing techniques by keeping track of target content (susceptible to phish) that the user uses. This personalization makes it harder for attackers to successfully launch bulk phishing campaigns, because we can detect phish copying directly from the sites the user uses, requiring attackers to expend much more resources to craft customized phish to avoid our specic technique. Localdetection means a detection technique that runs standalone or independently on the client or on a particular organizations’ network, without reliance on third-party or cloud services. Keeping detection 3 on the client-side has benets in protecting users’ privacy while maintaining good detection performance. Our use of local detection is not novel, but is an important foundation which we build upon in our work. Thecontrolledexchange of network information is data sharing while managing ownership, and the amount and quality of information disclosed. While we don’t develop any new techniques in controlled exchange, we use it to enable data sharing of previously-private information—information that has not been shared previously. With controlled exchange in place, data owners can share previously-private network informa- tion—parts of private data that is now shareable with collaborators because of protections put in place on data in storage, in transit, and in use. Sharing previously-private network information is an old idea— indeed, users and organizations have shared private data in an ad-hoc manner. Our novel contribution is to show that when we formalize sharing, organizations are more willing to share their previously-private information. We start by using controlled data exchange to share sensitive data with dierent users and organizations at dierent corresponding trust levels. Data owners balance the risk and benet trade-o by controlling data disclosure through minimizing data (anonymization, dierential privacy) and enforcing system limits (privacy budgets, query logs and audits). These tools provide data owners with the capability to share what is essential to making forward progress while minimizing risk to users’ and the organiza- tion’s privacy. Finally, exchanging with collaborators generally means sharing data with friends or peers in social circles. Our novel contribution is to show that when we apply privilege escalation to sharing data, a user or organization can view competitors in the same industry, or even strangers, as additional collaborators. For example, an organization, which initially shares little to nothing today, might share its previously-private information with a competitor if the situational context is sucient (network incident response) to making forward progress. 4 1.2 DemonstratingtheThesisStatement We rst discuss how our parts of our thesis statement address the problems presented at the beginning of Chapter 1, and then we will demonstrate the thesis statement with four studies in malicious activity detection and cybersecurity data sharing. We described earlier that there are three problems with our current response to cybersecurity incidents: existing mechanisms promote a security monoculture, are too centralized, and are too slow. We address the problem of a security monoculture with personalized detection. We introduce diversity in network defenses by personalizing the individual’s or organization’s defense. In our phishing example, we personalize each user’s defense by keeping track of “known good” sites specically used by that user and comparing its content with each visited page for possible phish. By tailoring to each user’s browsing habits, we present a varying defense across all users. Having dierent defenses means dierent available attack vectors, requiring the attacker to make a greater, concerted eort. We address the problem of too much centralization by localizing detection. By emphasizing detection that is located at the client or in the network, without undue reliance on centralized services, we preserve user and organizational privacy by maintaining full control of our own data. We approach the problem of slow mechanisms with controlled sharing of information with collabo- rators. By coupling information sharing to reactive techniques, we can respond to security events in a more ecient way by removing humans in parts of the response loop and automating certain processes (such as retrieving relevant information from disparate sources). Similarly, adding information sharing to proactive techniques augments centralized updates, by enabling collaborating users and organizations to quickly exchange data (malicious website ngerprints, detection signatures) to protect against new and developing threats. Given these problems and our corresponding solutions, the result will be better network security to protect against existing and future attacks at both the local and global level. 5 Chapter2 Chapter3 Chapter4 Chapter5 thesisstatement ContentReuse AuntieTuna Retro-Future AuntieTuna Detection Schooling improve network security 3 3 3 3 personalized detection 3 3 local detection 3 3 3 3 controlled information exchange 3 3 of previously-private data between collaborators Table 1.1: Demonstrating the thesis statement. We will demonstrate the thesis statement with four studies in detecting malicious activity and studying the benets of sharing cybersecurity data. Table 1.1 highlights each study and the specic parts of the thesis statement it supports. In Chapter 2, we developContentReuseDetection, an approach to locally discover and detect content reuse in web-size corpora using cryptographic hashing techniques and apply our approach to improve network security by detecting “bad neighborhoods” of suspicious activity on the web. We next present AuntieTuna in Chapter 3, an anti-phishing tool that implements personalized, local detection of phish using user-personalization and our previous methods in content reuse to improve network security by reducing successful web phishing attacks. In Chapter 4, we presentRetro-Future, a controlled information exchange framework with principled risk and benet management that formalizes sharing of previously-private data between collaborating organizations to improve network security by increasing detection’s eectiveness with shared data. Finally, we conduct a study on extending AuntieTuna with data sharing in Chapter 5 to build AuntieTuna-Schooling, enabling the controlled information exchange between collaborating friends to promote a collective immunity against phishing attacks, leading to improved network security for the individual user and their group of friends. We present our conclusions in Chapter 6. 6 1.2.1 PreciseDetectionofContentReuseintheWeb Our rst study in Chapter 2 presentsContentReuseDetection, an approach that eciently nds dupli- cation in large corpora, using hash-based methods to blindly discover content duplication and then detect this duplication in web-sized datasets. Replicating web content is easy, and some people replicate content for commercial gain. Some individ- uals bulk copy high-quality content from Wikipedia, adding advertisements or using copied content for search engine optimization or link farms. Others reuse selected content to impersonate sites for phishing. Our contribution is an approach to blindly discover content duplication and detect this duplication in web-size corpora using hash-based methods. To support our thesis, we can improve networksecurity by nding previously undetected “bad neighborhoods”, or hierarchical clusters of copied and potentially malicious content, on the web using our approach of hash-based informed discovery and detection. The novelty of our work is that our hash-based methods enable the scaling of local discovery and detection to web-sized datasets—all processing can be done on local cluster of commodity hardware. Ad- ditionally, we implement a form of personalization in informed discovery, seeding the discovery process with labeled content of interest that we want to detect in other corpora. For example, we demonstrate our approach by looking for copies of Wikipedia on the web. We rst seed a copy of Wikipedia in discovery, and then detect Wikipedia clones in a corpus of the web, nding that most copies (86 %) are monetized with advertisements or in link farms. Our approach inspires our later studies in thecontrolledsharingofnetworkinformation, as our informed discovery process can be initialized with labeled data from another source: we ask if we can take observations in malicious activity detection from one vantage point and eectively apply its results to other, dierent vantage points. 7 1.2.2 AuntieTuna: PersonalizedContent-BasedPhishingDetection Our second study in Chapter 3 presentsAuntieTuna, a web browser extension that implements an eec- tive and novel technique to detect phishing sites using both cryptographic hashing and user-personaliza- tion. Phishing sites are fake sites that masquerade as their legitimate counterparts to steal sensitive infor- mation. We need to improve one’s network security by protecting against the increasing threat of these phishing attacks, thereby preventing potential information and nancial losses. AuntieTuna helps improve one’s network security, thereby demonstrating that part of the thesis, by reducing successful web phishing attacks by locally detecting and blocking phish sites. By detecting and blocking phishing websites that look like the original target site, AuntieTuna prevents users from having their login or nancial details stolen. Phishing site detection is accomplished by rst identifying and indexing the original, target sites’ content using cryptographic hashing into target content lists, and then watching for entries in the lists to appear at incorrect websites as an indicator of phish. AuntieTuna is able to reduce web phishing attacks eectively because of itslocalandpersonalized detection, thereby proving that part of the thesis. Our novel contribution is to implement our phish de- tection algorithm as a browser extension that runs locally on the host: AuntieTuna works without reliance on cloud or third-party services and does not require users to submit visited URLs to blacklisting services. Another contribution is to personalize phish detection to each user’s browsing behavior, keeping detection lightweight in computation and storage by tracking only the sites that the user uses. AuntieTuna implements manual data sharing to increase phish detection’s eectiveness, which inspires our later studies in cross-site,controlledsharingofnetworkinformation. Although AuntieTuna does not implement a controlled information exchange, users can manually share their target content lists with each other and organizations can distribute their lists to their members, employees, or collaborators. 8 By combining the shared data with their existing target content list, users can augment AuntieTuna’s eectiveness in phish detection. 1.2.3 Retro-Future:ImprovingNetworkSecuritywithControlledInformationSharing Our third study in Chapter 4 presentsRetro-Future, a controlled information exchange framework with principled risk and benet management that formalizes data sharing with cybersecurity applications. To protect against the increasing number of cybersecurity incidents, we need to share information across organizations to eciently resolve security events and defend against future ones. While organiza- tions do share with others today, the data being shared is often not detailed enough to eectively respond to events because the useful details in the underlying, raw data are often sensitive and private. Similarly, some organizations might share too much information in response to an incident, unknowingly increasing their risk of inadvertent disclosure of private information. Our contribution is Retro-Future, a framework for the controlled exchange of information that for- malizes data sharing between organizations and enables organizations to control the risk-benet trade-o between the risks of data disclosure with the benets of forward progress in network security. To support our thesis, Retro-Futureimprovesnetworksecurity by increasing the eectiveness of local detection of malicious activity with previously-private data shared between organizations. Another contribution is to show that with data sharing, we can improve our local detection algo- rithms’ sensitivity by increasing the diversity, or quantity and quality, of the input data. (Although these local detection algorithms aren’t explicitly personalized, we consider in our next study how personaliza- tion across multiple users can be used to improve network security.) For example, detecting botnet activity on local networks via DNS queries is not very sensitive (detecting true positive malicious activity) if an organization does not have diversity in network trac. In order for improve detection with greater precision, organizations need to share its sensitive botnet activity data with 9 others, and they require that a secure sharing system be in place before being comfortable with disclosing their data. Retro-Future enables thesharingofpreviously-privatenetworkinformation across collaborat- ing organizations by allowing them to control and balance the risk and benet trade-o in data sharing. Our contribution with Retro-Future is to provide controls to manage the risks and benets in sharing through cross-site information sharing, sensitivity levels at each site, ecient retrospective search and processing. In our botnet activity detection example, a larger organization can use Retro-Future to share its sensitive botnet activity data securely with smaller organizations. The shared data allows the smaller organization to detect malicious activity with greater precision. Retro-Future provides a unique approach to balancing privacy risks with forward progress with the ability to “time travel” through data archives. Retrospective search and processing, as part of Retro-Future, helps resolve sharing on human timescales (interesting network events happen at computer timescales, with subsecond granularity) and enables us to make decisions on privacy after-the-fact. An incident re- sponse team may start immediate triage of an event by querying their own and their collaborators’ data and escalate their privilege to look at more sensitive data if needed to resolve an incident. Once the rest of the world is “awake”, they can jointly make post-mortem decisions on data handling and sanitization. 1.2.4 BuildingaCollaborativeDefensetoImproveResiliencyAgainstPhishingAttacks Our fourth study in Chapter 5 presentsAuntieTuna-Schooling, an approach to improving network se- curity at home and the enterprise with the proactive detection of phishing sites and data sharing between friends. Individuals at home and work continue to be at risk of phishing attacks, with the consequence of nancial loss and theft of personal data or intellectual property. Single Sign-On (SSO) is increasingly in use at large organizations for authentication. While SSO improves security with a common authentication 10 process, it also becomes an attractive vector to use in phishing attacks: attackers that phish for victims’ SSO credentials can get access to all of the organization’s resources. AuntieTuna-Schooling helpsimprovesone’snetworksecurity by improving phishing defenses with preemptive ltering that leverages the user’s or organization’s social circles: a user builds a whitelist of sites that are known-good, and then shares their known-good with their friends. The insight behind our approach is to bring together components and ideas from our prior two studies on personalized, local de- tection and the controlled exchange of previously-private network information with collaborators. Com- bining detection and sharing provides a defense that is tailored to each group and continuously improving, increasing a malicious actor’s level of eort needed to launch a successful attack. We protect users from phishing sites using AuntieTuna’spersonalized,localdetection to nd mali- cious websites and users thenexchangepreviously-privatenetworkinformationwithcollaborators to collectively build a defense. AuntieTuna allows the user to customize their local defense against phish- ing, and sharing enables them to share their formerly-sensitive customization with circles of their friends, improving their group’s collective immunity. Our novel contribution is to extend AuntieTuna with methods for peer-to-peer and centralized data sharing of known-good sites, enabling users to build a collaborative defense against web phishing attacks via inoculation. Our insight is that groups of users that share the same anity (like aliation or extracur- ricular interests) will benet in increased protection due to sharing because they visit the same common sites for work and personal use. We show that our collaborative defense can be successful at protecting internal institutional sites and external “community” sites, even with relatively modest sharing between other users. 11 1.3 AdditionalResearchContributions In addition to proving the thesis, each of our studies also have additional research contributions. We next describe each study’s additional research contributions. In our study of precise content reuse detection, our additional contributions include evaluating our approach in detecting content reuse on the web, presenting our design choices in adapting hashing to scale discovery and detection across web-size datasets, and showing that our approach is robust to minor changes. We nd that 6–11 % of content on the web is duplicated for commercial gain, and develop a new technique in bad neighborhood detection which nds clusters of duplicated content based on the hierar- chical nature of the web. We present our choices in heuristics for cleaning the data of recursion errors, and normalizing and chunking text content. We quantify the trade-o in choosing between cryptographic and locality-sensitive hash (LSH) functions for detection, nding that the LSH’s false positive rate is untenable for very large datasets, motivating our use of cryptographic hashing. We carefully validate our approach with a series of experiments in adding, modifying, and deleting content, nding that our approach works despite changes to the underlying content. In our study of personalized phishing detection, our additional contributions include AuntieTuna’s de- sign to improve usability and detection’s performance in phishing. We explore how AuntieTuna’s usability is maximized by minimizing user interaction for “hands-free” use, automation of target content identica- tion, and minimizing performance overhead. We show that we can use cryptographic hashing to precisely detect phish with 58.8 % accuracy and no false positives. In our study of controlled information sharing, our additional contributions include quantifying the benets of sharing between organizations by discovering additional malicious activity and presenting the qualitative benets of regularizing data sharing. We quantify the benets of Retro-Future’s data sharing in two studies in detecting previously unknown botnet activity and nding Internet-wide activity with DNS backscatter. After sharing and combining the dierent views of multiple organizations’ networks, 12 organizations were able to discover more suspect activity together than on their own—sometimes even detecting activity that was previously undetected at any location. By regularizing data sharing, we no longer treat sharing as one-o, exceptional events but as the normal course of business: we can then start to build more advanced and sensitive detection techniques. Finally, in our study of building a collaborative defense against phishing attacks, our additional contri- butions include identifying Single Sign-On (SSO) as a defense and target for phishing, quantifying the risk SSO presents for enterprises such as large universities, and showing that building a collaborative defense across social and even ad-hoc groups is possible. We identify SSO as an attractive target for phishing, presenting a case study about a phishing attack targeting users at a university using SSO. We quantify the risk that SSO presents by enumerating services at multiple universities, nding that most are SSO- enabled, and measuring the growth of SSO-enabled services at hundreds of organizations. We then show that our approach in building a collaborative defense against phishing is well matched to protecting users and organizations carrying out political election campaigns. 13 Chapter2 PreciseDetectionofContentReuseintheWeb In this chapter, we present Content Reuse Detection, an approach that eciently nds duplication in large corpora, using hash-based methods to blindly discover content duplication and then detect this duplication in web-sized datasets. Our techniques in content reuse detection will form the basis of our approach in AuntieTuna, an anti-phishing browser plugin, which we describe in the next chapter. This study of content reuse detection on the web partially supports our thesis statement. We can improve network security by nding previously undetected bad neighborhoods, or hierarchical clusters of copied and potentially malicious content, on the web. We nd bad neighborhoods with our approach of local discovery and detection, using hash-based methods to enable our approach to scale to web-sized datasets using commodity hardware. For example, we apply our approach to nd that 86 % of Wikipedia copies are monetized with link farms or advertisements. Detection can be personalized by seeding the informed discovery process with labeled content of interest that we want to detect in other corpora; we demonstrate using informed discovery that we can detect 59 % of PayPal phishing sites. Part of this chapter was previously published in the ACM SIGCOMM Computer Communications Re- view [12]. 14 2.1 Introduction A vast amount of content is online, easily accessible, and widely utilized today. User-generated content lls many sites, sometimes non-commercial like Wikipedia, but more often commercial like Facebook and Yelp, where it supports billions of dollars in advertising. However, sometimes unscrupulous entities repackage this content, wrapping their commercial content around this previously published information to make a quick prot. There are several recent examples of misleading reuse of content. Content farming involves reposting copies of Wikipedia or discussion forums to garner revenue from new advertisements, or to ll out link farms that support search-engine “optimization”. E-book content farming republishes publicly available information as e-books to attract naive purchasers and spam the e-book market. (Tools like Autopilot Kindle Cash can mass-produce dozens of “books” in hours.) Review spamming posts paid reviews that are often fake and use near-duplicate content to boost business rankings. The common thread across these examples is that they gather and republish publicly available information for commercial gain. Our goal is to develop methods that eciently nd duplication in large corpora, and to show this approach has multiple applications. We show that our method (Section 2.7) eciently nds instances of mass-republishing on the Internet: for example, sites that use Wikipedia content for advertising (Sec- tion 2.8.1). While copying Wikipedia is explicitly allowed, bulk copying of copyrighted content is not. Even when allowed, content farming does little to enhance the web, and review spamming and e-book content farming degrade the impact of novel reviews and books, much as click fraud degrading legiti- mate advertising. Our approach can also detect phishing sites that use duplicated content to spoof users (Section 2.8.2). Our insight is that cryptographic hashing can provide an eective approach in duplication detection, and scales well to very large datasets. A hash function takes arbitrary content input and produces a sta- tistically unique, simple, xed-length bitstring. We build lists of hashes of all documents (or “chunks”, 15 subparts of documents) in web-size corpora, allowing very rapid detection of content reuse. Although minor changes to content result in dierent hashes, we show that copying can often be identied in the web by nding the same chunks across documents. Economically, spammers seek the greatest amount of prot with minimal work: we see that current spammers usually do not bother to obfuscate their copying. (If our work forces them to hide, we at least increase their eort.) Our work complements prior work in semantic ngerprints [58, 123, 85] and locality-sensitive hashing [80]. Such approaches provide approx- imate matching, reducing false negatives at the cost of some false positives. While semantic hashing is ideal for applications as computer forensics, where false positives are manageable, our approach is rele- vant to duplicate detection in web-size corpora, whereprecise matching without false positives, since even a tiny rate of false positives overwhelms true positives when applied to millions of documents on the web (Section 2.6.6). We also explore blind (automated) discovery of duplicated content. We evaluate our approach on several very large, real-world datasets. We show that blind discovery can automatically nd previously unknown duplicated content in general web-scale corpora (Section 2.7), evaluating Common Crawl (2.8610 9 les) and GeoCities (26.710 6 les). While most general duplication is benign (such as templates), we show that 6–11 % of widespread duplication on the web is for commercial gain. We also show thatexpert-labeleddatasets can be used with our approach to eciently search web-size corpora or to quickly search new pages on the web. We demonstrate bulk searches by looking for copies of Wikipedia on the web (Section 2.8.1), nding that most copies of Wikipedia (86 %) are commercialized (link farming or advertisements). We also show that our approach can detect phishing in web pages (Sec- tion 2.8.2), demonstrating a Chrome plugin and evaluating it with a targeted dataset that nds that 59 % of PayPal phish, even without taking measures to defeat intentional cloaking (for example, source-code obfuscation). Contributions: The contributions of this chapter are to support the thesis statement (described at the beginning of Chapter 2) and to show that hash-based methods can blindly discover content duplication 16 thendetect this duplication in web-size corpora. The novelty of this work is not the use of hashing (a long- existing technique), but design choices in adapting hashing (with chunking and cleaning, Section 2.4.4) to scale to discovery (Section 2.4.2) and detection (Section 2.4.3) across web-size datasets, and to operate robustly in the face of minor changes. In particular, our approach to discovery can be thought of as a form of semi-supervised machine learning. One important step is our use of the hierarchical nature of the web to nd clusters of copied content (“bad neighborhoods”, Section 2.7.3). We show that this approach applies not only to discovering general duplication (Section 2.7) and identifying bulk copying in the web (Section 2.8.1), but also to detecting phishing activity (Section 2.8.2). We support our work with valida- tion of the correctness of discovery and detection in our methodology (Section 2.6) and evaluation of bad neighborhood detection’s robustness to random document changes (Section 2.6.5). Our data and code are available for research reproducibility (Section 2.5.3). 2.2 ProblemStatement Replicating web content is easy. Some individuals bulk copy high-quality content from Wikipedia or Face- book to overlay advertisements, or to back-ll for link farms. Others reproduce selected content to im- personate high-value sites for phishing. We seek to develop new approaches to address two problems. First, we want to automatically discover content that is widely duplicated, or large-scale duplication in a few places. Second, given a list of known duplicated content, we want to detect where such content is duplicated. We next dene these two problems more precisely. Consider a corpusC of lesf. Interesting corpora, such as a crawl of the Internet, are typically far too large to permit manual examination. We assume the corpus consists of semi-structured text; we use minimal understanding of the semantics of the text to break it into chunksc f by choosing basic, structural delimiters (without attempting to infer the meaning of the content itself). Each le is identied by URLs; 17 we can exploit the hierarchical structure in the path portion of the URL, or treat them as at space identied only by the sitename portion. Our rst problem is discovery. In discovery, our goal is to discover a labeled dataset L consisting of content of interest we believe to be copied. The simplest way to determineL is for an expert to examine C and manually identify it, thus building an expert-labeled datasetL from content inC. (The corpusC used to buildL can be the same as or dierent than the corpus later used in detection—for now we use the sameC in discovery and detection.) Although not possible in general, semi-automated labeling is suitable for some problems (Section 2.8) where one can buildL independently from known information. Alternatively, we show how to discoverL through a blind discovery process, without external knowl- edge. We explore this approach to discover content that is widely duplicated in the web (Section 2.7). The detection process nds targetsT in the corpusC that duplicate portions of labeled datasetL. In addition to nding individual les that show high levels of copying, we also exploit the hierarchical grouping of documents inC to ndbadneighborhoodsN, dened as clusters of content sharing the same URL hierarchy where many les appear to be duplicated. 2.3 RelatedWork There is signicant prior work in detection of duplicated content to reduce storage or network use and to nd near-duplicate content for plagiarism detection or information retrieval, and in phish detection. Storage&NetworkOptimization: Content duplication detection serves many purposes and several elds have revolved around the idea. Data deduplication can be used to eciently store similar or identical pieces of data once [100, 136, 84]. Our work shares some of the same fundamental ideas through the use of cryptographic hashing and chunking to eectively nd duplicate or similar les. They have explored, for example, chunking les into both xed- and variable-sized blocks and hashing those chunks to nd and suppress duplicate chunks in other les. In our chunking methods, we consider variable-sized blocks 18 delimited by HTML tags, leveraging the structure provided by the markup language. While they target the application of storage optimization of les, focusing on archival hardware and systems, our applications target duplicate detection at both the le- and neighborhood-level for commercial gain. The same concept can be used in a network to reduce the amount of data transferred between two nodes [117, 98]. Network operators can reduce WAN bandwidth use by hashing transmitted packets (or chunks of packets) in real time and storing this data in a cache. Network users can then see improved download times when retrieving data matched in the cache (for example, when multiple users download the same le or visit the same webpage). Our work also uses the idea of hashing to detect duplicates, but with dierent applications. While their work looks at suppressing redundant, duplicate downloads from the web, our work looks at nding where duplication exists on the web. Their application forces ecient, streaming processing and a relatively small corpus (caches are less than 100 GB to minimize overhead), while our web analysis is suitable for oine processing with corpora larger than 1 PB. Plagiarism Detection is a very dierent class of application. Storage and network optimization re- quires exact reproduction of original contents. Existing approaches to plagiarism detection in documents emphasize semantic matching, as plagiarism is also concerned with subtle copying concepts, in addition to exact text reuse. Plagiarism detection makes use of stylometric features [40, 114], measuring writing structure and styles, in addition to text statistics (counts of words and parts-of-speech). Our work aims to answer the question of whether massive duplication exists on a web-scale using syntactic methods; we do not attempt to infer semantic equivalence of the content. Because detecting plagiarism is typically done over small- to moderate-sized corpora (comparing a essay or homework assignment to1000 others), long runtimes (minutes to sometimes hours per docu- ment [37]), and a relatively large rate of false positives (precision = 0.75, for example [40]) are tolerable. Manual review can address false positives, and with a relatively small corpus, the absolute number of false positives can be manageable even if the rate is not small. In our applications, our parallelized processing 19 enables us to maintain good performance, even as the corpus grows (Section 2.5.2). Additionally, we re- quire high precision in detecting reuse since with large corpora (10 9 documents or more), even a small false positive rate quickly makes human review impractical (Section 2.6.6). InformationRetrieval: Document similarity and detection is at the heart of the eld of information retrieval (IR). Approaches in IR have explored duplicate detection to improve eciency and the precision of answers [85, 22, 58, 118]. Our use of cryptographic hashing has high precision at the cost of lower recall by missing mutated les. Broder et al. [18] develop a technique called “shingling” (today known asn-grams) to generate unique contiguous subsequences of tokens in a document and cluster documents that are “roughly the same”. They use this technique to nd, collapse, and ignore near-duplicates when searching for information (to avoid showing users the same content multiple times). In our applications, we specically look for content matches and require new approaches (cryptographic hashes) to avoid overwhelmingly numbers of false positives from approximate matching (Section 2.6.6). SpotSigs [123] and Chiu et al. [23] usen-grams in dierent applications to search for similarities and text reuse with approximate matching. SpotSigs extends the use of n-grams by creating signatures of only semantic (meaningful) text in documents, ignoring markup language details (like HTML tags). Their system for approximate matching is quadratic (O(n 2 )) in its worst-case, but it can trade-o runtime per- formance with threshold of similarity. Our work looks for precise content matches in quasilinear time (O(n logn)). In one of our applications in phish detection, we leverage the details of HTML to enable precise detection of phish. Chiu et al. build an interface within a web browser to interact with their back-end, enabling users to query for sentence-level similarities between their visited page and other pages in a larger corpus. We use precise matching of paragraph-level chunks, with applications on detecting widespread duplication across the web. We distinguish between the discovery and detection sides of the problem, allowing us 20 to better separate the problems of nding “what content is duplicated” and nding “where the content is being duplicated”. In Chiu et al., “discovery” is a manual search query performed by the user, while in our work we can perform discovery as an automated process. Cho et al. [24] use sentence-level hashing of text to search for duplicated content in order to improve Google’s web crawler and search engine. By identifying duplicates with hashing, a web crawler becomes more ecient as it avoids visiting clusters of similar pages. Because the crawler avoids and suppresses duplicates, the quality of search results is improved with more diverse sites being returned in response to a user’s query. Our work complements this prior work with dierent chunking strategies and dierent applications in measurements and anti-phishing. While Cho et al. extract, chunk, and hash only textual information (no markup), we look at paragraph-level chunking of a document’s native format, nding it to be eective in duplicate detection. Rather than avoid and suppress duplicates, we focus on precisely nding and identifying clusters of similar pages to measure content reuse on the web (for commercial gain or otherwise) and detect phishing websites as part of an anti-phishing strategy. Zhang et al. [141] build a near-duplicate detection technique by chunking documents at the sentence- level and matching their signatures across a variety of English and Chinese datasets (1.6910 6 –50.210 6 documents, 11–490 GB). We chunk documents at the paragraph-level, and compare the performance of matching at the le- and paragraph-level. We focus on applications in detecting commercialized duplica- tion and web phishing, showing that our techniques can scale to web-size corpora (Section 2.5, 2.8610 9 documents, 99 TB). Their signature creation also leverages prior work, using shingles [18], SpotSigs [123], and I-Match (SHA-1) [25], preferring I-Match and its eciency. Our work validates SHA-1’s performance, which we leverage to achieve precise and ecient detection. Henzinger [58] compares the performance of algorithms that use shingling and Charikar’s locality sensitive hashing (LSH). While LSH achieves better precision than shingling, combining the two provides 21 even higher precision. Exploration of LSH is an interesting possible complement to our use of crypto- graphic hashing: although the objective of this chapter is not a survey of algorithms, we briey compare LSH and cryptographic hashing in Section 2.6.6. Yang and Callan [135] develop a system that implements a clustering algorithm using document meta- data as constraints to group near-duplicates together in EPA and DOT document collections. They exploit constraints in document metadata; we instead focus on general datasets that provide no such metadata. Kim et al. [71] develop an approximate matching algorithm for overlaps and content reuse detection in blogs and news articles. A search query is compared to sentence-level signatures for each document, with returned results being some Euclidean distance p d of each other. Their system, tested on corpus sizes of 110 3 –10010 3 documents, balances the trade-o between a higher true positive rate (recall) with lower precision and their algorithm’s quadratic runtime (O(n 2 )). They also optimize part of their processing by using incremental updates for new content. We focus on precise matching in quasilinear time (O(n logn)) on larger chunks (at the paragraph-level) of content reuse. In our applications, we look at detecting large-scale content reuse on web-scale corpora (10 9 documents), requiring high precision to avoid being overwhelmed with false positives (requiring costly post-processing). Phish Detection: We summarize prior work here, from our previous, more detailed review [10]. Machine learning has been used to detect phish, by converting a website’s content [54] or URL and domain properties [81] into a set of features to train on. Other approaches measure the similarity of phish and original sites by looking at their content and structure: similarities can be computed based on the website’s visual features like textual content, styles, and layout [79]. Many of these approaches use approximate matching, which runs the risk of producing false positive detections, and machine learning techniques have high computational requirements (that would make them dicult to run on clients). Our use of precise content matching helps avoid false positives, runs eciently in clients, and can provide a rst pass that complements heavier approaches. 22 2.4 Methodology We next describe our general approach to detecting content reuse. Although we have designed the ap- proach for web-like corpora, it also applies to le systems or other corpora containing textual content like news sources. 2.4.1 Overview Our general approach is to compute a hash for each data item, then use these hashes to nd identical objects. In this section we present our workow and address the discovery and detection phases of our approach. CollectingtheData: 0. Crawl the web, or use an existing web crawl, and correct acquisition errors (Section 2.4.4.1). 1. For each lef in corpusC, compute a hash of the whole lef:H(f)and 2. Splitf into a vector of chunksc f =fc f;1 ;:::;c f;n g and hash each chunkH(c f;i ) to form achunkhash vector ∗ H(c f ). Discovery: (Section 2.4.2) 3. Populate the labeled dataset with lesL f or chunksL c by either: (a) informed discovery: seeding it with known content a priori (b) blind discovery: (i) identifying the most frequently occurring les or chunks inC as suspicious, after (ii) discarding known common but benign content (stop-chunk removal, Section 2.4.4.2) Detection: (Section 2.4.3) ∗ While the vector contains an ordering of hashed chunks, we do not currently use the order. 23 4. SimpleObjectMatching: Given a labeled dataset of hashed chunks or lesL, nd all matches2C where its hash is2L o . This results in target (suspicious) les and chunks:T f andT c . 5. Partial Matching: To identify les containing partial matches, we use the chunk hash vectors compute the badness ratio of target chunks to total le content: contains(L c ;f) = jL c \H(c f )j jH(c f )j If contains(L c ;f) is greater than a threshold, we considerf to be a partial target inT. 6. Bad Neighborhood Detection: Apply stop-chunk removal (Section 2.4.4.2), then for each neighborhood N =ff N;1 ;:::;f N;n g where the les share a hierarchical relationship, compute the overall badness ratio of labeled content matches to total content: badness(N) = X 8n2N contains(L c ;n)) jNj If badness(N) is greater than a threshold, we considerN as a bad neighborhood inT N . The thresholds for partial matching and bad neighborhood detection are congurable; we set the de- fault threshold to one standard deviation over the mean. We elaborate on our choice and how to select a threshold in Section 2.4.2. 2.4.2 Discovery Discovery is the process of building a labeled dataset of items we wish to nd in the corpus during detec- tion. We can do this with an informed or blind process. With informed discovery (Step 3a), an expert provides labeled content of interestL, perhaps by ex- ploringC manually, or using external information. As one example, we know that Wikipedia is widely 24 copied, and so we seedL with a snapshot of Wikipedia (Section 2.8.1). One could also seedL with banking websites to identify phishing sites that reproduce this content (we seedL with PayPal in Section 2.8.2). One can also identifyL through a blind discovery process (Step 3b) that automatically nds widely duplicated content. Blind discovery is appropriate when an expert is unavailable, or if the source of copying is unknown. We rst populateL f andL c with the most frequently occurring les or chunks in the corpus. We set the discovery threshold depending on the dataset size and the type of object being identied. For example, one would set the threshold to be higher when the dataset size is larger. We looked at the ROC curves (a plot between the true positive rate and false positive rate) and found a trade-o between false positives (FP) and true positives (TP). There was no strong knee in the curve, thus we picked thresholds with a reasonable balance of FP to TP. In the Common Crawl dataset of 40.510 9 chunks, we set the threshold to 10 5 . Additionally, in our discovery process we expect to nd trivial content that is duplicated many times as part of the web publishing process: the empty le, or a chunk consisting of an empty paragraph, or the reject-all robots.txt le. These will inevitably show up very often and litterL: while common, they are not very signicant or useful indicators of mass duplication. To make blind discovery more useful, we remove this very common but benign content using stop-chunk removal, described in Section 2.4.4.2. Given a threshold, all chunksc in the corpusC whose number of duplicates exceeds the threshold and are not “stop chunks” are automatically labeled and added to the labeled datasetL: L :=8c2C : duplicates(c)> threshold; c = 2fstop chunksg We next look at properties of the discovery process. An important property of discovery is that it isnotdistributive—analysis must consider the entire cor- pus. Parts of discovery are inherently parallelizable and allow for distributed processing by dividing the corpus to various workers; we use MapReduce to parallelize the work (Section 2.5). However, to maximize 25 detection, the nal data join and thresholding must consider the full corpus. Given an example threshold of 1000, consider a corpusC = C 1 [C 2 . Consider an objectj = j 1 = j 2 such that duplicates(j) = duplicates(j 1 ) + duplicates(j 2 ): j 1 2 C 1 , duplicates(j 1 ) = 1000 andj 2 2 C 2 , duplicates(j 2 ) = 100. Objectj only exceeds the threshold in the complete corpus (with duplicates(j) = 1100), not with consid- eration of onlyj 1 orj 2 . Discovery runtime is O(n logn) and performance on a moderate-size Hadoop cluster is reasonable (hours). We look at the runtime performance to understand which part of discovery dominates the com- putation time and, if possible, identify areas for improvement. After we hash all the desired objects (O(n)), we sort and count all hashes (O(n logn)), and cull objects (O(n)) whose number of duplicates do not ex- ceed the threshold. Discovery’s performance is dominated by sorting, leading to an overall performance ofO(n logn). 2.4.3 Detection In the detection phase, we nd our targetsT at varying levels of granularity in the corpusC by looking for matches with our labeled datasetL. Insimpleobjectmatching, our targetsT are an exact match of a chunkc or lef inL. GivenL, nd all chunks or les2C where its hash is2L o and add them to the set of targetsT. We can then analyzeT to understand if objects inL are being duplicated inC and how often it is being duplicated. While those statistics are relevant, we expect that duplication happens often and would like to further understand the details of where duplication happens. Algorithmic performance of detection isO(mn logn), wherem is the size of the labeled data,L, and n the size of the corpusC. SincejLjjCj (the corpus is large, with millions or billions of pages, only fraction of which are labeled as candidates of copy), performance is dominated by O(n logn) because 26 of sorting. With optimized sorting algorithms (such as those in Hadoop), our approach scales to handle web-sized corpora. We also consider partial le matching. Rather than look at whole objects, we can detect target les that partially duplicate content from elsewhere based on a number of bad chunks. Partial matches are les that belong inT p because theycontain part ofL. Containment examines the chunk hash vectorH(c f ) of each le to see what fraction of chunks are inL. Finally, we usebadneighborhooddetection to look beyond identication of individual les. Examination of “related” les allows detection of regions where large numbers of related les each have a duplicated copy. For example, nding a copy of many Wikipedia pages might lead to a link farm which utilized Wikipedia to boost its credibility or search engine ranking. We dene a neighborhood based on the hierarchical relationship of les in the corpus. A neighborhood N is dened by the URL prexp, it consists of all lesf2C wherep(f) =p(N). Many sites have shallow hierarchies, so in the worst case each site is a neighborhood. For exam- ple, while people might easily create domains and spread content across them, the duplicated content would be detected as matches and reveal a cluster of neighborhoods (or sites) containing duplicated con- tent. However, for complex sites with rich user content (e.g., GeoCities), individuals may create dis- tinct neighborhoods. Each site will have neighborhoods at each level of the hierarchy. For example, in arxiv.org/archive/physics/, we would consider three neighborhoods: arxiv.org/archive/physics/, arxiv.org/archive/, and arxiv.org/. We assess the quality of a neighborhood by applying partial matching to all chunks in the neighborhood N using contains(L c ;N) in Step 5 and addN to the set of targetsT if the result is greater than a threshold. Like chunk hash vector for les, the neighborhood chunk hash vector will have duplicated components when there are multiple copies of the same chunk in the neighborhood. Because neighborhood analysis is done over a larger sample, when we nd regions that exceed our detection threshold, it is less likely to 27 represent an outlier and instead show a set of les with suspicious content. We next look at properties of the detection process. Unlike discovery, the detection process is parallelizable when processing distinct neighborhoods N (neighborhoods that do not share the same URL prex). This parallelizable property allows us to process many neighborhoods simultaneously without aecting whether a particular neighborhood is detected as “bad” or not. GivenC 1 andC 2 , we assert that detected(L;C 1 [C 2 ) = detected(L;C 1 )[ detected(L;C 2 ): This holds true becauseC 1 andC 2 share no neighborhoods: given some neighborhoodN 2 C 1 ,N = 2 C 2 . As we showed earlier, runtime performance isO(n logn) because of the sort during join. However, since neighborhoods are independent and numerous, we get “easy” parallelism. Withp processors, we get runtimeO(n logn)=p. 2.4.4 CleaningtheData We do two types of cleaning over the data, rst we identify recursion errors that result in false duplication from the crawling process, and then we eliminate common, benign features with stop-chunk removal and whitespace normalization. We evaluate the eectiveness of these methods in Section 2.6.1. 2.4.4.1 DetectingandHandlingRecursionErrors Crawling the real-world web is a perilous process, with malformed HTML, crawler traps, and other well understood problems [17, 60]. We detect and remove crawler artifacts that appear in both Common Crawl and GeoCities. Our main concern is recursion errors, where a loop in the web graph duplicates les with multiple URLs—such results will skew our detection of copied data. We see this problem in both datasets and use heuristics involving how often a URL path component is repeated and remove that URL from 28 processing if it is determined to be a recursion error. We evaluate these heuristics in Section 2.6.2, nding that these heuristics have a very low false positive rate in detecting crawler problems, and are sucient to avoid false positives in our duplication detection. Future work may rene these heuristics to reduce the number of false negatives in recursion-error removal. 2.4.4.2 StopChunkRemoval We see many common idioms in both les and chunks. We call these stop chunks, analogous to stop words in natural language processing (such as “a”, “and”, and “the”). For chunks, these include the empty paragraph (<p></p>), or a single-space paragraph (<p> </p>). For les, examples are the empty le, or a reject-all robots.txt le. These kind of common, benign idioms risk skewing our results. We remove stop chunks before applying bad neighborhood detection, and use both manual and auto- mated methods of generating lists of stop chunks. We nd that automated generation, while less accurate, is sucient. We manually generated lists of stop chunks for duplicate detection in the web (Section 2.7). In Common Crawl, the list of 226 chunks is short enough to allow manual comparison: if the list becomes too large, we can apply Bloom lters [16] to support ecient stop-chunk removal. For automated generation, we apply the heuristic of treating all short chunks as stop chunks. In our evaluation of expert-identied content (Section 2.8), we discard chunks shorter than 100 characters. We also compare manual and automated generation: we previously labeled manually 226 (45 %) of the top 500 chunks in Common Crawl as benign. By discarding chunks shorter than 100 characters, we automatically label 316 (63 %) as benign: 222 benign and 94 non-benign from our manually labeled list. The non-benign that are automatically labeled aren’t necessarily “bad”—typically they are basic layout building blocks used in web templates or editors. Thus we nd the trade-o acceptable: manual generation is more accurate, but automatic generation is sucient for applications using expert-identied content. 29 2.4.5 ChunkingandHashing Chunking text and data into non-overlapping segments has been done in natural language processing (NLP) [2, 102], in disk deduplication optimization [100, 136], and in information retrieval [101]. We chunk all content in our corpora, breaking at the HTML delimiters <p> (paragraph) and <div> (generic block- level component) tags, that are used to structure documents. We could also chunk on other tags for tables, frames, and inline frames, but nd that our chosen delimiters are suciently eective. Hash function: Unlike prior work with hashing from Natural Language Processing, we use cryp- tographic hashing to summarize data. We employ the SHA-1 [89, 103] cryptographic hash function for its precision—identical input always produces the same output, and dierent input yields a dierent out- put. Cryptographic hashing is used in disk deduplication [100], but most prior work considering duplicate detection uses locality-sensitive [80, 30] and semantic [107] hashes. We use cryptographic hashing to elim- inate the small false positive rate seen in other schemes: we show in Section 2.6.6 that even a tiny rate of false positives is magnied by large corpora. 2.5 DatasetsandImplementation 2.5.1 DatasetsandRelevance This chapter uses three web datasets: Common Crawl, GeoCities, and our own phishing site corpus. We use the Common Crawlcrawl-002 dataset (C cc ) collected in 2009/2010 and publicly provided by the Common Crawl Foundation [42] to represent recent web data. crawl-002 is freely available on Amazon S3 and includes 2.8610 9 items (26 TB compressed, 99 TB uncompressed). Most of its data is HTML or plain text, with some supporting textual material (CSS, JavaScript, etc.); it omits images. 30 As a second dataset, we use the GeoCities archive (C g ) crawled by the Archive Team [9] just before the GeoCities service was shuttered by Yahoo! in October 2009. The dataset was compiled between April– October 2009 and contains around 3310 6 les (650 GB compressed) including documents, images, MIDI les, etc. in various languages. Although the content is quite old, having been generated well before its compilation in 2009, it provides a relatively complete snapshot of diverse, user-generated content. We generate the third dataset of phish (C p ) by extracting the top-level webpages (HTML) from a stream of 2374 URLs of suspected phishing sites provided by PhishTank [95], a crowd-sourced anti-phishing ser- vice, over two days (2014-09-24 and 2014-09-25). (Our other datasets, Common Crawl and GeoCities, are not suitable for phish detection since phish lifetimes are often only hours to days, leaving very little time to crawl phish.) From the collected URL stream, we automatically crawl the suspect URLs and manually classify each as phish or otherwise. We ignore phishing sites that were removed by the time we crawl, discarding about 20 % of the stream. DatasetRelevance: As the web evolves quickly, its content and structure also evolves. Most GeoCities content dates from the late 1990s to the early 2000s, Common Crawl is from 2009–2010, and our phishing dataset is from 2014. Does evaluation of our techniques over these older datasets apply to the current web? We strongly believe it does, for two reasons: our datasets fulll the requirement for content classication and we show that our approach easily adapts to today’s and tomorrow’s web. First, the key requirement for classication of content in a corpus is that the corpus be diverse and large enough to approach real-world size and diversity. Both GeoCities and Common Crawl satisfy the diversity requirement. While GeoCities is perhaps small relative to the current web, we believe it is large enough provide diversity. Our phishing dataset is intentionally small because it addresses a more focused problem; it shows considerable diversity in that domain. Second and more importantly, the web will always be dierent tomorrow, and continue to change over the next ten years. The increasingly dynamic and personalized nature of the web will modify the 31 edges (like recommended links) but leave the core content unchanged and still detectable with hashing. We show that our approach works well over many years of web pages with only modest changes (for example, adding the use of <div> in addition to <p> to identify chunks). Our largest change was to shift from static web content to crawling a browser-parsed DOM in our phishing study (Section 2.8.2)—while conceptually straightforward, its implementation is quite dierent. This change allows us to accommodate dynamically-generated, JavaScript-only web content. We believe that this range of ages in our datasets strongly suggests that our approach (perhaps with similar modest changes) will generalize to future web practices, whatever they may be. 2.5.2 Implementation We implement our methods and post-processing on cloud computing services (Amazon EC2) and a local cluster of 55 commodity PCs running Apache Hadoop [7]. Processing was done with custom MapReduce programs [33], Apache Pig [8], and GNU Parallel [122]. Our current cluster can intake data at a rate around 9 TB/hour. We initially hash les and chunks in Common Crawl (C cc ) on EC2 in 11.510 3 compute hours (18 real hours,$650), producing 2.8610 9 le hashes and 40.510 9 chunk hashes along with backreferences to the original dataset (1.8 TB of metadata). We use our local cluster to hash GeoCities (C g ) in 413 compute hours (1.5 real hours) producing 3310 6 le hashes and 18410 6 chunk hashes along with backreferences (4.7 GB of metadata). Overall performance is good as the corpus grows to the size of a large sample of the web. Although the theoretical bound on processing is theO(n logn) sort of hashes, in practice performance is dominated by scanning the data, anO(n) process that parallelizes well with MapReduce. We see linear runtime on uncompressed dataset size ranges from 5 GB to 5 TB. We expect processing 225 TB of uncompressed data 32 (Common Crawl, Feb. 2019, CC-MAIN-2019-09) on the same EC2 setup used earlier to take 41 real hours and $1300. 2.5.3 Reproducibility Our research is reproducible and we make our code and data freely available as open source. For data generated by others, we provide pointers to their public sources. Source code and data used for validation, discovering and detecting duplication of web content, and detecting clones of Wikipedia are available at https://ant.isi.edu/mega. Source code for our AuntieTuna anti-phishing plugin and data used in our anti-phishing application are available at https://ant.isi.edu/ software/antiphish. Instructions for reproducibility are included in their respective repositories. 2.6 Validation We next validate our design choices, showing the importance of cleaning and correctness of our method- ology. 2.6.1 DoOurCleaningMethodsWork? Initial analysis of our raw data is skewed by crawler errors, and identication of bad neighborhoods can be obscured by common benign content. We next show that our cleaning methods from Section 2.4.4 are eective. We have reviewed our data and taken steps to conrm that recursion errors do not skew discovery of duplicates. While only 1 % of all 91310 6 neighborhoods in Common Crawl are the result of recursion errors, removing the obvious errors is helpful although not essential. We discuss details of removing recursion errors in Section 2.6.2. 33 We next validate that our stop-chunk removal process (Section 2.4.4.2) is eective. To identify stop chunks, we manually examine the 500 most frequently occurring chunks in Common Crawl and identify 226 as benign. These chunks occur very frequently in the dataset as a whole, accounting for 35 % of all chunks that occur10 5 times. To verify that we do not need to consider additional frequent words, we also examine the next 200 and identify only 43 as benign, showing diminishing returns (these 43 account for only 1 % of all chunks that occur10 5 times). We therefore stop with the benign list of 226 chunks found in the top 500 most frequent as it is sucient to avoid false positives due to benign data. To demonstrate the importance of stop-chunk removal, we compare bad neighborhood detection with and without stop-chunk removal. Stop chunks dilute some pages and skews the quality of the badness ratio; if we do not remove stop-chunks in Common Crawl (91310 6 neighborhoods), we would detect 1.8810 6 (2.35 %) more “bad” neighborhoods than the 79.910 6 bad neighborhoods we nd after stop- chunk removal. These additional 1.8810 6 “bad” neighborhoods are false positives, mainly consisting of detected stop chunks, which would dilute the results above the detection threshold and reduce detection’s precision: (true positives)=(true positives + false positives). 2.6.2 RemovingRecursionErrors To validate our cleaning of recursion errors, we examine the distributions of the prex lengths of all neigh- borhoods in the Common Crawl and GeoCities datasets. In Common Crawl, we observe a noticeable gap in the distribution shown in Figure 2.1 when the prex length is 97. The graph to the right of the red dotted line (x = 97) represents neighborhoods due to recursion errors: all links with prex lengths 97 or longer (0.005 %) need cleaning. We conrm this with a random sample of 50 pages of prex length 97 or longer. We therefore set a threshold of 96, removing neighborhoods with prex lengths97, and retain almost all of the original data. We know that some of this data still contains errors: a sample of 100 neighborhoods 34 10 0 10 1 10 2 10 3 Neighborhood Prefix Length 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 Number of Occurences x = 97 recursion / ripping errors Figure 2.1: Prex lengths of neighborhoods in Common Crawl (C cc ). of prex length20 (0.46 % of total data) show that about two-thirds are bad, but we limit additional pruning to avoid removing valid data. As future work, we plan to explore better methods to remove crawling errors. While some recursion errors remain, our cleaning leaves most pages unaected by recursion errors: a random sample of 100 neighborhoods of all prex lengths suggests that only 1 % of all neighborhoods are recursion errors. We nd similar problems in GeoCities, and resolve them with a similar cleaning process. For example, we generally see that neighborhoods resulting from crawling errors contain repetitive path name compo- nents in the URL, like “clown” in geocities.com/SoHo/Workshop/9176/clown/clown/clown/Clown/clow n/show.htm. We also see more complex, multi-hop loops. We use several heuristics to reduce problems with recursion in GeoCities. We ignore a URL if one path name component is repeated5 times or if any 4 components are repeated3 times. 35 0 5 10 15 20 25 30 Neighborhood Prefix Length 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative Distribution before heuristic after heuristic after heuristic + manual Figure 2.2: Cumulative distribution of neighborhood prex lengths in GeoCities (C g ). Cleaning the GeoCities dataset using our heuristics is eective in removing recursion errors. Figure 2.2 shows the cumulative distribution of neighborhood prex lengths before cleaning (dashed green line) and after applying heuristics (dashed blue) and additional manual cleaning (solid blue). We see a total reduction of 26.1 % of the neighborhoods in the original dataset, from 12.910 5 to 8.110 5 , and similar sampling (as done with Common Crawl) of the removed neighborhoods show that all are the result of recursion errors. While new data crawls will need to be reviewed for collection errors, resolution of these problems is straightforward, if not time-consuming. As crawlers mature over time, this additional work will not be required. 2.6.3 CanWeDiscoverKnownFilesandChunks? We next turn to the correctness of our approach. We begin by validating and verifying that hashing can detect specic content in spite of the background “noise” of millions of web pages with the following experiment. 36 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Number of Duplicates 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 Number of Occurences File-level Granularity, Common Crawl (2.86B files) 100 250 1000 5000 duplicated content (files) E r r r r r r j j j discovery threshold Figure 2.3: File-level discovery of injected duplicates (N) inC cc , compared to le frequency (grey dots). j: JavaScript, r: robots.txt, E: empty le. Duplicatedfullles: We rst consider a spammer that duplicates a le many times to provide content for thousands of parked domains. To emulate this scenario, we take a known website (blog.archive.org as of 2013-08-22) containing roughly 4000 pages or les and duplicate the entire site fromd = 100 to 5000 times. For each duplication we generate a unique, emulated website, process that data with steps 0–2 of our methodology, merging this with our full processed data. We then build our labeled dataset via blind discovery. Our blind discovery process populates the labeled dataset with the most frequently occurring content. In Common Crawl (C cc ), our blind discovery threshold is 10 3 (threshold is set using Section 2.4.2): all les that have more than 10 3 duplicates are labeled. Figure 2.3 shows the results of this experiment inC cc le frequency. This frequency-occurrence graph shows the number of occurrences (y-axis) that a le object has, given the amount of times it has been duplicated (x-axis). Our discovery threshold is marked by a red dotted line atx = 10 3 ; all the content (indicated by points) past the threshold are added to the labeled dataset. Duplicating the entire blog moves 37 it from unique, unduplicated content (a grey dot in the top left) to an outlying point with 4000 pages occurring 5000 times (as indicated by a labeled black triangle at (x;y) = (5000; 4000)). We see that the point passes our threshold and we have discovered our injected and massively duplicated blog. This change from top-left to an outlier above and further right on the graph represents what happens when spammers duplicate parts of the web. Spammers may duplicate les fewer number of times. To consider this scenario, we change the number of duplicationsd to values less than our previous example. The blue circles represents the injected site had the entire site (4000 les) been duplicated dierent amounts of times (atd = 100, 250, 1000). When the injected site has been duplicated10 3 times (three blue circles on and to the left of the red threshold line), that site and corresponding les will not be automatically discovered; all points right of the red threshold line (the black triangle atx = 5000) will. Note that even with fewer duplications (blue circles left of the red threshold), the injected les duplicated fewer times will be visibly obvious outliers on the graph and may be detected with manual analysis or more sensitive automation (using additional analysis of the corpus or an iterative search to determine an optimal threshold as dened in Section 2.4.2). Partially duplicated pages: The above experiment shows our ability to track duplicated les, but spammers almost always add to the duplicated content to place their own links or advertisements. We therefore repeat our study of duplicating les by duplicating an entire websited = 5000 times, but add a dierent paragraph to the beginning and end of each duplicated page to represent unique advertisements attached to each page. Since each page varies here, le-level analysis will detect nothing unusual, but chunk-level analysis will show outliers. The size distribution of pages is skewed and appears heavy tailed (mean: 48 chunks, median: 23, max: 428). Our discovery threshold is increased from 10 3 to 10 5 , because the number of chunks inC cc is much larger than the number of pages. 38 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 Number of Duplicates 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 Number of Occurences Chunk-level Granularity, Common Crawl (40.5B chunks) duplicated content (chunks) < P > discovery threshold Figure 2.4: Chunk-level discovery of injected duplicates (N) inC cc , compared to chunk distribution (grey dots). Figure 2.4 shows chunk-level evaluation of this scenario, with each dot representing a particular chunk. The red dotted line atx = 10 5 marks our discovery threshold: all 6000 chunks to the right this line are discovered, added to the labeled dataset, and further analyzed. We now see more evidence of duplicated chunked content, which is shown by a cluster of black tri- angles (as opposed to a single outlying point) corresponding to the 1.210 9 chunks that make up the duplicated content of blog.archive.org (originally 24210 3 chunks). The light grey dots correspond to all the existing chunks inC cc . We see that many of the chunks that make up the pages of blog.archive.org pass our dened thresh- old and we discover 78 % of total distinct chunks. Similar to the previous experiment, we can “control” where the points are distributed by varying the number of times we duplicate the site. If all the chunks in the site had fallen below the threshold, we would not have automatically discovered the site via our blind discovery process. 39 10 0 10 1 10 2 10 3 10 4 10 5 10 6 All Chunks Duplicated # Times 0.0 0.2 0.4 0.6 0.8 1.0 Fraction Chunks Discovered (Threshold > 10 5 ) Chunk-level Granularity, Common Crawl (40.5B chunks) discovered 78% when we duplicate all chunks 5000 times Figure 2.5: Percentage of chunks discovered in blog.archive.org given the number of times it is dupli- cated. Hashing a ner-grained object in our discovery process allows us to discover more content that has been duplicated. File-level discovery returns a binary result: either we discover the le or not. Chunk- level discovery allows us to discover varying percentages of content depending on how many times it was duplicated. Figure 2.5 shows how many chunks fromblog.archive.org are discovered given the number of times all chunks have been duplicated. When we duplicate the websited = 5000 times (black triangles in Figure 2.4 and the point marked by the red dotted line in Figure 2.5), we discover 78 % of the chunks. (Trivially, we discover 100 % of the chunks when we duplicate the site10 5 times.) Our simple threshold detects some but not all duplicated chunks that were injected. The duplicated content (black triangles) in Figure 2.4 are clear outliers from most of the traditional content (grey dots), suggesting a role for manual examination. This experiment shows that chunk-level analysis is eective even though only portions of pages change. We next look at the eects of content mutation more system- atically. 40 2.6.4 CanWeDetectSpecicBadPages? Having shown that we can discover known les and chunks, we next validate our detection mechanism by nding known targetsT and understanding the conditions in which our mechanism fails. Given our labeled dataset curated by an expert (L expert ) and one via blind discovery (L blind ), can we detect bad pages? Furthermore, as we increasingly mutate each page, at what point can we no longer detect it? To evaluate our bad page detection mechanism, we continue our prior example where we rip and du- plicateblog.archive.org; this set of pages becomes our injected corpusC i . We mutateC i in a consistent manner that can be applied to all pages inC i to get a resultingC 0 i . We can categorize each mutation into the following: + Add additional content, such as ads or link spam Modify existing content by rewriting links Remove content such as headers, copyright notices, footers, or the main body of the page We build bothL expert andL blind fromC i (as described in Section 2.6.3), then run the detection process to see if pages inC 0 i are detected. We continue mutatingC 0 i (e.g.,C 00 i ;:::;C 0(n) i ) to understand the kinds and amount of mutations that the detection process can handle. While we utilize a copy of blog.archive.org to buildL andC i , our results for each mutation experiment are consistent with otherL because we mutate each of the 4626 pages. For each experiment, we have the base siteC i and applyn independent mutations to each page resulting inC 0(n) i . In our rst mutation experiment, we continuously add content to a page such that the page is diluted with non-target content and we do not detect it (due to the badness ratio not reaching a particular thresh- old). Figure 2.6 shows the performance with bothL expert (green) andL blind (blue). The bottomx-axis details the number of chunks added per page relative to the average number of chunks per page inC i 41 0 1 2 3 4 5 6 7 8 Relative Number of Added Chunks per Page ( c p p ) 0.0 0.2 0.4 0.6 0.8 1.0 Average Badness of a Page expert L blind L badness threshold 3: 4 £ , blind L (blue) 4: 5 £ , expert L (green) 0 50 100 150 200 250 300 350 Number of Added Chunks per Page (absolute) Data Mutation: adding chunks, blog.archive.org Figure 2.6: Eects of continuouslyadding chunks on pages. (cpp = 48). They-axis shows the average badness ratio per page (averaged over all 4626 pages inC i ). The badness threshold is labeled on each graph at 0.144 (we describe its computation in a later section). We perform 10 runs overC i at eachx value and take the average. We omit error bars when the standard error is<0.01 for clarity (Figure 2.6 in particular has no error bars). This experiment shows that we can tolerate an additional 3.4 (usingL blind ) or 4.5 (usingL expert ) the mean number of chunks per page (cpp) in each labeled dataset and still detect duplicated content. These tolerances are visually represented in Figure 2.6, where the blue (L blind ) or green (L expert ) dotted lines meet with the red dotted line (badness threshold): points to the left of the blue or green lines on thex-axis will have an average badness above the detection threshold on they-axis. This behavior is not surprising: if we were to dilute the content with many other unrelated chunks, the average badness would asymptotically approach 0. 42 0 1 2 3 4 5 6 7 8 Relative Number of Deleted Chunks per Page ( c p p ) 0.0 0.2 0.4 0.6 0.8 1.0 Average Badness of a Page Data Mutation 20 rounds: deleting chunks, blog.archive.org 0 50 100 150 200 250 300 350 Number of Deleted Chunks per Page (absolute) badness threshold expert L (green) blind L (blue) 25 50 75 100 125 150 175 200 225 Average Number of Chunks per Page Figure 2.7: Eects of continuouslydeleting chunks on pages. We next continuously delete content randomly; deleting content will increase the badness ratio but may be overlooked because the number of total chunks on the page will be smaller. Users of hashing might require a minimum number of chunks per page before applying the badness ratio. Figure 2.7 shows the average badness of a page given the number of chunks we delete per page. Using anL expert (green), we see that the ratio is always 1.0: deleting chunks does not aect the badness because the entire page is bad regardless. Next, we initially see an increase in average badness of a page when usingL blind (blue) and stabilizes until a certain point as we increase the number of deleted chunks per page. Pages that have a small number of total chunks have on average a lower badness ratio until the page is eventually removed from the population, which results in a higher average badness as pages that have a higher number of total chunks survive deletion. In this experiment, our detection mechanism on average handles all 400 deletions (per page). 43 0 1 2 3 4 5 6 7 8 Relative Number of Random Changes per Page ( c p p ) 0.0 0.2 0.4 0.6 0.8 1.0 Average Badness of a Page 0 50 100 150 200 250 300 350 Number of Random Changes per Page (absolute) Data Mutation: mutating chunks, blog.archive.org badness threshold 1: 8 £ , blind L (blue) 2: 0 £ , expert L (green) Figure 2.8: Eects of continuouslychanging chunks on pages. Similarly, we see a large variance in badness at the tail of the graph because the population of pages inC 0(n) i (after mutation) decreases. As we increase the number of deleted chunks per page, the average number of chunks per page (orange) fall. Pages also cease to exist after all the chunks have been deleted; we see in Figure 2.7 that the average number of chunks per page increases as the population of pages decreases. This behavior is expected: as a trivial example, consider a page with only two chunks only one of which is inL: the badness of the page is 0.5. If we delete the bad chunk, the badness falls to 0, but if we delete the other, the badness increases to 1. Thus, depending on the chunks we delete, the badness of a page will uctuate. In our nal experiment, we continuously modify content to the point where we no longer can detect it (e.g., if every chunk is modied at least once, our detection algorithm will fail). We consider a stream of mutations: we randomly pick a chunk to modify and change one random character in that chunk, with 44 0 1 2 3 4 5 6 7 8 Relative Number of Random Changes per Page ( c p p ) 0 1000 2000 3000 4000 Number of Pages Detected as Bad 0 50 100 150 200 250 300 350 Number of Random Changes per Page (absolute) Data Mutation: mutating chunks, blog.archive.org 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Pages Detected as Bad expert L (green) blind L (blue) Figure 2.9: Number of pages detected as bad after continuously changing chunks on all pages in blog.archive.org. replacement (in successive mutations, the same chunk can be modied again). Figure 2.8 shows the average badness of a page given the number of random changes with replacement. We see an exponential drop in the average badness of the page as we linearly increase the number of random changes (with replacement) per page. On average, our bad page detection mechanism handles 1:8cpp (L blind ) and 2:0cpp (L expert ) changes before the page falls below the threshold. To show that we can tolerate 3:4cpp mutations, we look at the performance of our bad page detection mechanism. Figure 2.9 shows how many pages we detect as bad given the number of random changes per page inC i . In the perfect case (such as usingL expert on an unmodied site), we detect all 4626 pages in C i as bad. While theL expert performs much better initially (detecting between 300–700 more pages than withL blind ), we see both lines eventually converge. We can detect known bad pages to a certain degree of mutation. Our validation experiments show that we can handle between 1:8–4:5 cpp mutations onC i depending on the type of mutation and the labeled 45 dataset we utilize. While utilizing theL expert slightly increases the number of mutations we can tolerate (compared to using theL blind ), theL expert contains over 4.8 the number of entries (jL expert j = 2110 3 , jL blind j = 4:410 3 ). We next transition into the validation of detecting known bad neighborhoods. 2.6.5 CanWeDetectKnownBadNeighborhoods? Given our success nding bad pages, we next validate the robustness of detecting known bad neighbor- hoods. Recall that a neighborhood contains a set of pages that share a common URL prex. As with pages, we evaluate both expert and blind labeled datasets, and change a known target to evaluate the sensitivity of our detection mechanism. We evaluate our detection mechanism by designing a mutation experiment with an example neighbor- hoodN. The goal of our experiment is to understand the degree of change before our detection process fails. We continue to use the same neighborhoodN (blog.archive.org) and the same approach as in the previous section (Section 2.6.4) with the following change: mutate all pages inN in a consistent manner to get a resultingN 0 : n mutations results inN 0(n) . We then run the bad neighborhood detection process to see ifN 0(n) is detected. We see similar results in the performance of bad neighborhood detection compared to bad page de- tection. Figures 2.10, 2.11, and 2.12 show the bad neighborhood detection performance using bothL expert (green) andL blind (blue) for add, delete, and modify operations, respectively. We compare the relative number of mutated chunks per page inN (cpp) against the resulting badness ratio of the neighborhood after mutation (N 0(n) ). We use a xed badness threshold as described in Section 2.7.3. We again take the average of 10 runs overN at eachx value and omit error bars when standard error is<0.01. Our experiments show that we can tolerate between 4:4–5:4 cpp mutations, and that bad neighbor- hood detection is much more robust than bad page detection—on average our process can handle 2:7–3:0 more modications per page than bad page detection. Analysis of the neighborhood is much more robust 46 0 1 2 3 4 5 6 7 8 Relative Number of Added Chunks per Page ( c p p ) 0.0 0.2 0.4 0.6 0.8 1.0 Average Badness of the Neighborhood Data Mutation: adding chunks, blog.archive.org neighborhoods 0 50 100 150 200 250 300 350 Number of Added Chunks per Page (absolute) badness threshold 4: 4 £ , blind L (blue) 5: 2 £ , expert L (green) Figure 2.10: Eects of continuouslyadding chunks in a neighborhood. 0 1 2 3 4 5 6 7 8 Relative Number of Deleted Chunks per Page ( c p p ) 0.0 0.2 0.4 0.6 0.8 1.0 Average Badness of the Neighborhood Data Mutation: deleting chunks, blog.archive.org neighborhoods 0 50 100 150 200 250 300 350 Number of Deleted Chunks per Page (absolute) expert L (green) blind L (blue) badness threshold 0 30 60 90 120 150 180 Number of Chunks in Neighborhood ( £ 1 000) Figure 2.11: Eects of continuouslydeleting chunks in a neighborhood. 47 0 1 2 3 4 5 6 7 8 Relative Number of Random Changes per Page ( c p p ) 0.0 0.2 0.4 0.6 0.8 1.0 Average Badness of the Neighborhood Data Mutation: mutating chunks, blog.archive.org neighborhoods 0 50 100 150 200 250 300 350 Number of Random Changes per Page (absolute) badness threshold 5: 3 £ , blind L (blue) 5: 4 £ , expert L (green) Figure 2.12: Eects of continuouslychanging chunks in a neighborhood. because we consider the badness across a collection of pages and have a larger population of content to work with; considering only a page when calculating badness is much more susceptible to uctuation and not as robust to mutation because of its smaller magnitude. We have now validated our mechanisms that we will use in two applications: content reuse detection over web content using the blind process and detection of expert-identied content in the web. 2.6.6 Cryptographicvs. Locality-SensitiveHashes Our work uses cryptographic hashing functions to minimize the impact of false positives that result from locality-sensitive and semantic hashing. To quantify this trade-o, we next compare bad page detection with SHA-1 (our approach) to the use of Nilsimsa [30], a locality-sensitive hashing algorithm focused on anti-spam detection. We use a corpusC p of 2374 suspected phish (as described in Section 2.5) and build a 48 TrueNature HashAlg. ofPage ClassiedAs Crypto LSH PayPal Phish PayPal Phish (TP) 43 43 Missed PayPal (FN) 42 42 Non-PayPal MisclassedPayPalPhish(FP) 0 10 Non-PayPal (TN) 1803 1793 Total 1888 1888 Table 2.1: Performance of detection on a phish corpus using cryptographic and locality-sensitive hashing. labeled datasetL from current and recent PayPal U.S., U.K., and France home pages (Sep. 2014, plus Jan. 2012 to Aug. 2013 from archive.org). We process the datasets, chunking on<p> and<div> tags, computing hashes of each chunk inC p and L with both SHA-1 and Nilsimsa. We then useL to detect PayPal phish inC p . For detection with Nilsimsa, we use a matching threshold of 115 (0 being the fuzziest and 128 an exact match), a relatively conservative value. Table 2.1 compares the confusion matrix when using SHA-1 (Crypto) and Nilsimsa (LSH) indepen- dently in detection. Both algorithms detect (TP = 43) and miss (FN = 42) the same number of PayPal phish. However, Nilsimsa has false positives, misclassifying 10 pages as PayPal phish, while SHA-1 misclassies none. Even very low, non-zero false-positive rates (0<FPR<1%) are bad when used against web-size corpora, since false positives in a large corpus will overwhelm true positives. For Common Crawl with 2.8610 9 les, Nilsimsa’s very low 0.55 % FPR at threshold 115 could result in 15.710 6 falsepositives (upper bound)! LSH’s design for approximate matching makes some false positives inevitable. A challenge with any LSH is nding the “right” threshold on each particular dataset to minimize the FPR. The number of false positives can dier greatly from small variations in threshold. We exhaustively studied the parameter space for Nilsimsa and our phishing dataset. A threshold of 128 forces exact matching, causing no false positives, but also makes the algorithm equivalent to cryptographic hashing. Our initial parameter choice 49 was 115; all thresholds from 120 down to 115 give a very low but non-zero false-positive rate from 0.33 % to 0.55 %. At a threshold of 114, the false-positive rate doubles (1.11 %) and as we continue to decrease the threshold, the FPR grows rapidly. Matching with thresholds from 128 to 120 is like exact matching with no false positives, in which case our analysis is needed to evaluate its performance. Although we nd some thresholds with no false positives, in general, exhaustive search of the parameter space is not possible, and no one value will be “correct” across varying inputs. The problem of false positives overwhelming rare targets known as the base rate fallacy and is a rec- ognized barrier to the use of imprecise detection in security problems [15, 111]. This problem motivates our use of cryptographic hashing. 2.7 AnalysisofBlindDiscoveryofWebCopying We next study the application of blind discovery of duplication of web content, and use this application to understand our approach. 2.7.1 WhyisFile-levelDiscoveryInadequate? We rst consider le-level discovery on both datasets. File-level comparisons are overly sensitive to small mutations; we use them to establish a baseline against which to evaluate chunk-level comparisons. Figure 2.3 (grey dots) shows the long-tail distribution of the frequency of le-level hashes in Common Crawl (2.8610 9 les,C cc ). We look at both the top 50 most occurring les and a sample of 40 random les that have more than 10 3 occurrences and nd only benign content (e.g., JavaScript libraries, robots.txt). We see the same distribution of les in GeoCities (3310 6 les,C g ), shown in Figure 2.13, where common les include the GeoCities logo, colored bullets, and similar benign elements. 50 10 0 10 1 10 2 10 3 10 4 10 5 Number of Duplicates 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 Number of Occurences blank file mostly GIFs and images Figure 2.13: File-level discovery onC g . Description jcj Type Common (benign) 68 Benign Templates 17 Benign e-Commerce 8 Benign Other 9 Benign Misc. 15 Benign Total 100 Table 2.2: Categories of the top 100 distinct chunks inC cc . 2.7.2 HowDoesChunkingAectDiscovery? We expect the greater precision of chunk-level analysis to be more eective. We next consider chunking (Section 2.4.1) of textual les (HTML, plaintext, JavaScript) by paragraphs (i.e., the literal <p> tag). Figure 2.4 shows frequency-occurrence distribution of the 40.510 9 chunks in Common Crawl (C cc ). Again, we see a heavy-tailed distribution: 40 % of chunks are unique, but3.710 9 distinct chunks appear more than 10 5 times. The most common chunk is the empty paragraph (<p>). 51 Description jcj Type Misc. 4 Benign JavaScript 2 - escaped 1 Benign other 1 Benign Templates 83 - navigation 17 Benign forms 32 Benign social 4 Benign other 30 Benign Commercial 6 - spam 1 Malicious JavaScript advertising 3 Ambiguous JavaScript tracking 2 Ambiguous Possibly Commercial 5 Ambiguous Total 100 Table 2.3: Classication of a sample of 100 distinct chunks with more than 10 5 occurrences inC cc . Chunking’s precision reveals several dierent kinds of duplication: aliate links, JavaScript ads, ana- lytics, andscripts, andbenigncontent dominating the list. Table 2.2 classies the 100 most frequent chunks. After common web idioms (empty paragraph, etc.), we see templates from software tools or web pages begin to appear. Again, we turn to a random sample of the tail of the graph to understand what makes up duplicated content. We draw a sample of 100 chunks from those with more than 10 5 occurrences and classify them in Table 2.3. This sample begins to show common web components that support monetization of websites. Java- Script occurs some (7 %) and is used for advertising via Google AdSense (3 %), user tracking, and analytics (2 %). We sampled one instance of spam where an article from The Times (London) was copied and an advertising snippet was included in the article for travel insurance. Other snippets were potentially spam- like or linking to a scam (5 %), but ambiguous enough to qualify as a non-malicious (if not poorly designed for legitimate monetization) site. 52 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Number of Duplicates 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 Number of Occurences Chunk-level Granularity, Geocities (97M chunks) Many unique items Few highly duplicated items eval(unescape('\%70\%61\%72\%65\%6E\%74...')) Google AdSense < P > (filtered) Figure 2.14: Chunk-level discovery onC g (9710 6 chunks, after heuristic and stop-chunk removal). We also nd instances of potentially malicious escaped JavaScript: decoding it reveals an email address (obfuscated via JavaScript to throw o spammers). Most content we discovered are elements of sites that make heavy use of templates (83 %) such as navigation elements, headers, and footers. Given anL o of the most frequently occurring content, this is not surprising: thousands of pages containing such template elements would naturally show up at the tail of the distribution. We conrm our results over a second dataset with chunk-level discovery onC g (GeoCities) in Fig- ure 2.14. We see a similar distribution overall, and nd similar templates and JavaScript as inC cc . We discovered and examined the kinds of content duplicated inC cc . Chunking identies frequent duplication, but not bad behavior. However, we can now use the results to build a labeled dataset of objectsL o . We next utilizeL o in our detection mechanism to identify and detect areas where copying runs rampant. 53 0 20 40 60 80 100 Badness Percentage (%) 0.0 0.2 0.4 0.6 0.8 1.0 Fraction (Histogram), P [ x < X ] (CDF) badness threshold Bad Prefixes (8.88%) Figure 2.15: Frequency of badness of neighborhoods inC cc , as a histogram (bars) and CDF (lines). 2.7.3 AreThereBadNeighborhoodsintheRealWorld? Chunking is successful at identifying bad chunks and pages, but duplication for prot can draw on many related pages to maximize commercial potential. Detection at the individual page-level can result in false positives, so we would prefer to detect groups of related pages that show a signicant amount of copied content. We now shift our focus to detecting bad neighborhoods. InCommonCrawl: To look for bad neighborhoods, we utilize the top 2121 common distinct chunks fromC cc as our labeled datasetL c (from Section 2.4.2), and identify bad neighborhoods in the full dataset using the algorithm in Section 2.4, step 6.C cc contains 90010 6 neighborhoods. Our detection threshold uses the mean and standard deviation across all neighborhoods. As one would hope, most neighborhoodsN2C cc are not bad (91 %). Figure 2.15 shows a combined histogram and CDF of the bad content ratios of all neighborhoods. We observe that 79.810 6 prexes (9 %) 54 Description jNj % Ads* 2 5.0 Blog* 19 47.5 Empty 1 2.5 Forms 1 2.5 Forum 1 2.5 “Suspect” 0 0.0 JavaScript 2 5.0 Templated/CMS 17 42.5 Total* 43 100.0 * indicates overlap between categories Table 2.4: Classication of a sample of 40 bad neighborhoods fromC cc . out of 90010 6 would be classied as a bad neighborhood: neighborhoods with badness >0.163 (since N;cc = 0:04 and N;cc = 0:123, and the threshold is N;cc + N;cc ). To understand the nature of the neighborhoods we identify as employing common content, we extract a sample of 40 neighborhoods from the 19.610 6 that are above the threshold and classify them in Table 2.4. We nd 82.5 % of the sampled sites to be benign: mostly blogs, forums, or newspapers that make heavy use of templates. Only 13 % of the content is clearly for prot: either spam, or search-engine optimization with ads. Our results show that there is duplication on the web: our approach discovers it through a blind process and then detects the broad kinds of copying that exists. Our approach is best at nding content that uses templates or a uniform, repeated structure. Most content with this structure is benign, but we nd a fraction of it is spam. The prevalence of templates in the sites we detect is a direct result of obtainingL via our blind process, since the denition ofL is commonly reused content. This observation suggests that a labeled dataset more focused on malicious content (not just duplicated content) would improve the yield, as we explore in Section 2.8 with an expert-providedL. 55 0 20 40 60 80 100 Badness Percentage (%) 0.0 0.2 0.4 0.6 0.8 1.0 Fraction (Histogram), P [ x < X ] (CDF) badness threshold Bad Prefixes (4.6%) Figure 2.16: Frequency of badness of neighborhoods inC g , as a histogram (bars) and CDF (lines). Description jNj Type Link farms 40 Prot Templates 50 Benign Default 37 Benign Other 13 Benign Misc. 10 Benign Total 100 Table 2.5: Classication of a sample of 100 bad neighborhoods fromC g . In GeoCities: Here we consider the top 2121 common distinct chunks fromC g (GeoCities) as our labeled datasetL c , and identify bad neighborhoods out of the 80710 3 neighborhoods inC g . LikeC cc , most neighborhoodsN2C g are not bad (5 %). Figure 2.16 shows a combined histogram and CDF of bad content ratios and indicate that most of our neighborhoods have low badness ratios. Roughly 3710 3 neighborhoods out of 80710 3 have a badness ratio of>0.448 ( N;g = 0:152, N;g = 0:296). We classify a sample of 100 randomly chosen bad neighborhoods in GeoCities (Table 2.5) and nd 40 % to be link farms while the other 60 % are benign (including GeoCities-provided and other templates). These 56 link farms likely tried to monetize by gaming search engine results with repeated usage of various key- words, self-referencing links, and advertisements. For example, some link farms focused on credit cards, automobile loans, and other nancing tools, linking back to their own pages containing Google AdSense advertisements. Others used seemingly haphazard keywords like automobile manufacturers and music downloads (.mp3s). We believe the higher rate of copying in link farms reects the greater susceptibility of search engines to duplicated content at this earlier time. 2.8 ApplicationsWithExpert-IdentiedContent We next look at two systems that use our approach, and use expert-identied content instead of blind discovery. Expert-identication is useful when targets locations are known but copied locations are un- known. 2.8.1 DetectingClonesofWikipediaforProt We rst explore nding copies of Wikipedia on the web. Although Wikipedia’s license allows duplica- tion [132], and we expect sharing across sibling sites (Wiktionary, Wikiquote, etc.), other copies are of little benet for users and often serve mainly to generate advertisement revenue, support link farms, or dilute spam. We next consider a copy of Wikipedia (from June 2008 [43], in English) as our labeled datasetL and use it to understand if Wikipedia is copied wholesale or just in parts. Wikipedia becomes aL c of 75.010 6 distinct chunks of length more than 100 characters (we treat shorter chunks as stop chunks, Section 2.4.4.2) and then search for this content in the Common Crawl corpus (C cc , Nov. 2009 to Apr. 2010). Utilizing L c , we identify bad neighborhoods in C cc using the algorithm described in Section 2.4. 57 Description jNj % Type Wikipedia Clones/Rips 31 78 - “Wikipedia Ring” 13 Prot Reference Sites 5 Prot Ads 10 Prot Fork 1 Ambiguous Unknown 2 Ambiguous Search Engine Optimization 3 8 - e-Commerce 2 Prot Stock Pumping 1 Prot Wikipedia/Wikimedia 5 13 Benign Site using MediaWiki 1 3 Benign Total 40 100 Table 2.6: Classication of the top 40 bad neighborhoods inC cc ,L = Wikipedia. The length of time between the crawl dates ofL andC cc may bias our detection’s true positive rate in a particular direction. To understand Wikipedia’s rate of change, during the 16–22 months betweenL andC cc , Wikipedia added an additional 1.4510 6 –1.8610 6 pages/month (an increase from 9.0010 6 to 14.610 6 pages), encompassing 2.52–3.46 GB of edits/month [44]. Thus, if sites inC cc copy from a more recent version of Wikipedia thanL, we would expect that to bias our detection’s true positive rate to be lower. Our detection mechanism nds 13610 3 target neighborhoods (2 % of 68.910 6 neighborhoods in C cc ) of path length 1 that include content chunks of length>100 from Wikipedia. To understand how and why more than 10010 3 sites copy parts of Wikipedia, we focus our analysis on neighborhoods that duplicate more than 1000 chunks from Wikipedia. We look at the 40 neighborhoods with the largest number of bad chunks and classify them in Table 2.6. We nd 5 Wikimedia aliates, including Wikipedia, Wikibooks, and Wikisource. More interestingly, we nd 34 instances of duplicate content on third party sites: 31 sites rip Wikipedia wholesale, and the remaining 3 utilize content from Wikipedia subtly for search-engine optimization (SEO). 58 Personalize Button (Browser Action) Storage Manager (Core Extension) Page Watcher (Content Script) AuntieTuna Browser Plugin Labeled Dataset containing Target Content URL Whitelist of Known-Good Sites Website (1) Discovery: user marks known-good via (3) Detection: hashes content and looks for matches (§ 2.4.3) if suspected phish, prevent access (2) Discovery: Known-good is chunked, hashed, and stored in the Labeled Dataset (§ 2.4.2) Figure 2.17: Implementation diagram of the AuntieTuna anti-phishing plugin. Almost all of the 31 third-party sites signicantly copying Wikipedia are doing so to promote commer- cial interests. One interesting example was a “Wikipedia Ring”: a group of 13 site rips of Wikipedia, with external links to articles that leads to another site in the ring. In addition to the intra-ring links, each site had an advertisement placed on each page to generate revenue. Other clones are similar, sometimes with the addition of other content. Finally, we also observe Wikipedia content used to augment stock pumping promotions or to draw visitors to online gambling. Our study of Wikipedia suggests that our approach is very accurate, at least for bulk copies. All neigh- borhoods in our sample of the tail of the distribution were copies of Wikipedia, and only one site was a false positive (because it uses MediaWiki, an open-source wiki application). All others were true positives. We have shown that from a labeled dataset (Wikipedia), our approach detects dozens of copies across the entire web, and that most of the bulk copies are for monetization. We next shift from bulk copying of Wikipedia to targeted copying in phishing sites. 59 2.8.2 DetectingPhishingSites Phishing websites attempt to trick users into giving up information (passwords or banking details) with replicas of legitimate sites—these sites often duplicate content, making them detectable with our methods. We adapt our system to detect phishing websites with an expert-labeled dataset built from common tar- gets (such as real banking pages) and we briey describe our adaptation in AuntieTuna here—in Chapter 3, we will present a more detailed analysis on the system design and performance of AuntieTuna. In our prototype browser extension, AuntieTuna (implementation diagram in Figure 2.17), users rst identify pages they care about (manually or automatically, via Trust on First Use) to build a custom labeled dataset (Discovery, Section 2.4.2). AuntieTuna then checks each page the user visits for potential phish (Detection, Section 2.4.3). As an alternative system implementation, phish could be detected centrally by crawling URLs found in email or website spam, then testing each as potential phish with our method. We evaluate our approach by examining PayPal phishing. We build a labeled dataset of PayPal home- pages (L pp , 2012–2014) and a corpus of known PayPal phish (C p , Sep. 2014). Our mechanism detects 50 (58.8 %) of the 85 PayPal phishing sites inC p (Table 3.1). Our precise approach prevents false posi- tives (specicity is 100 %), although we see 40 % of phish copy too little content from the original for us to detect. This evaluation shows that our hash-based detection can be a part of an anti-phishing scheme, complementing other techniques. 2.9 Conclusions In this chapter, we developed a method todiscover previously unknown duplicated content and toprecisely detect that or other content in a web-size corpus. We also showed how to exploit hierarchy in the corpus to identify bad neighborhoods, improving robustness to random document changes. We veried that our approach works with controlled experiments, then used it to explore duplication in a recent web crawl 60 with an informed and uninformed discovery process. Although most duplicated content is benign, we show that our approach does detect duplication as-is in link farms, webpage spam, and phishing websites. This chapter supports our thesis statement by demonstrating how we can improve network security by nding previously undetected bad neighborhoods on the web using our approach of informed (person- alized) discovery and local detection. Our discovery and detection precisely and eciently nds content reuse using hash-based methods that also enable our approach to scale to web-sized datasets on commodity hardware, like local compute clusters. In the next chapter, we will further support the thesis statement with AuntieTuna, a web browser plugin that uses our local detection approach and a novel application of user personalization in order to protect end-users from phishing website attacks. We previously looked at how we can apply our detection technique to nding phishing sites (Section 2.8.2). In Chapter 3, we will look at how designing AuntieTuna with an emphasis on usability and user personalization enables us to reduce successful phishing attacks. 61 Chapter3 AuntieTuna: PersonalizedContent-basedPhishingDetection In this chapter, we presentAuntieTuna, an anti-phishing browser extension, and evalute its performance and usability in detecting phishing sites. The content reuse detection algorithms from the previous chapter form the basis of AuntieTuna’s approach in phishing site detection, and our experiences in developing and using AuntieTuna will inspire the next chapters in data sharing. This study of AuntieTuna partially supports our thesis statement. AuntieTuna helps improve its users’ network security by reducing successful phishing site attacks. AuntieTuna reduces successful phishing at- tacks by nding and preventing access to phishing sites using personalized and local detection. AuntieTuna rst personalizes a user’s defense by selecting target sites based on the user’s behavior: the “known-good” content of these target sites will then be hashed and tracked by AuntieTuna. Then, by running entirely within the user’s browser without external dependencies, AuntieTuna locally detects phish by comparing the content of unknown, visited sites with known-good content of the original, legitimate target site using cryptographic hashing. Part of this chapter was previously published in the Network and Distributed System Security Work- shop on Usable Security [10]. 62 3.1 Introduction Phish are fake websites that masquerade as legitimate sites, with the goal of tricking unsuspecting visitors to sharing sensitive information: their credentials, passwords, nancial or other personal information (recently surveyed [62]). In phishing, an adversary constructs a phishing site from target content drawn from a legitimate service used by the user. The phishing site fools the user (as a Trojan horse) into disclosing information that can then be exploited for identity theft, fraud, and to compromise other services. Phishing is an increasing threat, with widespread opportunities as general public makes extensive use of the Internet for banking and electronic commerce. This threat is especially dire for nancial services and sites with online payment and their users: an attacker can use stolen credentials to steal money or make fraudulent transactions. Sophisticated attacks also target specic individuals in spear phishing attempts, customizing e-mail with personal information to draw individuals to specic Trojan-horse websites. Phishing is sometimes seen as a problem of education and experience, raising the question: “why can’t people just stop clicking on the bad links?” While studies show training can help [74], training is expensive and time-intensive, and other studies show training provides more mixed benets [19]. Moreover, the user- specic content in spear phishing exploits social pressures to encourage targets to set aside training and click. Even with training, ideally technical methods for anti-phishing would assist users, both naïve and trained. There are two classes of technical methods to intercept phishing attempts. Most browsers today de- tect potential phishing with URL blacklists such as the Google Safe Browsing API, PhishTank [95], Is It Phishing [127] service, and the Netcraft toolbar [90]. The browser checks each website a web user visits against a list of known bad sites that is typically cached locally and refreshed regularly. While eective at stopping previously known threats, blacklists must react to new threats as they are discovered, leaving an inevitable period of vulnerability where users are vulnerable. Attackers exploit this gap by changing URLs 63 for phishing sites frequently. Moreover, while blacklists may protect against common phishing sites, they are unlikely to track “pop-up” sites used for spear-phishing against a small number of targeted victims. Alternatively, whitelists can identify pre-determined websites as “known-good”. Whitelists thus avoid the race to identify and add new phishing sites, but have their own delays in approving new sites, and by denition prohibits (or strongly discourages) use of sites o the list. This delay makes them too limited for many users. Our goal is to create a system that provides proactive and personalized detection of phishing web- sites. Our mechanism provides proactive in-browser testing of visited websites against likely phishing content, providing rapid defense with neither the delay of blacklist identication nor the strict constraints of whitelists. Each user can personalize the sites they visit and identify target content that might be used in phishing. Personalization customizes defenses and generates uncertainty in attackers, increasing pro- tection against targeted, user-specic sites and spear phishing. Personalization can also augment shared, centralized lists. In this chapter we introduceAuntieTuna, a web browser plugin that provides anti-phishing alerts as a user browses. Our approach includes a usable and simple mechanism for users to identify and customize protection against their own target sites. Our system indexes the target site’s content and watches for this content to appear at incorrect sites as a sign of a active phishing. While prior work has visually compared good website layouts with potential phishing sites [142], we focus on the content itself using cryptographic hashing. Our insight is that cryptographic hashing of page contents allows precise and ecient bulk identication of content reuse at phishing sites. The contributions of this chapter are to support the thesis statement (described at the beginning of Chapter 3) and to show that our precise phishing detection using cryptographic hashing and user- personalized lists is both usable and eective. 64 One of our contributions is to provide usability in AuntieTuna, our anti-phishing plugin. We emphasize usability through automated and simple manual addition of target sites and clean reports of potential phish that include context about the targeted site. Since each user develops a customized list of target sites, our approach presents a diverse defense against phishers. Another contribution is to provide a precise and eective approach in detecting phishing sites, with zero false positives. We show that our algorithms detect a majority of phish, and are robust to several countermeasures, although they can be defeated by techniques such as a phishing site using only new images. Finally, AuntieTuna does not slow web browsing time and presents alerts on phishing pages before users can divulge information. A small number of alpha users have been using the browser extension, and we have released our extension and source code at https://auntietuna.ant.isi.edu. 3.2 RelatedWork Given the importance of phishing, many anti-phishing solutions have been proposed. We build on prior experience in phish detection, and anti-phish user interfaces and education. 3.2.1 AutomatingPhishDetection There are many dierent approaches to detecting phish. Anti-phishing blacklists, page heuristics, or a combination of both are used in browser toolbars [95, 127, 90], but aren’t always eective against phishing attacks [134], performing poorly even when blacklists were kept up-to-date [143]. Our plugin proactively detects phish using target content from known good sites, thus avoiding the delay in updating and retriev- ing blacklists. Machine learning can also be used to detect phish. By converting a website’s content [54] or URL and domain properties [81] into a set of features or feature vectors, machine learning can look for websites that 65 are similar, but have anomalous properties, such as “right” content in the “wrong” place. Computer vision techniques [3] can also be used to visually match the images on visited webpages with the originals. While these techniques can detect new phish, their approximate matching risk many false positives, and their high computational requirements make them dicult to run on clients. We instead employ precise content matching using cryptographic hashing to avoid false positives, and to provide lightweight detection that can run in a client’s browser without centralized support. Other approaches measure the similarity of phish and original sites by looking at their content and structure. Similarities can be computed based on the website’s visual features (text content, styles and layout) [79], or object positioning in their Document Object Model (DOM) trees [106, 142]. CANTINA [144] sends signatures based on the highest ranked words from the page’s content through search engines and assumes valid content will be highly ranked in the results. Each of these approaches use approximate matching, while we apply cryptographic hashing to avoid false positives when detecting phish that reuse content from the original website. Although not for phishing, CodeShield uses personalized whitelists to identify good PC-based appli- cations [49]. Users must verify newly installed applications, and the process requires multiple steps to encourage careful review. We too apply personalized lists to anti-phishing detection, but we emphasize fully automated or easy manual addition to the whitelist. 3.2.2 Anti-PhishingUserInterfaces User interfaces in anti-phishing tools play an important role in determining whether a user clicks on phish after it has been detected. There is the need for clear, non-subtle visual indicators of security problems [48]. Zhang et al. [143] found that while some user interfaces used colored indicators to signal if a website was legitimate or phish, they also found a lack of meaningful user interaction once a warning has been presented. Egelman et al. [39] found that most users responded positively when active warnings interrupt 66 the user, prevent clear, recommended actions, and are not easily closable. Inspired by this work, our active warnings follow these guidelines, interrupting the page and providing information on our choice and education. Other approaches seek to prevent information disclosure by focusing on the site’s login page. Dhamija and Tygar [35] propose using a memorable “visual hash” as a prominent graphical indicator that the site being accessed is secure and trusted. Google’s Password Alert [53] binds together the known-good website and the user’s login details; if the password is reused or entered somewhere else, the user is warned about phish and asked to change their password. We instead focus on phish website detection, but our approach could be used with these alternatives. 3.2.3 TheRoleofUserEducation Multiple studies explore why users are susceptible to phish and how to educate users against phishing attacks. For example, they encourage users to recognize indicators such as incorrect URLs or broken locks (TLS) that are around the main content (in the browser user interface) that indicate something is amiss. Herley et al. [59] found the mental costs in frequent evaluation of such indicators exceeds its benets; users often perceive the consequences of getting phished as low and ignore warnings. Additional stud- ies [74, 75] showed that anti-phishing training for users was eective when provided immediately as the user clicks on phish in email and when done periodically. We follow these studies’ recommendations by working silently without any requisite indicators and intervening with an active alert only when we detect a phishing website. We also point the user to resources where they can learn more about phishing and how to avoid falling victim to phishing attacks. 67 Figure 3.1: Users click the “Personalize Button” on websites to add to whitelist of known-good sites. 3.3 DesignforUser-customizableAnti-Phishing Our anti-phishing system consists of three components: abrowser-plugin watches websites a user tries to browse (Section 3.3.4). Using ourdetectionalgorithms (Section 3.3.2), it compares each new website against a list of target content by comparing cryptographic hashes: a detected phish will have a match in the content list and not be in awhitelistofknown-goodsites. Finally, we allow users topersonalize the list of target content (Section 3.3.1), customizing a common list of well known phishing targets. We describe these approaches below and in Section 3.3.3 highlight our choices that optimize usability. 3.3.1 IdentifyingandPersonalizingTargetContent A central goal of our approach is easy-to-use, per-user customization. Here we describe how and what information is collected, and in Section 3.3.3 we discuss usability. Detection is based on looking for target content in unexpected places. Users identify both target con- tent and its expected locations by marking sites that may be targets for phishing using a simple web button (Figure 3.1). This button adds that site to the whitelist of known-good sites, and adds hashes that identify the content of that page to the list of target content. This approach is analogous to public key pinning [105], and allows each user to build a custom list of sites they trust. (In Section 3.3.1 we show how even this button can be automated.) 68 Once a site is marked as known-good, its content may evolve over time. We update our content for known-good targets byopportunisticallyrehashing these pages when a user revisits their URLs after some time. We choose to build and store the whitelist and target content in each client, distributing detection and avoiding any centralized infrastructure. (Some blacklists or whitelists, like Google’s Safe Browsing API or an HTTP proxy, depend on centralized infrastructure and require global network connectivity and infrastructure managed by a third-party.) We expect to draw on both centralized and per-user whitelists and target content. Organizations may distribute target content lists, either generated centrally or aggregated from many users. But we expect user-customization to help build robust resistance to phishing in two ways. First, some sites oer user-specic “skins”. For example, users can indicate a preferred background or color scheme in Google and Yahoo’s websites. By selecting thesespecic versions of these popular sites users tune anti- phishing to their proles. Second, individuals often access smaller sites that are specic to their behavior, yet are vulnerable sources to spear phishing. For example, a company may have a public-facing internal portal that requires authentication. By making each user’s defenses more diverse and unique, we avoid a “monoculture” of anti-phishing ltering [50], decreasing the eectiveness of bulk attacks. We augment manual identication of pages with a fully automatic approach: every time a user agrees to save the password for a web page, we automatically mark that site as a phishing target. This approach leverages the existing indication of user trust (save my password) to provide a form of Trust On First Use [131]. We are in the process of integrating this method into our system; when deployed, it will provide protection without any need for user interaction (the button can be eliminated). 69 3.3.2 ProcessingPages: HashingandDetection Our process of identifying target content and matching it against new pages to detect phishing uses cryp- tographic hashing. We rst describe how this process is used to add a known good page to target content and the whitelist, then how it is used to check unknown content for potential phishing. We have ex- plored the use of hashing previously to detect plagiarism and content duplication for advertising on the web (Chapter 2); here we consider how it can be used specically for anti-phishing. 3.3.2.1 ProcessingaKnown-GoodPage When a web page is identied as known-good (as described previously, Section 3.3.1), we must record the content and the URL of that website. We place the URL on the whitelist. To track the content and process a given web page, we walk the page’s DOM representation in the browser, breaking it into “chunks” delimited by <p> and <div> tags. (Other delimiters are possible, but we found these to be most eective). We remember the contents of each chunk 25 or more characters in length by computing its cryptographic hash with SHA-256 [104] (we lter out common, small-length chunks to avoid aecting our results). We then save the hashes on the client’s local storage and add the URL to a whitelist of sites allowed to host this content. In our current use, both sets are relatively small and is stored directly in the client. Ecient techniques such as Bloom lters allow very large sets to be compressed to xed-size storage and compared very eciently [16]. The left part of Figure 3.2 shows a known good page, with the PayPal login page taken on 2015-12- 08 as an example. The truncated hashes of each DOM element are listed in the middle column (7667cd7, b3a4ac5, etc.). 70 Figure 3.2: Detecting a phishing attempt against PayPal. The known good site (left) is visually similar to the phish (right), and common elements are identied by identical hashes of DOM elements (red values in the middle columns). Target content of some sites will evolve over time. Opportunistic recrawl keeps our record of those pages fresh: when a user re-accesses a page in the whitelist after some time, we use that opportunity to refresh our hashes of their content. 3.3.2.2 ProcessingUnknownContent We process an unknown page for potential phishing in the same way: we walk the DOM, breaking it into chunks and computing the hash of each chunk. We then compare the number of chunks that match the list of target content. If the number of matches is greater than a threshold, we ag the webpage as suspected phish and actively prevent the user from accessing it (Figure 3.3). The right page in Figure 3.2 shows an actual PayPal phishing page we copied on 2015-12-08 (the URL has been obscured for privacy; the phish is no longer accessible as of 2015-12-10). Visually, the sites are identical to mislead a user. The red hashes in the middle indicate that this similarity was accomplished by copying graphical and text elements. In fact, not all of the duplication is visible—we detect duplication in a hidden error message div (indicated with dotted arrows). This duplication supports our automated detection. 71 Figure 3.3: Example of actively preventing a user from accessing a phishing site. Detection of a phish results in an overlay that obscures the content (Figure 3.3). The alert encourages users to back-o and gives links to web pages that can help them learn about phishing and avoid falling victim to phish. We also provide users with a link to the “most similar trusted page” as method of explaining why we distrust this page. We also allow an escape mechanism to handle false positives, but presented in a manner to discourage and caution its use (initially hidden under the “Advanced” option). 3.3.3 DesignChoicesforUsability Our approach to maximizing the usability of our anti-phishing methods is based on minimizing user in- teraction and no-knobs (“hands free”) use. We adopt these goals based on studies which show that users reject security advice when it poses too great of a burden relative to its perceived benets [59], and the need for clear, non-subtle visual indicators of security problems [48]. These goals reect in four design choices: full automation of building user-customized lists of target content, minimal controls for optional user additions, suppression of untrusted content, and some explanation and reasoning for that suppression. 72 First, we can fully automate user customization by integrating identication of target content with password storage, an existing method of managing trust. Although we optionally allow users to manually ag pages as target content, and organizations to distribute centralized lists of trusted sites, this automation customizes phishing defense for each user with no explicit eort. In addition, our manual interface is very simple: use one button to add a site (Figure 3.1). We actively suppress the visited webpage if it is suspected phish. Prior studies have shown that users often ignore passive warnings [39] and continue through to dangerous content. We explicitly choose not to redirect the user to their intended website automatically, so as to discourage users from becoming complacent using phishing links. Finally, rather than treat AuntieTuna as a black box, we provide the user some background about why we found the candidate site as a phish by including a link to the “closest trusted site”. We provide this link as content, not as an assistance to redirect, and accompany it with links to information about phishing. 3.3.4 ImplementationofAnti-PhishinginAuntieTuna Our plugin is implemented as an extension to the Google Chrome browser, written in JavaScript and using only the Chrome APIs. This section summarizes our implementation choices to operate with the security model for Chrome plugins [20]. We expect that our approach can port to other browsers. AuntieTuna consists of three components (Figure 3.4): the Personalize Button, Page Watcher, and Stor- age Manager. In the Chrome model, these components run as a Browser Action, Content Script, and Core Extension, respectively. 3.3.4.1 PageProcessingWorkow Processing pages begins with users personalizing their list of target content by using the Personalize Button (located on the browser toolbar, Figure 3.1) on known-good websites. Pressing the button signals to the 73 Figure 3.4: Implementation diagram of the AuntieTuna anti-phishing plugin. Page Watcher to mark as known good and chunk the current webpage. The Storage Manager stores the resulting hashes and site URL in the list of target content and whitelist of known-good sites, respectively. The appended list of target content is ready for use immediately. The Page Watcher runs continuously in the background, watching for and processing unknown pages not found in the whitelist of known-good sites. When the page has rendered, the Page Watcher processes the page as described in Section 3.3.2 and, if the page is suspected phish, injects an overlay (Figure 3.3) on the current page to prevent the user from accessing it. 3.3.4.2 Platform-SpecicCustomizations Implementing our browser plugin required changes to our discovery and detection methodology from our prior work (Chapter 2). Chrome’s security model prevents our extension from accessing the raw underlying HTML of sites, but it does allow access to the parsed version of the page in the form of the page’s document object model (DOM). Because the DOM is the processed (rendered) version of the underlying HTML, it can be modied by scripts on the page or other concurrent extensions, potentially reducing the accuracy of our phish detection mechanism. (We discuss this problem and possible countermeasures in 74 Section 3.4.2.) Additionally, the rendered DOM is browser-specic. Thus, the hashes in a given user’s phishing target content may not apply to users of other browsers. We generate and store all hashes and lists in the client browser, making our methodology completely self-contained, without dependence on outside infrastructure or processing. Our approach runs after page render time, imposing no increase in page render time. However, processing time provides a gap where users are briey un-alerted about a potential phish attempt. In Section 3.4.3 we show that we are faster than user reaction time on PC-class hardware, but this gap will be larger on lower-end hardware such as tablets or mobile phones. We can reduce classication time of unknown pages time by using Bloom lters to speed comparison of the contents of a new page against our list of target content. We can eliminate false positives that occur with Bloom lters with a two-tiered search: if a hash of some content chunk is “found” in the Bloom lter, do another search in the full list of target content. Since we expect most searches to return negative, the amortized cost of doing a full search is suciently negligible to maintain a zero false-positive detection rate. We have not yet implemented this optimization. 3.4 EectivenessofPhishingDetection To evaluate AuntieTuna, we consider its eectiveness today and in the face of potential countermeasures. We also examine its eects on browser performance and in our usage to date. 3.4.1 EvaluationofPhishDetectionAccuracy We now evaluate the eectiveness of the core algorithms of AuntieTuna. This is our rst evaluation of DOM-based hashing, although it builds on our prior work evaluating duplication of HTML (Chapter 2) Since we do not have access to a large source of spam, we approximate this system as follows. We target PayPal phishing, and ll our target content list with current and recent PayPal U.S., U.K., and French home 75 pages (Sept. 2014, plus Jan. 2012 to Aug. 2013) loaded from archive.org. We gather six variations on these three web pages, resulting in a target content list containing 311 distinct chunks longer than 25 characters. We test AuntieTuna against a suspected phish stream of 2374 URLs drawn from PhishTank [95] over 2 days (2014-09-24 and 2014-09-25). PhishTank is a crowd-sourced anti-phishing service. Since the lifetime of a phish is short, we automatically rip the target of each suspected phishing link. We compare each suspected phish against our target content list with our algorithm (Section 3.3.2) with a detection threshold of one or more non-trivial chunks. To evaluate ground truth, we manually examine the suspected phish stream and identify 124 (of the 1888) as PayPal phish attempts. We further identify 85 of the remaining sites as phish utilizing content from PayPal. Our mechanism detects 50 (58.8 %) pages that pass the detection threshold: 43 are direct rips detected with no normalization applied, and an additional 7 are detected with whitespace normalization. Table 3.1 classies the type of techniques each phishing site uses. Without taking steps to defeat countermeasures, our approach has a fairly high false negative rate with a sensitivity of 58.8 %. However, our targeted dataset has zero false positives and a specicity of 100 %. This experiment suggests our approach is a valuable additional technical method to automatically block phishing attempts, at least against our sample. Evaluating against more diverse phishing sites and use by more users is important future work. We next discuss hardening our approach to countermeasures. 3.4.2 ResistingPotentialCountermeasures While most phish copy much of the original site, other phish use dierent techniques to attack their targets, sometimes deliberately obscuring the source of their content. We discuss how these countermeasures aect the accuracy of our phish detection, and strategies to work around them. All phish are constrained by the requirement that they must look very similar to the original. Most simply copy content from the original, prompting our approach. However, others obscure that content. A 76 Description Num. Pages % Candidates 2374 Unavailable 486 Ripped 1888 Other 1764 TN = 1764 PayPal (image-based, removed) 39 PayPal 85 100.0 FP = 0 Successfully detected 50 58.8 TP = 50 Direct rips 35 Whitespace normalization 8 JavaScript obfuscation 7 Custom-styled with minor PayPal content 35 41.2 FN = 35 Table 3.1: Classication of phish in two days of PhishTank reports, based on detection against PayPal. Sensitivity = 58.8 %, Specicity = 100 %. fair number of phish (39 of the 124 PayPal-appearing phish) replace the original content with images. Our approach cannot see through this concealment and we exclude these from our list of PayPal phish that are potentially detectable by our method. Fortunately, such sites can be obvious (for example, text is not selectable, or fonts vary by platform), and are subject to image analysis. We next focus on the 85 potentially detectable (non-image-based) PayPal phish: of these we detect 50 phish (58.8 %). Sites can vary the original site’s HTML slightly, replacing whitespace or making other changes that do not aect the visual result. The DOM passes some variations through, thus we normalize whitespace as part of our processing, detecting 8 (9.4 %) sites that we would otherwise miss. A phisher willing to mutate every element will evade our approach, however, we argue that such a phish would also appear suspicious (due to misspellings or awkward phrasings) and is more work to generate than cut-and-paste. More challenging are sites that generate or obfuscate content dynamically with JavaScript to elude web crawlers that look for and process static HTML only. Because we parse the DOM after any JavaScript has run, we can see through this obfuscation. Manual identication showed 7 suspected PayPal phish (8.2 % 77 of detectable phish) that used JavaScript that we nd in DOM-based analysis but not in HTML alone: we found all 7 of them. A phisher could use homographs (look-alike characters) in ASCII or Unicode (e.g., Greek for “P”) to spoof the original. Our approach cannot currently see through these techniques, although we could potentially normalize characters by shape just as we normalize whitespace. Finally, we see a fair number (35 pages, 41 %) of potentially detectable phish construct an original phishing site using only a small amount of content taken from the original site. We miss these phish, although we expect their deviation from target content makes them less believable. We conclude that we miss a number of phish that use images or copy minimally from the target, however we detect more than half of phish with no false positives, thus providing a useful service. Wide use of our approach will of course cause phishers to move to other types of attacks or target the thresholds our tool uses. Personalization makes such countermeasures dicult, and we would consider “raising the bar” on rip-and-copy attacks a partial victory. 3.4.3 BrowserPerformancewithAuntieTuna A concern with any plugin is that it slows the browsing experience, so we next examine the computation performed by our plugin. AuntieTuna introduceszeroincrease in page render time because we process the page only after it has nished rendering. Thus the performance of AuntieTuna isnot about a “slower web”, but instead about a potential gap between when the page is visible and when we detect it as phish. We run our benchmarks on four sites using Google Chrome (v47.0.2526.80, 64-bit, 2015-12-08) with the list of target content from Section 3.4.1. We test on a PC running OS X 10.10.5 with an Intel i7-2760QM processor and 8 GB of memory. In Table 3.2 we report mean and SD page render and AuntieTuna execution times taken from ve runs. We nd that our plugin’s execution time ranges between 20–167 ms on each page. There is quite a bit of 78 PageRender AuntieTuna Website ms ( ms) ms ( ms) google.com 327 (15) 144 (6) paypal.com 349 (5) 20 (2) nytimes.com 5316 (1632) 167 (6) en.wikipedia.org 321 (7) 75 (3) Table 3.2: Page Render and AuntieTuna Execution Times variation depending on the complexity of the page. Only for the highly-optimized Google home page does scan time approach page render time; in other cases it is small in relative to page render time. However, we again emphasize that analysis happens after rendering and in parallel with viewing, so user browsing is not aected. A more serious concern is if we can put up a warning fast enough, before a user divulges private information, since they will see content while we process the page. Our longest scanning time is 167 ms; while this one-sixth of a second will be noticeable, we believe there are few users who could enter their information and click submit in this short amount of time. A great deal of web use today occurs on less powerful hardware with tablet computers or mobile phones. We have not evaluated our plugin on these devices. While our plugin will be slower on slower computers, user data entry will also be slower. Porting our plugin to mobile devices is future work. We conclude that AuntieTuna has no eect on web browsing performance on reasonably powerful hardware, and it runs fast enough to protect users. 3.4.4 ExperiencesinReal-WorldUsage We have been using the browser extension continuously since March 31, 2015. So far the extension works reasonably well, detecting the known phish we use for testing without noticeably aecting the speed of normal browsing operations. We have not yet seen any real phish, nor any false positives. A larger and formal user study remains as future work. 79 3.5 Conclusions This chapter has described a new approach to phish detection and its realization inAuntieTuna, a Chrome browser plugin. We described our design decisions to make our approach easy to use, with automatic or simple manual addition of targets and clean reports of potential phish. We have shown our approach is precise (no false positives), that it detects a majority of phish in controlled experiments, and that it does not aect browsing speed and it presents alerts before users can divulge information. We have released our extension and source code on our website at https://auntietuna.ant.isi.edu. This chapter supports the thesis statement by showing how AuntieTuna improves one’s network secu- rity by reducing successful phishing attacks using local and personalized detection of phishing websites. AuntieTuna protects users from falling victim to phishing attacks by rst detecting if an unknown, visited site is phish, and then preventing further access to it. Our detection techniques use cryptographic hash- ing to nd phishing sites with precision and are self-contained, without external dependencies, running locally on the client-side browser. By personalizing detection to the user’s behavior, tracking only the “known-good” sites that they use, we keep detection lightweight in resource usage, while presenting a diverse defense against attackers. In the next chapter, we will further support the thesis statement by showing how one can improve their network security with the controlled information exchange between collaborators. 80 Chapter4 Retro-Future: ImprovingNetworkSecuritywithControlled InformationSharing In this chapter, we presentRetro-Future, a controlled information exchange framework with principled risk and benet management that formalizes data sharing with cybersecurity applications. Our previous chapter’s work on AuntieTuna and its manual data sharing inspires this study of cross-organizational data sharing. Correspondingly, this study motivates our work in the next chapter, exploring how data sharing between users and friends enables them to protect themselves and their social circles from phishing attacks. This study of Retro-Future partially supports our thesis statement. Retro-Future improves network security by increasing the eectiveness of local detection of malicious activity when previously-private data is shared between organizations. Retro-Future is a framework that enables the controlled exchange of previously-private network information with collaborators within and across organizations by allowing them to control and balance the risk and benet trade-o in data sharing. When data is shared between organizations, each organization improves their local detection algorithms’ sensitivity by increasing the diversity, or quantity and quality, of the input data. We quantify the benet of cross-site sharing in two case studies in detecting DGA-based botnet activity and nding Internet-wide activity with DNS backscatter. 81 This study was joint work with Prof. John Heidemann, Gina Fisk (Los Alamos National Laboratory † ), Mike Fisk (Los Alamos National Laboratory † ), Shannon Beck (Los Alamos National Laboratory † ), and Prof. Christos Papadopoulos (Colorado State University). Part of this chapter was previously published in the ACM SIGCOMM Workshop on Trac Measurements for Cybersecurity [11]. 4.1 Introduction Cybersecurity incidents continue to increase in size, with highly damaging economic and increasingly physical consequences. The consequences of these incidents include an enormous loss of private data on individuals (Anthem [1], OPM [32], Yahoo! [52]) and corporations (Bangladesh Bank [28], Sony [26]), and the money spent cleaning up. Not limited to “simple” data loss, the damages are growing past the physical boundary, aecting critical infrastructure, from industrial systems (Stuxnet [78]), hospitals (ran- somware [29]), and to our own homes (Internet of Things (IoT) malware, phish). In order for organizations to improve and maintain their cybersecurity posture, they need to share data across and within organizations during the incident response process. Data and working processes are distributed and independent across and within organizations: each organization has its own unique and incomplete view of the Internet or local network. Data sharing during and after a security incident helps expedite the incident response process by collectively increasing the global knowledge and corresponding eort against an attack. The increased, shared knowledge and resulting collaboration helps lead toforward progress, dened as advances in research and understanding, in improved network security. Data sharing today is dicult as many organizations share limited or no informationacross other orga- nizations for several reasons. Organizations might not share their data because it contains highly sensitive † At the time of this work. 82 and private information (competitive intelligence, proprietary data), Organizations also sometimes can- not share (prohibited by law), choose not to share due to fear, uncertainty, and doubt in the risks of data disclosure, or both. Even within an organization, dierent parts of the organization are often discouraged or prevented from sharing. Groups might be segmented in order to maintain independence and prevent conicts of interest (for example, a logical “rewall” between investment groups to prevent insider trading, or the editorial and advertising groups of a publication), and establish security (accounting and IT have little to no visibility in the other’s systems). Organizations that share data across other organizations and within their own will accelerate progress in cybersecurity. Sharing across dierent organizations enables them to solve problems that are inherently distributed (stepping-stone attacks across many network boundaries), as each organization contributes a dierent view of the Internet. These benets also apply to sharing within dierent parts of large organi- zations. Our contributions are to support the thesis statement (described at the beginning of Chapter 4), to provide the controlled, cross-site data sharing mechanisms in the Retro-Future system, and to quantify the benets of data sharing with two case studies in nding malicious network activity. We provide the controlled, cross-site data sharing mechanisms in Retro-Future, a system that pro- vides retrospective, post-event understanding with time travel. Our insight is that greater controls in the risk and benet trade-o enables data sharing, and these controls are implemented in Retro-Future with three techniques: individual sensitivity levels at each site (Section 4.3.1), controlled cross-site shar- ing of information (Section 4.3.2, Section 4.3.3), and ecient retrospective search and processing of data (Section 4.3.4). Individual sensitivity levels enable organizations to customize and specify their sharing per each trust relationship with other organizations, ensuring exibility in implementing sharing policy. Controlled cross-site sharing via a query/response system minimizes the risks in data disclosure for both 83 the querier and responder. Finally, ecient retrospective search and processing helps resolve cross-site sharing on human timescales and is necessary for retrospective analysis in light of new information. We quantify the benets of sharing using our data sharing mechanisms in Retro-Future through two case studies in nding malicious network activity: detecting botnet activity and nding Internet-wide activity with DNS backscatter. A key result of sharing cybersecurity data is an improvement in network data diversity (we can nd more malicious activity), and in our detection algorithms’ sensitivity (we can detect malicious activity withgreaterprecision). In DGA botnet detection (Section 4.4.1), we conduct a two- year longitudinal study, showing that sharing enables us to detect botnet activity in secure networks where few bots exist and to improve the sensitivity of malicious activity detection. In processing DNS backscatter (Section 4.4.2), we show that sharing improves wide network visibility by increasing the aperture of our sensors, helping DNS authorities detect previously unknown malicious activity. Finally, by showing the benets of data sharing, we hope to regularize and normalize information exchange within and across organizations. Today, ad-hoc sharing occasionally happens in closed groups, generally with a limited number of participants as it’s hard to scale trust and sharing when the number of participants increases. With our remote query moderation system and data controls, we can enforce the dierent levels of trust given to participants and remove some of the human-intensive elements in the data sharing loop, enabling participants to focus on making forward progress in cybersecurity. Our mechanisms shift the risk-benet trade-os in data sharing, showing that for these applications, sharing makes sense. The Retro-Future framework and tools are open-sourced and are available online at 84 data... threat actor countermeasure at rest unauthorized access* internal, external secured data archive with data encryption and minimization (Section 4.3.5) abused access internal strong user authentication and authorization, data federation (Section 4.3.1, Section 4.3.5) in use active data breach* internal ACLs, restricted query languages, query logs and audits, privacy and execution budgets (Section 4.3.1, Section 4.3.2, Section 4.3.3) passive data leaks* internal remote and moderated queries, data anonymization and redaction, dierential privacy (Section 4.3.2, Section 4.3.3) in motion eavesdropping external secure communication protocols (Section 4.3.5) wrong endpoint* internal, external public-key authentication, certicate/public-key pinning (Section 4.3.5) * indicate threats introduced or caused by Retro-Future Table 4.1: Summary of Retro-Future’s Threat Model https://ant.isi.edu/retrofuture. Our hope is that these examples and tools will promote broader sharing of security-related data in other applications. 4.2 ThreatModel Retro-Future’s threat model is summarized in Table 4.1. Retro-Future considers the primary threats to sharing data (via query-response) and their countermeasures when data is at rest, in motion, and in use. Retro-Future handles data in three stages in response to a query: Retro-Future will rst pull data from archives (“at rest”), process and manipulate the data (“in use”), then transmit the processed results to the client (“in motion”). While some threats we briey discuss are general to any distributed system, Retro-Future introduces new threats (indicated by ‘*’ in Table 4.1) as we purposefully collect and share sensitive data with others. We thus focus our discussion on the threats unique to Retro-Future and how Retro-Future counters these threats to minimize risks in data sharing. If the threats to data sharing are not properly mitigated, they can lead to unintended data disclosure or misuse, with further consequences. Disclosure or misuse of data often leads to loss of sensitive data (like 85 unmasked IP addresses or PII), resulting in consequences that range from nancial and identity fraud (at an academic, health care, or enterprise organization) to possible loss of life (government/military). We assume that the systems external to Retro-Future (to include the hardware and operating systems which Retro-Future operates and network data live on) are reasonably secured against general external and internal threats (for example, using NIST’s framework [88]). For example, while we design Retro-Future to be robust to an eavesdropper on the network, we assume that the underlying Retro-Future system is trustworthy and has not been compromised by a malicious actor. 4.2.1 DataatRest Dataatrest is dened as inactive data (not in use or motion) sitting in storage or archives. Retro-Future encourages organizations to save, archive, and use their raw network trac and system log data—data that had not been previously stored before. We must secure this new saved data to protect against these increases in risk of unintended disclosure caused by Retro-Future collecting and storing (and eventually sharing) it. We secure the saved data using existing techniques and best practices. Today’s best practices in securing data against unauthorized or abused, authorized access apply here: data should be protected with encryption, aging (removing information over time, like payloads from packet captures), and anonymization. Similarly, access to data should be protected with strong user au- thentication, authorization, and data federation (keeping data in dierent locations). Retro-Future’s query system works well with data federation, distributing queries across all the needed archives or other Retro- Future systems. 86 4.2.2 DatainUse Data in use is dened as data being actively manipulated or processed by Retro-Future in response to a query from a client (local or remote). Another denition of data in use is data that sits in volatile, non- persistent storage, like RAM and CPU caches or registers: we focus on the higher-layer threats to data at the software level while being processed by Retro-Future. Data in use sits in a state between at rest and in motion as Retro-Future is processing and manipulating the data. The new threat to data in use is unintended data disclosure, resulting from remote queries to data, provided by Retro-Future, made by authenticated and authorized users. Unintended data disclosure can happen via active privacy breaches (generally intentional) or passive privacy leaks (generally inadvertent). Prior to Retro-Future, sensitive data was not shared nor able to be queried by users outside of the originat- ing organization. To counter the threat to data in use, Retro-Future uses a combination of its novel query system and other existing techniques to mitigate data disclosure. An active privacy breach can result from many intrusive queries over a period of time. For example, a query similar to or many smaller queries sent over time equivalent to “SELECT * FROM all_data” would likely be too intrusive and resource intensive. Retro-Future provides protective measures in place like ACLs to enforce data and query access (restricted query languages), query logging and auditing, and “privacy budgets” (limiting the amounts and execution times of queries) to minimize intrusive queries and possible privacy breaches. Passive privacy leaks happen because of accidental disclosure of sensitive data attributes to otherwise valid and reasonable queries to data. Accidental disclosure can happen as the result of poor choices in anonymization (enabling re-identication) or lack thereof (overlooked attributes that needed to be pro- tected). Retro-Future rst minimizes disclosure by enabling remote queries on data, as opposed to bulk data transfers, and ensuring that data is kept secure at the owner’s site. Retro-Future also provides multiple levels of anonymization and redaction of data through moderated queries to provide only the attributes 87 needed to make forward progress. Additional countermeasures to be explored in Retro-Future include dierential privacy, to protect the privacy of individual records. 4.2.3 DatainMotion Data in motion (or in transit) is dened as data being transmitted between two endpoint nodes (Retro- Future system and data archives, or a remote client and Retro-Future) across a network. The primary new threat to data in motion introduced by Retro-Future is transmitting data to a wrong endpoint, by mistake (human error or misconguration) or malicious intent (man-in-the-middle attacks). Retro-Future’s query system protects against wrong endpoints by ensuring that data is sent only in re- sponse to queries made by authenticated and authorized clients. Similarly, connecting to Retro-Future over TLS/SSH can use certicate or public key pinning to verify the intended host or service. Another generic threat to data in motion (in any system) is an eavesdropper tapping into the con- nection. Retro-Future adheres to best practices, using strong cryptographic protocols for communication (TLS, SSH) to provide both privacy and data integrity. 4.3 EnablingInformationSharingwithCross-SiteQueries We next describe our approach and design decisions on how we enable controlled information sharing with cross-site queries. The goal of controlled cross-site information sharing is to share usable data while balancing the privacy and exposure risks (Section 4.2) between query and response. We achieve this goal in Retro-Future with the principled risk and privacy management needed to share data safely through trust and sharing policies and query management. Owners rst set and modify accesses, even granting temporary escalated privileges, based on their trust relationships and sharing poli- cies (Section 4.3.1) with other organizations. Queries to data are then moderated (Section 4.3.2) such that 88 more-specic queries are more likely to be answered than broad queries, and queries are always remotely processed to control disclosure (Section 4.3.3), In addition to principled risk and privacy management, Retro-Future provides additional benets and features that enhance data sharing and its applications. Time travel (Section 4.3.4) resolves cross-site sharing on human timescales and enables retrospective analysis on prior events across many data types when new information is acquired. We emphasize owners’ full control of the system and data using best common practices in system security (Section 4.3.5) while maintaining exibility over data accessibility. 4.3.1 EstablishingTrustandSharingPolicies Trust relationships form the backbone between two collaborating parties in data exchange. These relation- ships today in informal, ad-hoc exchanges cause problems in accountability and privacy: the relationship is unocial and possibly short-lived, and the information exchange between the two parties are not au- dited nor are the disclosure risks properly managed. Additionally, trust relationships are assumed to be zero or full trust, corresponding with zero or full access to one’s data: the assumption that trust is binary precludes most organizations to sharing anything. Retro-Future encourages organizations to dene and formalize a data sharing policy to enable the regular information sharing needed to make forward progress. While creating the legal policies needed for sharing is out of scope for this study, we describe the mechanisms Retro-Future provides that enable dierent policies to exist: organizations codify and make ocial the trust relationships through granted accesses (which Retro-Future enforces), and understand the risks in data disclosure by explicitly giving access to specic data and query types. Objectives: We present a framework for how trust relationships can correspond to information shar- ing at dierent sensitivity levels. After establishing a sharing policy and prior to sharing data, organiza- tions will need to inventory the data they collect and explicitly dene in Retro-Future how it should be 89 level querytypes datatypes datasources SQL Snort BPF 1 X netow dns 2 X X netow, pcap-headers dns 3 X X X netow, pcap-headers, pcap-raw dns, gateway Table 4.2: Example of an organization’s access control list (ACL) accessed in an ACL: the ACL then enables organizations to account for the risks and benets in cross-site sharing. We challenge the notion that trust relationships are binary (zero or full trust) and provide mechanisms that allow relationships to operate at either end and the middle ground. This allows organizations to calculate their own risk/reward trade-o: an increase in risk tolerance enables greater gains in forward progress. Organizations then use our mechanisms in a exible way to implement their data sharing policy (which may vary across trust relationships). Mechanisms: We present a novel use of an existing mechanism to map trust relationships to data sharing policy details through access control lists (ACLs). We describe (and present an example of) how organizations use ACLs in Retro-Future to provide remote queriers’ access to shared data at varying sen- sitivity levels. Retro-Future then enforces these ACLs through a unied query interface that clients use to access data (dierent types of data would normally require individually dierent interfaces). Organizations inventory and set permissions in an ACL on three categories given a user access level or role: data sources, data types, and query types. Building an inventory and corresponding ACL in the context of sharing with others is new, requiring careful planning. A data source identies the underlying raw data, which is shared in dierent forms (data types) and queried in dierent ways (query types). This enables organizations to share at dierent levels of detail to dierent users given their trust relationship, even over the same underlying data source. 90 Table 4.2 is an example ACL that an organization could use to share network trac. Data sources of border (gateway) and DNS (dns) trac can be shared with various levels of detail (netow, packet headers, and raw packet captures) and queried in specic ways (SQL, Snort signatures, or BPF). Using this ACL, a data sharing policy might allow levels 1 and 2 (the least permissive) to be assigned to trusted collaborators at external organizations while level 3 is only assigned to internal users. (We see in Section 4.3.2 how permissions can be temporarily escalated in specic events). These trust relationships and data sharing policies are implemented and enforced by the Retro-Future system to enable information sharing with cross-site queries. We next show how queries can be moderated to control the sensitivity and types of questions that can be asked by a remote querier on the owner’s data. 4.3.2 ModeratingQueries Retro-Future enables cross-site information sharing with others using a query/response system as its foun- dation and additional risk and privacy management on top of the queries. Although a remote client has been authenticated and authorized, that client is not granted free reign to query everything about its per- mitted data and query types—doing so leads to unintended data disclosure. Retro-Future thus moderates incoming queries from remote clients by sensitivity before processing them to further minimize risks to privacy in the response. Objectives: We need to moderate queries on both responder and querier sides to control the query’s sensitivity. Controlling the sensitivity will enable each party to preserve privacy while providing (or re- ceiving) usable output. Our insight is that we can achieve a privacy balance by continuously adjusting the information trade-os in the query and response. We can apply this insight to both the responder and querier below. Responders moderate queries by dynamically adjusting what Retro-Future does in response to incom- ing queries, selecting between dierent levels or layers of sensitivity. For example, the querier is more 91 likely to get the answers that they need by adding additional details in their query. Put another way, re- sponders are more inclined to answer more specic questions (“did 10.0.0.2 visit example.com?”) than broad ones (“who visitedexample.com?”). Similarly, we also need to balance the privacy needs of the querier as the information disclosed in their query presents a privacy risk. In the previous example query, the more specic question reveals that the querier is particularly interested in an IP (the broader query obscures that fact). Certain situations for the querier (an ongoing security incident) may require that the query itself has minimal disclosure—then revealed and studied in the post-mortem. Mechanisms: Retro-Future has several mechanisms that support moderating queries that enable queriers to receive actionable information (enabling forward progress) and responders to protect privacy (managing risk): a privacy budget to control the level of sensitivity and contextual access to additional query levels. Users are allocated and spend a certain amount of their privacy/token budget (that replenishes over time) with any given query. Although general budget allocation is still an unexplored area (and remains an unsolved problem in its roots of dierential privacy), we assign rough values based on the query’s attributes like the data being queried or the query’s specicity. For example, packet payload inspection is much more sensitive than its headers and correspondingly has higher cost—both in privacy budget and disclosure by the querier. A compromise that lowers both privacy and disclosure costs might be a query that matches on the hash of its contents. Another mechanism to support query moderation and, ultimately, forward progress is the contextual access to additional queries. For example, a positive response on an initial query about a vulnerability (“were you aected by X?”) might be followed up with a more sensitive query (“which IPs were aected by X?”) that would normally be rejected as too sensitive. Thus, permission to a query can be upgraded given 92 the context and corresponding evidence—in light of handling a possible incident requiring a fast response, privacy decisions can then be made after-the-fact. Upgraded permissions can be handled manually (by a human) or semi-automatically, which can lead to a fully automated “query negotiation” process (where both querier and responder settle on a satisfactory cost). Retro-Future supports manual escalation and automated de-escalation (“downgrade-until-success”) of a query. Similarly, Retro-Future provides the raw APIs (via RPCs and ACLs) that a query negotiation mechanism could use. 4.3.3 ControllingDataDisclosure ProblemandObjectives: Controlling data disclosure is nal step in the query process and part of how we balance risks with forward progress by protecting users’ privacy in the data while providing usable results. Our insight is that by controlling the level of detail in the results while the query is being processed, the querier can query on more sensitive attributes (returning more useful results) while the responder maintains their users’ and organizational privacy. For example, it is sometimes sucient to the querier to receive a simple yes/no response, in contrast to a more traditional reply of matching packets or log entries. The terse nature of the simple reply limits disclosure of more sensitive data. Mechanisms: We control data disclosure to solve the problem of providing usable results while pro- tecting user privacy with three types of techniques: data minimization, rate limiting, and query logging and auditing. Data minimization controls what is being shared by obfuscating or removing sensitive attributes that contain PII. Retro-Future makes use of existing tools like dnsanon [115] and LANDER [116], which can anonymize or remove payloads in DNS and network packet capture data. These tools can be used to minimize and replace the original raw data in archives (minimizing the risk of disclosure) or used on-the- y as the results are being processed (keeping the original preserved). 93 Rate limiting controls disclosure by giving users a strict budget to spend on queries and query process- ing time, limiting data throughput for unintended data disclosure: Retro-Future uses existing rate limiting techniques in the new context of data sharing for preserving privacy. For example, execution time limits places constraints on how long a query can take, ensuring that users can’t monopolize available processing power or run data-intensive queries (a query that runs on all available data). Finally, existing techniques in query logging and auditing allows operators to quantify exactly what information is being shared and assess whether the data disclosure controls are too strict or permissive and adjust accordingly. 4.3.4 TimeTravel The ability to time travel through data archives, or easily downselect and search through historical data, helps resolve cross-site sharing onhumantimescales (for example, making query escalation in Section 4.3.2 easier) and is necessary for retrospective analysis when we acquire new information such as indicators of compromised systems or specic vulnerability details. Objectives: Time travel is part of our system’s design in balancing privacy and forward progress, enabling us to make decisions on privacy after-the-fact. Interesting network events, such as network intrusions or outages, happen at computer timescales, with millisecond granularity and at arbitrary times. Exploring and resolving events by a research or security team happen at human timescales—Retro-Future bridges the two timescales, making rapid and ecient incident response possible. Time travel is also needed for retrospective analysis to build an event timeline in light of new, additional information. Organizations can use this event timeline to understand how an attack propagated through their network (and bolster their defense), or assert that they were unaected—a statement that many can’t make today because of the lack of data. 94 Mechanisms: Retro-Future continuously indexes multiple network and system data types through timefind and allows users to run queries on downselected data in a given timerange. timefind operates in tandem with existing packet and log capture systems, supporting indexing and search for 18 data types, including packet capture, Windows/Linux system log, and rewall data. By providing a simple and unied interface across heterogeneous data types, organizations can usetimefind to quickly pull all relevant data together (e.g., DNS and email data towards a phishing incident) in one query between two timestamps for closer inspection. Our implementation is open-sourced and freely available at [77]. With massive amounts of historical, archived data, our ability to time travel is essential to solving these applications as we can quickly determine what data we have or don’t have, and bring them together to solve our targeted applications. Time travel bridges the gap between interesting events (at computer timescales) with our ability to respond and act (at human timescales). 4.3.5 SecuringtheRetro-FutureSystem Because organizations are now collecting and storing sensitive data (network trac, system logs, etc.) that was previously discarded, Retro-Future must be and remain secure to prevent increasing existing threats or introducing new to data disclosure. Objectives: To meet an organization’s security needs, Retro-Future’s system security emphasizes and builds on data owners’ full control over the system and data. Full control over data (thus eschewing storing data at cooperatives, escrows, or the cloud), allows owners to manage the risks in data sharing. By storing and managing data locally, organizations control both its disclosure and the exibility in choosing how it’s shared. Mechanisms: To secure Retro-Future and corresponding data access, we adhere to best practices, us- ing standardized and widely deployed protocols for access control (client-side certicates and Kerberos, 95 SSH/TLS transport). Securing local data archives is done with current best practices, including data en- cryption, aging (removing information over time), and anonymization. 4.4 CaseStudiesQuantifyingtheBenetsofSharing We next quantify the benet of cross-site sharing in two scenarios: detecting botnet activity using a domain generation algorithm (DGA) based technique and detecting malicious activity with DNS backscatter. 4.4.1 DetectingBotsandBotnetActivity In our rst case study, we look at how cross-site data sharing helps in support of detecting bots and botnet activity on local networks from their command and control (C&C) trac. We will show why sites using DGA-based (Domain Generation Algorithm) botnet detection will benet from Retro-Future’s cross-site data sharing (Section 4.4.1.1). We rst evaluate each site’s individual ability in detection (Section 4.4.1.2), then show how sites can leverage data sharing to improve detection (Section 4.4.1.3), and their detection sensitivity (Section 4.4.1.5). At the time of this writing, we have deployed Retro-Future for sharing botnet activity lists continuously at CSU, allowing users at USC, Los Alamos National Laboratory (LANL), and Northrop Grumman to query for CSU’s botnet activity lists. 4.4.1.1 ProblemStatement Today, organizations can use BotDigger [139] to detect bot activity on a host by examining their DNS trac for DNS access patterns that indicate botnet C&C trac. BotDigger works well when run over data from a large organization (like CSU, which has access to campus-wide DNS data). However, it becomes much less sensitive for organizations that are not as large and diverse as CSU. For example, CSU is diverse in population, with its relatively open and permissive 96 network having large volumes of network trac from a number of users, applications, and hosts. CSU’s trac is alsorich, with a wide variety of network protocols, operating systems, and user types (sysadmins, casual users, etc.). We also run BotDigger at USC, with a smaller and more homogeneous population (158 hosts, mostly Linux-based), and at Los Alamos National Laboratory (LANL), with many hosts (2610 3 ) and less richness, because of a more secure (and therefore somewhat homogeneous) set of hosts. Our hypothesis is that cross-site sharing with Retro-Future will allow larger, more diverse organiza- tions (CSU) to share sensitive data securely with smaller (USC) or less diverse (LANL) organizations to help the smaller/less diverse organization detect malicious activity happening on their networks. CSU can help because it has a much higher chance of malicious activity happening on its network: CSU has greater diversity in population and richness compared to USC and LANL. Retro-Future addresses privacy concerns in sharing sensitive data by providing the access controls and query system needed to control data disclosure. Collecting and classifying DNS data is privacy-sensitive because the collected data involves IPs of and metadata about end users, while processed data includes false positives and unvetted activity. Retro-Future’s access controls and query system ensures that data is shared only with authorized users and enables each site to selectively choose which data is shareable with whom, based on their risk tolerance: in the following case study, end-user IPs are not shared by CSU and cannot be queried for. With the Retro-Future system in place, CSU is then comfortable with the controlled sharing of botnet activity. 4.4.1.2 Cansitesdetectmaliciousactivityontheirown? We rst ask if sites can detect malicious activityontheirown with BotDigger (without cross-site sharing). We run BotDigger over roughly 2 years (2016-01-01 to 2017-12-31) at CSU (3910 3 people, 2010 3 hosts, 5.210 9 queries/month), a subset of USC (about 50 people, 158 hosts, 46.210 6 queries/month), and LANL (1010 3 people, 2610 3 hosts, 3.310 9 queries/month). Botdigger rst looks at DNS queries at each 97 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 2016-07 2016-08 2016-09 2016-10 2016-11 2016-12 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06 2017-07 2017-08 2017-09 2017-10 2017-11 2017-12 2018-01 30-day moving windows ending on YYYY-MM-DD 0 20 40 60 68 Number of IP/Domain Sets CSU LANL USC/ISI Figure 4.1: Detecting suspect C&C domains and IPs at each site independently. (Zero-valued entries are not shown.) site, clustering and labeling queried domains based on their linguistic features, and then further classies any set of at least 10 domains resolving to the same IP as suspect [139]. We then analyze the suspected C&C domains and IPs to evaluate BotDigger’s ecacy (does BotDigger detect bots or botnet activity?) and look for commonalities across sites (does each respective BotDigger instance at each site detect the same botnet activity?). Figure 4.1 shows us that sites can sometimes detect malicious activity on their own, depending on the diversity of richness and population of the site. CSU (green), a large organization with data and user diversity, is able to consistently detect suspect activity over time (avg. 6.67 detections/day), with several bursts of high activity (on 2016-07 to 2016-08 and 2016-10 to 2016-11). LANL, sitting between CSU and USC in diversity, detects far less (avg. 4.30 detections/day), with less consistency. Finally, BotDigger as USC has the least coverage (avg. 1.67 detections/day), with long periods of time (e.g., 2016-01 to 2016-05, 2016-10 to 2017-06) pass with zero detections. The amount of network diversity (in population and richness) at an organization aects the amount of malicious activity detected—greater diversity correlates with more malicious activity. Although we see that LANL has a comparable number of hosts and queries as CSU, as a government lab, LANL has a much stronger security posture and greater centralization than a public university. Smaller organizations (like USC) may not have sucient network data diversity to detect large-scale malicious activity. It may be that USC generally has no bots on the part we observe, or that BotDigger’s algorithm isn’t sensitive enough. 98 We next ask if sites detect more malicious activity when they use Retro-Future to share BotDigger’s output of detected malicious activity with each other. 4.4.1.3 Doessharinghelpsitesdetectmoremaliciousactivity? We next show how sites detect more botnet activity when using Retro-Future to share sensitive data. Here we focus on the benets of sharing data from a large organization with data diversity (CSU) with less diverse (LANL) and smaller (USC) organizations. To test if sites detect more malicious activity when they share, we take CSU’s botnet activity lists and share it with LANL and USC using Retro-Future. Both sites then check whether its hosts have queried or interacted with hosts in CSU’s lists, potentially revealing activity with undetected malicious hosts. Retro- Future is required to support sharing since CSU regards botnet detection data as sensitive and restricted for limited sharing only. LANL and USC retrieve CSU’s botnet activity lists containing a list of suspected C&C domains and IPs found by BotDigger (Section 4.4.1.2) published daily from 2016–2017. We then check if either site (LANL or USC) has queried or interacted with these hosts 30 days before or after the list was published, potentially revealing activity with previously undetected malicious hosts. (For example, when CSU published a list on 2017-03-16, the two other sites check their interaction during a period between [2017-02-14, 2017-04-15].) Sharing helps sites detect more malicious activity. In Figure 4.2, we visually see the benets of CSU sharing with LANL and USC, focusing rst on the region marked and highlighted ‘A’ (2017-02-14 to 2017- 04-15). During this time period, LANL (middle graph, blue) and USC (bottom graph, red) detect more suspect activity with sharing (dark blue and dark red ‘+’ symbols) than individually (light blue and light red points): the darker ‘+’s (which are present only when sharing helps) are higher than the lighter points on most days. Zero-values are not shown for clarity. 99 0 20 40 60 68 A B C CSU 0 20 40 60 68 Number of IP/Domain Sets LANL (with sharing) LANL 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 2016-07 2016-08 2016-09 2016-10 2016-11 2016-12 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06 2017-07 2017-08 2017-09 2017-10 2017-11 2017-12 2018-01 30-day Roll-ups Ending on YYYY-MM-DD 0 20 40 60 68 USC/ISI (with sharing) USC/ISI Figure 4.2: Detecting suspect C&C domains and IPs at each site individually (light-colored points) and using CSU’s shared botnet activity lists (dark-colored ‘+’). ‘+’ is not shown if sharing did not help, and zero-valued entries are not shown. site domains % ips % CSU 1845 100% 9 100% LANL 52 100% 2 100% self (only) 10 19% 1 50% with CSU’s sharing 42 81% 1 50% USC 30 100% 2 100% self (only) 0 0% 0 0% with CSU’s sharing 30 100% 2 100% Table 4.3: Number of domains and IPs detected as suspicious activity at each site independently (self) and with sharing. Data in this table corresponds to ‘A’ in Figure 4.2. Dates: 2017-02-16 to 2017-04-15. 100 We quantify the benets of sharing numerically in Table 4.3 (“with CSU’s sharing”) for the same time period (marked ‘A’ in Figure 4.2). Most domains/IPs that we detected at LANL (81 %, 50 %) and all the domains/IPs detected at USC (100 %, 100 %) are due to CSU’s data sharing. All suspect activity found with sharing was previously undetected when LANL and USC ran BotDigger individually on their own respective sites. We have shown that sharing helps sites detect more malicious activity, and we next ask howconsistent the benets are in data sharing. 4.4.1.4 Howconsistentarethebenetsofsharing? Having shown that sharing helps sites detect more activity (Section 4.4.1.3), we next show that sites con- sistently see the benets of sharing, with large variations, through longitudinal observations. Sharing consistently helps sites detect more malicious activity. Figure 4.2 shows that, during the two- year period (2016–2017), LANL and USC often see the benets of improved detection ability from CSU’s sharing. For example, we see that at LANL (middle graph, blue), the darker blue ‘+’s, representing the total activity found with sharing, is often higher than the lighter blue points, representing activity found individually. (On a given day, a ‘+’ is not plotted if no additional activity was found with sharing, and nothing is plotted if there was no activity found.) We see similar benets at USC (bottom graph, red). During our longitudinal observation, we see that there are large variations in the benets of sharing. For example, there are two time periods of relatively high activity, marked ‘B’ and ‘C’ in Figure 4.2. LANL detected an additional 52–57 (‘B’, 2016-07 to 2016-08), and 25–29 (‘C’ 2016-10 to 2016-11) IP/domain sets per day. USC sees similar results, detecting an additional 37–42 (‘B’) and 16–18 sets per day. At other times, such as in ‘A’, both LANL and USC detect, at most, an additional 1–2 IP/domain sets per day. These large variations suggest that sharing helps sites in either a signicant or minor way, and rarely in between. How much sharing helps is correlated with the number of detected botnet events: the chance 101 0 4 8 12 15 C CSU + LANL CSU + USC/ISI LANL + USC/ISI 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 2016-07 2016-08 2016-09 2016-10 2016-11 2016-12 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06 2017-07 2017-08 2017-09 2017-10 2017-11 2017-12 2018-01 90-day moving windows ending on YYYY-MM-DD 0 1 2 2 A, B Number of IP/Domain Sets Figure 4.3: Improving the sensitivity of BotDigger’s detection with controlled data sharing between sites over time. Top: Each of the IP/domain sets was previously undetected (negative result) at one of the sites in each pair. Bottom: Each of the IP/domain sets was previously undetected (negative result) at all sites in each pair. (90-day moving window, zero-valued entries are not shown). of nding shared activity is higher when there are a lot of hosts (>30). For example, we nd sharing signicantly helps in periods ‘B’ and ‘C’ (Figure 4.2): USC nds that its hosts have interacted with roughly 80–100 % of entries in CSU’s lists. At other times, like in ‘A’, sharing seems to have a less signicant impact: LANL’s hosts have interacted with 10–20 % of hosts found by CSU. We see similar results (both signicant and minor) again at USC. Finally, in order to realize these benets of sharing, sites need to run detection and observe the results for a long period of time (>1 year). If our study was focused solely on a particular period of time (‘A’, ‘B’, or ‘C’ in Figure 4.2), the varying results would lead to exaggerated or underwhelming conclusions in sharing’s ecacy. Prior to data sharing, less diverse organizations like LANL and USC were not able to locally detect malicious activity (we saw earlier in Section 4.4.1.2 that USC, on many days, detects 0 domains/IPs). With sharing, sites can now leverage the diversity of other sites, augmenting their local capability in detecting botnet activity. We next examine if the larger, more diverse site also benets from sharing. 102 site #domainsdetected (color in Figure 4.4) A B C CSU (green) 3 4 1 LANL (blue) 7 9 0 USC (red) 0 0 21* Total 10* 13* 22* Table 4.4: The sensitivity of BotDigger’s detection is improved with controlled data sharing. ‘*’ denotes that entry passes the detection threshold. Data in this table corresponds to Figure 4.4. 4.4.1.5 Cansitesimprovetheirdetectionsensitivitywhentheyshare? We next demonstrate how sites can improve the sensitivity (recall, true positive rate) of malicious activity detection when they share data with one another. False negatives can occur when potential botnet activity (C&C domains and IPs) as identied by BotDigger falls under the threshold of fewer than 10 IPs (as set in Section 4.4.1.2 and [139]) resolving to the same domain. Sites can reduce false negatives by sharing and merging each others’ BotDigger results and identifying any positives, suspect activity that meets the threshold. Each site exchanges, with each other, botnet activ- ity lists containing C&C domain and IP pairs that fall below the threshold with a 90-day sliding window. Each site then combines the exchanged lists with their own results and check if any C&C domain/IP pairs cross the threshold, thus revealing previously undetected activity. The top graph in Figure 4.3 shows that over a two-year period, each of the three sites (CSU, LANL, USC) can improve their detection sensitivities when they share. Prior to sharing, one site in each pair (there were no commonalities between all three sites simultaneously) would have missed up to 15 sets of domain/IP pairs as a false negative. With data sharing, the combined number of domains resolving to each IP per domain/IP set reaches the minimum threshold of 10 domains per IP and are re-identied as possible suspect C&C activity. We see sharing’s greatest eectiveness with 15 additional sets between the two diverse sites, LANL (population) and CSU (population and diversity). Sharing is also eective between 103 A B C IP Address 0 5 10 15 20 25 Number of Domains Resolving to IP Address detection threshold: # domains ¸ 10 CSU LANL USC/ISI Figure 4.4: The sensitivity of BotDigger’s detection is improved with controlled data sharing. With shar- ing, all three domain/IP sets meet or pass the detection threshold. Each IP/domain set label (‘A’, ‘B’, ‘C’) corresponds to the same annotation in Figure 4.3. sites where one or both are less diverse, as the other pairs ({ CSU, USC } and { LANL, USC }) detect between 1–5 additional suspect C&C domain/IP sets. The bottom graph in Figure 4.3 reinforces sharing’s eectiveness by highlighting suspicious activity missed at both sites (possible false negatives) in each pairing. Prior to sharing, these domain/IP sets were marked as candidates for suspect activity, but fell below the threshold for detection at both { CSU, LANL } and { LANL, USC } (as we saw earlier, there were no commonalities across all three sites). After sharing, each site pair detected 1–2 previously undetected suspect C&C domain/IP sets (1–2 at { CSU, LANL } and 1 at { LANL, USC }), moving them from a negative to a positive detection result. We further visualize and quantify the benets of sharing by looking at three particular examples la- beled ‘A’, ‘B’, and ‘C’ in Figure 4.3 which correspond to the same labels in Figure 4.4 and Table 4.4. Prior to sharing, one (C) or both sites (A, B) would have missed each corresponding domain/IP set as a false 104 negative. With data sharing, the number of domains resolving to these 3 IPs (A,B, andC) reaches the min- imum threshold of 10 domains per IP and are re-identied as possible C&C activity for further follow-up at each site. Surprisingly, we see that even large, diverse organizations can benet from sharing data. Prior to sharing, CSU would not have detected C as suspect, with only one domain resolving to that IP. After exchanging botnet activity reports with USC, CSU combines USC’s results with its own,C is now agged as suspect, adding 1 additional IP/domain pair (+10 % additional entries in their aggregated botnet activity list for the period of [2017-06-05, 2017-07-20]). Organizations with all levels of diversity in their network trac can benet from sharing data by improving their detection sensitivity in botnet activity detection. In this nal part of the case study, all sites beneted from sharing their respective BotDigger output, detecting more suspect activity that had previously fallen beneath the detection threshold. 4.4.2 FindingMaliciousActivitywithDNSBackscatter In our second case study, we look at how cross-site data sharing helps in support of nding network-wide malicious activity with DNS backscatter. DNS backscatter classication is the process of identifying someone (theoriginator) that touches many Internet hosts based on reverse DNS queries their behavior elicits from the hosts they contact [47]. We detect their behavior by watching for reverse DNS queries (the “backscatter” from their behavior) an au- thoritative DNS server (the authority). Reverse DNS queries are generated by queriers: rewalls, spam lters, and similar computers that try to identify the originator’s behavior. We describe, through an example, how malicious activity like spam generates DNS backscatter. An email spammer (originator) sends spam email to many hosts. Host rewalls and email servers (queriers) 105 will look up the reverse DNS name of the originator’s source IP address when processing email. This re- verse name query is sent to a recursive resolver, and is eventually handled by the authoritative nameserver that holds the DNS record (mapping) of an IP address (1.2.3.5) to hostname (bad.example.com). Processing DNS backscatter has two phases: detection of large originators, then their classication us- ing features of the queries and queriers. Classication for IPv4 uses a machine learning to tune parameters in a Random Forest model, while IPv6 uses human-dened rules—we focus on IPv4 in this section. We will show why sites using DNS backscatter will benet from Retro-Future’s cross-site data sharing (Section 4.4.2.1). We then evaluate a site’s individual ability in detection and classication (Section 4.4.2.2) and show how sites can leverage data sharing to detect more malicious activity (Section 4.4.2.3). 4.4.2.1 ProblemStatement DNS backscatter depends on data observed at a recursive resolver for a large ISP or organization, or at some authoritative nameserver handling reverse DNS requests, such as a root or top-level (.com,.jp, etc.) authority. Root nameservers (serving in-addr.arpa and 1.in-addr.arpa, for example) potentially see all originators, but caching and root nameserver selection algorithms will attenuate the actual numbers of queriers seen. The nal authority (3.2.1.in-addr.arpa) would see all queriers, but only for the originators in its address space–it would not see, for example, originators in 1.2.4.0/24. For example, LANL is only the nal (reverse) authority for its IP address space, and correspondingly will only detect originators in their IP ranges. Processing backscatter with only data from a nal authority limits visibility in nding network-wide activity. (LANL also runs a recursive resolver for its hosts: we thus analyze data from both its recursive resolver and nameserver.) Our hypothesis is that cross-site data sharing between a root and large end-organization with Retro- Future will help DNS authorities detect previously unknown activity (malicious originators). Cross-site 106 sharing enables authorities at dierent levels (root, top-level, nal) to access richer data sources to improve detection and classication. Nameservers closer to or at the root can see additional originators it may have missed because of at- tenuation, enabling a greater view of Internet-wide activity. These originators can also be used to improve the accuracy of the classier, by providing additional inputs for training or retraining. When an authority shares its classied originators with a nal authority, the nal authority can use these results to help with detecting malicious activity in its own address space (especially if human or computational resources are scarce), or to take steps to prevent attacks from originators outside its address space (preemptive blacklisting). 4.4.2.2 Cansitesdetectandclassifyoriginatorsontheirown? To examine if sites can detect and then classify originators in DNS backscatter, we rst consider detection at sites on their own, without cross-site sharing. We show that while some sites can detect and classify originators, it is only sites with sucient diversity and not smaller originators. We then show that with data sharing, these enterprises can detect and classify originators. Detection: Detecting originators is the rst step in processing DNS backscatter, and requires a certain amount of diversity in queriers to make classication feasible in the next step. We analyze DNS backscatter at B-root (root authority, 2.8110 9 queries, 3.1310 8 reverse queries over 36 hours) and LANL (nal authority and recursive resolver, 5.7910 9 queries, 3.7110 8 reverse queries over 1 month). We set varying thresholds at each site (Table 4.5), requiring a certain number of unique queriers for an observed originator to be detected. We then analyze detected originators (is there a sucient amount of originators?) and look for commonalities across sites (are the views between dierent nameservers unique?). 107 start queries (10 9 ) originators dataset (2014) duration window thr. (all) (reverse) (all,10 5 ) (detected) B-root-ditl 04-28 36 hours 36 hours 15 2.81 0.313 2000 11 108 LANL (nal) 04-01 1 month 30 days 5 5.79 0.371 4.28 16 345 Table 4.5: Datasets used in processing DNS backscatter. Table 4.5 shows that sites can detect originators on their own. At B-root, we’ve detected 0.06 % (11 108) originators to be analyzable. At LANL, we’ve detected 3.82 % (16 345) originators. We generally don’t see a signicant percentage of originators at a root authority due to the eects of caching at the levels below the root, leading to an attenuation of observed originators. We also see an overlap of 856 detected originators across sites, which is not surprising as it’s fairly easy for a prolic originator to reach many targets. However, the views at each nameserver is mostly unique, which we believe is true of most sites today: this allows us to later test if there are benets to data sharing when “combining” each site’s view. (If sites had the same exact view, then the benets of data sharing would be realized in work deduplication: only one site needs to process DNS backscatter and the results would apply to all others.) Classication: Classication with machine learning is the second and nal step in processing DNS backscatter. We use ground truth of known-good/bad as labeled data to train the classier, then clas- sify originators into two application categories: malicious (spammers/scanners) and benign (CDNs, mail servers, etc.). We then classify, at each site, the Top-N of detected originators (with the most unique queriers). B-root is able to classify originators on its own, nding 4323 (43.2 %) originators to be malicious and 5675 (46.8 %) to be benign. LANL is initially unable to classify originators on its own, because it doesn’t have enough resources to build a labeled dataset required for training. Classication using machine learning requires lots of training on accurate labeled data (ground truth of known-good/bad) and is human-intensive, requiring an expert to make the good/bad determination. Additionally, the training weights generated from labeled 108 benign malicious dataset total ad cdn crawl dns gcloud mail mes ntp scan spam B-root-ditl (self) 9998 116 515 316 110 311 3630 456 221 1844 2479 LANL (self) 0 LANL 10000 68 4 339 1 9588 (w/ B-root wts.) Table 4.6: Number of originators in each class for all datasets. 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Originator Classes B-root-ditl LANL ad cdn crawl dns gcloud mail mes ntp scan spam Figure 4.5: Fraction of originator classes of Top-N originators. originators has a useful lifetime of one month, after which the labeled data and corresponding weights must be regenerated. Using other’s training data: As LANL is initially unable to do classication on its own (Table 4.6, LANL (self)), we ask if LANL can perform classication with another site’s (B-root) help through data sharing. B-root shares its training weights with LANL, and LANL then uses B-root’s weights to train its classier and classify originators at LANL. After B-root’s sharing, LANL is able to now classify its detected originators. Table 4.6 shows the results of classifying the Top-N originators, with both sites using the same training data for its classier. LANL nd 9588 (95.9 %) originators as malicious (spam) and 412 (4.12 %) as benign. Although each site has a mostly unique view of originators, we do nd some commonalities of 640 of the same originators (in the Top-N) at both sites. Both sites agree on the classication of 25 originators (all spam), and disagree on all others, showing that the training data is tailored or biased towards the original training site (future work might evaluate making the training process easier and more universal across sites). 109 We’ve shown that while sites candetect originators on their own, they also require good training data toclassify originators. LANL, initially unable to perform classication, is now able to classify its originators due to B-root’s sharing. Now that sites have processed DNS backscatter, we next ask if the sharing of the results can help sites nd more malicious activity. 4.4.2.3 Does sharing the results of processed DNS backscatter help sites nd more malicious activity? Having shown that sites can detect and classify originators on their own, we next demonstrate how sites can nd more malicious activity when sharing their results. To test if sites nd more malicious activity when they share, each site (B-root and LANL) shares the malicious originators it earlier classied with one another. Each site then checks whether the originator appeared below the Top-N originators, revealing previously missed activity. Each site also checks if there was any interaction with the malicious originator (hosts querying about or receiving an answer with an originator, or an originator making queries to a site’s authority), potentially revealing interactions with previously undetected originators. B-root and LANL retrieve a list of malicious (scan or spam) originators from one another, and checks for interactions with these originators in their own data (Table 4.5). We remove the commonalities in classied originators between both sites, ensuring that we accurately nd new, additional malicious activity (we do not recount what was previously detected): B-root shares with LANL a list of 3947 malicious originators, and LANL shares with B-root a list of 8966. Sharing helps sites nd additional malicious activity. We quantify the benets of sharing in Table 4.7, showing the results of processing DNS backscatter individually (self) and with sharing. With sharing, we see that each site sees an increase in originator activity: B-root with LANL’s sharing sees 3611 additional 110 malicious siteanddescription originators % % B-root-ditl 7934 100.0 self (only) 4323 54.5 w/ LANL’s sharing 3611 45.5 100.0 originators below Top-N 3389 93.9 interactions with originator 222 6.1 originator sends a direct query/response to site 222 host at site receives DNS record containing originator 0 y LANL (d = 30, threshold 5) 11418 100.0 self (only) 0 0.0 self (w/ B-root’s training weights) 9588 84.0 w/ B-root’s sharing 1830 16.0 100.0 originators below Top-N 1417 77.4 interactions with originator 413 22.6 originator sends a direct query/response to site 74* host at site receives DNS record containing originator 354* y root authorities don’t serve nal DNS records * indicates overlap Table 4.7: Finding more malicious activity with the sharing of processed DNS backscatter. originators (accounting for 45.5 % of total activity seen at B-root), and LANL with B-root’s sharing sees an additional 1830 (16.0 % of total activity). We further break down in Table 4.7 the additional activity that was found into two categories: origi- nators below Top-N, and interactions with originators. At each site, we see that most of the additional originators found due to sharing are observed origina- tors, but below the Top-N (93.9 % at B-root, 77.4 % at LANL). Since these observed originators originally fell below the threshold at each respective site, these newly “discovered” originators could be candidates for another pass at classication or tagged for monitoring. The remaining are originators that have directly interacted with a site or hosts at a site in some way by sending queries or responses to a site, or being part of a DNS record in response to a host’s query. These 111 originators can be tagged for additional scrutiny at each site. A tagged spammer or scanner could be pre- emptively blacklisted to prevent DoS attacks, or a host could be analyzed for malware or other indicators of compromise if making DNS queries about and connecting to a malicious originator. Finally, we note that the signicance of the results in sharing is aected by thequality of the classied originators at each site. We believe that sharing from B-root to LANL is more signicant than the reverse because of the careful and resource-intensive work in building a labeled dataset and training the machine learning classier. In the reverse case (LANL sharing to B-root), the quality of results could be improved with a site-specic labeled dataset and training. We believe this sharing is still benecial as it enables B-root to discover and classify additional originators it previously missed due to a lack of observations. Future work will explore how sites can collaboratively build and train a more generalized classier with sharing. Organizations can benet from sharing processed DNS backscatter data by nd more malicious activity. With sharing, sites can now leverage the diversity of other sites, enabling them to combine their respective views of DNS activity to discover new, additional malicious activity. 4.5 RelatedWork There have been many eorts and much work done in enabling and promoting Internet data sharing. We build on prior experience in information sharing frameworks and data collection. DataSharingFrameworks: Several logical frameworks for enabling and implementing data sharing have been proposed in prior work, outlining privacy, usability, and utility considerations in developing policies for data sharing. Allman and Paxson, recognizing the prevalence of ad-hoc data sharing in the research community, proposed a set of high-level considerations for data sharing through “Acceptable Use” policies [4]. While they primarily consider Internet measurement data, these policies can be applied to cybersecurity incident 112 data shared by a given organization. Retro-Future provides the mechanisms that can be used to implement and support data sharing in conjunction with these Acceptable Use policies, for example, by refusing queries from a requester who violates policy. Kenneally and Clay proposed a Privacy-Sensitive Sharing Framework (PS2) that seeks to balance risks that can occur with data sharing with privacy management [27]. Their framework enumerates the principles that a data sharing component should have, and challenges the assumption that the privacy risks of sharing data outweigh the benets and they show that PS2 enables their organization, CAIDA [21], to realize utility goals in a risk-sensitive matter. Organizations can use PS2 as a guide to create policies and agreements with others, and use Retro-Future to enforce such policies in data sharing. We have also quantied the benets of data sharing using Retro-Future in case studies in detecting DGA-based botnet activity and nding Internet-wide malicious activity using DNS backscatter. In prior work, we enumerated the privacy principles and corresponding engineering approaches for sharing cybersecurity data across organizational boundaries, recognizing the risk and benet trade-o and the need to balance risks in disclosure with making forward progress in research and solving operational problems [41]. The Retro-Future system uses these principles and engineering techniques to implement a system for controlled information exchange across organizations and quanties its benets with its case studies in malicious activity detection. Data Collection, Storage, and Retrieval: There is much work in network data capture and collec- tion, from the user-level [86] (used for capturing trac at a specic node) to the network level [116, 6, 73]. Prior work has looked at both ecient (using deduplication or removing redundancy [99]) and secure (using encryption) capture and storage of network trac for long-term storage and retrieval, especially in the context of intrusion detection [97] and network security analysis [83]. Retro-Future builds upon this work by looking at data capture, collection, and “time travel” in the context and with the explicit purpose 113 of sharing with other, outside organizations. In addition to supporting generalized time travel across het- erogeneous data types, Retro-Future shows how time travel can resolve cross-site data sharing on human timescales, and is a necessary component for retrospective analysis when new information is acquired. Retro-Future encourages organizations to collect, archive, and use their network trac and system log data, providing the mechanisms needed to share and use data with others in the context of collaborative (across organizations) intrusion detection and network security analysis. 4.6 Conclusions This chapter described our steps towards formalizing and regularizing cross-site information sharing, pro- viding the sharing mechanisms in the Retro-Future system and quantifying the benets of data sharing in the context of botnet detection and nding Internet-wide activity with DNS backscatter. Retro-Future is our framework and system that provides post-event understanding with time travel, and enables controlled sharing with cross-site queries through query moderation and controlled data disclosure. We used Retro- Future in two case studies on DGA-based botnet detection and malicious, Internet-wide activity detection with DNS backscatter, showing how sharing cybersecurity data enabled sites to detect more malicious activity on their networks and improve the sensitivity of their detection algorithms. This chapter supports our thesis statement by showing how we can use data sharing in Retro-Future to improve an organization’s network security by nding additional malicious network activity. Retro-Future provides the framework and tools needed for organizations to conduct a controlled exchange of previously- private network information with collaborators within their own and across other organizations. When organizations share data with each other, each participating organization increases the eectiveness of their local detection of malicious activity. In the next chapter, we will further support the thesis statement by showing how one can improve their network security (and that of their friends) by using data sharing to build a collaborative defense against 114 phishing attacks. We have previously shown that providing the controls in Retro-Future to manage the risk-benet trade-o helps enable data sharing at organizations, and that data sharing helps organizations nd more malicious activity. We will next show how the same concepts and controls from Retro-Future helps enable data sharing in AuntieTuna, which will help prevent users and their social circles from falling victim to phishing sites by inoculating them with the known-good, legitimate site beforehand. 115 Chapter5 BuildingaCollaborativeDefensetoImproveResiliencyAgainst PhishingAttacks In this chapter, we present AuntieTuna-Schooling, the next evolution of AuntieTuna, a web browser extension that proactively detects phishing sites, extended with friend-to-friend data sharing. The previous chapters on an earlier prototype of AuntieTuna and data sharing in Retro-Future motivates bringing the two together to protect users and their friends from web phishing attacks. This study of AuntieTuna-Schooling partially supports our thesis statement. AuntieTuna-Schooling improves one’s network security by improving phishing defenses with preemptive ltering that leverages the user’s social circles and proactive phish detection. Users exchange previously-private network infor- mation of known-good, legitimate websites they use with collaborators or friends to collectively build a defense. We leverage the commonalities in web browsing that users have with each other and will quantify the benets of “inoculation” through friend-to-friend sharing. As users browse unknown sites, AuntieTuna-Schooling’s personalized and local detection will nd and prevent access to malicious web- sites. 116 5.1 Introduction Individuals are at risk of phishing attacks, with the consequence of nancial loss and theft of personal data or intellectual property. Risk of successful phishing is particularly strong at home and in small organizations, where there is limited technical expertise for defense. Risk of phishing is also high in large organizations, with greater assets at risk. Large organizations protect themselves with dedicated security personal and deployment of mechanisms such as Single Sign- On (SSO) across their services. SSO improves security in many ways (improved usability and manage- ment [69]) by providing a common method of authentication across multiple web-based services. However, it also poses as a tempting target for phishing [137], since one compromise can open all of the organiza- tion’s resources [124]. A compromise can even aect other services outside of the organization due to a high chance of password reuse [31]. Users are then easily targeted as SSO teaches users to enter their or- ganizational username and password when presented with a familiar SSO dialog or portal. Even riskier are organizations that use SSO mechanisms but with varying, non-standardized interfaces: users can become further accustomed to entering their credentials whenever prompted by any site plausibly related to the organization (Section 5.5). Ad-hoc organizations, like political election campaigns, are at even greater risk than well-established organizations. Campaigns are noted for uid membership with many volunteers, rapidly changing, ad- hoc infrastructure for data sharing, and under-resourced cybersecurity defenses. These organizations are high-value targets for phishing, even by nation-state level actors [57, 108, 96]. Today’s defenses against phishing often lag behind attacks. Training and warning emails about phish- ing are either general and easy to forget [121], and specic warnings can be sent only after an attack is well underway. Preemptive phishing defenses, such as browser-wide blacklists (for example, Google’s 117 Safe Browsing API), can be hours to days behind [55] and still must chase attackers that rapidly set up new websites at low cost. This chapter proposes improving phishing defenses with preemptive ltering that leverages the or- ganization’s social circles. We encourage users to independently identify sites they use to authenticate, building a whitelist of sites that are known good and that phishing attackers may attempt to replicate. We then provide both tools to share these whitelists with peer-to-peer or centralized methods, with peers in the same circle or group. Users then use an existing tool, AuntieTuna, to identify when an untrusted website attempts to impersonate a known website, with a phishing attack. The contributions of this chapter are to support the thesis statement (described at the beginning in Chapter 5), to identify and quantify the risks of SSO as a target for phishing, to describe data sharing methods to protect against phishing, and to quantify the protective benets of sharing against phishing sites. Our second contribution is to identify SSO both as a defense against and as a new target for phish- ing (Section 5.2). SSO raises new risks because it is an attractive phishing target that may expose hundreds of services at large enterprises. We quantify this risk by rst examining specic examples at a univer- sity (Section 5.5) and then measuring the “surface area” and growth of SSO (Section 5.6.1). Our nal contribution is to propose a solution using peer-to-peer and centralized data sharing in or- der to share information about known-good sites and promote a collaborative inoculation against phish- ing (Section 5.4). This defense builds on our prior work with a hash-based anti-phishing plugin for web browsers, AuntieTuna (Chapter 3) We show that our collaborative defense can be successful, even with relatively modest sharing (Section 5.6.2). 118 5.2 ProblemStatementandThreatModel We rst describe phishing threats that attackers use against the targets of end-users and their organi- zations. These threats lead to our problem statement and how AuntieTuna provides the solution to our phishing problem. 5.2.1 TargetUserPopulationandTheirAttackers Our target user population consists of individuals and their friends, in the context of using online web services at home and work. Our secondary targets are the organizations that the previously mentioned users belong to. Our attackers will initially target the individual users at any organization. For example, targets at a university will access online services for educational or work use (classes), and also for personal use (clubs or recreation), both at home and school. Many of these online services are run by rst- or third-party or- ganizations, and often require rst-party authentication (we describe this in further detail in Section 5.6.1). Attackers want to steal users’ credentials for malicious gain. Attackers can use the victim’s personal or nancial information to steal money, or to steal intellectual property at universities or companies. They can also use stolen credentials to gain access to inner systems (stepping stones), or conduct additional phishing campaigns from inside the organization (lateral phishing [61]). We will next describe how attackers can steal users’ information using phishing. 5.2.2 ThreatsandDefenses Phishing is the attempt to obtain sensitive (and often personal) information by pretending to be ocial or legitimate. Phishing sites, websites that masquerade as legitimate sites to trick potential victims into sharing their information like passwords or banking details, are a serious threat to our users. Our goal is to develop a 119 countermeasure to phishing sites by detecting whether an unknown site a user visits is suspected to be phish. Framed around the STRIDE model [72], phishing is aspoofing threat, and our countermeasure will determine theauthenticity of a phishing site. Our goal is to protect against phishing sites that look and feel like the original site and target users at a large university or enterprise. Users can fall victim to these phishing sites as the phish are often visual duplicates and copy content exactly from the original, while behaving in a realistic way: we will see in Section 5.5 how one phishing site pretends to be a university’s Single Sign-On portal and evades a second glance as it ultimately redirects to a legitimate site. The countermeasure against this threat needs to determine whether a visited site is a suspected phish or not and ensure that the user does not lose their credentials, while minimizing false positives that would irritate the user. Because phishing sites are quickly created and ephemeral, the solution should also be proactive in its detection and minimize dependencies on remote resources or computation. Finally, the countermeasure should require minimal conguration work silently in the background as to not annoy the user. Given our discussion of our target users and threats to our users, we next formally dene our phishing problem and solution. 5.2.3 ProblemStatement As organizations move to outsourcing core and sensitive functions to third parties with their own rst- party Single Sign-On (SSO) authentication ow, it is dicult for users to know what party has the right to a user’s credentials. This mix of many parties with many services can leave users vulnerable to SSO phishing and credential stealing threats [137]. 120 SSO is benecial for users—it provides a consistent process to log in, aiding usability. It is easier for users to use one set of credentials to access many dierent services than to use separate accounts for each service. However, users may become habituated by SSO to enter their organizational passwords whenever they see its logo or login forms, making them vulnerable to phishing threats [145]. Users are vulnerable because they don’t always know which services are legitimate, as there are many services for many groups: users may thus assume that sites using the SSO portal are trustworthy. Once an attacker successfully phishes a user’s passwords, the attacker can access all SSO-enabled sites as the victim (and possibly many other sites, due to password reuse). AuntieTuna provides the countermeasure to the phishing threat by detecting and protecting users against phishing website attacks. AuntieTuna detects phishing websites by looking for the “right content” (like website logins, SSO portals) in the “wrong place” (unknown websites being visited). We further augment AuntieTuna with data sharing between friends, which bootstraps and inoculates users and their social circles with the “right content”, quickly protecting users before they can get phished. Our users will benet from AuntieTuna’s protection because AuntieTuna automatically protects users from phishing with minimal conguration (“set it and forget it”). While our users are knowledgeable about using the web, and might even be trained in safe web browsing, they may not be able to always identify phish (we discuss in Section 5.5 actual examples of phish that look legitimate, and legitimate emails that look like phish). Our users can also benet from collective immunity through data sharing—they gain additional pro- tection because of their common or shared interests with other users. We leverage their commonalities for bootstrapping trust in data sharing between friends and peers. For example, users in the same research lab or sports team can eectively share data between each other because they use a lot of common services, 121 and improve their collective defense. We will show in Section 5.6.2.1 that even the group that encompasses the university in its entirety will benet from data sharing between its members. 5.3 RelatedWork There have been many eorts in anti-phishing, data sharing, and in assessing the risks of Single Sign-On (SSO) systems. We build on prior experience in detecting phish (including user education), data sharing, and analyzing SSO protocols and systems. Phish Detection: Detecting phishing sites can happen locally or with help from remote resources. URL blacklists and webpage heuristics [95, 127, 90], and machine learning [54, 81, 3] can be used for detec- tion, but their eectiveness varies [134]. Blacklists can perform poorly even when kept updated [143], and the false positive rate and high computational requirements of machine learning makes usability dicult. AuntieTuna proactively and precisely detects visited sites as possible phish based on the appearance of known-good content at other locations using hashing techniques—we previously evaluated AuntieTuna’s accuracy in Chapter 3 User Education in Anti-Phishing: Several studies have looked at understanding why users click on links to phishing websites in phishing emails, with the goal of prioritizing responses to phish that are likely to be clicked on [56] or to aid developing training for users [119]. Although users are trained to look at security indicators like the “green lock” (indicating HTTPS) and the website’s domain in the browser’s address bar, multiple studies [13, 137] have observed that users do not look at these indicators and sometimes could not nd features that distinguish between legitimate and phishing login pages [121]. Herley et al. [59] found the mental costs in frequent evaluation of these indicators (URLs, the green lock) exceeds its benets; users often perceive the consequences of getting phished as low and ignore warnings. Today, it is free, easy, and encouraged for all websites to get a TLS 122 certicate [67]. While communication between a client and server is secured, the use of HTTPS and its indicators now have little meaning about the site’s validity as many phishing sites are also secured [38]. AuntieTuna augments user education by automatically preventing phishing attacks with minimal user conguration. When a phish is detected, AuntieTuna provides general information about phishing sites and suggests the original site that the user likely wanted to visit (based on the content of the suspected phish). Data Sharing: The benets of sharing have been studied in detecting and preventing attacks in the context of both network security [46, 45, 64, 140] and anti-phishing: we focus on the latter. There are both centralized and decentralized methods for the use and distribution of anti-phishing data. Browsers can send URLs to centralized services like Google Safe Browsing API and Microsoft Defender SmartScreen, which check if a URL is suspicious based on proprietary heuristics and crowdsourcing. While these phishing blacklists can be eective [112], updates can take hours to days [55], leaving users vulnera- ble for a certain time. AuntieTuna correspondingly supports centralized distribution of known-good lists, but the same delays apply, as a central authority is needed to process and update the lists. Decentralized solutions enable the exchange of anti-phishing databases, but often require centralized services to coordinate the exchange. Nourian et al. [92] developed CASTLE, a peer-to-peer database frame- work similar to a distributed hash table which allows lookups of URL- and content-based blacklists of phish. To distribute the workload, the domain name space is split into subsets, with servers that are run by a trusted “social network” assigned responsibility for manually maintaining the blacklists of their des- ignated subset. AuntieTuna enables the peer-to-peer exchange of known-good lists between users and their social circles and performs all checks against known-good lists locally on the client browser, without requiring centralized services for coordination or lookups. Viecco et al. [128] developed Net Trust, a browser toolbar that displays the overall credibility rating of a website based on ratings from friends. These ratings can be implicit, based on passive behavior (how 123 often was the site visited by friends), or explicit, like a numerical value or comment given by another user. Net Trust users then create and join limited social groups to securely exchange their browser histories and ratings with everyone else in that group via centralized services. AuntieTuna uses a similar notion of rating sites and sharing with social groups: users add the original, known-good site to their lists (a positive rating in Net Trust) and then securely and directly share their list of known-good with their social groups. Phishing sites pose a challenge to Net Trust: users are vulnerable to phish until someone in the social group visits a suspicious site, and then identies and rates the site as phish. Consequently, there is an additional challenge in requiring users to continuously check Net Trust’s security indicator for every site they visit. AuntieTuna resolves these challenges and avoids the aforementioned delays as AuntieTuna only requires users to identify “good” sites or friends to share and receive known-good with. AuntieTuna then uses the known-good as a pre-emptive lter against phish as it continuously looks for and blocks access to suspicious sites that contain the content from known-good. SecurityofSingleSign-On(SSO): We look at prior work specically in Web SSO (used interchange- ably with SSO), in contrast to other network authentication like Kerberos. One category of related work looks at analyzing the SSO protocols and system for vulnerabilities [130, 14, 51] such as Cross-Site Script- ing (XSS), replay, or credential hijack attacks, and assumes that end-users have sucient knowledge to identify phish or are protected against phishing attacks. Our work focuses on protecting the end-user’s use of and experience with SSO: we look at how a user who is accustomed to the SSO process becomes more susceptible to phishing attacks, and develop a defense to protect against such attacks (Section 5.4). Prior work looked at the challenges to adopting a “universal” SSO (OpenID or other OAuth providers) for use on many Internet services. Although Sun et al. [121] found that privacy concerns and a lack of a compelling business model would reduce the rate of SSO’s adoption on the Internet, in Section 5.6.1 we 124 show that SSO is widely used today at universities and enterprises for convenient access to a variety of rst- and third-party services. Yue [137] identies that SSO provides an attractive and enlarged surface area for attackers: SSO cre- dentials for large services like Google, Facebook, Microsoft, and Yahoo provide concentrated value as it contains a user’s private data and can be used to access many other services. We conrm that in an en- terprise, SSO also has a large surface area and its credentials provide concentrated value. Our work is the rst to quantify the surface area and growth of SSO-enabled services in an enterprise context. We show that almost all rst- and third-party services at multiple universities are SSO-enabled (Section 5.6.1.1). We also measure the longitudinal growth of online services, nding that the number of SSO-enabled services grows 29.1 % every year (Section 5.6.1.3). While SSO provides a convenient login process for users, studies have found that SSO does not reduce a user’s susceptibility to phishing attacks [145, 137], and a successful phishing attack against a user poten- tially aects all services they use. Multiple studies have looked at how attackers can use SSO credentials to take over a user’s entire online identity [124] and even maintain long-term access by associating al- ternative credentials with another email account under an attacker’s control [51]. Similarly, an attacker can use the same password used with SSO to laterally access other, independent services: Das et al. [31] found that passwords are likely to be reused across multiple sites. Given these consequences, AuntieTuna helps protect the user against SSO phishing sites by leveraging an enterprise’s use of SSO as part of a user’s defense. For example, by tracking the content of the SSO portal and looking for that content in the “wrong” place, we can detect and protect users from falling victim to SSO phishing sites (Section 5.6.2.1, Section 5.6.2.2). 125 5.4 ImprovingNetworkSecuritywithAnti-PhishingandDataSharing We next describe our approach to improving network security at home and the enterprise with the proac- tive detection of phishing sites and data sharing between friends. We protect users from phishing sites with AuntieTuna, a web browser extension that detects phish by looking for “known-good” content on unknown sites (Section 5.4.1). Users then share their data with their friends, inoculating them with known-good and improving their group’s collective immunity (Sec- tion 5.4.2). 5.4.1 Anti-PhishingwithAuntieTuna We protect users from phishing with AuntieTuna, a web browser extension that proactively detects phish- ing sites as the user browses. We described an early version of AuntieTuna in Chapter 3; here we summarize its approach. The idea behind AuntieTuna is to allow users to label known good sites. AuntieTuna then monitors web browsing and ags sites as potential phish based on the appearance of content from known-good sites at other locations. AuntieTuna detects phish with precision by rst hashing content from known-good sites, personalized to the user, and then nding that content on unknown sites as the user browses. Users rst tailor AuntieTuna and their protection by identifying the sites they use as known-good, creating their own defense that is diverse and personalized to themselves. Users can select known-good sites are rst used (Trust on First Use), or they can inoculate themselves and others with data sharing—we will discuss data sharing in Section 5.4.2. After a user tags a site as good, AuntieTuna then records the site’s content by chunking the page, hashing the chunks, and storing the site’s URL and hashes in its whitelist (“known-good”). Known-good 126 is kept updated through opportunistic recrawl: AuntieTuna will automatically monitor and refresh the content of tagged sites as they change over time. AuntieTuna then detects suspect phishing sites precisely and eciently by comparing the hashes of content of unknown, visited pages with known-good: if the unknown page contains known-good content, we ag it as possible phish. AuntieTuna is able to precisely detect phish with minimal false positives because it uses cryptographic hashing to process content and it is ecient because AuntieTuna compares with sites that the user only uses. We evaluate the accuracy and robustness of AuntieTuna’s detection algorithms in detail in Chapter 3. We next look at how we augment AuntieTuna with data sharing to improve the collective immunity of our users and their friends. 5.4.2 ImprovingCollectiveImmunitywithDataSharing The second aspect of our approach is data sharing between friends, either directly (peer-to-peer), or through a centralized website. Sharing data between friends allows AuntieTuna to provide collective immunity, as sites approved by one user can be provided automatically to their friends after they share data. This inoculation helps close a user’s vulnerability gap when new services are created. To share information about known-good sites, users either share sites directly with one another or through a centralized website. AuntieTuna supports both methods of sharing. To bootstrap trust in the data exchange, we leverage the relationships that users already have with a centralized authority (for example, because of their employment with an enterprise) or with their friends. In centralized data sharing, a central authority like an enterprise or university provides and manages known-good lists containing their own services to its users. We will see in Section 5.6.1, however, it is 127 not always possible for large enterprises to enumerate all of their own services, requiring another way to share data. Users can also share sites directly. Our insight in friend-to-friend sharing is that users benet in in- creased protection when they share with their friends or their social groups because they likely visit the same common sites due to their shared interests. For example, graduate students in the same lab group likely use the same internal (wikis) and external (social networking, intramural sports leagues) sites. We leverage a group’s shared behavior to maximize the benets of inoculation. Once a user receives another’s known-good, AuntieTuna automatically incorporates the shared data with existing known-good for detection on unknown sites. Sharing has an added benet of improving usability for the user, as users don’t necessarily need to keep track of their own known-good sites. We next look at the risks in AuntieTuna’s design. 5.4.3 Risks While designed to improve security, use of AuntieTuna poses its own risk of vulnerability and to privacy. We look at the specic risks introduced by AuntieTuna, but consider overly general risks to the underlying technology (like vulnerabilities in the web browser) or peer-to-peer sharing/decentralized systems (Sybil attacks [36]) to be out of scope in this chapter. We have previously analyzed AuntieTuna’s robustness to countermeasures used by web phishing attacks in Chapter 3 and do not cover it here. We assume that exchanging known-good happens between trusted parties (we covered trust bootstrap- ping in Section 5.4.2), and that the exchange happens over secured communication. AuntieTuna adheres to best practices, using strong cryptographic algorithms and protocols (TLS), to secure communication. Security: There is a risk to security in increased vulnerability to phishing if false known-good content is being shared and used. Centralized sharing introduces an increased attack surface: attackers inject their own known-good content at a compromised authority. When directly sharing between friends, a user 128 might share false known-good because they were compromised or careless in their habits of adding to known-good. We mitigate the risks of false known-good by exposing the content and management of known-good. Users are always in control their own known-good and can verify its content: suspicious URLs are easily removed and hashes of false content associated with a known-good URL would cause annoyance at worst (false positive phishing alerts) or potentially backre (a hash of a malicious payload would identify actual phish). Privacy: The risk to privacy is exposing a user’s list of known-good sites and content that is being protected when known-good is shared or is inadvertently revealed. It is possible that a user may be em- barrassed if their websites are shared, or that attackers could target this data. We mitigate the risk to privacy by keeping only the essential information on known-good websites and maximizing user control over this data and how it is shared. AuntieTuna stores only the domain components of sites’ URLs and hashes of content: the complete URL is not stored or shared and potentially personalized content on a website is already cryptographically hashed. We could further hash the domains (like HashKnownHosts in SSH) at the cost of usability—we leave this potential feature as future work. 5.4.4 AdversarialCountermeasures Phishing is adversarial and we must consider what attackers will do when they consider and potentially develop countermeasures to our work. Our new contribution is data sharing between friends—fortunately such sharing is very robust to attackers when friends are either personally known or institutionally known (your school’s security ocer). We build on AuntieTuna, and its use of cryptographic hashing to detect phishing sites, while precise, can be fragile to content tweaked to look similar but has a dierent hash. Such changes raise the bar of eort required for an attacker and could be detected as suspicious by augmenting AuntieTuna with a second level of semantic hashing or image-based analysis. 129 Third-party Services First-party Services 2. User is redirected to Single Sign On authentication. 3. After authentication, user is redirected to desired service. 1. User enters and visits URL for desired service. Figure 5.1: Users at USC use a standardized Single Sign-On (SSO) process to access many rst- and third- party services. An attacker could also present a new service with a copied SSO portal and induce a user to mark it as “known-good”, subverting our system with their phishing site. However, users of our system should be extra suspicious when we raise warnings, since their organization will be using consistent SSO and we will ag any external copies or near-copies. 5.4.5 Implementation We implement AuntieTuna as a browser extension for Mozilla Firefox and Chromium-based browsers (Google Chrome, Chromium, Brave) in JavaScript in1200 lines of code. The extension is open-source, free, and available online at https://ant.isi.edu/auntietuna as well as in the web store for Chrome exten- sions. At the time of writing, there have been approximately 10 continuous users since its release. 5.5 CaseStudy: RealPhishingAttacksonUSC We now present a case study about phishing attacks on the University of Southern California (USC) in 2019 and 2020, showing an example of a legitimate email that looks suspicious and phishing emails that look real, and how AuntieTuna is well suited to prevent such attacks. Students, faculty, and sta at USC use Single Sign-On (SSO) to conveniently access many essential rst- and third-party services (illustrated 130 Figure 5.2: An email sent on 2019-05-15 instructing a faculty member to complete a mandatory compliance survey. in Figure 5.1, described and quantied in Section 5.6.1.1). One of the phishing attacks is an example of the threats we target (Section 5.2), with the attacker attempting to steal credentials by mimicking legitimate login pages like USC’s SSO portal. Similarly, the suspicious-looking, legitimate email is a potential false positive that we need to dierentiate from actual phish. LooksSuspicious,butit’sLegitimate: On 2019-05-15, a suspicious-looking email was sent to a USC faculty member requesting the completion of a compliance survey (shown in Figure 5.2). This survey was hosted on a third-party service and required users to rst authenticate on a rst-party login portal that was not the standardized SSO portal with USC SSO credentials before continuing. Because the mail was sent from a previously unknown sta member and pointed to a third-party site, it was dicult to tell if this request was legitimate or a sophisticated form of spear phishing. It was only after an additional conrmation (through e-mail or phone) was the survey veried as legitimate. This scenario highlights the 131 Figure 5.3: A spear-phishing email sent to everyone at USC/ISI on 2019-07-19. A link inside a blue box labeled “Restore” leads users to the phishing site (seen in Figure 5.4, right). This email can be very convincing because it contains the recipient’s name, title, and phone number (har- vested from a public directory) and instills a sense of panic, as users are told that their email is unavailable. danger of requiring SSO credentials on non-SSO login pages. It suggests that SSO login portals must be strictly standardized, since acceptable variations are hard to distinguish from phishing variants. (Out of scope of our work are compromises to third-party services using approved SSO.) Looks Legitimate, but it’s Phish: On 2019-07-09, a legitimate-looking phishing email (shown in Figure 5.3) was sent to all users at USC/ISI. The phish addressed each recipient by name, title, and phone number, and instructed the recipient to click on a link to “restore” their account due to exceeding an email storage quota. The “restore” link led to a phishing site (Figure 5.4, right) that looked and behaved exactly like USC’s SSO (Figure 5.4, left), except that the phish likely logged any entered credentials (and even redirected to an actual service at USC, unauthenticated). LooksAmbiguous, andit’sPhish: Attackers sometimes make phishing sites that share nothing in common with the source website. On 2020-06-09, an attacker used a compromised email account of a 132 Actual USC Authentication Phish (visited on 2019-07-09) Original USC Authentication Page (visited on 2019-07-09) a979b0 eb0c2d a89bb2 0fdff1 d0c721 8da616 2dce21 f366ed 1cbb78 0a7cb5 8da616 beba99 ac4e80 63e9e3 c63d77 135257 51c2e7 cc8514 cffca7 a979b0 eb0c2d a89bb2 5e312c d0c721 8da616 2dce21 f366ed 1cbb78 0a7cb5 8da616 beba99 ac4e80 63e9e3 c63d77 135257 51c2e7 cc8514 d2e8d4 div text (logo, branding) div form post action attack payload text (welcome) div form input (username) hidden tooltip hidden tooltip hidden tooltip div form input (password) div form button (submit) div text (forgot...) text (sign out) text (need help) javascript, css DOM Elements Figure 5.4: Detecting a phishing site attack against USC. The phishing site (right) is almost exactly the same visually to the original, known good site (left). By identifying the common elements by the hashes of a page’s content (red values in the middle columns), AuntieTuna detects the page on the right as phish. senior sta member at USC/ISI to send mail to hundreds of other students, sta, and faculty in a lateral phishing attack. The email, with subject “INVOICE”, had a legitimate signature and correct headers (due to a previously-compromised account). Its payload was an Excel attachment with a link to an ambiguous- looking portal to OneDrive (a service used at USC/ISI). In this case, the attackers used a customized, but basic, “phishing kit” that did not mimic any part of the original, legitimate site. This kind of phish provides a generic looking portal and can be reused to phish multiple services with a simple logo or wording change. Figure 5.5 shows an example Microsoft OneDrive phishing site (top, Figure 5.5a) targeting users at USC/ISI compared to the actual, legitimate login portal (bottom, Figure 5.5b): we observe that the phish has a completely dierent visual style and shares no content in common with the legitimate site. This scenario again reinforces the potential danger of requiring SSO credentials on non-SSO login pages. If users become habituated to non-standardized login portals, such as the one in the prior example (“Looks Suspicious, but it’s Legitimate”), users might not realize that the proper authentication ow should require the SSO login portal and fall victim to this type of phish. 133 (a) Phishing site. (b) Original, legitimate site (Microsoft OneDrive, visited on 2020-06-09). Figure 5.5: An example of a Microsoft OneDrive phishing site attack (top) against USC/ISI on 2020-06-09 that does not reuse content from the original, legitimate site (bottom). This phishing site requested login credentials on the same form. On the legitimate service, users rst enter their USC email, and are then redirected to USC’s SSO portal to complete authentication (Figure 5.1, step 2). 134 PreventingSSOPhishingAttackswithAuntieTuna: The prior examples teach us that the use of SSO or the inconsistent application of SSO can leave users confused about the legitimacy of an email or website and thus susceptible to phishing attacks. AuntieTuna prevents SSO phishing attacks by proactively detecting phish on visited sites, looking for the right things (known-good) in the wrong place (unknown sites). When users rst mark USC’s SSO page as known good, AuntieTuna hashes the page’s content and checks every other visited page for that content as possible phish. Users can also pre-emptively inoculate themselves by sharing their known-good data with their friends, distributing their eorts in marking and tracking known-good sites. In both cases, AuntieTuna detects the phishing page described previously and then blocks access to the page, preventing the user from losing their credentials. We quantify AuntieTuna’s eectiveness on sites at USC in Section 5.6.2.1, and on sites outside of USC’s domain in Section 5.6.2.4. The increasing trend of deploying USC’s SSO on top of third-party services also increases the oppor- tunity for phishing attacks against USC. We will show in Section 5.6.1 that this potentially problematic trend is getting worse by quantifying the growth of SSO services. Finally, we consider phishing sites that create or use generic “phishing toolkits” (described in “Looks Ambiguous, and it’s Phish”), instead of reusing and copying elements from actual sites, to be outside of AuntieTuna’s scope of detection. We expect other phish detection schemes using content or visual comparison techniques to also miss this style of site. If they do correctly classify it, we would expect that their detection has a higher false positive rate on other sites: it can be dicult to automatically distinguish between, as in our earlier example, a site that is phishing for OneDrive credentials or a legitimate service that supports or integrates with OneDrive (while also having its own, separate login form). It would be reasonable for legitimate services to reuse some OneDrive images and terms on its site, and thus makes it dicult to consider their reuse as phishing site features. Positive identication of phish using generic toolkits would require either seeding a database with toolkit elements or encouraging users to look for (and report) SSO sites with non-standard elements. We leave this analysis for future work. 135 name party description myUSC 1st general portal for students myViberbi 1st portal for engineering courses/grades Concur 3rd travel management Workday 3rd HR/payroll Outlook 365 3rd email, calendar, cloud storage Table 5.1: Examples of SSO-enabled Web Services at USC 5.6 Evaluation We rst show the need for our approach by quantifying the “surface area” and growth of Single Sign-On services (Section 5.6.1). We then show that sharing with AuntieTuna helps secure the enterprise at home and the oce (Section 5.6.2). We also suggest that AuntieTuna would be particularly relevant for campaign election security, where sharing is critical but often uid and unstructured (Section 5.6.3). 5.6.1 QuantifyingServicesUsingSingleSign-OnatUniversities We will show that the “surface area” of Single Sign-On (SSO) is large by quantifying the number of online services at university enterprises, and then measure its growth over time. 5.6.1.1 Howmanyonlineservicesareatauniversity? We rst show the need for anti-phishing protection in large organizations by looking at how many online services exist and use Single Sign-On (SSO) at the University of Southern California (USC, USA, 4810 3 students, 2110 3 faculty and sta in 2019 [126]) and University of California, Berkeley (UCB, USA, 4110 3 students, 1510 3 faculty and sta in 2018 [125])). In a sense, we are measuring the “surface area” of SSO. With many departments and services, universities like USC and UCB have encouraged the use of SSO for security, but decentralized control makes it hard to count the exact number of rst- and third-party services. 136 USC* UCB description # % # % SSO-enabled 48 78.7 605 97.6 First-party 21 454 Third-party 27 151 No SSO 13 11.3 15 2.4 Credentials same as SSO 3 2 First-party 3 2 Third-party 0 0 Credentials distinct 10 13 First-party 5 11 Third-party 5 2 Total 61 100 620 100 * manual count Table 5.2: Characterizing Web Services at USC and UCB We manually nd, verify, and count web services at each university, crawling university homepages (for example, portals and resources for current students) and augmenting with ground truth when avail- able, and characterize each service by its owner (rst- or third-party) and if rst-party SSO is used for authentication. If SSO is not used, we further label whether the service uses the same credentials as SSO or a separate, distinct login. In Table 5.1, we detail some example services at USC. We counted 61 (manual) and 620 (augmented with ground truth) web services at USC and UCB, re- spectively, and nd that most use SSO (48 (78.7 %) at USC and 605 (97.6 %) as UCB), a benet for both its users and system administrators. We summarize our ndings in Table 5.2. Of the services using SSO at USC, 27 are run by or hosted on third-parties. USC has 17 third-party services that are accessed directly on a third-party URL (like usc.qualtrics.com), while 8 are initially accessed from a rst-party URL, but ultimately resolves to a third-party, “nal destination” URL (for ex- ample, workday.usc.edu to wd5.myworkday.com/usc). (The nal 2 services, while run by a third-party, consistently maintain their rst-party URL.) 137 We also nd a signicant number of third-party, SSO-enabled services (151, 25 % of 605) at UCB, and see similar behavior in how services are accessed directly on third-party URLs or rst-party URLs redirecting to dierent destinations. Augmented with ground truth data for SSO-enabled services at UCB, while we see a greater proportion of services run by rst-party, many of them are development or staging sites and not necessarily used by campus aliates. While the use of third-party services with rst-party SSO provides a familiar single login method and centralized account management, it can also increase risk to phishing attacks on users. End-users cannot know when a third-party service is authorized to use SSO, so attackers can create (and have created, in Section 5.5) phish that mimic USC’s SSO with legitimate-looking URLs. We believe that keeping all services accessible by URLs within the same domain name is one way an enterprise can reduce their users’ risk to phishing attacks: services under the same second-/third-level domain name will provide consistency for its users (for example, all authorized services live insideusc.edu) and control for its administrators (rogue or disabled services are removed easily from DNS). Finally, some services (only a few: 3 at USC and 2 at UCB) require SSO login credentials but do not use the SSO portal. Finding only a small number of services that do use the same SSO credentials but dierent login pages is a mostly positive result: users would greatly increase their risk to phishing attacks if they become accustomed to authenticating using SSO login detailswithout a common SSO page. (These services are in the process of migrating to using the SSO process at the time of writing.) We have shown that the surface area of services using SSO is large: there are at least 48 and 605 SSO- enabled rst- and third-party services at USC and UCB (78.7 % and 97.6 % of total services we counted, respectively). We next ask if the services we enumerated is a complete list. 5.6.1.2 Areourservicelistscomplete? We next examine the completeness of our enumeration of services at USC and UCB. We show that our enumeration is incomplete at USC, but covers all services at UCB. 138 At USC, we know our count of 61 is incomplete and roughly accounts for 25 % of all services: what we have enumerated is a representative sample of all services at USC consisting of sites most commonly used by students, faculty, and sta based on crawling the USC homepages. After sharing our list with system administrators at USC, they conrmed that there were “probably more than 200” services using SSO, although no centralized list is kept. At UCB, we augment our enumeration with ground truth data from system administrators at UCB and consider it complete. We have shown that the surface area of SSO-enabled services is large (Section 5.6.1.1) and that our enumeration is representative at USC, and complete at UCB. We next look at the growth of online services over time to understand if there are any trends in how fast services are deployed. 5.6.1.3 Howfastisthenumberofonlineservicesgrowing? We next measure the rate of growth in online services in campus environments, using data from the InCommon Federation [65, 66]. Members of the InCommon Federation, including USC and UCB, typically have seamless access to third-party services provided by other members. We can thus measure the growth of third-party services using rst-party Single Sign-On (SSO) for many organizations (770 as of 2020-06-03). The InCommon Federation is a global network of academic institutions and service providers (com- mercial and non-prot) and provides the resources to enable rst-party SSO on third-party services. The federation uses Shibboleth [113], a SAML-based SSO system, and contains two types of providers: Identity (IdP) and Service (SP). IdPs are similar to “eyeball networks” and are typically academic institutions like USC that wish to access outside resources provided by an SP. SPs are resource providers, which can be run by academic or commercial entities. As a simplied example, within the InCommon Federation, users at USC (an IdP) can access Qualtrics (an SP) using USC’s SSO portal. 139 2010-05-11 2011-01-01 2012-01-01 2013-01-01 2014-01-01 2015-01-01 2016-01-01 2017-01-01 2018-01-01 2019-01-01 2020-01-01 Date 0 1 2 3 4 5 6 7 Number of Providers ( £ 10 3 ) 83 3776 572 6769 1938 371 3958 3029 A Identity (IdP) Service (SP) Figure 5.6: Number of Identity (IdP) and Service (SP) Providers in the InCommon Federation over time. Dataset: InCommon [66], 2010-05-11 to 2020-06-09. We nd that the number of online services in campus environments has grown linearly over time, reecting an increasing embrace of technology in education over the years. Figure 5.6 shows graphically the number of Service Providers (SPs, red) and Identity Providers (IdPs, blue) over time: between 2010– 2020, the number of SPs in the InCommon Federation has grown with an average annual growth rate of 29.1 %. While the growth in providers is consistently linear over time, we see one large spike in SPs and IdPs on 2019-02-16, marked ‘A’ in Figure 5.6. The number of SPs increased 30.7 % (3029–3958) and IdPs increased 423 % (371–1939) in one day: this sharp growth was the result of a planned integration of the InCommon Federation with eduGAIN, a federation of federations. This integration [110] enabled interoperability on an international scale—the InCommon Federation previously consisted of primarily US institutions. We next validate an aspect of our enumeration: can all participating institutions in InCommon use the 6769 services provided by its members? The actual number of services that a particular IdP (like USC or UCB) has access to will vary. For federation members, we can consider the number of SPs depicted in Figure 5.6 to be an upper bound in terms of access by any IdP. For example, while an SP supports feder- ated logins by any IdP, that SP might require additional agreements prior to authorization (for example, licensing with the Zoom video conferencing service). Other SPs (like collaboration wikis) are open to all participants. We consider detailed analysis about these relationships for future work. 140 We showed previously that the “surface area” of SSO is large, and we have now shown that this sur- face area is also increasing for many academic institutions. We measured the annual growth of 29.1 %, on average, in online services within the InCommon Federation, of which many academic and research insti- tutions are a member of. The rate of growth in SSO-enabled services is a net-positive benet for access to information and resources, and also reinforces the need to ensure users’ security. We next evaluate how we improve users’ security with AuntieTuna. 5.6.2 ImprovingEnterpriseSecurityatHomeandtheOce We now look at how data sharing in AuntieTuna helps improve enterprise security at home and the oce by preventing phishing attacks with AuntieTuna. We rst evaluate AuntieTuna’s eectiveness at an enterprise like USC (Section 5.6.2.1) and understand how users should share with each other in order to eectively protect against phishing (Section 5.6.2.2). Finally, we generalize AuntieTuna’s eectiveness from inside the enterprise to external sites (Section 5.6.2.4, Section 5.6.2.6). 5.6.2.1 IsAuntieTunaeectiveinprotectingenterprisesiteswithoutsharing? We rst ask if AuntieTuna is eective in protecting sites at a enterprise, using the University of Southern California (USC) as our test case. We hypothesize that AuntieTuna can leverage USC’s use of SSO to eectively protect users against phishing sites. Users rst mark USC’s SSO portal and any other services they use as known-good and AuntieTuna then detects and labels visited sites that contain content from their known-good as possible phish. To understand AuntieTuna’s eectiveness, we evaluate how many sites a user needs to add to their known-good in order to be suciently protected from phishing sites. We collect and analyze anonymized web browser histories of 14 users in the computer science department at USC, including students, faculty, 141 A B C D E F G H I J K L M N User at USC 0 5 10 15 # of Sites Used at USC sso-enabled sites A* B C D E F G H I J K L M N * indicates a user sharing with everyone else blue bar indicates how many sites are protected due to sharing (coverage in percentage labeled above bar) 0 5 10 15 100% 100% 100% 78% 67% 100% 100% 71% 100% 100% 100% 80% 100% protected sites A B* C D E F G H I J K L M N 0 5 10 15 62% 90% 100% 44% 67% 100% 100% 71% 100% 100% 100% 40% 100% A B C* D E F G H I J K L M N 0 5 10 15 62% 93% 100% 56% 56% 100% 100% 71% 100% 100% 100% 60% 100% A B C D* E F G H I J K L M N 0 5 10 15 44% 79% 70% 33% 56% 100% 75% 43% 83% 67% 67% 0% 67% A B C D E* F G H I J K L M N 0 5 10 15 69% 86% 90% 100% 56% 100% 88% 57% 83% 83% 83% 40% 100% A B C D E F* G H I J K L M N 0 5 10 15 50% 86% 70% 100% 33% 100% 75% 57% 83% 67% 67% 0% 67% A B C D E F G* H I J K L M N 0 5 10 15 44% 79% 70% 100% 33% 56% 75% 43% 83% 67% 67% 0% 67% A B C D E F G H* I J K L M N 0 5 10 15 56% 93% 90% 100% 44% 56% 100% 71% 100% 100% 100% 40% 100% A B C D E F G H I* J K L M N 0 5 10 15 56% 93% 90% 100% 44% 67% 100% 100% 100% 100% 100% 40% 100% A B C D E F G H I J* K L M N 0 5 10 15 50% 86% 80% 100% 33% 56% 100% 88% 57% 83% 83% 20% 67% A B C D E F G H I J K* L M N 0 5 10 15 56% 93% 90% 100% 44% 56% 100% 100% 71% 100% 100% 40% 100% A B C D E F G H I J K L* M N 0 5 10 15 56% 93% 90% 100% 44% 56% 100% 100% 71% 100% 100% 40% 100% A B C D E F G H I J K L M* N 0 5 10 15 25% 14% 30% 0% 22% 0% 0% 25% 29% 17% 33% 33% 33% A B C D E F G H I J K L M N* 0 5 10 15 50% 86% 80% 100% 44% 56% 100% 88% 57% 83% 83% 83% 20% A B C D E F G H I J K L M N User at USC 0 5 10 15 # of Sites Used at USC 100% 100% 100% 78% 67% 100% 100% 71% 100% 100% 100% 80% 100% 62% 90% 100% 44% 67% 100% 100% 71% 100% 100% 100% 40% 100% 50% 86% 80% 100% 33% 56% 100% 88% 57% 83% 83% 20% 67% protected sites w/ A's sharing protected sites w/ B's sharing protected sites w/ J's sharing (a) Actual users at USC. A B C D E F G H I J Simulated User at USC 0 10 20 30 # of Sites Used at USC sso-enabled sites A* B C D E F G H I J * indicates user sharing with everyone else blue bar indicates how many sites are protected due to sharing (coverage in percentage labeled above bar) 0 10 20 30 88% 92% 91% 85% 100% 92% 89% 71% 75% protected sites A B* C D E F G H I J 0 10 20 30 89% 88% 91% 90% 95% 100% 78% 100% 100% A B C* D E F G H I J 0 10 20 30 82% 77% 78% 75% 89% 92% 78% 71% 75% A B C D* E F G H I J 0 10 20 30 93% 92% 92% 90% 95% 92% 89% 100% 100% A B C D E* F G H I J 0 10 20 30 86% 88% 88% 87% 89% 92% 100% 100% 100% A B C D E F* G H I J 0 10 20 30 86% 81% 88% 78% 75% 92% 78% 71% 75% A B C D E F G* H I J 0 10 20 30 79% 81% 88% 74% 75% 89% 78% 71% 75% A B C D E F G H* I J 0 10 20 30 82% 77% 88% 78% 85% 89% 92% 71% 75% A B C D E F G H I* J 0 10 20 30 79% 85% 88% 83% 85% 89% 92% 78% 100% A B C D E F G H I J* 0 10 20 30 79% 81% 88% 78% 80% 89% 92% 78% 86% A B C D E F G H I J Simulated User at USC 0 10 20 30 # of Sites Used at USC 88% 92% 91% 85% 100% 92% 89% 71% 75% 89% 88% 91% 90% 95% 100% 78% 100% 100% 79% 81% 88% 78% 80% 89% 92% 78% 86% protected sites w/ A's sharing protected sites w/ B's sharing protected sites w/ J's sharing (b) Simulated users at USC. Figure 5.7: A prole of actual (top) and simulated (bottom) users at USC (x-axis) and the number of USC services they use (y-axis). All services using SSO are grouped together (green). and sta, for one week between 2020-03-27 to 2020-04-02 (reviewed and approved by University of South- ern California Institutional Review Board, #UP-19-00826). We initially focus on USC and USC/ISI services for each user, ltering for sites on usc.edu and isi.edu domains: the users’ resulting web histories range between 3–16 distinct sites. (We will consider all sites in Section 5.6.2.4.) Figure 5.7a is a stacked bar chart of our users (categorically on thex-axis) and the number of services that each user accessed at USC and USC/ISI (y-axis). Given a user, each bar in the stack of bars represents services grouped by their login method: the bottom green bar, for example, are SSO-enabled sites while the other bars represent sites that have their own, distinct login page. Figure 5.7a shows that AuntieTuna is eective at protecting sites at USC, requiring 1 site, the SSO login portal, to gain63 % (mean) coverage of their used sites. Because all path components of the URLs in users’ web histories are anonymized, except for public suxes (.com,.net, etc.) and the USC (usc.edu) and USC/ISI (isi.edu) domains, we cannot prove that all of the visited USC and USC/ISI sites use SSO. We found earlier in Section 5.6.1.1 that most (79 %) onusc.edu are SSO-enabled. Similarly, we nd in our user 142 population that 69 % (37) of visited sites at USC are inusc.edu, and the remaining 31 % (17) are inisi.edu. Therefore, we assume that all sites on usc.edu domains are also SSO-enabled and group them together as one bar (green, bottom bar), and other sites on isi.edu as distinct sites. Users adding the SSO portal to their known-good benet greatly because the majority of sites that our users access are SSO-enabled (mean: 5.35 sites, median: 5). For the two usersD andG who stay within the USC domain, adding the SSO portal is sucient to achieve 100 % coverage. For others to achieve complete coverage, most users need to add an additional 1–4 sites—in the worst case, userA needs 9 additional sites. The outsized benets of adding one site to AuntieTuna’s known-good shows that USC uses SSO on most of its services. We expect to see similar benets at other academic and enterprise institutions—we showed earlier in Section 5.6.1.1 that most sites at UCB are also SSO-enabled. We next look at sharing known-good with others using AuntieTuna, and how many friends one needs to share with to be protected at an enterprise like USC. 5.6.2.2 Howmanyfriendsmustsharetoprotectenterprisesites? We now look at sharing known-good with others, evaluating the number of friends that one needs to share with in order to protect enterprise sites. Later in Section 5.6.2.4, we will evaluate how sharing with friends improves protection on external, community sites that are outside of the enterprise. Users sharing data with other friends at an enterprise helps improve security by inoculating their friends on more sites than they could manually discover on their own. Rather than manually adding to their individual collection known-good, users can use AuntieTuna’s data sharing to bootstrap protection quickly with the help of their friends. In organizations as large as USC, it is challenging to make centralized sharing work since services are decentralized: dierent groups run their own mix of rst- and third-party services. Correspondingly, we hypothesize that we can leverage the commonalities that users have with each other to overcome the 143 A B C D E F G H I J K L M N User at USC 0 5 10 15 # of Sites Used at USC sso-enabled sites A* B C D E F G H I J K L M N * indicates a user sharing with everyone else blue bar indicates how many sites are protected due to sharing (coverage in percentage labeled above bar) 0 5 10 15 100% 100% 100% 78% 67% 100% 100% 71% 100% 100% 100% 80% 100% protected sites A B* C D E F G H I J K L M N 0 5 10 15 62% 90% 100% 44% 67% 100% 100% 71% 100% 100% 100% 40% 100% A B C* D E F G H I J K L M N 0 5 10 15 62% 93% 100% 56% 56% 100% 100% 71% 100% 100% 100% 60% 100% A B C D* E F G H I J K L M N 0 5 10 15 44% 79% 70% 33% 56% 100% 75% 43% 83% 67% 67% 0% 67% A B C D E* F G H I J K L M N 0 5 10 15 69% 86% 90% 100% 56% 100% 88% 57% 83% 83% 83% 40% 100% A B C D E F* G H I J K L M N 0 5 10 15 50% 86% 70% 100% 33% 100% 75% 57% 83% 67% 67% 0% 67% A B C D E F G* H I J K L M N 0 5 10 15 44% 79% 70% 100% 33% 56% 75% 43% 83% 67% 67% 0% 67% A B C D E F G H* I J K L M N 0 5 10 15 56% 93% 90% 100% 44% 56% 100% 71% 100% 100% 100% 40% 100% A B C D E F G H I* J K L M N 0 5 10 15 56% 93% 90% 100% 44% 67% 100% 100% 100% 100% 100% 40% 100% A B C D E F G H I J* K L M N 0 5 10 15 50% 86% 80% 100% 33% 56% 100% 88% 57% 83% 83% 20% 67% A B C D E F G H I J K* L M N 0 5 10 15 56% 93% 90% 100% 44% 56% 100% 100% 71% 100% 100% 40% 100% A B C D E F G H I J K L* M N 0 5 10 15 56% 93% 90% 100% 44% 56% 100% 100% 71% 100% 100% 40% 100% A B C D E F G H I J K L M* N 0 5 10 15 25% 14% 30% 0% 22% 0% 0% 25% 29% 17% 33% 33% 33% A B C D E F G H I J K L M N* 0 5 10 15 50% 86% 80% 100% 44% 56% 100% 88% 57% 83% 83% 83% 20% A B C D E F G H I J K L M N User at USC 0 5 10 15 # of Sites Used at USC 100% 100% 100% 78% 67% 100% 100% 71% 100% 100% 100% 80% 100% 62% 90% 100% 44% 67% 100% 100% 71% 100% 100% 100% 40% 100% 50% 86% 80% 100% 33% 56% 100% 88% 57% 83% 83% 20% 67% protected sites w/ A's sharing protected sites w/ B's sharing protected sites w/ J's sharing (a) Actual users at USC. A B C D E F G H I J Simulated User at USC 0 10 20 30 # of Sites Used at USC sso-enabled sites A* B C D E F G H I J * indicates user sharing with everyone else blue bar indicates how many sites are protected due to sharing (coverage in percentage labeled above bar) 0 10 20 30 88% 92% 91% 85% 100% 92% 89% 71% 75% protected sites A B* C D E F G H I J 0 10 20 30 89% 88% 91% 90% 95% 100% 78% 100% 100% A B C* D E F G H I J 0 10 20 30 82% 77% 78% 75% 89% 92% 78% 71% 75% A B C D* E F G H I J 0 10 20 30 93% 92% 92% 90% 95% 92% 89% 100% 100% A B C D E* F G H I J 0 10 20 30 86% 88% 88% 87% 89% 92% 100% 100% 100% A B C D E F* G H I J 0 10 20 30 86% 81% 88% 78% 75% 92% 78% 71% 75% A B C D E F G* H I J 0 10 20 30 79% 81% 88% 74% 75% 89% 78% 71% 75% A B C D E F G H* I J 0 10 20 30 82% 77% 88% 78% 85% 89% 92% 71% 75% A B C D E F G H I* J 0 10 20 30 79% 85% 88% 83% 85% 89% 92% 78% 100% A B C D E F G H I J* 0 10 20 30 79% 81% 88% 78% 80% 89% 92% 78% 86% A B C D E F G H I J Simulated User at USC 0 10 20 30 # of Sites Used at USC 88% 92% 91% 85% 100% 92% 89% 71% 75% 89% 88% 91% 90% 95% 100% 78% 100% 100% 79% 81% 88% 78% 80% 89% 92% 78% 86% protected sites w/ A's sharing protected sites w/ B's sharing protected sites w/ J's sharing (b) Simulated users at USC. Figure 5.8: Sharing known-good between users eectively inoculates them on sites at USC. The solid colored bars in blue, red, and yellow show the percentage of sites protected due to sharing byA,B, andJ, respectively. Bars are omitted if a user was sharing with themselves. challenges of centralized sharing: users in the same sub-group (within USC, for example) likely use the same sites and will benet in protection when they share their known-good with each other. To understand how many friends a user must share with in order to protect sites at an enterprise, we simulate transfers between friends and analyze the resulting level of coverage (protection) that users receive after sharing. We rst enumerate all possible transfers of known-good between users using their proles from Section 5.6.2.1, and then analyze the sites that are now inoculated for each user after sharing. We assume that the recipient user at the time of sharing has not used any services yet (or has not yet marked any service as known-good at the time of sharing). This would be representative, for example, of a new student starting at a university or joining a research group. Users need to share with at least one friend to be suciently protected on 72.3 % (mean) of sites used at USC (median: 79.2 %). Figure 5.8a shows a grouped bar chart of each user’s proles (x-axis) and the result- ing percentage of sites that are inoculated for each user due to sharing by three other users, independently (y-axis). We omit a bar if a user was to share with themselves (A sharing with A). For example, when A 144 1 2 3 4 5 6 7 8 9 10 11 12 13 Number of Users Sharing with All Other Users 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Sites Protected average (actual) average (simulated) Figure 5.9: Sharing with more friends increases the fraction of sites protected. The box plots for actual (left, blue) and simulated (right, green) users show the ranges of protection given a number of users sharing with all others, in addition to the plotted mean (blue points, green ‘x’s) and median (red) values. (blue) shares with everyone else, we see that usersB–J are covered on 71.4–100 % (blue bars) of their sites. We also see that users benet from the sharing by others even if their web proles are smaller: when J (yellow, 6 sites) shares with others, all other users’ receive between 20–100 % coverage (yellow bars). When more than one friend shares with another, the fraction of sites protected for any given user asymptotically reaches 1 as the number of friends sharing increases. Figure 5.9 shows a series of box plots (blue with medians colored red) and the mean fraction (blue points) of sites protected (y-axis) given the number of users sharing with all other users (x-axis). For example, any 2 users sharing their known-good with everyone else (such asA andB sharing withC–J) will result in 83.1 % (mean) coverage of protected sites for the destination user (median: 92.9 %). We showed in the previous section that AuntieTuna, without sharing, is eective at protecting sites after you have manually added the known-good. With data sharing in AuntieTuna, its users can pre- emptively share their known-good with others, bootstrapping and inoculating recipient users with sites they no longer need to manually add. Users at USC receive benets when sharing with at least one other friend that share a common aliation with USC; part of those benets are due to USC’s use of SSO. As users share with more people, they are inoculated with more sites: users will be protected on sites that they might not currently use today, but might use in the future. We next evaluate SSO’s advantages to sharing. 145 A B C D E F G H I J K L M N User at USC 0 5 10 15 # of Sites Used at USC A* B C D E F G H I J K L M N * indicates a user sharing with everyone else blue bar indicates how many sites are protected due to sharing (coverage in percentage labeled above bar) 0 5 10 15 36% 50% 20% 78% 33% 25% 62% 57% 50% 33% 67% 80% 100% protected sites A B* C D E F G H I J K L M N 0 5 10 15 31% 50% 30% 33% 22% 38% 62% 29% 50% 50% 83% 40% 100% A B C* D E F G H I J K L M N 0 5 10 15 31% 36% 50% 44% 22% 62% 50% 29% 50% 33% 100% 60% 100% A B C D* E F G H I J K L M N 0 5 10 15 12% 21% 50% 22% 22% 75% 25% 0% 50% 0% 67% 0% 67% A B C D E* F G H I J K L M N 0 5 10 15 44% 21% 40% 20% 11% 25% 38% 29% 33% 17% 50% 40% 100% A B C D E F* G H I J K L M N 0 5 10 15 19% 14% 20% 20% 11% 25% 12% 14% 17% 0% 33% 0% 33% A B C D E F G* H I J K L M N 0 5 10 15 12% 21% 50% 60% 22% 22% 25% 0% 33% 0% 67% 0% 67% A B C D E F G H* I J K L M N 0 5 10 15 31% 36% 40% 20% 33% 11% 25% 43% 50% 33% 67% 40% 100% A B C D E F G H I* J K L M N 0 5 10 15 25% 14% 20% 0% 22% 11% 0% 38% 17% 33% 33% 40% 33% A B C D E F G H I J* K L M N 0 5 10 15 19% 21% 30% 30% 22% 11% 25% 38% 14% 17% 50% 20% 67% A B C D E F G H I J K* L M N 0 5 10 15 12% 21% 20% 0% 11% 0% 0% 25% 29% 17% 33% 40% 33% A B C D E F G H I J K L* M N 0 5 10 15 25% 36% 60% 40% 33% 22% 50% 50% 29% 50% 33% 40% 100% A B C D E F G H I J K L M* N 0 5 10 15 25% 14% 30% 0% 22% 0% 0% 25% 29% 17% 33% 33% 33% A B C D E F G H I J K L M N* 0 5 10 15 19% 21% 30% 20% 33% 11% 25% 38% 14% 33% 17% 50% 20% A B C D E F G H I J K L M N User at USC 0 5 10 15 # of Sites Used at USC 36% 50% 20% 78% 33% 25% 62% 57% 50% 33% 67% 80% 100% 31% 50% 30% 33% 22% 38% 62% 29% 50% 50% 83% 40% 100% 19% 21% 30% 30% 22% 11% 25% 38% 14% 17% 50% 20% 67% protected sites w/ A's sharing protected sites w/ B's sharing protected sites w/ J's sharing (a) Actual users at USC. A B C D E F G H I J Simulated User at USC 0 10 20 30 # of Sites Used at USC A* B C D E F G H I J * indicates user sharing with everyone else blue bar indicates how many sites are protected due to sharing (coverage in percentage labeled above bar) 0 10 20 30 42% 48% 52% 40% 53% 50% 56% 14% 25% protected sites A B* C D E F G H I J 0 10 20 30 39% 40% 52% 40% 37% 67% 44% 57% 75% A B C* D E F G H I J 0 10 20 30 43% 38% 48% 40% 37% 58% 56% 43% 25% A B C D* E F G H I J 0 10 20 30 43% 46% 44% 35% 21% 33% 44% 57% 50% A B C D E* F G H I J 0 10 20 30 29% 31% 32% 30% 37% 17% 44% 86% 100% A B C D E F* G H I J 0 10 20 30 36% 27% 28% 17% 35% 50% 11% 29% 25% A B C D E F G* H I J 0 10 20 30 21% 31% 28% 17% 10% 32% 11% 0% 0% A B C D E F G H* I J 0 10 20 30 18% 15% 20% 17% 20% 5% 8% 14% 0% A B C D E F G H I* J 0 10 20 30 4% 15% 12% 17% 30% 11% 0% 11% 50% A B C D E F G H I J* 0 10 20 30 4% 12% 4% 9% 20% 5% 0% 0% 29% A B C D E F G H I J Simulated User at USC 0 10 20 30 # of Sites Used at USC 42% 48% 52% 40% 53% 50% 56% 14% 25% 39% 40% 52% 40% 37% 67% 44% 57% 75% 4% 12% 4% 9% 20% 5% 0% 0% 29% protected sites w/ A's sharing protected sites w/ B's sharing protected sites w/ J's sharing (b) Simulated users at USC. Figure 5.10: Sharing known-good between users inoculates them on sites at USC even when SSO is not used. The solid colored bars in blue, red, and yellow show the percentage of sites protected due to sharing byA,B, andJ, respectively. Bars are omitted if a user was sharing with themselves. 5.6.2.3 EvaluatingSSO’sadvantagesineectivelyprotectingenterprisesites To evaluate SSO’s advantages, we consider known-good sharing with friends in the worst-case scenario where SSO is never used: each site or service has its own distinct login page, requiring distinct entries in AuntieTuna’s known-good. We treat each site that was previously grouped together (SSO-enabled sites) as now having a distinct login: Figure 5.10a (bar charts) and Figure 5.11 (box plots) are the “SSO-disabled” counterparts to the SSO- enabled Figure 5.8a (bar charts) and Figure 5.9 (box plots). We see that sharing still helps overall, but without SSO, a given user requires at least 10 friends sharing with them to achieve a similar coverage of 70.2 % (mean) (median: 71.4 %). (We showed in Section 5.6.2.2 that users sharing with 1 friend in our SSO-enabled case achieve 72.3 % coverage.) 146 1 2 3 4 5 6 7 8 9 10 11 12 13 Number of Users Sharing with All Other Users 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Sites Protected average (actual) average (simulated) Figure 5.11: Sharing with more friends increases the fraction of sites protected. Without SSO, users need to share with more friends to achieve sucient protection. The box plots (black, medians in red) show the ranges of protection given a number of users sharing with all others, in addition to the plotted mean values (blue). Although the number of friends required is relatively large to achieve equivalent coverage as the SSO- enabled case, there are benets to sharing with at least 1 other: users are protected on 33.2 % (mean) of their sites. When sharing with 5 others, recipient users almost double (1.83) their protection to 60.6 %. AuntieTuna enables the sharing of known-good between its users and friends to protect them against phishing attacks at USC, and we have shown that AuntieTuna’s protection due to sharing is eective even when SSO is not used. We have also quantied the benets that SSO provides in reducing the number of friends a user at USC need to share with: with SSO, a user needs to share with 1 other (Section 5.6.2.2), and 10 others when SSO is not used. We next show how sharing with AuntieTuna also protects users on online services outside of (and not controlled by) USC when users share their known-good with their informal social groups. 5.6.2.4 Howmanyfriendsmustsharetoprotectcommunitysites? People at a university like USC often visit the same local websites (sports teams, clubs, local shopping venues, etc.), and they would benet from phishing protection for these sites with our approach. These external services are often outside of USC’s control, and USC cannot be responsible for maintaining and distributing known-good lists outside of their domain: friend-to-friend sharing is needed to ll in the gap. 147 A B C D E F G H I J K L M N User at USC 0 200 400 # of Sites Used A* B C D E F G H I J K L M N User at USC 0 200 400 # of Sites Used 23% 25% 20% 25% 24% 22% 29% 22% 22% 39% 28% 35% 48% protected sites A B* C D E F G H I J K L M N User at USC 0 200 400 # of Sites Used 14% 16% 22% 17% 20% 21% 20% 17% 18% 38% 31% 31% 33% protected sites A B C* D E F G H I J K L M N User at USC 0 200 400 # of Sites Used 14% 15% 15% 17% 20% 21% 17% 16% 20% 32% 26% 33% 44% protected sites A B C D* E F G H I J K L M N User at USC 0 200 400 # of Sites Used 10% 19% 14% 16% 20% 21% 20% 19% 17% 33% 28% 33% 41% protected sites A B C D E* F G H I J K L M N User at USC 0 200 400 # of Sites Used 10% 10% 11% 11% 17% 14% 14% 16% 14% 29% 24% 27% 30% protected sites A B C D E F* G H I J K L M N User at USC 0 200 400 # of Sites Used 11% 14% 16% 18% 21% 24% 20% 20% 18% 32% 26% 40% 41% protected sites A B C D E F G* H I J K L M N User at USC 0 200 400 # of Sites Used 9% 14% 15% 16% 15% 21% 14% 17% 18% 28% 28% 35% 37% protected sites A B C D E F G H* I J K L M N User at USC 0 200 400 # of Sites Used 12% 14% 13% 17% 17% 19% 16% 16% 16% 29% 25% 31% 44% protected sites A B C D E F G H I* J K L M N User at USC 0 200 400 # of Sites Used 9% 11% 12% 14% 18% 18% 18% 15% 19% 28% 26% 29% 37% protected sites A B C D E F G H I J* K L M N User at USC 0 200 400 # of Sites Used 5% 8% 9% 9% 8% 11% 12% 10% 13% 11% 12% 15% 15% protected sites A B C D E F G H I J K* L M N User at USC 0 200 400 # of Sites Used 8% 13% 11% 13% 16% 14% 14% 13% 13% 10% 24% 25% 37% protected sites A B C D E F G H I J K L* M N User at USC 0 200 400 # of Sites Used 7% 11% 10% 12% 15% 12% 16% 12% 14% 10% 27% 29% 37% protected sites A B C D E F G H I J K L M* N User at USC 0 200 400 # of Sites Used 6% 7% 9% 10% 12% 13% 14% 10% 10% 9% 20% 19% 30% protected sites A B C D E F G H I J K L M N* User at USC 0 200 400 # of Sites Used 5% 6% 8% 8% 10% 8% 10% 9% 9% 4% 20% 17% 21% protected sites A B C D E F G H I J K L M N User at USC 0 200 400 # of Sites Used 23% 25% 20% 25% 24% 22% 29% 22% 22% 39% 28% 35% 48% 14% 16% 22% 17% 20% 21% 20% 17% 18% 38% 31% 31% 33% 5% 8% 9% 9% 8% 11% 12% 10% 13% 11% 12% 15% 15% protected sites w/ A's sharing protected sites w/ B's sharing protected sites w/ J's sharing Figure 5.12: Prole of users at USC (x-axis) and all the external and internal sites and services they use (y-axis). We evaluate how many friends a user needs to share with to protect the external services they use with the same methodology as Section 5.6.2.2: we enumerate all possible transfers of known-good between users using all of their web browsing history data, and then analyze the levels of coverage for each recipient user. Earlier in Section 5.6.2.2, we ltered and analyzed users’ histories for sites at USC and USC/ISI only. In addition to analyzing users’ web browsing histories at USC, we validate our ndings with two additional datasets. We build and analyze web browsing proles using DNS data from the Case Connection Zone (CCZ) [5] (recursive DNS resolver at a neighborhood ISP) and SURFnet [129] (large authoritative DNS server). USC: Users at USC need to share with at least 4 other friends to be suciently protected on 31.6 % (mean) of their internet sites they use (median: 29.4 %). Figure 5.12 shows the sizes of the users’ proles, ranging from 27–404 distinct sites (mean: 147, median: 145). As more users share with each other, the mean fraction of sites protected asymptotically reaches 0.44: Figure 5.14a shows the benets of sharing via a series of box plots (black, medians colored red) and the mean fraction (blue) of sites protected (y-axis) given the number of users within the group sharing with all other users (x-axis). We see that sharing community sites is more challenging than sharing university sites. This challenge arises because community sites that our users visit are more diverse than university sites and most com- munity sites donot share SSO for authentication. Inoculation therefore requires sharing information about each site, rather than discovering a widely used SSO method. 148 Although sharing community sites is more dicult than sharing university sites, even sharing with only one other person has benets. Sharing with just one friend results in protection on 18.8 % (mean) of their sites (median: 16.6 %). CCZ: CCZ contains anonymized DNS queries made by all100 homes in a small neighborhood located next to Case Western Reserve University. Each home has one public IP address and typically has multiple client devices behind a NAT—we treat the entirety of a home as a “user” and do not attempt to distinguish the distinct devices in each home. We randomly pick 20 homes with proles containing at least 200 websites (Sep. 2018, duration: 1 day, 79 users). Each home’s prole is then populated with “websites” based on the hashed DNS queries (A records) made by that home. We conrm that the data is realistic in Figure 5.13, which shows a time series of DNS activity for 8 random homes over a continuous one week period (Sep. 2018): each home exhibits clear diurnal patterns of activity without any extreme outliers. SURFnet: SURFnet contains DNS queries from Google’s public DNS resolvers to SURFnet’s authorita- tive DNS server (10 4 zones). We randomly pick 20 users from our dataset with proles containing at least 25 websites (2017-12-20, duration: one day, 606 users). To build a “user”, we use the location of Google’s resolver and the originating querier’s Autonomous System (AS). (The querier’s AS is aggregated from the querier IP’s /24 subnet set in the EDNS Client Subnet (ECS) option.) Each user’s prole is then populated with “websites” based on the hashed DNS queries (A records) made by that user. In these datasets, users need to share with at least 5 other friends to be protected for more than half of their internet sites (53.8 % for CCZ and 55.3 % for SURFnet), based on the mean number of sites (medians: 52.9 % and 56.6 %). Figure 5.14b (CCZ) and Figure 5.14c (SURFnet) show a series of box plots of the fraction of sites protected, with asymptote values of 0.68 and 0.71, respectively. (We nd similar results with CCZ data from Oct. 2017, and in Section 5.6.2.6, we will show how we can generalize the benets of sharing by studying sharing in dierent communities or cohorts.) 149 0 24 48 72 96 120 144 168 0 2500 5000 7500 10000 12500 13397 Number of DNS queries for A records binned by each hour, for 8 random clients/homes. Dataset: dns.201809.anon.log.gz home 0 1 2 3 4 5 6 7 0 24 48 72 96 120 144 168 0 2000 4000 5451 home 0 0 24 48 72 96 120 144 168 0 5000 9567 home 1 0 24 48 72 96 120 144 168 0 2000 3867 home 2 0 24 48 72 96 120 144 168 0 2000 3853 home 3 0 24 48 72 96 120 144 168 0 1000 2000 2652 home 4 0 24 48 72 96 120 144 168 0 1000 1481 home 5 0 24 48 72 96 120 144 168 0 500 780 home 6 0 24 48 72 96 120 144 168 0 200 400 569 home 7 0.0 0.2 0.4 0.6 0.8 1.0 time (hour) 0.0 0.2 0.4 0.6 0.8 1.0 number of DNS queries for A records Figure 5.13: Time series of DNS activity for 8 random homes in the Case Connection Zone, binned by the hour. The top full-width graph shows the combined activity of all 8 homes and bottom graphs show the activity for each individual home. Dataset: CCZ, Sep. 2018. 150 1 2 3 4 5 6 7 8 9 10 11 12 13 Number of Users Sharing with All Other Users 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Sites Protected average (a) Dataset: USC (actual, all sites). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Number of Homes Sharing with All Other Homes 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Sites Protected average (b) Dataset: CCZ, Sep. 2018. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Number of Users Sharing with All Other Users 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Sites Protected average (c) Dataset: SURFnet. Figure 5.14: Sharing with more friends increases the fraction of internet sites protected. The box plots (black, medians in red) show the ranges of protection given a number of users sharing with all others, in addition to the plotted mean values (blue). 151 We again see that sharing with only one other person or home has benets. Sharing with just one friend results in protection on 27.1 % (CCZ) and 32.4 % (SURFnet) of their sites (mean) (medians: 25.1 %, 31.0 %). To understand the benets of one-to-one sharing in CCZ, we visually examine in Figure 5.15 the fraction of sites protected when an individual home (row) shares with another home (column). The size of each circle represents the magnitude of their web prole (numerical values are labeled at each column) and the lled-in wedge represents the fraction of sites protected. For this subset of homes, we see the recipient homes are protected on 5–67 % of their sites. We see similar rates of protection in SURFnet: we look at the individual sharing by three other users in Figure 5.16 and nd that recipient users are protected on 5–47 % of their visited sites. Conclusions: While users’ individual web histories are generally unique to each user [94], we leverage the commonalities in sites they visit to show that data sharing with AuntieTuna can be eective. As more users share with each other, they are protected on an increasing number of sites that they will eventually or already use. Having shown that sharing is eective for community sites, we next ask if we can simulate the web browsing histories of users at an enterprise or university like USC. 5.6.2.5 Dobrowsinghistoriesofsimulatedusersreectthehistoriesofactualusers? We now ask if we simulate user proles, are the resulting proles representative of actual user proles such that we can use and study the simulated proles in lieu of actual ones? We initially simulated users and their proles at USC because of privacy concerns with acquiring and handling users’ web histories and discuss our experiences. To evaluate whether simulated proles are equivalent to actual user proles, we rst create 10 users at USC and randomly select the services they use, then compare the benets of sharing between the simulated and actual user population. For each user, we build a prole of USC services by selecting 5–30 services uniformly at random from Section 5.6.1.1, and then analyze each prole. 152 A 3380 A C 3134 E 2633 G 1954 I 1370 K 1258 M 1241 O 925 Q 823 S 296 C E G I K M O Q S 0.0 0.2 0.4 0.6 0.8 1.0 Recipient Home 0.0 0.2 0.4 0.6 0.8 1.0 Sharing Home Figure 5.15: A graphical representation of sites protected in one-on-one sharing when an individual home (row) shares with another home (column). The size of each circle represents the magnitude of their web prole, and its numeric value is listed at the top of each column. The lled-in wedge represents the fraction of sites protected (with the diagonal representing a home “sharing” with itself and contains completely lled-in circles). Dataset: CCZ, Sep. 2018. 153 A B C D E F G H I J K L M N O P Q R S T User 0 200 400 # of Sites Used A* B C D E F G H I J K L M N O P Q R S T * indicates user sharing with everyone else blue bar indicates how many sites are protected due to sharing (coverage in percentage labeled above bar) 0 200 400 20% 25% 24% 24% 24% 26% 33% 24% 36% 25% 32% 37% 27% 28% 47% 34% 42% 41% 32% protected sites A B* C D E F G H I J K L M N O P Q R S T 0 200 400 15% 30% 26% 32% 23% 38% 47% 31% 37% 29% 42% 41% 39% 22% 61% 41% 42% 41% 32% A B C* D E F G H I J K L M N O P Q R S T 0 200 400 12% 20% 30% 27% 31% 29% 34% 27% 32% 24% 32% 35% 33% 20% 63% 47% 42% 34% 32% A B C D* E F G H I J K L M N O P Q R S T 0 200 400 11% 16% 28% 27% 34% 31% 43% 23% 39% 31% 33% 39% 35% 25% 61% 38% 42% 41% 52% A B C D E* F G H I J K L M N O P Q R S T 0 200 400 8% 15% 19% 19% 15% 30% 43% 24% 39% 39% 39% 44% 31% 48% 45% 44% 39% 41% 28% A B C D E F* G H I J K L M N O P Q R S T 0 200 400 8% 10% 21% 24% 15% 23% 31% 24% 34% 24% 26% 35% 24% 20% 42% 34% 35% 38% 28% A B C D E F G* H I J K L M N O P Q R S T 0 200 400 8% 16% 19% 20% 28% 22% 49% 21% 31% 34% 32% 41% 31% 25% 50% 38% 32% 45% 40% A B C D E F G H* I J K L M N O P Q R S T 0 200 400 6% 10% 12% 15% 21% 16% 26% 19% 27% 25% 26% 28% 18% 28% 34% 31% 29% 45% 36% A B C D E F G H I* J K L M N O P Q R S T 0 200 400 4% 6% 8% 7% 11% 11% 10% 17% 17% 15% 16% 11% 14% 5% 24% 22% 19% 10% 16% A B C D E F G H I J* K L M N O P Q R S T 0 200 400 5% 7% 9% 12% 16% 14% 14% 23% 16% 20% 26% 28% 25% 25% 29% 34% 29% 24% 24% A B C D E F G H I J K* L M N O P Q R S T 0 200 400 4% 5% 7% 9% 16% 10% 15% 21% 15% 20% 19% 17% 12% 12% 24% 19% 26% 24% 24% A B C D E F G H I J K L* M N O P Q R S T 0 200 400 4% 8% 9% 10% 15% 11% 14% 21% 15% 25% 19% 24% 16% 15% 32% 38% 32% 28% 28% A B C D E F G H I J K L M* N O P Q R S T 0 200 400 5% 7% 9% 10% 17% 14% 17% 21% 10% 25% 15% 23% 18% 25% 32% 34% 26% 34% 20% A B C D E F G H I J K L M N* O P Q R S T 0 200 400 3% 6% 8% 9% 11% 9% 12% 13% 11% 22% 10% 14% 17% 10% 21% 28% 13% 10% 16% A B C D E F G H I J K L M N O* P Q R S T 0 200 400 3% 3% 4% 5% 13% 6% 8% 16% 3% 17% 8% 11% 19% 8% 11% 9% 6% 7% 16% A B C D E F G H I J K L M N O P* Q R S T 0 200 400 4% 7% 12% 12% 12% 11% 14% 19% 15% 19% 15% 21% 22% 16% 10% 34% 29% 28% 24% A B C D E F G H I J K L M N O P Q* R S T 0 200 400 3% 4% 7% 6% 10% 8% 9% 14% 11% 19% 10% 21% 20% 18% 8% 29% 26% 17% 20% A B C D E F G H I J K L M N O P Q R* S T 0 200 400 3% 4% 6% 6% 8% 8% 8% 13% 10% 15% 14% 18% 15% 8% 5% 24% 25% 17% 12% A B C D E F G H I J K L M N O P Q R S* T 0 200 400 3% 4% 5% 6% 8% 8% 10% 19% 5% 12% 12% 14% 19% 6% 5% 21% 16% 16% 16% A B C D E F G H I J K L M N O P Q R S T* 0 200 400 2% 3% 4% 6% 5% 5% 8% 13% 6% 10% 10% 12% 9% 8% 10% 16% 16% 10% 14% A B C D E F G H I J K L M N O P Q R S T User 0 200 400 # of Sites Used 20% 25% 24% 24% 24% 26% 33% 24% 36% 25% 32% 37% 27% 28% 47% 34% 42% 41% 32% 15% 30% 26% 32% 23% 38% 47% 31% 37% 29% 42% 41% 39% 22% 61% 41% 42% 41% 32% 5% 7% 9% 12% 16% 14% 14% 23% 16% 20% 26% 28% 25% 25% 29% 34% 29% 24% 24% protected sites w/ A's sharing protected sites w/ B's sharing protected sites w/ J's sharing Figure 5.16: Sharing known-good between users eectively inoculates them on internet sites. The solid colored bars in blue, red, and yellow show the percentage of sites protected due to sharing byA,B, andJ, respectively. Bars are omitted if a user was sharing with themselves. Dataset: SURFnet. When we compare our simulated users (Figure 5.7b) with actual users (Figure 5.7a), we note that the proportions of SSO-enabled to non-SSO services are similar in both groups. In this particular example, we overestimate the total number of sites that each of the simulated users use: future work might recruit volunteers across a diverse set of departments in an enterprise or university to understand if our numbers are too divergent (recall earlier that our volunteer group of actual users are within the computer science department). We repeat on our simulated users the evaluation in Section 5.6.2.1, asking if AuntieTuna is eective in protecting sites at an enterprise. We nd that AuntieTuna is eective at protecting sites at USC for our simulated users, again requiring 1 site (the SSO login portal), to gain80 % (mean) coverage of their used sites. With our actual users, adding the SSO portal covered63 % (mean). We next ask how many friends our simulated users need to share with to be suciently protected on sites at USC. Like in Section 5.6.2.2, we see that simulated users need to share with at least one friend to be suciently protected on 85.6 % (mean) of sites used at USC (median: 88.0 %). Grouped bar charts depicting sharing between several uses in our simulated (Figure 5.8b) and actual (Figure 5.8a) users visually reveal similarities in coverage due to sharing. When we evaluate multiple users sharing with one another, we see in both simulated and actual users that the coverage asymptotically reaches 1. Figure 5.9 shows the mean values and box plots for both 154 simulated (green) and actual (blue) users. While the medians and means for both groups are relatively close, we do see that the coverage for actual users have a much greater spread. In both groups, SSO provides an outsized advantage to minimizing the number of friends required to share with. To understand if SSO provides the same advantage with our simulated users, we run the same scenarios (treating each site as distinct) on our simulated users as our actual users: we consider one-on- one sharing with three sharing users (Figure 5.10b and its counterpart of actual users in Figure 5.10a) and multiple users sharing wih every other user (green boxplots in Figure 5.11). We see that while SSO does provide an advantage, this advantage is not as prominent as observed in our actual users. While both sets of charts and graphs evaluating coverage on simulated and actual users when SSO is not used are visually similar, coverage on our actual users exhibit a slower rate of growth, smaller asymptote, and larger spread as the number of users sharing increases. Although there are dierences in our evaluation of simulated and actual users, we believe the web browsing histories of our simulated users are representative of that of actual users. The dierences that we see between our simulated and actual users are likely due to several factors. Our actual users, consisting of faculty, sta, and students, likely use dierent services that are reective of their role. Similarly, we know that in general, the popularity of services has a long tail distribution (there are a few services used by almost everyone, and most services are used only by a few), which our use of uniform sampling does not take into account. Future work can modify the sampling method by taking into account access patterns and analytics of web services, if available. We have shown that our use of uniform sampling in building simulated user proles provides similar results when compared to using actual user proles, showcasing that the use of simulated user proles can be benecial in early evaluation of ideas. We next ask if we can generalize the benets of sharing. 155 5.6.2.6 Generalizingthebenetsofsharing We next examine if we can generalize the benets of sharing from our prior studies of sharing in an enterprise (Section 5.6.2.2) and on the Internet (Section 5.6.2.4). We generalize the benets of sharing known-good by observing that the fraction of sites protected due to sharing typically has a starting baseline and logistic growth that typically stops near some asymptote that is usually below 1. The precise values of the baseline, growth function, and asymptote depend on the shared anity between users (how often they share aliation or interests), the number of users sharing, and the population of sites to be protected. In Figure 5.17, we plot the median fraction of sites protected using data from a diverse group of datasets, and we see that the general shape of each tted line is roughly the same, with varying baselines and asymptotes. We replicate some plots (CCZ, SURFnet, USC) as previously shown in Section 5.6.2.2 and Sec- tion 5.6.2.4, and compare with additional data from two social networks (Hacker News, Twitter), discussed later in this section. In Table 5.3, we further describe each dataset and enumerate the precise values of the baseline, growth function, and asymptote. For each dataset (or subset of), we t the growth functionAe Bx +C using non-linear least squares: we then nd the baseline value whenx = 1, and the asymptote asC (the limit whenx!1). The eectiveness of sharing given the number of sharing users has high rates of growth and high max- imums (asymptotes) when the anity of sharing users is high (like at a university) and is less pronounced when their anity is low (like Internet communities). At USC, we see that the baseline when using SSO is very high (0.791), and when SSO is not used, on par (0.324) with other groups like CCZ and SURFNet. The fraction of sites protected when SSO is used quickly approaches 1—we expect similar behavior at other universities or university-like environments. Even when SSO is not used, its asymptote is relatively high at 0.806. 156 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Number of Users or Homes Sharing With Each Other 0.0 0.2 0.4 0.6 0.8 1.0 Median Fraction of Sites Protected USC (USC-only, SSO) USC (USC-only, no SSO) SURFnet CCZ 2018-09 CCZ 2018-09 (heavy) Hacker News Hacker News (heavy) USC (all sites) Twitter Twitter (heavy) Figure 5.17: Growth curves of median fraction of sites protected across dierent user populations and datasets. “Heavy” users are indicated with a dashed line, and some lines are shorter than others if the population of users meeting the threshold was small. Corresponding numerical values can be found in Table 5.3. 157 dataset lter #sitesthreshold type date growth:f(x) =Ae Bx +C baseline* min max A B C ** USC (actual) USC-only, SSO 3 16 web 2020-04 −0.7125 1.2202 1.0015 0.7912 USC-only, no SSO 3 16 −0.5741 0.1771 0.8057 0.3248 all 27 404 −0.3084 0.2103 0.4252 0.1753 USC (simulated) USC-only, SSO 5 30 web 2020-01 −0.2076 0.3059 1.0000 0.8471 USC-only, no SSO 5 30 −1.0035 0.4046 0.9639 0.2944 CCZ (2018) all 200 - DNS 2018-09 −0.5133 0.2407 0.6796 0.2762 heavy users 3000 - −0.4734 0.3640 0.6978 0.3688 medium-heavy users 2000 3000 −0.5039 0.2111 0.7230 0.3150 medium users 1000 2000 −0.5091 0.2124 0.7047 0.2930 light users 200 1000 −0.5452 0.1831 0.6572 0.2032 CCZ (2016) all 200 - DNS 2016-10 −0.5108 0.2138 0.6746 0.2622 heavy users 3000 - −0.4572 0.4023 0.6095 0.3037 medium-heavy users 2000 3000 −0.4666 0.1971 0.6812 0.2980 medium users 1000 2000 −0.4865 0.1999 0.6793 0.2809 light users 200 1000 −0.4918 0.2037 0.6697 0.2686 SURFnet all 25 - DNS 2017-12 −0.4750 0.2336 0.7098 0.3338 Hacker News all 10 - social 2020-01 −0.5534 0.1052 0.5554 0.0573 heavy users 50 - −0.4061 0.2594 0.4465 0.1332 Twitter all 10 - social 2019-05 −0.1759 0.1484 0.1395 0.0000 heavy users 30 - −0.1661 0.1069 0.1437 0.0000 * result of growth function whenx = 0 ** asymptote Table 5.3: Datasets and user populations used to generalize the benets of sharing and corresponding growth function and baseline values. Studied subsets of populations in some datasets are indicated in parentheses and corresponding threshold values for the number of sites visited by a user or home. Graphs of selected growth functions are in Figure 5.17. 158 When we look at a more diverse population of users and sites like CCZ and SURFnet, their lower starting baselines (0.203–0.369) and slower growth functions show that it requires many more sharing users to reach their respective limits (0.610–0.723), which are also lower. In SURFnet, even when we consider the entire population of all 606 users sharing with each other, the resulting asymptote of 0.863 shows us that it is rare to be able to achieve 100 % protection for all users. Surprisingly, when we consider all sites for users at USC, the growth curve and corresponding values (baseline: 0.175, asymptote: 0.425) are lower than that of CCZ or SURFnet. We attribute this outcome to several possible factors. The users at USC represent a small sample from the computer science department with diverse roles (faculty, sta, students) and interests. Data at CCZ (and similarly at SURFnet) is collected from a larger number of households, each with a varying and unknown number of users in each household. Therefore, we expect that sharing is seemingly more eective in these communities due to some level of aggregation at the household-level. Finally, we nd that users in social networking sites like Hacker News (HN) and Twitter can benet some in sharing, despite having low and varying anity levels: the users of HN, a community focused on technology and entrepreneurship, have a much higher anity (and corresponding benet) compared to Twitter’s users (randomly picked from a global sample), who are much more diverse and loosely-knit. For both social networks, we build web histories for each user based on their comments to a story/link (HN) or replies/retweets to a tweet (Twitter). (We follow the approach of Su et al. [120] in using social network activity to represent web histories. We walk through an example of how we build such a history in Figure 5.18, providing our datasets and analysis at https://ant.isi.edu/auntietuna.) We nd that the coverage for users on both sites have similar growth curve shapes and much lower baselines (0–0.133) and asymptotes (0.140–0.555) when compared to previous groups. We have generalized the benets of sharing by enumerating and analyzing the growth curves of mul- tiple communities, showing that the eectiveness of sharing increases in proportion to the anity of their 159 ExampleInteractiononaSocialNetwork: Alice: Lorem ipsumdolorsitamet, example.com. Bob: Praesentestligula,tempusidlacinianon. Charlie: Nullaaliquetdiamac another.example.com. Alice: Maecenassitametpretiumerat. DerivedWebHistoryProles: Alice:f example.com, another.example.comg Bob:f example.comg Charlie:f example.com, another.example.comg Figure 5.18: An example of deriving web history proles (bottom) from social network interactions (top): Alice posts a story containing (or tweets) a link toexample.com, and Bob and Charlie reply to Alice’s story. We consider example.com to be in both Alice’s and Bob’s respective histories. Charlie also includes a link in their comment toanother.example.com, to which Alice replies: this URL is also included in Alice’s and Charlie’s histories. sharing users. We next describe a scenario in how AuntieTuna can help improve the security of election campaigns by securing their sta’s web browsers. 5.6.3 ImprovingElectionCampaignSecurity Modern political campaigns are increasingly dependent on online data, data sharing, and Internet services to operate. This dependence places these campaigns (including the candidate and sta) at high risk for cyberattacks, both from the opposition, and from other nation-states seeking to inuence elections [57, 108, 96]. Specically, both types of adversaries are motivated to phish either the work or personal accounts of political candidates and campaign sta, either to directly acquire information or to launch stepping- stone attacks (lateral phishing). Moreover, because political campaigns employ many volunteers, brought together to form ad-hoc teams and all operating under the pressure of an election, consistent defensive operational security is extremely challenging. To secure election campaigns from cyberattacks, we need to start with securing the work and personal web accounts of their candidates, sta, and volunteers. Multi-factor authentication (MFA) adds an addi- tional layer of security on services that implement it: Google, for example, provides campaigns with free 160 hardware tokens that are used to authenticate a user in addition to their password [34]. However, cam- paigns often use many services in a fragmented ecosystem, with services (including custom, “homebuilt” ones) provided by many dierent providers, often reusing passwords [31]. This ecosystem can be worse than one at a university environment (Section 5.6.2) because there is often no unied Single Sign-On (SSO) process, and not all services implement MFA. A compromised password on an MFA-protected service may still leave other services vulnerable. We can help secure people’s web accounts by protecting against phishing attacks with AuntieTuna, complementing other security measures like MFA. As users add the services they use to their known-good, existing users can easily and immediately share their known-good in AuntieTuna with new members of a campaign as they join, bootstrapping the new member’s known-good with inoculated sites. AuntieTuna’s phish detection and data sharing helps secure people’s web accounts with minimal fric- tion. Data sharing helps inoculate users before they rst use the services they need: in an environment without SSO and the shared anity between users is high, users need to share with at least 5 friends to be suciently protected (evaluated in Section 5.6.2.2). (If many services are behind SSO, users only need to share with at least 1 friend.) Inoculation has an additional benet in improving AuntieTuna’s usability (users do not need to keep track of services) and reducing the time needed for onboarding at a campaign. A major challenge to adoption of AuntieTuna (or something similar) is developing the ability to op- erate on mobile devices, specically on the popular, but restrictive Apple iOS operating system. While iOS provides strong sandboxing as part of its security, this sandboxing places constraints on the lower- level functionality we need in order to implement AuntieTuna on mobile devices—we leave the specic implementation on mobile as future work. We have walked through an application in improving election campaign security by protecting the people involved in such a campaign from phishing attacks. We believe our techniques in phish detection 161 and data sharing are benecial and ecient for campaigns, enabling them to focus on their main tasks while maintaining a defensive cybersecurity posture. 5.7 Conclusions This chapter described a collaborative phishing defense based on per-user whitelisting with multi-user data sharing. The addition of data sharing to AuntieTuna-Schooling, an anti-phishing browser extension that precisely detects phishing websites, enables a collaborative defense and inoculation through friends that use common websites. We showed that this approach is particularly eective at large organizations, like USC, where many services use Single Sign-On (SSO). Users at such organizations often share common non- work interests, and we showed a “halo eect” around phish-prevention for community websites that are “enterprise-adjacent”. We suggest that AuntieTuna-Schooling would be particularly eective in improving security for loosely-coupled organizations like political election campaigns. This kind of collaborative defense is particularly important for these groups because of the damage compromise to SSO represents for an organization, and the importance of protecting loosely-coupled organizations like election campaigns. AuntieTuna-Schooling is free and open-source and available today. This chapter supports the thesis statement by showing how AuntieTuna-Schooling improves one’s net- work security by reducing successful phishing attacks using local and personalized detection of phishing websites (based on an earlier version of AuntieTuna in Chapter 3) and the controlled sharing of previously- private information with collaborators. Users rst personalize their local defense against phishing, and then share their previously-private customization with their social circles, improving their group’s collec- tive immunity. AuntieTuna-Schooling then protects now-inoculated users with personalized (bootstrapped by sharing) and local detection to detect and prevent access to phishing sites as they browse the web. 162 Chapter6 Conclusions Improving network security is a challenging and continuous task because the Internet is built of many independent, distributed, and diverse networks connected with one another. This thesis has shown how we can improve network security through collaborative sharing. There are many steps and challenges that remain in pursuit of forward progress. We now describe potential future directions from our experiences and then present our conclusions. 6.1 FutureDirections Implementingnewandexistingmaliciousactivitydetectionandinstrumentationtofurtherim- prove network security across a diverse set of devices. Mobile and IoT devices bring unparalleled convenience in computing at low cost as well as their own set of challenges in securing them. These plat- forms have more limited functionality and computational power, are often inexible or heavily sandboxed for development, and yet are able to inict a disproportionate amount of damage. Given the popularity and continuing growth of mobile and IoT devices on the network, malicious activity detection techniques need to be eective and self-sucient on these resource-constrained devices. (We found in Chapter 5 that high-prole targets like those involved in election campaigns often exclusively use mobile devices.) For example, we showed how content reuse detection, including nding phishing sites, can be localized to 163 the client device’s web browser (Chapter 3) and implemented on commodity hardware (Chapter 2). While content reuse detection can be additionally implemented in a straightforward manner in email gateways or crawlers, we found challenges to implementing our approach on mobile phones and tablets because of platform limitations. As another example, how could we eectively apply the idea of inoculation (Chap- ter 5) to IoT devices on home and enterprise networks and prevent the next class of DDoS or other attacks? Improvingandusingcybersecuritydatasharingtomakeforwardprogressinnetworksecurity while preserving privacy. There are still many human and technical barriers to regularize cybersecu- rity data sharing in the context of improving network security. We have shown that data sharing helps improve malicious activity detection (Chapter 4, Chapter 5) when exchanging data between parties that often already share some level of trust with one another. Could we convince parties that have no ex- isting relationship (trust or otherwise) that data sharing between them not only improves their network security but can happen safely with minimal risk by further quantifying and qualifying the risk-benet trade-o? For example, if we could prove or quantitatively show that both parties risk and benet equally over one (or more) data sharing transaction, would enumerating and assigning weights to the potential outcomes help people make more informed decisions about how sharing would help improve network security? One possible approach could be to apply game theory to determine the viability of a sharing transaction by nding an equilibrium in which both parties are satised with the risks and benets given their constraints (tolerance for risk and expected value). Part of nding this equilibrium requires further exploration in negotiation and escalation mechanisms when asking for another entity’s data: how does a querier bargain or justify asking for more sensitive data while the responder seeks to maximally preserve its own privacy (and that of its users)? When data sharing is regularized, new applications in network security will become possible. With an increased diversity and amount of data, we can rapidly develop and prototype new algorithms to detect or predict malicious activity, for example, forecasting when or where the next cyberattack will hit. 164 Improving usability of techniques and tools that detect malicious activity for its users. Transi- tioning detection techniques from a research environment to an operational context on both enterprise and home networks requires us to consider its design and use from the perspective of the end-user. Net- work operators, often already saturated with security alerts, will be further burdened and unlikely to use a technique or tool if, for example, it cannot explain conciselywhy a particular trac pattern is an anomaly or malicious. Similarly, the general user at home does not understand [63, 68] or will not congure [59] complex settings, indicators, and algorithm thresholds at the cost of their convenience. Too many false positives or user experience interruptions lead to a decreased security posture when alerts are ignored or the tool is disabled (User Access Control in Windows Vista and Windows 7 [87]). To improve usability, we need to better understand how our end-users operate and design solutions that minimize friction and present clear, actionable results. In AuntieTuna (Chapter 3), for example, we focused on its usability by keeping the user interface and experience minimal: only when AuntieTuna strongly suspects that a phishing site has been detected does it interfere with the user’s workow. A future improvement would be to automate the personalization of AuntieTuna even further and require zero conguration while simultaneously balancing the unintended eects that might occur (false positives). As another example, we showed how organizations can use data sharing safely in Retro-Future (Chap- ter 4) to improve network security, but additional challenges remain in collating the diverse sources of data and producing actionable output of malicious activity detection (without requiring expert knowledge in the underlying detection algorithms). As a start in data collation, we introduced timefind [77], enabling the indexing, searching, and downselecting of many data types over time. Additional functionality like automatic gathering and selection of the correct input data for a query would improve the usability of a system like Retro-Future. Similarly, the output of many malicious activity detection techniques requires a careful understanding of the technique and its limitations. Network operators, as we discussed earlier, 165 are often subject to time constraints and cannot delve into all the nuances of every technique and tool. Improving the usability of our detection by presenting clear and actionable output would increase its usage by the end-user and further improve network security. 6.2 Conclusions In this thesis we have shown how improving one’s network security strongly benets from a combina- tion of personalized, local detection, coupled with the controlled exchange of previously-private network information with collaborators. At the beginning of this thesis, we described that our current response to cybersecurity incidents use mechanisms that promote a security monoculture, are too centralized, and are too slow. We studied four approaches on improving network security, resolving the problems of a security monoculture with personalized detection, centralized mechanisms with local detection, and slow mechanisms with controlled sharing of information with collaborators. In Chapter 2, we showed how Content Reuse Detection improves network security by nding previ- ously undetected bad neighborhoods, or hierarchical clusters of copied and potentially malicious content. Our design choices in hash-based, local discovery and detection enabled our approach to scale to web- sized datasets on commodity hardware. We used our approach to explore how content is duplicated on the Internet, and to nd monetized instances of Wikipedia and phishing sites. Our application into nd- ing phishing sites with our approach then motivated the core of our next work in AuntieTuna, bringing phishing detection directly and locally to the end-user on their web browser. We presented AuntieTuna in Chapter 3, a self-contained web browser extension that improves network security by nding and preventing access to phishing sites as a user browsers. We adapted our previous approach of content reuse detection to locally detecting phishing sites in the browser and personalized AuntieTuna’s detection to each user’s browsing behavior to keep detection lightweight and to diversify the user’s defense. We additionally focused on maximizing AuntieTuna’s usability with a hands-o approach 166 to conguration and use. Our study of AuntieTuna inspired the next two studies in how we could use data sharing to improve the eectiveness of malicious activity detection algorithms. In Chapter 4, we presented Retro-Future, a controlled information exchange framework that enables cybersecurity data sharing with controls to manage the risk-benet trade-o. We showed how Retro- Future improves network security by increasing the eectiveness of local detection of malicious activity with the sharing of previously-private data. We implemented the mechanisms to manage data disclosure when organizations query or respond to others for their previously-private network data. We quantied how data sharing enables organizations to detect more malicious activity, and showed how both small and large organizations benet in detecting previously unknown activity when they share. Our study of Retro-Future improving an organization’s network security motivated our next study in how we could use its concepts and controls in controlled information exchange to further improve the network security of users at home and in the oce against phishing attacks. Finally, in Chapter 5 we developed AuntieTuna-Schooling to further improve network security by pro- tecting users and their friends against phishing site attacks. We augmented an earlier version of Auntie- Tuna (rst introduced in Chapter 3) with methods for peer-to-peer and centralized data sharing to enable users and their friends to collectively build a continuously improving defense against phishing sites. We identied and evaluated the growth of Single Sign-On authentication at large organizations to be both a benet for its users and an attractive phishing target for attackers. We showed that users and their friends can successfully leverage their commonalities when sharing data with one another, inoculating themselves against phish. The Internet is ingrained as a critical part of our lives and there are many who seek to take advantage of its weaknesses for malicious gain. As the Internet and our reliance on it continue to evolve, we will need to continuously collaborate on new techniques to secure the Internet while maintaining the security and privacy of the people that depend on it. 167 Bibliography [1] Reed Abelson and Matthew Goldstein. “Millions of Anthem Customers Targeted in Cyberattack”. In: The New York Times (Feb. 2015).url: https://www.nytimes.com/2015/02/05/business/hackers- breached-data-of-millions-insurer-says.html. [2] Steven P. Abney. “Parsing by Chunks”. In: Principle-Based Parsing: Computation and Psycholinguistics. Ed. by Robert C. Berwick, Steven P. Abney, and Carol Tenny. USA: Kluwer Academic Publishers, 1991, pp. 257–278.isbn: 0792311736. [3] Sadia Afroz and Rachel Greenstadt. “PhishZoo: Detecting Phishing Websites by Looking at Them”. In: 2011 IEEE Fifth International Conference on Semantic Computing. IEEE. Sept. 2011, pp. 368–375.doi: 10.1109/ICSC.2011.52. [4] Mark Alllman and Vern Paxson. “Issues and Etiquette Concerning Use of Shared Measurement Data”. In:Proceedingsofthe7thACMSIGCOMMConferenceonInternetMeasurement. IMC ’07. San Diego, California, USA: ACM, 2007, pp. 135–140.isbn: 978-1-59593-908-1.doi: 10.1145/1298306.1298327. [5] Mark Allman. Case Connection Zone DNS Transactions. Apr. 2019.url: https://www.icir.org/mallman/data.html. [6] C. J. Antonelli, M. Undy, and P. Honeyman. “The Packet Vault: Secure Storage of Network Data”. In: Proceedings of the 1st Conference on Workshop on Intrusion Detection and Network Monitoring - Volume 1. ID’99. Santa Clara, California: USENIX Association, 1999, pp. 11–11.url: http://www.citi.umich.edu/techreports/reports/citi-tr-98-5-usenix.pdf. [7] Apache. Hadoop. 2012.url: http://hadoop.apache.org. [8] Apache. Pig. 2013.url: http://pig.apache.org. [9] ArchiveTeam. GeoCities. 2009.url: http://archiveteam.org/index.php/GeoCities. [10] Calvin Ardi and John Heidemann. “AuntieTuna: Personalized Content-based Phishing Detection”. In: Proceedings of the 2016 NDSS Workshop on Usable Security. USEC ’16. San Diego, CA, USA: Internet Society, Feb. 2016.isbn: 1-891562-42-8.doi: 10.14722/usec.2016.23012. 168 [11] Calvin Ardi and John Heidemann. “Leveraging Controlled Information Sharing for Botnet Activity Detection”. In: Proceedings of the 2018 Workshop on Trac Measurements for Cybersecurity. WTMC ’18. Budapest, Hungary: Association for Computing Machinery, 2018, pp. 14–20.isbn: 978-1-45035-910-8.doi: 10.1145/3229598.3229602. [12] Calvin Ardi and John Heidemann. “Precise Detection of Content Reuse in the Web”. In:SIGCOMM Comput. Commun. Rev. 49.2 (May 2019), pp. 9–24.issn: 0146-4833.doi: 10.1145/3336937.3336940. [13] Majid Arianezhad, L. Jean Camp, Timothy Kelley, and Douglas Stebila. “Comparative Eye Tracking of Experts and Novices in Web Single Sign-On”. In: Proceedings of the Third ACM Conference onData and Application Security and Privacy. CODASPY ’13. San Antonio, Texas, USA: Association for Computing Machinery, Feb. 2013, pp. 105–116.isbn: 9781450318907.doi: 10.1145/2435349.2435362. [14] Alessandro Armando, Roberto Carbone, Luca Compagna, Jorge Cuellar, Giancarlo Pellegrino, and Alessandro Sorniotti. “From Multiple Credentials to Browser-Based Single Sign-On: Are We More Secure?” In: Future Challenges in Security and Privacy for Academia and Industry. Ed. by Jan Camenisch, Simone Fischer-Hübner, Yuko Murayama, Armand Portmann, and Carlos Rieder. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 68–79.isbn: 978-3-642-21424-0.doi: 10.1007/978-3-642-21424-0_6. [15] Stefan Axelsson. “The Base-rate Fallacy and the Diculty of Intrusion Detection”. In: ACM Trans. Inf. Syst. Secur. 3.3 (Aug. 2000), pp. 186–205.issn: 1094-9224.doi: 10.1145/357830.357849. [16] Burton H. Bloom. “Space/time trade-os in hash coding with allowable errors”. In: Commun. ACM 13.7 (July 1970), pp. 422–426.issn: 0001-0782.doi: 10.1145/362686.362692. [17] Sergey Brin and Lawrence Page. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. In: Proceedings of the Seventh International World Wide Web Conference. Brisbane, Queensland, Australia, Apr. 1998, pp. 107–117.doi: 10.1016/S0169-7552(98)00110-X. [18] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Georey Zweig. “Syntactic clustering of the Web”. In: Papers from the Sixth International World Wide Web Conference. Vol. 29. 8. Santa Clara, California, United States: Elsevier Science Publishers Ltd., 1997, pp. 1157–1166. doi: 10.1016/S0169-7552(97)00031-7. [19] Deanna D. Caputo, Shari Lawrence Peeger, Jesse D. Freeman, and M. Eric Johnson. “Going Spear Phishing: Exploring Embedded Training and Awareness”. In: IEEE Security & Privacy 12.1 (Jan. 2014), pp. 28–38.issn: 1558-4046.doi: 10.1109/MSP.2013.106. [20] Nicholas Carlini, Adrienne Porter Felt, and David Wagner. “An Evaluation of the Google Chrome Extension Security Architecture”. In: Presented as part of the 21st USENIX Security Symposium (USENIX Security 12). Bellevue, WA: USENIX, 2012, pp. 97–111.url: https://www.usenix.org/conference/usenixsecurity12/technical-sessions/presentation/carlini. [21] Center for Applied Internet Data Analysis. CAIDA. 2017.url: https://caida.org. 169 [22] Moses S. Charikar. “Similarity Estimation Techniques from Rounding Algorithms”. In:Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing. STOC ’02. Montreal, Quebec, Canada: ACM, 2002, pp. 380–388.isbn: 1-58113-495-9.doi: 10.1145/509907.509965. [23] Stanford Chiu, Ibrahim Uysal, and W. Bruce Croft. “Evaluating text reuse discovery on the web”. In: Proceedings of the third symposium on Information interaction in context. IIiX ’10. New Brunswick, New Jersey, USA: ACM, 2010, pp. 299–304.isbn: 978-1-4503-0247-0.doi: 10.1145/1840784.1840829. [24] Junghoo Cho, Narayanan Shivakumar, and Hector Garcia-Molina. “Finding Replicated Web Collections”. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. SIGMOD ’00. Dallas, Texas, USA: ACM, 2000, pp. 355–366.isbn: 1-58113-217-4.doi: 10.1145/342009.335429. [25] Abdur Chowdhury, Ophir Frieder, David Grossman, and Mary Catherine McCabe. “Collection Statistics for Fast Duplicate Document Detection”. In: ACM Trans. Inf. Syst. 20.2 (Apr. 2002), pp. 171–191.issn: 1046-8188.doi: 10.1145/506309.506311. [26] Michael Cieply and Brooks Barnes. “Sony Cyberattack, First a Nuisance, Swiftly Grew Into a Firestorm”. In: The New York Times (Dec. 2014).url: https://www.nytimes.com/2014/12/31/business/media/sony-attack-first-a-nuisance-swiftly- grew-into-a-firestorm-.html. [27] K. Clay and E. Kenneally. “Dialing Privacy and Utility: A Proposed Data-Sharing Framework to Advance Internet Research”. In: IEEE Security & Privacy 8.4 (July 2010), pp. 31–39.issn: 1540-7993.doi: 10.1109/MSP.2010.57. [28] Michael Corkery. “Hackers’ $81 Million Sneak Attack on World Banking”. In: The New York Times (Apr. 2016).url: https://www.nytimes.com/2016/05/01/business/dealbook/hackers-81-million- sneak-attack-on-world-banking.html. [29] Stacy Cowley and Liam Stack. “Los Angeles Hospital Pays Hackers $17,000 After Attack”. In: The New York Times (Feb. 2016).url: https://www.nytimes.com/2016/02/19/business/los-angeles- hospital-pays-hackers-17000-after-attack.html. [30] Ernesto Damiani, Sabrina De Capitani di Vimercati, Stefano Paraboschi, and Pierangela Samarati. “An Open Digest-based Technique for Spam Detection.” In: ISCA PDCS 2004 (2004), pp. 559–564. url: http://spdp.di.unimi.it/papers/pdcs04.pdf. [31] Anupam Das, Joseph Bonneau, Matthew Caesar, Nikita Borisov, and XiaoFeng Wang. “The Tangled Web of Password Reuse”. In: Proceedings of the 21st Annual Network and Distributed System Security Symposium. NDSS ’14. San Diego, California, USA: Internet Society, Feb. 2014. url: https://www.ndss-symposium.org/ndss2014/programme/tangled-web-password-reuse/. [32] Julie Hirschfeld Davis. “Hacking of Government Computers Exposed 21.5 Million People”. In: The New York Times (July 2015).url: https://www.nytimes.com/2015/07/10/us/office-of-personnel- management-hackers-got-data-of-millions.html. 170 [33] Jerey Dean and Sanjay Ghemawat. “MapReduce: Simplied Data Processing on Large Clusters”. In:Proceedingsofthe6thConference on Symposiumon Operating SystemsDesign &Implementation - Volume 6. OSDI ’04. San Francisco, CA: USENIX Association, 2004, p. 10.url: https://research.google/pubs/pub62/. [34] Defending Digital Campaigns. Google Joins Defending Digital Campaigns To Protect 2020 Campaigns. Feb. 2020.url: https://www.defendcampaigns.org/press-release-february-11-2020. [35] Rachna Dhamija and J. D. Tygar. “The Battle Against Phishing: Dynamic Security Skins”. In: Proceedings of the 2005 Symposium on Usable Privacy and Security. SOUPS ’05. Pittsburgh, Pennsylvania, USA: ACM, 2005, pp. 77–88.isbn: 1-59593-178-3.doi: 10.1145/1073001.1073009. [36] John R. Douceur. “The Sybil Attack”. In: Peer-to-Peer Systems. Ed. by Peter Druschel, Frans Kaashoek, and Antony Rowstron. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002, pp. 251–260.isbn: 978-3-540-45748-0.doi: 10.1007/3-540-45748-8_24. [37] Heinz Dreher. “Automatic Conceptual Analysis for Plagiarism Detection”. In: Journal of Issues in Informing Science and Information Technology 4 (2007), pp. 601–614.doi: 10.28945/3141. [38] Vincent Drury and Ulrike Meyer. “Certied Phishing: Taking a Look at Public Key Certicates of Phishing Websites”. In: Proceedings of the Fifteenth USENIX Conference on Usable Privacy and Security. SOUPS ’19. Santa Clara, CA, USA: USENIX Association, Aug. 2019, pp. 211–223.isbn: 9781939133052.url: https://www.usenix.org/conference/soups2019/presentation/drury. [39] Serge Egelman, Lorrie Faith Cranor, and Jason Hong. “You’Ve Been Warned: An Empirical Study of the Eectiveness of Web Browser Phishing Warnings”. In: Proceedings of the SIGCHI Conference onHumanFactorsinComputingSystems. CHI ’08. Florence, Italy: ACM, 2008, pp. 1065–1074.isbn: 978-1-60558-011-1.doi: 10.1145/1357054.1357219. [40] Sven Meyer zu Eissen and Benno Stein. “Intrinsic Plagiarism Detection”. In: Advances in Information Retrieval. Ed. by Mounia Lalmas, Andy MacFarlane, Stefan Rüger, Anastasios Tombros, Theodora Tsikrika, and Alexei Yavlinsky. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 565–569.isbn: 978-3-540-33348-7.doi: 10.1007/11735106_66. [41] Gina Fisk, Calvin Ardi, Neale Pickett, John Heidemann, Mike Fisk, and Christos Papadopoulos. “Privacy Principles for Sharing Cyber Security Data”. In:Proceedingsofthe2015IEEEInternational Workshop on Privacy Engineering. IWPE ’15. San Jose, California, USA: IEEE, May 2015.doi: 10.1109/SPW.2015.23. [42] Common Crawl Foundation. Common Crawl. 2010.url: http://commoncrawl.org. [43] Wikimedia Foundation. Static HTML Dump of Wikipedia. June 2008.url: http://dumps.wikimedia.org/other/static_html_dumps/2008-06/en/. [44] Wikimedia Foundation. Wikimedia Statistics. Feb. 2019.url: https://stats.wikimedia.org/v2/#/en.wikipedia.org. 171 [45] Julien Freudiger, Emiliano De Cristofaro, and Alejandro E. Brito. “Controlled Data Sharing for Collaborative Predictive Blacklisting”. In: Detection of Intrusions and Malware, and Vulnerability Assessment. Ed. by Magnus Almgren, Vincenzo Gulisano, and Federico Maggi. Cham: Springer International Publishing, 2015, pp. 327–349.isbn: 978-3-319-20550-2.doi: 10.1007/978-3-319-20550-2_17. [46] José M. de Fuentes, Lorena González-Manzano, Juan Tapiador, and Pedro Peris-Lopez. “PRACIS: Privacy-preserving and aggregatable cybersecurity information sharing”. In: Computers & Security 69 (2017), pp. 127–141.issn: 0167-4048.doi: 10.1016/j.cose.2016.12.011. [47] Kensuke Fukuda, John Heidemann, and Abdul Qadeer. “Detecting Malicious Activity with DNS Backscatter Over Time”. In: IEEE/ACM Transactions on Networking 25.5 (Oct. 2017), pp. 3203–3218.issn: 1558-2566.doi: 10.1109/TNET.2017.2724506. [48] Steven Furnell. “Phishing: can we spot the signs?” In: Computer Fraud & Security 2007.3 (Mar. 2007), pp. 10–15.issn: 1361-3723.doi: 10.1016/S1361-3723(07)70035-0. [49] Christopher Gates, Ninghui Li, Jing Chen, and Robert Proctor. “CodeShield: Towards Personalized Application Whitelisting”. In: Proceedings of the 28th Annual Computer Security Applications Conference. ACSAC ’12. Orlando, Florida, USA, 2012, pp. 279–288.isbn: 978-1-4503-1312-4.doi: 10.1145/2420950.2420992. [50] Daniel Geer, Charles P. Peeger, Bruce Schneier, John S. Quarterman, Perry Metzger, Rebecca Bace, and Peter Gutmann. CyberInsecurity: The Cost of Monopoly: How the Dominance of Microsoft’s Products Poses a Risk to Security. Tech. rep. Computer and Communications Industry Association, Sept. 2003.url: https://www.schneier.com/essays/archives/2003/09/cyberinsecurity_the.html. [51] Mohammad Ghasemisharif, Amrutha Ramesh, Stephen Checkoway, Chris Kanich, and Jason Polakis. “O Single Sign-O, Where Art Thou? An Empirical Analysis of Single Sign-On Account Hijacking and Session Management on the Web”. In: 27th USENIX Security Symposium (USENIX Security 18). Baltimore, MD: USENIX Association, Aug. 2018, pp. 1475–1492.isbn: 978-1-939133-04-5.url: https://www.usenix.org/conference/usenixsecurity18/presentation/ghasemisharif. [52] Vindu Goel and Nicole Perlroth. “Yahoo Says 1 Billion User Accounts Were Hacked”. In: The New York Times (Dec. 2016).url: https://www.nytimes.com/2016/12/14/technology/yahoo-hack.html. [53] Google. Password Alert. Apr. 2015.url: https://www.google.com/ideas/products/password-alert/. [54] R. Gowtham and Ilango Krishnamurthi. “PhishTackle–a Web Services Architecture for Anti-phishing”. In: Cluster Computing 17.3 (Sept. 2014), pp. 1051–1068.issn: 1386-7857.doi: 10.1007/s10586-013-0320-5. [55] Xiao Han, Nizar Kheir, and Davide Balzarotti. “PhishEye: Live Monitoring of Sandboxed Phishing Kits”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. CCS ’16. Vienna, Austria: Association for Computing Machinery, 2016, pp. 1402–1413. isbn: 9781450341394.doi: 10.1145/2976749.2978330. 172 [56] Amber van der Heijden and Luca Allodi. “Cognitive Triaging of Phishing Attacks”. In: 28th USENIX Security Symposium (USENIX Security 19). Santa Clara, CA: USENIX Association, Aug. 2019, pp. 1309–1326.isbn: 978-1-939133-06-9.url: https://www.usenix.org/conference/usenixsecurity19/presentation/van-der-heijden. [57] Susan Hennessey. “Deterring Cyberattacks: How to Reduce Vulnerability”. In: Foreign Aairs. Vol. 96. 6. Nov. 2017, pp. 39–46.url: https://www.foreignaffairs.com/reviews/review-essay/2017-10-16/deterring-cyberattacks. [58] Monika Henzinger. “Finding near-duplicate web pages: a large-scale evaluation of algorithms”. In: Proceedings ofthe 29th annualinternational ACM SIGIR conference on Research and development in information retrieval. SIGIR ’06. Seattle, Washington, USA: ACM, 2006, pp. 284–291.isbn: 1-59593-369-7.doi: 10.1145/1148170.1148222. [59] Cormac Herley. “So Long, and No Thanks for the Externalities: The Rational Rejection of Security Advice by Users”. In: Proceedings of the 2009 Workshop on New Security Paradigms Workshop. NSPW ’09. Oxford, United Kingdom: ACM, 2009, pp. 133–144.isbn: 978-1-60558-845-2.doi: 10.1145/1719030.1719050. [60] Allan Heydon and Marc Najork. “Mercator: A Scalable, Extensible Web Crawler”. In: World-Wide Web Journal 2.4 (Dec. 1999), pp. 219–229.doi: 10.1023/A:1019213109274. [61] Grant Ho, Asaf Cidon, Lior Gavish, Marco Schweighauser, Vern Paxson, Stefan Savage, Georey M. Voelker, and David Wagner. “Detecting and Characterizing Lateral Phishing at Scale”. In: 28th USENIX Security Symposium (USENIX Security 19). Santa Clara, CA: USENIX Association, Aug. 2019, pp. 1273–1290.isbn: 978-1-939133-06-9.url: https://www.usenix.org/conference/usenixsecurity19/presentation/ho. [62] Jason Hong. “The State of Phishing Attacks”. In: Commun. ACM 55.1 (Jan. 2012), pp. 74–81.issn: 0001-0782.doi: 10.1145/2063176.2063197. [63] Adele E. Howe, Indrajit Ray, Mark Roberts, Malgorzata Urbanska, and Zinta Byrne. “The Psychology of Security for the Home Computer User”. In: Proceedings of the 2012 IEEE Symposium on Security and Privacy. SP ’12. USA: IEEE Computer Society, 2012, pp. 209–223.isbn: 9780769546810.doi: 10.1109/SP.2012.23. [64] Martin Husák and Jaroslav Kašpar. “Towards Predicting Cyber Attacks Using Information Exchange and Data Mining”. In: 2018 14th International Wireless Communications Mobile Computing Conference (IWCMC). June 2018, pp. 536–541.doi: 10.1109/IWCMC.2018.8450512. [65] InCommon/Internet2. InCommon Federation. 2019.url: https://www.incommon.org. [66] InCommon/Internet2. InCommon Metadata Aggregate. 2020.url: https://spaces.at.internet2.edu/display/federation/Download+InCommon+metadata. [67] Internet Security Research Group (ISRG). Let’s Encrypt - Free SSL/TLS Certicates. 2020.url: https://letsencrypt.org. 173 [68] Iulia Ion, Rob Reeder, and Sunny Consolvo. “‘. . . No One Can Hack My Mind’: Comparing Expert and Non-Expert Security Practices”. In: Proceedings of the Eleventh USENIX Conference on Usable Privacy and Security. SOUPS ’15. Ottawa, Canada: USENIX Association, 2015, pp. 327–346.isbn: 9781931971249.url: https://www.usenix.org/system/files/conference/soups2015/soups15-paper-ion.pdf. [69] Jostein Jensen. “Benets of Federated Identity Management - A Survey from an Integrated Operations Viewpoint”. In: Availability, Reliability and Security for Business, Enterprise and Health Information Systems. Ed. by A. Min Tjoa, Gerald Quirchmayr, Ilsun You, and Lida Xu. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 1–12.isbn: 978-3-642-23300-5.doi: 10.1007/978-3-642-23300-5_1. [70] Merrit Kennedy. “Equifax Conrms Another ’Security Incident’”. In: NPR (Sept. 2017).url: https://www.npr.org/sections/thetwo-way/2017/09/19/552124551/equifax-confirms-another- security-incident. [71] Jong Wook Kim, K. Selçuk Candan, and Junichi Tatemura. “Ecient overlap and content reuse detection in blogs and online news articles”. In: Proceedings of the 18th international conference on World wide web. WWW ’09. Madrid, Spain: ACM, 2009, pp. 81–90.isbn: 978-1-60558-487-4.doi: 10.1145/1526709.1526721. [72] Loren Kohnfelder and Praerit Garg. The Threats to Our Products. Microsoft. Apr. 1999. [73] Stefan Kornexl, Vern Paxson, Holger Dreger, Anja Feldmann, and Robin Sommer. “Building a Time Machine for Ecient Recording and Retrieval of High-volume Network Trac”. In: Proceedingsofthe5thACMSIGCOMMConferenceonInternetMeasurement. IMC ’05. Berkeley, CA: USENIX Association, 2005, pp. 23–23.url: https://www.usenix.org/legacy/events/imc05/tech/kornexl.html. [74] Ponnurangam Kumaraguru, Justin Cranshaw, Alessandro Acquisti, Lorrie Cranor, Jason Hong, Mary Ann Blair, and Theodore Pham. “School of Phish: A Real-World Evaluation of Anti-Phishing Training”. In: Proceedings of the 5th Symposium on Usable Privacy and Security. SOUPS ’09. Mountain View, California, USA: Association for Computing Machinery, 2009.isbn: 978-1-60558-736-3.doi: 10.1145/1572532.1572536. [75] Ponnurangam Kumaraguru, Steve Sheng, Alessandro Acquisti, Lorrie Faith Cranor, and Jason Hong. “Teaching Johnny Not to Fall for Phish”. In: ACM Trans. Internet Technol. 10.2 (June 2010).issn: 1533-5399.doi: 10.1145/1754393.1754396. [76] Supriya Kurane and Jim Finkle. “Health insurer Anthem hit by massive cybersecurity breach”. In: Reuters (Feb. 2015).url: https://www.reuters.com/article/us-anthem-cybersecurity/health- insurer-anthem-hit-by-massive-cybersecurity-breach-idUSKBN0L907J20150205. [77] Los Alamos National Laboratory and University of Southern California/Information Sciences Institute. timend and indexer. 2016.url: https://ant.isi.edu/software/timefind/. [78] Ralph Langner. “Stuxnet: Dissecting a Cyberwarfare Weapon”. In: IEEE Security & Privacy 9.3 (May 2011), pp. 49–51.issn: 1558-4046.doi: 10.1109/MSP.2011.67. 174 [79] Wenyin Liu, Xiaotie Deng, Guanglin Huang, and Anthony Y Fu. “An antiphishing strategy based on visual similarity assessment”. In: Internet Computing, IEEE 10.2 (2006), pp. 58–65.doi: 10.1109/MIC.2006.23. [80] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. “Multi-Probe LSH: Ecient Indexing for High-Dimensional Similarity Search”. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB ’07. Vienna, Austria: VLDB Endowment, 2007, pp. 950–961.isbn: 978-1-59593-649-3.url: http://www.vldb.org/conf/2007/papers/research/p950-lv.pdf. [81] Justin Ma, Lawrence K. Saul, Stefan Savage, and Georey M. Voelker. “Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs”. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’09. Paris, France: ACM, 2009, pp. 1245–1254.isbn: 978-1-60558-495-9.doi: 10.1145/1557019.1557153. [82] Syed Zain Al-Mahmood. “Hackers Lurked in Bangladesh Central Bank’s Servers for Weeks”. In: The Wall Street Journal (Mar. 2016).url: https://www.wsj.com/articles/hackers-in-bangladesh- bank-account-heist-part-of-larger-breach-1458582678. [83] Gregor Maier, Robin Sommer, Holger Dreger, Anja Feldmann, Vern Paxson, and Fabian Schneider. “Enriching Network Security Analysis with Time Travel”. In: Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication. SIGCOMM ’08. Seattle, WA, USA: ACM, 2008, pp. 183–194.isbn: 978-1-60558-175-0.doi: 10.1145/1402958.1402980. [84] J. Malhotra and J. Bakal. “A survey and comparative study of data deduplication techniques”. In: 2015 International Conference on Pervasive Computing (ICPC). Jan. 2015, pp. 1–5.doi: 10.1109/PERVASIVE.2015.7087116. [85] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. “Detecting near-duplicates for web crawling”. In: Proceedings of the 16th international conference on World Wide Web. WWW ’07. Ban, Alberta, Canada: ACM, 2007, pp. 141–150.isbn: 978-1-59593-654-7.doi: 10.1145/1242572.1242592. [86] Steven McCanne and Van Jacobson. “The BSD Packet Filter: A New Architecture for User-level Packet Capture”. In: Proceedings of the USENIX Winter 1993 Conference Proceedings on USENIX Winter1993ConferenceProceedings. USENIX’93. San Diego, California: USENIX Association, 1993, pp. 2–2.url: https://www.usenix.org/legacy/publications/library/proceedings/sd93/mccanne.pdf. [87] Sara Motiee, Kirstie Hawkey, and Konstantin Beznosov. “Do Windows Users Follow the Principle of Least Privilege? Investigating User Account Control Practices”. In: Proceedings of the Sixth Symposium on Usable Privacy and Security. SOUPS ’10. Redmond, Washington, USA: Association for Computing Machinery, 2010.isbn: 9781450302647.doi: 10.1145/1837110.1837112. [88] National Institute of Standards and Technology. Framework for Improving Critical Infrastructure Cybersecurity Version 1.1 Draft 2. Dec. 2017.url: https://www.nist.gov/framework. [89] National Institute of Standards and Technology. Secure Hash Standard (SHS). Federal Information Processing Standard (FIPS) 180-3. National Institute of Science and Technology, Oct. 2008.url: http://csrc.nist.gov/publications/fips/fips180-3/fips180-3_final.pdf. 175 [90] Netcraft Ltd. Netcraft Extension: Phishing Protection and Site Reports. 2019.url: https://toolbar.netcraft.com/. [91] Dennis Nishi. “The Ins and Outs of Cybersecurity Insurance”. In: The Wall Street Journal (June 2019).url: https://www.wsj.com/articles/the-ins-and-outs-of-cybersecurity-insurance-11559700180. [92] Arash Nourian, Sameer Ishtiaq, and Muthucumaru Maheswaran. “CASTLE: A social framework for collaborative anti-phishing databases”. In: Dec. 2009.doi: 10.4108/ICST.COLLABORATECOM2009.8310. [93] Matt O’Brien. “Yahoo: 3 billion accounts breached in 2013. Yes, 3 billion”. In:AssociatedPressNews (Oct. 2017).url: https://apnews.com/06a555ad1c19486ea49f6b5b80206847/. [94] Lukasz Olejnik, Claude Castelluccia, and Artur Janc. “Why Johnny Can’t Browse in Peace: On the Uniqueness of Web Browsing History Patterns”. In: 5th Workshop on Hot Topics in Privacy Enhancing Technologies (HotPETs 2012). Vigo, Spain, July 2012.url: https://hal.inria.fr/hal-00747841. [95] OpenDNS. PhishTank. 2019.url: https://www.phishtank.com. [96] Miles Parks. “Chinese, Iranian Hackers Targeted Biden And Trump Campaigns, Google Says”. In: NPR (June 2020).url: https://www.npr.org/2020/06/04/869922456/chinese-iranian-hackers- targeted-biden-and-trump-campaigns-google-says. [97] Vern Paxson. “Bro: A System for Detecting Network Intruders in Real-time”. In: Proceedings of the 7th Conference on USENIX Security Symposium - Volume 7. SSYM’98. San Antonio, Texas: USENIX Association, 1998, pp. 3–3.url: https://www.usenix.org/legacy/publications/library/ proceedings/sec98/full_papers/paxson/paxson.pdf. [98] Himabindu Pucha, David G. Andersen, and Michael Kaminsky. “Exploiting Similarity for Multi-Source Downloads Using File Handprints”. In: Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation. NSDI ’07. Cambridge, MA: USENIX Association, 2007, p. 2.url: https://www.usenix.org/conference/nsdi-07/exploiting-similarity-multi-source- downloads-using-file-handprints. [99] Lin Quan and John Heidemann. “On the Characteristics and Reasons of Long-Lived Internet Flows”. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement. IMC ’10. Melbourne, Australia: Association for Computing Machinery, 2010, pp. 444–450.isbn: 9781450304832.doi: 10.1145/1879141.1879198. [100] Sean Quinlan and Sean Dorward. “Venti: A New Approach to Archival Data Storage”. In: Proceedings of the 1st USENIX Conference on File and Storage Technologies. FAST ’02. Monterey, CA: USENIX Association, 2002.url: https://static.usenix.org/events/fast02/quinlan.html. [101] Lakshmish Ramaswamy, Arun Iyengar, Ling Liu, and Fred Douglis. “Automatic Detection of Fragments in Dynamically Generated Web Pages”. In: Proceedings of the 13th International Conference on World Wide Web. WWW ’04. New York, NY, USA: ACM, 2004, pp. 443–454.isbn: 1-58113-844-X.doi: 10.1145/988672.988732. 176 [102] L. A. Ramshaw and M. P. Marcus. “Text Chunking Using Transformation-Based Learning”. In: Natural Language Processing Using Very Large Corpora. Ed. by Susan Armstrong, Kenneth Church, Pierre Isabelle, Sandra Manzi, Evelyne Tzoukermann, and David Yarowsky. Dordrecht: Springer Netherlands, 1999, pp. 157–176.isbn: 978-94-017-2390-9.doi: 10.1007/978-94-017-2390-9_10. [103] D. Eastlake 3rd and P. Jones. US Secure Hash Algorithm 1 (SHA1). RFC 3174. Updated by RFCs 4634, 6234. Fremont, CA, USA: RFC Editor, Sept. 2001.doi: 10.17487/RFC3174. [104] D. Eastlake 3rd and T. Hansen. US Secure Hash Algorithms (SHA and SHA-based HMAC and HKDF). RFC 6234. Fremont, CA, USA: RFC Editor, May 2011.doi: 10.17487/RFC6234. [105] C. Evans, C. Palmer, and R. Sleevi.PublicKeyPinningExtensionforHTTP. RFC 7469. Fremont, CA, USA: RFC Editor, Apr. 2015.doi: 10.17487/RFC7469. [106] Angelo P.E. Rosiello, E. Kirda, C. Kruegel, and F. Ferrandi. “A layout-similarity-based approach for detecting phishing pages”. In: Security and Privacy in Communications Networks and the Workshops, 2007. SecureComm 2007. Third International Conference on. Sept. 2007, pp. 454–463. doi: 10.1109/SECCOM.2007.4550367. [107] Ruslan Salakhutdinov and Georey Hinton. “Semantic hashing”. In: Int. J. Approx. Reasoning 50.7 (July 2009), pp. 969–978.issn: 0888-613X.doi: 10.1016/j.ijar.2008.11.006. [108] David E. Sanger and Emily Cochrane. “House Republican Campaign Committee Says It Was Hacked This Year”. In: The New York Times (Dec. 2018).url: https: //nytimes.com/2018/12/04/us/politics/national-republican-congressional-committee-hack.html. [109] Adam Satariano and Nicole Perlroth. “Big Companies Thought Insurance Covered a Cyberattack. They May Be Wrong.” In: The New York Times (Apr. 2019).url: https://www.nytimes.com/2019/04/15/technology/cyberinsurance-notpetya-attack.html. [110] Thomas Scavo. [InCommon NOTICE] Re: Getting Ready for eduGAIN [ACTION REQUIRED]. Feb. 2016.url: https://lists.incommon.org/sympa/arc/inc-ops-notifications/2016-02/msg00002.html. [111] Bruce Schneier. “Why Data Mining Won’t Stop Terror”. In: Wired Magazine (Mar. 2005).url: https://schneier.com/essays/archives/2005/03/why_data_mining_wont.html. [112] Steve Sheng, Brad Wardman, Gary Warner, Lorrie Cranor, Jason Hong, and Chengshan Zhang. “An Empirical Analysis of Phishing Blacklists”. In: Proceedings of the Sixth Conference on Email and Anti-Spam. CEAS ’09. Mountain View, CA, July 2009.url: https://kilthub.cmu.edu/articles/An_Empirical_Analysis_of_Phishing_Blacklists/6469805. [113] Shibboleth Consortium. Shibboleth. 2020.url: https://www.shibboleth.net/. [114] Antonio Si, Hong Va Leong, and Rynson W. H. Lau. “CHECK: a document plagiarism detection system”. In: Proceedings of the 1997 ACM symposium on Applied computing. SAC ’97. San Jose, California, United States: ACM, 1997, pp. 70–77.isbn: 0-89791-850-9.doi: 10.1145/331697.335176. [115] University of Southern California/Information Sciences Institute.dnsanon:extractDNStracfrom pcap to text with optionally anonymization. 2016.url: https://ant.isi.edu/software/dnsanon/. 177 [116] University of Southern California/Information Sciences Institute. LANDER Trace Capture Software. 2017.url: https://ant.isi.edu/software/lander/. [117] Neil T. Spring and David Wetherall. “A Protocol-independent Technique for Eliminating Redundant Network Trac”. In: Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. SIGCOMM ’00. Stockholm, Sweden: ACM, 2000, pp. 87–95.isbn: 1-58113-223-9.doi: 10.1145/347059.347408. [118] Benno Stein, Moshe Koppel, and Efstathios Stamatatos. “Plagiarism Analysis, Authorship Identication, and Near-duplicate Detection PAN’07”. In: SIGIR Forum 41.2 (Dec. 2007), pp. 68–71. issn: 0163-5840.doi: 10.1145/1328964.1328976. [119] Michelle P. Steves, Kristen K. Greene, and Mary F. Theofanos. “A Phish Scale: Rating Human Phishing Message Detection Diculty”. In: Proceedings of the 2019 Workshop on Usable Security. San Diego, CA: Internet Society, 2019.isbn: 978-1-891562-57-0.doi: 10.14722/usec.2019.23028. [120] Jessica Su, Ansh Shukla, Sharad Goel, and Arvind Narayanan. “De-Anonymizing Web Browsing Data with Social Networks”. In: Proceedings of the 26th International Conference on World Wide Web. WWW ’17. Perth, Australia: International World Wide Web Conferences Steering Committee, 2017, pp. 1261–1269.isbn: 9781450349130.doi: 10.1145/3038912.3052714. [121] San-Tsai Sun, Eric Pospisil, Ildar Muslukhov, Nuray Dindar, Kirstie Hawkey, and Konstantin Beznosov. “What Makes Users Refuse Web Single Sign-on? An Empirical Investigation of OpenID”. In: Proceedings of the Seventh Symposium on Usable Privacy and Security. SOUPS ’11. Pittsburgh, Pennsylvania: Association for Computing Machinery, 2011.isbn: 9781450309110.doi: 10.1145/2078827.2078833. [122] O. Tange. “GNU Parallel - The Command-Line Power Tool”. In: ;login: The USENIX Magazine 36.1 (Feb. 2011), pp. 42–47.url: http://www.gnu.org/s/parallel. [123] Martin Theobald, Jonathan Siddharth, and Andreas Paepcke. “SpotSigs: robust and ecient near duplicate detection in large web collections”. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’08. Singapore, Singapore: ACM, 2008, pp. 563–570.isbn: 978-1-60558-164-4.doi: 10.1145/1390334.1390431. [124] Kurt Thomas, Frank Li, Ali Zand, Jacob Barrett, Juri Ranieri, Luca Invernizzi, Yarik Markov, Oxana Comanescu, Vijay Eranti, Angelika Moscicki, Daniel Margolis, Vern Paxson, and Elie Bursztein. “Data Breaches, Phishing, or Malware? Understanding the Risks of Stolen Credentials”. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. CCS ’17. Dallas, Texas, USA: Association for Computing Machinery, 2017, pp. 1421–1434.isbn: 9781450349468.doi: 10.1145/3133956.3134067. [125] University of California, Berkeley. Our Berkeley. 2020.url: https://opa.berkeley.edu/campus-data/our-berkeley. [126] University of Southern California. Facts and Figures. 2019.url: https://about.usc.edu/facts/. [127] Vade Secure. isitPhishing - anti phishing tools and information. 2020.url: https://www.isitphishing.org. 178 [128] Camilo Viecco, Alex Tsow, and L. Jean Camp. “A Privacy-Aware Architecture for a Web Rating System”. In: IBM J. Res. Dev. 53.2 (Mar. 2009), pp. 290–305.issn: 0018-8646.doi: 10.1147/JRD.2009.5429049. [129] Wouter Bastiaan de Vries and Roland Martijn van Rijswijk-Deij.DNSQueriestoAuthoritativeDNS Server at SURFnet by Google’s Public DNS Resolver. June 2018.doi: 10.4121/uuid:1ef815ea-cb39-4b41-8db6-c1008af6d5aa. [130] Rui Wang, Shuo Chen, and XiaoFeng Wang. “Signing Me onto Your Accounts through Facebook and Google: A Trac-Guided Security Study of Commercially Deployed Single-Sign-On Web Services”. In: Proceedings of the 2012 IEEE Symposium on Security and Privacy. SP ’12. USA: IEEE Computer Society, May 2012, pp. 365–379.isbn: 9780769546810.doi: 10.1109/SP.2012.30. [131] Dan Wendlandt, David G. Andersen, and Adrian Perrig. “Perspectives: Improving SSH-Style Host Authentication with Multi-Path Probing”. In: USENIX 2008 Annual Technical Conference. ATC ’08. Boston, Massachusetts: USENIX Association, 2008, pp. 321–334.url: https://www.usenix.org/legacy/event/usenix08/tech/full_papers/wendlandt/wendlandt.pdf. [132] Wikipedia. Reusing Wikipedia Content. Oct. 2017.url: https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content. [133] Richard Winton. “Hollywood hospital pays $17,000 in bitcoin to hackers; FBI investigating”. In: The Los Angeles Times (Feb. 2016).url: https://www.latimes.com/business/technology/la-me-ln- hollywood-hospital-bitcoin-20160217-story.html. [134] Min Wu, Robert C. Miller, and Simson L. Garnkel. “Do Security Toolbars Actually Prevent Phishing Attacks?” In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI ’06. Montréal, Québec, Canada: ACM, 2006, pp. 601–610.isbn: 1-59593-372-7.doi: 10.1145/1124772.1124863. [135] Hui Yang and Jamie Callan. “Near-duplicate detection by instance-level constrained clustering”. In: Proceedings ofthe 29th annualinternational ACM SIGIR conference on Research and development in information retrieval. SIGIR ’06. Seattle, Washington, USA: ACM, 2006, pp. 421–428.isbn: 1-59593-369-7.doi: 10.1145/1148170.1148243. [136] Lawrence You, Kristal Pollack, and Darrell D. E. Long. “Deep Store: An Archival Storage System Architecture”. In: Proceedings of the 21st International Conference on Data Engineering (ICDE ’05). Apr. 2005.doi: 10.1109/ICDE.2005.47. [137] Chuan Yue. “The Devil Is Phishing: Rethinking Web Single Sign-On Systems Security”. In: Proceedings of the 6th USENIX Conference on Large-Scale Exploits and Emergent Threats. LEET ’13. Washington, D.C.: USENIX Association, Aug. 2013.url: https://www.usenix.org/conference/leet13/workshop-program/presentation/yue. [138] Patricia Zengerle and Megan Cassella. “Millions more Americans hit by government personnel data hack”. In: Reuters (July 2015).url: https://www.reuters.com/article/us-cybersecurity- usa/millions-more-americans-hit-by-government-personnel-data-hack-idUSKCN0PJ2M420150709. 179 [139] Han Zhang, Manaf Gharaibeh, Spiros Thanasoulas, and Christos Papadopoulos. “BotDigger: Detecting DGA Bots in a Single Network”. In: Proceedings of the International Workshop on Trac Monitoring and Analysis. TMA ’16. Louvain La Neuve, Belgium: IFIP, 2016.url: https://tma.ifip.org/2016/papers/tma2016-final56.pdf. [140] Jian Zhang, Phillip Porras, and Johannes Ullrich. “Highly Predictive Blacklisting”. In: Proceedings of the 17th USENIX Security Symposium. SS ’08. San Jose, CA: USENIX Association, 2008, pp. 107–122.url: https://www.usenix.org/conference/17th-usenix-security-symposium/highly- predictive-blacklisting. [141] Qi Zhang, Yue Zhang, Haomin Yu, and Xuanjing Huang. “Ecient Partial-duplicate Detection Based on Sequence Matching”. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’10. Geneva, Switzerland: ACM, 2010, pp. 675–682.isbn: 978-1-4503-0153-4.doi: 10.1145/1835449.1835562. [142] Weifeng Zhang, Hua Lu, Baowen Xu, and Hongji Yang. “Web phishing detection based on page spatial layout similarity”. In: Informatica 37.3 (2013), pp. 231–244.url: https://www.informatica.si/index.php/informatica/article/view/452/463. [143] Yue Zhang, Serge Egelman, Lorrie Cranor, and Jason Hong. “Phinding Phish: Evaluating Anti-Phishing Tools”. In: Proceedings of the 14th Annual Network and Distributed System Security Symposium. NDSS ’07. San Diego, California, USA: Internet Society, 2007.url: https://www.ndss-symposium.org/ndss2007/phinding-phish-evaluation-anti-phishing-toolbars/. [144] Yue Zhang, Jason I. Hong, and Lorrie F. Cranor. “Cantina: A Content-based Approach to Detecting Phishing Web Sites”. In: Proceedings of the 16th International Conference on World Wide Web. WWW ’07. Ban, Alberta, Canada: ACM, 2007, pp. 639–648.isbn: 978-1-59593-654-7.doi: 10.1145/1242572.1242659. [145] Rui Zhao, Samantha John, Stacy Karas, Cara Bussell, Jennifer Roberts, Daniel Six, Brandon Gavett, and Chuan Yue. “The Highly Insidious Extreme Phishing Attacks”. In: Proceedings of the 2016 25th International Conference on Computer Communication and Networks. ICCCN ’16. Waikoloa, HI, USA: IEEE, Aug. 2016, pp. 1–10.doi: 10.1109/ICCCN.2016.7568582. 180
Abstract (if available)
Abstract
As our world continues to become more interconnected through the Internet, cybersecurity incidents are correspondingly increasing in number, severity, and complexity. The consequences of these attacks include data loss, financial damages, and are steadily moving from the digital to the physical world, impacting everything from public infrastructure to our own homes. The existing mechanisms in responding to cybersecurity incidents have three problems: they promote a security monoculture, are too centralized, and are too slow. ❧ In this thesis, we show that improving one's network security strongly benefits from a combination of personalized, local detection, coupled with the controlled exchange of previously-private network information with collaborators. We address the problem of a security monoculture with personalized detection, introducing diversity by tailoring to the individual's browsing behavior, for example. We approach the problem of too much centralization by localizing detection, emphasizing detection techniques that can be used on the client device or local network without reliance on external services. We counter slow mechanisms by coupling controlled sharing of information with collaborators to reactive techniques, enabling a more efficient response to security events. ❧ We prove that we can improve network security by demonstrating our thesis with four studies and their respective research contributions in malicious activity detection and cybersecurity data sharing. In our first study, we develop Content Reuse Detection, an approach to locally discover and detect duplication in large corpora and apply our approach to improve network security by detecting “bad neighborhoods” of suspicious activity on the web. Our second study is AuntieTuna, an anti-phishing browser tool that implements personalized, local detection of phish with user-personalization and improves network security by reducing successful web phishing attacks. In our third study, we develop Retro-Future, a framework for controlled information exchange that enables organizations to control the risk-benefit trade-off when sharing their previously-private data. Organizations use Retro-Future to share data within and across collaborating organizations, and improve their network security by using the shared data to increase detection's effectiveness in finding malicious activity. Finally, we present AuntieTuna-Schooling in our fourth study, extending the proactive detection of phishing sites in AuntieTuna-Schooling with data sharing between friends. Users exchange previously-private information with collaborators to collectively build a defense, improving their network security and group's collective immunity against phishing attacks.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Design of cost-efficient multi-sensor collaboration in wireless sensor networks
PDF
Detecting and characterizing network devices using signatures of traffic about end-points
PDF
A protocol framework for attacker traceback in wireless multi-hop networks
PDF
Balancing security and performance of network request-response protocols
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Collaborative detection and filtering of DDoS attacks in ISP core networks
PDF
Relative positioning, network formation, and routing in robotic wireless networks
PDF
Enabling efficient service enumeration through smart selection of measurements
PDF
Multichannel data collection for throughput maximization in wireless sensor networks
PDF
Congestion control in multi-hop wireless networks
PDF
Global analysis and modeling on decentralized Internet
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Improving network reliability using a formal definition of the Internet core
PDF
Robust routing and energy management in wireless sensor networks
PDF
Benchmarking interactive social networking actions
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
Efficient pipelines for vision-based context sensing
PDF
Scaling-out traffic management in the cloud
PDF
Modeling, searching, and explaining abnormal instances in multi-relational networks
PDF
Backpressure delay enhancement for encounter-based mobile networks while sustaining throughput optimality
Asset Metadata
Creator
Ardi, Calvin Satiawan
(author)
Core Title
Improving network security through collaborative sharing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/14/2020
Defense Date
06/11/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
anomaly detection,anti-phishing,botnet detection,content reuse detection,cybersecurity data sharing,DNS backscatter,duplicate detection,network measurement,network security,OAI-PMH Harvest,phishing,single sign-on,spam detection
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Heidemann, John (
committee chair
), Govindan, Ramesh (
committee member
), Krishnamachari, Bhaskar (
committee member
)
Creator Email
cardi@ieee.org,cardi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-329570
Unique identifier
UC11665819
Identifier
etd-ArdiCalvin-8672.pdf (filename),usctheses-c89-329570 (legacy record id)
Legacy Identifier
etd-ArdiCalvin-8672.pdf
Dmrecord
329570
Document Type
Dissertation
Rights
Ardi, Calvin Satiawan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
anomaly detection
anti-phishing
botnet detection
content reuse detection
cybersecurity data sharing
DNS backscatter
duplicate detection
network measurement
network security
phishing
single sign-on
spam detection