Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient processing of streaming data in multi-user and multi-abstraction workflows
(USC Thesis Other)
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFICIENT PROCESSING OF STREAMING DATA IN MULTI-USER AND MULTI-ABSTRACTION WORKFLOWS by Abdul Qadeer A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2021 Copyright 2021 Abdul Qadeer Dedication To my parents, wife, children (Farmeen and Haseeb), siblings, and extended family members whose constant support, love, and sacrifices made it pos- sible for me to take on this extraordinary research endeavor, that we call a Ph.D. ii Acknowledgments I am completing my Ph.D. with mixed emotions of humbleness and gratitude. Hum- bleness because I became more aware that my knowledge is minuscule as compared to what is out there. Gratitude because the lord of the knowledge, God Almighty, let me see His ocean of knowledge and collect some pebbles at the shore. Ph.D. is an apprenticeship, and the role of the mentor is critical. I thank my advisor Prof. John S. Heidemann, for patiently and diligently teaching me the traits of research; how to pick a good problem and to refine it; how and when to pay attention to the big picture and the minute details; and the importance of being clear in all of our research communications. Over the years, he dedicated many hours for our research, mentoring, and striving to make us our better selves. I would also thank Prof. Aiichiro Nakano, Prof. Shahram Ghandeharizadeh, Prof. Jelena Mirkovic, Prof. Barath Raghavan, and Prof. Murali Annavaram for taking the time to serve on my qualifying examination and thesis defense committees, and providing valu- able feedback on my research. Many thanks to Lizsl De Leon in the computer science department for her heroic eorts to facilitate the students at all times. I also want to thank my collaborators, colleagues and friends at the ANT Lab, ISI and USC for the valuable inputs on my work, sharing our research ideas, and to sup- port research: Prof. Kensuke Fukuda, Yuri Pradkin, Xun Fan, Liang Zhu, Calvin Ardi, iii Hang Guo, Lan Wei, Aqib Nisar, A.S.M. Rizvi, Guillermo Baltra, Basileal Imana, Asma Enayet, Wes Hardaker, Robert Story, B-Root DNS team, Joe Kemp, Alba Ragalado, Jea- nine Yamazaki, Matt Binkley and many others. I would also like to thank Melissa Snearl-Smith, Sharon Uyeda, Tracy Charles, Jen- nifer Gerson, Andy Chen, Sohe Claya, and other support sta at USC and ISI, specifi- cally: ISI Business and HR, GAP, Viterbi, Computer Science, and OIS oces. Thank you for making our stay at USC pleasant and taking care of ever increasing paperwork. I would also take this opportunity to thank my housemates over the years at the ”1071” house, to make it a home, away from home, and the delicious cuisines. The completion of this dissertation would not have been possible without the end- less support of my family. Many members of my extended family took care of many important tasks on my behalf while I was away for my studies. I am indebted to their selfless acts of love and support. Abdul Qadeer University of Southern California March 2021 iv Table of Contents Dedication ii Acknowledgments iii List of Tables ix List of Figures xi Abstract xiii 1 Introduction 1 1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Demonstrating Thesis Statement and Contributions . . . . . . . . . . . 6 1.2.1 Demonstration of Thesis Statement . . . . . . . . . . . . . . . 6 1.2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Ecient Processing of Streaming Data in Multi-User Workflows 14 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Definitions, Goals, Assumptions, and Scope of Work . . . . . . 20 2.2.2 Design Requirements and Case Study . . . . . . . . . . . . . . 22 2.2.3 Plumb Overview . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.4 De-duplicating Processing and Data via the Pipeline Graph . . . 27 2.2.5 Data Storage De-duplication . . . . . . . . . . . . . . . . . . . 28 2.2.6 Detecting I/O-Bound Stages . . . . . . . . . . . . . . . . . . . 29 2.2.7 Mitigating Structural Skew . . . . . . . . . . . . . . . . . . . . 30 2.2.8 Mitigating Computational Skew . . . . . . . . . . . . . . . . . 33 2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.1 Benefits of De-duplication . . . . . . . . . . . . . . . . . . . . 34 2.3.2 I/O Costs and Merging . . . . . . . . . . . . . . . . . . . . . . 39 2.3.3 Pipeline Disaggregation Addresses Structural Skew . . . . . . . 42 2.3.4 Disaggregation Addresses of Computational Skew . . . . . . . 44 v 2.3.5 Comparing Design Alternatives for Stages per Container . . . . 46 2.3.6 Improving Throughput:Overall Evaluation on Real Inputs . . . 48 2.3.7 Improving End-to-End Latency . . . . . . . . . . . . . . . . . 50 2.3.8 Ease of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.4 Deployment Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.5 Onward to Multiple Abstractions . . . . . . . . . . . . . . . . . . . . . 52 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3 Ecient Processing of Streaming Data in Multi-Abstraction Workflows 54 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.1 Case Study: A DNS Workflow . . . . . . . . . . . . . . . . . . 59 3.2.2 Plumb Goals and Requirements . . . . . . . . . . . . . . . . . 60 3.2.3 Large-Block Streaming Abstraction . . . . . . . . . . . . . . . 61 3.2.4 Windowed-Streaming Abstraction . . . . . . . . . . . . . . . . 64 3.2.5 The Stateful-Streaming Abstraction . . . . . . . . . . . . . . . 67 3.2.6 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2.6.1 Compute Failures . . . . . . . . . . . . . . . . . . . 69 3.2.6.2 Data Failures . . . . . . . . . . . . . . . . . . . . . . 69 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.3.1 Block-Streaming Enables Low Latency Processing . . . . . . . 72 3.3.2 Windowed-Streaming Enables Ecient Reduction . . . . . . . 75 3.3.2.1 Windowed-Streaming is Easy-to-use . . . . . . . . . 76 3.3.2.2 Windowed-Streaming Improves the Correctness-Latency Tradeo . . . . . . . . . . . . . . . . . . . . . . . . 77 3.3.2.3 Windowed-Streaming Enables Low Latency Processing 79 3.3.3 Stateful-Streaming Enables Continuous Applications . . . . . . 82 3.3.3.1 Deployment . . . . . . . . . . . . . . . . . . . . . . 86 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4 Beyond DNS: The Generality of Plumb Optimizations 88 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Application Catalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2.1 Early Detection of Malicious Activity . . . . . . . . . . . . . . 92 4.2.2 Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2.3 Merging Bidirectional Trac . . . . . . . . . . . . . . . . . . 93 4.2.4 Annual Day-In-The-Life of the Internet (DITL) Collection . . . 93 4.2.5 Shared Datasets Inside a Cloud . . . . . . . . . . . . . . . . . . 94 4.2.6 Plumb Enables Easy Analytics Evolution . . . . . . . . . . . . 94 4.2.7 Plumb Provides Context for Collaboration . . . . . . . . . . . . 94 4.2.8 Plumb Enables Easy Back-fill Processing . . . . . . . . . . . . 95 4.2.9 Data Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 95 vi 4.2.10 Internet Outage Detection . . . . . . . . . . . . . . . . . . . . 96 4.2.11 Detection of Aggressive DNS Resolver . . . . . . . . . . . . . 96 4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5 Related Work 98 5.1 Multi-user Optimizations in a Workflow . . . . . . . . . . . . . . . . . 98 5.1.1 Deduplication of Unstructured Data and Black-box Code . . . . 98 5.1.1.1 Duplicate Detection . . . . . . . . . . . . . . . . . . 98 5.1.1.2 Duplicate Detection Across Multiple Users . . . . . . 99 5.1.1.3 Data Deduplication . . . . . . . . . . . . . . . . . . 100 5.1.1.4 Processing Deduplication . . . . . . . . . . . . . . . 101 5.1.2 Skew Management . . . . . . . . . . . . . . . . . . . . . . . . 101 5.1.2.1 Straggler Mitigation . . . . . . . . . . . . . . . . . . 101 5.1.2.2 Resource Optimization for High Throughput . . . . . 102 5.1.2.3 Big-data Schedulers . . . . . . . . . . . . . . . . . . 102 5.2 Multiple Abstractions in a Workflow . . . . . . . . . . . . . . . . . . . 102 5.2.1 Single Abstraction in a Framework . . . . . . . . . . . . . . . 102 5.2.1.1 Batch Systems . . . . . . . . . . . . . . . . . . . . . 102 5.2.1.2 Streaming Big-data Systems . . . . . . . . . . . . . . 103 5.2.1.3 Streaming Databases . . . . . . . . . . . . . . . . . 104 5.2.2 Analytics Coverage . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2.2.1 Facebook Systems . . . . . . . . . . . . . . . . . . . 105 5.2.2.2 Google System . . . . . . . . . . . . . . . . . . . . . 106 5.2.2.3 Nexus . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.2.4 Spark Streaming . . . . . . . . . . . . . . . . . . . . 106 5.2.2.5 Hadoop Based Systems . . . . . . . . . . . . . . . . 107 5.2.2.6 Extending Relational Databases . . . . . . . . . . . . 107 5.2.2.7 Industrial Integrator . . . . . . . . . . . . . . . . . . 107 5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6 Future Work and Conclusions 109 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.1.1 Debugging in Multi-User Workflows . . . . . . . . . . . . . . . 109 6.1.2 Eect of Heterogeneity on I/O Intensity and Stage Merging . . 110 6.1.3 Workflow Consistency and Job Churn . . . . . . . . . . . . . . 111 6.1.4 Better Visual Interfaces . . . . . . . . . . . . . . . . . . . . . . 111 6.1.5 Malicious Actors and Sustained Malicious Attacks . . . . . . . 112 6.1.6 Resource Rate Limiting . . . . . . . . . . . . . . . . . . . . . 113 6.1.7 Garbage Collection and Consistency Checks . . . . . . . . . . 113 6.1.8 Explicit Compute Requirements, Fairness and Priorities . . . . 114 6.1.9 Performance Isolation . . . . . . . . . . . . . . . . . . . . . . 115 6.1.10 Private Name-Spaces . . . . . . . . . . . . . . . . . . . . . . . 116 vii 6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Bibliography 119 Appendices 129 A Plumb APIs 130 A.1 Queue API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 A.2 Block Naming Conventions and API . . . . . . . . . . . . . . . . . . . 137 A.3 User Job Submission API . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.4 Block-Streaming API . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.5 Windowed-Streaming API . . . . . . . . . . . . . . . . . . . . . . . . 140 A.6 Stateful-Streaming API . . . . . . . . . . . . . . . . . . . . . . . . . . 140 viii List of Tables 1.1 Demonstrating the thesis statement. . . . . . . . . . . . . . . . . . . . 7 1.2 Our contributions, their demonstration, and the benefits. . . . . . . . . . 9 2.1 Relative costs of IO, processing and IO-intensity relating them. . . . . . 30 2.2 Comparison of stages to containers mapping configurations. . . . . . . 31 2.3 Cluster capabilities for experiments. . . . . . . . . . . . . . . . . . . . 35 2.4 Plumb latency: One day DNS data using 100 cores. . . . . . . . . . . . 49 3.1 Comparison of three abstractions. . . . . . . . . . . . . . . . . . . . . 61 3.2 Properties of a window. . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3 Block-Streaming generated data comparison: Plumb vs pre-Plumb. . . . 74 3.4 Number of lines-of-code: Plumb vs pre-Plumb. . . . . . . . . . . . . . 77 3.5 Data stalls and missed events: Plumb vs pre-Plumb. . . . . . . . . . . . 78 3.6 RSSAC latency comparison: Plumb vs pre-Plumb. . . . . . . . . . . . . 81 4.1 Example application coverage of Plumb. . . . . . . . . . . . . . . . . . 92 A.1 Possible block states inside Plumb. . . . . . . . . . . . . . . . . . . . . 131 A.2 Queue API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.3 Block naming convention. . . . . . . . . . . . . . . . . . . . . . . . . 138 A.4 API to extract parts of block names. . . . . . . . . . . . . . . . . . . . 139 ix A.5 Job submission API. . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.6 Plumb internal API to manage windowing. . . . . . . . . . . . . . . . . 141 A.7 Interfacing with Plumb for Windowed-Streaming based applications. . . 142 A.8 Part of the PULL-API for the Stateful-Streaming based applications. . . 143 x List of Figures 2.1 Our case study: The DNS processing pipeline. . . . . . . . . . . . . . . 24 2.2 A Pipeline Graph for a portion of the DNS pipeline. . . . . . . . . . . . 24 2.3 A user submits his pipeline into Plumb. . . . . . . . . . . . . . . . . . 25 2.4 Pipeline-graph: Before and after the optimizations. . . . . . . . . . . . 28 2.5 Comparison of un-optimized VS de-duplicated pipelines. . . . . . . . . 36 2.6 Storage use over the lifespan of dierent pipelines. . . . . . . . . . . . 38 2.7 Computation comparison for dierent pipelines. . . . . . . . . . . . . . 39 2.8 Understanding role of I/O merging. . . . . . . . . . . . . . . . . . . . . 41 2.9 Understanding benefits of de-aggregation. . . . . . . . . . . . . . . . . 43 2.10 Understanding benefits of dynamic scheduling. . . . . . . . . . . . . . 45 2.11 Understanding limits of static allocation. . . . . . . . . . . . . . . . . . 45 2.12 Dynamic scheduling at 0%, 50% and 100% skew. . . . . . . . . . . . . 46 2.13 Understanding stages to containers mapping. . . . . . . . . . . . . . . 47 2.14 Latency: Plumb VS batch-based processing. . . . . . . . . . . . . . . . 50 3.1 The expanded DNS processing pipeline. . . . . . . . . . . . . . . . . . 59 3.2 System architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3 Three abstractions in use in the DNS workflow. . . . . . . . . . . . . . 63 3.4 Message question per-block latency: Plumb vs pre-Plumb. . . . . . . . 73 3.5 Pcap.xz per-block latency: Plumb vs pre-Plumb. . . . . . . . . . . . . . 74 xi 3.6 RSSAC per-block latency: Plumb vs pre-Plumb. . . . . . . . . . . . . . 75 3.7 Latency comparison at LAX site. . . . . . . . . . . . . . . . . . . . . . 80 3.8 Latency comparison at newer AMS site. . . . . . . . . . . . . . . . . . 80 3.9 Latency comparison at MIA site. . . . . . . . . . . . . . . . . . . . . . 80 3.10 Latency comparison at newer SIN site. . . . . . . . . . . . . . . . . . . 80 3.11 Latency comparison at IAD site. . . . . . . . . . . . . . . . . . . . . . 81 3.12 Latency comparison at ARI site. . . . . . . . . . . . . . . . . . . . . . 81 3.13 Comparison of LAX inter-arrival time VS inter-departure time. . . . . . 83 3.14 Zeek: 4-way connection hashing. . . . . . . . . . . . . . . . . . . . . . 84 3.15 Pre-Plumb cluster use. . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.16 Plumb cluster use. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1 Design dimensions in analytics. . . . . . . . . . . . . . . . . . . . . . 90 xii Abstract Ever-increasing data and evolving processing needs force enterprises to scale-out expen- sive computational resources to prioritize processing for timely results. Teams pro- cess their organization’s data either independently or using ad hoc sharing mechanisms. Often dierent users start with the same data and the same initial stages (decrypt, decom- press, clean, anonymize). As their workflows evolve, later stages often diverge, and dierent stages may work best with dierent abstractions. The result is workflows with some overlap, some variations, and multiple transitions where data handling changes between continuous, windowed, and per-block. The system processing this diverse, multi-user, multi-abstraction workflow should be ecient and safe, but also must cope with fault recovery. Analytics from multiple users can cause redundant processing and data, or encounter performance anomalies due to skew. Skew arises due to static or dynamic imbalance in the workflow stages. Both redundancy and skew waste compute resources and add latency to results. When users bridge between multiple abstractions, such as from per- block processing to windowed processing, they often employ custom code. These tran- sitions can be error prone due to corner cases, can easily add latency as an ineciency, and custom code is often a source of errors and maintenance diculty. We need new solutions to manage the above challenges and to expose opportunities for data shar- ing explicitly. Our thesis is: new methods enable ecient processing of multi-user xiii and multi-abstraction workflows of streaming data. We present two new methods for ecient stream processing—optimizations for multi-user workflows, and multiple abstractions for application coverage and ecient bridging. Our first method is a new approach to address challenges from duplication, skew, and ad hoc sharing in a workflow. These algorithms use a pipeline-graph to detect duplication of code and data across multiple users and cleanly delineate workflow stages for skew management. The pipeline-graph is our job description language that allows developers to specify their need easily and enables our system to automatically detect duplication and manage skew. The pipeline-graph acts as a shared canvas for collabo- ration amongst users to extend each other’s work. To eciently implement our dedu- plication and skew management algorithms, we present streaming data to processing stages as fixed-sized but large blocks. Large-blocks have low meta-data overhead per user, provide good parallelism, and help with fault recovery. Our second method enables applications to use a dierent abstraction on a dif- ferent workflow stage. We provide three key abstractions and show that they cover many classes of analytics and our framework can bridge them eciently. We provide Block-Streaming, Windowed-Streaming, and Stateful-Streaming abstractions. Block- Streaming is suitable for single-pass applications that care about temporal or spatial locality. Windowed-Streaming allows applications to process accumulated data (time- aligned blocks to sync with external information) and reductions like summation, aver- ages, or other MapReduce-style analytics. Stateful-Streaming supports applications that require a long-term state. We believe our three abstractions allow many classes of ana- lytics and enable processing of one block, many blocks, or infinite stream. Plumb allows multiple abstractions in dierent parts of the workflow and provides ecient bridging between them so that users could make complex analytics from individual stages with- out worrying about data movement. xiv Our methods aim for good throughput, low latency, and clean and easy-to-use sup- port for more applications to achieve better eciency than our prior hand-tuned but often brittle system. The Plumb framework is the implementation of our solutions and a testbed to validate them. We use real-world workloads from the B-Root DNS domain to demonstrate eectiveness of our solutions. Our processing deduplication increases throughput up to 6, reduces storage by 75%, as compared to their pre-Plumb coun- terparts. Plumb reduces CPU wastage due to structural skew up to half and reduces latency due to computational skew by 50%. Plumb has cut per-block latency by 74% and latency of daily statistics by 97%, while reducing code size by 58% and lowering manual intervention to handle problems by 73% as compared to pre-Plumb system. The operational use of Plumb for the B-Root service provides a multi-year validation of our design choices under many trac conditions. Over the last three years, Plumb has processed more than 12 PB of DNS packet data and daily statistics. We show that our abstractions apply to many applications in the domain of networking big-data and beyond. xv Chapter 1 Introduction Complexity is on the rise for analytics workflows in enterprises due to ever-increasing data and evolving processing needs. To manage complexity, developers divide work- flows into many stages and teams take responsibility for dierent parts of the workflow. To meet application needs, developers may use diverse abstractions on dierent work- flow stages. These multi-user and multi-abstraction workflows have open challenges of duplication, performance anomalies, and brittle code. This thesis presents solutions to these challenges. We show that multi-user and multi-abstraction workflows can be made ecient and cost-eective. The context of our work is streaming data. Examples of streaming analytics include failure rate in Domain Name System (DNS), application crash reports in distributed logs, reverse web-link graph from continuous worldwide web crawling, and click-through rate in a website. Figure 2.1 (with 7 stages, 3 contributors, and 1 abstraction) and Fig- ure 3.1 (with 11 stages, 5 contributors, and 3 abstractions) show multi-user and multi- abstraction analytics from our networking domain of B-Root DNS. Above examples of multi-user and multi-abstraction workflows exhibit challenges of duplication, skew, and brittle code. Now we study these challenges in detail. The first challenge in multi-user workflows is to identify duplication of computa- tion and storage that can occur when dierent groups share components of a pipeline. There are many reasons for the duplication amongst individual users and teams. First, at times dierent users do not see each other’s workflows because users like to process 1 data independently and only share exciting results. Second, often there are no standard- ized methods of workflow expression and sharing of results across multiple users. Some users might prefer to generate their own results (and causing duplication), while others might rely on ad hoc sharing. Third, workflows evolve and duplication can occur across business units when one unit starts considering others work as a black-box. A primary reason for such black-boxes is to avoid the complexity of unit-specific solutions that increase over time due to custom or legacy solutions. Fifth, duplication also happens because dierent teams run their analytics in a decoupled manner to mitigate risks of cascading failures propagating to other teams. Sixth, duplication happens because users usually start with the same high-value data in an organization and solve related prob- lems. While the challenge of duplication also occurs in structured data and for programs with known semantics [JQP + 18], it remains unsolved for workflows that use arbitrary programs to process unstructured data (focus of this work). The second challenge in multi-user workflows is processing skew. Data, computa- tional, and structural skew are all dierent cases where analytics hold cluster resources for too long (compared to averages), or resource requirements balloon, or resource is held but not fully utilized. Data skew is a known problem, where many data items fall into one processing bin, slowing the overall workflow [CF15]. We define two addi- tional types of skew. Computational skew occurs when a data bin takes extra long to process, not necessarily because there is more data, but because the data interacts with the processing algorithms to take extra time. Structural skew occurs when one stage of the processing pipeline is noticeably slower than other stages. Applications that require large-block data preclude the use of adaptive sharding schemes to present skew. Struc- tural skew wastes compute resources, and computational skew increases latency. We address problems of computational and structural skew in the new context of multi-user workflows. 2 The third challenge is that multi-abstraction workflows can be brittle. Two primary reasons for such brittleness are data grouping and bridging between toolsets (implemen- tation of an abstraction). Developers often rely on ad hoc solutions for data grouping (windowing) without support from the framework, and those solutions are not general enough to be readily useful for other users (who might need a window). In the absence of systematic sharing mechanism, sharing of such code is not easy. Developers’ ad hoc solutions often do not address all challenges, like correctness of heuristic over time, data completeness, timely scheduling and processing of data, and parallelism. It results not only in wasted person-hours but also correctness and performance issues. In summary, this thesis tackles challenges in multi-user and multi-abstraction work- flows. Multi-user workflows can suer from duplication and skew anomalies. Multi- abstraction workflows can have correctness and eciency problems due to ad-hoc code. Our goal is to make collaborative stream processing ecient, cost-eective, and appli- cable to many analytics classes. 1.1 Thesis Statement This thesis asserts that new methods enable ecient processing of multi-user and multi-abstraction workflows of streaming data. We now explain each concept in our thesis statement. This thesis describes two new methods to solve above mentioned challenges. The first method solves duplication and skew challenges in multi-user workflows. The sec- ond method provides three key abstractions and ecient bridging between them to tackle ad-hoc and brittle code. Our first new method addresses the challenges of duplication and skew in multi-user workflows. We use a pipeline-graph to enable multiple users to express their workflows 3 using a text-based description. Our system integrates pipeline-graphs of multiple users after addressing challenges of duplication and processing skew. This thesis shows that data de-duplication of unstructured data is possible with low metadata overhead. Sec- ond, this thesis shows that processing deduplication of arbitrary code is possible. Our second new method identifies three key abstractions to cover many applica- tion classes, and provides ecient bridging between them. These processing abstrac- tions are Block-Streaming, Windowed-Streaming, and Stateful-Streaming. The Block- Streaming abstraction is suitable for workflow that need some context as in temporal or spatial locality, and block-level parallelism. The Windowed-Streaming abstraction enables developers to operate on some group of data. Grouping can be based on time, or number of blocks. This abstraction also enables users to run many reductions on data. The Stateful-Streaming abstraction enables applications to consume an infinite stream where they keep some long-term state and need always-on behavior. We show that our abstractions can support many types of applications. Next, we define the remaining terms in our thesis statement (eciency, processing, multi-user workflow, multi-abstraction workflow, and streaming). In our work, eciency means high system throughput, low end-to-end latency, and ease-of-use for the developers. We strive for higher throughput than absolute lower latency because of well-known trade-o between throughput and latency [MYN16, SMPT14, AD14]. By ease-of-use we mean that developers need low eort to utilize our work. Our measure of ease-of-use is the number of lines of code to interface with our system. We strive for simple user-facing solutions so that more applications can use it. In our thesis statement, processing can be an arbitrary user code that runs in a stage of a workflow, and it could be a binary, or a script, or even a driver program to invoke other frameworks. We chose this definition so that our processing abstractions apply 4 generally to all kinds of stream processing. Other frameworks tradeo generality by narrowing their focus on well-known program structures (as in relational algebra), and unlock many optimizations from the database domain. Many analytics in networking domain are challenging to represent using SQL-like language. We define multi-user workflow as the one where dierent parts of the pipeline are contributed by dierent developers (or teams) independently. Our multi-user workflows decouple stages to enable individual users (or teams) to innovate dierent workflow parts at their own speed and with fault-tolerance. Our system automatically manages multi-user challenges. Additionally our workflow definition encourages developers to divide their workflows into multiple stages to conquer the complexity. A workflow has one or more stages, and each stage consumes at least one input and generates at least one output. The interpretation of the stream data is the responsibility of developer programs. Multi-abstraction workflow uses dierent abstractions on dierent stages of the pipeline. We make this choice to enable developers to match their need to an abstrac- tion, and allowing our system to eciently manage the bridging of the abstractions. Applications can get one (Block-Streaming), many (Windowed-Streaming), or infinite stream (Stateful-Streaming) of blocks. In the context of our work, a large-block is the accumulation of tens or hundreds of megabytes of data. The size of the block depends on finding a good trade o between de-duplication metadata overhead and parallelism needs, and other domain-specific concerns, as we will explain in §2.2.4. Although our system considers data as opaque, developers’ applications can interpret data dierently (for example as small records) inside a stage. We define streaming data as an unbounded flow of data with time varying rate. Streaming is an alternative to a batch where amount of data is fixed. 5 1.2 Demonstrating Thesis Statement and Contributions We first discuss how we prove our thesis statement and then show additional contribu- tions from our work. 1.2.1 Demonstration of Thesis Statement We believe that our two new methods enable our developers to solve many problems in a shared workflow, and using multiple abstractions with better eciency than the pre- Plumb system (a hand-tuned but often brittle system). The first method enables dedu- plication of unstructured data and arbitrary code, and manages skew in multi-user work- flows. The second method provides three abstractions (Block-Streaming, Windowed- Streaming, and Stateful-Streaming) to express many classes of applications, and man- ages ecient bridging between them in a single workflow. Table 1.1 summarizes dier- ent aspects of eciency improvements using our new methods. Multi-user optimizations: Multi-user optimizations detect and remove duplication and processing skew across all workflows. We use workloads from B-Root DNS service (§2.2.2) to demonstrate ecacy of our first method. We show that our deduplication algorithm can detect data and processing duplication across multiple users, and remov- ing duplication uses fewer compute containers and lesser storage as compared to their un-optimized counterparts (§2.3.1). By reducing compute resources we can administer more work on the same cluster (or reduce dollar cost on a public cloud) and increase throughput and reduce latency. Our first method detects I/O-bound stages in a workflow and merges them with a CPU-bound stage to reduce pressure on shared storage (§2.3.2). Throughput increases and latency reduces because of fewer I/O operations as compared to their un-optimized counterparts. 6 New Methods How eciency is improved Multi-user Optimizations (1) Ease-of-use for developers due to text-based job descrip- tion where a workflow stage contains (input-processing- output) tuples and processing can be an arbitrary executable. (2) Improves ease-of-use by allowing easy expression of com- plex use-cases, allowing multiple abstractions on each stage of the workflow using fewer lines of code as compared to the pre-Plumb system (a hand-tuned but often brittle system). (3) Improves throughput and reduces latency by detecting and removing duplicate processing and data across multiple users. (4) Improves throughput by mitigating imbalances (skew) in the multi-user workflow. (5) Reduces latency by detecting I/O intensive workflow stages and merging them with CPU-bound stages. Block-Streaming (1) Improves throughput and reduces latency by leveraging temporal and spatial locality in a block and enabling single- pass use cases. (2) Improves latency by enabling per-block parallel process- ing. Windowed-Streaming (1) Improves ease-of-use by ooading correct window com- pletion under data faults (missing, late, out-of-order, dupli- cate) to the system. (2) Improves ease-of-use by allowing developers to tweak window behavior and a knob to dial correctness-latency trade- o. (3) Reduces latency by scheduling a window as soon as it completes, and running dierent windows in parallel when possible. Stateful-Streaming (1) Improves ease-of-use by enabling developers to deploy always-on, and state-bearing applications. (2) Reduces latency by allowing parallelism based on stream hashing. Table 1.1: Demonstrating the thesis statement. Two new methods: multi-user optimiza- tions and multiple abstractions. We demonstrate that multi-user workflows can be disaggregated where each stage run separately to address structural skew in the pipeline (§2.3.3). We show that by dynamically moving compute resources from one workflow stage to the other mitigates computational skew (§2.3.4). Mitigating skew allows us to increase throughput by using 7 compute resources better, and skew management reduces end-to-end latency due to tar- geted allocation of resources. We demonstrate high throughput ((§2.3.6)) and low latency (Table 2.4) when all of our multi-user optimizations are in eect. We show that expressing workflows are easy for the developers because few lines (§2.3.8) suce, and our system automatically manages challenges of duplication and skew. Multiple abstractions: We demonstrate that the Block-Streaming abstractions pro- vides low latency (§3.3.1) as compared to pre-Plumb system. We use daily statistics from our B-Root DNS to show that the Windowed-Streaming empower developers to dial a tradeo between correctness and latency (§3.3.2.2). We use historical data from the B-Root service (over fourteen months) to demonstrate that the Windowed-Streaming consistently provide lower latency (§3.3.2.3). We show that the Stateful-Streaming can meet real-time trac rate to detect network attack (§3.3.3). We demonstrate that our abstractions are easy-to-use for developers by comparing number of lines of code in pre-Plumb and Plumb system (§3.3.2.1). We use analogy to our DNS workloads to support that our new methods apply to many classes of analytics (§4.2). We implement our new methods in a new framework, Plumb, that is serving in operations since 2018 where it uses resources eciently and enables developers to bring additional workflows into Plumb. It demonstrates that our new methods withstand stringent environments and meets its design goals. In summary, our focused and long-term evaluations of multi-user optimizations and multiple abstractions demonstrate ecacy of our new methods, and hence prove our thesis. Our work has other contributions as well that we discuss next. 8 Contributions Demonstrating Contribution Benefit of Contribution De-duplication of unstructured data and arbitrary code possible. (1) By turning the data stream into large blocks (2) By pipeline-graph on an application (DNS processing app) (1) Higher throughput (2) Lower latency (3) Intuitive workflow expression (4) Providing a medium for multi-user collaboration Mitigating processing skew possible without sharding. (1) By per-block scheduling and processing of work (2) Dynamic re-allocation of compute (3) Finding IO intensive stages of workflow (1) Higher throughput (2) Lower latency (3) Block level parallelism and fault-tolerance Finding good large-block streaming abstractions possible. (1) By providing three key abstractions: (a) Block-Streaming (b) Windowed-Streaming (c) Stateful-Streaming (2) By using a dierent abstraction on dierent stages of the same workflow (1) Increasing productivity by: (a) Saving person-hours (b) Reduction in ad hoc user code (c) by data completeness (d) by framework managed parallelism (2) Our abstractions cover a large spectrum of real-work applications (3) Bridging and interoperability between applications Proving the thesis statement. By providing two new methods for multi-user and multi-abstraction workflows: (a) Multi-user optimizations (b) Multiple abstractions (1) Many types of real-world processing with good eciency (2) Implementation of our solutions into the Plumb framework that is serving in production for B-Root DNS processing Table 1.2: Our contributions, their demonstration, and the benefits. 1.2.2 Contributions The primary contribution of this work is to prove our thesis. There are additional con- tributions of our work as well. We summarize all of our contributions in Table 1.2. We discussed demonstration of our thesis in §1.2.1, and now we explain the additional contributions. Un-structured data and black-box module de-duplication is possible: We show that processing and data deduplication is possible for multi-user workflows for arbitrary code acting on unstructured stream of data. We transform stream data into large-block streaming (LBS) so that our system can detect and deduplicate processing and data with low meta-data overhead. We use a real-world, DNS processing workflow to show that 9 our system provides higher throughput, lower latency, and an easy-to-use workflow sys- tem as compared to pre-Plumb system. Our initial user study indicates that our workflow expression system provides a shared canvas for multi-user collaboration. Our deduplication algorithm uses a pipeline-graph where developers can easily express their workflows, and our system can detect and remove duplication across all users. Our definition of a pipeline-graph (§2.2) is an extended directed acyclic graph for large block streaming workflows. It is easy-to-use because multi-users can express their real-world workflows using a few lines of YAML based description, and tens of lines of code to interface with our system (§2.3.8). The pipeline-graph helps our system to integrate workflows from multi-users. Large block streaming helps with data dedu- plication by dividing the stream data into fixed, and large-sized data blocks such that each block has a uniquely identifiable name and our system maintains usage reference counting on each block with low meta-data overhead (§2.2.5). The pipeline-graph helps with detection of similar processing, when the set of input and output streams of any stage of multi-user workflow is the same (§2.2.4). Our system then de-duplicates data and processing in the integrated multi-user workflow and schedules it for execution. The optimized workflow is ecient, due to deduplication and skew management that results in higher throughput and lower latency when we evaluate it using a real-world application, DNS processing workflow (§2.2.2). Skew management without sharding is possible: We show that mitigating the aects of processing skew is possible when data sharding is not possible (as is the case for LBS workflows). We once again transform stream data into LBS and use per-block scheduling and processing of blocks to mitigate structural skew. We reduce the aects of inter-stage data copy by detecting I/O bound stages, so that they can be merged with a CPU-bound stage. We monitor work backlog for each workflow stage, and dynamically re-allocate computational resources to mitigate computational skew. We evaluate our 10 system using real-world workloads and show that our skew management provides higher throughput, lower end-to-end latency, per-block parallelism and fault-tolerance. The optimized pipeline-graph helps our system to treat each stage of integrated and deduplicated workflow separately, and skew mitigation makes processing ecient. Such disaggregation of the optimized workflow helps to schedule and run the stage programs independent of each other, such that a stage reads and writes its data blocks from stor- age. The independent run of the stages removes structural skew from the optimized workflow (§2.2.7). The disaggregated stage run can be detrimental if reading and writ- ing o storage is faster than the stage processing. We provide I/O intensity detection algorithm for our workflow stages. We then alert the users about the stages that are a candidate for a merging with some CPU-bound processing, and hence reducing eects of I/O-bound stages. We maintain a work queue associated with each stage of the opti- mized workflow. We allocate workers to the stages in proportion to the backlog in each queue. The dynamic reallocation of workers helps in detecting and mitigating compu- tational skew specific to the stage and redirection of resources where needed (§2.2.8). Our evaluation with a real-world workload shows that, by removing structural skew with IO management increases system throughput and appropriate reallocation of com- putational resources improve end-to-end latency for the terminal user data (§2.3.3 and §2.3.4). Collectively these two improvements provide ecient processing of optimized multi-user workflows. Three key abstractions can express many applications: We find three key abstractions—Block-Streaming, Windowed-Streaming, and Stateful-Streaming, that provide a good balance of responsibilities between a developer and the system. Our system enables developers to express their processing need in few lines of code, and automatically manages complex challenges of timely scheduling, data completeness, correctness, and fault tolerance. These three abstractions can process data on the full 11 spectrum of small records to large blocks to infinite streams. We evaluate our system using real-world applications with dierent processing needs. We show that user pro- ductivity increases due to reduction in ad hoc code, throughput increases and latency reduces. We believe that our three abstractions can express many workloads. The Block-Streaming abstraction makes it possible to represent and execute each workflow stage independently. We stream one large block of each input type, that a stage request via pipeline-graph and writes one block of each output type to the storage. We schedule stages based on per-block because it provides block-level parallelism. We can change the degree of parallelism by tweaking the size of the block. We get lower latency by increasing parallelism. Per-block execution also helps with fault-tolerance by confining any failures at the block level, and easier retry mechanism. Collectively parallelism and fault-tolerance improve eciency by reducing latency and curtailing throughput reduction due to any failures. We evaluate per-block processing throughput (§2.3.6) and latency (§2.3.7 and §3.3.1) by using a real workload (see §2.2.2). The Windowed-Streaming abstraction is for window-based processing of LBS work- flows (§3.2.4). This abstraction helps analytics that need the accumulation of data before processing could start. This abstraction is easy-to-use for the users by expressing their windowing and parallelism needs by augmenting their pipeline-graph (Figure 3.3). The challenges of time-based data ordering inside a window and data completeness, either in terms of the number of blocks or time, is managed by our system. User applications also need a variety of parallelism needs inside and outside of the windows. Our abstrac- tion manages parallelism needs across the windows, while managing parallelism inside a window is the developer’s responsibility (we plan to provides helpful libraries for major use-cases for inside window parallelism in the future). We evaluate windowing capabilities using workloads from DNS (§3.3.2.1, §3.3.2.2, and §3.3.2.3). 12 The Stateful-Streaming abstraction enables our developers to easily build state- bearing applications on in-order stream. Our system helps developers to easily check- point and recover application state, and parallelization after hashing on stream data. We evaluate continuous streaming capability using an intrusion detection system and work- loads form the DNS (§3.3.3). Chapter 2 explores our multi-user optimizations and how it solve duplication of processing and data, and processing skew. Chapter 3 explains and evaluates our three abstractions on a real-world workflow. Chapter 4 shows that our new methods apply to many classes of streaming analytics. We then compare our eort with the related work in Chapter 5. Finally we conclude our work and also provide many exciting research problems for further study in Chapter 6. Appendix A explains how Plumb interfaces with applications. 13 Chapter 2 Ecient Processing of Streaming Data in Multi-User Workflows We established in Chapter 1 that multi-user and multi-abstraction workflows are com- mon, have many ineciencies, and need new solutions. This chapter focuses on chal- lenges in multi-user workflows and presents our new methods to tackle them. We defer study of multi-abstraction workflows for Chapter 3. This chapter presents three contributions. First, we prove part of our thesis statement that new methods enable ecient processing of multi-user workflows of streaming data. Second, we show that processing and data deduplication is possible across multi-user workflows of arbitrary code and unstructured data. Third, we demonstrate that skew mitigation is possible without data sharding for multi-user workflows—hence allowing our system to manage skew without the need to understand the data. These contribu- tions collectively show that we can make multi-user workflows ecient by removing duplicate work and mitigating performance anomalies. 2.1 Introduction As the field of big data analytics matures, workflows are increasingly complex and often include components that are shared by dierent developers and built by dierent teams. With multiple groups contributing to dierent stages of a complex workflow, it is easy to lose track of computation that may be shared across dierent groups. 14 Domain Name System [Moc87] (DNS, providing a mapping from human-readable names to computer-friendly addresses) analytics is an example of a complex workflow with across-team duplicate processing. A typical DNS service is geographically dis- tributed and needs monitoring to estimate service quality [Teaa], while analysis of the same data can detect malicious activity [FHQ17]. With geographically distributed sites and limited local processing capability, data from multiple sites may be back-hauled to a central site for analysis. High trac volumes, bandwidth limitations, and the need to ensure that data is complete dictate that data be collected in large blocks, compressed, and shipped to the processing site. In our example, two dierent teams (service monitor- ing, malicious activity detection) need the same data. Their initial stages are the same, where teams decrypt, uncompress, and clean the data. Later processing diverges into specialized workflow. Considerable eciency can be gained by detecting and remov- ing duplication in the common initial stages of the pipeline. When future needs arise (perhaps an anti-DDoS analysis team), additional duplication of and similar wasted resources will occur if steps are not taken to avoid it. In this chapter we propose a new framework, Plumb, a workflow system for multi- stage pipelines, where parts of computation and data are shared across dierent groups. Plumb focuses on streaming workflows where data is processed in blocks, a middle ground between large-size batch processing and small-record streaming. A particular challenge in this problem domain is structural and computational skew since the com- putation of dierent stages and dierent data blocks can vary by a factor of ten due to dierences in the work or data. Continuing our DNS example, both teams need to consume a stream of DNS data packaged as large-blocks. Many analytics programs for this domain can emit output after consuming a small part of input data. An example is the TCP reassembly. The DNS captured data have TCP flows, where each flow can have multiple DNS records. 15 Processing needs to stitch together all such flows so that it can extract DNS data out of it. Our captured large blocks (each with about 2 GB of DNS data before compression) have many such flows in a single block. DNS over TCP has the property that flows are small in size and nearby in time. Due to these properties, the TCP reassembly program can start emitting output data without consuming all the input. Such streaming operators make it possible to run adjacent pipeline stages concurrently (like Unix pipelines). The output of one stage is going into the input of the next workflow stage. Let us say the input to TCP reassembly was from a fast decompression (like snzip), and both stages run concurrently. TCP reassembly is slower than the snzip stage, and snzip waits for slower TCP stage, an example of structural skew in the pipeline. Additionally, it is hard to accommodate such skew statically. Often, under a malicious attack, TCP reassembly stage can take substantially longer than the typical case, hence throwing any static adjustments into the disarray (an example of computational skew). Plumb is designed for large-block, streaming (LBS) workloads. Traditional map- reduce has focused on batch processing, and systems such as Storm [TTS + 14] consider streams of small records. We have identified a class of applications that involve long- term streams of data, but where the processing requires examination of large blocks of data (say, 10 to 1000 megabytes) at a time. These applications need to capture temporal or spatial locality, integration with existing serial tools, and to support fault tolerance and human-guided recovery in the long-running data processing. Applications that require large-block data preclude the use of adaptive sharding schemes to present skew. Plumb exploits this “middle ground” where per-block scheduling is possible. One block is the smallest unit of consumption by an instance of the analytics in the LBS domain. Skew management schemes that rely on sharding data and utilizing many workers (one per shard) are not suitable for LBS. For example, arbitrarily breaking data of a TCP flow makes it harder to stitch it together. One needs a stage where we need to 16 gather everything again based on the key before stitching could be applied. Similarly, schemes that place data in memory suer due to a rapidly changing working set. It implies that either data spills to the disk or we discard it altogether. Skew can make one branch of a workflow slower, and when it needs some data (and it is not there), it will cost penalties in terms of disk reads (or worse full recomputation using data provenance). In both cases, one challenge is that for how long to keep state before moving on. LBS processing has the temporal and spatial context of data; hence its state management is mostly confined within a block and is much simpler. Plumb’s first goal is to identify duplication of computation and storage that can occur when dierent groups share components of a pipeline. When workflows are shared across developers, work done in common stages will be duplicated if each user assumes they begin with raw input, particularly as the workflow evolves throughout develop- ment. Previous work has made strides for processing de-duplication where operator semantics are well defined (for example [SRLS17, CMR + 18]), or where multiple users use the same programming language and run-time [JQP + 18]. Databases and other sys- tems [GRT + 10] sometimes save and share intermediate or final results because workload access patterns are amenable to caching. Finding duplication in arbitrary user-defined programs remains challenging, and LBS workloads are single-pass and hence not a good fit for caching. Our novel equivalence definition exploits LBS constraint where we define that for any two pipeline stages of all users, if the set of inputs and outputs blocks are the same, then the processing is the same. The second problem we address is skew. Data skew is a known problem, where many data items fall into one processing bin, slowing the overall workflow [CF15]. We address computational and structural skew under the new context of LBS, where adaptive sharding is not possible. Computational skew occurs when a bin of data takes extra long to process, not necessarily because there is more data, but because the data 17 interacts with the processing algorithms to take extra time. Structural skew occurs when one stage of the processing pipeline is noticeably slower than other stages. We address structural skew in Plumb by scheduling additional processing elements when one data block or one stage falls behind. Plumb decouples processing for each stage of the workflow, buering output when required and supporting independent stage execution. However, to avoid overhead from data buering, Plumb can run stages con- currently when they are well matched. This decoupling also addresses computational skew, since additional computation can be brought to bear when specific data inputs take extra time. Plumb provides a new method to its users to solve the challenges of duplication (of code and data) and skew (structural and computational) in a multi-user, shared workflow. The strength of this method is that it is simple to use for the developers, where they spec- ify their workflow using domain-specific names of inputs and outputs. Developers do not need to remember or find the same programs because Plumb infers code similar- ity from input/output names. Plumb uses naming in a novel way to detect and remove duplication and manage skew automatically. Our new method lets individual devel- opers/teams remain decoupled from each other, yet Plumb tackles sharing and skew aspects. We compare Plumb to pre-Plumb (a hand-tuned system), resulting in one-third the original latency (§2.3.7) and 39% less container hours (§2.3.1). Plumb’s new methods promise to support a much more flexible, multi-user analysis (§2.4) while being robust to DDoS-driven changes in processing needs. The first contribution of this chapter is to prove part of our thesis statement (§1.1)— providing two new methods that enable ecient processing of multi-user workflows of streaming data. To prove part of our thesis statement, we design our first new method for deduplication and skew management. We use the Block-Streaming abstraction to 18 execute our workloads, and show that our algorithms solve multi-user challenges of duplication and skew to make processing ecient. We evaluate eciency (§2.3) of our new methods using real workloads from B-Root DNS service. The second contribution of this chapter is to show that deduplication of unstructured data and black-box modules is possible (§2.2.4,§2.2.5). The third contribution of this chapter is that skew mitigation is possible without sharding (§2.2.7,§2.2.8) hence obliterating the need to understand the structure of the data. In summary, we show that real-world analytics often have substantial similarity across components developed by dierent teams. We identify Large-Block Steaming as a new domain for big data processing, and identify computational and structural skew as a new challenge (in addition to existing data skew) for such workflows. Current solutions are deficient in solving these problems (see Chapter 5 for detailed compar- ison with the current systems). We show that workflow component naming, coupled with new strategies to manage both computational and structural skew can reduce dupli- cation of computation to provide ecient processing. Finally, Large-Block Streaming provides a simple abstraction while addressing these challenges. (In this chapter we use the Block-Streaming abstraction as a placeholder for any of our new abstractions that we define in Chapter 3 because our solutions are valid for all of them.) 2.2 System Design We next describe Plumb’s requirements, and how it reduces duplicate storage and pro- cessing and addresses skew. 19 2.2.1 Definitions, Goals, Assumptions, and Scope of Work In this section, we define dierent terms as they apply to our work. We also list the goals, assumptions, and scope of our work. Definitions: Plumb supports streaming, large-block data that is opaque to the sys- tem. By streaming we mean that data is unbounded—new data continues to arrive at all times. For our work, we assume long-term, soft-real-time processing requirements, but our primary goal is to guarantee processing 100% of provided data. Thus we support buering to handle temporary bursts or maintenance, and do not target hard-real-time guarantees as some other stream processing systems. (Typically we see end-to-end pro- cessing delays of minutes; if necessary backlogs can grow to fill disks.) Plumb is not a batch processing system; it processes an unbounded stream of large-block data. Batch processing systems, instead consume a fixed amount of data for a one-time task. Our system is therefore a middle ground between pure-batch systems like map-reduce, and fine-grain streaming systems that process individual (small) records. By large block, we mean a collection of data into the smallest unit of consumption in any processing. The exact size of the block is domain-specific and depends on factors like data collection and relay and spatial and temporal ecacy of the operators. Plumb can work with both structured and unstructured data because we see data and user programs as opaque, and operators are complex. By unstructured data we mean that data does not necessarily conform to a specific schema, as one would see in a relational database or a streaming system following that model. By data opaqueness, we mean that framework does not understand the structure of the underlying data. By complex code, we mean an arbitrary binary whose semantics are not known. The operator semantics precisely tell what code does on data. While many relational operators have well-known semantics, we assume no knowledge of our user code. 20 Plumb uses three-way data replication to guard against data loss. Three-way data replication means that we make three discreet copies of the same data for data reliability. These copies use in-lined replication where all three copies are in progress in parallel. Processing with streaming operators and in-lined replication, input reading, processing, and replicated output writing overlap in time. By streaming operators, we mean code can start emitting output without first consuming full input. Most of our user code in this work are streaming operators. Goals: Our system provides data completeness as an SLA (Service Level Agree- ment). We strive hard not to lose any data, provide high system utilization and through- put, and best-eort low latency. We assume that most of our pipeline has low change velocity, and processing is for the long-term. Our system is not suitable for ad-hoc, exploratory analyses where the pipeline structure is changing rapidly (hundreds of times per hour). Assumptions: We assume that our system users are non-malicious, and that process- ing is idempotent. We assume lack of malice because the cost of protecting against an adversarial user is quite high and the users are part of the same organization. However, we take typical precautions against bugs, for example protecting data from processing stages that fail. We also use group membership and authentication to control data access. Idempotent processing implies user code generates the same output when provided the same input, even if run multiple times or other things change. Scope: We use per-block processing as an execution model in this chapter, and will extend it in Chapter 3. Our solutions presented in this chapter are equally applicable to those other executions models as well. Per-block processing implies the system does not manage any state across the blocks. State within a large-block can be maintained by the applications (and outputted in the outgoing data). For state maintenance across blocks, applications need windowing abstraction. Windowing is one of the topics of Chapter 3. 21 However even in the absence of any windowing, applications can do stateful compu- tation if they work at it: each app can write state to a new place, then another thread can watch for the state to appear and replay it (in order) when it is all there. Though as we will see in Chapter 3, Plumb provides ecient abstractions for windowing and state management—relieving developers of this tedious and error-prone activity. While our system takes special care to run each block once at a time, under some scenarios (for example, network partitioning), processing for a block can run more than once con- currently. Our system ensures that output from only one of these concurrent instances goes to the output. Per-block execution model does not support shuing or any other reductions on multiple blocks. Users needing such facilities can either use Plumb API (Appendix A) to take data out of the system, process it using a specific abstraction, and bring them back inside Plumb, or even better use one of Plumb’s other abstractions (Chapter 3) suitable for this purpose. 2.2.2 Design Requirements and Case Study Our system is designed to solve multi-user processing problems while being easy-to- use. Each user expresses their workflow through a framework that abstracts inputs and outputs, allowing Plumb to detect and eliminate duplicate computation and storage. The framework is flexible and we have evolved this framework over time, adding additional optimizations. Multiple users implies that dierent individuals or groups contribute components to the pipeline over time. This requirement aects our choice of processing similarity definition and supports de-duplication of computation (§2.2.4). Large-block streaming means data constantly arrives at the system, and it is deliv- ered in relatively large blocks. Many applications involve continuous data collection with analytics. Unlike other streaming systems that emphasize small events (perhaps 22 single records), we process data in large blocks—from 512 MB to 2 GB in dierent deployments. Large blocks are important for applications where actions frequently span multiple records that are nearby in space or time, since those records can often be pro- cessed together. Large blocks also amortize processing costs and simplify detection of completeness (§2.2.3) and error recovery (§2.2.8). For our sample application of DNS processing, large blocks are motivated by the need to do TCP reassembly, since all packets for a TCP connection are usually in the same block. Compression is much more ecient on bulk data (many MB). We find error handling (such as disk space exhaustion or data-specific bugs) and verification of completeness is easier when handling large, discrete chunks of data (instead of millions of small records). Debugability is increasingly crucial in today’s complex systems, and large-blocks based processing makes manual inspection not only viable but also most of debugging expertise from the serial world is useful in this context as well. Streaming also implies that we must keep up with real-time data input over the long term. Fortunately, we can buer and queue data on disk. At times (after a processing error in the current hand-coded system) we have queued almost two weeks of data, requiring more than a week to drain. Ease-of-use for the programmer is an explicit design goal in Plumb. As with map- reduce [DG04], individual processing modules focus on discrete inputs and outputs and the framework handles parallelism. Inspired by Storm [TTS + 14] queue management, we adapt parallelism of each pipeline stage to match the workload skew (§2.2.8). Pipeline-graph Abstraction and Programming Model: Users express their work- flow by defining an input-processing-output (IPO) graph, where each processing ele- ment has one or more inputs and outputs. The format of any input or output is defined by a name, and all users share a common name space. (We use conventions analogous to the naming of Java or Go libraries to deconflict name space management.) The full 23 Figure 2.1: The DNS processing pipeline, our case study described in §2.2.2. Intermediate and final data are ovals, computation occurs on each arc. Figure 2.2: A Pipeline Graph for a portion of the DNS pipeline. pipeline is the concatenation of such named stages in the IPO workflow. Two users uti- lizing the same I or O name implies referring to same data, and processing stages are considered identical when they share the same inputs and output types. User semantics for a processing stage is to read a large-block, process it, and write to a new large-block. Each large block is identified by its type and a unique sequence number. In this framework, operators and data are opaque to the system and processing elements are programs supplied by the user and are idempotent. User programs can be other framework drivers for specific operator needs. Many of networking pipelines use well-established tools that are serial and can relatively easily adapt to cluster environ- ment using our large-block based abstraction. User programs can either tolerate possible analytic errors due to data cut-o at block boundaries, or using more sophisticated tech- niques to handle such cases if needed (Chapter 3). Figure 2.2 is an example use of pipeline graph abstraction. Case Study - DNS: These requirements are driven by our case-study: the B-Root DNS processing pipeline. Figure 2.1 shows the user-level view of this workflow: three dierent output files (the ovals at the bottom of the figure), include archival data (pcap.xz), statistics (rsscint) [Teaa] , and processed data (message question.fsdb.xz). Generating this output logically requires five stages 24 1 YAML based job description (Inputs(s), Program, Output(s) (1)Request & get current optimized pipeline graph (2)Extend pipeline graph for a new pipeline (3)Submit pipeline graph (4)Compile, security check and make new optimized pipeline (5)Create new queues and establish publish-subscribe links (6,7) Return status codes (8) Schedule and run final optimized pipeline 2 3 5 4 6 7 8 Figure 2.3: A user submits his pipeline into Plumb. (the IPO squares), each of which has very dierent requirements for I/O and computa- tion (shown later as Table 2.1). This pipeline has been in use for 3 years and takes as input 1.5 to 2 TB of data per day. We are extending it with additional stages, and migra- tion of the the current hand-crafted code to Plumb is already complete and in production use (§2.4). 2.2.3 Plumb Overview We next briefly describe workflow in Plumb to provide context for the optimizations described in the following sections. Figure 2.3 shows overall Plumb workflow. Users provide (stage 2 in Figure 2.3) Plumb their workloads as with a YAML-based pipeline specification (Figure 2.2). Plumb integrates workloads from multiple users and can provide both graphical (similar to Figure 2.1 but with all the redundancies removed) and textual (Figure 2.2) descriptions of the integrated graph. Plumb detects and opti- mizes away duplicate stages from multiple users (§2.2.4). Data access from each user is protected by proper authentication and authorization, mediated by a database of all available content (tag 5). 25 Each pipeline stage has a single-user supplied program. Programs with only one input read from standard input, and with one output write to standard output. For pro- grams with multiple inputs or outputs, they are specified as command-line FIFO stream arguments. Each stage is allocated a single core. These resource limits allow Plumb to densely allocate processing over many cores in multiple computers. Allocation is done within Hadoop YARN, so stages are isolated from each other. When a user submits his or her pipeline graph, Plumb evaluates it and integrates it with graphs from other users. Plumb verifies the pipeline syntax. Then the system finds any processing or storage duplication across all users’ jobs and removes it. Internally, our system abstracts storage as queues with data stored in a distributed file system, with input and output of each stage bound to specific queues. If accepted, our system schedules each stage of the optimized pipeline to run in a YARN cluster. Stages typically require only small YARN containers (1 core and 1 GB RAM), allowing many to run on each multi-core computer in the cluster, and avoiding external fragmentation (when multiple cores are needed for a task but are not available on any particular computer) due to scheduling larger jobs onto compute nodes. Also, the system returns this container after processing one instance of a user program for better and fair resource sharing on the cluster. Our system assigns workers in proportion to the current stage slowness due to skew. Finally, each input block is stored as a file with a unique sequence number. Con- firming all sequence numbers have been processed is a useful check of completion, and we can set aside files that trigger errors for manual analysis (§2.2.8). To provide de-duplication, our optimizations combine stages that are identical com- putation (§2.2.4), combine storage of intermediate and final output (§2.2.5) and identify IO-bound stages that are better to merge with up or downstream computation (§2.2.6). 26 We also explain solutions to structural skew (§2.2.7) and computational skew (§2.2.8) in large-block, multi-user, streaming workloads. 2.2.4 De-duplicating Processing and Data via the Pipeline Graph We detect duplication by identifying identical data when merging user-supplied work- flows. Figure 2.2 is an example of a user’s pipeline graph. In the pipeline graph, each stage defines its input, the computation to take place, and its outputs. Input and output are identified by global names such as pcap, DNS, anon-DNS, etc. Data is opaque for Plumb and any format or structural information about inputs and outputs is left for user applications. By definition, any stages that use the same named inputs/outputs, refer to the same backing data. Plumb can now eliminate processing duplication by detecting stages submitted by multiple users that have the same set of input and output. We make this judgment based on textual equivalence of input and output names of a program in the pipeline graph, since the general problem of algorithmically determining that two programs are equiv- alent is undecidable [Sip13]. Plumb considers all user programs as black boxes and assumes no knowledge of operators. Many operators in networking domain does not lend itself for SQL like representation and do arbitrary computation. The outcome of merged users’ pipelines is a combined pipeline that has all users’ computation. We then schedule this computation in a cluster with YARN [VMD + 13]. As an example, when a new user (right-most branch in Figure 2.1) submits his pipeline (Figure 2.2), Plumb detects that first two stages are identical to some other users’ stages and de-duplicates them. Consequently, input to the third stage of new user was already available and hence reused (Figure 2.4). 27 pcap.sz pcap snzip pcap.xz xz message.fsdb dnsanon message_question.fsdb dnsanon rsscint message_to_rssacint pcap.sz pcap snzip pcap.xz xz message.fsdb dnsanon message_question.fsdb dnsanon rsscint message_to_rssacint message_question.fsdb.xz xz Figure 2.4: Optimized pipeline before (left picture with two users) and after (right pic- ture) merging a third user’s pipeline. 2.2.5 Data Storage De-duplication De-duplication detects identical computation and data requires safe storage of a single copy of each input and output. We store exactly one copy of each large-block (that is in a single file) in a shared stor- age system. To store single copy of data, but with the user illusion of private individual data, we use a publish-subscribe system. This system emulates Linux hard-links using a database for meta-data and the HDFS distributed file system to store actual data. Plumb for each pipeline stage subscribes it to its input. Whenever an input data item appears, our system publishes it to all registered subscribers by putting an emulated hard-link per subscriber. The storage eciency and at-least once processing semantics rely on the unique identification of data blocks. To uniquely identify a data item across system, we enforce a two-level naming scheme on all files names. Each file name has two parts: data store name based on user provided input or output names (in pipeline graph), and a time-stamp and a monotonically increasing number (for example: 20161108-235855- 00484577. pcap.sz). 28 Although there is one logical copy of each output, HDFS replicates the data multiple times (default: 3). HDFS replication provides reliability in the case of machine failures. In our multi-user processing environment, security is very important and fully enforced. For all user interactions with the system, we use two-way strong authenti- cation based on digital signatures. For data access authorization, we use HDFS group membership. Any user’s pipeline is only accepted for execution if that user has access to all the data formats mentioned in his pipeline graph. 2.2.6 Detecting I/O-Bound Stages While we strive for computation and storage de-duplication, in some cases, duplicate computation saves run-time by reducing data movement across stages via HDFS. Prior systems recognize this trade-o, recommending the use of lightweight compression between stages as a best-practice. We generalize this approach, by detecting I/O-bound stages; in §2.3.2 we show the importance of this optimization to good performance. We automatically gather performance information about each stage during execu- tion, measuring bytes in (I), out (O), and compute time (P that includes time for data read and write along with CPU time). From this information we can compute the I/O- intensity of each stage as follows: IO intensity o f a pipeline stage (measured as : mega bytes per second) = (I + (HDFS ReplicationFactor O)) P (2.1) Here the value of HDFSReplicationFactor is thrice due to our use of 3-way HDFS data replication. For our cluster’s hardware, we consider stages with I/O-intensity more than 50 (mega bytes/second) to suggest the computation should be duplicated to reduce I/O; clusters with dierent hardware and networks may choose dierent thresholds. 29 stage I O P IO-Intensity snzip 765 2048 28 246.75 dnsanon 2048 1590 180 37.87 rssac 717 73 180 5.2 xz1 2048 235 780 3.52 xz2 873 110 540 2.23 Table 2.1: Relative costs of Input and Output (I and O, measured in megabytes), pro- cessing (I read time + CPU time + O write time in seconds), and IO-intensity relating them. Such threshold depends on cluster hardware and can be established empirically by run- ning a stage known to have high I/O intensity. Table 2.1 shows an example of I/O intensity from our DNS pipeline. It correctly identifies snzip decompression stage as the most IO bound among all. Identification of I/O-bound stages allows the user to restructure the pipeline. We recommend that users duplicate lightweight computation to avoid I/O by connecting it through a pipe with another CPU-bound stage. In principle, this step could be auto- mated, but we encourage user control of structural changes to support informed deci- sions of what is merged and ensures that users are aware of intermediate data for debug- ging. 2.2.7 Mitigating Structural Skew Structural skew is when one stage of a pipeline is slower than others in a pipeline. We can define it as the ratio of the execution times of the slowest stage to the fastest. As an example, our DNS pipeline (Figure 2.1) has a structural skew ratio of 27.9, comparing the xz stage to the snzip stage. One way the structural skew can materialize is when unbalanced pipeline stages concurrently run. Individual users often use such run configuration, assuming it will minimize their wait time. In reality, doing so structural skew wastes computational 30 Run Configuration Through- put Latency Cost Eciency Disk Use Fragme- ntation RAT Cluster Sharing Stage Scaling SFD Heterogeneous Cluster Linearized/multi-stage/1-core High High Good High No Low Good Bad Complex Higher average latency Parallel/multi-stage/multi-core with limits Low Low Low Low Yes High Worse Bad Complex Higher structural skew Parallel/multi-stage/multi-core without limiting High Low Good Low No High Bad Bad Complex Higher structural skew Linearized/single-stage/1-core High High Good Higher No Low Good Good Simple Lower latency Table 2.2: Comparison of dierent configurations for stages to containers. RAT: Resource Allocation Time. SFD: Skew management, Fault-tolerance, and Debugging. Fragmentation refers to unused CPU inside an allocated execution container. resources because faster stages need to wait on slower stages, and during this time, resources are held but idle. We mitigate structural skew by creating additional workers on slow stages and scheduling stages independently. We share this approach with other big-data systems (for example MapReduce [DG04], and even early systems like TransScend [FGC + 97]), but we go beyond strictly independent stages by supporting stage execution both inde- pendent (here) and concurrent (using pipelines for reduced IO contention but without wasting resources). Such concurrent execution (as compared to traditional operators where next stage can only start when previous one completes) provides many pipeline execution alternatives. We describe these alternatives next and compare them in §2.3.5. When mapping pipeline stages to YARN containers, we can merge stages or split them across containers, so one stage my require one or more cores, with parallelism either between containers or inside each container. In addition, containers can buer output in the file system, or stream it through pipes. Table 2.2 shows the four options we consider. With Linearized/multi-stage/1-core, each input file is assigned a single container with one core, and it runs each stage sequentially inside that task. This scheme is ecient and flexible, providing excellent parallelism across input files. This scheme has high latency (because pipeline stages sequentially run one after the other) and is unable to mitigate the computational skew problem (because we can not increase container assignment for an aected stage). 31 For Parallel/multi-stage/multi-core with limits, multiple stages map to a single YARN container with as many cores as the number of stages in the pipeline. We then run all stages in parallel, with the output of one feeding directly into the other, and we strictly limit computation to the number of cores that are assigned (using an enforcing container in YARN). The advantage with this approach is that data can be shared directly between processes running in parallel, rather than through the file system. The diculty is that it is hard to predict how many cores are required: structural skew means some stages may under-utilize their core (resulting in internal core-fragmentation), or stages with varying parallelism may overly stress what they have been allocated (and will add to latency). Figures (a) and (b) in Figure 2.13 show that this configuration requires more container-hours compared to other choices, while providing only modest reduction in latency. For Parallel/multi-stage/multi-core without limiting, we assign multiple stages to a single YARN container with as many cores as there are stages in the pipeline, run- ning in parallel, with one core per stage, but here we allow the container to consume cores beyond the container strictly allocates. The challenge is to come up with a right degree of multi-programming for the duration of the pipeline execution. The risk here is that resources are stressed—if we under-provision cores per container, we reduce internal core-fragmentation, but also stress the system as a whole when computation exceeds the allocated number of cores. Figure (c) in Figure 2.13 shows an 8 core server from a deployment, that started well with very little core waste, but became over- loaded over time as workload characteristics changed. (This approach might benefit from approaches that adapt to system-wide over-commitment by adapting limits and throttling computation on the fly [ZTH + 13]; this approach is not yet widely available, needs complex feedback loops, and assumes a good combination of short and long-lived jobs.) 32 With Linearized/single-stage/1-core, we assign each stage (or two adjacent stages for I/O limited tasks (§2.2.6)), to its own YARN container with a single core. In eect, the pipeline is disaggregated into many independent tasks. This approach minimizes both internal and external core-fragmentation: there is no internal fragmentation because each stage runs to completion on its own, and no external fragmentation because we can always allocate stages in single-core increments. It also solves structural skew since we can schedule additional tasks for stages that are slower than others. The downside is data between stages must queue through the file system, but we minimize this cost with our I/O-based optimizations. In §2.3.5 we compare these alternatives, showing that Linearized/single-stage/1-core is the most ecient. 2.2.8 Mitigating Computational Skew Our solution to structural skew (running each stage independently) also addresses com- putational skew. The challenge of computational skew is that a shift in input data can suddenly change the computation required by a given stage. To detect computational skew we monitor the amount of data queued at each stage over time (recall that each stage runs separately, with its own queue §2.2.5). We then reduce the eects of computational skew by assigning computational resources in pro- portion to the queue lengths. Stage with the longest queue is assigned the most compu- tational resources. We sample stages periodically (currently every 3 minutes) and ensure that no stage is starved of processing. Another risk of computational skew is that processing for a stage grows so much that stage program times out. Our use of named, large-file processing helps here, since we 33 can detect repeated failures on a given input and set those inputs aside for manual evalu- ation, applying the error-recovery processes from MapReduce [DG04] to our streaming workload. Plumb’s pipeline graph addresses both structural and computational skew with no additional user eort. As new, improved skew solutions evolve, the system can trans- parently switch and deploy them. Next, we evaluate the ecacy of our new methods using B-Root DNS workloads. Besides DNS, our new methods apply to many applications (Chapter 4). 2.3 Evaluation We next evaluate how our design choices improve eciency. We measure eciency as cluster-container hours, so lower numbers are better. When using a commercial cloud, these hours translate directly to cost. A private cluster must be sized well above mean cluster hours per input file, so eciency translates in to time to clear bursts, or to avail- ability of unused cluster hours for other projects. 2.3.1 Benefits of De-duplication We first examine optimizations to eliminate duplicate computation (§2.2.4) and data storage (§2.2.5) across multiple users. To measure the benefits of de-duplication, we use the DNS processing pipeline (Fig- ure 2.1) with 8 stages, two of which (snzip and dnsanon) are duplicated across three users. We run our experiment on our development YARN cluster with configuration A from Table 2.3, scheduling stages as soon as inputs are available. 34 resource config. A B Servers 30 37 vCores 328 544 Memory (GB) 908 1853 HDFS Storage (TB) 139 224 Networking (Mbps) 1000 1000 Table 2.3: Cluster capabilities for experiments. One vCore uses one physical CPU exclusively. As input, we provide a 8, 16 or 24 files, (each 765 MB) and measure total container hours consumed. In a real deployment, new data will always be arriving, and compute cycles will be shared across other applications. We compare measurements from our DNS pipeline running with related sample data in two configurations: unoptimized and de-duplicated. With the unoptimized configu- ration all stages run independently, while with de-duplication, identical stages are com- puted only once (§2.2.4), with data between each stage buered in a file (§2.2.5). We expect that removing redundant computation will reduce overall compute hours and HDFS storage, lowering time to finish a given input size and freeing resources for other concurrent jobs. These benefits are a function of how many duplicate stages can be combined, so an additional duplicated stage (for example, a decryption stage) would provide even more savings relative to an unoptimized pipeline. Experimental Evaluation: Figure 2.5 shows our results for three dierent size inputs (the left three groups of bars), and then for two dierent pipelines with 8 inputs (the right two groups). The de-duplicated pipeline (the second bar from the left) is always faster than unoptimized (the leftmost bar). To evaluate the eect of input size we first compare inside each of the left three groups of bars. With 8 files as input, de-duplication uses 39% fewer container hours (compare the first and second bar on the left in each group). These benefits increase 35 0 5 10 15 20 25 30 35 40 45 50 8 files(f) 16 f 24 f 8 f-2 way stream 8 f-5 users YARN Container Hours Un-Optimized De-Duplicated Un-opt from Model De-Dup from Model Figure 2.5: Comparing un-optimized (left bar) and de-duplicated (second bar) exper- imental pipelines and modeled (right two bars) for dierent workloads (bar groups). Mean of 10 runs with standard deviations as error bars. with 16 and 24 input files, since the majority of compute time is in a stage (dnsanon) that can be de-duplicated. We next vary the pipeline, prepending an additional shared stage for each user. This extra shared stage simulates a scenario when ingress and egress trac is captured in separate large-files that must be merged (as explained in §4.2.3). This dierent configu- ration is shown in the fourth group of bars from the left of Figure 2.5, labeled “8 files-2 way stream”. De-duplicated uses about one-third the container-hours compared to unop- timized. This experiment shows how adding additional stages that can be de-duplicated has a greater benefit. 36 As a final variation of the pipeline, we add two users to the tail of the pipeline (consuming dnsanon output and emitting statistics). In this case, the rightmost set of bars, we see that de-duplication shows a 6 speedup relative to unoptimized, and a much greater savings compared to either of the other two pipeline structures. Adding additional users late in the pipeline exposes significant potential savings. Modeling De-Duplication: To confirm our understanding, we defined a simple ana- lytic model of speedup based on observed times and a linear increase of execution time for duplicated stages. In each group, the right two bars show our model’s prediction for unoptimized and de-duplicated cases. The model significantly underestimates the savings we see (compare the change between the two right bars of each group to the change between the two left bars), but it captures the cost of additional input files and the benefits of additional stages that may be de-duplicated. This underestimate is due to I/O overhead of synchronized file access (a “thundering herd problem”) as we evaluate next. This model confirms our understanding of the underlying speedup. De-Duplication and Storage: Like processing, disk usage decreases as we de- duplicate. Figure 2.6 shows the amount of unique data that is stored over time (data is three-way replicated, so actual disk use is triple) from the experiments of Figure 2.5. Each point is the mean of ten runs. In this experiment, de-duplication reduces storage by 55% relative to un-optimized for the 8-file workload (compare the peak of the green squares at time 600 s). Other workloads show similar or greater savings. Storage is particularly high in the middle of each run when the cluster is fully utilized. Greater storage not only consumes storage space, but can result in disk arm contention with disks, and even network contention in clusters that lack a Clos network. This experiment shows latency reduction as well. With fewer required resources, the backlog of input data is cleared more quickly, as shown by earlier termination 37 0 5x10 10 1x10 11 1.5x10 11 2x10 11 2.5x10 11 0 500 1000 1500 2000 2500 3000 3500 4000 4500 de-dup un-opt 8 files 16 files 24 files 8-files-2way-stream 8-files-5-users HDFS Queues Size (Bytes) Time (20 sec steps) Figure 2.6: Amount of data stored over each processing run for dierent workloads (color and shape), unoptimized (filled shapes) and de-duplicated (empty shapes). of the de-duplicated cases (empty shapes) relative to their unoptimized counterparts (filled shapes). For example, the 8-file-5-user case terminates around 2700 s, with de- duplication, while unoptimized takes about 4100 s, 50% longer (compare the rightmost filled curve against its unfilled pair). Summary: These experiments show significant duplicate computation can be elim- inated in multi-user workloads, and that these savings grow both with additional input data and additional pipeline complexity. These savings reduce costs in the cloud, or latency in a dedicated cluster. 38 0 5 10 15 20 25 30 35 40 8f-unopt 8f-opt 16f-unopt 16f-opt 24f-unopt 24f-opt YARN Container Hours snzip dnsanon xz-1 & xz-2 message_to_rssacint Figure 2.7: Computation spent in each stage (bar colors), unoptimized (left bar) and de-duplicated (right), for three size workloads (bar groups). 2.3.2 I/O Costs and Merging We next evaluate the improvements that are possible by reducing I/O. Reducing I/O is particularly important as the workload becomes more intense and disk contention occurs. For example, when we increased the size of the initial workload from 8 to 16 and 24 (the left three groups in Figure 2.5), doubling the input size increases the container hours about 2:3 for the un-optimized case. (SSDs avoid spindle contention, but hard disks are still often used in big data projects where capacity cost can be critical.) We first establish that I/O contention can be a problem, then show that merging I/O-intensive stages with CPU-intensive stages (§2.2.6) can reduce this problem. 39 The problem of I/O contention: Figure 2.7 examines the container hours spent in each stage for our three workloads, both without optimization and with de-duplication. We see that the compute hours of the snzip stage grow dramatically as the workload size increases. Even with de-duplication, snzip consumes many container hours even though it actually requires little computation (Table 2.1, where it is 6 faster than the next slower stage). This huge increase in cost for the snzip stage is because it is very I/O intensive (in Table 2.1, its I/O-intensity is 6 to 100 other stages), reading a file that is about 765 MB and creating a 2 GB output in short amount of time. Without contention, this stage takes 28 s, but when multiple concurrent instances are run with 3-way replication underneath, we see a significant amount of disk contention. Some of this cost is due to our hardware configuration, where our compute nodes have fewer disk spindles than cores. But even with a more expensive SSD-based storage system, contention can occur over memory and I/O buses. Benefits of Merging I/O-Bound Stages: Next we quantify how throughput improves when we allow duplicate I/O-intensive stages and merge them with a down- stream, compute-intensive stage. This merger avoids storing I/O on stable storage (including replication); the stages communicate directly through FIFO pipes. We expect that this reduction in I/O will make computation with merged stages more ecient because merged stages will read and write data slower per unit time. We examine three dierent input sizes (8, 16, and 24 files) with the DNS pipeline with four dierent configurations: first without computation de-duplication and with each stage in a separate process, then merging the snzip stage with the next downstream stage, then adding compute de-duplication. Figure 2.8 shows dierent input sizes in a cluster of bars, and each optimization one of those bars in the cluster. 40 0 5 10 15 20 25 30 35 40 8 files 16 files 24 files YARN Container Hours Redundancy-present:snzip-unmerged Redundancy-present:snzip-merged Redundancy-removed:snzip-unmerged Redundancy-removed:snzip-merged Figure 2.8: Computation with and without I/O merged with duplicated processing (left two bars) and de-duplicated (right two bars) for three size workloads (bar groups). We expect merging the snzip stage with the next stage to both reduce I/O, and to lower compute time with less disk contention. Comparing the left two bars (brown and green) of each group shows that this optimization helps a great deal. Stage de- duplication still helps (compare the right two bars of each group), but the relative savings is much less, because the amount of I/O contention is much, much lower. Summary: We have shown that I/O contention can cause a super-linear increase in cost. Balancing I/O across stages by running I/O intensive stages with the next stage can greatly improve eciency, reducing container hours even if some lightweight stages duplicate computation. We saw up to 2 less container hours consumption and this benefit increases with higher I/O contention. That merging snzip helps should 41 be expected—enabling compression for all stages of MapReduce output is a com- monly used best practice, and the snzip protocol was designed to be computationally lightweight. However, Plumb generalizes this optimization to support merging any I/O- intensive stages, and we provide measurements to detect candidate stages to merge. 2.3.3 Pipeline Disaggregation Addresses Structural Skew Pipeline disaggregation (§2.2.7) addresses structural skew by allocating additional workers to slower stages to run in parallel. Structural skew occurs when two stages have unbalanced run-times and they are forced to run together, allowing progress at only the rate of the slowest stage. Concur- rent run of stages is possible due to streaming nature of operators and is often a preferred execution strategy for end-users due to ease of coding (using FIFOs) and belief it pro- viding low latency. Here we first demonstrate the problem, then show how pipeline disaggregation addresses it. The problem of structural skew: To demonstrate structural skew we use a two-stage pipeline where first stage decompresses snzip-compressed input and re- compresses it with xz, and the second stage does the opposite. Xz compression is quite slow, while xz decompression and snzip compression and decompression are quite fast. (The first stage runs in about 20 minutes, but the second runs in less than 1 minute.) We compare two pipeline configurations: aggregated and disaggregated. With an aggregated pipeline, both stages run in a single container with 2 cores and 2 GB of RAM, with each process communicating via pipes. For a disaggregated pipeline, each of the stages runs independently on a container with 1 core and 1 GB memory, with com- munication between stages through files. Thus the aggregated pipeline will be inecient due to internal core fragmentation since the first stage is 20 slower than the second, but 42 0 10 20 30 40 50 60 70 80 x x+10% x+20% x+30% x+40% x+50% (a) Throughput:2−stage pipeline with cpu throttling YARN Container Minutes Aggregated Dis−aggregated 0 5 10 15 20 25 30 35 40 x x+10% x+20% x+30% x+40% x+50% (b) Latency:2−stage pipeline with cpu throttling Wall−clock Minutes Aggregated Dis−aggregated 0 2 4 6 8 10 12 0 5 10 15 20 25 (c) Throughput:2−stage pipeline with sleep YARN Container Minutes Aggregated Dis−aggregated 0 1 2 3 4 5 6 7 0 5 10 15 20 25 (d) Latency:2−stage pipeline with sleep Wall−clock Minutes Aggregated Dis−aggregated Figure 2.9: Comparing aggregated (left bar) and de-aggregated (right) processing, with delay added by cpulimit (top row) and sleep minutes (bottom row). x is baseline skew between two configurations. the disaggregated pipeline will have somewhat greater I/O costs but computation will be more ecient. During the experiment we vary skew in two ways: first reducing the available CPU in the OS with cpulimit, and by lengthening computation by adding intentional sleep. Figure 2.9 compares aggregated and disaggregated pipelines (the left and right of each pair of blocks), examining compute-time used (the left two graphs) and wall-clock latency (the right two graphs), with both methods of slow down (cpulimit in the top two graphs and sleep in the bottom two). We see that disaggregation is consistently much lower in compute minutes used (compare the left and right bars in the left two graphs). 43 Aggregated is usually twice the number of compute minutes because one of its cores is often idling. Disaggregation adds some latency (compare the right bar to the left in the bottom graph), but only a fixed amount. This latency reflects queuing intermediate data on disk. We see generally similar results for both methods of slowdown, except that CPU throttling shows much greater variance. This variance follows because our Hadoop cluster has nodes of very dierent speeds. Summary: This experiment shows that disaggregation can greatly reduce the over- head of structural skew, although at the cost of slightly higher latency. 2.3.4 Disaggregation Addresses of Computational Skew Disaggregation is also important to address computational skew. With computational skew, changes in input temporarily alter the compute balance of stages of the pipeline. Disaggregation enables dynamic scheduling where Plumb adjusts the worker mix to accommodate the change in workload (§2.2.8). We next demonstrate the importance of this optimization by replaying a scenario drawn from our test application. We encountered computational skew in our DNS Pipeline when data captured during a DDoS attack stressed the dnsanon stage of processing—TCP assembly increased stage runtime and memory usage six-fold. We reproduce this scenario here by replaying a input of 100 files (200 GB of data when uncompressed), while changing none, half, or all of the data from a DDoS attack. (Both regular and attack trac are sampled from real-world data.) We use YARN with 25 cores for this experiment. We then measure throughput (container hours) and time to process all input. Figure 2.10 shows latency in this experiment. (Throughput, measured by container hours, is similar in shape as shown in Figure 2.11.) The strong result is that dynamic scheduling greatly reduces latency as computational skew increases, as shown by the 44 0 2 4 6 8 10 12 0 25 50 75 100 Average Latency (Hours) Computational Skew (What % files cause dnsanon to exhibit skew) Statically Scheduled Dynamically Scheduled Figure 2.10: Dynamically scheduled workers always beat static assignment and margin increasing with increasing skew. (compare the left bar with the right bar). 0 50 100 150 200 250 300 0 25 50 75 100 Container Hours Computational Skew (% skew-inducing files for dnsanon) Statically Scheduled Dynamically Scheduled Figure 2.11: As skew increases, static allo- cation wastes more resources. (compare the left bar with the right bar). relative dierence between each pair of bars. Without any DDoS trac we can pick a good static configuration of workers, but dynamically adapting to the workload is important when data changes. To show how the system adapts, each column of Figure 2.12 increases amount of computational skew (the fraction of DDoS input files). Each row of graphs shows one aspect of operation: the number of workers, the number of files output from each of the three final stages, and queue lengths at each stage. In the top row we see how the mix of workers changes with dynamic scheduling as the input changes, with more dnsanon processes (the blue line with the * symbol) scheduled as skew increases. Secondly, comparing last row with top row, we see that available workers are strictly assigned according to queue lengths. 45 0 5 10 15 20 25 0 50 100 150 (a) lumpiness 25% Worker Count Epoch No (3 minutes apart) snzip xz1 dnsanon rssac-prog xz2 0 5 10 15 20 25 0 50 100 150 (b) lumpiness 50% Worker Count Epoch No (3 minutes apart) snzip xz1 dnsanon rssac-prog xz2 0 5 10 15 20 25 0 50 100 150 (c) lumpiness 75% Worker Count Epoch No (3 minutes apart) snzip xz1 dnsanon rssac-prog xz2 0 20 40 60 80 100 0 50 100 150 (d) lumpiness 25% Terminal File Count Epoch No (3 minutes apart) rssac message-question.xz pacp.xz 0 20 40 60 80 100 0 50 100 150 (e) lumpiness 50% Terminal File Count Epoch No (3 minutes apart) rssac message-question.xz pacp.xz 0 20 40 60 80 100 0 50 100 150 (f) lumpiness 75% Terminal File Count Epoch No (3 minutes apart) rssac message-question.xz pacp.xz 0 20 40 60 80 100 0 50 100 150 (g) lumpiness 25% Queue Lengths Epoch No (3 minutes apart) L1 L2 L3 L4 L5 0 20 40 60 80 100 0 50 100 150 (h) lumpiness 50% Queue Lengths Epoch No (3 minutes apart) L1 L2 L3 L4 L5 0 20 40 60 80 100 0 50 100 150 (i) lumpiness 75% Queue Lengths Epoch No (3 minutes apart) L1 L2 L3 L4 L5 Figure 2.12: Dynamic scheduling at 0%, 50% and 100% skew. 2.3.5 Comparing Design Alternatives for Stages per Container In §2.2.7 we examined four alternatives (Table 2.2) for mapping pipeline stages into containers. We suggest that linearized/single-stage/1-core provides flexibility and e- ciency. The alternative is to allow many stages run in one container, with or without resource over-commitment. Here we show that both of those alternatives have prob- lems: without over-commitment is inecient, and allowing over-commitment results in thrashing. The top two graphs in Figure 2.13 compare parallel/multi-stage/multi-core with lim- its (left bar in each group) against linearized/single-stage/1-core (“disaggregated”, the right bar in each group) for 4 sizes of input data. The left graph examines resource 46 0 5 10 15 20 25 30 35 40 45 1 file 8 files 16 files 24 files (a) Throughput disaggregated vs lumped YARN Container Hours Lumped De-Duplicated 0 5 10 15 20 25 30 35 40 45 50 1 file 8 files 16 files 24 files (b) Latency disaggregated vs lumped Average dealy (Minutes) 0 5 10 15 20 25 30 35 40 45 2016-09 2017-01 2017-05 2017-09 2018-01 2018-04 (c) Parallel cpu sharing Number of runnable processes Date (YYYY-MM) Load-avg:1 Load-avg:5 Load-avg:15 TotalCores Figure 2.13: Comparing alternative configurations of stages per container. Parallel/multi-stage/multi-core with limits marked as lumped. The graph in the bottom row is from one server, while the others in the cluster are similar. consumption, measured in container hours, and the right, latency. We see that disag- gregation cuts resource consumption to less than half because it avoids internal core- fragmentation (idling cores). The eects on latency (right graph) is present but not as clear in this experiment; latency dierences are dicult to see because CPU heterogene- ity in our cluster results in high variance in latency. The bottom graph in Figure 2.13 evaluates parallel/multi-stage/multi-core with and without limits by showing compute load over almost two years. Load is taken per minute by measurements of system load from one 8-core compute node of our hand- built pipeline. There are at most 4 concurrent stages in our hand-built pipeline. Around Feb. 2017 (about one-third of the way across the graph) we changed this system from 47 under-committed, with each container included 4 cores, to over-committed, with each container allocating only 2 cores. Under-committed resources ran well within machine capacities, but often left cores idle (as shown in the top two graphs). Over-commitment after Feb. 2017 shows that load average often peaks with the machine stressed with many more processes to run than it has cores. We conclude that it is very dicult to assign a static number of cores to a multi-process compute job while avoiding both under- utilization and over-commitment. With dynamically changing workloads, one prob- lem or the other will almost always appear. Disaggregation with the linearized/single- stage/1-core avoids this problem by better exposing application-level parallelism to the batch scheduler. 2.3.6 Improving Throughput:Overall Evaluation on Real Inputs To provide an overall view of the cumulative eect of these optimizations we next look at runtime to process two dierent days of real-world data, with a third day showing a synthetic attack to show dierences in our system. (We cannot directly compare our system with streaming systems like Spark for several reasons. Our LBS workloads don’t map to Spark’s distributed shared memory (RDD) model, our working set changes rapidly (hence not benefiting from in-memory cache), and Spark has no analog for our multi-user data and code de-deduplication.) We use data from two full days of B-Root DNS: 2016-06-22, a typical day with 1.8 TB of data in 896 files, and 2016-06-28, a day when there was a large DDoS attack [Roo16]. Because of network rate-limiting during the attack, the second day has less data, with about 1.4 TB of trac in 711 files. To account for this shortfall, we construct a synthetic attack day where we replicate one attack input 896 times to get the same trac volume as the normal day, but with data that is more expensive to process (the dnsanon stage is about 6 longer for attack data than regular data). This synthetic 48 scenario date input latency normal day 2016-06-22 1.8 TB / 896 files 8.30h attack day 2016-06-25 1.4 TB / 711 files 6.20h simulated attack — 1.8 TB / 896 files 11.75h Table 2.4: Plumb latency: One day DNS data using 100 cores. attack data depicts a worst case scenario where full day trac is stressful. (Input data is compressed with xz rather than snzip as in our prior experiments, but xz decompression is fast and has minimal eect on the workload.) We process this data on our compute cluster with a fixed 100 containers, while using all optimizations (de-duplication and skew mitigation) and I/O-bound stage merging. Table 2.4 shows the results of this experiment. The first observation is that, in all cases, Plumb is able to keep up with the real-time data feed, processing a day of input data in less than half a day. Our pre-Plumb system processed data with a hand-coded system using a parallel/multi-stage/multi-core without limiting strategy (all stages run in a single large YARN container). Based on estimates from individual file processing times, we believe that Plumb requires about one-third the compute time of pre-Plumb system. Most of the savings results from elimination of internal core-fragmentation: we must over-size our hand-build system’s containers to account for worst-case compute requirements (if we do not, tasks will terminate when they exceed container size), but that means that containers are frequently underutilized. We were initially surprised that the day of DDoS attacks was processed faster than the typical day (6.2h vs. 8.3h), but the savings follows from less saved data. This drop in trac is due to a link-layer problem with B-Root’s upstream provider where we were throttling attack trac before collection, thus not receiving all trac addressed to B- Root. Actual trac sent to B-Root on that day was about 100 normal during the attack. We correct for this under-reporting with our synthetic attack data, which shows 49 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 3500 4000 plumb prior batch-based system Cumulative Distribution RSSAC per-block processing time (seconds) Figure 2.14: Latency: Plumb VS batch-based processing. that a day-long attack requires about 40% more time to complete processing than our regular day. We discuss deployment status in §2.4. 2.3.7 Improving End-to-End Latency Plumb’s goal is to minimize latency of streaming block data. To evaluate latency, we compared end-to-end latency for Plumb and our pre-Plumb, Hadoop-based system over 24 hours of data (1393 files, each 2 GB) with our DNS workflow (Figure 2.1) in Fig- ure 2.14. Current Plumb latency (left) is much lower with median latency of 695 s, instead of 2724 s for RSSAC files. In addition, Plumb latency is much more consistent, with standard deviation of 50 s instead of 614 s (compare the narrow range of Plumb against the wide range of batch). Plumb latency is lower because it processes blocks as they arrive, rather than batching them. Plumb latency is good at about twice the theoretical minimums (see Figure 2.1), but it has some room for optimization. 50 2.3.8 Ease of Use Plumb’s another goal is ease-of-use for the developers to express their workloads suc- cinctly. We use number of lines-of-code as our primary measure of ease. The three users in our example (Figure 2.1) provide their pipeline-graph representation of work- flow (similar to Figure 2.2 but with first stage merged with the downstream to reduce I/O intensity), that contained 3 lines for first user (pcap.sz to pcap.xz) while 6 for the remaining two (pcap.sz to rsscint, and pcap.sz to message question.fsdb.xz). Before employing new workflows, developers can see current state of the optimized pipeline as topologically sorted graph (similar to Figure 2.1 but redundancies removed) or the corresponding YAML text representation (similar to Figure 2.2). The programs in workflow stages of the pipeline-graph were able to port to our new system from older deployment with minute changes. These changes were to accommodate stage merges (§2.2.6). 2.4 Deployment Status Plumb is currently in production use at B-Root for DNS analysis. Plumb keeps up with real-time processing with fewer resources than our prior system, and with much lower latency per-file. At steady state, Plumb’s queue is about 20 files deep, while our prior system varied from 50 to 150 over the day, and overall loads are much more consistent without over- or under-utilization. Plumb’s deployment has prompted new applications for multiple users to run on Plumb infrastructure. 51 2.5 Onward to Multiple Abstractions We extend Plumb’s ability to allow multiple abstractions in a workflow in the next chap- ter (Chapter 3). Some of our users need sophisticated reductions that operate on more than one data block at a time, typically to align data with a fixed time window (hours of the day, or a 24-hour period). Earlier developers utilized Plumb’s API (Appendix A) to extract data out of a queue, processed it with their desired system (for example, MapRe- duce), and fed the resulting data as large blocks back into the Plumb. We will see in the next chapter that doing so creates many correctness and performance challenges. 2.6 Conclusions Plumb is designed for processing large-block, streaming data in a multi-user environ- ment. Plumb’s novelty comes from integrating workflows from multiple users while de-duplicating computation and storage, and its use of dynamic scheduling to accom- modate structural and computational skew. Success stories from the operational use of Plumb and expanding deployment of workflows provides a long-term validation of our design choices for ecient processing of multi-user workflows. Plumb is in production use since November 2018, handling all B-Root DNS data and dealing well with increasing trac. Plumb provides lower latency with less resources compared to our prior system, while its ease-of-use supports new users and analysis. Median latency reduces to 695 seconds as compared to 2724 seconds on real-world DNS workloads (§2.3.7). Compute usage reduces by 39%, while storage use reduces by 55% after employing our optimizations (§2.3.1). Developers can add a new stage in the workflow by just three lines of YAML text (§2.3.8). We believe Plumb can support several similar workflows. 52 This chapter proves part of our thesis statement—new methods enable ecient pro- cessing of multi-user workflows of streaming data. Our first new method constitutes algorithms to detect and remove duplication of data and processing across multiple users, and to mitigate skew. The Block-Streaming abstraction for per-block process- ing is a new useful abstraction. In the next chapter we will see other abstractions and how to move between them. Block-Streaming coordinates with algorithms from our first method to tackles performance anomalies due to skew from large-block streaming data of multiple users. The Block-Streaming allows excellent parallelism and fault-tolerance. Deduplication and skew management algorithms, Block-Streaming, and related opti- mizations enable ecient processing of multi-user workflows by improving throughput, increasing utilization, reducing end-to-end latency of results, and making development easy due to succinct and simple interfacing with Plumb. We use real workloads from B- Root DNS service and compare Plumb with a pre-Plumb system to empirically demon- strate eciency improvements. In this chapter we used Block-Streaming as our primary execution model for per- block processing. In Chapter 3 we will provide additional abstractions and ecient interoperability between those abstractions to cover a broader range of workflows. In Chapter 4 we show that our new processing methods apply to many applications. 53 Chapter 3 Ecient Processing of Streaming Data in Multi-Abstraction Workflows Chapter 2 proved part of our thesis statement (§1.1) by providing a new method to deduplicate processing and data, and to mitigate skew in multi-user workflows. Chap- ter 2 used the Block-Streaming abstraction to execute our DNS workloads. Here we introduce two additional abstractions and show that ecient bridging between the three abstractions is possible. In this chapter we focus on challenges from multi-abstraction workflows, and present our solutions. To meet the need to use multiple frameworks on dierent workflow stages, developers rely on custom code, and such custom code causes correctness, performance, and software maintenance challenges. This chapter presents two contributions. First, we prove part of our thesis statement (new method enables ecient processing of multi-abstraction workflows of streaming data) by showing that developers can eciently move between multiple abstractions where our framework manages data movement between the abstractions. Second, we complement Block-Streaming with two new abstractions of Windowed-Streaming, and Stateful-Streaming and show that collectively they cover a broad spectrum of workflows. 3.1 Introduction Services like websites, web crawling, on-line advertising, advertising action systems, and the Domain Name System (DNS) generate continuous data streams of user and 54 system activity. Analysis of this activity provides business intelligence, performance analysis, intrusion detection, and security monitoring that helps monetize, optimize, and secure these services. Such analytics often employ multi-stage workflows, and mature systems will have analytic components that are developed by dierent teams, requiring multiple, dierent software frameworks to provide ecient, eective, and timely results. Large workflows often require multiple analytic frameworks. A recent survey showed that, on average, enterprise data pipelines use seven dierent tools in a workflow [Dat], sometimes employing custom integration methods to bridge frame- works [FFRFS19]. There are three reasons applications benefit from using multiple frameworks—ease-of-coding, eciency, and working with the existing codebase. When the framework matches the problem, development will be easier, with shorter and less complicated code. As an example, employing a complex algorithm like DNS TCP reassembly in a packet trace is easier as a MapReduce [DG04] job than in Hive [TSJ + 09] job, since TCP semantics do not map to SQL. Such ease of expression increases code velocity and programmer eciency. Second, framework choice often aects performance. Some frameworks are opti- mized for throughput (data processed per hour per server), while others emphasize low latency (data in to results out). As an example, Spark [ZCF + 10] is a better fit for work- loads that carry out multiple steps over the same data, while MapReduce is optimized for single-pass, batch processing. Third, organizations often have large, long-lived bodies of code that solve their spe- cific problems. Integrating existing code with big-data processing is important to lever- age this investment. An example is a serial toolchain for network data analyses that cannot be easily rewritten for better parallelism but still needs to adapt to a cluster envi- ronment. 55 Our goal is to support migration between multiple abstractions with a common streaming workflow such that processing is correct, ecient, and easy-to-use. While developers can sometimes shoehorn one abstraction into another system, a mismatch often compromises correctness when handling the many corner cases that arise in an operational system, and it can be dicult to provide good performance as data moves between abstractions. Finally, developers must implement special-purpose code to bridge abstractions and address correctness and performance needs, adding develop- ment time. We see three dierent abstractions are important to cover a broad range of big data analytics. We identify abstractions of large-block streaming (Block-Streaming), windowed large-block streaming (Windowed-Streaming), and continuous streaming (Stateful-Streaming) to meet our goal. These abstractions cover a range of tasks in our case study, a DNS processing and analytics workflow (§3.2.1), as well as other applications we have considered. With Block-Streaming, large data-blocks (often 0.2 to 2 GB in size) are sent to the user processing functions that emit new large blocks as output. When blocks can be pro- cessed out-of-order, Block-Streaming can exploit easy parallelism by processing blocks as they arrive, in parallel, across multiple cores or machines. In addition to supporting parallelism, Block-Streaming simplifies error handling, since the processing status of each block can be tracked and when processing fails for a block, it can be automati- cally retried. This abstraction support applications with single-pass ingestion, leverag- ing temporal or spatial locality of data, and with block-level parallelism. We have found Block-Streaming is a good fit for many aspects of statistical analysis of network trac such as DNS. Some applications require processing a fixed window of data—often a particular time period (a day or an hour), or a large amount of data. For these applications, we see 56 Windowed-Streaming as a second important abstraction. Windowed-Streaming enforces time ordering on blocks and provides a window of such data to developer for process- ing, that can then operate on the entire window in one workflow stage. Error handling in Windowed-Streaming is more complicated than with Block-Streaming, since blocks may arrive late, violating processing time constraints, and if blocks are lost, a window may never be complete. A Windowed-Streaming can automate error handling, provid- ing common methods to handle completeness and timeouts based on the developer’s requirements. We have found windowed processing is necessary to match the require- ment that reporting occur at regular times (perhaps daily), while bulk data is more e- ciently handled with fixed-size Block-Streaming. Periodic MapReduce analysis often assumes a time-based window of input data. Finally, some applications require tracking state over all time. Such applications require all data arrive, in-order. Stateful-Streaming streams blocks with lower latency than Block-Streaming to a specific developer application instance that keeps long- running state. User application here does not run to completion; rather, it needs to continuously run and consume data. This abstraction supports applications that need running state, as in an intrusion detection application. Prior systems support each of these abstractions, but typically provide only one abstraction. For example, Spark [ZXW + 16] and Kafka [KNR11] support record-level streaming, MapReduce [DG04] works well for periodic analysis of windowed data (but leaves windows management to the developer), and most network tools such as intrusion detection systems (IDS) [Pax] expect to work directly on an infinite stream of data. The first contribution of this chapter is to prove part of our thesis statement (§1.1)— providing a new method that enables ecient processing of multi-abstraction workflows of streaming data. To provide ecient processing of multi-abstraction workflows, we 57 show that Plumb can manage data movement between the abstractions to achieve cor- rectness, high throughput, low latency, and error handling. The challenge in reaching this goal is to provide good performance as data moves between very dierent formats (large blocks of MB or GB, very large windows of hours or days and many GB or TB, and continuous streams), while providing clear error semantics and a simple pro- gramming workflow. We demonstrate that throughput increases, latency decreases, and developers do not need their custom glue code (§3.3). We demonstrate eciency in a system that has been operational for over 3 years (initially with custom workflow, now with our system) and has processed more than 12 PB of DNS packet data (when uncompressed), daily statistics for 3 years, and is now being used for streaming intrusion detection. Plumb has cut per-block latency by 74% (§3.3.1) and daily statistics by 97% (§3.3.2.3), while reducing code size by 58% (§3.3.2.1) and lowering human intervention to handle problems by 73% (§3.3.2.2). The second contribution of this chapter is to show that our three key abstractions (Block-Streaming, Windowed-Streaming, and Stateful-Streaming) can support many classes of streaming applications. To demonstrate that our abstractions apply to many use-cases, we use workloads from B-Root DNS analytics. To further support our contri- bution we show that our abstractions can express many classes of analytics (Chapter 4). Plumb is open source and available for download from our website. 3.2 System Design Plumb make it easy to move data between three abstractions: Block-Streaming, Windowed-Streaming, and Stateful-Streaming to support a large class of applications. In this section we describe the goals of our system and these abstractions, illustrated with a DNS as a case study that we describe first. 58 Figure 3.1: The DNS processing pipeline, our case study described in §3.2.1. Inter- mediate and final data are ovals, computation occurs on each arcs. Dierent stages use one of our abstractions to achieve diverse processing goals. Over time more users have contributed their workflows (compare with Figure 2.1) 3.2.1 Case Study: A DNS Workflow We use DNS (the domain name system that maps human-readable names to machine- friendly addresses) as a case study to show the need for multiple frameworks in a real workflow. Figure 3.1 shows the workflow, with most stages requiring Block-Streaming, but the two on the right requiring our other two abstractions. Versions of this workflow have been in use since 2015 (for example: Figure 2.1), so it provides a good example of how we manage workflow with and without multiple abstractions. The main goal of this workflow is to compute daily statistics and archive trace infor- mation in two formats, but it has grown to include dierent kinds of intrusion detection as well. This workflow has evolved over 5 years, and dierent individuals are responsi- ble for dierent components. DNS data from geographically distributed B-Root sites arrive at our processing clus- ter (A in Figure 3.1). Initial few stages of processing need Block-Streaming based pro- cessing to harness DNS flow’s temporal and spacial locality (marked as stages B, C and F in Figure 3.1). Blocks are processed into rssacint-format (C in the figure), a sum- mary with information needed for daily statistics. We accumulate a 24-hour window of rssacint files to drive daily statistics (from stage C to D). Currently this stage is done with 59 custom code to bridge Block-Streaming abstraction to MapReduce; we plan to move to Windowed-Streaming, as we evaluate in our prototype in §3.3.2. Applications like network intrusion detection need to run continuously to consume streaming data, keeping long-term state, and detecting and alerting about any interest- ing events. Workflow stages A to E in Figure 3.1 depict one such application where our Stateful-Streaming abstraction makes it easy to build the application with low delay. Our B-Root DNS infrastructure often comes under malicious attacks, and our early studies show that there is often a reconnaissance pre-activity before the actual attack [FHQ17]. To allow reaction to security problems, intrusion detection must run as quickly as pos- sible, in near-real-time. 3.2.2 Plumb Goals and Requirements The primary goal of our work is to enable developers to process date correctly in an ecient, and flexible way, applying the best applicable abstraction in each stage of the workflow. We improve correctness by avoiding ad-hoc code that is otherwise required to bridge abstractions. In our pre-Plumb system, we required significant code to confirm that a window is complete before processing when going from Block-Streaming to Windowed- Streaming. With Plumb-managed windowing, system operators can improve algorithms over time without any change in the developer code. Our second goal is to provide ecient processing by reducing latency. Custom developer code either redundantly exercise large amount of data and processing to achieve lower delay or sacrifice low latency by running windowing algorithms after long intervals. In §3.3.2.3 we show that Plumb reduces latency of a site by 97%. Ease-of-use is also an explicit goal. Our developers can easily specify their process- ing needs using one of our abstractions in a YAML job description, and our framework 60 Unit of consumption Processing order Call-back function I/O size control State Canonical usecase Block- Streaming 1 full block None - Read 1 or more blocks -Write one or more blocks -Run to completion in bounded time No direct control No except emit in output block(s) - Single pass apps - Temporal and spacial locality -Block level parallelism -Block level fault isolation -Example: DNS TCP reassembly Windowed- Streaming 1 full window of time ordered blocks Strict ordering or out-of-order across windows as per request -Process input windows -Write output blocks -Run to completion in bounded time Yes, by specifying window size No except emit in output block(s) -Data accumulation - Reductions like MapReduce -Multi-input ordering -Example: DNS daily stats Stateful- Streaming Never ending data stream Oldest data first or strict monotonic stream order as per request -Read input blocks -Write output block(s) -Application continuously run No, but app can flow control incoming data Yes, long-term state -Lower latency apps -Always on processing -Example: Real-time intrusion detection Table 3.1: Comparison of three abstractions. takes care of all the data transport, accumulation, scheduling, and fault details. Devel- opers don’t need to employ custom cronjobs to schedule their jobs. In §3.3.2.1 we show that Plumb reduces code by 58%. Figure 3.2 shows logical architecture of our system. Developer applications use one or more of our abstractions in their workflow. Three abstractions of Block-Streaming, Windowed-Streaming, and Stateful-Streaming are our approach to achieve above men- tioned requirements. In the following we explain our abstractions in detail. Table 3.1 is a summary of them as they compare with each other. 3.2.3 Large-Block Streaming Abstraction The Large-Block Streaming abstraction divides the stream data into fixed-sized blocks for processing. Under Block-Streaming, we process each block independently, hence providing good block-level parallelism and fault isolation. As developer call-back func- tions process these blocks, they generate new large-blocks, whose size depends on the amount of data emitted by an application. Developers don’t have direct control on the size of the block. 61 Figure 3.2: System architecture. Block-Streaming is suitable for applications that need single pass processing, and they leverage spatial or temporal locality. Additionally, Block-Streaming blocks facili- tate low-overhead deduplication of computation and storage (as we saw in Chapter 2). The size of the block is domain-specific and configurable as per the need. Plumb streams one large block to an instance of developer call-back function. Devel- oper code is expected to process data and stream out one or more data blocks. Block- Streaming requires developer call-back function to run to completion in a bounded time. Plumb uses a pre-defined upper run-time limit for Block-Streaming based developer code. 62 Figure 3.3: Three abstractions in use in the DNS workflow. Developers specify their job requirement in a pipeline-graph (see Figure 3.3) and our framework takes care of streaming in, streaming out, and retries on failures. The first stage of the workflow in Figure 3.3 gets captured network data as sz compression, and changes it to more compact xz format for long-term archival by utilizing spatial locality. The second stage also consumes captured DNS data and does TCP reassembly to generate DNS records for output. This stage utilizes temporal locality to correctly stitch DNS TCP flows, because multiple packets of a TCP flow are near-by in the data. 63 3.2.4 Windowed-Streaming Abstraction A large class of analytics require a particular duration of data—a time window. (As a related problem, some applications want to operate on a very large chunk of data.) With Block-Streaming, processing happens in fixed-size blocks, providing a more uniform unit of work, and allowing dierent blocks to be processed in parallel. By contrast, time-based windows are often aligned with reporting requirements, and process varying amounts of work. Windowed-Streaming must bridge the gap between data arriving in discrete blocks, possibly out of order, while providing the developer this simplifying abstraction of a complete window of data. In addition, developers can specify windowing variations: are windows measured by time or data, how much, what phase, can dierent windows be processed in parallel, how boundary conditions are handled (data that crosses the window boundary), how long should we wait or how much missing data should we allow before timing out and processing an incomplete window, and how we recover from errors. Table 3.2 lists these parameters and their defaults. We review these design options below. To minimize latency while providing simple and clean error handling, Plumb’s pri- mary job is to trigger window processing when the window is complete, and to handle error detection and recovery. Since Plumb tracks arrival of new data to the window, can checks for completeness as soon as new data arrives and can schedule window process- ing as soon as a window completes. By contrast, pre-Plumb systems without framework support for windowing periodically probe to check for window completion, with infre- quent polling unnecessarily adding latency. Error handling is more dicult. Developers specify a timeout of hours after the window should be complete, and require that the number of missing blocks be below a 64 Window Parameter Semantics Man- datory? Allowed Values Default Value size What is the size of the window in terms of time or number of blocks. Yes X hours or Y blocks N/A start Where the window should start. This property must be used along with size. No YYYY:MM:DD:hh:mm UTC Head of the queue start and end When to start and finish windowing. Either size or this property required. Yes <start time-end time> N/A max wait Max wait time or missing blocks or percentage of missing blocks before moving on. No x hours or y blocks or z% 1 hour across window ordering If windows complete out-of-order, are they scheduled in order or not. No Yes or No No cuto delivery requirement For time-based windows, how to deliver cut block. No (1) Cuto Exactly Once To Current Window (2) Cuto Exactly Once To Next Window (3) Cuto Duplicate Delivery Cuto Exactly Once To Current Window use queue sequence number instance Make sub-windows based on developer provided label in block names such as sites. No Yes or No No queue sequence number instance list Developer can provide specific labels for sub-windowing, or if not given framework learns them from names . No Comma separates list of text labels N/A framework For future use: allows organizations to performance tune wrappers for specific frameworks. No Not implemented yet N/A Table 3.2: Properties of a window. threshold or percentage. Plumb requires that applications provide a sequence number and data start time for each block in a data stream. These values mean it can identify a window may be complete from block start times (the window will be complete when timestamps bracket the window start and end times), and confirm it is complete by insuring all sequence numbers are present. This information also allows Plumb to handle error conditions: has too long passed after the window should be complete, and if so, is a large-enough percentage of blocks present to process anyway? Formalizing error handling and tracking window completion is a major advance in Plumb compared to our pre-Plumb systems—its ad hoc code judged completeness by counting “enough” blocks per day, based on estimates of typical counts. Such estimates 65 are fragile in the face of operational changes such as sites being closed for mainte- nance or large increases or decreases in trac from external changes. Centralizing error handling simplified our implementation, reducing code size by 58% (§3.3.2.1). Most operator interventions in our pre-Plumb system were due to violations of our ad-hoc windowing implementation due to unexpected operational changes; use of Plumb would have lowered operator interventions by 73% (§3.3.2.2). Error recovery is necessarily specific to applications. We allow developers to specify a recovery function to list the blocks that are present and missing. That function can choose to ignore missing data, or can estimate what that data would have said. We are still experimenting with error handling, but being able to make an informed response to missing data is a step forward. Our pre-Plumb do not do error recovery because it lacked information to estimate what was missing. Plumb supports parallelism between windows (the default), or allows the developer to process windows sequentially. Sequential processing is important if a window must consider state saved from the prior window, but it requires that window processing be faster-than-real-time. With time-based windows, phase and boundary conditions must be considered. The phase of the window is its start time relative to an absolute clock: should the window run every 24 hours based on when it began, or should it run midnight-UTC-to-midnight? When data arrives in blocks, the first and last blocks of the window will usually only partially overlap the window’s time frame. By default, the first block in a window starts in its time frame, so each block is processed once but the window may start slightly late. The developer can also request that blocks that straddle window boundaries be delivered to both windows, allowing the application to get a 100% complete window after discarding data outside the time frame. Our pre-Plumb system did not handle boundary blocks. 66 We have several examples of window-based processing in our workflows. In our DNS case study (§3.2.1), RSSAC statistics concern a midnight-UTC-to-midnight win- dow. In another deployment (pre-dating Plumb) we processed 24- or 12-hour windows of packets to convert to net flow. Conversion to Plumb has reduced the code developers must maintain, reduced latency by 74% (§3.3.1), and lowering operator intervention to handle problems by 73% (§3.3.2.2). In summary, Windowed-Streaming enables developers to write correct, low-latency application with ease. 3.2.5 The Stateful-Streaming Abstraction The Stateful-Streaming abstraction is suitable for those applications that need long-term state and and along-running process for near-real-time consumption of streaming data. Intrusion detection is an example application that benefits from this abstraction, as are other systems that often capture data directly from the network. Plumb’s Stateful- Streaming enable these applications to share a single network tap. Design choices include handling data ordering and missing data, parallelization, and fault-tolerance and crash recovery of the long-running process. We discuss them one by one. Plumb streams data in-order to the developer’s code, reordering blocks that arrive out-of-order. However, missing or very late blocks require other choices—should Plumb ignore any absent data and move on, or should Plumb wait, and how long? The devel- oper can specify how long to wait for missing data, and what action to take to handle a time out. Missing out on data means dierent things for dierent applications—a finan- cial application might not be willing to lose any data while an application counting a long-term average might be willing to let go some data. An application’s ability to process data in time depends on stream arrival rate, and speed of the underlying hardware and software. While a single-instance application is 67 easier to write and manage, at some point aggregate data may exceed the capacity of a single processing instance. Some streaming data tools (for example, Zeek) include sup- port for parallel operation.Alternatively, Plumb can hash trac into separate partitions to support parallelism across multiple instances of the data stream. In this case, hashing would be a Plumb workflow stage that sends data to several separate queues (each with Stateful-Streaming), and Plumb will handle parallelism and fault tolerance. We cur- rently support single instance operation and plan support for these forms of parallelism. Finally, we need to plan for failure of even long-running processes. In some cases, a Stateful-Streaming worker can restart, but in other cases it may need to retain and restore state. If the application is prepared to periodically checkpoint its state, but requires the replay of recent data to recover, Plumb can assist by retaining and replaying recent data (similar to Kafka [KNR11]). As an example of Stateful-Streaming in our case study, we pass all trac through Zeek [Pax]. Normally Zeek directly reads from the network, but we use Plumb to mul- tiplex data collected for our existing workflow across Zeek. (Zeek runs indefinitely as if it was capturing directly from the network.) Zeek output can be fed back into Plumb as an output queue, or logged independently. In summary Stateful-Streaming allows applications to process continuously coming data eciently. We evaluate our design choices in §3.3.3 by using a network intrusion detection system Zeek on our DNS workflow (last stage in Figure 3.3). 3.2.6 Fault Tolerance Failures happen due to a myriad of reasons, so automatic masking and graceful tolerance of these failures are important to make a system robust and easy-to-use. Inside our system, we can categorize failures into two classes—compute failures and data failures. 68 3.2.6.1 Compute Failures Compute failures happen when a stage fails unexpectedly, probably due to some hard- ware or software problem. We use retries like MapReduce [DG04] (most likely on a dif- ferent server on the cluster) after some time to recover from intermittent problems such as network or storage failures. Since stage execution is idempotent, possible duplicate retries do not cause problems other than wasting a small amount of capacity. We can- not do automatic recovery for state-bearing stages, but we assist the developer in doing a custom recovery with saving/restoring the checkpoints. Our system allows stages to store their state inside our meta-data store or use an external state management system while keeping a pointer to the external state inside Plumb (see Table A.8). If a stage fails even after multiple retries on a specific data, we mark the data block as faulty and request a manual inspection. There is a global value of four for the retries (same as in Apache Hadoop [VMD + 13]), and we plan to allow developers to provide their cus- tom retry value in the future. Our system schedules and executes each stage separately (§2.2.7), and hence the failure of one stage does not impact other stages. 3.2.6.2 Data Failures Data failures happen when some blocks come in late or never arrive. On an asyn- chronous network like the Internet, such data faults are common, and we need to manage them for our three abstractions (Block-Streaming, Windowed-Streaming and Stateful- Streaming abstractions). Our system schedules stages using the Block-Streaming abstraction out-of-order, and as-soon-as data becomes available. Many programs might ignore late or missing data when using per-block processing, but developers can detect them if there is such a need. Our Windowed-Streaming abstraction highlights the tradeo between data completeness and latency and enables developers to manage 69 both of them. Developers can specify their data completeness and latency require- ments using provisions of scheduling when partial data is available and with bounded waits. Our Stateful-Streaming abstraction enables developers to detect missing data and allows developer-controlled wait-times. This behavior is dierent from our Windowed- Streaming because applications using the Stateful-Streaming abstraction have stringent latency requirements. Next, we discuss how our three abstractions manage data faults. Our Block-Streaming abstraction provides out-of-order block execution and, as a consequence, does not manage data faults directly. The Block-Streaming abstraction is unaected from missing blocks because developers are willing to process blocks out- of-order. Stages using the Block-Streaming might ignore missing data to achieve high parallelism and low latency now and to deal with data faults in a later stage. Some applications (using Block-Streaming) might track block-completeness using unique and monotonically increasing sequence numbers of each block. Many of our B-Root DNS workflow stages use the Block-Streaming abstraction (see Figure 3.1). Fault handling in Block-Streaming is dierent as compared to other two (Windowed-Streaming and Stateful-Streaming) where latency might add due to late or missing blocks. The appli- cations that use our Block-Streaming abstraction focus on one block, and any missing or delayed blocks might become evident in a later stage, possibly using our other two abstractions. For the Windowed-Streaming abstraction, developers need to process complete data in a window and they also need low latency. Data completeness and latency have a tradeo between them—waiting longer might help getting a complete window but at the expense of added latency. We enable developers to choose a trade-o between data completeness and latency for their windows and allow developers to set parameters to run windows with partial data. Developers can specify missing data either as an absolute value or a percentage of data in a window. Developers can bound their wait time using 70 a timeout value. To facilitate data completeness, each block has a unique identity in our system so that we can keep a tally which blocks of a window are present and which are missing. As soon as all the blocks of a window are present, our system schedules the window for processing. A developer can specify maximum wait time, maximum number of missing blocks, or percentage of missing blocks allowed to trigger a window for processing. To dial the tradeo between data completeness and latency, we provide max wait argument (Table 3.2) in our Windowed-Streaming abstraction so that our system could automat- ically manage missing or late blocks in a window. Developers can control latency or completeness using above mentioned parameter. If a window is unable to trigger for processing (for example, because a developer requested availability of all data and something is missing), we inform related developer of such cases and provide tools to interact with the window for a manual directive (for example, process with whatever is available). For the Stateful-Streaming abstraction, developers need an in-order stream. We divide the stream into blocks (§2.2.2) and assign each block a unique, monotonically increasing sequence number so that our system knows the order of the stream, and any missing blocks. Our system provides an API (Table A.8) so that stage programs could detect gaps in the in-order stream, and developers can directly manage their wait time. Stages need to invoke our API after each block consumption to know if the next block is available. If the immediate next-block is not available, the stage can either do a bounded wait or move on. We take a dierent approach here compared to Windowed-Streaming because in-order streaming usually is more sensitive to latency and latency needs can dynamically change. Stream rates can change dynamically over time due to many reasons such as diurnal patterns of trac or denial-of-service attacks. Cluster capacity is often provisioned 71 based on average loads (and not on peak loads) due to economical use of resources but unexpected surge of data can overload a stream processing system. We buer data on stable storage to hold any excess data until resources free up in the cluster. Trac surges are an intermittent phenomena and our system clears any backlog during the relatively lull periods. Our primary goal is to process all data (§2.2.1), and occasionally due to excessive buering average latency can increase. We evaluate data completeness and latency tradeo in §3.3.2.2, and show that our system provides consistently low latency under faults in §3.3.2.3. 3.3 Evaluation We use applications from our DNS workflow to evaluate our system. This workflow (shown in Figure 3.1) shows all three Plumb abstractions: it uses large-block streaming for statistics generation and archiving (nine stages on the left of the figure). It uses Windowed-Streaming to accumulate a 24 hours window of data that is then processed with map/reduce. Finally, it uses Stateful-Streaming to send a virtual stream of all data to Zeek, a standard logging and intrusion detection tool. 3.3.1 Block-Streaming Enables Low Latency Processing To evaluate Block-Streaming’s latency, we use three data items from our B-Root DNS workflow (Figure 3.1). We construct DNS flows from blocks, and compress this data for later use as message question.fsdb.xz (stage A to message question.fsdb to mes- sage question.fsdb.xz in Figure 3.1). We change block compression for compact storage (stage A to pcap.xz). We de-compress and decode DNS (stages A and B), then compute intermediate statistics (stage B to C). All stages involved in above three data use Block- Streaming. Our latency results are in Figure 3.4, Figure 3.5, and Figure 3.6. We also 72 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Plumb Pre-Plumb Cumulative Distribution Message_question.fsdb.xz per-block processing time (seconds) Figure 3.4: Latency: Plumb vs. Pre-Plumb processing, as each blocks is processed by three stages to extract DNS flows from packets and to recompress it as xz for future use. Delay includes queuing and processing time. summarize latency percentiles in Table 3.3. We will discuss rssacint data (Figure 3.6) in detail, but similar analyses applies to the other two. We evaluate latency from stage A to stage C (via stage B), comparing our pre-Plumb, batch-based system with Plumb using Block-Streaming. We evaluate one day’s worth of real data (2018-10-23, 1393 files, each 2 GB when uncompressed). Figure 3.6 shows the distribution of latency for this workflow over all 1393 files. Current Plumb latency (the left, green line in Figure 3.6) is much lower, with median latency of 695 s instead of 2724 s in our prior system. In addition, Plumb latency is much more consistent, with standard deviation of 50 s instead of 614 s (compare the narrow range of Plumb against the wide range of pre-Plumb). Plumb latency is lower because it processes blocks as soon as they arrive, while our pre-Plumb system grouped blocks into batches to use Hadoop-based parallelism. 73 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Plumb Pre-Plumb Cumulative Distribution Pcap.xz per-block processing time (seconds) Figure 3.5: Latency: Plumb vs. Pre-Plumb processing, as each blocks is processed by two stages to change data compression for long-term archival. Delay includes queuing and processing time. Data Name Latency (in seconds) pre-Pl. Plumb pre-Pl. Plumb pre-Pl. Plumb 50%tile 90%tile 99%tile message question.fsdb.xz 3315 1302 4137 1373 4329 1415 pcap.xz 2524 1385 3375 1448 3573 1493 rssacint 2724 695 3580 764 3779 782 Table 3.3: Comparison of pre-Plumb and Plumb latency (in seconds) at 50, 90, and 99%ile for Block-Streaming generated data of message question.fsdb.xz, pcap.xz, and rssacint processing. Plumb latency is good, approaching the theoretical minimum of 388 s from stage A to C without any queuing delays, when run on an idle server. (But there is some room for optimization.) Plumb’s delays are either due to queuing (all compute resources currently busy) or thundering herd phenomenon when there are many readers and writers on its HDFS shared file system, with 3-way replication (see Chapter 2 for more details). 74 0 0.2 0.4 0.6 0.8 1 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Plumb Pre-Plumb Cumulative Distribution RSSAC per-block processing time (seconds) Figure 3.6: Latency: Plumb vs. Pre-Plumb processing, as each blocks is processed by three stages to get statistics. Delay includes queuing and processing time. This graph is from our earlier work in Chapter 2 when only Block-Streaming was available in Plumb. Re-reporting for completeness. Our Block-Streaming abstraction provides low latency for all the stages of the work- flow using this abstraction. Table 3.3 shows that all three data item latencies under Plumb are consistently lower than pre-Plumb (up to 80% lower on 99th percentile). Additionally tail latencies under Plumb are lower than pre-Plumb as compared to 50th percentile (for example rssacint latency at 99th percentile is 80%, and is 74% lower at 50th percentile as compared to pre-Plumb. See third row in Table 3.3). Hence the Block-Streaming abstraction consistently provides lower latency than pre-Plumb. 3.3.2 Windowed-Streaming Enables Ecient Reduction We next show ease-of-use, correctness, and low latency of analytics via Windowed- Streaming and compare the results with a pre-Plumb system. 75 3.3.2.1 Windowed-Streaming is Easy-to-use One of our design goals is that our system be easy-to-use, so we next compare a pre- Plumb solution that bridges Block-Streaming and windowing with Plumb. For this experiment we compare lines of code to do 24-hour RSSAC statistics [Teaa] in our pre- Plumb, hand-coded windowing mechanism to that with Plumb. We use pygount [Agl] tool to correctly count the lines-of-code for bash and python scripts. Our experiment in Table 3.4 shows that Plumb requires only 41% of the code of our pre-Plumb implementation (a code reduction of 58%). Processing requires four com- ponents: window creation, statistics, scheduling, and YAML job description. Window creation is 83% smaller than pre-Plumb because Plumb only calls developer code when a window is complete and ready-to-run. Plumb code is simpler since the framework handles windowing and the many corner cases of incomplete input and error handling (see §3.3.2.2). Window creation code now contains API calls to interface with Plumb and statistics processing. Statistics code is smaller by 11% because Plumb automates forwarding output results to downstream in the workflow and archiving the output on disk in the final stage. Two lines of scheduling code (a cron job) are eliminated because Plumb auto- mates scheduling across the cluster. Avoiding cron-based polling also reduces latency, as we show in §3.3.2.3. The main new code is the Plumb specification and windowing requirements, 10 lines of YAML (Figure 3.3). In summary, Windowed-Streaming is easy-to-use and enables developers to ooad data completion and scheduling responsibilities to Plumb while concentrating on their analytics. 76 Developer Code lines of code Reduction pre-Plumb Plumb Window Creation 163 28 83% Statistics 66 59 11% Scheduling 2 0 100% Job Specification 0 10 (-100%) Table 3.4: Comparison of number of lines-of-code in pre-Plumb RSSAC (with devel- oper custom windowing) and Plumb RSSAC (with Windowed-Streaming abstraction). 3.3.2.2 Windowed-Streaming Improves the Correctness-Latency Tradeo Analytics should consider all data that is collected. Yet in real-world data collection sys- tems, data can be late or out-of-order, or even missing, as it is captured and relayed to the data warehouse and components or networks fail and disks fill. The developer must choose between waiting for perfect data (which may never come if there was a failure), or processing data after a reasonable time to allow for retries and perhaps human inter- vention. This tradeo between latency, correctness, and degree of manual intervention is common to most distributed systems and something for which Plumb should allow control. We next evaluate Plumb’s correctness and the degree of manual intervention when compared our pre-Plumb system with a custom-designed but ad hoc algorithm to esti- mate completeness, and in the next section (§3.3.2.3) we will see how it aects latency. Table 3.5 summarizes the tradeo between data completeness and latency between pre-Plumb and Plumb algorithms. We use real-world B-Root data taken from 2019-08- 19 to 2020-10-10 (14 months). Over this time B-Root expanded, from 3 sites (LAX, MIA and ARI starting 2020-08-19, adding AMS on 2020-01-20 and IAD and SIN on 2020-01-22). This data reflects typical operational issues that arose over that time, including occasional site network failures and shutdown for planned maintenance. Each 77 Pre-Plumb RSSAC Plumb RSSAC Total Site Missed Stalls Missed Stalls Days (early trig.) (1 wt.) (1 h wt.) (100%) LAX 0 7 (1.7%) 0 13 (3.1%) 2 (0.5%) 416 MIA 0 8 (1.9%) 0 19 (4.6%) 4 (1.0%) 416 ARI 1 (0.2%) 25 (6.0%) 0 45 (10.8%) 1 (0.2%) 416 AMS 1 (0.4%) 16 (6.1%) 0 76 (28.8%) 4 (1.5%) 264 IAD 0 1 (0.4%) 0 22 (8.5%) 3 (1.2%) 260 SIN 0 1 (0.4%) 0 33 (12.7%) 2 (0.8%) 260 all 2 (0.1%) 58 (2.8%) 0 208 (10.2%) 16 (0.8%) 2032 Table 3.5: Comparing pre-Plumb and Plumb data stalls and number of missed data events. over 14 months. site computes separate rssacint intermediate statistics, these are then later combined on another system. The input gives data generation time and the time it is delivered to our analytics cluster. We replay this data with these times through two systems: pre-Plumb and Plumb to compare latency. Our pre-Plumb system uses our custom-built but ad-hoc algorithm to estimate completeness. These algorithms use conservative estimates from historical data to estimate how many blocks each site should generate each day to detect data relay stalls. If these checks fail, they alert the operator and block progress until they can investigate and confirm or correct any problem. Our Plumb-based system allows the operator to select the tradeo in completeness and delay in the Plumb specification, and allows Plumb to infer completeness based on file times and sequence numbers. Plumb checks are therefore much more accurate and robust to changes in site load over time. We first evaluate correctness in Table 3.5, then later consider latency in Table 3.6. Pre-Plumb rarely misses available data (column 2 in Table 3.5) because it adopts a very conservative algorithm. We compare to Plumb with two settings: with infinite wait, it never misses data (but requires operator intervention on a stall), and with a 1 h threshold we see it misses data about 10% of the time. Here both have similar behavior 78 when correct, but due to dierent reasons—pre-Plumb’s conservative wait time coupled with long-term trac rate mostly works while Plumb relies on monotonically increasing sequence numbers of the underlying blocks for completion checks. With an aggressive threshold, Plumb can be configured to meet a strict (real-time) deadline at the cost of incomplete results. In some cases (if timely results are more important than perfect) this choice may be preferred. Pre-Plumb stalls falsely many times (58 events, 2.8% of the time, column 3 in Table 3.5). These stalls occur because trac changes, usually due to intentional engi- neering, making prior trac estimates invalid. Each of these events requiring an oper- ator intervention to reset the thresholds. By contrast, Plumb stalls 16 times, always because some data file was blocked due to an error or network problem. Finally, we see that pre-Plumb’s ad hoc algorithm silently misses some actual stalls. Plumb reports 3 and 2 for sites IAD and SIN, while pre-Plumb stalled only once for each, missing 3 events. Pre-Plumbs trac estimation algorithm cannot detect single file losses, while Plumb’s check of sequence numbers is complete. Although missing one file has only a small eect on statistics, detecting errors is important. In summary, Plumb only produces true stalls for developer intervention, while allow- ing developers to pick a suitable wait timeout value—hence allowing a developer-guided tradeo between completeness and latency. 3.3.2.3 Windowed-Streaming Enables Low Latency Processing Our second design goal is low latency, so we next evaluate latency using the same pre- Plumb and Plumb with Windowed-Streaming implementations as we did in §3.3.2.2, again comparing RSSAC statistics for the six B-root sites. 79 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 Plumb RSSAC Pre-Plumb RSSAC Cumulative Distribution Trigger Latency of RSSAC (hours) Figure 3.7: Latency comparison at LAX site. 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 Plumb RSSAC Pre-Plumb RSSAC Cumulative Distribution Trigger Latency of RSSAC (hours) Figure 3.8: Latency comparison at newer AMS site. 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 Plumb RSSAC Pre-Plumb RSSAC Cumulative Distribution Trigger Latency of RSSAC (hours) Figure 3.9: Latency comparison at MIA site. 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 Plumb RSSAC Pre-Plumb RSSAC Cumulative Distribution Trigger Latency of RSSAC (hours) Figure 3.10: Latency comparison at newer SIN site. As before, we collect times of RSSAC input files (stage C in Figure 3.1) and the output of rssacint summary files from HDFS logs. We examined all six sites, but here we explain data for two sites, LAX and AMS. (Other sites are similar to LAX or AMS.) Figure 3.7 shows the CDF of latency for RSSAC output at LAX with Plumb (the left, green line) and pre-Plumb (the right, red line) over 14 months. Figure 3.8 shows similar results for AMS for nearly 9 months. Table 3.6 summarizes median and tail-latency of all six sites. At all sites, Plumb has dramatically lower latency. At LAX, median and 90%ile latency are only 3% of their prior values (to 0.29 hours from 9.72 at 50%ile, and 0.41 from 12.72 for 90%ile). The tail for both systems is long, but 99%ile latency is 33% of 80 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 Plumb RSSAC Pre-Plumb RSSAC Cumulative Distribution Trigger Latency of RSSAC (hours) Figure 3.11: Latency comparison at IAD site. 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 Plumb RSSAC Pre-Plumb RSSAC Cumulative Distribution Trigger Latency of RSSAC (hours) Figure 3.12: Latency comparison at ARI site. Site Latency (in hours) pre-Pl. Plumb pre-Pl. Plumb pre-Pl. Plumb 50%tile 90%tile 99%tile LAX 9.72 0.29 12.72 0.41 53.91 7.32 MIA 9.75 0.37 13.07 0.52 89.64 26.64 ARI 15.04 0.41 17.58 1.26 666.04 169.36 AMS 10.76 0.43 34.47 18.68 90.14 91.00 IAD 10.44 0.79 13.95 0.81 39.17 12.22 SIN 10.5 0.70 15.1 1.98 39.18 30.32 Table 3.6: Comparison of pre-Plumb and Plumb latency (in hours) at 50, 90, and 99%ile for RSSAC processing. its prior value (left curve compared to right curve in Figure 3.7 and column six vs. seven in Table 3.6). The LAX site behaves with Plumb as we hope, usually processing data within the hour it’s available, and only occasionally requiring longer. AMS and some other sites show much longer tail latency, with AMS 99%ile latency of nearly four days. These delays represent a few days spent data back-haul for this site when it was coming on-line. Finally, we see very long latency for the ARI site: 99%ile latency is 169 hours (more than a week). There were events when ARI had trac engineering that dropped its data rate to zero. Correcting this problem brought ARI latency back to our goal of less than one hour latency per day. 81 Finally, median latencies of older sites (LAX, MIA) are lower than newer sites (for example: 0.29% for LAX while 0.43% for AMS) because with the addition of newer sites, we are getting more data. Our publicly available RSSAC stats [Teaa] show that trac increased by about 50% with the addition of the three newest sites and correspond- ing lower latencies, while our analytics cluster’s processing capacity did not change during this time. The tail latency in graphs reflects our conservative choice of requiring all data (or operator intervention) before processing statistics. If we wished to guarantee processing each day, even with incomplete data, Plumb makes it easy to set a 1 hour timeout and processes whatever data is available. Plumb’s infinite and 1 hour columns of Table 3.5 show the implication: about 10% of cases would run with incomplete data, a choice that might be preferred to stalling if approximate results were required by operational needs. In summary, we see that RSSAC processing based on our Windowed-Streaming improves correctness and makes it easy to provide fixed-latency results when required. More importantly, it relieves the developers and operators of ad hoc, error-prone win- dowing algorithms and makes it easy (a configuration parameter) to balance correctness and latency. 3.3.3 Stateful-Streaming Enables Continuous Applications Block-Streaming works well for processing that requires little state, and windowing supports data with predicable amounts of state, but some applications expect continuous state. Intrusion detection systems and long-term logging like Zeek fall in to this third category, so we next consider how Plumb can drive Zeek with Stateful-Streaming. As we described in §3.2.5, any Stateful-Streaming will be limited by the ability of one compute to keep up. What that exact rate is depends on the workload and the hardware doing the processing. As one example, we evaluate data from one B-Root site 82 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 Inter arrival time Inter departure time Cumulative Distribution Time (minutes) Figure 3.13: Comparison of LAX inter-arrival time VS inter-departure time. over 24 hours starting on 2020-11-27, considering about 1TB of data (538 blocks, each 2 GB before compression). Figure 3.13 shows the distribution of block arrival and Zeek processing to see how well it keeps up with real-time on our hardware. For our hardware and this application, we see that data arrives every three minutes (median), and is processed in 9.7 minutes (median). Only about 3% of blocks are han- dled faster-than-realtime—our hardware cannot keep up with this workload. To process this workload will require either Plumb-level flow-based hashing, use of cluster-mode Zeek where it does flow hashing, or addition of windowing to allow dierent days to process concurrently (without retaining state between days). Either Plumb- or Zeek-based flow hashing would support this workload with 4-way parallelism since processing time is consistently between 8 and 10 minutes. Processing time is so consistent because input data is fixed size (2 GB per block). Our initial pro- totype that uses two developer-level stages—first for hashing and splitting [Sel] and the 83 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Time to process (minutes) Number of hash buckets of 5−tuple connections total time(minutes) median inter−arrival time of input blocks(minutes) Figure 3.14: 4-way connection hashing with 4 Zeek processes is enough to keep up LAX real-time trac rate. second for Zeek processing confirms (see Figure 3.14) that four instances can keep up with the LAX real-time trac. In summary, Stateful-Streaming was able to support our state bearing application, though initially we found out that we need 4-way parallelism to keep up with the trac rate with our hardware. This need for processing parallelism suggests we should provide Plumb-level flow splitting as a primitive, and allow Plumb jobs to schedule multiple cores to support cluster-mode Zeek. We are able to achieve parallelism by using an additional stateful state before Zeek queues. We are extending Plumb for better support for hashing/splitting input. 84 Figure 3.15: Pre-Plumb cluster resource overload: Pcap.sz queue depletion over time (top graph), load on servers that have eight cores per server (middle graph), and free memory on servers over time (that have 16GB RAM per server). Total vCore:120, max used vCore:75 Figure 3.16: Plumb cluster resource utilization: Pcap.sz queue depletion over time (top graph), load on servers that have 16 cores per server (middle graph), and free memory on servers over time (that have 48GB RAM per server). Total vCores:96, max used vCores:96 85 3.3.3.1 Deployment We next show that our system is more ecient than the prior system and is stably run- ning in production environment for more than three years. We measure how Plumb (see Figure 3.16) uses cluster resources as compared to pre-Plumb (see Figure 3.15). The main input data queue (related to pcap.sz data) is shorter with Plumb—an indication of lower latency. There are fewer load spikes and cluster load conditions are stable for the Plumb system. Also load is more or less equally distributed with Plumb as compared to old one. 3.4 Conclusions Plumb makes it easy for developers to access multiple abstractions in their big-data workflows, simplifying code, lowering latency, and improving correctness. To do so, it bridges data streams between three dierent abstractions (Block-Streaming, Windowed- Streaming, and Stateful-Streaming), while managing error handling and allowing devel- opers to dial the trade-os between latency and completeness. These abstractions opti- mize for dierent goals (Block-Streaming for high parallelism when some context is need, Windowed-Streaming for time-dependent processing, and Stateful-Streaming for long-running applications with state). Versions of Plumb have been in production use at B-Root since November 2018 and has already reduced latency by 74% (blocks) or 97% (daily statistics), while reducing code size by 58% and improving correctness (that results in 73% less manual intervention). Plumb is open source and available for use. This chapter proves part of our thesis statement—new method enables ecient pro- cessing of multi-abstraction workflows of streaming data. Our new method enables developers to move between three key abstractions eciently. We use real workloads from our DNS processing pipeline, and compare our hand-tuned pre-Plumb system with 86 Plumb to demonstrate Plumb’s higher eciency. This chapter also demonstrate that our three abstractions (Block-Streaming, Windowed-Streaming, and Stateful-Streaming) can express dierent classes of analytics. We used B-Root service’s DNS workloads and related processing applications for our demonstration of higher eciency in Chapter 2 and here (Chapter 3). In Chapter 4 we argue that our new methods are applicable to other use-cases as well. 87 Chapter 4 Beyond DNS: The Generality of Plumb Optimizations DNS workflows and data from B-Root service are our primary case study (§2.2.2 and §3.2.1) to demonstrate eciency of our new methods. In this chapter we argue that our new methods apply to large classes of streaming applications. We classify stream processing (§4.1) based on multiple dimensions—business domain, smallest unit of input consumption, and processing goals. We present example applications (§4.2 ) for all of the above dimensions to show applicability of our methods. This chapter supports our thesis statement (§1.1) by showing that our new methods cover a large spectrum of stream processing. In this chapter we discuss how develop- ers might use our new methods to eciently process their data from dierent business domains, for variable-sized input consumption, and for broad eciency goals. We use arguments and analogy to our primary case study as ways to show applicability. 4.1 Introduction Streaming applications can be divided into a three dimensional space—business use- case or business domain, input data grouping, and system goals. First we describe these dimensions, and then present example applications that collectively cover part of stream- ing space. 88 Figure 4.1 represents the design space of analytics along the three axes. The black axis is for the smallest unit of consumption (from single records to blocks to multi- block windows or a continuous stream). The red axis is for processing goals (from hard-real-time to delay-tolerant, low correctness to high correctness, and low throughput to high throughput). The green axis is for business use cases (such as stock exchange order processing, DNS processing, and search engine pipelines). The green axis shows business application domain by category. Plumb is applicable in most of the space, except the plane where the red axis is at hard-real-time. Plumb covers the black axis of input size because developers can use data from small-size records to continuous streams. Plumb does not support hard real- time latency goals on the red axis of processing goals. Plumb provides its developers to dial a tradeo between correctness and latency, and Plumb makes compute cluster uti- lization higher by removing any ineciencies. Plumb supports many applications such as search engine pipelines, log collector analytics, DNS processing, and clickstream analytics as a consequence of goals and input size. Hard-real-time applications such as processing buy/sell orders at a stock exchange are beyond Plumb’s scope. Soft-real-time applications such as fraud detection may use Plumb. All modern businesses connect to a complex web of supply-chain and customers— generating many types of data streams such as click streams, credit card transactions, service telemetry data and many more. Plumb’s processing methods can apply to all such streams because stream data is opaque for Plumb and it allows arbitrary user code for processing. Plumb can support most applications in the space of analytics because Plumb supports all units of data consumption from small to continuous, and supports many processing goals, except hard-real-time (Figure 4.1). Applications will need to 89 Figure 4.1: Design dimensions in analytics (unit of consumption, processing goals, busi- ness use-cases). comply with Plumb’s APIs. Many business data-streams that conform to Plumb’s nam- ing might use our data and processing deduplication. Many applications with niche code can easily become part of a Plumb workflow stage. Data streams arrive to analytics in varying units—small records to large blocks (see the black axis in Figure 4.1). Often reason for small records is latency because there is not enough time to accumulate stream as a large blocks. Plumb is not a suitable framework where incoming stream is small because Plumb internally stores stream as large blocks (§2.2.2) to support low overhead deduplication. Dierent business verticals can have dierent processing needs (see the green axis in Figure 4.1). Workflows for financial sector put correctness and data completeness 90 in front of all other goals, while some others (such as like counters) might allow some loss. Similarly some applications might have very stringent latency requirements while others might be delay-tolerant. Plumb’s abstractions allow developers to dial a tradeo between correctness and latency, and long-running Plumb applications provide lower delay than others. Plumb can support applications with varying latency goals, either directly or by allowing users to run a low-latency framework (for example Apache Spark) at a stage. Though applications with goals of real-time latency (for example buy and sell orders coming at a stock exchange might have nano-second latency goals) cannot be processed using Plumb. Analytics need to meet goals such as latency, throughput, and correctness (see the red axis in Figure 4.1). Often these goals conflict with each other, and one might need to trade one o to achieve the other. As an example, developers might need to tradeo absolute correctness to achieve low latency due to any missing data. Often utilization of the system increases to achieve high throughput, but the increased oered load can increase response time. Plumb enables its developers to dial a tradeo between correct- ness and latency (§3.3.2.2), and prioritize higher throughput over the lowest possible latency (§2.3.4). In summary, Plumb’s new processing methods apply to many parts of the streaming application space (Figure 4.1). Our division of streaming analytics based on the use case, input size, and processing goals show that Plumb can support all those applications that do not have hard real-time constraints. We expressed DNS workflows using Plumb and presented a detailed evaluation in this work. We classify example applications using our three-dimensional space (Table 4.1) and show that Plumb can support most of them. 91 Application Black Step Coverage Red Range Plumb Coverage Search Engine Large blocks to Continuous Minutes to hours, high correctness, high throughput Yes Log Collector Large blocks to Continuous Minutes to hours, high correctness, high throughput Yes DNS Processing Large blocks to Continuous Minutes to hours, high correctness, high throughput Yes Click Stream Small records Minutes, high correctness, high throughput Yes Real-time Fraud detection Small records Seconds, high correctness, low throughput No Stock Exchange Small records Nano-second, high correctness, high throughput No DNS malicious activity detection Large blocks to continuous Minutes, high correctness, high throughput Yes Network flow anonymization and monitoring Large blocks to windows Hours, high correctness, high throughput Yes Merging high speed network trac Large blocks to windows Minutes to hours, high correctness, high throughput Yes Internet outage detection Small records to continuous Seconds to minutes, high correctness, high throughput Partially (minutes) Table 4.1: Example application coverage of Plumb: Plumb can support applications with latency requirement from tens of seconds onward. Red and green axis refer to Figure 4.1. 4.2 Application Catalog In addition to our operational DNS pipeline, Plumb is ideal for other applications. The following applications benefit from our pipeline-graph and three processing abstractions of Block-Streaming, Windowed-Streaming, and Stateful-Streaming. These applications are our data points in three dimensional streaming space. 4.2.1 Early Detection of Malicious Activity DNS backscatter can detect scanning and other malicious activity [FHQ17]. The input to this workflow is DNS data, and the final output is a list of IPs labeled as malicious (spam, scan, etc.) or benign (CDN, research scan, etc.) activity. The longitudinal studies in that paper employed largely ad-hoc processing with GNU parallel on specific computers, a process that was dicult to implement, slow to run, and often interfered 92 with other users. We are currently adapting this analysis to use Plumb, which will be easier, faster, and automatically share resources over a full cluster. 4.2.2 Flow Analysis Several sites at Colorado State U. capture packets and process them to anonymized flows. Input to this workflow is Argus flows, while the final output is monitoring alerts. Converting this post-processing to Plumb will support higher trac rates. The current system is hand-tuned (hence challenging to evolve) and suers from throughput issues. 4.2.3 Merging Bidirectional Trac Network data capturing on high-speed optical networks often must be done each direc- tion (incoming and outgoing) separately. We want to knit independent captures together after capture to provide a unified view at very high bit rates. The input to this workflow are two TCP flows, one for incoming while the other for outgoing trac. The output is a single bi-directional TCP flow. Note that, after this stage, our current pipeline workflow can be used as usual (without this stage, each subsequent code would need to change to accommodate two separate trac flows). 4.2.4 Annual Day-In-The-Life of the Internet (DITL) Collection DITL [DNS] is an annual data collection event where many service operators come together to collect their service data and share it publicly for facilitating research and development, and understanding evolving trends. B-Root participates in this event reg- ularly. Initial workflow stages of DITL processing work well with our per-block pro- cessing abstraction, where we decompress data, consistently anonymize it for privacy 93 preservation, and re-compress for a relay. Windowed-Streaming helps identify and sup- port processing over just the DITL time window, and makes processing simple. 4.2.5 Shared Datasets Inside a Cloud Plumb can also apply to data processing in the cloud, particularly when multiple groups share parts of the same workflow. These days all major cloud providers share massive datasets for the public use, and many internal and external projects utilize this data. A large cloud provider has internal data flow between dierent projects (loosely coupled groups), and they, too, end up shuing big files around every 5 minutes like we do in B-root. They are fixed time, not fixed duration, but they share all the issues like duplication, skew, etc. Our system applies to such use cases as well. 4.2.6 Plumb Enables Easy Analytics Evolution Analytics need to adapt to an evolving world, but one of the hurdle is to concur- rently (and without interrupting) develop and test new code using production data. The pipeline-graph provides a simple way to achieve the above goal. By using the same input(s) as old application but changing the name of the output(s) (to avoid de- duplication by Plumb), new application stage gets production data and process it and can compare results with the old one. 4.2.7 Plumb Provides Context for Collaboration In a large workflow with many teams, it becomes cumbersome for new users to initiate any collaboration. An example of such collaboration is when a new user observes that by slightly changing a present analytics can facilitate his or his needs as well (as compared to making a new stage). Fsdb [Hei] is an example of a processing tool that uses column 94 names for processing instead of column positions, and hence suits use-cases where a new user requests an earlier user to generate some more columns to facilitate his or her need. Plumb provides annotated representation of the cumulative pipeline-graph so that the new user could have a proper context about the currently available processing and their owners. 4.2.8 Plumb Enables Easy Back-fill Processing Back-fill is a process where some data is re-processed, possibly because some data was missing in an earlier run but we could tolerate approximate results in short-term. Such data losses, re-ordering, delays, or duplicates are common on an asynchronous network like the Internet. Plumb pipeline-graph and Windowed-Streaming provide a simple solution where correctness is traded-o for latency in short-term, but correctness for long-term. Using pipeline-graph, a user can have two windows on the same data; one for low latency trig- gering and the second for conservative waiting based and amending any results due to late arrivals or other data losses. Plumb automatically manages underlying data eec- tively without any need for developers to re-load old data. 4.2.9 Data Sampling Sampling is another common use case where our system could be used. The need is to pick every nth block in a data stream, possibly to detect some data related property, for example: is both ingress and egress data is present in the steam to guard against the error when capture system might shutdown part of data capture. There are multiple ways to achieve this usecase in Plumb. Developers can use Block-Streaming and an external state to record which block to sample, while discarding rest of the blocks. Developers 95 might use Windowed-Streaming to get a window of size n blocks and picking one and discarding the remaining blocks. 4.2.10 Internet Outage Detection Trinocular [QHP13] detects outages in the global Internet—earlier using quarterly data, and more recently in near real-time. Current outage pipeline is custom built, and has many components from probing, collection, cleaning, detection, and visualization. Bringing Trinocular under Plumb benefits by reducing custom code, and individual com- ponents (now individual stages in Plumb workflow) can be swapped with a better version without breaking rest of the pipeline. Multiple teams can easily collaborate to concur- rently improve parts of the pipeline, and it opens new doors for further analytics (for example where signals from DNS data are combined with Trinocular detection, without the need to learn complexities of a pipeline). 4.2.11 Detection of Aggressive DNS Resolver A flash-crowd attack is an example of distributed denial of service attack that floods the victim. One component to tackle such an attack is the timely detection of aggressive DNS resolver. Such a component can use Block-Streaming to process DNS blocks and later Stateful-Streaming for in-order data delivery so that the algorithm could generate an appropriate state. 4.3 Conclusions Many streaming workflows are a good fit for our large-block streaming domain and can benefit from our new processing methods. The pipeline-graph facilitates collabora- tive analytics between multiple users, and Plumb allows ecient bridging between our 96 three key processing abstractions—two characteristics that help developers taming the increasing complexity of workflows. Plumb is not a suitable framework when either input data is not a large-block or when latency requirement is real-time. Other streaming workloads (for example small record-level streaming) can be transformed into large-block streaming at the expense of added latency. In this chapter, we discussed how Plumb’s new methods are applicable to many applications. In Chapter 5 we compare our work with many classes of bigdata analytics for similarities and dierences. 97 Chapter 5 Related Work We next briefly compare Plumb to representatives from several classes of big-data pro- cessing systems to highlight contributions of this thesis (Table 1.2). Overall, our primary dierence from prior work is our focus on integrating workflow from multiple users, de- duplicating arbitrarily complex code and data, and coping with classes of performance problems (structural and computational skew) in a framework. We provide three key abstractions to cover many use-cases and for ecient interoperability between them in a single workflow. We compare contributions of this thesis with the related work in two sections— multi-user optimizations (§5.1) and multiple abstractions (§5.2). 5.1 Multi-user Optimizations in a Workflow First we compare our deduplication optimizations (§5.1.1) with the others, and then we compare skew management (§5.1.2). 5.1.1 Deduplication of Unstructured Data and Black-box Code 5.1.1.1 Duplicate Detection Workflow systems for scientific processing use explicit representations of their process- ing pipeline to capture dependencies and assign work to large, heterogeneous compute infrastructure. Unlike scientific workflow systems (for example [DSS + 05]) we only use workflow to capture data flow dependency to facilitate de-duplication (§2.2.4). Like 98 some big-data workflow systems [IHB + 12], our stage programs can bring together com- putation from dierent systems using our multiple abstractions (perhaps MapReduce and Spark) into one workflow (§3.2.2); but with the benefits of de-duplication (§2.2.4). Other systems place greater constraints on components (such as [MMI + 13]), allowing component-specific optimizations (for example, joins); our model avoids this level of integration to allow users to work in dierent languages and frameworks. We consider operators as black box and data as an opaque binary stream with unique naming scheme to find and unlock data and processing duplication. We use Apache YARN [VMD + 13] for scheduling. 5.1.1.2 Duplicate Detection Across Multiple Users Several programming languages provide abstractions and optimizations for big-data pro- cessing (for example, [YIF + 08, CRP + 10]). These systems often optimize a specific job but not those of multiple users; we optimize multi-user workloads. Compilers can eas- ily match an I/O pattern to a suitable optimization when framework enforces structure on I/O consumption [HSS + 14]. SQL-compilers for declarative languages [ORS + 08, PDGQ05] and Parallel databases [PPR + 09] match language abstractions to database access patterns and optimizations. Multi-query optimization [Sel88, OAA05] tries to find data sharing in compiled plans of relational data. Most approaches integrated with programming languages focus on structured data and understand operator semantics. We instead target arbitrary programs processing data without a formal schema (§2.2.4, §2.2.5) because black-box modules often provide flexibility in use. We optimize processing from multiple groups (§2.2.2) while related work typically optimize each user’s computation independently. 99 5.1.1.3 Data Deduplication Many systems rely on consistent hashing of data for duplication detection [PL19,OK16] in shared data repositories (like file servers) and multi-user computation. The fundamen- tal shortcoming is that they generate duplicate data nonetheless (albeit de-duplicated afterward). For streaming workloads, such a scheme will cause unnecessary network trac (due to inline three-way replication) and disk bandwidth use. We detect all dupli- cation at the user DAG optimization time, and never generate duplicate data (§2.2.5). The Sed-Dedup system [TLXX19] allows modification in the data and uses delta de-duplication to reduce storage. We do not allow direct data modification (processing writes a new data block after consuming input). Such a delta duplication scheme is more applicable to long-term data archives than a streaming system. The Edison-Data project [ALKL19] uses a data de-duplication algorithm similar to ours. System can label data for dierent scientific domains after semantic analyses. That implies that same data gets similar labels from the system, and hence marked as a possible duplicate (though removal of duplicates done manually by end-users). End users can search the repository based on these system-generated labels. As a conse- quence, de-duplication in Edison-Data depends on programmer’s manually searching from the available library. They do not manage workflow skew as well. We automati- cally infer code similarity from the input/output data types and manage computational and structural skew in the workflow (§2.2.7, §2.2.8). The SDAM system [CRC18] uses programmer-generated annotations to detect com- mon parts of the code to optimize their in-memory, small-record workloads. Our system is similar to SDAM annotations where our de-duplication strategy depends on domain- specific naming. Though Plumb’s names being explicit in user DAGs is much simpler than careful code annotation, and more robust to incorrect naming (a shared, optimized DAG is easier to review and to catch naming problems). 100 5.1.1.4 Processing Deduplication Like our work, Task Fusion [Dye13] merges jobs from multiple users. They show the potential of optimizing over multiple users, but their work is not automated and does not address performance problems such as structural and computational skew (§2.2.7, §2.2.8). The Nectar system [GRT + 10] examines work de-duplication, as we do, but we also expose data to users to promote data de-duplication and encourage data sharing. We also address the problem of workload skew. Several systems suggest frameworks or libraries to improve cluster sharing and utilization [PTS + 17]. Some of them resem- ble our optimized pipeline, but we focus on a very simple streaming API and loosely coupled jobs. 5.1.2 Skew Management 5.1.2.1 Straggler Mitigation Straggler handling is a special case of what we call skew, with several prior solu- tions. Speculative execution [DG04] recovers from general faults. Static sampling of data [VYF + 16], predictive performance models [MBG10], dynamic key-range adjust- ment [IBY + 07], and aggressive data division [OPR + 13] seek to detect or address com- putational skew. These systems are often optimized around specific data types or compu- tation models and assume structured data. Our system can be thought of as an approach to addressing this problem while making very few assumptions about the underlying data (§2.2.7). The large-block data consumption need of our applications preclude any adaptive data-sharding schemes (§2.2.8). 101 5.1.2.2 Resource Optimization for High Throughput Several systems exploit close control and custom scheduling of cluster I/O [RLC + 12] or memory [LGZ + 14] to provide high-throughput or low latency operation. Such systems often require full control of the cluster; we instead assume a shared cluster and target good eciency instead of absolute maximum performance (§2.3). 5.1.2.3 Big-data Schedulers Dierent schedulers have been proposed to optimize resource consumption or delay [ES13, DDKZ15], or to enhance data locality [ZBSS + 10]. We work with exist- ing schedulers (§2.2.3), optimizing inside our framework to mitigate skew. 5.2 Multiple Abstractions in a Workflow The primary goal of our work is to enable our developers to process data using multiple abstractions easily and with good performance. We compare our eort with related systems to evaluate the dierences and similarities. Some frameworks provide only a single abstraction in a framework (§5.2.1 ), while others provide many (§5.2.2) but dier from us in the approach and applicability. 5.2.1 Single Abstraction in a Framework 5.2.1.1 Batch Systems Batch systems, such as MapReduce [DG04] and Dryad [IBY + 07] focus on scheduling and fault tolerance, but do not directly consider streaming data, nor integration of multi- user workloads as we do. Google’s pipelines provide meta-level scheduling, “filling holes” in a large cluster [Den15]. Like them, we are optimizing across multiple users, 102 but unlike their work, we assume a single framework and leave cluster-wide sharing to YARN (§2.2.2). 5.2.1.2 Streaming Big-data Systems Several systems focus on low-latency processing of small, streaming data, including Kafka [KNR11], MillWheel [ABB + 13] and Spark Streaming [ZDL + 13], but without multi-user optimization. They often strive to provide transaction-like-semantics and exactly-once evaluation, and stream data in small pieces to minimize latency. These systems focus on processing small objects (records), while we instead focus on large- block streaming. Large-blocks are critical to our application and similar applications that need a broader view than a single record provides, and because large blocks pro- vide clean accountability for fault-tolerance and completeness (each block is processed or not) and more comfortable debugging (for example manually examining few large- blocks as compared to millions of records). Larger blocks of data raise issues of struc- tural and computational skew that dier from systems with fine-grain processing; we address skew with specific optimizations. In addition to dierent abstractions, systems like Spark streaming make assumptions about how much data fits into memory and how quickly the working-set changes. Our applications generally make one pass over data that far exceeds memory, making Spark’s optimizations ineective. Streaming systems like Flink [CKE + 15] optimize for throughput or latency by con- figuration; we focus on throughput while considering latency as a secondary goal. Flink is optimized for the stream of small-size records, heavily relies on data sharding, and does not provide multi-user collaboration or deduplication abilities. Plumb is opti- mized for large blocks, manages skew when data sharding is not possible and provides 103 multi-user collaboration and deduplication. Flink interfaces with developers via a pro- gramming language and is confined to the functionality available inside the framework. Plumb uses text-based job descriptions via pipeline-graph, which is easier to express, allows external binaries, and multiple abstractions enable the use of other frameworks. One area of work emphasizes exploiting cluster memory for iterative work- loads [ZDL + 13]. While we consider sharing across branches of a multi-user pipeline, our workloads are streaming and so are not amenable to caching (§2.2.2). 5.2.1.3 Streaming Databases For some bigdata systems ( [BAH + 19, TSJ + 09]) and databases ( [ACC + 03, SAB + 18, JMS + 08]) dierences between them have blurred where bigdata system use SQL-like query language and database optimizations, and databases allow streaming operations on events. Hive [TSJ + 09] is a canonical example of SQL-based processing on structured big- data and it benefits from well-established optimizations from the database community. Achieving good performance in Hive depends on utilizing the same table for multiple queries. Hive is not suitable for a single query on the data, while streaming databases optimize this use case [SAB + 18]. There are two dierences between Plumb and the above-mentioned systems. First, for Plumb, data is opaque, and hence Plumb optimizations apply to both structured and unstructured data, while database-oriented systems often work with structured data only. If data has a columnar structure at some stage, Plumb developers might use Hive for sim- pler expression of that stage’s code. Because Plumb allows its developers to invoke any custom code on workflow stages, executing Hive on a stage is possible. Second, Plumb’s abstractions can apply to many applications (including those applications that need sin- gle or multiple passes on the input) in the analytics space (Chapter 4). Plumb can read 104 and write eciently by merging I/O-bound stages with CPU-bound stages while stream- ing databases optimize reading, and traditional databases optimize writing separately. Some bigdata systems ( [BAH + 19,TSJ + 09]) provide SQL-like query language. Big- data systems allowed database like query language and use database optimizations when data is structured and operator semantics are well know. On the other hand traditional databases extended for event streams and allow database operators on small windows of data. None of the above types of systems provide deduplication of data and black-box code. 5.2.2 Analytics Coverage 5.2.2.1 Facebook Systems Facebook’s data processing system with Puma, Stylus, and Swift [CWI + 16] provides multiple abstractions like us, but it uses three independent frameworks—making a work- flow that might use all of them is left to the developers. Plumb allows developers to mix abstractions in a single workflow (§3.2). Facebook’s Puma system allows developers to do operations on an arbitrary size of data, but it uses SQL queries to combine many small records and relies on query optimizations to get locality and good performance. Our Block-Streaming abstraction allows developers to write code to benefit from single pass processing for temporal and spatial locality, at the cost of a lower-level abstraction (§3.2.3). Facebook’s Stylus system provides data ordering but they do not specify how they handle late, missing, or duplicated data. In Windowed-Streaming we allow the developer to choose completeness criteria and handle errors (§3.2.4). Facebook’s Swift system provides continuous streaming like our Stateful-Streaming, but underlying data management [ET19] is dierent—their data moves out to long-term archives automati- cally over time, while Plumb keeps track of each data item and proactively either delete 105 or archive data right after last-use; hence keeping free storage to absorb any DoS attack trac (§3.2.5). Facebook’s approaches reflect the need for multiple abstractions, but they do not describe integration to combine those across a workflow as we do. 5.2.2.2 Google System Google’s dataflow system [ABC + 15] maps a developer’s SQL code to one of the underlying frameworks (MapReduce [DG04], FlumeJava [CRP + 10], Mill- Wheel [ABB + 13])—only allowing to use abstractions in any one framework at a time. Plumb allows its developers to use a dierent abstraction on each stage of a workflow. Their frameworks don’t support Block-Streaming and implementing complex functions like TCP reassembly is not straightforward in their SQL-like language. 5.2.2.3 Nexus To provide developers multiple frameworks on a single shared cluster, Nexus [HKZ + 09] (a precursor of Mesos system [HKZ + 11]) provides an isolation boundary between dif- ferent frameworks (MapReduce [DG04], Dryad [IBY + 07]) and provide each framework a slot abstraction. We go beyond their work to provide three dierent abstractions, and to support the developer in moving data between abstractions in our framework. 5.2.2.4 Spark Streaming Spark streaming [ZCF + 10] provides a distributed shared memory abstraction for contin- uous stream and allows its developers to perform series of operations on a group of data. They provide abstractions similar to us that are applicable to in-memory data only and iterative workloads that repeatedly apply operations on slowly changing data for sub- second latency. Our Block-Streaming allows to leverage spacial and temporal locality 106 for single-pass operations. Plumb allows developers to use Spark as a framework with our Stateful-Streaming. 5.2.2.5 Hadoop Based Systems Airflow [Bea] and Oozie [IHB + 12] are workflow managers to glue Hadoop-based frame- works where execution moves from one stage to the next based on dierent user-defined events (for example, one success event will execute the current stage if any one of the parent stages has successfully completed). Their workflow stages are similar to our pro- cessing (P in IPO), but they do not manage input or output, such as data windowing. They do not provide multi-user data and code deduplication or skew management. 5.2.2.6 Extending Relational Databases SECRET and TSpoon [A19], parts of the system described by Lorenzo Aetti et al, use data windowing to provide interoperability between relational databases and streaming data, and allowing database operators on streaming data. They don’t provide Block- Streaming and Stateful-Streaming. They provide ACID transactional guarantees while Plumb operations provide at-least once semantics. 5.2.2.7 Industrial Integrator There are many systems (some examples include [FFRF19, Spo, BC, Teab, SWWF18]) that provide glue between two specific frameworks. Plumb provides three abstractions that cover a broad part of the design space, and glue to move data between them. 107 5.3 Conclusions This chapter compares Plumb and its optimizations with all classes of related systems to highlight contributions of this thesis (Table 1.2). First, our algorithm to detect similarity is novel, and it makes the de-duplication of unstructured data and arbitrary code pos- sible. Second, we show that mitigating processing skew is possible without sharding, and Plumb automates this process. Third, we demonstrate that our three new abstrac- tions (Block-Streaming, Windowed-Streaming, and Stateful-Streaming) cover most of the streaming application domain, and ecient interoperability between them is possi- ble in a single workflow. Fourth, our two new methods that we designed and evaluated in this thesis are novel additions to multi-user and multi-abstraction workflow process- ing. This thesis tackles significant challenges for stream processing, but many more problems are still open that we discuss in Chapter 6. 108 Chapter 6 Future Work and Conclusions Before we conclude (§6.2), we first point out some promising research and development opportunities (§6.1) to further our work. We presented new methods for stream process- ing, and their implementation into the Plumb framework to collaboratively solve data problems with ease and applicability to many use-cases. Many challenges are still open and we believe that our work is a cornerstone to solve them by further building upon it. 6.1 Future Work We present a handful of research opportunities in the following for further exploration. 6.1.1 Debugging in Multi-User Workflows The presence of multiple users can make debugging cycle more cumbersome. We present two such scenarios and potential solutions for possible future work. First, understanding diagnostic information becomes problematic when a workflow stage is shared. Distributed debugging relies on logging [SSKM92, Bra10] to generate diagnostic information. When multiple users (say user A and B) share a workflow stage, Plumb will use code from one of them (because similarity definition will declare A and B’s code functionally same). If user A’s code is in use and user B wants to debug this stage, logging will have information that might not make sense to B. The second challenge is to allow stage-by-stage debugging [GICK17] on a workflow stage without adding latency for other users. 109 We propose a possible future work for both of the above scenarios. Plumb can make a temporary copy of the stage (with a specific user’s code) but with a dierent output name (to disallow deduplication on this stage). After that, a specific user can do the debugging and compare the currently running code in the other production branch. If the user finds any bug, they can then inform all other shared users about this bug. They will need to devise a strategy to fix the problem. Care must be taken only to allow some specific number of concurrent debugging sessions that do not disrupt production jobs’ throughput and latency. 6.1.2 Eect of Heterogeneity on I/O Intensity and Stage Merging Plumb might merge an I/O-bound stage to a CPU-bound stage for better performance (§2.2.6). Reliable detection of I/O-bound stages and automated merging of two work- flow stages are challenging. We present these two challenges next. We need further study how Plumb’s I/O intensity calculation behaves on hetero- geneous clusters (dierent processor speeds, spinning disks and SSDs, Clos and over- subscribed networks, etc.) to make sure the algorithm provides stable results. The I/O characteristics can be quite dierent if the cluster has both hard-disks and flash based SSDs. Changes in the I/O characteristics from one run to the other can disrupt the calculation of I/O intensity. The knowledge of where a job is running, and hardware characteristics (for example: kind of disks data is read from or written to) might be a promising direction to solve this challenge. Automated merging of workflow stages changes the structure of the user visible workflow. (We established in §2.2.6 that performance improves by merging I/O-bound stage of a workflow to some other CPU-bound stage.) A mismatch between user- submitted workflow and optimized workflow can reduce developer comprehension and ability to debug. This situation is analogous to debugging a highly optimized binary in 110 gdb, where due to possible instruction reordering, the debugging flow does not match the source code level flow. Probably keeping a visual timeline of un-optimized to optimized workflow, and using an unoptimized version for user debugging might be a solution. In summary, compute, storage and networking component’s heterogeneity and dynamic load conditions on shared clusters are challenges for detecting I/O-bound stages, and providing a stable merged workflow. Dynamic conditions can be an imped- iment for automated merging and system can go into an un-stable situation where opti- mized workflow might be swinging between dierent structures and does not stabilize. Careful study is required to enable automated merging of developers’ workflows. 6.1.3 Workflow Consistency and Job Churn Plumb atomically moves from one workflow to the next when developers add new jobs or retract old ones but at the cost of added latency. We use a rather simple implemen- tation to achieve this eect—when new jobs are added or old jobs are retracted, all currently running jobs finish before incorporating changes. During this transition time, some compute might be idle (for example some straggler is holding everyone else) and adds unwanted latency. Probably terminating all running jobs and re-running them with the new optimized workflow might be a solution to reduce latency at the expense of some lost work. But care must be taken because high job churn can make such a scheme wasteful. A careful study is required to allow almost instantaneous use of changes. 6.1.4 Better Visual Interfaces Plumb provides a shared canvas to its users where they can collaboratively (but without the need to be present at the same time) process data. While Plumb provides a cur- rently optimized workflow in YAML and graphical form, there is opportunity to further 111 annotate the graphical representation, for example which developers provided a specific stage and a one line summary of stage computation. 6.1.5 Malicious Actors and Sustained Malicious Attacks Here we discuss future work to solve challenges arising from malicious actors, and consequences of sustained attacks. Plumb assumes its developers are not malicious. Plumb uses code from one of the users of a shared stage randomly, and it provides an opportunity for a malicious devel- oper. A malicious user might provide a shared stage that does not produce correct results always, aecting all downstream users of the results. The bugs in a program can cause the above-mentioned behavior as well, making it harder to judge if the problem was intentional or accidental. Proving program correctness is a challenging problem and is outside the scope of plumb. We can easily enhance a pipeline-graph to pick code for a shared stage from a reputable developer (for example most senior or some other metric). For non-malicious cases, if a stage fails, we might run another user’s version of the code on the next try to tackle the case where a specific version had some bug. Our strategy to allocate compute resources (that depends on current queue backlog) can become a new attack vector. If an attacker has the knowledge of Plumb schedul- ing and stage algorithms, then he or she might be able to influence Plumb’s compute allocation at will by sending specific kind of trac (it is similar to algorithmic attacks such as [CW03] and [BYW07]). Sustained malicious trac can generate hot-spots in the workflow, where some of the stage queues are consistently getting longer and Plumb moving compute resources to such hot-spots at the expense of others. We have seen some cases when such a scenario might occur due to non-malicious reasons for example one of our copy-out stage had sustained I/O errors causing the queue grow and propor- tionally more workers being assigned to copy-out. Probably taking into account of the 112 un-related workflow branches (that have no relation to slowing stage) can be a possible direction to explore. Detecting nature of a failure (malicious or non-malicious, persis- tent or intermittent) can give Plumb extra hints while allocating workers. Though reli- ably detecting dierent error conditions remains a challenging sub-filed of distributed systems. 6.1.6 Resource Rate Limiting A compute cluster consists of shared components such as servers, storage of dierent types (for example, HDFS, RAID), and network bandwidth. We need to manage oered load on these resources for response time to remain low. Rate limiting might oer a solution to avoid resource overload. Though dynamic conditions (new jobs arriving, old graduated, job characteristic change) make this problem challenging. Next, we describe a scenario from Plumb deployment that warrants rate limiting. Plumb currently uses HDFS for I/O but few use cases require to read or write from non-HDFS network file systems, and also keeping the load on such drives under a thresh- old. Our current scheduling system primarily rely on queue length growth over time to proportionally assign workers. Some of our important data (for example archived data from the past) resides on dierent RAID based systems and have I/O constraints in terms of maximum concurrent reads and writes. We would like Plumb to rate-limit such input / output devices as well while scheduling the jobs. It will be multi-optimization prob- lem that how to process files as quickly as possible (and taking queue length growth into account) but still not overwhelming file system resources. 6.1.7 Garbage Collection and Consistency Checks While Plumb proactively deletes the data that is no longer needed, but still it is possible to have some unwanted files accumulate over time in HDFS. These files could be due 113 to a failed data relay. Similarly some cases can arise when HDFS file was deleted but before its references inside Plumb are removed, Plumb was stopped. For both kinds of above mentioned problems, we need File System Consistency Check (FSCK) like facility to garbage collect data and to validate Plumb metadata. While many FSCK algorithms assume that no one is using the file system, our consistency check algorithm might not have that luxury and will need to do the work while data sources and Plumb jobs are being run. A possible direction to follow could be to FSCK up to a few days ago (for example 7 days). Another kind of garbage collection involves faulty files—files that could not be pro- cessed even after retries. Currently Plumb marks a block faulty after four retries, and these faulty files need manual inspection to take care of them. But as trac and user jobs increase, a more automated response to this scenario needs further investigation. The challenge here is that a failure could be due to many reasons—intermittent and permanent. Deciding which one required manual intervention is important. 6.1.8 Explicit Compute Requirements, Fairness and Priorities Plumb currently assigns just one virtual CPU to one stage and asks its developers to restrict to it. Though there might be use cases where users need more than one cpu per stage. A user might like to annotate his YAML job with requested CPU count. Such a count should only be taken as a hint and not a strict requirement because Plumb might not be able to find a reasonable slot where user request could be fulfilled and it will cause delays. An additional complexity arises to decide what to do if a specific stage is shared from more than one users, and one of them asks for more than one CPUs. Should Plumb prefer code from a user with fewer cores or should Plumb try to assign more cores so that some specific user’s code runs faster due to potential parallelization? On the same grounds, Plumb should have some mechanism to estimate user job’s memory needs. 114 Currently we allocate some reasonable amount of memory and have some flexibility to dynamically grow it. We might not like a user to ask about his or her program’s memory needs because it is hard for a user to know and this need varies during program run. Plumb does not consider per-stage fairness when scheduling jobs. A shared work- flow belongs to all users, and fairness per user might not be the correct fairness criterion. Large resource consumption at a workflow stage can interfere with other jobs, and hence a notion of fairness per stage might be a better fairness metric. Under dynamically arriv- ing jobs, and some jobs needing a large pool of resources for an extended time warrant for the ability to preemptive powers. Preemption of a job that has spawned many other jobs is challenging in many respects. One of the challenge is the network connections, that might reset after resumption of jobs after a longer delay and associated time-out algorithms might also aect. Need for selective job prioritization (for lower delay) might be in conflict with fair- ness policies. Probably we should apply fairness criterion on those jobs which do not have special priority. In our current system, all user’s jobs are considered at the same priority. One might need to escalate priority for not only a specific terminal job, but also upstream ancestors that generate that terminal data. 6.1.9 Performance Isolation A related issue to fairness is performance isolation, where one user’s workload should not impact other unrelated users. Performance isolation is challenging on shared clus- ters. Some users might take unfair advantage of the others (for example, by hogging all the disk bandwidth on a server) or making the situation worse for everyone (for example, making shared storage slow). Plumb allows dierent kinds of abstractions in the same workflow on a shared clus- ter. That might mean that a heavy shue by a mapreduce job leave little I/O headroom 115 for other streaming jobs. We need performance isolation of computational resources between dierent kinds of jobs. Unintended interference from dierent jobs on a shared cluster remains an active area of research. Disks, network bandwidth, and local mem- ory system bandwidth are shared resources. While there are fairness of use concern, additionally it is dicult to performance-isolate one from the other. Google’s CPI sys- tem [ZTH + 13] did seminal work to performance isolate for processing resources, while Microsoft [MBFM15] extended the work for other I/O related resources as well. Both of these work are for specific workloads (for example CPI work assumes that there is an abundance of very small and big jobs, where big jobs can be stopped and re-run later). Both of these work are for very large scale data centers with complex machinery. While above two work show that performance isolation is possible at very large scale, the first challenge is that how smaller enterprise level data centers use them. Sec- ondly, the mechanisms ask for what percent of resources should be assigned to whom. Devising right policies for the changing workload is challenging. 6.1.10 Private Name-Spaces While using input and output names from a common namespace helps deduplication, there are some use-cases where a user might want to use a name without a danger of it being conflicted with the global namespace. One such usecase arose when we wanted to use pcap.sz name from a new source, but that needed some processing before going to usual pcap.sz queue. Sine pcap.sz input name was already in use, a use had to prequalify it with a dierent name. Allowing a common global, and user specific name spacing should resolve this issue. In summary, stream processing is rich with many challenges and our system Plumb can provide a stable testbed for experimentation and to evaluate new approaches. 116 6.2 Conclusions Organizations receive streaming data from a web of services, and transforming it into useful and actionable information needs teamwork and a supportive framework to pick- and-choose suitable abstractions. Multi-user workflows (where many users in dierent teams collaboratively process data) experience data and processing duplication and per- formance anomalies of structural and computational skew. Multi-abstraction workflows (where developers need to match a workflow stage to a suitable processing method) suf- fer due to custom glue code that causes incorrect results and performance challenges. This thesis provides two new methods to tackle the above challenges and make multi- user and multi-abstraction workflows ecient. We prove our thesis (§1.1) by designing and thoroughly evaluating two new meth- ods for processing streaming workflows of multiple users and multiple abstractions. Our first method shows that deduplication of unstructured data, and arbitrary code is possible. It also shows that skew management is possible without data sharding and hence applies to large-block workloads. Our second new method provides three key abstractions (Block-Streaming, Windowed-Streaming, and Stateful-Streaming) and e- cient bridging between them. It shows that many classes of streaming applications can be built using these abstractions. We demonstrate the eciency of our new methods in focused evaluations, and in production use. We provide a new framework Plumb for processing large-block, streaming data in a multi-user environment and using multiple abstractions. Plumb’s novelty comes from integrating workflows from multiple users while de-duplicating computation and storage, and its use of dynamic scheduling to accommodate structural and computa- tional skew. Plumb makes it easy for developers to access multiple abstractions in 117 their big-data workflows, simplifying code, lowering latency, and improving correct- ness. To do so, it bridges data streams between three dierent abstractions (Block- Streaming, Windowed-Streaming, and Stateful-Streaming), while managing error han- dling and allowing developers to dial the trade-os between latency and completeness. These abstractions optimize for dierent goals (Block-Streaming for high parallelism when some context is needed, Windowed-Streaming for time-dependent processing, and Stateful-Streaming for long-running applications with state). Plumb shows how the right combination of simple ideas (use of naming, disaggre- gating pipeline-graph and our three key abstractions) can solve challenging problems (multi-user duplication, skew, and interoperable abstractions). Plumb’s new methods provide a shared (but secure) canvas to its users to collaboratively (but not requiring their simultaneous presence) process available data using multiple abstractions, and with less delay and high resource utilization. Plumb is in production use today handling all B-Root DNS data. Plumb’s new methods reduce latency by 74% (blocks: §2.3.7 and §3.3.1) or 97% (daily statistics: §3.3.2.3), while reducing code size by 58% (§3.3.2.1) and improving correctness (result- ing in 73% less manual interventions as described in §3.3.2.2). We are in the process of deploying it at additional sites, and for additional workflows. Plumb is open source and available for use. 118 Bibliography [ABB + 13] Tyler Akidau, Alex Balikov, Kaya Bekiroundefinedlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. Millwheel: Fault-tolerant stream processing at internet scale. volume 6, page 1033–1044. VLDB Endowment, August 2013. [ABC + 15] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fern´ andez-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive- scale, unbounded, out-of-order data processing. Proc. VLDB Endow., 8(12):1792–1803, August 2015. [ACC + 03] Daniel J Abadi, Don Carney, Ugur Cetintemel, Mitch Cherniack, Chris- tian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Aurora: a new model and architecture for data stream manage- ment. the VLDB Journal, 12(2):120–139, 2003. [AD14] Thomas Anderson and Michael Dahlin. Section 7.5 (Queuing Theory) and 7.6 (Overload Management), Operating Systems: Principles and Practice. Recursive books, 2nd edition, 2014. [A19] Lorenzo Aetti. New Horizons for Stream Processing. PhD thesis, Politec- nico Di Milano Dipartimento Di Elettronica, Informazione E Bioingegne- ria, 2019. [Agl] Thomas Aglassinger. Pygount project at PyPI. https://pypi.org/ project/pygount/. (Accessed on 02/11/2021). [ALKL19] Sunil Ahn, Jeongcheol Lee, Jaesung Kim, and JongSuk R. Lee. Edison- data: A flexible and extensible platform for processing and analysis of com- putational science data. Wiley Journal of Software, practice & experience., 49(10):1509–1530, October 2019. 119 [BAH + 19] Edmon Begoli, Tyler Akidau, Fabian Hueske, Julian Hyde, Kathryn Knight, and Kenneth Knowles. One sql to rule them all - an ecient and syntactically idiomatic approach to management of streams and tables. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD ’19, pages 1757–1772, New York, NY , USA, 2019. ACM. [BC] Gilles Barbier and Louis Cibot. Zenaton — workflow builder for develop- ers. https://zenaton.com/. (Accessed on 02/11/2021). [Bea] Maxime Beauchemin. Apache airflow. https://airflow.apache.org/. (Accessed on 02/11/2021). [Bra10] Ryan Evans Braud. Query-based debugging of distributed systems. PhD thesis, UC San Diego, 2010. [BYW07] Noa Bar-Yosef and Avishai Wool. Remote algorithmic complexity attacks against randomized hash tables. In International Conference on E-Business and Telecommunications, pages 162–174. Springer, 2007. [CF15] Emilio Coppa and Irene Finocchi. On data skewness, stragglers, and mapreduce progress indicators. In Proceedings of the Sixth ACM Sym- posium on Cloud Computing, SoCC ’15, page 139–152, New York, NY , USA, 2015. Association for Computing Machinery. [CKE + 15] Paris Carbone, Asterios Katsifodimos, Stephan Ewen, V olker Markl, Seif Haridi, and Kostas Tzoumas. Apache flink: Stream and batch process- ing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 38:28–38, 2015. [CMR + 18] Shumo Chu, Brendan Murphy, Jared Roesch, Alvin Cheung, and Dan Suciu. Axiomatic foundations and algorithms for deciding semantic equiv- alences of sql queries. Proc. VLDB Endow., 11(11):1482–1495, July 2018. [CRC18] Paolo Cappellari, Mark Roantree, and Soon Ae Chun. Optimizing data stream processing for large-scale applications. Wiley Journal of Software, practice & experience., 48(9):1607–1641, September 2018. [CRP + 10] Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. Flumejava: Easy, ecient data-parallel pipelines. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implemen- tation, PLDI ’10, page 363–375, New York, NY , USA, 2010. Association for Computing Machinery. 120 [CW03] Scott A Crosby and Dan S Wallach. Denial of service via algorithmic complexity attacks. In USENIX Security Symposium, pages 29–44, 2003. [CWI + 16] Guoqiang Jerry Chen, Janet L Wiener, Shridhar Iyer, Anshul Jaiswal, Ran Lei, Nikhil Simha, Wei Wang, Kevin Wilfong, Tim Williamson, and Serhat Yilmaz. Realtime data processing at Facebook. In Proceedings of the 2016 International Conference on Management of Data, pages 1087–1098. ACM, 2016. [Dat] Databricks. CIO survey: Top 3 challenges adopting AI and how to over- come them. https://pages.databricks.com/rs/094-YMS-629/images/ DatabricksIDG_eBK0911[1].pdf. (Accessed on 02/12/2021). [DDKZ15] Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, and Willy Zwaenepoel. Hawk: Hybrid datacenter scheduling. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’15, page 499–510, USA, 2015. USENIX Association. [Den15] Dan Dennison. Continuous pipelines at google. SRECon, May 2015. [DG04] Jerey Dean and Sanjay Ghemawat. Mapreduce: Simplified data process- ing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation - Volume 6, OSDI’04, pages 10–10, Berkeley, CA, USA, 2004. USENIX Association. [DNS] DNS-OARC. Day in the life of the internet (DITL). https://www.dns- oarc.net/oarc/data/ditl. (Accessed on 02/12/2021). [DSS + 05] Ewa Deelman, Gurmeet Singh, Mei-Hui Su, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, G. Bruce Berriman, John Good, and et al. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program., 13(3):219–237, July 2005. [Dye13] Robert Dyer. Task fusion: Improving utilization of multi-user clusters. In Proceedings of the 2013 Companion Publication for Conference on Sys- tems, Programming, and Applications: Software for Humanity, SPLASH ’13, pages 117–118, New York, NY , USA, 2013. ACM. [ES13] Moussa Ehsan and Radu Sion. Lips: A cost-ecient data and task co- scheduler for mapreduce. IPDPS workshop, pages 2230–2233, 2013. 121 [ET19] Facebook Engineering and Infrastructure Team. Scribe: Transporting petabytes per hour via a distributed, buered queuing system. https: //engineering.fb.com/2019/10/07/data-infrastructure/scribe/, 10 2019. (Accessed on 02/12/2021). [FFRF19] Daniela L Freire, Rafael Z Frantz, and Fabricia Roos-Frantz. Ranking enterprise application integration platforms from a performance perspec- tive: An experience report. Software: Practice and Experience, 49(5):921– 941, 2019. [FFRFS19] Daniela L Freire, Rafael Z Frantz, Fabricia Roos-Frantz, and Sandro Saw- icki. Survey on the run-time systems of enterprise application integration platforms focusing on performance. Software: Practice and Experience, 49(3):341–360, 2019. [FGC + 97] Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Cluster-based scalable network services. volume 31 of SIGOPS Oper. Syst. Rev., page 78–91, New York, NY , USA, October 1997. Association for Computing Machinery. [FHQ17] Kensuke Fukuda, John Heidemann, and Abdul Qadeer. Detecting mali- cious activity with dns backscatter over time. IEEE/ACM Trans. Netw., 25(5):3203–3218, October 2017. [GICK17] Muhammad Ali Gulzar, Matteo Interlandi, Tyson Condie, and Miryung Kim. Debugging big data analytics in spark with bigdebug. In Proceed- ings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, page 1627–1630, New York, NY , USA, 2017. Association for Computing Machinery. [GRT + 10] Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A. Thekkath, Yuan Yu, and Li Zhuang. Nectar: Automatic management of data and computation in datacenters. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, page 75–88, USA, 2010. USENIX Association. [HBP + 05] Alefiya Hussain, Genevieve Bartlett, Yuri Pryadkin, John Heidemann, Christos Papadopoulos, and Joseph Bannister. Experiences with a continu- ous network tracing infrastructure. In Proceedings of the ACM SIGCOMM MineNet Workshop, pages 185–190, Philadelphia, PA, USA, August 2005. ACM. [Hei] John Heidemann. Fsdb. https://www.isi.edu/ ~ johnh/SOFTWARE/FSDB/. (Accessed on 02/12/2021). 122 [HKZ + 09] Benjamin Hindman, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Scott Shenker, and Ion Stoica. Nexus: A common substrate for cluster computing. In Workshop on Hot Topics in Cloud Com- puting, 2009. [HKZ + 11] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proceed- ings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, page 295–308, USA, 2011. USENIX Associa- tion. [HSS + 14] Martin Hirzel, Robert Soul´ e, Scott Schneider, Buundefinedra Gedik, and Robert Grimm. A catalog of stream processing optimizations. ACM Com- put. Surv., 46(4), March 2014. [IBY + 07] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fet- terly. Dryad: Distributed data-parallel programs from sequential building blocks. volume 41, page 59–72, New York, NY , USA, March 2007. ACM, Association for Computing Machinery. [IHB + 12] Mohammad Islam, Angelo K. Huang, Mohamed Battisha, Michelle Chi- ang, Santhosh Srinivasan, Craig Peters, Andreas Neumann, and Alejandro Abdelnur. Oozie: Towards a scalable workflow management system for hadoop. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, SWEET ’12, New York, NY , USA, 2012. Association for Computing Machinery. [JMS + 08] Namit Jain, Shailendra Mishra, Anand Srinivasan, Johannes Gehrke, Jen- nifer Widom, Hari Balakrishnan, Uˇ gur C ¸ etintemel, Mitch Cherniack, Richard Tibbetts, and Stan Zdonik. Towards a streaming sql standard. Pro- ceedings of the VLDB Endowment, 1(2):1379–1390, 2008. [JQP + 18] Alekh Jindal, Shi Qiao, Hiren Patel, Zhicheng Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, and Sriram Rao. Computation reuse in analytics job service at microsoft. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18, page 191–203, New York, NY , USA, 2018. Association for Computing Machinery. [KNR11] Jay Kreps, Neha Narkhede, and Jun Rao. Kafka: A distributed messaging system for log processing. Proceedings of the NetDB, pages 1–7, 2011. 123 [LGZ + 14] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. Tachyon: Reliable, memory speed storage for cluster computing frame- works. In Proceedings of the ACM Symposium on Cloud Computing, SOCC ’14, page 1–15, New York, NY , USA, 2014. ACM, Association for Computing Machinery. [MBFM15] Jonathan Mace, Peter Bodik, Rodrigo Fonseca, and Madanlal Musuvathi. Retro: Targeted resource management in multi-tenant distributed sys- tems. In Proceedings of the 12th USENIX Conference on Networked Sys- tems Design and Implementation, NSDI’15, page 589–603, USA, 2015. USENIX Association. [MBG10] Kristi Morton, Magdalena Balazinska, and Dan Grossman. Paratimer: A progress indicator for mapreduce dags. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, page 507–518, New York, NY , USA, 2010. Association for Computing Machinery. [MMI + 13] Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Mart´ ın Abadi. Naiad: A timely dataflow system. In Proceed- ings of the Twenty-Fourth ACM Symposium on Operating Systems Princi- ples, SOSP ’13, page 439–455, New York, NY , USA, 2013. Association for Computing Machinery. [Moc87] Paul Mockapetris. https://www.ietf.org/rfc/rfc1035.txt. https://www. ietf.org/rfc/rfc1035.txt, November 1987. (Accessed on 02/22/2021). [MYN16] Hitoshi Mitake, Hiroshi Yamada, and Tatsuo Nakajima. Analyzing the tradeo between throughput and latency in multicore scalable in-memory database systems. In Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, APSys ’16, pages 17:1–17:9, New York, NY , USA, 2016. ACM. [OAA05] K O’Gorman, A. El Abbadi, and D. Agrawal. Multiple query optimization in middleware using query teamwork. Wiley Journal of Software, practice & experience., 35(4):361–391, April 2005. [OK16] Mikito Ogata and Norihisa Komoda. Improvement of deduplication e- ciency by two-layer deduplication system. Wiley Periodicals of Electronics and Communications in Japan., e11781(2):1–9, February 2016. 124 [OPR + 13] Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. The case for tiny tasks in compute clusters. In Proceedings of the 14th USENIX Conference on Hot Topics in Operating Systems, HotOS’13, page 14, USA, 2013. USENIX Association. [ORS + 08] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: A not-so-foreign language for data process- ing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, page 1099–1110, New York, NY , USA, 2008. Association for Computing Machinery. [Pax] Vern Paxon. The Zeek network security monitor. https://zeek.org/. (Accessed on 02/12/2021). [PDGQ05] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpret- ing the data: Parallel analysis with sawzall. Sci. Program., 13(4):277–298, October 2005. [PL19] J. Periasamy and B. Latha. Ecient hash function-based duplication detec- tion algorithm for data deduplication deduction and reduction. Wiley Jour- nal of Concurrency and computation., e5213(2):1–9, February 2019. [PPR + 09] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD ’09, page 165–178, New York, NY , USA, 2009. Association for Computing Machinery. [PTS + 17] Shoumik Palkar, James J. Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, and Matei Zaharia. Weld: A common runtime for high performance data analytics. In The bien- nial Conference on Innovative Data Systems Research, CIDR ’17, January 2017. [QHP13] Lin Quan, John Heidemann, and Yuri Pradkin. Trinocular: Understanding internet reliability through adaptive probing. In Proceedings of the ACM SIGCOMM Conference, pages 255–266, Hong Kong, China, August 2013. ACM. 125 [RLC + 12] Alexander Rasmussen, Vinh The Lam, Michael Conley, George Porter, Rishi Kapoor, and Amin Vahdat. Themis: An i/o-ecient mapreduce. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC ’12, New York, NY , USA, 2012. Association for Computing Machinery. [Roo16] Root Server Operators. Events of 2016-06-25. Technical report, Root Server Operators, ISI, June 29 2016. [SAB + 18] Mike Stonebraker, Daniel J Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, et al. C-store: a column-oriented dbms. In Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker, pages 491–518. 2018. [Sel] Seladb. Pcapplusplus/examples/pcapsplitter at master · seladb/pcapplusplus · github. https://github.com/seladb/ PcapPlusPlus/tree/master/Examples/PcapSplitter. (Accessed on 02/12/2021). [Sel88] Timos K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23–52, March 1988. [Sip13] Michael Sipser. Introduction to the Theory of Computation, Theorem 5.4 EQ TM is undecidable. Course Technology, Boston, MA, third edition, 2013. [SMPT14] Beatriz Soret, Preben E. Mogensen, Klaus I. Pedersen, and Mari Car- men Aguayo Torres. Fundamental tradeos among reliability, latency and throughput in cellular networks. In GLOBECOM Workshops, pages 1391– 1396. IEEE, 2014. [Spo] Spotify. Luigi: Python module for building complex pipelines of batch jobs. https://github.com/spotify/luigi. (Accessed on 02/12/2021). [SRLS17] Matthias Schlaipfer, Kaushik Rajan, Akash Lal, and Malavika Samak. Optimizing big-data queries using program synthesis. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, page 631–646, New York, NY , USA, 2017. Association for Computing Machin- ery. 126 [SSKM92] M. Satyanarayanan, David C. Steere, Masashi Kudo, and Hank Mashburn. Transparent logging as a technique for debugging complex distributed sys- tems. In Proceedings of the 5th Workshop on ACM SIGOPS European Workshop: Models and Paradigms for Distributed Systems Structuring, EW 5, page 1–3, New York, NY , USA, 1992. Association for Computing Machinery. [SWWF18] Matthias J. Sax, Guozhang Wang, Matthias Weidlich, and Johann- Christoph Freytag. Streams and tables: Two sides of the same coin. In Proceedings of the International Workshop on Real-Time Business Intelli- gence and Analytics, BIRTE ’18, New York, NY , USA, 2018. Association for Computing Machinery. [Teaa] B-Root Analytics Team. B-root dns statistics. https://b.root-servers. org/rssac/. (Accessed on 02/11/2021). [Teab] LinkedIn Engineering Team. Azkaban: Linkedin workflow manager. https://azkaban.github.io/. (Accessed on 02/12/2021). [TLXX19] Wenlong Tian, Ruixuan Li, Cheng-Zhong Xu, and Zhiyong Xu. Sed- dedup: An ecient secure deduplication system with data modifications. Wiley Journal of Concurrency and computation., e5350(4):1–14, April 2019. [TSJ + 09] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wycko, and Raghotham Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626–1629, 2009. [TTS + 14] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jig- nesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, and et al. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIG- MOD ’14, page 147–156, New York, NY , USA, 2014. Association for Computing Machinery. [VMD + 13] Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agar- wal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, and et al. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Comput- ing, SOCC ’13, New York, NY , USA, 2013. ACM, Association for Com- puting Machinery. 127 [VYF + 16] Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. Ernest: Ecient performance prediction for large- scale advanced analytics. In Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation, NSDI’16, page 363–378, USA, 2016. USENIX Association. [YIF + 08] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, ´ Ulfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system for general- purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, page 1–14, USA, 2008. USENIX Associa- tion. [ZBSS + 10] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European Conference on Computer Systems, EuroSys ’10, page 265–278, New York, NY , USA, 2010. ACM, Association for Computing Machinery. [ZCF + 10] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10-10):95, 2010. [ZDL + 13] Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Oper- ating Systems Principles, SOSP ’13, page 423–438, New York, NY , USA, 2013. ACM, Association for Computing Machinery. [ZTH + 13] Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. CPI 2 : CPU Performance Isolation for Shared Compute Clusters In Proceedings of the 8th ACM European Conference on Com- puter Systems, EuroSys ’13, page 379–391, New York, NY , USA, 2013. Association for Computing Machinery. [ZXW + 16] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkatara- man, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. Apache spark: A unified engine for big data processing. Commun. ACM, 59(11):56–65, October 2016. 128 Appendices 129 Appendix A Plumb APIs Plumb provides API for developers to interact with the system. Here we discuss queue API (§A.1) for interaction with the meta-data, block naming rules and related API (§A.2), job submission API (§A.3), and APIs for our three key abstractions (§A.4, §A.5, §A.6). A.1 Queue API We summarize our queue API in Table A.2. All functions in Table A.2 take an optional final parameter of caller id. By default, entries are associated with a user, but caller id provides the system administrator can assume the identity of an ordinary user (like Unix’ sudo). If not provided, framework fills it in, else an admin user can use it to do dierent operations for other users (after framework authenticates that the caller is an admin user). This feature is same as for Linux root user. As per Unix convention, return code on success is zero, and any error messages are printed on stderr. Plumb queues hold blocks, each data in a format specific to that queue. Typi- cally blocks are 512MB to 2GB in size. Each block can be in one of the many states (Table A.1). For example, a new block will get UNRESERVED state, its state will change to RESERVED when processing starts, and this block will be garbage collected if the processing is successful. Plumb detects failed or stuck processing by using time- outs (that how long a block has been under processing). If failure on a block persists after retries, such a block is marked as FAULTY . 130 State Synopsis UNRESERVED Block is ready for processing. RESERVED Block is being processed by some worker. FAULTY Block is bad or failed under some processing. LATE Block came in too late. Window have moved forward. INWINDOW Block is part of some window. INMULTIWINDOW Block is in more than one windows. DUPLICATE Block already processed by some window. OUTOFLEFTBOUNDARY Block is before the start of the window. OUTOFRIGHTBOUNDARY Block is after the end of the window. UNWANTED Block is from a site not requested by Plumb. Table A.1: Possible block states inside Plumb. A queue will be a derived queue if its data arrives from another queue after process- ing. There are always queues that do not have any parent (such as our pcap.sz queue in DNS workflow Figure 2.2). A user directly inserts blocks into such queues from an external source, such as packet captures, copied from an external system, or from data stored locally that will be reprocessed. Due to the semantics of our queue management system (initially used in an extended version of [HBP + 05]), each queue has an owner. Plumb makes queues on behalf of multiple users and Plumb takes ownership of all queues. Function Synopsis listDerivedQueues (from queue id) Prints list of queues that are derived from source queue (i.e. from queue id). 131 Table A.2 continued from previous page createQueue (queue id, data type) Creates a new disconnected queue with the name queue id and of a specified type. This type shall remain the same for the whole life of the queue. The user calling this function must have permissions to receive data of data type (i.e. user must belong to the Linux and HDFS groups associated with the data type). deleteQueue (queue id) Deletes queue referred to by queue id if this caller id is the owner of this queue. Additionally, this function deletes all files in the queue (all references, and files if file ref count has reached 0). connectDerivedQueue (from queue id, to queue id) Connects two queues so that data could flow from from queue id to to queue id. If to queue id has multiple parent queues, then the client should call this function multiple times with appropriate from queue id. The call succeeds only if the following two conditions are met: –calling user must be a member of the Linux group of the data type of source queue (i.e. from queue id) –calling user must be the owner of the destination queue (i.e. to queue id) 132 Table A.2 continued from previous page activateQueue (queue id) Activate data delivery for queue id. When a new queue is created, it is in the deactivated state by default. In that state, the queue can not receive any data. For data reception in the queue, this function must be called. deactivateQueue (queue id) Deactivates the queue, eectively stopping data delivery to this queue. distributeLocal (from queue id, localfile) Delivers ”instances” of localfile to all active queues derived from from queue id. enqueueLocal (to queue id, localfile) If to queue id is active, then this function adds localfile to the queue. Only owner of the queue (i.e. to queue id) can call this function on his queue. queueSize (queue id) Provides size of the queue specified by queue id queueFiles (queue id) Lists files in the queue that are in status ”UNRESERVED”, descending order by file mod time (i.e. oldest file first). queueFileOlder (queue id, hdfs file uri, ageseconds) Check if the file’s modtime is greater than ageseconds. 133 Table A.2 continued from previous page queueReserveTop (queue id, localtmp) Changes the status of the oldest file in queue to ”RESERVED” and returns its URI. Download file locally, and save local file path into localtmp. queueReserveFile (queue id, localtmp, file uri) Changes the status of the file uri file in the queue to ”RESERVED”. Download file locally, and save local file path into localtmp. queueUnreserve (queue id, file uri) Changes the status of the file uri in queue to ”UNRESERVED”. queueWipe (queue id) Deletes all files (irrespective of their state i.e. RESERVED, UNRESERVED, FAULTY etc) in the queue named queue id. queueRelease (queue id, file uri) Removes the file file uri from the queue because its processing is complete and it is no more needed. markAsFaulty (queue id, file uri) Marks file uri as faulty. Doing so stops this file considered for further processing (reserve functions will not consider faulty files). Faulty files consume disk space. 134 Table A.2 continued from previous page markAsHealthy (queue id, file uri) Marks faulty file uri (URI that have timed out in the RESERVED state) as healthy and a candidate for processing. Doing so, this file is considered for further processing (reserve functions will consider UNRESERVED files). listFilesInStateX (queue id, file state) This function lists files in a specific queue with file state. queueList (queue id, data type= OPTIONAL) Prints list of queues that are in the system. If the user provides data type, then only those queues are returned that are of the specified type. A user calling this function must be the owner of the source queue. queueDu (queue id) Finds number of bytes consumed by files in any state that are in the queue. queueUserDu() Finds those queues whose owner is user id and then finds and returns size of all files in such queues. queueUser2queues() Provides queue ids given user id. For our system, Plumb owns all queues. 135 Table A.2 continued from previous page queue2user (queue id) Provides queue owner given queue id. The caller must be the owner of the given queue. Only admin is allowed to provide other user’s queues. getEnqueueURI (to queue id, localfile) If to queue id is active, then this function gives a URI for HDFS where client writes the output data. Only the owner of the queue (i.e. to queue id) can call this function on his or her queue. enqueue (to queue id, hdfs uri) If to queue id is active, then this function adds hdfs uri to the queue. Only owner of the queue (i.e. to queue id) can call this function on his or her queue. getDistributeURI (from queue id, localfile) Returns an HDFS URI where client can write output data. distribute (from queue id, hdfs uri) Delivers ”instances” of hdfs uri to all active queues derived from from queue id. copyFromLocal ToHDFS (localfile uri, hdfs uri) Copies local file into hdfs 136 Table A.2 continued from previous page queueReserveTop WithoutLocalCopy (queue id) Changes the status of the oldest file in queue to ”RESERVED” and returns its URI doErrorRecovery (queue id, how old, max fault count) Finds all files in queue id, that have been in the reserved state for more than how old time. If their retry count will not reach max fault count after bumping it by one, then do so and change file status to unreserved. Else bump the retry count and mark the file as faulty queueRelease Unconditional (queue id, file uri) Removes the file file uri from the queue unconditional of its status (RESERVED, UNRESERVED, etc.). queueDeactivate AllQueues() If an admin is calling this function, then disable all the queues in the system. Table A.2: Queue API. A.2 Block Naming Conventions and API Plumbs requires specific names for the blocks to support dedupli- cation (§2.2.5) (for example, 20200505-151105-01789592.lax.pcap.sz and 20200504-221112-00000001-us east 1f-1588617573-ac1f337b- 2001187804000001000000008009a092.aws.pcap.sz). Table A.3 summarizes naming conventions. Plumb provides an API for developers to extract parts of the name from a block (Table A.4). 137 Field Name Min Length Max Length Is Mandatory? Comment Event Time in UTC 15 15 Yes Time (in UTC) when this specific data was collected / data was generated. Example: 20200414-003030. It must be followed by a - Sequence Number 1 767 Yes A monotonically increasing integer. Together with queue sequence number instance, it is unique in a queue. It must be followed by a non-number (either - or a .) Additional Info 0 767 No User provided metadata. It is present if sequence no was followed by a - A dot is not allowed in this field. Queue Sequence Number Instance 1 767 Yes Some app specific way to generate sequence number. Its application responsibility to make sure (sequence number + queue sequence number instance) are unique in a queue. The character dot (.) not allowed in this field. And it will be preceded and followed by a dot. Queue name 1 255 Yes As per the YAML specification, final string should be a queue name. An exception is None queue that we use to send data out. This filed CAN have dots in it. Table A.3: Block naming convention. A.3 User Job Submission API Plumb developers can easily submit or retract their jobs, and can also inspect currently available data in YAML text and graphical form. Table A.5 summarizes job submission API. An example of a YAML file (for job submission) is Figure 3.3. A developer can submit a new job to the system and if there are no issues (for example syntax errors or data access errors) then job status becomes submitted. Submitted jobs are the one that are currently not active and waiting. Plumb consistently moves from one system state to the next and the current implementation waits for all currently running jobs to finish before incorporating any changes. After some wait (until the currently running jobs complete), all submitted jobs change status to accepted and become fully active. Is user’s job was rejected, a copy of the job is saved that user can query. If a user wants to retract his or her job, Plumb stops using this specific job, changes its state to retracted and keep a copy for user query. To avoid accidental removal of jobs, we currently do not 138 Function Synopsis get event time in utc(blockName) Given block name, extract event time from it. A block’s event time is based on the time of the first data item in the block. get sequence number(blockName) Given block name, get sequence number. Sequence num- bers are unique per site, and they monotonically increase. Sequence numbers are important for data completeness. get additional info(blockName) Given block name, get additional info section. Data capturing system might append useful information (for example, the specific server’s identity inside the cloud) for analytics. get queue sequence number instance(blockName) Given block name, get sequence number instance. For systems inside a Public cloud, servers might cycle over time. Sequence number instance conveys such informa- tion to analytics. get queue name(blockName) Given block name, get queue name. Block names always end with a specific queue name. These conventions allow for an additional sanity check for correct data flow in the system. Table A.4: API to extract parts of block names. Function Synopsis submitJob <YAML job file> Submits the job to the Plumb server after syntax checking. If accepted, Plumb will incorporate it into the optimal workflow. retractJob <YAML job file> Retract an already submitted and active job. listAcceptedJobs List all the accepted jobs for the current user. listSubmittedJobs List all the submitted (but not yet incorporated) jobs of the user. listRejectedJobs List all rejected jobs of the current user. listRetractedJobs List all the retracted jobs of the current user. getOptYamlText Get current optimal workflow in YAML format. getOptYamlDiagram Get pictorial representation of the current optimal workflow. Table A.5: Job submission API. provide a remove API function and keep a copy of a retracted job (we plan to provide a remove function in the future). 139 A.4 Block-Streaming API Plumb runs an instance of user program (as provided in the YAML job description) and streams input block to stdin (or named FIFOs), and expects the program to write to stdout (or named FIFOs). If there are only one input and one output, Plumb uses stdin and stdout, else named FIFOs are passed to the application via command line. The name of these FIFOs conform to the naming conventions (Table A.3) to make it easy for the users to discern them, without relying on positional arguments. Plumb provides a special tag None for the output when a user program does not want to write a block locally. (For example sending data over the network to a remote archive). Conforming to Linux programs, a zero return code indicates success. In case of failure, Plumb retries processing of same block later. If processing fails after three retries, Plumb marks the block as faulty that needs manual intervention. A.5 Windowed-Streaming API Plumb uses an internal API (Table A.6) to track windowing state and to schedule them. Plumb streams a scheduled window information in YAML format to the user program (via stdin) and expect results of the processing in YAML format (via stdout). The struc- ture of input and output YAML are explained in Table A.7. A.6 Stateful-Streaming API Plumb streams binary data to the application instance, and also allows the application to manage late or duplicate arrivals. Plumb also enables application to store key, value based state inside the meta store. Applications that use external state managers, can store external state information using Plumb APIs (Table A.8). 140 Function Synopsis startMaintainingWindowState (WindowRequest) returns (WindowReply) Plumb asks Queue Server to track window state for specified queues. getWindowsToSchedule (WindowRequest) returns (WindowReply) Plumb asks Queue Server to provide any windows ready to schedule. changeWindowState (WindowOperationRequest) returns (WindowOperationResponse) Plumb changes state of a window based on processing results. Table A.6: Plumb internal API to manage windowing. Plumb API for streaming has two types of functions—fetching data with fault toler- ance and state storage and retrieval. A typical application will ask Plumb if the next data is available or not. Plumb enables applications to decide if they want to wait for missing data or want to move on. Applications using continuous stream are usually latency- sensitive, and they might want to control their wait behavior. Plumb provides the URI of the next block to the application. Plumb does not stream the block because many applications might need to strip some information from the block before consuming it. Plumb treats user data as opaque for deduplication reasons and cannot do application- specific data processing (we are extending the API to provide such block manipulation function that Plumb could use before sending the stream into the application). 141 Key Value Synopsis input count an integer Number of blocks in the input YAML list output count an integer Number of blocks in the output YAML list missing count an integer Number of blocks in the missing YAML list late count an integer Number of blocks in the late YAML list hdfs cleanup an integer Number of files in the hdfs cleanup YAML list local cleanup an integer Number of files in the local cleanup YAML list input queue count an integer Number of input queues connected to this stage output queue count an integer Number of output queues connected to this stage input <queueName> YAML List HDFS URIs of input in queue <queueName> output <queueName> YAML List HDFS URIs of output in queue <queueName> missing <queueName> YAML List HDFS URIs of missing blocks in queue <queueName> late <queueName> YAML List HDFS URIs of late blocks in queue <queueName> hdfs cleanup list YAML List HDFS URIs to delete local cleanup list YAML List Local file paths to delete ret Code an integer Processing return code. 0 is success. All other code failure err message string Error message is ret Code is not zero data from time Date Window start date data to time Date Window end date data from block an integer Window start block number data to block an integer Window end block number Max execution time an integer Time limit on the processing. Integer represents minutes del late <queue name> boolean Delete files in late <queueName>if value is True Table A.7: Input and output structure for using Windowed-Streaming based applica- tions. 142 API Function Semantics <Returns: gap, minutesSinceLastProcessNext, uri> peek next (stream id) Find if the next block in the stream id is available (indicated by gap value 1. Gap 0 implies duplicate data, while -ve gap value means late arrival). If not, how long it is been since the last processing. process next (uri) Application asks Plumb to mark block uri for processing and starts processing it. <Returns:value>get value (key) Application can retrieve a value against the key that it might have stored to Plumb earlier. put value (key, value) Application can store a key, value pair to Plumb for a later use <Returns:keys>list keys () Application can ask Plumb for all of its keys Table A.8: Part of the PULL-API for the Stateful-Streaming based applications. 143
Abstract (if available)
Abstract
Ever-increasing data and evolving processing needs force enterprises to scale-out expensive computational resources to prioritize processing for timely results. Teams process their organization’s data either independently or using ad hoc sharing mechanisms. Often different users start with the same data and the same initial stages (decrypt, decompress, clean, anonymize). As their workflows evolve, later stages often diverge, and different stages may work best with different abstractions. The result is workflows with some overlap, some variations, and multiple transitions where data handling changes between continuous, windowed, and per-block. The system processing this diverse, multi-user, multi-abstraction workflow should be efficient and safe, but also must cope with fault recovery. ❧ Analytics from multiple users can cause redundant processing and data, or encounter performance anomalies due to skew. Skew arises due to static or dynamic imbalance in the workflow stages. Both redundancy and skew waste compute resources and add latency to results. When users bridge between multiple abstractions, such as from per-block processing to windowed processing, they often employ custom code. These transitions can be error-prone due to corner cases, can easily add latency as an inefficiency, and custom code is often a source of errors and maintenance difficulty. We need new solutions to manage the above challenges and to expose opportunities for data sharing explicitly. Our thesis is: new methods enable efficient processing of multi-user and multi-abstraction workflows of streaming data. We present two new methods for efficient stream processing—optimizations for multi-user workflows, and multiple abstractions for application coverage and efficient bridging. ❧ Our first method is a new approach to address challenges from duplication, skew, and ad hoc sharing in a workflow. These algorithms use a pipeline-graph to detect duplication of code and data across multiple users and cleanly delineate workflow stages for skew management. The pipeline-graph is our job description language that allows developers to specify their needs easily and enables our system to automatically detect duplication and manage skew. The pipeline-graph acts as a shared canvas for collaboration amongst users to extend each other’s work. To efficiently implement our deduplication and skew management algorithms, we present streaming data to processing stages as fixed-sized but large blocks. Large-blocks have low meta-data overhead per user, provide good parallelism, and help with fault recovery. ❧ Our second method enables applications to use a different abstraction on a different workflow stage. We provide three key abstractions and show that they cover many classes of analytics and our framework can bridge them efficiently. We provide Block-Streaming, Windowed-Streaming, and Stateful-Streaming abstractions. Block-Streaming is suitable for single-pass applications that care about the temporal or spatial locality. Windowed-Streaming allows applications to process accumulated data (time-aligned blocks to sync with external information) and reductions like summation, averages, or other MapReduce-style analytics. Stateful-Streaming supports applications that require a long-term state. We believe our three abstractions allow many classes of analytics and enable the processing of one block, many blocks, or infinite stream. Plumb allows multiple abstractions in different parts of the workflow and provides efficient bridging between them so that users could make complex analytics from individual stages without worrying about data movement. ❧ Our methods aim for good throughput, low latency, and clean and easy-to-use support for more applications to achieve better efficiency than our prior hand-tuned but an often brittle system. The Plumb framework is the implementation of our solutions and a testbed to validate them. We use real-world workloads from the B-Root DNS domain to demonstrate the effectiveness of our solutions. Our processing deduplication increases throughput up to 6×, reduces storage by 75%, as compared to their pre-Plumb counterparts. Plumb reduces CPU wastage due to structural skew up to half and reduces latency due to computational skew by 50%. Plumb has cut per-block latency by 74% and latency of daily statistics by 97% while reducing code size by 58% and lowering manual intervention to handle problems by 73% as compared to the pre-Plumb system. ❧ The operational use of Plumb for the B-Root service provides a multi-year validation of our design choices under many traffic conditions. Over the last three years, Plumb has processed more than 12 PB of DNS packet data and daily statistics. We show that our abstractions apply to many applications in the domain of networking big-data and beyond.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Constructing an unambiguous user-and-machine-friendly, natural-language protocol specification system
PDF
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Resource management for scientific workflows
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
Efficient graph processing with graph semantics aware intelligent storage
PDF
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
PDF
A framework for runtime energy efficient mobile execution
PDF
Optimizing execution of in situ workflows
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Efficient pipelines for vision-based context sensing
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Efficient delivery of augmented information services over distributed computing networks
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
Asset Metadata
Creator
Qadeer, Abdul
(author)
Core Title
Efficient processing of streaming data in multi-user and multi-abstraction workflows
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
04/01/2021
Defense Date
03/11/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
collaborative big-data processing,data deduplication,multi-abstraction workflows,multi-user workflows,OAI-PMH Harvest,processing abstractions,processing deduplication,skew management,stream processing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Heidemann, John (
committee chair
), Annavaram, Murali (
committee member
), Nakano, Aiichiro (
committee member
), Raghavan, Barath (
committee member
)
Creator Email
aqadeer@usc.edu,qadeer.qadeer@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-435039
Unique identifier
UC11667837
Identifier
etd-QadeerAbdu-9384.pdf (filename),usctheses-c89-435039 (legacy record id)
Legacy Identifier
etd-QadeerAbdu-9384.pdf
Dmrecord
435039
Document Type
Dissertation
Rights
Qadeer, Abdul
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
collaborative big-data processing
data deduplication
multi-abstraction workflows
multi-user workflows
processing abstractions
processing deduplication
skew management
stream processing