Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
(USC Thesis Other)
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DataandComputationRedundancyinStreamProcessingApplications forImprovedFaultResiliencyandReal-TimePerformance by GeoffreyPhiCalaraTran ADissertationPresentedtothe FACULTYOFTHEGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (ElectricalEngineering) August2020 Copyright 2020 GeoffreyPhiCalaraTran Dedication TomyparentsTamandEvelyn,andsisterAngelica toChúHauandTitoJeriel, toWindyandJoey, andtoallmyfamilyandfriends. ii Acknowledgments The path to the completion of this dissertation has been long and filled with twists and deadends. IonceheardanaptanalogythatdescribedgettingaPh.D.asadarktunnelof unknownlength. Somewherealongtheway,youstartstumblinginthedark,toofarfrom both the entrance and exit and unsure of how long it continues. But one day, you make ittotheend. Thefolksmentionedinthisacknowledgment,andthecountlessfamilyand friendsthatIcouldnotlisthere,weremylightsinthedarkthatledmeout. ForthatIam eternallygrateful. First and foremost, I would like to thank my family, especially my parents Tam and EvelynandsisterAngelica,fortheirlove,persistentsupport,patience,andtimetakento listentomevent. Seeingwhattheyeachhaveaccomplishedprovidesconstantinspiration inallaspectsoflife. Theirsupportneverwavered,evenifIdidnotmakeiteasyattimes. ThanksalsotoAngelicaforhelpingtopolishthissection. IamgratefulformydogsJoey andWindy,whoneverfailedtoperkupmydayandwerethe“goodest”ofboys;bothare missedeveryday. I also would like to thank God for giving me the persistence and ability to power throughandcompletethisprogram. ItwouldnothavebeenpossiblewithoutHisendless blessings. Of course, this dissertation would never have been possible without the mentorship and guidance of my advisor, Dr. Stephen Crago. Without his constant support, I most iii likely would have been demoralized and left the program ages ago. His unending pa- tience and clear guidance enabled me to take the time to explore various research di- rections while keeping me pointed towards a focused objective. I would like to express my gratitude for Dr. John Paul Walters, who served as a second mentor. He helped im- menselytobothpolishideasandimprovemytechnicalwriting,andwasalwaysavailable tochatbothremotelyandon-siteatISI-East. I would like to recognize my extended family, who picked me up every time I stum- bled and provided guidance and moral support. There are some whose optimism and guidanceleftalastingimpression. IwouldmostliketothankNhiforthemanycallsjust to check up on how I was doing, for the time spent playing games or watching shows to unwind, and for helping me smile in spite of the struggles. I would like to acknowledge MichaelandMyMyforbeingincrediblerolemodelsfortherestofthecousinsandalways giving needed pep talks. From Cô Út, Cô Ánh, and Cô Kiêu, I sincerely appreciate the encouragement and kicks to get me on track when needed. Their strength supports the entire family. I am grateful to Bác Minh and Bác Gia for their care and concern regard- ing both academicand outside problems, and Mattfor expanding my culinary horizons. IwouldalsoliketothankTitaEmily,TitoJohn,TitoEugene,andTitaEmmaforhousing me and supporting me whenever I visited Oahu to destress and recover. Thank you to Tito Jun and Tita Elma for making sure that I saw more of Los Angeles. Last, but not least, I would like to thank my grandparents on both sides. Their sacrifices enabled the growthandsuccessforsomanyoftheirchildrenandgrandchildren. Furthermore,Iamfortunatetohavereceivedtheguidanceandassistanceofmanyin the University of Southern California and Information Sciences Institute. I would also like to thank Dr. Viktor Prasanna and Dr. Aiichiro Nakano for serving on both my qualification and defense committees, in addition to providing suggestions to improve my dissertation. Similarly, I would like to recognize my qualification exam committee iv members Dr. Alice Parker and Dr. John Silvester for their time and useful feedback on my proposal. I sincerely appreciate the numerous researchers, administrative assistants, andstaffmembersthathelpedmethroughouttheyears,especially: DianeDemetras,with academic advising, Claire Jansa, with travel and reimbursements, Barbara Dixon, with meetings and scheduling, Melissa Snearl-Smith, with getting the research assistantship logistics in order, Lorna Kaludis, with expense processing, and Robin Roundtree, with scheduling and logistics at ISI-West. I would like to acknowledge Dr. David Kang, Dr. Mikyung Kang, Dr. Kaushik Datta, and Dr. Andrew Rittenbach for their support and encouragement and thank Janice for editing and reviewing numerous papers, my proposal,andthisdissertation. IwouldberemisstonotrecognizethesupportofthefriendsIhavemetalongtheway. EveryoneofthemhasimpactedmylifeinsuchameaningfulwaythatIcannotimagine life without them. First, thank you to Kelson for always being reliable and providing big-picture perspectives on everything since the day we met, and Todd for checking in andremindingmethatthereismorethanjuststudies. Icouldnotforgetmydearfriends Kellen, HanHa, April, Erik, Yousuke, and Jon, who got me through undergrad. I am grateful to Cody and Mitchell for keeping me sane as I worked on the screening exam. Teale, Steph S., Roxy, Kyle, Ashley, Steph H., Sam, and Eric helped pull me out of one of the lowest times in my life with their kindness, even if they may not know it. I could not have finished this dissertation without Sanmukh’s positivity, or his countless motivational talks and brainstorming sessions over bowls of ramen. I would also like to thankVictorandMaxforsharingtheirtimeatUSCwithme;bothwillgoontodogreat things. IamgratefulforAustin,foralwaysprovidingapositiveoutlookonwhateverlife threw and for friendship since childhood. Yuchen provided enthusiastic encouragement and was a great surf buddy. I could not have gotten this far without Jacob, who helped meovercomethewrencheslifethrewandwasalwaysavailabletohearmeventaboutthe v various issues of the day. I would like to thank Vivi, Justin, Jayden, and Holly for their unconditional love and for always taking time to go to Disneyland with me. Finally, I would like to express my gratitude to Lesther for the encouragement to start writing my proposal and dissertation at a time when I seemed to be dragging things on: “The best proposal/thesisisthefinishedone.” I would like to acknowledge the organizations that have assisted me in one way or another. First, I would like to thank the University of Hawaii at Manoa, which built myundergraduatefoundation. ManythankstoProfessorsTepDobry,GalenSasaki,and Wayne Shiroma for sparking my passion for engineering and pointing me towards this path. Iwouldlike toexpressmygratitude totheOfficeof NavalResearchfortheir sup- port,mostrecentlythroughgrant#N00014-16-1-2887. IamindebtedtotheUniversityof Southern California’s Center for High-Performance Computing (hpcc.usc.edu) for pro- viding a platform for my work. I am also grateful for the technical support staff’s time spentdebuggingthebizarreandunusualproblemsthatwouldinevitablycomeup. OverallIhavebeenblessedwiththesupportandlovefromsomanyfolks. Although I could not list everyone, I truly appreciate and will be forever thankful for their contri- bution to this work. While this dissertation represented the end of one chapter in life, it is also the first piece of a larger body of future work and the first steps into the “real world.” Iamrelievedtohavethismuchsupportforwhereveritleads. vi TableofContents Dedication ii Acknowledgments iii ListOfTables ix ListOfFigures x Abstract xiv Chapter1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 BackgroundandDefinitions . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 StreamProcessing . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Real-TimePerformanceRequirements . . . . . . . . . . . . . . 7 1.3 TargetPlatform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 StreamProcessingExecutionFramework . . . . . . . . . . . . 9 1.3.2 ComputationalResourceManagerandScheduler . . . . . . . . 14 1.3.3 Deployment on The University of Southern California’s Center forHigh-PerformanceComputing . . . . . . . . . . . . . . . . 16 1.4 RedundancyforFaultToleranceandPerformance . . . . . . . . . . . . 17 1.5 DynamicResourceAllocationandScaling . . . . . . . . . . . . . . . . 18 1.6 ThesisStatement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.7 ResearchContributions . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.8 DissertationOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter2: RelatedWork 23 2.1 StreamProcessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 FaultTolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 FaultToleranceinStreamProcessing . . . . . . . . . . . . . . 27 2.3 DynamicAllocationinStreamProcessing . . . . . . . . . . . . . . . . 29 vii Chapter3: DataandComputationRedundancyforImprovedFaultResiliency inStreamProcessingApplications 31 3.1 FaultClasses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.1 ComputationalPerformanceFaults . . . . . . . . . . . . . . . . 32 3.1.2 FailedCommunicationFaults . . . . . . . . . . . . . . . . . . 34 3.1.2.1 Single-ShotOccurrenceModel . . . . . . . . . . . . 34 3.1.2.2 CorrelatedOccurrenceModel . . . . . . . . . . . . . 36 3.1.3 FailedInstanceFaults . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.4 FaultInjector . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 DesignandExperimentalEvaluationofRedundancyImplementations . 41 3.2.1 SimplifiedApplicationPatterns . . . . . . . . . . . . . . . . . 41 3.2.1.1 ExperimentalEvaluation . . . . . . . . . . . . . . . 47 3.2.2 GeneralizedApproach . . . . . . . . . . . . . . . . . . . . . . 53 3.2.2.1 StreamGroupings . . . . . . . . . . . . . . . . . . . 53 3.2.2.2 OperatorInput/OutputRatioandTopologyPatterns . 55 3.2.2.3 RedundancyGranularity . . . . . . . . . . . . . . . . 56 3.2.2.4 ExperimentalEvaluation . . . . . . . . . . . . . . . 58 3.2.3 ProgressTrackingGranularity . . . . . . . . . . . . . . . . . . 67 3.2.3.1 Coarse-GrainedProgressTracking . . . . . . . . . . 69 3.2.3.2 Fine-Grained Progress Tracking through Custom Ac- knowledgments . . . . . . . . . . . . . . . . . . . . 71 3.2.3.3 ExperimentalEvaluation . . . . . . . . . . . . . . . 72 Chapter 4: Matching Fault Resiliency to Runtime Load and Resource Avail- ability 89 4.1 DynamicAllocationforSimplifiedStreamProcessingModelandInjection 90 4.1.1 ExperimentalEvaluation . . . . . . . . . . . . . . . . . . . . . 92 4.2 DynamicAllocationforGeneralizedStreamProcessingandRedundancy Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.2.1 ExperimentalEvaluation . . . . . . . . . . . . . . . . . . . . . 102 Chapter5: ConclusionandFutureWork 109 5.1 BroaderImpacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Acronyms 114 ReferenceList 115 viii ListOfTables 2.1 Systemdependabilityattributesanddefinitions . . . . . . . . . . . . . 26 3.1 Increasedredundancyresultsinimprovedreal-timeperformance . . . . 51 3.2 Dynamoimposesminimaloverheadrelativetobaseline . . . . . . . . . 52 3.3 Tupleidentifierfieldspreventaliasing . . . . . . . . . . . . . . . . . . 56 3.4 Measured topology input rates for generalized stream processing model experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5 Performancecomparisonacrossredundancyamountsfortopologies(uni- formatalloperators) . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6 Averagethroughputfortopologies . . . . . . . . . . . . . . . . . . . . 64 3.7 Heronprovidesasimpleinterfaceforprogresstracking . . . . . . . . . 70 3.8 Maximumtopologyinputratesforgeneralizedstreamprocessingmodel experimentswithfine-grainedprogresstracking . . . . . . . . . . . . . 75 3.9 Higheroperatorinput/outputratiosincreaseexposuretoruntimefaults . 88 4.1 Dynamicallymanagingresourcesresultsinsimilarperformancetosuffi- cientstaticallocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2 Dynamicallymanagedredundancyreducesthenumberofmisseddeadlines 97 4.3 Redundancyreducesreal-timeperformancejitter . . . . . . . . . . . . 97 ix ListOfFigures 1.1 Streamprocessingapplicationsaremadeupofdiscreteoperations . . . 6 1.2 Heron example wordcount topology is made up of three parallelizable operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Tuplesflowbetweenoperatorsinstreamprocessingapplications . . . . 9 1.4 Streamgroupingdefinetheflowoftuplesbetweenoperators . . . . . . 10 1.5 ApacheStormusestupletreesforguaranteeingmessageprocessing . . 11 1.6 Heron translates user topology logical plans to physical plans as part of itsfine-grainedexecutionmodel . . . . . . . . . . . . . . . . . . . . . 13 1.7 ApacheAurorainterfaceswithApacheMesostoenablelong-runningjobs 15 1.8 TypicalexperimentscriptdeploymentonUSC’sHPC . . . . . . . . . . 17 3.1 Computationalperformancefaultsdecreasethereal-timeperformanceof streamprocessingapplications . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Communicationfailurescausedegradedreal-timeperformance . . . . . 36 3.3 TypicaltupleprocessingonHeronhasmultiplestagesofcommunication 37 3.4 Faultinjectorwrapperclassenablesinjectionoffaultsatdifferentphases ofexecution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Aninitiallyrestrictedstreamprocessingmodelallowsforsimplifiedini- tialstudyintoredundancyapplications . . . . . . . . . . . . . . . . . . 42 x 3.6 Redundancyenabledcontinuedprogressinthepresenceoffaults . . . . 43 3.7 Dynamodistributesredundanttuplestodifferentoperatorinstances . . . 45 3.8 Thenumberofactiveandpassiveoperatorinstancesdeterminestheamount ofredundancyinthetopology . . . . . . . . . . . . . . . . . . . . . . 46 3.9 DynamousesamodularwrapperimplementationonTwitter’sHeron . . 47 3.10 Image processing workload for evaluation of simplified stream process- ingmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.11 Computational performance faults drastically deteriorate real-time per- formance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.12 Redundancyreducesthenumberoftupleswithprolongedlatencies . . . 50 3.13 Tradeoffbetweenresourceusageandimprovedresiliency . . . . . . . . 52 3.14 Streamgroupingdefinetheflowoftuplesbetweenoperators . . . . . . 54 3.15 Generalized stream groupings enable support for redundancy in fields grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.16 Redundancyintroducesopportunitiesforaliasinginintermediatetuples 56 3.17 Fine-grained redundancy increases the flexibility for distributing redun- danttuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.18 Comparisonofimprovedtail-latencies(uniformredundancy) . . . . . . 61 3.19 ComparisonofDynamoversusbaselineruntimelatency(uniformredun- dancy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.20 Deadlinesmissedwhenreplicatingatasingleoperator . . . . . . . . . 66 3.21 Comparisonofoverheadforboltsacrossredundancylevels(uniformre- dundancy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.22 Rootandintermediatetupleexample . . . . . . . . . . . . . . . . . . . 68 3.23 Fine-grainedacknowledgmentsincreaseredundancyeffectiveness . . . 69 xi 3.24 Herontracksacknowledgmentsthroughtheuseofatupletree. . . . . . 71 3.25 Modular approach enables easy addition of fine-grained progress track- inglayer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.26 Customacknowledgmentsenablefine-grainedprogresstracking . . . . 72 3.27 LimitedscalabilityofGreptopology . . . . . . . . . . . . . . . . . . . 74 3.28 Fine-grainedacknowledgmentsreducemeanandtaillatencies . . . . . 77 3.29 Redundancyincreasesperformanceresiliencyforalltopologies . . . . . 78 3.30 Largerimprovementsintail-latencyasredundancyincreases . . . . . . 78 3.31 Failedtuplepercentagesacrosscommunicationerrorparameters . . . . 80 3.32 Overheadforcomputationscalesslowerthanforcommunication . . . . 80 3.33 Performanceimprovementsandresourceoverheadsoverredundancies . 82 3.34 Redundancy increases resiliency to recoverable single occurrence failed instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.35 Redundancyincreasesresiliencytoaggregatedpersistentfailedinstances 86 4.1 A monitor can make use of runtime metrics feedback and an existing performancemodeltodynamicallymanageresources . . . . . . . . . . 90 4.2 Image processing workload for evaluation of simplified stream process- ingmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3 Insufficientresourcescausedeterioratedreal-timeperformance . . . . . 93 4.4 Morefrequentfaultscausemoremisseddeadlines . . . . . . . . . . . . 95 4.5 Increasingresourcesblindlycancausemoremisseddeadlines . . . . . 96 4.6 Comparisonofper-tupleperformance . . . . . . . . . . . . . . . . . . 96 4.7 Runtimemetricsareusedtodynamicallyadjustfaultresiliency . . . . . 98 xii 4.8 Efficientredundancylevelvariesbasedonworkloadcharacteristicsrela- tivetolatencybound . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.9 Resiliencymanagerdynamicallyseeksoutefficientredundancylevel . . 105 4.10 Resiliencymanagercantradeoffincreasedresiliencyforlowerresource overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.1 Triple-modular redundancy enables correction of single error correc- tion/doubleerrordetection . . . . . . . . . . . . . . . . . . . . . . . . 111 xiii Abstract TheadventoffasterInternetconnectionsandaffordableyethighly-performantresources in the cloud has sparked a movement towards providing dynamic and interactive ser- vices. As a result, data analytics and telemetry have become paramount to monitoring and maintaining quality-of-service in addition to business analytics. Stream processing, amodelwhereadirectedacyclicgraphofoperatorsreceivesandprocessescontinuously arrivingdiscreteelements,iswell-suitedfortheseneeds. Currentframeworksboasthigh throughput,lowlatency,andfaulttolerancethroughcontinuedexecution. However,theper-data-elementlatenciesareimportanttomeetreal-timeperformance constraints. Theseconstraintsbecomeimportantinthecontextofinteractiveapplications anduserquality-of-serviceguarantees. Furthermore,runtimefaultscandegradethereal- timeperformancemoredrasticallythanaggregatemetrics. Thefaulttolerancetoutedby state-of-the-art frameworks does not address this issue. Furthermore, the application loadsforstreamprocessingapplicationscanbedynamic. Whileexistingworksstudythe runtimeallocationofresourcestomatchtheload,theseworksdosowithoutconsidering theper-data-elementreal-timeperformanceimplications. Inthiswork,weaddresstheseissuesbydevelopingaframeworkforincreasingthere- siliencytointermittentrun-timefaultsbyintroducingdataandcomputationredundancy through the replication of data elements in stream processing applications with at-least- once semantics. We first study a simplified model as a proof of concept then increase xiv the complexity to generalize the supported applications. Our work studies the effects of a range of fault types including computational performance, failed communication, and failed stream operator instance faults. While our results show that redundancy can be highly effective in mitigating the effects of these faults (78% tail latency reduction, 60% mean latency reduction, 80% missed deadline reduction), the gains do come with substantialoverheads(40%computationalresourceoverheadand100%communication overhead). However, we show that it is possible to use dynamically manage resiliency to take advantage of both the multi-modal behavior in the application and to adapt to intermittent faults. This dynamic management can reduce the real-time violations over abaselineframeworkwhilereducingtheresourceoverheadfromastaticlevelofredun- dancy injection (98%+ reduction in windows with constraint violations using 0.15x - 0.29x computation and 0.13x - 0.31x communication overheads compared to static re- dundancy). xv Chapter1 Introduction Streamprocessing,aprogrammingparadigmwhereasetofdataflowoperationsprocess discrete data elements, has become increasingly relevant to real-time large-scale data processing. For example, stream processing is well-suited towards data analytics appli- cations due to the way in which data arrives and the processing patterns are similar to a DirectedAcyclicGraph(DAG)ofoperations. Furthermore,theadventoffasterInternet connections and affordable yet highly-performant resources in the cloud has sparked a movement towards providing dynamic and interactive services over the web including online gaming, in-browser office suites, and the plethora of social media applications. This migration of services and products to the cloud and Internet has brought with it an increaseintheimportanceofonlineandreal-timedataanalyticsandtelemetry. Theseap- plications commonly have real-time constraints such as latency and throughput. Stream processinghasbecomepopularasonewaytomodel,develop,anddeploytheseapplica- tions. While models such as batch processing, MapReduce, and Spark are still relevant, stream processing is more commonly used to run workloads that have real-time con- straints. Real-time constraints include, but are not limited to, latency and throughput. Thereareanumberofactivestreamprocessingframeworksthattargethighperformance andreliability. Howeverwhileframeworksarereliablefromtheperspectiveofcontinuity 1 ofoperations,theissueofhowtoprovideresilientperformanceremains. Inthischapter, we first introduce stream processing, the motivations for our work, a brief background, andpresentourthesisandcontributions. There are numerous examples of real-time stream processing workloads. Many of theseexamplesareformsofdataanalytics. Forexample,onecanidentifytrendingtopics and process searches in social media using stream processing [17]. Stream process- ing can also be used to process frames from video streams for purposes such as image recognition [38]. Another example workload is Extraction, Transformation, and Load- ing(ETL)ofdata. Forinstance,astreamprocessingapplicationcansitinfrontofadata ingestionsourceandcleandatabeforestoringitinadatabase. Theseexamplesillustrate scenarioswhereapplicationshaveconstant,long-livedexecutionsandwhereresponsive- nessisparamount. Therearecurrentlymanyactivelyusedstreamprocessingframeworks. ApacheStorm (Storm) is one of the more popular projects. Storm presents a simple abstraction where stream operators are categorized as sources (spouts) or processors (bolts) [1]. Twitter’s Heron is a high-performance framework that adapts Storm’s programming abstractions but changes the underlying implementation and execution model [33]. It arose from a need to address performance and practical issues in Storm. Trident is an alternative high-level abstraction for Storm that provides stateful stream processing by trading per- formanceforreliableprocessingsemanticsusingmicrobatching[4]. These frameworks aim to provide continuity-of-operations by detecting stalled or failed processes and restarting them elsewhere. However, these methods do not apply to other types of runtime faults that can affect performance. Some examples of these faults are computational performance faults, where an operation takes longer than ex- pected due to issues such as resource starvation or garbage collection pauses, and failed 2 communicationfaults. Furthermore,frameworksandbenchmarkingeffortsfocusonag- gregate metrics such as the mean latency and throughput. These types of metrics hide theeffectsoftheaforementionedruntimefaultsonthereal-timeperformance. Inthiswork,weaddresstheproblemofhowruntimefaultscanaffectstreamprocess- ing applications. These runtime faults cause real-time performance errors which affect both average and tail latency. Studies show that high tail latency can negatively affect revenue streams for service providers [20, 47, 37]. Given that cloud-based applications are commonly critical, long-lived services, it is imperative to provide developers with a mechanismforincreasingresiliencytotheseerrors. Weleveragecommontechniquesfor fault-toleranceandresiliencyandapplytheminthereal-timecontextofstreamprocess- ing. Foundational work in the fault-tolerance field define redundancy as the key to fault tolerance [27]. We study opportunistically replicating work (tuples in the topology) to improvetheresiliencytoreal-timeperformanceerrorsandreducelatencyconstraintvio- lations. Furthermore,ourresultsshowthatthistechniquecanalsoimprovethereal-time performanceandreducethetaillatency. The remainder of this chapter is organized as follows: Section 1.1 presents our mo- tivation for this work, while Section 1.2 introduces a brief background. We then detail the target platform and deployment environment for this work in Section 1.3. Next, we discussabriefintroductiontorelevantareasforourstudy: redundancyforfault-tolerance and performance and dynamic resource allocation in Sections 1.4 and 1.5, respectively. Finally,wepresentourthesisandresearchcontributionsinSections1.6and1.7followed byanoutlineofthecompletedissertationinSection1.8. 3 1.1 Motivation Stream processing evolved to fulfill a need for scalable, real-time data processing. Pre- vious models, such as MapReduce and Apache Spark, scale well due to their restrictive abstractions. However,underthesemodels,datastartsandendsprocessingondiskstor- age,therebylimitingreal-timeresponse. Streamprocessingisflexibleasdatacanarrive fromavarietyofsourcesandresultstorageisalsoversatile. Asthedataarrivesandstays inmemory,thismodeliswell-suitedtoreal-time,onlineprocessing. However, stream processing as a model does not solve all of the problems around real-timeprocessing. Inpractice,deploymentsofstreamprocessingsystemsaresuscep- tible to a wide-range of faults that can deteriorate the performance. Some examples of these faults are computational performance faults (garbage collection pauses and exces- sive context switching overheads due to resource contention) and failed communication faults(dropped connectionsdueto saturation, unreliablelinks). Thesefaultscause real- timeperformanceerrors,wheretheper-dataelementprocessinglatencyisincreased,and drasticallyaffectthetaillatencyofstreamdataelements. It is these faults and their effects that motivate our work. We study these faults and solutions to mitigate the effects on real-time performance and to increase the perfor- mance resiliency. In today’s computing world, companies provide services at massive scale. The result of this is that even small delays could add up to large losses in profit. Resiliency and fault-tolerance are mature and well-studied areas. The key difference is that in the stream processing area, these works focus on keeping the applications online and running. In other words, these works target the continuity-of-service. The effects onreal-timeperformance,includingtail-latency,misseddeadlines,andthroughput,have not been considered until our work. We show that there are ways to both increase the 4 resiliency to the aforementioned errors and reduce the tail latency of stream processing applications. 1.2 BackgroundandDefinitions In this section, we present the background for stream processing and real-time perfor- mance. First,weintroducethestreamprocessingmodelandframeworksthatimplement thatmodelinSection1.2.1. Whilethereareanumberofabstractionsforthismodel,our workfocusesonthemodelusedbyApacheStorm[1]andTwitter’sHeron[33]. Then,we introduce real-time performance requirements and studies in Section 1.2.2. These con- straintscouldresultinlargeamountsoflostrevenueforcompaniesandserviceproviders. WethenpresentrelatedworkinChapter2. 1.2.1 StreamProcessing Stream processing is a programming model where the input data is partitioned into a streamofdiscreteelementsandprocessedbyanumberofstreamoperatorsinapipelined manner. A collection of these stream operators defines a stream processing application. Data arrives continuously with no assumptions made on a beginning or ending of data. Furthermore,datamayarriveperiodicallyorintermittently. Therefore,streamprocessing is suitable for online real-time processing of data [61, 34]. Some example applications aredataanalytics,suchassocialmediatrends,InternetofThings(IoT)sensorprocessing, networkmonitoring,anddatasetETL. Streamprocessingexistsatmultiplelevelsofab- straction,fromhardware-supportinprocessorstocloud-scaleprogrammingframeworks. Thefocusofourworkisonhigh-levelstreamprocessingframeworks. 5 While the abstractions may vary slightly from framework to framework, a stream processingapplicationisgenerallyrepresentedasaDAGofstreamoperators. Figure1.1 showsanexampleofawordcountstreamprocessingapplication. Sentences Split Sentence Count Hello, world! Hello world Operator Data Legend Figure1.1: Streamprocessingapplicationsaremadeupofdiscreteoperations OurworktargetstheabstractionprovidedbyStorm[1]andTwitter’sHeron[33]. Al- thoughbothframeworksusethesameabstraction,ourworkusestheHeronimplementa- tion. The Heron API provides a low-level abstraction where developers define both the operatorsandhowdatashouldflowbetweenthem. Underthismodel,eachapplicationis representedbyaDAGorstreamoperatorsandknownasatopology. Sourcestreamoper- atorsthatonlysenddataarereferredtoasspouts. Streamoperatorsthatreceivedata,and potentiallysenddatadownstream,areknownasbolts. Eachoperatormaybeparallelized intoanumberofinstances. IntheexampleshowninFigure1.2. Here,theSentencespout has three parallel instances while the SplitSentence and Count bolts have three and four instanceseach,respectively. Theinputstreamissplitintoelementsknownastuples. The Heron API provides two types of message processing semantics: none (no guarantees that data is processed), or at-least-once (a tuple emitted by a spout is guaranteed to be processedbyallrequiredstreamoperators). Forat-least-oncesemantics,atupleisre-sent fromthespoutifithasnotbeenfullyprocessedbytheendofaconfigurabletime-out. Other frameworks abstract stream processing applications at a higher level. For ex- ample, Apache Flink [32] and Trident [4] allow developers to define operators through 6 Sentence Spout Split Sentence Sentence Spout Sentence Spout Split Sentence Split Sentence Count Count Count Count world ISI Hello, world! Hello, USC! Hello, ISI! Hello Hello Hello USC Spout Tuple Legend Bolt Figure1.2: Heronexamplewordcounttopologyismadeupofthreeparallelizable operations asetoflimitedtransformationssuchasMap,FlatMap,Filter,Fold,Reduce,andAggre- gate. Furthermore, the execution model groups data into microbatches. The benefit is that these frameworks can allow for stronger guarantees such as exactly-once message processingsemanticsandstatefulprocessing. However,thetrade-offisreducedflexibil- ity and increased latency per tuples due to the additional queuing delays introduced by themicrobatches. 1.2.2 Real-TimePerformanceRequirements Real-time performance has become increasingly important in today’s computing envi- ronment. Developers and companies are moving more of their services and products to clouds instead of the locally-executing single download scheme of the past. The advent of IoT devices provides another source of data for online analytics [48]. Furthermore, telemetry and data analytics are more frequently used to better improve products. All of these allow for more dynamic products but also requires high real-time performance, suchaslowtaillatency,toprovideanacceptableuserexperience. However, IoT devices are not just source of data. Applications can take data from IoT devices, process them, and provide commands or other actionable knowledge back 7 to devices themselves. Depending on the application, the value of this knowledge de- creases over time. Therefore, again there is a need for low latency and high real-time performance. Therearemultiplestudiesthatreportvaryinglossesinrevenueduetolatencydelays. Atthesecorporations’scales,evenmarginaldecreasesinrevenueresultadduptoalarge amount. In[20,47],GoogleandMicrosoftreporta0.20%to0.59%reductioninsearch volume due to latency increases of 100 ms to 400 ms. Amazon further reported a 1% lossinsalesfor100msofdelay[37]. Recentworkshowsthatthesemagnitudesofdelays are real and can occur frequently. An example of timing delays is garbage collection. For example, [36] characterizes the ParallelOld and G1 garbage collections and shows that the average garbage collection pause ranges from sub-milliseconds to hundreds of milliseconds. 1.3 TargetPlatform In this section, we present details on the target platform for this work. Our studies tar- get stream processing applications running on the Heron stream processing framework. Heron itself does not do any resource management and instead relies on a pluggable scheduling interface to leverage existing resource management frameworks. Heron sup- ports a wide variety of distributed resource management frameworks. For our work, we deployHeronusingtheApacheMesosandAuroraframeworks. Finally,weuseresources providedbytheUniversityofSouthernCalifornia(USC)’sCenterforHigh-Performance Computingclustertorunourexperimentsatscale. 8 1.3.1 StreamProcessingExecutionFramework HeronisastreamprocessingframeworkfirstdevelopedandmaintainedbyTwitter[33]. Twitter developed Heron primarily to address shortcomings in Apache Storm, another popular framework. While Heron changes the execution model for these applications, it maintains Application Programmer Interface (API) compatibility with Storm. In this section,wefirstintroduceStorm’smodelbeforediscussingHeron’sdifferencesandexe- cutionmodel. Apache Storm is one of the most popular frameworks for stream processing at the cloud level [1, 9, 52]. As first discussed in Section 1.2.1, Storm provides a low-level abstractionforstreamprocessingapplications. TheStormmodelrepresentsapplications (called topologies) as a DAG. The input data is a discrete flow of data elements called tuples. Tuples themselves are a collection of key!value pairs where keys are called fields. Thenodesofthisgrapharestreamoperators. Eachoperatormayhaveoneormore parallel instances. Fata source operators that send, but do not receive data, are called spouts. Spouts may fetch data from external sources such as Apache Kafka, databases, or social media websites. Operators that receive, process, and potentially send result tuplestodownstreamoperatorsarecalled bolts. Tuplesflowbetweenthesenodesalong edges. WeillustratethisinFigure1.3. Sentences Split Sentence Count Hello, world! Hello world Operator Data Legend Figure1.3: Tuplesflowbetweenoperatorsinstreamprocessingapplications 9 Users specify how data flows between operators using stream groupings. Some ex- amplesareall,shuffle,andfields. Allgroupingmeansthattheoutgoingtuplesaresentto all parallel instances of the downstream operators while shuffle grouping means that no guaranteesaremadeonthedistributionoftuples. Inpractice,Stormdistributestuplesto instances in a round-robin manner. Under fields grouping, all tuples that have the same value in user-defined fields will be sent to the same downstream instance. We illustrate the different groupings in Figure 1.4. Figure 1.4a depicts shuffle grouping where tuples arerandomlydistributedwhileFigure1.4bdepictsfieldsgroupingusingthestatusfield. A B 1 B 2 B 3 200 Google.com 1 Status: URL: User: 0 TupleID: 200 Apple.com 1 Status: URL: User: 1 TupleID: 404 Google.com 1 Status: URL: User: 2 TupleID: (a)Shufflegroupingdistributestupleswithnoguarantees A B 1 B 2 B 3 200 Google.com 1 Status: URL: User: 0 TupleID: 404 Google.com 1 Status: URL: User: 2 TupleID: 200 Apple.com 1 Status: URL: User: 1 TupleID: (b)Fieldsgroupingsendstupleswiththesamestatustothe sameinstance Figure1.4: Streamgroupingdefinetheflowoftuplesbetweenoperators Finally, Storm provides two options for data processing guarantees. The first, at- most-once, means that tuples are sent once and no guarantees are made that the appli- cation will successfully process every tuple. This means that some tuples may not be included in the topology’s results due to failures. The second, at-least-once, guarantees that each data element sent from the spout will be fully processed. This means that a tuple could be included in the topology’s results multiple times. Storm accomplishes this through an acknowledgment scheme. Storm tracks every time an operator instance sendsatupletoanotheroperatorinstance. Eachdownstreaminstancemustthenacknowl- edgeeachtupletoconfirmthatishascompletedprocessingthattuple. Stormtracksthe 10 pending acknowledgments using a tuple tree as illustrated in Figure 1.5. Each time an instance emits a tuple, Storm adds another node to this tree. In this example, the root tuple is a sentence. An example wordcount application would split the sentence into wordtuplesandtagsthewordtupleswiththecurrentcount. IntheStormmode, atuple fails processing if that tuple is not fully acknowledged by all relevant instances within a user-configurableperiod. Stormreportsthistothespoutsthroughacallback. Thespout may then resend the tuple until it is successfully processed. The at-least-once model is a good match for injecting redundancy as the effects of executing tuples multiple times doesnotviolatetheprocessingsemantics. Figure1.5: ApacheStormusestupletreesforguaranteeingmessageprocessing[8] Twitter’s Heron is another stream processing framework where low-level control of operatorlogicanddatadistributionisexposedtoapplicationdevelopers[33,26]. Twitter developedHeronasananswertoanumberofdeficienciesinStorm. Thesedeficiencies include: difficulty in debugging topologies, poor performance scalability, and depen- dence on a dedicated cluster resources. Heron solves these issues by changing the the implementationofStorm’smodel. 11 Heron addresses the first deficiency, difficulty in debugging, by leveraging a finer granularityexecutionmodel. UnderStorm,differentcomponentsinatopologyarebun- dled into a single system processing. Spouts and bolts run as tasks grouped into execu- tors. Multipleexecutorsrunasthreadsinaworker. EachworkerisasingleJavaVirtual Machine(JVM)process. Thisisproblematicasspoutsandboltscanhavedifferentexe- cutionpatterns. Thegroupingmakesreasoningabouttheresourcerequirementsdifficult. ResourceallocationisalsomadedifficultduetoStorm’sassumptionthatallworkersare homogenous. This can lead to wasteful utilization if worker load is not even. Further- more,taskswritetheirlogstothesamefile,thiscanincreasethedifficultyofdebugging asmessagesandexceptionsfromdifferentinstanceswillbegroupedtogether. Finally,an exception or failure in a single task can result in a failure for the whole worker process, therebyaffectingothertasksthatareotherwisehealthy. Heron’s finer granularity execution model can be illustrated in two alternate views of topologies as visualized in Figure 1.6. While the topology simply defines the stream operators in a DAG, the logical view further shows the actual parallelism of each of the operators. Finally, the physical view shows the the actual distribution of the spout and boltinstances. Herongroupsstreamoperatorinstancestogetherintocontainersbasedon a packing plan specified by the user. Each container runs the stream operator instances andasinglelocalstreammanager. Thestreammanager isresponsibleforcommunicat- ingdirectlywiththelocaloperatorinstancesandforwardingthetuplestoremotestream managers for other stream instances. Furthermore, the stream managers also track the progress of each tuple as the topology operators process it and reports completion and failures to the spout instances. Furthermore, in Heron, each stream operator instance runsasit’sownJavaprocess. Thisprovidesstrongerisolationbetweeninstancesforboth executionandloggingpurposes. 12 Container 2 Container 1 Sentence Split Sentence Count Word Split Sentence Count Word Count Word Sentence Split Sentence Count Word Split Sentence Count Word Count Word Stream Manager Stream Manager Sentence Split Sentence Count Word Topology Logical View Physical View Figure1.6: Herontranslatesusertopologylogicalplanstophysicalplansaspartofits fine-grainedexecutionmodel The differences in Heron’s implementation also improve the real-time performance and stability. The cause of the improvement is primarily due to the inclusion of a back- pressuremechanism. Backpressureallowsdownstreamoperatorstoslowdownorhaltthe input of data at the topology sources if the downstream operators are overloaded. Twit- ter cites three scenarios for reduced performance that make overloaded operators more likely: high fan-out topologies, long garbage collection pauses, and queue contention. High fan-out topologies increase the probability of a failure in a tuple tree. A single failure causes the total replay of the tuple and its tree. The replays add additional load to the topology. Long garbage collection pauses increase the per-tuple latencies. In the worst case, this can cause failures due to time-outs. Queue contention can happen due to resource oversubscription at the worker level, particularly when a worker is running many executors. Backpressure allows the topology to slow down when operators need relief to recover from any of these scenarios. Heron’s implementation also reduces the number of threads that data must make five transitions from entry to exit: user-logic! local send thread! global receive thread! global send thread! local receive thread 13 ! user-logic. In Heron, data only makes four transitions (user-logic! user-gateway !streammanager!user-gateway!user-logic),whichresultsinareducedlatencyin additiontoimprovinganythreadresourceandqueuecontentionissues. Finally, Heron provides a pluggable scheduler interface. This enables the deploy- mentofHeronontoanyframeworkwhereaschedulerpluginexists. Currently,Heronin- cludespluginsformanyframeworksincluding: ApacheMesos,ApacheAurora,YARN, SLURM, Kubernetes, and Marathon. This increases the flexibility for usage compared toStorm. Stormusesitsowndedicatedresourcestorunapplications. Thisiswastefulas Storm may be holding onto resources for which there is no current need. Under Heron, those resources are managed elsewhere, for example Apache Aurora. Heron only allo- cates resources that it needs at the moment for running topologies and releases unused resources. 1.3.2 ComputationalResourceManagerandScheduler The recommended scheduler for Heron is Apache Aurora (Aurora). Aurora itself is a framework for running long-running services and jobs on clusters managed by Apache Mesos(Mesos). WefirstdiscussMesos,theunderlyingresourcemanger,thenAurora. Mesos is a highly-scalable cluster resources manager [28]. Mesos achieves scalabil- itybyutilizingatwo-levelschedulingmechanism. Mesoskeepstrackofwhatresources areavailable,thenissuesofferstoapplications(calledframeworksinMesosparlance). It isthenuptotheframeworksthemselvestoacceptresourcesandwheretoplacethecom- putation tasks. Mesos also allows for different isolation mechanisms for CPU, memory, GPUs,andnetworkresources. 14 Aurora builds on Mesos by providing an interface for long-running jobs [7]. This is well-suited for stream processing. Users submit applications (jobs in Aurora terminol- ogy)toAurorawhichwillthenobtainresourcesfromMesosontheapplication’sbehalf. Aurorafurtherprovidesanexecutionenvironmentforeachofanapplication’scontainers. Eachcontainermaycontainoneormoreprocesses. Eachoftheexecutorscommunicates with the central Aurora scheduler. This allows Aurora to detect when jobs or containers havefailedandrestartthembyobtainingmoreresourcesfromMesos. Head Node Mesos Master Aurora Scheduler Cluster Node Aurora Observer Mesos Agent Aurora Executor User Application Mesos Agent Aurora Executor User Application Cluster Node Aurora Observer Mesos Agent Aurora Executor User Application Mesos Agent Aurora Executor User Application (1) User submits application (jobs) (2) Aurora parses jobs into tasks (3) Aurora launches tasks in Mesos (4) Aurora monitors tasks in Mesos and restarts failed tasks User Application (1) (2) (3) (4) (4) Figure1.7: ApacheAurorainterfaceswithApacheMesostoenablelong-runningjobs We illustrate the relationship between Mesos and Aurora in Figure 1.7. The left depicts the common program flow while the right side illustrates which components fromeachframeworkareactuallyrunningonthecomputeresources. First,userssubmit 15 an application script, called a job, to Aurora (step 1). This script contains details on how a job should fetch necessary files and start any executables. Then, Aurora parses those jobs into tasks that are distributed to Mesos (step 2). Next, Mesos launches those tasksinexecutorsthatarerunningoneachcomputenode(step3). Thoseexecutorsthen launch the user application. The difference is that Aurora’s model keeps track of the uniquetasksandisabletomanagethemthroughoutmoreoftheexecutionlifecycle. For example, Aurora can relaunch failed tasks. Aurora has observers running on each node that can detect when tasks have failed and will restart them in Mesos (step 4), thereby enabling support of long-running services while still providing the same fault-tolerance andscalabilityasMesos. 1.3.3 DeploymentonTheUniversityofSouthernCalifornia’sCenter forHigh-PerformanceComputing Forourwork,wedeployedtheaboveframeworksandexecutedourexperimentsatscales similar to the current state of the art on the Center for High-Performance Comput- ing (HPC) at The USC [3]. The HPC is the 12th fastest academic supercomputer and 435th-fastest supercomputer in the world as of fall 2017. More than 2,700 heteroge- neous compute nodes make up HPC, each running CentOS for the operating system. High-speed ethernet, Infiniband, and 10-gigabit Myrinet connections link the nodes to- gether. Furthermore, the nodes are heterogeneous with a wide range of processor types and speeds. A subset of nodes are also equipped with GPUs. Users request resources and submit jobs using the Slurm (Simple Linux Utility for Resource Management) job scheduler. TheresourcesavailableonUSC’sHPCaretypicalofcloudplatformstoday. For our experiments, we request a subset of nodes for exclusive use. We deploy Zookeeper, Apache Mesos and Aurora, and Kafka services on those nodes. Heron [33], 16 Mesos[28],andAurora[7]useZookeeper[45]tomanagedistributedstatewhileApache MesosandAurorasupportrunningdistributedapplications(Section1.3.2),suchasHeron. We illustrate our typical deployment in Figure 1.8. Our submission scripts request a set of N nodes. We limit the requested nodes to ensure a consistent execution envi- ronment. The nodes are Lenovo NeXtScale nx360 M5 nodes, each configured with two Intel Xeon CPU E5-2640 v3 processors, 64 GB of memory, 1.79 TB of disk space, and connectedusingbothgigabitEthernetand56.6-gigabitFDRInfiniband. EachCPUcore runs at 2.60 GHz. The first node (lowest node ID) serves as the master for running the experimentscriptsandhostingthemastersfortherelevantservices. Theremainingnodes hostworkersforKafka,ApacheMesosandAurora. Thesubmissionscriptthenstartsthe Zookeeper master, Mesos master, Mesos workers, Aurora master, Aurora workers, and finallyKafkabrokersinorder. Node 0 Node 1 HPC Cluster Aurora Observer Mesos Agent Kafka Broker Zookeeper Server Aurora Scheduler Mesos Master Kafka Broker Zookeeper Server Node N-1 Aurora Observer Mesos Agent Kafka Broker Zookeeper Server Figure1.8: TypicalexperimentscriptdeploymentonUSC’sHPC 1.4 RedundancyforFaultToleranceandPerformance Faulttolerancestudiescommonlyaimtopreventfail-stopfailuressuchasnodeandcom- ponent failures. However, Section 1.2.2 shows that performance failures can result in largelosses. Therefore,thesetypesoffailuresshouldbestudiedaswell. 17 Redundancy is a common characteristic in fault tolerance schemes to increase re- siliencytofail-stopandincorrectoutputfaults. Acommonapplicationistoprovidedual modular or triple modular redundancy by providing parallel copies of components in systems. Feeding the inputs to a comparator for dual modular redundancy or a voter for triple modular redundancy allows the detection or detection and correction of a single error,respectively. Ourworkproposestheuseofredundancytoincreaseresiliencytoperformancefaults as well. Parallel instances of stream operators can provide multiple paths for data to flow through computation steps. Each additional path then reduces the probability of progress halting due to an error at a compute element. While redundancy can introduce side effects in the processing semantics of a given stream operator, this does not violate the at-least-once semantics employed in Heron. Furthermore, we minimize the effects on downstream operators by filtering redundant tuples. Work in this area has applied this idea to other systems, but to the best of our knowledge none have studied it in the contextofstreamprocessingsystems. Furthermore,weutilizeredundancytoincreasethe resiliency to stream operator instance crashes and communication failures. We present theseworksinSection2.2.1. 1.5 DynamicResourceAllocationandScaling Dynamic resource allocation is a widely studied field. However, the works do not study tail latency and resiliency to errors. Of these studies, there are a few that directly target streamprocessingsystems. In[51,50]theauthorspresentaframework,namedDynamic Resource Scaling (DRS), that uses runtime metrics feedback to autonomously scale the resources for Storm. They accomplish this by using Jackson open queuing network the- ory to determine the number of parallel instances that each stream operator requires to 18 maintaintherequestedload. Howeverthisworkconsidersaverage, nottaillatency. Fur- thermore, in [25], the authors present Dhalion, a system for autonomous self-regulation for stream processing systems. Dhalion provides a high-level framework for monitoring the health of stream processing applications. The collected metrics may trigger symp- toms. Dhalionthencombinessymptomstogeneratediagnoses. Someexamplesofdiag- noses are resource over-provisioning, resource under-provisioning, data skew, and slow instances. Once Dhalion makes a diagnosis, it then selects a resolver to try to resolve the issue. For instance, a resolver may scale the number of instances of a bolt up or down. Our work fits in well with [25]’s concept of self-regulating health if we consider real-timeperformanceviolationsasasymptomofunhealthiness. However, our work focuses on the dynamic scaling of redundancy levels in the con- text of a static resource allocation but under dynamically varying application load. The rationale for this is that rescaling in current frameworks is a costly operation relative to thereal-timeperformancerequirements. Whiletheper-elementlatencyconstraintscould be in the millisecond range, rescaling operations are in the seconds to minutes range. Furthermore,theapplicationispaused,scaled,thenrestartedifresourcewhenaresource changeiscommanded. Thisisunlikelytobeacceptableforapplicationswithtightperfor- mance constraints. Therefore, we assume that resources are provisioned for worst-case scenarios and study how our work can make better use of those resources in times of lower load. We do so by increasing the redundancy level to provide lower latencies and higherresiliencytoreal-timeperformanceerrors. 1.6 ThesisStatement Previous works show that real-time performance can directly affect profits and revenue andthatruntimeissuesthatcanaffectperformanceintangibleways. Ourworkdevelops 19 a programming model that targets increasing stream processing applications’ resiliency to runtime performance issues. We do so by applying redundancy to these applications. Therefore,wepresentthefollowingthesisstatement. Fault-tolerance and reliability techniques can increase fault resiliency and improve real-time performance, including tail latency, in stream pro- cessingapplicationswithoutintroducingexcessiveresourceandperformance overhead. We test our thesis statement using a model that applies redundancy as follows: we first opportunistically replicate tuples. We then use the first result that finishes to both improve resiliency to real-time performance errors and reduce tail latency. This use of redundancy increases the resiliency to computational performance faults and to failed communication faults. Finally, we developed algorithms to dynamically scale the re- sourceamountforstreamprocessingapplicationstobothmatchincomingloadandpro- videamarginforredundantcomputation. 1.7 ResearchContributions Ourworkintestingthethesisyieldstwomainresearchcontributions,eacharesultoftwo subproblemsinthethesis. Thetwoproblemsthatwestudiedare: (1)theeffectivenessof redundancyinachievingtheperformanceandresiliencygoalsand(2)theabilityofady- namicruntimeresourceallocationtoeffectivelymatchresourcesusedwithcurrentload. Overall we find that redundancy can be highly effective in increasing the real-time per- formanceresiliencytofaults. Wealsofoundthatfeedback-basedresiliencymanagement canmitigatewastefulresourceoverhead. 20 Our contribution from the first research problem is a design, implementation, and characterization of a framework for introducing redundancy in stream processing appli- cationsthroughspeculativeparallelexecution. Wefoundthatthisredundancycangreatly increasethereal-timeperformanceresiliencytocomputationalperformance,failedcom- munication, and failed stream operator instance faults. However, we did find that the resource overheads can be large. More precisely, we show that for computational per- formance faults we can reduce the mean latency by 78%, the tail latency by 60%, and the number of missed deadlines by 80% with a computation overhead of 40% and com- munication overhead of 100%. While this technique may not be applicable to all work- loadsduetotheoverheadsincurred,itcanhavesubstantialbenefitsforthoseapplications wherereal-timeperformanceiscritical. Secondly, we study how to dynamically manage the redundancy (and thereby re- siliency)dynamicallytomatchchangingconditionsatruntime. Wecontributeasolution thatusesaruntimeresiliencymanagerthatprocessesfeedbackfromtherunningapplica- tiontodeterminethesuitableredundancyusingahill-climbingalgorithm. Weshowthat whileastaticallocationofredundancyreducesthenumberofwindowviolationstozero, this solution can improve the resiliency by reducing the number of windows with con- straintviolationsofwhataworkloadwithnoredundancywouldhaveby98%. However, oursolutionachievesthisimprovementwhileusing10.4%-27.1%computationoverhead and 46.0%-164.3% communication overhead compared to 62.0%-94.3% computation overhead and 359.0%-529.2% communication overhead for the static redundancy. This means that the dynamically managed resiliency only uses 15%-29% of the computation overheadand13%-31%ofthecommunicationoverheadofthestaticredundancy. 21 1.8 DissertationOutline Theremainderofthisdissertationprovidesdetailsonourwork. Chapter2presentsmore details on related work in this area. In Chapter 3 we present details on how redundancy can improve fault resiliency for stream processing applications and an evaluation of our solutionforincreasingperformanceresiliency. Likewise,wepresentourfeedback-based solution for dynamically matching error resiliency to runtime characteristics and evalu- ation of that work in Chapter 4. Finally, we discuss broader impacts of this work and concludeinChapter5. 22 Chapter2 RelatedWork In this chapter, we present an overview of works related to stream processing, fault tol- erance, and dynamic allocation in stream processing. To the best of our knowledge, our work is the first that directly studies the problem of resilient real-time performance in these stream processing applications. In Section 2.1, we first present a more detailed introduction to the Apache Storm and Apache Heron abstractions and framework im- plementations followed by a discussion of other stream processing models. Then, we presentrelatedworkonfault-toleranceinSection2.2. Finally,inSection2.3wepresent works that study the problem of dynamically managing resources for stream processing systems. 2.1 StreamProcessing While we target Heron as our stream processing platform and Storm’s abstractions as discussed in Section 1.3.1, those are not the only popular model for stream process- ing frameworks. Other frameworks adopt a higher-level abstraction where users define streamoperations usingtransformationssuchas map, reduce, flat-map, etc. This model imposeslimitationsas operationsmuchfitintoone oftheexistingtransformations. The 23 Storm model allows users to implement logic at a lower-level. However, the tradeoff is thatframeworksusingthehigher-levelabstractiontendtoalsoprovidestrongerprocess- ingsemanticsintheformofexactly-oncesemantics. Thisiscommonlydonebygrouping tuplesintoorderedmicro-batches. Thetradeoffhereisincreasesandlessconsistencyin theper-tuplelatencies. Some examples of these other frameworks are Apache Flink, Apache Trident, and Google MillWheel. All three of these frameworks use the aforementioned higher-level abstraction of applying operators to streams (called DataStreams in Flink or streams in Trident). For example, developers define applications in these frameworks by declaring operationsappliedtostream. SomeexamplesoftheseoperationsareMap,FlatMap,Fil- ter, KeyBy, Reduce, Fold, and Split. These operators then implicitly define the edges that connect nodes of operators. For example, given a stream of sentences, a devel- oper would declare a wordcount application as a Split operation followed by a FlatMap. Eachoftheseframeworksalsousesthemicrobatchingprincipletopartitionstreamsinto groups of small numbers of data elements. The frameworks then process each batch exactly-once,muchlikeatransaction. Furthermore,theframeworksmanagestateforthe operators through different schemes such as ordering batch IDs and backing up state to a reliable fallback. For example, Flink maintains consistent, periodic snapshots of both the data stream and operator state through a checkpoint-based scheme inspired by the Chandy-Lamport algorithm for backups [21]. However, the shortcoming for this model is a stricter restriction on the types of operations that users can apply. Each operation must fit into one of the top-level classes of framework-supported operations. This is in contrasttotheStormmodel,whereusershaveveryfinecontrolofeachstreamoperator. Furthermore, if an application requires higher performance instead of exactly-once se- mantics, users are not forced to use the micro-batching nor failure-tolerant mechanisms whichmayimpartreal-timeperformanceoverheads. 24 Finally, the management, deployment, and resource utilization of stream processing applicationsisalsoawell-studiedfield. Recentworkscharacterizeandcomparebetween thecommonframeworkssuchasHeronandStorm[22,57,44]. Otherworksstudyways to deploy these stream processing workloads in a cost-efficient manner [58]. There has also been work studying how to further decrease the latency of stream processing by leveraging high-speed interconnects [30]. Others study how to better achieve real-time performancethroughbetterbuffersizing[62]. 2.2 FaultTolerance Fault tolerance in computing is a well-established field. While relatively recent works have attempted to define and refine the definitions and terminology of fault tolerance, reliability, and dependability [15], this effort has existed since 1980. Dependability is commonly defined through five attributes defined in Table 2.1: availability, reliability, safety, integrity, and maintainability. From the highest-level perspective, systems are entitiesthatinteractwithotherentitiessuchassystems,people,andtheworld. Theother entities are referred to as the environment of that system. Each system therefore has a boundary between it and its environment. Systems provide services to its environment, such as to users. The services themselves are made up of internal and external states. A system provides correct service when it implements a function as designed while a servicefailureiswhenthedeliveredservicedeviatesfromthespecifiedfunction. These failures are caused by errors in the internal state. Those errors are caused by faults. Therefore, there is a chain of effects: faults may cause errors which may cause failures in service. As an example in stream processing, an operator may fail to timely process a data element due to a computational performance fault such as resource congestion or a garbage collection pause. This fault can then cause that data element to not finish 25 processingwithinalatencyconstraint–therebycausingareal-timeperformanceerrorin the form of a missed deadline. Finally, this error causes a service failure; the stream is notfullyprocessedwithinthereal-timeconstraints. Table2.1: Systemdependabilityattributesanddefinitions[15] Attribute Definition Availability Readinessforcorrectservice Reliability Continuityofcorrectservice Safety Absence of catastrophic consequences on the user(s) and the envi- ronment Integrity Absenceofimpropersystemalterations Maintainability Abilitytoundergomodificationsandrepairs Given these definitions, many works look at how to provide fault tolerance and de- pendablesystems. Inoneoftheseminalpapersinthefield,Gartnerbothprovidesastruc- ture to the area while also surveying the fundamental building blocks for asynchronous, fault-tolerantapplications[27]. Gartnerdefinesfaulttoleranceas“theabilityofasystem to behave in a well-defined manner once faults occur.” As such, there are four types of fault-tolerance: masking(whereasystemisstillliveandinasafestate),failsafe(where a system is in a safe state but no longer live), non-masking (where a system is live, but notinasafestate),andnone(asystemisneitherlivenorinasafestate). However,akey takeawayisthatredundancyis“thekeytofaulttolerance.” Infact,Gartnermorestrongly provesthatthereisnofaulttolerancewithoutredundancy. Gartner’s paper further claims that there are two types of redundancy: redundant in space,andredundantintime. Redundantinspaceisdistributingtheexecutionofanoper- ationintomultiplecopies. Thisiscommonlyseeninredundanciesforhardwarefailures suchasdual-modularandtriple-modularredundancyinavionics. Conversely,redundant in time is when an operation is run multiple times such as repeating an execution of 26 a program on the same hardware and comparing results. Studies looking at redundan- ciesinbothspacesexistinmanycontexts,oneexamplebeingmultistageinterconnection networks[35]. Redundancy can support resiliency goals in different ways. For example, Apache Hadoop uses redundancy in the Hadoop Distributed File System (HDFS) by providing threecopiesofdatafragmentstomeethigheravailabilitytargets[49]. Thesoftwaresecu- rityandfault-tolerancefieldsprovideredundancythroughbothdesignanddatadiversity tohardensystems[12,43]. Designdiversityusesmultipleimplementationsofthesame program specification and compares the output with identical input data in order to de- tectandoptionallycorrecterrors. Datadiversitymodifiestheinputdatatoforcedifferent programflowpaths. 2.2.1 FaultToleranceinStreamProcessing Our work aims to increase the resiliency of stream processing applications to real-time performanceerrors. Ourevaluationconsidersmultipleunderlyingfaultsthatcausethese errorssuchascomputationalperformancefaults,communicationfaults,andfaultswhere the stream operators themselves crash. We do this to reduce the service failures: cases when a stream is not fully processed within real-time constraints. The techniques we studyshowthatwecanreducethefailurerateeveninthepresenceofthesefaults. Furthermore, our solution makes wide use of redundancy in space. The key idea is to add redundancy in the data elements inflight throughout the stream processing appli- cationsanddistributingthecopiestodifferentstreamoperatorinstances. Thisisaprob- abilisticapproachthatmakesuseofspeculativeexecutiontomaskintermittentreal-time performanceerrors. Whileweshowthatthistechniqueiseffectiveatbothincreasingthe real-timeperformanceandresiliency,itdoescomewithacostwhichwealsoquantify. 27 Studies have looked at this type of solution under different contexts [13, 46, 60]. In thissection,wepresentrelatedworksthatapplyredundancyforperformance. Itisclear fromtheseworksthatredundancyiseffectiveatincreasingthereal-timeperformance. Ananthanarayanan et al. applied redundancy to Hadoop systems in order to reduce tail latencies [13]. The authors note that speculative execution that waits to detect slow jobsbeforelaunchingduplicatesisineffectiveascopiesmaystarttoolate. Furthermore, the problem of determining slow tasks requires significant numbers of samples. The authors solve this by launching tasks multiple times and using the result of the earliest to finish. They show that applying redundancy in addition to other straggler mitigation techniquescanspeedupsmalljobsby46%. Bashir et al. studied redundancy (through duplication) as a first-class concept in clouds [16]. The authors introduce duplicate-aware scheduling (DAS) which provides system stability through prioritization and purging of requests. The authors further in- troduce D-Stage, an abstraction for supporting DAS across the different layers. In an example workload running on Google’s public cloud, the authors show that DAS can reduce the 99% tail latency by 4.6 times and the mean latency by 5.6 times for a HDFS clusterworkload. Other related work applies redundancy to a number of applications to reduce both meanandtail-latencies[60]. Theauthorsapplyredundancymoregenerallytoanumber ofareas: DomainNameService(DNS)anddatabasequeries,Memcached,andnetworks. In DNS, the authors use redundancy in the form of parallel queries to different servers. The authors poll parallel servers for redundancy in database and Memcached queries. Finally, in networks the authors replicate initial packets for a given flow, but at a lower prioritytomitigateeffectsofincreasedresourceutilization. Theresultsshowthatasingle copied DNS query to a parallel server can reduce the average latency by roughly 30%. Increasingthenumberofcopiestofiveimprovesthereductiontoalmost60%. 28 2.3 DynamicAllocationinStreamProcessing Whilemanyworks,includingours,showthatredundancycanbehighlyeffectiveatboth increasing the real-time performance and resiliency to errors, the overhead costs may be high. To mitigate the costs, we further study solutions to dynamically adjust the redundancy amount to trade off resource utilization with resiliency. Dynamic resource allocation is a widely studied field [59, 50, 51, 25, 14, 29]. The works summarized in this section also look at techniques for dynamically reallocating resources for improved utilizationandruntimestabilityinthecontextofstreamprocessingsystems. DynamicResourceScaling(DRS)directlystudiestheproblemofautonomouslyscal- ingresourcesforStormstreamprocessingapplications[51,50]. Theauthorsaccomplish thisbytreatingeachoperatorinstanceasaserverandmonitoringtheinputqueuesateach operatorinstance. TheirsystemthenusesJacksonopenqueuingnetworktheorytodeter- minethenumberofparallelinstancesthateachstreamoperatorrequirestomaintainthe requested load. However this work considers average, not tail latency. Furthermore, the operation to reallocate and rebalance resources is costly in time as Storm suspends the entiresystem,modifiestherouting,thenresumesthesystem. Thiscanbeunacceptableif per-dataelement(tuple)latenciesareinthemillisecondrange. Ourworkrunsinthecon- textofacontinuouslyrunningsystem–wemakenoadjustmentstothenumberofstream operatorinstancesbutinsteadadjusttheredundancyamountsfortuplesflowingbetween them. Thisallowsfastresponsewithoutinterruptingtheexistingrunningsystem. AnotherrelatedworkisDhalion,asystemformaintainingsystemhealththroughau- tonomous self-regulation for stream processing systems, from Microsoft Research [25, 29 14]. Dhalion provides a high-level framework for monitoring the health of stream pro- cessing applications. The collected metrics may trigger symptoms. Dhalion then com- bines symptoms to generate diagnoses. Some examples of diagnoses are resource over- provisioning,resourceunder-provisioning,dataskew,andslowinstances. OnceDhalion makes a diagnosis, it then selects a resolver to try to resolve the issue. For instance, a resolver may scale the number of instances of a bolt up or down. Our work fits in well with [25]’s concept of self-regulating health if we consider real-time performance vio- lations as a symptom of unhealthiness. For example, if we consider meeting real-time constraints as a policy, a diagnosis could be missed deadlines for tuples. Dhalion could self-regulate by increasing the redundancy to those operators that have high variance in theexecutionlatencies. Finally, other recent works study ways to more quickly converge to the correct re- sourceallocationformaintainingresourcegoals. DS2isanautomaticscalingcontroller for distributed stream processing workloads that uses runtime metrics as feedback [29]. The DS2 model uses knowledge of the workload graph and dependencies between op- erators to determine true processing rates and how much resources that each operator requires. DS2 has been applied to Heron, Apache Flink, and Timely Dataflow applica- tions. While this work addresses issues with scaling to match resources with compute load, our work studies the problem of how to increase real-time performance resiliency givenafixedresourceavailability. 30 Chapter3 DataandComputationRedundancyforImprovedFault ResiliencyinStreamProcessingApplications In this chapter, we present our study and findings on the first problem of this disserta- tion by implementing and quantifying the effectiveness of redundancy in increasing the real-time performance and fault resiliency. Our work studies three high-level classes of faults: performance, failed communication, and failed instance faults. Computational performance faults include those faults which directly affect the per-instance execution latencies such as garbage collection pauses and resource contention. Failed communi- cation faults affect the communication between stream operator instances. These types of faults can affect the delivery of acknowledgments and tuples. Finally, failed instance faults occur when the instances themselves crash and recover due to the underlying exe- cutionframework. Overall,wefindthatredundancycanbehighlyeffectiveatincreasing boththebaselineperformanceandresiliencytofaults,resultinginreducedtailandmean latenciesandfewermisseddeadlineswhilemaintainingthroughput. Weorganizethischapterasfollows: wedetailthefaultmodelforthevariousclasses in Sections 3.1.1. Then we present our redundancy models, implementations, and eval- uation results in Section 3.2. We took an evolutionary approach where we targeted a 31 simpler,morerestrictiveprogrammingmodelfirst,thenevolvedtheredundancyscheme togeneralizeourcontributions. 3.1 FaultClasses In order to evaluate the effectiveness of redundancy in increasing the resiliency to the faults that can affect stream processing workloads, we must first define the applicable classes, their effects, and model the occurrence rates. We present our model of the relevant fault classes in this section. First, in Section 3.1.1 we present computational performance faults, the most common type of faults that affect stream processing work- loads. Secondly, we present failed communication faults in Section 3.1.2. Finally, in Section 3.1.4 we present details on how we inject faults into the execution of stream processing applications in order to measure both their effects and the effectiveness of redundancyinincreasingtheresiliencytothosefaults. 3.1.1 ComputationalPerformanceFaults There are many runtime issues that can affect the performance of stream processing ap- plications. Many fall directly into the computational performance fault class. These are faultsthatdirectlyaffectthestreamoperatorexecutionlatenciesforprocessingthetuples. For example, resource congestion at the compute hosts may mean that operators have longer than expected latencies to queue before processing any received tuples. Another examplesareexcessiveorslowcontextswitchingaddingadditionaldelayorboltstaking longer to process tuples due to incorrect assumptions by the application developer. A commonperformanceforcurrentframeworksisthatofgarbagecollectionpauses. Many stream processing frameworks, including Heron, Storm, and Flink, use Java as the pro- grammingandexecutionlanguage. Javaissusceptibletogarbagecollectionpauseswhere 32 allprocessingofuser-levelcodeispausedwhiletheJVMperformsmemorymanagement operations and frees unused memory. These pauses are well-characterized and can vary inmagnitudefromsub-millisecondstohundredsofmilliseconds [36]. Thefrequencyof garbage collection pauses is highly application-dependent. Under the Heron execution model, eachandeverystreamprocessinginstancerunsasaseparateJavaprocessandis thereforesusceptibletothesetypesofpauses. Whilethesourcesofthesedelaysmaybevary,theactualeffectsonstreamprocessing applications is similar. In Figure 3.1 we illustrate how these delays can affect the real- timeperformance. Inthisexample,thetopologyhasthreestages: onespoutfollowedby two stages of bolts. Tuple A and B are normal executions of a tuple: the latency at each boltistheexpected10ms. However, acomputationalperformancefaultoccursatBolt2 during the execution Tuple C and causes an additional 10 ms of execution latency. This drasticallyincreasesthelatencyforTupleC by40%. Spout Bolt1 Bolt2 10ms 10ms 10ms 10ms 10ms 10ms+10ms fault Total: 25ms Total: 35ms Tuple A Tuple B Tuple C Time Total: 25ms Normal execution Affected by performance fault Topology Figure3.1: Computationalperformancefaultsdecreasethereal-timeperformanceof streamprocessingapplications We can model these faults using two characteristics: magnitude and frequency. The frequencyofeachfaultdefineshowoftenacomputationalperformancefaultoccursand affects a stream processing operator instance while the magnitude defines how much additional delay that the fault adds to the execution latency. Both characteristics are 33 stochasticvaluesandthereforealsohavearandomdistribution. Infault-tolerancestudies, thefrequencyisgenerallymodeledasanexponentialdistributionduetoitsmemoryless property. In our original motivating fault class (garbage collection pauses), the num- ber and lifetime of references are commonly modeled as Gaussian distributed [24, 31]. Therefore,wealsomodelthemagnitudeasaGaussiandistribution. 3.1.2 FailedCommunicationFaults Generalizing the redundancy approach for computational performance faults also en- ablesourframework’seffectivenesstoincludefailedcommunicationfaults. Thesefaults interrupt communication between stream operator instances. In the context of stream processingapplications,theeffectstaketheformofdroppedacknowledgmentsortuples. We detail two ways to model these faults. The first, discussed in Section 3.1.2.1, is a single-shot occurrence model. In this case, only single acknowledgments or tuples are dropped intermittently. However, reliable communication protocols result in this type of fault model being less common to the second model. The second model, presented in Section 3.1.2.2, is a correlated occurrence model where an instance may lose total connectivitywithitsstreammanagers,therebydroppingalltuplesandacknowledgments untiltheconnectionisrestored. 3.1.2.1 Single-ShotOccurrenceModel Single-shot occurrence faults are those faults where a single independent tuple or ac- knowledgment is dropped. There are a multitude of opportunities for these faults as Heron communication runs through multiple steps between the source and destination component. For example, to send tuples, stream operator instances serialize data before 34 communicatingusingProtocolBuffersbeforesendingcommunicatingthedatatothelo- cal stream manager using sockets [5]. The stream manager then sends the data to the stream manager for the destination component. The destination stream manager then sendsthedatatothetargetcomponentwhichthendeserializesthedataandprocessesit. Acknowledgments are similar: the only difference is that source stream managers send the acknowledgment message to the stream manager that manages the original spout. Thisresultsinmanyopportunitieswherebothtuplesoracknowledgmentsmaydropdue to issues such as network stack errors. Losing the data would result in no progress for therelevanttuple. AlthoughHeronprovidesat-least-oncesemanticsbyreplayingthetu- ple after a user-configurable timeout (shown in Figure 3.2a), the real-time performance wouldalreadybesignificantlydegradedastimeoutsaremanagedinsecondswhiletuple latenciescanbeinthemillisecondrange. We illustrate the implications of these types of faults on the real-time performance ofstreamprocessingapplicationsinFigure3.2. Forthisexample,thereisaspoutAthat sendsdatatoaboltB 1 whichthensendsdatatoboltsC 1 andC 2 . InFigure3.2a,weshow a single communication fault affecting a tuple. First, spout A sends the data to bolt B 1 . BoltB 1 thensendsthedatatoboltC 1 ,butthetupleislost. Afteratimeoutperiod,spout AresendsthedatatoboltB 1 whichsendsthedatatoboltB 2 instead. Inthiscasethetuple completesprocessingbutthedamagehasbeendonetothelatency. In Figure 3.2b we show how this looks in a timeline form. The first timeline shows thesuccessfulexecutionwhichtakesonly11ms. However,inthefailedcommunication execution,thetupletakesover2017ms,assuminga2secondtimeout. Furthermore,the workatboltB 1 doeswasteful,repeatedworkduetoHeron’smodel. 35 A B 1 C 1 t t C 2 t Replayed tuple t t Healthy Failed Link Legend Tuple (a)Failureincommunicationresultsinreplayedtuple A B 1 B 1 C 1 1 ms 5 ms A 1 ms 5 ms 5 ms B 1 C 2 A 1 ms 5 ms 5 ms Timeout and replay Tuple or acknowledgement is lost 2000 ms Correct execution Failed communication Spout replays tuple (b)Replayedtupleresultsinextendedlatency Figure3.2: Communicationfailurescausedegradedreal-timeperformance However, the weakness in this model is that it is rare for single-shot errors to occur. Local machine sockets are more likely to totally fail and affect multiple tuples (follow- ing the correlated occurrence model) instead of affecting a single tuple. Cross-machine communicationsareoverreliableTCP. 3.1.2.2 CorrelatedOccurrenceModel Analternateoccurrencemodelforcommunicationfaultsisthecorrelatedmodel. Inthis model, faults occur periodically and last for a period of time before communication is restored. Under this model, all tuples and acknowledgments during the fault are lost. This can occur if a stream operator loses its connection to the local stream manager. In Figure3.3weillustratetheprocessingofatupleinatypicaltopology. Forthisexecution wehaveonespoutfollowedbytwostagesofbolts. Thefirstbolthasaparallelismoftwo. 36 As each bolt finishes processing a tuple, it reports acknowledgments to the spout. Fur- thermore,weseethatallcommunicationbetweenstreamoperatorinstancesgoesthrough the stream managers. This becomes a single point of failure for all instances belonging toeachstreammanager. Therefore,ifaninstance’slinktothestreammanagerfails,that instancewillbeunabletoreceivetuplesandsendacknowledgments. B1 2 B2 1 Metrics Manager S1 1 B1 1 S1 B1 B2 A C C Ack Ack A Ack A Stream Manager Stream Manager Ack Metrics Manager Ack Figure3.3: TypicaltupleprocessingonHeronhasmultiplestagesofcommunication We model these faults using two parameters: the time between fault occurrences: MeanTimeBetweenFaults(MTBF)andthetimetorecoverfromthefaultMeanTimeTo Recovery (MTTR). The MTBF characterizes the time between fault occurrences while the MTTR characterizes the time restored connectivity. Similarly to the computational performance faults, each parameter is a stochastic value and has a random distribution. WemodeltheMTBFasanexponentiallydistributedrandomvariableandtheMTTRas aGaussianrandomvariable. 37 3.1.3 FailedInstanceFaults Failedinstancefaultsarethosefaultsthatoccurwhenastreamoperatorinstancecrashes or otherwise becomes unresponsive and stops processing tuples. This may be due to issuessuchasprogrammererrorsandhostfailures. Anexampleofprogrammererrorsare null-pointeranddivide-by-zeroerrors. Thesecouldoccurduetoedge-casesintheinput data. Host failures can be caused by hardware failures, software failures, configuration issues,orcyber-securityattacks. The effects of these faults can be substantial even if the actual faults are rare. This is due to the execution and replay model as discussed in Section 3.1.2. Due to the lack ofprogressandinabilitytoreceive,process,send,oracknowledgetuples,failedinstance faultshavesimilareffectstofailedcommunicationfaults. Therefore,forthepurposesof thisworkwemodelfailedinstancefaultsasanextremecaseofcorrelatedcommunication faultswheretheMTTRislengthyontheorderoftenstohundredsofseconds. We conducted a short experiment to justify the parameters for the failed instance injector. We launch a topology and periodically kill an instance. Then, we measure the time from when an instance stops until Heron and Aurora detects the failed instance and restarts it. Our results show that the MTTR has a mean value of 10.816 seconds on USC’s HPC. However, this value will vary by platform and execution framework implementation. Thisissubstantialrelativetotuplelatencieswhicharecommonlyinthe tenstohundredsofmillisecondsrange. 3.1.4 FaultInjector In order to study the effects of the faults with different characteristics and the effective- nessofoursolutionsforincreasingtheresiliencytothosefaults,wedesignedandimple- mentedamodularfaultinjector. Thisfaultinjectorisextensibletoallowfortheaddition 38 ofmoretypesoffaultsinthefuture. Wenotethattherearethreephasesintheexecution where faults can affect a stream processing bolt: execution, emit, acknowledgment. For spouts, faults can occur at the emit and acknowledgment phases. The execution phase is where an operator receives tuples and processes them. The emit phase is where an operatorsendstuplesdownstreamtootheroperators. Finally,theacknowledgmentphase iswhereanoperatornotifiesthestreamprocessingframeworkthatithascompletedpro- cessing a tuple. Computational performance faults occur in the execution phase while communication and failed instance faults affect the execution, emit, and acknowledg- mentphases. We implemented it as a wrapper class for Heron stream operators as illustrated in Figure 3.4. For spouts, the fault injector can inject faults at the nextTuple() and emit() methods. In bolts, the fault injector can inject faults at the emit(), ack(), and emit() methods. For example, if a computational performance fault affects a bolt, the call to execute() will have additional delay injected. If a communication fault occurs, the tuple mightnotcallexecute()atall. Anothercommunicationfaultmightresultinblockedcalls toemit()andack(). Furthermore,theconfigurationforourfaultinjectorisasimplefile-basedimplemen- tationasshowninListing3.1. Usersstartbylistingeachstreamoperatorinthetopology. Next, users can select individual instances for fault injection by adding the instance IDs to the enabled list. Finally, users list the relevant faults and the associated parameters for injection per the models defined in Sections 3.1.1 through 3.1.3. In this example, we enable the injection of computational performance faults for two instances of bolt1. The frequency is exponentially distributed with a mean period of one second while the magnitudeisGaussiandistributedwithameanof150ms. 39 Heron Framework User-spout SpoutOutputCollector Spout ack() emit() nextTuple() Heron Framework User-bolt ack() OutputCollector Bolt emit() execute() Fault Injector (a)Faultinjectorsitsbetweenframeworkanduser-levelcode Heron Framework User-bolt ack() execute() Heron Framework Fault Injector OutputCollector Fault Injector Bolt OutputCollector Bolt ack() emit() execute() User-bolt ack() execute() OutputCollector Bolt emit() emit() (b)Comparisonoffault-injectedandnormalbolt Figure3.4: Faultinjectorwrapperclassenablesinjectionoffaultsatdifferentphasesof execution Listing3.1: File-basedconfigurationoffaultinjectortypeandcharacteristics bolt1: # name of operator enabled: # list of enabled ids - 0 - 1 faults: # list of injectors timing: parameters: # microseconds mtbf: exponential: # Distribution - 100 # min - 900000000 # max - 1000000 # mean magnitude: gaussian: # Distribution - 75000 # min - 300000 # max - 150000 # mean - 37500 # std bolt2: # name of operator enabled: # list of enabled ids - 0 ... 40 This fault injector enables the study of redundancy effectiveness applied to faults with a wide-range of characteristics. Furthermore, the modular implementation allows for easy addition to existing stream processing applications. Finally, the ease of fault parameterconfigurationfacilitatesfastiterationofexperimentparameters. 3.2 DesignandExperimentalEvaluationofRedundancy Implementations WefirsttargetedasimplifiedstreamprocessingmodelasdiscussedandevaluatedinSec- tion3.2.1. Thisallowedustoconductapreliminarystudyintothepotentialeffectiveness ofredundancyforreal-timeperformanceresiliency. Oncetheinitialworkwasprovento beeffective,wegeneralizedtheredundancyschemewhichrelaxedconstraintsplacedby the simplified model. We present details and an evaluation of the generalized model in Section3.2.2. 3.2.1 SimplifiedApplicationPatterns Our initial work targets a simplified stream processing model. However, we relax these simplifications in our later work in Section 3.2.2. In this work, we restrict topologies to those with a single operator pipeline pattern. Furthermore, the operators themselves mayonlyhaveoperatorinput/outputratioof 1:1whichmeansthatforeverytuplethata streamoperatorinstancereceives,itmaysendatmostone. Wealsoimposearestriction that the last stage of a topology be an aggregation bolt in order to filter out redundancy injectedatthelastbolt. Finally,werestrictthegroupingsbetweentheoperatorstoshuffle grouping. WeillustrateanexampletopologyunderthisrestrictedmodelinFigure3.5. 41 Legend Optional additional stages Spout Bolt 1 Bolt 2 Bolt N Topology consists of single pipeline of operations All operators maintain 1:1 input/ output data ratio Aggregation Bolt Tuple Stream operator Topology ends in aggregation bolt operator Figure3.5: Aninitiallyrestrictedstreamprocessingmodelallowsforsimplifiedinitial studyintoredundancyapplications Given this simplified model, we proposed and evaluated a probabilistic solution to increase the performance resiliency to runtime computational performance faults. Our solution injects redundancy in the intermediate in-flight tuples and speculative executes theseparallelcopiesofdata. Thisreducestheprobabilityofafaultcompletelystallinga tuple’sprogress. Moreprecisely,ifwedefineatuplet’stotalexecutiontimeasT (t),the time from when a spout first emits the tuple until all instances acknowledge processing the tuple, then tuple is on time if the tuple with start time T s (t) completes processing before a deadlineD(t),T (t) < D(t)T s (t). We consider a deadline miss as a service failure. If T (t) = T f + n P i=0 T i (t) whereT f is the communication and framework over- heads,whilethefirstsummationtermisthesumofexecutiontimesperoperatorinstance that processes this tuple. However, computational performance faults can directly af- fectT i (t)inthisformulation. Anydelaysduetothosecomputationalperformancefaults will increase the timeT i (t) and potentially cause a tuple to miss its deadline. However, theoccurrenceofacomputationalperformancefaultduringatuple’sexecutiondoesnot necessarilycauseasystemperformancefailure. Ifthereisenoughbuffertimeinthepro- cessing chain or if the computational performance fault is small enough, the deadlines maystillbemet. 42 Op. A Op. B Op. B 1 2 3 1 3 (a)Noredundancy Active Passive Op. A Op. B Op. B 1 3 2 1 3 2 1 1 3 2 (b)Redundancyindata Figure3.6: Redundancyenabledcontinuedprogressinthepresenceoffaults We illustrate a simple example in Figure 3.6. The figures illustrate the progress of threetuples(1,2,3)attheendofexecution. Currentstreamprocessingexecutionframe- works follow Figure 3.6a. A computational performance fault, such as a garbage col- lection pause, at Operator B stopped the progress of tuple 2. Here, a computational performance fault afflicts the first instance of Operator B (thereby increasing n P i=0 T i (t)) and thus processes tuple 3 more slowly and potentially causes a deadline miss. We hy- pothesizedthatwecouldsolvethisissuebyincreasingtheresiliencyofstreamprocessing applicationstothesecomputationalperformancefaultsbyreplicatingthedataandsend- ing it to multiple instances of a stream operator as shown in Figure 3.6b. We can then gain increased resiliency to the these faults and improve the performance by taking the resultsfromthefirstcopyofatuplethatfinishes. Injecting redundancy does have effects on the programming semantics. Heron pro- vides allows for two types of semantics: at-most-once and at-least-once. The semantics affectthecorrectnessofresults. Forexample,inaWordCounttopology,ifagiveninput sentence was the single word "USC," the correct results under at-most-once semantics arecountsofoneandzero(oneifthetupleisprocessed,andzeroiftheprocessingfails). However,underat-least-oncesemanticstherearemanycorrectanswersasthecountcan 43 range from one toN, whereN is the number of executions required for complete, suc- cessfulprocessing. At-most-once semantics are weak, as by definition a data element might not be pro- cessed. Therefore,forbestreal-timeresponse,anexecutionenginemaydroptuples. Our workfocusesontheat-least-oncesemanticswheretuplesarereplayeduntiltheyarefully processed. While we minimize the effects on the results by applying filters at down- streamoperators,theadditionaltuplesfromredundancydonotcauseincorrectresultsas processingtuplesmultipletimesisacceptableunderat-least-oncesemantics. Our work implements redundancy by replicating tuples and distributing data to dif- ferentoperators. Weaddresstwoissues: howtodeterminewhichoperatorstodistribute data to, and how much data to replicate. The remainder of this section discusses how we approached this problem and how we implemented the solution: a framework and methodologynamedDynamo. First, we discuss how to determine which operators receive what data. Our solution partitions the parallel instances into two groups: active and passive. Tuples are always processedbyanactiveinstance. Eachtuplethatisselectedforreplicationissenttoboth anactiveandpassiveinstance. Wedeterminewhichinstancetosendatupletobyhashing each tuple’s unique ID. We use the result to index a list of active or passive instances as needed. Thisisarequirementsothattheresultsofredundanttuplesaresenttothesame downstream instance. Each instance only processes the first tuple per unique ID that it receives. WeillustratetheschemeinFigure3.7. Inthisexample,Bolt1,Bolt2,andBolt3all have one active and one passive instance. If a tuple is replicated to Bolt 1, there will be redundant results going into Bolt 2 instances. Therefore, Bolt 2 should filter duplicate results. 44 Bolt 1 Spout A P Bolt 2 A P Bolt 3 A P Aggregator Ta Tp Active and duplicate send to same targets Active and duplicate send to same targets Bolt only processes first received Bolt only processes first received Spout sends to bolts Only process first tuple received per tupleId Only process first tuple received per tupleId Only process first tuple received per tupleId Active Passive Figure3.7: Dynamodistributesredundanttuplestodifferentoperatorinstances Next,wediscusshowwedeterminetheamountofreplication. Wefirstassumethatall parallel instances of an operator have the same average processing speed. Furthermore, given a partition of active and passive instances, we assume that the number of active instancesissufficienttoprocessthefulltuplestreamwithoutexceedingdeadlines. This means we use passive nodes only for redundant computation. Under the assumption that all parallel instances have the same average processing speed, Dynamo needs to determine how much computation can be duplicated without overloading the passive instances. We do this by finding the amount of additional computation that keeps the utilizationconstantoverallparallelinstances. Therefore,theratioofduplicatetuplesfor areceivinginstanceisR p =R a ,whereR a isthenumberofactiveinstancesandR p isthe numberofpassiveinstances. However,thisquantityisfurtherscaledbytotalquantityof sent tuples. The total tuples sent at the sender is 1 +S p =S a whereS a is the number of 45 active senders andS p is the number of passive senders. Combing the above two values resultinthefinalredundancyratioequation: R p =R a 1 +S p =S a (3.1) We show an example in Figure 3.8. In this example, operator A is sending tuples to operator B. Here operator A has one active and one passive instance while operator B has three active and two passive instances. Therefore, per Equation 3.1, the redundancy ratio is 1=3, meaning that each sender instance should duplicate 1=3 of the tuples that it emits. A1 A2 B1 B2 B3 B4 B5 S a =1 S p =1 R a =3 R p =2 Fraction to duplicate (2/3) / (1+1/1) = 2/6 = 1/3 Figure3.8: Thenumberofactiveandpassiveoperatorinstancesdeterminestheamount ofredundancyinthetopology Finally, we discuss the implementation details. We implemented our work on Twit- ter’s Heron framework [33]. Heron adopts Storm’s API where operators are classified as sources (spouts) or processors (bolts). One of our goals for Dynamo is to maintain compatibility with user-level operator code. We also strove to avoid modifications to the framework. Figure 3.9 illustrates how we accomplished this. For both spouts and bolts, we maintain the interface for both the user-level and framework by “wrapping" user-level code. The framework notifies the spouts to generate a tuple by calling the nextTuple() method, and send tuples through the SpoutOutputCollector by calling 46 the emit() method. Likewise, a tuple is given to a bolt to process through the ex- ecute() method and send tuples through an OutputCollector by using the emit() method. Heron API DynamoSpout User-Level Spout Logic SpoutOutput Collector DynamoSpout OutputCollector nextTuple() nextTuple() emit() emit() Heron API DynamoBolt User-Level Bolt Logic SpoutOutput Collector DynamoSpout OutputCollector execute() execute() emit() emit() preprocess Figure3.9: DynamousesamodularwrapperimplementationonTwitter’sHeron 3.2.1.1 ExperimentalEvaluation In this section, we present our experimental evaluation of redundancy for the simplified streamprocessingmodel. Firstwepresenttheworkloadforthisstudy. Then,wepresent themetricsusedforevaluation,actualresultsandanalysis. We illustrate an overview of the test stream processing application in Figure 3.10. This application performs simple image processing by performing facial detection on a stream of images using OpenCV [19]. The topology has five stages: ImageFetch, DetectFaces, ImageMark, ImageWrite, AggBolt. The first stage is ImageFetch, a spout which emits images from a random set. It also records performance metrics for each trial. The second stage is a bolt that performs the facial detection. The third and fourth stages are bolts that mark the images and write the results to disk, respectively. Finally, thelast stageisanaggregating boltasrequired byourassumptions(Section 3.2.1). The aggregatingboltalsodumpsotherruntimemetricsthatthespoutcannotcollect. Our methodology consists of first measuring baseline executions of the topology’s executions. We then compare trials to the baseline to determine the improvement. Fur- thermore, we inject computational performance faults at a uniform rate at all instances. 47 imageFetch detectFaces imageMark aggBolt imageWrite Dump Metrics Dump Metrics Figure3.10: Imageprocessingworkloadforevaluationofsimplifiedstreamprocessing model ThefaultsareinjectedattheOutputCollectors. Wecheckwhetheranfaultispendingbe- foreemittingatuple. Ifso,weinjectadelayusingbuilt-insleepfunctionsandcalculate the time and magnitude for the next fault. The main metric for our work is the latency and number of missed deadlines. The fault characteristics are inspired by the observa- tions in [36]. The mean time between faults is 2 seconds and exponentially distributed, as commonly used to model faults. The magnitude is Gaussian distributed with a mean of250milliseconds. Thismeansthatthisworkloadspends7.5%oftheexecutiontimein garbagecollection. Thetrialsvarythetotalnumberofinstancesavailabletothetopology whileusingafixednumberofactiveinstances,therebyvaryingtheredundancyamount. Results We now present the results of our experimental evaluation. We first evaluate the effects ofcomputationalperformancefaultinjection. Wethencomparetheeffectofredundancy over a range of redundancy amounts. Finally, we study the potential performance over- heads. We represent the parameters in the format 1=A=B=C=1 where A is the number ofactivedetectFacesbolts,BisthenumberofactiveimageMarkinstances,andCisthe 48 number of active imageWrite instances. The first 1 specifies a single spout while the latter 1specifiesasingleaggregationbolt,perourassumptions. Webeginourstudybydeterminingwhetherthefaultscancauseenoughpoorperfor- mancetobeofinterest. InFigure3.11,weillustrateabaselineandfaultinjectedrunfor atopologywith 1=5=2=2=1instances. Here,wecanseethatwhennofaultsareinjected, the mean latency is 444.59 milliseconds. However, when faults are injected, the mean latencyincreasesto483.27milliseconds. Furthermore,thelatencyiswidelyvaryingdue to the injections which highlights the detrimental effect of computational performance faults. Duetothenatureofstreamprocessingframeworks,adelayforasingletuplecan alsoaffectotherfollowingtuples. Figure3.11: Computationalperformancefaultsdrasticallydeterioratereal-time performance Next,weevaluateDynamo’sperformanceandimprovementsinresiliencytothecom- putationalperformancefaults. Forthesesetsofexperiments,thetopologyhas 1=10=5=2=1 totalinstancesperoperator. Therefore,thenumberofpassiveinstancesis 10 #active. The active/passive partitioning varies per each trial. The baseline has 1=5=2=2=1 total instancesandallinstancesareactive. WeillustratetheresultsinFigure3.12andtabulate thedatainTable3.1. Ourprimarymetricisthenumberofdeadlinesmissed. Thedead- line is 500 milliseconds and notated on the figures. As the number of passive instances increases,theredundancyratealsoincreases. Avisualcomparisonofthesubplotsshows 49 (a)Nopassiveboltinstances (b)Onepassiveboltinstance (c)Twopassiveboltinstances (d)Threepassiveboltinstances (e)Fourpassiveboltinstances (f)Fivepassiveboltinstances Figure3.12: Redundancyreducesthenumberoftupleswithprolongedlatencies Latencyinmilliseconds(1/10/5/2/1instancesperoperator) that the variance in latencies decreases as the level of redundancy goes up. This is seen inthedensityofthepoints. Forinstance,thedensityat600millisecondsinFigure3.12a ismuchhigherthanatthesamelatencyinFigure3.12f. Likewise,thenumberofpoints abovethedeadlinedecreaseasweincreasethelevelofredundancy. InTable3.1,wetabulatevaluesfortheexperimentsshowninFigure3.12. Itisclear that the standard deviation does indeed decrease with increasing passive instances. We alsoobservethatatlownumbersofpassiveinstances,thenumberofdeadlinesisactually 50 higherthanthebaseline. Thisisduetotheserunshavingmoreactiveinstancesthanthe baseline,therebyincreasingtheexposureoftuplestocomputationalperformancefaults. The increased exposure is due to the assumption that all instances suffer computational performance faults at the same rate and that tuples are always processed by an active instance. Thisisakeyfinding: increasingresourcesblindlywithoutredundancycanpo- tentially cause worse performance. However, adding more passive instances diminishes and eventually overcomes the negative effect. In the best case, we observe a 73.15% reductioninthenumberofdeadlinesmissed. Table3.1: Increasedredundancyresultsinimprovedreal-timeperformance Allocation Std. Dev. Missed DeadlineMiss Miss Deadlines Percentage Reduction Baseline 63.157 1665 23.79% 0% 1/10/2/2/1 73.695 2639 37.70% -58.50% 1/9/2/2/1 71.294 2253 32.19% -35.32% 1/8/2/2/1 70.232 2219 31.70% -33.27% 1/7/2/2/1 63.523 1381 19.73% 17.06% 1/6/2/2/1 53.141 1013 14.47% 39.16% 1/5/2/2/1 34.173 447 6.39% 73.15% Finally,wediscusstheadditionalperformanceoverheadastabulatedinTable3.2. We comparethetotalexecutiontimeforanumberoftrialsforbothabaselineandDynamo- enabled topology for 7000 tuples. We measure the time from, the time the spout sends the first tuple until it receives the last acknowledgment. From these results, we can see that Dynamo and the redundancy scheme only adds an average of 71.5 milliseconds, which is minuscule compared to the total execution time. The overhead per tuple is therefore 10.214 microseconds, which is negligible relative to the average tuple latency ofroughly450milliseconds. ThisshowsthatDynamohasasmallperformanceoverhead. Furthermore,wecanconcludethattheredundantdata,intheformofduplicatetuples,in thisworkdoesnotoversubscribethecommunicationpathsandleadtomoredelays. We 51 alsoillustrateacurveofthenumberofdeadlinemissescomparedtothebaselineversus the passive/active ratio in Figure 3.13. Here, we note the clear trend in deadline misses asmoreredundancyisintroduced. Table3.2: Dynamoimposesminimaloverheadrelativetobaseline System Totalexecutiontime(seconds) Trial1 Trial2 Trial3 Trial4 Average Baseline 700.391 700.391 700.391 700.391 700.391 Dynamo 700.325 700.329 700.308 700.315 700.320 Difference 0.066 0.062 0.083 0.076 0.0715 Per-tuple(us) 9.429 8.857 11.857 10.857 10.214 Figure3.13: Tradeoffbetweenresourceusageandimprovedresiliency Overall, it is clear that redundancy in the form of duplicate tuples is an effective method for increasing the resiliency to computational performance faults. Our results show the potential for poor performance due to computational performance faults. We furthershowthatredundancycanprovideclearimprovementswithuptoa73.15%reduc- tion in the number of deadlines missed. We also show that the additional performance overhead is small relative to the total execution and that the communication overhead doesnotoversubscribethesystem. 52 3.2.2 GeneralizedApproach While the results in Section 3.2.1 are promising, the scheme presented there also im- poses a number of simplifying assumptions. This results in a restriction in the patterns of applications that can be supported. We therefore expand our work by relaxing those restrictions such that we can support a larger class of topologies. There are three main assumptions that we lift: stream groupings, operator ratios and topology patterns, and redundancy granularity. We expand our support of stream groupings to include fields groupings (Section 3.2.2.1). Our generalization also lifts the restriction on operator ra- tios, in that an operator may emit more than one result tuple for every received tuple (Section 3.2.2.2). Finally, we enable a finer redundancy granularity by modifying how ourmodeldistributesredundanttuples(Section3.2.2.3). 3.2.2.1 StreamGroupings Streamgroupingsdefinehowdata(tuplesinHeron)flowsbetweenstreamoperators. The twomostcommongroupingsareshuffleandfieldsgrouping. Therearenoguaranteeson wheretuplesaredeliveredtoundershufflegrouping. Forexample,tuplesmaybesentto downstream operator instances in a round-robin fashion, or in a load-balanced manner. In fields grouping, the user selects tuple fields to group the tuples on. This means that every tuple that has the same value in the user-selected fields will be sent to the same operator instance. This is similar to the reduce step in MapReduce. An example of this is a stream of HTTP page accesses. A user could choose to group tuples on the return status. Thiswouldcausealltupleswiththesamereturnstatuswouldbesenttothesame instance. WeillustratebothtypesofgroupingsinFigure3.14,firstshowninFigure1.4. In Figure 3.14a Bolt B receives tuples randomly, while Figure 3.14b Bolt B receives tuplesgroupedonthestatusfield. 53 A B 1 B 2 B 3 200 Google.com 1 Status: URL: User: 0 TupleID: 200 Apple.com 1 Status: URL: User: 1 TupleID: 404 Google.com 1 Status: URL: User: 2 TupleID: (a)Shufflegroupingdistributestupleswithnoguarantees A B 1 B 2 B 3 200 Google.com 1 Status: URL: User: 0 TupleID: 404 Google.com 1 Status: URL: User: 2 TupleID: 200 Apple.com 1 Status: URL: User: 1 TupleID: (b)Fieldsgroupingsendstupleswiththesamestatustothesame instance Figure3.14: Streamgroupingdefinetheflowoftuplesbetweenoperators In Section 3.2.1, we detailed limiting assumptions on the stream groupings that the redundancy model supported. In that work, we supported topologies that use shuffle grouping. However, it did prove to limit the topologies that our work could support. Therefore,wewillexpandedthegroupingstoallowfieldsgrouping. Fields grouping requires more care to maintain the delivery semantics when intro- ducing redundancy. While our initial work determined instances to deliver tuples to by indexing an active and passive list, we use a different methodology for our expanded work. Here, we instead access a complete list of target operator instances. We generate the index by hashing the user-defined fields for fields grouping, or the unique tuple ID for shuffle grouping. When we select a tuple for replication, we increment the gener- ated index by one and send the tuple there. This maintains the grouping semantics. We illustrate this in Figure 3.15. In Figure 3.15a, the tuple with Tuple ID 2 is selected for replication. The original hash of the tuple ID 2 was 4, so the hash of 2 + 1 = 5. The finalindexafterapplyingthemodulowiththesizeofthetargetindexlistis 0. Likewise in Figure 3.15b the tuple with Tuple ID 1 is selected for replication. The original index is 0,sothereplicatedtupleissenttotheoperatoratindex 1. 54 A B 1 B 2 B 3 B 4 B 5 Index 0 1 2 3 4 200 Apple.com 1 Status: URL: User: 1 TupleID: 200 Google.com 1 Status: URL: User: 0 TupleID: 404 Google.com 1 Status: URL: User: 2 TupleID: hash(1)=2 hash(0)=0 hash(2)=4 Tuple Duplicate tuple Legend 404 Google.com 1 Status: URL: User: 2 TupleID: hash(2)+1=0 (a)Shufflegrouping A B 1 B 2 B 3 B 4 B 5 Index 0 1 2 3 4 200 Apple.com 1 Status: URL: User: 1 TupleID: 200 Google.com 1 Status: URL: User: 0 TupleID: 404 Google.com 1 Status: URL: User: 2 TupleID: hash(200)=0 hash(404)=4 hash(200)=0 200 Apple.com 1 Status: URL: User: 1 TupleID: hash(200)+1=1 Tuple Duplicate tuple Legend (b)Fieldsgrouping(selectedfield: Status) Figure3.15: Generalizedstreamgroupingsenablesupportforredundancyinfields grouping 3.2.2.2 OperatorInput/OutputRatioandTopologyPatterns The next generalization that we make is the operator data input/output ratio. We define theratioateachoperatoras1:(numberoftuplessent). Ourinitialworkassumedthatall operatorshavearatioofeither1:1or1:0(asink). Wegeneralizetoourworktoallowaratioof1:N.However,thisintroducesanumber of potential issues with aliasing of intermediate tuples. Intermediate tuples are those tuples that are in-flight in a topology. A requirement for applying redundancy in the topology is that a mechanism for identifying duplicates should exist. Our initial work identifiedtuplessimplybasedontheuniquetupleIDthatisgeneratedwhenaspoutfirst sends a tuple. This is insufficient forN ratio operators as both emitted tuples will have thesameID.Thisaliasingcaseisannotatedas(5)inFigure3.16. Furthermore,welifttherestrictionthattopologiesmustarrangeoperatorsinalinear, pipelined manner. This leads to other cases for aliasing due to splitting and joining of tuplestreams. WeannotatetheseinFigure3.16as(3,4,5). Our solution expands each intermediate tuple ID to include a number of fields to addressthesenewsourcesofaliasing. Wetabulateandexplainthereasons(asannotated inFigure3.16)foreachofthesefieldsinTable3.3. 55 Topology RootTuple Spout_1 Spout_2 Bolt1_1 Bolt2_1 Bolt1_2 Intermediate Tuple Operator Bolt3_1 (1) (2) (3) (4) (5) Figure3.16: Redundancyintroducesopportunitiesforaliasinginintermediatetuples Table3.3: Tupleidentifierfieldspreventaliasing Fieldname AliasingType(asnotatedonFigure3.16) RootTupleIdentifier identifyRootTuple(1) SpoutIndex identifieswhichspoutinstanceatupleoriginatedat(2) SenderOperator prevents aliasing when receiving tuples from different opera- tors(joins)(3) SenderIndex prevent aliasing in joins from different instances of the same operator(4) EmitCount preventaliasingwhenreceivingtuplesfromoperatorswithN tupleratios(5) 3.2.2.3 RedundancyGranularity Another limitation in our initial work is the granularity at which we introduce redun- dancy. As shown in Figure 3.17a and discussed in Section 3.2.1, our previous work partitions a stream operator instance into active and passive instances where replicated tuplesaresenttobothanactiveandpassivepartition. However,thisislimitingasinitial and replicated tuples can only execute on different subsets of the instances. For exam- ple in Figure 3.17a, the primary copy of a tuple can only execute on two out of three instances, while a replicated copy can only execute on one out of three instances. We generalized the granularity such that copies of a tuple are not restricted on where they may execute. Therefore, we instead allow any copy of a tuple to go to any available in- stance. The benefits are twofold. First, this allows more granularity in the amount of 56 redundancy as we can do so at a tuple level. Secondly, this could allow for improved methodsforloadbalancinganddatadistributionduetotheincreasedresource(instance) availability. WeillustratethisinFigure3.17b,wherebothprimaryandredundantcopies aresenttoanyinstance. A B 1 t 1 B 2 B 3 t 2 t 1 t 3 t 2 Active Instance Passive Instance Legend Primary Tuple Copy Redundant Tuple Copy (a)InstanceGranularity A B 1 t 1 B 2 B 3 t 2 t 1 t 3 t 2 Operator Instance Legend Primary Tuple Copy Redundant Tuple Copy (b)TupleGranularity Figure3.17: Fine-grainedredundancyincreasestheflexibilityfordistributingredundant tuples Our actual implementation of this guarantees that copies of tuples are sent to the “next" operator instance based on the instance IDs. We illustrate an example in Fig- ure 3.17b. If a tuple would be sent to B 1 normally, the redundant copies would be sent toB 2 . UnderHeron’sexecutionmodel,eachoperatorinstanceisawareofitsownIDand has a list of the target instances (targetInstanceList). For shuffle grouping, Dynamo hashes the TupleID parent as defined by Dynamo (discussed in Section 3.2.2.2): target = targetInstanceList[ hash(TupleID parent ) ]. The replicated tuple is sent to the next in- stanceinthelistoftargetinstances: target=targetInstanceList[hash(TupleID parent )+1]. For fields grouping, Dynamo hashes the user-defined field: target = targetInstanceList[ hash(user-definedfields)]. Likewise,thereplicatedtupleissenttothenextinstancetar- get = targetInstanceList[ hash(user-defined fields)+1 ]. The tuple ID fields and hashes are set up such that the products of downstream operators go to the same instances to allowdeduplication. 57 3.2.2.4 ExperimentalEvaluation Inthissectionwedetailourmethodologyandresultsforanexperimentalevaluationthat determines the effectiveness of generalized redundancy injection against computational performance faults. Our results show that granular redundancy for a generalized stream processingmodelisstilleffectivewithupto63.3%reductionsintaillatencyand29.7% to94.53%fewermisseddeadlineswhenduplicatingeverytuple[54,55]. Wefirstdetail the evaluation benchmark followed by our experiment trials. Then, we discuss the fault injectionparameters,experimentalsetup,andmetricsbeforediscussingtheresults. For our experiments, we use topologies from Intel’s Storm Benchmark [2]. We use the topologies that do not use Apache Trident: DataClean, RollingWordCount, Word- Count, Grep, PageView, RollingSort, SOL, and UniqueVisitor topologies. We exclude theTridenttopologiesasTridentimposesadifferentprocessingmodelthantheoriginal Storm/Heronmodel. However,theworkheremaybeappliedtotheTridentmodelinfu- turework. Foreachtopology, weusetheoperatorsas-isduetoDynamo’scompatibility with Heron’s API. All topologies have a similar pattern of spout! bolt1! bolt2. For example, the wordcount topology is composed of: SentenceSpout! SplitSentenceBolt ! WordCountBolt. Each topology is configured to run three containers with a maxi- mumof500concurrenttuplesinflightandprocesses7000tuplespertrial. Weconfigure each spout with a parallelism of one and configure each bolt with a parallelism of four instances. Finally, we calculate the maximum input rate for each topology (as reported inTable3.4)bymeasuringtheaverageper-tupleexecutiontime. Usingtheaforementionedbenchmarks,werunavarietyoftrials. Wefirstrunabase- line topology with faults injected and no redundancy. Then, we enable a uniform level of redundancy for all stream operators across a range of values (12.5%, 25%, 50%, and 100%). Wecomparetheperformanceatdifferentredundancyfactorswiththebaseline’s 58 Table3.4: Measuredtopologyinputratesforgeneralizedstreamprocessingmodel experiments Topology Tuplespersecond DataClean 46 RollingWordCount 42 WordCount 44 Grep 118 PageViewCount 48 RollingSort 136 SOL 48 UniqueVisitor 48 performance. Next, we vary the redundancy factor at a single operator at a time over a smallerrange(25%and100%). Thegoalhereistostudywhetherredundancyismoreef- fective for particular operators. Finally, we measure the effective performance overhead ofDynamo-enabledtopologiesandcomparethattothebaseline. Next,wediscussthefaultinjectionparameters. Wealsoinjectcomputationalperfor- mancefaultswithcharacteristicsdrivenbygarbagecollectionpausesasreportedin[36]. While garbage collection was the initial motivation, our work applies to computational performance faults caused by other factors, such as increased context switching due to resource contention as well. The main change would be in varying the fault injection parameters. For these experiments where we inject faults, the mean time between faults is exponentially distributed with a mean of one second and the magnitude is Gaussian distributed with a mean of 150 ms. These parameters are within the range of garbage collectionpausesasreportedin[36]. We run our experiment trials on a cluster of three HP Proliant DL380 Gen8 servers. Each contains two Intel Xeon E5-2650 processors where each processor contains eight coresrunningat2.00GHz. Wedisablehyperthreading,TurboBoost,powermanagement, and frequency scaling to provide consistent and predictable performance. Furthermore, 59 weconfigureeachserverwith48GBofmemory(24GBperNUMAnode). Wedeploy Zookeeper, Mesos, and Aurora across the cluster (Section 1.3). We also deploy Heron using the provided Aurora scheduler and using Zookeeper for its state storage. Finally, wedeployApacheKafkaasthedatasourceforsomebenchmarks. The primary metrics are tail latency, average latency, and deadline misses. The tail latency is the 99% percentile of the latencies for all tuples while the deadline is defined as + 2 (: mean, : standard deviation) of the baseline trial. Although these experiments do not target throughput as a primary metric, we do measure it to confirm thatourmethodologydoesnotdetrimentallyaffectit. Results We now present the results from our experimental evaluation. Our first experiments comparetheredundancyeffectivenesstoabaselinewhentheredundancyrateisuniform acrossalloperators. Wevarytheredundancyrateacrossarangeofvalues(12.5%,25%, 50%, and 100%). First, we present our results in Figures 3.18 and 3.19. Then we also tabulateresults,includingdeadlinemisses,foreachtopologyinTable 3.5. InFigure3.18weillustratetheimprovementsintaillatency,withafocusonthe90% andhigherlatencies. Curvesclosertotheleft-handsidearebetter;inotherwords,when the tail latency is lower. For all topologies, there is a clear improvement in the 99% tail latencies. As shown in Table 3.5, the best case reduction in tail latencies ranges from 16.3%to63.3%wheneverytupleisduplicated. In Figure 3.19, we illustrate the average performance. A lower mean (represented by the dashed line) is better. In these figures, we can see that the IQR range becomes smaller which implies less variation in the performance. Furthermore, for topologies 60 such as Grep (Figure 3.19d) and RollingSort (Figure 3.19f) we see that the mean drasti- callydrops,withthemeanlatencyonly35.62%and43.82%ofthebaseline,respectively. Themeanlatencyimprovementsforalleighttopologiesvaryfromabout9%to64%. 200 300 400 500 600 700 Tuple latency (ms) 0.96 0.98 1.00 Percentile 0% 12.5% 25% 50% 100% (a)DataClean 200 300 400 500 600 700 Tuple latency (ms) 0.96 0.98 1.00 Percentile 0% 12.5% 25% 50% 100% (b)RollingWordCount 200 300 400 500 600 700 Tuple latency (ms) 0.96 0.98 1.00 Percentile 0% 12.5% 25% 50% 100% (c)WordCount 0 200 400 600 800 Tuple latency (ms) 0.96 0.98 1.00 Percentile 0% 12.5% 25% 50% 100% (d)Grep 200 400 600 800 Tuple latency (ms) 0.96 0.98 1.00 Percentile 0% 12.5% 25% 50% 100% (e)PageViewCount 0 200 400 600 800 Tuple latency (ms) 0.96 0.98 1.00 Percentile 0% 12.5% 25% 50% 100% (f)RollingSort 100 200 300 400 500 Tuple latency (ms) 0.96 0.98 1.00 Percentile 0% 12.5% 25% 50% 100% (g)SOL 200 400 600 800 Tuple latency (ms) 0.96 0.98 1.00 Percentile 0% 12.5% 25% 50% 100% (h)UniqueVisitor Figure3.18: Comparisonofimprovedtail-latencies(uniformredundancy) Wetabulatethenumberofmisseddeadlines,inadditiontoothermetrics,inTable3.5. Here, wedefinethedeadlineas + 2. Deadlinemissesrepresentameasureofhow often real-time performance constraints are violated. As expected, higher amounts of redundancyimproveboththemeanlatencyandamountofmisseddeadlinesastheprob- abilityofacomputationalperformancefaulthaltingprogressisreduced. Inthebestcase for the Grep topology, we observe a 94.53% reduction in the number of missed dead- lines. From these results, we can conclude that redundancy through duplicating tuples 61 baseline 12.5% 25% 50% 100% Redundancy 0 25 50 75 100 Latency (ms) (a)DataClean baseline 12.5% 25% 50% 100% Redundancy 0 100 200 Latency (ms) (b)RollingWordCount baseline 12.5% 25% 50% 100% Redundancy 0 100 200 Latency (ms) (c)WordCount baseline 12.5% 25% 50% 100% Redundancy 5 10 15 20 Latency (ms) (d)Grep baseline 12.5% 25% 50% 100% Redundancy 0 50 100 150 Latency (ms) (e)PageViewCount baseline 12.5% 25% 50% 100% Redundancy 5 10 15 20 Latency (ms) (f)RollingSort baseline 12.5% 25% 50% 100% Redundancy 0 20 40 60 80 Latency (ms) (g)SOL baseline 12.5% 25% 50% 100% Redundancy 0 50 100 150 Latency (ms) (h)UniqueVisitor Figure3.19: ComparisonofDynamoversusbaselineruntimelatency(uniform redundancy) can indeed increase the resiliency to computational performance faults, drastically in some cases. Furthermore, we can see that the additional overhead from redundant com- putationisoutweighedbytheincreasedresiliencyastheperformanceisnotdiminished. In fact, we observe that the mean latency decreases for most cases. Finally, in addition totheincreasedperformanceinlatencyandreductioninmisseddeadlines, wealsonote thatredundancycanbeeffectiveinreducingthetaillatencypercentiles. 62 Table3.5: Performancecomparisonacrossredundancyamountsfortopologies(uniform atalloperators) Topology Trial Mean(ms) Missed Deadlines MissedDeadline Reduction(%) Taillatency(99%) TailLatency Reduction(%) DataClean 0%(baseline) 46.618 636 0.000 255.337 -0.000 12.5% 47.906 641 -0.786 268.510 -5.159 25% 46.177 597 6.132 249.529 2.275 50% 44.022 522 17.925 251.842 1.369 100% 42.422 392 38.365 213.650 16.326 RollingWordCount 0%(baseline) 71.677 423 0.000 320.705 -0.000 12.5% 70.717 399 5.674 319.419 0.401 25% 68.222 353 16.548 301.993 5.835 50% 64.493 258 39.007 282.032 12.059 100% 64.044 221 47.754 278.638 13.117 WordCount 0%(baseline) 69.513 422 0.000 299.963 -0.000 12.5% 68.715 374 11.374 289.475 3.496 25% 67.615 356 15.640 298.565 0.466 50% 66.372 349 17.299 293.521 2.148 100% 61.569 221 47.630 241.235 19.578 Grep 0%(baseline) 21.144 621 0.000 190.606 -0.000 12.5% 19.697 540 13.043 184.004 3.464 25% 18.117 451 27.375 179.991 5.569 50% 13.605 319 48.631 165.011 13.428 100% 7.532 34 94.525 69.879 63.338 PageViewCount 0%(baseline) 51.700 546 0.000 278.821 -0.000 12.5% 52.794 560 -2.564 291.870 -4.680 25% 49.584 460 15.751 262.812 5.742 50% 45.839 345 36.813 240.924 13.592 100% 44.462 298 45.421 221.871 20.425 RollingSort 0%(baseline) 21.887 524 0.000 195.249 -0.000 12.5% 23.167 516 1.527 221.616 -13.504 25% 17.818 396 24.427 182.325 6.619 50% 16.075 339 35.305 181.270 7.160 100% 9.593 100 80.916 123.750 36.619 SOL 0%(baseline) 44.159 660 0.000 238.374 -0.000 12.5% 44.682 672 -1.818 247.731 -3.925 25% 43.112 655 0.758 241.130 -1.156 50% 40.675 539 18.333 232.383 2.513 100% 39.958 464 29.697 212.593 10.815 UniqueVisitor 0%(baseline) 52.602 583 0.000 269.784 -0.000 12.5% 49.960 513 12.007 263.447 2.349 25% 48.485 468 19.726 249.872 7.381 50% 45.605 384 34.134 241.278 10.566 100% 44.717 288 50.600 217.984 19.200 Anothermeasureofreal-timeperformanceisthethroughput. Ourworkinthispaper does not target throughput as a primary metric, but we evaluate it to confirm that our 63 methodology does not detrimentally affect it. We tabulate the measured throughput in Table 3.6. Here, we observe that the throughput remains relatively consistent when re- dundancy is introduced. In fact, some cases show a slight increase in throughput as the redundancyamountincreases. Table3.6: Averagethroughputfortopologies Trial Topology Throughput (tuples/sec) Topology Throughput (tuples/sec) 0% DataClean 44.8 PageViewCount 46.6 12.5% 44.8 46.8 25% 44.7 46.6 50% 44.7 46.6 100% 44.8 46.7 0% RollingWordCount 43.6 RollingSort 138.8 12.5% 43.6 139.4 25% 43.5 139.7 50% 43.6 139.8 100% 43.6 142.3 0% WordCount 45.5 SOL 50.1 12.5% 45.6 50.0 25% 45.5 50.1 50% 45.5 50.1 100% 45.6 50.1 0% Grep 109.6 UniqueVisitor 46.6 12.5% 109.4 46.5 25% 109.7 46.6 50% 110.3 46.7 100% 109.8 46.6 Forthenextsetofexperiments,westudytheredundancyeffectsatdifferentoperators. Astheexecutiontimeforeachoperatorisnotuniform,redundancymayhavestrongeror weaker effects at each stage. Thus, it would be wasteful to duplicate every tuple by the same amount for each operator. We study the effect of duplicating once every tuple and onceeveryfourtuplesatthespout,firstbolt(bolt1),andsecondbolt(bolt2)individually. 64 This is in contrast to the previous set of trials where the redundancy amount was set uniformly for all operators. We illustrate our results in Figure 3.20. Here, we plot the number of missed deadlines when duplicating by the various amounts at each operator. Fewermisseddeadlinesarebetter. From these plots, we can draw a striking conclusion: for all of the topologies in this study, tuple redundancy is most effective when applied to the tuples leaving the spout. Duplicating at the spout means the computation is redundant at the first bolt. There are two reasons for this. First, redundancy at the first bolt means that there are still downstream operators to filter out straggling data elements. Secondly, bolt1 commonly does more work relative to the other topologies. More work to complete implies higher operatorlatencieswhichresultinincreasedsusceptibilitytocomputationalperformance faultsduetoasmallermarginbeforethedeadlines. Totakethewordcountapplicationfor example, bolt1 is the SplitSentenceBolt which has to split sentence strings into words, while bolt2 is a simple counter. For the remaining operators, the number of missed deadlinesareroughlyconstantforbolt2,butslightlyincreasingwhenvaryingatbolt1for sometopologies. Thisimpliesthatduplicatingtuplesgoingtobolt2maybedetrimental. For some of these topologies, the bolt1!bolt2 transition is where a single input tuple results in n tuples outgoing. This results in a multiplicative effect for computation at bolt2, as the determination to duplicate or not is made per input tuple. Therefore, it is clear from these results that care should be taken when determining where to apply redundancyandbywhatfactor. Furthermore, we note that the improvements in performance do not come without cost, as there is additional overhead when duplicating tuples. We characterize those overheads by measuring the number of tuples that each operator receives and the actual CPUtimeconsumedwhenredundancyisapplieduniformlythroughthetopology. Heron providesmeasurementsoftheconsumedCPUtimeasrecordedbyperformancecounters 65 Spout Bolt1 Bolt2 Operator 0 250 500 750 1000 1250 Deadlines missed 0% 25% 100% (a)DataClean Spout Bolt1 Bolt2 Operator 0 200 400 600 Deadlines missed 0% 25% 100% (b)RollingWordCount Spout Bolt1 Bolt2 Operator 0 200 400 600 Deadlines missed 0% 25% 100% (c)WordCount Spout Bolt1 Bolt2 Operator 0 200 400 600 800 Deadlines missed 0% 25% 100% (d)Grep Spout Bolt1 Bolt2 Operator 0 200 400 600 800 Deadlines missed 0% 25% 100% (e)PageViewCount Spout Bolt1 Bolt2 Operator 0 200 400 600 800 Deadlines missed 0% 25% 100% (f)RollingSort Spout Bolt1 Bolt2 Operator 0 250 500 750 1000 1250 Deadlines missed 0% 25% 100% (g)SOL Spout Bolt1 Bolt2 Operator 0 200 400 600 800 Deadlines missed 0% 25% 100% (h)UniqueVisitor Figure3.20: Deadlinesmissedwhenreplicatingatasingleoperator in the JVMs executing each stream operator instance. The JVM measurements capture the CPU time across all CPUs that execute the JVM for all stream operator instances. Theformermetriccharacterizestheextracommunicationoverheadonthesystem,while the latter characterizes the computation overhead. We include results for the DataClean topology in Figure 3.21. Results for other topologies are similar. We observe that the communicationoverheadgrowsquicklyastheredundancyamountgoesupwithaworst case of 400% (Figure 3.21a), as expected. This is the maximum overhead possible and onlyoccurswhenatupleisconsistentlyselectedforredundancy. Ourschemeduplicates atupleonce,leadingto2xtheamountoftuples. However,thedownstreamoperator(with 66 Bolt1 Bolt2 Operator 0 200 400 600 Communication Overhead (%) 0% (baseline) 12.5% 25% 50% 100% (a)Executecount(tuplesreceivedatoperator) Bolt1 Bolt2 Operator 0 50 100 150 Computation Overhead (%) 0% (baseline) 12.5% 25% 50% 100% (b)CPUtime Figure3.21: Comparisonofoverheadforboltsacrossredundancylevels(uniform redundancy) two parallel copies of the tuple) may also duplicate its own outputs, thereby leading to a worst case 4x amount of tuples. However, many of the tuples are discarded as we only process the first unique tuple that is received. This is highlighted in Figure 3.21b. It is remarkable that, at 100% redundancy, communication overhead at bolt2 increases by 4x while the actual CPU resource usage increases by 1.74x. Likewise at bolt1, the communication overhead is 2x while the actual CPU usage increases by 1.33x. From theseresults,wecanconcludethatforthisstudythecomputationalresourceoverheadis notunreasonableforthegivenincreasesinresiliencyandreal-timeperformance. 3.2.3 ProgressTrackingGranularity While we are able to effectively apply redundancy to stream processing, Heron’s exe- cution model imposes its own limitations, specifically in the tracking of tuple progress through a topology. We first define the tuples that spouts emit as root tuples and the tuples in-flight the topology as intermediate tuples, illustrated in Figure 3.22. Heron is aware of unique root tuples, but assumes only one copy of intermediate tuples. This causesthelimitationthatalthoughweinjectredundancythroughreplicatedtuples,Heron is not aware of this. Therefore every intermediate tuple, whether a replica copy or the 67 original,istreatedasauniquetupleandthereforelimitsthepotentialresiliencyandper- formance gains. Furthermore, this means that our redundancy scheme is ineffective to communicationfaultsaslongasthetheredundancyreliesonHeron’sprogresstracking. SentenceSpout SplitBolt CountBolt Topology CountBolt CountBolt Hello USC world world USC Hello Hello:1 world:1 USC:1 Legend Stream operator Root tuple Intermediate tuple Figure3.22: Rootandintermediatetupleexample WeillustrateanexampleoftheweaknessduetorelianceonHeron’sprogresstracking in Figure 3.23 using three scenarios. The topology in the first scenario has one stage whilethelattertwoscenarioshavetwostages. Inthefirstscenario,redundancyisunable toprovideanybenefitduetothefactthatHeronisunawareofthetworedundantcopies of the intermediate tuple. Therefore it waits for both operator instances to complete as seen in Figure 3.23a. This is not the case if the execution framework was aware that the two intermediate tuples are redundant as shown in Figure 3.23b. However, this weakness may be masked if there is a downstream operator as illustrated in the second scenario. The third scenario illustrates the largest difference: Heron’s progress tracking isincapableofhandlinganyfailuresincommunication,unlikefine-grained,redundancy- aware progress tracking. Heron instead will timeout the root tuple and resend it for the topologytoprocessagain,incurringasubstantialperformancepenaltyduetothetimeout andrepeatedwork. In summary, the limitation is primarily due to Heron’s granular acknowledgments whichareunawareofredundantcopiesofintermediatetuplesthatmaybeinflightinthe topology. WediscussdetailsonthisschemeinSection3.2.3.1. However,wecanresolve 68 Legend Spout Bolt 1-1 Bolt 1-2 Unnecessary wait time for Bolt1-2 Bolt 2 Tuple Stream operator Bolt1-1 Bolt1-2 Spout Bolt 1-1 Bolt 1-2 Bolt1-1 Bolt1-2 Bolt2 Latency Latency No wasted time due to subsequent operator Bolt 2 Spout Bolt 1-1 Bolt 1-2 Bolt1-1 Bolt1-2 Bolt2 Latency Tuple times out and replays although processed by bolt 1-1 and bolt 2 Timeout (a)Heron’sprogresstrackingisnotawareofredundanttuplesleadingtoinefficientuseofredundancy Legend Spout Bolt 1-1 Bolt 1-2 Bolt 2 Tuple Stream operator Spout Bolt 1-1 Bolt 1-2 Bolt 2 Spout Bolt 1-1 Bolt 1-2 Bolt1-1 Bolt2 Framework correctly recognizes that tuple has completed processing Tuple completes processing as soon as first copy finishes Bolt1-1 Latency Bolt1-1 Bolt2 Latency No change in latency compared to coarse-grained progress tracking Latency (b)Customacknowledgmentwithawarenessofuniquetuplesincreasestheresiliencyandperformancegainsfromredundancy Figure3.23: Fine-grainedacknowledgmentsincreaseredundancyeffectiveness thisbyimplementingourownfine-grainedprogresstracking. Wepresentdetailsonhow weachievedthisinSection3.2.3.2. 3.2.3.1 Coarse-GrainedProgressTracking As discussed in Section 3.2.3, coarse-grained progress tracking can limit the effective- nessofredundancyinincreasingtheresiliencytobothperformanceandcommunication faults. Heron’s current progress tracking is coarse-grained. However, the interface for user-level logic is simple. We summarize this in Table 3.7. At spouts, the three meth- ods are emit(tuple, messageId), ack(messageId), and fail(messageId). User-logic injects roottuplesintothetopologythroughtheemit(tuple,messageId)method. Heronthendis- tributes the intermediate tuples and tracks their progress through the topology. Once all operatorshaveprocessedtheintermediatetuplesofagivenroottuple,Heroninformsthe userbycallingtheack(messageId)methodwiththesamemessageIdthattheuserusedin 69 Table3.7: Heronprovidesasimpleinterfaceforprogresstracking Operator Method Description Spouts emit(tuple,messageId) calledbyuser-logictosendtuple ack(messageId) calledbyHerontonotifyuserthat tuplehascompletedprocessing fail(messageId) calledbyHerontonotifyuserthat tuplehasfailedortimed-out Bolts emit(incomingTuple,outgoingTuple) send an intermediate tuple and addtotupletree ack(incomingTuple) notify Heron that this intermedi- atetuplehasbeenprocessed fail(incomingTuple) notify Heron that this intermedi- atetuplefailed theemit()call. Ifaroottuplefailstocompleteprocessingwithinauser-specifiedtimeout, orifaboltexplicitlyfailsanintermediatetuple,thenHeronwillcallthefail(messageId) method,againwiththeuniqueidentifierthattheuserinitiallyused. Herontracksprogressatthestreammanagers. Heronstreamoperatorinstances(both bolts and spouts) all have a local stream manager (Section 2.1). Each stream manager usestupletreestotrackprogress. WeillustratethelifetimeofatupletreeinFigure3.24. The local Heron stream manager creates a tuple tree when a spout emits a root tuple. Asthelocalstreammanagerdeliversintermediatetuplestothestreammanagersserving the bolt instances, it will add a tuple node to the tree. When the bolts acknowledge the tuple, the bolt’s stream manager will inform the spout’s stream manager which will then remove the tuple node from the tree. Once the tuple tree is empty, Heron will call the ack() method for the relevant spout. However, we highlight that tuple trees are not awareofuniquecopiesofintermediatetuplesthatmaybeinjectedintothetopology. As discussedpreviously,thiscanlimittheeffectivenessofredundancy. 70 1) Spout emits 2) Bolt 1 emits 3) Bolt 1 acks 4) Bolt 2 acks Tuple tree Tuple tree Tuple tree Tuple tree Hello USC world Spout Bolt 1 Bolt 2 Topology Hello USC world Hello world USC Hello world USC Figure3.24: Herontracksacknowledgmentsthroughtheuseofatupletree 3.2.3.2 Fine-GrainedProgressTrackingthroughCustomAcknowledgments ToaddresstheweaknessesduetoHeron’scoarse-grainedprogresstracking,wedesigned and implemented fine-grained progress tracking that is aware of potential redundant copies of in-flight, intermediate tuples. We accomplish this by first designing a unique identifierforeachintermediatetuple. Then,weaddanadditionalacknowledgmentlayer betweentheuser-levellogicandHeroninordertotracktheprogress. Wediscussdetails ontheidentifiersandouracknowledgmentlayerinthissection. Theinitialproblemisthatalthoughourredundancyschemeisawareofuniqueinde- pendenttuples,Heronisnot. Fortunately,wealreadyhaveauniqueidentifierduetoour needtofilterredundantcopiesfromdownstreamoperators. Wediscussthedetailsofthe identifiersinSection3.2.2.2. We implement our fine-grained progress tracking through the addition of a layer to bothspoutsandboltsthatissolelydedicatedtotrackingandsendingacknowledgments. Figure 3.25 illustrates this. When a spout emits a tuple, the acknowledgment layer gen- eratesauniquerootidentifierandcreatesanemitsetandanackset. Theemitsetkeeps trackofoutstandingpendingintermediatetupleswhiletheacksetkeepstrackofreceived acknowledgments. When bolts acknowledge a tuple, the acknowledgment layer sends a 71 messagedirectlytothespoutinstancethatemittedtheroottuple. Thespoutthenremoves thependingintermediatetuplefromitsemitset. Likewise,whenboltsemitananchored tuple, the acknowledgment layer sends a message to the spout notifying it of the new emit. Thespoutthenaddstheintermediatetupleidentifiertotheemitset. Whentheset isempty,thespout’sacknowledgmentlayernotifiestheuserlogic. Thespout’sacknowl- edgment layer also keeps track of tuples that timeout. We illustrate a simple example of thismechanismatworkinFigure3.26. Twitter Heron Fault Injector Custom Ack. Dynamo Metrics Collector User Spout (a)Spoutimplementation Fault Injector Custom Ack. Dynamo User Bolt Twitter Heron (b)Boltimplementation Figure3.25: Modularapproachenableseasyadditionoffine-grainedprogresstracking layer A B C B C C Action Emit to B Emit to C Emit to C Ack at B Ack at C Ack at C {B} {B,C} {B,C} {C} {} {} {} {} {} {B} {B,C} {B,C} Emit Set Ack Set 1 2 3 4 5 6 1 2 3 4 5 6 Figure3.26: Customacknowledgmentsenablefine-grainedprogresstracking 3.2.3.3 ExperimentalEvaluation In this section, we detail our experimental methodology and results for studying how fine-grainedprogresstrackingincreasestheeffectivenessofourgeneralizedredundancy 72 scheme [56]. While redundancy using Heron’s coarse-grained progress tracking is still effective against computational performance faults, we show that fine-grained progress trackingenablesresiliencytoawiderclassoffaultsincludingcommunicationandfailed instance faults (Section 3.2.3). Furthermore, we characterize the real-time performance improvementsoverawiderangeofredundancies. Finally,wealsoshowthatredundancy is effective at increasing the resiliency to both recoverable and aggregated persistent failedinstancefaults. We use topologies from Intel’s Storm Benchmark, similarly to the study in Sec- tion 3.2.2.4. The topologies in the benchmark are: DataClean, Grep, PageViewCount, SOL, UniqueVisitor, and WordCount. Each of the topologies have a similar pattern of spout!bolt1!bolt2,forexampletheWordCounttopologyiscomposedof: Sentence- Spout!SplitSentenceBolt!WordCountBolt. In the case of the Grep benchmark, we make small modifications to address limi- tations the original implementation. The original Grep topology (KafkaSpout! Find- MatchingSentence! CountMatchingSentence), as defined by the Intel Storm bench- mark is not scalable. The reason for this is that the FindMatchingSentence! Count- MatchingSentence grouping is Fields Grouping on the first field from the FindMatch- ingSentence bolt. However, the FindMatchingSentence bolt only emits a value of “1” whenever it finds a match. This results in all tuples going to the same CountMatch- ingSentence instance, thereby creating a bottleneck. In practice, this means that the topology as-is is unable to scale, even with higher parallelism operators. Therefore, for the purposes of our study, we modify the Grep topology to use Shuffle Grouping at the FindMatchingSentence! CountMatchingSentence relation. We refer to this modified topology as GrepShuffle in this paper. In Figure 3.27 we illustrate the potential differ- encesbetweenthetwobenchmarks. Ontheleft,theGreptopologyquicklysaturatesdue 73 tothebottleneck. Incomparison,theGrepShuffletopologyisabletoscaleandmaintain areliablethroughputovertime. 0 200 400 600 Time (seconds) 0 200 400 600 800 1000 1200 Throughput (tuples/second) Grep 0 200 400 600 Time (seconds) 0 200 400 600 800 1000 1200 Throughput (tuples/second) GrepShuffle Figure3.27: LimitedscalabilityofGreptopology Furthermore,fortheseexperimentsweincreasethescaleofthetopologiesbyincreas- ing the number of containers to five and increasing the number of instances per bolt to 33. Thisresultsin67instances, asimilarscaletorelatedworkinthefield[25]. Weuse USC’sHPCtosupporttheexperimentsatthisscaleasdiscussedinSection1.3.3. More precisely, we submit our jobs to run on Lenovo NeXtScale nx360 M5 nodes each con- figuredwithtwoIntelXeonCPUE5-2640v3processors,64GBofmemory,1.79TBof disk space, and connected using both gigabit ethernet and 56.6-gigabit FDR Infiniband. Each CPU core runs at 2.60 GHz. We use six nodes, where one node is dedicated to clustermanagementandfivenodesareusedforcomputation. Thisresultsin80available CPUcores(16pernode). Ourworkloadrequires67coresfortheoperatorinstances(one perinstance),fivecoresforthestreammanagers,onecoreforthetopologymanager,and five cores for the metrics managers for a total of 78/80 CPU cores used. Each topology 74 processes 220,800 tuples per trial. We deploy Apache Zookeeper, Apache Mesos, and ApacheAuroraasthesupportframeworkforHeron. Finally, wesetthemaximuminput rateforeachtopologyasreportedinTable3.4. Table3.8: Maximumtopologyinputratesforgeneralizedstreamprocessingmodel experimentswithfine-grainedprogresstracking Topology Tuplespersecond DataClean 368 RollingWordCount 336 WordCount 352 Grep 944 PageViewCount 384 RollingSort 1088 SOL 384 UniqueVisitor 384 Ourexperimentsstudythreegeneralareas. First,wecompareourfine-grainedprogress tracking with the coarse-grained tracking. Then, we study increases in real-time perfor- mance resiliency when using the new fine-grained progress tracking compared to the generalized redundancy injection as studied in Section 3.2.2.4. Finally, as fine-grained progress tracking enables redundancy to be effective against communication and failed instancefaults,westudythoseaswell. Weuseslightlydifferentmetricsforeachofthethreegeneralareas. Fortheprogress tracking comparison, we primarily use tail and mean per-tuple latencies. However, we also study the throughput as a secondary metric. Finally, we also compare overheads by measuring the CPU time used. For the real-time performance resiliency studies, we comparethenumberofmisseddeadlines,meanlatency,andtaillatency. Inthecommu- nication and failed instance faults, we use the tuple fail rate as the primary metric. This metric measures how many tuples fail and must be replayed by the topology, thereby 75 substantially increasing the per-tuple latency. Finally we again measure the computa- tionandcommunicationoverheadsusingtheCPUtimeandnumberoftuplesprocessed, respectively. For comparing the two progress tracking schemes, we use the DataClean topology as our benchmark. Furthermore, we allow the system latency to reach a steady state by discardingthefirst9,000tuplestoavoidstart-upeffects. WeillustrateourresultsinFig- ure 3.28. The original scheme has a mean of 6.28ms and a 99% tail latency of 9.71ms, while our fine-grained system has a mean of 4.27ms and tail latency of 6.70ms. Here, it is clear that although we implement our progress tracking at a higher level, we do not observeadecreaseinperformance. Onthecontrary,weobserveareductioninthemean tuplelatencyof32%andareductioninthe99%taillatencyofapproximately30%. This is contrary to expectations, as the original implementation is designed for high perfor- mance. In short, the stream managers, which track acknowledgments, are implemented in C++ specifically for performance reasons. An acknowledgment from a bolt instance would have many hops: bolt instance!local stream manager!spout’s stream manager, andeventuallytothespout. Thedifferenceinourimplementationisthatspoutsdirectly track acknowledgments from bolt instances. This means that an acknowledgment goes directly from the bolt instance!spout. The latency improvements may be due to fewer hops for acknowledgment and emitted tuple messages as the bolt instances communi- cate directly with the spouts themselves. We do measure a 78% increase in CPU usage at the spout due to the progress tracking and threads that listen for messages from bolt instances. However, that amounts to only a 19% increase in total CPU usage for the whole topology. We also measure a 1%-2% reduction in throughput. Further effects of the implementation differences may become clearer at higher scales. The reductions in throughput could be due to the increased load at the spouts resulting in slower message injection into the topology. The improvements further imply that the implementation of 76 Figure3.28: Fine-grainedacknowledgmentsreducemeanandtaillatencies thefine-grainedprogresstrackinghasaloweroverheadthantheexistingHeronprogress tracking scheme implemented at the stream managers albeit at the cost of a slightly re- duced throughput and increased CPU load at the spouts. The improvements show that thisfine-grainedprogresstrackingschemecantradeoffincreasedCPUloadatthespouts forreducedtuplelatenciesfortheseworkloads. Next, we evaluate whether there are any improvements in the real-time performance and resiliency to computational performance faults compared to previous work. We in- ject faults with an exponentially distributed frequency where the mean period is one second and Gaussian distributed magnitude of 150 ms to compare with existing work as presented in Section 3.2.2.4 [55]. However, we only introduce redundancy at the spout instead of uniformly throughout the topology. This was done to measure the re- dundancyeffectsatasingleoperator, butmayalsoresultinareducedloadcomparedto the prior work. Our results show that the improvements to the progress tracking enable more improvements in real-time resiliency despite less redundancy. In Figure 3.29 we plot the reductions in the number of missed deadlines and mean latencies. Overall, we show that redundancy and our fine-grained progress tracking can result in 22% to 93% fewermisseddeadlinesand16%to72%shortermeanlatencies. Thisisanimprovement 77 0% 20% 40% 60% 80% 100% Redundancy percentage 0% 20% 40% 60% 80% Reduction Missed deadline DataClean SOL WordCount PageView UniqueVisitor R. WordCount 0% 20% 40% 60% 80% 100% Redundancy percentage 0% 20% 40% 60% Reduction Mean latency DataClean SOL WordCount PageView UniqueVisitor R. WordCount Figure3.29: Redundancyincreasesperformanceresiliencyforalltopologies from the maximum reductions of 50% missed deadlines in previous work for compa- rable topologies. Figures 3.30a-3.30b illustrate the tail latencies for the DataClean and WordCount topologies. We observe 3% to 58.9% reductions in tail latency. The main reasonfortheimprovementsinreal-timeresiliencyisthefine-grainedacknowledgments. Thisimplementationallowsthesystemtonotwaitforacknowledgmentsfromredundant copies of tuples, thereby improving the latency. Furthermore, while the initial compar- ison shows that fine-grained progress tracking also provide an additional reduction in latency, we account for those effects as the improvements would also affect the baseline 0%redundancyexecutionsaswell. 96% 98% 100% Latency percentile 0 100 200 300 Tuple latency (ms) Redundancy 99% percentile 0% 12.5% 25% 50% 100% (a)Tail-latencyforDataClean 96% 98% 100% Latency percentile 0 100 200 300 400 Tuple latency (ms) Redundancy 99% percentile 0% 12.5% 25% 50% 100% (b)Tail-latencyforWordCount Figure3.30: Largerimprovementsintail-latencyasredundancyincreases 78 Our next experiments target a fault type that has not been studied in the context of real-time stream processing: failed communication faults. Per the discussion in Sec- tion 3.1.2, we study the more probably correlated occurrence model. We vary the char- acteristics of the injected faults with MTBFs of 20, 40, and 60 seconds and MTTRs of 500, 1000, and 1500 milliseconds. The MTBF is exponentially distributed while the MTTR is Gaussian distributed. Exponential distributions are commonly used for fault tolerancestudiesduetothememorylessproperty,meaningthatthetimetothenextfault at any instance is not dependent on already elapsed time. We then inject redundancy at the spout outputs, meaning that computation at the first bolt is redundant. We set the redundancyto25%,50%,75%and100%. WeplotourresultsinFigure3.31. Eachrow represents the results for a topology and varies the MTBF. Each line in the plots repre- sents a different MTTR. Here, it is clear that the increasing redundancy is effective in reducingthefailedtuplerate. FortheDataCleantopology,100%redundancycanreduce the failed tuples by more than 51%, while the reduction for WordCount is more modest with a best case of 14%. The reasons for this are due to the workload characteristics, specificallythetupleratio. Thetupleratioistheratioofinputtonumberofoutputtuples for each bolt. DataClean has a 1:1 ratio, meaning that bolts send one tuple for every tuple received. WordCount has a 1:N ratio, as it emits N tuples (where N is the number of words in each sentence) for each sentence tuple received. This results in an increase intheexposuretopotentialfaultsastherearemoreintermediatetuplestoprocess. This effectshowsinboththefailurerateandtheimprovementsinresiliency. Finally, we study the resource overheads caused by the redundancy framework. We measuretheCPUtimeacrosseveryCPUasrecordedbytheJVMsforallstreamoperators asameasureofthecomputationaloverhead,andthenumberoftuplessentasameasure ofthecommunicationoverhead. Ourresultsshowthatthecomputationaloverheadscales slower than the communication. For example, we measure a computational overhead of 79 7.00% 8.00% 9.00% MTBF: 20 Tuple Fail Rate WordCount 1.00% 1.50% 2.00% 2.50% DataClean 3.50% 4.00% 4.50% MTBF: 40 Tuple Fail Rate 0.60% 0.80% 1.00% 1.20% 0% 20% 40% 60% 80% 100% Redundancy Percentage 2.50% 2.75% 3.00% 3.25% MTBF: 60 Tuple Fail Rate 0% 20% 40% 60% 80% 100% Redundancy Percentage 0.40% 0.60% 0.80% MTTR: 500ms MTTR: 1000ms MTTR: 1500ms Figure3.31: Failedtuplepercentagesacrosscommunicationerrorparameters Total 0% 5% 10% 15% 20% 25% Overhead Computation 0% 25% 50% 75% 100% Total 0% 20% 40% 60% 80% 100% Overhead Communication 0% 25% 50% 75% 100% Figure3.32: Overheadforcomputationscalesslowerthanforcommunication 25.7% and a communication overhead of 100% at 100% redundancy. The overheads areconsistentwithexpectations,asthecommunicationoverheaddirectlyscaleswiththe amountofredundanttuplesinjected. 80 Our results show a clear trend: higher levels of redundancy result in stronger re- siliency to better real-time resiliency. However, the resiliency does come with an over- head cost (35% computation and 100% communication for up to 58% reduction in tail latenciesand51%increaseinresiliencytocommunicationfaults). Whilethetrade-offin increasedresourceoverheadforincreasedperformanceresiliencymaynotbeappropriate forallapplications,therearecaseswhenthismaymakesense. Forinstance,atthescales that corporations run, increased sales and revenue due to higher customer satisfaction mayoutweighthecostofadditionalcomputeresources. We next present data from our study of performance and overhead effects over a wider range of redundancy percentages. The parameters for the faults in this study are computationalperformancefaultswithaMTBFofonesecondexponentiallydistributed and MTTR of 150 ms with a Gaussian distribution. We illustrate the results in Fig- ure3.33whereFigures3.33athrough3.33cshowtheimprovementsinourprimaryper- formancemetrics(meanlatency,99%taillatency,andnumberofmisseddeadlines)and Figures3.33dand3.33eshowthecomputationandcommunicationoverheads. For all three performance metrics, we first note that there is a point of diminishing gains at approximately 100% redundancy. At this point, every tuple is fully redundant meaning that every tuple can handle a single fault. Redundancy levels exceeding 100% would be able to handle more than one fault per intermediate tuple. However, in our studytheprobabilityofmultiplefaultsaffectingthesameintermediatetupleislow. Fur- thermore, reductions in the number of missed deadlines and mean latency are linear in the0%to100%rangewhiletaillatencyreductionsarepolynomial. Inthe100%to180% range, the reductions in the number of missed deadlines, tail latency, and mean latency are linear but with a shallower slope. The exception in this range is the tail latency re- ductions for the WordCount topology which again follow a polynomial trend instead of 81 0% 20% 40% 60% 80% 100% 120% 140% 160% 180% Redundancy 0 20 40 60 80 Mean Latency Reduction % UniqueVisitor DataClean SOL PageViewCount GrepShuffle WordCount (a)Meanlatency 0% 20% 40% 60% 80% 100% 120% 140% 160% 180% Redundancy 0 20 40 60 80 100 Tail Latency Reduction % UniqueVisitor DataClean SOL PageViewCount GrepShuffle WordCount (b)99%Taillatency 0% 20% 40% 60% 80% 100% 120% 140% 160% 180% Redundancy 0 20 40 60 80 Missed Deadline Reduction % UniqueVisitor DataClean SOL PageViewCount GrepShuffle WordCount (c)Misseddeadlines 20% 40% 60% 80% 100% 120% 140% 160% 180% Redundancy % 0 20 40 60 80 Overhead (%) UniqueVisitor DataClean SOL PageViewCount GrepShuffle WordCount (d)Computationoverhead 20% 40% 60% 80% 100% 120% 140% 160% 180% Redundancy % 50 100 150 Overhead (%) UniqueVisitor DataClean SOL PageViewCount GrepShuffle WordCount (e)Communicationoverhead Figure3.33: Performanceimprovementsandresourceoverheadsoverredundancies linear. This can be due to the differences in the complexity of the WordCount topol- ogyasithasahigheroperatorinput/outputratio(1:N)comparedtotheothertopologies (1:1). We study this difference more in depth in the failed instance fault study later in thissection. Theresultsalsoclearlyshowthattheimprovementsinmeanlatencyandthe numberofmisseddeadlinesaresimilarforalltopologiesinourstudy. However, in the case of the 99% tail latency, the WordCount topology has a smaller reduction compared to the other topologies. The primary reason for this is the unique data distribution pattern that WordCount has relative to other topologies. In the other 82 topologies, all operators have a 1:1 data ratio. WordCount has a 1:N ratio where N is thenumberofwordsinagivensentencetuple. Thisresultsinahighfan-outofinterme- diate tuples, thereby increasing the exposure to faults at downstream operator instances. This occurs because of the stream processing model. Under this model, a root tuple is not fully processed until every single intermediate tuple in the tuple tree is processed. Furthermore, a 1:1 ratio means that intermediate tuples are only vulnerable to faults at a single operator instance while a 1:N ratio results in vulnerability to faults at N opera- torinstances. Theeffectsofthisexposurearefurtherobservedwhenstudyingpersistent failedinstancefaultslaterinthissection(Figure3.35). In Figures 3.33d and 3.33e we see that the computation and communication over- heads can be substantial if high resiliency is desired. In Figure 3.33e, we show that the communication overhead is linear with the redundancy level. This is intuitive as no matter what the redundancy level, additional tuples will be generated and distributed to parallelinstances. HoweverthecomputationoverheadshowninFigure 3.33d,whilelin- ear,scalesslowerthanthecommunicationoverhead. Evenat185%redundancy,westill havelessthan100%computeoverheadasredundanttuplesgetfilteredoutdownstream. Ournextsetofresultsstudytheeffectivenessofredundancyinincreasingresiliency to failed instance faults. These are faults where particular instances of the operators in a stream processing application fail as introduced in Section 3.1.3. We study two modelsforthesefaults: (1)asingle-shot,recoverablefailureand(2)persistentaggregated failures. Thefirstmodelsasingleinstancefailurethateventuallyrecoversandcontinues processing. The latter models cases where more serious errors cause multiple instances tofailandnotrecover. Anexamplecauseofthistypeisiftheunderlyinghardwarefails. Ourprimarymetricforthisstudyisthenumberofmisseddeadlines. Weillustrateour results for the single failed instance experiments in Figure 3.34. We test three different recovery times (10 seconds, 60 seconds, 120 seconds) and vary the time to the failure 83 (30 seconds, 60 seconds, and 120 seconds). It’s clear that redundancy is effective for all topologies. In each case, the reduction is linear with the amount of redundancy. Furthermore, we note that 100% redundancy can drive the number of missed deadlines to zero. We also note that the 10 second MTTRs are tightly grouped, as expected. The number of missed deadlines for the baseline (0% redundancy) are also ordered based on the MTTRs. Therefore, the magnitude of the failed instances dictates the reduction percentagesastheamountofmissesisdriventozeroat100%. WeillustrateourresultsfortheaggregatedpersistentfailedinstancesinFigure3.35. In these experiments, we allow up to 22 randomly selected bolt instances to fail. We randomlygeneratethreefailureprofiles(Profiles1,2,and3). Here,ourprimarymetric isagainthenumberofmisseddeadlines. We first note that the WordCount topology is most affected by aggregate failed in- stances. In fact, the topology does not complete processing until redundancy is 100%+. Therefore,wecannotcomputetherelativereductionasthedeadlineisdeterminedbythe performanceat0%redundancyandWordCountfailstocompleteatthatlevel. However, again it is clear that redundancy is highly effective in reducing the number of missed deadlines for the other topologies. There are fewer misses as the redundancy level in- creases. At 25% redundancy, the reductions range from 22.24% to 46.65% while at 100%redundancy,thereductionsrangefrom90.47%to100%. Finally,weobserveaninterestingtrendforthetopologieswithfieldsgrouping(Data- Clean, PageViewCount, and UniqueVisitor). Each of these topologies process tuples of websiteaccesseswhereeachtuplehasauser,URL,status,anduserIDfield. DataClean uses fields grouping based on the status field while PageViewCount and UniqueVisitor group tuples based on the URL field. The data in Figure 3.35 shows that each of these topologieshasfailureprofilesthatdecreasetheperformancemorethanothers. ForData- Clean, the profile is Profile 1 while the PageViewCount and UniqueVisitor topologies 84 (a)DataClean (b)GrepShuffle (c)PageViewCount (d)SOL (e)UniqueVisitor (f)WordCount Figure3.34: Redundancyincreasesresiliencytorecoverablesingleoccurrencefailed instances 85 (a)DataClean (b)GrepShuffle (c)PageViewCount (d)SOL (e)UniqueVisitor Figure3.35: Redundancyincreasesresiliencytoaggregatedpersistentfailedinstances are more susceptible to Profiles 2 and 3. This can be explained as follows: in fields grouping, Heron distributes tuples based on fields in each tuple. If certain keys show up more frequently than others, the bolt instances responsible for those tuples are more importantfortheprogressoftheoverallstream. Therefore,ifthoseinstancesweretofail, then the topology would have more deteriorated performance. Likewise, if an instance that processes a less frequent word were to fail, then the topology performance would deteriorateless. 86 Overallweseethatredundancyisindeedeffectiveinincreasingtheresiliencytoag- gregatedpersistentfailedinstanceresults. Infiveofthetopologies, redundancyreduces thenumberofthemisseddeadlinesby90.47%to100%ofthebaselinemisses. Further- more,thelasttopology(WordCount)isabletofinishexecutionat100%redundancy. Our last study looks at the susceptibility of certain topologies to faults and the re- activity to redundancy injection compared to other topologies. In the discussions for Figures3.33and3.35,wepresentedasimpleargumentforwhyWordCountisdifferent: theoperatorinput/outputratiois1:Ninsteadof1:1fortheothertopologies. Thisresults in an increased exposure as each incoming root tuple to the topology has more nodes in the intermediate tuple tree. We test this by using aggregated persistent failed instance faultinjectionexperimentssimilartotheworkinFigure3.35. Wethenaddtwomodified WordCounttopologies. Inthefirst,thespoutemitsasentencetuplewith11wordssuch that the words are distributed evenly across each counter bolt. In the second, the spout emits at a finer granularity by emitting a single word at a time while using the same words as the first topology. The words are therefore still evenly distributed across each counter bolt. Therefore the first version has a operator input/output ratio of 1:11, while thesecondhasasimilar1:1ratioastheothertopologies. Wethenexecuteeachtopology and inject aggregated persistent failed instance results and record those trials that suc- cessfully complete execution. We tabulate our results in Table 3.9. It is clear that two topologies to the point that they fail to make progress at an earlier point than the topol- ogy that uses single words. For the original WordCount, 150% redundancy is required tomakeprogressandcompleteexecutionwhile140%redundancyisneededforthesen- tencetopology. Conversely,thesinglewordtopologyisabletocompleteexecution,even atlowlevelsofredundancy. Basedonthis,weconcludethatextracareshouldbetakento minimize operator input/output ratios to reduce exposure to intermittent runtime faults. 87 This is consistent with similar claims that high fanout increases the number of potential bottlenecksasacauseofstragglersinrelatedworks[16]. Table3.9: Higheroperatorinput/outputratiosincreaseexposuretoruntimefaults (5: failsexecution;3: completesexecution) WordCount OperatorInput/OutputRatio Redundancylevel 0% 140% 150% Original 1:Random 5 5 3 SingleWord 1:1 3 3 3 LongSentence 1:11 5 3 3 88 Chapter4 MatchingFaultResiliencytoRuntimeLoadand ResourceAvailability While redundancy is proven to both improve performance resiliency and potentially in- crease the real-time performance, the work in Chapter 3 is limited to static redundancy levels. In other words, the redundancy does not change over the lifetime of the applica- tion. However,real-timestreamprocessingworkloadstendtobemultimodal. Asaresult, it can be difficult to tune and design system deployments. For example, a “tweet" pro- cessing application will have higher load when users are active during the day and have alowerloadatnightwhenusersaresleeping. Whileonecantunefortheworst-casesce- narios, this makes inefficient use of resources in times of low utilization. Furthermore, stream processing applications can have dynamic real-time requirements. For example, an application processing frames from video streams may have tighter delays for feeds withahigherframe-per-secondrate. We therefore extend our work to study dynamic resource usage through tuning the redundancytothestreamprocessingapplicationsatruntimethroughafeedbackmecha- nism. Previousstudies[50,51]implementedadynamicschemeforschedulingresources to match runtime loads. In our initial work, we assume a preexisting model where the 89 givenapplicationhasbeencharacterizedandprofiled. Wethenintroducearuntimemon- itorthatmakesuseofthismodelandtheruntimemetricstomakedecisionsontheredun- dancyasillustratedinFigure4.1. Ourlaterworkstudiesamoredynamicandgeneraliz- ableapproachwhereweintroducearun-timeredundancymanagerthatusesonlymetrics fromarunningapplicationtotuneandseekouttheproperamountofredundancy. Monitor Runtime Metrics Resource Allocation Topology Performance Model Figure4.1: Amonitorcanmakeuseofruntimemetricsfeedbackandanexisting performancemodeltodynamicallymanageresources We organize this chapter as follows: Chapter 4.1 details our initial study into the feasibility and experimental study of dynamic redundancy and resource allocation on a restrictedstreamprocessingmodel[53]. Then,inChapter4.2wedetailandevaluateour improvedfeedbackschemethatworksinconjunctionwiththefine-grainedandgeneral- izedredundancymodelaspresentedinSections3.2.2and3.2.3. 4.1 DynamicAllocationforSimplifiedStreamProcessing ModelandInjection Ourinitialworkinthisareaisdirectlyrelatedtooursolutionthatinjectsredundancyinto a restricted stream processing model as Section 3.2.1. We can utilize a simple heuristic for determining the redundancy factor due to the assumptions in the model. The re- stricted stream processing model redundancy model assumes that stream operators are partitioned into active and passive instances and that the ratio between the two deter- mines the redundancy factor. The redundancy model also assumes that the number of 90 active instances is sufficient to maintain the real-time performance for a given stream load. Therefore, the runtime monitor simply needs to calculate the partition for each operator. Our monitor uses a simple algorithm (Algorithm 1) to determine the number of re- quired instances for each operator [53]. The runtime performance monitor uses this al- gorithm to recompute the number of active and passive instances whenever the running application notifies it of a mode change. We first initialize the final aggregating bolt to a parallelism of one. We then iterate starting at the spout. At each iteration of the algo- rithm, we calculate the sender latency (milliseconds per tuple) as the sender latency per instancedividedbythenumberofsenderinstances. Thereceiverinstanceparallelismis theceilingofthereceivingoperator’sperinstancelatencydividedbythesenderlatency. For example, consider a topology where a spout with an input rate of one tuple per 200 millisecondssendingtoaboltwithanexecutionlatencyof300milliseconds. Ifthespout hasaparallelismofone,theconsolidatedspoutrateis 200ms=2 = 100ms. Thenumber ofboltinstancesrequiredisthen 300ms=100ms = 3instances. Algorithm1:GetModelBasedAllocation(model) 1 Getmode/spoutRatefromspout; 2 Getmodelforreportedmode; 3 InitprocessedNodeCount,newAllocation; 4 InitcurNode spout,senderPeriod spoutrate,senderParallelism spoutParallelism; 5 whileprocessedNodeCount<#totalNodes-2do 6 curNode (curNode!target); 7 GetcurNodeLatencyfrommodel; 8 Scalesenderlatencybysenderparallelism; 9 currentNodePar ceil(curNodeLatency/senderPeriod); 10 WritetonewAllocation; 11 UpdatesenderPeriodwithcurNodeLatency; 12 UpdatesenderParallelismwithcurrentNodePar; 13 IncrementprocessedNodeCount; 14 end 15 returnnewAllocation; 91 4.1.1 ExperimentalEvaluation The main metric for our work is the latency and number of missed deadlines. We first detail our test application and methodology before presenting preliminary results. The trials vary the total number of instances available to the topology while using a fixed numberofactiveinstances,therebyvaryingtheduplicationratio. The target application is the same as used in Section 3.2.1 and illustrated again in Figure 4.2. This application performs simple image processing by performing facial detection on a stream of images using OpenCV [19]. The topology has five stages: Im- ageFetch,DetectFaces,ImageMark,ImageWrite,AggBolt. ThefirststageisImageFetch, a spout which emits images from a random set. It also records performance metrics for each trial. The second stage is a bolt that performs the facial detection. The third and fourthstagesareboltsthatmarktheimagesandwritetheresultstodisk,respectively. Fi- nally,thelaststageisanaggregatingboltasrequiredbyourassumptions(Section3.2.1). The aggregating bolt also dumps other runtime metrics that the spout cannot collect. Furthermore,thedifferencefromthestaticredundancyexperimentsisthatthistopology has a varying input rate implemented by scaling the number of spout instances between one and four. We further generalize by changing the time it takes to process a tuple by varyingthecomplexityoftuples(varyingtheinputimagesize). Our evaluation methodology is similar to the method used for static redundancy as described in Chapter 3. More precisely, we measure baseline executions of the topol- ogy’s execution and compare the remaining trials to that baseline to determine possible improvements. We then inject computational performance faults at a uniform rate at all instancesusingourfaultinjector(Section3.1.4). Thefaultcharacteristicsareinspiredby the observations in [36]. The mean time between faults is 2 seconds and exponentially distributed, as commonly used to model faults. The magnitude is Gaussian distributed 92 imageFetch detectFaces imageMark aggBolt imageWrite Dump Metrics Dump Metrics Figure4.2: Imageprocessingworkloadforevaluationofsimplifiedstreamprocessing model with a mean of 250 milliseconds. This means that this workload spends 7.5% of the executiontimeingarbagecollection. Results Our evaluation of the feedback mechanism focuses on deadline misses. We first deter- minethemagnitudeofinsufficientresourcesasaproblem. Then,westudytheeffective- nessoverarangeofmeantimesbetweenfaults. WefurthermeasureDynamo’seffective- ness over a range of total resource availabilities. Finally, we compare the effectiveness fordifferentrate-varyingapplications. 0 2000 4000 6000 Tuple 0 100 200 300 400 Latency (ms) 1 2 3 4 5 Instance Parallelism Insufficient Resources Dynamo Sufficient Resources Deadline Spout Parallelism Figure4.3: Insufficientresourcescausedeterioratedreal-timeperformance 93 In Figure 4.3, we plot the performance for three trials. In the first, we allocate suf- ficient resources to the topology. For the second, we allocate insufficient resources to maintain the real-time performance to the topology. The final case is a topology where Dynamo dynamically reallocates resources in real-time. The load changes dynamically by varying the number of spout instances. More spouts mean a higher input rate. We canseethatboththesufficientresourceandDynamotrialsmaintaingoodreal-timeper- formance as there are fewer deadline misses. However, the insufficient resource trial misses many deadlines. This is further shown in Table 4.1. There are 28.25 times more deadlines missed when resources are insufficient compared to sufficient resources. The dynamically-adjustedDynamoscenariomisses4.58timesmoredeadlines. Table4.1: Dynamicallymanagingresourcesresultsinsimilarperformancetosufficient staticallocation AllocationType MissedDeadlines Misseswherelatencyexceeds450ms Insufficient 339 255 Dynamo 55 0 Sufficient 12 0 WenextmeasureDynamo’seffectivenessinreducingthenumberofdeadlinesmissed whenthefaultratevaries. Wevarythemeantimebetweenfaultinjections(10,30,50,70 seconds). The magnitude is Gaussian distributed with a mean of 150 milliseconds. The results,asillustratedinFigure4.4,showthatDynamohasaconsistentimprovementover thebaselineexecution. Overall, thenumberofdeadlinesmisseddecreases whenMTBF increases. Weexpectthisastheexperimentsrunafixednumberoftuples,meaningthat thereisafixedamountoftimeforfaultstooccur. Furthermore,lessfrequentfaultsalso imply a lower chance that multiple faults may afflict tuples. The improvements range froma44.26%reductionwhenfaultsoccurata30-secondMTBFto60.15%forfaultsat 94 a10-secondMTBF.ThisdemonstratesDynamo’sabilitytoreducethedeadlinesmissed overarangeoffaultrates. 10 30 50 70 MTBF per operator (seconds) 0 100 200 300 400 500 600 Total missed deadlines Baseline Dynamo Figure4.4: Morefrequentfaultscausemoremisseddeadlines InFigure4.5,wecomparetheperformanceoverdifferentresourceavailabilities. For thesetrials,wevarythetotalnumberofdetectfacesbolts. Here,themaximumobserved improvement is 71.40% when 18 detect faces bolts are available. As expected, Dynamo reduces the deadlines missed across the board. Dynamo also maintains a roughly con- stantnumberofdeadlinesmissed. However,weseethatwithincreasingparallelism,the number of deadlines missed goes up for the baseline scenario. This occurs because all instances are active and tuples are not duplicated. However, because timing faults oc- cur with the same rate across all instances, this means that the overall tuple set has a higherexposuretopossiblefaults. ThisisakeyDynamocontribution–blindlyincreas- ing resources without a framework such as Dynamo can actually be detrimental to the performanceoftheapplication. Figures 4.6a and 4.6b show the baseline and Dynamo-enabled trials for a work- load with 16/4/2 bolts per detect faces/image mark/image write respectively with mode- varyinginputrates. Thehorizontalredlinesdenotethedeadline. Figures 4.6cand4.6d illustratetheperformanceforaworkloadwith16/5/2boltswithmodechangesbasedon the complexity of the input tuples. Note that the deadline changes relative to the mean 95 10 12 14 16 18 Detect Faces Instance Count 0 100 200 300 400 Total missed deadlines Baseline Dynamo Figure4.5: Increasingresourcesblindlycancausemoremisseddeadlines 0 2000 4000 6000 Tuple 0 100 200 300 400 Latency (ms) 1 2 3 4 5 Instance Parallelism Total Latency (left axis) Spout (right axis) Deadline (a)Baseline: rateworkload(16/4/2instances) 0 2000 4000 6000 Tuple 0 100 200 300 400 Latency (ms) 1 2 3 4 5 Instance Parallelism Total Latency (left axis) Spout (right axis) Deadline (b)Dynamorateworkload(16/4/2instances) 0 2000 4000 6000 Tuple 0 100 200 300 400 500 Latency (ms) Medium-1 High Medium-2 Simple Total Latency Deadline (c)Baseline: complexityworkload(16/5/2instances) 0 2000 4000 6000 Tuple 0 100 200 300 400 500 Latency (ms) Medium-1 High Medium-2 Simple Total Latency Deadline (d)Dynamocomplexityworkload(16/5/2instances) Figure4.6: Comparisonofper-tupleperformance Numberofboltinstances: (detectfaces/imagemark/imagewrite) executiontimeofthecurrentmode. Qualitatively,weseethatthenumberofpointsabove thedeadlineisthinnedoutdramaticallyforbothscenarioswhencomparingthebaseline to Dynamo trials. Quantitatively, we see the data in Table 4.2 supports this with reduc- tionsof48.61%to73.17%fewermissesfortherate-changingapplication,andreductions of 62.96% to 73.60% for the complexity-changing application. Furthermore, Table 4.3 shows that the standard deviation under Dynamo is consistently lower, in many cases nearlyhalved,relativetothebaseline. Thismeanslesstimingjitterfortheapplication. 96 Table4.2: Dynamicallymanagedredundancyreducesthenumberofmisseddeadlines Mode Rate-Based Complexity-Based Base Monitor Reduction Base. Monitor Reduction Mode1 148 41 72.39% 54 20 69.96% Mode2 41 11 73.17% 178 47 73.60% Mode3 60 18 70.00% 134 38 71.64% Mode4 72 37 48.61% 89 34 61.80% Total 321 107 66.67% 455 139 69.45% Table4.3: Redundancyreducesreal-timeperformancejitter(valuesinmilliseconds) Mode Type Rate-Based Complexity-Based Mean Std. Mean Std. Mode1 Baseline 72.20 38.17 75.17 39.98 Monitor 66.92 17.94 65.57 22.60 Mode2 Baseline 69.36 31.69 194.54 40.54 Monitor 69.08 13.39 192.34 32.21 Mode3 Baseline 77.75 34.83 72.16 37.39 Monitor 69.06 15.83 64.77 21.66 Mode4 Baseline 66.53 24.56 21.34 37.68 Monitor 70.81 13.98 12.44 20.53 4.2 DynamicAllocationforGeneralizedStreamProcessing andRedundancyInjection As our studies in how redundancy in stream processing can increase resiliency evolved, manyoftheassumptionsinSection4.1nolongerhold. Forexample,weusefine-grained redundancy instead of having an active/passive partition. We also allow fields grouping and1:Noperatorratios. Theadditionalcomplexitymakescreatingandusinganaccurate modelformanagingresiliencybecomesevenmoredifficult. Instead,wemovetoamore dynamicapproachthatcanactivelysearchforthenecessaryamountofredundancy. This 97 newschemestillusesafeedback-basedapproachwherethetopologyreportsmetricstoa resiliencymanagerasseeninFigure4.7. Thismanagerthenusesthosemetricstoupdate theredundancy. Resiliency Manager Runtime Metrics Redundancy Level Topology Ex: tail latency, mean latency, max latency, … Figure4.7: Runtimemetricsareusedtodynamicallyadjustfaultresiliency WeillustrateourintuitioninFigure4.8. Weknowthatthelatencyversusredundancy curve is concave upward with a single point of inflection. This is because redundancy benefits the system until the amount of redundancy causes the total load to exceed the resource’scapacity,therebycausingdegradedperformanceduetoqueueingeffects. Fur- thermore, the input rates of the topology shift the curve left or right based on the input rate into the topology while the occurrence of faults shifts the curve up or down. If we assume that application developers have specific latency bounds on the per-tuple laten- ciesthenwecanfindtheminimumrequiredredundancythatmeetsthosebounds. There are two cases for the latency bound and curve relation: (1) the curve is wholly above the bound as shown in Figure 4.8a, or (2) the curve is below or intersects the bound as illustratedinFigure4.8a. Inbothcases,themostefficientredundancylevelisthatwhich minimizes the difference between the latency curve and the latency bound. In the first case, we target minimum possible latency. Although this does not meet the bound, it is the best effort that the system can provide given the degraded state. In the second case, therearetwopointswherethelatencyjustbarelymeetsthebound. Inthiscase,thesys- temshouldfindthepointthatusestheleastamountofredundancy(leftmostintersecting point)inordertominimizeresourceoverhead. 98 Latency bound Redundancy level Tail latency Latency difference(∆l) Efficient point (a)Singlebesteffortpointwhenlatencycurveexceeds bound Latency bound Redundancy level Tail latency Latency difference(∆l) Efficient points (b)Twoefficientpointswhenlatencycurveintersects bound Figure4.8: Efficientredundancylevelvariesbasedonworkloadcharacteristicsrelative tolatencybound We present our algorithm for finding the minimum redundancy amount in Algo- rithm 2. The resiliency manager as seen in Figure 4.7 uses this algorithm to update the redundancy every period. We model the minimum redundancy as a hill-climbing prob- lem whereby we take additive steps each period. The inputs to this are the period with which to update the redundancy, latency bound, min , i , d , and performance margin. The parameters i and d is a gain to calculate the amount of redundancy to increase or decreasebyproportionaltothedifferencebetweenthecurrentlatencyandlatencybound. min specifies the minimum redundancy level step to change each period. perfMargin specifiesapercentofthelatencybound. Ifthecurrentlatencyisbelowthebound,butless thanthatpercentage,thentheresiliencymanagerwillnotreducetheredundancy,thereby providingatunablesafetymargin. Ineachperiod,wefetchthemetricsanddeterminethe difference between the current metric and the latency bound. If that difference is nega- tive,thesystemisover-performingandwecanreducetheredundancylevel(case2). We also use a check to prevent changing the redundancy if the l is small. However, if l ispositivethenthelatencyiscurrentlyabovetheboundthenwesearchfortheminimum latency by testing a change in redundancy level in one direction (case 1). If we improve the performance by reducing l then we continue. Otherwise, we reverse the direction thatwearechangingtheredundancy. Thisfindsredundancythatprovidestheminimum 99 latencyorbringsthe l belowzerowhichthenfindstheminimumredundancythatstill meets the latency bound. Steps in redundancy levels are in steps of the greater of l divided by the latency bound multiplied by the proportional gain ( i for increases or d fordecreases)or min . Thisensuresaminimalstep( min )suchthatadiscerniblechange is visible in the feedback loop. Reductions in redundancy are in steps of. Finally, we can add safety margins to allow for some slack in performance due to system variations bytargetingalatencyslightlylowerthanthatwhichisspecifiedbytheuser. 100 Algorithm 2: DynamoRedundancySearch(latencyBound, period, min , i , d , perfMargin) /* Init */ 1 InitcurrentRedundancy, l, l old ; 2 InitcurrentMovement=increaseRedundancy; 3 foreachperiod do 4 Pullmetricsandcalculate l =currentLatency-latencyBound; 5 if l <0then /* Latency is below the bound and has changed */ 6 if(abs(l)<perfMarginlatencyBound)and(l != l old )then 7 currentMovement=decreaseRedundancy; 8 else /* Latency is above the bound, search or backoff */ 9 if l != l old then /* Only adapt if latency has changed */ 10 ifthisisfirstperiodabovebound then 11 currentMovement=increaseRedundancy; 12 elseifabs(currentDiff)>abs(l)then /* Last movement moved latency towards bound */ /* Continue in same direction */ 13 currentMovement=currentMovement; 14 else /* Last movement moved latency further from bound */ 15 currentMovement=reverse(currentMovement); 16 end 17 end 18 l old = l ; /* Update redundancy */ 19 ifcurrentMovement=increaseRedundancythen 20 currentRedundancy=currentRedundancy+max( min , l/latencyBound * i ); 21 else 22 currentRedundancy=max(0,currentRedundancy-max( min , l/latencyBound* d )); 23 end Next, we discuss implementation details for this system. We begin from the metrics then work around the feedback loop as first shown in Figure 4.7. The spouts collect windowedperformancemetricsonthetopology. Thewindowisuser-configurable. Some example metrics are 90% and 99% tail latencies, mean latency, and max latency. The 101 spoutsreportthesemetricsusingHeron’smetricscollectionAPI.Theresiliencymanager periodicallyfetchesthesemetricsperAlgorithm2andupdatestheredundancystate. The spoutsperiodicallyfetchtheupdatedredundancyfromtheresiliencymanager. 4.2.1 ExperimentalEvaluation We first evaluate our resiliency manager and algorithm using a static setup – both the input rates and fault injection are constant. This allows us to find the ideal amount of redundancy needed for the assumed worst-case faults. We then compare that with the dynamically-managed experiments and baseline with no redundancy. In each trial we recordthecurrentredundancy,feedbackmetric,andwhetherfaultsarecurrentlyinjected during each window. We also record the same overhead metrics as in Chapter 3: com- putation overhead measured by CPU time and communication overhead measured by number of emitted tuples. Our workloads consist of the Grep, WordCount, and SOL topologies. Each workload runs for approximately 18 minutes. We target bringing the 90% tail latency for one second windows below the user-specified bound. However, per Algorithm 2, this metric can be swapped for other runtime metrics depending on goals. Finally, we utilize the same deployment configuration and platform: one spout instance and33instancesperboltonUSC’sHPC. For the static redundancy work, we assume that the worst-case faults are computa- tional performance faults with an exponentially distributed mean time between faults of onesecondandGaussiandistributedmagnitudewithameanof450ms. Wethenlaunch the workload with fault injection enabled for the whole trial. The resiliency manager will determine the ideal amount of redundancy using an additive algorithm where the redundancy step for each update period is 10% redundancy. We then use the maximum 102 redundancylevelduringtheexperimentforthestaticredundancylevelasabaselinecom- parison. Foreverytopologyinthiswork,thestaticredundancylevelis160%. Then, we set the parameters for the dynamic resiliency management algorithm as follows. We set the parameter min to 10%. We then find the average latency difference whenfaultsareinjectedwithnoredundancyanddivideitbythestaticredundancylevel. Thisresultsinavalueoffivefor i and d . Next, we enable dynamic fault injection. Our injection profile randomly varies the length of each injection period, the downtime between periods, and the MTBF for each injection period. The specific parameter sets for these experiments are: 270 second injectionperiodwithatwosecondMTBF,180secondinjectionperiodwithfoursecond MTBF, and 90 second injection period with one second MTBF. All injection periods haveaGaussiandistributedmagnitudewithameanof450ms. Thisallowsustoevaluate the resiliency manager in a dynamic setting. An example of a runtime issue that can cause this if an additional workload is launched on compute nodes that are shared with stream operator instances thereby causing resource contention. We can both trace the dynamicsearchingbehaviorandcomparetheresultstothestaticsetupresultsusingthis dynamicinjectionprofile. Withoutaresiliencymanager,developersneedtoassumeand deploy for the worst-case. In this case, that means setting the redundancy level for the worst-casewhichcanbewastefulforresources. Results We now present the results of our experimental evaluation. We begin by illustrating the dynamic resiliency management over time. Then, we present a quantitative comparison ofthedata. Ourresultsshowthatitispossibletodynamicallysearchforanefficientre- siliencylevelbasedonuserrequirements. Forexampleonecanprioritizethehighestlevel of resiliency and its resource overheads, or choose to tolerate lower levels of resiliency 103 inexchange forless resourceusage. Forsome workloads, wecan getup to98.4% fewer windows of constraint violations compared to no redundancy while using 0.13x - 0.31x of the communication overhead and 0.15x - 0.29x of the computation overhead that the staticredundancycaseuses. We first present the dynamic resiliency management over time in Figure 4.9. The shadedgreenareasdepictperiodswherefaultsareinjected,butthelatencyisstillwithin the bound while the red areas depict those where faults are injected and the latency exceedsthebound. Ourresultsshowthattheresiliencymanageriseffectiveatreducing the number of windows where the latency constraints are violated (Figures 4.9a, 4.9b). Thisisseenineachinjectionperiodwherethewindowswithviolationsonlyoccuratthe beginning of each injection period, primarily when the resiliency manager is detecting andcompensatingfortheinjectedfaults. Theresiliencymanagercontinuestosearchfor moreefficientredundancylevelsthroughouttheremainderoftheinjectionperiod. Furthermore, we inject three periods to show multiple transitions between states. In practice, there could be longer periods to find the efficient redundancy level for longer lastingfaultssuchasdiskfailure,failednodes,orresourcecontentionbetweencohosted jobs. The annotations show the injection parameters for each window. The faults in the third injection period cause the most severe effect on performance, while the faults in the second injection period have the smallest effect. Faults in the first injection period have moderate effects. Intuitively, our results show that the resiliency manager seeks out redundancy levels where the injection periods with more severe faults have a higher redundancy level. We also note that in the middle injection period, faults occur so in- frequently that this could potentially cause window violations as seen in Figure 4.9c. The resiliency manager adapts by increasing the redundancy when it detects violations. However, it detects no violations in the subsequent windows and therefore reduces the redundancy to zero to save resources. This causes another violation when faults occur 104 again. A solution to this is to include a minimum redundancy parameter if such infre- quentfaultsareexpected. 200 400 600 800 1000 Time (seconds) 0 20 40 60 80 Latency (ms) 210 second window MTBF=2 seconds 270 second window MTBF=4 seconds 180 second window MTBF=1 seconds Latency (ms) l Latency bound (ms) Redundancy (right) 0 25 50 75 100 125 150 175 200 Redundancy % (a)Grep 200 400 600 800 1000 Time (seconds) 0 20 40 60 80 Latency (ms) 210 second window MTBF=2 seconds 270 second window MTBF=4 seconds 180 second window MTBF=1 seconds Latency (ms) l Latency bound (ms) Redundancy (right) 0 20 40 60 80 100 120 Redundancy % (b)SOL 200 400 600 800 1000 Time (seconds) 0 20 40 60 80 100 120 140 160 Latency (ms) 210 second window MTBF=2 seconds 270 second window MTBF=4 seconds 180 second window MTBF=1 seconds Latency (ms) l Latency bound (ms) Redundancy (right) 0 25 50 75 100 125 150 175 200 Redundancy % (c)WordCount Figure4.9: Resiliencymanagerdynamicallyseeksoutefficientredundancylevel Finally, we note that in some cases (WordCount’s first injection period and Grep’s third injection period) that the resiliency manager first jumps to a higher redundancy before backing off to a lower level. This is primarily caused by two reasons. For Grep, theinjectedfaultshavesuchastrongeffectthattheyincreasethe l dramatically. How- ever, redundancy quickly mitigates this and the resiliency manager detects that it can reduce the redundancy level while maintaining performance goals. For WordCount, the 105 initial adjustment in redundancy brings the latency below the bound. However, it re- mainswithinthesafetymarginforabout100secondsafterwhichthelatencymovesout of the safety margin. This allows the resiliency manager to reduce the redundancy to save resources. This could be due to either the system slowly stabilizing over that inter- val,orduetostochasticeffectsasfaultinjectionisrandomizedinthecharacteristicsand locationinthetopologywhichresultsinstrongerorweakereffectsontheperformance. 0% Managed WorstCase Static 0 50 100 150 200 Number of Windows Violated Windows Comm. Overhead % (right) Cpu Overhead % (right) 0% 50% 100% 150% 200% 250% 300% 350% Overhead percentage (a)Grep 0% Managed WorstCase Static 0 50 100 150 200 250 Number of Windows Violated Windows Comm. Overhead % (right) Cpu Overhead % (right) 0% 50% 100% 150% 200% 250% 300% 350% Overhead percentage (b)SOL 0% Managed WorstCase Static 0 50 100 150 200 250 300 Number of Windows Violated Windows Comm. Overhead % (right) Cpu Overhead % (right) 0% 100% 200% 300% 400% 500% Overhead percentage (c)WordCount Figure4.10: Resiliencymanagercantradeoffincreasedresiliencyforlowerresource overheads Next, we compare the dynamic managed performance with that of the static redun- dancy and baseline trials. The baseline trial is that with 0% redundancy. In Figure 4.10 weillustratethenumberofwindowswheretherearereal-timeconstraintviolations,com- putationoverheads,andcommunicationoverheads. Thecomputationoverheadcompares theamountofCPUtimeusedasmeasuredbythedifferentJVMsacrossallCPUsusedfor all stream operator instances, while the communication overhead compares the number 106 oftuplesthataresentwithinthetopology. Theseresultshighlightthatthatthemanaged workloadsaremoreefficientinreducingthewindowswheretheperformanceconstraints are violated with a much smaller resource overhead. For example, in the Grep topology (Figure4.10a),themanagedtrialreducesthewindowviolationsby98.3%(4vs234win- dows)whileonlyincreasingthecomputationoverheadby10.4%andthecommunication overhead by 46.0%. While the static redundancy brings the violations down to zero, this comes at a cost of 68.5% computation and 359.0% communication overhead. The SOLbenchmark(Figure4.10b)issimilarwithaviolatedwindowreductionof98.4%(4 vs244windows)for17.0%computationoverheadand58.0%communicationoverhead. Thiscomparesto100%reductioninviolatedwindowsatthecostof62.0%computation and 366.0% communication overhead for the static redundancy. The reductions for the WordCount topology are more modest with a reduction of 98.0% (6 vs 293 windows) for27.1%computationand164.3%communicationoverhead. Again, staticredundancy resultsinareductionof100%butatdrasticallyhigheroverheadcostsof94.3%compu- tationand529.2%communication. The resource usage reductions for WordCount are more modest than the other two topologies. This is due to the increased sensitivity to faults (Section 3.2.3.3). Word- Count’ssensitivitytofaultsthereforerequiresahigherlevelofredundancythantheother two topologies. This approaches the static level of 160%. This causes a larger jump in latency and therefore a larger selected redundancy level. This redundancy level exceeds theworstcasestaticlevelandthereforedeterioratestheresourcereductions. However,we notethattheresiliencymanagerisabletoapproachthetheoreticalboundintheviolated windowsreduction. Fortheseexperimentsthetheoreticalboundistwotothreewindows, one per injection period, to allow time to adjust as the resiliency manager has no future knowledge of when faults will occur. In some topologies, the injection period with the least frequent faults does not cause a miss, resulting in only two missed windows. The 107 theoretical bound is not zero as the resiliency manager is adapting to, not predicting, changesintheexecutionenvironmentandfaultoccurrences. There are multiple factors that can affect the resource savings for managed versus the static redundancy levels. First, the savings will be affected by the dynamicity of the faults. If faults occur for a larger fraction of the execution, the savings over static redundancy are reduced. Furthermore, the redundancy level that the resiliency manager selects can affect the relative savings. For instance, if the parameter is large, this will result in quick reaction to any changes in the execution setting but at the cost of increasedresourceusage. ThethirdinjectionperiodinFigure4.9cshowsanexampleof this. Due to the large jump in latency, the resiliency manager jumps to a high level of redundancy(197%)thatexceedstheworstcasestaticlevel(160%). Whiletheresiliency manager could search for lower levels of redundancy, it does not do so in this case as the parameters are tuned such that the manager prioritizes having as few windows with violated constraints as possible instead of searching for lower levels and backing off if necessary. Overalltheresultswehaveshownherearepromising. Wehaveshownthatamanager cancompensatetoachangingenvironmentsuchasfaultscausingdecreasedperformance by increasing redundancy. While the manager is able to seek out the most efficient redundancy level that enables performance bounds to be met, there is response delay in findingthenewredundancylevel. However,theefficiencygainscanbesubstantial(only 0.13x-0.31xofthecomputationand0.15x-0.29xofthecommunicationoverheadsthat static redundancy requires). Therefore, we’ve shown that resiliency can be dynamically tradedoffforincreasedresourceefficiencyifdecreasedperformanceisacceptableinthe shortterm. 108 Chapter5 ConclusionandFutureWork In this chapter, we discuss conclusions of our work. As part of this, we discuss the broader impacts of our work in Section 5.1, especially in the context of autonomous computing. Then, we further discuss more directions for future work in Section 5.2. Finally,weconcludeinSection5.3. 5.1 BroaderImpacts StreamprocessingasaparadigmisincreasinglyimportantasdiscussedinSections1.2.1 and1.2.2. Thestreamprocessingmodelisespeciallywell-suitedforlong-runningwork- loads where data arrives over time, instead of being totally available at one time as is the case for batch processing. This pattern is becoming more common for a wide range of applications such as for ETL at data ingestion and for data analytics for data from sources such as social media or IoT devices [48]. Furthermore, the movement of real- time workloads such as video games introduces new areas where latency requirements are key [6, 11, 42]. Stream processing can provide low-latency to fulfill Quality of Ser- vice(QoS)andreal-timerequirements[20,37,47]. 109 While stream processing is a capable and easily-adaptable model, the challenge is in the management and deployment of the applications. Developers need to be aware of current and future load requirements to determine that adequate resources are avail- able for the applications. Some works study this dynamic resource allocation based on runtime metrics [50, 51] while others try to develop performance models for predicting performance and optimally placing jobs onto compute resources [40, 41]. Furthermore, care is needed to address runtime faults that can deteriorate the real-time performance. We have studied and shown that this can be accomplished statically through speculative redundancyandthendynamicallymanagedthroughruntimemonitoringofmetrics. In the broader computing field, a large next step would be increasing the level of autonomy of the management of these workloads. For example, instead of the difficult processofdeployingstatically,developerscoulddescribeanddeveloptheseapplications andsubmit themto asystemthat canautonomously scaleandmanage resourcesto sup- port real-time performance and resiliency goals. Some initial work is already underway wheretheideaof“healthy”systemsisthegoal[14,25]. Ourworkcouldintegratenicely intothisconceptwhereperformanceresiliencyitselfisanotherdimensionofhealth. 5.2 FutureWork Redundancyisapowerfultoolforincreasingfaulttoleranceandperformanceresiliency. The applications to stream processing are especially exciting given the increase in real- timeworkloads. Inthissectionwepresentdirectionsandopportunitiesforfuturework. Thefirstareaforexpansionisinthedynamicredundancyallocation. Whilewehave shownthatsimplealgorithmscaneffectivelyreducetheoverheadoverstaticallocations, there is room for improvement. For example, our solution involves searching for better redundancylevelsbasedonruntimemetrics. Futureworkcouldstudywaystoconstruct 110 performancemodelsforgeneralizedstreamprocessingapplications. Developersthencan then look at using the models in tandem with runtime metrics to effectively determine theproperredundancylevelsmorequickly. Secondly,anotheropenareaforstudyistoexploretheeffectivenessofredundancyin increasingtheresiliencytomorefaultclasses. Inthisworkwestudiedfaultsthatdirectly affect the real-time performance such as computational performance faults, failed com- munication, and failed stream processing instance faults. However, future work could study faults where the results themselvesare incorrect. This would be especially impor- tantifsimilarstreamprocessingworkloadsweredeployedinspace-boundsystems,such as distributed compute resources across satellite constellations. While dual-modular re- dundancy can detect a single error in the result, triple-modular redundancy can correct a single error or detect two errors through a voting scheme as illustrated in Figure 5.1. This has been closely studied for at the hardware level [23], but not in the context of distributedstreamprocessing[18,39]. Figure5.1: Triple-modularredundancyenablescorrectionofsingleerror correction/doubleerrordetection[18] Furthermore, within our work we discuss the distribution of redundant tuples. Our workcurrentlysendssubsequentcopiesoftuplestothenextoperatorinstanceorderedby instanceindices. Thesendingoperatoralsohasnoknowledgeofdatadistributionsatthe 111 downstreamoperators. Futureworkcanstudyhowtomakebetterdecisionsaboutwhere to send these duplicate tuples. For example, if a downstream instance has a higher load or is on a node with high utilization, it may be better to send the duplicate tuples to an instancewithalowerutilization. Finally, expanded workloads provide another area for additional study. We used benchmarks from the Intel Storm Benchmark [2] for this work. While the benchmarks are smaller and simple, the patterns match trends as reported to Apache Storm develop- ers[10]. However,thestreamprocessingmodelasadoptedbyApacheStormandTwitter Heronisflexibleandcansupportmorecomplexworkloads. Furthermore,otherabstrac- tionsforstreamprocessingexistincludingECOAPIsandhigher-levelabstractionssuch as Apache Trident and Heron Streamlet. The restrictions placed by these models can affecttheeffectivenessofredundancyforperformanceresiliencygoals. 5.3 Conclusion Redundancy is a powerful technique and serves as the basis of many fault tolerance schemes. However, the focus has primarily been on maintaining the health of systems by enabling correct execution in the presence of faults. Other studies show that redun- dancy also has use in increasing the real-time performance resiliency as well. In this work, we study the application of redundancy in the context of stream processing sys- tems. Our results show that redundancy at the discrete data element level can be highly effective in increasing the resiliency to a variety of fault types including computational performance,failedcommunication,andfailedstreamoperatorinstancefaults(78%tail latency reduction, 60% mean latency reduction, 80% missed deadline reduction). How- ever, this comes with a cost of additional resource usage (40% computational resource overhead and 100% communication overhead). We further study ways to mitigate the 112 resource overheads by opportunistically managing the redundancy level based on varia- tions in the workload and fault occurrence rates. Our findings for this work show that a dynamic resiliency manager can adapt the redundancy level at run-time to better match the actual execution setting (98%+ reduction in windows with constraint violations us- ing0.15x-0.29xcomputationand0.13x-0.31xcommunicationoverheadscomparedto static redundancy). This results in increased real-time performance over no redundancy while using less computational and communication resources than a static redundancy level. Thisworkispartoftheinitialforayintotheapplicationsofredundancyforreal-time performance reasons. The trade off of the overheads will further favor redundancy as the real-world costs of diminished performance increase while the cost of additional re- sources decrease. We believe that the increase of real-time applications in larger and more distributed systems in industry will increase the need and opportunities for redun- dancy. 113 Acronyms API ApplicationProgrammerInterface DAG DirectedAcyclicGraph DNS DomainNameService ETL Extraction,Transformation,andLoading HDFS HadoopDistributedFileSystem HPC CenterforHigh-PerformanceComputing IoT InternetofThings JVM JavaVirtualMachine MTBF MeanTimeBetweenFaults MTTR MeanTimeToRecovery SOL SpeedofLight QoS QualityofService USC UniversityofSouthernCalifornia 114 ReferenceList [1] ApacheStorm. Online: http://storm.apache.org,2017. [2] Intel Storm Benchmark. Online: https://github.com/intel-hadoop/ storm-benchmark,2017. [3] University of Southern California Center for High-Performance Computing. On- line: https://hpcc.usc.edu/,2017. [4] Apache Storm: Trident Tutorial. Online: http://storm.apache.org/ releases/1.1.1/Trident-tutorial.html,2018. [5] Google Protocol Buffers Developer Guide. Online: https://developers. google.com/protocol-buffers/docs/overviewl,2018. [6] Project xCloud: Gaming with you at the center. Online: https: //blogs.microsoft.com/blog/2018/10/08/project-xcloud- gaming-with-you-at-the-center/,2018. [7] Apache Aurora Introduction. Online: http://aurora.apache.org/ documentation/latest/getting-started/overview,2019. [8] Apache Storm: Guaranteeing Message Processing. Online: https: //storm.apache.org/releases/current/Guaranteeing- message-processing.html,2019. [9] Companies Using Apache Storm. Online: https://storm.apache.org/ Powered-By.html,2019. [10] Companies Using Apache Storm. Online: https://storm.apache.org/ Powered-By.html,2019. [11] StadiaFAQ. Online: https://support.google.com/stadia/answer/ 9338946?hl=en,2019. [12] Paul E. Ammann and John C. Knight. Data Diversity: An Approach to Software FaultTolerance. IEEETransactionsonComputers,37(4):418–425,April1988. 115 [13] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Effective Straggler Mitigation: Attack of the Clones. In Proceedings of the 10th USENIX ConferenceonNetworkedSystemsDesignandImplementation,nsdi’13,pages185– 198,Berkeley,CA,USA,2013.USENIXAssociation. [14] AshvinAgrawalandAvriliaFloratou. DhalioninAction: AutomaticManagement of Streaming Applications. Proceedings of the VLDB Endowment, 11(12):2050– 2053,August2018. [15] AlgirdasAvizienis, Jean-ClaudeLaprie, BrianRandell, andCarlLandwehr. Basic ConceptsandTaxonomyofDependableandSecureComputing. IEEETransactions onDependableandSecureComputing,1(1):11–33,Jan2004. [16] Hafiz Mohsin Bashir, Abdullah Bin Faisal, M Asim Jamshed, Peter Vondras, Ali Musa Iftikhar, Ihsan Ayyub Qazi, and Fahad R. Dogar. Reducing Tail La- tency Using Duplication: A Multi-Layered Approach. In Proceedings of the 15th InternationalConferenceonEmergingNetworkingExperimentsAndTechnologies, CoNEXT’19,page246–259,NewYork,NY,USA,2019.AssociationforComput- ingMachinery. [17] Anatoliy Batyuk and Volodymyr Voityshyn. Apache Storm Based on Topology for Real-time Processing of Streaming Data from Social Networks. In 2016 IEEE First International Conference on Data Stream Mining Processing (DSMP), pages 345–349,Aug2016. [18] Melanie Berg and Kenneth LaBel. Verification of Triple Modular Redundancy (TMR) Insertion for Reliable and Trusted Systems. NASA Technical Note 20160001756,2016. [19] GaryBradski. TheOpenCVLibrary. Dr.Dobb’sJournalofSoftwareTools,2000. [20] Jake Brutlag. Speed Matters for Google Web Search. Technical report, Google, Inc,2009. [21] K.ManiChandyandLeslieLamport. DistributedSnapshots: DeterminingGlobal States of a Distributed System. ACM Transactions on Computer Systems, pages 63–75,February1985. [22] Subarna Chatterjee and Christine Morin. Experimental study on the performance and resource utilization of data streaming frameworks. In 2018 18th IEEE/ACM InternationalSymposiumonCluster,CloudandGridComputing(CCGRID),pages 143–152,May2018. [23] Chun-KaiChangandWenqiYinandMattanErez. AssessingtheImpactofTiming Errors on HPC Applications. In SC19: International Conference for High Perfor- manceComputing,Networking,StorageandAnalysis,Nov2019. 116 [24] Donald D. Eisenstein and Ananth V. Iyer. Garbage Collection in Chicago: A Dy- namicSchedulingModel. ManagementScience,43(7):922–933,1997. [25] Avrilia Floratou, Ashvin Agrawal, Bill Graham, Sriram Rao, and Karthik Ra- masamy. Dhalion: Self-regulating Stream Processing in Heron. Proc. VLDB En- dow.,10(12):1825–1836,August2017. [26] Maosong Fu, Ashvin Agrawal, Avrilia Floratou, Bill Graham, Andrew Jorgensen, Mark Li, Neng Lu, Karthik Ramasamy, Sriram Rao, and Cong Wang. Twitter Heron: Towards Extensible Streaming Engines. In 2017 IEEE 33rd International ConferenceonDataEngineering(ICDE),pages1165–1172,April2017. [27] Felix C. G¨ artner. Fundamentals of Fault-tolerant Distributed Computing in Asyn- chronousEnvironments. ACMComput.Surv.,31(1):1–26,March1999. [28] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A Platform for Fine- grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, pages 295–308,Berkeley,CA,USA,2011.USENIXAssociation. [29] VasilikiKalavri,JohnLiagouris,MoritzHoffmann,DesislavaDimitrova,Matthew Forshaw, and Timothy Roscoe. Three Steps is All You Need: Fast, Accurate, Au- tomatic Scaling Decisions for Distributed Streaming Dataflows. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 783–798,Carlsbad,CA,October2018.USENIXAssociation. [30] Supun Kamburugamuve, Karthik Ramasamy, Martin Swany, and Geoffrey Fox. Low Latency Stream Processing: Apache Heron with Infiniband & Intel Omni- Path. In Proceedings of the 10th International Conference on Utility and Cloud Computing,UCC’17,pages101–110,NewYork,NY,USA,2017.ACM. [31] Kaoutar El Maghraoui and Travis Desell. Randomized Distributed Garbage Col- lection. 2003. [32] Asterios Katsifodimos and Sebastian Schelter. Apache Flink: Stream Analytics at Scale. In 2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW),pages193–193,April2016. [33] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg,SaileshMittal,JigneshM.Patel,KarthikRamasamy,andSiddarthTaneja. TwitterHeron: StreamProcessingatScale. InProceedingsofthe2015ACMSIG- MODInternationalConferenceonManagementofData,SIGMOD’15,pages239– 250,NewYork,NY,USA,2015.ACM. 117 [34] Manish Kumar and Chanchal Singh. Building Data Streaming Applications with ApacheKafka: Design,DevelopandStreamlineApplicationsUsingApacheKafka, Storm,HeronandSpark. PacktPublishing,2017. [35] VijayP.KumarandSyingJyanWang. ReliabilityEnhancementbyTimeandSpace RedundancyinMultistageInterconnectionNetworks. IEEETransactionsonRelia- bility,40(4):461–473,Oct1991. [36] PhilippLengauer,VerenaBitto,HanspeterM¨ ossenb¨ ock,andMarkusWeninger. A Comprehensive Java Benchmark Study on Memory and Garbage Collection Be- havior of DaCapo, DaCapo Scala, and SPECjvm2008. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ICPE ’17, pages3–14,NewYork,NY,USA,2017.ACM. [37] Greg Linden. Make Data Useful. Online: http://sites.google.com/ site/glinden/Home/StanfordDataMining.2006-11-28.ppt, 2006. [38] Oscar Mateo Lozano and Kazuhiro Otsuka. Simultaneous and Fast 3D Tracking ofMultipleFacesinVideobyGPU-basedStreamProcessing. In2008IEEEInter- national Conference on Acoustics, Speech and Signal Processing, pages 713–716, March2008. [39] R. E. Lyons and W. Vanderkulk. The Use of Triple-Modular Redundancy to Im- proveComputerReliability. IBMJournalofResearchandDevelopment,6(2):200– 209,April1962. [40] ManuBansal,EyalCidon,ArjunBalasingam,AdityaGudipati,ChristosKozyrakis, Sachin Katti. Trevor: Automatic configuration and scaling of stream processing pipelines. arXive-prints,pagearXiv:1812.09442,Dec2018. [41] ManuK.Bansal. TechniquesforBuildingPredictableStreamProcessingPipelines. PhDdissertation,StanfordUniversity,2018. [42] Lucas Matney. What Latency Feels like on Google’s Stadia Cloud Gaming Platform. Online: https://techcrunch.com/2019/03/20/what- latency-feels-like-on-googles-stadia-cloud-gaming- platform/,2019. [43] Anh Nguyen-Tuong, David Evans, John C. Knight, Benjamin Cox, and Jack W. Davidson. SecuritythroughRedundantDataDiversity. In2008IEEEInternational Conference on Dependable Systems and Networks With FTCS and DCC (DSN), pages187–196,June2008. 118 [44] Ritika Pandey, Akanksha Singh, Angel Kashyap, and Abhineet Anand. Compara- tive Study on Realtime Data Processing System. In 2019 4th International Con- ference on Internet of Things: Smart Innovation and Usages (IoT-SIU), pages 1–7, April2019. [45] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In Proceedings of the2010USENIXConferenceonUSENIXAnnualTechnicalConference,USENIX- ATC’10,pages11–11,Berkeley,CA,USA,2010.USENIXAssociation. [46] Zhan Qiu, Juan F. Pérez, and Peter G. Harrison. Tackling Latency via Replication in Distributed Systems. In Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering, ICPE ’16, pages 197–208, New York, NY,USA,2016.ACM. [47] EricSchurmanandJakeBrutlag. TheUserandBusinessImpactofServerDelays, Additional Bytes, and HTTP Chunking in Web Search. In Velocity Web Perfor- manceandOperationsConference,2009. [48] ShusenYang. IoTStreamProcessingandAnalyticsintheFog. IEEECommunica- tionsMagazine,55(8):21–27,2017. [49] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop Distributed File System. In 2010 IEEE 26th Symposium on Mass Storage SystemsandTechnologies(MSST),pages1–10,May2010. [50] Tom Z. J. Fu, Jianbing Ding, Richard T. B. Ma, Marianne Winslett, Yin Yang, and Zhenjie Zhang. DRS: Dynamic Resource Scheduling for Real-Time Analytics over Fast Streams. In 2015 IEEE 35th International Conference on Distributed ComputingSystems,pages411–420,June2015. [51] TomZ.J.Fu,JianbingDing,RichardT.B.Ma,MarianneWinslett,YinYang,and Zhenjie Zhang. DRS: Auto-Scaling for Real-Time Stream Analytics. IEEE/ACM TransactionsonNetworking,25(6):3338–3352,Dec2017. [52] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel,SanjeevKulkarni,JasonJackson,KrishnaGade,MaosongFu,JakeDonham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. Storm@Twitter. In Proceed- ingsofthe2014ACMSIGMODInternationalConferenceonManagementofData, SIGMOD’14,pages147–156,NewYork,NY,USA,2014.ACM. [53] Geoffrey Phi Calara Tran, John Paul Walters, and Stephen P. Crago. Dynamically ImprovingResiliencytoTimingErrorsforStreamProcessingWorkloads. In2017 18th International Conference on Parallel and Distributed Computing, Applica- tionsandTechnologies(PDCAT),pages469–476,Dec2017. 119 [54] Geoffrey Phi Calara Tran, John Paul Walters, and Stephen P. Crago. Reducing TailLatenciesWhileImprovingResiliencytoTimingErrorsforStreamProcessing Workloads. In2018IEEEInternationalConferenceonServicesComputing(SCC), pages278–281,July2018. [55] Geoffrey Phi Calara Tran, John Paul Walters, and Stephen P. Crago. Reducing TailLatencieswhileImprovingResiliencytoTimingErrorsforStreamProcessing Workloads. In2018IEEE/ACM11thInternationalConferenceonUtilityandCloud Computing(UCC),pages194–203,Dec2018. [56] Geoffrey Phi Calara Tran, John Paul Walters, and Stephen P. Crago. Increased Fault-Tolerance and Real-time Performance Resiliency for Stream Processing Workloads through Redundancy. In 2019 IEEE International Conference on Ser- vicesComputing(SCC),pages51–55,July2019. [57] Tri Minh Truong, Aaron Harwood, Richard O. Sinnott, and Shiping Chen. Per- formance Analysis of Large-Scale Distributed Stream Processing Systems on the Cloud. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD),pages754–761,July2018. [58] Tri Minh Truong, Aaron Harwood, Richard O. Sinnott, and Shiping Chen. Cost- Efficient Stream Processing on the Cloud. In 2019 IEEE 12th International Con- ferenceonCloudComputing(CLOUD),pages209–213,July2019. [59] JanSipkevanderVeen,BramvanderWaaij,ElenaLazovik,WilcoWijbrandi,and RobertJ.Meijer. DynamicallyScalingApacheStormfortheAnalysisofStreaming Data. In2015IEEEFirstInternationalConferenceonBigDataComputingService andApplications,pages154–161,March2015. [60] Ashish Vulimiri, Philip Brighten Godfrey, Radhika Mittal, Justine Sherry, Sylvia Ratnasamy, and Scott Shenker. Low Latency via Redundancy. In Proceedings of the Ninth ACM Conference on Emerging Networking Experiments and Technolo- gies,CoNEXT’13,pages283–294,NewYork,NY,USA,2013.ACM. [61] ChathuraWidanage,JiayuLi,SahilTyagi,RaviTeja,BoPeng,SupunKamburuga- muve, Dan Baum, Dayle Smith, Judy Qiu, and Jon Koskey. Anomaly Detection over Streaming Data: Indy500 Case Study. In 2019 IEEE 12th International Con- ferenceonCloudComputing(CLOUD),pages9–16,July2019. [62] Philip S. Wilmanns, Stefan J. Geuns, Joost P. H. M. Hausmans, and Marco J. G. Bekooij. Buffer Sizing to Reduce Interference and Increase Throughput of Real- TimeStreamProcessingApplications. InProceedingsofthe2015IEEE18thInter- nationalSymposiumonReal-TimeDistributedComputing,ISORC’15,pages9–18, Washington,DC,USA,2015.IEEEComputerSociety. 120
Abstract (if available)
Abstract
The advent of faster Internet connections and affordable yet highly-performant resources in the cloud has sparked a movement towards providing dynamic and interactive services. As a result, data analytics and telemetry have become paramount to monitoring and maintaining quality-of-service in addition to business analytics. Stream processing, a model where a directed acyclic graph of operators receives and processes continuously arriving discrete elements, is well-suited for these needs. Current frameworks boast high throughput, low latency, and fault tolerance through continued execution. ❧ However, the per-data-element latencies are important to meet real-time performance constraints. These constraints become important in the context of interactive applications and user quality-of-service guarantees. Furthermore, runtime faults can degrade the real-time performance more drastically than aggregate metrics. The fault tolerance touted by state-of-the-art frameworks does not address this issue. Furthermore, the application loads for stream processing applications can be dynamic. While existing works study the runtime allocation of resources to match the load, these works do so without considering the per-data-element real-time performance implications. ❧ In this work, we address these issues by developing a framework for increasing the resiliency to intermittent run-time faults by introducing data and computation redundancy through the replication of data elements in stream processing applications with at-least-once semantics. We first study a simplified model as a proof of concept then increase the complexity to generalize the supported applications. Our work studies the effects of a range of fault types including computational performance, failed communication, and failed stream operator instance faults. While our results show that redundancy can be highly effective in mitigating the effects of these faults (78% tail latency reduction, 60% mean latency reduction, 80% missed deadline reduction), the gains do come with substantial overheads (40% computational resource overhead and 100% communication overhead). However, we show that it is possible to use dynamically manage resiliency to take advantage of both the multi-modal behavior in the application and to adapt to intermittent faults. This dynamic management can reduce the real-time violations over a baseline framework while reducing the resource overhead from a static level of redundancy injection (98%+ reduction in windows with constraint violations using 0.15x - 0.29x computation and 0.13x - 0.31x communication overheads compared to static redundancy).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Optimal designs for high throughput stream processing using universal RAM-based permutation network
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
A complex event processing framework for fast data management
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
High level design for yield via redundancy in low yield environments
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Resource allocation in dynamic real-time systems
PDF
Improving efficiency to advance resilient computing
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Flag the faults for reliable quantum computing
PDF
Robust real-time algorithms for processing data from oil and gas facilities
Asset Metadata
Creator
Tran, Geoffrey Phi Calara
(author)
Core Title
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
06/15/2020
Defense Date
04/21/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
fault resiliency,Garbage collection,OAI-PMH Harvest,real-time processing,redundancy,stream processing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Crago, Stephen (
committee chair
), Nakano, Aiichiro (
committee member
), Prasanna, Viktor (
committee member
)
Creator Email
geoffret@usc.edu,gtran21@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-318924
Unique identifier
UC11664342
Identifier
etd-TranGeoffr-8591.pdf (filename),usctheses-c89-318924 (legacy record id)
Legacy Identifier
etd-TranGeoffr-8591.pdf
Dmrecord
318924
Document Type
Dissertation
Rights
Tran, Geoffrey Phi Calara
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
fault resiliency
real-time processing
redundancy
stream processing