Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Model-driven situational awareness in large-scale, complex systems
(USC Thesis Other)
Model-driven situational awareness in large-scale, complex systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MODEL-DRIVENSITUATIONALAWARENESSIN LARGE-SCALE,COMPLEXSYSTEMS by ArunA.Viswanathan ADissertationPresentedtothe FACULTYOFTHEGRADUATESCHOOL UNIVERSITY OFSOUTHERN CALIFORNIA InPartialFulfillment ofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (COMPUTERSCIENCE) May2015 Copyright 2015 ArunA.Viswanathan DefenseCommittee Dr. CliffordNeuman(CommitteeChair) ComputerScienceUSC/ISI Date Dr. RameshGovindan(CommitteeMember) ComputerScienceUSC Date Dr. ViktorPrasanna(ExternalFaculty) ElectricalEngineeringUSC Date ii Dedication To mywifeandbestfriend,Suchitra,andourlittlebundleofjoy,Kiaan iii Acknowledgments Thisdissertationwouldhavenotbeenpossiblewithoutthecontinuoussupportandencour- agement of several fine folks over the last seven years. I am first and foremost grateful to myadvisorDr. CliffordNeumanfor givingmetheopportunitytopursueaPh.D, andsup- porting me patiently throughout the process. Ted Faber from USC/Information Sciences Institute (ISI) was always available when I was in need, whether it was for discussing an idea,reviewingawriteup,orbrainstormingonapresentation. Histhoughtfulandconstruc- tive suggestions played a significant role in influencing my thinking, and shaping several aspects of this work. Kymie Tan from the Jet Propulsion Laboratory (JPL) in Pasadena, California taught me about computer science research and writing. I cannot thank her enough for taking time off her busy schedule to meet with me and patiently address my concerns. Whileseveralpeoplehavehelpedmealongtheway,IamdeeplyindebtedtoTed and Kymie, who have been great mentors and whose timely help, advice and encourage- menthaveprovedinvaluabletome. I havehad the opportunityto meet and work withsomegreat individualsat Division7 (Networking) of ISI. I had a good time interacting with members of the Smart Grid Re- gionalDemonstrationProject(SGRDP)atISIincludingCliffNeuman,GoranScuric,Anas Almajali,GregFinn,TanyaRyutovandJoeTouch. BeforeSGRDP, Iwasfortunatetocol- laborate with several fine folks on the DETECT project including Alefiya Hussain, Jelena Mirkovic,JohnT.Wroclawski,StephenSchwab,TerryBenzel,BrettWilson,KevinLahey, iv and Bob Braden. I owe a lot to Venkata Pingali and Goran Scuric who have been instru- mentalinshapingmythinkingingeneral. IwillalwaystreasuretheintenseconversationsI hadwiththemonarangeortopics. IalsohadalotoffuninteractingwithGaneshBhaskara and AbdullahAlwabelat ISI. I amthankful for timelyand extensivehelpfrom theadmin- istrativestaff at ISI - AlbaRegalado-Palacios,Matt Binkley,JoeKemp,JeanineYamazaki and Melissa Snearl-Smith. At USC, Lizsl De Leon, Jennifer Gerson, and Tracy Charles wereagreathelpinpromptlytakingcareofadministrativeissues. I was truly fortunate to work with the fine folks at JPL, both as part of the SGRDP project, and as an intern there for two summers. The many intense discussions I had with them helped me broaden my thinking, and influenced several aspects of this dissertation. I would like to thank in no particular order Kymie Tan, Bryan Johnson, Eric Rice, Frank Kuykendall, William Kirk Reinholtz, DJ Byrne, Maddalena Jackson, Brian Cox, Christo- pherDorros,BradleyClement,andEdwardG.Silber. Iamalsogratefulfortheadministra- tivestaff at JPL - Sandi M. Thomas and Lisa DeLange, who did whatever it took to make myinternshipspossibleandpleasant. I would like to thank my committee members Dr. Ramesh Govindan and Dr. Viktor Prasanna for serving on my defense committee, and for giving me meaningful and con- structive feedback on my work. I would also like to thank Dr. Ellis Horowitz and Dr. JelenaMirkovicforgraciouslyagreeingtoserveonmythesisproposalcommittee. Iamthankfultomysponsorsforfundingmethroughtheyears. Thisresearchwassup- ported by the United States Department of Energy under Award Numbers DE-OE000012 with the Los Angeles Department of Water and Power (LADWP), and N66-001-10-C- 20181 with the Department of Homeland Security and Space and Naval Warfare Systems Center,San Diego. 1 1 Neither the United States Government or any agency thereof, the Los Angeles Department of Water and Power, nor any of their employeesmake any warranty,express or implied, or assume any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to v I cannot forget the contributions of my former mentors Raghunath Iyer and Rahul Kharge, who largely shaped my thinking before I joined graduate school. Finally, my graduateworkwouldhaveneverbeensuccessfulwithoutthepatientandrocksolidsupport of the pillars of my life - Suchitra (my wife), Leela (my mother), A.K. Viswanathan (my dad),andMeenakshi(mymother-in-law). Allofthiswasmeaninglesswithoutthegreatest joyofmylife-mybabyKiaan. any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement,recommendation,or favoring by the United States Governmentor any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or any agency thereof. Figures and descriptions are providedbytheauthorsandusedwithpermission. vi Contents Dedication iii Acknowledgments iv ListofTables xi ListofFigures xii Abstract xiv Chapter1 Introduction 1 1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 DetectingaSituationOverRelated Events . . . . . . . . . . . . 8 1.1.2 AnticipatingaSituationOverIsolated,IndependentEvents . . . 9 1.2 High-levelApproach . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 BehaviorModels . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.2 SituationModels . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 ThesisandContributions . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4 SummaryandThesisOutline . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter2 Background 22 2.1 EffectiveSituationAwareness . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 CharacteristicsofLargeScale,ComplexSystems . . . . . . . . . . . . 23 2.2.1 NatureofSystem . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 NatureofData . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 RequirementsforaSpecificationLanguage . . . . . . . . . . . . . . . 26 2.4 Related WorkonExtractingInsightsfromData . . . . . . . . . . . . . 28 2.4.1 ManualSolutions . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.2 Data-drivenSolutions . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.3 FullyAutomatedSolutions . . . . . . . . . . . . . . . . . . . . 30 vii Chapter3 BehaviorModels 32 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.1 KeyInsights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.2 Behavior: FundamentalSemanticConstruct . . . . . . . . . . . 34 3.1.3 FundamentalRelationships . . . . . . . . . . . . . . . . . . . . 35 3.1.4 ModelingLanguageDesign . . . . . . . . . . . . . . . . . . . . 36 3.2 ModelingFormalism . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 ModelExamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 ModelingSecurityThreat . . . . . . . . . . . . . . . . . . . . . 46 3.3.2 ModelingDynamicChange . . . . . . . . . . . . . . . . . . . . 47 3.4 AnalysisWorkflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.1 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.2 ComposingModels . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 PrototypeImplementation . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.1 KnowledgeBase. . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.2 DataNormalizer . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.3 EventStorage . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.4 AnalysisandPresentationEngine . . . . . . . . . . . . . . . . . 53 3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7.1 FormalModelingApproaches . . . . . . . . . . . . . . . . . . 55 3.7.2 ToolComparison . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Chapter4 SituationModels 63 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1.1 Situation: FundamentalSemanticConstruct . . . . . . . . . . . 65 4.1.2 SituationModels . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.1.3 DerivingaSituationModelFromGoals . . . . . . . . . . . . . 68 4.2 ModelingFormalism . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.1 BackgroundonBayesianBelief Networks . . . . . . . . . . . . 71 4.2.2 EncodingaSituationModel . . . . . . . . . . . . . . . . . . . . 73 4.2.3 AnalysisUsingaSituationModel . . . . . . . . . . . . . . . . 77 4.3 AnalysisWorkflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 DiscussionofProperties. . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.1 Semantically-relevantInsights . . . . . . . . . . . . . . . . . . 81 4.4.2 AbstractionandModelingSimplicity . . . . . . . . . . . . . . . 81 4.4.3 IntegrationofHeterogeneousData . . . . . . . . . . . . . . . . 82 4.4.4 MinimalKnowledgetoConstructModels . . . . . . . . . . . . 82 4.4.5 IntegrationofVerticalandHorizonalInformation . . . . . . . . 82 viii 4.4.6 ExplicitRequirementsForDataCollection . . . . . . . . . . . . 83 4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5.1 DefinitionofSituation . . . . . . . . . . . . . . . . . . . . . . . 84 4.5.2 Fault/AttackAnalysisandBayesian Networks . . . . . . . . . . 87 4.5.3 Ontology-basedApproaches . . . . . . . . . . . . . . . . . . . 89 4.5.4 GoalmodelingApproaches . . . . . . . . . . . . . . . . . . . . 90 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Chapter5 CaseStudies 92 5.1 DetectingComplexBehavior: DNSAttack . . . . . . . . . . . . . . . . 93 5.1.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.1.2 ModelingPhase . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.1.3 AnalysisPhase . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.2 DetectingComplexBehavior: DDoSAttack . . . . . . . . . . . . . . . 99 5.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2.2 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2.3 ModelingPhase . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2.4 AnalysisPhase . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3 PredictingaPotentialSituation: DRScenario . . . . . . . . . . . . . . 102 5.3.1 BackgroundonDemandResponse . . . . . . . . . . . . . . . . 103 5.3.2 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3.3 SAWRequirements . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.4 ModelingPhase . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3.5 AnalysisPhase . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Chapter6 Discussion 122 6.1 PerformanceAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.2 ModelingLimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.1 BehaviorModels . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.2 SituationModels . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 TowardsExtractingAccurateInsights . . . . . . . . . . . . . . . . . . 127 6.3.1 UnderstandingEffectivenessofEvaluations . . . . . . . . . . . 127 6.3.2 FactorsAffectingAccuracy ofLow-levelMonitors . . . . . . . 129 6.3.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.3.4 DeconstructionofEvaluationResults . . . . . . . . . . . . . . . 139 6.3.5 CaseStudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 ix Chapter7 ConclusionsandFutureWork 155 7.1 SummaryofContributions . . . . . . . . . . . . . . . . . . . . . . . . 155 7.2 ConcludingRemarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Bibliography 162 x ListofTables 3.1 Semantics of operators, behavior constraints and operator constraints inthemodelinglogic. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Comparisonofthebehavior-basedSemanticAnalysisFrameworkwith fourpopulardataanalysistools. . . . . . . . . . . . . . . . . . . . . . 57 6.1 Potential error factors across the five evaluation phases (Fig. 6.2) that cancompromisetheevents(Fig.6.3)necessaryforvalidandconsistent detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.3 Summary of the efficacy of evaluations performed in the case stud- ies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 xi ListofFigures 1.1 High-levelinputsandoutputsinaspecification-drivenapproach. . . . . 7 1.2 Simpleexampleofabehaviormodel. . . . . . . . . . . . . . . . . . . . 12 1.3 Simpleexampleofasituationmodel. . . . . . . . . . . . . . . . . . . . 15 3.1 Thegrammarforspecifyingabehaviormodelφ. . . . . . . . . . . . . . 38 3.2 Sequence diagramof IP-interaction between four nodes.→ or← rep- resent an IP packet between a source (s) and destination (d). An IP flowisapacketpairbetweensandd. . . . . . . . . . . . . . . . . . . . 45 3.3 ModelingtheworminfectionchainoverIDSalerts. . . . . . . . . . . . 46 3.4 Modelingchangeinrateofpacketstreams. . . . . . . . . . . . . . . . . 47 3.5 AnalysisWorkflowUsingBehavior Models . . . . . . . . . . . . . . . 48 3.6 Composingtwobehaviormodels. . . . . . . . . . . . . . . . . . . . . . 50 3.7 Prototypeimplementationtodriveanalysisusingbehaviormodels. . . . 51 3.8 Performancevs. ExpressivenessTradeoff . . . . . . . . . . . . . . . . 60 4.1 Complexwebofdependenciesinlarge-scale, complexsystems. . . . . . 64 4.2 Simplesituationmodel. . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 AnExampleBayesianNetwork-NodesandEdges. . . . . . . . . . . . 72 4.4 AnExampleBayesianNetwork-InitialBeliefs. . . . . . . . . . . . . . 72 4.5 AnExampleBayesianNetwork-UpdatedBeliefs. . . . . . . . . . . . . 73 4.6 Simplesituationmodelencodedasabayesiannetwork. . . . . . . . . . 77 4.7 Simplesituationmodelencodedasabayesiannetwork(showinginitial beliefvalues) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.8 Predictiveanalysisusingthesimplesituationmodel. . . . . . . . . . . . 79 4.9 Diagnosticanalysisusingthesimplesituationmodel. . . . . . . . . . . 79 5.1 DNSKaminskyexperimentsetup. . . . . . . . . . . . . . . . . . . . . 94 5.2 User’shigh-levelmodelofexperimentbehavior. . . . . . . . . . . . . . 95 5.3 DNSKAMINSKYmodelscompleteexperimentbehavior. . . . . . . . . 96 5.4 Behavior instancessatisfyingtheDNSKAMINSKYmodel. . . . . . . . . . 98 5.5 DDOS HYPmodelstwothresholdsfordetectingDDoSattacks. . . . . 100 5.6 Behavior instancessatisfyingtheDDOS HYPmodel. . . . . . . . . . . 101 5.7 Demandresponse(DR) reducespeakload. . . . . . . . . . . . . . . . . 104 5.8 Entitieswithinthedemandresponsesystem . . . . . . . . . . . . . . . 106 xii 5.9 ThreatstoDemandResponse . . . . . . . . . . . . . . . . . . . . . . . 106 5.10 Setupfordemandresponsecasestudy. . . . . . . . . . . . . . . . . . 108 5.11 DRdecisionloop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.12 DRSituationModel(Part1) . . . . . . . . . . . . . . . . . . . . . . . 112 5.13 DRSituationModel(Part2) . . . . . . . . . . . . . . . . . . . . . . . 113 5.14 BayesianNetworkfortheDRScenario. . . . . . . . . . . . . . . . . . 116 5.15 BayesianNetworkfortheDRScenario(withprobabilities). . . . . . . 116 5.16 SituationmodelforcustomerintheDRscenario. . . . . . . . . . . . . 117 5.17 BayesiannetworkforcustomerintheDRScenario. . . . . . . . . . . 118 5.18 BayesiannetworkforcustomerintheDRScenario. . . . . . . . . . . 119 6.1 Plotofruntimeagainstnumberofeventsforfivetypesofbehaviorcom- plexity. Behaviorscontainingdependentvaluestates(dStates)resultin quadraticcomplexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Factors contributing to errors across the five different phases of an anomalydetector’sevaluationprocess. . . . . . . . . . . . . . . . . . . 130 xiii Abstract Situational awareness, or the knowledge of what is going on? to figure out what to do?, has become a crucial driver of the decision-making necessary for effectively managing and operating large-scale, complex systems such as the smart grid. The awareness funda- mentally depends on the ability of decision-making entities to convert the low-level oper- ational data from systems into higher-level insights relevant for decision-making and re- sponse. Technological advances haveenabled monitoringand collection of a wide variety of low-level operational event data from system monitors and sensors, along with several domain-independenttools(e.g. visualization,datamining)and domain-specifictools(e.g. knowledge-driven tools, custom scripts) to assist decision-makers in extracting relevant higher-level insights from the data. But, despite the availability of data and tools to make sense of the data, recent high profile incidents involving large-scale systems such as the NorthAmericanpowerblackouts,thedisruptionoftrainservicesinSydney,Australia,and the malicious shutting down of nuclear centrifuges in Iran, have all been linked to a lack of situationalawareness ofthedecision-makers,whichpreventedthemfrom takingproac- tive actions to contain the scale and impact of the incident. A key reason for the lack of situational awareness in each circumstance was the inability of decision-making entities to integrate and interpret the heterogeneous low-level information in a way semantically- relevanttotheirgoalsandobjectives. Improvingthesituationawarenessofadecision-makingentityinsuchsystemsrequires capabilities to assist decision-making entities to integrate and interpret the heterogeneous xiv event data from the system, and extract insights relevant to their goals and objectives. Specification-drivenmethodsareapopularchoicefordecision-makersinlarge-scale,com- plexsystemstoextract high-levelinsightsfromdata. Inthespecification-drivenapproach, adecision-makerwritesaspecification(suchasarule)toprocessthelow-leveleventdata, which then drives analysis over the operational event data at runtime, and results in high- level insights relevant to the decision-maker. We observe that while such approaches are popular,afundamentalproblemtodayiswiththelow-levelnatureofthelanguagesusedto buildspecifications,whichincreasestheburdenforhigh-leveldecision-makerstocombine andinterpretinformationinawayrelevanttotheirgoalsandobjectives. In this work, we propose a model-driven approach to enable decision-makers to write high-levelspecificationstodriveanalysisovertheeventdata,andextractinsightssemanti- cally-relevanttotheirgoalsandobjectives. Specifically,weintroducetwoabstractions: be- haviormodels,andsituationmodels. Behaviormodelsprovideeffectivehigh-levelabstrac- tionstospecifycomplexbehaviors(suchasmulti-stepattacks,orprocessexecution)overa sequenceorgroupof relatedevents. Situationmodelsprovideeffectivehigh-levelabstrac- tionstomodelthehigh-levelcause-effectrelationshipsofsituationsinlarge-scale,complex systemsoverisolated,independentlow-levelevents. Decision-makerscomposehigh-level modelsusingtheaboveabstractionstodriveanalysisoverlow-leveldata. Themodelscap- ture relevant high-level knowledge at a level of abstraction relevant to a decision-making entity, and explicitly encode their high-level goals and objectives. When such a model is used asan inputto theanalysisprocess, theinsightsproducedare semantically-relevantto adecision-makingentity’sgoalsandobjectives. Theproposedmodelingabstractionsareexpressive,simple,allowsharingandreuseof knowledge,andallowcustomizationtosuitadecision-maker’sneeds. Wedemonstratethe effectiveness of the above abstractions by applying them to a set of case studies relevant to large-scale, complex systems. First, we apply behavior models to model the complex xv multi-step attack behavior of a DNS cache poisoning attack, and demonstrate how such a model can be used to effectively extract insights in the form of attack instances from network events. Then, we demonstrate how behavior models can be used to rapidly com- pose a description of a distributeddenial of service (DDoS) attack to extract DDoS attack instancesfromanISPpackettrace. Wethendemonstratehowsituationmodelsenablemul- tiple decision-making entities involved in a demand response operation to make sense of heterogeneous,low-levelfactsandextractinsightssemantically-relevanttotheirhigh-level decision-makinggoals. Overall, in this work, we fundamentally demonstrate that the introduction of simple, semantically-relevantmodelingconstructsiseffectiveinenablingdecision-makersincom- plex environments to build specifications at a higher-level of abstraction, and extract in- sightsrelevanttotheirgoalsandobjectives. Ourmodel-drivenapproachemphasizesreuse, composibility and extensibility of specifications, and thus introduces a more systematic waytobuildspecifications,andretainexpertknowledgeforsharingandreuse. xvi Chapter1 Introduction Situation awareness, or the knowledge of what is going on? to figure out what to do?, has become a crucial driver of the decision-making necessary for effectively managing and operating large-scale, complex systems such as the smart grid [25,30,40,52,101]. Large-scale, complex systems consist of several interconnected and interdependent sub- systems and system entities that collectively support several high-level system opera- tions. For example, the power grid consists of a complex interconnection of cyber and physical subsystems to support continuous delivery of power to consumers. Such sys- tems routinely encounter threats to their secure and resilient operations due to cyber at- tacks [1,8,18,23–25,45,92,93,97,112,122,125,132], physicalattacks [25,54], and low- level system faults [17,113,121]. The threats affect low-level system entities, often in isolatedandunrelatedpartsofthesystemorindifferentsubsystems,buttheinterconnected and interdependent nature of such systems enables low-level threats to manifest as situ- ations affecting higher-level system operations. Decision-making entities responsible for ensuring the security and resilience of such complex systems fundamentally rely on their situationawarenesstomakeproactivedecisions,andinitiatenecessaryactionstoavertun- desiredconsequencestothesystem. Situationalawarenessrefers tothestateofknowledgeofadecision-makingentity,and fundamentally results from an ability to convert the low-level operational data from the system into high-level insights relevant to the decision-making entity’s goals [36]. One formofoperationaldatainlarge-scale,complexsystemsrelevantforsituationalawareness 1 comes in the form of timestamped events, which convey observations, notifications or in- ferences, from multiple sources such as system entities, system monitors, custom scripts and domain-specific tools. The insights are typically in the form of high-level situations relevanttoadecision-maker,andenablethemtogainanunderstandingofwhatis happen- ing in the system? (the current situation(s)), and what is likely to happen? (the potential situation(s)) [36]. For example, in the smart grid, the “unavailability of a power reserve”, “potentialfailureofdemand-responseoperationtocurtailload”,“damageofcriticalequip- ment” and “potential violation of system stability” are situations that are directly relevant tothedecision-makingneedsofapowersystemoperator[51]. Advances in monitoring and communication technologies have enabled collection of a variety of low-level event data in large-scale, complex systems such as the smart grid. A fundamental challenge today in improving situation awareness is in assisting decision- making entities to integrate and interpret voluminous, heterogeneous event data from the system, and extract insightsrelevantto their goalsand objectives[7,37,76]. For example, recenthighprofileincidentsinvolvinglarge-scalesystemssuchastheNorthAmericanand Italian power blackouts [17,121], disruption of transportation in Sydney, Australia [113], shutting down of nuclear centrifuges in Iran [45], and data breaches across several high- profile organizations [8], have all been linked to a lack of situational awareness of the decision-makers, which prevented them from taking proactiveactions to contain the scale andimpactoftheincident. Ineachoftheaboveincidents,low-leveleventdatapertainingto current or developingsituationswas always found to be available, but the poor situational awareness resulted from the decision-makers’ inability to comprehend the meaning of the low-level information in terms of insights relevant to their decision-making needs [8,45, 52,113,121]. There are several ways to address the challenge of assisting decision-making enti- ties to extract relevant high-level insights from low-level event data. Broadly speaking, 2 the solution space for addressing the above challenge can be categorized into manual, data-driven, specification-driven, and fully automated methods. We describe the differ- ent approaches inSection 2.4, butin thisworkwe areconcerned with specification-driven approaches [27,44,84]. In the specification-driven approach, a decision-maker writes a specification to process the low-level event data. Such a specification then drives analy- sis over the operational event data at runtime, and results in high-levelinsightsrelevant to thedecision-maker. Forexample,eventprocessinglanguagesincomplexeventprocessing systems are used to specify the rules to combine a sequence of events into higher-level events[43,123]. Whiletheseapproachesarepopular,afundamentalproblemtodayiswith thelow-levelnatureof thelanguagesusedtobuildspecifications,whichincreases thebur- den for high-level users to combine and interpret information in a way relevant to their goals and objectives. For example, the Esper language provides a SQL-like language to build specifications [43]. While the SQL-like constructs of the Esper language are pow- erful, and general purpose, it requires a high-level decision-maker (for example, a power- system operator), to write several detailed SQL-like statements to operate over the event data, and extract insights relevant to his goals and objectives. A decision-maker’s high- level intent is lost in the low-level language details. Further, this makes the specifications hard to read and maintain, and thus impossibleto share and reuse. The state-of-the-art in specification-drivenmethodstodayisakintotheassembly-basedprogrammingeraincom- puterlanguagesafewdecadesago. Weneedtotransitionfromthelow-level,assembly-like event processing rules to a higher-level, intuitive,domain-specific languagesuch as MAT- LAB. In this dissertation, our primary objective is to improve the state-of-the-art in specification-based approaches by enabling decision-makers within the smart grid envi- ronmenttoexpressanalysestasksatahigher-levelofabstraction,and thusinaway which 3 ismeaningfulandusefultothedecision-maker. Tothatend,wefocusontwospecificanal- ysistasksdirectlyrelevanttothesituationalawarenessneedsofahigh-leveldecision-maker in the smart grid, namely, (a) detecting “interesting behaviors” from network event data, and (b) integrating low-level isolated, independent events to anticipate potential higher- level consequences. We propose two high-level modeling languages, namely behavior models and situation models, to address (a) and (b) respectively. Both languages enable decision-makers to build specifications which capture their high-level understanding di- rectly as models over event data. The specifications generated are abstract in nature, cap- ture relevant high-level knowledge at a level of abstraction relevant to a decision-making entity, and explicitly encode their high-level goals and objectives. When such a model is used asan inputto theanalysisprocess, theinsightsproducedare semantically-relevantto adecision-makingentity’sgoalsandobjectives. Weelaborateonthedetailsinremainingsectionsofthischapter. Section1.1elaborates on the two specific problems related to the analysis tasks. Section 1.2 discusses the basic ideasbehindthemodel-drivenapproach,andprovidesabriefoverviewofthetwomodeling mechanisms proposed in the thesis. Section 1.3 presents the thesis and identifies specific contributions. Section1.4summarizesthechapterandoutlinesrestofthethesis. 1.1 Problem Asdiscussedearlier,currentlow-levelspecification-drivenapproaches posechallengesfor decision-makersinlarge-scale,complexsystemssuchasthesmartgridtoeffectivelycom- bine and interpret heterogeneous event data from such systems. We tackle this challenge within the context of two analysis tasks relevant to the situational awareness needs of decision-making entities within a smart grid environment, namely, (a) detecting interest- ingsituations,and(b)anticipatingpotentialsituations. Inthissection,weelaborateontwo specific problems with specification-based approach relevant to each of the above tasks. 4 We begin by first defining the concepts of an event and a situation as applicable to this work. Event - An event is a timestamped message from an event source and conveys (a) an observation about some aspect of the system and its environment (such as current value of a parameter, or the state of an entity), or (b) notification of some occurrence in the real system(suchasachangeofstateinsomeentity,alow-levelsystemfaultorattack),or(c)an inference about someaspect of thesystemderived from other events(such as inference of ahigh-levelattack,orapotentialhigh-levelsystemfailure). Events originate from a variety of sources such as monitors, sensors, and high-level analysis tools. Further, events originate from sources located across the subsystems, and atvariouslevelsofsystemabstraction. Forexample,eventsrelevanttoademand-response operation in the smart grid will contain events relevant to the high-level demand-response operation, events from the demand-response server, events from the several hundreds of geographically-distributed clients (customer subsystem), the intrusion detectors located acrossautility’scybernetwork,eventsfromthelow-levelnetworkinfrastructureconsisting of processes, databases, nodes, and network elements between the server and clients, the monitorswatchingforanomalouspricingsignalsfromthemarket,thepowersystemmon- itoringtoolsto monitorthehealth of physicalentitiessuch as transformersand capacitors, and high-levelanalysistoolssuch aseventcorrelators, customscriptsand domain-specific tools. Situation- Wedefineasituationhereasa currentorpotentialundesiredstate-of-affairs in the system expressed at a level of abstraction relevant to a decision-making entity. For example, the failure of a network router is a low-level situation. Similarly, an attack on a systemassetisalow-levelsituation. Adecision-makingentityresponsibleforhigher-level 5 system operations is not interested in knowing about these low-level situations by them- selves, but rather the meaning of the low-level situations in terms of high-level situations insightsrelevanttoadecision-makingentity. Forexample,inpower-systemoperations,the “unavailabilityofapowerreserve”,“failureofdemand-responseoperationtocurtailload”, “damageofcriticalequipment”and“violationofsystemstability”aresituationswhichare relevanttoapower-systemoperator[51]. As discussed earlier, our broader motivation is to assist decision-making entities in extracting relevant high-level insights (situations) from the heterogeneous low-level event data. Thereareseveralmethodsavailabletoassistdecision-makingentitiesinextractingthe abovehigh-levelinsightsoverlow-leveleventdata. Thesecan bebroadlycategorized into categorized into of manual, data-driven, specification-driven, and fully automated meth- ods. Manual approaches include the use of filtering, querying and visualization tools to assistdecision-makers in extractinginsightsfrom data [56,85,115,133,135]. Data-driven approaches include use of data mining, machine learning or statistical techniques to dis- cover interesting patterns from data [21,22,46,62,69,79,124,139]. The fully automated approaches use complex, bottom-up models of system behavior, structure and dependen- cies to assist decision-makers in extracting insights from data [4,39,64,65,86]. As we elaborate further in Section 2.4, both themanual and data-drivenapproaches shift thebur- den of higher-level comprehension and reasoning to the decision-maker, and are thus less effectivein improvingthesituationalawareness of a decision-maker. The fully automated approachesareuseful,butthesystemmodelsarecomplex,oftenbuiltspecifictoaproblem, require detailed domain expertise, and are hard to build and maintain for decision-makers involvedinday-to-daysystemoperations. In this work, we are concerned with the specification-driven approach popular across several application domains to assist decision-making entities in making sense of hetero- geneouseventdata[27,44,84]. AsshowninFigure1.1,inaspecification-drivenapproach, 6 Analysis Engine Analysis Engine Spec. Events from multiple sources High-level insights Specification of analysis tasks written by decision-maker Spec. Figure1.1: High-levelinputsandoutputsinaspecification-drivenapproach. a decision-maker writes a specification to process the runtimeeventdata, and convert into high-level situations relevant to a decision-maker. While these approaches are popular, a fundamental problem today is with the low-level, generic and semantic-devoid nature of the languages used to build the specifications, which increases the burden for high-level decision-makers to operate over data in a meaningful and useful way. Current low-level specification-driven approaches pose challenges for decision-makers in large-scale, com- plex systems such as the smart grid to effectively combine and interpret heterogeneous event data from such systems. We tackle this challenge withinthe context of two analysis tasksrelevanttothesituationalawarenessneedsofdecision-makingentitieswithinasmart gridenvironment,namely: a) the detection of a current situation over a sequence or groups of related lower-level events,forexample,detectingahigh-levelattackoverasequenceofintrusionevents fromintrusiondetectors;and b) theanticipationofpotentialhigher-levelsituation(s)overisolated,independentlow- levelevents,forexample,makingsenseofcyberevents(suchastheanomalouspric- ing signals from markets, meter failure events from customers, and power system statusevents)toanticipateasituationleadingtodestabilizationofthepowergrid. We elaborate on the specific problems with specification-based approach relevant to each oftheabovetasksinthefollowingsubsections. 7 1.1.1 DetectingaSituationOverRelatedEvents Decision-making entities often need to make sense of a set of related events to extract higher-level meaning in the form of an interesting situation. For example, a set or related intrusionalertsfromdistributedintrusionmonitorsmightindicatethepresenceofapropa- gating worm. Similarly, a decision-maker in the smart grid might use a set of eventsfrom demand response servers and client to infer the failure or success of a demand response operation. A typical approach to extract such insights using a specification-based approach in- volves a decision-maker writing rules using simplesearch and correlation constructs such as boolean queries over attribute-value pairs to identify relationships between events and inferhigher-levelmeaning. Forexample,wireshark[133]canhelpidentifycompleteorin- completeTCPflowsfrompackettracesandsplunk[115]can helpidentifyspuriouslogins from a server log. Our study of four popular tools, discussed in Section 3.7.2, reveals that currentapproachesrequirecumbersomemulti-stepanalysestoinfersemanticrelationships overrelatedevents. Forexample,auseranalyzinganetworkpackettracemayfirsthaveto extractindividualflowsbyspecifyingspecificattributevaluesrelatedtoeachflow,andthen somehow manually infer relationships like concurrency between the flows. This problem isfurthercomplicatediftheuserhastoreasonandanalyzeovermultipletypesofdata. Thespecificproblemthatneedstobeaddressedcanbestatedasfollows: Effective analysis of related events to detect complex, higher-level situations using a specification-based approach requires capabilities to assist decision- makingentitiesto operateover suchevent dataathigher-levelsofabstraction. 8 1.1.2 AnticipatingaSituationOverIsolated,Independent Events Consider a large-scale, complex system such as the smart grid. The smart grid will ex- tend traditional power system boundaries to enable active participation in energy produc- tion, distribution and control from entities outside the traditional power operations realm, namely,fromcustomers,marketentitiesandserviceproviders[128]. Theadditionallayers of functionality added by the smart grid increases the interconnectedness and interdepen- denciesinthesystemandincreasesthecomplexityforreliableoperationofthepowergrid. Specifically,itgivesrisetomultiplecontrolanderrorpropagationchannelsbetweendiffer- entsubsystemsandincreasesthevulnerabilityofthesystemasawholetocyberattacksor inadvertent failures. This characteristic impliesthat isolated and unrelated threats in com- pletely independent parts of the system or in different subsystems, enables those threats to manifest as situations affecting higher-level system operations. For example, consider a smart grid scenario where an attack leading to destabilization of the underlying power grid could be launched by malicious control of the pricing signals monitored by plug-in hybridvehiclesto cause themto simultaneouslycharge or discharge, thus creating sudden load fluctuations on the grid and causing the grid frequency to move out of the nominal operatingrangeof60±0.03Hz[41,103]. An effective awareness for a power system operator in the above scenario will require makingsense of theheterogeneous, isolated and independentlycaused events(such as the pricing signals from markets, the meter events from customers, and power system status events)toanticipateasituationleadingtodestabilizationofthepowergrid. Further,anef- fectiveawarenessrequiresthattheextractedinsightsberelevanttotheneedsofadecision- making entity. For example, for a power system operator, the event “malicious pricing signal to electric vehicle X” has by itself no meaning, but becomes very relevant when several such events are combined to anticipate a higher-level consequence to the power grid. 9 Theproblemthatneedstobeaddressedcanbestatedasfollows: Effectiveanalysisofisolated,independentlow-level eventstoanticipatehigher- levelsituationsusingaspecification-basedapproachrequireshigh-levelabstrac- tionstomodelthecause-effectrelationshipsofsituationsinlarge-scale,complex systems. Further, such modeling must be relevant to the goals of a decision- makingentity. 1.2 High-levelApproach Thissectionfirstdescribesthekeyideasbehindtheapproach,andprovidesabriefoverview oftheproposedmodelingabstractions. The overall objective of this work is to enable high-level decision-makers within a smart grid environment to specify their analysis tasks at a level of abstraction relevant to theirdecision-makingneeds. Asdiscussedearlier,wefocusontwospecificanalysistasks, namely, a) the detection of a current situation over a sequence or groups of related lower-level events;and b) theanticipationofpotentialhigher-levelsituation(s)overisolated,independentlow- levelevents. We propose a model-driven approach to assist decision-makers to make sense of the heterogeneouslow-leveldata,andextracthigh-levelinsightssemantically-relevanttotheir goals and objectives. Basically, we introduce modeling constructs to enable the decision- maker to express the above analysis tasks over low-level event data at a higher-level of abstractions. Weproposetwomodelingabstractions: behaviormodels,and situationmod- els. Behavior models provide effective high-level abstractions to specify complex behav- iors (such as multi-step attacks, or process execution) over a sequence or group of related 10 events. Situation models provide effective high-level abstractions to model the high-level cause-effectrelationshipsofsituationsinlarge-scale, complexsystemsoverisolated,inde- pendentlow-levelevents. Decision-makerscomposehigh-levelmodelsusingtheaboveabstractionstodriveanal- ysis over low-level data. The models are abstract in nature, capture relevant high-level knowledgeat a levelof abstractionrelevant to a decision-makingentity, and explicitlyen- code their high-level goals and objectives. When such a model is used as an input to the analysisprocess, theinsightsproduced are semantically-relevantto a decision-makingen- tity’sgoalsandobjectives. Wenextprovideabriefoverviewofthesetwoabstractions,and discussthedetailsofeach inChapter3andChapter4. 1.2.1 BehaviorModels Behaviormodelsallowrapidlymodelingsystembehaviorovereventdata,byprovidingse- manticconstructstocapturehigh-levelrelationshipssuchascausality,orderingandconcur- rencybetween eventsor groupsofevents. Such modelsenableextractingrelevantinsights intheformofinterestingbehaviorsrelevanttoadecision-makingentity. Key Idea. One form of low-level event data represents time-ordered facts about system state. For example, a set of message events from a client server interaction. In addition, events are related data samples and represent variable-cardinality set of symbolic, nomi- nal features. This allows for representing higher-level patterns, taking into account more complex interrelationships between attributes. Thus, events or sequence of related events actually externalize the control flow within a system and capture the semantics of higher level behaviors. Such behaviors can be combined further to create higher-level behaviors. Akeyinsighthereisthathigher-levelunderstandinginnetworkedsystemscanbeexpressed 11 in the form of relationships between system states, simple behaviors, and complex behav- iors. Forexample,inmostsituations,atypicalweb-serveroperationisbetterunderstoodas aconcurrentrelationshipbetweenmultipleHTTPsessionstoaserverratherthanthedetails of the protocolsand specific values in thepacket headers. Thus, weintroduce behavior as aprimitiveanalysisconstruct. Behaviors can be extended or constrained to create a behavior model, which forms an assertion about the overall behavior of the system. A behavior model can then be rapidly appliedoverdatatovalidatetheassertion. Behaviormodelssoformedareabstractentities to capture the semantic essence of a particular relationship without focusing on unneces- sarydetailsorparticularparametersthatmayvarybetweenindividualeventsorbehaviors. Theyareexplicitlyrepresentedandmanipulatedconstructswithintheframeworkanddrive analysis over data. Behavior models enable building a semantic vocabulary of interaction allowing system operators to understand and interact with their environment directly in termsofhigh-levelbehaviors. SimpleExample. Forexample,considerasimplifiedIPflowinnetworking,whereaflow is a communication between two hosts identified by their IP addresses. For simplicity we assumeanIPflowtobebrokenintotwostates: ip s2ddenotesapacketfromsomesource todestinationhostandip d2sdenotesapacketfromadestinationtosource. Then,avalid IP flow behavior, IPFLOW,shown below, is one where ip s2d and ip d2s are related by theirsourceanddestinationattributeswiththeadditionalcriteriathatip d2salwaysoccurs afterip s2d. Thebehavior model (φ ipflow ) is an assertion thatIPFLOWis valid. Oneway offormallyspecifyingtheIPFLOWmodelisasshowninFigure1.2. ip_s2d = {etype=PKT_IP, sip=$$,dip=$$} ip_d2s = {etype=PKT_IP, sip=$ip_s2d.dip, dip=$ip_s2d.sip} IPFLOW = ip s2d ý> ip d2s Figure1.2: Simpleexampleofabehaviormodel. 12 Advantages. Thebehaviormodelsareabstractentitiesandcapturethesemanticessence of a particular relationship without focusing on unnecessary details or particular parame- ters that may vary between individual facts or behaviors. Incorporation of abstract behav- iormodelsasexplicitlyrepresented andmanipulatedconstructsprovidestwokeybenefits. First,thisabstractionallowsusersoftheframeworktoanalyzeandunderstandtherawdata atasemanticallyrelevantlevel. Forexample,considerthebehaviormodeltoidentifyaTCP port scan which can be generically captured as relationships between packets originating from same source to multipledestinations. Such modelscan be used to analyze manydif- ferentdatasetswithoutanymodification. Additionally,sincebehaviormodelsareprimitive analysisconstructs,newermodelsaresimplybuiltbycomposingnewmodelsfrombehav- ior modelspresent in theknowledgebase. Thus, representing analysis expertiseexplicitly as behavior models formalizes the semantics for data analysis in networked systems. The second key benefit is the ability to foster sharing and reuse of knowledge embedded in explicitly represented behavior models. A well defined shareable format for representing knowledgeaboutnetworked systemsdataoffers theprospect thatmanydifferent toolscan bedrivenby,andcontributeto,asinglesharedknowledgebase. 1.2.2 SituationModels Situation models allow explicitly capturing the meaning of low-level isolated, unrelated eventsininterconnected,interdependentsystemsintermsofhigher-levelsituationsrelevant toadecision-makingentity. KeyIdea. Akeyinsighthereisthatininterconnectedandinterdependentsystemsoneor more low-level situations will manifest as a higher-level situation, and one or more such higher-level situations will result in further situations, and so on. For example, two in- dependent router failures, will manifest eventually as a failure of a higher-level operation 13 to reduce load via demand response in the smart grid. While traditional bottom-up ap- proaches capture such manifestation via complex system modeling such as dependencies andstructure,situationmodelsallowcapturingthemeaningoflow-levelsituationsdirectly as the relevant high-level situations. More specifically, a situation model resembles a tree of situations. The root of the situation model tree represents a high-level situation which is directly relevant to a decision-makingentity. Thelower-levelnodes represent situations whichcanleadtothatsituation. Atthelowestlevelofthetreearelow-levelsituations,such asfailureofsystementities,andattacksagainstsystementities. A decision-making entity builds a situation model using his knowledge of the system, andasrelevanttohisdecision-makinggoals. Theproceduretoconstructasituationmodel is based on the fault-tree analysis approach popular in the system engineering domain. Fault-treeanalysisisatechniqueinwhichanundesiredstateofasystemisspecifiedandthe system is studied in the context of its environment and operation to find all credible ways in which the event could occur. A decision-maker starts with a top-levelsituationrelevant to his decision-making, and decomposes it into a set of lower-level situations based on his knowledgeof the system. Thelower-levelsituationsare further decomposed usingthe sameapproach. AsituationrepresentsANDorORcombinationsoflower-levelsituations, and sequential combinations of lower-level situations, that could cause the root event to occur. The model is then encoded for processing at runtime. Such a situation model then drives analysis over runtime data. We choose to encode the situation model as a bayesian dependencynetwork,aselaboratedfurtherinChapter4. SimpleExample. We demonstratea simplesituationmodel for a scenario involvingthe powergrid. Theprimaryfunctionofthepowersystemistodelivercontinuouspower. But, a large, complex system such as the power grid faces several threats to its stability in the formofdisturbancesandcontingencies. Thepowersystemoperatesstablyat60Hz. Minor disturbances and contingencies such as generation loss cause the frequency to fluctuate, 14 System freq. outside nominal (600.3 Hz) Freq. < 59.97 Sudden significant load increase Sudden significant generation decrease Freq. > 60.03 Significant fraction of EVs charge simultaneously Significant fraction of ACs turned ON Low-pricing signals sent to EVs Coordinated malicious control of EV functionality Hot day Sudden significant load decrease Sudden significant generation increase Due to malicious control of ACs UN UN UN UN UN UN Loss of generators UN UN Market pricing server hacked UN UN Large number of customers disconnected Substation failures UN UN Natural phenomena Remote disconnects to customers UN UN Figure1.3: Simpleexampleofasituationmodel. butaslongasthefluctuationisrestrictedtoanarrowregionthesystemquicklyrecoversto 60Hz,andoperatescontinuously. The goal of a power system operator is to maintain stability of the grid by taking nec- essary actions to keep the power system in the optimal operation region. Such a power system controller responsible for maintaining the stability of the grid needs awareness of anysituationthatcanleadtogridinstability,thatis,heneeds a) awareness of any potential situation that can cause the grid to move out of the safe operatingzone,and b) awareness of any potential situation that can prevent any corrective actions from beingtaken. Figure 1.3 showsa situationmodel with situationsat different levelsof abstraction for theabovescenario. Thesituationmodel integratesa variety of low-leveleventdata across 15 thesystem(collectedatdifferentlevelsofsystemabstraction)intermsofsituationswithin the situation tree. The nodes marked ‘UN’ represent the uncertainty in knowledge due to otheryetunknownsituationsthatcanleadtothehigh-levelsituation. Notethatthisisonly a representative model which skips many details for illustrative purposes. Further, this decomposition is specific to a decision-maker, and there could be more than one way to decomposeanysituation. Attherootofthetreeisthehigh-levelsituationofinteresttoadecision-makingentity, that is, “system frequency outside nominal range”. This situation is decomposed into two separatesituationsbasedonwhetherthefrequencyishigherorlowerthannormal. Consider the situation “Freq. < 59.97”. There are two lower-level situations which can give rise to that situation: “sudden, significant increase in load”, and “sudden, significant decrease in generation”. A “sudden, significant increase in load” can be caused by a lot of electric vehicles charging simultaneously, or due to air conditioners being turned on in summer, or for several other unknown reasons. A key idea here is that the situation tree does not claim to capture all the knowledge. Rather, it allows integrating new situations easily as and when they become known. We see that the situation model integrates diverse data at theleafnodesofthetree. The situation model provides a simple approach to integrate and interpret isolated, in- dependent event data in terms of high-level situations. The approach trades off modeling complexity for usability, while retaining sufficient expressiveness as demonstrated by the example. Advantages. Thereareseveraladvantagestotheaboveapproachtomodelingknowledge. First,thesituationtreeimplicitlyencodes insightsrelevanttoadecision-makergoalasthe root nodes. Second, a decision-maker can decompose the situation tree up to any level of abstraction. For instance, a power-system operator can specify situations occurring at the level of a single machine in the system, or at a level of an entire network of such 16 machines. Giventhatasingledecision-makerwillnotpossessdetailedknowledgeinlarge- scale, complex systems, the ability to abstract the details at a higher-level is incredibly useful. Further,situationmodelsbuiltbydifferentexpertscanbecombinedtoformhigher- levelsituationmodelsspecifictoadecision-makingentity. WedeferdetailstoChapter4. 1.3 ThesisandContributions Thethesisofthisworkis: Behaviormodelsandsituationmodelsareeffectivemechanismstomodelhigh-levelunder- standingofadecision-makerinasmartgridenvironment,anddriveanalysisoverlow-level event datatoextractsituationsrelevanttothegoalsofa decision-maker. Specifically, a) behavior models are effective high-level abstractions to model behaviors over a se- quenceorgroupsofrelatedlower-level events;and b) situation models are effective high-level abstractions to model relevant high-level situationsover isolated,independentlow-level events. We restrict the scope of this thesis by applying behavior models to model complex at- tackbehaviorsovernetworkpacketcaptures. Although,wefocusonthisrestrictivedomain inthethesis,themodelingformalismswepresentinChapter3arenotspecifictoeither“at- tackbehaviors”orto“networkpacketcaptures”. Thisimpliesthattheproposedformalisms havethepotentialtobeappliedandextendedtootherapplicationareas. We apply situation models to anticipate high-level situations relevant to decision- makersinvolvedinmanagingdemandresponseoperationsinthesmartgrid. We demonstrate the effectiveness of these mechanisms using a set of case studies in Chapter 5. Specifically, we demonstrate the effectiveness of the proposed modeling ab- stractions in terms of (a) expressiveness of semantic constructs in capturing high-level 17 understanding of a decision-maker, (b) simplicity of abstractions to enable rapid specifi- cations,(c)abilitytoshareandreusemodels,and(d)abilitytocustomizemodelsbasedon theneedsofadecision-makingentity. Contributions Specificcontributionsofthisworkinclude: 1. A simple logic-based modeling approach called behavior models to assist decision- making entitiesin specifying complexbehaviors (such as multi-stepattacks, or pro- cess execution) over a sequence or group of related events. Specifically, this work introduces: (a) a simple set of well-defined semantic constructs to capture fundamental rela- tionships of time, causality, concurrency, ordering, combinations, exclusions, dynamicchangeandattributedependencies, (b) thenotionofabehavior,definedusingtheaboverelationshipsoverevents,asa fundamentalsemanticanalysisconstruct, (c) mechanismstocomposehigher-levelbehaviorsasrelationshipsoversimpleand complexbehaviors, (d) mechanismsforconstructionofasemanticallyrelevantvocabularyoraknowl- edgebaseforinteractionovereventdataasreusable,shareableandcomposable abstractmodels,and (e) mechanismsforencodingsemantically-relevanthigh-levelgoalsandobjectives oversemanticallyrelevantbehaviors. 2. A simplemodelingapproach called situationmodels to assistdecision-makingenti- ties in the smart grid to model relevant high-level situations over isolated, indepen- dentlow-levelevents. Specifically,thisworkintroduces: 18 (a) the notion of a situation as a fundamental construct for capturing undesired state-of-affairs (either an attack or a fault) occurring at any level of system ab- straction, (b) asimplemethodology(basedonfault-tree,attack-treedecomposition)forspec- ifyingahigh-levelsituationmodelspecifictoadecision-maker’sgoalsandob- jectives, (c) amethodologyforanticipatingpotentialhigher-levelsituation(s)atruntimeus- ing situationmodelsencodedasbayesiannetworks, (d) mechanismsforconstructionofasemanticallyrelevantvocabularyoraknowl- edgebaseofsituationsforinteractionovereventdataasreusable,shareableand composablesituationmodels. 3. Demonstration that the introduction of simple, semantically-relevantmodeling con- structs are effective in enabling decision-makers in complex environments to build specifications at a higher-level of abstraction, and extract insights relevant to their goalsandobjectives. Specifically, (a) we demonstratetheeffectiveness of behavior modelsin modelingand analysis ofcomplexattackscenariosovernetworkcaptures,and (b) we demonstrate the effectiveness of situation models in improving the situa- tional awareness of decision-making entities involved in the demand response operationinthesmartgrid. Overall, in this thesis, we introduce a systematic and effective way to build high-level specificationsfordrivinganalysisovereventdata,andretainexpertknowledgeforsharing andreuseinalarge-scale, complexenvironmentlikethesmartgrid. 19 1.4 SummaryandThesisOutline In this chapter, we first observed that a fundamental challenge today in improving situ- ation awareness involves bridging the gap between the voluminous, heterogeneous event data from system monitors and analytical tools, and the decision-maker’s ability to ex- tract insights relevant to their decision-making goals. Our broader motivation is to assist decision-makingentitiesinextractingrelevanthigh-levelinsights(situations)fromthehet- erogeneouslow-leveleventdatausingaspecification-drivenapproach. Inthisdissertation, our primary objective is to improve the state-of-the-art in specification-based methods by enabling decision-makers within the smart grid environment to express analyses tasks at a higher-level of abstraction, and thus in a way which is meaningful and useful to the decision-maker. To that end, we focus on two specific analysis tasks directly relevant to the situational awareness needs of a high-level decision-maker in the smart grid, namely, (a) the detection of a current situation over a sequence or groups of related lower-level events;and(b)theanticipationofpotentialhigher-levelsituation(s)overisolated,indepen- dentlow-levelevents. In this work, we propose a model-driven approach to assist decision-makers to make sense of the heterogeneous low-level data, and extract high-level insights semantically- relevant to their goals and objectives. Specifically, we introduce two novel abstractions: behavior models, and situation models. Behavior models allow rapidly modeling system behavior over event data, by providing semantic constructs to capture high-level relation- shipssuchascausality,orderingandconcurrencybetweeneventsorgroupsofevents. Situ- ationmodelsallowexplicitlycapturingthemeaningoflow-levelisolated,unrelatedevents in interconnected, interdependent systems in terms of higher-level situations relevant to a decision-making entity. Decision-makers compose high-level models using the above abstractions to drive analysis over low-level data. The specifications so generated are ab- stract in nature, capture relevant high-level knowledge at a level of abstraction relevant 20 to a decision-making entity, and explicitly encode their high-level goals and objectives. When such a model is used as an input to the analysis process, the insights produced are semantically-relevant to a decision-making entity’s goals and objectives. The proposed abstractions trade off modeling complexityfor usability,while retaining sufficient expres- sivenesstomodelawiderangeofscenarios. ThesisOutline The rest of the dissertation is organized as follows: Chapter 2 introduces the challenges involvedinsupportingsituationalawarenesstasksinlarge-scale, complexsystems. Itthen identifiesthekeyrequirementsforahigh-levelspecification-basedapproachinsmartgrid- like environment. Chapter 3 introduces the behavior modeling formalism which enables capturingdomainknowledgein theform ofsystembehaviorsoverrelated eventdatafrom system. Chapter 4 introduces details of situation models to comprehend and reason over isolatedand independenteventdata. Chapter 5 describes three case studieswhich demon- stratetheeffectivenessoftheabstractionspresented. Chapter6presentsadiscussionofthe issues surrounding the abstractions presented. It also presents our current work on under- standing factors relevant to accuracy of data. Chapter 7 concludes with a summary of the workandpotentialdirectionsinthefuture. 21 Chapter2 Background Inthischapter,wefirstelaborateonthenotionofeffectivesituationalawareness,andstate the aspects relevant to this work. We then enumerate the characteristics of large-scale, complexsystemswhichinfluencetherequirementsforsituationawarenessinsuchsystems. We then enumerate the key requirements for a high-level specification language in a such systems. We also present a brief overview of the solutions available to a decision-maker forextractinghigh-levelinsightsfromdata. 2.1 EffectiveSituationAwareness As defined earlier, situation awareness refers to the state of knowledge of a decision- making entity, and fundamentally consists of insights about the operational system state as relevant to the decision-making entity’s goals [36]. For instance, insights relevant to decision-makinginlarge-scale,complexsystemswouldincludethedetectionofananoma- loussystemstateorbehavior,apredictionofapotentialfuturesystemstateorbehavior,or identificationoftheroot-causeofaparticularstateorbehavior. Ingeneral,anawarenessiseffectiveiftheinsightsarecapableofdrivingeffectivedeci- sionsandactionstoavertundesiredconsequences. Thus,aneffectiveawarenessconsistsof insightswhichareboth(a) relevant,and (b) actionable. Relevant meansthatthegenerated insights are in terms of abstractions that are semantically relevant to a decision-making entity. For example, a power system operator responsible for maintaining the stability of thesystemisnotinterestedinreceivingastreamofattackalarmsinforminghimthatsome low-levelsystem entity is under attack, but is rather interested in knowing the meaning of 22 thoseeventsintermsof potentialstabilityviolationstothesystem[51]. Actionablemeans that the produced insights are capable of driving actions towards averting undesired con- sequences. Further, actionability requires the assessment process to anticipate potential situations in an accurate and timely manner. For instance, in power-system operations an accurateandtimelypredictionthatapowerreserveishighlylikelytobecomeunavailable, maybeduetosomelow-levelcyberattacksagainstsomecriticalsystemcomponents,isan actionable insight that allows system operators to quickly take corrective actions such as buyingadditionalenergytoaccountfor thepotentiallossofreservepower. We observe that relevancy of insights and actionability of insights are complex prob- lems in their own right, and cannot be addressed comprehensively within a single thesis. We thus restrict our focus in this thesis to the relevancy aspect of effectiveness, and re- strict the definition of an effective awareness as follows: an effective awareness consists of insights which are semantically-relevant to a decision-making entity’s goals and objec- tives. We briefly discuss some aspects of actionability, namely, the factors concerning the accuracyoflow-leveleventsinSection6.3. 2.2 CharacteristicsofLargeScale,ComplexSystems Large-scale, complexsystemssuchas thesmart gridpresent uniquerequirements for situ- ation awareness which are different than those investigated under other areas such as air- craftcommandandcontrol[38],emergencyresponse[15],homelandsecurity[107],public health[99]. Inthissection,wedescribethecharacteristicsoflarge-scale,complexsystems in terms of the system itself, and the data generated by the system. These characteristics influence the requirements for specification-driven methods in such systems, as described laterinSection2.3. 23 2.2.1 NatureofSystem The following important properties characterize large-scale, complex systems as relevant tothisthesis. Largenumberofentities Large-scale, complex systems consist of a large-number of system entities. For in- stance,thesmartgridwillsupportentitiesintheorder ormillions. Systemofsystems Large-scale,complexsystemssuchasthesmartgridcompriseofseveralsubsystems andsystementitiesthatcollectivelysupportseveralhigh-levelsystemoperations(or missions). For example, the smart grid consists of generation, transmission, and distributionsubsystemswhichworktogethertodeliverpowertoconsumers. Interconnected andInterdependent Systementitieswithinalarge-scale,complexsystemareinterconnectedandinterde- pendentinavarietyofways. Forinstance,entitieswithinsuchsystemsareconnected viaacommunicationnetwork. Further,someentitiesdependonotherentitiesforre- sourcessuchasservices,data,cpuorbandwidth. Multiplecontrolloops Large-scale,complexsystemsdonothaveacentralcontrollingentitybutratherhave multiplesuch entities controlling the system at various levels of system abstraction. Forexample,theextensionofthesmartgridintothecustomershomecreateshidden controlchannelstosomeportionsofthegridthroughtechnologiesthatenableremote controlofappliances. Increasedsurfaceareaforattacksandfailures The complexity and scale of such systems gives rise to multiple control and error 24 propagationchannelsbetweendifferentsubsystemsandentitieswitheachsubsystem, therebyincreasingthevulnerabilityofthesystemasawholetothreatssuchascyber attacksandlow-levelsystemfaults. 2.2.2 NatureofData Advancesinmonitoringandcommunicationtechnologieshaveenabledcollectionofava- riety of low-levelevent data in large-scale, complex systems such as the smart grid. Such eventdatais: 1. voluminous, 2. originatesfromheterogeneoussourceswithindifferentsubsystems, 3. distributedacrossspaceandtime, 4. ofuncertainnature,and 5. representsinformationaboutthesystematseverallevelsofsemanticabstraction. Forexample,eventsrelevanttoademand-responseoperationinthesmartgridwillcon- taineventsrelevanttothehigh-leveldemand-responseoperation,eventsfromthedemand- response server, events from the several hundreds of geographically-distributed clients (customer subsystem), events from the low-levelnetwork infrastructure consistingof pro- cesses, databases, nodes, and network elements between theserver and clients, and events fromsecuritymonitorsatthosedifferentlevelsof abstraction. Similarly,avarietyof secu- rity monitors exist to monitor for system health and intrusions at several levels of system abstraction(i.e.,forindividualdevicessuchassmartmeters,foraentirenetwork,foranop- eratingsystem,orforanapplication),andacrossitssubsystems(i.e. acrossthedistribution, transmission,generation,customerandcybersubsystemsofthesmartgrid). 25 2.3 RequirementsforaSpecificationLanguage In this section, we briefly discuss some of the requirements for a high-level specification language in a system such as the smart grid. Our requirements derive directly from the nature of system and data discussed in previous sections. A specification language must addressatleastthefollowingrequirements. IntegrateHeterogeneousDataAcrossSubsystems A large-scale, complexsystemsuchas thesmartgrid consistsof severalsubsystems such as cyber, physical, customer and markets [128]. A high-level language must allowintegratingandinterpretingheterogeneouseventdatafromacrosssuchdiverse domains. For example,anticipatingahigh-levelsituationsuchas “destabilizationof thepowergrid”willrequiremakingsenseofanomalouspricingeventsfrommarkets, meter failure events from customers, and power system events from the physical powergrid. IntegrateHeterogeneousDataAcrossLevelsofSystemAbstraction Similartotheaboverequirement,ahigh-levellanguagemustenableintegratinghet- erogeneouseventdatafromacrossseverallevelsofsystemabstraction. Forexample, theabilitytorelatealow-leveldiskeventtoahigh-levelprocessfailureisimportant tocapturedependenciesbetweenprocessesandresources. SpecificationsRelevanttoDecision-makingEntity’sGoals A system such as the smart grid has several decision-making entities, each with its own set of goals and objectives. This means that a given set of events can be inter- preted differently with respect to the decision-maker and his needs. For example, in the smart grid, the loss of a router, and the failure of some gateways means the inabilitytoscheduleloadreductionfor ademandresponseoperator. Butfor another decision-makersuchasacustomer,thelossofarouter,orfailureofhisowngateway 26 would imply a potential loss of revenue. Thus, a high-level specification language must support building specifications relevant to the needs of a decision-making en- tity. SemanticConstructstoCaptureHigh-levelUnderstanding We discuss this requirement using an example of analysis of a simple set of packet events using a low-level simplequery based tool such as wireshark [133]. Consider theHTTP/TCPconnectionrecordswithfoursamplefieldsshowninthetablebelow. Supposeweneed toidentifyalltherecordsthatbelongtosuccessfulHTTPrequests initiatedfromtheclient. Weneedtoformulateaquerythatwillreturnrecords1,5,7,8 andrecords2,6,9,10togrouptwosuccessfulHTTPGETflows. id src dst tcpflags httpstatus 1 10.1.1.1 10.1.1.2 SYN 2 10.1.1.1 10.1.1.3 SYN 3 10.1.1.1 10.1.1.10 SYN 4 10.1.1.1 10.1.1.12 SYN 5 10.1.1.2 10.1.1.1 SYN-ACK 6 10.1.1.3 10.1.1.1 SYN-ACK 7 10.1.1.1 10.1.1.2 ACK 8 10.1.1.2 10.1.1.1 ACK 200 9 10.1.1.1 10.1.1.3 ACK 10 10.1.1.3 10.1.1.1 ACK 200 There is no direct way using wireshark filters to extract this information. Typically, it will be a two stage process. First apply a filter to identify all destinations that the client initiated a connection with by sending a TCP SYN. Then, for each client- destination pair, filter all TCP connection setup records and HTTP status records. 27 Thus if there are n unique clients and n unique destinations, there will be n 2 such queries. This rapidly leads to a combinatorial explosion especially as we vary ad- ditional fields within the query. Clearly, a high-leveldecision-maker cannot operate overdataatthislow-levelofabstraction. Ahigh-levelspecificationlanguageintheabovecasewilldirectlyenableanalysisof the above dataset in terms of relationships between high-level HTTP flows instead ofsinglepackets. Ahigh-levellanguagemustprovidesemanticconstructstoenable specificationatalevelclosertousersunderstandingofthesystem. Share,ReuseandCustomizeKnowledge Knowledge in a complex system such as the smart grid is distributed across several decision-making entities. Today, the expertise required to write low-level specifica- tionsishigh,andoncesuchspecificationsarewritten,theyaretypicallyhardtoshare and reuseby other entities. Mostoftenlow-level,specification-based approaches do notinherentlysupportsharingandreuse,whichfundamentallyreducestheefficiency oftheoverallanalysisprocess. Ahigh-levelspecificationlanguagemustallowshar- ing and reuse of the encoded knowledge. Further, it must allow easy customization ofsuchknowledgespecifictoadecision-maker’sneeds. 2.4 RelatedWorkonExtractingInsightsfromData Afundamentalchallengetodayinimprovingsituationawarenessinvolvesbridgingthegap between the voluminous, heterogeneous event data and a decision-maker’s ability to ex- tract insights relevant to their decision-making goals [7,37,76]. There are several ways toaddressthechallengeof assistingdecision-makingentitiestoextractrelevanthigh-level insights from low-level event data. Broadly speaking, the solution space for addressing theabovechallengecanbecategorizedintomanual,data-driven,specification-driven,and 28 fully automated methods. We present a brief overview of the above categories, except the specification-driven category, as it is already covered in detail in Section 3.7 and Sec- tion4.5. 2.4.1 ManualSolutions Manual approaches include the use of filtering, querying and visualization tools to assist decision-makers in extracting insightsfrom data [56,85,115,133,135]. In the manual ap- proach, a decision-maker relies on his high-level understanding of system operation and domainknowledgetosearchtheeventdataforinterestingpatternsusingsimplesearchand correlation constructs likeboolean queries, manuallymaps thosepatterns to semantically- relevantconcepts and thenreasons overthem to gainhigh-levelunderstanding. For exam- ple, a popular choice for network operators to search for known network situations is to use tools such as wireshark [133] and splunk [115] to manually extract relationships and patternsfromnetworkeventdata. 2.4.2 Data-drivenSolutions In the data-driven approach, the decision-maker relies on data mining, machine learning or statistical techniques to discover interesting patterns from data. Such techniques have been widely used for mining patterns from system trace logs [79], inferring malware be- havior [22,46], mining actionable insights from IDS alarms [62], automatically detecting configuration issues [139], mining episodes from telecommunication alarm data [69] and summarizing firewall logs [21,124]. The problem often faced here is to infer high-level insightsfromtheproducedlow-levelpatterns becausetheplethoraof patternsoutputfrom suchtoolsbearveryminimaloralmostnosemanticrelevanceeithertotheusersobjectives, or the high-levelsystem operation due to lack of semantic or contextual knowledge. Both 29 themanualanddata-drivenapproachesshifttheburdenofhigher-levelcomprehensionand reasoningtothedecision-maker. 2.4.3 FullyAutomatedSolutions Thefullyautomatedapproachesusecomplex,bottom-upmodelsofsystembehavior,struc- tureand dependencies to assistdecision-makersin extractinginsightsfrom data[4,39,64, 65,86]. The fully automated approaches are useful, but the system models are complex, often built specific to a problem, require detailed domain expertise, and are hard to build andmaintainfordecision-makersinvolvedinday-to-daysystemoperations. Bottom-up approaches model system knowledge without regard to the goals and ob- jectivesofanyparticulardecision-makingentity. Examplemodelsinthiscategoryinclude modelsofsystembehavior,systemstructure,dependenciesandthreatbehavior. Bottom-up approaches model can complex system characteristics, and focus on a particular problem area or system aspect. Thus, they also tend to integrate only a limited types of low-level eventdataintotheiranalyses. Forexample,stateanalysistoolsusedinpowersystemoper- ationscantakereal-timepowersystemeventsfromthefield,runasimulationusingamodel ofthepowersystemandpredictafuturestateofthepowersystem. But,suchtoolsarehard toextendtoaccommodatedifferent typesofinputs,andthusultimatelyshifttheburdento the decision-maker. For instance, the state analysis tool may not factor in low-level cyber eventsintoitsanalyses. Althoughsuchcomplexmodelscapturefinerdetailsandcanperformfine-grainedanal- ysis, we observe that the complexityinvolved in buildingand maintaining such models in complex, dynamic systems such as the smart grid outweighs the advantages. Further, in a complex system such as the smart grid, several such models may be required to be built to capture different aspects of the system, thereby increasing the challenges. Moreover, 30 theoutputproducedbysuchtoolsmightstillnotbesemantically-relevanttothegoalsof a decision-makingentity,asthemodelingisnotdecision-makerspecific. 31 Chapter3 BehaviorModels In thischapter, wedescribe behavior modelswhich enabledecision-makingentitiesto en- codetheirhigh-levelunderstandingofsystembehaviorasabstractmodelsoverasequence or groups of related lower-level events. Large sections of work described here was pre- sentedatNSDI’2011[126]. The rest of the chapter is organized as follows. Section 3.2.2 presents the key insights and constructs behind behavior models. Section 3.2 presents the syntax and semantics of behavior models. Section 3.3 discusses two practical behavior model examples. Sec- tion 3.4 explainshow behavior models driveanalysis overdata. Section 3.5 highlightsthe variouspropertiesofbehaviormodels. Section3.6discussesaprototypeframeworkcalled ThirdEyewhichimplementsan enginetoanalyzedatausingbehaviormodels. Section3.7 compares behavior models to existing top-down modeling formalisms and tools. Finally, Section3.8concludesthechapterwithasummaryofthekeyideas. 3.1 Overview In this section, basicinsightsand novelmechanismsbehindbehavior modelsis presented. Weintroducenotionsforcapturingfundamentalknowledgerelevanttonetworkedsystems, user’s high-level understanding of system behavior and user’s high-level analysis objec- tives. 32 3.1.1 KeyInsights Behaviormodelsallowrapidlymodelingsystembehaviorovereventdata,byprovidingse- manticconstructstocapturehigh-levelrelationshipssuchascausality,orderingandconcur- rencybetween eventsor groupsofevents. Such modelsenableextractingrelevantinsights intheformofinterestingbehaviorsrelevanttoadecision-makingentity. Decision-making entities often need to make sense of a set of related events to extract higher-level meaning in the form of an interesting behavior. For example, a set or related intrusionalertsfromdistributedintrusionmonitorsmightindicatethepresenceofapropa- gating worm. Similarly, a decision-maker in the smart grid might use a set of eventsfrom demand response servers and client to infer the failure or success of a demand response operation. Theobjectiveistoenablesemanticanalysisatalevelclosertotheuser’sunderstanding ofasystemorprocess. Capturingsemanticsforawidevarietyofsituationsrelevanttonet- workedanddistributedsystemsrequires(a)abroadlyapplicabledefinitionofafundamen- talunitofsemantics,and(b)requiresprovidingexpressivesemanticconstructsindependent ofaspecificapplicationordomain. Thekeytocapturingsemanticsintheproposedapproachistheintroductionofalogic- basedmodelingformulationofhigh-levelbehaviorabstractionsasasequenceoragroupof relatedevents. Thisallowstreatingbehaviorrepresentationsasfundamentalanalysisprim- itives, elevating analysis to a higher semantic-level of abstraction. This idea is elaborated furtherinSection3.1.2. Further, to capture semantics relevant to networked and distributed systems, we fo- cus onprovidingsemanticconstructstoexpressrelationshipsthatcapturethecorecharac- teristics of networked and distributed systems, that is, fundamental relationships of time, causality, concurrency, ordering, combinations, exclusions, uncertainty, dynamic change 33 and attribute dependencies. These relationships in addition form the fundamental knowl- edgewithinthemodelingframework. Section 3.1.2 discussestherelationshipsand funda- mentaldesignchoicesindetail. 3.1.2 Behavior: FundamentalSemanticConstruct The event data represents time-ordered facts about system state. In addition, events are dependent data samples and represent variable-cardinality set of symbolic, nominal fea- tures. Thisallowsforrepresentinghigher-levelpatterns,takingintoaccountmorecomplex interrelationships between attributes. Thus, events or sequence of related events actually externalize the control flow within a system and capture the semantics of higher level be- haviors. Such behaviors can be combined further to create higher-level behaviors. A key insight here is that higher-level understanding in networked and distributed systems can be expressed in the form of relationships between system states, simple behaviors, and complexbehaviors. For example,inmostsituations,atypicalweb-server operationisbet- ter understood as a concurrent relationship between multiple HTTP sessions to a server rather than thedetailsof theprotocolsand specific valuesin thepacket headers. Thus, the proposedapproachintroducesabehavior asaprimitivesemanticconstruct. Behaviorscanbecomposed,extendedorconstrainedtocreateabehaviormodel,which forms an assertion about the overall behavior of the system. Behavior models so formed areabstractentitiestocapturethesemanticessenceofaparticularrelationshipwithoutfo- cusing on unnecessary details or particular parameters that may vary between individual eventsorbehaviors. Theyareexplicitlyrepresentedandmanipulatedconstructswithinthe frameworkanddriveconfirmatoryandexploratoryanalysisoverdata. Behaviormodelsen- ablebuildingasemanticvocabularyofinteractionallowingsystemoperatorstounderstand andinteractwiththeirenvironmentdirectlyintermsofhigh-levelsituations. 34 For example, consider a simplified IP flow in networking, where a flow is a commu- nication between two hosts identified by their IP addresses. For simplicity we assume an IP flow to be broken into two states: ip s2d denotes a packet from some source to des- tination host and ip d2s denotes a packet from a destination to source. Then, a valid IP flowbehavior,IPFLOW,shownbelow,isonewhereip s2dandip d2sarerelatedbytheir sourceand destinationattributeswiththeadditionalcriteriathatip d2salways occursaf- terip s2d. Thebehaviormodel(φ ipflow )isanassertionthatIPFLOWisvalid. Onewayof formallyspecifyingtheIPFLOWmodelis ip s2d = {etype=PKT IP, sip=$$,dip=$$} ip d2s = {etype=PKT IP,sip=$ip s2d.dip, dip=$ip s2d.sip} IPFLOW = ip s2d ∼> ip d2s 3.1.3 FundamentalRelationships The notion of relationships can vary widely across situations, and modeling all such rela- tionshipsisinfeasible. Wefocusonfollowingrelationshipstocapturethecorecharacteris- ticsofnetworkedanddistributedsystems: 1. Causal relationships between behaviors, for example, a file being opened only if a userisauthorized. 2. Partialor totalordering,forexample,in-orderorout-of-orderarrivalofpackets; 3. Dynamicchangesovertime,forexample,trafficbetweenclientandserverdropsafter anattackontheserver; 4. Concurrencyof operations,forexample,simultaneouswebclientsessions; 5. Multiple possible behaviors, for example, a polymorphic worm behavior may vary oneach execution; 6. Synchronous or asynchronous operations, for example, some operations need to completewithinaspecifictimewhereasothersneednot; 35 7. Valuedependenciesbetween operations,forexample,aTCPflowisvalidonlyifthe attribute–valuescontainedintheindividualpacketsarerelatedtoeach other; 8. Invariantoperations,forexample,someoperationsmayalwaysholdtrue 9. Eventualoperations,for example,someoperationshappeninthecourseoftime. In addition, we support traditional mechanisms, such as boolean operators and loops, forcombiningtheserelationshipsintocomplexbehaviorsandmechanismsforbasiccount- ingofeventsandreasoningoverthecounts. We do not claim completeness of the above requirements but the belief is that being able to express the above classes of primitive relationships and combining them to form complex relationships would suffice for a wide range of situations, a few of which we demonstrateinthecasestudiesinChapter 5. 3.1.4 ModelingLanguageDesign Thefollowingdesigndecisionsare centralto themodelingapproach introduced. First,the modelingcombines operators from Allen’s interval-temporallogic[2], Lamport’s Tempo- ral Logic of Actions [77] and boolean logic. Temporal logic allows expressing the order- ing of events in time without explicitly introducing time. Interval-temporal logic allows expressing relationships like concurrency, overlap and ordering between behaviors as re- lationships between their time-intervals. Additionally, complex behaviors are easily com- posedfromsimpleronesusingbooleanoperators. Second, the modeling enables capturing dependency relationships between event at- tributes while leaving the values to be dynamically populated at runtime. Late binding enables abstract specifications that enrich the knowledge base as they can be directly ap- plied to a wide variety of data-sets. This also enables parametrization of models during complexmodelcomposition[126]. 36 We believethesedesign decisionsensuredevelopingabstract behavior modelsas first- order primitives for capturing, storing, and reusing domain expertise for the analysis of networked systems. The modeling language defines formal syntax and semantics for cap- turing the above relationships and building models over data. The details of modeling languagearepresentedinSection3.2.1andSection3.2.2. 3.2 ModelingFormalism A particular execution of a networked system or process can be captured as a sequence of states, where a state is a collection of attributes and their values. A behavior (b) is a sequenceofoneormorerelatedstates. Asystemexecutionisthusdefinedasacombination of different behaviors, and each new execution may generate a uniqueset of behaviors. A behavior model (φ) is a formula that makes an assertion about the overall behavior of the system. For example, consider a simplified IP flow in networking, where a flow is a commu- nication between two hosts identified by their IP addresses. For simplicity we assume an IP flow to be broken into two states: ip s2d denotes a packet from some source to des- tination host and ip d2s denotes a packet from a destination to source. Then, a valid IP flow behavior, IPFLOW, is one where ip s2d and ip d2s are related by their source and destination attributes with the additional criteria that ip d2s always occurs after ip s2d. The behavior model (φ ipflow ) is an assertion that IPFLOW is valid. We discuss details of thisexampleandextenditfurtherinSection3.2.2. 37 3.2.1 Syntax The language grammar for defining a behavior model φ as a formula, consists of five key elementsasshowninFigure3.1: statepropositionsS asatomicformulae;groupingopera- tors‘(’and‘)’todefinesub-formulae;logicaloperatorsandtemporaloperatorsforrelating sub-formulaeoratomic-formulae;theoptionalbehaviorconstraintsbconandoperatorcon- straintsopconwrittenwithin‘[’ and‘]’;andtherelationaloperatorsrelop. φ ::= ‘(’ S|φ ‘)’{bcon} | notφ (NEGATION) | φ and φ (LOGICAL AND) | φ or φ (LOGICAL OR) | φ xor φ (LOGICAL XOR) | φ (opcon) φ (LEADSTO) | (opcon) φ (ALWAYS) | φ olap (opcon) φ (OVERLAPS) | φ dur (opcon) φ (DURING) | φ sw (opcon) φ (STARTSWITH) | φ ew (opcon) φ (ENDSWITH) | φ eq (opcon) φ (EQUALS) bcon ::= ‘[’{tc|cc}‘]’ tc ::= {at|duration|end} relop t{:t} cc ::= {icount|bcount|rate} relop c{: c} opcon ::= ‘[’ relop t{: t} ‘]’ relop ::= {>|<|=|≥|≤|6=} t ::= [0−9]+ {s|ms} c ::= [0−9]+ Figure3.1: Thegrammarforspecifyingabehaviormodelφ. A state proposition, S, is an atomic formula for capturing events that satisfy specified relations between attributes and their values. In essence, S captures states of a system or process and is the basic element of a behavior model. The most trivial behavior model is one with a single state proposition. Formally, S is represented as a finite collection of relatedattribute-valuetuplesas: 38 S ={(a i ,r i ,v i )|i∈ N, a i ∈ A, v i ∈ V,r i ∈ (=, >, <,≥,≤,6=)} A is a set of string labels, such as sip, dip, etypeand V is a set of string constants, such as 10.1.1.2, /bin/sh, along with two special strings: (a) strings prefixed with ‘$’, as in $$,$s2.dst (b) strings with the wild-card character ‘*’, as in /etc/pas * . Considering the previous example of IPFLOW, the state propositions ip s2d and ip d2s arewrittenas: ip s2d = {etype=PKT IP, sip=$$,dip=$$} ip d2s = {etype=PKT IP,sip=$ip s2d.dip, dip=$ip s2d.sip} State proposition ip s2d contains three attributes etype, sip and dip. etype has a constantvaluePKT IP,whilesipanddipattributesusethe‘$’ prefixed specialvariables whicharedynamicallyboundatruntime. Statepropositionip d2sdefinesthevaluesofits sipanddipattributesasbeingdependentonvaluesofstateip s2d. Dependentattributes alongwithdynamicbindingofvaluesallowsleavingoutdetailsliketheactualIPaddresses fromthespecification. Thetemporaloperatorsallowexpressingtemporalrelationshipslikeorderingandcon- currency between one or more behaviors. The linear-time temporal operator (leadsto), written as∼>, is used to express causal relationships between behaviors. The interval temporal logic operators express concurrent relationships between behaviors as either re- lationships: (a) between their starttimes using sw (startswith), (b) between their endtimes using ew (endswith) or (c) between their durations using olap (overlap), eq (equals) and dur(during). The(always)operator,writtenas[],allowsexpressinginvariantbehaviors. Thelogicaloperatorsnot,and,or,xoraresupportedforlogicaloperationsoverbehaviors andforcreatingcomplexbehaviors. 39 Behavior constraintsallowplacingadditionalconstraintson thematchingbehaviorin- stancesandarespecifiedimmediatelyfollowingthebehaviorwithinsquarebrackets. Con- straintsandtheirvaluesarerelatedusingthestandardrelationaloperators. Thesixbehavior constraintsaredividedastimeconstraintstcandcountconstraintscc. Timeconstraintsal- low constraining behavior starttime using at, behavior endtime using end and behavior duration using duration. The time value, t, for the constraint can be specified as a sin- gle positive value or as a range. Additionally, the values can be suffixed with either ‘s’ or ‘ms’ to indicateseconds or millisecondsrespectively. Thecount constraintsallowcon- straining number of matching behavior instances using icount, the size of each behavior instanceusingbcount andrateofeventswithinabehaviorinstanceusingrate. Operator constraints allow specifying time bounds over the temporal operators thus allowing their semantics to be slightly modified. The operator constraint values are specified as a single value or a range along with a relational operator. Table 3.1 presents detailed semantics of operatorsalongwithbehaviorandoperatorconstraints. Expressingabehavior inthelanguageconstituteswritingsub-formulae. Behaviors are alwaysenclosedwithinparenthesis‘(’and’)’. Simplebehaviorsareconstructedbyrelating one or more state propositions using operators, while complex behaviors are constructed byrelatingoneormorebehaviors. Thegrammaralsoallowsexpressingcomplexbehaviors using recursion and we present an example in Section 3.3.1. Recursive definitions allow expressing looping behavior for which the loop bounds can be optionally specified using the bcount behavior constraint. The current grammar does not support existential and universalquantificationsincesuchaneedisnotclear. 40 3.2.2 Semantics Behaviormodelψ Meaningofψ Lsatisfiesψ (L|=ψ)iff (φ) φis abehavior. ∃B φ ⊆Land|B φ |> 0 S Sisastatepropositiondefinedas S ={(a 1 ,r 1 ,v 1 )..., (a d ,r d ,v d )}. (a)|B S |> 0,(b) ∀e∈ B S ,∀i∈{1,...,d},e.a i isdefined andvaluese.v i andS.v i satisfyrelationr i . (negφ) Negationofbehavioristrue. L6|=φ,thatis,|B φ | = 0 (φ 1 andφ 2 ) Bothφ 1 andφ 2 aretrue. L|=φ 1 andL|=φ 2 (φ 1 orφ 2 ) φ 1 andφ 2 arenotbothfalse simultaneously. L|=φ 1 orL|=φ 1 orsatisfies bothφ 1 and φ 2 (φ 1 xorφ 2 ) Eitherofφ 1 orφ 2 aretruebutnot both. L|=φ 1 orL|=φ 2 butnotboth (φ 1 φ 2 ) φ 1 leadstoφ 2 ,thatis,wheneverφ 1 issatisfiedφ 2 willeventuallybe satisfied. (a)L|=φ 1 andL|=φ 2 ,(b) B φ1 [1]6= B φ2 [1],(c) B φ2 .starttime≥ B φ1 .endtime (φ 1 [≤t]φ 2 ) Wheneverφ 1 issatisfiedφ 2 willbe satisfiedwithinttimeunits. (a)L|= (φ 1 φ 2 ),(b) B φ2 .starttime≤ (B φ1 .endtime+t) (φ) φis alwayssatisfied,thatis, satisfiedbyeachevent. ∀e∈L,e|= φ ([=t]φ) φis alwayssatisfiedwithinevery consecutiveinterval(epoch)oft timeunits. t> 0andforallconsecutiveintervalst, l t ⊆Landl t |= φ (φ 1 swφ 2 ) φ 1 startswithφ 2 . (a)L|=φ 1 andL|=φ 2 ,(b) B φ1 [1]6= B φ2 [1],(c) B φ1 .starttime = B φ2 .starttime (φ 1 sw[≥t]φ 2 ) φ 1 startsttimeunitsafterφ 2 . (a)L|= (φ 1 swφ 2 ),(b) B φ1 .starttime≥ (B φ2 .starttime+t) Continuedonnextpage 41 Behaviormodelψ Meaningofψ Lsatisfiesψ (L|=ψ)iff (φ 1 ewφ 2 ) φ 1 endswithφ 2 . (a)L|=φ 1 andL|=φ 2 ,(b) B φ1 [1]6= B φ2 [1],(c) B φ1 .endtime = B φ2 .endtime (φ 1 ew[=t]φ 2 ) φ 1 endsttimeunitsafterφ 2 . (a)L|= (φ 1 ewφ 2 ),(b) B φ1 .endtime = (B φ2 .endtime+t) (φ 1 olapφ 2 ) φ 1 overlapsφ 2 ,thatis,φ 1 starts afterφ 2 startsbutbeforeφ 2 ends andendsafterφ 2 ends. (a)L|=φ 1 andL|=φ 2 ,(b) B φ1 [1]6= B φ2 [1],(c) (B φ2 .starttime< B φ1 .starttime<B φ2 .endtime)and (B φ1 .endtime > B φ2 .endtime) (φ 1 olap[>t]φ 2 ) φ 1 overlapsφ 2 andtheoverlapping regionisgreaterthanttimeunits. (a)L|= (φ 1 olapφ 2 ),(b)theoverlap (B φ2 .endtime−B φ1 .starttime)>t (φ 1 eqφ 2 ) φ 1 equalsφ 2 induration. (a)L|=φ 1 andL|=φ 2 ,(b) B φ1 [1]6= B φ2 [1],(c) B φ1 .duration =B φ2 .duration (φ 1 eq[=t]φ 2 ) φ 1 andφ 2 arebothofdurationt. (a)L|= (φ 1 eqφ 2 ),(b) B φ1 .duration =B φ2 .duration =t (φ 1 durφ 2 ) φ 1 occursduringφ 2 ,thatis,φ 1 startsafterφ 2 andendsbeforeφ 2 ends. (a)L|=φ 1 andL|=φ 2 ,(b) B φ1 [1]6= B φ2 [1],(c) (B φ1 .starttime > B φ2 .starttime)and (B φ1 .endtime < B φ2 .endtime) (φ 1 dur[=t 1 : t 2 ]φ 2 ) φ 1 occursduringφ 2 withduration betweent 1 andt 2 . (a)L|= (φ 1 durφ 2 ),(b) (t 1 ≤B φ1 .duration≤t 2 ) (φ)[icount =c] Thenumberofbehaviorinstances satisfyingφisc. (a)L|=φ,(b)thereexistdistinct B 1 φ ...B c φ ⊆L (φ)[bcount =c] Behaviorinstancessatisfyingφ are ofsizec. (a)L|=φ,(b)B φ .bcount =c Continuedonnextpage 42 Behaviormodelψ Meaningofψ Lsatisfiesψ (L|=ψ)iff (φ)[rate>c] Behaviorinstancessatisfyingφ havearate,definedas(behavior size/behaviorduration)greater thanc. (a)L|=φ,(b) (B φ .bcount/B φ .duration)>cand B φ .duration> 0 (φ)[at<t] Startingtimeofbehaviorinstances satisfyingφmustbeless than absolutetimet. (a)L|=φ,(b)B φ .starttime<t (φ)[end≥t] Behaviorinstancessatisfyingφ haveendtimegreaterthanabsolute timet. (a)L|=φ,(b)B φ .endtime≥t (φ)[duration6= t] Behaviorinstancessatisfyingφ are ofduration6=t. (a)L|=φ,(b)B φ .duration6=t Table 3.1: Semantics of operators, behavior constraints and operator constraints in the modelinglogic. We first define two concepts important for understanding the semantics. A sequential log (L) is a finite sequence of timestamped events L = e 1 ,e 2 ,e 3 ,...,e n such that e i .t ≤ e j .t ,∀i < j. A behavior instance B φ for a behavior model φ is sequence or groups of eventssatisfyingthebehaviormodelφ. B φ =hstarttime,endtime,bcount,(b 1 ,b 2 ,...,b k )i where (b 1 ,b 2 ,...,b k )⊆ L could bean individualevente or another behavior-instance B φ i . starttime = b 1 .starttime is the starting time of the behavior as defined by its first element and endtime = b k .endtime is the ending time of the behavior as defined by its last element. bcount = k is the total number of elements in the behavior instance. All b i ’s are in increasing time-order of their starttime. Additionally, let B φ .duration = (B φ .endtime− B φ .starttime) be the duration of the behavior instance and|B φ | = 43 B φ .bcountrepresentthesizeofbehaviorinstance. Ifφisasimplebehavior,suchasastate propositionS,then B s =he i 1 .t,e i k .t,k,(e i 1 ,...,e i k )i where (e i 1 ,...,e i k )⊆L. GivenafinitesequentiallogLandauser-definedbehaviormodelφ,goaloftheanalysis istofindallbehaviorinstances(B 1 φ ,B 2 φ ,...)fromLthatsatisfythebehaviormodel,where satisfiabilityisdefined asfollows: L|= φ iff∃B φ ⊆L and|B φ | > 0 Thatis,thelogLsatisfies(|=)thebehaviormodelφiffthereexistsabehaviorinstanceB φ inLoffinitelength|B φ |. Sinceφisacompositeformulacreatedusingmanysub-formulas, the satisfiability of φ is determined as a function of satisfiability of its sub-formulae. Ta- ble 3.1 defines the satisfiability criteria for sub-formulae formed using the operators and constraints. Wenextexplainthekey languageideas by definingsimplemodelsandapply- ingthemtoafictitiousdataset. Assume a packet trace of seven IP packets representing an interaction between four nodes A, B, C and D as shown in Figure 3.2. Let the sequential log of corresponding eventsbee 1 ,e 2 ,...,e 7 . Using the states ip s2d and ip d2s defined earlier in Section 3.2.1, IP flow behav- ior is written as a causal relationship between the state propositions ip s2d and ip d2s as IPFLOW=(ip s2d ip d2s). There are three IP flow instances in Figure 3.2 that satisfyIPFLOW,thatis,icount = 3withbcount = 2foreach instance: B 1 ipflow = (e 1 ,e 7 ) B 2 ipflow = (e 2 ,e 5 ) B 3 ipflow = (e 3 ,e 4 ) 44 Figure3.2: Sequence diagramof IP-interaction between four nodes.→or← represent an IP packet between a source (s) and destination (d). An IP flow is a packet pair between s andd. Extending the example, a complex behavior for pairs of overlappingIP flows can now bewrittenas IPFLOW PAIRS = ( IPFLOW olap IPFLOW ). ThereareinallthreeinstancesofoverlappingIPFLOWpairsfromFigure3.2. Thatis, B 1 ipflow pairs = ((e 1 ,e 7 ),(e 2 ,e 5 )) B 2 ipflow pairs = ((e 1 ,e 7 ),(e 3 ,e 4 )) B 3 ipflow pairs = ((e 2 ,e 5 ),(e 3 ,e 4 )) Again, icount = 3 and for each instance bcount = 2, since bcount counts the number of IPFLOWoccurrences andnotindividualevents. Wecan additionallydefineabadIPflow behaviorBAD IPFLOWas onefor whichthere was no matching response from the destination. That is, BAD IPFLOW=(ip s2d (not ip d2s)). Event e 6 matches BAD IPFLOW model since it has no matching response. Thatis,B 1 bad ipflow = (e 6 ),withbcount = 1. 45 3.3 ModelExamples Inthissection,wepresenttwoexamplesofbehaviormodelsusingpracticalscenariosfrom networkedsytems. 3.3.1 ModelingSecurityThreat 1. scan_A = {etype=SCAN, src=$infect_A.host, dst=$$} 2. infect_A = {etype=INFECT, host=$scan_A.dst} 3. single_spread = (scan_A ~> infect_A) 4. spread_chain = (single_spread ~> spread_chain) 5. WORMSPREAD(host) = (spread_chain) Figure3.3: ModelingtheworminfectionchainoverIDSalerts. In thisexample,wedefineabehaviormodelofatypicalwormspread detectedby IDS alerts collectedfrom multiplehosts. Assumeanetwork withIDSes oneach hostreporting two types of timestamped alerts: a SCAN alert when a scan is detected by a host and an INFECTalertwhenthehostisfoundinfected. Assumeaneventlogcreatedbynormalizing thealertstotwotypesofeventswiththeircorrespondingattributes. Giventheeventlog,our objective here is to define a behavior model to extract all possible infection chains of any lengthandreportthehostsinvolved. Figure3.3definesthebehaviormodelforpropagating wormusingeventsfromdistributedintrusiondetectionmonitors. The model is defined in two stages; by first defining a single spread behavior us- ing events from a single host and then defining the spread chain as a chain of related single spread occurrences. The single spread behavior, concerning a vulnerable hostA, is a sequenceof two dependent and casual events: (a) ascan A eventwith itssrc attribute pointing to an earlier infected host, followed by (b) an infect A event with its hostattributethesameasscan A.dst. 46 1. IMPORT = NET.APP_PROTO.HTTP 2. http_pkt = HTTP.HTTP_PKT(sip=$$, dip=$$) 3. attack_event = {etype=DOSATTACK,src=$$,dst=http_pkt.dip} 4. http_stream_at100 = ((http_pkt)[rate=100]) 5. http_stream_below50 = ((http_pkt)[rate=0:50]) 6. attack_start=(http_stream_at100 ew[<= 5s](attack_event)) 7. DYNAMIC_CHANGE = (attack_start ~> http_stream_below50) Figure3.4: Modelingchangeinrateofpacketstreams. Awormspreadchain(spread chain)isthensimplydefinedbyarecursiveoccurrence of related single spread behaviors. Referring to the model, the forward-dependent at- tributesrcinthedefinitionofscan Aconnectssuccessivesingle spreadbehaviorsby requiring the src of the next scan to be the same as the previously infected host. The forward-dependentattributesrcisinitializedautomaticallythefirsttimesingle spread isparsedbyconsideringittobeadynamic($$)variable. Thenextiterationoverspread - chainthenusesthevaluesasdetermineddynamicallybysingle spread. 3.3.2 ModelingDynamicChange Dynamic changes are a fundamental characteristic of networked and distributed environ- ments. One exampleof a dynamicchange is the change in rate of a stream of packets due toananomalousconditionsuchasaDoSattack. Ourobjectiveinthisexampleistomodel an expected reduction in therate of legitimateHTTPtraffic dueto DoS attack on a server. OurrawdataconsistsofIDSDoSattackalertsandHTTPpackets. The DYNAMIC CHANGEmodel,containingonlytherelevantaspectsisdescribedinFig- ure 3.4. Line2 defines a statecapturing a HTTP packet between a source and destination. Line3definesastatecapturingaDoSattackalert,additionallyrequiringthedestinationto be same as the destination in the HTTP packet. Lines 4 and 5 describe the HTTP packet streamratesbeforeandaftertheattackrespectively. Thechangeboundaryisdefinedbythe attack eventthat is triggered once the attack starts. Sinceattack eventrepresents a 47 single event, it has the same starttime and endtime. Line 6 use the ew (endswith) opera- tortodefinethe attack startcondition,whichspecifiesthatthehttp stream at100 behavior endwithinfiveseconds of theattack event. The DYNAMIC CHANGEmodelis thenanassertionthattheHTTPstreamratereducesfollowingtheattack. 3.4 AnalysisWorkflow Behaviormodelsprovidesamethodologyforuserstocapturetheirhighlevelunderstanding as behavior modelsoverdata. In this section, we describea typicalworkflow of howsuch modelscanbeusedtodriveanalysisoverdata. Figure3.5: AnalysisWorkflowUsingBehaviorModels Theworkflowconsistsofthemodelingphaseandtheanalysisphase. Modeling Phase. In the modeling phase, system or domain experts encode their high- level understanding of system behavior as abstract models over data. For example, a TCP connection setup model captures the abstract behavior of a TCP connection setup as a sequenceofrelatedTCPSYN,SYN-ACKanACKpacketsandtheDNSKaminskyModel, 48 discussedin Section 5.1, captures behavior of theDNS Kaminskycache poisoningattack. Atpresent,modelsareencodedinasimplelogic-basedmodelinglanguageasdiscussedin Section3.1. Analysis Phase. In the analysis phase, models drive analysis over data. Given times- tamped raw-data, in the form of syslogs, packet dumps, alert logs, kernel logs, or appli- cation logs, the framework enables users to analyze the data by encoding their high-level questions directly as semantically meaningful models over data. Users can use existing modelsorbuildmodelstoformulateanalysisquestionsassemantically-relevantassertions overdata. AsshowninFigure3.5,givensuchamodelandtherawdata,confirmatoryanal- ysishelpsusersgetanswerstotheirquestionsintheformofeventssatisfyingthemodels. Specifically, given a high-level behavior model and time-ordered, multivariate, multi- typedataasinput,goalofanalysisis(a)tofindallbehaviors,thatissequenceorgroupsor events,relevanttothebehaviormodel,and(b)reportallbehaviorsthatsatisfythebehavior model. 3.5 Properties Inthissection,wediscussthekeypropertiesandadvantagesofbehaviormodels. 3.5.1 Abstraction Behaviormodelsareabstractentitiesandcapturethesemanticessenceofaparticularrela- tionshipwithoutfocusingonunnecessarydetailsorparticularparametersthatmayvarybe- tweenindividualfactsorbehaviors. Incorporationofabstractbehaviormodelsasexplicitly represented and manipulated constructs within the analysis framework provides two key benefits. First, this abstraction allows users of the framework to analyze and understand the raw data at a semantically relevant level. For example, consider the behavior model 49 to identify a TCP port scan which can be generically captured as relationships between packets originating from same source to multiple destinations. Such models can be used to analyze many different datasets without any modification. Additionally, since behavior modelsareprimitiveanalysisconstructs,newermodelsaresimplybuiltbycomposingnew models from behavior models present in the knowledge base. Thus, representing analy- sis expertise explicitly as behavior models formalizes the semantics for data analysis in networkedsystems. Thesecondkeybenefitistheabilitytofostersharingandreuseofknowledgeembedded inexplicitlyrepresentedbehaviormodels. Awelldefinedshareableformatforrepresenting knowledgeaboutnetworked systemsdataoffers theprospect thatmanydifferent toolscan bedrivenby,andcontributeto,asinglesharedknowledgebase. 3.5.2 ComposingModels Behavior models can be composed into higher-level behavior models. We use a simple example to demonstrate the ease of composing and extending existing models to define semantically relevant higher-level behavior. We combine two models DNSKAMINSKYand WORMSPREADtocreateaCOMBINED ATTACKscenarioasshowninFigure3.6. 1. IMPORT = NET.ATTACKS.DNSKAMINSKY,NET.ATTACKS.WORMSPREAD 2. worm_attack= WORMSPREAD.single_spread(host=$$) 3. dns_attack = DNSKAMINSKY.SUCCESS(sip=$worm_attack.host) 4. COMBINED_ATTACK = (worm_attack ~> (dns_attack)) Figure3.6: Composingtwobehaviormodels. Line2capturesthebehaviorwhereaworminfectsahostmachineandscansandinfects another host. Line 3 describes the behavior where the worm launches a DNS Kaminsky attack on some DNS server from the last infected host. We do not specify any server for the DNS Kaminsky attack due to the abstractness of the DNSKAMINSKYmodel which infers the destination dynamically. Line 4 is the final behavior model combining both 50 the attacks. In line 3, we only constrain the sip and leave other attributes unspecified. Thisdemonstratestheabilitytoextendtheimportedmodelswithonlythedesiredattribute valueswhileleavingtheothersasdefined intheimportedmodel. 3.6 PrototypeImplementation In this section, we present a prototype implementation of a framework called ThirdEye, whichcapturesauser’shigher-levelanalysisintentasabehaviormodel,appliesthemodel over a finite stream of events normalized from raw data, and outputs events satisfying the behaviormodel. ! " ! "# $ % ! % & #$% & '()*+ ' ,- ./0 '12 345 ' #& '1'67%'685 9'1'6,%%'6,5 #-$2 & -'3:9 #& ()*+%% %%% '- % /% & ' ())) *(+,((,-).+/ ((+.01 # 2% '()3(3(3(/ '()3(3(3+ !/ 0 $()*+- '''''''''''''''''''''''' ;$ 77 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2 ===% <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< =78>77>?@A8=7@,7,7,7=7@,7,7,8 =78>77>?@A8=7@,7,7,8=7@,7,7,7 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 77 - 2,,, 4 *1 5 6 2%7 *!1 4 "8&9 *1 & Figure3.7: Prototypeimplementationtodriveanalysisusingbehaviormodels. Given our objective of semantic-level data analysis, we require the analysis frame- work to support (a) analysis of multi-type, multi-variate, timestamped data, (b) defining 51 new models by composing existing models, and (c) storage, retrieval and extensibility of domain-specific behavior models. The framework has five components as shown in Fig- ure3.7;theknowledgebase,adatanormalizer,aneventstoragesystem,ananalysisengine and a presentation engine. The decoupling of behavior model specification, the input pro- cessing and the analysis algorithms, allows the framework to be directly applied across severaldifferentdomains. Subsequentsectionsdiscussthedetailsofeach component. 3.6.1 KnowledgeBase The knowledge base provides a namespace-based storage mechanism to store behavior models and is central in providing an extensible framework. For example, our network- ing domain currently defines models for ipflow, tcpflow, icmpflow and udpflow. These behavior models capture common domain information and allow a user to rapidly com- pose higher-level models by reusing existing behavior models. Reusing a behavior model from the knowledge base constitutes importing it using its namespace and name. For ex- ample, referring to the behavior model in Figure 5.5, line 5 imports the IPFLOW model from theNET.BASE PROTOdomain. The namespace allows categorization of models into domain-specific areas while allowing composition of models across domains. We imple- ment namespaces similar to Java namespaces, that is, each component in the namespace corresponds to a directory name on the filesystem. This simple design ensures that the knowledgebaseiseasilycustomizableandextensible. 3.6.2 DataNormalizer The data normalizer maps a data record to the event format defined in Section 3.1.4. Raw dataacceptedbythenormalizercan beintheformortracefiles,packetdumps,auditlogs, security logs, syslogs, kernel logs or script output with the only requirement that each data record have a timestamp and a message field. Specialized plugins in the normalizer 52 convert each typeof raw data intocorresponding events. Theevents havea timestamp,an eventtype, and attribute-value pairs that encode the message field in the data record. All events of the same type encode the same attributes. Figure 3.7(b) shows a possible event format for an IP packet from a packet dump. The current normalizer supports a C-based plugin API for writing new specialized plugins. The framework includes plugins for the basic packet-types of IP, TCP, UDP, ICMP, DNS along with plugins for parsing syslog, authandserverlogs. 3.6.3 EventStorage Theeventstoragecomponentisresponsibleforstoringtheeventsfromthedatanormalizer into a database. Every event-type has a separate table, the columns of the tables corre- spondtotheeventattributesandeach rowdescribesan event. Thecurrent implementation stores all events into a SQLite database for two reasons: (a) it provides a standard and ready-to-use interface for storing and fetching events and (b) its server-less operation and open-source nature ensures portability on commodity systems. Our experience suggests that SQLite performs reasonably well for a large number of situations but presents chal- lenges for complex analysis as the volume of events increases. Our future work includes investigatingthescaleandefficiencychallengesinvolvedinstorageandretrievalofevents. 3.6.4 AnalysisandPresentationEngine Given a finite sequential log L and a user-defined behavior model φ, goal of the analysis engineistofindallbehaviorinstances(B 1 φ ,B 2 φ ,...)fromLthatsatisfythebehaviormodel. Let theeventsinL bestored internallyintheeventstoragedatabaseE db . We discussonly the key ideas behind the analysis process by describing extraction of behavior instances satisfyingtheIPFLOWmodeldefinedinSection3.2.2fromthesampledatainFigure3.2. 53 The behavior model φ is first internally represented in a manner similar to a compiler expression-treeandisthenevaluatedleft-to-rightinapost-orderfashion. Thesatisfiability ofthebehaviormodelisdeterminedasafunctionofsatisfiabilityofeachofthecomponent behaviors according to the semantics defined in Table 3.1. For the IPFLOW model, the statepropositionip s2d={etype=PKT IP,sip=$$,dip=$$} isevaluatedfirst. Sinceit does not have any dependent attributes, its expression is converted to the following query {etype=PKT IP,sip= * ,dip= * } and is used to fetch all events in E db matching the query. Allevents (e 1 ,e 2 ,e 3 ,e 4 ,e 5 ,e 6 ,e 7 )matchthestateip s2d. Next, the proposition ip d2s={etype=PKT IP, sip=$ip s2d.dip, dip=$ip - s2d.sip}isevaluated. The attributesdepend on theattributesof stateip s2d. So, using each event that matchedip s2d, a corresponding query is generated by resolving the val- uesofsipanddipusingthevaluesfromthematchedevents. FromFigure3.2,e 1 matches e 7 ,e 2 matchese 5 ,e 3 matchese 4 . e 5 ande 6 arealsopossiblecandidatesbutsincee 5 already matched e 2 , it is not paired withe 6 . Finally, the operator is evaluated, where the satis- fiabilitycriteriadescribedinTable3.1isappliedandanyspecifiedoperatorconstraintsare checked. Thethreeinstancessatisfyingthecriteria(e 1 ,e 7 ),(e 2 ,e 5 ),and(e 3 ,e 4 )arereturned. Thepresentationengineisresponsibleforextractingtheoutputfromtheanalysisstage and presenting it in a summarized format. We currently support printing the output in a tabularformatasshowninFigure3.7(c). Wenextpresentabriefanalysisofthealgorithm. AlgorithmAnalysisAsdescribedinSection3.2.1,statepropositionscouldeithercon- tain constant attribute-values (cStates), such as 10.1.1.2; dependent values (dStates), such as $s1.dip; or dynamic values (iStates), such as $$. A simple behavior consists of acombinationofthesestatesusingoneormorecombinationsofoperatorsandconstraints. Weassumeaconstantprocessingtimeforalloperatorsandconstraints. Then,givenanin- putofN events,processingastatepropositioncaninvolvetwoimportantoperationswhich influencetheruntime: (i)queryingusingthestateexpressionand(ii)processingtheresults 54 ofthequery ifany. Inthecaseof cStatesand iStates,thereisexactlyonequerymade,and it generates at most N responses. Thus, the worst case for processing those N responses is O(N). In the case of a dstate, given N events, there are N queries to be made and in the worst case every query may return O(N) results that have to be processed. Thus, processing dependent states involves a worst case of O(N 2 ) operations. We present our performanceresultsinSection6.1. 3.7 RelatedWork Inthissection,wediscussthecontextfor theproposedmodelingapproachbyfirstsurvey- ingrelatedliteratureforformalsemanticmodelingformalismsusedindifferentnetworking domainsandthenstudysemanticcapabilitiesoffourpopulardataanalysistoolsinthenet- workingdomain. 3.7.1 FormalModelingApproaches Specification-based approaches are particularly appealing in various areas of networked anddistributedsystemsduetotheirabilitytobeabstract,concise,precise,andverifiable. In formalverificationofdistributedandconcurrentsystems,asystemisspecifiedinlogicand thenformalreasoningisappliedonthespecificationtoverifydesiredproperties[14,77]. In declarative networking, a specification language, Network Datalog (NDLog) [83], allows defininghigh-levelnetworkingspecificationsforrapidlyspecifying,modeling,implement- ing, and experimenting with evolving designs for network architectures. In testbed-based experimentation, a simple set of user-supplied expectations are used to validate expected behaviorofanexperiment[100]. 55 Theformalspecificationapproacheshavebeenwelldevelopedwithintheintrusionde- tection communityand havebeen successfullyapplied to network and audit data for anal- ysis. In this section we first present a brief overview of four such approaches and then comparethemtoThirdEye. Roger et al. [111], leveragetheideathat attack signaturesare best expressed in simple temporal logic using temporal connectives to express ordering of events. They pose the detection problem as a model-checking problem against event logs. Naldurg et al. [102], propose another temporal-logic based approach for real-time monitoring and detection. Their languageEAGLE supports parameterized recursiveequations and allowsspecifying signatureswithcomplextemporaleventpatternsalongwithpropertiesinvolvingreal-time, statistics and data values. Kinder et al. [67], extend the logic CTL (Computation Tree Logic)andintroduceCTPL(ComputationTreePredicateLogic)todescribemaliciouscode as a high-level specification. Their approach allows writing specifications that capture malware variants. Ellis et al. [34], introduce a behavioral detection approach to malware by focusing on detecting patterns at higher-level of abstractions. They introduce three high-levelbehavioral signatures which havethe ability to detect classes of worms without needinganyaprioriinformationofthewormbehavior. Theproposedbehaviormodelsarecomparabletotheapproachesof [34,67,102]intheir useofformallogicandtemporalconstructsforspecifications. But,inadditiontoproviding an extended set of sophisticated intuitive operators and constructs, the behavior models presentedinthispapercanbegenericallyappliedtomodelvariousscenariosoveravariety ofdataandareeasilycomposedintosemanticallyrelevanthigher-levelmodels. Thisallows creating a knowledge base to explicitly capture domain expertise required for analyzing a large variety of operations encountered in networked and distributed systems as shown in Section 5. The higher-level behavioral signatures [34] based on the network-theoretic abstract communication network (ACN) are tightly bound to networking constructs like 56 hosts, routers, sensors and links making them very restrictive in their ability to express generalnetworkedsystemsbehaviors. 3.7.2 ToolComparison wireshark splunk SEC Bro ThirdEye Systemgoals Interactive analysis Interactive analysis Real-time event correlation High-speed, real-time monitoring Interactive analysis Inputdata Network packets Asciidata fromany source Asciidatafrom files,stdin, pipes Network packets Anytypeof data(with plugin) Specification language Boolean logic Booleanlogic Simple languagefor specifying rules Broscripting language Formal languagebased ontemporal logic,interval temporallogic andboolean logic Primitive constructs Boolean predicates Boolean predicates, unix-like pipelinesand commands Boolean predicates, functions writteninPerl Events (low-levelor higher-level) Behavior (low-levelor higher-level) Semantic constructs None External commands canencode semantics Perlfunctions canencode semantics Network notionssuchas connections,IP addrs.,ports, andnetwork protocols Temporallogic andinterval temporallogic operatorsfor defining behaviors Composibility ofspecs None Queriescan berecorded andthen composed intoother queries Matching eventscan triggercreation ofnew high-level events Policiescan compose lower-level eventsto generate higher-level events Behaviorscan becomposed intohigher levelbehaviors Abstraction None None Limited Yes Yes Table 3.2: Comparison of the behavior-based Semantic Analysis Framework with four populardataanalysistools. 57 In thissection, we studyfour popular analysismethodologies: wireshark v1.2.7[133], splunk v4.1 [115], Simple Event Correlator (SEC) v2.5.3 [123], Bro v1.5.2 [104], and compare them with our behavior-based semantic analysis framework (ThirdEye). Both wiresharkandsplunkaremainlyinteractiveanalysistoolswhileBroandSECarereal-time monitoringtools. Thebehavior-basedsemanticanalysisframework(ThirdEye)fallsinthe category of interactive analysis tools. The tools are compared along seven dimensions in Table 3.2; (a) high-level goals, (b) input data types, (c) analysis specification language (d) primitive analysis constructs, (e) semantic analysis constructs, (f) ability to compose specifications and (g) abstraction, that is, specifications in terms of relationships between dataattributes. Each paragraph below introduces an analysis framework and the reader is directed to Table 3.2 for details. The corresponding features for our framework (ThirdEye) are intro- duced in Table 3.2 and explored in future sections. We have not considered SQL-based approachesonstreamingdatafor comparison[59],sinceThirdEyerepresentationsareata higher-levelofabstractionthandatabasequerylanguages. wireshark[133]isanopen-sourcetoolforinteractiveanalysisofalargevarietyofnet- workdatafromapacketcapturefile. Wireshark’sdesigncanbeseparatedintotheanalysis framework and plugins. The analysis framework provides the ability to sift through large volumes of packets visually and provides a boolean query grammar for finding “interest- ing”relationshipsandstatisticalsummariesovertypicalnetworkingconcepts,forexample, rate,flows,bytes,andconnections. Thepluginarchitecture,ontheotherhand,isresponsi- ble for normalizing and presenting different types of packet data and protocol behavior to theanalysisframeworkinauniformway. splunk [115] is a popular commercial framework for unified data analysis of a large variety of data. Splunk’s strength comes from its ability to index various types of data, 58 allowing the user to sift through logs by combining search queries using boolean opera- tions,pipesandpowerfulstatisticalandaggregationfunctions. Splunksupportstime-based, event-based, value-based correlations and also allows combining queries into higher-level queries. Splunk is extensible using apps, which allow encoding knowledge as queries for sharing and wider dissemination. However, it does not provide support for explicitly cap- turing domain expertise with semantic constructs. It does provide the ability to invoke external commands, thus providing an indirect way to incorporate explicit domain exper- tiseintotheanalyses. Simple Event Correlator(SEC) [123] is an open-source framework for rule-based eventcorrelation. SECreadstheanalysisspecificationsfromaconfigurationfilecontaining asetofeventmatchingrulesandcorrespondingactions. SECprocessesdatafromlogfiles, pipes and standard streams to trigger the configured actions on a match. It supports both time-basedandevent-basedcorrelationsandalsoallowsspecifyingabstractrulesthatbind their values at runtime. SEC is moresophisticated than the previous two tools, it supports composing higher-level events by correlating low-level events, providing a framework for semantic understanding. Its rule-types pair and pairwithwindow capture some of the se- mantics of ordering and duration. However, it lacks support for inferring interval-based temporal relationships like concurrency and overlap and the analysis specification in the configurationfilesarenotintuitivetocaptureandsharedomainexpertiseinagenericway. Bro [104] is a high-speed intrusion detection system for checking security policy vi- olations by passively monitoring network traffic in real-time. Bro’s security policies are written in the specialized Bro scripting language which is geared towards security analy- sis. The language supports semantic constructs such as connections, IP addresses, ports, and various network protocols along with various operators and functions to express dif- ferent forms of network analyses. Bro has the ability to do time-based and event-based 59 correlation. However,Bro mainlyprocesses network packet data and uses a programming language-basedanalysisapproach. ThirdEye is based on a logic-based specification approach rather than a programming language-based specification approach like the one followed in Bro. The goal here is that behavior models should be abstract but also concise and precise to support well-known knowledge representation and reasoning approaches. Logic is declarative and type-free, imparting formal semantics, abstract specifications, and efficient processing by analysis engines. The logic-based approach also enables building a knowledge base of behavior models to explicitly capture domain expertise that can be used to automatically reason and infer behaviormodels. However,logic-based approaches are less expressivethan pro- gramming languages. The expressiveness of proposed approach is based on requirements derivedfromcharacteristicsofnetworkedsystemsasdiscussedinSection3.1.3. ! " # $ %&' # $ % %' Figure3.8: Performancevs. ExpressivenessTradeoff 60 Finally,comparingproposedmodelingapproachonperformanceversusexpressiveness with other approaches, we arrive at a situation approximately represented in Figure 3.8. Though the performance of the modeling framework has not been studied explicitly, the beliefisthatintroductionofsemanticconstructswoulddefinitelypenalizetheperformance ascomparedtoothersemantic-devoidapproaches. Butasadesignchoice,wetradeperfor- manceforsemanticexpressiveness. 3.8 Summary Thischapterintroducedanoveltop-down,logic-based,modelingapproachcalledbehavior models for integrating and interpreting dependent events in terms of high-level behaviors relevanttoadecision-makingentity. Behaviormodelsdriveanalysisoverdependentevent data to extract insightsin the form of behaviors. Behavior models allow rapidly modeling systembehaviorovereventdata,byprovidingsemanticconstructstocapturehigh-levelre- lationshipssuchascausality,orderingandconcurrencybetweeneventsorgroupsofevents. Further, they allow decision-makers to encode semantically-relevant high-level goals and objectivesoversemanticallyrelevantbehaviors. We showed that behavior models provide mechanisms for construction of a semanti- cally relevant vocabulary or a knowledge base for interaction over event data as reusable, shareable and composable models. Typically, system experts rely on their intuition and experiencetomanuallyanalyzeandcategorizescenariosandthenhand-craftrulesandpat- terns for analysis. Hence due to the manual and ad-hoc nature of this analysis process, thereislimitedextensibilityandcomposibilityofanalysisstrategies. ourapproachismore systematic, can retain expert knowledge, and supports composingbehaviors from existing models. We presented the design of prototype framework to do analysis over event data using behavior models. Lastly, we saw in comparison to the related work that behavior models 61 trade off modeling complexity for usability, while retaining sufficient expressiveness to modelawiderangeofscenarios. 62 Chapter4 SituationModels In this chapter, we describe situation models which allow decision-making entities to ex- plicitly capture the meaning of low-level isolated, unrelated events in interconnected, in- terdependent systems in terms of higher-level situations relevant to the decision-making entity. The rest of the chapter is organized as follows. Section 4.1 presents the key insights and constructs behind situation models. Section 4.2 presents the bayesian formalism used to encode situation models. Section 4.3 explains the analysis workflow of using situation modelstodriveanalysisoverdata. Section4.4discussesthevariouspropertiesofsituation models. Section4.5comparessituationmodelstoexistingtop-downmodelingformalisms. Finally,Section4.6concludesthechapterwithasummaryofthekeyideaspresentedinthe chapter. 4.1 Overview Inthissection,wedescribethekeyintuitionbehindsituationmodels,explainthenotionof a situation as basic unit of semantic abstraction, and discuss how situations are combined intoasituationmodel. Consider a large-scale, complex system such as the smart grid consisting of several different subsystemsas shown in Figure 4.1. The smart grid will extend traditional power systemboundariestoenableactiveparticipationinenergyproduction,distributionandcon- trol from entities outside the traditional power operations realm, namely, from customers, market entities and service providers [128]. The additional layers of functionality added 63 bythesmartgridincreasestheinterconnectednessandinterdependenciesinthesystemand increases thecomplexityfor reliableoperationof thepower grid. Specifically, itgivesrise to multiple control and error propagation channels between different subsystems and in- creases the vulnerability of the system as a whole to cyber attacks or inadvertent failures. This characteristic implies that isolated and unrelated threats in completely independent parts of the system or in different subsystems, enables those threats to manifest as situ- ations affecting higher-level system operations. For example, attacks originating in the cyberdomain,suchasmaliciouscontrolofchargingbehaviorofelectricvehicles[103],or malicious remote disconnection of large-number of smart meters [3], can potentially lead todestabilizationofthepowergrid. Figure4.1describessuchathreatscenario. Customer Customer EV Customer Electric Vehicles Markets Pricing Server Power System Controller Power System Controller Power System Pricing information Electrical Interface Goal: Maintain system stable @ (600.3 Hz) Threat Scenario 1: Pricing server compromised. 2. Electric vehicles sent low pricing signals. 3. Customer vehicles charge simultaneously. 4. Sudden significant load increase 5. System freq. < 59.97 Hz !! Figure4.1: Complexwebofdependenciesinlarge-scale,complexsystems. Gaining an effective awareness in such systems thus requires decision-makingentities to make sense of isolated and unrelated events from different parts of the system to infer a potential high-level consequences. Further, effective decision-making and response in 64 such scenarios will also require identification of a root cause at a relevant level of system abstraction. 4.1.1 Situation: FundamentalSemanticConstruct We introduce the notion of a situation as a fundamental semantic construct for compre- hension and reasoning. There are several definitions of situations found in the situation awareness and high-level information fusion literature, a few of which are described and comparedwithourdefinitionlaterinSection4.5. In this thesis, we define a situation as a current or potential undesired state-of-affairs in the system expressed at a level of abstraction relevant to a decision-making entity. For example, the failure of a network router is a low-level situation. Similarly, an attack on a systemassetisalow-levelsituation. Adecision-makingentityresponsibleforhigher-level system operations is not interested in knowing about these low-level situations by them- selves, but rather the meaning of the low-level situations in terms of high-level situations insightsrelevanttoadecision-makingentity. Forexample,inpower-systemoperations,the “unavailabilityofapowerreserve”,“failureofdemand-responseoperationtocurtailload”, “damageofcriticalequipment”and“violationofsystemstability”aresituationswhichare relevanttoapower-systemoperator[51]. Below, we enumerate a comprehensiveset of example situationsprominent in a large- scale, complex system such as the smart grid. The situations are broadly categorized into low-levelsituationsandhigh-levelsituations,followedbycategorieswithinthosetwocat- egories. Low-level situations occur at the lower-levels of the system abstraction such as within the entities and networks of the system. High-level situations occur at the highest- levelsofsystemabstractionsuchasthoseimpactinghigh-levelsystemoperations,andsys- tem goals and objectives. Low-levelsituations are the result of attacks and failures, while high-levelsituationsaretheconsequencesofthoselow-levelsituations. 65 Low-levelSituations 1. Changeinoverallavailability(available/unavailable) (a) of hardware, software, network assets, for example, the unavailability of AMI meter,DRAS, customergateway,networkrouter (b) of services, functions running on those assets, for example DRDR Service crashed (c) of dataassets,forexample,meterusagedatafromMDMisunavailable. 2. Changeinnominalfunctionalbehavior(failure-free/failure1/failure2..) (a) of hardware, software, network assets, for example, OS out of memory, disk errors, networkcarddroppingpackets (b) of services, functions running on those assets, for example, “message broad- casting”functionfailedtosendDRmessages,MDMsendingcorrupteddatato DRAS 3. Changeinperformancecharacteristics (a) of hardware, software, networkassets,for example,Deviationfrom acceptable networklatency (b) of services, functions running on those assets, for example, change is service responsetime (c) of data assets, for example, deviations from acceptable data throughput rate, valuescontainedindatamessagesareoutofbound 4. Changeinsecurityposture (a) of hardware, software, network assets (unauthorized access, unauthorizeduse, modification), for example, unauthorized access to AMI meter, and modifica- tion of AMI meter firmware. Note that unauthorized access and unauthorized usearedistinctsituations. 66 (b) of services, functions running on those assets (unauthorized access, unautho- rized use,modification),forexample,unauthorizedaccess toassets, (c) ofdataassets(unauthorizeduse,falsification,disclosure,modification,corrup- tion,obstruction),forexample,disclosureofAMIkeys,disclosureofcustomer usagedatabound. High-levelSituations 1. Situationsrelevanttoutilitybusiness (a) Loss of revenue, for example, loss of revenue related to outage management, loss of revenuerelated to customerbilling, lossof revenuerelated to DR func- tions. 2. Situationsrelevanttostability/performanceofoverallsystem (a) impacttohigh-levelsystem goals,forexample,impactto“Totalloadreduction achievable”viademandresponse, 3. Situationsrelevanttohigh-levelsystemoperations (a) Failureof ongoingmissions (b) Inabilitytoinstantiate/performfuturemissions 4. Situationsrelevanttosystemsecurity (a) Securitydataassets(C/I/A) (b) Securityofcriticalsystem assets(C/I/A) 5. Situationsrelevanttocustomers (a) Disclosureof customerdata (b) Inaccuratebilling (c) Safety (d) Loss of Power 67 4.1.2 SituationModels A key insight here is that in interconnected and interdependent systems one or more low- levelsituationswillmanifestasahigher-levelsituation,andoneormoresuchhigher-level situationswillresult in further situations,and so on. For example, two independent router failures, willmanifest eventuallyas afailure of ahigher-leveloperationto reduce loadvia demand response in the smart grid. While traditional bottom-up approaches capture such manifestation via complex system modeling such as dependencies and structure, situation models allow capturing the meaning of low-level situations directly as the relevant high- level situations. More specifically, a situation model resembles a tree of situations. The root of the situation model tree represents a high-level situation which is directly relevant to a decision-making entity. The lower-level nodes represent situations which can lead to that situation. At the lowest level of the tree are low-level situations, such as failure of systementities,andattacksagainstsystementities. 4.1.3 DerivingaSituationModelFromGoals A decision-making entity builds a situation model (or tree) using his knowledge of the system,andasrelevanttohisdecision-makinggoals. Theproceduretoconstructasituationmodelisloosely-basedonthefault-treeanalysis (FTA) approach popular in the system engineering domain. Fault Tree Analysis (FTA) is a popular technique for the dependability modelingand evaluationof large, safety-critical systems [16]. In this technique, an undesired state of a system is specified and the system isstudiedinthecontextofitsenvironmentandoperationtofindallcrediblewaysinwhich the event could occur. We discuss the key differences between FTA and situation models inSection4.5.2. A decision-maker starts with a “root situation” relevant to his decision-making, and decomposes it into a set of lower-level situations based on his knowledge of the system. 68 The lower-level situations are further decomposed using the same approach. A situation representsANDorORcombinationsoflower-levelsituations,andsequentialcombinations oflower-levelsituations,thatcouldcausetherooteventtooccur. Eachnodeinthetreerepresentsasituation,whichaswedefinedearlierrepresentsacur- rent orpotentialundesiredstate-of-affairsin thesystemoccurring,at alevelof abstraction relevant to the decision-making entity. Each situation corresponds to an entity-of-interest relevanttoadecision-makingentity,whereanentity-of-interestherecouldbeaconceptual entity(such as a high-levelsystemstate, a systemfunction, a high-levelsystemoperation, a grouping of entities), or a physical entity (such as a device, or a service). For example, “ServiceXunavailable”isasituationcorrespondingtotheentity-of-interestserviceX. A lower-level situation can be shared by one or more top-level situations. This makes it possibleto encode common-cause failures. For example, “bad software update to smart meter”isasituationwhichissharedbyallthesmartmetersfromthesamevendor. Thelowest-levelnodesinthetreerepresent lower-levelsituationssuchas faultsoccur- ring in system entities, or attacks. We define a fault as the cause of a failure and could range from specificationand designdefects to physicalor human factors [5]. Both attacks andfaultsleadtofailuresituations. Each situation node at any level of the tree can be observed using one or more events, or it can be inferred based on the lower-level situations. For example, the “unavailability ofadevice”canbeobservedusingamonitorwhichpingsthedeviceperiodically. The situation model contains a special node called an UNKNOWN nodeto model un- known situations. The unknown situation is important because available knowledge in complex systems is always imperfect. Further, we observe that the unknown situation is not added to every situation node. This reflects the decision-maker’s confidence in his knowledgeaboutthesystem. AnUNKNOWNsituationcannotbeobserved,i.e. cannotbe attachedtoanobservable. AnUNKNOWNsituationcannotbedecomposedfurther. 69 A simplesituationmodel is described in Figure 4.2. It demonstrates how the top level situation“unavailabilityof a customergatewaydevice”is decomposedintoits lower-level situations. We note that there may be many ways to decompose a top-level goal, and it dependsonthegoalsofthedecision-maker,andhisknowledgeofthesystem. Customer Gateway Device X unavailable X operationally down X overloaded OR UNKNOWN X shutdown X crashed OR UNKNOWN OR Customer shutdown of X Legitimate shutdown of X OR UNKNOWN OR Internal software failure of X Internal hardware failure of X OR Malware infection on X UNKNOWN UNKNOWN Figure4.2: Simplesituationmodel. Once a situation model is built, it is then encoded for processing at runtime. Such a situation model then drives analysis over runtime data. There are variety of encodings possible. Wechoosetoencodethesituationmodelasabayesiandependencynetwork. We nextdescribetheformalismforencodinginSection4.2. 4.2 ModelingFormalism In this section, we describe the details of the formalism we use to encode the situation models for runtime processing. We first present a brief overview of bayesian networks, followed by the procedure to encode a situation model as a bayesian network. We then discusshowanalysisisperformedusingtheencodedsituationmodels. 70 4.2.1 BackgroundonBayesianBeliefNetworks Overview In this thesis, we choose Bayesian Belief Networks (BN for short) as the formalism to encodeoursituationmodels. Abayesianbeliefnetworkisagraphical,probabilisticknowl- edge representation of a collection of variables describing some domain. The network is a directed, acyclic graph where the nodes of the BN denote the variables and the edges denotecausalrelationshipsbetweenthevariables[73]. Thereasonsforchoosingbayesiannetworksastheencodingmechanismareasfollows: 1. BNprovidesformalmathematicallanguagefor reasoningunderuncertainty. 2. BNallowscombiningevidencefrommultiple,unreliablesources. 3. BNenablesconditioningonpartialevidence. 4. BNallowspropagatingevidencesystematicallythroughthedependencies. 5. BNallowsforwardandbackwardreasoning. 6. BNallowsencodingseveraldifferenttypesofdependencies 7. BNsarewell-establishedformalismusedacrossseveraldomains. ASimpleBaysianNetwork. Figure 4.3 shows a simple bayesian network consisting of one target node, and three leaf nodes. Theleafnodesrepresentfactors(causes)whichinfluencethetargetvariable(effect). AsshowninFigure4.4,eachnodehasasetofstates(values)withsomeinitialdistribution specifiedasaconditionalprobabilitytable. The network estimates the likelihood of compromise of an AMI meter based on three factors: 1. themodeofaccesstoAMI(MeterAccessMode),whetherlocalorremote, 71 causes effect Figure4.3: AnExampleBayesianNetwork-NodesandEdges. 2. the authorization status of the access (AccessAuthorization), whether authorized or unauthorized, 3. thestatusofAMIconfiguration(AMIConfig),whetherchangedorunchanged, Figure4.4: AnExampleBayesianNetwork-InitialBeliefs. From Figure 4.4, we see that when nothing is known about any of the three factors, it is hard to conclude the security status of the meter. But as more information becomes 72 Figure4.5: AnExampleBayesianNetwork-UpdatedBeliefs. available, the estimates are more meaningful. As shown in Figure 4.5, when it is known that someone accessed a meter locally, and the access was unauthorized, the likelihood of themeterbeingcompromisedincreasesto80%. 4.2.2 EncodingaSituationModel Inthissection,wedescribetheprocessofconvertingasituationmodelintoabayesiannet- work. The conversion procedure described below is adapted from the procedure followed ingeneral toconvertafaulttreetoabayesiannetwork[16]. Given a situation model, it is mapped into a binary bayesian network, i.e. a BN with everysituationShavingtwoadmissiblevalues: true(S)-correspondingtoasituationwhich hasbeenobserved,andfalse(S)-correspondingtoasituationwhichhasnotbeenobserved. As we described earlier, the situation model has AND/OR gates. Converting an AND/ORgateinthesituationmodelisastraightforwardprocess. Thesimpleprocedureis as follows: for each node corresponding to an AND (respectively OR) gate create a CPT 73 such that the nodeis true with probability1.0 iff all incomingnodes are true (respectively iff at least oneincoming nodeis true) while it is false with probability1.0 elsewhere [16]. TheproblemwiththesimpleprocedureisthatwhenspecifyingCPTentriesonehastocon- ditionthevalueofavariableoneverypossibleinstantiationofitsparentvariables,resulting inexponentialentrieswithrespecttothenumberofparents. Noisygatescanreducethisef- fortbyrequiringanumberofparameterslinearinthenumberofparents,whilekeepingthe abilityofmodelinguncertainrelations. Inourapproach,weusenoisy-ORandnoisy-AND gatestomodelcorrespondingANDandORinthesituationmodel. A noisy-or node in a bayesian network is a generalisation of a logical or. As in the case of the logical or, an event Y is presumed to be false if all the conditions that cause Y are false. However, unlike a logical or, if one of the causes of the event Y is true, it does not necessarily imply that E is definitely true. Formally, the noisy-OR model is defined as follows [16,73]. Given a binary variable Y having the set of parent binary variablesX 1 ,...,X n ,thenoisy-ormodelrequirestospecifyn parametersλ 1 ,...,λ n , where eachλ i is interpretedas theprobabilityofY = true giventhatX i istrueand alltheother parents are false (i.e. λ i = P(Y| ¯ X 1 ,...X i ,... ¯ X n )). By assuming that each X i influences Y independently from each other, the local model is completely specified if we further assume that Y is false if none of the parents is true [16,73]. It can be shown that [73] if X is a particular instantiation of X 1 ,...,X n and π x is the set of true variables in X, then P(Y|X) = 1− Q X i ∈πx (1−λ i ). The above noisy-OR model does not still cater to UNKNOWN situation nodes in the situationmodel. Thiscanbehandledbyusingthenoisy-ORwithleakmodel. Inthismodel, oneassumesthatthereisapositiveprobability(called theleak or backgroundprobability) of having Y true even when all parents X i are false. This can be modeled by thinking to the influence between X i and Y as changed by adding an unknown parent L: the leak. In 74 this way P(Y| ¯ X 1 ,..., ¯ X n ) is interpreted as P(Y| ¯ X 1 ,..., ¯ X n ,L) = λ 0 and then the earlier equationbecomesP(Y|X) = 1−[(l−λ 0 )] Q X i∈πx (1−λ i )]. Similar definition can be specified for he noisy-AND model. A noisy-and node is the dual of a noisy-or node. It is a generalisation of a logical and. The noisy-AND models requires specifying n parameters λ 1 ,...,λ n , where each λ i is interpreted as the probability of Y = true given that X i is false and all the other parents are true (i.e. λ i =P(Y|X 1 ,..., ¯ X i ,...,X n )). Inthiscase,P(Y|X) = Q X i/ ∈πx λ i . Theprocedureforconvertingasituationmodelintoabayesiannetworkisasfollows: a) Foreachleafsituationnodeofthemodel(whichisnotanUNKNOWNnode),create acorrespondingnodeinthebayesiannetwork. b) Set the prior probability of the leaf node as true = 0.01, false = 0.99. This ensures that thedefault stateof thesituationis false. It is importantto leaveasmallpositive probability for the true state, otherwise the situation cannot set to true in the future (ifthepriorprobabilityoftrueiszero.) c) ForeachORnode (a) If theORnodehasanUNKNOWNnodeasasub-node, i. Createa“noisy-ORwithleak”node; ii. Set the CPT such that for each incoming node into the OR, true = 0.99, false = 0.01. This ensures that each incoming node can turn on the OR nodebyitself. Thesevaluescanbeadjustedforspecificcasestomodelthe relativestrengthofinputs. iii. Settheleakprobabilitytotrue=0.01,false=0.99. (b) If theORnodedoesnothaveanUNKNOWNnodeasasub-node, i. Createa“noisy-OR”node; ii. Set the CPT such that for each incoming node into the OR, true = 0.99, false = 0.01. This ensures that each incoming node can turn on the OR 75 nodebyitself. Thesevaluescanbeadjustedforspecificcasestomodelthe relativestrengthofinputs. iii. Settheleakprobabilitytotrue=0,false=1. d) ForeachANDnode (a) If theANDnodehasanUNKNOWNnodeasasub-node, i. Createa“noisy-ANDwithleak”node; ii. SettheCPTsuchthatforeach incomingnodeasfollows p(ANDnode = true|innode = false,othernodes = true) = 0.01 Setting a low probability ensures that the AND node won’t be true unless all the nodes are true. These values can be adjusted for specific cases to modeltherelativestrengthofinputs. iii. Settheleakprobabilitytotrue=0.01,false=0.99. (b) If theANDnodedoesnothaveanUNKNOWNnodeasasub-node, i. Createa“noisy-AND”node; ii. SettheCPTsuchthatforeach incomingnodeasfollows p(ANDnode = true|innode = false,othernodes = true) = 0.01 Setting a low probability ensures that the AND node won’t be true unless all the nodes are true. These values can be adjusted for specific cases to modeltherelativestrengthofinputs. iii. Settheleakprobabilitytotrue=0,false=1. e) Connect the nodes in the BN in the same way as in the situation model, but reverse thedirectionofthearrow; 76 Figure4.6: Simplesituationmodelencodedasabayesiannetwork. We elaborate on the above procedure by applying it to the simple situation model ex- ample described earlier in Figure 4.2. The corresponding bayesian network is shown in Figure 4.6. Figure 4.7 shows a different view of the same bayesian network with all the probability values. The bayesian models described in this paper were created using the GeNIe modeling environment developed by the Decision Systems Laboratory of the Uni- versityofPittsburgh[28]. 4.2.3 AnalysisUsingaSituationModel StandardanalysisofBNconcernsthecomputationoftheposteriorprobabilityofanygiven setofvariablesgivensomeobservation(theevidence)representedasinstantiationofsome ofthevariablestooneoftheiradmissiblevalues. Thisgeneralanalysismayproceedalong twolines[73]. 77 Figure 4.7: Simple situation model encoded as a bayesian network (showing initial belief values) 1. Aforward(orpredictive)analysis,inwhichtheprobabilityofoccurrenceofanynode of the network is calculated on the basis of the prior probabilities of the root nodes andtheconditionaldependenceofeachnode. 2. Abackward(diagnostic)analysisinwhich,givensomeevidence,theposteriorprob- abilityofanysetofvariablesthatmaycausetheevidenceiscomputed. Wedemonstratethetwocasesbyapplyingittothebayesiannetworkexamplepresented earlier in Figure 4.6. Figure 4.8 demonstrates how observed low-level situations are used to compute a high-level consequence. Specifically, given the fact the “malware infection onX”hasbeenreported,theprobabilityoftherootsituation“CustomergatewaydeviceX unavailable”hasincreasedto25%. Figure 4.9 demonstrates how the same network is used to compute low-level causes givenahigh-levelsituation. Assumethatweobservethat“CustomerGatewayDeviceXis unavailable”. Further, itis knownthat“X is NOT shutdown”and “X isNOT overloaded”. Thisleadstoa57%potentialsituationthat“Xhascrashed”. 78 Figure4.8: Predictiveanalysisusingthesimplesituationmodel. Figure4.9: Diagnosticanalysisusingthesimplesituationmodel. 79 4.3 AnalysisWorkflow Situationmodelsprovidesamethodologyforuserstocapturetheirhighlevelunderstanding as situationmodelsover data. In thissection, wedescribe a typical workflow of howsuch models can be used to drive analysis over data. The workflow consists of the modeling phaseandtheanalysisphase. ModelingPhase. Themodelingphaseconsistsoftwokeysteps. Derivingasituationmodelfromgoals In this phase, the decision-maker derives a situation model depending upon high- level goals and objectives. The decision-maker expands the situation tree upto his desired level of abstraction, and using his available knowledge of the system. The situations at any level of abstraction can be observed using events from heteroge- neous sources in the system. Decision-maker’s can use sub models from existing situationmodelstocomposetheirownmodels. EncodingthesituationmodeltoBN Decision-makerusestheproceduredescribedinSection4.2.2toencodethesituation models to a bayesian network representation. This step can also be accomplished withthehelpofanautomatedtool. Analysis Phase. In the analysis phase, models drive analysis over data. Given times- tampedeventdata,ananalysisengineusesthebayesianmodeltocompute(a)theprobabil- ity of theroot situationgiventhecurrent stateof thelow-levelsituations,and (b) compute thepotentialcauses(low-levelsituation(s))ofanobservedhigh-levelsituationgivenallthe evidenceavailable. At each time instant t, the current set of events are gathered from all the relevant sources, and corresponding nodes in the bayesian situation model are set as “true”. The 80 probability of occurrence of the root situation is then computed and reported. Similarly, thepotentialcausesofanobservedhigh-levelsituationarealsocomputed. Atthenexttime instantt +1, the nodes which are set are carried forward unless a specific eventarrives to resetthemtotheirdefaultstate. 4.4 DiscussionofProperties In this section, we discuss some of the interesting properties of situation models which makes them suitableconstructsfor integratingand interpretingeventsin large-scale, com- plexsystems. 4.4.1 Semantically-relevantInsights As described earlier in Section 4.1.3, situation models are derived from the goals and ob- jectives of the decision-maker. Thus, when the model is used for analysis, the insights produced are always directly relevant to a decision-making entity’s goals and objectives. Situationmodelsprovideaframeworktocombineinformationinahighlycontextualman- ner. 4.4.2 AbstractionandModelingSimplicity Situation model captures knowledge in large-scale, complex system in the form of lower- level situations manifesting into higher-level situations. The model captures the inherent interconnectedness, interdependencies in systems without requiring explicit modeling of these complex relationships. The models are thus abstract enough for decision-maker’s to captureenoughinformationwithouttheneedtocapturethegorydetails. 81 Further, the knowledge used to construct a situation model already exists in the form of fault or attack trees. The models then only need to be encoded in a relevant form for automatedanalysis. 4.4.3 IntegrationofHeterogeneousData Situation models by their very nature allow combining information from a diverse variety of sources. As discussed in Section 1.1, information in the form of events could originate from low-level system entities, system monitors, or from advanced analysis tools already existinginthesystem. 4.4.4 MinimalKnowledgetoConstructModels Buildingmodelsinlarge-scale,complexsystemsrequiresdetailedknowledgeofthesystem availableonlytoafewexperts. Further,decision-makingentitiesresponsibleforoperations donotpossessdetailedknowledgeofeveryaspectofthesystemanditssubsystems. While they may possess detailed knowledge in some areas, they may only possess an abstract knowledgeinotherareas,suchasdependencybetweenubsystems. Situationmodelsallow knowledge to be easily captured at a level of abstraction relevant to the decision-making entity. 4.4.5 IntegrationofVerticalandHorizonalInformation Situation trees combine information horizontally and vertically. That is, they enables in- tegrating information from across several layers of system abstraction (vertical), and also integrating information at the same layer from across subsystems. For example, a cyber andaphysicalsituationcan beeasilyexpressedwithinasinglesituationmodel. Thisabilitytointegratedatafromacrosssubsystems,andatthelowestlevelsofsystem abstraction with a system, is invaluable for anticipating potential situations before they 82 actuallymanifest. Forexample,collectingdataaboutpricingsignalssenttoEVs,canhelp anticipatepotentiallymaliciouschargingbehaviorand itssubsequentconsequencesonthe grid. Thisallowsadecision-makertotakeactionssuchasincreasinggenerationtoprevent apotentialcollapseofthegrid. 4.4.6 ExplicitRequirementsForDataCollection An additional advantage of situation models is their ability to explicitly encode the in- formation that needs to be collected. This is a natural byproduct of the top-down goal- decomposition which creates those models. This is valuable in large-scale, complex sys- tems where often times decision-maker’s do not know the right information to look for. For example, In the case of the Syndenham railway disruption incident [113], operators werenotmonitoringthelow-levelrouterlogswhichwereclearlyindicativeofarouterfail- ure. If theoperators had knownthat routerinformationwas tobemonitored,itcouldhave preventedthecatastrophicfailureoftheentirerailwaycontrolsystem. 4.5 RelatedWork In this section, we compare different aspects of our situationmodel approach with related work across domains. We first look at the different ways in the fundamental notion of situation has been defined, and point out the key differences with our definition. We then present a comparison of situation models to existing fault/attack analysis work. Then, we compare our work to existing high-level specification-based approaches such as the ontology-drivenapproachandgoalmodelingapproach. 83 4.5.1 Definitionof Situation The notion of a situation as a fundamental construct for high-level reasoning has been defined in several ways in the literature. Here we present a brief overview of some of the populardefinitions,andcompareittothedefinitionproposedinthisthesis. FormalDefinitions Thefirstformalspecificationofasituationforcomputer-basedsituationmanagement was givenby McCarthy and Hayes in their SituationCalculus, wherethey used first order logic expressions to define a situation as a snapshot of a complete world state ataparticulartime[94]. BarwiseandPerry[10]introducetheideaofsituationsemantics. Situationsarebasic and ubiquitous. We are always in some situation or other. Human cognitive activ- ity categorizes these situations in terms of objects having attributes and standing in relationstooneanotheratlocations-connectedregionsofspace-time. Devlindefinesasituationas“astructuredpartofrealitythatisdiscriminatedbysome agent”[31]. Suchstructurecanbecharacterized intermsofthestatesofentitiesand of relationships among them. Determining just which entities, which state elements andrelationshipsarepartsofasituationorrelevanttoasituationisanimportantpart of the problems of Situation Semantics and of Situation Assessment (level 2 data fusion). Theobjectiveofaboveworkhasbeentodefineformalapproachestomodelingsitua- tions. Inthiswork,wearefocusedonpracticalapproachestoassistdecision-making entitiesinlarge-scale,complexsystems. Thedefinitionadoptedinthisthesisiscare- fullyselectedtohandlecasesspecifictotheproblemathand. 84 System-centricDefinitions Taddaetal.[117]defineasituationasaperson’sworldviewofacollectionofactiv- ities that one is aware of at an instance in time. An entity is defined as “something thathasadistinct,separateexistence,thoughitneednotbeamaterialexistence”. An objectis“aphysicalentity;somethingthatiswithinthegraspofthesenses”. Agroup is“anumberofthingsbeinginsomerelationtoeachother”. Aneventis“something that takes place; an occurrence at an arbitrary point in time; somethingthat happens atagivenplaceandtime”. Bothentitiesandgroupscanbeassociatedwithaspecific eventorevents. Anactivityis“somethingdoneasanactionoramovement”. Activ- itiesarecomposedof entities/groupsrelatedby oneor moreeventsovertimeand/or space. Jakobsonetal.[60]definesituationsasaggregatedstatesofsomeobjectsconsidered at certain time moments. These objects are engaged in different class, structural, causal, spatial, temporal and other relations forming complex multiobject systems. The situations describe not only the states of the objects, but also the states of the inter-objectsrelations. For example,wecan talkabout cybersituationsthatare hap- pening in a cyber terrain, i.e. situations concerning network hardware components, orsoftwareassets,ortheITservicesprovidedoverthehardwareandsoftwareassets ofthecyberterrain. Theabovedefinitionsareverygenericinnature,andtheirpurposeistocatertoawide variety of cases across several application domains. Further, the above definitions define situationsusing a complexset of relationshipsbetween objects. In thiswork, we are focused on practical approaches to assist decision-making entities in large- scale, complexsystems. The definitionadopted in thisthesis is carefully selected to favorsimplicityovergenerality. 85 Data-centricDefinitions Lambert et al. [75] defines a situation as a collection of related spatio-temporal facts, where facts consist of relations between objects and are expressed symboli- callythroughsentences. Cardell-Oliver et al. [19], define a situation as an abstraction for a pattern of ob- servations made by a distributed system such as a sensor network. A situation is a specificpatternofobservations,asdefinedbyapredicateonasetofobservationsse- lected from the full spatial-temporal map generated by a sensor network. Examples of situations include “unusually high average temperature from at least five sensors in one region of the landscape,” or “the stove is on in the kitchen for more than 10 minuteswithnopersonpresent.” Thedefinitionof situationpresented inthisworkiscloser tothedefinitionproposed by Lambert et al. [75]. The definition proposed in the thesis is perhaps simpler be- causeitdoesnotexplicitlyfactorinthespatialdimension. Threat-centric Definitions Steinbergetal.[116]defineathreatsituationasoneinwhichthereissomelikelihood of certain types of potential events (e.g. attacks) by some agent against threatened systementities. Steinberg’s definition is very specific in defining a situation as something which re- sultsfrommaliciousintent. Thedefinitionadoptedinthisworkisgenericasitdefines a situation as an “undesired state of affairs”, which can be the result of a low-level systemfault,oranattack. 86 4.5.2 Fault/AttackAnalysisandBayesianNetworks Inthissection,wecomparesituationmodelstothepopularfault/attackanalysisapproaches employedinseveraldomains. FaultTreeAnalysis FaultTreeAnalysis(FTA) isapopulartechniqueforthedependabilitymodelingandeval- uationoflarge,safety-criticalsystems[13,16]. Thetechniqueisbasedontheidentification of a particular undesired event to be analysed (e.g. system failure), called the Top Event (TE). The construction of the Fault Tree (FT) proceeds in a top/down fashion, from the events to their causes, until failures of basic components are reached [16]. The methodol- ogyisbasedonthefollowingassumptions: 1. Eventsarebinaryevents(working/not-working); 2. Eventsarestatisticallyindependent; 3. Relationships between events and causes are represented by means of logical AND andORgates. The main purpose of a fault-tree analysis is to calculate the probability of the root event, using statistics or other analytical methods and incorporating actual or predicted quantita- tive reliability and maintainability data. When the root event is a security violation, and some of the subevents are deliberate acts intended to achievethe root event, then the fault treeisanattacktree. Although situation models are based on the similar idea of decomposing a top-level situation into lower-level situations, there are three key differences with the fault tree ap- proach: 1. Situationmodels(intheirunencodedform)arenotusedforanalysisdirectly. Theap- proachdoesnotrequireassigninganypriorprobabilitiestoleafnodesofthesituation treeduringthemodelingphase. 87 2. A fault tree allows deterministic relationships b/w the events and causes by means of logical AND and OR gates, whilea situationmodel (in its bayesian form) allows probabilistic relationships by means of noisy-OR and noisy-AND nodes. This is an important asset in modeling uncertain situations and relationships in large-scale, complexsystems. 3. Situation models explicitly allow modeling of unknown situations, which is impor- tanttomodeltheimperfectioninknowledgeinlarge-scalesystem. 4. Situationmodels(intheirbayesianform)canbeusedforgeneraldiagnosticanalysis asdemonstratedbytheexamplesinearliersections,whilefaulttreeanalysisdoesnot accommodatethat. AttackTreeAnalysis Attacktreesarespecialcasesoffaulttrees,andareverypopularsecurityanalysistools[42, 96,106,120]. Ingeneral,thesecurityincidentthatisthegoaloftheattackisrepresentedas therootnodeof thetree, andthewaysthat anattacker could reach thatgoal areiteratively andincrementallyrepresentedasbranchesandsub-nodesofthetree. Eachsubnodedefines asubgoal,andeachsubgoalmayhaveitsownsetoffurthersubgoals,etc.Thefinalnodeson the paths outward from the root, i.e., the leaf nodes, represent different ways to initiatean attack. Each node other than a leaf is either an AND-node or an OR-node. To achievethe goalrepresentedbyanAND-node,thesubgoalsrepresentedbyallofthatnode’ssubnodes must be achieved; and for an OR-node, at least one of the subgoals must be achieved. Branches canbelabeledwithvaluesrepresentingdifficulty,cost,orotherattackattributes, sothatalternativeattackscanbecompared. Afundamentalstrengthofsituationmodelsliesinthefactthatthegenericdefinitionof asituationas an undesiredstateof affairs allowsincorporatingbothfault and attack nodes into the situationmodel, thus eliminatingtheneed for separate analysis. Further, it allows 88 integratingsecuritysituationswhichotherwisecouldnothavebeenintegratedintoapurely faulttreebasedanalysis. 4.5.3 Ontology-basedApproaches Ontology-based approaches are popular top-down approaches high-level information fu- sion and situation awareness applications [9,11,12,32,50,71,72,89,90]. An ontology is a formulation of the entities that are relevant to a domain as well as the relationships between these entities. Ontologies have recently been introduced into higher-level infor- mation fusionwhere theyprovidea mechanism for describing and reasoning about sensor data, objects, relations and general domaintheories [71]. Ontologiescapture potentialob- jects and potential relations; that is to say, they do not describe what is in the world but rather what can be in the world. Ontologies, however, can be used to annotate or markup descriptionsofinstancesoftheworldinwhatarecalledinstanceannotations. Allthedifferentontology-basedapproachesaresimilarinnature. Forthesakeofcom- parison we elaborateon SAWA (Situation Awareness Assistant)[89], which facilitates the developmentof user-defined domain knowledgein the form of formal ontologiesand rule sets and then permits the application of the domain knowledge to the monitoring of rele- vant relationsas they occur in a situation. SAWA includestools for developingontologies in OWL and rules in SWRL and provides runtime components for collecting event data, storingandqueryingthedata,monitoringrelevantrelationsandviewingtheresultsthrough a graphical user interface. In most general terms, SAWA is an ontology based situation monitor. Its main goal is to monitor a “standing relation” (or a goal), i.e., a query formu- latedintermsof an underlyingontology. SAWA collectsinformation(events)andinvokes itsinferenceenginethatderiveswhethertherelationholdsornot. Ontology-based approaches are very flexibleand have been applied in a variety of do- mains successfully. They have the potential to produce insights semantically relevant to 89 thegoalsandobjectivesofadecision-maker. But,asuccessfulapplicationofthisapproach requires capturing the domain knowledge in terms of the objects, relationships and data verywell. Thisapproachagainleadstoincreasedcomplexityinthedomainoflarge-scale, complexsystemssuchasthesmartgrid. Thewholepurposeofsituationmodelsistokeepthemodelingabstractionsreallysim- plesothatdecision-makerscanrapidlycomposesituationmodelswithouttheneedtolearn thecomplicatedconcepts(storedinsomeontology)relevanttoadomain,andthenuseitto specifymodels. Thereisnodoubtthatafull-fledgedontology-basedapproachmightenable modelingcomplexrelationships,butinourworkweexplicitlytrade-offthiscomplexityfor simplicityandusability. 4.5.4 GoalmodelingApproaches Goal modeling approaches are primarily used for requirements engineering in areas such as business analysis, stakeholder analysis, security analysis, and context analysis [33,35, 81,82,136–138]. Goal modeling approaches such as i* (pronounced “i star”), KAOS and UML diagrams are modeling languages suitable for an early phase of system modelingin order to understandtheproblemdomain. Theyare usedfor modelingandreasoning about organizational environments and their information systems composed of heterogeneous actors with different, often competing, goals that depend on each other to undertake their tasksand achievethesegoals. In general, insuch approaches goalsdescribetheobjectives that thesystemshouldachievethroughthecooperationof actors in thesoftware-to-beand intheenvironment[82]. Goal modeling approaches are attractive options for modeling situation awareness needs of a decision-maker because the modeling is high-level, and incorporates explicit goalrepresentations. Inaway,thegoal-baseddecompositionofsituationmodelscanbeex- plicitlymodeledusinggoal-basedlanguages. But,goal-basedapproachesareoptimizedfor 90 anentirelydifferentpurpose. Theyprovideconstructsandmechanismstomeasurerequire- ment completeness with respect to a set of goals. From a situation awareness perspective, suchmodelingwouldbeanoverkill. Forexample,theKAOSmodelinglanguage[108]al- lows modeling goals, subgoals, expectations and requirements. It allows modeling agents whoareresponsibleforthegoalsandrequirements. Further,itallowsmodelingobstaclesin theachievementofgoals. WhilethesetofconstructsprovidedbyKAOSisrich,theformal analysistechniquesthemselvesarenotoptimizedfromasituationawarenessperspective. 4.6 Summary Thischapterintroducedasimplemodelingapproachcalledsituationmodelsforintegrating and interpreting isolated and independent events in terms of high-level situations relevant to a decision-making entity. Situation models drive analysis over event data to extract semantically-relevant insights in the form of higher-level situations, or diagnosing causes of a situation. Further, situationmodels directly encode high-levelgoals and objectivesof adecision-makingentity. Theyhelpdifferent decision-makingentities,extractcompletely differentsetsofinsights(asrelevanttoeach)fromthesamesetoflow-levelfacts, We discussed how situation models are derived using a top-down approach, and en- coded using a bayesian network formalism. We saw how the models can be built with high-level knowledge of the system. We saw how predictive and diagnostic analysis can be performed using situation models. Further, in comparison to related work we saw the noveltyofsituationmodelsintermsoftheirsimplicityandusability. 91 Chapter5 CaseStudies AsstatedinChapter1,ourprimaryobjectiveinthisthesisistoimprovethestate-of-the-art inspecification-basedmethodsbyenablingdecision-makerswithinthesmartgridenviron- ment to express analyses tasks at a higher-level of abstraction. We focus on two analysis tasks,specifically a) the detection of a current situation over a sequence or groups of related lower-level events;and b) theanticipationofpotentialhigher-levelsituation(s)overisolated,independentlow- levelevents. Chapter3andChapter4presentedmodelingabstractionstoassistdecision-makerstobuild high-levelspecificationstodrivesemanticallyrelevantanalysisoverlow-leveleventdata. In thischapter, wedemonstratetheeffectivenessof theaboveabstractionsbyapplying them to three case studies. Specifically, we present two case studies which demonstrate how behavior models can be used by decision-making entities to specify complex attack behavior over network packet events. We then demonstrate how situation models enable multipledecision-making entities involved in a demand response operation to make sense of heterogeneous, low-level facts and extract insights semantically-relevant to their high- level decision-making goals. We demonstrate the effectiveness of the proposed modeling abstractions in terms of (a) expressiveness of semantic constructs in capturing high-level understandingofadecision-maker,(b) simplicityof abstractionstoenablerapidspecifica- tions, (c) ability to share and reuse models, and (d) ability to customize models based on theneedsofadecision-makingentity. 92 Rest of the chapter is organized as follows. Section 5.1 applies behavior models to model the complex multi-step attack behavior of a DNS cache poisoning attack, and demonstrates how such a model can be used to extract attack instances from a network packetstream. Section5.2appliesbehaviormodelstomodeladistributeddenialofservice attack, and demonstrates the use of the model by applying it to extract DDoS instances from a network packet stream. Section 5.3 applies situation models to demonstrate how the awareness of a smart grid demand response operator can be improved in a complex, large-scalesystemsuchasthesmartgrid. 5.1 DetectingComplexBehavior: DNSAttack Wehavethefollowingobjectivesinthiscasestudy: a) Demonstratethatbehaviormodelscanbeappliedoverdependenteventdatatoextract insightsintheformofsemantically-relevantbehaviors. b) Demonstratetheuseofbehaviormodelsinassistinganalysisovercomplexdatasets. c) Demonstratethat a model of complex attack behavior can be rapidly composed in a datasetindependentandsystemindependentway. d) Demonstratethatexistingmodelscanbereusedtobuildhigh-levelcomplexmodels. 5.1.1 Scenario DNS Kaminsky Attack We present an experiment emulating Dan Kaminsky’s popular DNSattack[63]usingthemetasploit[98]framework. ReferringtoFigure5.1,theattackers objectiveistopoisonthecacheofthevictimnssothatanyrequeststoeby.comareredirected to a fake nameserver (fakens) instead of the real nameserver (realns). We refer the reader to[63]foradetailedunderstandingoftheattack. 93 !"#$%&!'()$ *)+ ,(- ./, '(0 ,(-0 0 Figure5.1: DNSKaminskyexperimentsetup. Sincetheattackexploitsaracecondition,theexperimentsetuphastopermitsuccessful occurrences aswellasfailedoccurrencesoftheattack. The attack beginswhen attacker sendsa DNS query to victimnsfor somenon-existent domainnameundereby.com. Thevictimns(arecursivenameserver)sendstheDNSquery upstreamtorealns. realnssendsaDNSresponsebacktothevictimns. Theattacker,mean- while, starts sending forged DNS responses in an attempt to poison the cache of victimns. Theexperimentkeepsrunningindefinitelyuntilitsucceedstopoisonthevictimnameserver cache. 5.1.2 ModelingPhase AbstractModelofBehavior Figure 5.2 captures the experiment behavior as a tree of possibilities where the nodes are theexperimentstatesandthepathsconnectingthestatesarepossibleexperimentbehaviors. Thesestatesarenotexhaustivebutsufficienttocapturemostofthesemanticsoftheexper- iment. Specifically, we see that there are three possiblebehaviors that can lead to failures andonebehaviorthatcan leadtosuccess. 94 Figure5.2: User’shigh-levelmodelofexperimentbehavior. EncodingoftheModel The behavior model script is shown in Figure 5.3. Lines 2–4 define the model as DNSKAMINSKY over events of type PKT DNS. Line 5 imports the DNSREQRES model that alreadydefinesstatesandbehaviorsrelevanttotheDNSprotocol. The model header section, states that the model is called DNSKAMINSKY and works oneventsoftype PKT DNS.WeimportthemodelforDNSrequestandresponse. Thus,all states along with its attributes, behaviors and model outputs from the DNSREQRESmodel areavailableinthecurrentnamespace. Lines7–17definefivedifferentstatesthatarerelevanttotheexperiment. Line8defines thefirstDNSqueryfromattackertovictimandprovidesacontextforfurtherstates. Line10 defines a query from the victim to real nameserver by requiring that the source IP address of this query be same as the destination IP address of the previous query and the DNS questions of both states be identical. This makes sure that the forwarded query by the 95 Figure5.3: DNSKAMINSKYmodelscompleteexperimentbehavior. victim nameserver is the same as the one received. Line 12 defines the response from the real nameserver to the victim nameserver. The response is related to the request in line 10 by using the stateidentifier of the query stateVtoR query. To specifically distinguish thisresponsefrom the attacker’s response, wementionthe valueof thednsauthattribute thatisexpectedintheresponse. Therearetwocasesforspecifyingtheattacker’sresponse. Line 14 defines the attacker’s response same as the real nameserver response except that we mention the fake nameserver as value of the dnsauth attribute. Line 16 defines the case where the attacker’s response is incorrect due to a wrongly guessed DNS transaction id. The bcount constraint specifies that any number of responses can be matched since 96 the attacker can send multipleforged responses. Attributevalues not defined in the above statesdefaulttotheirdefinitionsin DNSREQRES. Lines 19–22 specify four possible behaviors corresponding to the four different paths in Figure 5.2. Line 20 uses the xor operator to merge two behavior paths. The other behaviors use the operator to capture the causation between the states. Finally, the behavior model is defined in the model section using FAILURE and SUCCESS behaviors. ReferringtoFigure5.2,weseethatb 1andb 2,whereb 1isacompositeoftwobehaviors, leadtoFAILUREandb 3leadsto SUCCESS.Bydefault,theframeworkcomposesthefinal modelbyor’ingthebehaviorsspecifiedinthemodelsection. 5.1.3 AnalysisPhase Procedure We run the experiment and capture all packets at the victimns machine. After running the experiment and capturing DNS packets, we normalize thelast 10,000 packets toPKT DNS events since they contain a successful attack along with failures representative of rest of the capture. The attack succeeds in 643 seconds and generates 949,161 PKT DNS events. Feeding the model and the events to the framework generates the output as shown in Fig- ure5.4. Results The framework outputs one SUCCESS instance and 622 FAILURE instances as shown in Figure 5.4. The failures clearly show the attacker’s forged responses reaching the victim nameserveraftertherealnameserver’sresponses. Everyreportedinstanceconsistsofabout 7forgedresponsesfromtheattacker. 97 Summary : DNSCACHEPOISON_SUCESS ======================== Total Matching Instances: 1 ---------------------------------------------------------------------------------------------------------------- eventtype | timestamp | sipaddr | dipaddr | sport | dport | dnsid | dnsauth ---------------------------------------------------------------------------------------------------------------- PACKET_DNS | 1275515488 | 10.1.11.2 | 10.1.4.2 | 38323 | 53 | 59439 | PACKET_DNS | 1275515488 | 10.1.4.2 | 10.1.6.3 | 32778 | 53 | 59439 | PACKET_DNS | 1275515488 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 59439 |fakens.fakeeby.com PACKET_DNS | 1275515488 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 59439 |realns.eby.com ---------------------------------------------------------------------------------------------------------------- Summary : DNSCACHEPOISON_FAILURE ======================== Total Matching Instances: 622 <truncated output> ---------------------------------------------------------------------------------------------------------------- eventtype | timestamp | sipaddr | dipaddr | sport | dport | dnsid | dnsauth ---------------------------------------------------------------------------------------------------------------- PACKET_DNS | 1275515486 | 10.1.11.2 | 10.1.4.2 | 6916 | 53 | 47217 | PACKET_DNS | 1275515486 | 10.1.4.2 | 10.1.6.3 | 32778 | 53 | 15578 | PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 15578 |realns.eby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 47217 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 47217 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 47217 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 47217 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 47217 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 47217 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 47217 |fakens.fakeeby.com ---------------------------------------------------------------------------------------------------------------- PACKET_DNS | 1275515486 | 10.1.11.2 | 10.1.4.2 | 28902 | 53 | 50921 | PACKET_DNS | 1275515486 | 10.1.4.2 | 10.1.6.3 | 32778 | 53 | 4347 | PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 4347 |realns.eby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 50921 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 50921 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 50921 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 50921 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 50921 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 50921 |fakens.fakeeby.com PACKET_DNS | 1275515486 | 10.1.6.3 | 10.1.4.2 | 53 | 32778 | 50921 |fakens.fakeeby.com ---------------------------------------------------------------------------------------------------------------- <truncated output> Figure5.4: BehaviorinstancessatisfyingtheDNSKAMINSKYmodel. 5.1.4 Discussion This case study demonstrates the ease with which the full system behavior was semanti- cally modeled at the level of user’s understanding. The behavior of the complex attack was modeled in a few lines. This dataset collected was complex because it contained race conditions which are hard to debug using tools such as wireshark [133]. We showed how behaviormodelssimplifytheentireprocessofanalysisforthedecision-makerbyabstract- ingawaythegorydetailsofanalysis. Additionally,themodelwascomposedusingexisting modelsfromtheknowledgebase,extendedwithuser’scontext-specificvaluesforattributes andthenvalidated. 98 5.2 DetectingComplexBehavior: DDoSAttack Wehavethefollowingobjectivesinthiscasestudy: a) Demonstrate that behavior models enable extracting insights specific to a decision- makingentity. b) Demonstrate that a model of complex attack behavior (DDoS) can be rapidly com- posedtoextractinsightsrelevanttoonesgoalsandobjectives. c) Demonstratetheuseof behaviormodelsinassistinganalysisovervoluminouscom- plexdatasets. 5.2.1 Background We use the DDoS scenario outlined by Hussain et al. [57] to demonstrate how behavior modelscan berapidlycreated to reproduceresults. Wealsodiscussthesyntaxinvolvedin writingacompletebehaviormodel. 5.2.2 Scenario Setup In the above referenced paper, a threshold-based heuristic was presented to iden- tify DDoS attacks in traces captured at an ISP. The authors continuously captured packet headers using tcpdump and created trace files every two minutes. Each traces were then anonymized and processed. Attacks on a victim were identified by testing for two thresh- oldson anonymizedtraces: (a) thenumberof sources thatconnect tothesamedestination withinonesecondexceeds60,or(b)thetrafficrateexceeds40,000packets/sec. 5.2.3 ModelingPhase Referring tothemodelscriptshowninFigure5.5,lines2–5definethemodelheader. Line 4 does not specify any qualifying conditions, that is, filters, for the events it can process. 99 1. [header] 2. NAMESPACE=NET.ATTACKS 3. NAME=DDOS_HYP 4. QUALIFIER={} 5. IMPORT=NET.BASE_PROTO.IPFLOW 6. [states] 7. sA=IPFLOW.ip_s2d() 8. sB=IPFLOW.ip_s2d(dip=$sA.dip) 9. [behavior] 10.hyp_1=(sA)[bcount=1] ~>[<=1s] (sB)[bcount>=59] 11.hyp_2=(sA)[rate > 40000] 12.[model] 13.DDOS_HYP(timestamp,sip,dip,etype)= (hyp_1 or hyp_2) Figure5.5: DDOS HYPmodelstwothresholdsfordetectingDDoSattacks. Line5importstheIPFLOWmodelfromtheknowledgebase. Lines7–8definethenecessary statepropositions. Line7 definessA, asimplestatewhich justcaptures an IP packet from somesourcetodestination. Line8definesastatesBwithadependencythatitsdiphasto beequaltothedipinsA.StatesAthusprovidesacontextforsB. Line 10 expresses the first hypothesis that there should be more than 60 sources con- necting to the same destination for an attack. We apply the operator to denote that we expectsAtooccur beforesB.Thebehaviorconstraintbcount (refer Section 3.2.2)applied to sA limits number of events returned to 1, whereas it is applied to sB so that at-least 59 events should occur since the event matching sA occurred. Additionally, the operator constraint[<=1s]bindssAandsBtooccurwithinasecondintheorder specified. Line 11 defines the second hypothesis that requires that the packet rate be≥ 40,000 by using therate constraint on state propositionsA. Lastly, line 13 defines the behavior model DDOS HYP which asserts that either hyp 1 or hyp 2 or both are valid. The four attributestimestamp,sip,dip,etypearereportedinthefinaloutput. 100 5.2.4 AnalysisPhase Procedure We demonstrate the advantages of behavior model-based analysis by defining a model to testforthetwoheuristicslistedaboveusing10secondsofthetracefilecontainingthestart of an attack. We normalize the packet traces to 142,530PKT IP events (87,736PKT TCP, 8,002PKT UDP,32,468PKT DNSand14,324PKT ICMP). Results Summary : DDOS_HYP_hyp1 ======================== Total Matching Instances: 2 Instance : 1 of 2 (Total Event Count: 60) -------------------------------------------------------- timestamp | sip | dip | etype -------------------------------------------------------- State Definition: sA 1025390156 |201.199.184.56|87.231.216.115| PKT_ICMP State Definition: ~> [<= 1 s ] sB [ ecount >= 59 ] 1025390156 |201.199.184.56|87.231.216.115| PKT_ICMP 1025390156 |201.199.184.56|87.231.216.115| PKT_ICMP <truncated output containing remaining 57 events> Instance : 2 of 2 (Total Event Count: 60) -------------------------------------------------------- timestamp | sip | dip | etype -------------------------------------------------------- State Definition: sA 1025390157 |53.232.170.113|87.134.184.48 | PKT_ICMP State Definition: ~> [<= 1 s ] sB [ ecount >= 59 ] 1025390157 |33.138.213.170|87.134.184.48 | PKT_ICMP 1025390157 |33.138.213.181|87.134.184.48 | PKT_ICMP <truncated output containing remaining 57 events> Figure5.6: BehaviorinstancessatisfyingtheDDOS HYPmodel. When the model is applied to the packet trace, it produces an output as shown Fig- ure5.6. Weseethattherearetwoinstancesreportedmatchinghypothesishyp 1bothwith 101 60 events within a 1 second interval. The output also shows the corresponding state or behaviordefinitionsmatchingthefollowingevents. The two destination IPs that are under attack are 87.231.216.115 and 87.134.184.48. Thisoutputisconsistentwiththefindingsreportedintheoriginalpaper[57]. 5.2.5 Discussion This example clearly demonstrates the ease with which simple hypotheses could be mod- eled and validated. We see that the users could rapidly compose a model specific to their needs, whichin thiscaseweretheheuristics. Theoriginalauthors wroteabout 2,000lines of C code to identify attacks. The same validation was expressed in about five lines as a behavior model. Additionally, this model can now be shared and easily modified and extended. 5.3 PredictingaPotentialSituation: DRScenario Wehavethefollowingobjectivesinthiscasestudy: a) Demonstrate that situation models help two different decision-making entities, ex- tract completely different sets of insights (as relevant to each) from the same set of low-levelfacts. b) Demonstrate that situation models help integrate isolated and independent events (low-levelsituations)inlarge-scale, complexsystems. c) Demonstratethatthemodelscanbebuiltwithhigh-levelknowledgeofthesystem. d) Demonstrate that the models can integrate a diverse range of data across systems, andacrossseverallevelsofsystemabstraction. e) Demonstratethatthemodelscanbeusedforpredictiveanalysisaswellasdiagnostic analysis. 102 The first section presents general background information on the demand response mechanism in the smart grid. Readers familiar with DR should feel free to skip the in- troductorysection. 5.3.1 BackgroundonDemandResponse Power System Stability and Power Reserves. The primary function of the power sys- tem is to deliver continuous power. But, a large, complex system such as the power grid faces several threats to its stability in the form of disturbances and contingencies. The powersystemoperatesstablyat60Hz. Minordisturbancesandcontingenciessuchasgen- erationlosscausethefrequencytofluctuate,butaslongasthesystemisabletopreventthe frequency from going out of 60±0.03 Hz, and quickly recover to 60 Hz (refer figure), the systemoperatescontinuously[41]. Power reserves are the primary mechanism to handle disturbances and contingencies and keep the systemoperating in thenominalrange. Reserves are classified as “spinning” or “non-spinning”, where spinning refers to the unused but synchronized capacity of the system and non-spinning refers to the unconnected capacity. The reserves are used by variousresponsemechanismssuchasGovernorandAGCadjustthegeneration. Based on their type, power reserves are classified as regulating reserves and contin- gency reserves. Mechanisms such as Governor response and AGC use the regulating re- servestohandlenormaloperationaldisturbancesinthesystem. Contingencyreserveshan- dlesupplycontingenciessuchaslossofgeneration[41,68]. Demand Response as Power Reserve. DR focuses on reducing demand temporarily in response to a price signal or other type of incentive, particularly during the system’s peak periods,as showninFigure5.7. End-usercustomersreceivecompensation(eitherthrough utility incentives or rate design) to reduce non-essential electricity use or to shift electric 103 load to a different time, without necessarily reducing net usage. For example, a large customer may switch from grid-supplied electricity to backup generators, when called to dosobytheutility[110]. Predicted energy consumption Time Energy Consumption Potential load shaving using DR Current time Figure5.7: Demandresponse(DR) reducespeakload. Inthefuture,automateddemandresponse(DR)mechanismswillbeusedasaspinning power-reservebyutilitiestoautomaticallymanageloadinthesystemduringtimesofcon- tingencies or during times of peak-demand. For instance, during a contingency such as a generatortrip,DRwillenableanintelligentsystemcontroller(oranoperator)tosendcon- trol commands in the form of load reduction requests to selected customers (or customer appliances),who(orwhich)willcomplybyshuttingofftherequestedamountofload,thus providingameanstobalanceandstabilizethesystemwithoutresortingtomoreexpensive means like buying more energy. DR thus promises to be an efficient, low-cost option for utilitiestoensuresystemstability. TypicalDemandResponseSetup. Figure5.8showsahigh-levelviewoftheentities,and the message exchanges between the entities involved in a typical demand response opera- tion. Reducing load via demand response fundamentally involves sending load reduction signals(LRsignals)toasetofselectedcustomers. 104 TheoperatorshowninFigure5.8isthedecision-makerwhointeractivelyinteractswith the demand response system. The “DR controller” consists of two key entities: (a) a pre- dictionengine(PE), and(b)ademandresponseserver(DRAS). Prediction Engine - Given a geographic area as input, the PE predicts the achievable load reduction in that area by selecting a group of customers who are most likely to help achievethereduction. For each selected customer, it outputstwo attributes: (a) sizeof the customer,and(b)probabilityofparticipation. DRAS (the DR server) - Once a likely group of candidate customers is selected, the DR server sends load reduction requests to the customers to schedule a load reduction during the selected time frame. A special case of the load reduction operation is the “DR NOW” case which requests willing customers to shed load immediately. We will use this caseinourscenariodescribedbelow. TheDRhomegateways(smartmetersinsomecases)aretheprimaryinterfacebetween the customer’s load and the utility. It respond to DR requests as per the customer prefer- ences. Itparticipatesinloadreductionbyautomaticallyinstructingcustomerappliancesto shutdownloadduringtherequestedtimeframe. Threats to Demand Response. Figure 5.9 shows a set of threats to demand response in general. The consequences listed are the high-level consequences of the threat. Given the complexity of the DR subsystem in terms of the number of cyber entities involved, a number of ways exist for each threat to manifest. In the scenario described below, we elaborateonthethreatwhereamaliciousentitycanpreventloadreductioninthesystemto increasecoststotheutility. 105 Operator DR Controller Customer Meters / Home Gateways Power Grid Predicted load reduction (before DR event) 5HGXFH;.:DWWLPHµW¶ RYHUGXUDWLRQµG¶IRU some selected region) Achieved load reduction after DR event DR Event Notification Signals DR Event Acks DR Opt-in/ Opt Out Reduce / switch on / off load Power Real-time measurements of power consumption (from PQ Meters) Per-customer usage measurement from AMI meters (at 15 minute intervals) Customer Prefs DR cancellations / modifications Utility backend systems Customer account info/ topology etc. Figure5.8: Entitieswithinthedemandresponsesystem Threat High-level Situations (Consequences) Prevent load reduction Increase costs to utility Force blackouts Create sudden load reduction Grid instability Force blackouts Manipulate DR scheduling Reduce efficiency of system Shutdown ORDGVDWXVHU¶VHQGIRUH[ $&¶VGXULQJSHDNVXPPHU Public discomfort Health and safety Manipulate the DR pricing signals Game the system Increase costs to utility Disclosure of bid requests / responses Loss of customer privacy Figure5.9: ThreatstoDemandResponse 5.3.2 Scenario WewillpresentthescenariowithrespecttotheDRoperator,butlateronwewillintroduce analysisofinformationfromtheperspectiveofacustomertodemonstratehowthesituation 106 model assists different decision-making entities to make sense of the same information in awayrelevanttotheirgoalsandobjectives. ScenarioSetup For this case study, we consider a simple setup as shown in Figure 5.10 with the typical components involvedin a typical DR operation. The DR server is responsiblefor sending “LRrequests”toDRclients(G1–G16). TheDRClientsexistwithinthecustomerdomain, and are arranged into four different RF mesh networks. Each RF Mesh network hosts a CGR (Connected Grid Router), whichacts as interfaceb/w thewirelessRF mesh network and the wired utility network. The DR clients are connected wirelessly to the CGR’s (as indicated in Figure 5.10 by the dotted lines connecting the clients (G’s) to CGR’s). The HeadEnd is a device which converts messages from the DRAS into an appropriate format fortheDRClients,andviceversa. ScenarioDescription Consider a real scenario where it is a peak summer noon and the predicted energy con- sumption is high enough to stress a utilities reserves. The utility’s objective is to ensure stabilityof thesystem by provisioningenough supplyfor the demand. Theutilityhas two choices: (a) use DR to reduce load, or (b) buy energy for the afternoon to maintain its demand. ThisisveryexpensivegivenhighTOUratesandclearlyundesirable. Utilityasks anoperatortoschedulealoadreduction. Lets assume that the operator uses the “DR NOW” mode in which willing customers willshed requestedamountof load immediately. TheDRoperator’sobjectiveisto reduce load of X KW within a specified time ’t’. The DR operation fails is the above objective is notmet. 107 Figure5.10: Setupfordemandresponsecasestudy. 108 Figure 5.11 shows a simple decision loop used by the decision-maker to achieve his objectives. Thedecision-makerfirstissuesarequesttothepredictionengine(PE)toselect asetcustomersfordesiredloadreductionofXKW.Thepredictionengineusesinformation such as the expected customer load reduction based on preferences, and probability of reduction based on historical behavior. and returns a set GC of ’k’ customers. Decision- maker asks DR server to send ’k’ load reduction signals to ’k’ selected customers. The decision-makerwaits for responses fromtheDR clientfor apre-specified amountof time. A response from a DR client indicates that the client has shed the requested amount of load. The decision-maker evaluates received responses to measure amount of load shed. If insufficient responses are received, the decision-maker restarts the procedure to reduce remainingamountofloadbyselectingadifferentsetofcustomers. Ifthisprocessexceeds thetime’t’,andtherequestedamountofloadhasnotbeenshed,theDRoperationfails. Send LR Requests to Customers Wait for responses Evaluate Responses All resp. received Select Customers Failures LR Goal: X KW within time µW¶ Figure5.11: DRdecisionloop ThreatVector The decision-maker’s objective in the scenario is to ensure completion of successful load reductionwithinthespecifiedtime. Inthisscenario,firstasetof’k’customers(DRC[1..k])areselectedtoperformdesired loadreduction. DRserver sends’k’ loadreductionsignalsto’k’ customers,butmostacks 109 are never received within the specified timeout. Either, the DR signals never reached the customers,orthesignalsmayhavereached buttheackscouldhavebeenlost. Given the system complexity in terms of number of components along the DR path, there are many reasons why the acks may not reach the utility, a few of which are as follows: • The internet connectivity in a region may have been temporarily down which may causesomeclientgatewaystonotrespondtoevents. • CGR connecting some of the meters maybe under heavy load causing delays and dropsintheackssentfrommeters. • Arecentupgradetocustomergatewaysmighthaveresultedinamisconfigurationof gatewaysresultinginalloutboundackstobedropped. • Amaliciousattacker(utilitycompetitor)mightbeintentionallyblockingthecommu- nicationwithDRASatsomeintermediateinternetrouter. • A network router failure at the utility may result in the acks being dropped at the utility. • An activeDoS attack against the headend may cause all communicationreceived to bedropped. DR Server (DRAS) may issue events to a new set of customers if it does not receive ACKs within a specified period. But DRAS only has a limited view of the system, it’s decisionmakingusesonlyalocalizedviewofthesystem. Considerthescenario,whereall event ACKs going back to DR are lost. DRAS will keep reissuing the ACKs to a new set ofcustomersandfailing. Eventually,theloadreductionoperationwillfail. Thedecision-makeruseshislimitedviewofthesystemalongwithexperiencetomake sense of the failures and decide on subsequent course of action. But the problem is that the operator cannot possibly be aware of all possible circumstances and the implications 110 of each, thus limiting his ability to make informed decisions and/or adjustments to his subsequentactions. 5.3.3 SAWRequirements The goal of the decision-maker is to schedule a load reduction of X KW within time ‘t’. Any low-level situation that can result in a potential failure of the high-level goal is a situationisaninsightthatisusefulforthedecision-maker. Thereareatleasttwohigh-level situationsofinteresttothedecision-makerinthisscenario: 1. Noloadreductionwillbeachievedwithinthespecifiedtime. 2. Onlypartialloadreductionwillbeachievedwithinthespecifiedtime. For example, knowledgethat someDR clientswill beunreachable is a situationthat leads toonlypartialloadreductionbeingachieved. 5.3.4 ModelingPhase Given the high-level situation awareness requirements, this section describes the resulting situationmodels,andtheircorrespondingencodingasabayesiannetwork. SituationmodelfortheDRoperator We notethatthere are manypossibleways to build asituationmodelfor aparticular goal. Wearedemonstratingonesuchpossibility. Thesituationmodelissplitintotwopartsacross Figure5.12 and Figure 5.13for convenience. Recall thateach nodeof thesituationmodel representsasituationatsomelevelofabstraction. Referring toFigure5.12,therootofthe model contains the top-level situation of consequence to the decision-maker, that is, “DR failure to reduce X KW load”. A failure to reduce X KW load can manifest if either “No LRisachieved”or“PartialLRisachieved”. 111 DR failure to reduce X KW load Total LR = 0 No LR achieved by DR Total LR < X Partial LR achieved by DR OR No LR requests executed by chosen DR Clients (DRC[1..k]) Atleast 1 of 'k' chosen clients (DRC[1..k]) does NOT execute LR request DRC[1] AND DRC[2] AND ... DRC[k] do not LR execute request DRC[1] OR DRC[2] OR... DRC[k] do not execute LR request Figure5.12: DRSituationModel(Part 1) Recall that the decision-maker selects a set of ’k’ DR clients (DRC[1..k]), and sends requests to those clients. The “No LR is achieved” situation will manifest if none of the DRclients(DRC[1],DRC[2],...,DRC[k])executedtheloadreductionrequests,whilethe “Partial LR is achieved” situation will manifest if at least one of the above clients fails to executetherequest. Thesituation“No LRrequests executedby chosen DRclients(DRC[1..k])” isdecom- posedintoanANDedcombinationof’k’lower-levelsituations,oneforeachDRclientnot executing the request. For convenience. we represent this decomposition using a “double octagon”. Similarly, the situation “Atleast 1 of ’k’ chosen clients (DRC[1..k]) does NOT executeLRrequest”isdecomposedintoanORedcombinationof’k’lower-levelsituations. Figure5.13showsthedecompositionforthegenericcaseof“DRC[x]doesnotexecute request”. Thesituation“DRC[x] doesnotexecuterequest”willmanifestifeither, 1. theDRserverdidnotsendtherequesttotheclient,or 2. theDRclientdidnotreceivetherequest(after itwassentbytheDRserver),or 3. theDRclientreceiveditbutfailedtoexecutetherequest. 112 DRC[x] does not execute request DRAS does not send LR request to DRC[x] DRC[x] does not receive LR request DRC[x] failed to execute LR request No path from DRAS to DRC[x] DRC[x] unavailable OR Internal N/W unavailable to transmit packets CGR connecting DRC[x] (CGR(DRC[x])) unavailable OR AMI n/w connecting DRC[x] unavailable OR DRC[x] operationally down DRC[x] overloaded OR CGR(DRC[x]) operationally down CGR(DRC[x]) overloaded OR AMI_NW(DRC[x]) RF Interface Down AMI_NW(DRC[x]) RF Interface Overloaded OR DRC[x] shutdown DRC[x] crashed AMI_NW(DRC[x]) DoS Attack CGR(DRC[x]) legitimate shutdown UNK OR CGR(DRC[x]) high packet rate @wired interface DRC[x] legitimate shutdown DRC[x] shutdown by customer DRC[x] malware infection UNK DRC[x] internal failure (S/W) DRC[x] internal failure (H/W) Figure5.13: DRSituationModel(Part 2) 113 Consider the situation “DRC[x] does not receive LR request”. This could happen if eithertherewas“nopathbetweentheDRserverandDRC[x]”,orif“DRC[x]wasunavail- able”. GiventhesetupdescribedinFigure5.10,weseethatthesituation“nopathb/wDR serverandDRC[x]”, couldoccurforoneormoreofthefollowingreasons: 1. theinternaln/wattheutilitywasunavailabletotransmitpackets,or 2. the CGR (gateway router) connecting the DR client to the network is unavailable (denotedasCGR(DRC[x])), or 3. the AMI n/w between the CGR and DRC[x] (denoted as AMI(DRC[x])) is unavail- able. The remaining decomposition of the above situations follows using a similar line of rea- soning,andcanbeeasilyfollowedfromFigure5.13. AswediscussedearlierinChapter4, situationmodelscanbedecomposedtoanydesirablelevelofabstraction. We next discussthedecompositionfor thesituation“DRC[x] is unavailable”. The DR client can be unavailable either because it is operationally down, or it is overloaded or for some other unknown reason (denoted using the situation UNK). As discussed previously, the unknown situation is important because available knowledge in complex systems is always imperfect. Further, we observe that the unknown situation is not added to every situation node. This reflects the decision-maker’s confidence in his knowledge about the system. The situation “DRC[x] is operationally down” can manifest either because the node wasshutdown,oritcrashed. Theshutdowncouldbeduetooneofthefollowingreasons: 1. alegitimateshutdown(say,shutdownbytheutilitycompany),or 2. ashutdownbythecustomer,or 3. ashutdownbymaliciousmeans(sayduetomalwareinfection),or 4. forsomeotherunknownreason. 114 Similarly, the situation “DRX[x] crashed” can manifest either because the node was shut- down,oritcrashed. Thecrashcouldbeduetooneofthefollowingreasons: 1. aninternalfailureinhardware, or 2. aninternalfailureinsoftware,or 3. acrashbymaliciousmeans(sayduetomalwareinfection),or 4. forsomeotherunknownreason. Ascanbeseen,thesituationsforshutdownandcrashshareacommonlower-levelsituation, namely,“DRC[x] malwareinfection”. It mightbeargued thatitisnotimportanttodecompose“DRC[x] operationallydown” into shutdown and crash. But, sometimes it is important to determine the exact cause to takeanappropriateaction. Forexample,ifthenodecrashedduetoamalwareinfection,an appropriate action might be to quarantine the infected DR client, and prevent any further communication to it and from it. Similarly, a legitimate shutdown might require simply issuingarestartcommandtotheclient. Encoding situation model as bayesian network. Figure 5.14 shows the encoding of the situation model as a bayesian network using the procedure described earlier in Sec- tion 4.2.2. The model is partial as it expands only the relevant nodes for G1 and G4. Figure5.15showsanotherviewofthesamenetworkwiththeprobabilityvalues. SituationModelforCustomer Figure 5.16 shows the encoding of the situation model for the customer. The customer is interested in any situation that will lead to a loss of revenue for him. The customer’s situationmodelisspecifictohisneedsandisbuiltwiththeknowledgethathehas. Further, heisalsorestrictedintheamountofruntimeeventinformationthatisavailabletohim. For instance,thecustomercannotgetstatusofentitiesbeyondtheCGR. 115 Figure5.14: Bayesian NetworkfortheDRScenario. Figure5.15: BayesianNetworkfortheDRScenario (withprobabilities). 116 Loss of revenue for customer Unavailability of timely pricing signals from market Inability of DRC[x] to control load OR Inability of DRC[x] to sync preferences to utility OR DRC[x] unavailable DRC[x] malware infection No path to utility DRC[x] operationally down DRC[x] overloaded OR Unavailability of CGR(DRC[x]) DRC[x] shutdown DRC[x] crashed DRC[x] internal failure (S/W) DRC[x] internal failure (H/W) Figure5.16: SituationmodelforcustomerintheDRscenario. Encoding situation model as bayesian network Figure 5.17 shows the encoding of the situation model as a bayesian network using the procedure described earlier in Sec- tion 4.2.2. The model is built for customer G4. Figure 5.18 shows another view of the samenetworkwiththeprobabilityvalues. 5.3.5 AnalysisPhase In this phase, we apply the bayesian models to perform analysis at runtime. We will first simulate a set of scenarios, and use the BN’s for both the operator and the customer to compute their respective top-level situations of interest. Consider that the DR operator choosesG1andG4asthetwogatewaystosendloadreductionrequeststo. 117 Figure5.17: BayesiannetworkforcustomerintheDRScenario. Scenario1. Considerthefollowingscenariowherethefollowingcommoneventhappens - CGR1 goes down. For the DR operator, the loss of CGR1 means that both G1 and G4 willbeunreachable. Thus, p(failureofDRtoreduceload|CGR1isunavailable) = 1 Fromthecustomerspointofview,thecrashofCGR1meansthathewillnotbeabletosync hisDRpreferences totheutility,whichwillnothaveahugeimpactonhisrevenue. p(lossofrevenueforcustomer|CGR1isunavailable) = 0.52 Scenario 2. Consider the followingscenario where two isolated and independent events happenduringexecutionofDR,namely, 118 Figure5.18: BayesiannetworkforcustomerintheDRScenario. 1. G1isshutdownbycustomer. 2. G4crashesduetoaninternalhardwarefailure. GiventhatboththeseeventseventuallyresultinthesituationwheretheG1andG4will notexecuteDRrequest,fortheDRoperator p(failureofDRtoreduceload|G1shutdown,G4crashes) = 1 Fromthecustomerspointofview,onlythecrashofG4matters. p(lossofrevenueforcustomer|G4crashes) = 0.71 119 5.3.6 Discussion In this section, we present a detailed discussionof theeffectiveness of situationmodelsin the DR scenario. To recap, there were two decision-makingentities in our case study: the DRoperatorintheutility,andacustomer. Extracting different insights from same set of facts. The analysis of scenario 1 and 2 abovedemonstrateshowthesameoflow-levelfactsareinterpreteddifferentlywithrespect to the high-level goals of the decision-maker. For the DR operator, the loss of a router, and the failure of some gateways means the inability to schedule load reduction. For the customer,thelossofarouter,orfailureofhisowngatewaymeanspotentiallossofrevenue. Interpreting isolated and independent events. In the case study example we use only two gateways. But, in a large-scale, complex system such as thesmart grid with a million customers,therewillbeamilliongateways. Therewillbeseveralisolatedandindependent events happening throughout the system, and the situation model based approach will en- surethattheeventsareallputtogetherinawaydirectlyrelevanttothegoalsandobjectives ofadecision-maker. Integrate data across layers of the system, and across subsystems. In an actual de- mand response deployment, the DR operator will have access to the following kinds of events: a) eventsrelevanttothehigh-leveldemand-responseoperation, b) eventsfromthedemand-responseserver, c) eventsfromtheseveralhundredsofgeographically-distributedclients(customersub- system), d) events from the low-level network infrastructure consisting of processes, databases, nodes,andnetworkelementsbetweentheserverandclients, 120 e) eventsfromsecuritymonitorsatthosedifferentlevelsofabstraction. Theaboveeventscorrespondtodifferentsituationsat severallevelsofabstraction. For example, the events related to the demand response operation will correspond to a high- levelsituationsuch as “DRAS does not get acks”. Similarly,eventsrelated tothe network elementswillcorrespondtolow-levelsituationssuchas“CGRunavailable”. Thesituation model allows integrating information from across the layers of system operations, and as alreadydiscusseditintegratesinformationfromacrossdifferentsubsystems. Requirementofknowledgetoconstructthemodels. Asweseefromthesituationmod- elsfortheDRoperatorandthecustomer,themodelsdifferduetotheamountofknowledge possessedbythetwoentities. Thecustomerbuildsamodelbasedonhisknowledgeofthe system to satisfy his goals, whilethe DR operator builds a model specific to his goals and objectives. 5.4 Summary In this chapter, we presented a set of case studies to demonstrate the effectiveness of the proposedtop-downmodelingabstractions. Tosummarize,acrossthethreecasestudies,we successfullydemonstratedthefollowingaspectsofourapproach: a) abilitytobuildmodelswithhigh-levelunderstanding, b) ability to integrate a diverse range of heterogeneous data from across subsystems, andacrosslevelsofsystemabstraction, c) abilitytoproduceinsightsrelevanttoadecision-makingentity, d) simplicityofabstraction,and e) reuse,sharingandcomposibilityofmodels. 121 Chapter6 Discussion In this chapter, we present a discussion of the limitations of the modeling abstractions, discuss the runtime performance of a prototype implementation for processing behavior specifications,andpresentourworkrelevanttounderstandingtheaccuracy ofinsights. We choose not to elaborate on the runtime performance of situation models since our currentminimalimplementationisdirectlybasedonthebayesiannetworkimplementation availableviatheGeNIe Javalibraries [28]. Further, a meaningfulperformance analysisof such models would require building situation models for a real environment, followed by testingofthosemodelsoverrealdatafromtheenvironment. Atthetimeofsubmittingthis thesis, such data was still not available from our sponsor projects. We thus leave detailed performanceanalysisofsituationmodelstofuturework. 6.1 PerformanceAnalysis Inthissection,weevaluatetheperformanceoftheprototypeimplementationoftheengine responsibleforprocessingbehaviormodels. Acommonapproachforsemantic-levelanalysisinvolvesuseofcustomscriptsortools encodingcontext-specificsemantics. Sincecustomscriptsandtoolscan bewrittenusinga varietyofprogrammingandoptimizationtechniques,anyevaluationofourgenericframe- work againstthem wouldbevery subjectiveand thusflawed. Instead, wechooseto report theraw runtimeperformanceof ourprototypeimplementationonfivebasicanalysestasks overeventdatasetsofincreasingsize. 122 0 10 20 30 40 50 60 0 10000 20000 30000 40000 50000 60000 70000 80000 Runtime (minutes) Total Events Processed b1 = cState b2 = iState b3 = iState ~> iState b4 = iState ~> dState b5 = iState ~> dState ~> dState ~> dState Figure6.1: Plotofruntimeagainstnumberofeventsforfivetypesofbehaviorcomplexity. Behaviorscontainingdependentvaluestates(dStates)resultinquadraticcomplexity. The runtime performance of the framework depends on the language constructs, input data,analysisalgorithmandimplementationmechanismsused. Sinceourprimaryfocusin this paper is on enabling semantic functionality, we prototyped the framework in Python usingaSQLitedatabaseasbackendforstoringevents. TheinputeventsusedwerePKT DNS eventscollectedforthecasestudyinSection5.1. Theperformanceanalysiswasconducted onalaptopwithanIntelPentium-Mprocessorrunningat1.86GHzandwithamemoryof 2GB. We measure runtime as a function of two variables: (a) the number of events input to the algorithm, (b) the behavior complexity, defined as the processing complexity of state propositions in a behavior formula. As discussed in Section 3.2.1, there are three types of state propositions based on attribute assignments; constant value attributes denoted as cState,dependentvalueattributesdenotedas dState,anddynamicattributevaluesdenoted as iState. These states can be combined to form five basic behaviors, each representing 123 a basic semantic analysis task: b1 = (cState), represents extracting events with known attributes and values; b2 = (iState), represents extracting events with particular attributes but unknown values; b3 = (iState iState), represents extracting causally correlated yet value-independentevents;b4=(iState dState),representsextractingcausallycorrelated and value-dependent events; and b5 = (iState dState dState dState), represents extractingalongchainofcausalevents. Althoughwelimitouranalysistothe operator, all operators incur uniform processing overhead in the algorithm, thus resulting in similar performance results. The chosen event set along with the behaviors are representative of a worst-case input to the framework. We measure the performance using above behaviors over event sets in increments of 10,000 events. We stop at the event set when runtime exceeds60minutes. The results are averaged over three runs and are shown in Figure 6.1. The plots for behaviors consisting of cStates and iStates b1, b2 and b3 tend to be linear as discussed in Section 3.6.4. One would expect that behavior b5, containing three dStates would showsignificantlyhigherruntimethanbehaviorb4containingonlyonedState. Bothshow quadraticperformance, since, in achain of dependent states,thestates further inthechain process lesser events than states in front of the chain. We thus see that runtime quickly becomes quadratic given a worst-case set of events and behaviors containing dependent statepropositions. Thecurrent Pythonand SQLite-based implementationalsoadd penalty totheframeworkruntime. Weinvestigatetheseissuesaspartofourfuturework. 124 6.2 ModelingLimitations Inthissection,wedescribethekeylimitationsofourmodelingabstractions. 6.2.1 BehaviorModels Modelingcomplexattributedependencies Themodelinglacksabstractionsformodel- ingcomplexdependenciesbetweeneventattributessuchasalgebraicdependencies. Events showdependenciesbeyondsimpleequalitybetweentheirattributes. Forexample,consider the simple additive relationships between the TCP sequence and TCP acknowledgement numbersbetweensuccessiveTCPSYNandTCPSYN-ACK packets. Modeling Unknown Relationships The current modelling language requires specifica- tion of exact relationships between behaviors and certain values related to the modeling constraints. But there may be cases where just the behaviors are known and the relation- shipsthatbindthemarenotknown. For example,eventhoughtheTCPprotocolstatesare wellspecifiedtherearemanybehaviorsthatoccurunexpectedly. Forexample,therecould bemanywaysinwhichconnectionsgetclosedandsomehostsalwaysshowpreferencefor onemethodovertheother. Exploringthisbehaviorautomaticallyisextremelyvaluable. FormalEvaluation Thecurrentmodellinglanguagelacksaformalanalysiswithrespect tocompletenessandsoundnessofthelanguage. Thisisahardproblemsincethedomainof applicability is very large and it is not clear how one might define a universe with respect towhichtheabovepropertiescanbecomputed. Modeling probabilistic behavior Although the case studies [126] suggest that a wide variety of event based modeling is possible with the current language, the modeling lan- guage lacks abstractions for modeling probabilistic behavior. Probabilistic behaviors are 125 essential to capture uncertainty which is fundamental to capture many behaviors shown by distributed systems. For example, in the current language it is difficult to capture the probabilisticaddressspacescanningbehavioroftheCodeRed worm. Modeling distributions The current modeling approach does not include an ability to model distributions over the data, for example, the distribution of inter-arrival times of networkpacketscannotbemodeled. 6.2.2 SituationModels Including accuracy constraints The situation model currently lacks an ability to in- clude accuracy constraints relevant to a decision-maker. For example, a decision-maker mightwantto specify thelevelof confidence desiredin theextractedinsights,sincea par- ticularconfidenceneedstobeestablishedbeforecertainactionscanbetaken. Thislevelof confidencewouldneedtobetrickleddowntoimposeaccuracyconstraintsonthelow-level eventdata. Thisisahardproblemandasubjectofourfutureresearch. Including timeliness constraints The situation model currently lacks an ability to in- clude timeliness constraints relevant to a decision-maker. Situational awareness is only useful if the insights are produced within a relevant amount of time. Improving the qual- ity of insights will require understanding the timeliness constraints relevant to a decision- making entity. These constraints would trickle down into constraints on the timeliness of low-level data, and eventually as constraints on the processing time available to low-level eventsources. 126 6.3 TowardsExtractingAccurateInsights In this section, we present some of our preliminary work towards the larger challenge of extracting accurate insights to improve the effectiveness of situation awareness in large- scalesystemssuchasthesmartgrid. ThisworkwaspresentedatRAID 2013[127]. Theaccuracyofahigh-levelsituationawarenesssystemiscompletelydependentonthe accuracy of inputs supplied by the low-level sources in the system. One such prominent sourceisananomaly-basedintrusiondetector. Inthiswork,ourobjectivewastounderstand thedependabilityofthelow-levelanomalydetectorsinsupplyingaccurateinputstoahigh- levelsituationalawarenessmechanismsuchassituationmodels. This work is very relevant to our futurework in the followingway. A high-levelspec- ification language such as the situation model would need to include constructs to enable specificationofhigh-levelaccuracyandtimelinessconstraintsrelevanttoadecision-maker. Givensuchamodel,runtimeprocessingofsuchconstraint-annotatedmodelswouldrequire adetailedunderstandingoftheaccuracyoflow-leveldatasourcestocorrectlyprocessdata from such sources. A reliable characterization of such low-level sources is thus required to improvetheoveralleffectivenessof situationalawareness. Our work in this section is a smallsteptowardsthatlarger goal. Restofthesubsectionspresentdetailsofourwork. 6.3.1 UnderstandingEffectivenessofEvaluations Anomaly-based intrusion detection has been a consistent topic of research since the in- ception of intrusion detection with Denning’s paper in 1987 [29]. As attacks continue to display increasing adversarial sophisticationand persistence, anomaly-based intrusion de- tection continuesto appeal as a defensivetechniquewith thepotentialto address zero-day exploits or “novel”adversarial tactics. However, for anomaly-based intrusion detectors to become a viable option in mission critical deployments such as the primary control loops for a power grid, or the command system for spacecraft we need to know precisely when 127 these detectors can be depended upon and how they can fail. Such precision is particu- larly important when considering that the outputs of anomaly detectors are the basis for higher-levelfunctionssuchassituationalawareness/correlationenginesordownstreamdi- agnosis and remediation processes. Errors in detection output will inevitablypropagateto exacerbate errors in the outputs of such higher-level functions, thus compromising their dependability. Buildingdependabletechnologyrequiresrigorousexperimentationandevaluationpro- cedures that adhere to the scientific method [91,105]. Previous research has identified the lack of rigorous and reliable evaluation strategies for assessing anomaly detector per- formance as posing a great challenge with respect to its dependability and its subsequent adoption into real-world operating environments [49,66,91,114]. We strongly subscribe to these statements and underscore the need to delve into the mechanics of an evaluation strategyinawaythatenablesustobetteridentifywhatwentwrongaswellastounderstand howtheresultsmayhavebeencompromised. ObjectivesandContributions. Our objectives in this work is two-fold: we first explore a critical aspect of the evaluation problem, namely the error factors that influence detection performance (Sect. 6.3.2), and then present a framework of how these error factors interact with different phases of a detector evaluation strategy (Sect. 6.3.4). The factors are mined from the literature and compiled into a singlerepresentation to providea convenientbasis for understandinghow errorsourcesinfluencevariousphasesinananomalydetectorevaluationregime. Although these factors have been extensively studied in the literature our approach for discussing themofferstwoadvantages: (a)itallowsvisualizationofhowerrorsacrossdifferentphases of the evaluation can compound and affect the characterization of an anomaly detector’s performance, and (b) it provides a simple framework to understand the evaluation results, 128 such as answering why a detector detected or missed an attack?, by tracing the factors backwards through the evaluation phases. In addition, as discussed further in Sect. 6.3.3, we also introduce a new error factor, that has not as yet appeared in the literature, namely the stability of attack manifestation. We use the error taxonomy to build a framework for analyzing the validity and consistency arguments of evaluation results for an anomaly detector(Sect.6.3.4). Using the frameworks described in Sect. 6.3.2 and Sect. 6.3.4, we then focus on ana- lyzing three case studies (Sect. 6.3.5) consisting of evaluation strategies selected from the literature,toidentifya)the“reach”ofthepresentedresults,i.e.,whatcanorcannotbecon- cluded by the results with respect to, for example, external validity, and b) experimental omissionsor activitiesthat introduceambiguitythereby compromisingthe integrityof the results, e.g., an inconsistentapplication of accuracy metrics. In doing so, we willnot only bebetterinformedregardingtherealconclusionsthatcanbedrawnfrompublishedresults, butalsoonhowtoimprovetheconcomitantevaluationstrategy. 6.3.2 FactorsAffectingAccuracyofLow-levelMonitors In this section, we present a compilation of factors that have been identified as sources of error in the literature. Our objective is not to present a comprehensive taxonomy but rathertoprovideaunifyingviewofsuchfactorstobettersupportadiscussionandstudyof theevaluationproblem. We scopeour discussionin thissection by focusingon evaluation factorsrelevanttoanomalydetectorsthat: a)workineitherthesupervised,semi-supervised or unsupervised modes [20], and b) learn the nominal behavior of a system by observing datarepresentingnormalsystemactivity,asopposedtodetectorsthataretrainedpurelyon anomalousactivity. Wealsofocusonaccuracymetrics,namelythetruepositives(TP),false positives(FP), falsenegatives(FN) and truenegatives(TN), rather than other measures of detectorperformancesuchasspeedandmemory. 129 Data Collection Evaluation Data (Normal & Attack data) Training Data Training & Tuning Testing Test Data 1 3 Model(s) Detector Alerts Measurement 4 Performance (TP/TN/FP/FN) Data Preparation 2 5 DC1 Data generation DC2 Data monitoring DC3 Data reduction DC4 Data characterization DC4.1 Availability RI³ground WUXWK´ DC4.2 False alarm characterization Factors contributing to data collection errors TR1 Characteristics of training data TR1.1 Representation of real-world behavior in data TR1.2 Stability of normal data TR1.3 Attack-free training data TR2 Detector internals TR2.1 Choice of data features TR2.2 Modeling formalism TR2.3 Learning parameters TR2.4 Online vs. offline learning TR3 Amount of training TR4 Model generation approach Factors contributing to training errors DP1 Data sanitization DP2 Data partitioning DP3 Data conditioning Factors contributing to data preparation errors TS1 Characteristics of test data TS1.1 Ratio of attack-to-normal samples TS1.2 Stability of attack manifestation TS1.2.1 Adversary-induced instability TS1.2.2 Environment-induced instability TS2 Detector internals TS2.1 Detection parameters TS2.2 Similarity/scoring metric Factors contributing to testing errors MS1 Definition of metrics MS2 Definition of anomaly Factors contributing to measurement errors Figure 6.2: Factors contributing to errors across the five different phases of an anomaly detector’sevaluationprocess. In Fig. 6.2, we represent the typical evaluation process of an anomaly-based intrusion detector as a high-level workflow consisting of five key phases: (1) data collection, (2) data preparation, (3) training and tuning, (4) testing, and (5) measurement. Each phase is annotated with factors that contribute errors towards the final detector performance. We brieflydescribethephases(referenced byatwoletteracronym),followedbyadescription ofthefactorsineachphase. DataCollection(DC) Thefirststageintheevaluationofananomaly-basedintrusiondetectorinvolvesthecollec- tionofbothnormalandabnormal(attack)instancesofdata,wheretheresultingevaluation dataset should ideally be well labeled and characterized. The following five broad factors areknowntocontributeerrors tothedatacollectionphase. 130 Datageneration(DC1) - Raw data is needed for an evaluation. Live environments that generate real data have been observed to contain noisy artifacts that introduce ex- perimentalconfounds[95]. Artificiallygenerateddatamayprovidegoodcontrolbut introduceerrorswithrespecttofidelitytoreal systembehavior[95]. Datamonitoring(DC2) - Errors can be introduced by data monitors themselves, e.g., stracehasbeenshowntoinjectstrangeparametervalueswhenmonitoringjobswith hundreds of spawned children [55], or when following children forked using the vfork()systemcall[48]. Datareduction(DC3) - Techniques employed to reduce the volume of input data, e.g., data sampling, can distort features in captured data that in turn adversely influences the performance of anomaly detectors [88]. Ringberg et al. [109] suggest that the use of data reduction techniques can lead to poor quality data that can affect the identificationoftrue-positivesinadataset. Datacharacterization(DC4) -Anunderstandingofwhatadatasetcontainsisfundamen- tal to evaluation[95,114]. Errors can beintroduced when ground truthis poorly es- tablished [20,49,95,109,114,119], and it has been argued that even the availability of onlypartial groundtruthisnot goodenoughbecause itwouldmakeitimpossible to calculate accurate FN and FP rates [109] (factor DC4.1). Similarly, a poor char- acterizationoftheanomalous-yet-benigninstancesindatacanresultinanunreliable assessmentofadetector’sfalsealarmrate[95](factorDC4.2). DataPreparation(DP) Datapreparationprimarilyreferstotechniquesthatprocessthedataintoaformsuitablefor evaluationpurposes, or for detector consumption. We notethat, althoughdata preparation cancontributetoerrors,therearecaseswheredatapreparationmightbenecessarytoreduce adetector’serror. Forinstance,severalmachine-learningbasedmethodsworkbetter ifthe 131 inputs are normalized and standardized (e.g., artificial neural networks can avoid getting stuckinlocaloptimaiftheinputsarenormalized). Datasanitization(DP1) - The choice of a particular data sanitization strategy (or a lack of it) to clean the data of unwanted artifacts has been shown to significantly perturb theoutcomeofanomalydetectors[26]. Datapartitioning(DP2) - An improper choice of the data partitioning strategy (or even the parameter values within a particular strategy such as the choice of k in k-fold cross validation),can lead to an error-prone result when assessing anomaly detector performance. Kohavi et al. [70] reviewed common methods such as holdout, cross- validation,andbootstrapanddiscussedtheperformanceofeachintermsoftheirbias andvarianceondifferentdatasets. Dataconditioning(DP3) -Thechoiceofdataconditioningstrategycanhaveimplications for the performance of an anomaly detector, e.g., data transformations such as cen- tering and scaling continuous data attributes can bias the performance of learning algorithms[134]. TrainingandTuning(TR) Inthetrainingphase,ananomaly-basedintrusiondetectorconsumestrainingdatatogener- atemodelsof nominalbehaviorthatare usedinturntoidentifyoff-nominalevents. Train- ing data can also be used to fine-tune the parameters governing the anomaly detector’s learning and modeling algorithms to enable the generation of more representative models of system behavior. Errors are introduced in the training phase due to factors influencing thetrainingdata,thelearningprocessortheoveralltrainingstrategy. 132 Characteristicsoftrainingdata(TR1). Representationofreal-worldbehaviorindata(TR1.1): Training data must be repre- sentative of system behavior. Real-world behavior is often dynamic and evolving in natureand, if captured inadequately can lead to inadequatetraining,increased er- ror (e.g., false alarms) and biased detector performance, i.e. the problem of concept drift[49,61,78]. Stabilityoftrainingdata(TR1.2): AsdiscussedbyLeeetal.[80]andSommerandPax- son [114], the basic premise of anomaly detection rests on an assumption that there exists some stability or regularity in training data that is consistent with the normal behavior and thus distinct from abnormal behavior. Real-world data displays high variabilityandrarelywellbehaved[95,114]. Highlyvariabletrainingdatacancause a detector to learn a poorly fitted baseline model, which would affect its error rate whendeployed. Attack-freetrainingdata(TR1.3): The need for attack-free trainingdata has been iden- tified in several papers [26,49,114]. If the training data is pollutedwith attacks, the detectorcanlearntheattacksasnominalbehavior,causingaprobableincreaseinthe missrate[26]. Detectorinternals(TR2). Choiceofdatafeatures(TR2.1): An anomaly detector can detect attacks over multiple types of data and over different features of the data. An incorrect choice of data typesorfeaturesdirectlyaffectsadetector’saccuracy[20,66]. Modelingformalism(TR2.2): A poor choice of modeling formalism or an inadequately complex model can affect the accuracy of a detector. For instance, n-gram models were found to better model packet payloads than the 1-gram model [130]. Kruegel 133 et al. [74] reported good results for detecting web attacks using a linear combina- tion of different models, with each model capturing a different aspect of web-server requests. Learningparameters(TR2.3): Learning algorithms are influenced by their parame- ters [134]. Incorrect parameter choices can adversely affect detector performance. For example, in the seminal work by Forrest et al. [48], the value of window size parameterwasadecidingfactorfortheperformanceoftheanomalydetector. Onlinevs. OfflineTraining(TR2.4): The choice of learning strategy can have an influ- ence on the detector performance. An offline training strategy, wherein a detector is trained before deployment can suffer from high error rates due to concept drift in dynamicenvironments [49]. An onlinelearning strategy, wherein a detector contin- uously learns from its inputs has been shown in some contexts to reduce the error rates [66]. However, in some cases, an online learning strategy can induce more errors in the detector’s performance if the concept drift is artificially induced by an attacker. Amount of training (TR3). The amount of training can either be measured in terms of training timeor size of data used for training and has been shownto be heavily correlated withdetectorerrorrates[66]. Model generation approach (TR4). Errors are introduced due to the choice of training strategyadoptedforgeneratingmodelinstances(e.g.,one-classvs. two-classtrainingstrat- egy) [20]. For example, to detect anomalies in a particular network X, a classifier could be trained using normal data from network X, or a classifier could be trained using data fromanothersimilarnetworkY.Thetwoapproachesresultintwodifferentclassifierswith differenterrors. 134 Testing(TS) The test phase is concerned with exercising detection capabilities on test data that ideally consists of a labeled mixture of normal and attack data sequences. The detector flags any deviations from the nominal behavior as attacks and produces a set of alarms. The test phase performance is influenced by factors related to the test data and the detector’s detectionstrategy. Characteristicsoftestdata(TS1). Ratioofattack-to-normaldata(TS1.1): The base-rate or the ratio of attacks to normal data instances can significantly bias the evaluationresults of an anomaly detector to a particular dataset [6]. The attack data, if generated artificially must be distributed realisticallywithinthebackgroundnoise[95]. Stabilityofattacksignal(TS1.2): Current evaluation strategies implicitly assume that the attack signal itself is a stable quantity, i.e., the attack signal will manifest in a consistentway givingevaluationresultssomedegreeof longevitybeyondtheevalu- ationinstance. However,an attack signalcouldmanifest unstablyfor oneor bothof the following reasons: 1) Adversary-induced instability, wherein an attacker might distortan attack signalby generating artificial noisethat makestheattack signalap- pear normal to a detector [47,129] (factor TS1.2.1), and 2) Environment-induced instability, where an attack signal may get distorted due to variations in the oper- ating environment (factor TS1.2.2). For example, an attack signal represented as a sequence of system calls from a process is easily perturbed due to addition of noisy systemcalls,injectedbytheprocess inresponsetothevariationsinmemoryorload conditionsintheunderlyingOS. 135 Detectorinternals(TS2). Detectionparameters(TS2.1): The performance of detection algorithms is sensitive to the choice of parameters such as detection thresholds. For example, Mahoney et al. [87] show the variation in their detector’s hit and miss performance when the detection thresholds were varied for the same test dataset. Detection parameters are either chosen manually by the evaluator [87,131] or are automatically computed at runtimebythedetector[74]. Choiceofsimilaritymeasure(TS2.2): Itiswellacknowledgedthatthechoiceofthesim- ilaritymeasureusedtodeterminethemagnitudeofdeviationsfromthenormalprofile greatlyinfluencestheaccuracyofadetector[20,66]. Measurement(MS) Giventhesetofdetectorresponsesfromthetestphasealongwithgroundtruthestablished foratestcorpus,theperformanceofthedetectorismeasuredintermsofthetruepositives, false positives, false negatives and true negatives. There are at least two factors that can influencethemeasurements. Definitionofmetrics(MS1): When measuring or comparing the performance of detec- tors, it is crucial to understand two categories of metrics: (a) the four fundamental metrics – true positive (TP) or “hit”, false negative (FN) or “miss”, false positive (FP),truenegative(TN),and(b)theoverallperformancemetricssuchastheTPrate or the FP rate of a detector. The fundamental metrics are tied to the interpretation of detector alarms. For instance, a true positive(hit) could be defined as any single alarm from the detector over the entire duration of an attack, or as a specific alarm within a specific time window. The overall measurement of performance could be expressed as a percentage (e.g., total over the expected true positives), or may be expressed operationally (e.g.: false positives/day). An improper definition of the 136 abovemetricswithrespecttothechosentestdataand/ortheoperationalenvironment can significantly bias a detector’s assessment and render performance comparisons acrossdifferentdetectorsinconclusive[95]. Definitionofanomaly(MS2): Anomalies themselves possess distinctive characteristics, for example, they could be point anomalies, collective anomalies or contextual anomalies [20]. Errors are introduced when it is assumed that a detector is capable ofdetectingaparticularkindofanomalythatisnotinitsrepertoire[109,118,119]. 6.3.3 Background Thepurposeofanevaluationistogaininsightintotheworkingsofadetector. AsSommer and Paxson [114] state – a sound evaluation should answer the following questions: (a) What can an anomaly detector detect?, (b) Why can it detect?, (c) What can it not detect? Why not?, (d) How reliably does it operate?, and (e) Where does it break?. In addition to these questions we would also add (f) Why does it break?. We observe that in literature, thepreponderanceofevaluationstrategiesforanomalydetectorsfocusonthe“what”ques- tions,specifically,whatcanthedetectordetect. The“why”questionshowever,arerarely,if ever,answered. Forexample,Inghametal.[58]evaluatedtheperformanceofsixanomaly detectiontechniquesoverfourdifferentdatasets. Astrikingdetailoftheirworkliesintheir evaluationof“characterdistribution-based”detectorsoverthefourdatasetswhichresulted in a 40% true positive rate (low performance) for one of the datasets as compared to a ≥70% true positiverate for the remaining three datasets. The authors did not clarify why thatparticulardetectionstrategyunder-performedforoneparticulardatasetandyetnotfor the other three. If we were to consider deploying such “character distribution-based” de- tectors within a mission critical operational environment, such ambiguity would increase uncertainty and risk that would be difficult to tolerate. A similar comparative study of 137 n-grambasedanomalydetectorsbyHadˇ ziosmanovi´ cetal.[53]isagoodexampleofanal- yses that delves deeper into a specific “why” question. The authors focus on thoroughly explaining the detection performance of content-based anomaly detectors for a class of attacksoverbinarynetworkprotocols. Error Factors. To answer why a detector did or did not detect an event of interest re- quiresasystematicunderstandingofthefactorsthatcaninfluenceadetector’sperformance. It has been observed that a lack of understanding of such factors can create systematicer- rors that will render experimental outcomes inconclusive [91]. Previous studies in evalu- ating anomaly detectors within the network and host-based intrusiondetection space have identified several factors influencing a detector’s performance, for example, the improper characterization of training data [80,114], an incorrect sampling of input data [88], the lack ofground-truthdata[49,109,114], poorlydefined threat scope[114], theincorrect or insufficientdefinitionofananomaly[109,118,119],andsoforth. Although many of the factors that contribute to error in a detector’s performance are reported in the literature, they are distributed across different domains and contexts. Con- sequently,itisdifficulttoclearlyseehowsucherrorfactorswouldintegrateintoandinflu- encevariousphases of an evaluationregime. Giventhat theobjectivesof thispaper center onunderstandinghowtheintegrityofperformanceresultscanbecompromisedbytheeval- uation strategy, we are motivated to compile a framework in Sect. 6.3.2 that identifies the errorfactorsthathavebeendescribedintheliteratureandhowtheyrelatetovariousphases ofacanonicalevaluationregime. Stability of Attack Manifestation. The framework in Sect. 6.3.2 also refers to an error factor that has not as yet appeared in the literature, namely the stability of attack manifes- tation. Anomaly detector evaluation strategies to date have consistentlymade the implicit assumption that attack signals will always manifest in a stable manner and can thus be 138 consistently differentiated from normality. Consequently, when a detector is evaluated to have a 100 percent hit rate with respect to an attack, it is only by assumption that this detection result will persist against the specific attack. This observation is supported by the general absence of analyses in the current literature to address the reliability of eval- uation results beyond the evaluation instance, leaving the reader to believe that the result will remain consistent in other time instances and operational environments. What would happen, however, should the attack change in its manifestation due to factors present in its environment? Sensors like strace, for example, are known to drop eventsunder certain circumstances creating spurious anomalous sequences that may perturb the manifestation ofanattacksignal[48]. While it is known that attacks can be manipulated by the adversary to hide intention- ally in normal data [47,129], there is no study aimed at understanding if the operating environment itself can induce hide-and-seek behavior in attacks. In current evaluation ap- proaches,ifadetectordoesnotdetectanattack,thentheerror(miss)istypicallyattributed to the detector from the evaluator’s standpoint. However, this may be an incorrect attri- bution. Consider the scenario where the attack signal has somehow been perturbed by the environment causing its manifestation to “disappear” from the purview of a detector. In such a circumstance, it would not be accurate to attribute the detection failure to the de- tector – there was nothing there for the detector to detect. In this case the “miss” should more appropriately be attributed to the experimental design, i.e., a failure to control for confoundingevents. 6.3.4 DeconstructionofEvaluationResults This section focuses on three basic questions that must be answered when considering deploymentonoperationalsystems: (1) Can anomalydetectorD detectattackA? (2) Can anomalydetectorD detectattackAconsistently? (3)Why? 139 An evaluationstrategy aimed at answering the questions abovemust provideevidence to support that (a) every “hit” or “miss”assigned to a detector is valid, i.e., thehit or miss is attributablepurely to detector capability and not to any other phenomenon such as poor experimental control, and (b) the “hit” or “miss” behavior corresponding to an attack is consistent, i.e., the hit or miss result for a detector for a given attack is exhibited beyond thatsingleattackinstance. We use the framework presented in Sect. 6.3.2 to analyze the validity and consistency arguments of evaluation results for an anomaly detector. Specifically, we (1) identify the sequenceoflogicaleventsthatmustoccurfortheevaluationresultstobevalidandconsis- tent(Sect.6.3.4),(2)identifytheerror factorsthatcanperturbthevalidityandconsistency of evaluation results (Sect. 6.3.4), and (3) explain the conclusions that can be drawn from evaluationresultswithintheerror context(Sect. 6.3.4). ValidityandConsistencyofDetectionResults As shownin Fig. 6.3, givenan attack instanceas test input,there are at least seven logical events that are necessary for reasoning about the validity and consistency of the detection result,thatis,a“hit”ora“miss”. Validity. To determine that an anomaly-based intrusion detector has registered a valid hit, the following six events must occur (Fig. 6.3): (1) the attack must be deployed, (2) the attack must manifest in the evaluation data stream, (3) the attack manifestation must bepresentin thesubsetof theevaluationdata(thetestdata) consumedbythedetector, (4) theattackmanifestationmustbeanomalouswithinthedetector’spurview,(5)theanomaly mustbesignificantenoughtobeflaggedbythedetector,and(6)thedetectorresponsemust bemeasuredappropriately,inthiscase, asa“hit”. Notethatevent3 ′ isnotincludedabove asitonlyaffectstheconsistencyofdetection. 140 Attack is deployed 1 Attack manifests in evaluation data 2 Attack manifests in test data 3 Attack is anomalous within GHWHFWRU¶V purview Attack is anomalous within GHWHFWRU¶V purview 4 Anomaly is significant for detector Anomaly is significant for detector 5 Detector response is measured appropriately . 6 Attack NOT anomalous within GHWHFWRU¶V purview Attack NOT anomalous within GHWHFWRU¶V purview 4a Anomaly NOT significant for detector Anomaly NOT significant for detector 5a Attack manifests stably ¶ Sequence of events for valid or consistent hit result. Sequence of events for valid or consistent miss result. Event for a valid or consistent detection result. ³3HUWXUEHG´YHUVLRQRIHYHQW x Event necessary for valid detection. x Event necessary for consistent detection. Figure 6.3: Causal chain of logical events necessary for a “hit” or “miss” to be valid and consistent. Theunshaded eventsliewithinan evaluator’spurviewwhiletheshadedevents arewithinthedetector’spurview. Thislogicalsequenceofeventsformsthecausalbackbonethatenablesreasoningabout the validity of evaluation results. Ambiguities in any element of this sequence arguably compromises the integrity of evaluation results. For example, if we compromise event 2, wherebyanattackisdeployedbuttheevaluatordoesnotchecktoensurethatitmanifested intheevaluationdata. Insuchacase,anydetectorresponseissuspectbecausetheresponse cannot be correlated to the attack itself – there is no evidence the attack manifested in the data. We note that the seven events in Fig. 6.3 can be divided into those that lie within the purview of the evaluator (events 1, 2, 3, 3’, 6) and those that lie within the purview of the detector (events 4, 5). This division is particularly important when analyzing the con- clusions that can be drawn from evaluation results. Consider the case where a detector responds with a miss and the evaluator cannot confirm that the attack deployed actually manifestedin theevaluationdata(event2). It wouldbe incorrect toattributethe“miss”to detection capability, since the detector may have missed because there was nothing in the data for it to detect despite the deployment of the attack. The fault in this case lies with poorexperimentalcontrolanddoesnotreflect detectorcapability. Assumingthat alleventsthatliewithintheevaluator’spurviewoccur as expected, two possible sequences of events can occur for a valid miss (as shown in Fig. 6.3): (a) 1→ 141 2→ 3→ 4→ 5a→ 6, and (b) 1→ 2→ 3→ 4a→ 5a→ 6. Event 4a (“attack NOT anomalous within detector’s purview”) and 5a (“anomaly NOT significant for detector”) are the perturbed versions of events 4 and 5 respectively. Since these events liewithin the detector’spurview,theperturbationscanbedirectlycorrelatedtofactorsthataffectdetector capability,andthemisscanbeconfidentlyattributedtothedetector. Consistency. From an evaluator’s point of view, evaluation of detection consistency re- quires that the ground truth established for the evaluation corpus also include an under- standing of the stability of attack manifestation. For example, if the attack signal is stable and yet detector performance varies then the evidence may point toward poor detector ca- pability, e.g., poor parameter value selection. However, if the attack signal is itself incon- sistent,causingdetectorperformancetovary,thenthedetectorcannotbesolelyblamedfor the“poor”performance. Ratheritispossiblethatthedetectorisperformingperfectlyinthe face of signal degradation due to environmental factors. Consequently in our analysis of detectionconsistencyweaddstability(event3 ′ )asaneventofnote,i.e.,todeterminethata detectoriscapableofconsistently(andvalidly)detectinganattack,thefollowingsequence of sevenevents(as showninFig. 6.3)mustoccur: 1→ 2→ 3→ 3 ′ → 4→ 5→ 6. Sim- ilarly, for a consistent(and valid) miss one of the followingtwo sequences of events must occur: (a)1→ 2→ 3→ 3 ′ → 4→ 5a→ 6,and(b) 1→ 2→ 3→ 3 ′ → 4a→ 5a→ 6. FactorsInfluencing ValidityandConsistency Thelogicalsequenceofeventsdescribedintheprevioussectionsimplydescribestheevents thatmustoccurinordertoconcludethatahit,forexample,isindeedavalidandconsistent hit, i.e., it is a true detection of an attack via an anomalous manifestation, and is detected consistently. Each eventin thatsequencecan becompromisedto, in turn, compromisethe integrityof evaluationresults. This section ties those events to the set of error factors that cancausesuchacompromise,assummarizedinTable6.1. 142 Table6.1: Potentialerror factors across the fiveevaluationphases (Fig. 6.2) that can com- promisetheevents(Fig.6.3)necessaryforvalidandconsistentdetection. # Event Factorsinfluencingvalidand consistentdetection (1) Attackisdeployed. DC1 (2) Attackmanifestsinevaluationdata. DC2, DC3, DC4.1 (3) Attackmanifestsintestdata. DP1, DP2, DP3 (3 ′ ) Attackmanifestsstably. TS1.2.1, TS1.2.2 (4) Attackisanomalouswithindetector’s purview. TR2.1, TR2.2 (5) Anomalyis significantfordetector. TR1.2, TR1.3, TR2.3, TR2.4, TR3, TR4, TS2.1, TS2.2 (6) Detectorresponseismeasuredcorrectly. MS1, MS2 RationaleforChoiceofFactors. InSect.6.3.2,weenumerated24factorsthatcontribute toerrorsacrossthefivedifferentphasesofananomalydetector’sevaluation. Table6.1lists only the subset of factors that compromise events for valid and consistent detection of an attack instance, i.e., factors that affect the measurement of a “valid hit”(true positive)or a “valid miss”(false negative). Consequently, three factors, namelyDC4.2, TR1.1, and TS1.1 are not included in Table 6.1. Factors DC4.2 (characterization of false alarms) and TR1.1 (representation of real world behavior in data) only affect the false positive andtruenegativeassessmentsofadetector. TS1.1isthebase-ratefactor(ratioofattacks- to-normal samples), which affects the reliability of the overall assessment of an anomaly detector’s performancebut does not influencetheeventsfor validand consistentdetection ofasingleattack. Description. Table 6.1 lists the events that must occur to conclude a valid and consis- tent detection result, along with the corresponding error factors that can compromise the events. Forthefirstevent,“Attackisdeployed”,thefactorDC1(datageneration)isasource of error thataffects thecorrect deploymentor injectionof an attack. For thesecondevent, “Attackmanifestsinevaluationdata”,factorsDC2, DC3, DC4.1(datamonitoring,data reduction and ground truth availabilityrespectively) are sources of error that influence the 143 manifestation of an attack in the raw evaluation stream. In this case, the poor use of sam- pling techniques, or the lack of “ground truth” can cause attack events to disappear from the evaluation corpus. Similarly, the error factors DP1, DP2, DP3 (data sanitization, partitioning and conditioning), can cause an attack to disappear from the test data stream thatisconsumedbythedetector. FactorsTS1.2.1, TS1.2.2(adversary-inducedandenvironment-inducedinstabil- ity)causeunstablemanifestationofattacksandaffectevent3 ′ (“Attackmanifestsstably”). Event 3 ′ and its factors only affect the consistency of detection results. Error factors TR2.1, TR2.2 (choice of data features and modeling formalism respectively), will in- fluence the manifestation of an attack as an anomaly within the detector’s purview, thus affecting event 4. For example, a detector looking at temporal features of system calls would not see attacks that manifest as an increase in system call frequency. Similarly, a detector using a 1-gram model of packet payloads will not see attacks that might require modeling the dependencies between application-level tokens contained within the packet payload. Event 5 (“Anomaly is significant for detector”) is affected by several factors related to thetrainingandtestingphasesofanevaluation. ErrorfactorsTR1.2, TR1.3, TR2.3, TR2.4, TR3, TR4, TS2.1, TS2.2 (stability of training data, attack-free training data,learningparameters,onlinevs. offlinetraining,theamountoftraining,themodelgen- eration approach, the detection parameters, and the similarity metric respectively) will in- creaseordecreasethemeasuredsignificanceofananomaly. Adetectortrainedoverhighly variabledatamightnotbeabletoidentifyattacks assignificantanomalies. Similarly,hav- ing attacks in the training data will cause those attacks to look benign to a detector in the test phase. Detection parameters such as high anomaly thresholds or the choice of a par- ticular scoring mechanism can also cause attack-induced anomalies to seem insignificant. We note that factors related to event 5 can heavily influence the consistency of detection. 144 Forinstance,factorTR2.4(onlinevs. offlinelearningstrategy)can affecttheconsistency ofdetectionbychangingthedetector’sperceptionofananomalyovertime. Finally, the factors related to the measurement phase MS1, MS2 (definition of met- rics, and definition of anomaly respectively) influence the final assessment and reporting of a valid and consistent detection performance. For instance, a mismatch between detec- tor’s notion of a “hit” versus the real definition as it relates to an attack can create non- generalizableresults. DeconstructingHitsandMisses: Understanding Results This section discusses the insufficiency of current evaluation approaches by showing how unexplainedfactorsacrossthedifferentevaluationphasescangiverisetomultiplepossible explanationsforevaluationresults,i.e.,hitsandmisses. Figure 6.4(a) and Fig. 6.4(b) show the possible sequence of events that would explain a hit and miss from a detector respectively. In the case where an attack is deployed and thedetector detects the attack, Fig. 6.4(a) depicts12 possiblesequences of eventsthat can explain thehit, labeled case H1 to case H12 and described in Table 6.2. In thecase where an attack is deployed and the detector misses the attack, Fig. 6.4(b) depicts 18 possible sequencesofeventsthatcanexplainthemiss,labeledcaseM1tocaseM18anddescribedin Table6.2. TheerrorfactorsdefinedinTable6.1canbeusedtoexplainthepotentialcauses thatresultedineachalternatesequenceofeventsidentifiedinFig.6.4(a)andFig.6.4(b). Thegoalofanevaluationistoassessthecapabilityofthedetectorandnotthevalidityof theexperimentitself. Consequently,events4and5inFig.6.4(a)andFig.6.4(b)areevents that can be attributed to detector capability, whileevents 1, 2, 3, 3 ′ , and 6 are attributedto experimental control. In Fig. 6.4(a) and Table 6.2 we observe only a single case (H1) that canbeassessedasavalidandconsistenthit. CaseH2isassessedasafalsepositivebecause the detector alarm was unrelated to the attack and there was no fault with experimental 145 Attack is deployed 1 Attack manifests in evaluation data 2 Attack manifests in test data 3 Attack is anomalous within GHWHFWRU¶V purview Attack is anomalous within GHWHFWRU¶V purview 4 Anomaly is significant for detector Anomaly is significant for detector 5 Detector response is measured appropriately . 6 Attack NOT anomalous within GHWHFWRU¶V purview Attack NOT anomalous within GHWHFWRU¶V purview 4a Anomaly NOT significant for detector Anomaly NOT significant for detector 5a Attack improperly deployed 1a Attack does NOT manifest in evaluation data 2a Attack does NOT manifest in test data 3a Something else manifests as anomaly. Something else manifests as anomaly. 4b Detector response is NOT measured appropriately . 6a Attack is deployed 1 Attack manifests in evaluation data 2 Attack manifests in test data 3 Attack is anomalous within GHWHFWRU¶V purview Attack is anomalous within GHWHFWRU¶V purview 4 Anomaly is significant for detector Anomaly is significant for detector 5 Detector response is measured appropriately . 6 Attack NOT anomalous within GHWHFWRU¶V purview Attack NOT anomalous within GHWHFWRU¶V purview 4a Anomaly NOT significant for detector Anomaly NOT significant for detector 5a Attack improperly deployed 1a Attack does NOT manifest in evaluation data 2a Attack does NOT manifest in test data 3a Something else manifests as anomaly. Something else manifests as anomaly. 4b Detector response is NOT measured appropriately . 6a (a) Deconstruction of a valid and consistent hit. (b)Deconstruction of a valid and consistent miss. Attack manifests stably Attack does NOT manifest stably Attack manifests stably ¶ Attack does NOT manifest stably ¶D ¶ ¶D Figure 6.4: Deconstruction of an anomaly detector’s response showing multiple possible explanationsofahitormiss. control, i.e., the attack was deployed and its manifestation in the data confirmed. Cases H3 – H12 are assessed as indeterminate(denoted by thesymbol??) sincethesequence of eventssuggestserrorsbothexternal(poorexperimentalcontrol)andinternaltothedetector. In all cases marked indeterminate (??), it would be incorrect to conclude a hit since the attack does not manifest in the data, thus the detector’s alarm was unrelated to the attack. Itwouldalsobedifficulttoconcludeafalsealarmonthepartofthedetector. Afalsealarm occurs in the absence of an attack, and in this case poor experimental control has resulted in an alarm generated concomitantly with a deployed attack. Similarly, we observe that there are only two cases M1 and M2 that can be assessed as a valid and consistent “miss” becausetheseerrors can bedirectlyattributedto thedetector and nottopoorexperimental control. All other cases, M3 – M18 are indeterminatedue to errors that are external to the detector. 146 Table 6.2: Enumeration of a subset of the sequence of events from Fig. 6.4 with their correctassessments. Assessmentsdenoted?? areindeterminate. ReferFig.6.4(a)forcases H1–H12,andFig.6.4(b)forcasesM1–M18. Case SequenceofEvents Assessment H1 1→2→3→3 ′ →4→5→6 Valid&consistenthit(TP) H2 1→2→3→3 ′ →4b→5→6 FP H3–H12 <otherpossiblesequencesfromFig.6.4(a)> ?? M1 1→2→3→3 ′ →4→5a→6 Valid&consistentmiss(FN) M2 1→2→3→3 ′ →4a→5a→6 Valid&consistentmiss(FN) M3–M18 <otherpossiblesequencesfromFig.6.4(b)> ?? 6.3.5 CaseStudies This section examines well-cited papers from literature with an eye toward understanding theconclusionsthatcanbedrawnfromtheirpresentedresults. Weapplythelessonslearned (compiledintheframeworkdescribedinSect. 6.3.2andSect.6.3.4),anddiscussthework by: (1) Mahoney et al. [87], (2) Wang et al. [131], and (3) Kruegel et al. [74]. The results fromeachstudyaresummarizedinTable6.3. Mahoneyetal.[87]–EvaluationofNETAD NETADisanetwork-basedanomalydetectionsystem,designedtodetectattacksonaper- packet basis by detecting unusual byte values occurring in network packet headers [87]. NETAD was evaluated by first training the detector offline using a subset of the 1999 DARPA dataset and then tested using 185 detectable attacks from the dataset. A detec- tion accuracy of 132 1 8 5 was recorded when the detector was tuned for 100 false alarms. We were unable to reconcile three factors that introduced uncertainty in our assessment of the presented results, while two additional factors were found to undermine detection consistencyarguments. Some of the uncertainties that we were unable to reconcile are as follows. We can only assume that since the well-labeled DARPA dataset was used, all 185 attacks used in the evaluation manifested in the evaluation data stream (this is only an assumption is 147 based on McHugh’s observations [95]). Some of the attacks may not have manifested in the test data stream due to the data sanitization (DP1) performed on the evaluation data stream. ThesanitizationinvolvedremovinguninterestingpacketsandsettingtheTTLfield of IP headers to 0 as the authors believed that it was a simulation artifact that would have made detection easier. The literature suggests that data sanitization strategies can perturb detector performance [26,88]. Consequently, we were unable to ascertain in the NETAD assessment weather it was verified that (a) the filtering of packets did not adversely cause any of the 185 attacks to disappear from the test data stream, and (b) the act of setting all TTL bits to zero did not invalidate any attacks that otherwise would have been detected because they manifest as non-zero values in the TTL stream. In the first case, we have an experimental confound in that we cannot determine if the detection of 132 attacks instead of the 185 attacks (assumed manifested in the data) was due purely to detector capability orduetodatasanitizationissues. Inthesecondcase,weareunsureiftheevaluator’sactof modifyingtherawdataitselfmayhavebiasedtheresults. Weknowthatonlyheader-basedattacksareactuallydetectablebyNETADduetoNE- TAD’s choice of data features (TR2.1), however NETAD was tested against a mixtureof header-basedandpayload-basedattackswithoutspecifyinghowmanyoftheattacksinthe mixturewerepayload-basedattacks versusheader-based attacks. Further, weareunsureif all the header-based attacks used to test NETAD did indeed manifest as anomalies within thepurviewofNETAD,thatis,howmanyoftheattacksusedwereactuallysuitableforde- tectionbythemodelingformalismusedbyNETAD(TR2.2). Consequently,whenweare presented results whereby 132 attacks were detected, we cannot determine: 1) How well did the detector detect header-based attacks? (Were all header-based attacks detected?), 2) Did thedetector also detect somepayload-based attacks? 3) Did payload-based attacks manifestinwaysthatallowedaheader-baseddetectortodetectthem?,and4)Whatdidthe detectoractuallydetectvs. whatwasdetectedbychance? 148 With regard to the consistency of the presented results, i.e., do the results describe the detector’s capability beyond the single test instance? No, we cannot conclude that from the results of the presented work because of the training strategy used. It is known that variability in the training data (TR1.2) and the amount used (TR3) can significantly influence detector performance. Since the authors only trained on one week’s worth of data,itisuncertainifthechoiceofanotherweekwillproducethesameresults. Theresults presented in this paper can only apply to the single evaluation instance described, and wouldperhapsnotpersistevenifanothersampleofthesamedatasetwereused. In short, we cannot conclude that the results in this paper truly reflect the detector’s capability and are not biased by the artifacts of poor experimental control (e.g., lack of precision in identifying the causal mechanisms behind the reported 185 attacks), and we areuncertainiftheresultswillpersistbeyondthesingleevaluationinstance. Wangetal.[131]–EvaluationofPayload-basedDetector PAYL is a network-based anomaly detector, designed to detect attacks on a per-packet or per-connection basis by detecting anomalous variations in the 1-gram byte distribution of the payload. PAYL was evaluated over real-world data privately collected from campus web-servers and also over the DARPA 1999 dataset. The results reported were 100% hits for port 80 attacks at 0.1% false positive rate on the DARPA dataset using connection- based payload model. We were unable to reconcile at least two factors that introduced uncertainty in our assessment of the presented results, while three additional factors were foundtounderminedetectionconsistencyarguments. Some of the uncertainties that we were unable to reconcile are as follows. Again, we assume that since the well-labeled DARPA dataset was used, all port 80 related attacks usedintheevaluationmanifestedintheevaluationdatastream. Theevaluationdatastream was filtered to remove non-payload packets (DP1). As for the previous case, it is unclear 149 whether the filtering of packets may have perturbed attack manifestations causing them to either disappear from the test data stream or change their manifestation characteristics. Also,weareunsureifallthepayload-basedattacksusedtotestPAYLdidindeedmanifest asanomalieswithrespecttothemodelingformalismusedbyPAYL(TR2.2). Forinstance, payloadattackssuchasthosethatexploitsimpleconfigurationbugsinserversusingnormal commandsequencesmightnotmanifestasanomalouspayloadswithinPAYL’spurview. With regard to the consistency of the presented results, we cannot conclude that the results of the presented work describe detector capability beyond the single evaluationin- stance. Again,werefertothefactthatvariabilityinthetrainingdata(TR1.2),theamount used(TR3)andthechoiceoflearningparameterssuchastheclusteringthreshold(TR2.3), cansignificantlyinfluencedetectorperformance. Sincetheauthorsonlytrainedon2weeks worthofdata(week1andweek3),wouldthechoiceofanother2weeks(week1andweek 2) produce the sameresults? As it stands, theresults presented in this paper only apply to the single evaluation instance described, and may not have persisted even if another sam- ple of the same dataset were used. Although the authors do mention that PAYL is also designedto work inan incremental learning mode,theydid notevaluatethat functionality – consequently we cannot speak to the efficacy of the detector with respect to that mode. Inshort,theuncertaintyliesinwhetherPAYLcanachieve100%detectionaccuracywitha lowfalse-alarmrateconsistently,eveninanotherinstanceofthesamedataset. Kruegeletal.[74]-DetectorforWeb-basedAttacks Kruegel et al. [74] evaluated a multi-model based anomaly-detector for detecting web- based attacks over individual web-server requests. The evaluation was performed over threedatasets,onefromaproductionweb-serveratGoogle,Inc. andtwofromwebservers locatedattwodifferentuniversities. Theyreporteda100%detectionratefortheiranomaly detector when tested against twelve attacks injected into the dataset collected from one of 150 the university webserver. This paper provided the best example of a reliable evaluation whose results were useful to us in determining the applicability of the technology within our own systems. As summarized in Table 6.3, we were able to account for all the factors necessaryforconfirmingthevalidityofthedetectionresults. Wewere,however,unableto reconciletwofactorsthatintroduceduncertaintyinourassessmentofthedetectionconsis- tencyarguments. Theevaluationprovidedenoughinformationtobecertainthatallattackswereinjected manually into the data stream and manifested as single anomalous queries into the eval- uation data stream. There was no additional filtering or sanitization performed over the attackdatasetsotheattacksmanifestedas-isintothetestdatastream. Further,theprovided informationontheattack setusedfortestingissufficienttoconcludethattheattackswere suitableforthemodelingformalism. Some of the uncertaintiesthat we were unableto reconcile with respect to consistency of detection are as follows. All evaluations were performed by choosing the first 1000 queries corresponding to a web-server program to automatically build all necessary pro- files and compute detection thresholds. It is not clear how increasing or decreasing the number of queries used in training, i.e., the amount of training (TR3), would bias the re- porteddetectionresults. Furthermore,thedetectorwasassessedoveratestcorpusthatwas createdbyinjectingattacksintooneofthreedatasetscollectedfromauniversitywebserver. This particular dataset was earlier shown to display less variabilityin its characteristics as compared to the other two datasets. It is not clear if similar detection performance (100% detection)canbeexpectedifthesameattackswereinjectedintoacomparativelymorevari- able dataset such as the Google dataset (TR1.2). It is consequently difficult to ascertain the reliabilityor consistencyof the result beyondthe exact training data and strategy used inthispaper. 151 SummaryofResultsFromCaseStudies Thecase studiesdiscussedin theprevioussectionselaborated on howunexplainedfactors across the evaluation phases affect the validity and consistency of detection results. In thissection,wesummarizetheefficacyoftheevaluationsperformedinthecasestudiesby counting the multiple possible explanations for the hit and miss results presented in their respectivepapers,duetotheunexplainedfactorsinthoseevaluations. We apply the analysis developed in Sect. 6.3.4 and present results for the case studies discussed in Table 6.3. Each row in Table 6.3 is filled in as follows: (1) For each event, we first gather the set of factors influencing validity and consistency from Table 6.1. (2) Then, for each case study (columns in Table 6.3), we record if any of those factors were identified in our previous discussion of the case studies. There are three possibilities: (a) if NO factors were identified, one possibility is that there was enough information avail- able to explain away the factors perturbing the corresponding event (entries labeled YES in Table6.3); (b) if NO factors were identified, another possibilityis that there were some assumptions made to explain away the factors (entries labeledYES * ); or (c) if ANY fac- tors were identified, it means that there was insufficient or no information regarding those factors to confidently state that the event was unperturbed (entries labeled NOINFO). (3) We then use this information along with the framework in Fig. 6.4 to count the possible explanationsforhitsandmissesforthecasestudy. Fromacombinedperspectiveofvalidandconsistentdetection,weseethatforMahoney et al. [87] and Wang et al. [131], the uncertainty in the evaluation process induces four possible explanations for a “hit” (from Fig. 6.4(a)): 1→ 2→ 3→ 3 ′ → 4→ 5→ 6; 1→ 2→ 3→ 3 ′ → 4b→ 5→ 6; 1→ 2→ 3→ 3 ′ a→ 4b→ 5→ 6; 1→ 2→ 3a→ 3 ′ a→ 4b→ 5→ 6. Similarly, from Fig. 6.4(b), there are six explanations for a “miss”: 1→ 2→ 3→ 3 ′ → 4→ 5a→ 6; 1→ 2→ 3→ 3 ′ → 4a→ 5a→ 6; 1 → 2 → 3 → 3 ′ a → 4a → 5a → 6; 1 → 2 → 3 → 3 ′ a → 4b → 5a → 6; 152 Table6.3: Summaryoftheefficacyofevaluationsperformedinthecasestudies. # Event Mahoneyet al.[87] Wanget al.[131] Kruegelet al.[74] (1) Attackdeployed. YES YES YES (2) Attackmanifestsinevaluation data. YES * YES * YES (3) Attackmanifestsintestdata. NOINFO NOINFO YES (3 ′ ) Attackmanifestsstably. NOINFO NOINFO NOINFO (4) Attackis anomalouswithinthe detector’spurview. NOINFO NOINFO YES (5) Anomalyissignificant. YES YES YES (6) Detectorresponseismeasured appropriately. YES YES YES Possiblecasesfor“hit” 4 4 2 Possiblecasesfor“miss” 6 6 0 1→ 2→ 3a→ 3 ′ a→ 4a→ 5a→ 6; 1→ 2→ 3a→ 3 ′ a→ 4b→ 5a→ 6. We observethat thebest exampleof a reliableevaluationis by Kruegel et al. [74] because there are only two possible explanations for a hit: 1→ 2→ 3→ 3 ′ → 4→ 5→ 6; 1→ 2→ 3→ 3 ′ a→ 4b→ 5→ 6. Inessence,theirreportedhitswereallvalidbutcannot be concluded to be both valid and consistent. There are zero explanations for a “miss” as therewerenomissesencounteredintheirevaluation. From aconsistencyperspective,weobservedthatit was difficultin allthecasestudies to ascertain the consistency of the presented results beyond the exact instance of training dataandstrategyused. 6.3.6 Conclusions Our objective here was to examine the mechanics of an evaluation strategy to better un- derstand how the integrity of the results can be compromised. To that end, we explored thefactorsthatcaninduceerrorsintheaccuracyofadetector’sresponse(Sect.6.3.2),pre- sented a unifying framework of how the error factors mined from literature can interact 153 with different phases of a detector’s evaluation to compromise the integrity detection re- sults (Sect. 6.3.4), and we used our evaluationframework to reason about the validityand consistencyoftheresultspresentedinthreewell-citedworksfromliterature(Sect.6.3.5). The framework of error factors presented is geared toward answering the “why”ques- tions often missing in current evaluation strategies, e.g., why did a detector detect or miss an attack?. We used itto showhowand whytheresultspresented in well-citedworks can bemisleadingduetopoorexperimentalcontrol. Ourcontributionisasmallsteptowardthe designofrigorousassessmentstrategiesforanomalydetectors. 154 Chapter7 ConclusionsandFutureWork In this chapter wesummarizethekey ideas presented inthis thesisin supportof thethesis statementdiscussedearlier. Wealsoenumeratethecontributionsofthework,andconsider implicationsandopportunitiesforfuturework. 7.1 SummaryofContributions This thesis first introduced the fundamental challenge of improving the situational aware- ness of decision-making entities in large-scale, networked systems such as the smart grid. Specifically, we saw that a fundamental reason for the lack of situation awareness of decision-making entities results from their inability to combine low-level information to extract insights relevant to their goals and objectives. There are several ways to address the challenge of assisting decision-making entities to extract relevant high-level insights fromlow-leveleventdata,namely,manual,data-driven,specification-driven,andfullyau- tomated methods. Within the specification-driven approaches, we saw that a fundamental problem today is with the low-level nature of the languages used to build specifications, which increases the burden for high-level users to combine and interpret information in a wayrelevanttotheirgoalsandobjectives. Ourprimaryobjectivewastoimprovethestate- of-the-artinspecification-basedmethodsbyenablingdecision-makerswithinthesmartgrid environmenttoexpressanalysestasksatahigher-levelofabstraction. 155 Wefocusedontwoanalysistasks,specifically a) the detection of a current situation over a sequence or groups of related lower-level events;and b) theanticipationofpotentialhigher-levelsituation(s)overisolated,independentlow- levelevents. Thefundamentalproblemsthatneededtobeaddressedwerestatedasfollows: Problem1(relevanttotask(a)above) Effective analysis of related events to detect complex, higher-level situations using a specification-based approach requires capabilities to assist decision- makingentitiesto operateover suchevent dataathigher-levelsofabstraction. Problem2(relevanttotask(b)above) Effectiveanalysisofisolated,independentlow-level eventstoanticipatehigher- levelsituationsusingaspecification-basedapproachrequireshigh-levelabstrac- tionstomodelthecause-effectrelationshipsofsituationsinlarge-scale,complex systems. Further, such modeling must be relevant to the goals of a decision- makingentity. Toaddresstheproblems,weproposedamodel-drivenapproachtoassistdecision-mak- ers to make sense of the heterogeneous low-level data, and extract high-level insights se- mantically-relevantto theirgoals and objectives. We proposedtwo modelingabstractions, namely, behavior models, and situation models to enable decision-makers to build high- levelspecificationstodrivesemanticallyrelevantanalysisoverlow-leveleventdata. 156 We presented the following thesis statement: Behavior models and situation models areeffectivemechanismstomodelhigh-levelunderstandingofadecision-makerinasmart gridenvironment,anddriveanalysisoverlow-leveleventdatatoextractsituationsrelevant tothegoalsofa decision-maker. Specifically, a) behavior models are effective high-level abstractions to model behaviors over a se- quenceorgroupsofrelatedlower-level events;and b) situation models are effective high-level abstractions to model relevant high-level situationsover isolated,independentlow-level events. Novelty of approach. Chapter 3 presented a novel, logic-based, modeling approach calledbehaviormodelsforintegratingandinterpretingsequenceorgroupsofrelatedevents intermsofhigh-levelbehaviorsrelevanttoadecision-makingentity. Behaviormodelsal- lowrapidlymodelingsystembehaviorovereventdata,byprovidingsemanticconstructsto capturehigh-levelrelationshipssuchascausality,orderingandconcurrencybetweenevents or groups of events. They drive analysis over dependent event data to extract insights in theformofbehaviors. Further,theyallowdecision-makerstoencodesemantically-relevant high-levelgoalsandobjectivesoversemanticallyrelevantbehaviors. Chapter 4 presented anovelmodelingapproach called situationmodels for integrating andinterpretingisolatedandindependenteventsintermsofhigh-levelsituationsrelevantto adecision-makingentity. Situationmodelsfundamentallycapturethecause-effectrelation- ships between events across subsystems, and across several levels of system abstraction. Further,weshowedthattheydosoinawayspecifictotheneedsofadecision-makingen- tity. Atruntime,suchmodelsdriveanalysisovereventdatatoextractsemantically-relevant insightsintheformofhigher-levelsituations. We demonstrated the novelty of the above mechanisms, in support of the thesis state- ment,bycomparingourapproachtotheexistingapproachesinSection3.7andSection4.5. 157 Effectiveness of approach. To demonstrate the effectiveness of the proposed mecha- nisms,weappliedtheaboveabstractionstoreal-worldscenariosinChapter 5. InSections5.1andSections5.2,wedemonstratedtheusefulnessofbehaviormodelsin modelingandanalysisoftwocomplexattackscenarios. Specifically,wedemonstratedthe following: a) Behaviormodelsenablerapidcompositionofamodelofattackbehaviorinadataset independent,andsystemindependentway. b) Behaviormodelsenablespecificationofsystembehavioratahigher-leverofabstrac- tion,andwithoutrequiringtoomanydetails. c) Behaviormodelsprovidemechanismstocomposehigher-levelbehaviorsasrelation- shipsoversimpleandcomplexbehaviors. d) Behavior models provide mechanisms for construction of a semantically relevant vocabularyoraknowledgebaseforinteractionovereventdataasreusable,shareable andcomposableabstractmodels, Section 5.3 demonstrated how situationmodels can be used to improvingthe situation awareness of an operator responsible for reducing load via demand response in the smart grid. Specifically,wedemonstratedthefollowing: a) Situationmodelshelptwodifferentdecision-makingentities,ademandresponseop- erator and a customer. extract completely different sets of insights (as relevant to each)fromthesamesetoflow-levelfacts. b) Situationmodelshelpintegrateisolatedandindependentevents(low-levelsituations) inlarge-scale, complexsystems. c) Situationmodelscan bebuiltwithhigh-levelknowledgeofthesystem. d) Situation models can integrate a diverse range of data across systems, and across severallevelsofsystemabstraction. 158 Insummary,tosupporttheclaimofeffectivenessofourproposedmechanisms,thecase studiesdemonstratedthefollowingaspects: a) abilitytobuildmodelswithhigh-levelunderstanding, b) ability to integrate a diverse range of heterogeneous data from across subsystems, andacrosslevelsofsystemabstraction, c) abilitytoproduceinsightsrelevanttoadecision-makingentity, d) simplicityofabstraction,and e) reuse,sharingandcomposibilityofmodels. The case for novelty and effectiveness presented above supports the thesis statement that a top-down, model-driven approach to situational awareness is effective in assisting decision-making entities to extract high-level insights semantically-relevant to their high- levelgoalsandobjectives. Finally, in Chapter 6 we discussed limitations of the modeling abstractions. Although the modeling abstractions are capable of modeling a wide range of scenarios, they are by no means complete. Further, we also noted that improving the situation awareness of a decision-making entity also requires considering factors related to the accuracy and time- liness of data. In Section 6.3, we presented some of our ongoing work on understanding the factors that affect the accuracy of low-levelanomaly detectors, which are a prominent sourceofthelow-leveleventdatainlarge-scalesystems. Thisworkis veryrelevanttoour future work in the following way. A high-level specification language such as the situa- tionmodelwouldneed toincludeconstructstoenablespecificationof high-levelaccuracy and timelinessconstraintsrelevant to a decision-maker. Givensuch a model, runtimepro- cessingofsuchconstraint-annotatedmodelswouldrequireadetailedunderstandingofthe accuracy of low-leveldata sources to correctly process data from such sources. A reliable 159 characterizationofsuchlow-levelsourcesisthusrequiredtoimprovetheoveralleffective- nessofsituationalawareness. OurworkinSection6.3wasasmallsteptowardsthatlarger goal. Toconclude,inthiswork,wefundamentallydemonstratedthattheintroductionofsim- ple, semantically-relevant modeling constructs is effective in enabling decision-makers in complexenvironmentstobuildspecificationsatahigher-levelofabstraction,andextractin- sightsrelevanttotheirgoalsandobjectives. Ourmodel-drivenapproachemphasizedreuse, composibility and extensibility of specifications, and thus introduced a more systematic waytobuildspecifications,andretainexpertknowledgeforsharingandreuse. 7.2 ConcludingRemarks Aneventualgoalofresearchinlarge-scale,complexsystemsistobuildsystemswhichcan notonlyautomaticallyextractinsightsfromdata,butalsoautomaticallymakedecisionsand take responses. For example, the smart grid vision talks of a self-healing grid, which will have the ability to provide sustained delivery of essential system services in the presence of situations such as attacks and faults, and the ability to rapidly recover from those situ- ation to normal operation. Realizing the full vision of an operationally-resilient complex system is a challenging problem, but tackling the problem of having an effective situation awareness is a first step in that direction. The problem of maintaining an effective situa- tion awareness is in itself a challenging problem, with several decades of research across domains such as military, aviation and homeland security. In this thesis, we addressed a small yet important aspect of the problem, namely, that of extracting high-level insights semantically-relevanttothegoalsandobjectivesofadecision-makingentity. Weproposed amodel-drivenapproachtoautomatetheprocessofextractinginsightsfromlow-leveldata, anddemonstrateditsusefulnessusingcasestudies. 160 Overall, in this work, we demonstrated that the introduction of simple, semantically- relevantmodelingconstructsareeffectiveinenablingdecision-makersincomplexenviron- ments to build specifications at a higher-level of abstraction, and extract insights relevant to their goals and objectives. This thesis fundamentally demonstrated that it is not al- ways necessary to build complex domain models to drive analysis over data and extract high-level insights. We showed that simple, high-level abstractions can effectively handle a reasonably wide variety of cases in large-scale, complex systems. We believe that the contributionsof this thesis are a step in theright direction towards improvingthestate-of- the-art in situation awareness research in large-scale, complex systems. Lastly, we hope thatourcontributionsservesasacatalystfor furtherresearchinthisarea. 161 Bibliography [1] M. Abrams and J. Weiss. (2008, Aug.) Malicious Control System Cyber Security Attack Case Study: Maroochy Water Services, Australia. Tech. Paper. [Online]. Available: http://www.mitre.org/publications/technical-papers/malicious-control- system-cyber-security-attack-case-study-maroochy-water-services-australia [2] J. Allen, “Maintaining Knowledge about Temporal Intervals,” Communications of theACM,vol.26,no.11,pp.832–843,Nov.1983. [3] A. Almajali, E. Rice, A. Viswanathan, K. Tan, and C. Neuman, “A Systems Ap- proachtoAnalysingCyber-PhysicalThreatsintheSmartGrid,”inIEEESmartGrid Communication. IEEE,2013,pp.456–461. [4] B. J. Argauer and S. J. Yang, “VTAC: virtual terrain assistedimpact assessment for cyberattacks,”inProc.SPIE,vol.6973,2008,pp.69730F–69730F–12. [5] A. Avizienis, J.-c. J.-C. Laprie, B. Randell, and C. Landwehr, “Basic Concepts and Taxonomy of Dependable and Secure Computing,” IEEE Transactions on Depend- ableandSecure Computing,vol.1,no.1,pp.11–33,Jan.2004. [6] S.Axelsson,“TheBase-rateFallacyandtheDifficultyofIntrusionDetection,”ACM Trans.onInfo. SystemsSecurity,vol.3,no.3,pp.186–205,Aug2000. [7] F.Baader, A.Bauer, P. Baumgartner,A.Cregan,A.Gabaldon,K.Ji,K.Lee,D.Ra- jaratnam, and R. Schwitter, “A Novel Architecture for Situation Awareness Sys- tems,” in Automated Reasoning with Analytic Tableaux and Related Methods, ser. LectureNotesinComputerScience,M.GieseandA.Waaler,Eds. SpringerBerlin Heidelberg,2009,vol.5607,pp.77–92. [8] W. Baker, A. Hutton, C. D. Hylender, J. Pamula, M. Spitler, M. Goudie, C. Novak, M. Rosen, P. Tippett, C. Chang, and J. Fisher. (2011) Verizon Data Breach Investigations Report. Verizon. [Online]. Available: http://www.verizonbusiness. com/resources/reports/rp data-breach-investigations-report-2011 en xg.pdf [9] A. B. Barreto, P. C. G. Costa, and E. Yano, “A Semantic Approach to Evaluate the Impact of Cyber Actions to the Physical Domain.” in Proceedings of the 7th 162 International Conference on Semantic Technologies for Intelligence, Defense, and Security(STIDS 2012).,Fairfax,VA,USA,2012,pp.64–71. [10] J. Barwise and J. Perry, “Situations and Attitudes,” The Journal of Philosophy, vol.78,no.11,pp.668–691,1981. [11] N. Baumgartner and W. Retschitzegger, “A survey of upper ontologies for situa- tion awareness,” Proc. of the 4th IASTED International Conference on Knowledge SharingandCollaborativeEngineering,St.Thomas,US VI,pp.1–9+,2006. [12] N. Baumgartner, W. Retschitzegger, and W. Schwinger, “A Software Architecture forOntology-drivenSituationAwareness,”inProceedingsofthe2008ACMSympo- siumon AppliedComputing,ser.SAC’08. NewYork,NY,USA:ACM,2008,pp. 2326–2330. [13] G. Bearfield and W. Marsh, “Generalising Event Trees Using Bayesian Networks with a Case Study of Train Derailment,” in Computer Safety, Reliability, and Secu- rity, ser. Lecture Notes in Computer Science, R. Winther, B. Gran, and G. Dahll, Eds. SpringerBerlinHeidelberg,2005,vol.3688,pp.52–66. [14] B. B´ erard, Systems and Software Verification: Model-checking Techniques and Tools. Springer,2001. [15] A.BlandfordandB.L.W.Wong,“SituationAwarenessinEmergencyMedicalDis- patch,” InternationalJournal of Human-Computer Studies,vol.61,no. 4, pp.421– 452,2004. [16] A.Bobbio,L.Portinale,M.Minichino,andE.Ciancamerla,“ComparingFaultTrees andBayesianNetworksforDependabilityAnalysis,”inComputerSafety,Reliability andSecurity,ser.LectureNotesinComputerScience,M.FeliciandK.Kanoun,Eds. SpringerBerlinHeidelberg,1999,vol.1698,pp.310–322. [17] S.Buldyrev,R. Parshani,G.Paul,H.Stanley,andS. Havlin,“CatastrophicCascade ofFailuresinInterdependentNetworks,”Nature,vol.464,no.7291,pp.1025–1028, 2010. [18] T. Capaccio and J. Bliss. (2011) Chinese Military Suspected in Hacker Attacks on U.S. Satellites. Online Article. Bloomberg News. [Online]. Avail- able: http://www.businessweek.com/news/2011-10-27/chinese-military-suspected- in-hacker-attacks-on-u-s-satellites.html [19] R.Cardell-OliverandW.Liu,“Representationandrecognitionofsituationsinsensor networks,”CommunicationsMagazine,IEEE,no.March,pp.112–117,2010. [20] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: A Survey,” ACM ComputingSurveys,vol.41,no.3,pp.15:1–15:58,Jul2009. 163 [21] V. Chandola and V. Kumar, “Summarization – Compressing Data Into An Infor- mative Representation,” Knowledge and Information Systems, vol. 12, no. 3, pp. 355–378,2007. [22] M. Christodorescu, S. Jha, and C. Kruegel, “Mining Specifications of Malicious Behavior,” in Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of softwareengineering, ser. ESEC-FSE ’07. New York, NY, USA: ACM, 2007,pp. 5–14. [23] M. Clayton. (2013, Jan) Energy sector cyberattacks jumped in 2012. Were utilities prepared? Online on The Christian Science Monitor. Accessed: 7 Feb 2014. [Online]. Available: http://www.csmonitor.com/Environment/Energy-Voices/2013/ 0107/Energy-sector-cyberattacks-jumped-in-2012.-Were-utilities-prepared [24] ComputerCrimeandIntellectualPropertySection(CCIPS).(1998,March)Juvenile ComputerHackerCutsofFAAToweratRegionalAirport.PressRelease.Accessed: 5 Feb 2014. [Online]. Available: http://www.irational.org/APD/CCIPS/juvenilepld. htm [25] P.H.CorredorandM.Ruiz,“AgainstAllOdds,”IEEEPowerandEnergyMagazine, vol.9,no.2,pp.59–66,Mar.2011. [26] G. F. Cretu, A. Stavrou et al., “Casting Out Demons: Sanitizing Training Data for Anomaly Sensors,” in Proc. of the IEEE Symp. on Security and Privacy. IEEE, 2008,pp.81–95. [27] G. Cugola and A. Margara, “Processing Flows of Information: From Data Stream to ComplexEventProcessing,” ACM Comput. Surv., vol.44, no. 3, pp. 15:1–15:62, Jun.2012. [28] Decision Systems Laboratory, University of Pittsburgh. GeNIe SMILE. Software. Accessed: 19Jan2015.[Online].Available: http://dsl.sis.pitt.edu [29] D.E.Denning,“AnIntrusion-DetectionModel,”IEEETrans.onSoftwareEngineer- ing,vol.SE-13,no.2,pp.222–232,Feb1987. [30] Department of Energy (DoE), National Energy Technology Laboratory (NETL). (2007, Jan) A Vision for the Modern Grid. DOE - NETL. [Online]. Available: http://www.smartgrid.gov/document/vision modern grid [31] K.Devlin,Logicandinformation. CambridgeUniversityPress,1995. [32] A. D’Amico, L. Buchanan, J. Goodall, and P. Walczak. (2009) MISSION IMPACT OF CYBER EVENTS: SCENARIOS AND ONTOLOGY TO EXPRESS THE RELATIONSHIPS BETWEEN CYBER ASSETS, MISSIONS, AND USERS. 164 Conference Paper Preprint. Applied Visions, Inc. [Online]. Available: http://www. dtic.mil/cgi-bin/GetTRDoc?AD=ADA517410 [33] G.ElahiandE.Yu,“AGoalOrientedApproachforModelingandAnalyzingSecu- rityTrade-Offs,”inConceptualModeling-ER2007,ser.LectureNotesinComputer Science,C.Parent,K.-D.Schewe,V.Storey,andB.Thalheim,Eds. SpringerBerlin /Heidelberg,2007,vol.4801,pp.375–390. [34] D. R. Ellis, J. G. Aiken, K. S. Attwood, and S. D. Tenaglia, “Detecting Malicious CodebyModelChecking,”inProc.oftheACMworkshoponRapidmalcode,2004, pp.43–53. [35] R. Ellis-Braithwaite, R. Lock, R. Dawson, and B. Haque, “Towards an approach foranalysingthestrategicalignmentofsoftwarerequirementsusingquantifiedgoal graphs,”arXivpreprintarXiv:1307.2580,2013. [36] M.R.Endsley,“TowardaTheoryofSituationAwarenessinDynamicSystems,”The Journal of the Human Factors and Ergonomics Society, vol. 37, no. 1, pp. 32–64, 1995. [37] ——,“Theoreticalunderpinningsofsituationawareness: Acriticalreview,”inSitu- ation awareness analysis and measurement, M. R. Endsley and D. J. Garland, Eds. CRCPress,2000,pp.3–32. [38] M.R.EndsleyandM.D.Rodgers,“SituationAwarenessInformationRequirements AnalysisforEnRouteAirTrafficControl,”HumanFactorsandErgonomicsSociety AnnualMeeting Proceedings,vol.38,pp.71–75(5),1994. [39] C. Ensel, “New Approach for Automated Generation of Service Dependency Mod- els.”in LANOMS,2001. [40] EPRI. (2005) Transmission Fast Simulation and Modeling (T-FSM)–Functional Requirements Document. Technical Report - 1011666. Palo Alto, CA. [On- line]. Available: http://www.intelligrid.info/doc/TFSMFunctionalRequirements- ER11666.pdf [41] ——. (2009) EPRI Power Systems Dynamics Tutorial. EPRI,. Technical Report - 1016042.Palo Alto, CA. [Online].Available: http://www.epri.com/abstracts/Pages/ ProductAbstract.aspx?ProductId=000000000001016042 [42] EPRI/NESCOR.(2013)AttackTreesforSelectedElectricSectorHighRiskFailure Scenarios. [Online]. Available: http://smartgrid.epri.com/doc/NESCOR%20Attack %20Trees%2009-13%20final.pdf [43] EsperTech. Esper: Open source Complex Event Processing Software. Software. Accessed: 19Jan2015.[Online].Available: http://esper.codehaus.org/index.html 165 [44] O. Etzion and P. Niblett, Event Processing in Action. Manning Publications Co., 2010. [45] N.Falliere,L.OMurchu,andE.Chien.(2011)W.32StuxnetDossierv1.4.Technical White Paper. Symantec. [Online]. Available: http://www.symantec.com/content/en/ us/enterprise/media/security response/whitepapers/w32 stuxnet dossier.pdf [46] I.Firdausi,C.Lim,A.Erwin,andA.Nugroho,“AnalysisofMachinelearningTech- niques Used in Behavior-Based Malware Detection,” in Advances in Computing, Control and Telecommunication Technologies (ACT), 2010 Second International Conferenceon,dec. 2010,pp.201–203. [47] P. Foglaand W. Lee, “EvadingNetwork AnomalyDetection Systems: Formal Rea- soning and Practical Techniques,” in Proc. of the 13th ACM Conf. on Comp. and Comm. Sec. (CCS). ACM,2006,pp.59–68. [48] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff, “A Sense of Self for UnixProcesses,”inProc.oftheIEEESymp.onSecurityandPrivacy. IEEE,1996. [49] C.GatesandC. Taylor,“ChallengingtheAnomalyDetectionParadigm: aProvoca- tive Discussion,” in Proc. of the Workshop on New Sec. Paradigms. ACM, 2006, pp.21–29. [50] J. R. Goodall, A. D’Amico, and J. K. Kopylec, “Camus: Automatically mapping Cyber Assets to Missions and Users,” MILCOM 2009 - 2009 IEEE Military Com- municationsConference,pp.1–7,Oct.2009. [51] M. Govindarasu, A. Hann, and P. Sauer. (2012) Cyber-physical sys- tems security for smart grid. Future Grid Initiative White Paper. Power Systems Engineering Research Center (PSERC). [Online]. Avail- able: http://www.pserc.wisc.edu/documents/publications/papers/fgwhitepapers/ Govindarasu Future Grid White Paper CPS May 2012.pdf [52] R.Guttromson,F.Greitzer,M.Paget,andA.Schur.(2007,Aug)HumanFactorsfor Situation Assessment in Power Grid Operations. Technical Report PNNL-16780. PacificNorthwestNationalLaboratoryPNNL. [Online]. Available: http://www.pnl. gov/main/publications/external/technical reports/PNNL-16780.pdf [53] D.Hadˇ ziosmanovi´ c,L. Simionato,D. Bolzoni, E. Zambon,and S. Etalle, “N-Gram Against the Machine: On The Feasibility of The n-gram Network Analysis for Bi- nary Protocols,” in Proc. of the 15th Intl. Conf. on Research in Attacks, Intrusions, andDefenses. Springer-Verlag,2012,pp.354–373. [54] E. Halper and M. Lifsher. (2014) Attack on electric grid raises alarm. [Online]. Available: http://articles.latimes.com/2014/feb/06/business/la-fi-grid- terror-20140207 166 [55] J. Horky, “Corrupted Strace Output,” Bug Report, 2010, urlhttp://www.mail- archive.com/strace-devel@lists.sourceforge.net/msg01595.html. [56] M.Hubbelland J. Kepner, “Largescale networksituationalawareness via3D gam- ing technology,” in 2012 IEEE Conference on High Performance Extreme Comput- ing. IEEE,Sep.2012,pp.1–5. [57] A. Hussain, J. Heidemann, and C. Papadopoulos, “A Framework For Classifying DenialofServiceAttacks,”Proc.oftheConf.onApplications,Technologies,Archi- tectures,andProtocolsforComp. Comm. -SIGCOMM,p.99,2003. [58] K.L.InghamandH.Inoue,“ComparingAnomalyDetectionTechniquesforHTTP,” inProc.ofthe10thIntl.Conf.onRecentAdvancesinIntrusionDetection. Springer- Verlag,2007,pp.42–62. [59] N. Jain, S. Mishra, A. Srinivasan, J. Gehrke, J. Widom, H. Balakrishnan, U. C ¸etintemel, M. Cherniack, R. Tibbetts, and S. Zdonik, “Towards a Streaming SQLStandard,”Proc.VLDB Endow.,vol.1,pp.1379–1390,August2008. [60] G.Jakobson,“Missioncybersecuritysituationassessmentusingimpactdependency graphs,” in 2011 Proceedings of the 14th International Conference on Information Fusion(FUSION). IEEE,2011,pp.1–8. [61] H. Javitz and A. Valdes, “The SRI IDES Statistical Anomaly Detector,” in Proc. of the IEEE Comp. Soc. Symp. on Research in Security and Privacy, 1991, pp. 316– 326. [62] K. Julischand M. Dacier, “Miningintrusiondetection alarms for actionableknowl- edge,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, New York, USA: ACM, 2002, pp.366–375. [63] D. Kaminsky, “Multiple DNS Implementations Vulnerable to Cache Poisoning,” http://www.kb.cert.org/vuls/id/800113,2008. [64] S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl, “Detailed Diagnosis in Enterprise Networks,” SIGCOMM Comput. Commun. Rev., vol. 39, no.4,pp.243–254,Aug.2009. [65] a. Keller and G. Kar, “Determining service dependencies in distributed systems,” ICC 2001. IEEE InternationalConference on Communications.Conference Record (Cat. No.01CH37240),vol.7,no.Icc,pp.2084–2088,2001. [66] K. Killourhy and R. Maxion, “Why Did My Detector Do That?!: Predicting Keystroke-dynamics Error Rates,” in Proc. of the 13th Intl. Conf. on Recent Ad- vances inIntrusionDetection. Springer-Verlag,2010,pp.256–276. 167 [67] J.Kinder,S. Katzenbeisser,C.Schallhart, andH.Veith,“DetectingMaliciousCode by Model Checking,” in Intrusion and Malware Detection and Vuln. Assessment, ser.LectureNotesinComputerScience, K. JulischandC. Kruegel,Eds. Springer Berlin/Heidelberg,2005,vol.3548,pp.174–187. [68] B. J. Kirby. (2006, Dec.) Demand Response For Power System Reliability : FAQ. ORNL/TM-2006/565. [Online]. Available: http://certs.lbl.gov/pdf/dr-for-psr-faq. pdf [69] M. Klemettinen, H. Mannila, and H. Toivonen, “Rule discovery in telecommunica- tion alarm data,” Journal of Network and Systems Management, vol. 7, no. 4, pp. 395–423,1999. [70] R. Kohavi et al., “A Study of Cross-Validation and Bootstrap for Accuracy Esti- mation and Model Selection,” in Intl. Joint Conf. on Artificial Intelligence, vol. 14, 1995,pp.1137–1145. [71] M. Kokar, C. Matheus, and K. Baclawski, “Ontology-based situation awareness,” InformationFusion,vol.10,no.1,pp.83–98,Jan.2009. [72] M. M. Kokar, C. J. Matheus, K. Baclawski, J. A. Letkowski, M. Hinman, and J. Salerno, “Use Cases for Ontologies in Information Fusion,” in Lecture Notes in EconomicsandMathematicalSystems Series. SpringerVerlag,1995. [73] D. Koller and N. Friedman, Probabilistic graphical models: principles and tech- niques. MITpress,2009. [74] C. Kruegel and G. Vigna, “Anomaly Detection of Web-based Attacks,” in Proc. of the10thACMConf.onComp.andComms.Security(CCS). ACM,2003,pp.251– 261. [75] D.Lambert,“Situationsforsituationawareness,”inProc.ofFusion2001,2001. [76] ——, “Grand challenges of information fusion,” in Proceedings of the Sixth Inter- nationalConferenceof InformationFusion,vol.1,July2003,pp.213–220. [77] L. Lamport, “The Temporal Logic of Actions,” ACM Trans. Program. Lang. Syst., vol.16,no.3,pp.872–923,1994. [78] Lane, Terran and Brodley, Carla E, “Approaches to Online Learning and Concept DriftforUserIdentificationinComputerSecurity,”inProc.ofthe4thIntl.Conf.on KnowledgeDiscoveryandDataMining,1998,pp.259–263. [79] C.LaRosa,L.Xiong,andK.Mandelberg,“Frequentpatternminingforkerneltrace data,”Proceedingsofthe2008ACMsymposiumonAppliedcomputing-SAC’08,p. 880,2008. 168 [80] W. Lee and D. Xiang, “Information-theoreticMeasures for Anomaly Detection,”in Proc.oftheIEEESymp. onSecurityandPrivacy,2001,pp.130–143. [81] L. Liu, E. Yu, and J. Mylopoulos, “Security and Privacy Requirements Analysis WithinaSocialSetting,”inProceedings.11thIEEEInternationalRequirementsEn- gineeringConference, Sep2003,pp.151–161. [82] L. Liu and E. Yu, “Designing information systems in social context: a goal and scenario modelling approach,” Information systems, vol. 29, no. 2, pp. 187–203, 2004. [83] B. T. Loo, T. Condie, M. Garofalakis, D. E. Gay, J. M. Hellerstein, P. Maniatis, R. Ramakrishnan, T. Roscoe, and I. Stoica, “Declarative Networking: Language, ExecutionandOptimization,”inProc.ofACM SIGMOD,2006,pp.97–108. [84] D.Luckham,ThePowerofEvents: AnIntroductiontoComplexEventProcessingin DistributedEnterpriseSystems. Boston,MA:Addison-WesleyLongmanPublishing Co.,Inc.,2001. [85] D.M,S.Bohn,A.Wynne,W.A,M.Daniel,andA.William,“Real-TimeVisualiza- tionofNetworkBehaviorsforSituationalAwareness,”Most,pp.79–90,2010. [86] A. A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, J. Yates, Y. Zhang, and Q. Zhao, “To- wardsAutomatedPerformanceDiagnosisinaLargeIPTVNetwork,”inProceedings oftheACM SIGCOMM 2009ConferenceonDataCommunication,ser.SIGCOMM ’09. NewYork,NY,USA:ACM,2009,pp.231–242. [87] M. V. Mahoney, “Network Traffic Anomaly Detection Based on Packet Bytes,” in Proc.oftheACM Symp.on Appliedcomputing. ACM,2003,pp.346–350. [88] J.Mai,C.-N. Chuah,A.Sridharan, T.Ye, andH.Zang,“IsSampledDataSufficient for Anomaly Detection?” in Proc. of the 6th ACM SIGCOMM Conf. on Internet measurement. ACM,2006,pp.165–176. [89] C.J.Matheus,“SAWA:Anassistantforhigher-levelfusionandsituationawareness,” ProceedingsofSPIE,vol.5813,pp.75–85,2006. [90] C.Matheus,M.Kokar,K.Baclawski,andJ.Letkowski,“AnApplicationofSeman- ticWebTechnologiestoSituationAwareness,”TheSemanticWeb–ISWC2005,vol. 3729,pp.944–958,2005. [91] R. Maxion, “Making Experiments Dependable,” in Dependable and Historic Com- puting, ser. Lecture Notes in Computer Science, C. B. Jones and J. L. Lloyd, Eds. Springer,2011,vol.6875,pp.344–357. 169 [92] McAfee Foundstone Professional Services and McAfee Labs. (2011) Global Energy Cyberattacks: ”Night Dragon”. Technical White Paper. McAfee. [Online]. Available: http://www.mcafee.com/us/resources/white-papers/wp-global-energy- cyberattacks-night-dragon.pdf [93] McAfee Labs. (2010) Protecting Your Critical Assets: Lessons Learned from “Operation Aurora”. Technical White Paper. McAfee. [Online]. Available: http:// www.mcafee.com/us/resources/white-papers/wp-protecting-critical-assets.pdf [94] J. Mccarthy and P. J. Hayes, “Some PhilosophicalProblems from the Standpointof ArtificialIntelligence,”inMachineIntelligence. EdinburghUniversityPress,1969, pp.463–502. [95] J.McHugh,“TestingIntrusionDetectionSystems: ACritiqueofthe1998and1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Labora- tory,”ACM Trans.on Info.System Security,vol.3,no.4,pp.262–294,Nov2000. [96] S.McLaughlinandD.Podkuiko,“Energytheftintheadvancedmeteringinfrastruc- ture,”CriticalInformation,2010. [97] J. Meserve. (2007, Sep.) Staged cyber attack reveals vulnerability in power grid. Online Article. CNN. [Online]. Available: http://www.cnn.com/2007/US/09/26/ power.at.risk/ [98] “MetasploitFrameworkWebsite,”http://www.metasploit.com/. [99] P. Mirhaji, Y. F. Michea, J. Zhang, and S. W. Casscells, “Situational awareness in publichealth preparedness settings,” Proceedings of the SPIE, vol. 5778, no. 1, pp. 81–91,2005. [100] J. Mirkovic, K. Sollins, and J. Wroclawski, “Managing the Health of Security Experiments,” in Proc. of the conf. on Cyber Security Experimentation and Test. USENIX,2008,pp.7:1–7:6. [101] K.Moslehi,A.Kumar,D.Shurtleff,M.Laufenberg,A.Bose,andP.Hirsch,“Frame- work for a self-healing power grid,” in IEEE Power Engineering Society General Meeting,2005. IEEE,2005,pp.2816–2823. [102] P.Naldurg,K.Sen,andP.Thati,“ATemporalLogicBasedFrameworkforIntrusion Detection,”inProc.ofthe24thIFIPIntl.Conf.onFormalTech.forNet.&Dist.Sys., 2004. [103] C. Neuman and K. Tan, “Mediating Cyber and Physical Threat Propagation in Se- cure Smart Grid Architectures,” in 2nd IEEE International Conference on Smart GridCommunicationsSmartGridComm,Oct2011,pp.238–243. 170 [104] V.Paxson,“Bro: ASystemforDetectingNetworkIntrudersinReal-time,”Comput. Networks,vol.31,no.23-24,pp.2435–2463,1999. [105] S. Peisert and M. Bishop, “How to Design Computer Security Experiments,”in 5th WorldConf.onInformationSecurityEducation,ser.Intl.FederationforInformation Processing, L. Futcher and R. Dodge, Eds. Springer US, 2007,vol. 237, pp. 141– 148. [106] L. Pi` etre-Cambac´ ed` es and M. Bouissou, “Beyond Attack Trees: Dynamic Security ModelingwithBoolean LogicDrivenMarkovProcesses (BDMP),” 2010 European DependableComputingConference,pp.199–208,2010. [107] T.Raghu,R.Ramesh,andA.B.Whinston,“Addressingthehomelandsecurityprob- lem: Acollaborativedecision-makingframework,”JournaloftheAmericanSociety forInformationScienceandTechnology,vol.56,no.3,pp.310–324,2005. [108] Respect-IT. (2007) A KAOS Tutorial. Tutorial. [Online]. Available: http://www. objectiver.com/fileadmin/download/documents/KaosTutorial.pdf [109] H. Ringberg, M. Roughan, and J. Rexford, “TheNeed for Simulationin Evaluating AnomalyDetectors,” SIGCOMM Comp. Comm. Rev. (CCR), vol. 38, no. 1, pp. 55– 59,Jan2008. [110] Rocky Mountain Institute/SWEEP. (2006, Apr.) Demand Response: An Introduction – Overview of Programs, Technologies, and Lessons Learned. Technical White Paper. [Online]. Available: http://www.sgiclearinghouse.org/ LessonsLearned?q=node/2440&lb=1 [111] M. Roger and J. Goubault-Larrecq, “Log Auditing through Model-Checking,” in Proc. of the 14th IEEE Computer Security Foundations Workshop, 2001, pp. 220– 236. [112] D. Salmon, M. Zeller, A. Guzm´ an, V. Mynam, and M. Donolo. (2007) Mitigating the Aurora Vulnerability With Existing Technology. Technical White Paper. Schweitzer Engineering Laboratories, Inc. [Online]. Available: https://www.selinc. com/workarea/DownloadAsset.aspx?id=6379 [113] Signal and Control Systems Engineering Group. (2011) Sydenham Signal Box Failures on April 12th, 2011. Technical Investigations Report. Rail Corporation New South Wales (RailCorp). Australia. [Online]. Available: http://www.railcorp. info/publications [114] R.SommerandV.Paxson,“OutsidetheClosedWorld: OnUsingMachineLearning forNetwork IntrusionDetection,”in Securityand Privacy (SP), 2010 IEEE Sympo- siumon,may2010,pp.305–316. 171 [115] SplunkSoftware.Software. [Online].Available: http://www.splunk.com/ [116] A.Steinberg,“Anapproachtothreatassessment,”in20057thInternationalConfer- enceon InformationFusion. Ieee,2005,p.8pp. [117] G.G.P.TaddaandJ.J.S.Salerno,“OverviewofCyberSituationAwareness,”Cyber SituationalAwareness,vol.46,pp.15–35,2010. [118] K. M. C. Tan and R. A. Maxion, ““Why 6?” Defining the Operational Limits of Stide, an Anomaly-Based Intrusion Detector,” in Proc. of the IEEE Symp. on Secu- rityandPrivacy,2002,pp.188–201. [119] M. Tavallaee, N. Stakhanova, and A. Ghorbani, “Toward Credible Evaluation of Anomaly-Based Intrusion-Detection Methods,” IEEE Trans. on Systems, Man, and Cybernetics, Part C: Applications and Reviews,, vol. 40, no. 5, pp. 516–524, Sep 2010. [120] C.-w. Ten, G. Manimaran, C.-C. Liu, and S. S. Member, “Cybersecurity for Criti- cal Infrastructures: Attack and Defense Modeling,” IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 40, no. 4, pp. 853–865, Jul.2010. [121] U.S.-Canada Power System Outage Task Force. (2004) Final Report on the August 14,2003BlackoutintheUnitedStatesandCanada: Causes andRecommendations. TechnicalInvestigationsReport.[Online].Available: https://reports.energy.gov/ [122] U.S. Government Accountability Office, “TVA Needs to Address Weaknesses in Control Systems and Networks,” Report to Congressional Requesters, USGAO, 2008,http://www.gao.gov/new.items/d08526.pdf. [123] R.Vaarandi, “SEC -A LightweightEventCorrelationTool,”IEEE Workshop on IP Operationsand Management,pp.111–115,2002. [124] ——, “A data clustering algorithm for mining patterns from event logs,” in Pro- ceedings of the 2003 IEEE Workshop on IP Operations and Management IPOM. Citeseer,2003,pp.119–126. [125] Verizon RISK Team. (2013) Verizon Data Breach Investigations Report. Verizon. Accessed: 7 Feb 2014. [Online]. Available: www.verizonenterprise.com/DBIR/ 2013 [126] A. Viswanathan, A. Hussain, J. Mirkovic, S. Schwab, and J. Wroclawski, “A Se- manticFrameworkforDataAnalysisinNetworkedSystems,”in Proceedingsofthe 8th USENIX Symposium on Networked Systems Design and Implementation, ser. NSDI’11. USENIXAssociation,2011,pp.127–140. 172 [127] A. Viswanathan, K. Tan, and C. Neuman, “Deconstructing the Assessment of Anomaly-based Intrusion Detectors,” in Research in Attacks, Intrusions, and De- fenses,ser.LectureNotesinComputerScience,S.Stolfo,A.Stavrou,andC.Wright, Eds. SpringerBerlinHeidelberg,2013,vol.8145,pp.286–306. [128] D.VonDollen.(2009)ReporttoNISTontheSmartGridInteroperabilityStandards Roadmap. Technical Report. Electric Power Research Institute (EPRI), National Institute of Standards and Technology (NIST). [Online]. Available: http://www. smartgrid.gov/document/report nist smart grid interoperability standards roadmap [129] D. Wagner and P. Soto, “Mimicry Attacks on Host-based Intrusion Detection Sys- tems,” in Proc. of the 9th ACM Conf. on Comp. and Comm. Sec. (CCS). ACM, 2002,pp.255–264. [130] K. Wang, J. J. Parekh, and S. J. Stolfo, “Anagram: A content anomaly detector resistanttomimicryattack,”in Recent Advances in IntrusionDetection. Springer, 2006,pp.226–248. [131] K. Wang and S. Stolfo, “Anomalous Payload-Based Network Intrusion Detection,” inRecent AdvancesinIntrusionDetection,ser.LectureNotesinComputerScience, E. Jonsson,A. Valdes, and M. Almgren, Eds. Springer, 2004, vol. 3224, pp. 203– 222. [132] G. C. Wilshusen. (2012) Challenges in Securing the Electricity Grid . Testimony BeforetheCommitteeonEnergyandNaturalResources,U.S.Senate.UnitedStates GovernmentAccountabilityOffice. [Online].Available: http://www.gao.gov/assets/ 600/592508.pdf [133] WiresharkWebsite.Software.[Online].Available: http://www.wireshark.org/ [134] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. MorganKaufmann,2005. [135] Q. Wu, D. Ferebee, Y. Lin, and D. Dasgupta, “Visualization of security events us- ingan efficient correlation technique,” in 2009 IEEE Symposium on Computational IntelligenceinCyber Security,no.978. IEEE,Mar.2009,pp.61–68. [136] E.YuandL.Liu,“ModellingTrustforSystemDesignUsingthei*StrategicActors Framework,” in Trust in Cyber-societies, ser. Lecture Notes in Computer Science, R.Falcone,M.Singh,andY.-H.Tan,Eds. SpringerBerlin/Heidelberg,2001,vol. 2246,pp.175–194. [137] E. S. Yu, “Social Modeling and i*,” in Conceptual Modeling: Foundations and Applications, ser. Lecture Notes in Computer Science, A. Borgida, V. Chaudhri, P. Giorgini, and E. Yu, Eds. Springer Berlin / Heidelberg, 2009, vol. 5600, pp. 99–121. 173 [138] E.Yu,J.Mylopoulos,andY.Lesperance,“AIModelsforBusinessProcess Reengi- neering,”IEEE Expert,vol.11,no.4,pp.16–23,Aug1996. [139] D. Yuan, Y. Xie, R. Panigrahy, J. Yang, C. Verbowski, and A. Kumar, “Context- based Online Configuration-error Detection,” in Proceedings of the 2011 USENIX conferenceonUSENIXannualtechnicalconference,ser.USENIXATC’11. Berke- ley,CA,USA:USENIXAssociation,2011,pp.28–28. 174
Abstract (if available)
Abstract
Situational awareness, or the knowledge of what is going on? to figure out what to do?, has become a crucial driver of the decision‐making necessary for effectively managing and operating large‐scale, complex systems such as the smart grid. The awareness fundamentally depends on the ability of decision‐making entities to convert the low‐level operational data from systems into higher‐level insights relevant for decision‐making and response. Technological advances have enabled monitoring and collection of a wide variety of low‐level operational event data from system monitors and sensors, along with several domain‐independent tools (e.g. visualization, data mining) and domain‐specific tools (e.g. knowledge‐driven tools, custom scripts) to assist decision‐makers in extracting relevant higher‐level insights from the data. But, despite the availability of data and tools to make sense of the data, recent high profile incidents involving large‐scale systems such as the North American power blackouts, the disruption of train services in Sydney, Australia, and the malicious shutting down of nuclear centrifuges in Iran, have all been linked to a lack of situational awareness of the decision‐makers, which prevented them from taking proactive actions to contain the scale and impact of the incident. A key reason for the lack of situational awareness in each circumstance was the inability of decision‐making entities to integrate and interpret the heterogeneous low‐level information in a way semantically‐relevant to their goals and objectives. ❧ Improving the situation awareness of a decision‐making entity in such systems requires capabilities to assist decision‐making entities to integrate and interpret the heterogeneous event data from the system, and extract insights relevant to their goals and objectives. Specification‐driven methods are a popular choice for decision‐makers in large‐scale, complex systems to extract high‐level insights from data. In the specification‐driven approach, a decision‐maker writes a specification (such as a rule) to process the low‐level event data, which then drives analysis over the operational event data at runtime, and results in high‐level insights relevant to the decision‐maker. We observe that while such approaches are popular, a fundamental problem today is with the low‐level nature of the languages used to build specifications, which increases the burden for high‐level decision‐makers to combine and interpret information in a way relevant to their goals and objectives. ❧ In this work, we propose a model‐driven approach to enable decision-makers to write high‐level specifications to drive analysis over the event data, and extract insights semantically‐relevant to their goals and objectives. Specifically, we introduce two abstractions: behavior models, and situation models. Behavior models provide effective high‐level abstractions to specify complex behaviors (such as multi‐step attacks, or process execution) over a sequence or group of related events. Situation models provide effective high‐level abstractions to model the high‐-level cause‐effect relationships of situations in large‐scale, complex systems over isolated, independent low‐level events. Decision‐makers compose high‐level models using the above abstractions to drive analysis over low‐level data. The models capture relevant high‐level knowledge at a level of abstraction relevant to a decision‐making entity, and explicitly encode their high‐level goals and objectives. When such a model is used as an input to the analysis process, the insights produced are semantically‐relevant to a decision‐making entity's goals and objectives. ❧ The proposed modeling abstractions are expressive, simple, allow sharing and reuse of knowledge, and allow customization to suit a decision‐maker's needs. We demonstrate the effectiveness of the above abstractions by applying them to a set of case studies relevant to large‐scale, complex systems. First, we apply behavior models to model the complex multi‐step attack behavior of a DNS cache poisoning attack, and demonstrate how such a model can be used to effectively extract insights in the form of attack instances from network events. Then, we demonstrate how behavior models can be used to rapidly compose a description of a distributed denial of service (DDoS) attack to extract DDoS attack instances from an ISP packet trace. We then demonstrate how situation models enable multiple decision‐making entities involved in a demand response operation to make sense of heterogeneous, low‐level facts and extract insights semantically‐relevant to their high‐level decision‐making goals. ❧ Overall, in this work, we fundamentally demonstrate that the introduction of simple, semantically‐relevant modeling constructs is effective in enabling decision‐makers in complex environments to build specifications at a higher‐level of abstraction, and extract insights relevant to their goals and objectives. Our model‐driven approach emphasizes reuse, composibility and extensibility of specifications, and thus introduces a more systematic way to build specifications, and retain expert knowledge for sharing and reuse.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A complex event processing framework for fast data management
PDF
Prediction models for dynamic decision making in smart grid
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Software security economics and threat modeling based on attack path analysis; a stakeholder value driven approach
PDF
Defending industrial control systems: an end-to-end approach for managing cyber-physical risk
PDF
Probabilistic data-driven predictive models for energy applications
PDF
Deriving component‐level behavior models from scenario‐based requirements
PDF
Towards generalized event understanding in text via generative models
PDF
A function-based methodology for evaluating resilience in smart grids
PDF
Data-driven multi-fidelity modeling for physical systems
PDF
Discovering and querying implicit relationships in semantic data
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
Protecting online services from sophisticated DDoS attacks
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Supporting faithful and safe live malware analysis
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Identifying and mitigating safety risks in language models
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Spatiotemporal traffic forecasting in road networks
Asset Metadata
Creator
Viswanathan, Arun A.
(author)
Core Title
Model-driven situational awareness in large-scale, complex systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science (Computer Security)
Publication Date
02/10/2015
Defense Date
01/15/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
demand response,event processing language,formal modeling,high‐level specification language,model‐driven approach,OAI-PMH Harvest,situational awareness,smart grid
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Neuman, Clifford B. (
committee chair
), Govindan, Ramesh (
committee member
), Prasanna, Viktor (
committee member
)
Creator Email
arun.pict@gmail.com,aviswana@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-532211
Unique identifier
UC11297799
Identifier
etd-Viswanatha-3179.pdf (filename),usctheses-c3-532211 (legacy record id)
Legacy Identifier
etd-Viswanatha-3179.pdf
Dmrecord
532211
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Viswanathan, Arun A.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
demand response
event processing language
formal modeling
high‐level specification language
model‐driven approach
situational awareness
smart grid