Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Provenance management for dynamic, distributed and dataflow environments
(USC Thesis Other)
Provenance management for dynamic, distributed and dataflow environments
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PROVENANCEMANAGEMENTFORDYNAMIC,DISTRIBUTEDAND DATAFLOWENVIRONMENTS by JingZhao ADissertationPresentedtothe FACULTYOFTHEUSCGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (COMPUTERSCIENCE) December2012 Copyright 2012 JingZhao TableofContents ListofTables v ListofFigures vi Abstract viii Chapter1: Introduction 1 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter2: Background 9 2.1 UseCaseScenariosfromEnergyInformatics . . . . . . . . . . . . . 9 2.1.1 ReservoirProductionForecast . . . . . . . . . . . . . . . . . 9 2.1.2 PowerDemandAnalysis . . . . . . . . . . . . . . . . . . . . 11 2.2 ExistingWorkonProvenance . . . . . . . . . . . . . . . . . . . . . . 13 Chapter3: Compact Provenance Recording for Recurrent and Streaming Workflows 19 3.1 OverviewofOurApproach . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 WorkingUseCase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 OPM-basedProvenanceGraph . . . . . . . . . . . . . . . . . 24 3.3 ProvenanceTemplateforRecurrentWorkflows . . . . . . . . . . . . 25 3.3.1 ProvenanceStorageusingTemplates . . . . . . . . . . . . . . 29 3.3.2 ProvenanceInstanceRetrieval . . . . . . . . . . . . . . . . . 30 3.3.3 Generating Provenance Template from a Workflow Defini- tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.4 StorageCostAnalysis. . . . . . . . . . . . . . . . . . . . . . 33 3.4 ApplyingProvenanceTemplatetoStreamProcessingWorkflows . . . 37 3.4.1 OverheadAnalysis . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 ExtendingProvenanceTemplateforStreamProcessingWorkflow . . . 40 3.5.1 ProvenanceInferenceFunctions . . . . . . . . . . . . . . . . 42 3.5.2 ProvenanceInstanceRetrieval . . . . . . . . . . . . . . . . . 45 ii 3.6 CapturingProvenancePatternOutliers . . . . . . . . . . . . . . . . . 47 3.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7.1 StorageStructures . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7.2 ProvenanceServicesandLibraries . . . . . . . . . . . . . . . 51 3.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.8.1 EvaluationSetup . . . . . . . . . . . . . . . . . . . . . . . . 52 3.8.2 RecurrentWorkflows . . . . . . . . . . . . . . . . . . . . . . 53 3.8.3 StreamProcessingWorkflows . . . . . . . . . . . . . . . . . 57 3.9 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Chapter4: Querying Provenance Information in Distributed Environments 63 4.1 OverviewofOurApproach . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 SemanticAnnotationonProvenance . . . . . . . . . . . . . . . . . . 67 4.3 SystemArchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4 ProvenanceRetrievalQuery . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 ProvenanceFilterQuery . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5.1 ProvenanceTemplatesIdentification . . . . . . . . . . . . . . 77 4.5.2 ProcessingProvenanceFilterQueryinaLocalRepository . . 78 4.5.3 ProcessingFilterQueryacrossDistributedRepositories . . . . 80 4.5.4 Sub-QueryExecution . . . . . . . . . . . . . . . . . . . . . . 87 4.5.5 HandlingLoops . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6.1 ExperimentSetup . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.6.3 ResultsandDiscussion . . . . . . . . . . . . . . . . . . . . . 93 4.6.3.1 ProvenanceRetrievalQueries . . . . . . . . . . . . 93 4.6.3.2 ProvenanceFilterQueries . . . . . . . . . . . . . . 95 4.7 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Chapter5: A Semantic-based Approach for Handling Incomplete Prove- nance 105 5.1 OverviewofOurApproach . . . . . . . . . . . . . . . . . . . . . . . 108 5.2 SemanticAnnotationonReservoirModels . . . . . . . . . . . . . . . 111 5.3 SemanticAssociation . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.1 DetectingAssociationsfromHistoricalDatasets . . . . . . . . 113 5.4 ConfidenceofAssociation . . . . . . . . . . . . . . . . . . . . . . . 114 5.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5.1 HandlingInaccurateProvenanceInformation . . . . . . . . . 117 5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 iii 5.6.1 ExperimentSetup . . . . . . . . . . . . . . . . . . . . . . . . 122 5.6.2 BaselineApproaches . . . . . . . . . . . . . . . . . . . . . . 123 5.6.3 ResultsandDiscussion . . . . . . . . . . . . . . . . . . . . . 124 5.7 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Chapter6: PresentingProvenancewithAppropriateGranularities 133 6.1 ProvenancePresentationforSmartGrid . . . . . . . . . . . . . . . . 135 6.1.1 PresentingProvenanceforDifferentUserRoles . . . . . . . . 136 6.1.2 PresentingProvenanceforDataQualityForensics . . . . . . . 138 6.2 ModelingPresentationGranularityforProvenance . . . . . . . . . . . 140 6.2.1 StaticGranularityDefinition . . . . . . . . . . . . . . . . . . 140 6.2.2 HybridView . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.3 DeterminingAproposProvenancePresentationView . . . . . . . . . 143 6.3.1 DecompositionApproach . . . . . . . . . . . . . . . . . . . . 143 6.3.2 ClusteringApproach . . . . . . . . . . . . . . . . . . . . . . 145 6.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Chapter7: Conclusion 148 7.1 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Bibliography 152 iv ListofTables 3.1 PropertiesandannotationsusedtodescribeatypicalOPMgraph. . . . 26 3.2 Variablesrepresentingaveragesize(inbytes)ofprovenanceproper- tiesinanOPMgraph. “U”’srepresentUUIDswhile“S”’sarestring types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 v ListofFigures 1.1 Overviewofprovenancemanagement . . . . . . . . . . . . . . . . . 6 2.1 Thereservoirproductionforecastingworkflow . . . . . . . . . . . . . 10 2.2 TheUSCcampuspowerdemandanalysisworkflow . . . . . . . . . . 12 2.3 ConceptsandrelationshipsinOPM . . . . . . . . . . . . . . . . . . . 14 3.1 Recurrentandstreamprocessingworkflowsinourmotivatingexam- ple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 SampleOPMgraphsforthepowerdemandforecastresultsofbuild- ings“RTH”and“EEB”,whichweregeneratedbytwoexecutionsof thesameworkflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Provenancetemplateforpowerdemandforecastworkflow . . . . . . 27 3.4 ProvenancerecordsiniTableandmTable . . . . . . . . . . . . . . . . 29 3.5 Exampleworkflowswithcontrolflowandtheircorrespondingprove- nancetemplates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.6 Thesizeofprovenanceinformationgeneratedbythetemplate-based approachandthebaselineapproach. . . . . . . . . . . . . . . . . . . 36 3.7 Astreamprocessingelementwhichtakeskstreamsasinputandpro- ducesjstreamsasoutput . . . . . . . . . . . . . . . . . . . . . . . . 37 3.8 Anexemplifiedpre-definedqueryskeletonusedtoidentifyinputC p eventsubsequencesforaC agg eventaccordingtoprocessingpattern Rule3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.9 StepsforretrievingprovenanceinformationforaC agg event . . . . . 46 3.10 SystemarchitectureintheWindowsAzureCloudPlatform . . . . . . 50 3.11 Timecostofthreetypesofprovenanceinsertion . . . . . . . . . . . . 54 3.12 Timecostofthethreestepsofprovenanceinsertion . . . . . . . . . . 55 3.13 Provenance storage size for recurrent workflows when employing theOPM-basedandthetemplate-basedapproaches . . . . . . . . . . 56 3.14 Timecostforretrievingprovenanceinformationforrecurrentwork- flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.15 Provenancestoragesizeforstreamprocessingworkflows . . . . . . . 58 3.16 Timecostforretrievingprovenanceinformationforstreamprocess- ingworkflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 vi 4.1 AProvenancetemplateforreservoirproductionforecast . . . . . . . . 68 4.2 OverviewofSystemArchitecture . . . . . . . . . . . . . . . . . . . . 69 4.3 Processingofprovenanceretrievalquery . . . . . . . . . . . . . . . . 71 4.4 Pseudocodeofasampleprovenancefilterquery . . . . . . . . . . . . 75 4.5 Aprovenancetemplategraphof“ProductionForecast” . . . . . . . . 78 4.6 Thedistributedscenariooftheprovenancetemplatefor“Production Forecast” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.7 QueryunitsderivedfromFigure4.6andtheirdependency . . . . . . . 87 4.8 Aprovenancetemplatewithloop . . . . . . . . . . . . . . . . . . . . 88 4.9 Structureofthesyntheticworkflow . . . . . . . . . . . . . . . . . . . 91 4.10 Time for processing provenance retrieval queries with different size ofprovenanceanddifferentdatabasedistribution . . . . . . . . . . . 94 4.11 Timeonlocalqueriesandnetwork . . . . . . . . . . . . . . . . . . . 95 4.12 Time performance of the provenance filter queries with different numberofdatabasesandW=2000 . . . . . . . . . . . . . . . . . . . 96 4.13 The ratio of retrieved provenance information in the baseline algo- rithmwhenprocessingthefirstfilterquery . . . . . . . . . . . . . . . 97 4.14 Timeperformanceforprovenancefilterquerieswithdifferentsizeof provenanceanddifferentnumberofdatabases . . . . . . . . . . . . . 99 4.15 Timeonnetworkforprovenancefilterqueries . . . . . . . . . . . . . 100 4.16 Timeonlocalqueriesandnetworkforprovenancefilterqueries . . . . 101 5.1 Overviewofbootstrappingandprovenanceprediction . . . . . . . . . 108 5.2 Usingsemanticassociationforprovenanceprediction . . . . . . . . . 112 5.3 Numberofdataitemsinsampledatasetscreatedbyeachprocess . . . 122 5.4 Precisionofprovenancepredictionforusecase1 . . . . . . . . . . . 125 5.5 Precisionofprovenancepredictionforusecase2 . . . . . . . . . . . 126 5.6 Precision of provenance prediction for use case 2 when the existing provenanceisinaccurate . . . . . . . . . . . . . . . . . . . . . . . . 129 6.1 A provenance graph for workflows used to forecast building and campuspowerconsumption. . . . . . . . . . . . . . . . . . . . . . . 136 6.2 Provenancegraphviewsfordifferentuserroles . . . . . . . . . . . . 138 vii Abstract Provenance, the derivation history of data objects, records how, when, and by whom a piece of data was created and modified. Provenance allows users to understand the context of derived data, estimate its quality for use, locate data of interest, and deter- mine datasets affected by erroneous processes. Thus it is playing an important role in scientific experiments and business processes for data quality control, audit trail, and ensuringregulatorycompliance. While most of the previous works only study provenance in a closed and well- controlledenvironment(e.g.,aworkflowengine),challengesstillexistforholisticprove- nance management in practical and open environments, where provenance can be dis- tributed, dynamic and diverse. For example, in the Energy Informatics domain, prove- nanceisoftencollectedfromlarge-scaleworkflowsacrossdisciplinesandorganizations and thus is usually stored in distributed repositories. However, there has been limited researchonreconstructionofandqueryoverdistributedprovenanceinformation. Mean- while,recurrentandstreamprocessingworkflowscangeneratefine-grainedprovenance withoverwhelmingsizethatcanbelargerthantheoriginaldataset. Provenancestorage approachesforefficientlymanagingsuchmetadatavolumesdonothaveadequatefocus inliterature. Andlastly, thefactthatlegacytoolswithoutautomaticprovenancecollec- tion functionalities are still widely used leads to the requirement of manual provenance annotationoperations,whichcausesprovenancetobeincomplete. viii In this thesis, by using Energy Informatics as an exemplar domain, we design and develop algorithms and systems for managing provenance in dynamic, distributed and dataflow environments, that are motivated by real world challenges. In particular, we make the following contributions: (1) template-based algorithms that can efficiently store provenance information for dynamic datasets, (2) algorithms for reconstructing andqueryingprovenancegraphsfromdistributedprovenancerepositories,(3)semantic- basedapproachesforpredictingincompleteprovenance. Weevaluateourresearchcon- tributions with use cases from the Energy Informatics domain, including both Smart Oilfield and Smart Grid. The evaluation results demonstrate that our work can achieve efficient and scalable provenance management. As future work, we also discuss key challenges and initial solutions for presenting provenance across different granularities basedonitsusagecontextinformation. ix Chapter1 Introduction Provenanceisthemetadatathatpertainstothederivationhistoryofdataobjects. Prove- nance is often considered in the context of a workflow, a series of tasks that are com- posed together as a logical data processing application. In a workflow, the provenance may include the services and processes that were triggered during the workflow exe- cution, the configuration and parameters for these processes, the actual data inputs that wereusedforthatinvocation,andthedependenciesbetweenprocessesanddataobjects. Provenancerecordsthecausalityandexecutionchainofactivitiesthatledtoapiece ofdatatobegenerated. Itiswellrecognizedasbeingcrucialtosoundscientificexperi- mentsandtomeetregulatorybusinessprocesses[26][84][76]. Theavailabilityofprove- nanceallowsuserstounderstandthecontextofderiveddata,estimateitsqualityforuse, locate data of interest, and determine datasets affected by erroneous processes. Prove- nanceiscollectedandusedinthecontextof in silicoexperimentsmodeledasscientific workflowsorpipelines[39],whereithasgainedimportancesincethescientificprocess wouldotherwisebeopaquetotheusers[75]. Provenanceisalsogrowinginimportance forbusinessworkflowsasmeansofauditingandensuringregulatorycompliance[37]. Muchresearchhasgoneintomodelingprovenance[77][74],collectingitfromwork- flow executions [23][24], and storing and querying over it [53][60][16]. Provenance recordingsystemshavebeendevelopedandusedinworkflowframeworkssuchasTav- erna [79] and Kepler [23]. While most existing work still focuses on algorithms and mechanismsforprovenancemanagementinawell-controlledandclosedsoftwareenvi- ronment(e.g.,aworkflowframework),itremainsachallengetoofferrobustandefficient 1 provenancestorage,queryingandusabilityinrealworldenvironmentssuchasbusiness enterpriseswheredata,processanduserdiversityanddistributionarethenorm. In this dissertation, we identify gaps in holistic provenance management for open software environments. We provide a framework of algorithms and strategies, vali- dated by implementations and experiments, for efficient provenance storage, retrieval and query in such dynamic, distributed dataflow environments. We use the Energy Informaticsdomainbothtomotivateandsubstantiateourcontributionsasitreflectsthe challenging complexities and diversity of many scientific and enterprise domains that require provenance. Energy Informatics is a domain where information technologies areappliedtointegrateandoptimizecurrentenergyassets. Theseenergyassetsinclude energysources,energygeneratinganddistributinginfrastructures,andbillingandmon- itoringsystems[7]. TheEnergyInformaticsapplicationenvironmenthascharacteristics thatarenotconsideredbytraditionalprovenancemanagement,whichinclude: 1. Large-scaleworkflowsdistributedacrossdisciplinesandorganizations: Inthe EnergyInformaticsdomain,typicalworkflowsarecomposedoflargeamountsof distributed processes and applications, and involve multiple classes of users and stakeholders, with different specializations and roles, spanning departmental or organizational boundaries [90]. For example, in Smart Oilfield, which is a sub- domain of Energy Informatics that applies scientific principles to forecast and optimize the production of oilfields [34], a typical oil production forecast work- flow usually contains hundreds of processes and thousands of data objects dis- tributedindifferentareas. Differentcategoriesofdataobjects,includingreservoir capability descriptions, historical production records, and surface facility con- straints,areintegratedandanalyzedtopredictthefutureoilproduction. 2. Pervasiveness of recurrent workflows and stream processing workflows: As data from sensors and large shared instruments in the energy industry becomes 2 more dynamic – in terms of data sizes, arrival rates, and number of sources – workflows often need to be run repeatedly or continuously to process the large volumesofincomingdata. Sometimestheseworkflowsrunhundredsoftimesper day on distributed infrastructure. In addition to recurrent workflows, where the same workflow runs on a schedule with different available input datasets, stream processing dataflows are also widely used to run these analyses continuously as data events arrive. For example, in the Los Angeles Smart Grid project [86], streamandcomplexeventprocessingsystemsareemployedtoingest,analyzeand detectpatternsofelectricalpowerconsumptionoverhundredsofeventstreamsat frequencyof 1/Min or higher. Stream processing workflows are also widely used intheSmartOilfielddomainforoil/gaswellsurveillanceandmonitoring[98]. 3. DiversityofApplicationsandWorkflows: AtypicalEnergyInformaticssetting involves the use of diverse applications and workflows manipulated by multiple classes of users and stakeholders with different specializations and roles. Not all the workflows provide modern information functionalities such as automatic provenance collection. Legacy applications were not designed to interact with eachother,thusmanualoperationsareinvolvedindataarchiving,transferringand integration. Forexample, engineers often have to manually copydata items from existing data volumes supported by one type of application, transform their for- mat, and then manually paste the transformed data into a new dataset for another typeofapplication. Application environments having the above characteristics introduce the following challengeswhenincorporatingprovenanceforthedomain: 1. Distributed provenance information: While a large-scale workflow may run across organizations and disciplines, individual organizations and application 3 domain communities usually host their own provenance repositories to store and share provenance. Alternatively, when a specialized application (e.g., a sub- workflow) within a pipeline of tasks (e.g., a workflow) is run in a specific exe- cution environment due to resource dependencies, that environment may collect and store provenance for all data products derived by that application in a sin- gle repository. The domain affinity and resource affinity cause logically related provenancemetadatatobehostedinphysicallydistributedprovenancereposito- ries. This limits the ability of a user of application to perform queries over the completeprovenancethatspansmultipleorganizations. 2. Overwhelmingprovenancesize: Recordingprovenancefordynamicdataflows, where data flows through workflows due to its continuous nature, can become overwhelming in terms of the storage size. For example, a Smart Power Grid workflow that we introduce in Section 2.1.2 processes events once per minute fromnumerouseventstreamstoforecastpeakpowerdemandinabuilding. While the size of the events are small (around 100 bytes), nearly 1GB of provenance is recorded per day. Considering the continuous and long running nature of the workflow, the storage and querying of the provenance information can be unten- able,unlessintelligentlymanaged. 3. Incomplete provenance: In a diverse environment where provenance collec- tionmaybemanual,semi-automatedorautomated,endusersdonotalwayshave complete and accurate provenance information. Provenance information may be missingduringmanualoperationsofdataarchivingandintegration,oritmayhave been manually annotated for the output data. For example, in the Smart Oilfield domain, provenance information of the original data objects has to be manually copied and linked to the data objects that are manually copied/pasted between 4 legacy applications by reservoir engineers. These manual provenance annotation operations can be tedious and error-prone, thus produce incomplete provenance informationfordataobjects. 1.1 Contributions In this dissertation, we propose the hypothesis that efficient provenance management canbeachievedfordynamic,distributedanddiversedataflowenvironments. Weprove thishypothesisandmakethefollowingcontributions: 1. Efficient provenance storage: We design a template-based approach for iden- tifying common patterns in the provenance structure and properties, across dif- ferent executions of a workflow and different data generated by them. Novel variationsofthistechniqueareusedtoreducethestoragefootprintofprovenance collectedbothforrecurrentexecutionsofthesameworkflowwithdifferentinputs, as well as for stream processing workflows. Our techniques are compatible with the Open Provenance Model community standard, allowing it to be widely used withexistingworkflowandprovenancesystems. Ananalysisofournormalization approaches shows a typically reduction in lossless provenance storage of50% orbettercomparedtocommonlyusedprovenancerecordingtechniques. 2. Querying over distributed provenance repositories: We provide mechanism forreconstructingandqueryingtheprovenancegraphfromdistributedprovenance repositories using a limited set of operations exposed by the repository services. Specifically, we propose a classification of the queries into two types based on user intent, and propose distinct solutions to answer these queries. The first type ofqueryistheProvenanceRetrievalQuery,whichrequeststheprovenancegraph for a data artifact (i.e. provenance appears in the SELECT clause, using a SQL 5 analogy). And the second type of query is the Provenance Filter Query, which uses the provenance information, including ancestral processes and artifacts, as query conditions when retrieving a data artifact (i.e. provenance appears in the WHEREclause). 3. Predicting incomplete provenance: We propose a novel approach to pre- dict incomplete provenance information. Based on the observation in practical domains that certain data items may be created by the same process, a semantic- based approach is designed which utilizes the possibly existing provenance to predict the missing provenance. We also consider the inaccuracy of the exist- ing provenance information in our prediction procedure. Based on the accuracy of existing provenance and the confidence values of our prediction evidence, we calculate probability values to determine the trust of our final prediction. Our evaluationshowsthattheaverageprecisionofourapproachfortheSmartOilfield domainisabove85%whenonethirdoftheprovenanceinformationismissing. 1.2 Overview Recurrent/Stream- processing workflow Provenance Repository Chap. 3: Template- based Compact Provenance Recording Provenance Index Service Provenance Query Service Provenance Query Service Provenance Query Service Chap. 4: Provenance Query from Distributed Repositories Provenance Repository Provenance Repository Provenance Repository Provenance Repository Provenance Query Service Chap. 5: Incomplete Provenance Recovery Store Query Enhance Figure1.1: Overviewofprovenancemanagement 6 Figure 1.1 illustrates the lifecycle of the provenance information and summarizes ourcontributionstoprovenancemanagementfordynamic,distributeddataflowenviron- ments. Thelifecyclegoesthroughstagesof(1)provenancegenerationandstoragefrom dynamicworkflows,(2)theirqueryandretrievalfromacrossadistributedenvironment, and(3)theirusagebydiverseusersandapplications. Detailed provenance information collected for processes executing as part of dataflow,includingtheirinput,intermediateandoutputdataartifacts,parametersettings of the process, and agents that control the process, is stored in individual provenance repositories. Template-based compact provenance recording algorithms (Contribution 1) are designed and developed in order to decrease the size of the provenance storage, especiallyforrecurrentworkflowsandstreamprocessingworkflows. Detailsofthealgo- rithms, along with the corresponding implementation and evaluation, are introduced in Chapter3. Applications, sub-workflows and workflows run in a wide-area distributed environ- mentthatmayspanmultiplephysicalsites. Multipleprovenancerepositoriesexistinthis environment and the provenance information for different workflow or sub-workflow executions may be placed in distributed locations. The provenance repositories expose services for querying provenance. To allow lookup of provenance repositories in this distributed environment, a provenance index service is used to map from a data artifact totherepository/repositoriesthatholds/holdtheimmediateprovenancedetailsaboutthe process that derived that artifact. Based on the architecture design with the index ser- vice,inChapter4,weintroducealgorithmsforexecutingprovenanceretrievalandfilter queriesoverprovenancethatarelocalordistributedacrosssuchrepositories(Contribu- tion2). When processing provenance retrieval queries for data objects with incomplete provenanceinformation,incompleteprovenancepredictionalgorithmsaredesignedand 7 used to predict the missing provenance (Contribution 3). The predicted provenance information, along with the metadata that explains the derivation of our prediction, is stored in provenance repositories for future retrieval and query. Note that our current approachcannotpredictallthemissingprovenancedetails–thepredictionislimitedto the type of the parent process for generating the data artifact is predicted. However, an accuratepredictionabouttheparentprocessisstillvaluableforthedataqualityestima- tion. Wepresentthesemantic-basedpredictionalgorithminChapter5. Lastly,provenancefordataderivedfromlarge-scaleworkflowsthatrunacrossorga- nizations and disciplines can be complex. Consequently, users in different roles find their interpretation onerous unless provenance is presented in a form that is easily con- sumable for the given task at hand. As future work, in Chapter 6, we outline key chal- lenges and initial solutions for determining and presenting apropos hybrid provenance views across granularities, analyzed in the context of the Smart Grid domain. We sum- marizeourworkandpresentotherareasforfutureresearchinChapter7. 8 Chapter2 Background In this chapter, we first introduce two use case scenarios from the Smart Oilfield and the Smart Power Grid domains, which are two sub-domains of Energy Informatics that exhibit dynamic, distributed dataflow characteristics. We use these two use cases to motivate our work and validate our contributions. We then introduce related work on provenance, and discuss the gaps between existing work and the requirement of prove- nancemanagementforEnergyInformatics. 2.1 UseCaseScenariosfromEnergyInformatics 2.1.1 ReservoirProductionForecast Our first motivating example, the reservoir production forecasting workflow, comes from the Smart Oilfield domain. Reservoir production forecasting is commonly used to predict the performance of an oil reservoir under different scenarios during its life- time. The core part among the several sub-processes in this forecasting workflow is a complex simulation process. The input data for the computational simulation includes data about the reservoir deliverability, description of the reservoir capacity to produce oil, the historical production data for that oil well, the surface facility constraints of the facility, and the capacities of export system over the life of the reservoir. Each of these inputsisgeneratedusingdifferenttechniques,suchaswelltests,seismicandproduction simulations, and fluid property analysis, and requires invocation of remote applications running in specific resource environments. This requires the sub workflows to often be 9 distributedacrossdifferententerprisedomainsthatofferthelicenses,configurationsand dependenciesrequiredbyspecificprocessesanddatausedinthesubworkflow. Further, there may be manual operations that are required to exchange data between these sub workflows or the reservoir workflow and other data processing operations. The whole workflow is executed periodically with different inputs and parameter settings. Fig- ure2.1illustratesthereservoirforecastingprocess. Historical production/ injection data Reservoir deliverability and capacity data Surface facility constraints Reservoir forecasting simulation Forecasting result Sub-workflow composed by applications and services Sub-workflow composed by applications and services Sub-workflow composed by applications and services Figure2.1: Thereservoirproductionforecastingworkflow The outcome of the forecasting process is important for making reservoir develop- ment decisions. The quality of the input data is a significant factor in determining the quality of the forecast, and this in turn depends on the data generated by different sub- processes. The provenance of the input data to the forecasting process can be used as a reliable estimator of the data quality. However, provenance for the sub-workflows is oftencapturedandstoredinindependentrepositoriesco-locatedwiththesubworkflow and this requires us to retrieve and query the provenance from multiple locations. For example, the provenance collected from the well test process, which is contained in 10 the sub-workflow for generating the reservoir capacity description, is usually stored in repositories physically present near the oilfield facility, whilst the provenance collected from the forecasting simulation process is usually stored in repositories of a research lab. Users require a way to collectively query for provenance and over provenance acrossthesedistributedrepositories. Inaddition,theuseofmanualdataexchangesteps (such as “copy” and “paste”) in the workflow execution can lead gaps in automated provenancecapture. In the forecasting workflow, input/output data objects are represented as reservoir models. Each reservoir model is usually a complex dataset that consists of a large amountoffine-graineddataitemsintegratedfromdifferentdatasources. Thedataqual- ityofareservoirmodelismeasuredbasedonthequalityofdataitemsitcontains. Prove- nance of fine-grained data items is not always available and accurate, which becomes a hurdleforachievinggooddataqualitycontrol. Weneedtechniquestoaccuratelypredict missingprovenanceinformationwithinadiverseprovenancecollectionenvironment. 2.1.2 PowerDemandAnalysis Our second use case comes from the Smart Power Grid domain. The Smart Power GridsareaformofCyberPhysicalSystemwhereinformationfromdiversesensorsand instruments monitoring the distribution network and consumer facilities is transformed andanalyzedforoperationalsupport[86]. Inourusecase,theUSCCampusMicrogrid [86]servesasatestbedfortheLosAngelesSmartGridprojecttoinvestigatenovelinfor- maticsandeEngineeringarchitecturesforefficientlymanagingpowerconsumption. Oneofthekeyworkflowsistoreliablyforecastpowerconsumption,andinitiatevol- untaryanddirect-controlactionstocurtailenergyuseduringpeakloadperiods. Several 11 sub-workflowsareusedforinformationintegration,andforanalysisanddecisionmak- ing by a multi-disciplinary team of power systems engineers, data mining researchers, socialbehavioranalysts,andfacilityandoperationsmanagers. A simplified version of the workflow used for campus power forecasting is shown in Figure 2.2. The Campus Power Consumption Forecast sub-workflow predicts the future electric power usage by individual buildings on campus, and aggregates these forecasts into a total consumption forecast for the campus. The building forecasts are performed using a forecasting process that uses a machine-learned prediction model trained previously, current weather and campus schedule, along with recent timeseries data observed from the building’s smart power meter and facility sensors. Two other sub-workflows are responsible for preparing these inputs. A Forecast Model Training sub-workflowgeneratesapowerconsumptionforecastmodelforeachbuilding. Atime- seriesofrecentpowerconsumptionandequipmentusageinformationisgeneratedbythe Building Sensor Integration sub-workflow. This sub-workflow normalizes and aggre- gateseventsfromsmartmetersandsensorsinstalledineachbuilding. Theseworkflows run continuously, integrating information from hundreds of sensors every few minutes and providing updated forecasts to the campus facility managers every hour. Conse- quently, the size of the provenance captured for these workflows grows quickly over time. Wethusneedcompactalgorithmstoefficientlyrecordandstoreprovenance. Building Sensor Integration Power consumption for a building Forecast Model Training Forecast model Weather, Schedule, ... Campus Power Consumption Forecast Power consumption forecast for a building Aggregation Power consumption forecast for campus Power consumption forecast for a building …... …... Figure2.2: TheUSCcampuspowerdemandanalysisworkflow 12 2.2 ExistingWorkonProvenance Wesummarizerelatedworkonprovenancemanagementinthissectionandofferamore detailedreviewofspecificrelatedresearchwithinindividualchapters. Provenanceinfor- mationcansupportanumberofuses,includingdataqualitycontrol[57][38],audittrail [71][50], repetition of data derivation [42][13], and attribution [57]. Researchers in both the database and the e-Science community have done extensive work in the area of provenance. In [26], one of the earliest research contributions in the area of data provenance, Buneman et al. outline the main challenges as the basic technical issues. Theyfurtherprovideanapproachforcomputingprovenanceinformationfromdatabase queries in [27]. Ikeda and Widom discussed issues and challenges of building a prove- nance management system in [56]. Simmhan et al. have listed provenance work in e-Science in a survey [84], in which provenance work is presented from multiple e- ScienceprojectsincludingChimera[43],myGrid[95],CMCS[80],andESSW[44]. A recentsurveyonprovenanceanditsusageonthewebcanbefoundin[76]. Existing work discusses provenance from different perspectives, including prove- nancemodeling,provenancecollectionandstorage,andprovenancequeryprocessing. ProvenanceModels: Provenancesystemsoftenusetheirowncustomprovenancemod- els[23][43][20]. Meanwhile,theW3CPROVDataModel[8]andtheOpenProvenance Model (OPM) [77] are emerging as community standards that are popularly used to describe the causal relationship and interaction between processes and derived data. Both the PROV Data Model and the Open Provenance Model use graph models to rep- resentprovenance,andaresimilarinmanyrespects. The graph model in OPM mainly defines three types of nodes: Artifact, Process and Agent, where an artifact is usually a data/physical object, a process is an action or a series of actions performed on artifacts, and an agent is a contextual entity control- ling or affecting the execution of a process. The graph model also defines directional 13 edges representing the causal relationships between these nodes, such as wasDerived- From and used. OPM describes the causal relationships between process executions in a workflow, and the input and output data of the process. Both control flow and data flowintheworkflowcanbecaptured. Forexample,inOPM,the“wasTriggeredBy”and “wasControlledBy” are used as control flow relationships, while “used”, “wasGenerat- edBy” and “wasDerivedFrom” represent dataflows. Nodes and edges in an OPM graph are often associated with properties. Some of these are core part of the model while othersareuserdefinedannotations. Asanexampleoftheformer,OPMallowsrolesthat are syntactic tags about the process’s and artifact’s involvement to be specified on the causaledges,suchasthenameoftheinputparameter. Figure 2.3 depicts the main concepts and relationships defined in OPM. In this and subsequent figures, we use the following convention for illustrating provenance graphs: artifacts/entitiesareshownasellipses,processes/activitiesareshownasrectan- gles,agentsareshownasoctagons,andlabelededgesbetweennodesdepicttheircausal relationships. Process Agent wasControlledBy wasGeneratedBy (role) wasDerivedFrom used (role) Artifact Artifact wasTriggeredBy Process Agent wasControlledBy wasGeneratedBy (role) Figure2.3: ConceptsandrelationshipsinOPM The W3C PROV Data Model (PROV-DM) defines similar concepts and relation- shipstoOPM,suchasEntity(correspondingtoArtifact)andActivity(correspondingto Process). In general, the provenance information defined in PROV-DM can be divided into six components: (1) the derivation relationship between entities and activities, alongwiththecorrespondingtimeinformation(similarto“wasGeneratedBy”definedin 14 OPM),(2)therelationshipsdescribingtheresponsibilityofagentsforentitygeneration oractivitycontrolling(“wasControlledBy”inOPM),(3)thederivationsofentitiesfrom entities (“wasDerivedFrom” in OPM), (4) specialization and alternate between entities, (5)collectionsformedbyagroupofentitymembers,and(6)asimpleannotationmech- anism. OPM is a community standard put forth by academic groups while PROV-DM is beingdevelopedasaformalstandardthroughwiderparticipationamongtheW3Ccom- munity. Since PROV-DM was still under development at the time of writing this dis- sertation, we use OPM as the default provenance model in our research contributions. However,theirsimilaritiesensurethatourworkisequallyapplicabletoemergingprove- nanceframeworksthatadoptthePROV-DMstandardonceitisratified. ProvenanceCollectionandStorage: Provenance collection systems have been devel- opedasanintrinsicpartofworkflowframeworks,suchasTaverna[79],Kepler[68],and Trident [83], or as stand alone systems such as Karma [85] to support different work- flow engine. For example, The Kepler project [68] provides a workflow framework which has been used for applications in biology, environmental science, and various other domains. Based on the workflow framework, Kepler also develops a provenance framework [13][23], which can handle different provenance models and has automatic provenancecollectioncapabilities. Similarly, in the myGrid project, workflows are defined by using XScufl language and are executed on the Taverna workflow engine [50] which is a workflow engine developed for grid environments. In Taverna, provenance information is automatically logged during the workflow execution. The logged provenance includes four different levels: process level, data level, organization level and knowledge level. Semantic web technologiesareemployedtolinkdomainknowledgewiththeprovenanceinformation, thus the raw provenance log is converted into domain-knowledge-enriched provenance 15 information. WhilemyGridandKeplerprovideprovenancesystemsthataretightlyinte- grated with workflow systems, Karma [85] is a stand-alone provenance collection tool, whichthuscanprovidemoreflexibleprovenancemanagementsolutions. Theprovenanceinformationcollectedbysuchsystemscangrowtobelargeinsize, particularly for recurrent workflows that are executed often. To compactly represent provenance, in [30], Chapman et al. propose three techniques for compressing the provenanceinformation: factorization,structuralinheritance,andpredicate-basedinher- itance. These techniques identify data artifacts with identical or similar provenance graphs and remove redundant copies of common provenance graphs. The techniques proposed in [30] need to scan the whole provenance database. It is thus not efficient when the size of the database is large and the database is growing constantly, which is usually the case to be handled when storing provenance for large-scale stream process- ingworkflows. Ourresearchaddressesthesegaps. Some existing work discuss provenance collection for stream processing work- flows. [66] identifies provenance as a key metadata for the data quality control in stream environments. In [91], the authors introduce an efficient provenance tracking approach which captures the dependencies among data streams. Their approach has low-overhead,butignoresthedependenciesbetweenindividualstreamevents,whichis defined as fine-grained stream provenance in [49]. In [49], Glavic, et al. also outline thechallengesandpossiblesolutionsformanagingthefine-grainedstreamprovenance. Our work acknowledges these challenges and also focuses on fine-grained provenance trackingforstreams. ProvenanceQueryProcessing: Provenancequeryprocessinghasbeenstudiedinsome existing work. Miles, in [70], defines a provenance query and describes techniques for scoped execution of provenance queries. Y. Zhao et al. [97] and Holland [55] have proposed approaches for expressing provenance queries. Heines et al. [53] focus on 16 mechanisms for efficient storage and querying of provenance information using inter- val encoding to reduce the storage size and improve query performance over a recur- sive algorithm. Based on a fine-grained provenance model [17] that supports scientific workflows processing nested dataset (e.g., XML data), Anand et al. define the formal semanticsofaprovenancequerylanguage,QLP,anddiscussefficientqueryprocessing approachesin[16]. In [60], Kementsietsidis and Wang first defines the provenance backward/forward query. Theprovenancebackwardqueryisdefinedasaqueryaddressingwhatinputwas used to generate the output, and the provenance forward query is defined as a query identifyingwhatoutputwasderivedfromtheinput. Twokeyrequirementsarethendis- cussed for both forward and backward provenance queries, duality and locality, where the duality requires that a single access method can be used to evaluate both backward queriesandtheforwardqueries,andthelocalityrequiresthattheprovenancequeryeval- uationtimeshouldbeonlydependentonthesizeoftheprovenancequeryresultsinstead of the total size of provenance data. A novel index structure is utilized to improve the provenancequeryprocessingin[60]. Provenancequeryprocessingindistributedenvironmenthasbeenstudiedinseveral existingwork. In[69],threeadvantagesareclaimedfromthedistributionofprovenance information: low-latency access to local provenance metadata, avoiding synchroniza- tion with a central service after operations during network disconnection, and ease of themaintenanceofuserprivacy. ProvenancesketchesbasedonBloomfiltersarestored in local nodes and utilized to enable and optimize provenance queries across reposito- ries. Grothdiscussestechniquesfordeterminingprovenanceinformationinadistributed environment in [51]. The algorithm in [51] is made up of six steps: translate, filter, traverse, consolidate, pare and merge. Groth and Moreau also discuss how to repre- sentprovenanceinformationindistributedenvironmentsbyusingtheOpenProvenance 17 Model in [52]. Gadelha et al. design the Swift parallel scripting language to analyze provenance information collected from large-scale distributed systems in [46]. Prove- nancegenerationandrefinementinthecontextofverylarge-scaleworkflowsinvolving thousands of computations is discussed by Kim et al. in [62]. Ellkvist et al. pro- pose a mediator-based architecture to integrate distributed and heterogeneous prove- nanceinformation[41]. Although a lot of effort has been put on improving the performance of provenance queries, most existing work, even if discussed within a distributed environment, only providesprovenanceretrievalfunctionalitybasedoncompleteandaccurateprovenance. There has been limited work on the mechanisms for processing more complex queries suchastheprovenancefilterqueries(eitherlocalordistributed). Also,amorepractical provenancesystemneedstobeabletorecoverincompleteprovenanceinformation. 18 Chapter3 CompactProvenanceRecordingfor RecurrentandStreamingWorkflows The growth of large-scale and dynamic data in Energy Informatics is leading to recur- rentworkflowexecutionsforcontinuousdataanalysis,wherethesameworkflowrunson a schedule with different available input datasets and parameter settings. In our Smart PowerGriddomainusecase,whichwehaveintroducedinSection2.1.2,thepowercon- sumption forecast workflow runs periodically, consuming the most recent power con- sumption measurement data collected from sensors and smart meters, so as to generate accurate power consumption predictions for each building in the campus. Also, in the Smart Oilfield domain, while engineers keep improving analysis models according to newlycollectedinformationaboutthereservoir,thereservoirproductionforecastwork- flowisalsoexecutedrepeatedlyandtheoilproductionschemekeepschangingbasedon newforecastresults. Provenance information, including the details about the input datasets, intermedi- ate results, and process parameter settings, needs to be recorded for each execution of the recurrent workflow so that domain engineers can get a better understanding of the output,identifymissingdataorprocessmisconfigurations,andachievebetterdataqual- ity control. However, the collected provenance information can impose a large storage overhead over time. For example, the power consumption forecast workflow in our use 19 caseisexecutedaround100timesperdayforeachbuilding. Consequently,severalhun- dred megabytes of provenance information is then collected and stored each day when executingsucharecurrentworkflowforthewholecampus. Meanwhile, stream processing workflow, which is also widely deployed and used in the domain, can generate even larger size of provenance information. Provenance collected from a stream processing workflow captures the causal relationship between input/output events of stream processors, thus can be used as evidence of data quality monitoringanddebugging. Thesizeoftheprovenanceinformationcanincreaserapidly whenthefrequencyofincomingeventsgetshigh. Forexample,aswehaveintroduced, a stream processing workflow in our Smart Power Grid use case can generate nearly 1GB provenance per day. The stream processing workflow usually keeps running for a long time, and in practice it is necessary to collect and store complete fine-grained provenance as evidence of data quality forensics. Therefore the size of its provenance informationcanbeoverwhelming. In this chapter, we introduce algorithms for compact provenance storage for recur- rentandstreamingworkflows. Atemplate-basedalgorithmisfirstdesignedtocompress the provenance for recurrent workflows without loosing any metadata. A variation of thebasictemplate-basedalgorithmisthenproposedandusedforprovenancestorageof stream processing workflows. Our techniques are well suited to reconstitute the com- plete original provenance even at fine granularities, and the storage size scales well for provenanceaggregatedfromrecurrentandstreamingworkflowsovertime. Ourworkis compatible with the Open Provenance Model for broader applicability, and an evalua- tionofournovellosslessprovenancestorageapproachesforapplicationsexecutingover 30daysinSmartPowerGriddomainsuggestsastoragereductionof50%orbetter. 20 3.1 OverviewofOurApproach Ingeneral,ourworkforcompactprovenancestoragecontainsthefollowingthreeparts: Atemplate-basedalgorithmforprovenancestorageofrecurrentworkflow: Prove- nance graph instances generated by different executions of the same workflow share identicalorsimilarstructuresandpartoftheproperties,suchastheprocessandartifact types. Our algorithm captures the common provenance information, and uses prove- nance template graphs for its representation and storage. The main storage workload is then only caused by the non-static properties of each individual provenance graph instance with respect to this common template. This greatly decrease the total size of theprovenancestorage. A provenance template graph is generated based on the workflow definition. In our work,wediscusshowtohandlethecontrolflowsuchasconditionalbranchesandloops containedintheworkflowdefinitionwhengeneratingthetemplategraph. Ouralgorithm can also be employed to compress the provenance storage for stream processing work- flows. For both scenarios (i.e., the recurrent workflow and the stream processing work- flow), we provide detailed analysis about the storage overhead of our template-based algorithm,andcompareitwiththeoverheadoftheOPM-basedbaselineapproach. Extendingthetemplate-basedalgorithmforprovenancestorageofstreamprocess- ingworkflow: Astreamprocessingworkflowcanstillgeneratelargeamountofprove- nance information even if our basic template-based provenance storage algorithm is used. Basedontheobservationthatmanystreamprocessingelementsexhibitanimplicit processing pattern which can be used to identify the causal relationship between input and output events, we extend our basic template-based algorithm by modeling these processingpatternsexplicitlywithinourtemplate. Inparticular,aprovenanceinference functionisdevelopedbasedonthetypeoftheprocessingelement,andisintegratedinto 21 theextendedprovenancetemplatetoexpresshowtoinfertheinputeventsfromagiven derivedevent. Because of event delay or disorder which is not uncommon in practice, the real executionofastreamprocessingelementisnotalwaysconsistentwithitsexpectedpro- cessing pattern. Our work thus includes a two-stage process to capture the provenance outliers, and only stores the difference between the expected and actual provenance in therepository. Provenance reconstruction based on compact provenance storage: We have designed data structures and corresponding system architectures for storing the static (i.e., the provenance templates) and non-static (i.e., the instance-level provenance) provenanceinformation. Forboththebasicandtheextendedtemplate-basedprovenance storage algorithms, we have also developed protocols and services to reconstruct and retrieve provenance graph instances from the compact provenance storage. We imple- mentourtemplate-basedprovenancestorageontheCloudplatform. 3.2 WorkingUseCase WeusetheCampusPowerConsumptionForecastworkflowfromtheLosAngelesSmart Gridproject[86],whichweintroduceinSection2.1.2,asourmotivatingexampleforthe recurrentworkflows. Theworkflowforecaststheelectricitypowerdemandperiodically for150buildingsontheUSCcampusbasedondiverse,realtimeinformation. Usually itisexecutedevery15minutesforeachbuildingtoupdatetheforecast,thusisexecuted around14,000timesperday. Figure3.1(a)depictsthesimplifiedstructureofthework- flow. As shown in Figure 3.1(a), the workflow takes as input the most recent power consumption data (“CONS”) of a building, and passes it to the “Cleanses” service for dataqualitycheckingandgapinterpolation. Thenitusesthecleansedconsumptiondata, 22 along with the current weather information, as input to the forecast service (“FCST”) and generates the power demand forecast for the building. This forecast is used by theuniversityfacilitiestoefficientlymanage electricityusageandto reporttheforecast energyfootprintofthecampustothepublic. CONS Cleanse Cleansed CONS FCST Power Demand FCST Weather Info (a) Powerconsumptionforecastworkflow(recurrent) Pre- process Pre- process ...... Aggregate ... Pre- process Cin(a1,tn-j),...,Cin(a1,tn) Cin(a2,tn-j),...,Cin(a2,tn) Cin(am,tn-j),...,Cin(am,tn) Cp(a1,tn-j),...,Cp(a1,tn) Cagg(U,tn,w) Cp(a2,tn-j),...,Cp(a2,tn) Cp(am,tn-j),...,Cp(am,tn) (b) Powerdemandaggregatingworkflow(streamprocessing) Figure3.1: Recurrentandstreamprocessingworkflowsinourmotivatingexample Another workflow in the project, the Building Sensor Integration workflow, pro- cesses realtime streaming information from sensors that are deployed in campus build- ings. As shown in Figure 3.1(b), for each building U on campus, the stream process- ing workflow utilizes a pre-processing process (“Pre-process”) to collect and transform continuous power usage information reported by smart metersa k (1 k m) that are located in U’s service area. Each event in the stream (C in (a k ;t i ) in the figure) records the KWh power consumption measurement, as well as the smart meter ID (a k ) and the timestamp(t i )ofthemeasurement. Whenaneventarrives, thepre-processingprogram transforms the power consumption measurement value of the incoming event, and gen- erates a new event C p (a k ;t i ) with the same smart meter ID a k and timestamp t i as the input event. The new event C p (a k ;t i ) is sent to an aggregation process (“Aggregate”), 23 whichcalculatestheaggregatedpowerconsumptionofallthesmartmetersinthebuild- ingwithinatimewindow(t n w,t n ],notedasC agg (U;t n ;w),wherew isthewidthof thetimewindow. Inthefigureweassumethetimeranget nj t n fallswithinthetime window. Theaggregatedpowerconsumptioncanthenbeusedastheinputofthepower demandforecastworkflowdepictedinFigure3.1(a). 3.2.1 OPM-basedProvenanceGraph C1: CONS (for EEB) (sg:refc1) PC1: Cleanse (http://aldan.usc.edu/sg/ cleanse) CC1:Cleansed CONS (for EEB) (sg:refcc1) PF1: FCST (Model DT) (http://aldan.usc.edu/sg/fcst) DF1:Power Demand FCST (for EEB) (sg:refdf1) W1: Weather (sg:refw1) used (t5) (consumption) was GeneratedBy (t6) (ccons) used (t7) (ccons) used (t8)(weather) A1: SG System (http://aldan.usc.edu/sg) was ControlledBy (t3, t4) wasControlledBy (t1, t2) C2: CONS (for RTH) (sg:refc2) PC2: Cleanse (http://aldan.usc.edu/sg/ cleanse) CC2: Cleansed CONS (for RTH) (sg:refcc2) PF2: FCST (Model DT) (http://aldan.usc.edu/sg/fcst) DF2:Power Demand FCST(for RTH) (sg:refdf2) W2: Weather (sg:refw2) A2: SG System (http://aldan.usc.edu/sg) wasControlledBy (t10, t11) was ControlledBy (t12, t13) was GeneratedBy (t15) (ccons) used (t16) (ccons) used (t17)(weather) was GeneratedBy (t18) (forecast) used (t14) (consumption) was GeneratedBy (t9) (forecast) Figure 3.2: Sample OPM graphs for the power demand forecast results of buildings “RTH”and“EEB”,whichweregeneratedbytwoexecutionsofthesameworkflow 24 Figure 3.2 shows two OPM-based provenance graphs for two executions of the power demand forecast workflow. The graphs describe the provenance information for two artifacts DF 1 and DF 2 , which are the forecast values of the power demand for building “RTH” and “EEB” respectively, generated by the same workflow but different executionswithdifferentinputdataandparameters. Nodes and edges in an OPM graph are associated with properties. Table 3.1 lists some of the commonly used properties and their data structures. Each node and edge has a globally unique id that is a URI representing the stateful instant of a process’s execution or an artifact’s creation or usage. A node also has a value that provides a logical handle for the data or process concerned (e.g., the URL of an artifact, or the WSDL endpoint for a web service process). OPM allows timestamps to describe the instantaneousoccurrenceswithinaprocess,includingthestartorendoftheprocess,and usageorgenerationofartifacts. Attributeslikelabelandtypeareadditionalannotations thatuserscommonlyspecify. AsshowninFigure3.2,theOPM-basedprovenancegraphsincludetheURIofeach artifact,processandagent(e.g.,C 1 ,PC 1 ,andA 1 ),aswellastheirtypes(e.g.,“CONS”, “Cleanse”, and “SG System”). The “FCST” process is also annotated with labels for the configuration parameter, “DT” or “NN”, which are used to indicate its calcula- tion model. Other information such as timestamps and roles in the figure is listed in Table3.1. 3.3 ProvenanceTemplateforRecurrentWorkflows A template-based approach forms the foundation for our work on compact provenance representation. In this section, we describe the basics of the approach and illustrate itsapplicationforefficientlyrecordingprovenanceforrecurrentworkflowswhilegoing 25 OPM Property Description Type ExamplesinFig- ure3.2 ArtifactId Identifies an artifact node in the provenancegraph URI C1,C2,CC1 ProcessId Identifies a process node in the provenancegraph URI PC1, PF1, PC2, PF2 AgentId Identifies an agent node in the provenancegraph URI A1,A2 EdgeName Causalrelationshiptypeofanedge String used, wasGener- atedBy Role Edgetagsdescribingtheroleplayed by an artifact/process in the rela- tionship String ccons,weather Type Subtypes describing the artifacts, processes,andagents URI CONS,FCST Value An application-provided logical handle for the artifact, process, and agentinstances String, URI sg:refc1, sg:refcc1 Label A human consumable description of the artifact, process and agent nodes String for EEB, Model DT Timestamp Instantaneous timestamp of an arti- fact’s creation/usage, and a pro- cess’sstart/end String t1,t2 Table3.1: PropertiesandannotationsusedtodescribeatypicalOPMgraph. into finer details and extensions in later sections. Workflows are often executed repeat- edlywithdifferentinputdataandparameters,producingsimilaroutputdatatypes. This leadstoprovenancegraphsthatshareidenticalorsimilarstructuresandsomeproperties, suchastypes,whiledifferinginthevaluesofpropertiesofprocessandartifactinstances specifictoanexecution. Thetemplateapproachattemptstoleveragethiscommonality. As shown in Figure 3.2, the two artifacts, “DF1” and “DF2”, are generated by dif- ferentrunsofthesameworkflow,hencetheirprovenancegraphstructuresarethesame: a consumption data artifact (“CONS”) is pre-processed by the “Cleanse” process, and then consumed by the forecast process (“FCST”), along with the weather information, 26 togeneratethepowerdemandforecastforthebuilding. However,propertiessuchasids andvaluesaredifferent,anduniquetoprovenanceinstances. was GeneratedBy (forecast) Template: 00f7a774-eb5a-4c93-9fd9-965a5172ab4c CONS Cleanse http://aldan.usc.edu/ sg/cleanse Cleansed CONS FCST http://aldan.usc.edu/ sg/fcst Power Demand FCST Weather SG System http://aldan.usc.edu/sg wasControlledBy was ControlledBy used (consumption) was GeneratedBy (ccons) used (ccons) used (weather) Figure3.3: Provenancetemplateforpowerdemandforecastworkflow Therefore, instead of storing individual provenance graph instances for each output artifact, we identify and record just one copy of the common provenance information, while separately recording the unique features of each provenance graph instance with respect to this common template. This is a form of normalization [33]. Specifically, a ProvenanceTemplateisdesignedtocapturethecommonprovenanceinformationshared by data artifacts and processes that participate in a specific workflow composition. If a workflow can be described using a directed acyclic graph (DAG) – an assumption discussed in the next sub-section – then we define a provenance template as a schema- level OPM provenance graph, G T < A;P;A G ;E >, that captures information shared acrossprovenancegraphinstances,where: A denotes the list of static properties of artifacts nodes in the template graph. These include the types of artifacts involved in the workflow, e.g. “CONS”, and possiblytheirlabels,e.g.,“Powerconsumptiondataofthebuilding”. 27 P is the list of static properties of process nodes in the template graph. They specify the type of the corresponding process, e.g. “FCST”. In addition, process valuesand labelsthatarestaticacrossworkflowrunscanalsoberecorded,e.g. a webserviceboundtoastaticendpoint. A G denotes the list of static properties of agent nodes in the template, specifying thetypesofagentscontrollingtheprocesses,e.g. “SGSystem”. Staticproperties ofagentnodesmayalsoincludetheirvaluesandlabels. E denotes the list of static properties of edges in the template. These include causal relationship names between artifacts, processes, and agents, and the roles oftheedgeslinkedtoprocesses. Figure 3.3 depicts the provenance template for the provenance graph instances dis- cussed previously in Figure 3.2. The two processes, “Cleanse” and “FCST”, and the agent “SG System” all have static values which are thus included in the template. Cor- respondingly, the features unique to each provenance graph instance are a “diff” of Figure 3.2 and Figure 3.3, and include artifact ids, process ids, agent ids, timestamps, labelsandartifactvalues. Itshouldbepossibletoreconstructtheoriginalprovenancegraphinstancefromthe provenance template and the features unique to the instance. However, a provenance template, by itself, can answer a limited set of high-level provenance queries, such as What type of processes are used to generate a given type of artifact?. More detailed provenancequerieswillrequiretheactualinstances,e.g., Whatancestralartifacts were used to generate a given artifact instance? For a given provenance template, the cor- respondingcollectionofinstance-levelinformationmustbeavailabletoreconstructthe original provenance graph instance. We define the instance level provenance informa- tionasarecord<I i ;I t ;A;P;G;T >,where: 28 I i isaUUIDastheidentityoftheprovenancegraphinstancerecord, I t istheUUIDofthecorrespondingprovenancetemplate, A is a list of URIs and strings which represent the ids, values, and labels of the artifactinstancespresentintheprovenancegraphinstance, P is a list of URIs and optional strings that identify the ids and non-static labels andvaluesfortheprocessinstancesthatarevariantacrossexecutions, G is a list of URIs and optional strings that identify the ids and non-static labels andvaluesfortheagentinstancesthatarevariantacrossexecutions, T is a list of timestamps which describe the artifact generation and usage time, andprocessstart/endtime. 3.3.1 ProvenanceStorageusingTemplates Provenance Retrieval Service !"# $%&'()*+',-./0 1!2$3456'/!0 (7*895,&5+:&;,/! (7*895,&5+:&; 75<=9*'5,/! 75<=9*'5 iTable <CONS:C1, Cleansed CONS: CC1, Weather: W1, Power Demand FCST: DF1> 00f7a774- eb5a-4c93- 9fd9-965a5172 ab4c <CONS-Cleanse: 2011-06-02 20:20:19, ...> <FCST: Model DT> Ii d3ee3261- cec0-405c- a526-4b48a8a3 d760 <CONS-Cleanse: 2011-06-02 20:15:41, ...> A T <FCST: Model NN> P <CONS:C2, Cleansed CONS: CC2, Weather: W2, Power Demand FCST: DF2> IT 00f7a774- eb5a-4c93- 9fd9-965a5172 ab4c 8e91c1bd-4383- 4294-80f9- a3d1fa89118c mTable ... ... H(DF2) 8e91c1bd-4383-4294-80f9-a3d1fa89118c H(DF1) d3ee3261-cec0-405c-a526-4b48a8a3d760 H(CC1) d3ee3261-cec0-405c-a526-4b48a8a3d760 Hash(Artifact URI) iTable record ID ... ... 00f7a774-eb5a-4c93- 9fd9-965a5172ab4c Provenance Template ID Provenance Template tTable >&:456*6+5, ?&*=@,:),!"# 4 1 3 (7*895,&5+:&;,/! 2 Figure3.4: ProvenancerecordsiniTableandmTable 29 TheprovenancetemplateG T isserializedintoatRecordfileandstoredinadatastore that we call tTable and identified using a UUID. We also develop a data store, iTable, to record the instance-level provenance information and its mapping to the provenance templates in tTable. Each record in the iTable (iRecord), when correlated with its cor- responding template information in the tTable, can be used to reconstruct the original provenance graph. In particular, the list of artifacts, processes and agents (A,P andG) are stored in the same order as a canonical traversal of the template. This allows us to implicitlydeterminewheretheinstancevalues“fit”intothetemplategraph. Toretrievetheinstance-levelprovenanceinformationforanygivenartifactinstance (including output, intermediate, and input artifact), we develop another reverse lookup table,mTable,tomaptheartifactinstanceIDstotheiRecord containingtheirderivation in the iTable. Each record in the mTable (mRecord) has two attributes: the MD5 hash codeoftheartifact’sURIandtheUUIDoftheiRecordwhichcontainstheinstance-level provenanceinformationoftheartifact. Figure3.4showstwo iRecordsfortheinstance- levelprovenanceinformationoftheprovenancegraphsdescribedinFigure3.2andpart of the records in mTable. We can see that all the instance-level provenance informa- tion of a provenance graph instance is contained in a single iTable record. Because the instance-level provenance information of different types of workflows usually have different properties, schema-free tables (also known as NoSQL databases) are used for storage. DetailsofthesystemimplementationareprovidedinSection3.7. 3.3.2 ProvenanceInstanceRetrieval As shown in Figure 3.4, given the URI of an artifact instance, the provenance retrieval processconsistsofthefollowingfoursteps: 1. QuerythemTablebyusingthehashcodeoftheartifactinstance’sURI,andiden- tifytheUUIDofthecorrespondingiRecord intheiTable. 30 2. UsetheUUIDoftheprovenancerecordtoquerythe iTablefortheinstance-level iRecord. 3. UsethetemplateUUIDcontainedintheiRecord toquerythetTableanddeserial- izetheprovenancetemplategraph. 4. Associate the corresponding properties present in the iTable provenance record with the artifact, process, and agent nodes and the causal edges in the template graph. The mapping of attribute values to the location in the template is implicit basedonthecanonicaltraversaloftheprovenancetemplategraph. The required provenance graph instance for the given artifact is contained in this reconstructedprovenancegraph. Itmaybethewholegraphifthegivenartifactwasthe outputdataproduct,orasubsetifitwasanintermediateone. Toconstructthecomplete provenancegraphfortheartifact(beyonditsimmediateancestors/descendantspresentin thisprovenancegraphinstance),wecanquerytheprovenancerecursivelytoreconstruct thecompleteprovenancegraph. 3.3.3 GeneratingProvenanceTemplatefromaWorkflowDefinition Workflows typically are composed as a data flow DAG or may have more complex control flows. For a workflow with a simple DAG structure and well-specified (typed) inputs, processes and outputs, a provenance template tRecord can be easily generated for the whole workflow by traversing its DAG specification and the properties that are statically bound. Figure 3.3 shows a sample of such a template and we do not describe it further. For workflows containing control flow, such as conditional branches and iterations, we decompose the workflow into sub-workflows that are each DAGs and generate a provenance template for each sub-workflow 1 . Specifically, we describe two 1 Wedonotconsidercomplexworkflowspecificationwitheventhandlersorfaulthandlershere 31 commoncaseswhereworkflowshaveaconditionalbranchoraniterativeloop(withthe latterbeingaspecializationoftheformer). A workflow specification containing conditional branches is decomposed into sub- workflows that cover each of the conditional branches, and the regions outside the branches. For example, in Figure 3.5(a), workflow W 1 contains a conditional branch: theartifactwithtypeA b isconsumedbyeitherprocessP b orprocessP c basedonsome conditions. WethendecomposeW 1 intothreesub-workflowsandgenerateaprovenance template for each sub-workflow. As shown in the figure, provenance templateT 11 cov- eringallthe artifactsandprocesses beforetheconditional branch; provenancetemplate T 12 andT 13 coversoneconditionalbranchrespectively. Ifaworkflowcontainsaniteration,anindividualprovenancetemplateisgeneratedto representoneiterationoftheloop. Forexample,inFigure3.5(b),wedividetheoriginal workflowW 2 intotwoparts. Thefirstpartisthesub-workflowbeforetheentrypointof the loop, and we generate provenance template T 21 for it. The second part is the loop, for which we generate provenance template T 22 which only describes the provenance for one iteration. If the iteration is executed multiple times for an run of the workflow W 2 , a provenance instance record will be generated in the iTable for each run but they willallpointtothesameprovenancetemplateT 22 . We also need to distinguish the case where iterations past the first iteration have accesstodatageneratedbythepreviousiteration,suchasanaccumulator. Forexample, inFigure3.5(b),processP b maytakeeitherofthetwoartifactswithtypesA b andA d as inputs depending on whether it is in the first or subsequent iteration. ThusT 22 contains A b andT 23 containsA d astheinputtoP b . 32 Pa Ab Pb Pc Ad Pb Ad Ab Aa Ac Pd Pa Ab Aa Pc Ac Ab Pd Ad Workflow W1 T11 T12 T13 (a) Provenancetemplatesforaworkflowcontainingconditionalbranch Pa Ab Aa Pb Ac Ad Pc Pa Ab Aa Ab Pb Ac Ad Pc Workflow W2 T21 T22 Ad Pb Ac Ad Pc T23 (b) Provenancetemplatesforaworkflowcontainingloop Figure 3.5: Example workflows with control flow and their corresponding provenance templates 3.3.4 StorageCostAnalysis Weprovideasimpleanalysisofthestoragecostsforatraditionalbaselineapproachfor recordingOPMprovenanceinstancesandcompareitwithourtemplate-basedapproach, whencollectingprovenanceforrecurrentworkflowruns. Detailedempiricalanalysisis provided in Section 3.8. Table 3.1 lists basic properties that are stored for each prove- nance graph instance using a baseline approach. Table 3.2 lists the variables we use in our analysis to describe the average size (in bytes) for these properties. The only addi- tiontopropertiesdescribedinTable3.1istheneedtocapturethesourceandsinknodes (artifacts or processes) of the dependency edges. For brevity, we only consider “used” and “wasGeneratedBy” edges between artifacts and processes in our analysis and omit “wasDerivedFrom”and“wasTriggeredBy”edges. 33 Artifact Process Agent Value Edge EdgeSrc,Snk U a U p U g U v S e <U a ;U p > Role Type Label Time S r U ty S l S ts Table 3.2: Variables representing average size (in bytes) of provenance properties in an OPMgraph. “U”’srepresentUUIDswhile“S”’sarestringtypes. For a workflow whose provenance graph contains n a artifact nodes, n p process nodes,n g agent nodes, andn e edges, storing the OPM provenance graph instance for a workflowexecutionusingthebaselineapproachtakes: B 1 =n a (U a +U ty +U v +S l )+n p (U p +U ty +U v +S l ) +n g (U g +U ty +U v +S l )+n e (U a +U p +S e +S r +S ts ) bytes,andformworkflowrunstakesB m =mB 1 . B 1 calculates the size of the provenance information for a complete provenance graph. This kind of representation can only support provenance queries for the final outputs of the workflow. In order to support provenance queries for intermediate arti- facts,extraindexstructure(likethemTableinourtemplate-basedapproach)isrequired. Alternatively, we can “split” the complete provenance graph into pieces, where each piece only captures provenance for an individual process or artifact contained in the provenancegraph. Aprovenancerecordonlystorestheprovenanceforasingleprocess. Bothmethodsincreasetheprovenancestoragesize,thusB 1 onlydepictsthelowerlimit ofthesizeofthebaselineapproach. We then calculate the provenance size of our template-based approach. For sim- plicity we assume we generate one provenance template for the whole workflow, all the processes and agents in the workflow have static values but different labels in each workflow run. Each tRecord represents a provenance template graph and is identified 34 using a UUID, that takes 16 bytes in compact form. Each node of the template graph hastostorethetypesoftheartifacts,processes,andagents,thestaticprocess/agentURI values,andauniqueUUIDforthenode. Causalrelationshipedgesarestoredbyrecord- ingtheUUIDforitssourceandsinknodes,alongwiththeedgetypeandrole. Thetotal sizeofthetRecord,notedas,isthen: = 16+n a (16+U ty )+n p (16+U ty +U v ) +n g (16+U ty +U v )+n e (16+16+S e +S r ) We use the iTable and the mTable to store the instance-level provenance informa- tion. For an iRecord, besides the instance-level provenance information, we also store twoUUIDs(16byteseach)fortheidentitiesoftheiRecordandtheidentifyofitscorre- sponding tRecord (i.e.,I i andI T ). Thusan iRecord costsn a (U a +U v +S l )+n p (U p + S l )+n g (U g +S l )+n e S ts +16+16bytes. InthemTable,eachmRecordcosts16bytes for the MD5 hash code of an artifact URI, and 16 bytes for the UUID of the iRecord presentiniTable. Intotal,formworkflowexecutions,therearearoundmn a mRecords, one for each artifact that participates in the workflow runs. Therefore, the total storage costofprovenanceinformationcollectedfrommworkflowexecutionsis Ω m = +m[n a (U a +U v +S l )+n p (U p +S l )+ n g (U g +S l )+n e S ts +16+16]+mn a (16+16) For a more intuitive comparison of the two approaches, we assign numerical values to variables (i.e. the average byte sizes) as shown in Table 3.2, and calculate the prove- nance storage size for different numbers of workflow executions (i.e., different values form). We use our example workflow as the use case, which contains 4 artifact nodes, 2processnodes,1agentnode,and7edges. WeassumeeachURIis67byteslong(this 35 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 1.E+10 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 Provenance Size (bytes) m (# of workflow runs) Baseline Template Based Figure 3.6: The size of provenance information generated by the template-based approachandthebaselineapproach is the median URL length according to [6]). Each label, which gives a simple human readable description of a provenance node, uses 20 bytes on an average. We employ ISO-8601 [1] as the standard representation of the time information (“YYYY-MM-DD hh:mm:ssZ”), that takes 20 bytes. Figure 3.6 shows the comparison result when m equals to 1, 1k, 100K, and 1M. We can see that the template-based approach decreases theoverallsizeofprovenancestoragewhenmisgreaterthan1. Whenmisgreaterthan 100,theprovenancesizegeneratedbythetemplate-basedapproachwillbearound51% of the size generated by the baseline approach. This results in a 49% reduction in size ofprovenancestorageusingourtemplateapproach. 36 3.4 Applying Provenance Template to Stream Process- ingWorkflows Atemplate-basedprovenancestorageapproachcanalsobeappliedtoastreamprocess- ingworkfloworpipeline,whichisusuallycomposedbyconnectingmultipleprocessing elements using event streams as a channel for data flow [21][29][99]. Figure 3.7 illus- tratesatypicalstreamprocessingelement(henceforthcalledprocessingelement),which takesk streamsasinputandproducesjstreamsasoutput. Processing Element I1 I2 ... Ik O1 O2 ... Oj Figure 3.7: A stream processing element which takes k streams as input and produces j streamsasoutput A provenance template can be constructed for the stream processing workflow in a similarmannerasbefore. Eachinput/outputeventstreamisdepictedasanartifactnode inthetemplate,andtheartifacttypedescribesthetypeofeventsgeneratedorconsumed bythestream. Processandagentnodescontinuetodescribethetypesandreferences(if static) of stream processing elements (henceforth called processing elements) and their agents. Nodesareconnectedbyedgeswhichindicatecausalrelationshipsandtheroles ofsource/targetnodesintherelationships. One key difference arises when we consider the fact that file-based workflows are composed of processes with a well-defined set of typed input and output ports, and use/generate a static number of files for each invocation of the process. Processing elements,however,mayconsumezeroormoreeventsfromtheinputstream(s)inorder to generate one or more events on their output streams. This makes the processing elements a gray-box, and the non-determinism in event consumption frequencies by 37 processing elements poses a challenge in defining a single provenance template for an entirestreamprocessingworkflow. Consequently,eachprovenancerecordiniTable(andintTable)nowonlyfocuseson asinglestreamprocessingelement,andcapturestheinputeventsusedbytheprocessing elementtogenerateasingleoutputeventononeofitsoutputeventstreams. Ifmultiple output events from the processing element share the same set of input events, we can combinetheirprovenancerecordsintoonerecord. Sincewefocusononeoutputstream event for a processing element at a time in the iTable, we do not need to record the reversemappinginthemTable. 3.4.1 OverheadAnalysis Forsimplicityweanalyzetheoverheadforstoringprovenanceinformationofmoutput events generated by just a single stream processing element takingk input streams and generatingj output streams. Suppose every output eventgenerated wasderivedfromn inputeventsonanaverage,andasetofninputeventsonlygenerateoneoutputeventon one of the output streams. In the baseline approach, we store an individual provenance graphforeachoutputevent,containing,asbefore,propertieslistedinTable3.1. Weuse the symbols listed in Table 3.2 to indicate the average size of each property. The total size of the provenance information (in bytes) for the baseline approach to store OPM provenancegraphsformeventsgeneratedbytheprocessingelementis: ′ m =m[(n+1)(U a +U ty +U v +S l )+(U p +U ty +U v +S l ) +(U g +U ty +U v +S l )+(n+1)(U a +U p +S e +S r +S ts )] 38 After a calculation similar to the previous section, the size of the provenance tem- plate consists of the UUID for the template record, the types of the input and output artifactstreams,thedescriptionsoftheprocessingelementandtheagent,andthenames of the edges and their roles. In addition, a UUID is associated with each template node asbefore. Thetotalbytesfortheprovenancetemplate, ′ ,isgivenby: ′ = 16+(j +k)(16+U ty )+(16+U ty +U v ) +(16+U ty +U v )+(j +k)(16+16+S e +S r ) The size of the total provenance information stored using the template approach for m execution of the processing element is given by the sum of bytes required to describe: the n input artifacts and one output artifact, including a reference to their streamdescriptioninthetemplateusingthenodeUUIDs;theprocessingelementandthe agent;then+1edgesconnectingtheinput/outputartifactsandtheprocessingelements, anda16byteUUIDtorefertotheprovenancetemplaterecordintTable: Ω ′ m = ′ +m[(n+1)(U a +U v +S l +16)+(U p +S l ) +(U g +S l )+(n+1)S ts +16] Usingthevaluesforthevariablesasbefore,wecanstillachieveanearlyhalfstorage sizereduction. Forexample,consideringacasewheren=10and(j+k)=5,weobserve anupto53%reductioninthestoragesizefortemplateapproach. 39 3.5 Extending Provenance Template for Stream Pro- cessingWorkflow In our earlier template-based approach for stream processing, we record the instance- levelprovenanceinformationforeveryeventandachievea50%reductioninstorage. However, this may still be insufficient for high frequency streams. Consider the stream processingworkflowillustratedinFigure3.1(b). SupposeweonlyrecordaURIforeach individualevent 2 ,wewouldstillhavetostorem(j+1)URIs,wheremisthenumber ofstreamsandj isthenumberofeventsinatimewindow,torecordalltheinputevents of the “Aggregate” process for generating aC agg event. If the smart meter sends events at a high frequency, i.e., whenj is large, recording all the event URIs can lead to large provenancestoragesizes. Here, we improve our basic template model for stream processing workflow by 1) identifyingprocessingelementsthatexhibitanimplicitprocessingpattern,and2)mod- eling these processing patterns explicitly within our template. As a result, rather than recordingthedependencybetweeninputandoutputevents,wecaninsteadusethe pro- cessingpatterntoinfertheprovenanceinformation. Forexample,thepowerdemandaggregationworkflowintroducedinSection3.2has two processing elements, “Pre-Process” and “Aggregate”, that operate on events from oneormoreinputstreams(C in andC p ,respectively)andgenerateeventsononeoutput 2 Considering the unbounded length of stream events and the fact that events often have timestamps, in the following discussion, we assume we only capture the dependency between event instances as the instance-level provenance information, and ignore other additional annotated information such as labels foreachindividualeventforconciseness. 40 stream (C p and C agg ). For these two processing elements we observe the following processingpatternrulesstand: (C agg (U;t n ;w) wasDerivedFrom !C p (a k ;t i )) (3.1) () (a k isLocatedIn !U &&t n w <t i t n ) (C p (a k ;t i ) wasDerivedFrom !C in (a k ′;t i ′)) (3.2) () (k =k ′ &&i =i ′ ) Rule 3.1 tells us that when retrieving the immediate provenance information for an event C agg (U;t n ;w), we can lookup all the events C p (a k ;t i ) whose smart meter ID a k is located in the building U, and whose timestamp t i is contained in the time window (t n -w,t n ]. Similarly,accordingtoRule3.2,wecandeterminethateacheventC p (a k ;t i ) was directly derived from an incoming event C in (a k ;t i ) having the same smart meter ID and timestamp. Therefore, for these two processing elements, it is not necessary to recordtheprovenanceforeachoutputeventderivedfromtheinputeventsastheycanbe inferred based on the processing patterns of the processing elements. Just as templates provide a short hand for describing common provenance attributes of a process across recurrentruns,theseprocessingpatternsprovideashorthandforfunctionallydescribing therelationshipsbetweentheinputandoutputevents. Basedontheseprocessingpatterns,wecanuseagivenevent’sattributestoqueryan eventrepositoryandidentifytheinputeventsthatderivedthisevent. Anassumptionhere isthattheeventsarerecordedinaneventrepositoryforarchivalorregulatorypurposes, and can be queried upon. If this assumption does not stand, we can still use our basic template-basedapproach,i.e.,tostoreeventURIsaseventindicators. 41 The provenance storage cost can therefore be reduced to storing the descriptions of input and output streams, static attributes of processing elements, and the process- ing pattern linking input and output events of each processing element. In particu- lar, we generate an extended provenance template for each processing element whose processing pattern can be described explicitly. The extended template is defined by T <A;P;A G ;E;I >,where 1. A, P and A G continue to specify the types and static attributes of artifacts, the processandagentsforthestreamprocessingworkflow. 2. E continuestobethesetofedgesdescribingcausalrelationships. 3.I is a provenance inference function that explicitly describes how to infer the inputeventsfromagivenderivedevent. The provenance inference function is the core improvement in the extended tem- plate. It takes a derived event as input and identifies the corresponding ancestral events as its provenance – in effect acting as an inverse function of the processing element. Theinferencefunctionencapsulatestheprocessingpatternrules,andisdefinedoverthe typesandotherstaticattributesoftheprovenanceelements. 3.5.1 ProvenanceInferenceFunctions AprovenanceinferencefunctionI foraprocessingelementPE canbedefinedas I PE :E d !fE i g (3.3) whereE d denotes a derived event andfE i g denotes the events that were used by the corresponding processing element forE d ’s generation. The function can be expressed using a parameterized query that operates over specific attribute values of the derived 42 event. Thus if the derived eventE d has attribute valuesf 1 ; 2 ;:::; n g, the inference functionisexpressedasaqueryQ PE ( i 1 ;:::; im ),whereforeachi k (1km)we have1i k n. Themaineffortindefiningtheinferencefunctionliesinmappingtheprocessingpat- ternstotheparameterizedqueryQ PE . Toeasethespecificationsbytheusers,wedefine initial skeletons for these parameterized queries for common categories of processing elements. Wediscussdetailsinthefollowing. Aggregation: An aggregation processing element generates one output event using a sequence of input events from one or multiple input streams. Computation operators such as sum(), average(), max(), min(), and count() are usually contained in an aggregation element. Event sequences are usually provided in turns of a time window or an event count window over events in the stream. Based on standard window oper- ations on streams, it is possible to provide pre-defined query structures for aggregation processingelementsthatcanbefurthercustomizedintoaparameterizedquery. For example, we can use the following SQL-like pseudo code to express the infer- encefunctionforanaggregationprocessingelement: SELECTeventFROMevent repositories (3.4) WHEREevent.sourceINflistofinputstreamsg ANDevent.timestamp endTime ANDevent.timestamp > endTime-timeWindowWidth Herethelistofinputstreamsandthetimewindowwidtharestaticallyconfiguredbythe user to describe the processing pattern of the processing element and form the parame- terizedquery,whiletheendtimeispopulatedfromthederivedevent’stimestamp. 43 We use the “Aggregate” processing element as an example. The provenance infer- ence function for the “Aggregate” element is defined based on the processing pattern Rule 3.1. Figure 3.8 uses a parameterized SQL query to express the inference function for “Aggregate”. Three attribute values are first identified for the inference function: the building IDU, the timestampT forC agg , and the time window widthW. Then the threequestionmarkscontainedinthequeryskeletonarereplacedbyU,TW,andT, respectively. SELECT event FROM event_repository WHERE event.type = 'CpEvents' AND event.smartmeter IN (SELECT smartmeter FROM UtilitySmartMeters WHERE utility=?) AND event.timestamp > ? AND event.timestamp <= ? Figure 3.8: An exemplified pre-defined query skeleton used to identify input C p event subsequencesforaC agg eventaccordingtoprocessingpatternRule3.1 Projection/Transform: This kind of processing element projects out part of the attributes of each incoming event, or transforms some of the attribute values. It is a unaryoperatorthuseachoutputeventismappedtoasingleinputevent. Wecandevelop aprovenanceinferencefunctionforaprojection/transformprocessingelementwhenits output event contains enough information for determining the corresponding ancestral input event. To construct the parameterized query for a projection element, users need tospecifyasetofprojected/transformedattributes<a 1 ;a 2 ;:::;a k >(k 1),whichcan be used to identify the input event. For example, based on Rule 3.2, the query skeleton forthe“Preprocess”elementcanbeexpressedas: SELECTeventFROMevent repository (3.5) WHEREevent.timestamp = ? ANDevent.source = ? 44 For a derived C p event, we then can use its two attribute values, smart meter ID and timestamp,asargumentsfortheinferencefunction. Filter: A filter processing element outputs an incoming event if it satisfies a set of filter conditions. It is also a unary operator. Since the output event is identical with the original input event, we can simple use a set of output event attributes for determining the input event. The corresponding query skeleton is thus similar with the one for the projectionprocessingelement. Combined and User-Defined Processing Element: We can construct more compli- cated parameterized queries by combining the corresponding query skeletons for pro- cessing elements that perform a combination of basic operations. For example, condi- tions may be present in processing elements for filtering event streams before aggre- gatingthem. Aprocessingelementcanalsocontainsuser-definedtransformationlogic. Thequeryskeletonsfortheseprocessingelementsaremainlybasedontheirinput/output cardinality. Foraprocessingelementwithn:1I/Oeventsmapping,itsqueryskeletonis similartoanaggregationelement. Thequeryskeletonforaunaryprocessingelementis similar to a filter/projection element. The difference is that instead of following a fixed format to specify event sequence predicates or filter predicates, users can directly use thequerylanguagetodefinemorefree-stylepredicates. 3.5.2 ProvenanceInstanceRetrieval For an output event whose parent stream processing element is associated with an extended provenance template, we only store the mapping between the event and the corresponding provenance template. In our system, we develop an sTable to map the MD5 hash code of the output event ID to the ID of the provenance template. No other instance-levelprovenanceinformationneedstoberecordedfortrackingtheeventdepen- dency. 45 Provenance Retrieval Service !"#$%&' (')*!"#$%&'+ ,#-./0%#&' ,#-./0%#&' 123"#$0$4#5 ,#-./0%# !"#$%&' !"#$%5123. 67#2859#2:"#95;23-5 123"<5=#%2:#"0/5>7$4< &$.7%5!"#$%? sTable ... ... H(00182276-4ef5-40aa- 93bc-25d0ef6dfe1f) a62d5e7e-5eb1-40ff-a729-61bb4ff8b2cc Hash(Event ID) Provenance Template ID ... ... a62d5e7e-5eb1-40ff- a729-61bb4ff8b2cc Provenance Template ID Provenance Template tTable Event Repository Event ID Event Type Event Properties U t n w … 00182276-4ef5-40aa- 93bc-25d0ef6dfe1f C agg EEB 2011-08-01 18:20:00Z 900 … … … … ! &$.7%5!"#$%? 1 2 3 4 Figure3.9: StepsforretrievingprovenanceinformationforaC agg event Figure3.9depictsthestepsthatweretrieveprovenanceinformationforaC agg event based on our provenance template approach. To identify the provenance for an output C agg event, the retrieval service first uses the MD5 hash value of the event ID to query the sTable, and identify the ID of the corresponding extended provenance template. It thenusesthistemplateIDtoretrievethetemplateT fromthetTable,whichalsocontains theparameterizedquerycorrespondingtotheinferencefunction. Theeventrepositories are queried to retrieve the output event C agg and its attribute values, which are later combined with the parameterized query to form the actual provenance retrieval query. This query is invoked on the event repositories to return the input events C p that were used to derive the given event C agg . As discussed in Section 3.3.2, we can reconstruct the complete provenance graph for the given C agg event, by recursively performing a similar provenance retrieval process for each of the inputC p events, and combining all theidentifiedprovenancesub-graphs. 46 3.6 CapturingProvenancePatternOutliers Ideally, the provenance retrieved using the inference function for a processing element isconsistentforalleventsderivedbythatelement. However,oneofthereasonsforcol- lecting provenance is to detect deviations from normal operations, such as event delay or disorder. We thus need to handle cases where the processing element execution fails tofollowtheexpectedpattern,i.e. tocapturethe“outlier”provenanceinconsistentwith the expected provenance. For example, Rule 3.1 expresses the processing pattern for the“Aggregate”processor,i.e.,thatalltheeventsC p (a k ;t i )whosesmartmeterIDa k is locatedintheutilityU andtimestampt i iscontainedinthetimewindow(t n -w,t n ]were usedtoderivetheaggregationeventC agg (U;t n ;w). Inpractice,aneventE d =C p (a k ;t i ) mayarrivelate(perhapsduetoatransienterrorwhenconnectingtothesmartmeter)and is not used in the aggregation result C agg (U;t n ;w). Therefore, in the real provenance information, the delayed eventE d is not used to derive the event C agg (U;t n ;w). How- ever, the provenance inferred using the extended template would incorrectly containE d as it would be present in the query results of the inference function. To handle this, we record the difference between the actual events that were used to derive an output event(i.e.,theactualprovenanceinformation)andtheexpectedeventscomputedbythe provenanceinferencefunction. We capture the provenance outliers using a two-stage process. In general, for any processing element that is associated with an extended provenance template, we first recordthecompleteinstance-levelprovenanceinformationforalltheoutputeventsand storeitinaprovenanceoutliersdatabase. Theserecordsareinitiallymarkedas“incom- ing”. Next, we distinguish provenance information that are truly outliers from those fol- lowingtheexpectedprocessingpattern. Aprovenancecompressorperiodicallyscanthe provenance records marked as “incoming” in the outliers database, and retrieves their 47 extended provenance template. The compressor then uses the inference function con- tainedintemplatetoquerytheeventrepositoryandgeneratestheexpectedprovenance. The compressor compares the expected provenance with the actual provenance stored in the outliers database. If the two records are consistent, the actual provenance record isremovedfromtheoutliersdatabase. Otherwise,werecordthedifferencebetweenthe expectedandactualprovenanceintheoutlierdatabase. Each“differencerecord”canbe modeled as a set of <U e ;+=> pairs, where each U e indicates the URI of an input event, and “+/-” indicates whether we should add/remove the event from the inferred provenancerecord. ThedifferencerecordsareindexedbytheoutputeventID. Care must be taken here that occasionally, the inconsistency between a real prove- nance record and a processing pattern cannot be detected in “real-time”. For example, when an event C agg (U;t n ;w) was generated, the “Aggregate” processor may not be aware of the existence of the delayed eventE d , sinceE d might not have been stored in the event repository yet. Thus the inconsistency cannot be detected by the provenance compressoratthattime. Consequently,werunthecompressorperiodicallyoveramov- ingwindowofeventssuchthatalleventsbeingprocessedbythecompressorhavebeen storedforaperiodlongerthanthemaximumTTL(TimeToLive) 3 . We can continue to use the approaches discussed in Section 3.5.2 to retrieve the provenanceforanoutputeventwhoseprocessingelementisassociatedwithanextended provenancetemplate. Theonlydifferenceisthatwhenwehaveinferredtheprovenance using the template, we need to additionally query the provenance outlier database to check for the existence of a difference record for the output event. If so, we need to applythedifferencetotheinferredrecordtogeneratetheactualprovenance. 3 EventsarrivingoutsidetheTTLwillnotbeconsideredbytheprocessingelementsandnotbestored intheeventrepository 48 3.7 Implementation We implement our template-based provenance storage on the Windows Azure Cloud platform. Clouds provide several advantages for hosting a provenance service. The schema-less NoSQL table storage provided by Azure table storage is well suited for storing provenance templates and instance records that have different number of fields. The provenance service running on Cloud virtual machines provides scalable querying throughelasticvirtualmachinesandcomputesco-locatedwiththedata. Thisalsohelps supporting a large community of users and eases collaborations across multiple institu- tions. Figure3.10depictsthearchitectureofoursystem,whichincludesprovenancerepos- itories in the Windows Azure Storage, provenance services deployed in the Azure worker virtual machines, provenance collection services (i.e., provenance collector and provenancecompressor)thatcanruninclientorserverside,andclientstoragelibraries thatcanbeintegratedintoworkflowsystemsand/orstreamprocessingsystems. 3.7.1 StorageStructures A NoSQL database is required to store the instance-level provenance records in the iTable, since different iTable records may contain different attributes. For this, we use the Azure Table Storage service, which provides a key-value style storage service that is schema-free and scalable. Each iTable record is represented as a group of key-value pairs, whereakey/valuepairspecifiesthenameandvalueofanattribute. Sinceweuse the UUID of a instance-level provenance record to query the iTable during the prove- nance retrieval, this UUID is specified as the row key of the table that is indexed in the Azure Table Storage for efficient searching. Similarly we implement the mTable, sTa- ble,andtheoutlierprovenancedatabaseusingtheAzureTableStorage. Recallthatthe 49 Windows Azure Cloud Platform Provenance Retrieval Service Windows Azure Storage tTable iTable mTable/ sTable Outlier Prov. DB Event Repository Provenance Conversion/Publishing Service Template Registraion Service Provenance Compressor Provenance Collector Workflow Sub-Workflow Workflow Sub-Workflow Workflow System/ Stream Processing System External Provenance Repository Streams Figure3.10: SystemarchitectureintheWindowsAzureCloudPlatform mTable and the sTable are both queried with the MD5 hash value of the output event’s UUID. We thus combine them into one table in our implementation. The hash value of theevent’sUUIDisusedastherowkeyofthecombinedtable. We serialize the provenance templates (including the extended provenance tem- plates) and store the tTable in the Azure Blob Storage service. Azure Blob Storage providesareliableandscalablewaytostoreflatfilesthatcanberetrievedusingaunique ID. The tTable is indexed over the provenance template UUID. When a template needs tobeusedforprovenanceretrieval,thetemplateisretrievedfromthetTableanddeseri- alizedbyatemplateparser. WealsousetheAzureTableStoragefortheeventrepository, whichstoreseacheventanditsattributesasarecordinthetableindexedbyitsUUID. 50 3.7.2 ProvenanceServicesandLibraries As shown in Figure 3.10, we setup three different types of services in the Azure plat- form. Thefirsttypeofserviceistheprovenancetemplateregistrationservice,viawhich users register new provenance templates and publish the templates into the tTable. The secondkindofserviceistheprovenanceretrievalservice. UserssubmittheUUIDofan output event to the provenance retrieval service. The service queries the databases that weimplementintheWindowsAzureStorage,andreconstructstheprovenancegraphfor theuser. Thethirdkindofserviceistheprovenanceconversionandpublishingservice. This service can directly publish the template-based provenance information into the provenance repositories. It can also convert the OPM-based provenance information to the template-based provenance information. Thus this service can be used to compress theOPM-basedprovenanceinformationstoredinexternalprovenancerepositories. Besides setting up the provenance conversion and publishing service, we also pro- vide users with two provenance libraries, the provenance collector and the provenance compressor, which directly communicate with the Windows Azure Storage via the Azure’s Storage Client Library. Users can use libraries in their workflow systems or stream processing systems. In particular, in a workflow system, the provenance collec- tor collects the template-based provenance information and publishes the provenance into the iTable and the mTable. Also, as we discussed in Section 3.6, for a stream pro- cessing system, the provenance collector collects the original provenance information and stores it to the outlier provenance database. And the provenance compressor peri- odicallyscanstheoutlierprovenancedatabaseandcompressestheprovenanceinforma- tion based on the comparison results between the original provenance records and the provenancerecordscalculatedbythecorrespondingprovenancetemplates. 51 3.8 Evaluation In this section, we evaluate the storage scalability and query performance of our template-basedstorageapproachesforrecurrentworkflowsandstreamprocessingwork- flows,andcomparethemwithcorrespondingbaselineprovenancestoragemechanisms. 3.8.1 EvaluationSetup Our experiments are based on a real-world use case from the Los Angeles Smart Grid project [86] in the Energy Informatics domain. The use case contains both recurrent workflows and stream processing workflows as discussed in Section 3.2. We evaluate our provenance storage approach for recurrent workflows by collecting provenance for a campus power demand forecast workflow used in the project. The workflow has a simplified structure as depicted in Figure 3.3, and is run periodically for each of the 146 buildings on the campus. Our experiments on a stream processing workflow uses the aggregation workflow that combines information from sensors in buildings on the campus. ThestructureoftheworkflowhasbeendescribedinFigure3.1(b). Giventhattheworkflowsandsensorsusedintheprojectarestillbeingdeployed,we useasimulationharnesstosimulatetheexecutionofthetasksandprocessingelements in these workflows and to generate provenance information identical to the real work- flows. Thisallowsustoevaluatetheprovenancemanagementservicesatthelargescales eventuallyrequiredbytheprojectduringoperations. Ourharnesssimulatesexecutionof thepowerdemandforecastworkflowevery15minutesforeachofthe146buildingson the campus and generates corresponding provenance information. Similarly, the simu- lationharnessgeneratestheprovenanceinformationforthestreamprocessingworkflow 52 according to the sensors that have been deployed and are to be deployed in each build- ing. On an average there are 10 sensors in each building that produce an event every minuteandthetimewindowusedbytheaggregationprocessingelementis15minutes. Ourexperimentsareperformedontheprovenancemanagementservicesrunningon the Windows Azure Cloud platform. We create iTable, mTable/sTable, and the outlier provenance database in the Azure Table Storage. The tTable is located in the Azure Blob Storage. We also deploy the event repository in the Azure Table Storage. The provenance services, including the Provenance Retrieval Service and the Provenance Publishing Service, are currently deployed into a worker role instance on the Azure platform. WindowsAzuredoesnotprovideAPIforuserstodirectlyretrievethestorage size of each table. As a workaround, we use different storage accounts to store the provenance information using the various approaches. This allows us to correlate the accountstorageusageinformationprovidedthroughthedailyAzurebillinginformation. 3.8.2 RecurrentWorkflows For the recurrent workflow, as discussed in Section 3.3.4, we use the OPM provenance storage approach as our baseline approach. We first measure the time for inserting provenanceinformationintorepositoriesthroughCloudservices. Specifically,wemea- surethetimeforthreedifferenttypesofprovenanceinsertion. 1. Templatepublish: the template-based provenance information is directly gener- atedbyoursimulationprogramandpublishedintotheiTableandthemTable. 2. Baselinepublish: wepublishtheOPMprovenanceinstancesintoanAzuretable. Each OPM provenance instance is first represented following the OPM XML schema [3] and sent to the provenance publishing services through the network. 53 The provenance publishing service parses the OPM XML and directly stores the OPMprovenanceinstanceintothecorrespondingAzuretable. 3. Baseline conversion: we still transfer the OPM provenance instances to the provenance conversion/publishing service, whilst the service first converts the OPM provenance instances to the template-based provenance information, and thenpublishesthetemplate-basedinformationtotheiTableandthemTable. 0 100000 200000 300000 400000 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 220000 240000 260000 Cumulated insertion time (seconds) Number of provenance graph instances baseline conversion baseline publish template publish Figure3.11: Timecostofthreetypesofprovenanceinsertion We employ the simulation program to simulate one-month executions of the power demand forecast workflow, which generates around 420,000 provenance graph instances. We publish the provenance data from a desktop located in the campus. Fig- ure 3.11 depicts the cumulated provenance insertion time when we keep publishing provenancegraphinstancesintotheAzureCloudrepositories. Forallthethreetypesof insertion, the cumulated insertion time grows linearly with the number of provenance instances. Thus the time for inserting one provenance instance keeps static, and is not affectedbythesizeoftheexistingstorage. ThisresultprovesthescalabilityoftheAzure Tablestorage. 54 As shown in Figure 3.11, the insertion of the “baseline publish” takes the longest time (around 600 ms per provenance instance). The time cost for the other two types of insertion, both of which store the template-based provenance information, is much smallerandaround350msperinstance. Ingeneral,eachprovenanceinstanceinsertion canbedividedintothreesteps: 1)theclientsidemakesconnectionandtransfersdatato the service, and 2) the service parses the data from XML to corresponding Azure table records, and 3) the service inserts provenance records into the corresponding Azure tables. Figure 3.12 depicts the time cost of these three steps for all the three types of provenance insertion. We can see that the main time cost is on the third step, i.e., to publish records to Azure tables. As shown in Figure 3.12, when directly publishing the OPM-based provenance instance, the step 3 takes about twice the time compared with the other two types of insertion. This is close to the ratio of the provenance size between the OPM-based approach and the template-based approach (as later shown in Figure3.13). baseline−conversion baseline−publish template−publish 0 100 200 300 400 500 600 Time (milliseconds) step 1 step 2 step 3 Figure3.12: Timecostofthethreestepsofprovenanceinsertion Figure 3.13 shows the size of the provenance information for the baseline approach andourtemplate-basedapproach. Still,wesimulatetheexecutionofthepowerdemand forecastworkflowforonemonth,i.e.,720hours,andmeasuretheprovenancesizewith different time scale. When storing the OPM provenance graph instance, the one-month 55 executions of the workflow can generate more than 3,500 megabytes of provenance information. Thetemplate-basedapproachoccupieslessthanhalfsizeoftheprovenance storage(47%). ThisisconsistentwithouranalysisinSection3.3.4. 180 360 540 720 0 500 1000 1500 2000 2500 3000 3500 4000 Time scale of provenance information (hours) Provenance size (megabytes) baseline template ï based Figure 3.13: Provenance storage size for recurrent workflows when employing the OPM-basedandthetemplate-basedapproaches Forboththeprovenancestorageapproaches,wealsomeasuretheaveragetimecost for retrieving the provenance graph instance when given the URI of an output artifact. Figure 3.14 illustrates the provenance retrieving time with different total size of the provenance information. To retrieve a provenance graph instance from the template- basedprovenancestoragetakeslongertime,sincetheprocessforqueryingthetemplate- based provenance storage is more complicated: the provenance retrieval service has to retrieve both the provenance template and the instance-level provenance information, fromthetTable,themTableandtheiTable. AsshowninFigure3.14,theprovenanceretrievaltimeisnotincreasingasthetotal size of the provenance information increases. As we have introduced before, the Azure Table storage builds an extra index over the “row key” of all the table records. When retrieving a provenance instance from either the template-based storage or the OPM- basedstorage,everyretrievalqueryanditscorrespondingsub-queriescanbeaccelerated 56 by the index. Thus the query performance will not be greatly affected by the total size ofthedataintheTablestorage. 105,120 210,240 315,360 420,480 0 50 100 150 200 250 300 350 Number of provenance graph instances Query time (milliseconds) baseline template ï based Figure3.14: Timecostforretrievingprovenanceinformationforrecurrentworkflow 3.8.3 StreamProcessingWorkflows Forthestreamprocessingworkflow,weemployourextendedtemplateapproachforthe provenance storage. Since the two processing elements contained in the exemplified workflow can both be represented by provenance retrieval functions, the main storage cost is on the outlier provenance information. Our simulation program simulates the provenance collection for different provenance exceptional ratioe. For example, when e = 5%,5% of the sensor events will be delayed over the network and we need to cap- ture the outlier provenance for the corresponding descendant events. We compare our approach with a baseline approach which uses event URIs to represent the input/output foreachexecutionofaprocessingelement. Figure3.15showsthesizeoftheprovenanceinformationforourextendedtemplate- based approach (with differente) and the baseline approach. We also measure the size oftheevents,althoughtheeventrepositoryisnotincludedintheprovenancestorage. In 57 24 48 72 96 0 1000 2000 3000 4000 5000 6000 7000 8000 Time scale of provenance information (hours) Data size (megabytes) event baseline provenance provenance, e=1% provenance, e=5% provenance, e=10% Figure3.15: Provenancestoragesizeforstreamprocessingworkflows our experiment, we simulate the continuous execution of the stream processing work- flow for 24 96 hours, and generate the provenance information for the workflow executions. For the baseline approach, the stream processing workflow may generate a large amount of provenance: 750 megabytes per day. Thus the size of the provenance informationwillbealmosthalfofthesizeoftheworkflowoutputdata(i.e.,theevents). Byusingtheextendedtemplate-basedapproach,thesizeoftheprovenanceinformation can be reduced significantly. Recall that even for the outlier provenance information, we only record the difference between the real provenance records and the calculated provenance records. When the exceptional ratio e = 10%, we only need to record 28 megabytesoutlierprovenanceinformationperday. We measure the time cost for retrieving a provenance graph instance for both stor- age approaches, and show the results in Figure 3.16. For our template-based approach, we assume the exceptional ratio e = 5%. For the baseline storage approach, because the retrieval query can be handled based on the row key of the provenance records, the query time keeps static when the total size of the provenance increases. However, 58 24 48 72 96 0 500 1000 1500 2000 2500 3000 3500 4000 Time scale of provenance information (hours) Query time (milliseconds) baseline template ï based Figure 3.16: Time cost for retrieving provenance information for stream processing workflow when retrieving provenance instances from the template-based storage, more compli- cated queries, which are generated by the provenance retrieval functions, needs to be processedintheeventrepository. Multipleeventattributesareusuallyinvolvedinsuch queries. Since the Azure Table Storage does not support building extra indices over attributes other than the row key, the time cost for handling these queries increases linearly with the total size of the event repository. As illustrated in Figure 3.16, the provenance retrieval time for the template-based storage is much larger than the base- linestorage, andgrowswiththe sizeoftheeventsandtheprovenance. IntheWindows Azure Storage platform, our template-based approach sacrifices the query performance in order to compress the provenance information. To improve the query performance, wecanuseadatabasewhichsupportsthecreationofmultipleindicesforstoringstream events(e.g.,theSQLAzuredatabase[5]). 59 3.9 RelatedWork In[30],Chapmanetal. proposetechniquesforcompressingtheprovenanceinformation by scanning the provenance database, identifying data artifacts with identical or simi- lar provenance graphs, and removing redundant copies of common provenance graphs sharedbymultipleartifacts. Inasimilarvein,ourtemplate-basedapproachalsotriesto compact provenance storage by capturing the common provenance information. How- ever, in our work, since we are aware of the workflow specification a priori, we do not need to scan the complete provenance database to locate similar or identical prove- nance graphs. Such scanning is also not efficient for a constantly growing provenance database. Our provenance templates are statically generated from workflow specifica- tions,andtheyrepresentthesharedprovenancestructuresandotherunchangingannota- tions. Also, our provenance templates can support workflows with complex structures, whilein[30]onlysimplestructureslikechainsaresupported. In [92], Wang, et al. provide an interesting model for tracking event dependencies in Century, a stream processing system for the healthcare domain. In [73], Misra, et al. introduce further improvements and remaining challenges for provenance management inCentury. ATime-Value-Centric(TVC)ModelisdesignedinCentury,whichusestime information of events to represent a specific processing pattern in the system, i.e., each outputeventwasgeneratedbyeventswhosetimestampswerewithinsomespecifictime window. In fact, some of our motivating stream processing elements exhibit this kind of processing pattern. Compared to TVC, our work provides a more generic solution. Provenance query functions are designed so that users can flexibly express different kindsofprocessingpatternsandutilizethesepatternstodecreasethesizeoftheinstance- level provenance information. More importantly, our approach also takes into account thenon-deterministicbehaviorofthestreamsandprocessingelements,andproposesan 60 approach to capture outliers. This ensure that the provenance information collected by ourapproachalwayscorrectlyreflectsthehistoricalexecutions. 3.10 Conclusion In this chapter, we first highlighted the provenance management challenges caused by overwhelming provenance size. A template-based technique, which identifies the com- monprovenancepropertiesandstructuresforrecurrentworkflows,isthendesignedand used for compact provenance storage. Our analysis and evaluation results show that the template-based technique can decrease more than half of the original OPM-based provenancesize. Basedontheobservationthatthecausalrelationshipbetweeninputandoutputevents of a stream processing element can often be inferred by its implicit processing pattern, we extended our basic template-based technique and use a provenance inference func- tion as the explicit specification of the processing pattern. The instance-level prove- nanceisthencomputedduringtheprovenanceretrievalprocedureandthuscanbeomit- ted from the storage. Since the computed provenance may be inconsistent with the real provenance because of processing exceptions such as event delay and disorder, we alsodesignedastrategytocapturethedifferencebetweenexpectedandrealprovenance information. Our evaluation results demonstrate a significant decrease of provenance storageoverhead. We implemented a provenance repository in the Cloud platform based on our com- pact storage algorithm. We also developed provenance query service which retrieves provenancefromtherepositoryandreconstructsOPM-basedprovenancegraphs. Since our techniques are compatible with the Open Provenance Model community standard, 61 we can easily integrate it into a large-scale environment with existing provenance sys- tems and repositories. However, more advanced algorithms are then required to enable queryandreconstructionofprovenanceacrossdistributedrepositories. Wediscussthese distributedprovenancequeryalgorithmsinthenextchapter. 62 Chapter4 QueryingProvenanceInformationin DistributedEnvironments Whenprovenanceiscollectedfromlarge-scaleworkflowsacrossorganizationsanddis- ciplines, two factors cause provenance information to be stored in distributed reposi- tories. The first factor is the affinity of provenance generated for a data product with itslogicaldomain,i.e.,individualorganizationsanddomaincommunitieswouldliketo hosttheirownrepositoriesforprovenancestoragefordataproductsthatfallwithintheir area of expertise (which may be different from the “owner” or “producer” of the data products). For example, a meteorology project may collect provenance about weather measurementsandforecastmodelswithinasingleprovenancerepositorysharedbythat collaboration,whileasmartpowergridworkflowthatmakescity-scalepowerusagepre- dictionusingthoseweatherforecastsasaninputmaycollectprovenanceforitsresultsin aseparaterepository. Asanotherexample,inourreservoirproductionforecastusecase fromtheSmartOilfielddomain(introducedinSection2.1.1),differentrepositoriesmay behostedbythefacilitymanagementdepartmentandthereservoirmanagementdepart- ment for recording provenance indicating the derivation of surface facility constraints and reservoir deliverability, which are two types of input for the simulation application whencalculatingthefutureoilproduction. The other factor that causes distributed provenance storage is the resource affinity. Some applications in a workflow may be run in a specific environment due to resource dependencies, and the execution environment may be bound with its own provenance 63 collection mechanism. Thus the provenance collected from these applications can be storedindifferentrepositories,collocatedwiththeexecutionenvironmentthatproduced derived data product. For example, a reservoir design workflow invokes a proprietary oilfield simulation tool that is licensed to run only within a particular cluster. The sim- ulationenvironmentmayprovidetheabilitytocollectandstoreprovenanceforallsim- ulation results to a local provenance repository, while provenance collected from other partsofthedistributedworkflowmaygotootherrepositories. With the proliferation of provenance repositories hosted for individual organiza- tions or communities, there has been limited effort to reconstruct and query for and on provenance across them. Community standards like the Open Provenance Model (OPM) allow uniform interpretation and exchange of provenance metadata but do not prescribe query or service specifications to access provenance. Since data reuse and sharingacrossinstitutionsisnotalwaysaccompaniedbypassingprovenanceatthetime of data exchange, we need to track the provenance and query for them or over them acrossdistributedprovenancerepositories. Inthischapter,wepresentalgorithmsforqueryingoverdistributedprovenanceinfor- mation,andaddresstwocommonprovenancequerymodelsthatweformalize: Provenance retrieval query: This type of query is used for reconstructing the provenance graph for a data artifact from across distributed provenance reposi- tories. If we use a SQL analogy to describe the query, (distributed) provenance appears in the SELECT clause, and execution of the query results in the prove- nance being recombined and returned as the result. In our reservoir production forecast use case, a provenance retrieval query can be to select and return the completeprovenancegraphforanoilfieldforecastresult. Provenance filter query: This type of query uses provenance information as query conditions when searching for specific artifact instances of a given data 64 type. In a filter query, (distributed) provenance information appears in the WHERE clause when using a SQL analogy, and the clause has to be evalu- ated over the distributed provenance when executing the query. In our use case, a provenance filter query would be to select the production forecast values for all forecast simulations where a specific simulation process, which is part of its provenancegraph,wasused. WeusethereservoirproductionforecastusecasefromSection2.1.1asourmotivat- ing example, and we evaluate the performance of our algorithms using synthetic work- flowsbasedonthedomainusecase. 4.1 OverviewofOurApproach Ingeneral,ourworkinthischaptercoversthefollowingthreeparts: Theindexserviceforlocatingprovenanceinformation: Thefirstchallengeforquery- ing provenance in an distributed environment is to locate the provenance information, i.e.,toidentifywhatrepositoryisusedforstoringthecorrespondingprovenancegraph. Inourwork,wedevelopanindexservicetocaptureandmaintainthemappingbetween an artifact and the repository/repositories storing its immediate provenance details. Besidesthemetadatarecordingartifact-repositorymapping,theindexservicealsostores thedomainmetadataasontologiestodescribedomainconcepts,entities,andtheirprop- ertiesandrelationships. Alltheartifactspublishedintheindexserviceareannotatedby thedomainmetadata. Provenance retrieval query processing: We develop an algorithm for processing the provenance retrieval query based on the provenance lookup functionality provided by the index service. To reconstruct the complete provenance graph for an artifact across 65 repositories,ouralgorithmfirstidentifiestherepositorythatstorestheprovenanceinfor- mationcollectedfromtheparentprocessofthegivenartifact. Theninatransitiveman- ner, our algorithm expands the provenance graph and looks up the index service when part of the provenance is not present in the repository for the current exploration. We use the Open Provenance Model as a global representation model so that the prove- nance sub-graphs retrieved from individual repositories can be easily integrated into a completeanduniqueprovenancegraph. Provenance filter query processing: We formally define the provenance filter query in this chapter, and introduce algorithms for processing filter queries in both local and distributed scenarios. The query conditions contained in a filter query can be divided into two categories: 1) the reachability criteria that specifies what paths should exist in the provenance graph between the target artifact and the artifacts/processes mentioned inthefilterquery,and2)thepropertycriteriathatspecifiestheconditionsthatvaluesor propertiesofindividualartifactsandprocessesintheprovenancegraphshouldsatisfy. Athree-stepapproachisusedforprocessingaprovenancefilterqueryinadistributed environment. Accordingtothereachabilitycriteria,ouralgorithmfirstchecksavailable provenance templates and identifies which type of workflows may be the appropriate oneforgeneratingthetargetartifacts. Ouralgorithmthendecomposestheoriginalfilter queryintoasetofsub-queriesbasedontheprovenancedistributionacrossrepositories. Theexecutiondependencyofthesub-queriesisalsodeterminedinthisstep. Weprocess sub-queriesinindividualrepositories,transfertheirresultsacrossrepositoriesaccording totheirdependency,andcalculatethefinalresultsinthelaststep. 66 4.2 SemanticAnnotationonProvenance We use the Open Provenance Model (OPM) as a global provenance model for prove- nance information description while transferring across repositories. On top of the basic provenance information represented by OPM, we add a domain level abstraction that uses a collection of semantic models to capture the domain knowledge, including domain concepts, domain entities, and their properties and relationships. Artifacts and processes are annotated using domain concepts, such as “Petroleum Reservoir 1 ”, “Oil Well 2 ”, and “Original Oil in Place (OOIP) 3 ” in the Smart Oilfield domain, as well as domain entities that refer to instances of domain concepts, including physical oilfield entities like real-world oil wells, and logical entities such as simulation scenarios. We useontologytoexpressthedomainmodel. The domain level abstraction can also be annotated on the provenance template, whichwedefineinSection3.3asanoutlineoftheprovenancegraphstructure,without the instance information about actual artifacts and processes. Each provenance node in a provenance template is annotated with its corresponding domain concept, but does not contain details of the data/application instance. Figure 4.1 illustrates a semantic- annotated provenance template for reservoir production forecast that is an implemen- tation of the workflow introduced in Figure 2.1. The label for node is the semantic domainconceptthatitrepresents(e.g.,thefinaloutputisannotatedasProductionFore- cast). Edges from artifacts to processes are Was Generated By relationships, and edges from processes to artifacts are Used relationships. Some parts of this provenance tem- plate refer to sub-workflows that support the forecast simulation activity. For example, 1 http://en.wikipedia.org/wiki/Petroleum reservoir 2 http://en.wikipedia.org/wiki/Oil well 3 http://en.wikipedia.org/wiki/Oil in place 67 Optimization Forecast Production Forecast Facility Constraints Facility Profiles Production Schedule Performance Curves Normalize Simulation Result Reservoir Simulator Deck File Deck Composition PVT Grid Information Integration Reservoir Capability Lab Test Cleanse Historical Production Data Raw Hist. Data Figure4.1: AProvenancetemplateforreservoirproductionforecast a sub-workflow composed of Deck Composition, Reservoir Simulator and Normalize takesPVTandGridInformationasinputandgeneratesPerformanceCurvesasoutput. 4.3 SystemArchitecture We illustrated the overall architecture of our system and the environment within which itoperatesinFigure4.2. Whileworkflowsandapplicationsarerunninginadistributed environment,provenancecollectedfortheseexecutingprocessesisstoredindistributed repositories. The section of a repository to store provenance may be arbitrary or user- dependent, but often, domain or resource affinity is observed and provenance from dif- ferentexecutionsofthesameabstractworkflowisco-locatedinthesamerepository. 68 Provenance Repository Provenance Query Service sub-workflow workflow workflow workflow Provenance Repository Provenance Query Service Provenance Repository Provenance Query Service sub-workflow sub-workflow ... ... AID_3 URL_3 URL_2 URL_1 Repo. AID_2 AID_1 Artifact ID Provenance Index Service Figure4.2: OverviewofSystemArchitecture Provenance query services are exposed by the repositories. Each repository imple- mentation itself may be different, e.g., a compact provenance storage may be imple- mented as we describe in Chapter 3, but the service provides at the least the ability and interfaces for provenance retrieval and filter queries on provenance graphs that are stored locally in that repository. In the later sections, we introduce algorithms that our provenancerepositoryimplementationusesforexecutingthesequeriesoverprovenance graphsthatarelocalordistributedacrosssuchrepositories. A provenance index service is created and used to enable the lookup of provenance repositories in the distributed environment. The index service maintains the mapping fromanartifacttotherepository/repositoriesthatholds/holdtheimmediateprovenance details about the artifact, i.e., the provenance details about the process that derived the artifact. Inparticular,theprovenanceindexservicestoresmetadataaboutthefollowing twodifferentinformationmodelstosupportdistributedprovenancequerying: 69 1. Artifact-Repository Mapping Metadata. This allows a lookup of the prove- nancerepository(ies)thatstoreprovenancegraphsforalldataartifactsinthesys- temusinga2-tuplemodel: ¡AID,PR¿,whereAIDreferstotheIDoftheartifact, PRreferstotherepositoriesthatstoretheprovenanceinformationcollectedfrom theprocessthatderivedthisartifact. Weassumethateverydataartifactisassigned with a global artifact ID , thus AID can be used to uniquely identify an artifact. ThelocationofeachrepositoryisrecordedasitsactualserviceendpointURL. 2. DomainMetadata. Thisutilizesthesemanticontologiestodescribedomaincon- cepts, entities, and their properties and relationships. Artifacts published in the provenanceindexserviceareallannotatedbythedomainmetadata. GiventheIDofanartifact,theindexservicecanidentifywheretheartifact’sprove- nanceinformationisstored, andhowtheprovenanceinformationcanberetrieved. The index service provides detailed information indicating which provenance query service shouldbeusedandhowtoinvokethequeryservice. The index service can also be used to figure out the provenance distribution of a provenance template. Recall that we have used domain concepts to annotate prove- nance nodes in a provenance template graph. For each domain concept, by querying the domain metadata in the provenance index service, we can identify a group of arti- facts that are annotated by the domain concept. Then we query the artifact-repository mappingmetadataintheindexservicetoidentifytheprovenancerepositoriesthatstore provenance information of these artifacts. The provenance distribution of the prove- nancetemplatecanhelpusprocesstheprovenancefilterqueryinadistributedenviron- ment. WewilldiscussmoredetailsinSection4.5. 70 4.4 ProvenanceRetrievalQuery In this section, we discuss how to process a provenance retrieval query in a distributed environment. In a retrievalquery, a user is interested in the complete provenanceinfor- mation of an artifact. For example, in Figure 4.3, the user submits a retrieval queryQ 1 for the provenance information of an artifact instance PF (Production Forecast), which is the final result of the reservoir simulation workflow described earlier. Suppose the usersubmitsQ 1 toaqueryserviceS 0 (weomitS 0 srepositoryinthefigure). Provenance Query Service S0 Query Q1': local provenance of PF Repository R2 Query Q2: provenance of PC Query Q1: complete provenance of production forecast PF Repository R3 Query Q3: provenance of FC O F PF FC FP PS PC N SR RS DF DC PVT GI I RC LT C HPD RHD F PF PVT I RC LT C HPD RHD PC FC PC N SR RS DF DC GI PVT O FC FP PS DF Repository R1 Provenance Query Service S1 Provenance Query Service S3 Provenance Query Service S2 Provenance Index Service Figure4.3: Processingofprovenanceretrievalquery To process this query, the first step is to identify the repository that stores the immediate provenance information of PF. Here the “immediate” provenance informa- tion refers to the provenance information collected from the parent process of PF, i.e., 71 theforecastprocessF.Byqueryingtheprovenanceindexservicetherepositoryisiden- tified asR 1 . The partial provenance information of artifact PF should then be obtained byqueryingtheserviceofrepositoryR 1 ,i.e.,S 1 . To get the complete provenance graph for PF beyond its immediate parent process, S 1 explorethelocaltransitiveprovenanceinformation,i.e.,theartifactsthatare“Used” bytheparentprocessandtheircorrespondingprovenanceinformation. Inourexample, this leads to further provenance exploration of artifacts FC (Facility Constraints), RC (ReservoirCapability),andHPD(HistoricalProductionData),aswellastheirancestral artifacts. Foreachoftheseartifacts,S 1 checksiftheirprovenanceinformationispresent inthelocalrepositoryR 1 . Iftheprovenanceinformationofanartifactisnotpresent,S 1 marks it when returning the provenance sub-graph toS 0 . S 0 then does a look up in the provenanceindexservicetoidentifythe(other)repositoriesandcontinuetoexplorethe newlyidentifiedrepository. Inourexample,thisisthecaseforartifactsPC(Performance Curves)andFCthatleadS 0 toexplorerepositoriesR 2 andR 3 (byinvokingS 2 andS 3 ). S 0 repeats this process until all the artifacts are explored or if the provenance infor- mation of an artifact is not available. Care must be taken here that if the provenance information of an artifact is not present in a local repository, instead of querying the indexserviceimmediately,S 0 shouldfirstfindoutwhetherithasalreadybeenretrieved inprevioussub-queries. Forexample,whenqueryingtherepositoryR 2 ,S 0 findsthatthe provenance information of artifact PVT is not stored in R 2 . However, the provenance of PVT has already been retrieved from R1 before. Thus S 0 can simply combine the two provenance sub-graphs retrieved from R 1 and R 2 . A similar scenario happens for artifactDF(DeckFile)whenqueryingrepositoryR 3 . Theprovenanceinformationobtainedateachexplorationstepisaprovenancegraph represented by OPM. Once all artifacts are explored, the complete provenance graph is generated as a union of the individual provenance graphs from each repository. The 72 global unique ID of artifacts and the OPM as the global provenance model enable the possibility to integrate distributed provenance information with different provenance models. Provenanceinformationstoredineachrepositorycanhaveitsownprovenance representationmodels. Dataartifactscanalsohaveheterogeneousformats. Byusingthe globaluniqueID,anartifactcanbeuniquelyidentifiedthustheprovenanceindexservice canmaintainandqueryitsmappingwithaprovenancerepository. Also,whenretrieving provenance information sub-sets from individual repositories, we can develop adaptors toconvertprovenanceinformationrepresentedinthelocalizedprovenancemodel(e.g., ourtemplate-basedprovenanceinformationwhichispreviouslydiscussedinChapter3) toaprovenancegraphrepresentedbyOPM.Inthisway,wecanintegratedistributedand heterogeneous provenance information into a complete and unique provenance graph. Besides, by annotating domain concepts and domain entities to artifacts and processes, we can hide the data heterogeneity and make it easier for domain users to understand theretrievedprovenanceinformation. Our model of provenance graph reconstruction is fairly straightforward, and used in other provenance systems such as [51][69]. Use of the index service to locate and querytheprovenancesub-graphsindistributedrepositoriesisanincrementalfeatureas compared to typical, monolithic provenance repositories where all queries operate on thesamecatalog. 4.5 ProvenanceFilterQuery Aprovenancefilterquerysearchesforspecificdataartifactsusingprovenanceinforma- tionastheinformationfiltercriteria. Formally, aprovenancefilterqueryisdefinedasa 3-tuple<T A ,M P ,M A >where 73 T A specifies the data type of the query target data artifact. The data type is usu- ally expressed by using the domain concept annotated to the target artifact. For example,TAcanbe“ProductionForecast”ifwearelookingforspecificreservoir productionforecastresults. M P specifiesasetofkey-valuepairs,whereeachkeyindicatesanancestralpro- cess that should appear in the provenance graph of the target artifacts, and the correspondingvaluedenotesthequeryconditionsontheprocess; M A specifies a set of key-value pairs, where each key indicates an artifact that shouldappear in the provenancegraph of the targetdata artifact. Such an artifact can be an input artifact or an intermediate result, or the target artifact itself. The valueassociatedwiththekeydenotesthequeryconditionsontheartifact. The query conditions in this definition specify users searching criteria on artifacts and processes that are required to appear in the provenance graph of the target artifact. Query conditions on a process are mainly focusing on its parameter settings, as well as its controlling agents. Query conditions on an artifact are mainly focusing on its properties and values. Since artifacts and processes are annotated by domain concepts andentitiesinourprovenancemodel,thesequeryconditionscanbeexpressedusingthe semantic domain models. A query condition can also be expressed by another prove- nancefilterquery. Inthatway,wehaveanestedprovenancefilterquery. Forexample,Figure4.4showsthepseudocodeofasampleprovenancefilterquery. Thetargetartifactisthe“ProductionForecast”. Usersrequirethatthe“LabTest”process should be employed as an ancestral process. The “*” after “Was Generated By” means that the “Lab Test” process may not be the direct ancestor of “Production Forecast”. A query condition is defined on “Lab Test” that its controlling agent should be “Jim”, who may be an experienced test operator. Similarly, users require that “Raw Historical 74 select "Production Forecast" where "Production Forecast" wasGeneratedBy* (select "Lab Test" where "Lab Test" wasControlledBy "Jim") and "Production Forecast" wasDerivedFrom* (select "Raw Historical Data" where missing ratio of "Raw Historical Data" < 0.1) and "Production Forecast" wasDerivedFrom (select "Facility Constraints" where "Facility Constraints" wasDeriveFrom (select "Production Schedule" where publishing date of "Production Schedule" is after 01/01/2011)) Figure4.4: Pseudocodeofasampleprovenancefilterquery Data”shouldbeanancestralartifact,andthemissingratioofthe“RawHistoricalData” shouldbelessthan0.1. In this sample query, a sub-filter-query is used to define the query condition on “FacilityConstraints”. Accordingtothequerycondition,thereshouldexistapath“Pro- ductionForecast”!“FacilityConstraints”!“ProductionSchedule”intheprovenance graph. M P andM A basicallycontaintwokindsofcriteria. Thefirstkindofcriteriaspeci- fieswhatartifactsandprocessesshouldappearinthetransitiveprovenanceofthetarget artifact,thusdefinesthereachabilitycriteriafortheprovenancegraph,i.e.,thereshould existpathsintheprovenancegraphbetweenthetargetartifactandtheartifacts/processes mentionedinthe filterquery. Forexample, thesample queryrequiresthat intheprove- nance graph, “Raw Historical Data”, “Lab Test”, and “Facility Constraints” should all be reachable from the target artifact “Production Forecast”. Also, “Production Sched- ule” should be reachable from “Facility Constraints”. The reachability criteria can be analyzedinaprovenancetemplategraph. Thesecondkindofcriteriaspecifiesthecon- ditions that values or properties of individual artifacts and processes in the provenance graph should satisfy. The query condition that the missing ratio of the “Raw Historical Data”shouldbelessthan0.1belongstothiskindofcriteria. Wecallthiskindofcriteria thepropertycriteriaanduseittolookforspecificinstancesofartifactandprocess. 75 Recall that the provenance information is distributed across different repositories. Naively, we can retrieve the complete provenance graph for every instance of artifacts of typeT A using the provenance retrieval algorithm in Section 5, and then process the provenance filter query in the local repository. Obviously, this brute force method is inefficient when 1) there are a large number of instances ofT A , 2) the complete prove- nancegraphofT A islarge,or3)thegraphisdistributedacrossanumberofrepositories. Ingeneral,ourapproachtoprocessaprovenancefilterqueryincludes3steps: 1. Because there may be multiple workflows that are executed to create artifact of T A , we should identify the appropriate provenance templates in the first step. In our approach, we use the reachability criteria contained in the filter query for provenancetemplateidentification. 2. Because the provenance information can be distributed in multiple repositories, in the second step of our approach, we decompose the original filter query into a set of sub-queries. The decomposition algorithm is based on the provenance distribution across repositories. We also determine the execution dependency of thesub-queriesaccordingtothecausalrelationshipsintheprovenancegraph. 3. In the third step we evaluate sub-queries in individual repositories according to theirdependency, transferresults ofsub-queries acrossrepositories, andgenerate thefinalresults. In our system, a user submits a filter query to a provenance query service. This service acts as a “coordinating” service, which takes the first two steps, i.e., to identify theappropriateprovenancetemplatesanddecomposethequerybasedontheprovenance distribution. In the third step, the coordinating service invokes remote corresponding provenancequeryservicestoevaluatesub-queriesandgeneratethefinalresults. 76 We discuss the details of our approach in the following sub-sections. We first con- sider a basic scenario where the provenance templates do not contain loops. We also discusshowtohandleloops. 4.5.1 ProvenanceTemplatesIdentification The first step for processing a filter query is to identify the provenance templates for the target artifact. Because different workflows can generate a target artifact, all the corresponding provenance templates should be considered to search for the instances of target artifact whose provenance graph satisfies the query conditions. As we have discussed in Section 3.3, provenance templates can be derived directly from workflow specifications. We assume we can obtain all the corresponding provenance templates (or workflow specifications) of the target artifact (e.g., all the workflow specifications arewellmaintainedinarepository). Among all the possible provenance templates of the target artifact, we search for templates that satisfy the reachability criteria of the filter query. By interpreting the filter query, the reachability criteria can be represented by a group of paths that should appear in the provenance template graph. For example, the reachability criteria of the samplequeryillustratedinFigure4.4requiresthefollowingpathstobecontainedinthe provenancegraph: 1. “ProductionForecast”!“LabTest” 2. “ProductionForecast”!“RawHistoricalData” 3. “ProductionForecast”!“FacilityConstraints”!“ProductionSchedule” In(3)weuseapaththatcontainsthreeartifactsinasequencetorepresentthenested sub-filter-query. Wethengothroughallthepossibleprovenancetemplatesofthetarget 77 artifact,filteringouttemplatesthatdonotincludethesepaths. Tocheckifaprovenance template graph includes a path, we employ a graph database like Neo4j [2] to store the provenance template graph, and use the graph traversal API provided by the database to search for the path. We can also utilize existing techniques, such as the provenance backwardquerydiscussedin[60]orthe“INCLUDEPATH”operatordiscussedin[58]. 4.5.2 ProcessingProvenanceFilterQueryinaLocalRepository Optimization Forecast Production Forecast Facility Constraints Facility Profiles Production Schedule Performance Curves Normalize Simulation Result Reservoir Simulator Deck File Deck Composition PVT Grid Information Integration Reservoir Capability Lab Test Cleanse Historical Production Data Raw Hist. Data Figure4.5: Aprovenancetemplategraphof“ProductionForecast” Beforeweconsiderthedistributedscenario,wefirstdiscusshowtoprocessaprove- nance filter query in a local repository. Figure 4.5 illustrates a provenance template graph that satisfies the reachability criteria of the sample query in Figure 4.4. We use compoundlinestohighlighttheartifactsandprocessesreferencedbyqueryconditions. 78 Suppose all the corresponding provenance information, i.e., all the instances of this provenancetemplate,arestoredinthesamerepository. We first evaluate all the query conditions belonging to the “property criteria” on individual artifacts and processes. This can be done in parallel. For example, we look for instances of “Raw Historical Data” whose missing ratio is less than 0.1, as well as instances of “Production Schedule” whose publishing date is after 01/01/2011. The evaluationresultofapropertycriterionisasetofartifactinstancesorprocessexecutions satisfyingthepropertycriteria,whichcanberepresentedasasetofIDinoursystem. We define the set of instances of an artifact/process that satisfy all the query condi- tions in the filter query as the constraint set of the artifact/process. The set of instances derived from the property criteria can be seen as the initial constraint set of an arti- fact/process. If an artifact/process is not referenced by any property criteria, a set con- tainingallitsinstancesisassignedasitsinitialconstraintset. We need to propagate the constraints of ancestral artifacts/processes to the target artifact, along all the possible paths in the provenance template graph. For example, if theinitialconstraintsetof“LabTest”is<L 1 ;L 2 ;L 3 >,whichmeansthreeexecutions ofthe“LabTest”process,L 1 ,L 2 ,andL 3 ,satisfythequerycondition“wascontrolledby Jim”,weshouldpropagatethisconstraintalongthepath“LabTest” “PVT” “Inte- gration” “ReservoirCapability” “Forecast” “ProductionForecast”. Intuitively, we first identify instances of “PVT” that were generated by L 1 , L 2 , or L 3 . Then in a transitivewayweidentifyinstancesof“ProductionForecast”thatwerederivedfromthe identified instances of “PVT”. Also, we need to propagate the constraint of “Lab Test” alongthepath“LabTest” “FacilityConstraints” “ProductionForecast”. Inouralgorithm,wedoatopologicalsortingoftheprovenancetemplategraph. Ifa nodeXisanancestorofnodeY,XisbeforeYinthesorting. Wecalculatetheconstraints of artifact/process nodes according to the sorting. For nodes without incoming edges 79 in the provenance graph, its constraint set is its initial constraint set. When we have computed the constraint set of an artifact/process, we propagate the constraint to its child nodes. This propagation is done by using provenance forward queries [60]. For example,topropagatetheconstraintof“LabTest”to“PVT”,wedoasetofprovenance forward queries originating from L 1 , L 2 , and L 3 , to look for instances of “PVT” that were generated by L 1 , L 2 , and L 3 . The constraint set of a child node is calculated by intersecting its initial constraint set with all the constraints propagatedfrom its parents. Andtheconstraintsetofthetargetartifactisthefinalresultofthelocalprovenancefilter query. 4.5.3 ProcessingFilterQueryacrossDistributedRepositories We discuss how to process a provenance filter query across distributed repositories in thissection. Wehavealreadyidentifiedtheprovenancetemplatesthatsatisfythereacha- bilitycriteria. Wenowannotatetheprovenancedistributiontotheprovenancetemplates. Sincetheprovenancetemplatehasbeenannotatedbydomainconcepts,byqueryingthe artifact-repositorymappingmetadataandthedomainmetadataintheindexservice(see Section 4.3), we can find out in which repositories the provenance information of each artifact/processisstored. In our experience with Smart Oilfield datasets, we have observed that provenance information for specific subsets of the provenance template tend to be located in the samerepository,andthedistributionoftheprovenanceforatemplateacrossrepositories remains static. This is supported by the observation that the provenance information collected from a particular process (e.g., a sub-workflow) across different executions is stored in one or more pre-specified repositories. Figure 4.6 depicts such an example. The provenance template and the filter query are the same with Figure 4.5, except that 80 the whole provenance graph is distributed in 4 different repositories: R 1 , R 2 , R 3 and R 4 . Optimization Forecast Production Forecast Facility Constraints Facility Profiles Production Schedule Performance Curves Normalize Simulation Result Reservoir Simulator Deck File Deck Composition PVT Grid Information Integration Reservoir Capability Lab Test Cleanse Historical Production Data Raw Hist. Data R1 R2 R3 R4 Figure 4.6: The distributed scenario of the provenance template for “Production Fore- cast” Wedecomposetheoriginalqueryintosub-queries basedonthedistributionof tem- plate subsets across the repositories, and execute each sub-query in the relevant repos- itory 4 . Each sub-query tries to identify the constraint set of an artifact, i.e., the set of instancesoftheartifactthatcanappearintheprovenancegraphforthetargetartifact. Sub-queries have dependencies on each other that can be used to efficiently order their execution and avoid unnecessary sub-query execution. For example, in our exam- pleillustratedinFigure4.6,aswehavediscussedinthelocal-repositoryscenario,before 4 If provenance information of instances of an artifact/process is stored in multiple repositories, the sub-queryshouldbesenttoalltherelevantrepositories. 81 we search for the instances of the target artifact “Production Forecast”, we need to first identify the constraint set of “Historical Production Data”, “ReservoirCapability”, and“FacilityConstraints”,since“ProductionForecast”wasdirectlyderivedfromthem. Becausetheprovenanceinformationof“HistoricalProductionData”and“FacilityCon- straints” is stored in remote repositories R 1 and R 4 , and there are query conditions addressed on their ancestors, individual sub-queries should be generated and executed in R 1 and R 4 so as to identify the constraint set of “Historical Production Data” and “Facility Constraints”. And the results of the sub-queries, i.e., the constraint sets of “Historical Production Data” and “Facility Constraints”, should then be propagated to R 2 andusedasthequeryconditionsofthesub-querythatistoidentifytheconstraintset of“ProductionForecast”. Thescopeandorderofexecutionofasub-queryformsaqueryunitthatcanbecom- pletely executed within a single repository. The query unit consists of the repository where the sub-query will be executed, the target artifact of the sub-query, the range set that contains all the corresponding types of artifact/process nodes, and the dependency informationbetweenthissub-queryandothersub-queries. Queryunitsareidentifiedby decomposing provenance templates based on the repositories. In our running example inFigure4.6,sincewehave4differentprovenancerepositories,wewillinitiallyhave4 query units, one each for repositoriesR 1 ,R 2 ,R 3 , andR 4 , with target artifact “Histori- calProductionData”,“ProductionForecast”,“PerformanceCurves”,and“FacilityCon- straints”,respectively. Therangesetofeachqueryunitconsistsofalltheartifact/process nodes stored in the corresponding repository. For example, the range set of the query unit with target artifact “Historical Production Data” inR 1 is the set<“Historical Pro- duction Data”, “Cleanse”, “Raw Historical Data”>. Besides, the query unit with target artifact“ProductionForecast”inR 2 dependsontheothertwoqueryunits. Formally,aQueryUnitisdefinedasa4-tupleU =<R;S;T A ;D U >inwhich 82 Ristheprovenancerepositorywherethesub-queryistobeexecuted; S istherangeset; T A isthetargetartifactofthesub-query; D U isthedependencysetofthequeryunit. Accordingtotheabovedefinition,the4initialqueryunitscanbeexpressedas: 1. U 1 = <R 1 , <“Historical Production Data”, “Cleanse”, “Raw Historical Data”>, “HistoricalProductionData”,<>> 2. U 2 =<R 2 ,<“ProductionForecast”,“Forecast”,“ReservoirCapability”,“Integra- tion”,“PVT”,“LabTest”>,“ProductionForecast”,<U 1 ;U 3 ;U 4 >> 3. U 3 =<R 3 ,<“PerformanceCurves”,“Normalize”,“SimulationResults”,“Reser- voir Simulator”, “Deck File”, “Deck Composition”, “Grid Information”>, “Per- formanceCurves”,<U 2 >> 4. U 4 =<R 4 ,<“FacilityConstraints”,“Optimization”,“FacilityProfiles”,“Produc- tionSchedules”>,“FacilityConstraints”,<U 3 >> However, it may happen that multiple copies of an artifact may exist in different repositories. For example, the provenance information of artifact “PVT” is stored in repository R 2 . Because “PVT” is also used as an input for process “Deck Compo- sition”, whose provenance information is stored in R 3 , there exists a duplicate copy of artifact “PVT” in repository R 3 . Similarly, artifact “Performance Curves” also has duplicate copies in repositories R 2 and R 3 . In such a case, the two query units U 2 and U 3 will depend on each other and we cannot determine the execution sequence of the twocorrespondingsub-queries. 83 Consequently,wefirstidentifytherepositorythatcontainstheparentprocessofthis kind of artifact. The query unit of this repository is further decomposed into smaller queryunitsinsuchamannerthat,notwoartifactsthathaveaduplicatecopyexistinthe same query unit. In our example, the query unitsU 2 andU 3 for repositoriesR 2 andR 3 willbefurtherdecomposed. Thequeryunitsafterdecompositionare: 1. U 1 = <R 1 , <“Historical Production Data”, “Cleanse”, “Raw Historical Data”>, “HistoricalProductionData”,<>> 2. U 21 = <R 2 , <“Production Forecast”, “Forecast”, “Reservoir Capability”, “Integration”>,“ProductionForecast”,<U 1 ;U 22 ;U 31 ;U 4 >> 3. U 22 =<R 2 ,<“PVT”,“LabTest”>,“PVT”,<>> 4. U 31 =<R 3 ,<“PerformanceCurves”,“Normalize”,“SimulationResults”,“Reser- voirSimulator”>,“PerformanceCurves”,<U 32 >> 5. U 32 = <R 3 , <“Deck File”, “Deck Composition”, “Grid Information”>, “Deck File”,<U 22 >> 6. U 4 =<R 4 ,<“FacilityConstraints”,“Optimization”,“FacilityProfiles”,“Produc- tionSchedules”>,“FacilityConstraints”,<U 32 >> Algorithm1illustratesthepseudocodeforgeneratingqueryunitsfromaprovenance template and the original filter query. Before we execute the algorithm, we first create a hash table that records the repositories where each provenance node is stored. Then in the function Create Query Unit, we can check the hash table to verify whether an artifact has duplicate copies in multiple repositories. The function Create Query Unit will run recursively when processing an artifact that has duplicate copies in multiple repositories. For each duplicate artifact, a new query unit will be created with this 84 Algorithm1Thealgorithmgeneratesqueryunitsanddeterminestheirdependencies Input: a target artifact T A for searching the provenance template PT (including informationaboutrepositoriescontainingprovenancenodes) Output: aqueryunitdependencygraph 1: lookuptheprovenanceindexservicetoidentifytheprovenancerepositoryRwhere theparentprocessofT A isstored 2: U =Create Query Unit(T A ;R;PT) 3: return Create Dependency Graph(U) Function: Create Query Unit(A;R;PT) Input: target artifact A of the sub-query, provenance repository R where the parent processofAisstored,theabstractprovenancegraphPT; Output: queryunitU 1: create and initialize the constraining set S p =fAg, the dependency set S d = , and a variableS t = . S t is used for storing artifact nodes that have duplicates in multiplerepositories. 2: whiletruedo 3: foreachprovenancenodeP inS p do 4: for each provenance nodeP ′ from which there existsa “WasGeneratedBy”or “Used”edgeP ′ !P inPT do 5: ifP ′ is stored inR AND there is no copy ofP ′ in other provenance reposi- toriesthen 6: S p =S p [fP ′ g 7: elseif there is a copy (copies) ofP ′ stored in other provenance repositories then 8: S t =S t [fP ′ g 9: endif 10: endfor 11: endfor 12: ifnonewnodeisaddedintoS p orS t then 13: foreachprovenancenodeP i inS t do 14: lookup the provenance index service to identify the provenance repository R i wheretheparentprocessofP i isstored 15: U i =Create Query Unit(P i ;R i ) 16: S d =S d [fU i g 17: endfor 18: return U(R;S p ;A;S d ) 19: endif 20: endwhile 85 Algorithm2ThefunctionCreate Dependency GraphinAlgorithm1 Function: Create Dependency Graph(U) Input: queryunitU Output: queryunitdependencygraphG(V;E) 1: createemptysetsV = andE = ,createqueueQ 2: insertU atthetailofQ 3: V =V[fUg 4: whileQisnotemptydo 5: retrievethefirstelementU f ofQ,markU f as“visited” 6: foreachqueryunitU i inthedependencysetofU f do 7: ifU i isnotvisitedthen 8: insertU i atthetailofQ 9: V =V[fU i g 10: endif 11: createedgee i =U i !U f ,E =E[fe i g 12: endfor 13: endwhile 14: return G(V;E) artifact as the target artifact, and this query unit will be added into the dependency set S d . Oncethequeryunitsaregenerated,weusethefunctionCreate Dependency Graph, defined in Algorithm 2, to compute the dependency graph between query units. Such a graph,illustratedinFigure4.7forourrunningexample,determinestheorderofexecu- tion of sub-queries and is generated using the dependency set information in the query unit. The computational complexity of our algorithm depends largely on the process of scanning the provenance template. We adopt Breadth First Search to scan the prove- nance template. Thus our algorithm hasO(jVj+jEj) time complexity whereV is the setofartifact/processverticesandE isthesetofedgesoftheprovenancetemplate. 86 U21 U22 U1 U31 U32 U4 Figure4.7: QueryunitsderivedfromFigure4.6andtheirdependency 4.5.4 Sub-QueryExecution The query unit dependency graph is used to determine the sequence of the sub-query executions. In our work, the query decomposition algorithm is executed in the coordi- nating service, i.e., the service that receives the filter query from the user. The coor- dinating service does a topological sorting of the query units based on the query unit dependency graph, and then invokes remote corresponding filter query services to exe- cute sub-queries according to the sorting. In particular, the coordinating service starts the processing from query units that have no incoming edges in the dependency graph, i.e., “leaves” of the graph. Each query unit is represented as a sub-query, which is sent to the corresponding query processing services. Note that a sub-query is also a filter query,thusqueryservicescanutilizetheapproachdiscussedinSection4.5.2toprocess sub-queries. Resultsofsub-queriesreturnedfromremoteservicesaresenttothecorre- sponding query units along the edges in the dependency graph, and are represented as queryconditionswhenwecreatesub-queries. We can process sub-queries in a sequential order, i.e., we process sub-queries one by one and the invocation of query services is in a blocking mode. Because we have doneatopological sortingof thedependencygraph, whenwe preparetoprocess asub- query, all its dependencies should have been satisfied. More efficiently, we process 87 sub-queries in parallel, i.e., we create multiple threads/processes to handle sub-queries simultaneously. When we find the dependencies of a sub-query have been satisfied, wecreateanewthread/processthatwillinvokethecorrespondingservicetohandlethe sub-queryimmediately. 4.5.5 HandlingLoops A provenance template may contain complicated control flows such as loops that need to be handled for a filter query. Figure 4.8 illustrates a provenance filter query on a provenance template that contains a loop structure. After ProcessP 2 generates Artifact B,ProcessP 3 mayuseBagaintostartanewiteration. Supposeafilterqueryissearching forspecificinstancesofartifactA,andthequeryspecifiesthatthecompleteprovenance information of A should contain artifacts B, C, and D, and every instance of B, C, and Dshouldsatisfyspecificconditions. C P3 D A P1 B P2 R2 R1 R3 (a) C P3 D A P1 B P2 R2 R1 R3 R4 (b) Figure4.8: Aprovenancetemplatewithloop When the provenance information about the artifacts and processes involved in the loop is stored in the same repository (Figure 4.8(a)), we can use the same algorithm 88 discussed previously to process the filter query. For example, the filter query in Fig- ure 4.8(a) is decomposed into three sub-queries, each of which is to be executed in an individualrepository. Theonlydifferenceistheexecutionofthesub-queryontherepos- itoryR 2 , which is to calculate the constraint set of artifact B. Similar with the scenario withoutloop,westillpropagateconstraintsbydoingprovenanceforwardqueriesalong the path P 3 C P 2 B. But when we have computed the constraint set of B, besidespropagatingtheconstrainttoP 1 inR 1 ,wealsoneedtopropagatetheconstraint backtoP 3 ,sinceinstancesofBcontainedintheconstraintsetcanalsobeinputsofP 3 . Wekeeptheiterationuntilthenewcalculatedconstraintsetofanartifact/processisnull. Theprovenanceinformationfromtheloopstructuremayalsobedistributedindiffer- ent repositories. As depicted in Figure 4.8(b), the information aboutP 2 and B is stored inR 2 ,whiletheinformationaboutP 3 andCisstoredinR 4 . Inthisdistributedscenario, thepropagationofconstraintsfromBbacktoP 3 needstobeacrossrepositories. 4.6 Evaluation Weevaluatetheperformanceofouralgorithmsusingprovenanceinformationcollected fromsyntheticworkflowsbasedontheSmartOilfielddomain. Theprovenancedatacol- lected from workflows is stored in distributed databases. We evaluate the performance ofouralgorithmsforbothprovenanceretrievalandfilterqueries,anddiscussthefactors thataffecttheperformance. 4.6.1 ExperimentSetup We create a synthetic workflow based on the pattern of a real Smart Oilfield workflow: the reservoir production forecasting workflow introduced in Section 2.1.1. Figure 4.9 89 shows its structure. In this figure, each node is either an artifact or a process, and the edgesbetweennodesrepresenttheirrelationships. As we have discussed, in a practical case, provenance information collected from specificsubsetsoftheprovenancetemplate(e.g.,sub-workflowscontainedinthewhole workflow) tend to be located in the same repository, and the distribution of the prove- nance for a template tends to be static. In our experiment, we distribute provenance accordingtothedecompositionoftheworkflow. Provenanceofthesamesub-workflow is stored in the same database. Based on different granularities, we decompose a syn- thetic workflow into different number of sub-workflows. For example, if we consider the reservoir production forecasting workflow as a composition of 5 sub-workflows, we distribute provenance into 5 different databases: provenance collected from each sub-workflow will be stored in an individual database. If we further decompose each sub-workflowintoseveralfine-grainedsub-workflows,andstoreprovenanceofthefine- grained sub-workflows in different databases, we can distribute provenance in more databases. In our experiment, we simulate different provenance distribution scenarios with 5, 10, 15, and 20 databases. Figure 4.9 shows the scenario where provenance information is stored in 5 databases. We use different colors to indicate the different databases. Provenance information is collected from multiple executions of the workflow. In eachexecution,werandomlyassignvaluesforproperties/valuesofartifactsandparame- tersofprocesseswithinapre-specifiedrange,thuseachexecutionhasdifferentinputand generates different output. The collected information is represented using the semantic provenancemodelandstoredindistributeddatabases 5 . Thedatabasesarelocatedacross five servers in a campus, which are connected by 10Mb/s network. Each server has 4 5 For simplicity the basic provenance information is directly stored as OPM graph and we do not employourcompactprovenancerecordingalgorithmsforcompressionintheexperiment 90 P1 A1 A2 A3 A4 P25 A5 A6 P2 P8 P19 P23 P22 A7 A23 P21 P27 A29 P29 A35 A36 P24 A57 A58 A61 A62 A65 A66 A48 A49 A50 A51 A52 A53 A54 A55 A56 P3 P9 P14 P20 A8 A24 A30 A37 A38 A39 A40 A41 A42 A43 A44 A45 A46 A12 P7 A47 A59 A63 A67 P4 P10 P15 P26 P28 P30 A9 A25 A31 A60 A64 A68 P5 P11 P16 A10 A26 A27 A32 A33 P6 P13 P12 P18 P17 A11 A28 A34 A13 A14 A15 A16 A17 A18 A19 A20 A21 A22 Figure4.9: Structureofthesyntheticworkflow processorswith2GHzCPUand8GBmemoryrunningLinuxwithkernelversion2.6.18. WeuseMySQL5.1tostoretheprovenance. Provenancequeryservicesaredeployedas RESTfulservicesoneachserver. When creating provenance filter queries for evaluation, we should carefully design the number and the distribution of query conditions so that our approach is evaluated in a really distributed scenario. We try to avoid the situation that a filter query can be evaluated in a single database. For example, in the provenance distribution scenario in Figure4.9,ifallthequeryconditionsareaddressedonrednodes,thewholefilterquery only needs to be evaluated in the red database. To avoid this situation, we should have enoughqueryconditionsandwidelydistributequeryconditionsacrossdatabases. 91 In our experiment, we create two filter queries, and evaluate both through all dif- ferent provenance distribution scenarios. The first filter query returns around 1% of the target artifact instances as the final query results, and the second filter query returns about 5% of the target instances. Each filter query is defined such that every database is hit by at least one sub-query. Note that a database is hit by a sub-query if 1) a query condition is addressed on nodes in this database, and 2) constraints are propagated to nodes in this database. Therefore, we create a group of query conditions and address partofconditionsonleafprovenancenodes. Morethan10queryconditionsarecreated anddistributedondifferentartifacts/processesforeachquery. Queryconditionsspecify property criteria on artifact properties or process parameters. If a database is not refer- enced by a query condition, we make sure constraints of other provenance nodes will be propagated to it. This can be achieved by addressing some query conditions on leaf provenance nodes. When a leaf node is referenced by a query condition, the derived constraints should be propagated along all the paths from the leaf node to the target artifact. The constraint propagation will generate sub-queries that hit all the databases alongthepaths. Sub-queries in both retrieval query evaluation and filter query evaluation are exe- cutedinasequentialorderbydefault,unlesswestateexplicitlythatweexecutethemin parallel. 4.6.2 Baseline When evaluating the performance of the provenance filter query, we compare our algo- rithm with a baseline approach. In the baseline, for every instance of the target artifact type, we identify its complete provenance information by sending queries to the prove- nance retrieval query service, and then check if its provenance information satisfies the query conditions defined in the provenance filter query. To prune the search space, in 92 the baseline approach, we first check each sub-provenance-graph to see if its artifacts and processes satisfy the query conditions, and if not, we do not retrieve the remaining partsofitscompleteprovenancegraph. 4.6.3 ResultsandDiscussion Each experiment scenario in our evaluation can be expressed by using a pair of param- eters (D, W), where D is the number of provenance databases and W is the number of workflow executions, i.e., the number of instances of provenance graphs. For example, (D;W) = (5;2000) indicates that we run the workflow 2000 times and distribute the provenance information across 5 databases. The number of provenance sub-graphs in each database is 2000. If we distribute the provenance information in more databases, thesizeofeachsub-graphwillgetsmallerbutthenumberofinstancesineachdatabase will remain 2000. In general, D and W determine the distribution and the total number of instances of provenance graph respectively. In our experiments, D is varied between 5 and 20 databases, and the value of W is varied between 500 and 5000 workflow exe- cutions. 4.6.3.1 ProvenanceRetrievalQueries Figure4.10illustratesthetimeperformanceofprovenanceretrievalquerieswithdiffer- entnumbersofprovenanceinstancesanddatabases. Whenthenumberofthedatabases is fixed, increasing the number of provenance instances causes an increase in the query time. We also find that more databases leads to shorter query time when the number of instances of provenance graph is fixed. When all the provenance information is stored in a single database, the query time is much longer than the distributed scenario. This isbecausewhenthenumberofnodesintheprovenancegraphisfixed(foragiventem- plate),increasingthenumberofdatabasescandecreasethenumberofprovenancenodes 93 stored in individual databases, thus accelerating the processing of sub-queries in each database. Figure 4.10: Time for processing provenance retrieval queries with different size of provenanceanddifferentdatabasedistribution In Figure 4.11 we also measure the time on network transfer, and compare it with the time on local queries. Each rectangle in the figure shows the measurement with a specific number of databases (5 20) and a specific number of workflow executions (2000or5000). Wecanseethetimeonnetworktransferkeepsstaticwitharelativesmall value when we increase the number of databases. In our experiment, when we transfer a subset of the provenance graph across repositories over the network, we transfer its OPM representation, which will not cause a lot of network workload. To deploy the databasesinawide-areanetworkmaycauselargernetworklatencybetweendatabases. Largernumberofdatabaseswillthenleadtoincreasedtimecostonthenetwork. 94 Figure4.11: Timeonlocalqueriesandnetwork 4.6.3.2 ProvenanceFilterQueries Figure 4.12 illustrates the time performance of the two provenance filter queries when the number of workflow executions is 2000. We compare our algorithm with the base- linealgorithm. Intheexperimentweassumewehavealreadyidentifiedthecorrespond- ing provenance template and its provenance distribution, thus only measure the time cost on query decomposition and sub-query executions. When executing the baseline algorithm, we pre-specify the provenance distribution thus also omit the step of query- ing the index service. In our algorithm, the time spent on query decomposition varies between 35 ms, which is far more less than the time spent on sub-query executions. In Figure 4.12, both the sequential and parallel curves depict the time performance of our algorithms. The “sequential” algorithm processes the sub-queries one by one by invoking filter query services. The sequence for processing sub-queries is determined by the BFS algorithm. The parallel algorithm processes the sub-queries concurrently if they do not have dependency on each other. We can see that in average the time of our algorithmsislessthan10%ofthebaselinealgorithm,thusouralgorithmsachievemuch bettertimeperformance. 95 (a) Query1 (b) Query2 Figure 4.12: Time performance of the provenance filter queries with different number ofdatabasesandW=2000 Using a decentralized, parallel strategy to process the sub-queries improves the query efficiency since several sub-queries are executed concurrently. In the scenario depicted in Figure 4.9, the maximum parallelism we can achieve is 5, i.e., we can exe- cute5sub-queriesconcurrentlyatmost. However,whentherearedependenciesbetween sub-queries,thesesub-querieshavetobeexecutedinaparticularorder. Forexample,the sub-querythat isto calculatethe instancesof the targetartifact(A 1 inthe figure)has to 96 beexecutedafterwehaveexecutedalltheothersub-queries. Thiskindofsub-querymay become a bottleneck of the total query processing. In our experiment, the sub-queries executed in the end usually take more than half of the total query time. Therefore, the improvementachievedbytheparallelalgorithmislimitedtoaround10%40%forthe twofilterqueriesinourexperiment. Figure 4.13: The ratio of retrieved provenance information in the baseline algorithm whenprocessingthefirstfilterquery The time performance of the baseline algorithm decreases when the number of databases increases. Recall that the baseline algorithm retrieves and checks prove- nanceinformationofeachinstanceofthetargetartifacttypebyqueryingtheprovenance retrieval query service. Because the time spent on each individual provenance retrieval query deceases when the number of databases increases, the total time of the baseline algorithmwilldecrease. Meanwhile,whenthenumberofdatabasesincreases,retrieving asmallersubsetofprovenanceinformationissufficienttojudgethatanartifactinstance does not satisfy the query conditions. For example, if all the provenance information is stored in a single database, each access of the provenance retrieval service returns the complete provenance graph, which takes a long query time in the database. But if the 97 provenance information is distributed in multiple databases, each access of the prove- nanceretrievalserviceonlyreturnsasub-provenance-graph, thuscostslessquerytime. Andthefirstseveralsub-queriesmaysufficetojudgethattherequiredartifactinstances arenotpresenttosatisfythequeryconditions. InFigure4.13,weuseacurvetoshowthe percentage of the complete provenance graph that we need to retrieve for each artifact instance, on an average, using the baseline algorithm before the first provenance filter query can be fully processed. For example, when we distribute the provenance infor- mation in 20 databases, the ratio value is less than 0.5, or that we only need to retrieve halfofthecompleteprovenancegraphforeachartifactinstance. Thecurvedropsasthe number of databases increases. Thus the total size of the provenance information that we need to retrieve in the baseline algorithm decreases when we distribute provenance informationinmoredatabases. 98 (a) Query1 (b) Query2 Figure4.14: Timeperformanceforprovenancefilterquerieswithdifferentsizeofprove- nanceanddifferentnumberofdatabases Figure4.14depictsthetimecostforprovenancefilterquerieswithdifferentnumber ofprovenanceinstances(i.e.,differentvaluesforW)anddifferentnumberofdatabases. Intuitively, a provenance filter query is more complicated than a provenance retrieval query, thus the time for processing filter queries is longer than time for processing retrieval queries. Similar to provenance retrieval queries, the time cost for filter queries decreaseswhenwekeepthenumberofworkflowexecutionsunchangedandincreasethe 99 number of databases. This is mainly because when the number of databases increases, the size of provenance information in each database decreases. Thus, the time for each sub-querydecreasesineachdatabase. (a) Query1 (b) Query2 Figure4.15: Timeonnetworkforprovenancefilterqueries 100 (a) Query1 (b) Query2 Figure4.16: Timeonlocalqueriesandnetworkforprovenancefilterqueries We also measure the time cost on network and local queries in each scenario, illus- trated in Figure 4.15 and Figure 4.16. We can see that when the size of the prove- nance information is large (e.g., W=5000), the time cost on network increases when provenance is distributed in more databases. Intuitively, when handling provenance fil- ter queries, the increase in the number of databases generates more sub-queries with dependenciesacrossdatabases. Becausetheoutputofasub-querymayneedtobetrans- ferredtoothersub-queriesastheirinput,thevolumeofdatathatistransferredbetween 101 databases across the network increases. These factors lead to the increase of the time costonthenetwork. However,whenwepropagateconstraintsacrossdatabases,weonly transfer a set of artifact IDs, which will not generate a lot of network traffic. Besides, our experiment is in a local area network, thus the network latency is not large. There- fore, as shown in Figure 4.16, the query time is mainly spent on local queries, and the totalquerytimedecreaseswhenwedistributeprovenanceinmoredatabases. Inawide area network environment, larger network latency may affect the performance of the filter query, but the network traffic should still keep small considering the size of the constraintsets. 4.7 RelatedWork We have listed existing work that studies provenance retrieval query processing, either inasinglerepositoryorindistributedenvironments,inChapter2. Ourapproachforpro- cessingprovenanceretrievalqueriesacrossdistributedrepositoriesissimilartosomeof theideaspresentedintheseworkexceptthatweemployanindexservicetohelpidentify repositoriesthatstoreprovenanceinformationofinterest. Besides, inourapproach, we employ OPM as our global provenance schema and develop a semantic domain model asaglobaldatarepresentationvocabulary. Thisallowsustointegrateinformationfrom heterogeneousprovenancesources. The provenance filter query has been mentioned before. For example, Sahoo et al. propose an example filter query in [81] when discussing semantic provenance manage- mentfore-Science. Also,Query5and6intheFirstProvenanceChallenge[4]canalso beseenaslocalprovenancefilterquerieswithqueryconditionsaddressedononeofthe ancestors. A number of papers discuss how to process the queries in the Provenance 102 Challenge, such as [45], [72], [32] and [85]. However, these work only provide solu- tionsforthespecificqueriesonaspecificprovenancetemplate,whichcannotbeapplied toageneralcase. In our work, we give a formal definition of provenance filter query, and provide a generalandefficientapproachforqueryprocessing. Wealsodiscussboththelocaland distributed scenarios. In our approach, we first interpret the provenance filter query, extracting the reachability criteria and the property criteria. We identify provenance templatesbasedonthereachabilitycriteria,andannotateprovenancedistributiontotem- platesbyqueryingtheprovenanceindexservice. Wethendecomposeaprovenancefilter query into sub-queries based on the distribution of provenance information, and deter- mine dependencies between sub-queries according to the causal relationships between data objects. The general approach for processing provenance filter query is one of our maincontributions. We propose the filter query decomposition algorithm and the sub-query processing algorithminourworktohandlequeriesinadistributedenvironment. Toprocessqueries over distributed repositories is a classic problem in the distributed database community [94][54][64][88][89][25]. Ourworkannotatesprovenancedistributiontoprovenance templates and utilizes the annotated template graphs to conduct the processing of filter queries. Data warehouses have been used for information integration in large organizations [31] [63] [36] [35]. They use a process of extract-transform-load (ETL) to query data from distributed sources in an organization, perform schema mapping from heteroge- neous formats to a global schema and load it into a central database for analytics [63]. However, they pose several limitations when used for queries over distributed prove- nance. Warehouses are designed for frequently analyzed data such as aggregates since 103 thereisacostinvolvedwithdownloadingthedatatoacentrallocation. Inthecaseofdis- tributedprovenancerepositories,unlessallinformationinallrepositoriesisgoingtobe accessed often, there is little value in centrally aggregating that information. Querying for them when necessary is a more efficient model and also allows new repositories to beaddedorremovedwithlimitedpenalty. Thisisfrequentlythecasewhensharingdata betweennewmembersofacommunityorbetweennewdomainsnecessitatessharingof provenancealso. Also,inthepresenceofastandardcommunityformatforprovenance, suchasOPM,thereisnonecessityforschemamappingprovidedbywarehouses. 4.8 Conclusion In this chapter, we introduced our algorithms for processing provenance queries in dis- tributedenvironments. Wedevelopedaprovenanceindexservicetorecordthemapping between artifacts and the repositories storing their immediate provenance information. Basedonprovenancelookupfunctionalityprovidedbytheindexservice,weenabledthe processingoftwotypesofprovenancequeriesindistributedenvironments: theretrieval query and the filter query. We presented details of the algorithms for provenance query processing,andtheperformanceevaluationsbasedontheSmartOilfieldusecaseshow amarkedbenefitoverthebaselineapproaches. A complete provenance graph retrieved across repositories helps domain users get a good understanding of the workflow’s big picture. However, in practice, part of the provenance information may be unavailable due to the lack of automatic provenance collection capability. This becomes a hurdle for achieving better data quality control. In the next chapter, we tackle this challenge and discuss how to recover the incomplete provenanceinformation. 104 Chapter5 ASemantic-basedApproachfor HandlingIncompleteProvenance As we have discussed in Chapter 1, some legacy tools that do not support automatic provenancecollectionarestillwidelyusedintheEnergyInformaticsdomain. Theman- ual provenance annotation operations, which are usually tedious and error-prone, can produce incomplete and inaccurate provenance information. For example, in the Smart Oilfield domain, reservoir engineers do not always have complete and accurate prove- nance information for data items contained in a reservoir model (recall that a reservoir modelisadatasetconsistingofalargenumberofdataitems),althoughtheirprovenance isasignificanttypeofevidenceforthemodel’sdataqualityestimation. In this chapter, we take Smart Oilfield as our motivating domain and introduce a semantic-based approach for predicting the missing provenance information for reser- voir models. As defined in existing provenance models such as OPM [77], provenance of a data artifact has multiple attributes, including the process that generated the data artifact,agentswhocontrolledtheprocess,andothercontextinformationofthederiva- tion history. As the first step of our exploration, we discuss how to predict the missing parent process of a data artifact, i.e., what type of method, application, or system was employed to create the data. The type of process can greatly affect the output qual- ity. Thereforeanaccuratepredictionaboutdata’sparentprocessisvaluableforthedata quality estimation. For convenience, in the remaining parts of the chapter, we say two 105 data items have the “same/identical” provenance information if they were generated by thesametypeofprocess. InSmartOilfielddomain,multipledataitemscontainedinthesamereservoirmodel oftensharethesameprovenanceinformation. Thisissupportedbytheobservationthat reservoir engineers often employ the same process to create a collection or a series of fine-graineddataitems,andimportthemintothesamereservoirmodel. Forexample,in theusecaseweintroduceinSection2.1.1,anengineeroftenutilizesthesameprocessto cleanseproductiondataofalltheoilwellsinthesameblockofthereservoir 1 . Cleansed production data of each well is wrapped as a data item, and is then integrated into the same reservoir forecasting model. Given this observation, when provenance of a data itemismissing,wecanidentifydataitemsthatsharethesameprovenanceinformation, andusetheirpossiblyexistingprovenanceforourprediction. To identify data items with the same provenance information can be a challenge. We find that data items linked by special semantic “connections” often share the same provenance. Intheexamplewementionedabove,theproductiondataiscleansedbythe sameprocessifthecorrespondingwellsarelocatedinthesameblock. Inanotherexam- ple,the“originaloilinplace(OOIP) 2 ”estimatesoftwowellsareusuallycomputedby thesameprocesswhen the wells belong to the same region. Infact,thesetwo“connec- tions”revealthehiddendatagenerationpatterns,whichmaybeaboutthegranularityof the data generation algorithms, and/or the coherence of certain types of data items. We can capture these special semantic “connections”, and use them to identify data items withthesameprovenanceinformation. 1 Areservoirisdividedintomultipleblocks,eachofwhichcontainsasetofwells 2 http://en.wikipedia.org/wiki/Oil in place 106 Wedetectand represent thesespecial “connections”by usingsemantic associations [11][19]. Inourapproach,weannotatereservoirmodelswithasemanticdomainontol- ogy, which defines domain classes (such as “Well”, “Block” and “OOIP Estimate”), domain entities that are instances of domain classes, and domain relationships between classes and entities. Every data item contained in a reservoir model is annotated by a domain entity defined by the ontology. A semantic association is a sequence of rela- tionshipsthatinterconnecttwodomainentitiesinthedomainontology. Usinghistorical reservoir models that have complete and accurate provenance information, we discover semantic associations between the entities whose annotated data items share the same provenance. For two data items sharing the same provenance, there may exist multiple associ- ations connecting their annotation entities in the ontology graph. Only part of them always imply identical provenance thus can be used for future provenance prediction. Todistinguishtheseassociationsfromotherassociationsthatdonotpredictprovenance equivalence,weemployastatisticalapproachtocalculateconfidencevaluesfordiscov- ered associations. The confidence value of an association indicates the probability that two data items, whose annotation entities are connected by the association, share the same provenance. Based on the discovered associations and their confidence values, when given a reservoir model containing data items with missing provenance informa- tion,weuseavotingschemetopredictthemissingprovenance. Inpractice,theexistingprovenanceforpartofthedataitemscanbeincorrectdueto theprovenanceannotationerrorscausedbycarelessnessofhumanannotatorsorlackof information and knowledge. We thus design an algorithm as an extension of the voting schemewhichtakesintoaccountnotonlytheassociationconfidencevaluesbutalsothe accuracy of existing provenance information. The algorithm also outputs a probability valueforeachpredictionresultasanexplicitindicatorofits“trust”. 107 5.1 OverviewofOurApproach Bootstrapping 1 annotation 2 3 predict Dataset D with Predicted Prov. d 1 d 2 d j fp(dj)=π 6 fp(d1)=fp(d2)=π 5, ... ... d data item e domain entity Historical Datasets with Complete Provenance ... Dataset D 1 d 1 1 d 2 1 d n 1 ... fp( )=fp( )=π1 d 1 1 d 2 1 fp( )=π 2 d n 1 D m Dataset D 2 d 1 2 d 5 2 d n' 2 ... fp( )=fp( )=π 4 d 5 2 d n' 2 fp( )=π 3 d 1 2 ... Annotated Historical Datasets d 1 1 d 2 1 d n 1 ... ei e2 e1 Dataset D 1 ... D m d 1 2 d 5 2 d n' 2 ... e5 e4 e3 Dataset D 2 ... Subset of the Ontology Graph Association A1=(r1,r2,r3), Association A2=(r1, r4, r5) r1 r2 r3 r4 e1 e2 r5 r1 r2 r3 r4 e4 e5 r5 e3 ...... conf(A1 ) = 0.35, conf(A2 ) = 0.95 Subset of the Ontology Graph r1 r2 r3 r4 e1 e2 r5 r1 r2 r3 r4 e4 e5 r5 e3 ...... assign confidence detect association Dataset D with Incomplete Prov. fp(d1)=? fp(d2)=π 5 fp(dj)=π 6 ... d 1 d 2 d j ... r1 r2 r3 r4 e6 e7 r5 e8 ...... Subset of the Ontology Graph annotation annotation annotation Figure5.1: Overviewofbootstrappingandprovenanceprediction We define the problem and provide our approach overview in this section. We con- sider a reservoir model as a datasetD k =fd k 1 , d k 2 , ..., d k n g, which consists of data items d k i ;1 i n. The provenance of a data item d k i is defined as f p (d k i ), which is the informationabouttheprocessthatcreatedd k i . 108 The provenance information for each data item is either available or missing. We defineanindicatorfunctionf c :d k i !f0;1g: f c (d k i ) = 8 < : 0 ,f p (d k i )ismissing 1 ,f p (d k i )isavailable This definition allows us to divide the data items in D k into two sets, namely: D k available =fd k i jf c (d k i ) = 1g and D k missing =fd k i jf c (d k i ) = 0g. Data items d i and d j have the same provenance if f p (d i ) = f p (d j ), i.e., the process creating d i and d j is thesame. The overview of our approach is illustrated in Figure 5.1. To bootstrap our sys- tem, we analyze historical datasets (D 1 ,D 2 , ...,D m in the figure), which are reservoir modelswithcompleteandaccurateprovenanceinformation, soastodiscoversemantic associations that imply the identical provenance. In particular, the bootstrapping con- tains3steps: 1)Annotation. Weannotatehistoricaldatasetsbyusingasemanticdomainontology. Every data item is annotated by a domain entity defined in the ontology. For example, inthefigure,thedataitemd 1 1 isannotatedbyentitye 1 ,andd 1 2 isannotatedbye 2 . 2) Association detection. We first identify data items in each dataset that share the sameprovenanceinformation. InFigure5.1,dataitemsd 1 1 andd 1 2 sharethesameprove- nance in dataset D 1 , and data items d 2 5 and d 2 n ′ share the same provenance in dataset D 2 . Weusedarkcolorstoillustratethemandtheirannotationentities. Oncethesepairs of data items are detected, we identify the semantic associations in the ontology graph between the domain entities that the data items are annotated with. Each association is representedasasequenceofrelationships. Inourexample,twoassociationsA 1 andA 2 arediscoveredbetweene 1 ande 2 ,andA 2 alsoconnectse 4 ande 5 . 109 3)Confidenceassignment. Notallthediscoveredassociationscanbeperfectlyused for future provenance predication. For example, A 1 also connects e 3 and e 4 , whose annotated data items d 2 1 and d 2 5 do not share the same provenance. For each discov- ered association, by analyzing all the historical datasets and provenance information, wecomputeconfidencevaluestoreflecttheprobabilitythattwodataitemswhoseanno- tationentitiesareconnectedbytheassociationsharethesameprovenance. Inourexam- ple, we suppose the confidence of association A 1 is 0.35, and the confidence of A 2 is 0.95. After the bootstrapping procedure, we utilize the discovered associations and their confidence values to predict the missing provenance for data items in a dataset (data itemd 1 indatasetDinourexample). EachdataiteminDisalsoannotatedbyadomain entity defined by our domain ontology. As illustrated in the figure, d 1 is annotated by e 6 , d 2 is annotated by e 8 , and d j is annotated by e 7 . For a data item with missing provenance, we identify domain entities that connect to its annotation entity through semantic associations discovered in the bootstrapping procedure. Data items annotated by these entities are probable to share the same provenance with the one with missing provenance. Among their existing provenance, we process a voting algorithm based on theassociationconfidencevaluestogenerateourprediction. Intheexample,wefindthat e 6 is connected with e 8 by the association A 2 , and connected with e 7 by A 1 . Because the confidence value ofA 2 is higher thanA 1 , in the voting procedured 2 will “beat”d j , andwetakeprovenanceofd 2 asourprediction. Notethattheexampleisjustanaivecaseandweignoreotherassociationsandenti- ties. In practice, since the existing provenance information may be incorrect, our algo- rithm also considers the accuracy of the existing provenance when generating the final prediction. Wediscussdetailsofeachstepofourapproachinthefollowingsections. 110 5.2 SemanticAnnotationonReservoirModels WeusethesamesemanticdomainontologythatweintroduceinSection4.2toannotate reservoirmodels. Recallthattheontologydefinesdomainclasses,suchas“Reservoir”, “Well”,and“OOIPEstimate”,andtheirrelationships,suchas“WellHasOOIPEstimate” and “ReservoirContainsWell”. The ontology also defines domain entities which are instances of domain classes, including physical oilfield entities such as real-world oil wells,andlogicalentitiessuchassimulationscenarios. Forareservoirmodel,weannotateeverydataitemwithadomainentity. Forexam- ple, a data item about the OOIP estimate of a wellW 1 is annotated by a domain entity o 1 ,whichindicatesaninstanceofthedomainclass“OOIPEstimate”. Theontologyalso containsadomainentityw 1 torepresentthewellW 1 ,andw 1 ando 1 areconnectedbya relationship “WellHasOOIPEstimate”. Data items in different reservoir models can be annotated by the same domain entity. For example, if two reservoir simulation models bothcontainadataitemabouttheOOIPestimateofwellW 1 (thesetwodataitemsmay havedifferentvalues),bothdataitemswillbeannotatedbyo 1 . Theannotationfunction is implemented as a module in the existing Integrated Asset Management framework (IAM) 3 [87], which was developed to integrate heterogeneous data representation for- mats used by proprietary software systems in the energy domain. We skip the details abouttheannotationimplementationheresinceitisoutofthescopeofourwork. We use function f a to indicate the mapping between a data item and its annotation domain entity. We define f a (d k i ) = e i if the data item d k i is annotated by the domain entitye i . 3 TheIAMFramework,http://ganges.usc.edu/wiki/Smart Oilfield 111 5.3 SemanticAssociation An ontology graph can be derived from the domain ontology, where each vertex in the graph is a domain entity, and the edges between vertices are corresponding domain relationships between entities. The annotation procedure maps a reservoir model to a subset of the whole ontology graph: each data item is mapped to a domain entity, and the implicit semantic “connections” between data items can be represented explicitly by the paths between domain entities. This allows us to detect those special semantic “connections”thatimplyidenticalprovenanceinformation. We use semantic associations to represent the paths between domain entities in the ontologygraph. Asdefinedin[11][19],asemanticassociation,notedasA,isasequence of relationships, r 1 , r 2 , ..., r n , for which there exist a sequence of domain entities, e 1 , e 2 , e 3 , ..., e n , e n+1 that makes the sequence e 1 , r 1 , e 2 , r 2 , e 3 , ..., e n , r n , e n+1 a path 4 in the ontology graph. In this definition, domain entity e 1 is associated with e n+1 by associationA,whichisdenotedbye 1 A !e n+1 . ontology graph o1 o2 w1 w2 re1 o3 o4 w3 w4 re2 r1 r2 r2 r1 r1 r2 r2 r1 d 2 j Dataset D j d 1 j d 2 k d 1 k annotation annotation annotation annotation r1=WellHasOOIPEstimate r2=RegionContainsWell fp(d 1 j )=fp(d 2 j ) Dataset D k Figure5.2: Usingsemanticassociationforprovenanceprediction 4 Similarwith[11],weignoredirectionofrelationshipswhenwediscusspathsintheontologygraph. Thusfr 1 ,r 2 ,r 2 ,r 1 gisavalidassociationinFigure5.2. 112 As we have stated, certain semantic associations lead to identical provenance infor- mation. Figure5.2depictssuchasimplescenario. InFigure5.2,d j 1 andd j 2 aretwodata itemsinadatasetD j withavailableprovenanceinformation,andd k 1 andd k 2 aretwodata itemsin another datasetD k , where the provenanceofd k 1 ismissing and theprovenance of d k 2 still exists. Each data item has been annotated by a domain entity: d j 1 and d j 2 are annotated byo 1 ando 2 , andd k 1 andd k 2 are annotated byo 3 ando 4 , whereo 1 ,o 2 ,o 3 and o 4 are OOIP estimates for wellw 1 ,w 2 ,w 3 ,w 4 , respectively. Wellw 1 andw 2 belong to regionre 1 ,andw 3 andw 4 belongtoregionre 2 . Suppose we find thatd j 1 andd j 2 have the same provenance information. A semantic association between their annotation domain entities (o 1 ando 2 ) in the ontology graph, sayA=fWellHasOOIPEstimate, RegionContainsWell, RegionContainsWell, WellHa- sOOIPEstimateg, is illustrated using bold lines connecting o 1 and o 2 in Figure 5.2. If we find that most other pairs of data items whose annotation entities are associated by A, also have the same provenance (not depicted in Figure 5.2), we can speculate that theassociationAisaspecialassociationthatimpliesidenticalprovenanceinformation. In fact, this association reveals an implicit data generation pattern: OOIP estimates of wells belonging to the same region are usually generated by the same kind of process. Sinceo 3 ando 4 arealsoconnectedbyA,wecanpredictthattheprovenanceofd k 1 isthe same as the provenance of d k 2 . Thus we use the provenance of d k 2 as our prediction for d k 1 . 5.3.1 DetectingAssociationsfromHistoricalDatasets Weanalyzehistoricaldatasetstodiscoverallthepossiblesemanticassociationsthatmay implyidenticalprovenanceinformation. Eachdataitembelongingtothesedatasetshas complete provenance information and is semantically annotated with domain entities definedbyourdomainontology. 113 Algorithm3Semanticassociationdetectionforprovenanceprediction Input: asetofhistoricaldatasetsH =fD k gwithcompleteprovenanceinformation Output: all the semantic associations between domain entities whose annotated data itemsshareidenticalprovenance 1: fg 2: foreachdatasetD k =fd k l g,D k 2H do 3: groupdataitemsaccordingtotheirprovenance 4: foreach(d k i ;d k j )suchthatf p (d k i ) =f p (d k j )do 5: identifydomainentitiese i ′ ande j ′ suchthatf a (d k i ) =e i ′,f a (d k j ) =e j ′ 6: detectallthesemanticassociationsfAgsuchthate i ′ A !e j ′ 7: [fAg 8: endfor 9: endfor 10: return Algorithm3depictsourapproach. InAlgorithm3,wefirstgroupdataitemsbytheir provenanceinformation. Dataitemswithidenticalprovenancewillbegroupedtogether. For each two data items with identical provenance information, we identify their anno- tateddomainentities,andthencomputesemanticassociationsbetweendomainentities. To compute semantic associations between two domain entities can be done by using thequery definedin[11][19]. 5.4 ConfidenceofAssociation For a semantic associationA, its confidence conf(A) is a measure of the probability thatthetwodataitemshaveidenticalprovenance,iftheirannotationdomainentitiesare associatedbyA: conf(A) =P ( f p (d k i ) =f p (d k j )jf a (d k i ) A !f a (d k j ) ) (5.1) 114 In Equation 5.1,d k i andd k j are any two data items in a datasetD k . Their annotation domainentities,f a (d k i )andf a (d k j ),areassociatedbytheassociationA l intheontology graph(recallwedefinef a astheannotationfunctioninSection5.2). In particular, for a specified ontology item e i , we are interested in the following conditionalconfidencemeasure: conf(Aje i ) = (5.2) P ( f p (d k i ) =f p (d k j )jf a (d k i ) A !f a (d k j );f a (d k i ) =e i ) Givenadomainentitye i ,Equation5.2definestheconfidenceofanassociationwhen it associates e i with other entities. The same semantic association can have different conditional confidence values for different given entities. This can be illustrated by the followingexample. InreservoirR 1 ,becausethecumulativewaterpressureismeasured basedonthesamesensormechanismforthewholereservoir,thecumulativewaterpres- sure of wells in the same reservoir share the same provenance information. Note that R 1 , and all the wells in R 1 and their corresponding cumulative water pressure are all domain entities in the ontology graph. Thus we can find that the semantic association A = fWellHasCumulativeWaterPressure, ReservoirContainsWell, ReservoirContain- sWell,WellHasCumulativeWaterPressureg,whichindicates“thecumulativewaterpres- sureofwellsinthesamereservoir”,shouldhaveahighconfidencewhenpredictingthe missingprovenanceforacumulativewaterpressuredataitemofawellinR 1 . However, inreservoirR 2 ,wellsindifferentblocksofthereservoirareinstrumentedwithdifferent sensor mechanisms and thus, the probability for cumulative water pressure of wells in a reservoir sharing the same provenance information is low. Consequently the corre- sponding confidence ofA should be low. Using the same confidence of the association 115 A, would lead to poor results when analyzing datasets from reservoir R 2 . To address suchscenarios,wedefinetheaboveconditionalconfidencemeasure. We calculate conf(Aje i ) by analyzing the historical datasets H = D 1 ;D 2 ;:::;D m . For each datasetD k , we count all the data item pairs (d k i , d k j ) such thatd k i is annotated by e i , and e i is associated with the annotation entity of d k j byA. We use C k total (A;e i ) to specify the number of qualified pairs. Among these pairs, we count all the data item pairs (d k i , d k j ) that also satisfy f p (d k i ) = f p (d k j ). The number of these item pairs is denotedasC k identical (A;e i ). Theconfidencevalueofconf(Aje i )isthenestimatedas: conf(Aje i ) = ∑ k C k identical (A;e i ) ∑ k C k total (A;e i ) (5.3) In our bootstrapping process, after we discover a semantic associationA, for each domainentitye i thatisassociatedwithanotherentitybyA,wecomputethecorrespond- ingconf(Aje i ). 5.5 Prediction Afterthebootstrappingprocess,weusediscoveredsemanticassociationsandtheir(con- ditional) confidence values to predict missing provenance. Suppose we are to predict missing provenance for data items in a dataset D. We first annotate every data item in D with a domain entity defined by our domain ontology. Then to predict the missing provenance of a data itemd i annotated by domain entitye i , we retrieve the conditional confidencevaluesofalltheassociationsthatassociatee i withotherentities: v conf i =fconf(A 1 je i ),conf(A 2 je i ), :::,conf(A L je i )g; 116 where L is number of associations that connect e i with other entities. Note that all the conditionalconfidencevalueshavebeencalculatedinourbootstrappingprocess. Next, we spread the conditional confidence values in the ontology graph. For each association A l (1 l L), we tag the conditional confidence value conf(A l je i ), as well as the semantic association A l , to all the domain entities that are associated with e i by A l . At each domain entity e j , in case that multiple associations exist between e i and e j , we tag e j with the highest association confidence value and the corresponding association. ABreadth-Firstsearchintheontologygraphdoesthisprocess. In the dataset D, we identify all the data items satisfying two conditions: 1) their provenance information is still available (i.e., they are contained in D available ), and 2) their annotation domain entities are associated with e i by at least one A l . We use a set S i to indicate all these data items. Each such data item is then tagged by the same association and the confidence value that are tagged to its annotation domain entity. Thesedataitemsarelikelytosharethesameprovenancewithd i . Theconfidencevalues taggedtothemindicatetheprobabilityofsharingidenticalprovenance. Finally, we simply employ a voting process to select the provenance information withthehighestconfidencevalue: everydataitemtaggedbyaconfidencevaluegreater than a threshold votes to its provenance information, using its confidence value as the weightofthevote. Theprovenancewiththehighestvotesisourprediction. 5.5.1 HandlingInaccurateProvenanceInformation Theexistingprovenanceinformationisofteninaccurateduetoerrorsduringthemanual provenanceannotations. Theseerrorsmaybecausedbythecarelessnessofdomainengi- neers during the annotation, or the lack of the information and/or knowledge about the data. The voting process may filter out part of the incorrect provenance. However, the 117 votingprocess does not explicitly takes the provenance accuracy into account. Further- more,itonlygeneratesthepredictionresults,andcannotprovideadditionalinformation to help us judge whether the results are trustworthy. This “trust” information is useful especiallywhentheaccuracyoftheexistingprovenanceislow. Wethusprovideanextendedpredictionalgorithmtopredictthemissingprovenance basedonnotonlytheassociationsandtheirconfidencemeasures,butalsotheaccuracy of the existing provenance. The algorithm also calculates the trust for each prediction result. We first estimate the average provenance annotation accuracy, denoted by, by comparingtheannotatedprovenanceforagroupofrandomly-selectedhistoricaldatasets with the correct provenance. Then we still use the same method to identify the set of data itemsS i , which can be seen as the set of candidates for our prediction. We divide S i into subsets, noted asfs 1 i ;s 2 i ;:::;s n i g, according to the associations that are tagged to the data items. Data items tagged with the same semantic association are grouped intothesamesubset. Ineachsubsets l i ,incasethatdataitemshavedifferentprovenance information,weselecttheprovenancethatissharedbythehighestnumberofdataitems astheprovenancerepresentativeofthesubset. Andourfinalpredictionwillbeselected amongalltheprovenancerepresentativesacrosssubsets. A probability value is then calculated as the trust of the representative. The prove- nance with the highest trust is taken as our final prediction. Since the existing prove- nance may be incorrect, for calculating the prediction trust, we consider both the accu- racyofexistingprovenanceandtheassociationconfidencevaluesthataretaggedtothe dataitems. WeuseN l toindicatethetotalnumberofdataitemscontainedinsubsets l i , andfp 1 ;p 2 ;:::;p m g to indicate the set of possible provenance of data items ins l i . The corresponding number of data items for eachp k isn k , denoted byf l n (p k ) = n k , where 1 k m, and n k = N l . Suppose we have n s = max(n k ), and consequently we takep s astheprovenancerepresentativeofs l i . 118 Asthepredictiontrustofp s ,wecalculatetheprobability T(p s ) =P ( f p (d i ) =p s jf l n (p s ) =n s ) (5.4) whered i isthedataitemwiththemissingprovenanceinformation. Equation5.4calcu- lates the probability that the provenance ofd i isp s when we find that there aren s data itemsins l i sharingthesameprovenancep s . Thepredictiontrustofp s canbecalculated by P ( f p (d i ) =p s ,f l n (p s ) =n s ) P(f l n (p s ) =n s ) (5.5) InEquation5.5,P ( f p (d i ) =p s ,f l n (p s ) =n s ) istheprobabilitythattherearen s data itemsins l i whoseexistingprovenanceisidenticalwiththemissingprovenanceofd i . We assumetheexistingprovenanceofdataitemsins l i followsthemultinomialdistribution, and each data item’s existing provenance falls into the setfp 1 ;p 2 ;:::;p m g. For any singledataitemd2s l i ,theaccuracyofitsexistingprovenanceis,andtheprobability that it shares the same provenance with d i is conf(A l je i ). Thus the probability that the existing provenance of d is the same with the missing provenance of d i should be =conf(A l je i ). Wethenhave P ( f p (d i ) =p s ,f l n (p s ) =n s ) = ( N l n s ) ns (1) N l ns (5.6) where =conf(A l je i ). 119 We assume a data item’s provenance falls into the setfp 1 ;p 2 ;:::;p m g, and it has a uniformdistributionoverthem1typesofprovenanceotherthanp s . Thenwehave P ( f l n (p s ) =n s ) = ( N l n s ) ns (1) N l ns (5.7) +(m1) ( N l n s )( 1 m1 ) ns ( 1 1 m1 ) N l ns BycombiningEquation5.5,5.6,and5.7,wecalculatethepredictiontrustforp s : T(p s ) = (m1) N l 1 ns (1) N l ns (m1) N l 1 ns (1) N l ns +(1) ns (m+2) N l ns (5.8) AsaspecialcaseofEquation5.8,whenN l = n s = 1,thepredictiontrustequalsto =conf(A l je i ),i.e.,ifonlyonedataitemistaggedbytheassociationA l ,thetrust ofthedataitemaspredictioniscalculatedbymultiplying1)theprobabilitythatthedata item shares the same provenance with the prediction target with 2) the probability that theexistingprovenanceofthedataitemiscorrect. Algorithm4illustratesthepseudocodeofthenewpredictionalgorithm. Inthealgo- rithm, we use a set S i to store data items in D that still have provenance information and whose annotation domain entities are semantically associated with e i . We use to indicate the confidence values tagged to these data items, and use to denote the corresponding associations. Each data item d j is assigned by a conditional confidence value conf(A l je i ), in whichA l is the association connecting e i to f a (d j ). When mul- tiple associations exist betweene i andf a (d j ), the highest association confidence value is assigned. As shown in Algorithm 4, we divide the set S i into subsets, and select a 120 Algorithm4 Predicting missing provenanceof data itemd i in datasetD when existing provenanceisinaccurate Input: D available ; d i , where f c (d i ) = 0 and f a (d i ) = e i ; v conf i = fconf(A l je i )g, 1ln;accuracyofexistingprovenance Output: f p (d i ) 1: S i fg 2: initialize() = 0 3: initialize() = 0 4: foreachassociationA l ;1lndo 5: identifys =fd j g,d j 2D available ,f a (d i ) A l !f a (d j ) 6: S i S i [s 7: foreachd j insdo 8: ifconf(A l je i )>(d j )then 9: (d j ) = conf(A l je i ) 10: (d j ) =A l 11: endif 12: endfor 13: endfor 14: foreachdataitemdinS i do 15: s l i s l i [fdg,where(d) =A l 16: endfor 17: initialize =fg 18: fors l i do 19: selecttheprovenancerepresentative l 20: calculateT( l )byusingEquation5.8 21: = [f l g 22: endfor 23: return withthehighestT( ), 2 provenance representative for each subset. Equation 5.8 is used to calculate the pre- diction trust for each . The provenance information with the highest trustT( ) is our prediction. 121 5.6 Evaluation Weevaluateourapproachinthissection. Wecreatesyntheticdatasetsbasedonrealdata collected from the Smart Oilfield domain. We measure the accuracy of our approach underdifferentprovenancelossratio,andcompareourapproachwithanother2baseline approaches. Figure5.3: Numberofdataitemsinsampledatasetscreatedbyeachprocess 5.6.1 ExperimentSetup We first collect two groups of reservoir model samples from two practical use cases in theSmartOilfielddomain. Eachgroupcontains10datasetswithcompleteprovenance. Datasetsinonegroupcontainaround1000dataitemseach,whichareusedforreservoir forecasting. Datasets in the other group are for the production optimization problem, containing around 500 data items. From each group we pick one dataset and show the numberofdataitemscreatedbyeachindividualprocessinFigure5.3. The historical datasets with complete provenance information are synthesized by duplicatingtherealdatasetsintoareasonablescale,say,2,000copies. Inpractice,each data item can be created by different kinds of processes. For example, in our first sam- plegroup,algorithmsdescribedin[40][65][67][82]canallbeemployedtocalculatethe 122 same category of data. When we create the historical datasets, data items annotated by entities of the same ontology class are designed to be possibly generated by different kinds of processes. We determine the process according to entity properties and usage frequencies of different processes learnt from the sample groups. 10% of the histori- cal datasets are used as test datasets, and the remaining are used for our bootstrapping process. Ineachtestdataset,werandomlypickdataitems,droptheirprovenanceinfor- mation,andregardthedroppedprovenanceas“missingprovenance”. Fortheremaining data items, their existing provenance can also be incorrect according to the provenance annotation accuracy, which is controlled by a probability value in our experiment. To assign a piece of incorrect provenance to a data item, we randomly choose one of the otherpossibleprocessesthatmaybeusedforthedataitem’sgeneration. We ran our experiments on a machine with 3.06 GHz Intel Core i3 CPU and 4GB memory. While the offline bootstrapping process usually takes several minutes, which is proportional to the size of the historical datasets, the prediction step of our approach respondswithinsecondswhenpredictingthemissingprovenanceforadataitem. 5.6.2 BaselineApproaches Wecompareourapproachtotwoapproaches: Baseline1: For a data item annotated by an entitye i , we predict its missing prove- nance by simply selecting the generation process which were most frequently used to create data items annotated by e i in the historical datasets. I.e, we predict the missing provenanceofadataitemsimplybasedonitssemanticannotation. Thisisasimpleand directapproach,butwidelyusedbydomainengineersinpractice. Baseline2: Wepredictprovenanceonlyconsideringprovenancesimilaritybetween domain entity pairs. Suppose we are predicting missing provenance for a data item d i contained in the datasetD, and the domain entity annotated tod i ise i . For every other 123 entitye j intheontology, thisapproachcountsthetotalnumberoftimesthate i ande j ’s annotated data items share the same provenance when contained in the same historical dataset. We express the counting number as C e i (e j ). Then in the dataset D, among thedataitemswhoseprovenanceisavailable,weselecttheprovenanceofthedataitem whose annotation entity has the biggest C e i as our prediction. This approach derives fromtheobservationthatdataitemsmaysharetheidenticalprovenanceinformation. It utilizes the semantic annotations to capture the identical provenance, but without intro- ducingthesemanticassociations. 5.6.3 ResultsandDiscussion Aspreviouslymentioned,wetakendatasetsashistoricaldata,wherenvariesfrom500 to 2000. We calculate the prediction accuracy of different approaches under different provenancelossratiosbycomparingourprovenancepredictionwiththegroundtruth. 124 10% 20% 30% 40% 50% 60% 70% 80% 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Provenance Loss Ratio asso baseline2 baseline1 (a) 500historicaldatasets 10% 20% 30% 40% 50% 60% 70% 80% 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 N(unknown) / N(incorrect) Precision Provenance Loss Ratio asso baseline2 baseline1 unknown (b) 1000historicaldatasets 10% 20% 30% 40% 50% 60% 70% 80% 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Provenance Loss Ratio asso baseline2 baseline1 (c) 2000historicaldatasets Figure5.4: Precisionofprovenancepredictionforusecase1 125 10% 20% 30% 40% 50% 60% 70% 80% 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Provenance Loss Ratio asso baseline2 baseline1 (a) 500historicaldatasets 10% 20% 30% 40% 50% 60% 70% 80% 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Provenance Loss Ratio asso baseline2 baseline1 (b) 1000historicaldatasets 10% 20% 30% 40% 50% 60% 70% 80% 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Provenance Loss Ratio asso baseline2 baseline1 (c) 2000historicaldatasets Figure5.5: Precisionofprovenancepredictionforusecase2 126 Figure5.4and5.5showstheevaluationresultsontwousecaseswhenalltheexisting provenance is accurate. The voting process is used to predict the missing provenance for both use cases, and the voting threshold is set to 0.75. The curve “asso” illustrates the prediction accuracy of our approach, which has the highest precision among all the prediction methods in both cases. In general, our approach can achieve an average precisionabove85%whenonethirdoftheprovenanceinformationismissing. Generally,“baseline1”hasarelativelyfixedprecisionandisnotaffectedbychanges intheprovenancelossratio. Thisisbecause“baseline1”onlytakesthemost-frequently- usedprocessasprediction. Sincevariousprocessescanbeemployedtocreatethesame categoryofdataitems,thisapproachcannotachieveagoodaccuracy. Theprecisionoftheothertwoapproachesdeclineswhenthepercentageofthemiss- ing provenance information increases. Intuitively, when we are predicting the miss- ing provenance for a data item, with more provenance information missing, the prob- ability that all the relevant data items lose their provenance information grows. If all the relevant data items that share the same provenance with the given data item lose theirprovenance,wemarkthisprovenanceinformationas“unknown”. The“unknown” provenancecausesincorrectpredictionssinceinthatcaseourapproachcannotfindany existing provenance as prediction. We measure the proportion of “unknown” prove- nancetoincorrectpredictionswithrespecttothe“asso”curveincaseof1000historical datasets. The “unknown” curve in Figure 5.4(b) illustrates the increasing proportion of “unknown”provenanceastheprovenancelossratioincreases. Notethatwecanstilluse provenance predicted by “baseline1” as a reasonable guess for the “unknown” prove- nance. The precision of our approach is better than “baseline2”. In “baseline2”, we look for entity pairs whose annotated historical data items share the same provenance infor- mation. Becauseeachhistoricaldatasetonly“covers”partoftheentitiesdefinedbythe 127 domainontology,thisapproachcannotfindallthepossibleentitypairswiththe“same- provenance”connection. Thuswemaymissusefuldataitemsthatsharethesameprove- nancewiththegivendataitem. Ourapproachabstractsthe“same-provenance”connec- tionsassemanticassociations. Whenwefindtwodataitemshavethesameprovenance information, specific semantic associations between their annotation domain entities reveal the data generation pattern which derives identical provenance, and we can use thisrevealedpatterntospeculateidenticalprovenancebetweendataitemsannotatedby otherentities. 128 0.0 0.1 0.2 0.3 0.4 0.5 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Precision Inaccuracy Lost ratio = 10% Lost ratio = 20% Lost ratio = 30% Lost ratio = 40% Lost ratio = 50% (a) 500historicaldatasets 0.0 0.1 0.2 0.3 0.4 0.5 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Precision Inaccuracy Lost ratio = 10% Lost ratio = 20% Lost ratio = 30% Lost ratio = 40% Lost ratio = 50% (b) 1000historicaldatasets 0.0 0.1 0.2 0.3 0.4 0.5 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Precision Inaccuracy Lost ratio = 10% Lost ratio = 20% Lost ratio = 30% Lost ratio = 40% Lost ratio = 50% (c) 2000historicaldatasets Figure 5.6: Precision of provenance prediction for use case 2 when the existing prove- nanceisinaccurate 129 Figure 5.6 shows the prediction precision for use case 2 when part of the existing provenance is incorrect. Algorithm 4 is used for prediction. In the figure, the axis of “inaccuracy” indicates the ratio of the data items with incorrect existing provenance. Each line illustrates the change of the prediction precision when the provenance lost ratioisfixedandtheinaccuracyoftheexistingprovenanceisincreasing. Theprecision declines when the inaccuracy ratio increases. Compared with the lost ratio, the prove- nance inaccuracy can cause faster precision decline. For example, in Figure 5.6(c), if we increase the loss ratio from 10% to 50%, the decline of the prediction precision is at most 0.16 (when the inaccuracy is fixed to 0.5); while the same change of the prove- nance inaccuracy can cause an at least 0.17 (when lost ratio is 10%), and at most 0.33 (whenlostratiois50%)dropoftheprecision. Inthemeanwhile,wecanfindthatwhen thelostratiogetshigher,theprecisiondeclinesfasterwiththeincreaseoftheinaccuracy ratio. This is because when we have smaller size of existing provenance with a higher lost ratio, the incorrect provenance may have a higher probability to be selected as a prediction“representative”,thustheinaccuracydoesmore“harm”toourprediction. 5.7 RelatedWork Semantic Web techniques have been used in provenance management. For example, Zhao et al. discussed the application of semantic Web techniques for managing and querying provenance information as a part of the myGrid project [95][96]. A semantic provenance model is proposed in their work and domain knowledge is annotated and linkedtoprovenanceinformation. Sahooetal. discussessemanticprovenancemanage- ment for e-Science in [81]. We also employ semantic Web techniques in our work: a domain ontology is used to annotate data items and specific semantic associations are detectedintheontologygraphtoinferidenticalprovenance. 130 Semantic association has been discussed in [19] and [11], in which two entities are semantically associated if they are “semantically connected” or “semantically similar”. In [19] Anyanwu et al. define semantic associations and discuss queries of semantic associations. Rankingofsemanticassociationsisfurtherdiscussedin[11][18][12][59]. Semantic association is used by applications such as documents ranking in [10]. In our work, we use semantic associations to reveal the hidden semantic “connections” betweendataitemsthatsharethesameprovenanceinformation. In general, our problem can be seen as an application of the Statistical Relational Learning[47]inaspecificdomain. InthisSRLproblem,theontologygraphrepresents the relational data model, while an implicit graphical model encodes the probabilistic relationshipsbetweentheprovenanceofdifferenttypesofdataitems. Thebootstrapping step in our approach learns the probability values by analyzing the historical datasets. And we utilize the probabilistic model to predict the missing provenance. The imple- mentation of our approach can thus also be built based on existing frameworks such as SPARQL-ML[61]. 5.8 Conclusion Inthischapter,weintroducedourworkforpredictingmissingprovenanceinformation. Basedontheobservationthatdataitemswithspecificsemantic“connections”mayshare thesameprovenance,ourapproachannotatesdataitemswithdomainentitiesdefinedin a domain ontology, and represent these “connections” as semantic associations in the ontology graph. By analyzing annotated historical datasets with complete provenance information,wecapturedspecificsemanticassociationsthatmayimplyidenticalprove- nance. A statistical analysis is applied to assign probability values to the discovered 131 associations,whichindicatetheconfidenceofeachassociationwhenitisusedforfuture provenanceprediction. We developed a voting algorithm which utilizes the semantic associations and their confidence measures to predict the missing provenance information. We also extended the algorithm and take into account the accuracy of the existing provenance. A proba- bilityvalueiscalculatedasthetrustofeachpredictionresult. Ourevaluationshowsthat the average precision of our approach is above 85% when one third of the provenance informationismissing. 132 Chapter6 PresentingProvenancewith AppropriateGranularities Asdataflowsthroughandisderivedfromworkflowsexecutedacrossorganizationsand disciplines,provenancemaybecollectedandreconstructedfromdifferentorchestration andexecutionframeworks. Provenancefordataderivedfromlarge-scaleworkflowscan be complex. Hence, understanding and interpreting raw provenance is challenging for userswhomayconsumeitfordiversepurposes. Often, provenanceitselfiscollectedatdifferentgranularitiesdependingontheexe- cutionframeworkinquestion. Thisprovidesanatural“grouping”structureforpresent- ing provenance. For example, for a workflow composed of multiple web services, the workflowmanagementsystemmaycollectcoarse-grainedprovenancethatdescribesthe dataflowandcontrolflowatthegranularityofthewebserviceinvocations(i.e. aservice invocationisaprovenanceprocess). Further,withinanindividualwebservice,detailed provenancemaybecollectedtodescribetheexecutionlogicoftheservice(i.e. eachexe- cutionstepintheserviceisanprocess). Furthermore,moredetailedprovenancemaybe collected on system and OS calls within each execution step. Besides this hierarchical structure, also termed as a “vertical” view of provenance, there is also a “horizontal” view that arises from data flowing from one workflow invocation to another, possible controlledbydifferentusersandorganizations. 133 Thesehorizontalandverticalprovenancestructuresarecommon,butpresentprove- nance from the perspective of the “composer” of the workflow rather than the “con- sumer” of the provenance. An appropriate granularity or view of provenance should be presented to users based on the current task at hand and situation of interest. For example, when using provenance for data quality debugging, fine-grained provenance informationneedstobeprovidedfordataobjectsandprocessesthathaveahighimpact on quality, whilst other provenance entities are masked. Users with different roles may alsobeinterestedindifferentviewsofprovenance. Forexample,whenbrowsingprove- nance,businessmanagersmaybeonlyinterestedinthehigh-levelbusinessflows,while domainengineersareinterestedinthedetailedstepsandtheexecutionlogicinthework- flow. Theselogicallevelsmaynotnecessarilycorrespondtoauniformlayerinthehor- izontal provenance structure but rather span structural levels. An effective provenance presentation approach is thus required to make provenance usable and accessible. This should determine the suitable view or granularity for provenance based on the context ofusage. In this chapter, as our vision for the future work, we motivate the need for a prove- nance presentation model and framework to support complex, inter-disciplinary work- flows, and use Smart Grid as an exemplar domain. We discuss key challenges for pre- senting provenance across different granularities to support data quality forensics for diverse users. These approaches go beyond “horizontal” provenance traversal across and “vertical” navigation into workflow traces, and include hybrid-granularity prove- nance views that slice across vertical layers and horizontal boundaries and allow navi- gation across granularities. We also offer potential modeling and algorithmic solutions tothisproblem. 134 6.1 ProvenancePresentationforSmartGrid A provenance graph collected for the ecosystem of workflows used for campus power forecasting,whichweintroduceinSection2.1.2,isshowninFigure6.1. Comparedwith thegeneralworkflowdescriptioninFigure2.2,moredetailsaredescribedinFigure6.1 for each sub-workflows. In the Campus Power Consumption Forecast sub-workflow, future electric power usage is predicted for each individual building on campus (“BF” inFigure6.1),andtheusageforecastsfrom150buildingsareaggregated(“AG”)into a total consumption forecast (“CF”) for the campus. A forecasting process (“F”) con- sumes a power consumption forecast model (“FM”), current weather (“W”), campus schedule (“SC”), and recent timeseries data (“BC”) observed from the building’s smart power meter and facility sensors, to generate the building power usage forecasts. The Forecast Model Training sub-workflow generates the forecast model for each building. Its input dataset includes historical daily power usage reported by smart meters (“H”), historical weather information from online web services (“HW”), and historical facil- ity scheduling information from the campus calendar (“HS”). After an annotation and transformationprocess(“A”),thesedatasetsareaggregated(“AG”)intoatrainingdataset (“TR”).ThistrainingdatasetisconsumedbyaniterativeMapReducedataflowthatuses regression tree learning [14][93] to generate the forecast model (“FM”). The Building Sensor Integration sub-workflow generates a timeseries of recent power consumption and equipment usage information. Events from smart meters and sensors installed in each building are normalized (“PP”) and aggregated (“AG”) in this sub-workflow in a continuous and periodical manner. Information from smart meters and sensors is inte- grated every few minutes and the updated forecasts are provided to the campus facility managerseveryhour. 135 M R S …... H A AH AG HS A AHS HW A AHW SubM FM W SC PP C C C AG BC F BF BF …... AG CF Forecast Model Training Building Sensor Integration Campus Power Consumption Forecast MapReduce Learning H: historical power consumption data HS: historical schedule information HW: historical weather information A: annotation AH: annotated historical power consumption data AHS: annotated historical schedule information AHW: annotated historical weather information S: power consumption collected from smart meter TR: training dataset M: map SubM: sub-model for forecasting R: reduce FM: forecast model SC: schedule information W: weather information F: forecast AG: aggregation PP: pre-process C: power consumption after transformation and annotation BC: power consumption for a building BF: power consumption forecast for a building CF: power consumption forecast for the campus …... S PP S PP TR Figure 6.1: A provenance graph for workflows used to forecast building and campus powerconsumption. 6.1.1 PresentingProvenanceforDifferentUserRoles Three types of users consume the provenance information for situation awareness and data quality forensics in our use case: the software architect, the data analyst, and the campus facility operator. The data analyst designs the machine learning algorithm for generating the forecast model. She is interested in provenance about the execution of the forecast model training and the campus power consumption forecast workflow so that she can verify the quality of the forecast model. However, the analyst is not con- cerned with the computation scalability or reliability of the software system. Thus she does not need the detailed provenance for the MapReduce learning process. The soft- ware architect, on the other hand, is responsible for ensuring the performance of the software infrastructure and particularly to ensure that the data flows between different physical and organizational boundaries occur as expected. Hence, he is interested in the remote MapReduce execution, the data aggregation for the MapReduce and in the 136 building sensor integration, and information flow to the forecasting model. The facil- ity operator is responsible for the accurate operations of the physical smart meters and sensorsinbuildings,andutilizingtheconsumptionpredictionstocontrolbuildingoper- ations,suchaschangingthesetpointtemperatureofthermostatsordutycyclingtheA/C units from the campus control center. Therefore, she requires the detailed provenance indicatingthevaluesreportedbythemetersandsensors,andtheirimpactontheforecast consumptionforbuildingandthecampus. Inthesescenarios,provenanceisusedtoval- idatethatthecomplexsystemisoperatingasexpected,andtorapidly“debug”thecyber physical system in case discrepancies are detected. Oftentimes, when problems cannot be resolved immediately by a single user, provenance also assists in drilling down for furthercollaborativeforensics. Based on the above analysis, Figure 6.2 shows three different provenance graphs for these user roles. These provenance graphs can be seen as different “views” for the original graph in Figure 6.1. In each provenance view, the provenance information of relevance to the particular user is shown with higher fidelity to scope the user activ- ity, and a coarser granularity of information is substituted for other parts of the graph. For example, since the facility operator is responsible for physical operations, detailed provenanceinformationispresentedforsensordataandforecastresultsfromthebuild- ingsensorintegrationandforecastworkflowsintheirview,whiledetailsoftheforecast modeltrainingandinformationaggregationaspectsofprovenancearemasked. 137 H A AH AG HS A AHS HW A AHW FM W SC BC F BF BF …... AG CF …... MapReduce Learning Building Sensor Integration TR (a) Provenancegraphforthedataanalystrole M R H HS HW SubM FM W SC BC Campus Power Consumption Forecast CF Annotation & Aggregation S S S …... Building Sensor Integration BC …... BC TR (b) Provenancegraphforthesoftwarearchitectrole S S S BF BF …... AG CF …... …... Building Power Consumption Forecast (c) Provenance graph for the facility operator role S1,1 S1,2 S1,20 BF1 AG CF S7,1 S7,2 S7,25 BF7 BF(other) …... …... Building Power Consumption Forecast Building Power Consumption Forecast Power Consumption Forecast for Other Bld. (d) Provenance graph for data quality forensics by facilityoperator Figure6.2: Provenancegraphviewsfordifferentuserroles 6.1.2 PresentingProvenanceforDataQualityForensics In our use case, the main usage of provenance information is for data quality forensics and situation awareness by the different user roles. The simplified provenance graph 138 in Figure 6.1 only shows the forecasting for a single building. In practice, a complete execution for the campus performs forecasting for150 buildings, which leads to a complex provenance graph with several thousand provenance nodes. This graph also growsastheforecastingisupdatedeachhour. Directlypresentingthiscompleteprove- nancegraphmakesitchallengingforuserstoperformdataqualityforensicstodetermine qualityproblems. In fact, data objects and processes in the workflow exert varying quality impacts on the final output. Here, we use quality impact to indicate how the quality of a process or a data object in the provenance graph affects the quality of the output artifact. Any change in the quality of a process or data object with high quality impact causes a significantchangeintheoutputdataquality. Incontrast,thequalitychangeofprocesses and data objects with low quality impact may not cause a notable quality change of the output. This guides domain users on what (high impact) processes and data objects in the provenance graph they need to exercise more quality control upon to improve the output quality. Knowledge of the quality impact thus should be used to direct the presentation of provenance information, so that we can achieve a more efficient usage of the provenance information when debugging or analyzing for causes of poor output quality. Therefore, in addition to user roles, we also need to consider the usage requirement for the provenance presentation. Particularly, for the data quality forensics, provenance presentation should be guided by the quality impact of provenance subjects. Users can determinethe quality impactbased on their domainknowledge. Forexample, thequal- ity of the power consumption forecast of the whole campus is impacted more by the forecast of a large building than that of a small building. Also, some processes and input/intermediate data objects, such as the model training processes, have a higher quality impact on the forecast than others. Thus these processes and data objects with 139 highqualityimpactshouldbepresentedwithafine-grainedgranularity,whileotherscan becollapsedandpresentedatacoarsergranularity. Figure6.2(d)illustratesaprovenanceviewforthefacilityoperatorwhichreflectsthe granularity requirement based on the quality impact. The provenance graph highlights theprovenancetraceforcalculatingtheconsumptionforecastofBuilding1and7since they have the highest quality impact. Other provenance information is compacted into twoprovenancenodes,“PowerConsumptionForecastforOtherBld.” and“BF(other)”, andtheirdetailsareomitted. 6.2 ModelingPresentationGranularityforProvenance Thefirstproblemwefaceistodesignaprovenancemodelwithsupportforpresentation granularity. Such a model should allow the co-existence of provenance with different customizedgranularities. Themodelshouldalsoofferadequatesupportforexploration beyondaparticularviewtodrilldownintospecificdetails. So,itshouldallowmapping betweendifferentprovenancecomponentsacrossgranularities. Thefollowingstrategies canbeconsideredforthemodeldesign. 6.2.1 StaticGranularityDefinition A provenance model that supports static granularity can explicitly define uniform hier- archical granularities. This allows all provenance information to be structures using homogeneous granularity associations across vertical layers. For example, a layered provenance model is defined in the REDUX system [20], where provenance informa- tion is divided into four levels: the Abstract level, the Service Instantiation level, the Data Instantiation level, and the Runtime Specification level. Each layer offers a par- ticular level of abstraction, and lower layers contain more detailed information. The 140 Abstract level depicts the most coarse-grained information: it only captures the types of the activities in the workflow and links/relationships among the activities or ports of activities. The Service Instantiation level represents an instance of the abstract level by instantiating the activity classes. The Data Instantiation level provides additional infor- mationbyspecifyingtheinputdata,runtimeparametersettings,andexecutionbranches. TheRuntimeSpecificationlevel,whichisthemostdetailedlevel,offersruntimespecific information such as the start and end time of workflow/activity execution, intermediate results,internalstateofeachactivity,andeveninformationaboutthemachinestowhich each activity was assigned for execution. Such an approach is common in other prove- nance models too [28][15], and provenance graphs spanning different workflows are naturallylinkedateachlayerthroughshareddataobjects. Staticallyincorporatinggranularitylevelswithaprovenancemodelcanmakeiteas- ier to configure tools for provenance query and presentation. However, its main limi- tation is its lack of flexibility. The granularity association often reflects user’s view on the provenance information, thus for different applications where users have different provenanceusagerequirements,differentnumbersanddefinitionsofprovenancelayers may be required. For a large-scale workflow containing processes across institutes and disciplines, it is difficult to apply a provenance model with homogeneous granularity association. 6.2.2 HybridView Thereareacoupleofalternativeapproachestodesignamoreflexibleprovenancemodel tosupportcustomizedprovenancepresentationviews. 1. Application-specific granularities. For different applications/processes involved intheworkflow,usersshouldhavetheflexibilitytodefinethespecificprovenance granularitiesbasedontheirenduseanduserrole,alongwiththeassociateddetails 141 abouttheprovenancecollectedtopopulatethem. Thegranularityinformationcan then be annotated with the provenance information, or provenance at different granularities is recorded as alternate “Accounts”, as defined in Open Provenance Model [77] and the W3C Provenance Data Model [8]. The latter, however, can cause a state space explosion by recording an account for every type of user role andlevelofdetailrequired,andendupduplicatingsignificantinformation. New provenance granularities can also be defined and specified during runtime, i.e.,whenprovenanceinformationispresentedtoendusers. Thisusuallyhappens whentheprovenancepresentationmodulefindsitnecessarytoclusterfine-grained provenanceinformationasnewcompositemodules. 2. Hybridprovenanceviews. Foralarge-scaleworkflowconsistingofmultipleappli- cations and processes, since individual applications may have their own prove- nance granularity association, there can be no global granularity for the prove- nance presentation for the workflow. Meanwhile, users may be interested only in part of the provenance information, which should be presented at a detailed low level, whilst the other parts of the provenance information can be presented at a high level. Therefore the complete provenance graph should be presented across granularities, i.e., weshouldenableahybrid-granularityprovenancepresentation mechanism. Theprovenancemodelneedstoallowuserstotagorannotateprocessesanddata objects with information that can be used to determine the presentation granular- ities. For example, the quality impact can be annotated to individual processes and data objects as a type of their property in our use case. The intentions of processescanalsobeannotatedandusedforidentifyingthesemanticconnections 142 between processes. These tags may span across traditional levels and be used for decomposingorclusteringprovenancesubjectstogenerateahybridview. 6.3 Determining Apropos Provenance Presentation View In this section, we discuss potential approaches to determine the suitable provenance viewthatshouldbepresentedtoaparticularuserbasedontheprovenancemodel. Wetry to provide an overview of possible strategies that can be employed, but omit algorithm detailsthatwilldependonspecificprovenancearchitectures. Ingeneral,thesestrategies can be classified into a decomposition or a clustering approach relative to the initial provenanceinformationavailable. 6.3.1 DecompositionApproach Adecompositionapproachiswellsuitedinthepresenceofgranularitiesclearlydefined in the provence model, either as static layers or flexible application specific ones. They presentanaturalprogressionthroughdiscretegranularitiesthatallowthecoarsestgran- ularity that encompasses the starting provenance artifact or process to be presented as a default start point. In a static layered model, this could be the workflow level and for flexibleaccounts,thiscouldbetheaccountwiththefewestprovenancenodes. Whenthe workflowiscomposedofdistributedprocesseswithheterogeneousgranularityassocia- tions,eachindividualactivityisinitiallypresentedatthecoarsestgranularitytoprovide a “big picture” of the provenance graph. Sometimes, the coarsest layer may not be the mostappropriategranularityforagivenprovenanceusagecontextinformation. Instead, we may first check whether it is necessary to drill down to a lower level with more detailed information so as to satisfy the provenance usage requirement and to meet the 143 user’s interest. If necessary, we use corresponding finer-grained provenance subgraph or account to replace the coarse-grained provenance node. This sort of breadth first traversal approach can be used to navigate through each discrete granularity for each activity until a suitable configuration is reached. The eventual presentation may be a combinationoffine-grainedandcoarse-grainedprovenancefordifferentsectionsofthe graph to highlight the most important provenance information. In addition, the users can also explore horizontally within the same level to view neighboring activities and entities,andnavigatedeepertoafinergranularity. A key challenge here is to identify the most appropriate provenance presentation granularity for each activity, and decide if further traversal to the lower-level details is required. This decision should be based on the provenance usage context information, which indicates a users’ interest on the provenance information and their intention of theprovenanceusage. Twotypesofinformationcanbeusedascontextinformationand capturedbythepresentationmodel: 1) Provenance end use, which specifies the activity for which provenance is used, such as data quality forensics or software system debugging. This reveals the level as well as subset of provenance information that may be useful for the end users. For example, as discussed earlier in Figure 6.2(d), when using provenance information for forecast quality checks, provenance subjects with high quality impact are of particular interest. This end use can narrow down the subset of provenance to the forecasting activity, as well highlight provenance for building forecasts with higher quality impact atafinergranularityforfurtheranalysis. 2) User profile, which describes the role of the user consuming provenance. This mayincludetheuser’saffiliation,businesslevel,associatedprojects,andexpertise,and canbeusedtoidentifyauser’sinterestontheprovenanceinformation. Forexample,we 144 haveillustratedthethreeviewsofthesameprovenancegraphsthatareofinteresttothe userrolesofthedataanalyst,thesoftwarearchitect,andthefacilityoperator. Based on the context information, the mechanism for determining the presentation granularity can be modeled as a function that takes properties of the corresponding provenance subject (e.g., the quality impact of a process) and provenance usage con- text information as input, and compute the granularity at which the provenance infor- mation should be presented as output. Different methods can be employed for imple- mentingthis. Forexample,asetofrulescanbedefinedtospecifythemappingbetween <provenance properties, usage context> and provenance presentation granularity. A sample rule can be “if the provenance is used for data quality forensics and an activ- ity has top-N quality impact (as defined by some metric), the activity should be pre- sented with most detailed granularity”. The mapping between <provenance proper- ties, usage context> and presentation granularity can be learnt from user’s provenance usage/browsing history. Evolutionary learning processes can be used so that the granu- laritycanbedeterminedinanuser-interactiveway. 6.3.2 ClusteringApproach Insomecases,existingprovenanceinformationdoesnothavediscretegranularitylevels specified in the model and they have to be inferred based on fine-grained provenance information that is collected. The hybrid provenance view models discussed earlier are such an example; alternatively, there may be a separate specification of provenance presentation to overlay on top of existing fine grained provenance. For example, the provenance system may collect provenance such as shown in Figure 6.1, without the dashedboxes,whiletheusermaywishtobepresentedwithaviewsimilartoFigure6.2. A clustering approach can be applied for provenance presentation in such scenar- ios. In general, this approach incrementally clusters the initial fine-grained provenance 145 information so that groups of low-level provenance nodes are combined and replaced by new higher-level nodes. Some existing work has already discussed problems in this direction. Forexample,intheZoomsystem[22],virtualcompositestep-classes,orsub- workflows, are built for workflow provenance to hide complexity from users and allow userstofocusonahigherlevelofabstraction. In[22],auserviewiscreatedbycombin- ing“non-relevant”modulesto“relevant”modules,wherethe“relevant”modulescanbe seenasimportantmodulesintheworkflowwhichusersaremoreconcernedabout. Multipleissuesshouldbeconsideredwhenusingaclusteringapproachtogeneratea compositeprovenanceview. Theclusteringstrategyneedstoclearlyidentifywhatfine- grained provenance information can be combined into a composite module. Semantic annotationsontheprovenancemodelcanbeusedtodrivethisclusteringandthehybrid provenanceviewapproachleveragesthis. Thisallowsmultipleannotationscorrespond- ing to different user roles or end use to be tagged with different provenance subjects, and can form the basis for the clustering. These form a superset of the rules used by Zoom to specify the target of the composition: each composite module can contain at mostone“relevant”module,andthereshouldbeasfewaspossiblecompositemodules only containing “non-relevant” modules. Also, in practice, a clustering strategy may consider the semantic connection of the relevant provenance subjects. For example, in ourusecase,asequenceofMap/Reduceactivitiescanbemoretightlycoupledandoffer a higher priority for combination from a data analysts perspective since they are con- secutive steps of the parallel regression. This requires mechanisms like calculation of connectivitypowertobedesigned. 146 6.4 RelatedWork Provenance granularity is supported by existing provenance models and systems. For example,boththeOpenProvenanceModel[77]andtheW3Cprovenancedatamodel[8] allow users to use the “Account”/“AccountEntity” concepts to bundle up a set of fine- grained provenance subjects and generate a higher-level provenance description. The W3Cprovenancedatamodelalsocontainsthe“Collections”mechanismtodescribethe “part-of” relationships between entities, which can then be used to indicate the data objectgranularities. Many existing provenance systems have layered provenance models that describe and present provenance information with static pre-defined granularities [20][28][15]. For example, the Kepler workflow engine [13] also supports four levels of provenance information: the process level, the data level, the organization level and the knowledge level. Although these different levels are more used for describing provenance infor- mation for the workflow from different aspects, they can also be utilized for presenting provenance across granularities. As we have discussed, these provenance models and systems with static granularity specification may be lack of flexibility for supporting provenancepresentationforalarge-scaleworkflow. Some existing work have already discussed the importance, as well explored the methodsforpresentingprovenancewithsuitableviewstoendusers. Theseworkinclude the Zoom [22] system, which we discussed in Section 6.3. Also, in [48], Gibson et al. leveragethenamedgraphsandextendtheSPARQLquerylanguagetoenabletheflexible creation and control of user-oriented views over the detailed provenance information. Thisworkprovidesausefulmechanismforuserstospecifytheclusteringoffine-grained provenance records, but it has not provided strategies for identifying the appropriate presentationgranularitiesbasedonprovenanceusagecontext. 147 Chapter7 Conclusion The focus of our work has been provenance management in dynamic, distributed and diverse dataflow environments. By using Energy Informatics as our exemplar domain (includingSmartOilfieldandSmartGrid),wediscussedthemainchallengesforprove- nance management in practice, and presented algorithms and approaches to achieve efficient provenance recording, storage, retrieval, and query. In particular, our work includes: Efficient provenance storage: We discussed template-based techniques to efficiently manage large size of provenance information generated for recurrent and streaming workflows. Webuiltbasicprovenancetemplatesthatcapturecommonprovenanceprop- erties and structures for recurrent workflows, and extended them to stream processing workflows. Inaddition,weintroducednovelmethodsusingexplicitspecificationofpro- cessing patterns to omit instance-level provenance information when streamed events are persisted in a repository. The provenance query functions based on the patterns act as proxies that evaluate to ancestral events. We handled the possibility for non- deterministicbehaviorofthestreampipelineusinganapproachfortrackingprovenance outliersthatarenotconsistentwiththedefaultprocessingpatterns. Querying over distributed provenance repositories: We introduced our work in querying distributed provenance information. Our provenance model incorporates the Open Provenance Model integrated within semantic domain models. By developing a provenance index service, we enabled the processing of provenance queries across dis- tributed repositories. We presented algorithms to address both the provenance retrieval 148 query and the provenance filter query. Performance evaluations of these algorithms show a marked benefit over the baseline approaches, and highlight the challenges and benefitsofdistributingprovenanceacrossmultiplerepositories. Predicting incomplete provenance: We designed semantic-based approaches in pre- dictingmissingprovenanceinformation. Ourapproachutilizedsemanticassociationsto capture hidden semantic “connections” between fine-grained data items sharing identi- cal provenance. By analyzing historical datasets annotated by a domain ontology, we detected specific semantic associations in the ontology graph that may imply identi- cal provenance. A statistical approach was implemented to measure the confidence of semanticassociations. Weutilizedthediscoveredassociationstoidentifydataitemsthat arelikelytosharethesameprovenancewiththegivendataitem,andusedavotingalgo- rithmbasedontheconfidencevaluestopredictmissingprovenance. Wealsoconsidered the possible inaccuracy of existing provenance information, and provided an algorithm tocalculatethetrustofpredictedresults. Thesecontributionshaveledtoimprovedunderstandingoftherequirementandchal- lenges of provenance management for open software environments. Novel problems have been highlighted which have not been considered by traditional provenance sys- tems. Wehaveprovidedholisticapproachforprovenancemanagement,andhavedevel- oped innovative solutions which make practical impact on important domains such as EnergyInformatics. 7.1 FutureWork WehavediscussedoneofourmainfutureworkinChapter6,whereweuseanexample intheSmartGriddomaintodemonstratethenecessityofprovenancepresentationwith appropriate granularities according to provenance usage context information. We have 149 also discussed the main challenges and potential modeling and algorithmic solutions for this problem. Besides the work on provenance presentation, we also propose the followingproblemsasfuturework. New Techniques and Applications of Provenance Usage and Mining: Although provenance has been widely recognized as being crucial for management of scientific workflows and business processes, it is still being used in a simple and intuitive way, such as directly being presented to end users as graphs. In our future work, we plan to develop more advanced applications for provenance usage in data quality estimation and forensics. In particular, there are opportunities to design algorithms to mine prove- nance information so as to detect processing patterns that have higher probability to raise data quality problems. The results of the provenance mining can also be used in streamprocessingworkflowsfordetectingerrorsinareal-timemanner. Provenance can also be used for service/process selection and composition. In the Energy Informatics domain, the same type of data artifacts can be created by different methods. By comparing individual provenance graphs of data artifacts, we can find groups of services/processes that were employed for the creation of the same type of data,aswellidentifytheiradvantagesanddrawbacks. Thisinturnwillhelpusidentify thesuitableservices/processesforfutureuse. Privacy and Security: Important privacy and security concerns arise from sharing provenanceacrossinstitutionanddisciplineboundaries. AswehavediscussedinChap- ter 4, provenance information is often shared and queried in a distributed environment across organizations and disciplines. Privacy information may be released along with the provenance sharing. For example, fine-grained provenance information about the electricalpowerconsumptionofauser’shousemayrevealthelivingpatternoftheuser, whichthusshouldbeprotectedfrombeingsharedwiththepublicfortheuser’sprivacy 150 concern. Some existing work already discusses privacy-enhanced provenance infor- mation [78]. There is potential for research into developing and integrating a scalable accesscontrolmechanismintoourexistingprovenancemanagementsystem. ProvenanceforCloudComputing: While more and more processes and applications are to be executed in the cloud platform, provenance is becoming an important issue for the cloud infrastructures. In a cloud platform, provenance information is useful for dataqualitycontrol,resultrepeatability,andresourcemonitoring,bothphysicalandvir- tual [9]. Our proposed approaches can be extended to workflow frameworks executing intheCloud,whichcanhandlelarger-scaleandmoredynamicresourcesandhavebetter performanceinscalability. 151 Bibliography [1] ISO8601:2004. http://www.iso.org/iso/catalogue detail?csnumber=40874. [2] Neo4j: TheGraphDatabase. http://neo4j.org. [3] Open Provenance Model (OPM) XML Schema Specification. http://openprovenance.org/model/opmx. [4] ProvenanceChallengeSeries. http://twiki.ipaw.info/bin/view/Challenge/. [5] SQLAzureDatabase. http://msdn.microsoft.com/en-us/library/windowsazure/ ee336279.aspx. [6] Supermindconsulting. http://www.supermind.org/. [7] USCCenterforEnergyInformatics. http://cei.usc.edu. [8] W3CProvenanceDataModel. http://www.w3.org/TR/prov-dm/. [9] I. M. Abbadi and J. Lyle. Challenges for provenance in cloud computing. In ProceedingsoftheThirdUSENIXWorkshopontheTheoryandPracticeofProve- nance,TaPP’11,2011. [10] B.Aleman-Meza,I.B.Arpinar,M.V.Nural,andA.P.Sheth. Rankingdocuments semantically using ontological relationships. In Proceedings of the 2010 IEEE FourthInternationalConferenceonSemanticComputing,ICSC’10,2010. [11] B. Aleman-Meza, C. Halaschek, I. B. Arpinar, and A. Sheth. Context-aware semantic association ranking. In International Workshop on Semantic Web and Databases,SWDB’03,2003. [12] B. Aleman-Meza, C. Halaschek-Wiener, I. B. Arpinar, C. Ramakrishnan, and A. Sheth. Ranking complex relationships on the semantic web. IEEE Internet Computing,9:37–44,2005. 152 [13] I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the Keplerscientificworkflowsystem. InProceedingsofthe2006InternationalCon- ferenceonProvenanceandAnnotationofData,IPAW’06,pages118–132,2006. [14] S. Aman, Y. Simmhan, and V. K. Prasanna. Improving energy use forecast for campus micro-grids using indirect indicators. In 2011 IEEE 11th International ConferenceonDataMiningWorkshops,pages389–397,2011. [15] M. K. Anand, S. Bowers, and B. Lud¨ ascher. A navigation model for exploring scientific workflow provenance graphs. In Proceedings of the 4th Workshop on WorkflowsinSupportofLarge-ScaleScience,WORKS’09,pages2:1–2:10,2009. [16] M. K. Anand, S. Bowers, and B. Lud¨ ascher. Techniques for efficiently querying scientific workflow provenance graphs. In Proceedings of the 13th International ConferenceonExtendingDatabaseTechnology,EDBT’10,pages287–298,2010. [17] M. K. Anand, S. Bowers, T. McPhillips, and B. Lud¨ ascher. Efficient provenance storageovernesteddatacollections. InProceedingsofthe12thInternationalCon- ference on Extending Database Technology: Advances in Database Technology, EDBT’09,pages958–969,2009. [18] K. Anyanwu, A. Maduko, and A. Sheth. Semrank: ranking complex relationship searchresultsonthesemanticweb. InProceedingsofthe14thInternationalCon- ferenceonWorldWideWeb,WWW’05,pages117–127,2005. [19] K.AnyanwuandA.Sheth. -queries: enablingqueryingforsemanticassociations on the semantic web. In Proceedings of the 12th International Conference on WorldWideWeb,WWW’03,pages690–699,2003. [20] R. S. Barga and L. A. Digiampietri. Automatic capture and efficient storage of e-science experiment provenance. Concurrency and Computation: Practice and Experience,20(5):419–429,April2008. [21] A. Biem, E. Bouillet, H. Feng, A. Ranganathan, A. Riabov, O. Verscheure, H. Koutsopoulos, and C. Moran. Ibm infosphere streams for scalable, real-time, intelligent transportation services. In Proceedings of the 2010 International Con- ferenceonManagementofData,SIGMOD’10,pages1093–1104,2010. [22] O. Biton, S. Cohen-Boulakia, S. B. Davidson, and C. S. Hara. Querying and managing provenance through user views in scientific workflows. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE ’08, pages1072–1081,2008. [23] S.Bowers,T.M.McPhillips,andB.Lud¨ ascher. Provenanceincollection-oriented scientific workflows. Concurrency and Computation: Practice and Experience, 20(5):519–529,April2008. 153 [24] P. Buneman, A. Chapman, and J. Cheney. Provenance management in curated databases. In Proceedings of the 2006 ACM SIGMOD International Conference onManagementofData,SIGMOD’06,pages539–550,2006. [25] P.Buneman,G.Cong,W.Fan,andA.Kementsietsidis. Usingpartialevaluationin distributedqueryevaluation. InProceedingsofthe32ndInternationalConference onVeryLargeDataBases,VLDB’06,pages211–222,2006. [26] P. Buneman, S. Khanna, and W. C. Tan. Data provenance: Some basic issues. In Proceedings of the 20th Conference on Foundations of Software Technology and TheoreticalComputerScience,FSTTCS2000,pages87–93,2000. [27] P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of dataprovenance. InProceedingsofthe8thInternationalConferenceonDatabase Theory,ICDT’01,pages316–330,2001. [28] B. Cao, B. Plale, G. Subramanian, E. Robertson, and Y. Simmhan. Provenance information model of karma version 3. In Proceedings of the 2009 Congress on Services-I,SERVICES’09,pages348–351,2009. [29] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. TelegraphCQ:Continuousdataflowprocessingforanuncertainworld. InProceed- ingsofFirstBiennialConferenceonInnovativeDataSystemsResearch,CIDR’03, 2003. [30] A.P.Chapman, H.V.Jagadish, andP.Ramanan. Efficientprovenancestorage. In Proceedingsofthe2008ACMSIGMODInternationalConferenceonManagement ofData,SIGMOD’08,pages993–1006,2008. [31] S.ChaudhuriandU.Dayal.Anoverviewofdatawarehousingandolaptechnology. SIGMODRecord,26(1):65–74,March1997. [32] B.Clifford,I.Foster,J.-S.Voeckler,M.Wilde,andY.Zhao. Trackingprovenance in a virtual data grid. Concurrency and Computation: Practice and Experience, 20(5):565–575,April2008. [33] E.F.Codd.Recentinvestigationsinrelationaldatabasesystems.InIFIPCongress, pages1017–1021,1974. [34] B.C.Craft,M.Hawkins,andR.E.Terry. AppliedPetroleumReservoirEngineer- ing. PrenticeHall,2ndedition,1990. [35] Y.CuiandJ.Widom. Practicallineagetracingindatawarehouses. InProceedings ofthe16thInternationalConferenceonDataEngineering,ICDE’00,pages367– 378,2000. 154 [36] Y.CuiandJ.Widom. Lineagetracingforgeneraldatawarehousetransformations. TheVLDBJournal,12(1):41–58,May2003. [37] F. Curbera, Y. Doganata, A. Martens, N. K. Mukhi, and A. Slominski. Business provenance — a technology to increase traceability of end-to-end operations. In Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA,GADA,IS,andODBASE2008.PartIonOntheMovetoMeaningfulInternet Systems:,OTM’08,pages100–119,2008. [38] P. P. da Silva, D. L. McGuinness, and R. McCool. Knowledge provenance infras- tructure. IEEEDataEngineeringBulletin,26(4):26–32,December2003. [39] E. Deelman, D. Gannon, M. S. Shields, and I. Taylor. Workflows and e-science: An overview of workflow system features and capabilities. Future Generation ComputerSystems,25(5):528–540,2009. [40] U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Neural-network based sen- sitivity analysis for injector-producer relationship identification. In Intelligent EnergyConferenceandExhibition,2008. [41] T. Ellqvist, D. Koop, J. Freire, C. Silva, and L. Stromback. Using mediation to achieveprovenanceinteroperability. In Proceedingsofthe2009CongressonSer- vices-I,SERVICES’09,pages291–298,2009. [42] I. Foster. The virtual data grid: a new model and architecture for data-intensive collaboration. In Proceedings of the 15th International Conference on Scientific andStatisticalDatabaseManagement,SSDBM’03,pages11–11,2003. [43] I. Foster, J.-S. V¨ ockler, M. Wilde, and Y. Zhao. Chimera: Avirtual data system for representing, querying, and automating data derivation. In Proceedings of the 14thInternationalConferenceonScientificandStatisticalDatabaseManagement, SSDBM’02,pages37–46,2002. [44] J.FrewandR.Bose. Earthsystemscienceworkbench: Adatamanagementinfras- tructure for earth science products. In Proceedings of the 13th International Con- ference on Scientific and Statistical Database Management, SSDBM ’01, pages 180–189,2001. [45] J. Frew, D. Metzger, and P. Slaughter. Automatic capture and reconstruction of computational provenance. Concurrency and Computation: Practice and Experi- ence,20(5):485–496,April2008. [46] L. Gadelha, B. Clifford, M. Mattoso, M. Wilde, and I. Foster. Provenance man- agementinswift. FutureGenerationComputerSystems,27(6):775–780,2011. 155 [47] L.GetoorandB.Taskar. IntroductiontoStatisticalRelationalLearning(Adaptive ComputationandMachineLearning). TheMITPress,2007. [48] T. Gibson, K. Schuchardt, and E. Stephan. Application of named graphs towards customprovenanceviews. InFirstworkshopononTheoryandPracticeofProve- nance,TAPP’09,pages5:1–5:5,2009. [49] B. Glavic, K. S. Esmaili, P. M. Fischer, and N. Tatbul. The case for fine-grained streamprovenance. InP.M.Fischer,H.Hpfner,J.K.0002,D.Nicklas,B.Seeger, T.Umblia,andM.Virgin,editors,BTWWorkshops,pages58–61.TechnischeUni- versittKaiserslautern,2011. [50] M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis, D. Marvin, L. Moreau, andT.Oinn. Provenanceofe-scienceexperiments-experiencefrombioinformat- ics. OSTeScienceSecondAllHandsMeeting2003AHM03,4:223–226,2003. [51] P. Groth. A distributed algorithm for determining the provenance of data. In Proceedings of the 2008 Fourth IEEE International Conference on eScience, ESCIENCE’08,pages166–173,2008. [52] P. Groth and L. Moreau. Representing distributed systems using the open prove- nancemodel. FutureGenerationComputerSystems,27(6):757–765,June2011. [53] T. Heinis and G. Alonso. Efficient lineage tracking for scientific workflows. In Proceedingsofthe2008ACMSIGMODInternationalConferenceonManagement ofData,SIGMOD’08,pages1007–1018,2008. [54] A.HevnerandS.Yao. Queryprocessingindistributeddatabasesystem. Software Engineering,IEEETransactionson,SE-5(3):177–187,May1979. [55] D.A.Holland,U.Braun,D.Maclean,K.-K.Muniswamy-Reddy,andM.I.Seltzer. Choosingadatamodelandquerylanguageforprovenance.InInternationalProve- nanceandAnnotationWorkshop,IPAW’08,2008. [56] R.IkedaandJ.Widom. Panda: asystemforprovenanceanddata. InProceedings of the 2nd conference on Theory and Practice of Provenance, TAPP ’10, pages 5–5,2010. [57] H. V. Jagadish and F. Olken. Database management for life sciences research. SIGMODRecord,33(2):15–20,2004. [58] G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In Pro- ceedingsofthe2010InternationalConferenceonManagementofData,SIGMOD ’10,pages951–962,2010. 156 [59] G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath, and G. Weikum. Naga: Searching and ranking knowledge. In 2008 IEEE 24th International Conference onDataEngineering,ICDE’08,pages953–962,2008. [60] A. Kementsietsidis and M. Wang. Provenance query evaluation: what’s so spe- cial about it? In Proceedings of the 18th ACM Conference on Information and KnowledgeManagement,CIKM’09,pages681–690,2009. [61] C. Kiefer, A. Bernstein, and A. Locher. Adding data mining support to SPARQL viastatisticalrelationallearningmethods. In5thEuropeanSemanticWebConfer- ence,ESWC’08,pages478–492,2008. [62] J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar. Provenance trails in the wings-pegasussystem. Concurrency and Computation: Practice and Experience, 20(5):587–597,April2008. [63] R.KimballandJ.Caserta. TheDataWarehouseETLToolkit. Wiley,2004. [64] D. Kossmann. The state of the art in distributed query processing. ACM Comput. Surv.,32(4):422–469,Dec2000. [65] H. Lee, K. Yao, O. Okpani, A. Nakano, and I. Ershaghi. Identifying injector- producer relationship in waterflood using hybrid constrained nonlinear optimiza- tion. InSPEWesternRegionalMeeting,2010. [66] H.-S. Lim, Y.-S. Moon, and E. Bertino. Research issues in data provenance for streaming environments. In Proceedings of the 2nd SIGSPATIAL ACM GIS 2009 International Workshop on Security and Privacy in GIS and LBS, SPRINGL ’09, pages58–62,2009. [67] F. Liu, J. Mendel, and A. Nejad. Forecasting injector/producer relationships from production and injection rates using an extended kalman filter. In SPE Annual TechnicalConferenceandExhibition,2007. [68] B. Lud¨ ascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao. Scientific workflow management and the kepler sys- tem: Researcharticles. Concurrency and Computation: Practice and Experience, 18(10):1039–1065,August2006. [69] T.Malik,L.Nistor,andA.Gehani. Trackingandsketchingdistributeddataprove- nance. In Proceedings of the 2010 IEEE Sixth International Conference on e- Science,ESCIENCE’10,pages190–197,2010. [70] S. Miles. Electronically querying for the provenance of entities. In International ProvenanceandAnnotationWorkshop,IPAW’06,2006. 157 [71] S. Miles, P. Groth, M. Branco, and L. Moreau. The requirements of using prove- nanceine-scienceexperiments. JournalofGridComputing,5:1–25,2007. [72] S. Miles, P. Groth, S. Munroe, S. Jiang, T. Assandri, and L. Moreau. Extracting causal graphs from an open provenance data model. Concurrency and Computa- tion: PracticeandExperience,20(5):577–586,April2008. [73] A. Misra, M. Blount, A. Kementsietsidis, D. Sow, and M. Wang. Advances and challengesforscalableprovenanceinstreamprocessingsystems. InInternational ProvenanceandAnnotationWorkshop,IPAW’08,pages253–265,2008. [74] P. Missier, N. W. Paton, and K. Belhajjame. Fine-grained and efficient lineage querying of collection-based workflow provenance. In Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10, pages 299–310,2010. [75] P. Missier, S. Sahoo, J. Zhao, C. Goble, and A. Sheth. Janus: From workflows to semantic provenance and linked open data. In International Provenance and AnnotationWorkshop,IPAW’10,pages129–141,2010. [76] L. Moreau. The foundations for provenance on the web. Found. Trends Web Sci., 2:99–241,February2010. [77] L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth, N. Kwasnikowska, S. Miles, P. Missier, J. Myers, B. Plale, Y. Simmhan, E. Stephan, and J. V. den Bussche.Theopenprovenancemodelcorespecification(v1.1). FutureGeneration ComputerSystems,27(6):743–756,June2011. [78] Q. Ni, S. Xu, E. Bertino, R. Sandhu, and W. Han. An access control language for ageneralprovenancemodel. InProceedingsofthe6thVLDBWorkshoponSecure DataManagement,SDM’09,pages68–88,2009. [79] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K.Glover,M.R.Pocock,A.Wipat,andP.Li. Taverna: atoolforthecomposition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045–3054, November2004. [80] C. Pancerella and et. al. Metadata in the collaboratory for multi-scale chemical science. InDublinCoreConference,2003. [81] S. S. Sahoo, A. Sheth, and C. Henson. Semantic provenance for escience: Man- aging the deluge of scientific data. IEEE Internet Computing, 12(4):46–54, July 2008. 158 [82] M. Sayarpour, E. Zuluaga, C. Kabir, and L. W. Lake. The use of capacitance- resistance models for rapid estimation of waterflood performance and optimiza- tion. JournalofPetroleumScienceandEngineering,69,2009. [83] Y.SimmhanandR.Barga. Analysisofapproachesforsupportingtheopenprove- nancemodel: Acasestudyofthetridentworkflowworkbench. FutureGeneration ComputerSystems,27(6):790–796,2011. [84] Y. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMODRecord,34(3):31–36,September2005. [85] Y.Simmhan,B.Plale,andD.Gannon. Querycapabilitiesofthekarmaprovenance framework. ConcurrencyandComputation: PracticeandExperience,20(5):441– 451,April2008. [86] Y. Simmhan, V. Prasanna, S. Aman, S. Natarajan, W. Yin, and Q. Zhou. Towards data-driven demand-response optimization in a campus microgrid. In Third ACM Workshop On Embedded Sensing Systems For Energy-Efficiency In Buildings, BuildSys’11,2011. [87] R. Soma, A. Bakshi, and V. Prasanna. A semantic framework for integrated asset management. InIEEE/ACMInternationalSymposiumonCluster,CloudandGrid Computing,CCGrid’07,2007. [88] D. Suciu. Query decomposition and view maintenance for query languages for unstructured data. In Proceedings of the 22th International Conference on Very LargeDataBases,VLDB’96,pages227–238,1996. [89] D. Suciu. Distributed query evaluation on semistructured data. ACM Transaction DatabaseSystem,27(1):1–62,March2002. [90] G. Thakur and A. Satter. Integrated waterflood asset management. PennWell Books,1998. [91] N. N. Vijayakumar and B. Plale. Towards low overhead provenance tracking in nearreal-timestreamfiltering. InInternationalProvenanceandAnnotationWork- shop,IPAW’06,2006. [92] M. Wang, M. Blount, J. Davis, A. Misra, and D. Sow. A time-and-value centric provenance model and architecture for medical event streams. In Proceedings of the 1st ACM SIGMOBILE International Workshop on Systems and Networking Support for Healthcare and Assisted Living Environments, HealthNet ’07, pages 95–100,2007. 159 [93] W.Yin,Y.Simmhan,andV.Prasanna. Scalableregressiontreelearningonhadoop usingopenplanet. InInternationalWorkshoponMapReduceanditsApplications, MAPREDUCE’12,2012. [94] C. T. Yu and C. C. Chang. Distributed query processing. ACM Comput. Surv., 16(4):399–433,Dec1984. [95] J. Zhao, C. Goble, M. Greenwood, C. Wroe, and R. Stevens. Annotating, link- ing and browsing provenance logs for e-science. In International Semantic Web Conference(ISWC)WorkshoponRetrievalofScientificData,2003. [96] J. Zhao, C. Wroe, C. Goble, R. Stevens, S. Bechhofer, D. Quan, and M. Green- wood. Usingsemanticwebtechnologiesforrepresentinge-scienceprovenance. In InternationalSemanticWebConference,ISWC’04,2004. [97] Y. Zhao and S. Lu. A logic programming approach to scientific workflow prove- nance querying. In International Provenance and Annotation Workshop, IPAW ’08,2008. [98] T.Zhu,A.Bakshi,V.Prasanna,R.Cutler,andS.Fanty.Semanticwebtechnologies for event modeling and analysis: A well surveillance use case. In SPE Intelligent EnergyConference,2010. [99] D. Zinn, Y. Simmhan, M. Giakkoupis, Q. Hart, T. McPhillips, B. Lud¨ ascher, and V. Prasanna. Towards reliable, performant workflows for streaming-applications oncloudplatforms. InIEEE/ACMInternationalSymposiumonCluster,Cloudand GridComputing,CCGrid’11,2011. 160
Abstract (if available)
Abstract
Provenance, the derivation history of data objects, records how, when, and by whom a piece of data was created and modified. Provenance allows users to understand the context of derived data, estimate its quality for use, locate data of interest, and determine datasets affected by erroneous processes. Thus it is playing an important role in scientific experiments and business processes for data quality control, audit trail, and ensuring regulatory compliance. ❧ While most of the previous works only study provenance in a closed and well-controlled environment (e.g., a workflow engine), challenges still exist for holistic provenance management in practical and open environments, where provenance can be distributed, dynamic and diverse. For example, in the Energy Informatics domain, provenance is often collected from large-scale workflows across disciplines and organizations and thus is usually stored in distributed repositories. However, there has been limited research on reconstruction of and query over distributed provenance information. Meanwhile, recurrent and stream processing workflows can generate fine-grained provenance with overwhelming size that can be larger than the original dataset. Provenance storage approaches for efficiently managing such metadata volumes do not have adequate focus in literature. And lastly, the fact that legacy tools without automatic provenance collection functionalities are still widely used leads to the requirement of manual provenance annotation operations, which causes provenance to be incomplete. ❧ In this thesis, by using Energy Informatics as an exemplar domain, we design and develop algorithms and systems for managing provenance in dynamic, distributed and dataflow environments, that are motivated by real world challenges. In particular, we make the following contributions: (1) template-based algorithms that can efficiently store provenance information for dynamic datasets, (2) algorithms for reconstructing and querying provenance graphs from distributed provenance repositories, (3) semantic-based approaches for predicting incomplete provenance. We evaluate our research contributions with use cases from the Energy Informatics domain, including both Smart Oilfield and Smart Grid. The evaluation results demonstrate that our work can achieve efficient and scalable provenance management. As future work, we also discuss key challenges and initial solutions for presenting provenance across different granularities based on its usage context information.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A provenance management framework for reservoir engineering
PDF
A complex event processing framework for fast data management
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Dynamic graph analytics for cyber systems security applications
PDF
From matching to querying: A unified framework for ontology integration
PDF
Resource management for scientific workflows
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Towards the efficient and flexible leveraging of distributed memories
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Scalable evacuation routing in dynamic environments
PDF
Ensuring query integrity for sptial data in the cloud
Asset Metadata
Creator
Zhao, Jing
(author)
Core Title
Provenance management for dynamic, distributed and dataflow environments
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/12/2012
Defense Date
07/03/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dataflow,distributed environments,energy informatics,OAI-PMH Harvest,provenance,provenance management
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Nakano, Aiichiro (
committee member
), Raghavendra, Raghu (
committee member
), Simmhan, Yogesh (
committee member
)
Creator Email
zhaoj@usc.edu,zhaojing9@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-103144
Unique identifier
UC11289384
Identifier
usctheses-c3-103144 (legacy record id)
Legacy Identifier
etd-ZhaoJing-1246.pdf
Dmrecord
103144
Document Type
Dissertation
Rights
Zhao, Jing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
dataflow
distributed environments
energy informatics
provenance management