Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A complex event processing framework for fast data management
(USC Thesis Other)
A complex event processing framework for fast data management
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A COMPLEX EVENT PROCESSING FRAMEWORK FOR FAST DATA MANAGEMENT by Qunzhi Zhou A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2014 Copyright 2014 Qunzhi Zhou Table of Contents List of Figures v Abstract vii Chapter 1: Introduction 1 Chapter 2: Background and Literature Review 7 2.1 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Smart Grid Evolution . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Demand Response Optimization in Smart Grid . . . . . . . . 9 2.1.3 Information Integration in Smart Grid . . . . . . . . . . . . . 10 2.1.4 USC Campus Micro Grid Testbed . . . . . . . . . . . . . . . . 11 2.2 Related Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Complex Event Processing . . . . . . . . . . . . . . . . . . . . 16 Chapter 3: Semantic Complex Event Processing 19 3.1 Problem Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Event and Query Model . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Semantic Event Model . . . . . . . . . . . . . . . . . . . . . . 24 3.3.2 Semantic Query Model . . . . . . . . . . . . . . . . . . . . . . 26 3.3.2.1 Syntactic Filtering Query . . . . . . . . . . . . . . . . . 29 3.3.2.2 Syntactic Aggregation Query . . . . . . . . . . . . . . . 29 3.3.2.3 Syntactic Sequence Query . . . . . . . . . . . . . . . . 30 3.3.2.4 Semantic Filtering Query . . . . . . . . . . . . . . . . . 30 3.3.2.5 Semantic Aggregation Query . . . . . . . . . . . . . . . 32 3.3.2.6 Semantic Sequence Query . . . . . . . . . . . . . . . . 32 3.4 Processing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 Compile-time Semantic Processing . . . . . . . . . . . . . . . 33 3.4.2 Runtime Semantic Processing . . . . . . . . . . . . . . . . . . 35 3.4.2.1 Baseline Approach . . . . . . . . . . . . . . . . . . . . 35 ii 3.4.2.2 Event Buffering . . . . . . . . . . . . . . . . . . . . . . 35 3.4.2.3 Semantic Caching . . . . . . . . . . . . . . . . . . . . . 36 3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 4: Resilient Complex Event Processing 45 4.1 Problem Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3 Event and Query Model . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Processing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.1 Archive Query Processing . . . . . . . . . . . . . . . . . . . . 49 4.4.1.1 Na¨ ıve Event Replay . . . . . . . . . . . . . . . . . . . . 50 4.4.1.2 Plain Query Rewriting . . . . . . . . . . . . . . . . . . 50 4.4.1.3 Hybrid Rewriting and Replay . . . . . . . . . . . . . . 53 4.4.2 Integrated Query Processing . . . . . . . . . . . . . . . . . . . 54 4.4.2.1 Query Plan for Zero Gap Streams . . . . . . . . . . . . 55 4.4.2.2 Query Plan for Non-Zero Gap Streams . . . . . . . . . 57 4.4.2.3 Query Plan for Fail-Fast Scenario . . . . . . . . . . . . 57 4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Chapter 5: Stateful Complex Event Processing 62 5.1 Problem Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 Query Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.1 Data Stream and Domain Context . . . . . . . . . . . . . . . . 68 5.3.2 Query Operations . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3.3 Query Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.3.1 Query Statefulness . . . . . . . . . . . . . . . . . . . . . 73 5.3.3.2 Query Subsumption . . . . . . . . . . . . . . . . . . . . 77 5.4 Query Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.1 Query Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.2 Hierarchical Query Paradigm . . . . . . . . . . . . . . . . . . 83 5.5 Processing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5.1 Online Query Processing . . . . . . . . . . . . . . . . . . . . . 86 5.5.2 On-demand Query Processing . . . . . . . . . . . . . . . . . . 88 5.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 iii Chapter 6: Quantitative Evaluation 93 6.1 Semantic Complex Event Processing . . . . . . . . . . . . . . . . . . . 93 6.2 Resilient Complex Event Processing . . . . . . . . . . . . . . . . . . . 96 6.3 Stateful Complex Event Processing . . . . . . . . . . . . . . . . . . . 99 6.3.1 Online Query Processing . . . . . . . . . . . . . . . . . . . . . 99 6.3.2 Ondemand Query Processing . . . . . . . . . . . . . . . . . . 100 Chapter 7: Empirical Evaluation: Dynamic Demand Response in Micro Grid 101 7.1 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.1.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.1.2 Event-Driven Demand Response in Micro Grid . . . . . . . . 103 7.2 Smart Grid Information Space . . . . . . . . . . . . . . . . . . . . . . 104 7.3 Smart Grid Domain Ontology Model . . . . . . . . . . . . . . . . . . 107 7.3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3.2 Model Components . . . . . . . . . . . . . . . . . . . . . . . . 108 7.3.3 Model Relations . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.4 Dynamic Demand Response Query Taxonomy . . . . . . . . . . . . . 114 7.4.1 End-Use Purpose Dimension . . . . . . . . . . . . . . . . . . . 115 7.4.1.1 Monitoring Pattern . . . . . . . . . . . . . . . . . . . . . 115 7.4.1.2 Prediction Pattern . . . . . . . . . . . . . . . . . . . . . 117 7.4.1.3 Curtailment Pattern . . . . . . . . . . . . . . . . . . . . 117 7.4.2 Spatial Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.4.2.1 Physical Space and Equipment . . . . . . . . . . . . . . 118 7.4.2.2 Virtual Space . . . . . . . . . . . . . . . . . . . . . . . . 119 7.4.3 Temporal Scale . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.4.3.1 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.4.3.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.4.4 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.4.5 Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.4.6 Adaptivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.5 USC Micro Grid Experiments . . . . . . . . . . . . . . . . . . . . . . . 122 7.5.1 Events and Ontologies . . . . . . . . . . . . . . . . . . . . . . 122 7.5.2 Queries and Empirical Evaluations . . . . . . . . . . . . . . . 123 7.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Chapter 8: Dissertation Conclusions 128 Bibliography 131 iv List of Figures 1.1 Fast Data Management – Big Picture . . . . . . . . . . . . . . . . . . 2 1.2 Problem Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Complex Event Processing Application . . . . . . . . . . . . . . . . . 16 2.2 Basic Complex Event Processing Concepts . . . . . . . . . . . . . . . 17 3.1 Semantic CEP versus Traditional CEP . . . . . . . . . . . . . . . . . . 23 3.2 Semantic-enriched AirflowReport Event . . . . . . . . . . . . . . . . . 26 3.3 Semantic Subquery Graph (Query 3.5) . . . . . . . . . . . . . . . . . 37 3.4 Semantic Subquery Property Path Sharing . . . . . . . . . . . . . . . 39 3.5 SCEP Architecture Overview. Events are processed in an asyn- chronous pipeline model. Raw events arriving on streams are seman- tically annotated, and pipelined through a semantic filter and a CEP kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Integrated Query Processing over End-to-End Event Streams . . . . . 48 4.2 Rule-based SCEP to SPARQL Query Rewriting . . . . . . . . . . . . 53 4.3 End-to-End Event Stream Configurations. X axis shows time. Top and bottom dots are events available to realtime and database engines. Vertical dotted lines are realtime and archive event stream bound- aries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4 Integrated Query Plans over End-to-End Event Streams . . . . . . . . 55 4.5 SCEPter Architecture Overview. SCEP queries are performed seam- lessly over realtime and persistent event streams. . . . . . . . . . . . . 59 5.1 Hybrid Online and On-demand Querying over Data Streams . . . . . 66 5.2 Semantic-enriched AirflowReport Data Tuple with Native and For- eign Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3 Stream Containment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4 Matches, Violation and State of Stream Queries . . . . . . . . . . . . 73 5.5 Stateful Online Query Plan . . . . . . . . . . . . . . . . . . . . . . . . 85 5.6 Stateful Complex Event Processing Architecture . . . . . . . . . . . . 89 6.1 SCEPter Realtime Query Performance . . . . . . . . . . . . . . . . . 95 v 6.2 SCEPter Integrated Query Performance in Fail-Fast Scenarios . . . . 97 6.3 H2O Online and On-demand Query Performance . . . . . . . . . . . 100 7.1 Event-Driven Demand Response in Micro Grid . . . . . . . . . . . . 104 7.2 Interplay between Information Concept Spaces that are Relevant to Smart Grid Applications. . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3 Electrical Equipments Ontology . . . . . . . . . . . . . . . . . . . . . 109 7.4 Infrastructure Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.5 Integrated Domain Ontology . . . . . . . . . . . . . . . . . . . . . . . 114 7.6 Top-level Orthogonal Dimensions of theD 2 R Pattern Taxonomy115 7.7 End-Use Purpose Dimension of theD 2 R Pattern Taxonomy . . . . . 116 7.8 Spatial and Temporal Dimensions of theD 2 R Pattern Taxonomy119 7.9 Representation, Life Cycle and Adaptivity Dimensions of theD 2 R Pattern Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.10 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 vi Abstract Emerging applications in domains like Smart Grid, e-commerce and financial services exemplify the need to manage Fast Data – Big Data with an emphasis on data Velocity. Utility companies, social media and financial institutions often need to analyze data streams arriving at a high rate for business operations and innovative services. For example, dynamic demand response, realtime advertising, online retail and algorithmic trading attempt to leverage high-frequency meter and sensor readings, advertisement auctions, purchasing behaviors and stock ticks, respectively, to make timely decisions. Existing Big Data management systems, however, are mostly Volume-centric. Spe- cialized technologies including distributed RDBMS, Hadoop and NoSQL databases were developed for scalable and reliable storage of data set as large as terabytes and even petabytes in volume. These systems provide programming and query primitives, and high cumulative I/O read performances to facilitate large-scale computation over persistent or slow-changing data on durable storage. Complex Event Processing (CEP), on the other hand, is a promising paradigm to manage Fast Data. CEP is recognized for online analytics of data that arrive contin- uously from ubiquitous, always-on sensors and digital event streams. CEP systems are designed to perform high-throughput online queries over high-rate or constantly- changing data, typically leveraging in-memory query matching algorithms to minimize interactions with persistent volumes. vii Fast Data applications, while emphasizing data Velocity, do not preclude having high data Variety and large Volume as well. Applications with such multi-dimensional charac- teristics require certain distinctive capabilities that go beyond traditional CEP systems. One is the need to process query patterns over heterogeneous information spaces that may span multiple domains such as engineering, social community and public policy. CEP queries have to abstract away these domain complexities and allow users to define accessible knowledge-based analytics. Second is the capability to support queries over continuous timespace that span past, present and future for resilient analytics. This requires seamless query processing across the boundary of realtime data streams and persistent data volumes. Third is the capability to perform on-demand queries on the fly that complement pre-defined online queries. Given the transient nature of Fast Data – data arrive and leave streams at a high velocity, on-demand queries should be processed with in-flight data management that obviates the need to persistently store everything. In this dissertation, we describe a Complex Event Processing framework for Fast Data management, considering all the 3-V dimensions. Specifically, we extend the state- of-the-art CEP systems and make the following contributions: (1) Semantic Complex Event Processing for high-level query modeling and processing in diverse information spaces, hiding data Variety; (2) Resilient Complex Event Processing across the boundary of high-Velocity data streams and persistent data storage; (3) Stateful Complex Event Processing for hybrid online and on-demand queries over transient data Volumes on streams. We perform both quantitative and empirical evaluations of our approaches, using real-world applications from the Smart Grid domain. The evaluation results verify the efficacy of the proposed framework and confirm the performance benefits of the optimization techniques. viii Chapter 1 Introduction Fast Data, a Big Data variant that emphasizes high data Velocity, has recently attracted increasing attention [28]. Organizations such as utility companies, social media and financial institutions often face scenarios where they need to process data arriving con- tinuously for business operations and innovative services. These applications include dynamic demand response optimization in Smart Grid [123, 124], realtime advertis- ing and online retail in e-commerce [26, 109], algorithmic trading in financial services [34] and so on. They correlate realtime data such as meter and sensor readings, online advertisements, digital shopping operations and stock ticks respectively to make timely analysis and decisions. The underlying data streaming from diverse sources may also vary in terms of data structures and semantic meaning. Existing Big Data management systems have mostly focused on addressing the 3- V’s of Big Data [74], i.e., Volume, Variety and Velocity, from the Volume perspective. Specialized technologies including Hadoop [97] and NoSQL [66] databases have been developed to provide programming and query primitives that allow scalable processing of data sets as large as terabytes and even petabytes in volume. They adopt key-value pair data model and leverage key space distribution to scale-out storage and computation tasks over clusters of commodity servers. Their high read performance and availability make them suitable for applications that perform write-once-read-many operations over data set of large Volume but relatively low Velocity. 1 Complex Event Processing (CEP), On the other hand, is a promising paradigm to manage Fast Data. CEP is recognized for online analytics over data that arrives contin- uously from ubiquitous, always-on sensors and digital event streams [30]. CEP allows compositions of events with specified constraints, also called event patterns or com- plex events, to be detected from examining thousands of event streams in realtime for situation awareness. In particular, CEP adopts high performance in-memory pattern matching algorithms to deal with data Velocity and achieves high processing throughput. As a result, CEP has grown popular for operational intelligence, where online pattern detection drives realtime response, and is used in domains ranging from mobile comput- ing [117] to RFID data management [113, 118], algorithmic financial trading [34] and healthcare monitoring [111]. data$archives$ data$streams Online& Queries& On+demand& Queries& persistent&interests& transient&interests& ad+hoc&evalua5on& con5nuous&evalua5on& Fast%Data% Query%Engine handle&high& data&Velocity,$ Variety,$Volume Figure 1.1: Fast Data Management – Big Picture 2 Fast Data applications, which deal with data Variety and Volume in addition to high data Velocity, motivate certain distinctive capabilities from traditional CEP systems as shown in Figure 1.1. One is the need to process query patterns over heterogeneous information spaces that may involve multiple domains in engineering, social commu- nity, public policy and so on. CEP queries have to abstract away these domain com- plexities and offer the diverse users of analytics accessible knowledge on system behav- iors. Often, semantic concepts and relations from the domains are combined to get this abstraction [37, 119, 57]. Two is the capability to perform queries and analyze results seamlessly on end- to-end data streams – from network to storage – for resilient operational analysis. In real-world applications, business critical data such as meter readings and stock ticks are usually archived for regulatory compliance. The capability of querying across the boundaries of realtime and archived data helps adapt to lazy-definition and fail-fast con- ditions. Lazy-definition is a common need where users define online queries after the time of interest has passed. This is a means of validating hindsight through post-hoc analysis and using it to inform future event pattens. Fail-fast condition, on the other hand, refers to infrequent but critical faults in information infrastructure that cause the online queries to fail. Consequently, the CEP queries will have to be performed over event streams persisted to durable storage, in a latency sensitive and efficient manner. Three is the capability to perform concurrent online and on-demand queries over data streams. Online queries supported by existing CEP systems represent static or persistent user interests that need to be evaluated continuously as new data appears to trigger realtime actions. On-demand queries, as supported in traditional databases, rep- resent user interests that need to be answered in an ad-hoc manner, usually using stored history data. One key notion of Fast Data is the dynamic or transient data life-cycles, i.e., data arrives at high rate and the analytic values of data also fade away at high rate. 3 This characteristic offers the opportunity for on-demand querying of Fast Data with in- flight volume management, that obviates the need to persistently store everything and the overhead to interact with persistent store. In this dissertation, we describe a Semantic, Resilient and Stateful Complex Event Processing framework for Fast Data Management with considerations of all the 3-V’s dimensions. We model Fast Data as realtime (dynamic) and archived (static) event streams, classify queries as continuous online queries and ad-hoc on-demand queries and explore the problem space shown in Figure 1.2. Specifically, we extend the state- of-the-art in Complex Event Processing and make the following contributions, sta$c& dynamic&& seman$c& syntac$c& Velocity& Database& Tradi$onal&CEP& Seman$c&Web& persistent&–&& ad:hoc&query& transient&–&& con$nuous&query& Variety& Figure 1.2: Problem Space • Semantic Complex Event Processing to hide data Variety. We discuss data Variety in motivating Fast Data applications and introduce a Semantic Complex 4 Event Processing (SCEP) framework for high-level query processing over data streams. Data Variety as a part of domain knowledge is modeled in ontologies using Semantic Web. Based on the domain ontologies, we introduce semantic- enriched event and query model as an extension to traditional relational models to support abstract query specification, shielding data heterogeneity from end users. Online query processing techniques that use optimizations such as query rewrit- ing, event buffering and semantic caching are proposed to mitigate severe perfor- mance overheads. • Resilient Complex Event Processing across high-Velocity and static data. For resilient analytics over data that spans past (in static or slow-changing history store), present and future (on realtime streams), we present SCEPter to uniformly process SCEP queries across the data boundaries. The SCEP query model is extended to operate seamlessly over end-to-end event streams from network to storage. We discuss approaches to process SCEP queries over event archives including plain query rewriting, naive event replay and their hybrid that lever- ages the arbitrage. Integrated query plans are analyzed in the context of temporal gaps that may exist between the data boundaries to return consistent results. • Stateful Complex Event Processing for on-demand query evaluation over transient data Volumes. Motivated by applications from diverse domains such as Smart Grid and e-commerce, we introduce Stateful Complex Event Processing to enable Hybrid Online and On-demand (H2O) queries over data streams. H2O inherits CEP systems’ capability of high-throughput online query processing and also supports in-flight ad-hoc query evaluations. Specifically, we develop a formal query algebra to capture the statefulness and subsumption semantics of queries which forms the foundation of a hierarchical query paradigm. A unified query 5 model is proposed based on the query algebra for online and on-demand query specification. Online query states are leveraged as a means of dynamic data life- cycle management and view materialization to facilitate on-demand query evalu- ation over streams. • Quantitative and Empirical Evaluations. We implement the proposed approaches and evaluate their performances quantitatively using benchmark data from the Smart Grid domain. We also apply the developed prototype system to enable Dynamic Demand Response (D 2 R) optimization in the USC campus Micro Grid. The objective of D 2 R is to manage demand-side power load in response to supply conditions in realtime. Traditional DR approaches require advanced planning, hours or days ahead, and operate on a broadcast principle that reaches all customers. On the other hand, D 2 R leverages Fast Data avail- able in Smart Grid, including realtime meter and sensor readings, event schedules and other digital data streams, to understand dynamic energy consumption and respond with precise curtailment actions, with low latency and high relevance. 6 Chapter 2 Background and Literature Review In this chapter, we describe the main motivating application of our work – Dynamic Demand Response (D 2 R) in Smart Grid, and review the state-of-the-art of relevant tech- nologies, including Semantic Web and Complex Event Processing. 2.1 Application Domain 2.1.1 Smart Grid Evolution Smart Grid refers to the modernization of electric systems through the integration of dig- ital and information technologies [19]. Smart Grid technology allows for more efficient and reliable use of power grids by monitoring, protecting and automatically optimizing the operations of its interconnected elements – from the central and distributed genera- tor through the transmission and distribution network, to building automation systems, energy storage and end-use equipment. In the past a few years, major efforts have been made in the development of Smart Grid infrastructures to provide valuable information that utilities can access to improve their services. Conventional meters which record the accumulative power usage at a monthly base are being replaced by smart meters which are able to report power usage and quality in nearly realtime. In Europe, for example, Italy and Sweden are approach- ing 100 percent deployment of smart meters for consumers. In the U.S., the largest municipal utility, the Los Angeles Department of Water and Power (LADWP), has begun to expand its advanced metering infrastructure (AMI) serving its commercial and 7 industrial customers [19]. Smart appliances for household and commercial buildings such as air-conditioners, clothes dryers, washing machines, as well as plug-in electric vehicles (PEVs), are also in development. These new generation appliances can talk to the grid and decide how best to operate and automatically schedule their activity at strategic time based on available power generation. The deployment of such smart hardware will also accelerate as these devices become even more affordable [27]. Making the most of information from Smart Grid increasingly requires dealing with Fast Data – data that has high Velocity, Volume and Variety. With high-frequency smart meters and a variety of other sources, utilities will collect and monitor power usage information at exponential rate and volume. On the other hand, data variety signifies the increasing data types and complexities, which are collected not only from “traditional” sources like meters, equipments, control systems but also from weather forecasting sys- tems, cameras and online web services. The variety of data is likely to become increas- ingly pertinent to utilities as they begin to incorporate data from other relevant domains such as social media events as part of their decision making and planning processes [29]. It is a necessity for energy industry to turn to techniques which enable analyzing data on the fly, using tailored algorithms to model and process a variety of data in its native format, and applying optimizations to digest incoming data with high volume and speed. Many existing Big Data management systems can process vast volumes of information if there is sufficient time for the process to complete. But many business operations in Smart Grid require realtime decision-making such as, outage prevention, equipment health monitoring and, dynamic demand response which we will elaborate as an example application in the next section. In these applications, utilities have to lever- age realtime data processing and analytics, and bringing together various data sources. 8 2.1.2 Demand Response Optimization in Smart Grid Demand Response (DR) is one of the cornerstone applications of power grid [100]. DR deals with demand-side power load management in response to supply conditions. The main problem of demand response is predicting peak load and identifying opportunities for load curtailment. The benefits that DR offers are twofold: (1) it reduces the maxi- mum power generation capacity required by a utility to avoid blackouts or brownouts, and (2) it avoids starting and stopping power generating units by shaping the power usage to remain relatively constant over time. Existing DR approaches are typically done by static planning: (1) using a priori com- mitment by consumers to directly control end-use equipment for load shedding during a power shortage, or (2) setting price that varies by season or time of the day, offering incentives to consumers for pro-actively adjusting energy consumption, thus reducing their electricity bill and, at a broader level, contributing to global energy conservation. These approaches work under the observation that traditionally electricity demand has been relatively static and cyclical, with diurnal load patterns observed across a 24 hour period, and seasonal patterns seen across a calendar year. Given the Smart Grid environment with dynamic demand profiles and realtime mon- itoring data, a more effective DR strategy is to leverage advanced analytics to support data driven decision making and planning. The notion of Dynamic Demand Response (D 2 R) [93] uses near-realtime information to understand dynamic energy consumption situations, and responds with precise curtailment actions, with low latency and high rel- evance. D 2 R requires an integrated view of Smart Grid data across disparate domains while provides insight into power grid operations and assets, enabling utilities to take adaptive and fine-grained load control in addition to coarse schedules. 9 2.1.3 Information Integration in Smart Grid Understanding the variety of data and providing information integration is a prerequi- site for implementing Smart Grid capabilities. New components and improvements to energy information systems such as dynamic demand response will rely on data from existing and many new information sources within the continuously evolving Smart Grid infrastructure. These applications need to interpret information semantics in a common way in order to ensure that data can be exchanged and shared, and that intelligent activ- ities can be carried out in an efficient and cost effective manner. The ad-hoc point-to-point integration between pairs of devices or applications is no longer sufficient to handle data varieties in Smart Grid. At the utility level, there are many attempts to show information integration through smart meters with a consumer- facing interface, or a home energy monitoring system. A common form of such inter- faces provides a simple line chart or histogram of energy usage over various time inter- vals: minutes, hours, days, or weeks. Examples include home energy monitoring sys- tems such as Google’s PowerMeter application [5], and building information manage- ment systems such as the Pulse energy management software [15] by Small Energy Group, and AgileWaves [1]. While this direction is promising and many applications have reported positive initial results, they tend to focus singularly on data visualization. The existing systems do not have an architecture that facilitates plug-and-play, such that a developer could quickly integrate new software components to consume available data. Efforts have also been made to leverage standards to enable Smart Grid partici- pants to drive commoditization of the components and span energy information from the “micro” (i.e., for the power domain) to the “macro” (i.e., for multiple domains and users). Standards provide common protocols, syntax and data models that can be used by the various elements of the Smart Grid to work together. This in turn enables Smart 10 Grid participants to focus on innovative applications, analytics, decision support facil- ities, etc., that create value for various customer segments. The Smart Grid standards space spans multiple domains from electric power generation to information technol- ogy and involves a number of organizations: International Electrotechnical Commission (IEC), Electric Power Research Institute (EPRI), World Wide Web Consortium (W3C), National Institute of Standards and Technology (NIST), and others [37]. In the last decade, the power industry has made great efforts in creating a common information model (CIM) to resolve semantic inconsistency issues. IEC 61970 and IEC 61968 [7] series standards define data exchange specifications of CIM so that the interoperabil- ity between various systems and applications can be achieved. Today, CIM is widely accepted by both vendors and customers. There have been recent developments in the modular and extensible semantic-level Smart Grid information integration models based on existing standards. In [37], the authors propose a shared ontology model to provide common semantics for Smart Grid applications. The ontology captures concepts governed by business semantics, engineer- ing and scientific principles by transforming existing standards, such as CIM, to a uni- form conceptual model. To make semantic modeling accessible to domain experts, the authors developed the Semantic Application Design Language (SADL), a controlled- English language with an associated environment for building semantic models. 2.1.4 USC Campus Micro Grid Testbed The data integration and processing techniques introduced in this thesis were originally motivated and evaluated by dynamic demand response applications within the Univer- sity of Southern California (USC) campus Micro Grid [98, 122, 123]. The USC campus serves as a testbed for the Los Angeles Smart Grid project to experiment and evaluate demand response technologies. 11 The USC campus encompasses many of the features that make up a diverse city like Los Angeles and make it suitable as a Micro Grid. It is the largest private customer for LADWP, which has an annual consumption of 155GWh and an average load of 20MW. The campus is diverse, in both demographics and buildings. With 33,000 students and 13,000 faculty and staff spread over 300 acres containing class rooms, residence halls, offices, labs, hospitals, restaurants, public transit and even a gas station, it forms a “city within a city”. The 100+ major buildings are between 2 and 90 years old with varied electrical and heating/cooling facilities. Two power vaults route power from LADWP and a co-generation chiller is available for energy storage. The USC Facilities and Management Services (FMS) maintains a relatively “smart” electrical and equipment infrastructure. It has the ability to measure energy usage by building at minute interval, with the possibility of zone or room level measurement for a third of the buildings and indirect calculation of equipment level usage. Their control center aggregates data across all buildings, and can centrally control or override HV AC (heating, ventilation and air conditioning) equipments that consume up to 50% of the total campus power. However, many of these features are only used by manual inter- vention when demand optimization is required, and automated intelligence for demand response is lacking. These features make USC campus a ready, instrumented Smart Grid environment for conducting controlled and calibrated demand response experiments end to end [98]. Besides the available data collection and control facilities, there is also the flexibility of trying emerging Smart Grid sensors and instruments from third party vendors on the campus for fine grained and richer sources of data and points of control. The goal is to eventually scale out the successful models that work at the campus scale to a city scale. 12 2.2 Related Technologies The data management framework proposed in this dissertation supports dynamic demand response in Smart Grid based on Semantic Web information integration and Complex Event Processing. It correlates high-Velocity data from Smart Grid and other relevant domains, interpret semantic data Varieties and integrate persistent data Volumes for timely demand prediction and load curtailment. In the following, we review the main concepts and the state-of-the-art in relevant technologies. 2.2.1 Semantic Web Semantic Web is an evolving extension of the World Wide Web in which the semantics of information are well defined to be both human understandable and machine process- able [106]. It envisions the usage of the web as a universal medium for data, informa- tion and knowledge exchange. To achieve this vision, the World Wide Web Consortium (W3C) has proposed a set of standards, enabling technologies and tools to process the semantics of information. Some key specifications include the Resource Description Framework (RDF) [16], a variety of data interchange formats such as RDF Schema (RDFS) and the Web Ontology Language (OWL). The Resource Description Framework (RDF) is a family of W3C specifications orig- inally designed as a metadata data model. Over the past decade, it has grown to be a general method for modeling information through a variety of syntax formats. RDF is based on the idea of making statements or descriptions about resources. Statements of resources are presented in a normalized form of subject-predicate-object triples. The subject represents the resource, and the predicate denotes a property of the resource, or a relationship between the subject and the object. RDF Schema (RDFS) is an extensible 13 knowledge representation language to denote RDF vocabularies which are used to struc- ture RDF resources [107]. A query language called SPARQL [20] has been developed to query RDF data. OWL [13, 107] is a family of knowledge representation languages endorsed by W3C for authoring ontologies. An ontology is a formal representation of a set of concepts, instances and their relations in a domain. It provides a shared vocabulary which can be used to model a domain. OWL is based on RDF. It provides more built-in vocabulary to describe relationships and concepts. As it is compatible with RDF, OWL ontologies can be serialized using RDF/XML syntax while many of the RDF tools, storage and querying methods such as SPARQL can be re-used for OWL data. W3C introduced three variants of the OWL specification with different level of expressiveness, including OWL Lite, OWL DL and OWL Full. Semantic Web for information integration is an active research area that has been studied in various research and application domains, including databases [71, 90], web services [76, 32], health care [108, 59], oil industry [101] and transportation [91]. The general objective is to facilitate interoperability of information systems and share infor- mation sources that are often heterogeneous and distributed [92]. Contemporary approaches for semantic interoperability can be classified into two categories. The first is the so-called Brute-force Data Conversions (BF) [62], which directly implements all necessary data transformations manually. This approach may require a large number of transformation agreements to be hardcoded that are difficult to maintain [92]. Another approach is Global Standardization, in which different infor- mation systems agree on a uniform standard. This causes the semantic differences to disappear and there is no need for data transforms between components. Unfortunately, such standardization is usually infeasible for many domains due to organizational or operational reasons [119]. 14 Semantic Web provides an extensible framework that allows information to be shared and reused across application and domain boundaries using ontologies. The shortcomings of the traditional approaches can be overcome by declaratively describ- ing data semantics using ontologies and separating knowledge representation from data transformation [62]. A widely adopted approach is to allow information sources to describe their vocabularies of information independently using ontologies. Inter- ontology mappings and reasoning services are then applied to identify semantically cor- responding terms of component ontologies, e.g., which terms are semantically equal or similar. Numerous research projects utilize ontologies to represent data semantic varieties and facilitate information integration. For example, in [59] the authors describe the use of Semantic Web technologies for sharing knowledge in healthcare. It combines relational databases and ontologies to extract knowledge that is not explicitly declared within the database. An ontology representation of the UMLS (Unified Medical Lan- guage System) represents the basic medical concepts, and mappings and inference over the semantic knowledge are done to query and update heterogeneous databases. In [91], we developed a traffic modeling and simulation framework into which more focused simulators can be integrated. A transportation domain ontology is used as the com- mon modeling language and data exchange model for integrated simulation. In [122], we developed a Micro Grid ontology model to capture the domain concepts and enti- ties associated with the demand response application in USC campus. The ontologies are organized in a modular fashion, allowing components from relevant domains to be linked together. 15 Real%me'' Detec%ons' Pa/ern' Queries' Event'Sources' Figure 2.1: Complex Event Processing Application 2.2.2 Complex Event Processing As shown in Figure 2.1, Complex Event Processing (CEP) deals with detecting realtime situations, represented as composite event patterns, from among event streams that are originate from various sources. RAPIDE [53] simulation research project introduced CEP in 1993. In CEP, the concept of primitive event is defined as an occurrence of interest at a point of time and complex event or event pattern query is defined as a combination of primitive events [79] as shown in Figure 2.2. CEP has been developed with the focus on capabilities of realtime pattern matching over high-rate data streams, which is considered as one major limitation of traditional database management systems (DBMS). In recent years, CEP has received increasing attention in the research community motivated by its applications in a variety of domains including financial services [34, 81], health care [47], sensor network [56, 114] and RFID data management [113, 112]. Research prototypes such as SNOOP [45], Cayuga [54, 30] and SASE [116, 65, 55] were proposed, and several commercial systems such as ruleCore [17], Oracle CEP server [12], Sybase Coral8 engine [18] and Esper [4] are available. Among existing systems, Snoop [45] is an early prototype built over active databases. The authors treated updates to active databases as events and observed that 16 e(t) time t (a) Primitive Event pattern {e(ti) | t1<ti<t2} time t1 t2 (b) Complex Event Figure 2.2: Basic Complex Event Processing Concepts the detection of composite database events leads to monotonically increasing storage overhead as previous occurrences of events cannot be deleted. To overcome this prob- lem, they introduced the concept of event selection and consumption policies, i.e., so- called parameter contexts for precisely restricting event occurrences. They identified some of the parameter contexts as recent, chronicle and contiguous and developed algo- rithms for detecting composite events in all parameter contexts. Cayuga [54, 30] is a high performance, single server CEP system designed to effi- ciently support a large number of concurrent queries over event streams. In Cayuga, event streams are modeled as infinite sequences of relational tuples with interval-based timestamps. Cayuga defines an event algebra that contains operators: selection, projec- tion, renaming, union, conditional sequence and iteration. Event patterns defined using the algebra are detected by incremental matching algorithms based on Nondeterministic Finite Automata (NFA). To achieve high performance, Cayuga also uses custom heap management, indexing of operator predicates and reuse of shared automata instances. Other CEP systems include SASE [116, 65, 55], Siddhi [102] and Esper [4]. These systems focused on extending the event selection and consumption policies, query oper- ators and matching algorithms to enable expressive query patterns and their efficient execution. They often provide strong-typing SQL-like declarative query languages with built-in operators to support temporal sequence, windows and Kleene Closure patterns. However, these existing systems only consider online query processing over realtime events and are subject to low-level query specification over syntactic event models. 17 These limitations expose distinctive gaps to the requirements of emerging Fast Data applications as we will describe in later chapters. 18 Chapter 3 Semantic Complex Event Processing Emerging applications in Cyber Physical Systems (CPS) [89] have been pushing the boundaries of Complex Event Processing to incorporate data Variety and enable information integration. In a CPS domain, the operation and optimization of inter- connected physical infrastructure are based on information analysis performed on cyber- infrastructure. Pervasive sensors monitoring physical systems like traffic networks [44] and Smart Grid [25] generate event data that vary in data sources, data structures and meanings. Such heterogeneous data needs to be integrated and correlated with supple- mental domain knowledge to offer insight into system behavior for operational deci- sions, say, to change traffic signaling or to shape consumers’ power demand. Traditional CEP systems expose clear gaps in capturing data varieties in query spec- ifications, and their robust execution on streams. Most existing CEP engines process relational events and queries, exposing users to underlying data heterogeneity [65, 30]. Recently, C-SPARQL [39] and ETALIS [38] introduced event semantics into CEP for abstract query specification, allowing background knowledge to be combined with real- time events. But these are semantic-centric solutions, leveraging inference engines to model and process queries over events and domain knowledge. Many native CEP tem- poral patterns such as Klenee Closure and matching policies for event selection and con- sumption [49, 58] are not supported. Consequently, they lack the power and scalability of full CEP systems that exploit query patterns, processing techniques and algorithms designed for event stream applications. 19 In this chapter, we describe a Semantic Complex Event Processing (SCEP) frame- work that extends traditional CEP systems to support high level query specifications over data streams, shielding underlying data Variety. Our main contributions are summarized as following: • We introduce semantically enriched event and query models to support both semantic and native CEP query predicates (§ 3.3). Query constraints are specified over raw data streams and linked domain ontologies that allow uniform queries to operate over a diverse and evolving information environment. • We describe techniques for online SCEP query processing (§ 3.4), using opti- mizations including compile-time query rewriting, event buffering and semantic caching to mitigate the overhead introduced by data semantics. • We implement the proposed SCEP models and online query processing techniques in a prototype system (§ 3.5). The quantitative and empirical evaluations are sum- marized in later chapters (§ 6 and § 7). 3.1 Problem Motivation Smart Grid is an exemplar of Cyber Physical Systems and forms an application domain for Complex Event Processing. Smart Grid deploys sensors and communication devices at the generation, transmission and distribution networks of the power grid, and utilize operational analytics over data arriving from these for intelligent power grid manage- ment [95]. Demand Response optimization (DR) is a cornerstone power grid appli- cation that attempts to prevent mismatches between power generation and consumer demand by shaping the demand, through consumer notifications and incentives, and 20 thus improves the grid’s reliability from blackouts. DR decisions are usually made stat- ically, hours or days ahead for all customers within the utility, using fixed schedules and time-of-use pricing to encourage consumer energy curtailment [25]. In Smart Grid, real- time monitoring information offers the opportunity to make DR decisions dynamically and by targeting specific customers or equipments fitting usage profiles for improved operational efficiency. Complex Event Processing is a natural tool to implement Dynamic Demand Response (D 2 R) in Smart Grid. As introduced in Section 2.2, CEP is a technique for detecting patterns of events over one or more realtime event streams, usually with tem- poral clauses that capture event ordering or range constraints. It can be used to examine meter and sensor readings and other relevant data streams in Smart Grid to detect pat- terns that predict imminent energy demand or locate customers with profiles that match curtailment opportunities, and support operational DR decisions on customer notifica- tion or power generation. However, the pervasive information heterogeneities in Smart Grid CPS domain present novel requirements in both modeling and runtime capabilities to contemporary CEP systems. We illustrate these requirements using dynamic demand response scenar- ios within the USC campus Micro Grid testbed. Firstly, it is required to deal with diverse information sources and multi-disciplinary participants. Meters, sensors and other instruments that monitor the physical grid infras- tructure and produce an avalanche of data are from different vendors and adopt various data structures and syntax. Besides, information from other relevant domains including spatial locations, organizations, class schedules, weather conditions and so on are also correlated for decision making [119]. This information heterogeneity makes it chal- lenging for the users, such as facility managers, department coordinators and end-use 21 consumers, to define meaningful energy-use pattern, and behoove a shift toward cap- turing pattern definitions at higher abstractions. In addition, executing these high-level queries on realtime data streams should not introduce undue latencies and overheads that obviate their benefits. Secondly, the physical infrastructure of Smart Grid CPS domain is subject to contin- uous evolution, given the emerging nature of the technologies. For example, USC, as the largest private power consumer in Los Angeles, has over 60,000 students, faculty and staff spread over 170 buildings [120]. This means that infrastructure is constantly being upgraded and consumers change every year. Of late, an average of two new buildings are built each year on campus, each with hundreds of sensors and equipment. Ambient sensors such as temperature, airflow and CO 2 sensors are deployed at the room-level to monitor conditions. Likewise, around 19,000 new students enroll in USC each year which induces changes in power usage profiles in dormitories and classrooms. Smart Grid applications, specifically D 2 R CEP queries, need to sustainably adapt to or be shielded from these changes in the information space with low overhead. 3.2 Approach Overview We propose a Semantic Complex Event Processing (SCEP) framework, using a combi- nation of Complex Event Processing and Semantic Web technologies to query data with both high Velocity and Variety. SCEP is designed as a generic framework based on for- mal models, despite motivation fromD 2 R applications in Smart Grid. Other potential application domains include e-commerce, digital healthcare and supply chain manage- ment. These domains also feature realtime data processing with diverse information spaces and multi-disciplinary users. 22 Seman&c( Annota&on( Seman&c( Filtering( Syntac&c( Matching( Seman&c(Event(and(Domain(Models( Seman&c(Pa:ern(Queries( Syntac&c( Matching( Rela&onal(Pa:ern(Queries( online(query(specifica&on( model(update( Tradi&onal*CEP* Seman&c*CEP* Figure 3.1: Semantic CEP versus Traditional CEP As shown in Figure 3.1, the SCEP framework captures data varieties and domain context using semantic ontology models. User queries are specified over the event and domain models at high-level rather than raw data streams. Specifically, the semantically enriched event and query models offer the following advantages: Interoperability. As mentioned in Section 3.1, the broad space of software and hard- ware vendors in Smart Grid means that different standards need to co-exist. This extends to data formats and meanings. For example, airflow sensors on campus use different variants of the airflow attribute such as flowrate and airvolume in their raw data schema. With traditional CEP systems, pattern designers are exposed to the structural and syntac- tical heterogeneity of events and have to rewrite the same query for different data streams whose formats may vary. The semantic CEP model helps capture these distinctions, for example using owl:sameAs relation, in domain ontologies and allows a unified concep- tual query specification over heterogeneous event schemas without in-depth knowledge of standards. This also reduces the complexity of operational debugging by having a smaller set of conceptual patterns. Expressivity. Traditional CEP systems process events solely based on the attributes 23 they posses in the event tuples. It prohibits users from specifying meaningful query con- straints over meta data that does not physically present in raw event tuples. By linking tuple attributes to meta data captured as part of the semantic ontology, query constraints can then be defined on related domain concepts and entities. This significantly enhances the power of an event pattern specification in detecting very precise situations while eliminating false positives. Accessibility. Defining DR event patterns over domain ontologies shields users from lower level details of data streams and their changes. For example, in USC campus Micro Grid, we can easily define pattern queries that apply to only meeting rooms, even if the end user is not aware of individual sensor locations and which buildings have meeting rooms, let alone the sensors that are deployed in those rooms. 3.3 Event and Query Model Allowing users to specify event query patterns at the domain level firstly requires the events themselves to be enriched with domain context [123]. We introduce such a semantic event model followed by the query specification. 3.3.1 Semantic Event Model Events in traditional CEP systems are treated as syntactic data and represented in var- ious forms such as relational tuple, XML, JSON and POJO [96]. Here we adopt the tuple representation [30], i.e., timestamped name-value attribute pairs, for raw events transferred on network. For example, a HV AC airflow measurement (AirflowReport) event may consist of three tuple attributes including sensorID, flowrate and timestamp. Events that are formed in other structural formats can be easily transformed to the tuple representation. 24 Data Tuple ∶= f⟨name, value⟩⋆; timestampg With a syntactic, raw event model and in the absence of an external, static knowledge base, users have to be aware of low-level details of low-level details of, say, the Smart Grid physical infrastructure to define meaningful patterns. Consider an exampleD 2 R scenario in the USC Micro Grid: a facility manager in the campus needs to curtail power load in response to a utility request. He firstly considers reduce power consumption from Office rooms where the airflow of the HV AC unit exceeds GreenBuildingAirflow. Here, Office and GreenBuildingAirflow are both domain concepts not present in the raw events. Users have to know variants of data schemas and types of rooms each sensor is monitoring to define the query. To provide higher level abstraction for user queries, we leverage OWL ontologies 1 to capture the domain entities associated with raw events. The ontologies are orga- nized in a modular fashion, allowing components from related domains to be linked together [122]. Raw events are enriched with semantic context by mapping their tuple attributes to ontology entities as shown in Figure 3.2, and these semantics may either be static or dynamic. Static semantics are domain entities linked to event stream schema, i.e., tuple attribute names. For example, the attribute name airflow for an AirflowReport event has multiple structurally different, but conceptually equivalent, names such as flowrate and airvolume. These equivalences are captured as part of the HV AC ontology model with namespace hvc using the owl:sameAs relation to allow pattern specification that masks the data diversity, and remains static. Dynamic semantics relate to the actual values of event attributes that may vary for each event. For example, in Figure 3.2, the materialized semantic event relates tuple attribute sensorID with domain concepts in the ontology such as sensors 1 Web Ontology Language,http://www.w3.org/TR/owl-ref 25 !!!!ee:D105VOLUME bd:RTH105 evt:hasSource ee:hasLoca=on bd:Office rdf:type ee:AirflowSensor rdf:type domain!! ontologies Seman=c!Event! URI !!!!evt:hasTime !!!!hvc:airflow seman=c!! event org:EEDepartment bd:RTH bd:belongsTo D105VOL!!!!!!!!!!!!!!!!!!2012K05K04T09:30!!!!!!!!!!!!!!!!!!!!!!!!510.0!!!!!!!!!!! raw!event! tuple! !!!!!!!!!!sensorID........................../mestamp............................flowrate !!2012K05K04T09:30 510.0!!!!!!!!!!! context!mapping! ! materializa=on! Figure 3.2: Semantic-enriched AirflowReport Event (ee:D105VOLUME), buildings (bd:RTH105), room types (bd:Office), measured vari- ables (ee:AirflowSensor), and organizations (org:EEDepartment). Note bd:belongsTo is defined as a transitive property. Event semantics allow users to query events based on domain context that may not directly present in the raw event structure but rather inferred from background knowl- edge. For example, in theD 2 R scenario described previously, the facility manager, who is interested in HV AC events from a certain type of physical spaces, can navigate the semantic relations captured in the domain ontologies from the sensorID attribute to the location and location type. 3.3.2 Semantic Query Model We propose a two-segment SCEP query model that decouples the semantic and struc- tural CEP predicates. Besides simplifying specification by the user, this also allows for pipelined execution. The general query structure is as follows: 26 SCEP Query ∶= PREFIX <ontology name spaces> . SELECT <output event stream definition> . FROM <input event stream definition> . WHERE [Semantic subquery] * | [CEP subquery]? The PREFIX clause defines domain ontology name spaces that can be referenced in both semantic and CEP subqueries. Particularly it enables qualified access to static event semantics in CEP subqueries. TheSELECT clause projects properties of matching events such as attributes or aggregation functions to an output event, optional subexpres- sions with keywordAS can be used to rename output event attributes. The input events definition labeled by the keyword FROM associates input event variables with named streams. For example, event ?e from an airflow measurement stream AirflowReport is declared as: FROM (?e, AirflowReport) Finally, theWHERE clause specifies a semantic subquery and CEP subquery pipeline to qualify events. Specifically, the semantic subquery segment specifies semantic filtering constraints over raw event tuples and the linked domain ontologies. Semantic subqueries are repre- sented using the SPARQL OWL ontology query language 2 , which can be modeled as a query graph with multiple property paths 3 . Here, we assume each subquery operates on just one event variable, if present – support for semantic correlation across multiple events is left as future work. Semantic subqueries are labeled using keywordPATH as: Semantic Subquery ∶= PATH <SPARQL triple patterns> 2 SPARQL,http://www.w3.org/TR/rdf-sparql-query 3 SPARQL Property Paths,http://www.w3.org/TR/sparql11-property-paths 27 The CEP subqueries are traditional CEP patterns which specify temporal and syntac- tic constraints over raw event tuples. There are many CEP query languages [30, 38, 102] which usually extend the SQL query model with temporal operators such as time win- dow and sequence. In [58], CEP patterns are classified as dimensional patterns and basic patterns depending on whether they are related to temporal/space concepts or not. To facilitate later discussion, we discretize composite CEP query operators present in existing CEP query languages into unit operators, each representing a single constraint from the following categories: non-correlation, value-based correlation and time/length- based correlation. These help model composite CEP patterns such as basic, threshold and temporal patterns introduced in [58]. Specifically, we adopt the following syntax for CEP subqueries: CEP Subquery ∶= [FILTER <non-correlation constraint>] * [JOIN <value-based correlation>] * [SEQ <temporal order correlation>]? [WINDOW <temporal range correlation>]? ... Non-correlation operator FILTER defines constraints for individual events based on attribute values; correlation operators define constraints across multiple events based on non-temporal attributes (e.g.,JOIN) or temporal attributes (e.g.,SEQ andWINDOW). We illustrate the various query constructs of the above model using examples from the campus Micro Grid, which are also reused in later sections. Specifically, we com- pare three common types of CEP queries including simple filtering, aggregation and sequence queries with their semantic-enriched queries. The examples use events from the HV AC airflow measurement stream and ontology namespaces defined below [122]. We ignore the sharedFROM clause andPREFIX namespaces in the example queries. 28 PREFIX bd:<http://cei.usc.edu/Building.owl#> PREFIX ee:<http://cei.usc.edu/Equipment.owl#> PREFIX hvc:<http://cei.usc.edu/HVAC.owl#> PREFIX org:<http://cei.usc.edu/Organization.owl#> PREFIX evt:<http://cei.usc.edu/Event.owl#> PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> 3.3.2.1 Syntactic Filtering Query Our SCEP model supports plain CEP queries, a simple form of which has just filtering constraints. For example, Query 3.1 Report the monitoring sensor and flow rate when the inbound airflow of a space exceeds 500 cfm (cubic feet per minute). SELECT ?e.sensorID, ?e.flowrate FILTER (?e.flowrate > 500) The syntactic constraint can be extended to incorporate static event semantics that leverage domain knowledge for diversity integration. For example, flowrate attribute is equivalent to its variant airflow captured in the HV AC ontology, to give: FILTER (?e.hvc:airflow > 500) 3.3.2.2 Syntactic Aggregation Query Aggregation functions such as average and sum can be computed over specific event attributes, by grouping events into moving windows, and matched continuously. For example, Query 3.2 Report the 5-minute average inbound airflow of a space when it is greater than 500 cfm. 29 SELECT AVG(?e.flowrate) > 500 AS avg WINDOW (?e * , 5min) 3.3.2.3 Syntactic Sequence Query CEP queries can also assert temporal ordering over events present in moving windows. For example, Query 3.3 Report the monitoring sensor when the airflow of a space is greater than 500 cfm and increased by 50 cfm within 5 minute. SELECT ?e1.sensorID FILTER (?e1.flowrate > 500) JOIN (?e2.sensorID = ?e1.sensorID) JOIN (?e2.flowrate - ?e1.flowrate > 50) SEQ (?e1, ?e2) WINDOW (?e1, ?e2, 5min) 3.3.2.4 Semantic Filtering Query A semantic filtering query places additional semantic constraints over events, besides CEP filtering subqueries. For example, a department energy coordinator can extend Query 3.1 with dynamic semantics as: Query 3.4 Report the monitoring sensor and flowrate when the airflow in an Office of EE department exceeds 500 cfm. Without knowing details of sensor deployment in physical spaces, the above query can be specified by simply extending Query 3.1 to include the following semantic sub- query, specified in SPARQL: 30 PATH f ?e evt:hasSource ?src . ?src rdf:type ee:AirflowSensor . ?src bd:hasLocation ?loc . ?loc rdf:type bd:Office . ?loc bd:belongsTo org:EEDepartment g To illustrate the intuitiveness of SCEP query further, we leverage the domain con- cept, GreenOfficeAirflow, defined in the HV AC ontology instead of a static threshold value: Query 3.5 Report the monitoring sensor and flow rate when the airflow in an Office room of EE department exceeds GreenOfficeAirflow. In this case, we remove the CEP subquery and update the semantic subquery as following, PATH f ?e evt:hasSource ?src . ?src rdf:type ee:AirflowSensor . ?src bd:hasLocation ?loc . ?loc rdf:type bd:Office . ?loc bd:belongsTo org:EEDepartment . ?e hvc:airflow ?rate . hvc:GreenOfficeAirflow hvc:hasValue ?goa . FILTER (?rate > ?goa) g Note the FILTER operator in the semantic subqueries, i.e., PATH clauses, is a native SPARQL operator that is different from the CEPFILTER operator. 31 3.3.2.5 Semantic Aggregation Query A semantic aggregation query applies aggregation functions specified in the SELECT clause over events that match both semantic and CEP subquery constraints. For exam- ple, a variant of Query 3.2 can be: Query 3.6 Report the 5-minute average inbound airflow of a Office in EE department when it is greater than 500 cfm. This is identical to Query 3.6, using theAVG aggregation function, and including the semantic subquery from Query 3.4. 3.3.2.6 Semantic Sequence Query Similarly, a semantic sequence query specifies additional semantic constraints over events that fulfill the sequence CEP constraints. For example, Query 3.7 Detect the sequence of airflow events as constrained in Query 3.3 but only for MeetingRooms. Here, the CEP subquery is identical to Query 3.3 which uses SEQ clause to order events in 5-minute time windows. In addition, two semantic subqueries one for each component event in the sequence need to be specified. For example, the semantic sub- query for event ?e1 is similar to that of Query 3.4 as: PATH f ?e1 evt:hasSource ?src . ?src rdf:type ee:AirflowSensor . ?src bd:hasLocation ?loc . ?loc rdf:type bd:MeetingRoom g In this case we also have another semantic subquery for ?e2 with the same constraint. 32 The above scenarios illustrate intuitive SCEP query specification in a Micro Grid environment. The high-level abstraction that utilizes semantic ontology models pre- cludes the need to know the details of event data varieties or Micro Grid infrastructure. As discussed later, event semantics are processed both offline and online for precise query matching. 3.4 Processing Model In this section, we describe approaches to process SCEP queries over realtime event streams. Semantic SPARQL queries are time expensive, even for in-memory processing [60]. To reduce query overhead we have static and dynamic event semantics evaluated at query compile time (when the queries are provided by the user) and runtime (when the events arrive) respectively. 3.4.1 Compile-time Semantic Processing SCEP query predicates that are defined only on static event semantics, i.e., domain concepts associated with event stream schemas, can be evaluated offline since the results will not change for individual events. We apply compile-time query optimization in three steps: semantic pruning, migration and normalization. Semantic pruning and migration reduce the complexities of semantic subqueries. In the pruning step, SPARQL property paths that originate from ontology constants such as classes and instances are executed in advance and the results are replaced in the query clauses. For example, in Query 3.5 the property path from hvc:GreenOfficeAirflow con- cept can be replaced by its literal value 500 and the semantic subquery can be rewritten as: 33 PATH f ?e evt:hasSource ?src . ?src rdf:type ee:AirflowSensor . ?src bd:hasLocation ?loc . ?loc rdf:type bd:Office . ?loc bd:belongsTo org:EEDepartment . ?e hvc:airflow ?rate . FILTER (?rate > 500) g In addition, semantic migration transforms one-hop property paths originated from event variables to (more efficient) CEP subquery clauses. For example, the last two triple pat- terns in the above semantic subquery can be completely transformed to a CEP subquery constraint similar to Query 3.1 as: FILTER (?e.hvc:airflow > 500) Further, semantic normalization eliminates all static semantics from CEP subqueries. Semantic predicates and concepts present in CEP subqueries are normalized to standard terms based on the ontologies to do inferencing once initially and not repeat them for each arriving event. For example, airflow is semantically equivalent to flowrate and airvolume for the same concept, any of which may be used in a CEP subquery. A static semantic inferencing using the domain ontologies can normalize these alternative attributes offline to one standard terms, say flowrate. As result, Query 3.5 is compiled to the following query for processing: PATH f ?e evt:hasSource ?src . ?src rdf:type ee:AirflowSensor . ?src bd:hasLocation ?loc . ?loc rdf:type bd:Office . ?loc bd:belongsTo org:EEDepartment g | FILTER (?e.flowrate > 500) 34 This compile-time query optimization reduces the number of predicates in the semantic subqueries and eliminates semantic predicates in CEP subqueries, thus avoid- ing repetitive reasoning and evaluation performed at runtime. 3.4.2 Runtime Semantic Processing We adopt an asynchronous pipeline architecture to process SCEP query segments at runtime. Event tuples that arrive on streams are annotated and linked with domain ontologies as shown in Figure 3.2. Semantic-enriched events are passed to a seman- tic filter module which evaluates semantic subqueries. Events that satisfy the semantic constraints are allowed to pass to a CEP engine that evaluates CEP subqueries. Here we focus on semantic subquery processing at runtime since CEP subqueries can be pro- cessed by a traditional CEP kernel. 3.4.2.1 Baseline Approach A straightforward approach is to evaluate the semantic subqueries upon the arrivals of every new event. However, evaluating a SPARQL query requires costly inference and self-join operations over the entire knowledge base. In general, a SPARQL query with a single property path requires (n−1) self-joins over the ontology, wheren is the property path length [121]. We observe evaluation throughputs flattened at ∼80 events per sec- ond in our experiments (§ 6.1). We propose two optimization approaches for runtime semantic subquery processing to mitigate this, both of which are implemented in our architecture. 3.4.2.2 Event Buffering The idea of buffering data streams for lazy query processing has been used before, e.g., in XML stream querying [64] and T-REX event processing [48]. However, this approach 35 has been studied, primarily, in the context of syntactic data and queries. Specifically, in T-REX lazy processing based on buffers is considered less effective than automata-based eager evaluation for CEP sequence patterns. For a SCEP semantic subquery, since the time taken to evaluate one event instance is similar to that of evaluating it over a small set of events due to a high static query pro- cessing overhead, the event buffering approach is expected to be very effective. Hence, instead of evaluating semantic subqueries for each event upon its arrival, we buffer events that arrive within a (configurable) time interval and perform the query collectively on this batch of events. The obvious side-effect of this approach is the introduction of a pattern detection delay that, in the worst case, equals the buffer’s time interval. So, the interval length should be small enough so that the user applications can tolerate the delay. As long as the query processing time for a batch of events is less than the buffer interval, the query throughput can keep up with the input event rate. However, as shown in our experiments (§ 6.1), this performance benefit is subject to the input event rate itself. When the input event rate is very small, the buffered event batch may have just one event – degenerating to the baseline. While if the input rate is very high, the time to perform the semantic query for the batch of events can exceed the buffer window, causing the query throughput to fall below the input rate. Thus the choice of event buffer interval is a trade off between query latency and maximum throughput that can be achieved. 3.4.2.3 Semantic Caching Yet another optimization we propose is caching semantic subquery results, similar to the caching mechanisms employed to speed up memory, database and web data accesses. The key intuition for semantic subquery caching is that multiple event tuples may share 36 ?event rela)on evalua)on sensorID///////)mestamp/////////flowrate//// bd:Office org:EEDepartment hvc:GreenOfficeAirflow ee:AirflowSensor ?src ?rate ?val ?loc materializa)on raw/event/tuple/ seman)c/subquery/ hvc:hasValue Figure 3.3: Semantic Subquery Graph (Query 3.5) the same value for an attribute, and evaluation results for queries specified on those can be reused. Consider the semantic event materialized from an event tuple as a RDF tree graph, rooted from the event URI with edges linked to property nodes (Figure 3.2), we have the following definition: Definition 3.1 The event root properties of a semantic event are the ontology properties directly materialized from tuple attribute values. For example, in Figure 3.2, the event root properties are ee:D105VOLUME, 2012-05-04T09:30 and510.0. On the other hand, as shown in Figure 3.3, a SPARQL semantic subquery which consists of multiple property paths can also be modeled as a tree graph whose root nodes are event variables present in the query (e.g., ?e in Query 3.5), inner nodes are property variables (e.g., ?src and ?loc) and leaf nodes are constants including literals, ontology classes and instances (e.g., bd:Office and hvc:GreenOfficeAirflow) [123]. We evaluate a semantic subquery over an event to discover if the event tree leads to the same set of leaf nodes of the query tree. Specifically we have, 37 Definition 3.2 The query root properties of a semantic event for a given query are its event root properties that are evaluated by the query. Obviously, events that have the same query root properties share the same subquery evaluation results. As a simple example, for Query 3.5, different airflow measurement events with the same sensorID would have the same originating location type, and thus return the same boolean result for the semantic query: is the event from an Office that belongs to EEDepartment? Algorithm 1 Semantic Query Caching Require: Cache table ht initialized for queryQ, domain ontologiesO Ensure: Evaluation resultv for queryQ 1: while Receiving semantic evente do 2: Compute cache keyk fore 3: v← ht.get(k) 4: ifv!=null then 5: Returnv 6: else 7: v← Evaluate(Q,e, O) 8: Update {k; v} to ht based on LFU policy 9: Returnv 10: end if 11: end while The semantic cache is implemented using linked hashmaps that are maintained per semantic subquery. Their keys are formed using a canonical combination of the event root properties, and their value is the boolean result of its semantic query evaluation. The cache has a fixed number of entries, and we use a Least Recently Used (LRU) algorithm for cache updates. Denote the boolean function that evaluates a queryQ over evente and domain ontologiesO asEvaluate(Q;e;O), the pseudo code for semantic query caching for semantic subquery processing is shown in Algorithm 1. In addition to sharing query results between events, it also makes sense to share path evaluations between queries in scenarios where a number of queries share a smaller set 38 ?event Q3 bd:Office org:EEDepartment hvc:GreenOfficeAirflow ee:AirflowSensor ?src ?rate ?val ?loc ?event ?event bd:Mee<ngRoom Q2 Q1 Q3 Q3 Q1,1Q2 Q1,1Q2,1Q3 Q2,1Q3 rela2on evalua2on Figure 3.4: Semantic Subquery Property Path Sharing of property paths. Consider queries shown in Figure 3.4 where property paths are shared between multiple queriesQ1,Q2 andQ3. It should be noted that property paths with leaf nodes such as bd:Office and bd:MeetingRoom, which are semantically disjoint, only need to be evaluated once for events with the same query root property. Based on this observation, we can cache query results at a more fine-grained level. Specifically, we maintain cache tables for individual property path and the pseudo code for semantic path caching is shown in Algorithm 2. Both event buffering and semantic query caching are applied in our prototype system for online SCEP query processing. Their performance results are reported in Chapter 6 together with other quantitative evaluations. 3.5 Implementation Figure 3.5 shows our implementation of a Semantic Complex Event Processing system that incorporates the semantic event and query models we have introduced and validates the processing techniques and optimizations we have discussed. The SCEP realtime 39 Algorithm 2 Semantic Path Caching Require: Cache table ht i initialized for path P i (i = 1 to n) of query Q, dht ij (j = 1tom) identified as disjoint cache tables ofP i , domain ontologiesO Ensure: Evaluation resultv for query Q 1: while Receiving semantic event e do 2: Compute cache keysk i of e forP i (i= 1ton) 3: v←true 4: fori= 1ton do 5: v i ← ht i .get(k i ) 6: ifv i =true then 7: Continue 8: else ifv i =false then 9: v←false 10: Returnv 11: else 12: v i ← Evaluate(P i ,e, O) 13: Update {k i ; v i } to ht i based on LFU policy 14: Update {k i ; !v i } to dht ij (j = 1tom) based on LFU policy 15: end if 16: end for 17: Returnv 18: end while query engine is implemented fully in Java and built around an existing open-source CEP engine, Siddhi [102]. Siddhi is an exemplar of a traditional CEP engine that uses a tuple- based event model and supports common CEP operators such as filtering, aggregation and sequence over event streams. Our system complements the CEP kernel to support semantic CEP query processing using several modules, viz., domain ontology model, stream and query manager, semantic annotator, and semantic filter. Event streams are pipelined through the semantic annotator, semantic filter and Siddhi CEP engine to operate asynchronously. The domain ontology model, represented in OWL, forms a knowledge base that cap- tures domain concepts and their relations to support semantic query specification and 40 SCEP%REALTIME%QUERY%ENGINE ! ! ! ! ! CEP!Kernel!(Siddhi) Seman2c! Annotator ! ! ! ! ! Seman2c!Filter Event!Ontology Event& Streams Seman,c&Events Seman&c( Subquery Seman&c( Subquery CEP( Subquery CEP( Subquery … … Domain! Ontologies Ontology&Model Stream!&!Query!Manager! Real,me& Pa7ern& Matches SCEP(Queries Figure 3.5: SCEP Architecture Overview. Events are processed in an asynchronous pipeline model. Raw events arriving on streams are semantically annotated, and pipelined through a semantic filter and a CEP kernel. processing. For the campus Micro Grid domain, we organize ontologies in a modu- lar fashion for extensibility [122]. Domain ontologies are loaded once into a Jena 4 in-memory ontology query engine as a base model which can be combined with seman- tic event instances for runtime query evaluation. The stream and query manager pro- cesses static event semantics present in event schemas and user queries as discussed in Section 3.4.1. The normalized event stream schemas and subqueries are submitted for runtime processing. The semantic annotator module translates raw event tuples arriv- ing at runtime to semantic events based on an annotation file that describes mappings from the raw tuple schema to semantic properties. The semantic filter module pro- cesses dynamic event semantics, performing semantic subquery evaluation over input events. If an event satisfies a semantic subquery, it is passed to the CEP kernel’s input streams to evaluate corresponding CEP subqueries; otherwise, the event is dropped. The semantic filter implements the runtime query optimizations described in Section 3.4.2 for improved performance. Finally, the CEP kernel processes CEP subqueries present 4 Jena Framework,jena.sourceforge.net 41 in SCEP queries. The interfaces to the CEP kernel are generalized to allow CEP engines other than Siddhi to be used instead, if they provide unique capabilities. 3.6 Related Work Our work in SCEP falls in the space of stream and Complex Event Processing [50, 52, 21, 30, 35]. These systems offer intuitive languages to query data streams and react to continuous data with low latency. For example, Borealis [50] and Storm [21] support data stream processing using a boxes-and-arrows workflow based model. CEP systems [30, 35], as an evolution, typically follow a SQL-like query syntax and allow explicit specification of temporal patterns, such as sequences and time windows, in addition to relational patterns. Existing CEP systems have focused on optimizations of syntac- tical and temporal pattern modeling and matching. Cayuga [30] leverages an eager (incremental) Nondeterministic Infinite Automata (NFA) algorithm to process event sequences within moving windows. T-REX [48] compares eager and lazy (buffering) evaluations of CEP queries in realtime. In [94], the authors discuss query rewriting techniques for two commonly used CEP patterns, namely all and sequence into subpat- terns for parallel execution, where possible. Our SCEP system is a natural extension of traditional CEP systems where CEP kernels are used to process CEP subqueries, while we focus on extending traditional CEP systems to further deal with data varieties. One approach to addressing data variety in stream applications is the utilization of schema mappings [85]. However, the point-to-point degree of integration is not scal- able nor sustainable for broad and constantly changing information spaces like Smart Grid. Examples [39, 38, 104] discuss incorporating extensible semantic data models in stream and Complex Event Processing. Specifically, C-SPARQL [39] extends the 42 SPARQL language with window and aggregation clauses to support RDF stream pro- cessing. ETALIS [38] is a rule-based deductive system that acts as a unified execution engine for temporal pattern matching and semantic reasoning. Both event patterns and semantic knowledge are transformed to Prolog rules used for pattern derivation and detection. These are semantic-centric solutions, leveraging inference engines to model and process queries over data streams and domain knowledge. Many native CEP tem- poral patterns such as Klenee Closure and matching policies for event selection and consumption [49, 58] are not considered. Consequently, they lack the power and scal- ability of full CEP systems that exploit patterns, techniques and algorithms designed for streaming applications. Rather than adopting a bespoke solution that departs from traditional CEP systems, our SCEP model integrates the native constructs of both CEP and semantic languages for intuitive query specification. More practically, it also allows for rapid constructing of a framework to process semantic-enriched CEP queries using existing tooling, and improves the performance using proposed optimizations. 3.7 Conclusions In this chapter, we introduced a Semantic Complex Event Processing (SCEP) frame- work to support online query processing over data streams while shielding underly- ing data Varieties from end users. The semantically enriched CEP model allows users in emerging Fast Data application domains to easily specify high-level query patterns over diverse information spaces and knowledge bases. We have illustrated the value of this model using the Smart Grid domain as a case study. Further, we have proposed approaches for translating the SCEP model into practice along with optimizations to 43 address the limitations of semantic query processing in realtime. These have been imple- mented in the SCEP realtime query engine to validate our design, and our experiments presented in Chapter 6 highlight the performance improvements. 44 Chapter 4 Resilient Complex Event Processing Real-world Fast Data applications often require data correlation from high-Velocity streams with data from persistent store for resilient operational analytics. This helps applications adapt to lazy-definition and fail-fast conditions. Lazy-definition is a com- mon scenario where users define queries after the time of interest has passed. This is a means of validating hindsight through post hoc analysis and using it to inform future information pattens. Fail-fast condition, on the other hand, refers to infrequent but crit- ical faults in cyber-infrastructure that cause the realtime queries to fail. Consequently, queries will have to perform over data streams persisted to durable storage in a latency- sensitive and efficient manner. Existing Complex Event Processing systems like the one previously described, however, have only focused on query processing over realtime events, either with or without considerations of data Variety. Performing CEP queries across end-to-end event streams – from network to archive – has recently attracted research interests. Proposed solutions include the use of active databases [33] which leverage relational query engines to process time varying data that persists, using triggers and incremental queries to match new events. DataCell [78] even layers in-memory tables on top of database kernels to handle realtime events. These systems sacrifice the expressivity of CEP queries and introduce latency into matched results. Alternatively, a recency-based CEP model was proposed in [87] to support a happen-before relation which links live streams with persistent events. However, the recency model is limited to correlating patterns present in realtime and archived streams separately, other than uniform query processing over them. 45 In this chapter, we extend the SCEP framework described in Chapter 3 and introduce SCEPter which uniformly processes SCEP queries across the boundary of realtime and persistent event streams. Our work helps lazy specification of queries for online and post-hoc analytics, and resilience of query execution for operational decision support, while mitigating severe performance overheads through optimizations. Specifically, our contributions include: • We extend the SCEP model to provide additional query predicates that allow queries to operate over realtime and persistent event streams seamlessly (§ 4.3). • We discuss SCEP query processing over persistent event streams (§ 4.4.1) and across stream boundaries (§ 4.4.2). Approaches including na¨ ıve event replay, plain query rewriting and their hybrid are compared for low processing latency and resiliency in the presence of temporal gaps exist in end-to-end event streams. • We implement the extended SCEP query model and processing techniques within SCEPter (§ 4.5) for quantitative evaluations. 4.1 Problem Motivation Resilient Complex Event Processing over end-to-end event streams is also motivated and evaluated with the Dynamic Demand Response (D 2 R) application in the Smart Grid CPS domain, as part of the US Department of Energy sponsored Los Angeles Smart Grid project [99]. We illustrate motivating scenarios in the University of South- ern California’s campus Micro Grid testbed. Firstly, the operational needs of demand response may not tolerate failures in the CEP system that permanently misses patterns due to runtime hardware or software faults. Secondly, given the exploratory needs of this emerging domain, not all query patterns may have been defined a priori before 46 interested events happen (lazy-definition). On the other hand, events generated in Smart Grid are often archived for regulatory compliance [82] and data mining [36]. This pro- vides an opportunity to enhance the robustness and flexibility of the CEP system if it can seamlessly perform the same query across both incoming events and those that have been archived. This introduces problems in translating a temporal pattern definition into queries that operate both exclusively on and span online streams and static storage, and in executing them efficiently to mitigate recovery time from faults. To facilitate later discussion, consider the following use case extended from the SCEP scenario described in Section 3.3: a facility manager in the Micro Grid needs to curtail load in response to a utility request at noon. The manager firstly considers office rooms where the airflow of the HV AC unit exceeds 500 cfm, but wants this query to be evaluated from 9AM this morning. She then reduces the fan speed for these HV AC units to limit the load. Here, besides semantic abstractions, detecting historical occur- rences of the pattern (since 9AM) allows rooms that have been overcooled for long to be identified for curtailment. 4.2 Approach Overview Figure 4.1 depicts the general scenario of SCEP querying across realtime and persistent events. Events streaming from data sources are reliably forked and passed to both an event database for durable storage and to a SCEP realtime query engine, often running on different machines for robustness. Event query patterns are uniformly defined by users over past, present and future events using domain-level predicates and concepts. The realtime query engine and database cooperatively detect patterns spanning the end- to-end event streams and return results to users. 47 Event&Store Real%me'Events Persistent'Events Integrated& Results Real2me&& Query&Engine Archive&& Query&Engine Uniform&& SCEP&Queries Matched' Pa4erns .& Fork Figure 4.1: Integrated Query Processing over End-to-End Event Streams Notably, integrated querying over the end-to-end event streams has to be cognizant of the time boundary between realtime and archived events. The results must be consis- tent, i.e., as if they were all performed on realtime events, with ordering preserved and no missing or duplicate matches. This is non-trivial when it cannot be guaranteed that an event is present exactly-once either in the event stream or the database. A temporal gap (negative) or overlap (positive) at the point where the event stream is forked for archiving (Figure 4.1) can cause missing or duplicate results. 4.3 Event and Query Model We leverage the same event model and extend the SCEP query model described in Sec- tion 3.3 to support queries that operate across realtime and persistent event streams. The general query structure is: SCEP Query ∶= PREFIX <ontology name spaces> . SELECT <output event stream definition> . FROM <input event stream definition> . 48 WHERE [Semantic subquery] * | [CEP subquery]? WITHIN <query boundary> . Specifically, the WITHIN clause specifies lower and upper temporal bounds of the events of interest, which may include historical events. For example, the default bound- ary for SCEP queries that match the patterns with only present and future events is: WITHIN [now, ) while a query range that starts at a time in the past and spans to all future events may be given by: WITHIN [2012-05-07T09:00, ) When the query range overlaps with a time in the past, the query need to be executed over both realtime and persistent event streams, otherwise it will only be executed over realtime event streams. 4.4 Processing Model In this section, we firstly discuss SCEP query processing over persistent event archives and then describe approaches to evaluate the queries across the boundaries of realtime and persistent events. 4.4.1 Archive Query Processing We propose a hybrid solution that marries the best of the realtime query engine and databases for SCEP query processing over event archives. Here, we assume events arrived from streams have been persisted to a database that can be queried on-demand. It forms the foundation of SCEP query processing over end-to-end event streams discussed later. 49 4.4.1.1 Na¨ ıve Event Replay Na¨ ıve event replay is a simple and direct approach where all events that occurred in the past and fall within the time range defined by theWITHIN clause are extracted from the database and then replayed to the realtime query engine as a precursor of the realtime events. This requires minimal query processing capability from the archive database and leaves the realtime engine to actually perform the event pattern query as previously discussed. This model needs limited additional tooling to implement and is used by some existing CEP systems like Oracle Complex Event Processing [12]. However, the performance of na¨ ıve event replay falls short when SCEP queries oper- ate over longer event histories, which forces more events to be materialized. The perfor- mance, as measured by the latency and throughput of matched patterns, depends on the ability of the realtime SCEP engine to manage a burst of archived events. While CEP systems can handle millions of events per second for syntactic queries, semantic CEP query processing in realtime is much more expensive. This may prove infeasible when the historical time range of the SCEP queries are large. 4.4.1.2 Plain Query Rewriting An alternative approach is to push the SCEP queries to the archive database. The use of Semantic Web ontologies requires that the event database be an RDF store and support SPARQL. Transforming CEP query expressions to native SPARQL expressions is non- trivial. However our CEP subquery model that leverages unit operators facilitates rule- based mappings from CEP subquery clauses to SPARQL property paths that can be evaluated on the database. Example SCEP rewriting rules for a RDF event store are described below. 50 ThePREFIX namespaces andPATH semantic subqueries of a SCEP query already conform to SPARQL syntax and need no further transformation. We have the following NATIVE rule: Rule 4.1 (NATIVE) Namespaces and all property paths of semantic subqueries are retained as the same as in the target SPARQL query. Rewriting declarations such asSELECT output event stream definition,FROM input event stream definition andWITHIN query boundary clauses using SPARQL predicates is also straightforward: Rule 4.2 (SELECT) SELECT clause in a SCEP query maps to a SPARQL SELECT clause with corresponding property variables and aggregation functions, as well as property paths which lead event variables to the selected properties in the ontology. Rule 4.3 (FROM) Input events definition maps to a set of property paths in the target SPARQL query which lead event variables to the corresponding source streams captured in the ontology. Rule 4.4 (WITHIN) A query boundary declaration maps to property paths that link event variables with their timestamp ontology properties and have SPARQL filters spec- ify the lower and upper bounds. A CEP subquery may consist of multipleFILTER clauses for non-correlation con- straints each of which can be written in SPARQL as following: Rule 4.5 (FILTER) A CEP subquery FILTER clause maps to a property path in the target SPARQL query which specifies a SPARQL filter to evaluate the same constraint for the mapped ontology property. 51 There exist various correlation operators includingJOIN,SEQ,WINDOW and so on which need specific rewriting rules. In particular, to rewrite CEP JOIN clauses that specify value-based correlation constraints we have: Rule 4.6 (JOIN) A JOIN clause in a CEP subquery maps to a set of property paths which lead the event variables to their corresponding ontology properties and have one SPARQL filter to specify the correlation between them. Time-based correlation operators may have rules to map the constraints to semantic event time properties stored in the archive. For example, we have the following rules to transform SEQ expressions to SPARQL triple patterns: Rule 4.7 (SEQ) ASEQ clause maps to a set of property paths which lead the event vari- ables to their timestamp properties and have SPARQL filters to compare the timestamps for sequence ordering. Different types of WINDOW constraints map to different SPARQL property path structures. Specifically, a time window that correlates multiple events can be interpreted as timestamp comparisons between events. Let ?e first be the first event and ?e last be the last event in the window, and w be the window width. A sliding time window is enforced using the following rule: Rule 4.8 (WINDOW) A multi-variant timeWINDOW maps to property paths that use a SPARQL filter to evaluate condition ?e last :timestamp−?e first :timestamp≤w. A query rewriting example which leverages above rules to rewrite the CEP subquery of Query 3.3 and Query 3.7 in Section 3.3 is shown in Figure 4.2. Notably, specific rewriting rules need to be defined for new operators supported by CEP subqueries based on their definitions. 52 !?e1!evt:hasTime!?e1_time!.! !?e2!evt:hasTime!?e2_time!.! !FILTER(?e2_time!>!?e1_time) !?e2!evt:hasValue!?e2_flowrate!.!!! !FILTER(?e2_flowrate@?e1_flowrate!>!50) !FILTER(?e2_time@?e1_time!<!300) Rule%4.6% Rule%4.7% Rule%4.8% JOIN!! !!(?e2.flowrate!>!e1.flowrate)! SEQ!(e1,!e2)! WINDOW!(?e1,!?e2,!5min)! !?e1!evt:hasValue!?e1_flowrate!.!!! !FILTER(?e1_flowrate!>!500) Rule%4.5% FILTER!! !!(?e1.flowrate!>!500)! !?e1!evt:hasSource!?e1_src!.! !?e1_src!ee:hasID!?e1_sensorID!.!! !?e2!evt:hasSource!?e2_src!.! !?e2_src!ee:hasID!?e2_sensorID!.!! !FILTER(?e2_sensorID!=!?e1_sensorID) Rule%4.6% JOIN!! !!(?e2.sensorID!=!?e1.sensorID)! Figure 4.2: Rule-based SCEP to SPARQL Query Rewriting 4.4.1.3 Hybrid Rewriting and Replay Plain query rewriting and na¨ ıve event replay process different SCEP query clauses at variable efficiencies. In particular, query rewriting may benefit from batch evaluation of SCEP queries in the database since each target SPARQL query is executed just once on the past events, rather than once per event. However, certain CEP subquery clauses, like correlation constraints within moving windows, can introduce severe overheads when executed on a batch. This is due to repeated and unnecessary join operations that consider all events in the batch rather than in dynamic windows. Some of these clauses can actually be executed more efficiently by the realtime query engine with event replay. We partition query clauses that can be more efficiently evaluated by the realtime query engine or the event store based on the query clause types described in Section 3.3. In general, we would like non-correlation constraints that evaluate events uniformly such as semantic subqueries and CEP FILTER constraints to be processed in the database. Specifically, semantic subquery evaluation has a high static overhead caused by infer- encing over the knowledge base. This can be mitigated when the subquery are only evaluated once for all history events in batch. On the other hand, value and time based correlation operators should be executed in the realtime query engine. For example, 53 after rewriting,SEQ constraints can be very time expensive when evaluated on the his- tory dataset due to excessive and unnecessary join operations over the time property that consider all events in the batch. Given that the different query engines offer variable performance benefits for the SCEP query clauses, we can use a hybrid model that leverage this arbitrage. Here, we perform partial query rewrites, enforcing only some rules, and use the resulting events as a stream on which the SCEP queries are further applied. Specifically, we rewrite all SCEP query clauses, except for those that perform correlation, into SPARQL. The partial SPARQL query is executed (efficiently) by the database to materialize a pre- filtered historical event stream that is replayed and evaluated efficiently by just the CEP kernel (without semantics) to complete the SCEP pattern match. This hybrid approach executes costly event correlations within moving windows in the realtime query engine rather than over the entire event dataset in the database, as done by the plain rewriting. This also has benefits over the na¨ ıve replay. Firstly, the SPARQL pre-filtering produces fewer events for replay, and secondly, more importantly, the expensive semantic subqueries are processed in a batch in the database rather than per event in the replay. 4.4.2 Integrated Query Processing · · · · · · · · · · · · · · · · 0 e 1 e 1 e 2 e S t 0 D t 0 t stream S archive D i e (a) Zero Gap Stream · · · · · · · · · · · · · · · · 0 e 1 e 1 e 2 e S t 0 D t 0 t stream S archive D i e (b) Negative Gap Stream · · · · · · 0 e 1 e 1 j e S t 0 D t 0 · · · · · · · t · · · j e i e stream S archive D in-flight (c) Positive Gap Stream Figure 4.3: End-to-End Event Stream Configurations. X axis shows time. Top and bottom dots are events available to realtime and database engines. Vertical dotted lines are realtime and archive event stream boundaries. 54 The SCEP query processing models operate over realtime and archived event streams independently, using the same query model. Another consideration in per- forming seamless queries over end-to-end streams is in planning these independent queries over realtime and persistent events to provide results in-order, without dupli- cate or missing matches. In particular, we focus on queries whose time range spans across the boundary between realtime and persisted events, and when a temporal gap exists between them. Figure 4.3 shows possible end-to-end event stream configurations based on the boundaries of events available to the realtime and the archive query engines. Without loss of generality, consider a SCEP queryQ which operates on a logical event stream F .S is the realtime stream which is a sub stream ofF at timet S 0 whileD represents the persisted stream, also a sub stream ofF . Let the timestamp of the latest event available inD att S 0 bet D 0 ; the first event observed by the realtime engine aftert S 0 bee 0 , the second event bee 1 , and so on; the events beforee 0 aree −1 ;e −2 ;:::. Due to the way streamF is forked toD andS, it is possible that (1)S⊕D =F , (2)S∩D ≠∅, or (3)S∪D ⫋F , mutually exclusive. These refer to zero, negative and positive gaps as shown in Figure 4.3, and the objective is to compute a consistent query resultR under these conditions. 4.4.2.1 Query Plan for Zero Gap Streams !!!!!!!! !!!!!!!! 0 e 1 e 1 − e 2 − e S t 0 D t 0 t D boundary W S boundary W length Q S W t + 0 length Q S W t − 0 stream(S archive(D (a) Zero Gap Scenario · · · · · · · · · · · · · · · · 0 e i e 1 e 2 e S t 0 D t 0 t 1 e stream S archive D (b) Negative Gap Scenario !!!!! !!!!! failure t !!! recover t !!!!! real&me archive !!! !! gap. t failure S recover S failure pre S − (c) Failure Recovery Scenario Figure 4.4: Integrated Query Plans over End-to-End Event Streams In an ideal zero time gap situation, events passing through the realtime query engine reach the event store immediately, i.e., every event is visible either to the realtime query 55 engine or to the database. Integrated query plans for different SCEP queries on such streams are discussed first. Consider an SCEP query Q without a WINDOW clause (e.g., Query 3.4 in Section 3.3). At timet S 0 we applyQ simultaneously to realtime streamS and persisted stream D respectively. Let the subset of patterns detected on S be R S and patterns detected onD beR D . The integrated query result is simplyR =R S ∪R D , and it guarantees no duplicate or missing patterns. For an SCEP query Q with a WINDOW clause W Q of length W length Q , applying Q on S and D retrieves patterns R S and R D which only contain events observed on S or D respectively. However, valid patterns that require component events from both realtime and persisted streams are missed inR S ∪R D since the time window can span across them. Let the missing pattern set be R boundary , and define a boundary window (Figure 4.4(a)) onD andS side respectively as: W D boundary = (t S 0 −W length Q ; t S 0 )W S boundary = [t S 0 ; t S 0 +W length Q ) Letr boundary ∈R boundary be one missing pattern,C = {c i Si= 0;:::n;n> 0} be compo- nent events ofr boundary in time order, andc timestamp i be the timestamp of eventc i . To be missing, patternr boundary must satisfy these necessary and sufficient constraints: c timestamp 0 ∈W D boundary c timestamp n ∈W S boundary We hence have the following query plan. Firstly queryQ is extended toQ ′ by adding query clauses that represent the above temporal constraints over eventc 0 andc n . From timet S 0 when queryQ is applied onS, our system waits for a time period ofW S boundary to ensure all events in the boundary windowW S boundary have reached the database. It then lazily executesQ ′ overD to retrieveR boundary . The integrated query results are given 56 byR =R S ∪R D ∪R boundary , ordered by the timestamp of the last component event in the query pattern. 4.4.2.2 Query Plan for Non-Zero Gap Streams When streams have a negative gap between realtime and archived events (Figure 4.3(b)), some events observed by the realtime query engine aftert S 0 were already available in the database by that time. This can lead to duplicate matching patterns to be present in R = R S ∪R D . We handle this by adding a filter clause to the archive query to only consider events with timestamp less thant S 0 (Figure 4.4(b)), but otherwise use the same plan as zero gap. With a positive gap between events in the realtime and persisted streams (Fig- ure 4.3(c)), we have in-flight event set M = {e i Si= (−j+ 1):::− 1} that were pre- viously seen by the realtime query engine but not yet in the database at timet S 0 . When applying queryQ att S 0 , these events are neither visible to the realtime query engine nor the database, causing patterns to be missed. We have the following approach to detect the missing patterns. For queries without aWINDOW clause, at timet S 0 we executeQ on streamS to detect the pattern setR S . The system waits for a time periodt S 0 −t D 0 and lazily executeQ on the database to get a result setR ′ D . The integrated result is given byR = R S ∪R ′ D . For queries with WINDOW clauses, the query plan is modified simi- larly to the zero gap case to consider missed patterns in windows that span the boundary between realtime and archived events. 4.4.2.3 Query Plan for Fail-Fast Scenario One of the objectives for SCEP queries over past and future events is to adapt to fail- fast conditions. SCEP systems can be subject to hardware or software failures, which, if unaddressed can lead to false pattern detections. SCEP systems are often used in 57 mission-critical realtime applications that cannot tolerate such faults and require high availability, both by mitigating failures and by providing recovery mechanism. As shown in Figure 4.4(c), the SCEP failure recovery problem can be reduced to the end-to-end event stream query problem. Events passed through a realtime stream are archived in a durable database. In this setting, we consider two boundaries between the realtime and archived events: t failure , time when the realtime system fails andt recover , time when the system is back online. There exist three event segments when events pass through: event set S failure which is not observed by the realtime query engine during the downtime, S pre−failure which is processed by the realtime query engine before the failure, andS recover which is processed by the engine after the system resumes. Obvi- ouslyS failure can be retrieved from the archive, and there may exist positive, negative or zero gap at the boundary time pointst failure andt recover . Given a query Q submitted to at t S 0 , the objective of failure recovery is to recon- struct result setR which is the expected query result ofQ if the system had not failed. Let patterns detected by the realtime query engine on S pre−failure be R pre−failure , pat- terns detected on S recover be R recover , and patterns retrieved from the archive S failure be R failure . In the simplest case, Q is a query without a WINDOW clause and there is zero gap between the realtime and archive data att failure andt recover . The reconstructed query result is R = R pre−failure ∪R recover ∪R failure . If Q has a WINDOW clause, let R failure boundary andR recover boundary be the missing patterns at the boundary oft failure andt recover , respectively. The complete pattern set is then R = R pre−failure ∪R recover ∪R failure ∪ R failure boundary ∪R recover boundary , whereR failure boundary andR recover boundary can be computed using approaches discussed in the previous sections. 58 SCEP%REALTIME%QUERY%ENGINE ! ! ! ! ! CEP!Kernel Seman-c! Annotator ! ! ! ! ! Seman-c!Filter Uniform( SCEP(Queries Event!Ontology Event& Streams Seman,c&Events Seman2c( Subquery Seman2c( Subquery CEP( Subquery CEP( Subquery … … Domain! Ontologies Ontology&Model Stream!&!Query!Manager! 4Store(RDF( Database Archive(Query( Manager Integrated! Query!Planner Seman5c%Archive% Subsystem Matched&& Pa7erns&& Persist&Events&to&Archive Figure 4.5: SCEPter Architecture Overview. SCEP queries are performed seamlessly over realtime and persistent event streams. 4.5 Implementation SCEPter is our implementation of a semantic complex event processing system over end-to-end event streams that incorporates the event and query models we have intro- duced and validates the processing techniques we have discussed. Specifically, SCEPter extends the realtime SCEP query engine described in Section 3.5 with the semantic archive subsystem, and the integrated query planner as shown in Figure 4.5. Semantic Archive Subsystem: The archive subsystem persists a fork of incoming event streams to a semantic database and manage SCEP queries over them. We use the 4Store RDF database as our repository due to its scalability [67]. 4Store offers a SPARQL REST service for event insertion and querying. The archive query manager creates partial SPARQL queries from registered SCEP queries using rewriting rules dis- cussed in Section 4.4.1.2 and implements hybrid SCEP query evaluation over event archives as described in Section 4.4.1.3. Integrated Query Planner: The query planner coordinates the realtime query engine and the archive subsystem to retrieve integrated query results over end-to-end streams based on query range specifications and stream configurations. When a SCEP query is registered with SCEPter, the archive query is generated and executed imme- diately, or lazily depending on the gap width between realtime and archived events. 59 The gap width is currently set statically, but is configurable. The planner buffers and combines the independent results into consistent, ordered matches on the output stream. 4.6 Related Work We discussed a Complex Event Processing based approach for integrated query process- ing over realtime data streams and persistent data stores. Relevant work includes those leveraging database systems to manage and query dynamic data. Realtime databases, for example, iteratively process transactions over constantly changing data with con- straints on their required completion time. Techniques like scheduling, buffer and cache management [70, 69] are used to manage temporal consistency and deadlines of query results. Active databases are another extension to process time-varying data. ECA rules and trigger mechanisms [115] were defined to support standing queries with efforts to optimize such continuous queries. For example, Tapestry [103] converts a standing query in active database into an incremental query that finds new matches to the original query as data is added to the database. However, even traditional CEP query models are more expressive than schedules and triggers in terms of specifying temporal con- straints. Also, executing window based correlations in databases performs poorly with long query latency. Processing continuous queries across realtime and history data, correlated using time windows, has also been recognized in CEP researches recently. DataCell [78] exploits relational models specifically for stream processing. Incoming data tuples are cached into baskets, i.e., in-memory tables, queried in batch and flushed from these temporary tables to the underlying database. The basket concept resembles the window opera- tor in CEP queries. This approach potentially allows unified querying on streams that also includes historical data, but is distinct from our SCEP model where we treat the 60 database as a logical extension of the stream back in time rather than a static data source to perform a join. Pattern correlation queries (PCQ) [87] defines the semantics of a recency-based CEP model over live and archived event streams. The recency clause in PCQ is essentially a happened-before relation which specifies the temporal distance between patterns in live streams and in archived streams. It focuses on correlating pat- terns that present either in realtime data streams or history data. Our system, however, considers a superset of the problem, processing uniform queries across stream bound- aries. We hence also analyze the impact of potential temporal gaps that exist between realtime and persistent streams. 4.7 Conclusions In this chapter, we extend the Semantic Complex Event Processing framework to sup- port resilient query processing over end-to-end event streams from network to storage. This allows users to specify unified query patterns for data that span past (in persistent archives), present and future (in high-Velocity streams). We demonstrate the value of this model for lazy-definition and fail-fast in Smart GridD 2 R use cases. We describe alternative approaches, including na¨ ıve event replay, plain query rewriting and hybrid rewriting and replay, to perform SCEP queries over persistent event streams. We also discuss integrated query planning with respect to temporal gaps that may exist across realtime and persistent stream boundaries. These approaches are implemented in the SCEPter prototype system. We performed experiments to validate our design and the system performance as summarized in the quantitative evaluation chapter (§ 6). 61 Chapter 5 Stateful Complex Event Processing Besides dealing with data Variety and Velocity, another challenge for Complex Event Processing is to support on-demand queries over Fast Data Volume [125]. Existing data management systems often adopt a multi-temperature architecture. Data that is frequently accessed is on fast storage (hot data, in-memory cache), compared to less- frequently accessed data placed on slower storage (warm data, SSD/spinning disk) and infrequently accessed data stored on the slowest storage (cold data, tapes). Data streams are hot data that require on-the-fly analysis and management. Online queries over data streams, which are supported by existing CEP systems, represent static or persistent user interests that are evaluated continuously to trigger realtime actions. On the other hand, on-demand queries, which are supported in traditional databases [40], represent user interests that need to be answered in an ad-hoc manner, usually using stored history volume. One key aspect of Fast Data is its transient nature or dynamic data life-cycle, i.e., data arrives at a high rate and the analytic values of data also fade away at a high rate. This characteristic offers the opportunity for on-demand querying of Fast Data with in- flight volume management, that obviates the need to persistently store everything and the overhead to interact with persistent store. For instance, consider an online retail scenario where large quantities of web orders are placed and processed concurrently. Shopping operation events such as order creation, payment processing, email notification and so on are correlated for online Service Level Agreement (SLA) monitoring and on-demand 62 status checking. Such operation events, however, may no longer require realtime access and analysis after the associated orders have been fulfilled. In this chapter, we propose Stateful Complex Event Processing to support Hybrid Online and On-demand (H2O) queries over Fast Data. H2O is motivated by appli- cations in diverse domains ranging from e-commerce to Smart Grid. It inherits CEP systems’ capability for efficient online query processing, while leverages online query states as a means of dynamic view materialization for in-flight on-demand query evalu- ation. Specifically, our main contributions are: • Introducing a formal query algebra that generalizes the basic query operations over Fast Data, and captures the statefulness and subsumption properties of queries (§ 5.3). • Reforming the SCEP query structure described in Section 3.3 for unified online and on-demand query specification over data streams, and introducing a hierar- chical query paradigm illustrated using real-world examples (§ 5.4). • Describing processing approaches of the proposed query paradigm (§ 5.5). A prototype implementation is provided (§ 5.6) and experimental evaluations are reported (§ 6). 5.1 Problem Motivation The motivation behind the hybrid online and on-demand query support for Fast Data is providing users with both happened-after (online) and happened-before (on-demand) forms of situation awareness in realtime applications. Online queries detect instanta- neous occurrences of pre-defined situationsover data as it streams in, to drive timely actions. On-demand queries, on the other hand, discover post-defined situations from 63 history data streams and sources, which offer data volume for analysis. We describe a few compelling scenarios from the e-commerce and energy industries which demon- strate the need for such hybrid query capabilities. Online retail or digital shopping [109] plays a vital important role in today’s retail industry. It allows consumers to directly buy (physical or virtual) goods and services from vendors over the Internet. Consider a simplified online retail process which con- sists of a sequential set of tasks for the online retailer: creating a web order, checking inventory, processing delivery, processing payment, and email notifications. An order has to be fulfilled within a limited time window which may range from a few minutes to a few days. Online queries can be used to monitor the order’s completion or report viola- tions to the Service Level Agreement (SLA), in realtime. On the other hand, during the intermediate steps of processing an order, the customer (or rather, the order status web form) may issue on-demand queries to check the status of partially fulfilled orders, and vendors may need to aggregate orders at a specific state, say, the order created step to determine subsequent actions required to meet their SLAs. As such, both these classes of online and on-demand queries operate over the same corpus of order status activities and events that are generated in realtime, as the order processing workflow proceeds. Realtime advertising [26], considered as the next breakthrough of mobile advertis- ing, employs a publish-subscription model to allow advertisers to gain instant access to their desired target audiences. Advertisers can specify online queries to manage the life-cycle of published advertisements (ads) by automatically expiring them after cer- tain conditions are met, say, a certain campaign time period has passed or a bid quota is reached. Similarly, users (or their apps) can also specify online queries to discover new ads that match their interests in realtime. On-demand query evaluation enters the picture when, say, users subscribe to topics of interest after relevant advertisements have already published. Here, in order to maximize the impact of the realtime ad campaign 64 and target the right demographic, the subscription will need to be matched against the ad published in the recent past (and still active), rather than just future campaigns. Dynamic Demand Response (D 2 R), as mentioned earlier, is a novel technique in Smart Grid for identifying opportunities for demand-side energy curtailment and effi- ciency by analyzing high volume readings from building and utility area sensors, such as smart meters and HV AC sensors. These readings are delivered over the network to per- form realtime demand prediction and respond with customized curtailment suggestions to consumers [99]. In a campus Micro Grid, for example, online queries can correlate observations from power meters, occupancy sensors and class schedules to detect when energy is being wasted in unoccupied rooms, and pro-actively turn off unused lighting, ventilation and smart appliances [123]. However, not all events and curtailment patterns can be defined in priori. This requires correlation of partially matched online queries with ad-hoc events, on-demand, for operational intelligence. The characteristics of above applications can be generalized from both query and data perspectives. Firstly, they all require modeling persistent user interests as continu- ous online queries and ad-hoc interests as on-demand queries. Secondly, data present in these applications have dynamic life-cycles within which realtime analytics are required. 5.2 Approach Overview Figure 5.1 depicts our approach to a hierarchical online and on-demand query paradigm over Fast Data. Data that originate from different sources arrive continuously on input streams. Domain and data Varieties are captured as Semantic Web ontologies which are referenced by users when defining queries and by the system when evaluating them. Users pre-define online queries to detect data patterns or compute aggregations, as 65 Data$Streams Online$ Queries$ Con$nuous'Query'Evalua$on' State%Management%and%Indexing Ad2hoc'Query'Answering' On0demand$ Queries$ …$ persistent$interests transient$interests (view%materializa4on) Domain$ Ontologies Figure 5.1: Hybrid Online and On-demand Querying over Data Streams and when the query is matched against the input data streams. The in-progress par- tial matches of an online query comprises of a materialized dynamic view of the input data streams. These views capture data patterns that may develop into matches of the online queries, which are considered as data of interests. On-demand queries can be performed on top of these materialized views, rather than the entire history data set, to extract interested information in an ad-hoc manner. We propose to extend Complex Event Processing (CEP) to implement the above hybrid query paradigm. As described earlier, CEP is an advanced stream processing model designed to process online queries or event patterns represented using SQL-like query languages, with explicit support for temporal operators. In the previous chapter, we discussed integrating semantic domain knowledge with CEP to deal with data vari- eties. The challenge here is traditional CEP systems do not manage history events for ad-hoc access. However, CEP systems, due to the nature of their evaluation of tem- poral queries, natively perform incremental evaluation of online queries. As a result, they maintain intermediate query states, i.e., partial matches, that can be leveraged as a means of view materialization to enable on-demand query analysis. 66 It may also seem appealing to achieve hybrid querying over Fast Data using a database-based approach. Many existing database systems, both SQL-based and dis- tributed NoSQL ones, support efficient on-demand queries with high I/O read perfor- mance. Also, realtime and active databases have proposed techniques to perform stand- ing queries over dynamic data through schedules or triggers. But, compared to database- based solutions, we adopt a CEP-based approach which offers several advantages: (1) Intuitive query specification. Unlike relational and NoSQL query models, CEP pro- vides explicit temporal operators such as time window and event sequence to correlate time-ordered data, and is intuitive for both online and on-demand query specification on streams. (2) High performance online query processing. CEP systems adopt an in- memory processing model to perform online queries. Designated algorithms such as Non-deterministic Finite Automaton (NFA) are used to achieve high query throughput. Systems such as Esper [4] and Siddhi [102] can process millions of events per second. Realtime and active databases can perform standing queries, but do not scale to such high transaction rate; (3) Dynamic view materialization. In our proposed hierarchical query paradigm, data for executing on-demand queries is managed by high performance online queries as part of their matching states. Since the online queries have window boundaries, the duration of the intermediate states materialized and retained in memory is limited to the window length. This will limit the history duration on which on-demand queries can be executed. In addition, it allows query-based rather than static schema- based view materialization. In return, diverse data and users may be supported. 67 5.3 Query Algebra In this section we define a formal query algebra that generalizes the basic transform operations on data streams and discuss their properties based on our prior work dis- cussed in Chapter 3. In particular, the algebra generalizes the CEP operators and the semantic filter used in the SCEP query model. The algebra provides the foundation for specifying uniform online and on-demand query, and their hierarchical processing over Fast Data. 5.3.1 Data Stream and Domain Context We abstract Fast Data in the context of our query processing, as being in two layers – the data streams arriving over the network, and prior metadata or static knowledge that captures the domain context. One main difference in query processing over data streams as opposed to databases is that data elements in a stream are observed and processed in temporal order rather than in batch. We define the atomic element of a data stream as a data tuple. A data tuple emanates from a data source and is transferred over the network, and contains a set of attribute, one among which is a designated temporal attribute used for ordering tuples within a stream. We reuse the key-value pair tuple representation introduced in Section 3.3.1: Data Tuple ∶= f⟨name, value⟩ * , timestampg Based on the above definition, we formally define data stream as follows: Definition 5.1 A time-ordered set of data tuplesS is said to be a data stream or simply stream. A streamS of tuples with attribute names{a 1 ;a 2 ;:::} is denoted asS(a 1 ;a 2 ;:::). 68 !!!!ee:D105VOLUME bd:RTH105 evt:hasSource ee:hasLoca=on bd:Office rdf:type ee:AirflowSensor rdf:type Seman=c!Event! URI !!!!evt:hasTime !!!!hvc:airflow foreign!aIributes! (in!domain!context) org:EEDepartment bd:RTH bd:belongsTo D105VOL!!!!!!!!!!!!!!!!!!2012O05O04T09:30!!!!!!!!!!!!!!!!!!!!!!!!510.0!!!!!!!!!!! na=ve!aIributes! (in!stream)! !!!!!!!!!!sensorID........................../mestamp............................flowrate !!2012O05O04T09:30 510.0!!!!!!!!!!! context!mapping! ! materializa=on! Raw!Data!Tuple Figure 5.2: Semantic-enriched AirflowReport Data Tuple with Native and Foreign Attributes For example, an airflow measurement data stream AF in Smart Grid may be denoted as AF(sensorID, flowrate, timestamp), and a digital shopping data stream DS may be denoted as DS(orderID, operation, timestamp). We introduce a designated operator to compute the data stream from an arbitrary data set or a set of data sets. Let E = {e i S i = 1;:::;n} be a set of data tuples with arbitrary order, we haveS = (E) is the data stream that consists of all time-ordered data tuples inE, and the reverse operation is denoted as (S) which relaxes the ordering constraint and returns the tuple set ofS. Obviously, we have ( (E))= (E). Let = {E i Si= 0;:::;n} as a set of tuple sets, we have ( ())= n ⋃ i=1 ( (E i ))= n ⋃ i=1 (E i ). In addition, we denoteT(e i ) as the temporal attribute of a data tuplee i ∈E, the temporal attribute with the maximum (newest) timestamp value among the data tuples in E as ̂ T(E), and the one with the minimum temporal attribute (oldest timestamp) as q T(E). Domain context is another component of the data model. It is essentially the back- ground knowledge we modeled using ontologies in Chapter 3. Specifically, domain context captures the domain concepts and entities associated with data tuples, either 69 directly or indirectly. For example, in Smart Grid domain, domain context may capture the variant terms for attribute flowrate of airflow measurement events such as airflow, airrate and so on. It may also capture domain entities that are not explicitly present in data streams, such as the sensor location, location type and location owner as shown in Figure 5.2. We define the mappings from the tuple’s attribute space to the domain context space as context mappings. To differentiate the attributes present in data tuples from domain context, we have the following definition: Definition 5.2 The named attributes explicitly present in data tuples are said to be the native attributes of the tuple. The meta data in the domain context implicitly associated with the native attributes are said to be foreign attributes of the data tuple. In the data graph shown in Figure 5.2, we see that native attributes are one-hop paths that link the data tuple to the corresponding attribute values, while foreign attributes are paths that contain a context mapping. GivenP as one attribute path of a data tuplee, we denoteP(e) as the corresponding attribute value. Based on this notion, we define the following comparison relations for data tuples: Definition 5.3 Data tuplee is said to equale ′ with respect to attribute pathP , denoted ase P =e ′ , ifP(e)=P(e ′ ). Further,e is said to equale ′ , denoted ase≡e ′ , if∀P e P =e ′ . Figure 5.3 depicts some useful stream relations. In particular, a sub stream of a streamS contains an ordered sub set tuples ofS, a child stream ofS from timet s tot e is essentially a special sub stream ofS that contains all data tuples ofS fromt s tot e . In addition, we haveS =S +∞ −∞ . In later discussions, we omit symbol−∞ and+∞ in stream denotations. The formal definitions of the above stream relations are described below: 70 ∞ t stream' ∞ + ) sub'stream' child'stream' t s t e Figure 5.3: Stream Containment Definition 5.4 A streamS ′ is said to be contained by streamS, if∀ tuplee ′ ∈S ′ we can finde∈S wheree≡e ′ .S is said to be the super stream ofS ′ denoted asS ⊇S ′ , andS ′ is the sub stream ofS denoted asS ′ ⊆S. Definition 5.5 A sub streamS ′ ofS that satisfiese ∈ S ′ ∀e ∈ S whereT(e) ≤ t e and T(e) ≥t s is said to be a child stream ofS, denoted asS ′ =S te ts andS te ts <S. S is the parent stream ofS te ts , denoted asS >S te ts . In particular,S +∞ t is said to be the upstream ofS andS t −∞ is said to be the downstream ofS at timet. 5.3.2 Query Operations Based on the above definitions, we generalize the unit query operations to transform Fast Data streams. In general, a stream query extracts sub streams from the original input data streams and projects them to an output data stream. Due to the similarities between the stream data model and the relational model, we borrow relational operations and reinterpret them in the context of stream processing. This approach is also used in [40, 49] Projection (). Similar to relational projection, a stream projection is a unary oper- ation that can be written as a 1 ;:::;an (S) which gives a result data stream that is obtained when all tuples in the original streamS are limited to the attribute subset {a 1 ;:::;a n }. 71 Rename (). A rename operation is a unary operation that can be written as a~b (S) which gives a resultant data stream that is identical to S except that attribute b in all tuples ofS is renamed to attributea. Selection ( ). A selection is a unary operation written as ' (S) which gives a filtered data stream that only contains tuples in the original streamS for which' holds. In particular,' is a propositional formula of both native attributes and foreign attributes of data tuples. Join (&). A join is a binary operation that is written asR& S which gives a data stream that contains all combinations of tuples in the original streams R and S that fulfill the join predicate. The join predicate is a correlation constraint over the tuple attributes of streamR andS. One key difference between the stream algebra and relational algebra lies in the evaluation of Join operations since it may deal with the temporal attribute and leverage the implicit ordering property of data tuples on streams. In particular, we classify Join operations in stream processing as being temporal join written asR& T S where the join predicate is based on the temporal attribute of data tuples exclusively and otherwise as attribute join written asR& P S. Two common temporal join operations are window join and interval join for which the join predicate isT(r)−T(s)≤W andT(r)−T(s)≥I respectively, wherer ∈R,s∈S,W andI are time periods noted as window length and interval length. As we will discuss later, due to the ordering property of data streams, evaluations of certain join operations such as window join for unobserved future data on streams are predictable at certain time points. This property is crucial for stream processing since it allows to the “forgetting” of history data that cannot match a join constraint incrementally in online query processing. Various composite query clauses or constraints can be defined using the above unit operation sets. In this chapter, we normalize a stream query as a set of query constraints 72 developed'match' developing'match' undeveloped'match' viola1on' 2∞' +∞' t' developed'match' 'state' 2∞' +∞' t' Figure 5.4: Matches, Violation and State of Stream Queries composed by logic AND relations. Otherwise, the query can be decomposed to multiple independent queries in the conjunctive form. 5.3.3 Query Properties To facilitate a hierarchical query paradigm that supports both online and on-demand queries, we explore the properties and relations of queries defined using the above oper- ations, especially from the temporal perspective. 5.3.3.1 Query Statefulness Firstly, we consider the matching results of a single query constraint defined using one stream operation. Intuitively, a match can occur before, across or after a given time point. As shown in Figure 5.4, we have the following notions for query matches with respect to timet: Definition 5.6 A sub stream M of stream S is said to be a match of a constraint C onS, denoted asM ∈ C(S), ifM fulfillsC and ∀e ∈ M, M − {e} does not fulfillC. Further, ifM also satisfiesT(e)≤t∀e∈M then we sayM is a developed match ofC att, denoted asM ∈C(S; ← Ð t ). On the other hand, ifT(e)≥t∀e∈M then we sayM is an undeveloped match ofC att, denoted asM ∈C(S; Ð → t ). OtherwiseM is said to be a developing match ofC att, denoted asM ∈C(S; ← → t ). 73 We have: C(S)=C(S; ← Ð t )∪C(S; Ð → t )∪C(S; ← → t ) (5.1) As opposite to match, we define the violation of a query constraint as follows: Definition 5.7 Given a sub stream V and t = ̂ T(V ), V is said to be a violation of constraintC on streamS at violation timet if∀S ′ we haveV ~ ∈C( (S t ∪S ′ t )). A violation indicates that at the violation time, in any arbitrary down streams of the original input streams, we cannot find a supplementary tuple set to make the violation a match of the constraint. Intuitively, an intermediate matching state of a constraint at a point in time should not be a violation but a sub stream that can potentially develop into a match later. Assuming we already knew the developing matches of a constraint at timet, we define the post state of the constraint att as follows: Definition 5.8 Let N be the up stream of M where M ∈ C(S; ← → t )) is a developing match at timet,N is said to be a post state ofC on streamS at timet. However, at a point in time the developing matches of a constraint may not be decided since the down streams are unknown. It is necessary to manage a tempo- rary storage of all non-violations which may develop into matches giving proper down streams. Hence, we have the following definition for prior state or, simply, state, Definition 5.9 A sub streamA is said to be a prior state or state of a constraintC on streamS at timet, denoted asA ∈ C(S;t), ifA ~ ∈C(S),A is not a violation ofC, and ∀e ~ ∈A ande∈S t we haveA∪{e} is a violation ofC. Based on the above definitions, we can classify the unit query operations defined previously as being either stateless or stateful as below: 74 Definition 5.10 A query operator is stateless if, given any streamS, ∀ stateA of any constraintC defined using the operator we haveA =∅; otherwise the operator is said to be stateful. Constraints defined using stateless operators are said to be stateless constraints, and constraints defined using stateful operators are said to be stateful. Among the unit query operations Projection (), Selection ( ) and Rename () are stateless, while Join (&) is stateful. We can prove this using contradiction. For example, for the unary operator Selection ( ), we have: Proof 5.1 Suppose Selection ( ) is stateful. Based on Definition 5.10, we have a state A for certain Selection constraint C at time t that satisfies A ≠ ∅. Since A is a non violation ofC, based on Definition 5.7, there exist a downstream of the original input stream from which we can find a supplementary tuple set to makeA a match ofC. The match size is greater than one which contradicts with the fact is a unary operator. Further, stateful operators can be classified based on how the state tuple sets are updated. For any new tuples arriving from down streams, states of operations like attribute join and interval join always advance or keep, i.e., the state tuple sets either have new component tuple added or remain unchanged. However, for operations like window join, a new data tuple may prove an earlier state to be a violation that can not be satisfied by future data tuples using the ordering property of streams. We hence have the following: Definition 5.11 A stateful query operation is progressive if its evaluation does not invalidate established states, otherwise it is called regressive. We can easily extend the above statefulness concepts from individual query opera- tions and constraints to composite queries. Equation 5.1 becomes: Q(S)=Q(S; ← Ð t )∪Q(S; Ð → t )∪Q(S; ← → t ) (5.2) 75 Obviously, a query match needs and only needs to satisfy all component constraints of the query according to the conjunctive form, and a query is stateful if and only if it contains stateful constraints. We can also derive the following theorem: Theorem 5.1 Give queryQ, streamS and timet, we haveQ(S)−Q(S; ← Ð t )⊆ Q(S;t)∪ S t . The theorem indicates all unknown matches of a query Q on a stream S at time t (developing matches and undeveloped matches) can be derived from the query states att and tuples arriving on the stream aftert. In other words, it is sufficient to persist only the current query states rather than the entire observed historic data tuples for online query evaluation. The proof of the theorem is as follows: Proof 5.2 From Equation 5.2, we haveQ(S)−Q(S; ← Ð t ) =Q(S; Ð → t )∪Q(S; ← → t . Based on the definition of undeveloped match, we have Q(S; Ð → t ) ∈ S t . Hence we need and only need to proveQ(S; ← → t )⊆ Q(S;t)∪S t , i.e., for allM ∈Q(S; ← → t ) we haveM−S t ∈ Q(S;t). Suppose not. Assume, on the contrary, ∃M ∈ Q(S; ← → t ) we have M −S t ∉ Q(S;t). Obviously we haveM ∈ Q(S). DenoteM t = M ∩S t andM t =M −S t , we have M = M t ∪M t . Based on the definition of developing match, we have M t ≠ ∅, M t ∉ Q(S), and M t is a non violation of Q. Hence From the supposition, we have M t ∉ Q(S;t), i.e.,M t is not a state ofQ att. Based on the definition of state,M t must satisfiesM t is a violation ofQ or there exists an evente wheree ∉ M t ande ∈ S t so thatM t +e is a non violation ofQ. The former case contradicts with the assumption (M t −e)∪M t =M is a match ofQ. If it is the later case, we have (M t +e)∪M t = M +e is a match ofQ which also contradicts withM ∈Q(S). Hence the supposition M −S t ∉ Q(S;t) is false and the theorem is true. Existing stream processing systems implement online query matching algorithms that leverage the property generalized by Theorem 5.1. The theorem enables incremental 76 online query processing and also forms the foundation for dynamic state management for on-demand query evaluation over streams. 5.3.3.2 Query Subsumption In addition to the statefulness property of individual queries, we study the relations between the matches and states of different queries, and introduce the subsumption (i.e., containment) relation. As before, we first consider individual query constraint and define the following: Definition 5.12 A query constraintC is said to subsume or contain another constraint C ′ if∀ streamS andM ′ ∈C ′ (S) we haveM ′ is a state ofC onS at time ̂ T(M ′ ). C is said to be a super constraint ofC ′ , denoted asC ⊇ C ′ , andC ′ is said to be a sub constraint ofC, denoted asC ′ ⊆C. As evident, a super constraint is more relaxed than its sub constraints. For example, (?w:price>50) subsumes or is a super constraint of (?w:price>100) ; (?a:categoryrdf∶typeauto∶automobile) is a super constraint of (?a:typerdf∶typeauto∶SUV) ; & window(?m;?n;10m) is a super constraint of & window(?m;?n;5m) ; and & window(?w;?d;?p;?e;10m) is a super constraint of& window(?w;?d;10m) . We extend the subsumption relation from query constraints to queries as follows: Definition 5.13 A queryQ is said to subsume (or contain) another queryQ ′ if∀ stream S andM ′ ∈ Q ′ (S) we have thatM ′ is a state ofQ onS at time ̂ T(M ′ ). Q is said to be a super query ofQ ′ , denoted asQ ⊇Q ′ , andQ ′ to be a sub query ofQ, denoted as Q ′ ⊆Q. The subsumption relation is transitive and symmetric and we have the following theorem: 77 Theorem 5.2 Give query Q and Q ′ , we have Q ′ ⊆ Q if ∀C ∈ Q, ∃C ′ ∈ Q ′ so that C ′ ⊆C. The theorem indicates that queryQ subsumes queryQ ′ if all constraints inQ have one sub constraint inQ ′ . In particular, if all non-temporal constraints inQ exist inQ ′ we sayQ temporally subsumeQ ′ . The proof of Theorem 5.2 is as follows: Proof 5.3 Let {C i Si = 1;:::;n} be the constraint set of queryQ, andC ′ i be the corre- sponding sub constraint ofC i in queryQ ′ . Suppose not. We have a matchM ′ ∈Q ′ (S) that is a violation ofQ at ̂ T(M ′ ). DenoteM ′ i ∈M ′ as the match of constraintC ′ i . Since Q is the conjunction of constraint C i , there must exist C ′ k in{C ′ i S i = 1;:::;n} which satisfiesM ′ k is a violation ofC k . This contradicts the assumptionC ′ k is a sub constraint ofC k . The subsumption or containment relation between multiple queries can be lever- aged in a hierarchical query paradigm. In particular, on-demand queries that provide additional insight into the intermediate states of online queries may be subsumed by the online queries. The hierarchical query paradigm with subsumption query grouping allows a single super query to manage online states or views for the sub queries to be answered on-demand. We denote an on-demand query Q ′ performed over an online queryQ on streamS at timet asQ ′ ( ( Q(S;t))). 5.4 Query Model 5.4.1 Query Syntax The semantic event model described in Chapter 3 conforms to the two-layer Fast Data definition. We reform the SCEP query model described in Section 3.3.2 based on the 78 unit operations generalized in the query algebra to facilitate both online and on-demand query specifications. The query structure becomes: H2O Query ∶= SELECT <output stream definition> FROM <input stream definition> [WINDOW <time window correlation>] * [INTERVAL <time interval correlation>] * [JOIN <non-temporal correlation>] * [FILTER <non-correlation constraint>] * where theSELECT clause defines the output stream by projecting input tuple attributes to output tuple attributes, the FILTER clause specifies constraints for individual data tuples and performs select operation. Finally, the JOIN, WINDOW and INTERVAL clauses define attribute, time window and interval based join operations across multiple data tuples. In the following, we describe theFILTER select andWINDOW,INTERVAL join clauses in more detail. AFILTER clause can be used to define both syntactic and semantic constraints of individual data tuples. As described previously, the key-value pair tuple representation normalizes the structural format of data on streams. Data varieties are reflected by allow- ing names and values, which are either literals or domain concepts, to have related vari- ants, concepts and entities captured in the backend domain context. We model domain context using Semantic Web ontologies, FILTER expressions are allowed to embed SPARQL triple patterns to specify semantic constraints over the ontologies as shown in example queries later. A WINDOW clause describes an ordered sequence of data tuples within a temporal range between the first and last tuples. For example, WINDOW(A, B, C, 10min) defines a sequence of tuple A, B and C with temporal join predicate T(C)-T(A) < 79 10min. It also supports Kleene Closure of the sequence. For example,WINDOW(A * , B, C, 10min) defines a sequence of tuples that can consist of any number of A followed byB and ended byC within a10min window. An INTERVAL pattern describes an ordered sequence of data tuples at a specified time interval. For example, INTERVAL(A, B, 10min) defines a sequence of data tuples where tuple B must be observed after 10min when A was observed. Simi- larly, INTERVAL also supports Kleene Closure. It is also worth noting that we can define more complicated temporal join relations using nestedWINDOW andINTERVAL clauses. For example, we can have WINDOW(A * , INTERVAL(B, C, 3min), 10min). We illustrate the various query constructs of the above model using examples from the Smart Grid, the online retail and the realtime advertising application domains. These use data tuples from the meter reading stream MR(sensorId, reading, timestamp), digital shopping event stream DS(orderId, category, operation, timestamp) and advertisement stream AD(itemId, category, zip, price, timestamp). Query 5.1 Filtering. Notify when there is a second-hand automobile for sale within the zip code 90007 and under $5,000. SELECT ?a FROM (?a, AD) FILTER (?a.category rdf:type auto:automobile) FILTER (?a.zip geo:hasNeighbour geo:zip90007) FILTER (?a.price < 5000) Query 5.1 shows both semantic and syntactic select constraints specified using the FILTER expression. In particular, the semantic constraints are represented by SPARQL triple patterns. The root of the SPARQL patterns such as?a.category is a foreign 80 attribute of the data tuples. The named semantic predicates and concepts are captured in the backend domain ontologies. Query 5.2 Aggregation. Report the average power demand of office rooms every 10 minutes. SELECT AVG(?m.reading) FROM (?m, MR) FILTER (?m.sensorId ee:hasLocation ?loc . ?loc rdf:type bd:Office) WINDOW (?m * , 10min) Query 5.2 shows the use of an aggregation function to project the output stream from a regressive (Definition 5.11) input dataWINDOW. TheWINDOW expression essentially performs a self temporal join over the input data stream. Query 5.3 Join. Report orders that are fulfilled and confirmed with email notifications within 1 hour. SELECT ?w FROM (?w, DS) FROM (?e, DS) FILTER (?w.operation = orderCreation) FILTER (?e.operation = emailNotice) FILTER (?w.category rdf:type ds:digitalProduct) JOIN (?w.orderId = ?e.orderId) WINDOW (?w, ?e, 1hour) Query 5.3 shows the combined use of a progressive attribute join constraint defined usingJOIN and a regressive temporal join constraint defined usingWINDOW. The query 81 is also defined at a high level over ontologies to deal with data and domain varieties associated with the data tuples. In particular, the digital shopping ontology with name space ds: may capture domain concepts such as ds:ebook, ds:game and ds:movie as sub- classes of ds:digitalProduct. Orders for all these categories should satisfy the semantic FILTER constraints defined in the query. Query 5.4 Sequence. Report all phone orders which are successfully fulfilled within 1 hour following a sequence of operations including web order creation (orderCreation), delivery, payment and email notification (emailNotice). SELECT ?w FROM (?w, DS) FROM (?d, DS) FROM (?p, DS) FROM (?e, DS) FILTER (?w.operation = orderCreation) FILTER (?d.operation = delivery) FILTER (?p.operation = payment) FILTER (?e.operation = emailNotice) FILTER (?w.category rdf:type ds:phone) JOIN (?d.orderId = ?w.orderId) JOIN (?p.orderId = ?w.orderId) JOIN (?e.orderId = ?w.orderId) WINDOW (?w, ?d, ?p, ?e, 1hour) Query 5.4 is a more complicated example of sequence query patterns defined using the H2O query model. The query correlates four types of data tuples from the DS stream to determine order completeness. Obviously Query 5.3 can also be used to monitor order completeness. However, Query 5.4 captures each order processing step and gives 82 additional insight to the process which can be analyzed on-demand. In the next section, we will use the above example queries to discuss the query statefulness and subsumption properties within a hierarchical query paradigm. 5.4.2 Hierarchical Query Paradigm We propose a hierarchical query paradigm for online and on-demand query process- ing over data streams. The query language described above provides unified syntax to specify both online and on-demand queries. While different from traditional online queries that monitor input data streams continuously, on-demand queries need to be evaluated over volumes of history data, in an exploratory manner. However, despite requiring on-demand queries, we still want to process data tuples on-the-fly and avoid interactions with a persistent data store where all history data is stored. Realtime data is transient, and the life-cycle of data tuples on streams is dynamic. Rather than persist- ing the entire data set observed on streams to support on-demand analysis, we propose to use online query states that continuously monitor the life-cycles of data tuples. The intermediate matching states of online queries then consist of live data. The hierarchical query paradigm essentially allows online queries to materialize dynamic views of data on streams, over which on-demand queries can be performed. We use the example queries described in Section 5.4.1 to illustrate the semantics of the hierarchical queries. Among these queries, Query 5.1 is a stateless query since it only consists of stateless constraints. According to the query state definition (Definition 5.10), when this query is processed online, its query states are always empty. On the other hand, Query 5.2 is stateful. When this query is evaluated online, history meter readings that are within 10 minutes before the current time are “persisted” temporarily 83 for online query evaluation and form the matching states. An on-demand query submit- ted against the online query at a certain point in time essentially analyzes the data tuples that potentially match the online query. Consider the following on-demand query: Query 5.5 Report the maximum power demand of office rooms in the last 5 minutes. SELECT MAX(?m.reading) FROM (?m, MR) FILTER (?m.sensorId ee:hasLocation ?loc . ?loc rdf:type bd:Office) WINDOW (?m * , 10min) Based on Definition 5.13, Query 5.5 is a subquery of Query 5.2. The results of Query 5.5 at any point in timet is a subset of the state of Query 5.2 att. It is hence sufficient to only manage the state of the online super query, rather than all history data tuples, to answer the on-demand sub query. Consider another stateful query Query 5.4. When it is evaluated online the query states represent incomplete orders and consist of partially matched data tuple sequences. An on-demand sub query that evaluates part of the processing sequence, for example Query 5.6 as shown below, may be performed over the online query to analyze orders accumulated at a particular step at a point in time. Query 5.6 Report all phone orders which are currently pending on payment processing in the last hour. SELECT ?w FROM (?w, DS) FROM (?d, DS) FILTER (?w.operation = orderCreation) 84 INTERVALs* WINDOWs* JOINs* FILTERs* Data$Tuples$ State$Transi/on$ State$Evalua/on$Path$ A B C D B* Figure 5.5: Stateful Online Query Plan FILTER (?d.operation = delivery) FILTER (?w.category rdf:type ds:phone) JOIN (?d.orderId = ?w.orderId) WINDOW (?w, ?d, 1hour) 5.5 Processing Model In this section, we describe the approaches for processing H2O queries over data streams. Compared to traditional CEP systems, H2O performs high-throughput online queries, and also manages intermediate query result states which on-demand queries can access. Compared to databases, H2O leverages online queries, rather than schema-based relational operations, to dynamically provision and update stream views for on-demand querying. 85 5.5.1 Online Query Processing We leverage the Non-deterministic Finite Automata (NFA) algorithm for online query processing. As shown in Figure 5.5, online queries are compiled to stateful query plans modeled as state machines. The state machine consists of state evaluation nodes and transition edges. Each state evaluation node represents a component tuple type in the sequence of data tuples specified in the query. For example, Figure 5.5 shows a sequence of data tuples (A, B, C * , D). State transition is triggered when the current state evaluation is satisfied. Query constraints associated with an individual state node are compiled dynamically to a linear evaluation path forWINDOW,INTERVAL,JOIN and FILTER expressions. In a state evaluation path,WINDOW expressions are evaluated first since it processes the common temporal attributes of all data tuples and are regressive. Evaluating the WINDOW constraint for an arbitrary data tuple from the input data stream may invalidate established states or partial matches. For example, consider Query 5.2 which has a self- join window constraintWINDOW(A * , 10min). When the newest input data tuple is 10 minutes later than the earliest data tuple in a state, the state needs to be reformed by dropping the earliest data tuple regardless of whether the new tuple is of type A or not. Similarly, for a constraint likeWINDOW(A, B, 10min), when the newest input data tuple is 10 minutes later than tupleA in a query state, the state needs to be dropped regardless other properties of the new data tuple. FILTER clauses are evaluated last since they may require time and memory expensive computations such as processing SPARQL queries. The state evaluation paths for different queries are managed in a single state eval- uation graph. Query constraints defined using various operators such as FILTER, WINDOW and JOIN are essentially triple patterns consisting of a tuple attribute as the subject, a comparison expression as the predicate and a literal value as the object. Shared 86 evaluations are achieved by grouping the constraints based on the subject and operator pairs. For example, assume we have multiple queries evaluating the price attribute of digital shopping data tuples to different literal numbers with the greater than compari- son predicate. The evaluation for all these constraints can be performed by one binary search if the literal numbers were sorted when generating the evaluation paths. One main difference between our query model and traditional CEP systems is that our model allows the embedding of semantic FILTER expressions to deal with data and domain varieties. In the SCEP query processing model discussed in Chapter 3, we described data buffering and caching optimizations for processing online semantic queries. Here we extend these techniques to eliminate semantic query processing at runtime with compile-time semantic indexing. Semantic indexing mitigates the overhead introduced by semantic FILTER con- straints by indexing the result space of the semantic query patterns at compile time. The goal is to pre-process semanticFILTER expressions and sort the result spaces for fast runtime lookup. The tuple attributes indexed for a SPARQL query expression are the query root property (Definition 3.2) which is essentially the native tuple attribute mapped to the foreign attribute evaluated by the query expression. For example, in Figure 3.2, the query root property for SPARQL expressions that evaluate sensor loca- tion types issensorID. When such semantic constraints are present in online queries, at compile time the system performs the semantic query over domain ontologies and creates a sorted list of sensorIDs of sensors from the particular location type. At runtime, instead of performing semantic inferencing over ontologies, the system simply performs a binary lookup from the index pertinent to the expression. It is worth noting that semantic indexing is a space-time tradeoff optimization that requires the system to store the entire resultant space of semantic FILTERs. In Section 6.3, we evaluate the performance benefits of semantic indexing as compared to the baseline approach. 87 5.5.2 On-demand Query Processing In the hierarchical query paradigm, on-demand queries are performed over the dynamic stream views that consist of online query states. Hence online queries also serve the purpose of in-flight data volume management. As discussed in Section 5.5.1, query states are updated dynamically when a query state transition occurs. Applying on-demand queries over online query states is similar to querying materi- alized views in relational databases. The baseline approach for on-demand query pro- cessing is to iterate over the current state set of the target online query. One can improve on-demand sub query performance by indexing proper tuple attributes when states are updated. In particular, to answer temporal sub queries on-demand, we implement state indexing. The system maintains index maps that have keys equal to the state labels and values which are to be the data tuples that reach the corresponding states. With the help of the state index, a mere index lookup is required to answer an on-demand query over its temporal super query. In Section 6.3, we compare the performance of state indexing and the baseline approach in terms of query latency. 5.6 Implementation Figure 5.6 shows our implementation of a Stateful Complex Event Processing system to support H2O queries. We extend an existing CEP system to process semanticFILTERs and manage query states for on-demand analysis. The system consists of three main components – the domain ontology model, the query manager and the stream processing engine. It processes two types of input streams – the data stream and the query stream. In particular, we model online query Creation, Update, and Deletion, as well as Retrieval (on-demand querying), as CRUD operation events on the query stream. 88 Query&Manager& …& H2O$ Queries …& join$windows$ sequence$windows$ State$Evalua8on$Graph$ State$Manager$ self?join$windows$ Online$and$On? demand$Query$ Results$ Domain& Ontologies& Data$ Streams$ Query$Stream$ State$Index$ Stream&Processing&Engine& Seman8c$Reference$ Figure 5.6: Stateful Complex Event Processing Architecture The domain ontology model, represented in Semantic Web OWL ontology lan- guage, forms the knowledge base of domain concepts and relations to support semantic FILTER specification and evaluation. The query manager module performs opera- tions upon query events received. It generates and updates query plans executed by the stream processing engine. The stream processing engine implements the core algo- rithms and data structures for online query processing, state table management and on- demand query answering as described in Section 5.5. Specifically, the online queries are compiled to state machines handled by the state manager. Query constraints including statelessFILTER constraints, statefulJOIN,INTERVAL andWINDOW constraints are dynamically updated in the state evaluation graph. The state index provides fast state lookups for on-demand queries as described in Section 5.5.2. 89 5.7 Related Work Stateful CEP described in this chapter is rooted in the area of stream and Complex Event Processing [44, 21, 52, 65, 30, 4]. Stream processing systems such as InfoS- phere [44] and Storm [21] often offer workflow-based computation models that allow data “flow” through a network of processing tasks for analysis. Complex Event Pro- cessing (CEP) systems such as SASE [65, 55] and Esper [4] typically provide SQL-like query languages that employ explicit temporal operators for data pattern specification. These systems implement designated algorithms to match temporal patterns on streams. For example, Nondeterministic Finite Automata (NFA) and its variants are commonly used to detect data sequences. Continuing work in stream and event processing systems have mostly focused on improving online query performance [41, 80]. For example, in [68], the author discusses mechanisms to parallelize single queries by partitioning state automata. Stormy [80] borrows the idea from distributed storage systems and uses con- sistent hashing to assign stream queries and route data tuples to distributed computing nodes. On the other hand, our work in this chapter, motivated by Fast Data application, is not a singular focus on the Velocity aspect of data streams but also considers data Variety and especially data Volume. We generalize a formal query algebra that takes considerations of the Variety and Volume aspects of data streams. We enable semantic selection in traditional CEP queries to deal with data Variety. The problem of processing diverse data streams were dis- cussed in [61, 39, 38, 104]. Specifically, in [61], the authors discuss schema-mapping based approaches. Such point-to-point mappings, however, do not scale for application domains that feature fast-evolving infrastructure and multidisciplinary users like Smart Grid. Recently, semantic model based approaches [39, 38] are introduced. In these sys- tems, semantic domain ontologies represented using RDF are used to provide a holistic view of the heterogeneous information space. In particular, C-SPARQL [39] directly 90 extends SPARQL, the language designed to query static RDF data store, with temporal window and aggregation clauses to support RDF stream processing. ETALIS [38], on the other hand, transform SPARQL together with temporal patterns into Prolog rules. It leverages a rule-based inference engine for pattern detection. Our system is similar to the model based approaches in terms of leveraging semantic ontologies to capture data and domain varieties. However, we model semantic constraints as selection operations that can be executed by native CEP systems. This allows leveraging all native CEP query constructs, algorithms with added semantic filtering capability. Motivated by real-world applications, we in addition propose to support both online and on-demand queries over the dynamic Volume of streams without interactions with persistent data store. On-demand queries are usually supported by databases that exe- cute ad-hoc user queries over a batch of data. Database systems made efforts to process online queries over time-varying data concurrently. Active databases [115, 103] and realtime databases [70, 69] leverage ECA rules or schedules to process queries over constantly changing data in tables. Stream databases such as [51, 77] offer power- ful query languages with sliding windows and sequence operators. However, on-disk databases are not suitable to perform high-rate data streams due to the limitations on disk I/O performance. In-memory databases [63, 75, 73] primarily rely on main mem- ory for data storage. These systems support much higher transaction rates than on-disk databases. However, existing systems mostly adopt schema-based tables for data stor- age and relational algebra for query specification. Compared to these systems, our CEP- based stream processing system use a uniform temporal query model for online and on- demand query specification over streams. The hierarchical on-demand query paradigm allows query-based view materialization, so that volumes can be created dynamically and managed in-flight. 91 5.8 Conclusions In this chapter, we described a Stateful Complex Event Processing framework that sup- ports on-demand queries over the transient Volume of Fast Data. Our work was moti- vated by applications in diverse domains ranging from e-commerce to Smart Grid. We introduced a formal query algebra to capture the stateful and subsumption properties of stream queries, which forms the foundation of a hierarchical online and on-demand query paradigm. The proposed system inherits traditional CEP systems’ capability of efficiently processing online queries, while also leverages online query states as a means of dynamic volume management for on-demand query evaluation. We have illustrated the value of this model using use cases from the Smart Grid and e-commerce domains. Further, we proposed approaches for translating the model into practice along with opti- mizations to improve online and on-demand query performances. These have been implemented in the H2O system prototype to validate our approaches. 92 Chapter 6 Quantitative Evaluation We quantitatively validate the performance benefits of the techniques that we have proposed and implemented within SCEPter– our Semantic Complex Event Processing framework for querying realtime and persistent data streams, and H2O – the Stateful Complex Event Processing system that supports both online and on-demand queries. We use example queries, real-world events and domain ontologies from the USC cam- pus Micro Grid in these experiments. Raw event data is collected from HV AC systems in the Micro Grid over time to create a large corpus of events that are then used to sim- ulate input event streams. This protects the operational Micro Grid applications during the experiments while providing a realistic environment. This also allows for the added ability to evaluate different data Velocities in a reproducible manner. The events in the streams have the same schema as shown in Figure 3.2. All experiments were performed three times and the average values are reported here. 6.1 Semantic Complex Event Processing We run SCEPter on a 12-core 2.8GHz AMD Opteron server with 32 GB of physical memory, running Windows Server 2008 R2 and using 64-bit Java JDK v1.6. The 4Store database runs within a Linux Virtual Machine (due to OS dependency) on the same machine with exclusive access to 1 CPU Core and 8 GB of RAM, and is accessed by SCEPter over the local network port. 93 In the first set of experiments, we evaluate the time performance of SCEPter’s real- time query processing engine using SCEP queries from Section 3.3. Specifically, we study the throughput of the system and the processing latency per event as the input event rate increases under the following settings: (1) Processing queries with only CEP subqueries (i.e., Query 3.1–3.3 in Section 3.3) as a benchmark; (2) Processing full SCEP queries with both semantic and CEP subqueries (i.e., Query 3.4–3.7 in Section 3.3) using the baseline approach; (3) Processing full SCEP queries with only event buffering; (4) Processing full SCEP queries with only query caching; and (5) Processing full SCEP queries with both buffering and caching optimizations. We observed that the realtime performance is similar for queries that have the same semantic subqueries but different CEP subqueries due to the pipelined processing architecture. So, for brevity, we only report results for Query 3.1 and 3.4 here. Figure 6.1(a) shows the output throughput (events/sec) on the Y Axis and the input event rates on the X Axis for CEP Query 3.1 and full SCEP Query 3.4. Ideally, the output throughput should match the input event rate. This occurs only for the CEP query (solid black line) where the CEP kernel keeps up with input event rates till 100,000 events/sec that we tested. For full SCEP queries, the baseline approach has a low peak throughput of ∼80 events/sec (solid yellow). The performance of event buffering is subject to the input event rate, as we posit in Section 3.4.2.2. Increasing the buffer window size from 1 sec (solid blue triangle) to 2 sec (solid blue square) improves the peak throughput but it still converges to a lower value when the input rate increases further since the time to batch process the buffer eventually outstrips the input rate. But the throughput benefit of caching is retained as input rate increases, and the peak stays constant at 170 event/sec and 1200 events/sec as cache capacity (i.e., fraction of unique key variants held in the cache) is varied from 5% (dashed green circle) to 25% (dashed 94 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 4000" 0" 1000" 2000" 3000" 4000" 5000" 6000" Throughput)(events/sec)) Input)Event)Rate)(events/sec)) CEP" SCEP-Baseline" SCEP-1"Sec"Buffer" SCEP-2"Sec"Buffer" SCEP-5%"Cache" SCEP-25%"Cache" SCEP-25%"Cache-1"Sec"Buffer" SCEP-25%"Cache-2"Sec"Buffer" (a) Realtime Query Throughput 0 2 4 6 8 10 12 14 Cache Miss Cache Hit 1 Sec Buffer-‐1000 events/sec 1 Sec Buffer-‐3000 events/sec 2 Sec Buffer-‐1000 events/sec 2 Sec Buffer-‐3000 events/sec Latency (milli sec) Seman-c Annotator Seman-c Filter CEP Kernel (b) Realtime Query Latency per Input Event. Note that semantic annotator and CEP kernel times are∼0 Figure 6.1: SCEPter Realtime Query Performance green cross), respectively. Finally, combining buffering and caching compounds their individual performance benefits, as we see from the dotted orange lines. We drill down into the processing times spent within each module in the SCEPter’s pipeline. Figure 6.1(b) shows the processing time of the semantic annotator, semantic filter and the CEP engine under different conditions. The first two columns show the 95 latencies when only the cache optimization is enabled. In these cases, the latencies do not depend on input event rate. When the query cache is disabled or missed, we observe that a majority of the time is spent in the semantic filter module, while the relative times for the annotator and CEP engine are negligible. This confirms that the semantic queries are the most expensive operation; with a cache hit, all three times are fractional. The last four columns show the latencies when only the buffer optimization is enabled. When the input event rate increases, the number of events accumulated in a buffer window increases, causing the batch processing time and the per-event latency to increase. If the buffer processing time is within the event rate, the system can keep up with input stream. For example, for a 2-second buffer and 1000 events/second input rate, the per- event latency is less than 1 ms and it takes 2 seconds to process the 2000 events received in the 2-second buffer window. But with a 2-second buffer and 3000 events/second input rate, the per-event latency is 1.7 ms and it takes 10.2 seconds to process the 6000 events seen in the 2-second window. As a result, the query processing throughput falls below the input event rate. 6.2 Resilient Complex Event Processing In the second set of experiments, we evaluate SCEPter’s performance for integrated querying over realtime and archived events. Specifically, we introduce a downtime in the input event stream passed to SCEPter, from which it recovers using integrated query- ing over the event archive. The comparison metrics used are initial recovery latency, the time to process the first archived event after the fault, catchup duration, the time to pro- cess all events archived during the downtime, and catchup throughput, the rate at which the archived events are processed. Once recovery is done, realtime stream processing resumes. We compare these metrics for different strategies: query rewriting, na¨ ıve event 96 0" 10" 20" 30" 40" 0" 20" 40" 60" 80" 100" 120" Latency((sec)( Downtime((min)( Plain"Rewri2ng4120Min" Naive"Replay4120Min" Hybrid4120Min" Plain"Rewri2ng460Min" Naive"Replay460Min" Hybrid460Min" (a) Initial Recovery Latency 0" 40" 80" 120" 160" 200" 240" 280" 0" 20" 40" 60" 80" 100" 120" Catchup(Time((sec)( Downtime((min)( Plain"Rewri1ng3120Min" Naive"Replay3120Min" Hybrid3120Min" Plain"Rewri1ng360Min" Naive"Replay360Min" Hybrid360Min" (b) Catchup Duration 0" 10,000" 20,000" 30,000" 40,000" 0" 20" 40" 60" 80" 100" 120" Throughput)(events/sec)) Downtime)(min)) Plain"Rewri3ng5120Min" Naive"Replay5120Min" Hybrid5120Min" Plain"Rewri3ng560Min" Naive"Replay560Min" Hybrid560Min" (c) Catchup Throughput Figure 6.2: SCEPter Integrated Query Performance in Fail-Fast Scenarios replay and hybrid rewriting and replay (hybrid), with stream downtime durations from 10 minutes to 115 minutes. We also examine the impact of the 4Store archive size on performance by using two different archive capacities that store the last 60 minutes and 120 minutes of input events, respectively. Figure 6.2(a), 6.2(b) and 6.2(c) plot these three recovery metrics for Query 3.4 in Section 3.3, with 6,000 HV AC sensors sending one event per minute each to the input event stream. We see that na¨ ıve event replay takes the longest time to recover (orange dashed lines), taking around 15 seconds for initial recovery from a 10-minute stream failure over a 120-minute rolling archive and around 36 seconds to catch up. Here, 97 the realtime SCEP processing time is higher than the database query since the latter simply extracts raw events. We also see the impact of archive size, as the 60-minute archive takes only 9 seconds to recover, reflecting the limited ability to index semantic databases, thus causing costly table scans. Since archive capacity limits the duration of downtime that can be recovered from, using a 60-minute archive prevents recovery from downtimes that exceed 60 minutes. The hybrid (green dotted lines) and plain rewriting (blue solid lines) approaches evaluate either part or all of their SCEP query expressions in the database, so their initial recovery times almost equal their catchup durations – the CEP engine makes little to no effort. The hybrid approach has a marginally lower catchup duration by leveraging the native benefits of both CEP and semantic query engines. As the downtime increases, the na¨ ıve replay gets progressively worse since it extracts even larger numbers of events from the database for precessing by the realtime SCEP engine. This impact is much smaller in the hybrid and plain rewriting approaches, allowing them to scale better. Both na¨ ıve replay and hybrid approaches process CEP subqueries using the realtime CEP engine, which performs equally well for different types of CEP subqueries. Hence their behaviors for Query 3.4–3.7 described in Section 3.3 are similar. However, plain rewriting transforms CEP subqueries to SPARQL. We see that for SCEP queries with sequence and aggregation CEP subqueries, like Query 3.6 and 3.7 which have window and correlation operators, the plain rewriting approach performs much worse with recov- ery latencies greater than 10 minutes for even short downtimes. As discussed in Section 4.4.1, it is possible to rewrite correlation constraints as SPARQL patterns. However, this will cause repetitive costly self-joins over the entire data set. Increasing the archive capacity exacerbates this problem. The hybrid approach, on the other hand, side-steps this issue by pushing CEP subqueries to the realtime SCEP engine. 98 6.3 Stateful Complex Event Processing We validate the feasibility of the H2O query model for online and on-demand query processing, and also evaluate the performance benefits of the optimization techniques that have been proposed and implemented. We perform the experiments on a 4-core 2.4GHz Intel Xeon server with 16 GB of physical memory, running Red Hat Linux 4.1.2 using 64-bit Java JDK v1.6. 6.3.1 Online Query Processing We evaluate the time performance of H2O online query processing using queries created from Query 5.2. Specifically, query instances are created using Query 5.2 as a template with different filter constraints. We study the throughput of the system as the number of user queries increases under the following settings: (1) Processing online queries with the baseline approach; (2) Processing online queries with semantic indexing optimiza- tion. Figure 6.3(a) shows the maximum online query processing throughput (data tuple/sec) on the Y Axis and the number of Query 5.2 variants on the X Axis. As we can see, the processing throughput decreases as query set size increases, i.e., when more user queries need to be processed concurrently. The throughput flattens as the query set size increases since more queries share the same filter constraints. In particu- lar, when there is only one user query, the baseline approach has a peak throughput of ∼2000 data tuples per second. While, with semantic indexing optimization, the online query throughput peaks at around 120K data tuples per second. This is because the time-expensive semantic inferences are performed at compile time rather than runtime. 99 1 10 100 1000 10000 100000 1000000 1 10 20 30 40 50 60 70 80 90 100 Throughput (tuple/sec) Query Set Indexed Baseline (a) Online Query Throughput 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 0 2000 4000 6000 8000 10000 12000 Latency (ns) State Set Indexed Baseline (b) On-demand Query Latency Figure 6.3: H2O Online and On-demand Query Performance 6.3.2 Ondemand Query Processing In this set of experiments, we evaluate the time performance of H2O’s on-demand query processing using Query 5.4 as the online query and Query 5.6 as the on-demand query. Since an on-demand query evaluates the query states of a target online query in an ad- hoc manner, the on-demand query latency largely depends on the size of the online query states. Hence, we study the on-demand query latency as the size of the online query states increases and compare the following scenarios: (1) Processing on-demand queries with the baseline approach; (2) Processing on-demand queries using state indexes. Figure 6.3(b) shows the on-demand query processing latency in nanoseconds (ns) on the Y Axis and the state set size on the X Axis for Query 5.6. It shows the query latency increases as the number of query states increase. The baseline approach has higher latencies under the same conditions. While with state indexing, the on-demand query latency is relatively low and has a more gradual rate of increase. When we have a high input data rate that leads to more intermediate states created by the online query, we expect the on-demand query latency with state indexing optimization is much lower than that of the baseline approach. 100 Chapter 7 Empirical Evaluation: Dynamic Demand Response in Micro Grid Dynamic Demand Response (D 2 R) in the USC campus Micro Grid encompasses many features that make it a representative Fast Data application. USC maintains a relatively “smart” electrical and equipment infrastructure that collects fine-grained measurement data at high Velocity suitable for online analysis. It has the ability to measure energy usage in 100+ major buildings at second-level intervals, with the possibility of room level HV AC and presence measurements for a third of the buildings. The data sources are distributed in a Variety of physical and logical spaces managed by diverse application participants such as facility managers, building and department coordinators, students, staffs, etc. D 2 R application also requires correlating realtime information from other relevant domains such as weather forecasts and class and event schedules for better demand prediction and curtailment planning. The large Volumes of data from the Micro Grid are also persisted in durable storage by USC Facilities and Management Services (FMS) for regulatory compliance and post-hoc analysis. In previous chapters, we described how our Complex Event Processing framework can help correlate continuous data in Micro Grid and how it can perform query analysis to detect DR situations. For example, a semantic CEP query can be defined to detect 101 load curtailment opportunities for a temperature reset in a classroom if the current tem- perature measurement is less than 72 ○ F and no classes are scheduled. Such insight into ongoing situations enables timely and opportunistic DR curtailment responses. While the application of CEP for demand response optimization is innovative, we only discussed its anecdotal applications that are tightly scoped to narrow scenarios and examples. In particular, there is a lack of detailed explanation pertaining to the relevant domain ontologies and categories of query patterns that can benefit demand- side management in DR. There is also a lack of accessible means for defining them at a higher level of abstraction. In this chapter, we explore theD 2 R application in depth using the proposed CEP-based Fast Data query framework. Specifically, we make the following contributions, 1. Exploring diverse domain knowledge relevant to demand response and the use of Semantic Web to capture them in a modular and extensible architecture (§ 7.3). 2. Discussing a taxonomy of event pattern queries to guide different aspects of DR, along with examples for demand management in the USC campus Micro Grid (§ 7.4). 3. Evaluating the efficacy of event-basedD 2 R by presenting pattern detection statis- tics from the USC campus Micro Grid experiments (§ 7.5). 7.1 Approach Overview 7.1.1 State of the Art Existing DR strategies available within the USC campus Micro Grid include direct con- trol strategies, like Global Temperature Reset (GTR) and Duty Cycling, and voluntary curtailment strategies through email notifications sent to building occupants. Currently, 102 these strategies are scheduled at pre-determined time periods, days ahead of time, based on historical power usage trends. However, these strategies can be initiated based on near-realtime energy usage conditions and also supplemented with nimble strategies that leverage dynamic demand reduction opportunities on campus. Realtime building, equipment and environment monitoring information from the campus BAN, campus schedule, facility details and weather forecasts can be analyzed to detect additional cur- tailment opportunities. 7.1.2 Event-Driven Demand Response in Micro Grid Continuous, time-series data from sensors and other information sources in the Micro Grid can be abstracted as event streams. For example, an event stream may be comprised of timestamped KWh energy usage of an HV AC unit in a particular room reported every minute through the BAN. Weather conditions for a particular zipcode provided every hour by the NOAA web service [11] can also form an event stream. Dynamic DR situations of interest can be modeled as combination of these event occurrences, i.e., as event patterns. As we discussed before, defining query patterns over raw event data can be tedious, especially in an information rich domain such as a Smart Grid. Our semantic CEP framework facilitates event query specification using high-level domain concepts. This makes it more user-friendly compared to existing CEP systems that process events as relational data tuples and require precise knowledge of the underlying information infrastructure. Figure 7.1 shows the proposed architecture of event-driven DR in the campus Micro Grid. The semantic CEP engine matches query patterns in realtime by correlating events from various sources in the Micro Grid infrastructure. Supply-side information is pro- vided a priori by the utility. Detected patterns can be converted to operational actions 103 Seman&c(( CEP(Engine Ac&on(Rule( Engine DR( Pa4erns load(controls(&(customer(no&fica&ons event(streams DR(request detected(( Micro(Grid( Infrastructure( &(Par&cipants U&lity pa4ern(management pa4erns(( Figure 7.1: Event-Driven Demand Response in Micro Grid by an automated rule engine. Actions can include direct control of equipments such as GTR and duty cycling, sending notifications to DR participants, or CEP engine con- figurations such as activating more aggressive curtailment queries and so on. Here we focus on the information space and event patterns that are relevant for dynamic demand response optimization in Micro Grid. 7.2 Smart Grid Information Space Smart Grid applications depend on the availability of relevant information for their effective operation. A variety of information can be leveraged in the context of demand response optimization and these may be considered in the SCEP model for this applica- tion. Realtime Monitoring. Power consumption details, collected from smart meters and other sensing devices, will enable us to improve the accuracy of forecast models by correcting errors in the prediction model and improving the model within a short cycle. These also help monitor the response of curtailment strategies that are initiated and actively tune the responses. Different frequencies of information can be attempted to make tradeoffs between accuracy and cost of information collection. For example, smart meters can collect and report power consumption information as frequently as once per 104 minute, though once per 15 minutes is more often used. The Complex Event Processing module can make use of the time series measurement readings to find out interesting patterns which will help in reducing the consumption at the observed location. For instance power consumption information can be useful either at the individual consumer level or when aggregated over neighborhoods. Infrastructure Configuration. Besides information about the power grid infras- tructure such as the distribution network, substations and feeders, it is also useful to model information about environment infrastructure at the city and consumer scale since they influence power usage. Information at the level of individual buildings may provide features such as building structures, orientation (for sunlight), and equipment installa- tion. At the macro scale, the layout of the road networks as well as traffic flow can provide pertinent knowledge. For example, traffic information from road networks pro- vide some interesting insights about how the consumption would be affected. When the traffic congestion is higher during the evening the power demand in households would be shifted based on the number of people staying at home. Participant Behavior. Consumer behavior provides valuable insight about elec- tricity consumption and helps in understanding power usage patterns. For instance, a customer’s billing information over a period of time can be used to predict his/her electricity consumption for the next billing cycle. Similarly, customer demographics will help understand how the consumption varies from one demographic to the other as well as find the similarities between them for clustering response strategies. Apart from this information, social network feeds can be used to understand how a person’s actions influence other people around him. A person might be motivated to cut down the electricity consumption by reading about his friends’ savings through energy conser- vation. This information will help in finding out groups which actively perform energy conservation, and may be early adopters of new tools. These feeds, when combined 105 with location details that may be available from mobile phone GPS’s, provide us with “human sensors” to report environmental information. Event Schedules. Scheduling information provides knowledge about a future occur- rence ahead of time. This information will enable us to estimate the demand at the par- ticular venue based on the type of event scheduled, as well as on the number of people expected to attend the event. Schedule information about individual people as well as facilities are useful. A person planning a vacation from work may indirectly indicate that they may not be at home either, thus predicting lower demand while also eliminating a source of demand reduction during curtailments in that period. Natural conditions. Environmental factors, such as weather and seasonal changes, help in determining the electricity consumption pattern in a region during a particular weather condition. For instance, when the outside temperature is around 60 ○ F, the chillers inside the building would be set to higher temperatures thereby causing a drop in electricity consumption. Similarly, an impending serious weather event, such as a heatwave or a thunder storm, may indicate a demand pattern different from usual. These details may, once again, be at different spatial and temporal scales, and may also include future events. As shown in Figure 7.2, it is interesting to see how each of this information sources are related to or influence other information sources. It can also be seen that electrical equipment is installed on various infrastructures and people make a direct impact on how this equipment is used, which would ultimately determine the total consumption at a place at any point in time. 106 Figure 7.2: Interplay between Information Concept Spaces that are Relevant to Smart Grid Applications. 7.3 Smart Grid Domain Ontology Model We use Semantic Web ontologies to model domain knowledge in Micro Grid, providing an integrated information view for demand response application. 7.3.1 Model Architecture The Smart Grid domain, being diverse in nature, involves a wide range of concepts from various domains. It is not possible to build one single model from scratch which encapsulates all the relevant concepts. Hence, our approach to finding a solution is to identify well defined and understood ontologies in the candidate domains and integrate these by just filling in the gaps. This modular and extensible strategy leverages the features provided by Semantic Web technologies. It allows us to build models on top of domain expertise, provide familiar conceptual terms for users and potentially help us leverage existing tools for knowledge sharing and reuse. 107 Depending on the level of knowledge representation present in these domains, we may have access to (1) complete ontologies that capture all concepts that are required by the smart grid applications; (2) partially complete ontologies with some concepts or relationships missing; (3) the absence of an ontology but existence of well defined metadata schemas; or (4) a simple glossary of terms without a well defined structure or semantics. Each of these require a different level of intervention on our part. This includes identifying common or related concepts across domains and introducing rela- tionships between them, introducing new, relevant concepts that are missing from a domain ontology, mapping existing metadata schemas to an ontology framework, or constructing a new domain ontology from the domain dictionary. Our Smart Grid information model is represented using Web Ontology Language (OWL), one of the standards known for knowledge representation. We have retained the namespaces of all the component ontologies we have reused, and for the ontologies and concepts we have introduced we have maintained our own namespace. The ontolo- gies were integrated using Protege[14] and the instances were populated using Jena [8] Semantic Web Framework for Java. The Ontology schema as well as the instance data were stored in the MySQL Database using Jena API and querying was performed using SPARQL [20]. 7.3.2 Model Components The various component ontologies does not exist in isolation. The relationships between concepts from individual ontologies have been carefully established so that they form a single coherent Ontology which can be used as the Smart Grid Information model. Instead of developing the component ontologies from scratch we have reused some of the very well developed and standard ontologies for each domain. 108 Electrical Equipment Ontology. The main domain Ontology we are interested in is the one pertaining to Electrical Equipment and Electrical Measurements. This information being the crux of the Smart Grid, needs to be captured in the information model. The International Electrotechnical Commission’s Common Information Model [2] is a standard that describes the components of a power system at a distribution level and defines information exchange between them. We are interested in the equipment on the consumer side and how much consumption the equipment records at each point in time. CIM describes these domain features in a structural form but does not describe their semantics. Hence, we transform the CIM standard to an ontology representation tailored to our needs. The ontology captures different types of equipment, as well as measurement units used by these equipment. Figure 7.3 shows the different categories of equipment like Lighting, Refrigeration, Sensor, and so on. Each of the categories has subcategories or specializations of equipment. For example CO 2 sensors are a type of sensing element which help in detecting the CO 2 levels in area at a point in time. Figure 7.3: Electrical Equipments Ontology Organization Ontology. It is essential to classify different classes of organiza- tion since the electricity consumption for them would be different. For instance the consumption pattern of an airline is going to be different from that of an educational 109 institution. The presence of this added information will induce some prior knowledge to the forecast models as to what the consumption pattern would be for each category of organization. Along with the organization information, it is also essential to cap- ture people involved in the organization as well as their roles within the organization. Information about people and their respective roles are relevant since it also helps in understanding the consumption patterns. Also, it helps with the responses to requests for curtailing power consumption. For instance, the holidays for an organization or its departments may depend on their type, while the response of its members to demand reduction may depend on who in the organization sends such a request (facility manager, head of organization, immediate supervisor, etc.). These relevant concepts are included in the DBPedia Ontology [3] and we have reused this ontology in our information model to capture the corresponding information. Infrastructure Ontology. The Smart Grid information model also captures envi- ronmental concepts including transportation networks, buildings and so on aside from Power Grid infrastructure. These concepts will improve demand response applications by bringing in context about the type of infrastructure which consumes electricity. For instance, an office building that has 20 floors would consume more electricity than an office building with 5 floors, and likewise the traffic in freeways will help evaluate the shift in demand. The DBPedia ontology integrated into our model covers a broad range of infrastructure specific concepts at the same time provides specialization of various infrastructures like Office Buildings, Hospitals etc. Weather Ontology. Weather information is one of the most crucial parts of our information model which will help in understanding the electricity consumption pat- tern in a particular geography. We integrated the NNEW Weather Ontology [10] which uses SWEET 2.0 [22], JMBL[9], and WordNet[24] Ontologies in a coherent manner, to provide a rich set of vocabularies to define various weather phenomena. The SWEET 110 Ontology captures concepts pertaining to earth sciences like Physical Phenomenon, Space, Human Activities etc., whereas WordNet provides a large domain independent lexical database. NNEW Ontology makes an attempt to reuse the concepts mentioned in other ontologies and carefully extend those concepts which are essential to describe weather phenomena. The NNEW covers various low level domain specific concepts like thunder storms, hurricane, and precipitation, as well as high level concepts which are not domain specific like phenomena. While not all concepts from the domain are required by the Smart Grid applications, we do not modify these ontologies to allow us to use and update them consistently. However, only the relevant parts of the ontology need to be populated with instances and used in queries and inferencing. Spatial Ontology. Power consumption is linked to specific equipment or infrastruc- ture at a spatial location drawing power from feeders supplying at that location. But the usage is also influenced by people whose locations change or external influences like weather that have regional impact. The fact that a building is part of a city’s downtown gives an intuition that the building will experience decrease in demand during evening and during weekends. We also go beyond latitude and longitude; addresses or zip codes are available sometimes. We may also have point co-ordinates or regions. These need to be captured in the ontology so that we can perform inferencing and geo-spatial queries at a later point in time. For example, mobile phones may report the location of a per- son but also add an error boundary that places them within a broader circle. Just like organizational and infrastructure concepts the spatial concepts are also covered in the DBPedia Ontology, and the fact that some of the basic relationships amongst them are already established makes it a much better choice compared to other isolated ontologies. Figure 7.4 shows a small snapshot of the DBPedia Ontology. The concepts are shown on the left side while the relationships between different concepts are shown using arrows between concepts. For instance person and means of transportation share a relationship 111 to show how people use different means of transportation to commute. It can also be seen that the ontology relates different infrastructures that are available to how people make use of them. Figure 7.4: Infrastructure Ontology Temporal Ontology. Power consumption happens over time, and demand response applications specifically attempt to learn from past consumption patterns to predict and control future consumption. Scheduling information of infrastructure, electrical equip- ments as well as that of individual people are relevant in understanding how much the electricity demand there is going to be. For example, the fact that an air conditioner is scheduled to run everyday for a certain period of time at a predefined temperature gives a sense how much consumption there is going to be for the scheduled period. The W3C calendar Ontology [23] provides the set of vocabularies to capture scheduling and calender related information. The Ontology is an attempt to integrate ICalendar [6], a widely used format for sending meeting requests with other Semantic Web data. 112 7.3.3 Model Relations Since the component ontologies pertain to domain specific concepts it is necessary to integrate all these into one single Information model, the Smart Grid Information Model. A simple integration of the concepts from various domains will not suffice. It requires establishing concise and meaningful relationships between domains so that we can per- form complex queries and inferencing. While a structural schema can help us perform queries (even complex ones), semantic inferencing is possible only if adequate relation- ships across domains exist, i.e., we perform knowledge capture, not just information capture. The inferences will help figure out patterns which are less intuitive and also help in improving the performance of the entire Architecture. Figure 7.5 shows some of the inter domain relationships we have established as well as how the key concepts in one domain are related to key concepts in another domain. A place which is part of Spatial Ontology will experience certain weather conditions which is part of Weather ontology, and hence we have established a relationship between place and weather phenomena. This relationship would help to query information about infrastructure in various places that are experiencing certain weather conditions. Sim- ilarly infrastructure at various places will have many electrical equipment installed. It is essential to establish relationship between these concepts even though they are part of two separate domains. These relationships help in understanding the consumption pattern of the particular Infrastructure as well as in understanding the consumption at a higher granularity like the consumption of an area since Infrastructure is related to places as well. Similarly, the scheduling information as mentioned before could be related to people, infrastructure or equipment. All these concepts are part of different component ontologies, but we have established relationships corresponding to a person’s schedule, a venue’s schedule information or an equipment’s operating schedule to make sure that 113 we capture the correct relationship as well as provide a platform to make meaningful inferencing. Apart from the relationships we have identified, deep linking between concepts from different domains can also occur. This linking will be performed as we attempt to cap- ture much more knowledge about different domains. Although we do not have any relationships for this category, we would be able to do this selectively as the need for them arises over time. Figure 7.5: Integrated Domain Ontology 7.4 Dynamic Demand Response Query Taxonomy The potential space ofD 2 R event patterns is enormous. In the absence of investigation and classification, it becomes onerous for operators to go beyond facile patterns and exploit the innovation and expressivity of semantic CEP patterns for different aspects of dynamic demand response. We offer a taxonomy ofD 2 R event patterns motivated from scenarios and semantic concepts observed in the USC Micro Grid, but generalizable to other environments. Figure 7.6 shows top-level orthogonal dimensions of this taxonomy 114 that are key characteristics to consider when defining aD 2 R pattern. Patterns are not exclusive to one dimension but have a specific feature for each dimension. 1. DR Pattern 1.1 End-use Purpose 1.2 Spatial Scale 1.3 Temporal Scale 1.4 Representation 1.5 Life Cycle 1.6 Adaptivity Figure 7.6: Top-level Orthogonal Dimensions of theD 2 R Pattern Taxonomy 7.4.1 End-Use Purpose Dimension Patterns can be categorized based on the objective of their eventual use, as shown in Figure 7.7. These categories are typically exclusive. The obvious examples are cur- tailment patterns that can identify curtailment opportunities which may detect transient power wastage or trigger direct and voluntary curtailment actions. However, patterns can also play a role in situation monitoring and early warning. Meter readings can be aggregated to monitor demand levels, and indirect influencers of power usage used to predict demand trends. These monitoring and prediction patterns can trigger control and notification actions or initiate detection of specific curtailment patterns. This can enable incremental and opportunistic DR curtailment. 7.4.1.1 Monitoring Pattern Patterns in this category evaluate demand profiles of spaces and equipment at fine granu- larity by analyzing and aggregating meter and sensor data. Below is a sample monitoring pattern: Example 1. Power used by building “MHP”, averaged over 5 minutes, exceeds a given pre-peak load of 27 KW. 115 1.1 End-use Purpose 1.1.1 Monitoring Pattern 1.1.1.1 Demand Monitoring 1.1.1.2 Response Monitoring 1.1.2 Prediction Pattern 1.1.2.1 Direct Prediction 1.1.2.2 Indirect Prediction 1.1.3 Curtailment Pattern 1.1.3.1 Shave Opportunity 1.1.3.2 Shift Opportunity 1.1.3.3 Shape Opportunity Figure 7.7: End-Use Purpose Dimension of theD 2 R Pattern Taxonomy Let ?m represent events from the meter’s KW interval measurement stream, the SELECT and WHERE clause of the corresponding SCEP query is: SELECT AVG(?m)>27 AS avg WHERE PATH f ?m evt:hasSource ?src . ?src bd:hasLocation bd:MHP g | WINDOW (?m * , 5min) The above query monitors power consumption of a certain space and detects when cur- tailment needs to be performed. It helps initiate low-latency curtailment strategies such as changing the setpoint of a variable frequency drive unit in the building where the pattern was seen to avoid a peak demand. We further classify monitoring patterns as demand monitoring and response mon- itoring patterns (Figure 7.7). Example 1 is a demand monitoring pattern. Response monitoring patterns evaluate the effectiveness of a curtailment operation, and can be used to determine if a more aggressive curtailment strategy is required. An example response monitoring situation is: 116 Example 2. 15 minutes after a global temperature reset (GTR) operation was per- formed in “MHP”, the building’s power consumption remains greater than 30 KW. A sequence CEP pattern can be used to detect such an insufficient curtailment situ- ation and trigger further actions such as HV AC unit duty cycling. 7.4.1.2 Prediction Pattern Traditional demand prediction models are ill-suited for energy forecast at fine tempo- ral and spatial scales, particularly as consumption profiles change [43]. On a campus Micro Grid, dynamic events like scheduling or cancellation of classes, space occupancy changes and holidays can help predict power consumption trends [31]. Prediction pat- terns are categorized as direct and indirect predictions (Figure 7.7). Direct predictions forecast demand solely based on prior energy consumption using timeseries models or historical baselines. Alternatively, indirect predictions combine demand influencers to predict future changes in demand. Example 3. Power usage (reading) in an empty computer lab is currently less than 0.5KW, and a class is scheduled in 1 hour. Semantic subqueries can be defined for the class schedule and meter measurement streams to filter events based on the location types and pass qualified events, denoted as ?m and ?c, to the following CEP subquery for correlation: FILTER (?m.reading < 0.5) SEQ (?m, ?c) WINDOW (?m, ?c, 3600sec) 7.4.1.3 Curtailment Pattern Curtailment patterns identify distributed and dynamic curtailment opportunities which supplement traditional scheduled or voluntary curtailments. These patterns can be 117 defined by DR participants ranging from facility managers to department coordinators to end users. Curtailment patterns can be classified based on the actions as shave, shift or shape (Figure 7.7). Shave patterns detect non-critical or wasteful power usage that could be avoided such as in Example 4. Example 4. The temperature in a meeting room is lower than 73 ○ F when it is unoccupied. Shift patterns identify non-urgent power demand from certain equipment which can be rescheduled to off-peak periods. Such equipment may include washing machines and campus EVs. Lastly, shaping patterns flatten demand curves by dynamically selecting HV AC units, for example, to a duty cycle as shown in Example 5. Example 5. More than 6 fan coils are operating concurrently in building MHP during peak hours. 7.4.2 Spatial Scale D 2 R query patterns are usually associated with a spatial dimension. This dimension helps identify the target spatial entity, either physical or virtual, on which some end-use action is required (Figure 7.8). 7.4.2.1 Physical Space and Equipment Physical power grid infrastructure entities include the campus, buildings, rooms and individual equipment. The spatial granularity may vary for different DR participants or end uses. For example, campus managers can specify campus-level monitoring patterns to trigger global curtailment operations, while building managers may define room or equipment demand prediction and curtailment patterns. Physical objects can be further classified as stationary and mobile. The latter include EVs and portable appliances and 118 1. DR Patterns 1.2 Spatial Scale 1.2.1 Physic Space & Infrastructure 1.2.2 Electric Equipments 1.2.2.1 Stationary 1.2.2.2 Mobile 1.2.3 Virtual Space 1.2.3.1 Organization 1.2.3.2 Group 1.3 Temporal Scale 1.3.1 Frequency 1.3.1.1 Sliding Window 1.3.1.2 Batch Window 1.3.2 Latency 1.3.2.1 Immediate/Z ero 1.3.2.2 Future/Positi ve Figure 7.8: Spatial and Temporal Dimensions of theD 2 R Pattern Taxonomy benefit from the transparency offered by semantic patterns in masking variation in their physical event streams based on their location. 7.4.2.2 Virtual Space DR Patterns can also be defined for virtual spaces or objects such as organizations and customer segments. Virtual spaces may be physically contiguous, such as a depart- ment located in neighboring buildings, or scattered, such as a customer segment that is environmentally conscious. For example, upon detecting the situations described in Example 6 over the (virtual) department space, the department’s coordinator can be noti- fied to initiate local curtailment strategies within the department. Let?m be events from the meter measurement stream, the SELECT and WHERE clauses for the corresponding SCEP query are shown below: Example 6. The total power demand from the Electrical Engineering department exceeds 600 KW. SELECT SUM(?m)>600 AS sum WHERE PATH f ?m evt:hasSource ?src . 119 ?src bd:hasLocation ?loc . ?loc bd:belongsTo org:EEDepartment g 7.4.3 Temporal Scale The interval nature of events means thatD 2 R queries have temporal properties such as the frequency of evaluating the queries and the latency time for response after pattern detection, as shown in Figure 7.8. 7.4.3.1 Frequency D 2 R queries may perform aggregation and correlation based on different frequencies which are determined by the window constraints. For example, we have sliding and batch window constraints. For queries with sliding WINDOW constraints such as the query for Example 1, events are aggregated or correlated by gradually moving the win- dow in single event increments. For queries with batch window constraints, events are processed by moving the window in discrete, non-overlapping time/event blocks. A batch window is useful, for example, when we want to monitor a building’s aggregated consumption every hour. 7.4.3.2 Latency The latency of a D 2 R query is the difference between the time of its detection and the time of its consequence. Most patterns, including the monitoring and curtailment patterns, have immediate impact; they have zero latency. A prediction pattern, however, has a positive latency as it is anticipatory and detects a future situation. A curtailment pattern may also have a positive latency when it is used to schedule a future curtailment operation rather than trigger one immediately. 120 7.4.4 Representation As shown in Figure 7.9,D 2 R patterns are specified at different abstraction levels, pri- marily determined by the underlying event models. If only using traditional CEP sys- tems, syntactical patterns have to be defined over raw data streams. The event attributes can be either crisp values or fuzzy concepts, depending on the uncertainty in matching. This has been explored in other literature [34, 56]. As we described previously, our SCEP framework allows users to define semantic level queries over one or more domain ontologies to shield underlying data Varieties. Examples 1–6 illustrate such situations. 1. DR Pattern 1.4 Representation 1.4.1 Syntactic Pattern 1.4.1.1 Crisp Value 1.4.1.2 Fuzzy Value 1.4.2 Semantic Pattern 1.4.2.1 Single Domain 1.4.2.2 Cross Domain 1.5 Life Cycle 1.5.1 Persistent 1.5.2 Scheduled 1.5.3 On-demand 1.6 Adaptivity 1.6.1 Static 1.6.2 Dynamic Figure 7.9: Representation, Life Cycle and Adaptivity Dimensions of theD 2 R Pattern Taxonomy 7.4.5 Life Cycle The life cycle of a D 2 R query pattern is the time period during which it is active for the given application. As shown in Figure 7.9, some patterns may run persistently, some may only be active for scheduled periods, and others may be activated on-demand (say by other patterns that were detected). Most monitoring and prediction patterns are 121 persistent. However, a curtailment pattern is meaningful only when there is a potential peak load to be handled, for example, after receiving a DR request from the utility or after a set of less aggressive curtailment operations failed to achieve the power reduction goals. Since there is a resource cost associated with having patterns active, these patterns are active on-schedule or on-demand. 7.4.6 Adaptivity D 2 R query patterns may be categorized based on how often they evolve over time. Some queries may be static and do not need to change after they are initially introduced. However, some patterns such as prediction patterns may be affected by changes in the power grid infrastructure and consumer behavior. A novel area of research is to mine historical event streams to automate the process of defining interesting patterns, allowing patterns to self-adapt. 7.5 USC Micro Grid Experiments TheD 2 R CEP query taxonomy was informed through DR approaches that were inves- tigated in the USC campus Micro Grid. Events representing different dimensions in the taxonomy were implemented and their efficacy was evaluated within the campus. We present those results here. We use SCEPter, our semantic Complex Event Process- ing engine, to detect query patterns defined over a selected set of event streams in the campus BAN. These patterns span different DR end-uses: monitoring, prediction and curtailment. These experiments were conducted over a 4-day period on campus. 7.5.1 Events and Ontologies USC Micro Grid event streams used in our experiments are: 122 • Meter Measurement. Events from smart meters which measure buildings’ KW loads. • Fan coil status. Events from HV AC sensors which report the operational status of fan coils: “1” means ON and “0” means OFF. • Class Schedule. Data from a calendar schedule service which generates a class- room schedule event an hour before a class begins. • Room Temperature. Measurement from room-level space temperature sensor. • Room Occupancy. Events from room occupancy sensors that provide boolean readings. Events from the same type of sources are pushed to a single logical stream. The campus Micro Grid domain ontologies capture properties of and relationships between physical space, electric equipment, and organizations on campus. 7.5.2 Queries and Empirical Evaluations The four D 2 R pattern queries introduced in Section 7.4 are evaluated over the above event streams. Specifically, we analyze the detection of patterns for Example 1 (average power consumption exceeds a peak load), 4 (space temperature of unoccupied room less than 73 ○ F), 5 (more than six fan coils are concurrently active) and 6 (load on EE department exceeds 600KW). The experiments were conducted from Friday May 4 th to Tuesday May 7 th , 2012. Figure 7.10 shows the detection of these four patterns over the six event streams during that time period. The detection frequency of some patterns were limited since this time period coincided with the final exam week when classes and DR curtailment were not actively scheduled. 123 Pattern 1 Pattern 4 Pattern 5 Pattern 6 (Fri) (Sat) (Sun) (Mon) (Tue) 05/04 00:00 05/05 00:00 05/06 00:00 05/07 00:00 05/08 00:00 Pattern 1 Pattern 4 Pattern 5 Pattern 6 Figure 7.10: Experiment Results In Figure 7.10, pattern 1’s detection indicates that the power consumption of the MHP building exceeded its pre-peak threshold from around 8:20AM to 4:00PM on Fri- day and from around 8:40AM to 5:00PM on Monday. The power load of MHP during weekends is below the pre-peak threshold because it is primarily used for teaching. However, we observe from Pattern 6 that the power consumption of the EE department exceeds its pre-peak threshold even on the weekend. Detection of these patterns helped the facility managers decide when and where to curtail energy use on campus – these patterns do not activate actual curtailments yet, but offer an insight into the potential to do so. Pattern 4 and 5 show opportunities for curtailments. From pattern 5, we know that more than 6 fan coils in MHP operate concurrently from∼8:00AM to 5:00PM on week- days. By duty cycling the operations of fan coils during this period, we can flatten the demand curve. In a separate experiment, we observed over 27% curtailment in peak demand by duty cycling fan coils in MHP. Pattern 4 monitors a meeting room in the EE department. Several group meetings were scheduled on Friday and Monday. It is 124 observed that as people leave the room without resetting the thermostat, it causes power wastage when the room is unoccupied – which is during most times, especially over the weekend. These patterns and situations are detected in realtime, which helps facilitate fine grained, timely and intelligent DR strategies. The action rule engine as shown in Figure 7.1 is responsible for mapping detected patterns to operational actions and for helping to complete the event-basedD 2 R control loop. This will offer an accurate estimate of the improvement in curtailment response using the dynamic event-based demand response approach as compared to static schedules. 7.6 Related Work Existing DR strategies use incentive-based and time-based programs. Incentive-based programs such as dynamic pricing offer benefits to customers who perform volun- tary curtailment. This requires manual intervention by customers and the outcome is less reliable. The Open Automated Demand Response Communications Specifications (OpenADR) model [105, 83] is increasingly used to communicate pricing signals to customers in realtime. These signals are mapped to operation modes of building control systems through production rules. Our work supplements this approach by providing the capabilities to correlate heterogeneous events to initiate and target the curtailment strategies. Time-based demand schedules are commonly used for DR in power grids. These approaches model DR as a mathematical optimization problem, maximizing the user or the utility’s benefit. In [88], the authors discuss optimal schedules of generation units 125 and demand-side reserves, formulating the objective function as a two-stage stochas- tic programming model. In [84], the authors propose DR models for a single house- hold which schedules appliance activities attempting to minimize user bills. In the Micro Grid scenario, the authors of [42, 46] propose models for computing the opti- mum energy plan, i.e., the amount of power to be purchased, sold, transferred, and stored in Micro Grid for a time period to minimize the total operation cost. Neverthe- less, these DR approaches are based on accurate mathematical modeling which requires in-depth knowledge of the system and are not sustainable as the power grid evolves with new appliances and information sources that are deployed. Unpredictable events that influence power consumptions also occur dynamically. An opportunistic DR paradigm driven by realtime monitoring data can hence supplement these existing approaches. Complex Event Processing itself has received much attention in a variety of domains [34, 56] and there are increasing interests in using CEP for Smart Grid applications. In [72], the authors propose a CEP-based approach to detect building occupancy changes for energy saving. However, the occupancy change patterns and rules were specified over low-level data tuples. In [110], the authors discuss the vision of using CEP over linked Smart Grid data in general, but these are anecdotal rather than comprehensive uses of CEP. The authors of [86] introduced a light management system in smart offices using ETALIS [38] stream reasoning and CEP engine. Their system performs reasoning of user activities such as considering when users leave the office in realtime for effi- cient lighting control. This is supported by ontologies that capture semantic relations between multiple movement sensors and lighting actuators. As discussed in Section 3.6, ETALIS is different from our approaches developed for SCEPter, as it adopts rule- based reasoning, a bespoke solution that departs from traditional CEP systems, for event pattern modeling and matching. Nevertheless, a comprehensive analysis of query pat- terns to guide event-based Smart GridD 2 R application development is missing. To our 126 knowledge, our work is among the first efforts to analyze and implement semantic CEP for demand response applications at a Micro Grid scale. 7.7 Conclusions In this chapter, we discussed the use of smantic CEP for dynamic demand response opti- mization in a campus Micro Grid. By incorporating realtime sensing data and domain knowledge as semantic events, our approach enables DR end-use needs, such as moni- toring, prediction and curtailment, to be intuitively modeled as high-level patterns with- out knowledge of raw events. Our taxonomy, informed and validated by DR techniques in the Micro Grid, offers a structure for operators to develop their own suite of DR pat- terns for their service area. Semantic CEP offers a powerful online data analysis tool to achieve more accurate and timely DR, but this requires in-depth study of their real-world use; our work is a step to translate this potential into reality. For future work, we believe the ability to automatically mine for self-adaptive query patterns can lead to a paradigm shift in informatics-driven demand management for a reliable and efficient Smart Grid. 127 Chapter 8 Dissertation Conclusions The focus of our work is leveraging Complex Event Processing (CEP) for Fast Data management. Motivated by applications in Smart Grid and e-commerce domains, we explored the problem space within the 3-V dimensions of Fast Data. In particular, we studied Semantic CEP with respect to data Variety, Resilient CEP with respect to data Velocity and Stateful CEP with respect to data Volume. We discussed data Variety present in motivating Fast Data applications and intro- duced a Semantic Complex Event Processing (SCEP) framework for high-level query processing over heterogeneous data streams. We model realtime data as timestamped event tuples and model the underlying data Variety using Semantic Web domain ontolo- gies. We described semantic-enriched CEP event and query models to support domain- level query specification, hiding data Variety from end users. Online query process- ing techniques using optimizations like query rewriting, event buffering and semantic caching were discussed to mitigate severe performance overheads. For resilient analytics over Fast Data that span past (in static or slow-changing history store), present and future (on high-Velocity realtime streams), we presented SCEPter to uniformly process SCEP queries across the data boundaries. The SCEP query model was extended to seamlessly operate over end-to-end event streams - from network to storage. We discussed approaches to perform SCEP queries over event archives using plain query rewriting, na¨ ıve event replay and their hybrid which lever- ages the arbitrage. Integrated query plans were analyzed in the context of temporal gaps that may exist between data boundaries to return consistent results. 128 Motivated by applications from Smart Grid and e-commerce domains, we introduced Stateful Complex Event Processing to support Hybrid Online and On-demand (H2O) queries over the transient/dynamic Volume of Fast Data. H2O inherits CEP systems’ capability of high-throughput online query processing, and also supports on-the-fly ad- hoc query evaluation. We developed a formal query algebra to capture the statefulness and subsumption semantics of CEP queries which forms the foundation of a hierarchical query paradigm. A unified query model was proposed based on the query algebra for online and on-demand query specification. Online query states are leveraged as a means of dynamic volume management and view materialization to facilitate on-demand query evaluation. We implemented the proposed models and techniques in prototype systems, and quantitatively evaluated the system performances using benchmark data from the Smart Grid domain. We applied the proposed framework for Dynamic Demand Response (D 2 R) optimization in the USC campus Micro Grid to manage demand-side power load in response to supply conditions in realtime. Traditional demand response approaches require advanced planning, hours or days ahead, and operate on a broadcast principle that reaches all customers. In contrast, D 2 R leverages Fast Data, including realtime meter and sensor readings, event schedules and other digital data streams, to understand dynamic energy consumptions and respond with precise curtailment actions, with low latency and high relevance. Besides the work presented in this thesis, we propose the following problems as future work: • Inexact Complex Event Processing over Uncertain Data. Uncertainty is an intrinsic feature of real-world applications, where potentially incomplete, unreli- able and even incorrect information exists, yet query patterns over such data need to be matched within certain bounds. Existing CEP systems only support precise 129 pattern matching, without any leeway to relax pattern constraints. As part of the future work, we may work on an event and query model which correctly captures the uncertainties of events and query patterns. Traditional online query algorithms need to be augmented for incremental uncertainty reasoning in addition to exact pattern matching. • Adaptive Complex Event Processing with Online Learning. Cyber-physical systems are often under continuous evolvement, yet query patterns defined based on prior domain knowledge need to adapt to such changes. For example, in dynamic demand response optimization for Smart Grid, the accuracy of query patterns which predict power demand may be affected by changes of the grid infrastructure and consumer behaviors. A novel area of research is to mine histor- ical event streams or perform online learning to allow query patterns self-adapt. 130 Bibliography [1] Agilewaves. http://www.agilewaves.com. [2] CIM Standards. http://www.iec.ch/smartgrid/standards. [3] DBPedia. http://wiki.dbpedia.org/Ontology. [4] Esper Complex Event Processing. http://esper.codehaus.org. [5] Google Power Meter. http://www.google.com/powermeter. [6] iCalender. http://www.ietf.org/rfc/rfc2445.txt. [7] International Electrotechnical Commission. http://www.iec.ch. [8] Jena Framework. http://jena.sourceforge.net. [9] JMBL. https://wiki.ucar.edu/display/NNEWD/JMBL. [10] NNEW Weather Ontology.https://wiki.ucar.edu/display/NNEWD. [11] NOAA Weather Service. http://graphical.weather.gov. [12] Oracle Complex Event Processing. http://www.oracle.com/ technetwork/middleware/complex-event-processing/ overview/index.html. [13] OWL Web Ontology Language Reference. http://www.w3.org/TR/ owl-ref. [14] Protege Ontology Editor and Framework. http://protege.stanford. edu. [15] Pulse Energy. http://www.pulseenergy.com. [16] Resource Description Framework (RDF). http://www.w3.org/RDF. 131 [17] RuleCore Complex Event Processing. http://www.rulecore.com. [18] SAP Sybase Event Stream Processor. http://www. sybase.com/products/financialservicessolutions/ complex-event-processing. [19] Smart Grid. http://en.wikipedia.org/wiki/Smart_grid. [20] SPARQL Query Language for RDF. http://www.w3.org/TR/ rdf-sparql-query. [21] Storm Distributed Realtime Computation System. http://storm. incubator.apache.org. [22] SWEET 2.0 Ontology. http://sweet.jpl.nasa.gov/ontology. [23] W3C Calender Ontology. http://www.w3.org/TR/rdfcal. [24] WordNet Ontology. http://www.w3.org/TR/wordnet-rdf. [25] FERC assessment of demand response and advanced metering. Federal Energy Regulatory Commission Staff Report, 2008. [26] The arrival of real-time bidding. Google White Paper, 2011. [27] Big data and utility analytics for smart grid. SAS Research Excerpt, 2013. [28] Big data gets real-time: Oracle fast data. Oracle White Paper, 2013. [29] Managing big data for smart grids and smart meters. IBM White Paper, 2013. [30] A. Demers, J. Gehrke and et al. Cayuga: A general purpose event monitoring system. In CIDR, 2007. [31] A. Saima, Y . Simmhan and V . Prasanna. Improving energy use forecast for cam- pus micro-grids using indirect indicators. In International Workshop on Domain Driven Data Mining, 2011. [32] C. Acuna, E. Marcos, J. Gomez, and C. Bussler. Toward web portals integration through semantic web services. In International Conference on Next Generation Web Services Practices, 2005. [33] R. Adaikkalavan and S. Chakravarthy. Snoopib: Interval-based event specifica- tion and detection for active databases. Data and Knowledge Engineering Jour- nal, 2006. [34] A. Adi, D. Botzer, G. Nechushtai, and G. Sharon. Complex event processing for financial services. In IEEE Services Computing Workshops, 2006. 132 [35] M. Akdere, U. C ¸ etintemel, and N. Tatbul. Plan-based complex event detection across distributed sources. Proc. VLDB Endow., 1(1):66–77, Aug. 2008. [36] S. Aman, Y . Simmhan, and V . Prasanna. Improving energy use forecast for cam- pus micro-grids using indirect indicators. In International Workshop on Domain Driven Data Mining (DDDM), 2011. [37] J. L. Andrew Crapo, Xiaofeng Wang and R. Larson. The semantically enabled smart grid. In Grid-Interop, 2009. [38] D. Anicic, S. Rudolph, P. Fodor, and N. Stojanovic. Stream reasoning and com- plex event processing in etalis. Semantic Web Journal, 2012. [39] B. Francesco, B. Daniele and et al. An execution environment for c-sparql queries. In EDBT, 2010. [40] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In ACM SIGMOD-SIGACT-SIGART Symposium on Princi- ples of Database Systems, PODS ’02, 2002. [41] N. Backman, R. Fonseca, and U. C ¸ etintemel. Managing parallelism for stream processing in the cloud. In the 1st International Workshop on Hot Topics in Cloud Data Processing, HotCDP ’12, pages 1:1–1:5, New York, NY , USA, 2012. ACM. [42] A. Bagherian and S. Tafreshi. A developed energy management system for a microgrid in the competitive electricity market. In PowerTech, IEEE Bucharest, 2009. [43] S. Bhattacharyya and G. Timilsina. Energy demand models for policy formula- tion. The World Bank, Policy Research Working paper, 2009. [44] A. Biem and et al. IBM infosphere streams for scalable, real-time intelligent transportation services. In SIGMOD, 2010. [45] S. Chakravarthy and D. Mishra. Snoop: an expressive event specification lan- guage for active databases. Data Knowl. Eng., 14:1–26, November 1994. [46] S. Choi, S. Park, D.-J. Kang, S. jae Han, and H.-M. Kim. A microgrid energy management system for inducing optimal demand response. In IEEE Interna- tional Conference on Smart Grid Communications, 2011. [47] G. E. Churcher and J. Foley. Applying complex event processing and extending sensor web enablement to a health care sensor network architecture. In Sensor Systems and Software, volume 24. 2010. 133 [48] G. Cugola and A. Margara. Complex event processing with T-REX. Journal of Systems and Software, 85(8):1709–1728, 2012. [49] G. Cugola and A. Margara. Processing flows of information: From data stream to complex event processing. ACM Computing Surveys, 44(3), June 2012. [50] D. Abadi and D. Carney et al. Aurora: a data stream management system. In SIGMOD, 2003. [51] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams – a new class of data management applications. In VLDB, 2002. [52] D. J. Abadi, Y . Ahmad, M. Balazinska, M. Cherniack, J. Hwang, W. Lindner et al. The design of the Borealis stream processing engine. In CIDR, 2005. [53] L. David and F. Brian. Complex event processing in distributed systems. Techni- cal report, Stanford University, 1998. [54] A. Demers, J. Gehrke, M. Hong, M. Riedewald, and W. White. Towards expres- sive publish/subscribe systems. In EDBT, pages 627–644, 2006. [55] Y . Diao, N. Immerman, and D. Gyllstrom. SASE+: An agile language for Kleene closure over event streams. Technical report, UMass, 2007. [56] J. Dunkel. On complex event processing for sensor networks. In International Symposium on Autonomous Decentralized Systems, 2009. [57] A. Erick, K. Martin, and M. Vladimir. Biological knowledge management: the emerging role of the semantic web technologies. Briefings in Bioinformatics, 2009. [58] O. Etzion and P. Niblett. Event Processing in Action. Manning Publications, 1st edition, 2010. [59] N. Fabiane Bizinella and M. Lincoln A. Knowledge sharing and information integration in healthcare using ontologies and deductive databases. Medinfo 11 (Part 1), pages 62–66, August 2004. [60] D. C. Faye, O. Cure, and G. Blin. A survey of rdf storage approaches. ARIMA Journal, 15:11–35, 2012. [61] P. M. Fischer, K. S. Esmaili, and R. J. Miller. Stream schema: Providing and exploiting static metadata for data stream processing. In International Conference on Extending Database Technology, 2010. 134 [62] M. S. Gannon T. and et. al. Semantic Information Integration in the Large: Adapt- ability, Extensibility, and Scalability of the Context Mediation Approach. Tech- nical report, 2005. [63] H. Garcia-Molina and K. Salem. Main memory database systems: An overview. IEEE Trans. on Knowl. and Data Eng., 1992. [64] T. J. Green, A. Gupta, G. Miklau, M. Onizuka, and D. Suciu. Processing xml streams with deterministic automata and stream indexes. ACM Trans. Database Syst., 29(4):752–788, 2004. [65] D. Gyllstrom, E. Wu, and et. al. SASE: Complex event processing over streams. In the 3rd Biennial Conference on Innovative Data Systems Research, 2007. [66] J. Han, E. Haihong, G. Le, and J. Du. Survey on nosql database. In the 6th International Conference on Pervasive Computing and Applications (ICPCA), pages 363–366, 2011. [67] S. Harris and et al. 4store: The design and implementation of a clustered rdf store. In SSWS, 2009. [68] M. Hirzel. Partition and compose: Parallel complex event processing. In Pro- ceedings of the 6th ACM International Conference on Distributed Event-Based Systems, 2012. [69] W. Kang, S. Son, and J. Stankovic. Design, implementation, and evaluation of a qos-aware real-time embedded database. IEEE Transactions on Computers, 61(1):45–59, 2012. [70] B. Kao and H. Garcia-molina. An overview of real-time database systems. In Advances in Real-Time Systems, 1994. [71] W. Kim and J. Seo. Classifying schematic and data heterogeneity in multi- database systems. Computer, 24(12):12–18, Dec 1991. [72] L. Renners, R. Bruns and J. Dunkel. Situation-aware energy control by combining simple sensors and complex event processing. In Workshop on AI Problems and Approaches for Intelligent Environments, 2012. [73] T. Lahiri, M.-A. Neimat, and S. Folkman. Oracle timesten: An in-memory database for enterpise. In IEEE Data Engineering Bulletin, 2013. [74] D. Laney. 3D data management: Controlling data volume, velocity, and variety. Technical report, META Group, 2001. 135 [75] P.-A. Larson, S. Blanas, C. Diaconu, C. Freedman, J. Patel, and M. Zwilling. High-performance concurrency control mechanisms for main-memory databases. In VLDB, 2011. [76] J. Lastra and I. Delamer. Semantic web services in factory automation: funda- mental insights and research roadmap. IEEE Transactions on Industrial Infor- matics, 2(1):1–11, Feb 2006. [77] Y .-N. Law, H. Wang, and C. Zaniolo. Query languages and data models for database sequences and data streams. In VLDB, 2004. [78] E. Liarou, R. Goncalves, and S. Idreos. Exploiting the power of relational- databases for efficient stream processing. In EDBT, 2009. [79] M. Liu, E. Rundensteiner, D. Dougherty, C. Gupta, S. Wang, I. Ari, and A. Mehta. Neel: The nested complex event language for real-time event analytics. In Enabling Real-Time Business Intelligence, volume 84 of Lecture Notes in Busi- ness Information Processing, pages 116–132. Springer Berlin Heidelberg, 2011. [80] S. Loesing, M. Hentschel, T. Kraska, and D. Kossmann. Stormy: an elastic and highly available streaming service in the cloud. In Proceedings of the 2012 Joint EDBT/ICDT Workshops, pages 55–60, New York, NY , USA, 2012. ACM. [81] Y . Magid, A. Adi, M. Barnea, D. Botzer, and E. Rabinovich. Application gener- ation framework for real-time complex event processing. In IEEE International Conference on Computer Software and Applications, 2008. [82] C. Martinez, H. Huang, and R. Guttromson. Archiving and management of power systems data for real-time performance monitoring platform. Technical Report 15036, PNNL, 2005. [83] J. Mathieu, P. Price, S. Kiliccote, and M. Piette. Quantifying changes in building electricity use, with application to demand response. IEEE Transactions on Smart Grid, 2011. [84] A.-H. Mohsenian-Rad and A. Leon-Garcia. Optimal residential load control with price prediction in real-time electricity pricing environments. IEEE Transactions on Smart Grid, 1, 2010. [85] B. Mozafari, K. Zeng, L. D’antoni, and C. Zaniolo. High-performance complex event processing over hierarchical data. ACM Trans. Database Syst., 2013. [86] N. Stojanovic, D. Milenovic and et al. An intelligent event-driven approach for efficient energy consumption in commercial buildings. In ACM conference on Distributed event-based system, 2011. 136 [87] N.Dindar, M. Fischer and et al. Efficiently correlating complex eventsover live and archived data streams. In DEBS, 2011. [88] M. Parvania and M. Fotuhi-Firuzabad. Demand response scheduling by stochastic SCUC. IEEE Transactions on Smart Grid, 2010. [89] R. Poovendran, K. Sampigethaya, S. K. S. Gupta, I. Lee, K. V . Prasad, D. Cor- man, and J. Paunicka. Special issue on cyber-physical systems. Proceedings of the IEEE, 2012. [90] R. A. Pottinger and P. A. Bernstein. Merging models based on given correspon- dences. In VLDB, 2003. [91] Q. Zhou, A. Bakshi, V . Prasanna and R. Soma. Towards an integrated modeling and simulation framework for freight transportation in metropolitan areas. In IEEE International Conference on Information Reuse and Integration, 2008. [92] Q. Zhou, Y . Simmhan and V . Prasanna. Towards an inexact semantic complex event processing framework. In ACM International Conference on Distributed Event Based Systems, 2011. [93] Q. Zhou, Y . Simmhan and V . Prasanna. On using semantic complex event pro- cessing for dynamic demand response optimization. Technical report, Computer Science Department, University of Southern California, 2012. [94] E. Rabinovich, O. Etzion, and A. Gal. Pattern rewriting framework for event processing optimization. In ACM International Conference on Distributed Event Based Systems, 2011. [95] S. D. Ramchurn, P. Vytelingum, A. Rogers, and N. R. Jennings. Putting the ’smarts’ into the smart grid: a grand challenge for artificial intelligence. Commun. ACM, 55(4):86–97, Apr. 2012. [96] S. Rozsnyai, J. Schiefer, and A. Schatten. Concepts and models for typing events for event-based systems. In ACM International Conference on Distributed Event Based Systems, 2007. [97] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1–10, 2010. [98] Y . Simmhan, S. Aman, B. Cao, M. Giakkoupis, A. Kumbhare, Q. Zhou, D. Paul, C. Fern, A. Sharma, and V . Prasanna. An informatics approach to demand response optimization in smart grids. NATURAL GAS, 31:60, 2011. 137 [99] Y . Simmhan, S. Aman, A. Kumbhare, R. Liu, S. Stevens, Q. Zhou, and V . Prasanna. Cloud-based software platform for data-driven smart grid manage- ment. CiSE, 2013. [100] Y . Simmhan, V . Prasanna, S. Aman, S. Natarajan, W. Yin, and Q. Zhou. Toward data-driven demand-response optimization in a campus microgrid. In ACM Work- shop on Embedded Sensing Systems for Energy-Efficiency in Buildings, 2011. [101] R. Soma, A. Bakshi, and V . Prasanna. A semantic framework for integrated asset management in smart oilfields. In 7th IEEE International Symposium on Cluster Computing and the Grid, pages 119–126, May 2007. [102] S. Suhothayan, K. Gajasinghe, and et al. Siddhi: A second look at complex event processing architectures. In ACM GCE Workshop, 2011. http://siddhi. sourceforge.net. [103] D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append- only databases. In SIGMOD, 1992. [104] K. Teymourian and A. Paschke. Enabling knowledge-based complex event pro- cessing. In EDBT/ICDT Workshops, 2010. [105] G. Thatikar, J. Mathieu, M. Piette, and S. Kiliccote. Open automated demand response technologies for dynamic pricing and smart grid. In Grid-Interop Con- ference, 2010. [106] P. Tzanetos, A. K. Dimitrios, P. C. Sotiris, and S. P. Theodore. Towards web 3.0: A unifying architecture for next generation web applications. In Handbook of Research on Web 2.0, 3.0 and X.0: Technologies, 2009. [107] F. van Harmelen, F. van Harmelen, V . Lifschitz, and B. Porter. Handbook of Knowledge Representation. Elsevier Science, 2007. [108] J. Van Niekerk and K. Griffiths. Advancing health care management with the semantic web. In the 3rd International Conference on Broadband Communica- tions, Information Technology Biomedical Applications, pages 373 –375, 2008. [109] R. V on Ammon, C. Emmersberger, T. Ertlmaier, O. Etzion, T. Paulus, and F. Springer. Existing and future standards for event-driven business process man- agement. In ACM International Conference on Distributed Event-Based Systems, 2009. [110] W. Andreas, A. Darko and et al. Linked data and complex event processing for the smart energy grid. In Linked Data in the Future Internet at the Future Internet Assembly, 2010. 138 [111] D. Wang, E. A. Rundensteiner, and R. T. Ellison, III. Active complex event processing over event streams. Proc. VLDB Endow., 4(10):634–645, July 2011. [112] F. Wang, S. Liu, and P. Liu. Complex rfid event processing. VLDB J., 18:913– 931, August 2009. [113] F. Wang, S. Liu, P. Liu, and Y . Bai. Bridging physical and virtual worlds: Com- plex event processing for rfid data streams. In EDBT 2006, pages 588–607, 2006. [114] W. Wang, J. Sung, and D. Kim. Complex event processing in epc sensor network middleware for both rfid and wsn. In IEEE International Symposium on Object Oriented Real-Time Distributed Computing, 2008. [115] J. Widom and S. Finklestein. Set-oriented production rules in relational database systems. In SIGMOD, pages 259–270, 1990. [116] E. Wu. High-performance complex event processing over streams. In SIGMOD, pages 407–418, 2006. [117] Y . Xiao, W. Li, M. Siekkinen, P. Savolainen, A. Yla-Jaaski, and P. Hui. Power management for wireless data transmission using complex event processing. IEEE Transactions on Computers, 61(12):1765–1777, 2012. [118] J. Xingyi, L. Xiaodong, K. Ning, and Y . Baoping. Efficient complex event pro- cessing over rfid data stream. In 7th IEEE/ACIS International Conference on Computer and Information Science, pages 75 –81, May 2008. [119] Y . Simmhan, Q. Zhou and V . Prasanna. Chapter: Semantic Information Inte- gration for Smart Grid Applications. Green IT: Technologies and Applications, 2011. [120] Y . Simmhan, S. Aman et al. An informatics approach to demand response opti- mization in Smart Grids. Technical report, University of Southern California, 2011. [121] Z. Zhang. Ontology query languages for the semantic web. Technical report, Computer Science, University of Georgia, 2005. [122] Q. Zhou, S. Natarajan, Y . Simmhan, and V . Prasanna. Semantic information modeling for emerging applications in smart grid. In ITNG, 2012. [123] Q. Zhou, Y . Simmhan, and V . Prasanna. Incorporating semantic knowledge into dynamic data processing for smart power grids. In International Conference on Semantic Web (ISWC), 2012. 139 [124] Q. Zhou, Y . Simmhan, and V . Prasanna. On using complex event processing for dynamic demand response optimization in microgrid. In IEEE Green Energy and Systems Conference, 2013. [125] Q. Zhou, Y . Simmhan, and V . Prasanna. Towards hybrid online on-demand query- ing of realtime data with stateful complex event processing. In IEEE International Conference on Big Data, 2013. 140
Abstract (if available)
Abstract
Emerging applications in domains like Smart Grid, e-commerce and financial services exemplify the need to manage Fast Data—Big Data with an emphasis on data Velocity. Utility companies, social media and financial institutions often need to analyze data streams arriving at a high rate for business operations and innovative services. For example, dynamic demand response, realtime advertising, online retail and algorithmic trading attempt to leverage high-frequency meter and sensor readings, advertisement auctions, purchasing behaviors and stock ticks, respectively, to make timely decisions. ❧ Existing Big Data management systems, however, are mostly Volume-centric. Specialized technologies including distributed RDBMS, Hadoop and NoSQL databases were developed for scalable and reliable storage of data set as large as terabytes and even petabytes in volume. These systems provide programming and query primitives, and high cumulative I/O read performances to facilitate large-scale computation over persistent or slow-changing data on durable storage. ❧ Complex Event Processing (CEP), on the other hand, is a promising paradigm to manage Fast Data. CEP is recognized for online analytics of data that arrive continuously from ubiquitous, always-on sensors and digital event streams. CEP systems are designed to perform high-throughput online queries over high-rate or constantly-changing data, typically leveraging in-memory query matching algorithms to minimize interactions with persistent volumes. ❧ Fast Data applications, while emphasizing data Velocity, do not preclude having high data Variety and large Volume as well. Applications with such multi-dimensional characteristics require certain distinctive capabilities that go beyond traditional CEP systems. One is the need to process query patterns over heterogeneous information spaces that may span multiple domains such as engineering, social community and public policy. CEP queries have to abstract away these domain complexities and allow users to define accessible knowledge-based analytics. Second is the capability to support queries over continuous timespace that span past, present and future for resilient analytics. This requires seamless query processing across the boundary of realtime data streams and persistent data volumes. Third is the capability to perform on-demand queries on the fly that complement pre-defined online queries. Given the transient nature of Fast Data—data arrive and leave streams at a high velocity, on-demand queries should be processed with in-flight data management that obviates the need to persistently store everything. ❧ In this dissertation, we describe a Complex Event Processing framework for Fast Data management, considering all the 3-V dimensions. Specifically, we extend the state-of-the-art CEP systems and make the following contributions: (1) Semantic Complex Event Processing for high-level query modeling and processing in diverse information spaces, hiding data Variety
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
Model-driven situational awareness in large-scale, complex systems
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Prediction models for dynamic decision making in smart grid
PDF
From matching to querying: A unified framework for ontology integration
PDF
Algorithms and data structures for the real-time processing of traffic data
PDF
Customized data mining objective functions
PDF
Discovering and querying implicit relationships in semantic data
PDF
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
On efficient data transfers across geographically dispersed datacenters
PDF
A function-based methodology for evaluating resilience in smart grids
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Optimum multimodal routing under normal condition and disruptions
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Defending industrial control systems: an end-to-end approach for managing cyber-physical risk
Asset Metadata
Creator
Zhou, Qunzhi
(author)
Core Title
A complex event processing framework for fast data management
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/20/2014
Defense Date
05/14/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
big data,complex event processing,data management,data stream processing,Fast Data,OAI-PMH Harvest,smart grid
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Simmhan, Yogesh (
committee chair
), Horowitz, Ellis (
committee member
), Ioannou, Petros (
committee member
)
Creator Email
qunzhizh@usc.edu,zhouqunzhi@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-507095
Unique identifier
UC11297486
Identifier
etd-ZhouQunzhi-3020.pdf (filename),usctheses-c3-507095 (legacy record id)
Legacy Identifier
etd-ZhouQunzhi-3020.pdf
Dmrecord
507095
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhou, Qunzhi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
big data
complex event processing
data management
data stream processing
Fast Data
smart grid