Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Complex pattern search in sequential data
(USC Thesis Other)
Complex pattern search in sequential data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPLEX PATTERN SEARCH IN SEQUENTIAL DATA by Leila Kaghazian A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2008 Copyright 2008 Leila Kaghazian Dedication To My Parents ii Acknowledgements Writing this thesis is a conclusion to one of the most important chapters of my life. The next chapter would be making a difference in the world, even a small one, using this PhD. However, before I close this chapter, I need to thank so many people who made so many differences in me to make me who I am and support me to achieve my goals. I was lucky to work with more than one advisor during my study at USC. I would like to thank Dennis McLeod my official advisor for this thesis. Dennis was the ideal advisor, giving me both freedom and good advice. He thought me the importance of an original work, a deep thought and initiating a new line of research. I am much indebted to him for his understanding of my personal circumstances as I am in his intellectual contribution to my academic growth. Although Dennis’ interest changed from the subject of my research during my study, he gave me the freedom to pursue my interests and provided me with many of the resources I needed, and beyond. I also would like to thank Barry Boehm, Najmedin Meshkati and Wei-Min Shen for accepting to be in my committee. iii I had the honor to work with Hayward Alker in my first year in USC. ”Alker was a leading scholar on world order and international conflict resolution, interests grounded in his Quaker faith and belief in the possibility of achieving peace”. He taught me to bring a mathematics background to the social sciences. Cyrus Shahabi supervised my research in the Haptic data domain for one semester. I am very thankful for his academic and personal support. On the personal side, I wish to thank my parents, Najibeh Parvizian and Mohsen Kaghazian, for all they have done for me; it takes many years to realize how much of what you are depends on your parents. I salute my parents for their great intuition and vision in my education. Their first goal was to provide me the best education, and I am so happy that it paid back. Their unconditional love and prayers have given me the strength to preserve. I wish to thank my younger brothers Amir Mohsen and Ehsan for all the happy and sad moments of our life we shared together. I missed them a lot during my study. I am indebted to my friends Reza Sadri for co- advising my research and Amir Zarkesh for his valuable inputs during my PhD study. A special thank goes to Abtin Afshar Naderi who walked me to the computer science field and bear with me hours and hours to teach me the alphabet of software and hardware. He was the one who supervised me with my first database project. I also enjoyed all the discussions we had iv about philosophy, poetry and educational psychology which had a big impact in my life. I’ve had so many friends in my life whom I shared so many good moments with them. I can’t mentioned all of them but I would like to thank Payman Arabshahi and Roshanak Roshandel for their continues unconditional friendship, Mehdi Shariat Panahi, Mahnaz Karimi, Mohammad Rahimi, Tahmineh Akbarnejad, Mohammad Ko- lahdouzan, Maryam Hesabgar, Shahab Ghoreishi, Ameneh Eslami, Abbas Nasiraie, Mahya Dahaghin and Hadi Moradi for their support. My oldest friend from the first grade, Arezoo Gheysarieh, my high school friends, Maryam Emadzadeh and Elham Labbaf and my college friend Gita Moazami were indeed among the people I would never forget for the happy days of my life I spent with them. I salute my second grade teacher Mrs. Hariri to initiate the love of learning in my heart. Very special thanks go to Mrs. Shamsi Madani, who had the biggest impact in my life by her life brightening advices. She taught me to look at the world from different angle. I also would like to thank Joel Pelcyger, the founder of PS #1 elementary school in Santa Monica, for the great moments my son had in his school in past two years which gave me enough comfort to focus on my research. At last but not least I would like to thank my best eternal friend and husband Iman, whose unconditional love and support have been wind beneath my wings and my cute son Farid Hossein who made our life full of love and joy. We were always wondering v if our son would have any effect on our careers. The answer is: sure, he will make it much stronger. I wish to thank them for what they offer me every wonderful moment: the essence of life. Finally my thanks go out to all those who helped me to get to this point, those whom I mentioned and those whom I forgot. Thank you all! vi Table of Contents Dedication ii Acknowledgements iii Abstract xii Chapter 1 Introduction 1 1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Solution Approach in Brief . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Principal Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2 Related Work 12 2.1 SQL-TS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 OPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Frequent Pattern Search . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Event Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Chapter 3 RSPS: Recursive Sequential Pattern Search 27 3.1 Patterns with Nested Star . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Run time support for nested stars . . . . . . . . . . . . . . . . . . . . . 30 3.3 Proposed algorithm for patterns with nested stars RSPS . . . . . . . . . 32 3.3.1 Finding shift and next for the nested star case . . . . . . . . . . 34 3.4 Computation ofshift(j) andnext(j) fromG j p . . . . . . . . . . . . . 41 vii 3.4.1 Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.2 Next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Chapter 4 Multiple Pattern Search 44 4.1 Multiple Concurrent Conjunctive Pattern Search (MCCPS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1 MCCPS for patterns without recurring elements . . . . . . . . . 53 4.1.2 Single input data, multiple concurrent conjunctive patterns with * elements . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Multiple Input Data, One Pattern Search . . . . . . . . . . . . . . . . . 66 4.2.1 Multiple input data, single pattern without recurrent (*) element . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.2 Multiple input data, single pattern with recurrent (*) element . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter 5 Empirical Evaluation 70 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.1.1 Stock Market Data . . . . . . . . . . . . . . . . . . . . . . . . 72 5.1.2 Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 RSPS for One Dimensional Data . . . . . . . . . . . . . . . . . . . . . 75 5.2.1 Stock Market Data . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.2 Synthetic Pattern Generator . . . . . . . . . . . . . . . . . . . 84 5.2.3 Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3 MCCSP Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 6 Conclusion and Future Work 95 6.1 Conclusion Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2 Future Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2.1 Calculating the buffer size for data streams . . . . . . . . . . . 97 6.2.2 Pattern Generalization . . . . . . . . . . . . . . . . . . . . . . 97 Bibliography 102 Appendix SQL-TS Syntax 106 viii List Of Tables 4.1 Input Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Example 4: Illustration of the process of search for two conjunctive patterns over a given input sequence . . . . . . . . . . . . . . . . . . . 58 4.3 Multi Pattern Search Illustration . . . . . . . . . . . . . . . . . . . . . 64 4.4 Illustration of the search process for Example 4, with different given input sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5 A successful search procedure for the Example 5 . . . . . . . . . . . . 69 5.1 Performance for selected companies for a given query (Example 6) . . . 83 5.2 RSPS Impressive Speedup . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 RSPS performance over the network data . . . . . . . . . . . . . . . . 89 5.4 Performance for selected companies for a given query (Example 6) . . . 93 ix List Of Figures 1.1 Example of Double Bottom in stock market data (ProphetFinance.com) 4 3.1 Illustration of nested recurring pattern in the input data in Example 3 . . 31 3.2 State model for Example 3 . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 RSPS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Interdependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Possible transition among elements of a pattern . . . . . . . . . . . . . 37 3.6 Extra possible transition among elements of a pattern for the case of nested star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.7 General Implication Graph . . . . . . . . . . . . . . . . . . . . . . . . 40 3.8 Implication Graph for mismatch in element 6 . . . . . . . . . . . . . . 41 5.1 Illustration of Network Data . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Illustration of Network Data . . . . . . . . . . . . . . . . . . . . . . . 77 5.3 32 consecutive bump shape fluctuations found in the DJIA data are shown in red (dark). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4 A closer look at the red (dark) square highlighted in Figure5.3 which shows one of the matches . . . . . . . . . . . . . . . . . . . . . . . . . 80 x 5.5 Comparison between naive search and RSPS with speedup = 3 . . . . . 84 5.6 Comparison between the naive search and RSPS with speedup = 27 . . 85 5.7 How Shift and Next speed up the search . . . . . . . . . . . . . . . . . 86 5.8 Illustration of MCCPS for 2 patterns . . . . . . . . . . . . . . . . . . . 90 5.9 Comparison between naive search and MCCPS . . . . . . . . . . . . . 91 5.10 Illustration of MCCPS Pattern Matching . . . . . . . . . . . . . . . . . 94 6.1 Example of a set of queries . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 Pattern Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 xi Abstract The need to search for complex and recurring patterns in database sequences, data streams and graphs is shared by many applications. Challenges in this problem in- clude searching through large volumes of data in some database sequences, dealing with real-time data within a limited time frame and complexity of relations between tree-structured data. Feasible methods to search for patterns of interest, for data analy- sis purposes, will have to address these issues. In this thesis, we investigate the design and optimization of constructs that enable SQL to express complex patterns. In par- ticular we propose the Recursive Sequential Pattern Search algorithm (RSPS) which is inspired by the KMP (Knuth-Morris-Pratt) string matching algorithm. RSPS exploits the inter-dependencies between elements of a sequential pattern to minimize repeated passes over the same data. Moreover we propose another novel algorithm, MCCPS (Multiple Concurrent Conjunctive Pattern Search), to look for complex patterns in sin- gle, and multi dimensional data. Performance gains derived from a set of experiments and a sensitivity analysis for RSPS and MCCPS are also discussed. Our results demon- strate dramatic speedup in search, of up to two order of magnitude. xii Chapter 1 Introduction 1.1 Problem Definition Many applications in commercial and scientific domains share the need for processing and analyzing sequential or stream data. Examples include the analysis of data from sensor networks, stock market data, telecommunications data, earthquake monitoring data, INSAR satellite data, etc (Agrawal et al., 1993) (M. J. A. Berry, 1997) (Edwards & Magee, 1997) (Mesrobian et al., 1994). Sometimes, the only feasible way to make sense of large volumes of data is to search for patterns of interest. This is especially difficult when the patterns of interest are complex. Traditional constructs available in SQL can’t express these rich patterns. Facilities like datablades have increased the expressive power of database query languages, but still there are applications that need a more expressive language for describing their patterns of interest. Another limitation of most 1 of these applications is that data is processed on the fly and there is a limited buffer for keeping the history of the time-series; therefore, we are in need of an implementation of the pattern detection mechanism that does not require keeping the entire history of the sequence in fast memory. Looking to extend SQL with the ability to query the time-series data bases with more flexibility and power than Informix datablades (Software, 1998) and SRQL (Ra- makrishnan, 1998) , Sadri et al in (Sadri et al., 2001) introduced an extension of SQL, SQL-TS, to express sequential patterns, and studied how to optimize search queries for this language. They exploit the inter-dependencies between the elements of a sequential pattern to minimize repeated passes over the same data. While the proposed technique in (Sadri et al., 2001) and (Sadri et al., 2004) is powerful enough to find many types of patterns, it does not provide several crucial capabilities as following: ² It lacks the power necessary for expressing key interesting kinds of queries. For instance, it is not designed to search for patterns including nested stars (recurring pattern inside another recurring pattern) ² It does not provide any efficient technique to search for multiple concurrent pat- terns in a given time-series. ² It cannot search for a given pattern in multi-dimensional data such as trees, graphs etc. 2 In this research we introduce a formal method to optimize complex pattern search in multi dimensional data. We propose Recursive Sequential Pattern Search (RSPS) which is a general algorithm that gives SQL-TS power to look for more complex patterns such as recurrent nested star patterns over a sequential data. RSPS provides a general framework to search for any pattern in SQL level. In addition we define a new construct for SQL-TS, ”AND AS”, which gives the SQL-TS the capability to look for multiple concurrent conjunctive patterns in a given time-series. 1.2 Motivation We are motivated by the desire to investigate and discover useful knowledge from a sequential databases and real time data streams. To name a few: stock market data, auction market data, fraud detection, customer behavior, network behavior, etc. As an example we could mention the three favorable Chart Patterns which investors are usually interested to look for. They include the ”Cup and Handle”, ”Double Bot- tom” and ”Flat Base”. An example of a stock which had formed a Double Bottom pattern before breaking out to new 52 week highs was NVR Incorporated in 2002 and illustrated in Figure 1.1. As one of the interesting applications of this research, the RSPS algorithm is able to catch recurrent occurrence of double bottom patterns over a specific time period. 3 Figure 1.1: Example of Double Bottom in stock market data (ProphetFinance.com) Another interesting application of this research is fraud detection. With credit card and identity theft, insurance fraud, cellular phones and other types of crimes costing institutions billions of dollars every year in e-commerce, a better solution for recog- nizing customer purchase patterns is a must in all sectors in web services. The major challenge is to design a fast, reliable and instant fraud detection engine to be able to recognize a given case as a fraud instantly. Most of the research in the field of fraud de- tection can be categorized in three major sectors: Evidence Extraction, Link Discovery, and Pattern Recognition. Evidence Extraction refers to extend information extraction technology from its current ability to extract accurately named entities - e.g., people, places, etc. - and their attributes, to the ability to extract relevant relationships between 4 entities and attributes of these relationships. Link Discovery refers to develop the capa- bility to discover related entities, additional attributes, and other relevant relationships from available source material in the context of a particular scenario of interest. Pattern Recognition refers to develop the necessary data mining technologies to enable a sys- tem to learn from example instances consisting of data about entities, relationships, and their attributes. These technologies would likely include pattern representations and languages as well as algorithms for learning patterns represented in these languages. They would also include data representations that will provide scalability with respect to pattern size and complexity as well as with respect to the vast amounts of avail- able data. Researchers have begun to explore promising new techniques for relational classification and for learning probabilistic relational models. SQL-TS provides a simple but robust and understandable language to represent such patterns while RSPS provides suitable techniques for pattern search in Pattern Recogni- tion domain. What is needed is a new suite of relational learning techniques that would make it possible to represent and learn models of relational data, from the naturally occurring data, without complex transformations. We believe a adding the RSPS and MCCPS algorithm to the SQL-TS backbone provides a unique environment and capa- bility for a fast recognition of a Fraud Pattern. Patterns of interest may exist at different degrees of structural complexity, including single attributes, single attributes with set- valued variables, groups of nodes, layered patterns combining entities at different levels 5 of abstraction. Entities and links comprising pattern components may be single types (e.g., person to person), multiple entity types with single link types (e.g., people linked by telephone calls), multiple entity and link types (e.g., people, businesses, accounts, and locations linked by telephone calls, financial transactions, and meetings), or multi- ple entities with multiple endpoint links (e.g., employees of a structured organization). Temporal issues including prediction, classification of time series (regular time based data) and time sequences (ordered sets of events), event detection and regime shifting, and concept drift in the learned patterns are also important. Al above-mentioned types of patterns are representable through SQL-TS. A case based pattern recognition using SQL-TS (with support of RSPS and MCCPS) consists of the following steps: 1. A database of all time series of previously recognized fraud patterns is collected. Each sequence represents a fraud case. 2. All cases get clustered to ensure they are close enough to each other. This step eliminate the redundant information provided by similar cases and make a rich set of distinguished fraud cases. The detail of clustering techniques used in this stage is subjective and is beyond the scope of this paper. 3. Within each cluster, for each time series a set of features will be extracted for every single point. A universal feature set will be generate, which captures all major cases belong to a given cluster. 6 4. RSPS technique will run to find similar cases in each cluster representative in a large database. 1.3 Solution Approach in Brief Following is a brief summary of RSPS: Given an input stream and a sequential query, suppose that while searching for the sequential pattern over an input stream, a mismatch occurs at the the position of the pattern. Speedup is achieved by tracking two items, shift(j) and next(j), that help resetting the position trackers (i and j) to optimize values after the mismatch . Shift(j) determines how far the pattern should be advanced in the input, and next(j) determines from which element in the pattern the checking of conditions should be resumed after the shift. To compute shift(j) and next(j), RSPS algorithm begins by capturing all the logical relations among pairs of the pattern elements using a positive precondition logic matrixµ, and a negative precondition logic matrix Á. These matrices are of size m, where m is the length of the search pattern. The µ jk andÁ jk elements of these matrices are only defined forj ¸ k; thus there is a lower-triangular matrices of sizem: µ jk andÁ jk are defined as follows: µ jk = 8 > > > > > > < > > > > > > : 1 if p j )p k ^ p j 6=F 0 if p j )»p k U otherwise 7 Á jk = 8 > > > > > > < > > > > > > : 1 if »p j )p k 0 if »p j )»p k ^p j 6=T U otherwise in which p i is the predicate at location i : From matrices µ andÁ, matrix S is derived that describes the logical relationships between whole patterns. next an shift are conse- quently derived using matrix S. Here p i is the predicate at location i. For instance, consider a pattern with the fol- lowing predicates as its elements: p 1 =5<x<9 p 2 = x>10 p 3 = x<20 p 4 = 21<x< 50 p 5 = x<2 p 6 = x<5 The matricesµ andÁ for this pattern would be: 8 µ = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 0 1 U U 1 0 1 0 1 0 0 1 0 1 0 0 1 0 U 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 Á= 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 0 U 0 0 1 0 U U U 0 U U U U 0 U U U U 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 Logical relationships between whole patterns are derived from the matricesµ andÁ, and next and shift are calculated accordingly. We then combine these precondition ma- trices in dealing with finding multiple concurrent conjunctive patterns and derive new matrices which demonstrate the interdependency between the elements of the each pat- tern, as well as interdependency between the elements of the patterns of interest. These matrices are used to pre-calculate the value of the shift and next for any possible failure 9 during the search. These pre-calculations later will be used at run time to expedite the search process. 1.4 Principal Contributions Key contributions and novelties include: ² Investigating the design and optimization of constructs that enable SQL to ex- press complex patterns. ² Proposing the RSPS (recursive sequential pattern search) algorithm that inspired by the KMP (Knuth-Morris-Pratt) string matching algorithm and exploits the inter-dependencies between the elements of a sequential pattern to minimize re- peated passes over the same data. ² Proposing the Multiple Concurrent Conjunctive Pattern Search (MCCPS) algo- rithm as a powerful search technique to look for two or more concurrent conjunc- tive patterns over the same sequential data. This powerful algorithm is also able to look for concurrent patterns with recurrent elements. ² Defining a new construct based on MCCPS for SQL-TS called ”AND AS” which enables SQL-TS to look for multiple concurrent patterns. 10 ² These contributions are investigated and proven by empirical and analytic explo- rations in Stock Market data and Network database domain. 1.5 Outline of the Thesis The rest of this thesis is organized as follows. In Chapter 2, we briefly review SQL- TS and the OPS algorithm as a part of related work. We will not address them in detail as they have already been discussed in (Sadri et al., 2001). We also look at the related research done in the area of pattern search in different domains. We propose and explain the RSPS (Recursive Sequential Pattern Search) algorithm in Chapter 3. Next, we explain the novel technique to search for Multiple Concurrent Conjunctive Pattern Search (MCCPS) in a given data sequence in which we extend RSPS to address the search for concurrent conjunctive patterns search over one and multiple input data sequences. At the end, Chapter 6 covers the conclusion following with the references at the end of this thesis. 11 Chapter 2 Related Work In this section, we briefly review SQL-TS and OPS and then we describe other related researches have been done in this area and explain the novelty and differences of our research. 2.1 SQL-TS Structured Query Language for Time Series (SQL-TS), introduced in Sadri et al (Sadri et al., 2001), adds simple constructs to SQL for specifying complex sequential patterns. SQL-TS is identical to SQL, except for the following additions to the FROM clause: ² a SEQUENCE BY clause specifying the sequencing attributes, and ² a CLUSTER BY clause specifying the grouping attributes, similar to GROUP BY. Each group indicates a separate sequence. 12 For example, suppose we have the following table of HTTP requests over the net- work: CREATE TABLE network ( srcIP Varchar(15), srcPort Varchar(5), destIP Varchar(15), destPort Varchar(5), packet Integer (50), date Date) To find the maximal periods in one day intervals, in which the number of incoming packets jump more than 30%, and to return the source IP address and these periods, we can write the SQL-TS query of Example 1: Example 1. Using the FROM clause to define a pattern. SELECT X.srcIP, X.date, Z.previous.date FROM network CLUSTER BY srcIP SEQUENCE BY date AS (X, * Y, Z) 13 WHERE Y.packet > Y.previous.packet AND Z.previous.packet > X.packet * 0.30 The AS clause, which in SQL is mostly used to assign aliases to the table names, is used to specify a sequence of tuple variables from the specified table. Tuple vari- ables from this sequence can be used in the WHERE clause to specify the conditions for expressing the pattern, and in the SELECT clause to specify the output. Here is another example of running SQL-TS over data stream. Suppose that we have a stream containing the bids of ongoing auctions, as follows: auction_id : id for specific auctioned item amount : amount of bid time : Timestamps To find the three consecutive bids with more than %20 increase, we can write the SQL-TS query of example 2. Example 2. Finding the three consecutive bids with more than % 20 increase. SELECT T.auction_id,T.amount,T.time FROM bids CLUSTER BY auction_id SEQUENCE BY time AS (X,Y,Z,T) 14 WHERE 1.2 * X.amount < Y.amount AND 1.2 * Y.amount <Z.amount AND 1.2 * Z.amount < T.amount A key feature of SQL-TS is its ability to express recurring patterns by using a star operator. However, the star operator can be applied only to simple patterns, and not to complex patterns that contain sub-patterns. Our approach supports recurring complex patterns, as detailed in Section 3. Since SQL-TS is more powerful and more appropriate, than other similar works (Ramakrishnan et al., 1998), (Parker, 1990), (Perng & Parker, 1998), for our research we used SQL-TS as the basis of this research. In addition we and added a new construct to SQL-TS to address multi input cases. 2.2 OPS Since finding sequential patterns in databases is somewhat similar to finding phrases in text, optimization techniques in OPS were inspired by string-matching algorithms. Among the well-known string-matching algorithms with the best order of complexity in average cases, The Karp-Rabin algorithm (Karp & Rabin, 1987), The Boyer-Moore pattern matcher (Boyer & Moor, 1977) and the KMP algorithm (Knuth et al., 1977) 15 were the possible choices for being the basis for OPS and accordingly RSPS. Exhaus- tive experiments (Wright et al., 1998) show that, in general, KMP has the best per- formance. Because of its good performance, and its independence from the alphabet size, and mostly because it generalizes to problems such as real-time string matching (Gusfield, 1997), KMP provides a natural basis for dealing with the more general prob- lem of optimizing database queries on sequences. This is a major generalization that presents difficult challenges: rather than searching for strings of letters (usually from a finite alphabet), we have now to search for sequences of structured tuples qualified by arbitrary expressions of propositional predicates involving arithmetic and aggregates. By extending the KMP text matching algorithm (Knuth et al., 1977), Sadri et al (Sadri et al., 2001), introduced Optimal Pattern Search (OPS), in order to optimize sequential queries in SQL-TS. Following is a brief summary of OPS: Given an input stream and a sequential query, suppose that while searching for the sequential pattern on an input stream, a mismatch occurs at the the position of the pattern. Speedup is achieved by tracking two items, shift(j) and next(j), that help resetting the position trackers (i and j) to optimize values after the mismatch . Shift(j) determines how far the pattern should be advanced in the input, and next(j) determines from which element in the pattern the checking of conditions should be resumed after the shift. To compute shift(j) and next(j), OPS algorithm begins by capturing all the logical relations among pairs of the pattern 16 elements using a positive precondition logic matrixµ, and a negative precondition logic matrix Á. These matrices are of size m, where m is the length of the search pattern. The µ jk and Á jk elements of these matrices are only defined forj ¸ k; thus there is a lower-triangular matrices of sizem: µ jk andÁ jk are defined as follows: µ jk = 8 > > > > > > < > > > > > > : 1 if p j )p k ^ p j 6=F 0 if p j )»p k U otherwise Á jk = 8 > > > > > > < > > > > > > : 1 if »p j )p k 0 if »p j )»p k ^p j 6=T U otherwise in which p i is the predicate at location i : From matrices µ andÁ, matrix S is derived that describes the logical relationships between whole patterns. next an shift are conse- quently derived using matrix S. The most relevant research to our work indeed is the OPS algorithm which was addressed in (Sadri et al., 2001). Sadri et al. (Sadri et al., 2001) proposed SQL-TS, an extension of the SQL language to express lot of applications, however it suffers from the lack of power to look for complex patterns. We inspired by their novel idea and extended their algorithm to a new complex algorithm and proposed RSPS algorithm which is a powerful and expressive algorithm to look for any type of complex pattern 17 including sophisticated recurrent patterns over sequential data. We are also aware of a research presented in (Harada, 2004) which we believe is a copycat of Sadri et all research. In this paper authors tried to employ the Boyer-Moore algorithm instead of the KMP algorithm and claimed that their results showed an impressive speedup over the data streams compare to the naive approach and the OPS algorithm. We found other solid literature (Gusfield, 1997) in string search which shows that the Boyer-Moore algorithm is not applicable to stream data so the main approach in the last mentioned paper remains under question. 2.3 Frequent Pattern Search Sequential pattern search is an important problem with broad applications, including the analysis of customer purchase behavior, web access patterns, scientific experiments, disease treatments, patient database, natural disasters, DNA sequences, network data analysis etc. Such problems have attracted researchers from different communities. A major portion of research in this area has focused on discovering frequent patterns in sequential data (such as time series). The main focus of these works is to discover frequent patterns through approximation, transformation (Agrawal et al., 1993) and statistical inference . See for example, the approach taken by artificial intelligence researchers (Keogh & Smyth, 1997) (Smyth, 1997) (Das et al., 1995). 18 In the database context, where input data is usually much larger, the problem has been studied in a number of publications (Agrawal et al., 1995) (Agrawal & Srikant, 1995) (Das et al., 1995) (Mannila et al., 1996). Generally in these techniques event se- quences are searched for frequent patterns of events. These patterns have a simple struc- ture (essentially a partial order) whose total span of time is constrained by a window given by the user. The technique of generating candidate patterns from sub-patterns, together with a sliding window method, is shown to provide effective algorithms. The work in (Tsong-Li Wang et al., 1994) also deals with the discovery of sequential patterns, but it is significantly different from our work. In (Tsong-Li Wang et al., 1994) the considered patterns are in the form of specific regular expressions with a distance metrics as a dissimilarity measure in comparing two sequences. The proposed approach is mainly tailored to the discovery of patterns in protein databases. We note that the concept of distance used in (Tsong-Li Wang et al., 1994) is essentially an approximation measure, and, hence, it differs from the temporal distance between events specified by our constraints. In (Wang & Tan, 1996) a scenario is considered where sequential patterns have previously been discovered and an update is subsequently made to the database. An incremental discovery algorithm is proposed to update the discovery results considering only the affected part of the database. In (Jagadish et al., 1995) they present a domain dependent framework to pose similarity. The suggested method consists of a pattern 19 languageP , a transformation rule language T, and a query language L and a similarity model. A sequence S1 is to say to be similar to an object S2 if S2 can be reduced to S1 by a sequence of transformation defined inL. Das et al in their work (Das et al., 1995) presents a new method for rule discovery from time series data. They slide a window over data and find the class of the subse- quence and then find common episodes in represents. Unlike some other works they do not seek a global model for the time series, instead searching for a local patterns in a relatively non-parametric manner. Agrawal et al presents a new way to find similar time series. In their work, which used in IBM Quest Miner suite, they introduced a model of time-series similarity (Agrawal et al., 1995). In this model two time series are considered to be similar if they have enough non-overlapping time-oriented pieces (sub-series) that are similar. The amplitude of one of the two time-series is allowed to be scaled by any suitable factor and its offset adjusted to be scaled appropriately before matching. Two subse- quences are considered similar if one lies within an envelope of e width around the other, ignoring outliers. Agrawal et al also present a faster way to cluster time series (Agrawal & Srikant, 1995). They map the time domain to frequency domain using DFT (Discrete Foriour Transform). As the distance between two signals will be equal in two domains so they 20 use the distance in frequency domain. To make the similarity search easier they use only a few first factors of DFT transition and they show it works well. (Keogh & Smyth, 1997) (Smyth, 1997)proposed a probabilistic approach for sub- sequence matching in databases. They use a piecewise linear segmentations as the underlying representation, local features (such as peak, troughs, and plateaus) are de- fined using a prior distribution on expected deformation from a basic template. They use a computationally efficient and flexible approach based on bottom-up merging of local segments in a hierarchical multi-scale segmentation, where at each step the local segments are merged which lead to the least increase in acquired error. They also use of some heuristic to find K, the number of clusters. Berndt (Berndt, 1994) use dynamic time-wrapping approach to allow for elasticity in the temporal axis when matching a query Q to reference sequence S. Our research is essentially different with all of the above mentioned work since we are not looking for frequent patterns in time series; rather we are interested in exact match of a pattern in the SQL-TS level. In addition, none of these work look at the information hidden in the pattern rather they focus on the structure of the data. 2.4 Data Streams During the last few years, there has been a great interest in processing streaming data. STREAM (Babcock et al., 2002) (Widom & Babu, 2001) is a data stream processing 21 project whose focus is on computing approximate results and to understand how to efficiently run queries in a bounded amount of memory. The Aurora (Carney et al., 2002) system allows users to specify quality-of-service requirements for queries, and then uses those specifications to determine how and when to shed load. Other recent research has focused on developing algorithms to perform specific functions on se- quenced data. Gehrke et al. (Gehrke et al., 2001) considers the problem of computing correlated aggregate queries over streams, and presents techniques for obtaining ap- proximate answers in a single pass. Yang et al. (Yang & Widom, 2001b) (Yang & Widom, 2001a) discusses data struc- tures for computing and maintaining aggregates over streams. Finally, there has been a spate of work on this topic more recently, especially from the group at IIT-Bombay (Gupta et al., 2001) (Roy et al., 2000) (Sellis, 1998) (Sellis, 1998). Multi-query opti- mization typically shares relational sub expressions that appear in the plans of multiple (snapshot) queries. The Telegraph and TelegraphCQ (Chandrasekaran et al., 2003) project have developed a suite of novel technologies for continuously adaptive query processing and on meeting the challenges that arise in handling large streams of con- tinuous queries over high-volume, highly-variable data streams. 22 2.5 Time Series Time series data are being generated at an unprecedented speed from almost every application domain. As a consequence, in the last decade there has been a dramatically increasing amount of interest in querying and mining such data which, in turn, resulted in a large number of works introducing new methodologies for indexing, classification, clustering and approximation of time series (Agrawal et al., 1993) (Agrawal & Srikant, 1995). Key aspects for achieving effectiveness and efficiency when managing time series data are representation methods, similarity measures and pattern discovery. In the fol- lowing we review the related work of each group and explain which aspect of our work differentiate it from all of these researches. Many of these works and some of their extensions have been widely cited in the literature and applied to facilitate query processing and data mining of time series data. Giugno and Shasha in (Giugno & Shasha, 2002) and (Garey & Johnson., 1979) designed a set of primitives for a time series analysis in wide variety of application domains. Their work include data reduction such as FFT, wavelet, sketch etc, data structure design, temporal and spatial data analysis etc. They have tried to uncover the facts hidden under the phenomenon and to detect the burst within several window sizes over a sequence of data. 23 Sun et al in (Sun & Fang, 2008) introduced low-cost representation for similarity search of time-series pattern based on Minimum Bounding Rectangle. While all of the above mentioned researches are related to time series but none of them are relevant to the nugget of this thesis. First, we are not looking for reducing time series dimension. Then, we look for a pattern(s) of interest in a time series. And finally, we rather to provide such technique in SQL level. 2.6 Event Processing Event processing plays an increasingly important role in constructing enterprize ap- plications that can immediately react to business critical events. Various technologies have been proposed in recent years, such as event processing, data streams and asyn- chronous messaging. Most of these technologies share a common processing model and differ only in target workload, including query language features and consistency requirements. A couple of recent works have addressed various issues in event process- ing. There has been some work in active databases for implementing complex events. Examples are ODE (Gehani et al., 1992), SAMOS (Gatziu & Dittrich, 1993), and TREPL (Motakis & Zaniolo, 1997). These systems are for tracking complex events. In (Barga, 2007) author present an overview and discuss the foundations of CEDR, an event streaming system that embraces a temporal stream model to unify and further 24 enrich query language features, handle imperfections in event delivery, define correct- ness guarantees, and define operator semantics. BiCEP (Bizarro, 2007)is a new project being started at the University of Coimbra to benchmark Complex Event Processing systems (CEP). In (Buchmann, 2007) infras- tructures for smart cities are considered a potential application for event based com- puting. Event services are a crucial part of the infrastructure. Other recent works such as (Chandy et al., 2007),(Huh, 2007), (Terfloth et al., 2007) and (Urban et al., 2007) addressed various problem in event processing such as parallel processing, providing XML framework and database issues. Here we use similar ideas to detect complex patterns in sequential data but our approach is not similar to any of these works. 2.7 Overview In a departure of above mentioned algorithms and methods, our proposed RSPS algo- rithm is a pattern detection mechanism that is not bound to keeping the whole history of the sequence and trying to optimize the search by exploiting the inter-dependencies between the elements of a sequential pattern to minimize repeated passes over the same data. We did not intend to represent time series, rather we represent them as they are stored in a database. We only retrieve time series for pattern matching. In addition, 25 our methodology of distance measurement, provided in this thesis, is radically different with all of the these techniques. We use a general predicate language to describe a pattern. Hence, any subsequence of input that satisfy such a pattern is considered a match. This thesis is distinguished from the state of the art in times series analysis by the following factors: First, we are not looking for reducing time series dimension. Second, we look for a pattern of interest in a time series. And finally, we rather to provide such technique in SQL level. Finally, to the best of our knowledge, none of the above mentioned algorithms in data steams and pattern search has addressed even a similar method to the RSPS or MCCPS in pattern search literature. 26 Chapter 3 RSPS: Recursive Sequential Pattern Search In this section, we explain Recursive Sequential Pattern Search (RSPS) and present a general algorithm which gives SQL-TS the capability to look for more complex patterns such as nested-star patterns. RSPS provides a general framework to search for any pattern in SQL level. 3.1 Patterns with Nested Star An important advantage of the RSPS algorithm is that it can be easily generalized to handle input patterns which, in SQL-TS, are expressed using the star. In general, a star such as¤Y denotes a maximal sequence of one or more (not zero or more!) tuples that satisfy all the applicable conditions. For example ifp j is t i :price<t i¡1 price 27 then¤p j matches sequences of records with decreasing prices. Now consider a more generalized example with the following predicates: p 1 (t)=t i :price<t i¡1 price p 2 (t)=t i :price>t i¡1 price then¤(¤p 1 ;¤p 2 ) matches the sequences of records with recurring patterns of decreasing prices following by a period of increasing prices. For instance take the following SQL- TS example: Example 3. Suppose we are interested to find the occurrence of the following pattern in Intel stock price: an increasing period of time leading to repeated occurrence of ”a price between 30 and 40, followed by a period of decreasing price, followed by another of increasing price period”, followed by another period of decreasing leading to a price below 25. The query written in SQL-TS is: SELECT X.next.date, X.next.price, S.previous.date, S.previous.price FROM quote CLUSTER BY name, SEQUENCE BY date 28 AS ( * X, * (Y, * Z, * T), * U,V) WHERE X.name="Intel" AND X.price>X.previous.price AND 30<Y.price AND Y.price<40 AND Z.price <Z.previous.price AND T.price > T.previous.price AND U.price<U.previous.price AND V.price<25 Therefore our pattern predicates (on input tuple t) are: p1(t) = (t.price > t.previous.price) p2(t) = (30< t.price <40) p3(t) = (t.price< t.previous.price) p4(t) = (t.price> t.previous.price) p5(t) = (t.price < t.previous. price) p6(t) = (t.price < 25) The calculation of logic matrices µ and Á remains unchanged in the presence of nested stars patterns; thus, the formulas given in Section 2.2 will still be used. However, 29 the calculation of the arrays next and shift must be generalized for nested star patterns as described next. At runtime we maintain an array of counters (one per pattern element) to keep track of the cumulative number of input objects that have matched the pattern sequence so far. For example, if the first pattern element is a star that matched five elements in the input and the second pattern element is a non star, matching only one input element and the third element is a star matching two input elements we will have count 1 = 5, count 2 =6 andcount 3 =8. 3.2 Run time support for nested stars As mentioned earlier, at runtime we maintain an array of counters to keep track of the cumulative number of input objects that have matched the pattern sequence so far. Each element of this array is an array itself, because star pattern can match different part of the input stream in a single run and we need a counter to keep track of number of matched elements for each part. For instance, suppose that the previous query is applied to our input stream with the following sequence fort:price: 26,28,29,31,29,27,26,27,28,32,31,29,27,26,27,28,26,25,24 30 20 22 24 26 28 30 32 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Figure 3.1: Illustration of nested recurring pattern in the input data in Example 3 Figure 3.1 illustrates the nested recurring pattern of above data. After running the query, the array of counters will contain the following values: count 1 =3 count 2 =4;10 count 3 =7;14 count 4 =9;16 count 5 =18 count 6 =19 31 X Y Z T U V Figure 3.2: State model for Example 3 3.3 Proposed algorithm for patterns with nested stars RSPS As described, some of the counters have more than one cumulative value. Following in the algorithm for a pattern with nested stars, we will employ these values. We represent a pattern with a finite state model in which elements of the pattern are the states of the model. Stars and nested starts are coded in state transitions. Figure 3.2 illustrates a state diagram for Example 3. To develop the RSPS algorithm, the next step will be creating an adjacency matrix based on the state model of the pattern. The following adjacency matrix, presents the state model of Example 3: 32 OPS* Algorithm: ) , ( * j i OPS 1 1 m m j i WHILE )) ( ) (( n i and m j d d /* m is the length of the pattern and n is the length of the input data */ j R = {k| k s.t k 1 ) , ( & d k j A j } /* j R presents all possible nested star element of a sub-patterns start at k and end at j */ IF the current input element satisfies the pattern, THEN 1 m i i if j R is empty 1 m j j , elseif j R is not empty and ) , ( ~ i R Sat j /* ) , ( i R Sat j returns the set of pattern elements */ 1 m j j /* in j R which satisfies i */ else )) , ( max( i R Sat j j m OTHERWISE (i.e. when the current input element doesn’t satisfy the pattern) If j R is empty or ( ) max( j R j and j p is tested for the first time) then x reset j (the index in the pattern) to next(j) and x reset i (the index in the input) as follows: )) 1 ) ( ) ( ( ( min j next j shift count j i i If j R is not empty and ) max( j j R R is empty then x 1 m j j If j R is not empty and ) max( j R j and ) max( j j R R is not empty then x )) max( max( j j R R j m Figure 3.3: RSPS Algorithm A= 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 33 In the above matrix, the elements of patterns (X;Y;Z;T;U;V) were represented as (1;2;3;4;5;6) respectively. Hence A(1;1) = 1 means p 1 (the first element of the pattern) is a star element. If A(j;k) =1 and k < j,then p j is the last element of a nested star sub-pattern. For instance, in the above adjacency matrix, 4 th row represents the state T.A(4;4) = 1 means T is a star element. Furthermore, since in the same row A(4;2)=1, we conclude that T is the last element of a nested star sub-pattern, starting from Y . In the RSPS algorithm, for each row j, we define R j which includes every k, (k·j) such thatA(j;k)=1. For instance, in Example 3,R 4 =f2,4g. In Figure 3.3 we describe the RSPS algorithm in detail. The difference between state models in OPS and RSPS is that RSPS model may include right to left transitions, however in OPS state model we only face left to right or self loop transitions. Note that in both OPS and RSPS, left to right transitions are only between adjacent states. To complete the RSPS algorithm, we must now specify the computation of shift(j) and next(j) in the presence of nested stars. 3.3.1 Finding shift and next for the nested star case Consider the following sample graph in Figure 3.4 based on the matrixµ (excluded the main diagonal): 34 p p o o p p o p 43 42 41 31 21 32 T T T T T T Figure 3.4: Interdependency Graph The entry µ jk in our matrix correlates pattern predicates p j with p k , k < j, when these are evaluated on the same input element. Therefore, we can picture the simulta- neous processing of the input on the original pattern, and on the same pattern shifted back by j¡k. Thus the arcs between nodes in our matrix above show the combined transitions in the original pattern and in the shifted pattern. In particular, consider µ jk where neither p k nor p j are star predicates; then after success in p j and p k , we have a transition to p j+1 in the original pattern, and to p k+1 in the shifted pattern: this tran- sition is represented by an arc µ jk ! µ j+1;k+1 However, if p j is not a star predicate, whilep k is, then the success of both will movep k top k+1 , but leavep j unchanged: this is represented by the arcµ jk !µ j;k+1 . In the nested star situation, there is another possible arc which is a back edge when the last element of the nested star sub-pattern satisfies the previous input element but not the current one. In this case, before going forward to match the input element with 35 the next pattern element, algorithm evaluates the input element against the first element of the nested star sub-pattern, so the graph will have a back edge. In general, it is clear that only some of the arcs listed in the matrix above represent valid transitions and should be considered. The set of valid transitions also depends on the values of µ. In particular, since all the predicates in the pattern must be satisfied by the shifted input, every µ jk = 0 entry must removed with all its incoming and departing arcs: we only retain entries that are either 1 or U. Considering all possible situations, and assuming that all the neighbors are non-zero entries, the following table demonstrates transitions which are needed when building the graph. These rules assume that the end nodes of the arcs have valueU or 1; but when such nodes have value 0, the incoming arcs will be dropped. The directed graph produced by this construction will be called the Implication Graph for pattern sequence P, and is denoted as G p . For each value of j this graph must be further modified with entries fromÁ to account for the fact thatj t h element of the pattern failed on the input. Therefore, we replace thej t h row of G p (i.e., the row that starts withµ j;1 ) with thej t h row of matrixÁ, and remove all rows and arcs afterj. In addition we recompute the arcs from rowj¡1 to rowj according to the new values of elements in rowj. Thus, if elementk is star, there are up to two arcs fromµ j¡1;k to rowj: one toÁ jk and one toÁ j;k+1 . If elementk is not an star, then there will be only an arc fromµ j¡1;k 36 Figure 3.5: Possible transition among elements of a pattern 1 j is a star predicatek is a star predicateµ jk =U U ! µ j;k+1 # & µ j+1;k µ j+1;k+1 2 j is a star predicatek is a star predicateµ jk =1 (There is no arc to µ jk+1 , because µ jk = 1; thus all input tuples that satisfyp j must also satisfyp k ) 1 µ j;k+1 # & µ j+1;k µ j+1;k+1 3 j is a non-star predicatek is a non-star predicate µ jk µ j;k+1 & µ j+1;k µ j+1;k+1 4 j is a star predicatek is a non-star predicate µ jk ! µ j;k+1 & µ j+1;k µ j+1;k+1 5 j is a non-star predicatek is a star predicate µ jk µ j;k+1 # & µ j+1;k µ j+1;k+1 In presence of nested star, when the element j is the last element of the nested star sub-pattern, the following tran- sitions would be added to the graph. 37 Figure 3.6: Extra possible transition among elements of a pattern for the case of nested star 6 j is a star predicatek is a star predicateµ jk =U y = length (nested star sub-pattern)-1) 1 , 1 , 1 1 , 1 , p o k j k j k j y j U T T T T 7 j is a star predicatek is a star predicateµ jk =1 1 , 1 , 1 1 , 1 , 1 p k j k j k j y j T T T T 8 j is a non-star predicatek is a non-star predicate 1 , 1 , 1 1 , 1 , 1 k j k j k j y j T T T T 9 j is a star predicatek is a non-star predicate 1 , 1 , 1 1 , 1 , o k j k j k j jk y j T T T T T 10 j is a non-star predicatek is a star predicate 1 , 1 , 1 1 , 1 , p k j k j k j y j U T T T T 38 to rowj that goes toÁ j;k+1 . Furthermore, all the originalG p entries in rows up to and includingj¡1 remain unchanged, and so are all arcs leading to entries in these rows. Again we assume that the end nodes of the arcs are either U or 1; but when such nodes are 0 the incoming arcs will be dropped. The resulting graph will be called the Implication Graph for pattern elementj, denotedG j p ; this graph will be used to compute shift(j) and next(j). For instance, let’s go back to Example 3 and calculate the matricesµ andÁ: µ = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 U 1 0 U 1 1 U 0 1 0 U 1 0 1 U 0 U U U 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 Á= 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 0 U 0 U U 0 0 U U 0 U U 0 U 0 U U U U U 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 39 » » » » » » » » » » » » » » » » ¼ º « « « « « « « « « « « « « « « « ¬ ª p o U U U U U U U U G p 0 0 1 0 0 1 0 Figure 3.7: General Implication Graph Since p 1 ;p 3 and p 5 are star predicates, p 4 is a nested star predicate, and p 2 ;p 6 are not star predicates, the Implication Graph for our pattern sequence is illustrated in Figure 3.7. Suppose we have a mismatch in presence of the current input data and the sixth element of the pattern. So we need to build G 6 p . We replace row 6 of the G p with the row 6 of Á and update the outgoing arcs from the row 5 to the new row 6. Figure 3.8 illustrates theG 6 p . Consider now the nodeµ 41 in this graph. Observe that there are several paths con- sisting of either 1 nodes or Unodes that take us to nodes in the last row of the matrix. Therefore, the input shifted by 4 can succeeds along any of these paths. However, there is no path to the last row starting from node µ 31 : thus, 3 is not a possible shift. Also 40 » » » » » » » » » » » » » » » » ¼ º « « « « « « « « « « « « « « « « ¬ ª p o U U U U U U U U U G p 0 1 0 0 1 0 6 Figure 3.8: Implication Graph for mismatch in element 6 there is not path to the last row starting fromµ 21 andµ 11 ; thus shifts of size 2 and 1 can never succeed. Therefore, we conclude that shift(6) = 3. 3.4 Computation ofshift(j) andnext(j) fromG j p In the following we provide a detailed steps to calculateshift(j) andnext(j). As men- tioned earlier shift(j) and next(j) determine how much we need to shift the pattern after occurrence of a mismatch and where to set the pointer in the pattern after resuming the search respectively. 3.4.1 Shift In general we define shift(j) as follows: Let P denote the search pattern, and let¾(j)= fsj9 a path fromµ s+1;1 to a node in the last row ofG j p g. Then, 41 ² if the set¾(j) is not empty, thenshift(j)=min(¾(j)) ² if the set¾(j) is empty andÁ j1 6=0 thenshift(j)=j¡1 ² if the set¾(j) is empty andÁ j1 =0 thenshift(j)=j. 3.4.2 Next Multiple paths leading to the last row were acceptable for shift, but they are not accept- able for next, since this must return a value that uniquely determines the point from which the search must be resumed. Therefore, let us say that a node in our G j P graph is deterministic if there is exactly one arc leaving this node, and the end-node of this arc has value 1 (thus a deterministic node cannot take us to an Unode or to several 1 nodes). Thus, we start from µ shift(j)+1;1 , and if this is not deterministic, then we set next(j) = 1. Otherwise, we move to the unique successor of this deterministic node and repeat the test. When the first non-deterministic node is found in this recursive process,next(j) is set to the value of its column. If the search takes us to the last row inG j P , that means that none of the input elements previously visited needs to be tested again: thusnext(j)=j¡shift(j): For the example at hand, there is a non-zero path from nodeµ 41 toÁ 63 , thus shift(6) = 3. We now considerµ 41 = 1 and see that this is not a deterministic node, since there 42 are more than one arc leaving the node. One back edge toµ 21 and one toµ 52 . Thus, we conclude thatnext(6)=1. Despite the fact that Implication Graph for RSPS may have some back edges, the computation for the shift(j) and next(j) is based on the same formula as the star algo- rithm. Suppose that there is a path from µ j¡y;1 to the last row of the G j P . Also assume that there is a back edge fromµ j¡2;1 to theµ j¡y;1 and there is a path fromµ j¡2;1 to the last row. Thus ¾(j)=fj¡y;j¡2g(y >2) and Shift(j)=min(¾(j))=j¡y: So existence of back edge in the Implication Graph does not have any impact on the calculation ofshift(j) and thereforenext(j). 43 Chapter 4 Multiple Pattern Search Looking for complex patterns over structured and semi-structured data has a lot of sci- entific and commercial applications. Employing graph to represent shapes or images in computer vision or to represent the relations in social networks and auction market are the examples of these applications. Although looking for the occurrence of a subgraph in a set of graphs known to be NP complete (Garey & Johnson., 1979), there have been a lot of research going on in reducing the space and the time needed to search in graphs. RSPS was dealing with finding the time period(s) for the occurrence of one recur- sive sequential pattern over a single sequential data. In this section we first briefly discuss other possible scenarios in terms of dealing with more than one input data se- quence or one pattern or both. Then we discuss two of these cases in detail. There are five different scenarios to look for patterns in asset of inputs: 1. Single input data, multiple concurrent disjunctive patterns 44 2. Single input data, multiple concurrent conjunctive patterns 3. Multiple input data, one pattern; to look for a pattern occurrence in multiple input data at the same time. 4. Multiple input data, multiple concurrent disjunctive patterns 5. Multiple input data, multiple concurrent conjunctive patterns Among the mentioned scenarios, in this research we focus on the second and the third ones. In the following sections, we will discuss the novel techniques to deal with these cases. The performance of the proposed techniques is evaluated in the next chapter. 4.1 Multiple Concurrent Conjunctive Pattern Search (MCCPS) Consider cases when we are interested to know if two or more conjunctive patterns occurred concurrently over the same sequential data. Here is a very simple example: We are trying to find a period(s) of time over the oil price data when the two follow- ing patterns occur at the same time. The pattern consisting of a period of equal or rising prices followed by a period of falling prices and finally followed by another period of rising prices. 45 The query written in SQL-TS is: SELECT X.name, FIRST(X).date AS sdata LAST(Z).date AS edate FROM quote CLUSTER BY name, SEQUENCE BY date AS ( * X, * Y, * Z) WHERE X.price >= X.previous.price AND Y.price < Y.previous.price AND Z.price > Z.previous.price Therefore our pattern predicates (on input tuple t) are: p1(t) = (t.price >= t.previous.price) p2(t) = (t.price < t.previous.price) p3(t) = (t.price < t.previous.price) And the second pattern of interest is: Find the pattern consisting of a period of prices between $2.00 and $10.00. SELECT X.name, FIRST(X).date AS sdata 46 LAST(S).date AS edate FROM quote CLUSTER BY name, SEQUENCE BY date AS ( * S) WHERE X.price > 2 AND S.price < 10.00 This is important because this search enables us to look for a specific price behavior such that a specific price range falls in that period of time. Now suppose the input data is as follow: Input order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Input Value 1 8 4 4 1 3 6 7 10 8 6 5 4 6 8 Pattern 1 1 2 3 3 3 3 Pattern 2 1 1 1 Table 4.1: Input Example In this example the 4th to 9th input elements satisfy the first pattern, while the second pattern would be satisfied form the 6th to 8th elements. So we could conclude that there is a period of time that the oil price follows the specific behavior we are 47 looking for and in a sub-period of this period the price is fluctuating between $2.00 and $10.00 per barrel (Obviously this is an imaginary price for oil these days!). Now our goal is to define a procedure that speeds up our search with remarkable magnitude. The naive search would go over the input data twice and looks for each pattern separately and if both patterns would be satisfied over periods of time, try to figure out if one of these periods is a sub-sequence of the other. In MCCPS, we are trying to capture the interdependency between the elements of different patterns in order to find the answer of the search in one pass. Thus the MCCPS algorithm begins with capturing the logical relation among pairs of the elements of the two patterns as well as logical relation between the elements of each pattern. The positive precondition logic between the elements will be demonstrated by matrix® and the negative precondition logic between them with the matrix¯. The size of these two matrices is: (length(P 1 )+length(P 2 )+:::+length(P n ))¤( length(P 1 )+length(P 2 )+:::+length(P n )) which P 1 ;P 2;:::::: ;P n are the patterns of interest we are looking for the concurrent oc- currence of them over the same input data. The value of above matrices calculated as follow: 48 ® row;column = 8 > > > > > > < > > > > > > : 1 if ;p row )p column ^ p row 6´F 0 if p row )»p column U otherwise ¯ row;column = 8 > > > > > > < > > > > > > : 1 if »p row )p column 0 if »p row )»p column ^p row 6´T U otherwise Therefore the matrices® and¯ for last example would be: ® = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 p 1 p 2 p 3 q 1 p 1 1 U p 2 0 1 U p 3 U U 1 U q 1 U U U 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 49 ¯ = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 p 1 p 2 p 3 q 1 p 1 1 1 U p 2 1 1 U p 3 U U 1 U q 1 U U U 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 In comparing the elements of the same pattern in this matrices, we are not interested in the value of ®(P j ;P k ) if (j¡k) (as we never shift the pattern against the input data to the left). Hence there will be (m1(m1¡1)=2 + m2(m2¡1)=2) (wherem1 is the length of the pattern P andm2 is the length of the pattern Q) comparisons that we could skip. The matrices ® and ¯ contain information regarding both the intra dependencies between the elements of the different patterns as well as interdependencies between the elements of each pattern. As it is been already discussed in the RSPS algorithm, we could calculate the shift and next for each pattern, using some of the values in® and¯. By looking at the matrix® we could conclude the followings: ² If®(p j1 ;q j2 )=1 and®(q j2 ;p j1 )=1, bothp j1 andq j2 imply the same condition. Therefore if during the search process we are examining an input data element 50 against these two elements at the same time, matching the input against one of them would be enough. ² If®(p j1 ;q j2 )=0, means that®(q j2 ;p j1 )=0 too and if an input element satisfies one of them, it will not satisfy the other one, or if doesn’t satisfy one of them, it will satisfy the other one, thus trying one these elements against the input data would be enough. ² If ®(p j1 ;q j2 ) = U and ®(q j2 ;p j1 ) = U, means that if an input data element satisfies the element of the first pattern, it may or may not satisfy the second one so we need to try the input against both. ² If ®(p j1 ;q j2 ) = 1 and ®(q j2 ;p j1 ) = U , or ®(p j1 ;q j2 ) = U and ®(q j2 ;p j1 ) = 1 we will examine the input data against thep j1 and we could skip trying the input againstq j2 or visa versa. By looking at the elements of the matrices ® and ¯ we could pre-calculate the interdependencies between the elements of the patterns. This pre-calculation will help us to determine if the concurrent occurrence of the patterns is possible at all, and if it is, at which point comparing each pattern against the input data should begin. For the sake if simplicity let’s assume that we are dealing with one sequential input data and two patterns. There are three possible scenarios about these two patterns. 51 1. The whole or sub-pattern of one of the patterns implies the second pattern. In this case, we only need to match the input data against the first pattern and there is no need to try the second pattern. 2. Two patterns are contradictory, so they would never be satisfied at the same time. 3. There is no logical relationship between two patterns, so we need to examine both of them against the input data. We can generalize the above conditions for the case of more than two patterns. In the matrix ®, if there is any j1 such that for all x, j1 · x · m1(m1 = length(P)) and allj2, (1·j2·m2&m2=length(Q)) the following function FOR (j1·m1 and 1·j2·m2 IF ®(p j1 ;q j2 ) = 1 j1=j1+1 j2=j2+1 ELSE FALSE RETURN TRUE; If the above function returns TRUE, means thatQ will be satisfied during the inter- val of the input that satisfies theP . 52 4.1.1 MCCPS for patterns without recurring elements In this section we provide the MCCPS algorithm, however to draw a clear picture of how this algorithm works, we begin with a simple example where there is no recurrent element in any if the patterns of interest. Example 4. Find three consecutive days over the closing price of “Yahoo” when we have a rise, followed by a decrease, followed by another rise. Check if in any two consecutive days of these periods, price is between $10.00 and $20.00 and then falls to exactly $9.00. SELECT X.date AS start_date Z.data, AS end_date FROM quote CLUSTER BY name, SEQUENCE BY date AS (X,Y,Z) AND AS (R,S) WHERE X.name=’’YAHOO’’ AND (( X.price > X.previous.price AND Y.price < X.price AND Z.price > Y.price 53 ) AND (R.price >= 10 AND R.price <= 20 AND S= 9 ) ) So we are looking for concurrent occurrence of two patterns with the following predicates: P: p1(t) = (t.price > t.previous.price) p2(t) = (t.price < t.previous.price) p3(t) = (t.price > t.previous.price) Q: q1(t) = (t.price >= 10 AND t.price <= 20) q2(t) = (t.price = 9) Thus the metrics® and¯ will be as follows: 54 ® = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 p 1 p 2 p 3 q 1 q 2 p 1 1 U U p 2 0 1 U U p 3 1 0 1 U U q 1 U U U 1 q 2 U U U 0 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 ¯ = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 p 1 p 2 p 3 q 1 q 2 p 1 0 U U p 2 U 0 U U p 3 0 U 0 U U q 1 U U U 0 q 2 U U U U 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 The MCCPS search begins with looking in matrix ® to find if there is an j1 (1 · j1·m1) such that C´(®(p j1 ;q 1 ) ^ ®(p j1+1 ;q 2 ) ^ :::^®(p m1¡m2+1 ;q m2 ))is equal to 1 orU. 55 ² If C is equal to 0, means that P and Q will never happen concurrently over the same input data. ² If C returns 1, means that if any period(s) of input satisfiesP , will satisfy Q as well, thus there is no need to run the search forQ. ² IfC returnsU andm1 = m2, we will make an ordered list. If®(p 1 ;q 1 ) is equal to 1, we only addp 1 to the list, if it is equal toU then we look at®(q 1 ;p 1 ) , if it is equal to 1 we add theq 1 to the list, and if it is equal toU, then we addp 1 and q 1 to the list and move on to®(p 2 ;q 2 )and so on. At run time, instead of matching the input against all the elements of P and Q, we only need to match the input data against this ordered list. Now, we consider the case when during the matching P and Q against the input data, search process fails. IfC returns 1, we run the normal RSPS forP , calculate shift and next forP and advance theP over the input data and resume the search starting at the next element of the pattern. In the case ofC returningU, if the patterns are with the same length and failure hap- pens during the matching input data against the mentioned ordered list, we look at the failed element. If it is ap j1 element, we shift both patterns equal to theshift(j1), and resume the search form the next(shift(j1)) element of the ordered list. If the failing element isq j2 , then we do the same for theq j2 . (The necessary steps to calculate shift 56 and next for all the elements of each pattern already described in the RSPS algorithm in details). The last and the most complex scenario is when C returns Uand the length of the patterns are different. ² We begin the search by comparing the input data against the fist element of the both patterns. ² If it matches both, we proceed to the next input and the next elements. ² If the input data satisfies the element of the longer pattern but does not satisfy the element of the shorter pattern, we calculate theshift(q j2 ). Ifshift(q j2 )·m1¡ j1, then we shift the failed pattern equal toshift(q j2 ) andj2=next(shift(q j2 )) and search proceeds. If shift(q j2 ) > m 1 ¡j1 means search failed and we re- sumed the search from the next input element. ² If input data does not satisfy the element of the longer pattern, we shift both pattern equal to shift(p j1 ), resume the search by matching the first element of the shorter pattern againsti¡j1+shift(p j1 )+1 element of the input and continue the matching the input data from there against the elements of the shorter pattern till we reach the “i¡j1+shift(p j1 )+next(j1)"th input element. At this point, input data should be match against this element of the larger pattern as well. (this is whenj2=next(shift(p j1 )). 57 ² If the current data element does not satisfy any of the patterns, we calculate the shift(jx) for all patterns and then shift all the patterns equal to themax(shift(jx)) and resume he search formnext(shift(jx)) for the pattern with maximumShift(j) and from “i¡j x +shift(j x )+1” for the rest of the patterns. Let us get back to our example. After computing the metrics® and¯ for the query, we look at the result of the function C which is in this case U. The length of the two patterns are not equal, so we start the search by comparing the first element of the input with p 1 and with q 1 and proceed the search from there. Table 4.1.1, shows how the search will continue. Input order 1 2 3 4 5 6 7 8 9 10 11 Input Value 15 16 15 20 15 9 10 9 12 8 6 j1 fail 1 2 3 1 2 3 fail fail j2 1 fail fail 1 2 j1 1 2 2 3 j2 1 fail 1 2 j1 2 3 j2 1 fail j1 1 2 j2 1 fail j1 2 3 j2 1 2 Table 4.2: Example 4: Illustration of the process of search for two conjunctive patterns over a given input sequence As shown in Table 4.1.1 , the second element of the input would satisfy bothp 1 and q 1 for the first time. Then we move to the 3 rd element of the input which satisfiesp 2 but 58 fails at q 2 . As we have already known that Shift q (2) = 1 and next q (shift(2)) = 1, and because (m1¡j1+1 = 3¡2+1 ¸ m2¡j2+1 = 2¡1+1) we advance the Q against the data one element and resume the search for the Q by setting the j2=1. The next input element satisfiesp 3 , however fails atq 2 again. This time because (m1¡j1+1 = 3¡3+1 Á m2¡j2+1 = 2¡1+1), we don’t shift the pattern Q against the input data and instead we resume the whole search forP andQ from the next input data element. Both p 1 and q 1 would be satisfied by the 4 th element of the input data. Then we move to the 5 th element in the input data and the second element in both patterns. p 2 would be satisfied by the 5 th element however the search fails at the 5 th element forq 2 . Because (m1¡j1+1=3¡2+1¸m2¡j2+1=2¡1+1), we resume j2 = 1 which would be satisfied by 5 th input element. Now we try p 3 and q 2 against the 6 th element of the input and both would be satisfied. Therefore the search succeeds and one answer for the query isi = [4;6]. We resume the search starting the 7 th element of the input and continue. 4.1.2 Single input data, multiple concurrent conjunctive patterns with * elements In this section, we will discuss the cases when we have single input data and we are looking for the concurrent occurrence of multiple conjunctive patterns with at least one recurrent (*) element in one of the patterns. This is a much more complicated 59 case compare to the previous one as the length of the patterns would not be at any help. The reason is that a series of input data that are satisfying one star element in one pattern, could satisfy more than one consecutive elements in the other patterns. Following example clarifies this idea. Example 5: Find patterns for the closing price of “YAHOO” stock consisting of a period of rising or equal prices, followed by a period of falling prices, followed another of rising prices. Find concurrent periods with these patterns when for a period of time price fluctuates between $2.00 and $10.00, followed by a rising period between $10.00 and $20.00. SELECT FIRST(X).date AS start_date, LAST(Z).date AS end_date, FIRST (R).date AS sdate, LAST(S).date AS edate FROM quote CLUSTER BY name, SEQUENCE BY date AS ( * X, * Y, * Z) AND AS( * R, * S) WHERE X.name=’’YAHOO’’ AND ( 60 ( X.price >= X.previous.price AND Y.price < Y.previous.price AND Z.price > Z.previous.price ) AND ( R.price > 2 AND R.price < 10 AND S.price > S.previous.price AND S. price > 10 AND S.price < 20 ) ) As indicated by the Example 4, the two patterns of interests are as follows: P: p(X) = (X.price >= X.previous.price) p2(Y)= (Y.price < Y.previous.price) p3(Z)= (Z.price > Z.previous.price) Q: q1(R)= (R.price > 2 AND R.price < 10) q2(S)= (S.price > S.previous.price AND S.price > 10 61 AND S.price < 20) Therefore we could calculate the matrices® and¯: ® = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 p 1 p 2 p 3 q 1 q 2 p 1 1 U U # p 2 0 1 U 0 # p 3 U U 1 U ! U q 1 U ! U ! U 1 # q 2 U 0 U 0 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 62 ¯ = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 p 1 p 2 p 3 q 1 q 2 p 1 0 U U p 2 1 0 U U p 3 U U 0 U 1 q 1 U U U 0 q 2 U U U U 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 Since there are some * elements in the above patterns, for some of the input data elements we have more than one pattern element to examine. As it was mentioned, the length of the patterns do not help us to predicate which pattern would match with more input elements, however we are interested in the periods of time when all patterns occur concurrently. Thus when the first element of one of the patterns starts to be matched with the input data, if there is a path from that element to the last element of the same data in the matrix®, the search will proceed. For instance, in the above example, if the elementp 1 is started to be satisfied by the input first, we look at the above implication 63 matrix to see if there is any path from ® p 1 ;q q to the ® p 3 ;q 2 .By path we mean any path from one matrix element to another one without having any “0” value in between. If there is no such a path, means thatP andQ will never occur over the same input data concurrently, therefore without any matching we conclude that this search will fail. Input order 1 2 3 4 5 6 7 8 9 10 11 12 Input Value 1 8 4 4 1 3 6 7 11 12 20 19 j1 fail 1 2 fail 2 3 3 3 3 3 3 fail j2 fail 1 1 1 fail 1 1 1 2 2 fail fail j1 1 j2 1 P p 2 p 2 p 3 Q q 1 q 2 Table 4.3: Multi Pattern Search Illustration Likewise if the input element satisfiesq 1 , we look at the implication matrix to check if there is any path from® q 1 p 1 to® q 2 p 3 and then proceed the search. In this path, if any of the® q j2 p j1 is equal to “1”, means that matching the input data against thatq j2 is enough and there is no need to match the same input element with p j1 . If the next input data does not matchq j2 , we match the data against the next possible option. For instance, we assume that input data is as demonstrate in the Table 4.1.2. The 4 th input data element satisfies both p 1 and q 1 . We look at ® p 1 ;q 1 in the implication matrix and there is a path from® p 1 ;q 1 to® p 3 ;q 2 , so the search proceeds . The next input element doesn’t satisfyq 2 , therefore the search forQ resumes by shiftingQ equal toShift p (2) and then from next(Shift p (2)), but continues for P . Implication matrix shows that 64 there is a path from® p 2 ;q 1 to® p 3 ;q 2 , thus the search proceeds. The 6 th elements satisfies p 3 andq 1 . At this point becausep 3 is a * element, and there is a path from® p 3 ;q 1 to® p 3 ;q 2 , search still proceeds. The 7 th and 8 th elements still satisfy bothp 3 andq 1 . The 9 th and 10 th elements satisfy p 3 and q 2 . The 11 th elements satisfies p 3 but not q 2 . Therefore because j2 = m2, the search only continues for P until it fails. The answer for the query isi=[4,11]. Suppose that for the same query the input data sequence is as shown as in Ta- ble 4.1.2. This time the search starts fromq 1 and proceeds by checking if there is a path from ® q 1 ;p 1 to® q 2 ;p 3 . The rest is as described in above. Input order 1 2 3 4 5 6 7 8 9 10 11 12 Input Value 1 8 4 4 5 3 6 7 11 12 19 18 P p 2 p 2 p 3 Q q 1 q 2 Table 4.4: Illustration of the search process for Example 4, with different given input sequence 65 4.2 Multiple Input Data, One Pattern Search Looking for a concurrent occurrence of a pattern of interest over multiple input se- quences is indeed an interesting case to investigate. Assume that we are looking for a time period when a double bottom pattern occurred in the closing price of the major oil companies such as Shell, OXY , Chevron and Mobil. Such a data would be such a valuable information for the stock market analysts, as they could look for a common reason for such a behavior in the price of the stock for the same category companies. For instance, say that we have the following table of closing prices for stock: CREATE TABLE quote (name Varchar(8), date Date, price Integer) And the pattern of interest is as follow: Example 5: We are looking for the concurrent occurrence of the following pattern in the stock price for “IBM” and “Intel”: small Pattern consisting of a period of rising prices, followed by a period of falling prices, followed another period of rising prices. We first create a view on a self join of the quote table: CREATE VIEW temp AS SELECT q1.price AS int_price, q2.price AS ibm_price, q1.date AS date 66 FROM quote q1, quote q2 WHERE q1.date=q2.date AND q2.name=’INTEL’ AND q2.name=’IBM’ Now we search this view for the desired pattern. Note that we don’t need a CLUS- TER BY clause since we have already filtered out the stocks of Intel and IBM: SELECT X.name, FIRST(X).date AS sdate, LAST(Z).date AS edate FROM quote SEQUENCE BY date AS ( * X, * Y, * Z) WHERE X.ibm_price > X.previous.ibm_price AND X.int_price > X.previous.int_price AND Y.ibm_price < Y.previous.ibm_price AND Y.int_price < Y.previous.int_price AND Z.ibm_price > Z.previous.ibm_price AND Z.int_price > Z.previous.int_price 67 The approach to expedite the search process for this case is not evolved with any new algorithm. The only technique would be employing RSPS algorithm and proceed the search. 4.2.1 Multiple input data, single pattern without recurrent (*) element If there is no * in the pattern elements, we start the search by looking at the first input element for each of the sequences and match them against the first element of the pat- tern. If all of them match, we advance to the second elements of each sequence and match them against the second element of the pattern. The search proceeds with the same mentioned paste until succeed. If during the search process any failure occurs, for example the 4 th element of the input does not match the 2 nd element of the pattern, we advance the pattern against all to input sequences equal toShift(2) for the pattern P and resume the search from thenext(shift(2))th element of all the input sequences. 4.2.2 Multiple input data, single pattern with recurrent (*) element In this case, we start the search by matching the first elements of all the inputs against the first element of the pattern. If any of them fails, we proceed the search by comparing 68 the first element of that pattern against the second elements of all the patterns. If the first element of the pattern is a star element and all the input elements match it, we proceed with the first pattern element and the next set of input elements. If any of the input fails against the current pattern element, since the current element is a star element, we match that input element against the second element of the pattern and proceed. The answer for search is a period of time when all the input sequences satisfy the whole pattern. Table 4.5, illustrates the successful search procedure for the example 10. If during the search process, any of the input data fails to satisfyp j , we look at the Shiftfor all the currentj for all the sequences and advance the pattern against all the data sequences equal to the minimum value for the calculatedShifts. The search will be resumed from the elements in the input sequences which are now at the position of minimum of thenext for all the mentionedShifts. Input order 1 2 3 4 5 6 7 8 9 10 IBM 48 49 51 50 49 48 49 51 52 53 j 1 1 2 2 2 3 3 3 3 Intel 22 23 21 20 21 23 24 25 27 28 j 1 2 2 3 3 3 3 3 3 Microsoft 39 40 43 44 45 44 46 47 49 51 j 1 1 1 1 2 3 3 3 3 Table 4.5: A successful search procedure for the Example 5 69 Chapter 5 Empirical Evaluation To assess performance, we count the number of passes over the same input element while tested against the pattern(s) of interest for RSPS, MCCPS and naive approaches. The speedups obtained range from modest (simple search pattern without any recurring sub-pattern), to dramatic (more than two orders of magnitude obtained on complex pat- terns found in actual applications). We run RSPS and MCCPS over two set of datasets: stock market data and network data. Moreover we examine the scalability and robust- ness of RSPS and MCCPS when pattern of interest changes. 5.1 Introduction Our technique has been investigated in the context of Stock market Data and Network data. In the following, we explain each domain in more detail in addition to our finding 70 in each field. We explain the intention of each experiment and we describe which aspects of RSPS and MCCPS have been covered in each experiment. Stock Market: Initially we test the power of RSPS and MCCPS over stock market data. In addition we have developed a synthetic pattern generator to generate a set of patterns around a seed pattern to measure the sensitivity of RSPS and MCCPS to a given pattern(s). The primary objectives of this set of experiments are: ² To determine whether RSPS and MCCPS are indeed successful in finding pattern in most desirable and famous data available in financial world: Stock market Data. ² To evaluate performance improvement (measured by RSPS and MCCPS speedup compare to conventional techniques). By speedup we mean the ratio of number of comparison needed to find a pattern. ² To demonstrate that RSPS and MCCPS improve the search process even in the cases that the search suppose to fail. ² To illustrate that RSPS and MCCPS performance are not worse than conventional techniques even if there is no dependency among patters elements at all. Network Data. The objectives of experience in this domain is: ² To test the effectiveness of RSPS and MCCPS when data shows self-similarity and recursion. 71 ² To demonstrate that RSPS and MCCPS techniques improve the search for com- plex patterns dramatically across different data sets. 5.1.1 Stock Market Data A stock market or (equity market) is a private or public market for the trading of com- pany stock and derivatives of company stock at an agreed price; both of these are secu- rities listed on a stock exchange as well as those only traded privately. The expression stock market refers to the market that enables the trading of company stocks (collective shares), other securities, and derivatives. The purpose of a stock exchange is to facilitate the exchange of securities between buyers and sellers, thus providing a marketplace (virtual or real). The exchanges pro- vide real-time trading information on the listed securities, facilitating price discovery. The amount of stock in each day illustrates the average value of such share on a given day. 5.1.2 Network Data Nature of traffic in high-speed, high-bandwidth communications is essential for engi- neering and performance evaluation. Hence, finding patterns is an important essential 72 for modeling the network behavior. Examples of these patterns are similar to stock mar- ket including Triple Bottom (three equal lows followed by a breakout above a certain level). One of the functions of a Network Management Platform (NMP) is to log network events, alarms, and statistics. The database which holds this data can become extremely large. For example, the Spectrum NMP collects live data for say 24 hours, and then un- loads the data to an off-line archival database in order to make room for the next 24 hours of data. The archival database may contain valuable information about the over all characteristics of the network. We believe that data mining tools may be used on this data to uncover useful information for network administrators and managers. In the following we briefly describe a sample of network data that is logged by the Spec- trum NMP. This data, collected during 18 weeks, from October 16 1994 to February 12 1995, on Cabletron corporate network. There are 16849 entries, representing measure- ments roughly every 10 minutes for 18 weeks. This network has a router with 16 ports connected to 16 links. The packet traffic of each port is investigated independently. There are 16 portsP n on the router that connect to 16 links, which in turn connect to 16 Ethernet subnets (S n ). Note that the traffic has to flow through the router ports in order to reach the 16 subnets. Thus, we can observe the traffic that flows through the ports of the router to get some idea about what is happening on each subnet, as well as 73 on the network. For each port, the packet rate is measured. This measures show the rate of packets, during 10 minute’s period, have been sent through a port per minute. For example 1178.89 means that an average of 1178.89 packets has passed through the port per minute during a 10-minute period. As a sample we consider data from port P1. There are more than 16000 data averaged at ten minutes intervals, averaging over each hour we get 2808 hourly. Figure 1 shows part of the Cabletron corporate network. There are three independent variables: ² Load – a measure of the percentage of bandwidth utilization of a port during a 10 minute period. For example 22.23 means that 22.23% of the total bandwidth was used during a 10 minute period. ² Packet Rate – a measure of the rate at which packets are moving through a port per minute. For example 1178.89 means that an average of 1178.89 packets have passed through the port per minute during a 10 minute period. ² Collision Rate – a measure of the number of packets during a 10 minute period that have been sent through a port over the link but have collided with other pack- ets. For example 0.14 means that.14 percent of the packets transmitted during the last 10 minutes did not go through. Figure 5.1 shows part of the Cabletron corporate network and Figure 5.2 illustrates the nature of the network data. As we may see data shows self-similarity characteristics 74 illustrated in Figure 5.2. Self similar data has the potential of having queries with recurring patterns (*). 5.2 RSPS for One Dimensional Data As we mentioned earlier, to measure performance, we count the number of passes over the same input element while tested against a pattern element when we ran RSPS and compared the result with the number of passes in naive search algorithm. The speedups obtained range from the modest simple search pattern without any recurring sub-pattern, to speedups of more than two orders of magnitude obtained on the complex patterns found in actual applications. 5.2.1 Stock Market Data In stock market, there are a set of common chart patterns that can be very useful for technical analysis. Examples of such chart patterns are Double Bottom (two consec- utive local minima that are roughly equal, with a moderate peak in between), Triple Top (three equal highs followed by a break below specific price) and Ascending Trian- gle (bullish formation that usually forms during an uptrend as a continuation pattern). In the following we show the power of RSPS in finding those patterns. For instance, 75 0 100 200 300 400 500 600 700 800 900 Time 0 5 10 15 20 25 30 35 40 45 Time 0 20 40 60 80 100 120 140 Time Figure 5.1: Illustration of Network Data 76 0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 1/ 1 9/ 27 6/ 23 3/ 19 12/ 14 9/ 9 6/ 5 0 5 10 15 20 25 30 35 40 1/ 1 9/ 27 6/ 23 3/ 19 12/ 14 0 5 10 15 20 25 30 35 40 1/ 1 9/ 27 6/ 23 3/ 19 0 5 10 15 20 25 30 35 40 1/ 1 9/ 27 6/ 23 0 5 10 15 20 25 30 35 40 1/ 1 9/ 27 0 5 10 15 20 25 30 35 1/ 1 0 5 10 15 20 25 30 35 1/ 1 Figure 5.2: Illustration of Network Data 77 we ran the following patterns which reflex the search for a set of repeated consecutive relaxed double bottom in stock market data for a given company. For instance,Example 6, looks for a consecutive interval of period of high fluctua- tion (more than 1% up and down) followed by a period of steady raise in the stock price in DJIA (Dow Jones Industrial Average) index for 1975-2000. Example 6.Pattern of consecutive interval of period of high fluctuation (more than 1% up and down) followed by a period of steady raise. SELECT X.NEXT.date, X.NEXT.price, T.previous.date, T.previous.price FROM djia SEQUENCE BY date AS ( * X, * ( * Y, * Z), T) WHERE X.price >= X.previous.price AND Y.price < 0.99 * Y.previous.price AND Z.price > 1.01 * Z.previous.price AND T.price <= 1.01 * T.previous.price While running RSPS, we found 32 matches over 25 years of DJIA data. Figure 5.3 and 5.4 show the patterns that occurred around December 2000. The following are interesting observations from our experiment. 78 Oct76 Jul79 Mar82 Dec84 Sep87 Jun90 Mar93 Dec95 Sep98 May01 0 2000 4000 6000 8000 10000 12000 Figure 5.3: 32 consecutive bump shape fluctuations found in the DJIA data are shown in red (dark). ² RSPS speedup depends on the nature of the pattern query and the input itself. More interdependencies between pattern elements make it possible to gain more speedup through RSPS. ² RSPS improves search speed even when there is no match for a given query. This case indeed is very interesting to study, when we want to make sure there is no occurrence of a given pattern query in a sequence or data stream. 79 9700 9750 9800 9850 9900 9950 10000 Figure 5.4: A closer look at the red (dark) square highlighted in Figure5.3 which shows one of the matches ² When there is no interdependency between the pattern elements, the RSPS speedup gets close to naive search. ² RSPS pattern generalization provide a simple mechanism to relax the query in the case of no occurrence of the pattern in the data. We also ran the following patterns which reflex the search for a set of repeated consecutive relaxed double bottom in stock market data for a given company. 80 Example 7: Pattern of repeated consecutive relaxed double bottom in stock market data for a given company. SELECT X.NEXT.date, X.NEXT.price, S.previous.date, S.previous.price FROM company SEQUENCE BY date AS * (X, * Y, * Z, * T, * U, * V, * W, * R, S) WHERE X.price >= 0.98 * X.previous.price AND Y.price < 0.98 * Y.previous.price AND 0.98 * Z.previous.price < Z.price AND Z.price < 1.02 * Z.previous.price AND T.price > 1.02 * T.previous.price AND 0.98 * U.previous.price < U.price AND U.price < 1.02 * U.previous.price AND V.price < 0.98 * V.previous.price AND 0.98 * W.previous.price < W.price AND W.price < 1.02 * W.previous.price AND R.price > 1.02 * R.previous.price AND S.price <= 1.02 * S.previous.price Moreover we ran RSPS for a similar query to Example 7 as following: 81 Example 8: SELECT X.NEXT.date, X.NEXT.price, S.previous.date, S.previous.price FROM company SEQUENCE BY date AS (X, * Y, * Z, * T, * U, * V, * W, * R, S) WHERE X.price >= 0.98 * X.previous.price AND Y.price < 0.98 * Y.previous.price AND 0.98 * Z.previous.price < Z.price AND Z.price < 1.02 * Z.previous.price AND T.price > 1.02 * T.previous.price AND 0.98 * U.previous.price < U.price AND U.price < 1.02 * U.previous.price AND V.price < 0.98 * V.previous.price AND 0.98 * W.previous.price < W.price AND W.price < 1.02 * W.previous.price AND R.price > 1.02 * R.previous.price AND S.price <= 1.02 * S.previous.price 82 Company Sequence RSPS Speedup Number of RSPS Speedup Number of Ticker Length (Example 8) matches (Example 6) matches DELL 4169 3.12 14 3.94 0 EBAY 1615 3.00 6 3.28 0 IBM 10863 12.32 54 20.08 2 GE 10863 9.35 78 16.21 4 COKE 3743 7.55 20 9.77 0 PEPSI 1480 7.68 12 8.35 2 SONY 5520 8.20 29 11.2 1 WMART 8204 5.04 72 8.04 3 DIJ 6000 85.22 14 92 1 Table 5.1: Performance for selected companies for a given query (Example 6) Note that there is no (*) in front of the pattern in the last example. Table 5.1 com- pares the OPS and RSPS speedups for these patterns. As it is illustrated in Table 5.1 the speedups we obtained from running several queries were up to 100 times. In the following set of graphs we compare the performance of RSPS with naive search in a couple of case studies. 1) Figure 5.5 compares the number of comparison between pattern and input when the speedup is equal to 3. 2) Figure 5.6 compares the number of comparison between pattern and input similar to previous case when the speedup is equal to 27. 83 200 400 600 800 1000 1200 1400 1600 2 4 6 8 10 12 14 16 Naive Search Path i j 200 400 600 800 1000 1200 1400 1600 2 4 6 8 10 12 14 16 OPS* Search Path i j Figure 5.5: Comparison between naive search and RSPS with speedup = 3 3) Figure 5.7 illustrates the power of RSPS in details and shows how Next and Shift speed up the search process compare to the naive search. 5.2.2 Synthetic Pattern Generator To evaluate RSPS power, we employed a simulator with the capability of making com- plex queries around a given seed query. In our simulator, a user has the capability to 84 200 400 600 800 1000 1200 1400 1600 2 4 6 8 10 12 14 16 Naive Search Path i j 200 400 600 800 1000 1200 1400 1600 2 4 6 8 10 12 14 16 OPS* Search Path i j Figure 5.6: Comparison between the naive search and RSPS with speedup = 27 modify the number of elements in the query, the length of the query and parameters in each element of the pattern. For instance assume user is interested in Example 6. We treat this query as seed query and make a set of queries around this query. We can modify Example 8 by changing numbers to parameters as following: SELECT X.NEXT.date, X.NEXT.price, 85 2 4 6 8 10 12 14 1 1.5 2 2.5 3 3.5 4 Naive Search Path i j 2 4 6 8 10 12 14 1 1.5 2 2.5 3 3.5 4 OPS Search Path i j Figure 5.7: How Shift and Next speed up the search S.previous.date, S.previous.price FROM company SEQUENCE BY date AS * (X, * Y, * Z, * T, * U) WHERE X.price >= A * X.previous.price AND Y.price < B * Y.previous.price 86 AND C * Z.previous.price < Z.price AND Z.price < D * Z.previous.price AND T.price > E * T.previous.price AND U.price <= F * U.previous.price AND V.price < G * V.previous.price AND I * W.previous.price < W.price AND W.price < J * W.previous.price AND R.price > K * R.previous.price AND S.price <= L * S.previous.price Now user can pick part of this query, drop or add (*) and change parameters(A;B;) . The following is a sample of modified version of Example 8. Example 8 (modified): SELECT X.NEXT.date, X.NEXT.price, S.previous.date, S.previous.price FROM company SEQUENCE BY date AS * (X, * Y, * Z, * T) WHERE X.price >= 0.90 * X.previous.price AND Y.price < 0.88 * Y.previous.price AND 0.98 * Z.previous.price < Z.price 87 AND Z.price < 1.05 * Z.previous.price AND T.price > 1.05 * T.previous.price AND V.price < 0.97 * V.previous.price AND 0.96 * W.previous.price < W.price AND W.price < 1.01 * W.previous.price AND R.price > 1.03 * R.previous.price AND S.price <= 1.02 * S.previous.price We ran the simulator to generate 100 queries and ran RSPS to find the pattern. The mean, min, max and variance of the RSPS speedup over naive search are illustrated in Table 5.2.2. As it shows in Table 5.2.2 RSPS performance may varies dramatically (for instance form 200 to 37 in DIJ data) due to the query pattern. Ticker Length Speedup Matches RSPS on 100 simulated queries Mean Min Max Stdv DELL 4169 3.94 0 28.03 6.64 73.61 22.22 EBAY 1615 3.28 0 11.36 2.79 29.53 7.25 IBM 10863 20.08 2 66.43 11.16 139.90 44.60 GE 10863 16.21 4 58.55 9.27 150.94 43.59 COKE 3743 9.77 0 48.84 8.57 154.52 40.00 PEPSI 1480 8.35 2 28.55 f8.59 f62.80 f15.75 SONY 5520 11.2 1 27.58 4.89 85.16 20.44 WMART 8204 8.04 3 64.94 9.35 214.27 55.72 DIJ 6000 92 1 86.93 37.11 201.20 43.97 Table 5.2: RSPS Impressive Speedup 88 5.2.3 Network Data For this experiment we exploited a sample of network data. Similar to stock market data we ran the simulator to generate 100 queries for a triple top query and ran RSPS to find the pattern. The mean and variance of the RSPS speedup over naive search are illustrated in Table 5.3. As it is illustrated in Table 5.3 the speedups we obtained from running several queries were up to 25 times. While we observed various speedups across all data we only show the result of running RSPS on selected ports with better speedup. Similar to stock market data the RSPS performance may varies radically due to the pattern query and the data itself. Data Port # Length Speedup(Triple Bottom) Speedup(Double Top) Load 3 8000 13.0 24.6 Packet Rate 4 8000 21.6 24.1 Packet Collision 16 8000 4.4 13.5 Table 5.3: RSPS performance over the network data 5.3 MCCSP Result We applied Multiple Concurrent Conjunctive Pattern Search (MCCPS) to the same set of datasets. Figure 5.8 illustrated MCCPS output. Blue (black) line represent input and red (black thick) line highlights patterns in the Input. There are two patterns in this example. The green line in both graphs shows where two patterns satisfied simultane- ously. MCCPS find query match in one run. 89 7.292 7.294 7.296 7.298 7.3 7.302 7.304 7.306 7.308 7.31 x 10 5 4000 6000 8000 10000 12000 7.292 7.294 7.296 7.298 7.3 7.302 7.304 7.306 7.308 7.31 x 10 5 4000 6000 8000 10000 12000 Figure 5.8: Illustration of MCCPS for 2 patterns Our experiments shows the MCCPS improve the search up to 2 order of magnitude. For instance we applied MCCPS for the following queries. Example 9: SELECT X.NEXT.date, X.NEXT.price, S.previous.date, S.previous.price FROM company 90 0 500 1000 0 2 4 6 8 i j 0 500 1000 0 2 4 6 8 i j 0 500 1000 1 2 3 4 5 i j 0 500 1000 1 2 3 4 5 i j Figure 5.9: Comparison between naive search and MCCPS SEQUENCE BY date AS (X, * Y, * Z, * T, * U, * V, * W, * R, S) WHERE X.price >= 0.98 * X.previous.price AND Y.price < 0.98 * Y.previous.price AND 0.98 * Z.previous.price < Z.price AND Z.price < 1.02 * Z.previous.price 91 AND T.price > 1.02 * T.previous.price AND 0.98 * U.previous.price < U.price AND U.price < 1.02 * U.previous.price AND V.price < 0.98 * V.previous.price AND 0.98 * W.previous.price < W.price AND W.price < 1.02 * W.previous.price AND R.price > 1.02 * R.previous.price AND S.price <= 1.02 * S.previous.price and Example 10: SELECT X.NEXT.date, X.NEXT.price, S.previous.date, S.previous.price FROM company SEQUENCE BY date AS (X, * Y, * Z) WHERE X.price >= 0.98 * X.previous.price AND Y.price < 0.98 * Y.previous.price AND 0.98 * Z.previous.price < Z.price AND Z.price < 1.02 * Z.previous.price 92 Company Sequence Length MCCPS Speedup DELL 4169 6.1 EBAY 1615 3.4 IBM 10863 19.3 GE 10863 16.4 COKE 3743 11.77 PEPSI 1480 8.4 SONY 5520 10.7 WMART 8204 8.1 DIJ 6000 94.6 Table 5.4: Performance for selected companies for a given query (Example 6) AND T.price > 1.02 * T.previous.price Table 5.4 illustrates such performance. Figure 5.9 illustrated number of comparison between RSPS and naive search, if we run RSPS independently for each pattern. MCCPS gains even more speedup due to the fact that it uses not relation between elements of two patterns in addition to relation of internal relationship among pattern elements. Figure 5.10 illustrated MCCPS pattern matching capability looking for 2 patterns occurrence simultaneously. In this particular example MCCPS 18 times faster than naive search. 93 0 500 1000 0 2 4 6 8 i j 0 500 1000 0 2 4 6 8 i j 0 500 1000 1 2 3 4 5 i j 0 500 1000 1 2 3 4 5 i j Figure 5.10: Illustration of MCCPS Pattern Matching 94 Chapter 6 Conclusion and Future Work In this chapter we conclude this dissertation and discus the future work related to the main core of this research. 6.1 Conclusion Remarks Many applications in the commercial or scientific domains share the need for process- ing and analyzing sequential or stream data. Examples include analysis of data gathered from sensor networks, the stock market, telecommunications networks, seismic activ- ity, and remote sensing. At times, the only feasible solution to understanding large volumes of data is to search for patterns of interest. This is an especially difficult task when the patterns of interest are complex in nature - in the sense that traditional con- structs available in SQL may be unable to express these rich patterns. Facilities such as datablades have increased the expressive power of database query languages. However, 95 many applications remain which need a more expressive language for describing their patterns of interest. Another challenge with most such complex applications is that data needs to be processed on the fly. The limited buffer needed for keeping the history of the time-series is thus another problem that needs to be addressed. An implementation of the pattern detection mechanism may be required which precludes keeping the entire history of the sequence in fast memory. In this thesis we investigated the design and optimization of constructs that enable SQL to express complex patterns. We proposed the RSPS (recursive sequential pattern search) algorithm that inspired by the KMP (Knuth-Morris-Pratt) string matching algo- rithm and exploits the inter-dependencies between the elements of a sequential pattern to minimize repeated passes over the same data. MCCPS (Multiple Concurrent Con- junctive Pattern Search) extends RSPS to search for two or more conjunctive patterns occurred concurrently over the same sequential data. In addition, we added a novel and innovative technique to MCCPS to be able to search for multiple concurrent con- junctive patterns with recurrent (*) elements. Moreover, we proposed a new approach ti employ RSPS in detecting concurrent occupance over single pattern over multiple input data. These contributions were investigated and proven by empirical and analytic exploration Stock Market data and Network database domain. 96 6.2 Future Challenges There are several direction and issues for future work. Two of the more immediate issues that we are planning to address are: buffer size for data streams and pattern generalization. 6.2.1 Calculating the buffer size for data streams Considering that most streaming applications have a limited buffer size for the transient data, in order to make RSPS applicable to streaming applications, we need to address the buffer size issue. Since RSPS minimizes repeated passes over the same data by pre-computing shift(j) and next(j), it is obvious that in every pass we can remove some of the input data from the buffer and read the same amount of new incoming input data to the buffer. Our goal at this stage will be finding an optimal buffer size for useful pattern classes. 6.2.2 Pattern Generalization Consider that we have a set of patternsP =fP 1 ;P 2 ;¢¢¢ ;P K gof interest and we wish to detect any occurrences of similar patterns to this setP in a collection of much longer time-series S = fS 1 ;S 2 ;¢¢¢ ;S T g. We will assume that P is a set of univariate time series. Although a univariate time series data set is usually given as a single column 97 0 2 4 6 8 10 12 14 16 1 2 3 4 5 6 7 Figure 6.1: Example of a set of queries of numbers, time is in fact an implicit variable in the time series. Figure 6.2 provides a simple example of such patterns P = fP 1 ;P 2 ;P 3 gand their shapes. For instance a physician might be able to recognize these patterns after research and study on these patterns, but it is very hard to recognize such patterns on the fly. In real data sets, such patterns are more complex and much longer. However for the purpose of simplicity we only illustrate a simple and short example. For a given pattern P i =< x i 1 ;x i 2 ;¢¢¢ ;x i Ti > for each point x i j we define the fol- lowing set as features which we show with F i =< f i 1 ;f i 2 ;¢¢¢ ;f i Ti >. A similar set to these features also has been introduced in (Perng et al., 2000) but with different use and concept. For a given point x i j in a given pattern P i a set of feature defines as f i j = © f1 i j ;f2 i j ;¢¢¢ ;f9 i j ª are: ² x i j ;which refers to the value in x axes (usually time axes) of jth point in patterni. 98 ² y i j ;which refers to the value in y access of jth point in patterni. ² Dx i j =x i j ¡x i j¡1 which in most of the cases is equal to 1 ² Dy i j =y i j ¡y i j¡1 ² Rx i j =x i j ± x i j¡1 ² Ry i j =y i j ± y i j¡1 ² RDx i j =Dx i j+1 ± Dx i j =x i j+1 ¡x i j ± x i j ¡x i j¡1 ² RDy i j =Dy i j+1 ± Dy i j =y i j+1 ¡y i j ± y i j ¡y i j¡1 ² RD i j =Dy i j =Dx i j =y i j ¡y i j¡1 ± x i j ¡x i j¡1 ² RD2 i j =RD i j+1 ¡RD i j ± x i j ¡x i j¡1 These features basically indicate the relation among local points in a pattern. How- ever, the advantage of these features is that they can represent different type of scaling of a pattern if they chosen correctly. The following is the related features for j= 2 in Figure 6.2 which we show with f 1 2 =ff1 1 2 ;f2 1 2 ;¢¢¢ ;f9 1 2 g: ² x 1 2 = 2 y 1 2 = 14 ² Dx 1 2 =x 1 2 ¡x 1 1 = 1 99 0 2 4 6 8 10 12 14 16 1 2 3 4 5 6 7 Figure 6.2: Pattern Example ² Dy 1 2 =y 1 2 ¡y 1 1 = 2 ² Rx 1 2 =x 1 2 /x 1 1 = 2 ² Ry 1 2 =y 1 2 /y 1 1 = 1:16 ² RDx 1 2 =Dx 1 3 /Dx 1 2 =x 1 3 ¡x 1 2 /x 1 2 ¡x 1 1 = 1 ² RDy 1 2 =Dy 1 3 /Dy 1 2 =y 1 3 ¡y 1 2 /y 1 2 ¡y 1 1 = ¡4 ² RD 1 2 =Dy 1 2 =Dx 1 2 =y 1 2 ¡y 1 1 /x 1 2 ¡x 1 1 = 2 ² RD2 1 2 =RD 1 3 ¡RD 1 2 /x 1 2 ¡x 1 1 =(-8 –2) / 1 = -10 The power of these features is not only in presentation of a time series but also can be used to define a query as input for RSPS. The main idea is to discover the best of features which explain all patterns (at a given time). For instance in our example 100 looking at the rate of deviation could be enough to extract a simple model which covers all instances rather than looking at the amplitude of the variable over the time. 101 Bibliography Agrawal, R., C., Faloutsos, & Swami, A. (1993). Efficient similarity search in sequence databases. The fourth International Conference on Foundataion of Data Organization and Algorithm. Agrawal, R., Lin, K. I., Sawheny, H. R., & Shim, K. (1995). Fast similarity search in the presence of noise, scaling and translation in time series databases. VLDB. Agrawal, R., & Srikant, R. (1995). Mining sequential pattern. ICDE. Babcock, B., Babu, S., Datar, M., Motawani, R., & Widom, J. (2002). Models and issues in data stream systems. ACM SIGACT-SIGMOD-SIGART. Barga, Roger (2007). Consistent streaming through time: A vision for event stream processing. CIDR ’07, Conference on Innovative Database Research. Berndt, D.J. (1994). Using dynamic time wrapping to find patterns in time series. AAAI workshop on KDD. Bizarro, Pedro (2007). Bicep - benchmarking complex event processing systems. Event Processing. Dagstuhl, Germany: Internationales Begegnungs- und Forschungszen- trum fur Informatik (IBFI), Schloss Dagstuhl, Germany. Boyer, R., & Moor, S. (1977). A fast string searching algorithm. Communications of the Association for Computing Machinery, 20, 762–772. Buchmann, Alejandro P. (2007). Infrastructure for smart cities: The killer application for event-based computing. Event Processing. Dagstuhl, Germany: Internationales Begegnungs- und Forschungszentrum fur Informatik (IBFI), Schloss Dagstuhl, Ger- many. Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stone- braker, M., Tatbul, N., & Zdonik, S. (2002). Monitoring streams: A new class of data management applications. VLDB. 102 Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M., Hellerstein, J., Hong, W., Krishnamurthy, S., Madden, S., V ., Raman, Reiss, F., & Shah, M. (2003). Tele- graphcq: Continuous dataflow processing for an uncertain world. CIDR. Chandy, Mani, Etzion, Opher, von Ammon, Rainer, & Niblett, Peter (2007). 07191 summary – event processing. Event Processing. Dagstuhl, Germany: Internationales Begegnungs- und Forschungszentrum fur Informatik (IBFI), Schloss Dagstuhl, Ger- many. Das, G., Lin, K., Mannila, H., Renganathan, G., & Smyth, P. (1995). Rule discovery for time series. KDD. Edwards, R.D., & Magee, J. (1997). Technical analysis of stock trends. AMACOM. Garey, M., & Johnson., D. (1979). Computers and intractability:a guide to the theory of np-completeness. Freeman and Company. Gatziu, S., & Dittrich, K. R. (1993). Events in an object-oriented database system. Proceedings of the First Intl. Conference on Rules in Database Systems. Gehani, N. H., Jagadish, H. V ., & Shmueli, O. (1992). Composite event specification in active databases: Model and implementation. Proceedings of the 18th International Conference on Very Large Data Bases. Gehrke, J.E., Korn, F., & Srivastava, D. (2001). On computing correlated aggregates over continual data streams. ACM Sigmod International Conference on Management of Data. Giugno, R., & Shasha, D. (2002). Graphgrep: A fast and universal method for querying graphs. 16th International Conference on Pattern Recognition. Gupta, A., Sudarshan, S., & Viswanathan, S. (2001). Query scheduling in multi query optimization. IDEAS. Gusfield, D. (1997). Algorithms on strings, trees and sequences. Cambridge University Press. Harada, L. (2004). Detection of complex temporal patterns over data streams. Infor- mation Systems, 29, 439–459. Huh, Eui-Nam (2007). Sensor event processing on grid. Event Processing. Dagstuhl, Germany: Internationales Begegnungs- und Forschungszentrum fur Informatik (IBFI), Schloss Dagstuhl, Germany. 103 Jagadish, H.V ., Mendelzon, A., & Milo, T. (1995). Similarity-based queries. PODS. Karp, R, & Rabin, M. (1987). Efficient randomized pattern matching algorithms. IBM Journal of Research and Development, 31, 249–260. Keogh, E., & Smyth, P. (1997). A probabilistic approach to fast pattern matching in time series databases. KDD. Knuth, D.E., Morris, J. H., & Pratt, V . R. (1977). Fast pattern matching in strings. SIAM Journal of Computing, 6, 323–350. M. J. A. Berry, G. Linoff (1997). Data mining techniques: For marketing, sales, and customer support. John Wiley. Mannila, H., Toivonen, H., & Verkamo, A. I. (1996). Discovering generalized episodes using minimal occurance. KDD. Mesrobian, E., Muntz, R.R., Santos, J.R., Shek, E.C., Mechoso, C.R., Farrara, J.D., & Stolorz, P. (1994). Extracting spatio-temporal patterns from geoscience datasets. IEEE Workshop on Visualization and Machine Vision. Motakis, I., & Zaniolo, C. (1997). Temporal aggregation in active databases. Int. Conf. on the Managment of Data. Parker, D. S. (1990). Stream data analysis in prolog. In L. Sterling (Ed.), The practice of prolog. MIT Press. Perng, C.S., & Parker, D.S. (1998). Sql/lpp: a time series extension of sql based on limited patience patterns (Technical Report). UCLA Computer Science Dept. Perng, C., Wang, H., Zhang, S. R., & Parker, D. (2000). Landmarks: A new model for similarity-based pattern querying in time series databases. ICDE. Ramakrishnan, R. (1998). Srql: sorted relational query language. SSDBM. Ramakrishnan, R., Donjerkovic, D., Ranganathan, A., Beyer, K., & Krishnaprasad, M. (1998). Srql: Sorted relational query language. Roy, P., Seshadri, A., Sudarshan, A., & Bhobhe, S. (2000). Efficient and extensible algorithms for multi query optimization. SIGMOD. Sadri, R., Zaniolo, C., Zarkesh, A., & Adibi, J. (2001). Optimization of pattern match- ing queries on database sequences. PODS. 104 Sadri, R., Zaniolo, C., Zarkesh, A., & Adibi, J. I. (2004). Expressing and optimizing sequence queries in database systems. ACM Transactions on Database Systems. Sellis, T. (1998). Multiple query optimization. ACM Transactions on Database Sys- tems, 13, 23–52. Smyth, P. (1997). Clustering sequences using hidden markov models. In Advances in neural information processing. Software, Informix (1998). Managing time-series data in financial applications. Sun, M., & Fang, J. (2008). Ua low-cost representation for similarity search of time- series pattern based on minimum bounding rectangle. Congress on Image and Signal Processing. Terfloth, Kirsten, Hahn, Katharina, & V oisard, Agnes (2007). On the cost of shift- ing event processing within wireless environments. Event Processing. Dagstuhl, Germany: Internationales Begegnungs-und Forschungszentrum Informatik (IBFI), Schloss Dagstuhl, Germany. Tsong-Li Wang, J., Chim, G., Marr, T.G., Shapiro, B. A., Shasha, D., & Zhang, K. (1994). Combinatorial pattern discovery for scientific data: Some preliminary results. SIGMOD. Urban, Susan, Dietrich, Suzanne, & Chen, Yi (2007). An xml framework for integrat- ing continuous queries, composite event detection, and database condition monitor- ing for multiple data streams. Event Processing. Dagstuhl, Germany: Internationales Begegnungs- und Forschungszentrum fur Informatik (IBFI), Schloss Dagstuhl, Ger- many. Wang, T., & Tan, J. (1996). Incremental discovery of sequential patterns. Research Issues on Data Mining and Knowledge Discovery. Widom, J., & Babu, S. (2001). Continuous queries over data streams. SIGMOD. Wright, C.A., Cumberland, L., & Feng, Y . (1998). Performance comparison between five string pattern matching algorithms. Yang, J., & Widom (2001a). Incremental computation and maintenance of temporal aggregates. ICDE. Yang, J., & Widom, J. (2001b). Temporal view self maintenance. TODS. 105 Appendix SQL-TS Syntax Since AXL is being considered as the framework for implementing SQL-TS, we extend the current syntax of AXL to cover SQL-TS programs and queries. The main difference is in the FROM clause. hprogrami ¡! fhdecigfhstatementi;g hdeci ¡! hvdecijhaggrdeci hstatementi ¡! hselecti jhinserti jhupdatei jhdeletei jhloadi 106 hvdeci ¡! TABLEhidi(hcolumnsihkeydeci) [hscopei] [AShqueryi] ; hcolumnsi ¡! hcolumnif,hcolumnig hcolumni ¡! hidihtypei htypei ¡! INTjREALjCHAR(hnumi)jREF(hidi) hkeydeci ¡! f,KEY(hidif,hidig)g hscopei ¡! PERSISTENTjLOCALjMEMORY 107 haggrdeci ¡! AGGREGATEhidi(hcolumnsi):hret-typeihbodyi hret-typei ¡! htypeij(hcolumnsi) hbodyi ¡! { fhvdecig INITIALIZE: [fhstatementi;g] ITERATE:fhstatementi;g [TERMINATE:fhstatementi;g] } j { fhvdecig fhstatementi;g } 108 The FROM clause is the only major difference between SQL-TS and AXL. We specify the CLUSTER BY and SEQUENCE BY clauses in the FROM clause. We also specify the sequential patterns in the FROM clause. hquery-blocki ¡! SELECT [DISTINCT]hhxpif,hhxpig FROMhsqunif,hsqunig [WHEREhexpi] [GROUPBYhexpif,hexpig] [HAVINGhexpi] 109 Note that when the SEQUENCE BY and CLUSTER BY clauses are empty, the AS can work for making aliases. hsquni ¡! hquni [CLUSTERBYhexpif,hexpig] [SEQUENCEBYhexpif,hexpig] [hasclsi] hasclsi¡!hasmemijhasmemi (ANDjOR) hasclsi hasmemi¡!AShsidij (hsidi;hsidi) [AShsidij(hsidif,hsidig)] hquni ¡! hidi jTABLE(hudfi) j(hqueryi)hqun-aliasi hsidi ¡! hidij * hidi 110 hhxpi ¡! hexpi [[AS]hhxp-aliasi] j [hidi.] * hhxp-aliasi ¡! hidij(hidif,hidig) hqun-aliasi ¡! [AS]hidi [(hidif,hidig)] hudfi ¡! hidi( [hexpif,hexpig] ) hselecti ¡! hqueryi [horder-clausei] horder-clausei ¡! ORDERBYhexpi [ASCjDSC] f,hexpi [ASCjDSC]g hqueryi ¡! hqueryi (UNIONjINTERSECTjEXCEPT) [ALL]hquery-blocki 111 hdeletei ¡! DELETEFROMhidi [WHEREhexpi] hinserti ¡! INSERTINTOhidihqueryi hupdatei ¡! UPDATEhidiSEThupdatesi [WHEREhexpi] hupdatesi ¡! hidi=hexpif,hidi=hexpig hloadi ¡! LOADFROMhidiINTOhidi 112 hexpi ¡! NIL jhnumi jh°oati jhstringi jhidi [.hidi] jhrefi jhexpi (+j-j * j/j%)hexpi jhexpi (=j!=j<j<=j>j>=)hexpi jhexpi (ANDjORjINjNOTIN)hexpi jEXISTShexpi j (maxjminjcountjsumjavg) (hexpi) jhudfi [->hidi] jhcase-expi j(hexpi) j(hqueryi) j{fhvdecigfhexpi;g} hrefi ¡! hrefi->hidi jhidi [.hidi] ->hidi 113 hcase-expi ¡! CASEhexpihwhen-exp-listi [ELSEhexpi] END jCASEhwhen-exp-listi [ELSEhexpi] END hwhen-exp-listi ¡! WHENhexpiTHENhexpifWHENhexpiTHENhexpig hidi ¡! hletterifhletterijhdigitig hletteri ¡! (a¡zjA¡Z) hdigiti ¡! (0¡9) 114
Abstract (if available)
Abstract
The need to search for complex and recurring patterns in database sequences, data streams and graphs is shared by many applications. Challenges in this problem include searching through large volumes of data in some database sequences, dealing with real-time data within a limited time frame and complexity of relations between tree-structured data. Feasible methods to search for patterns of interest, for data analysis purposes, will have to address these issues. In this thesis, we investigate the design and optimization of constructs that enable SQL to express complex patterns. In particular we propose the Recursive Sequential Pattern Search algorithm (RSPS) which is inspired by the KMP (Knuth-Morris-Pratt) string matching algorithm. RSPS exploits the inter-dependencies between elements of a sequential pattern to minimize repeated passes over the same data. Moreover we propose another novel algorithm, MCCPS (Multiple Concurrent Conjunctive Pattern Search), to look for complex patterns in single, and multi dimensional data. Performance gains derived from a set of experiments and a sensitivity analysis for RSPS and MCCPS are also discussed. Our results demonstrate dramatic speedup in search, of up to two order of magnitude.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Tag based search and recommendation in social media
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
An efficient approach to categorizing association rules
PDF
Understanding semantic relationships between data objects
PDF
From matching to querying: A unified framework for ontology integration
PDF
Learning logical abstractions from sequential data
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
An efficient approach to clustering datasets with mixed type attributes in data mining
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
Customized data mining objective functions
PDF
A complex event processing framework for fast data management
PDF
Software quality understanding by analysis of abundant data (SQUAAD): towards better understanding of life cycle software qualities
PDF
Modeling, searching, and explaining abnormal instances in multi-relational networks
PDF
Scalable processing of spatial queries
PDF
Natural language description of emotion
PDF
Location-based spatial queries in mobile environments
PDF
The importance of not being mean: DFM -- a norm-referenced data model for face pattern recognition
PDF
Long range stereo data-fusion from moving platforms
PDF
Scalable data integration under constraints
PDF
Application of data-driven modeling in basin-wide analysis of unconventional resources, including domain expertise
Asset Metadata
Creator
Kaghazian, Leila
(author)
Core Title
Complex pattern search in sequential data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/11/2008
Defense Date
06/24/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data mining,OAI-PMH Harvest,optimal search,pattern,pattern search,sequential data
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McLeod, Dennis (
committee chair
), Boehm, Barry W. (
committee member
), Meshkati, Najmedin (
committee member
)
Creator Email
kaghazia@usc.edu,leila_kaghazian@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1568
Unique identifier
UC1210849
Identifier
etd-Kaghazian-2313 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-107605 (legacy record id),usctheses-m1568 (legacy record id)
Legacy Identifier
etd-Kaghazian-2313.pdf
Dmrecord
107605
Document Type
Dissertation
Rights
Kaghazian, Leila
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
data mining
optimal search
pattern search
sequential data