Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scalable data integration under constraints
(USC Thesis Other)
Scalable data integration under constraints
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Scalable Data Integration Under Constraints by George Konstantinidis Submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2015 Alas for those that never sing, but die with all their music in them! — O. W. Holmes To my parents Acknowledgements As happens with every journey, the quest that brought me to this moment is filled with adventures, challenges and happy memories. First and foremost, I would like to thank my supervisor Prof. José Luis Ambite for his guidance and support. He has been a great mentor and a source of inspiration. He has taught me how to approach every problem with boldness and enthusiasm, and that no problem is too hard to address. Every single one of our discussions has been truly stimulating and productive and he has always steered our research towards the right direction. I would like to thank Prof. Craig Knoblock for his valuable comments on my dissertation. I would like to thank Prof. Cyrus Shahabi and Prof. Daniel O’Leary for their feedback and their participation in my dissertation committee, and Prof. Phokion Kolaitis for being in my thesis proposal committee and providing valuable feedback during the early stages of this dissertation. To my Los Angeles friends: Andreas, Antonis, Aphrodite, Celso, Christos, Dimitris, Farhang, Jorge, Katie, Kyriakos, Melina, Nasia, Obi, Shaunak, Smriti, Tania, Valia, and Vasilis thank you for your love and support. I would like to thank Prof. Dimitris Papadias, Nikos and the rest of the Hong Kong group for their support and the great memories during my visit at the Hong Kong University of Science and Technology. Special thanks to my brother Stathis and my parents Theo and Katia, for being always supportive of me. Los Anegeles, 30 June 2015 G. K. i Abstract We witness an explosion of available data in all areas of human activity, from large scientific experiments, to medical data, to distributed sensors, to social media. Integrating data from disparate sources can lead to novel insights across scientific, industrial, and governmental domains. In this thesis we consider the problem of scalable data integration, in a setting that consists of: 1) a set of distributed and heterogeneous data sources, modeled as relational databases, 2) a mediating (or target) relational schema, exposed to the user as a global query interface, 3) a set of integrity constraints, expressed in the language of Tuple-Generating De- pendencies (TGDs), and known as target TGDs, which is used to model complex relationships among the target relations, and, 4) a set of schema mappings, which are used to connect the source and the target schemas. In this setting the main problem of data integration, and the fo- cus of this thesis, is answering conjunctive queries and unions of conjunctive queries (UCQs). We also address the closely related problem of query containment under constraints, which studies when two queries have subset-related sets of answers, under the constraints. These problems are well studied and crucial to the areas of Database Theory, Database Systems, Knowledge Representation & Reasoning, Ontologies and the Semantic Web. The two main approaches to data integration, in our setting, are: 1) data exchange which focuses on how to materialize a target database instance, by running the chase algorithm and transforming the source data to a centralized warehouse, appropriate for answering queries, and 2) virtual data integration which assumes that the data is left in the sources, and consists of rewriting, at query-time, the conjunctive queries over the target schema, into a maximally-contained rewriting over the data sources, that answers our query. In this thesis, we make significant contributions to the problems of scalable virtual data integration, data exchange, query answering and query containment with and without the presence of integrity constraints. First, we develop a scalable algorithm for virtual data inte- gration without target constraints. The novel insight of our approach is to look at the problem from a graph perspective and compactly represent overlapping graph patterns in the schema mappings. This, together with other optimizations, results in an experimental performance that rewrites queries in the presence of 10000 sources in under a second, which is about two orders of magnitude faster than state-of-the-art algorithms. We also extend this algorithm to address UCQs and multiple queries as inputs, in an optimized way. We then turn our focus to richer mediator languages and present a solution for UCQ query containment under linear weakly-acyclic constraints. Our solution employs, again, our compact graph-modeling and in- dexing and outperforms the brute-force approaches by two orders of magnitude. Subsequently, iii Abstract we examine the chase algorithm and develop an optimization that produces equivalent but smaller target databases, with the same (polynomial) data complexity as the standard chase. We implement and apply this result to the problem of query answering using views under linear weakly acyclic constraints. For the latter setting, our experiments show that the size of the frugal chase is much smaller to that of the standard chase (and very close to the core), which we outperform by almost two orders of magnitude. Lastly, we develop a theoretical solution for query answering under non-chase terminating constraints (linear TGDs), where we “combine” the chase with query rewriting. We employ the chase algorithm but prune its output independently of any query, and we design a novel approach to “minimize” any query into a rewriting that gives the answers on the pruned chase. We prove that our rewriting is exponentially smaller than other related approaches. iv Contents Acknowledgements i Abstract iii List of figures ix 1 Introduction 1 2 Background 7 2.1 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Mappings in Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Supporting GLAV through LAV mappings . . . . . . . . . . . . . . . . . . . 9 2.3 Containment, Equivalence and Homomorphisms . . . . . . . . . . . . . . . . . . 10 2.4 Answering Relational Queries Using Views . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Gaifman Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6 Data Integration and Exchange under Constraints . . . . . . . . . . . . . . . . . 13 2.6.1 Tuple-Generating Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6.2 Weakly acyclic TGDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6.3 The Chase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6.4 Data Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6.5 The Chase Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6.6 Containment and Maximally-Contained Rewritings Under Constraints . 17 3 Scalable Query Rewriting: A Graph-based Approach 19 3.1 The Query Rewriting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Queries and Views as Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 Predicate Join Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 Information Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.3 Partial Conjunctive Rewritings . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Graph-based Query Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 Source Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.2 Query Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.1 Star Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 v Contents 3.4.2 Chain Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.3 Chain Queries using 10000 Views . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Extension to Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Extension to Answering UCQs and Multiple Queries Using Views . . . . . . . . . 43 3.6.1 Overlapping Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.6.2 The MGQR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6.3 Graph Modeling of UCQs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6.4 Multi-query rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.7 Discussion and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7.1 MiniCon Phase One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7.2 MiniCon Phase Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7.3 MCDSAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7.4 GQR vs MiniCon and MCDSAT . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7.5 Hypergraphs and Hypertree Decompositions . . . . . . . . . . . . . . . . 55 4 Scalable Containment for UCQs under Constraints 57 4.1 The UCQ Containment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Graph-Based Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 UCQ Containment under Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.1 Algorithm for Chasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.2 Algorithm for Containment . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5 Discussion and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5 Optimizing the Chase: Scalable Data Integration under Constraints 71 5.1 Chase in Virtual Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 The Frugal Chase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Procedural Frugal Chase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4 Compact Frugal Chase For Query Rewriting Under Constraints . . . . . . . . . . 82 5.4.1 Graph Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.2 Graph-based Query Rewriting (GQR) . . . . . . . . . . . . . . . . . . . . . 83 5.4.3 Compact frugal chase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5.1 Chain Queries, Constraints and Views . . . . . . . . . . . . . . . . . . . . . 90 5.5.2 Star Constraints and Views . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.6 Discussion and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6 Pruning the Infinite Chase 97 6.1 Answering a fixed query using LTGDs . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Query Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2.1 Rewriting size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.3 Discussion & Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 vi Contents 7 Conclusions and Future Work 111 Bibliography 115 vii List of Figures 3.1 Query Q, and sources S 1 -S 7 as a graphs. . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Predicate Join Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Infobox for a variable node. The node is existential and is attached on its predi- cate node on edge with label 3 (this variable is the third argument of the corre- sponding atom). We can find this specific PJ in three views, so there are three sourceboxes in the infobox. The two join descriptions in the sourcebox S 6 tell us that this variable, in view S 6 , joins with the second argument of P 4 and the second argument of P 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 All existing source PJs for the predicates of the query Q. For presentation pur- poses the infoboxes of the first variable of P 1 are omitted. The partial conjunctive rewritings (view heads) each PJ maintains, are “descriptions” of the current PJ and are candidate parts of the final conjunctive rewritings. During our pre- processing phase, PJs for P 4 (which also exists in the sources) would also be constructed but are omitted from being presented here (as they are dropped by our later phase). Using these constructs we can compactly represent all the 8 occurrences of P 1 P 2 and P 3 in the sources S1-S7, with the 6 PJs presented here. 28 3.5 All source PJs that cover P 1 are combined with all PJs that cover P 2 . As Alg. 2 suggests, we try all combinations of CPJs that cover the query. The figure does not show the un-combinable pairs (as e.g., the PJ in (a) with the PJ in (f)). Note that combining two nodes we eliminate the join descriptions we just satisfied. In the current example, PJ (d) is combined with both (c) and (g) which alternatively cover the same part of the query. The union of the resulting rewritings of (h) and (i) is our solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6 Repeated Predicates. In (b) only the PJs for P 1 are shown; these “capture” all five occurrences of P 1 in the sources S 1 ,S 2 and S 3 of (a). . . . . . . . . . . . . . . . . . 35 3.7 Repeated predicates in the query (returning variable names are shown in Q’s graph for convenience). For the sources of Fig. 3.6(a), the algorithm “instantiates” the matching PJs (Fig. 3.6(b)) for every occurrence of the same query predicate. 36 3.8 (a) Average time and size of rewritings for star queries. GQR time does not take into account the source preprocessing time while gqr+p does. (b) Average time and size of rewritings for chain queries. . . . . . . . . . . . . . . . . . . . . . . . . 38 3.9 Average reformulation time for 10 chain queries. Preprocessing time is not included in the plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 ix List of Figures 3.10 Average size of rewriting for chain queries of Fig. 3.9. Queries q1,q2,q3 and q9 don’t produce any rewritings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.11 Ten chain queries on views constructed from an increasing predicate space. The upper bunch of straight lines gives the total time for the queries, while the lower dotted part gives only the reformulation time. . . . . . . . . . . . . . . . . . . . . 41 3.12 Number of rewritings for the 10 queries of Fig. 3.11 . . . . . . . . . . . . . . . . . 41 3.13 (a) Queries q 3 ,q 4 ,q 5 , and (b) sources S 4 -S 7 as a graphs. Query (c) and Source (d) Predicate Join Patterns. (e) Infoboxes for a query and a view variable nodes. The variable in the upper box is existential and it appears in two queries q 3 , q 4 . The join description associated with q 3 states that the variable joins, in q 3 , with the first argument of P 3 and the first argument of P 2 . . . . . . . . . . . . . . . . . . . 45 3.14 The PJs covering P 1 are first combined with the PJs covering P 2 , and then with the PJs covering P 3 . The union of the complete rewritings (marked with?) is the solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.15 Average online time and number of rewritings for 10 chain queries over up to 1000 views. The MGQR multiquery rewriting algorithm outperforms GQR rewriting queries one by one. Offline times are the same for both algorithms (not shown). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1 (a) The conjunctive queries in Q 4 , and in chase(Q 4 ) as graphs. (b) Predicate Join Patterns (PJs) for all the queries in Q 4 . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 (a) All the PJs of Q 4 with their infoboxes. (b) All the PJs of chase(Q 4 ) (the result of the compact chase algorithm on the PJs of Q 4 ). In the two figures we see the infoboxes for all variable nodes. For example, we can find the PJ for P (bottom right corner of figure b) in three queries: q 4 1 , q 4 2 , and q 4 3 . The two join descriptions related to q 4 2 in the PJ for P tell us that its variable, in query q 4 2 , joins with the second argument of HC D and the second argument of T P. In figure (b) all the pre-existing infoboxes have been updated by our algorithm to reflect the new chased set of queries. . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Checking containment for two UCQs of 700 queries each, under various numbers of constraints. The containment check fails for all cases. We run each experiment 5 times and took the average times. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Checking containment for two UCQs with 500 and 1000 queries resp. under various numbers of constraints. The containment check always succeeds. We run each experiment 5 times and took the average times. . . . . . . . . . . . . . 68 5.1 Redundancy in the Chase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Frugal Chase: Partially Satisfiable Set. . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3 (a) Query q. (b) Sources S 1 -S 3 as graphs. (c) Joinbox for a variable node. (d) View PJs for S 1 -S 3 with their joinboxes. (e) Constraint c 1 as a graph. (f ) PJs resulting from chasing PJ TreatsPatient (TP) with the constraint c1. (g) Merging existing and inferred PJs. (h) Chased view PJs. . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4 Graph-based Query Rewriting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 x List of Figures 5.5 Total time for 20 chain queries, chasing 300 chain views with up to 200 constraints. 89 5.6 Number of PJs produced for the queries of Fig. 5.5 . . . . . . . . . . . . . . . . . 90 5.7 Average query rewriting time for the queries of Fig. 5.5 . . . . . . . . . . . . . . . 91 5.8 Average number of rewritings GQR produces for the queries of Fig. 5.5 . . . . . 91 5.9 Average time of chasing 1000 star views, with up to 250 star constraints. . . . . . 93 5.10 Average time of chasing 20 star views, with up to 250 star constraints. . . . . . . 94 xi 1 Introduction Information Integration, the problem of combining information from multiple, distributed, heterogeneous sources, is of critical importance as the amount of available information continues to grow at an accelerated pace. The need to integrate data is ubiquitous across domains, including biology, genetics, medical research, electronic health records, government, geospatial data, intelligence and law enforcement, sensor fusion, and mobile services, among many others. Information integration is a complex problem that requires solutions to many subproblems, including source discovery [29, 26, 49], source modeling and schema mapping [121, 25, 114, 57, 48, 11], and querying sources in an integrated fashion [70, 85, 55, 13, 53, 62]. In this thesis, we focus on querying, usually described using the term data integration. Although there has been significant progress on the theory of data integration (e.g., [70, 85, 62, 50]), much work is still needed to achieve practical solutions that can scale to the size required by many real problems. From the architectural perspective, data integration can be accomplished through data warehousing or virtual integration. In the warehouse approach, the integrator cleans, reconciles and materializes (loads) the data from the sources into a single data repository (e.g., a relational database). The related data management field studying this version of the problem is known as data exchange [55]. In the virtual integration approach, the data remains at the original sources and the system accesses and integrates the data from the sources in response to user queries in real time. These approaches have complementary properties. The warehouse approach provides more efficient query answering for queries that require large amounts of data to be processed simultaneously. However, the data in the warehouse may become stale; it is only as recent as the last update cycle. The virtual integration approach always accesses the latest data from the sources, so it is necessary when data is updated often, when having the latest data is critical, or when organizational or legal constraints prevent the incorporation of data sources into a warehouse. These two approaches are not mutually exclusive. Some data can be materialized locally and some accessed at run-time from remote sources. The computational machinery needed to integrate the data either in the virtual or the warehouse approach can be designed to overlap substantially. 1 Chapter 1. Introduction A data integration system, (also called a mediator [132]), integrates information from multiple heterogeneous sources by defining a global schema (also known as domain, target, or mediated schema) and defining mappings between the sources’ schemas and the global schema. Often the global schema is further constrained by a set of constraints (or an ontology). Formally, a data integration/exchange setting,Ç S,T ,§ st ,§ t È, is a tuple where S is a set of heterogeneous schemas from multiple sources, T is the target schema, which must satisfy a set of target constraints/dependencies§ t , and§ st is a set of schema mappings, also known as views, that transform the data in schemas S into schema T . The main problem in this setting, and the focus of this thesis, is conjunctive 1 query answering (aka answering queries using views under dependencies [68], or computing certain answers under constraints [55, 4, 46]): the user poses a query over target schema T and the system needs to provide the so-called certain answers to the query using the data from the sources S such that all constraints§ st and§ t are satisfied. Data exchange and virtual integration are the two main approaches for getting the certain answers. Data exchange achieves this by materializing a target database, by obtaining data from the sources using (satisfying) the schema mappings§ st , and satisfying the target constraints§ t . Then, the certain answers to a user query can be computed by directly evaluating the query over the materialized target database. In virtual integration, the mediator uses the schema mappings and target constraints to rewrite the target user query into a query that only uses terms from the source schemas but produces all the certain answers, when evaluated directly over the sources. Mappings in§ st , between the sources’ schemas and the mediator schema, are typically ex- pressed as logical formulas, and in particular in the form of logical implications of conjunctive formulas, called tuple-generating dependencies (TGDs) [23]. Since these schema mappings connect the source and the global schemas they are usually called source-to-target TGDs (st-TGDs) [55]. Three main types of such schema mappings have been studied. In the Global- as-View (GAV) [60, 3, 58, 126, 61] approach, each global predicate is defined by formula (view) over source predicates. Conversely, in the Local-as-View (LAV) [92, 89, 113, 52, 84] approach each source predicate is defined by a conjunctive formula over global predicates. Finally, these two approaches are superseded by the GLAV approach, where a conjunctive formula over the source predicates (the rule antecedent) is mapped to a conjunctive formula over the global predicates (the rule consequent) [59, 55]. Target constraints in§ t , are also given as tuple-generating dependencies, which however involve only global relations. TGDs are very expressive, but query answering, and rewriting, under general target TGDs is undecidable [23]. However, there are syntactic restrictions, such as the sets of dependencies considered in this thesis, that are decidable (even data-complexity tractable) in query answering. We begin by focusing on virtual integration where usually there are no target constraints, i.e., § t Æ;. As we will see in Sect. 2.2.1, answering queries using GAV mappings in§ st is relatively straightforward, and also, GLAV mappings can be reduced to a set of GAV and a set of LAV mappings. Hence, we focus on LAV schema mappings, which constitute the algorithmically 1 We focus on conjunctive queries which correspond to select-project-join queries and are in the core of every query language. 2 challenging part of the problem. This setting forms the core problem of data integration, called query rewriting or answering queries using views [70, 85], and is the focus of chapter 3. The complexity of query rewriting using LAV views, in its simplest form of relational conjunc- tive queries and views, is already NP-hard [92]. Moreover, the number of rewritings that a single input query can reformulate into can grow exponentially with the number of views. Nevertheless, there have been several efforts to produce algorithms with good performance in practice, namely, MiniCon [112] and MCDSAT [17]. In Chapter 3 we consider LAV query rewriting from a graph perspective, which allowed us to gain a deeper insight into the problem, and led us to design an approach, called Graph-Based Query Rewriting (GQR), that compactly represents and indexes common patterns in the mappings, and (optionally) pushes some com- putation to an off-line phase. This approach resulted in an experimental performance about two orders of magnitude faster than state-of-the-art algorithms, such as MiniCon and MCD- SAT, with GQR rewriting queries using over 10000 views, producing thousands of rewritings, within seconds. In Sect. 3.6, we extend our results to address the problem of answering multiple conjunctive queries and union of conjunctive queries (UCQs) using views. UCQs constitute an impor- tant class of structured queries; they correspond to the select-project-join-union fragment of structured languages like SPARQL and SQL. UCQ and multi-query containment an im- portant problem in query optimization, data integration and ontology-based data access. Since rewriting one conjunctive query using views is an NP-hard problem, we develop an approach where answering n queries takes less than n times the cost of answering one query, by compactly representing and indexing common patterns in the input queries and the views. Our experimental results show a promising speed up, of about 30%. We then turn our focus to richer mediator languages, that is, we consider target constraints in§ t . Recently, there has been significant interest in approaches to incorporate intensional knowledge on a database schema, in the form of ontologies, constraints or dependencies. Relevant research has focused on specific types of constraints that provide a good trade-off between expressiveness and complexity [36, 31, 32, 55, 56, 46]. We will focus on two major classes of decidable sets of target TGDs. The first is (using termi- nology from [105, 22]) the finite expansion sets (FES) of rules, or chase-terminating. For query answering under sets of constraints in FES we can perform a special kind of reasoning on the underlying data in order to “complete” it with respect to the constraints. In other words, given a database instance I , a query q and a set of FES constraints§, we can produce a new database I 0 that contains I and adheres to§. Thereafter, we can completely ignore the constraints in§ (since essentially we “compiled” all the inferred knowledge of§ inside the database I , resulting in I 0 ) and reduce answering the query q to just evaluating it as a normal relational query over I 0 . The reasoning we perform in order to compile§ in I (and result in I 0 ) is known as the chase [1, 24, 23] in Databases or forward chaining in Knowledge Representation [119] and logic programming [95]. Thus, finite expansion sets are exactly sets of constraints for which 3 Chapter 1. Introduction the chase terminates (after a finite number of steps). For data integration/exchange settings, the chase technique is a form of forward chaining that “fetches” data from the sources through the mappings,§ st , to a target database, and also “completes” this database w.r.t. the target constraints,§ t . Again, one can then disregard the constraints and do query answering over the completed database. Under several classes of chase-terminating constraints, we subsequently describe how we consider query containment (chapter 4), query answering and the chase algorithm, and virtual data integration (chapter 5). In Chapter 4 we present a scalable approach for UCQ containment with or without chase- terminating TGDs. Testing query containment (and query equivalence) is fundamental to database and knowledge representation systems. It is central to query optimization and minimization [39] and to data integration problems such as view-based query answering [88]. The classic containment problem, without the presence of constraints, is, given a database schema R and two queries Q 1 , Q 2 on R, to decide whether for all database instances I of R, it is the case that Q 1 (I )µ Q 2 (I ), with Q(I ) being the set of answer tuples of a query Q on database I . Deciding containment for two conjunctive queries is an NP-complete problem [39] , which can be solved by finding a homomorphism that maps one query into the other. A UCQ Q 1 is contained in another UCQ Q 2 , if every conjunctive query in Q 1 is contained in at least one conjunctive query of Q 2 [118]. Hence, a mapping detection, NP-complete procedure has to be repeated (worst-case) for all pairs of conjunctive queries in the two unions. By exploiting our graph-based representation of formulas, already developed in chapter 3, our algorithms try to exploit overlapping parts of the conjunctive queries in a union so as to prune the number of candidate containment mappings. Moreover, in the face of integrity constraints §, it holds that Q 1 µ § Q 2 iff chase § (Q 1 )µ Q 2 ,that is, in order to decide containment we can chase all conjunctive queries in Q 1 then check for regular containment, through containment mappings. In Chapter 4, we focus on linear (or LAV) weakly-acyclic constraints [6]. This type of constraints is chase-terminating and first-order rewritable (first-order rewritable languages are discussed later), it is a superset of the useful class of weakly-acyclic inclusion dependencies, as well as the class of LAV TGDs with no existential variables (LAV full TGDs), includes practically interesting languages like RDF/S 2 , and has good computational properties in data integration, data exchange [64, 4], and in inconsistent and incomplete databases [6]. In Sect. 4.3.1, we develop a compact graph-based chase algorithm in order to perform the chase reasoning to the candidate “containee” UCQ, and then proceed with the aforementioned algorithm. Experimentally, these algorithms result in 2 orders of magnitude performance improvement. In Chapter 5, we first focus on data exchange and query answering under general target chase-terminating TGDs. In particular, we take a thorough examination on the standard chase algorithm, and develop a novel, optimized version of the standard chase, called the frugal chase, usable in data integration, data exchange, or incomplete database settings, with GLAV mappings and GLAV target constraints. Instead of adding the entire consequent to a solution when a chase rule is applicable, as in the standard chase, the frugal chase avoids adding 2 http://www.w3.org/RDF/ 4 provably redundant atoms (Sect. 5.2). We prove that the frugal chase results in equivalent, yet smaller in size (number of tuples), results (which are called universal solutions) with the same data complexity as the standard chase. We also present a procedural version of the frugal chase (Sect. 5.3), and a compact graph-based version of the latter adapted to our GQR rewriting algorithm (Sect. 5.4), significantly extending the compact chase algorithm of Sect. 4.3.1. We apply this compact frugal chase to the problem of virtual data integration under constraints. For virtual integration with target constraints, Afrati and Kiourtis [4] used the chase algorithm in a novel way by “compiling” the target constraints (specifically, LAV weakly- acyclic TGDs) into the LAV mappings to reduce the problem to view-based query rewriting without constraints. In this work, we present an algorithm for query rewriting under LAV weakly-acyclic target TGDs, building on [4], GQR, and our optimized chase. This algorithm uses the compact frugal chase to efficiently compile the constraints into the mappings (à la [4]), and then efficiently do query rewriting (using GQR). Our algorithm experimentally performs about 2 orders of magnitude faster than running the standard chase on each view individually and then applying query rewriting using views. Our approach scales to larger numbers of constraints and views and produces smaller, but equivalent, UCQ rewritings (containing less and shorter conjunctive queries), that compute the certain answers. For our experimental setting the size of the frugal chased mappings is very close to the core [56, 47] (i.e., the globally minimized chased mappings). Nevertheless, our compact algorithm achieves this output almost 3 orders of magnitude faster than the core chase [47], since we do not rely on minimization. The second class of constraints we consider are those for which the chase might not terminate (so one cannot transform our original database I to an I 0 adhering to§), but there exists an algorithm which can compile the inferred knowledge of the constraints in the query itself. That is, given a database instance I , a query q and a set of constraints§, we can rewrite q into q § such that simply evaluating q § over I , answers our original query under the constraints. The technique that allows us to rewrite the query w.r.t. the constraints is known as query expansion [36], based on backward chaining [119] and resolution/unification [18, 19]. Query expansion has been extensively studied in the context of Ontology-Based Data Access (OBDA) and Integration (OBDI) [36, 116, 109], where the constraints have the form of ontologies. Spe- cial interest has been given to classes of constraints where the query expansion is a UCQ; these languages of constraints are called first-order rewritable (FO-rewritable), and the rewriting can be translated to an SQL query, which is useful in practical systems. The sets of constraints for which the query expansion terminates are also called finite unification sets or FUS [105]. The major concern in these languages is that query expansion can rewrite a conjunctive user query into an exponentially larger (in the worst case) query that takes into account the constraints imposed on the target schema. This exponential rewriting/expansion is usually so large that it is infeasible to evaluate, and much attention has been devoted into either minimizing it, or devising alternative answering approaches, which are based on the chase algorithm. In this matter, we conclude this thesis presenting a novel theoretical result, in Chapter 6, where we study conjunctive query answering under linear tuple-generating dependencies. 5 Chapter 1. Introduction Linear TGDs, an FO-rewritable language, generalizes the language of inclusion dependen- cies and constitutes the core of the DL-Lite family of languages, which underpins the w3c recommended web ontology standard, OWL2-QL. For linear TGDs, the chase fails to be of direct use since it does not terminate. We propose a combined approach to query answering: we chase the data up to a stopping point, which is independent of any query, and then rewrite the original query into a rewriting that can be issued directly on our “pruned” chase and which is much smaller than previous traditional and combined approaches. This combined approach to query rewriting, results in smaller query rewritings than other approaches. We prove that for any query we can get its results by issuing an equivalent (to the original conjunctive query) union of conjunctive queries on a fixed portion of the infinite chase that depends only on the constraints. This allows us to chase the data up to a point independently of any query, in effect pruning the rest of the infinite chase, and then just “minimize” any incoming query at runtime, such that it can “fit” inside our pruned chase. We compute the rewriting with the help of a small “prototypical” database, which we chase at query time. Our prototypical database, which is essentially only the “frozen” schema of our original database, contains only “pattern” facts rather than actual tuples, and hence, could be several orders of magnitude smaller than the actual database. To account for all the answers of our original query on the infinite chase, there is more than one query we need to issue on our pruned chase. Nevertheless, unlike the traditional UCQ rewriting of the query which is of a large exponential size, our UCQ rewriting needed to get the certain answers, using our pruned chase result, is much smaller in size (but still exponential). Concluding, our results are crucial to the success of data exchange and virtual integration. For the first time we present a virtual integration algorithm which scales to thousands of sources; this is necessary for domains with too many (and usually small) sources of information as biological data integration, or sensor or mobile networks (where one could have a global interface to query data in thousands of devices). Our multi-query answering using views and containment algorithms are useful in several integration contexts: (1) systems that serve multiple users each issuing different queries simultaneously, (2) systems where users issue unions of conjunctive queries (UCQ), and (3) systems enhanced with integrity constraints, as in ontology-based data integration (OBDI) [36, 111, 109, 116], where the user writes a query over a global schema/ontology and the system computes an expanded query, which is a UCQ, still over the global ontology terms, but whose size, as discussed, may be exponential. Then, the integration system will rewrite this UCQ into another UCQ query that uses only the sources’ schemas. In all these cases, it is important to efficiently process multiple queries over large numbers of sources. Our frugal chase algorithm substantially improves over the standard chase, allowing the efficient construction of warehouses, which do not contain “much” redundant information. Experimentally, most of our algorithms outperform related approaches by orders of magnitude. On the theory side, we present a combined approach, whose data preprocessing step does not depend on the query, and whose rewriting output size is practically useful, bringing the rigorous results of query answering and data integration theory into the practical realm. 6 2 Background In this chapter we present some important notions and concepts in relational and ontology- based data integration and provide some illustrative examples. 2.1 Queries We use the well-known notions of constants, variables, predicates, terms, and atoms of first- order logic. We use safe conjunctive queries; these are rules of the form q(~ x)à p 1 (~ y 1 ),..., p n (~ y n ) where q,p 1 ,..., p n are predicates of some finite arity and~ x,~ y 1 ,...,~ y n are tuples of variables or constants. Conjunctions in our queries and rules might be denoted with the use of ’ ,’ or ’^’ . We define the body of the query to be bod y(q) = {p 1 (~ y 1 ),..., p n (~ y n )}. Any non-empty subset of bod y(q) is called a subgoal of q. A singleton subgoal is an atomic subgoal. The conse- quent of the rule, q(~ x), is the head of the query. For a query or a set of atoms S, we denote var s(S), cons(S), and ter ms(S) the set of variables, constants, and terms of S, respectively (e.g., var s(q) 1 is the set of all query variables). By removing one or more atoms out of a query body we get a subquery. Predicates appearing in the body stand for relations of a database D, while the head represents the answer relation of the query over D. The result of evaluating the query q over the database D, is denoted q(D). The query being “safe” means that var s(head(q))µ var s(bod y(q)). All variables in the head are called head, distinguished, or returning variables, while the variables appearing only in the body (i.e., those in S n iÆ1 ~ y i \~ x) are called existential variables. We will call type of a variable its property of being distinguished or existential. We denote constants by putting them in quotes (e.g., “2012”). Joins are denoted by repetition of variables. A union of conjunctive queries (UCQ) is a set of same-head rules. The result of evaluating a UCQ Q over a database I is Q(I )Æ S q2Q q(I ). The following query asks for doctors treating patients with chronic diseases: 1 When clear from the context we’ll use q to refer to either the datalog rule, the set of atoms of the rule, or just the head. 7 Chapter 2. Background q(doc,di s)ÃTreatsPatient(doc, pat),HasChronicDisease(pat,di s) A view V is a named query. The result set of a view is called the extension of the view. 2.2 Mappings in Data Integration An important decision in the design of a data integration system is the choice of descriptions (or mappings) that relate the mediator’s global schema with the local schemas used by the sources. The two main approaches here as discussed in the introduction are Global-as-View (GAV) and Local-as-View (LAV). In the Global-as-View approach, each mediator relation (or predicate) is defined by a logical formula (a view) involving source predicates. Hence the following GAV mapping relates the mediator relation TreatsPatient to two source tables SourcePatientIDs and SourceDoctorsWith- PatientIDs: TreatsPatient(doc, pat) à SourcePatientIDs(pat,p i d ), SourceDoctorsWithPatientIDs(doc, p i d ) In the Local-as-View approach each source predicate is defined by a logical formula (view) over mediator predicates. For example, consider the following LAV rules, describing sources S 1 and S 2 that provide data about medical records. S 1 contains doctors that treat patients with a chronic disease. S 2 contains doctors, patients and clinics where the doctor is responsible for discharging the patient from the clinic. S 1 (doctor, disease) ! TreatsPatient(doctor, patient), HasChronicDisease(patient,disease) S 2 (doctor, patient, clinic) ! DischargesPatientFromClinic(doctor, patient, clinic) For historical reasons and similarity to datalog [128, 129], in LAV rules the (single source predicate) antecedent is called the head and the consequent the body of the rule (although the implication, as we will shortly discuss, goes the other direction than it does for queries and GAV rules). Recall that global predicates are virtual, the actual data resides in the sources. In our example, S 1 contains only references to the doctors and the diseases they treat, but not to the patients that have these diseases; this information conceptually exists in the body but is not provided by the source (perhaps due to privacy concerns). In effect, ‘patient’ is an existential variable in S 1 . Logical implication (!) in a mapping indicates that the source(s) (antecedent) contains tuples that satisfy the logical formula in the consequent (mediator predicates), but it does not contain all such tuples (i.e., open-world assumption [2]). In other words, a view V on the global schema D could be incomplete, in the sense that its extension could only be a subset of the relation V (D). A complete source, one that contains all possible tuples satisfying the formula, is defined with equivalence ($) (closed-world assumption [2]). In most cases of virtual data integration (e.g., web sources, peer data, or medical records) the open-world assumption is much more prevalent. We can describe the constraints that a source satisfies, but rarely be assured that the source is complete. For example, we don’t expect 8 2.2. Mappings in Data Integration S 2 to provide all possible tuples of doctors, patients and clinics, but rather a subset specific to the source (e.g., for a region or a reporting agency). In this thesis, we focus on open-world rewritings. The GAV and LAV approaches have complementary properties (see [130] for a detailed com- parison). The main advantage of GAV is that query reformulation is generally much simpler. Since the user query is expressed in mediator predicates, query reformulation reduces to substituting each query predicate by its GAV definition and simplifying the resulting query. The disadvantage of GAV is that the addition of a new source to the system implies reviewing all the GAV definitions and possibly changing many of them to include the new source predicate. This makes the approach quite cumbersome in environments where discovering new sources is common. The advantage of the LAV approach is its scalability in terms of the addition of new sources. Since each source is described independently, new sources can be added in a plug n play fashion, without disrupting the existing mediator models and with less effort. Another advantage is that the sources can be described in great detail not only by using the appropriate mediator predicates but by also adding appropriate constraints. The disadvantage of LAV is that the query reformulation algorithms are less straightforward and the characteristics of the source description language must be chosen carefully in order to be computationally practical. A generalization of both the GAV and LAV approach, is GLAV [59] where entire composi- tions of source relations are expressed as queries over the mediator relations. Known also as source-to-target tuple generating dependencies (s-t TGDs [55]) these formulas have the form:Á S (~ x,~ y)!à G (~ x,~ z), withÁ S andà G conjunctive formulas over the source and global predicates, respectively. Note that, the formal logical meaning of the above expression is 8~ x,~ yÁ S (~ x,~ y)!9~ zà G (~ x,~ z), but we choose to syntactically neglect the quantifiers, for brevity (similarly in the cases of GAV and LAV). We intend on using all different kinds of mappings; we show how a solution for LAV mappings, supports GLAV as well. 2.2.1 Supporting GLAV through LAV mappings The approach to support GLAV s-t TGDs (similar to [59]) is as follows. 1. Split the GLAV rules into LAV and GAV using intermediate predicates, 2. rewrite the query using the LAV rules (the output rewritings will be in terms of the temporary predicates), 3. unfold the rewritings using the GAV rules (by substituting the temporary predicates by their definitions), and 4. minimize the resulting rewritings. 9 Chapter 2. Background The following example illustrates the splitting of the first rule (GLAV), into the two following ones (one LAV and one GAV): V4(doctor,clinic) ^ V3(doctor, disease) ! TreatsPatient(doctor, patient) ^ HasChronicDisease(patient,disease) ^ DischargesPatientFromClinic(doctor,patient,clinic) Temp(doctor,clinic,disease) ! TreatsPatient(doctor, patient) ^ HasChronicDisease(patient,disease) ^ DischargesPatientFromClinic(doctor,patient,clinic) V4(doctor,city) ^ V3(doctor, disease) $ Temp(doctor,clinic,disease) Hence, in the rest of this thesis we will deal with only LAV s-t TGDs (but also with GLAV target TGDs). 2.3 Containment, Equivalence and Homomorphisms Query answering and rewriting is closely related to the problem of query containment [92]. Definition 1. Query Containment: For all database queries Q 1 and Q 2 , Q 2 is contained in query Q 1 , denoted by Q 2 µ Q 1 , iff for all databases D, the result of evaluating Q 2 on D, denoted Q 2 (D), is contained in the result of evaluating Q 1 , that is, Q 2 (D)µ Q 1 (D). Definition 2. Query Equivalence: For all database queries Q 1 and Q 2 , Q 1 is equivalent to Q 2 , denoted Q 1 » Æ Q 2 iff Q 1 µ Q 2 and Q 2 µ Q 1 . Definition 3. Minimal Query: For all conjunctive queries Q, Q is minimal iff there is no Q 0 subquery of Q, s.t. Q´ Q 0 . Definition 4. Homomorphism: Given two sets of atoms S 1 , S 2 , a homomorphism from S 1 to S 2 is a function h:ter ms(S 1 )! ter ms(S 2 ), such that: (1) h is the identity on constants, i.e., h(c)Æ c, and (2) for all atoms A2 S 1 , h(A)2 S 2 (a homomorphism h is extended over atoms, sets of atoms, and queries in the obvious manner). Given two queries or instances A, B, iff there exist two homomorphisms h 1 : A! B, and h 2 : B ! A, then A and B are homomorphically equivalent (denoted by A´ B). Two ho- momorphically equivalent instances give the same tuples of constants as answers to any query. For conjunctive queries of the same schema, query containment can be proven using contain- ment mappings (which are special homomorphisms) [1, 39]. It holds that Q 2 µ Q 1 iff there is a containment mapping from Q 1 to Q 2 [39]. Deciding query containment for two conjunctive queries is an NP-complete problem [39], which is essentially the the complexity of deciding the existence of a homomorphism. Definition 5. Containment Mappings: Given two conjunctive queries Q 1 , Q 2 , a containment mapping from Q 1 to Q 2 is a homomorphism h:bod y(Q 1 )! bod y(Q 2 ) s.t. h(head(Q 1 )) = head(Q 2 ). 10 2.4. Answering Relational Queries Using Views The following query asks for doctors treating patients with chronic diseases: q 1 (doc, dis)à TreatsPatient(doc,pat), HasChronicDisease(pat,dis) In our example, q 1 asks only for references to the doctors and the diseases they treat, but not to the patients that have these diseases; this information conceptually exists in the body but is not required in the answer. In effect, ‘pat’ is an existential variable in q 1 . Consider the following query which asks for doctors treating chronic diseases such that the doctor is also a surgeon: q 2 (d, ds)à TreatsPatient(d,p), HasChronicDisease(p,ds), Surgeon(d) It is easy to notice that q 2 µ q 1 , since q 2 asks for the same information as q 1 plus an additional join with Sur g eon which can only cut down on the answers. Also, the queries are not equiva- lent since q 1 returns doctors who might not be surgeons. As discussed, these intuitions can be formalized with the use of containment mappings (which use homomorphisms) [39, 1]. For the aforementioned query, the containment mapping that proves that q 2 µ q 1 , is h :{doc! d, di s! d s, pat! p} which “maps” q 1 to q 2 . It is important to emphasize that containment mappings map distinguished to distinguished variables. For example, q 3 below is not con- tained in q 1 nor q 2 as the disease is not being returned. q 3 (d)à TreatsPatient(d,p), HasChronicDisease(p,ds), Surgeon(d) For deciding query containment among unions of conjunctive queries, the following holds [118]. Proposition 1. For all UCQs Q 1 and Q 2 , Q 2 µ Q 1 iff for all conjunctive queries q j 2 Q 2 , there exists a conjunctive query q i 2 Q 1 and a containment mapping from q i to q j . 2.4 Answering Relational Queries Using Views Given a global relational schema D, several source schemas and G/LAV mappings between the two sides, the problem of query answering using views is the following: users pose queries over the global schema D and the system needs to rewrite or reformulate these queries into queries that only use the source predicates so as to obtain all of the tuples in Q(D) that are available in the sources. To align the problem with the recent frontier of data integration research we should mention current works that build and extend on LAV rewriting algorithms (e.g., [94, 5, 75]) or use their intuitions [136, 15]). There are also recent works that build on LAV rewriting foundations (e.g., in composition of schema mappings [15, 13] or in uncertain data integration [7]) or assume the off-the-self usage of a LAV rewriting algorithm (e.g., [15, 14]). It is critical for all data 11 Chapter 2. Background integration systems to be able to support a large number of sources, so our solution focuses on scalability to the number of views. From the query optimization perspective, views are previous queries who have been already evaluated and their answers have been materialized. In this context, systems rewrite sub- sequent queries substituting a part of the original query with some of the views, so as to come up with an equivalent optimized query. In our context, of data integration, multiple heterogenous sources (as web sites, databases, peer data etc.) are integrated under a global query interface. These sources are usually presented as relational schemas and the system offers a virtual mediated schema to the user for posing queries. Then, the system needs to rewrite user queries as queries over the sources’ schemas only. Moreover, some sources might be incomplete; hence the system needs to produce query rewritings that instead of equivalent are maximally-contained. Definition 6. Maximally-Contained Rewriting: Given a query Q and a set of (G)LAV schema mappingsMÆ {M 1 ,..., M n }, then Q 0 is a maximally-contained rewriting of Q usingM if: (1) Q 0 only contains source predicates, (2) Q 0 µ Q, and (3) there is no query Q 00 containing only source predicates (i.e., no other rewriting of Q usingM ), such that Q 0 µ Q 00 µ Q and Q 00 Q. As we will see later, in a data-exchange setting [55], a maximally-contained rewriting of a query q computes the certain answers of q with respect to the source database [50, 55, 4] (this holds even in the presence of dependencies, whenever a maximally-contained rewriting exists). To ground the above definitions, assume the user asks for doctors treating chronic diseases and the clinics that they work (discharge patients from): q(d, c) à TreatsPatient(d,x), HasChronicDisease(x,y), DischargesPatientFromClinic(d,z,c) As well, consider available the two LAV sources S 1 , S 2 presented in Sect. 2.2: S 1 (doctor, disease) ! TreatsPatient(doctor, patient), HasChronicDisease(patient,disease) S 2 (doctor, patient, clinic) ! DischargesPatientFromClinic(doctor, patient, clinic) A rewriting of q is: q 0 (d, c) à S 1 (d, y), S 2 (d, z, c) Intuitively, we can get the chronic diseases’ doctors data from S 1 and join it on “doctor” with the discharge information from S 2 . In this example, the selection of which views are relevant to answer the user query and the reformulation process was quite simple since there were only two views. We just built the conjunctive query q 0 using the two views and tested that q 0 µ q. Here, since S 1 and S 2 are the only sources available, q 0 is the best we can do and (as we explain below) it is a maximally-contained rewriting of q using S 1 and S 2 . Note that in our example above we needed to check containment of q 0 defined over the source’s schemaV ={S 1 ,S 2 }, in q which is defined over the mediator predicatesR ={TreatsPatient, HasChronicDisease, DischargesPatientFromClinic}. Since containment mappings is an al- gorithm for same-schema queries, we cannot use it directly. In order to check whether a rewriting q 0 (defined overV ) is contained in a query q (defined over schemaR), we need 12 2.5. Gaifman Graphs to get the rewriting’s unfolding [92], i.e., unfold the atoms ofV in the body of q 0 with their definitions (which are overR). The new query unfold(q 0 ) is defined over R, and we can use containment mappings to check wether unfold(q 0 )µ q. Here, unfold(q 0 )à TreatsPatient(d,p), HasChronicDisease(p,y), DischargesPatientFromClinic(d,z,c), which is contained in q. In general, there may be a large number of sources and the number of possible rewritings that would need to be considered, unfolded and tested for containment grows exponentially to this number of sources. Containment is an expensive procedure and hence algorithms dealing with query rewriting, try to reduce the number of candidate rewritings. In Chapter 3, we discuss our approach to look at the relational query rewriting problem using LAV sources from a graph perspective and be able to gain a better insight and develop an algorithm which outperforms the state-of-the-art by close to two orders of magnitude. 2.5 Gaifman Graphs For a conjunction of atoms B (a query or a database instance), the Gaifman graph of nulls [56] is an undirected graph with nodes being the existential variables (labeled nulls) of B; edges between nodes exist if the existential variables occur in the same atom in B. Dually, the Gaifman graph of facts of B, denoted grf(B), is the graph whose nodes are the atoms of B and two nodes are connected if they share existential variables/labeled nulls. Note that parts of B that are connected only through constants (as well as distinguished variables for TGDs), constitute different connected components of the Gaifman graph of facts. For G i a connected component of grf(B), V (G i ) is the set of all the facts (i.e., the nodes) in G i . For any conjunction of atoms B we denote the decomposition of B to (the facts of) its connected components {G i ,...,G n } as the set of sets (of facts) dec(B)Æ {V (G i ),...,V (G n )}. 2.6 Data Integration and Exchange under Constraints In this section we briefly discuss the sets of constraints we are considering, which are decidable for query answering. We also present a common tool, called the chase, that is used as an algorithm in query answering under dependencies. We then look its connection to query containment and query answering using views. 2.6.1 Tuple-Generating Dependencies Various forms of constraints have been studied in the literature. Our focus, Tuple Generating Dependencies (TGDs) [1, 24], is a generalization of the language inclusion of dependencies. TGDs can be seen as Datalog rules that allow for value invention [1]; they allow for existential variables in the consequent of the rule which “invents” values. Because of their structure they are also known as existential or89¡ r ul es [21, 20]. Syntactically TGDs are the same as s-t TGDS, introduced in Sect. 2.2, but they are completely defined on the global schema: 13 Chapter 2. Background 8~ x,~ y Á G (~ x,~ y)!9~ z à G (~ x,~ z) In general, query answering, containment and rewriting under TGDs is undecidable [23, 38]. Nevertheless several syntactic restrictions have been studied that provide expressive and useful fragments of TGDs, which are decidable in the problems above, and some times even computationally tractable (in data complexity [131]). Note that apart from TGDs there is another important class of dependencies in the literature, the equality generating dependencies (or EGDs) [38], which are generalizations of functional dependencies. In this thesis we focus on TGDs and indicate how our results could be extended to cover EGDs as well. Recently, there has been a lot of active research on these constraint languages which has spread across multiple areas in Databases, Knowledge Representation, Semantic Web and Web Engineering. In particular these fragments include several “interesting” ontology lan- guages, that provide a sweet-spot between expressivity and computational complexity, such as RDFS [28], and the DL-Lite family [36] which includes OWL2-QL [104]. RDFS and OWL2 are W3C recommendations for structuring and reasoning with data on the web 2 . These lan- guages provide convenient modeling constructs, such as class hierarchies, property domain and ranges, and disjointness constraints. Datalog+/- [32], is another language of families recently proposed, that uses TGDs. It captures most parts of the languages above and extends them to yet more expressive and decidable ways in different directions. In the context of database systems, useful classes of decidable constraints include, inclusion dependencies, full TGDs [24], acyclic inclusion dependencies [45], weakly acyclic dependencies [55] (generalizing the two former), as well as further generalizations of weakly acyclic dependencies, such as super-weak-acyclic [101] and joint-acyclic [83]. A recent survey on decidable cases for query answering was presented in [105, 22], where the authors generally identify three abstract classes of decidable TGD constraints: • Finite expansion sets of rules. The following are FES sets: datalog rules (range restricted), weakly-acyclic constraints, joint-acyclic and others. The defining characteristic of these rules is that the chase algorithm (see Sect. 2.6.3) terminates under these dependencies. We will refer to these constraints as chase-terminating. • Finite unification sets of rules. The following are FUS sets: inclusion dependencies, DL-Lite, linear Datalog [1], sticky-tgds [32], and others. The defining characteristic of these languages is that they are FO-rewritable, that is, their reasoning tasks can be implemented as relational calculus queries (i.e., SQL), and thus are efficiently sup- ported by existing relational database technology. FO-rewritability means that the query expansion phase terminates. 2 http://www.w3.org/standards/semanticweb/ 14 2.6. Data Integration and Exchange under Constraints • Bounded Treewidth sets (BTS) of rules. The following are BTS sets: guarded and weakly guarded [30], jointly-guarded [83] and others. These languages have the property that while the chase and the query expansion don’t always terminate, there is a bounded part of the (infinite) chase that is sufficient for query answering. These three abstract classes of rules are not recognizable (i.e., it is in general undecidable to say whether a set of rules belongs in one of these classes). However, all the other concrete sub- classes mentioned are efficiently recognizable. In this thesis, we make several contributions across the range of all these categories of classes. 2.6.2 Weakly acyclic TGDs A special class of particular interest in this thesis is the class of weakly acyclic constraints (wa-TGDs) [55]. Definition 7. Weakly Acyclic TGDs [55] Let§ be a set of TGDs over schema<Æ {R 1 ,R 2 ,...,R n }. Construct a directed graph, called the dependency graph, as follows: (1) there is a node for every pair (R i , A) with A an attribute of R i , call such pair (R i , A) a position; (2) add edges as follows: for every TGD8~ x,~ z Á(~ x,~ z)!9~ y Ã(~ x,~ y) in§, for every x in~ x that occurs inà and for every occurrence of x inÁ in position (R i , A i ): • for every occurrence of x inà in position (R j ,B k ), add an edge (R i , A i )! (R j ,B k ) (if it does not already exist) • in addition, for every existentially quantified variable y and for every occurrence of y in à in position (R t ,C m ), add a special edge (R i , A i ) (R t ,C m ) (if it does not already exist). Then§ is weakly acyclic (wa) if the dependency graph has no cycle going through a special edge. LAV Weakly Acyclic TGDs are wa-TGDs that have a single predicate in the antecedent (also known as simple or linear wa-TGDs). 2.6.3 The Chase We compactly present the necessary definitions behind the chase procedure [24, 55]. The chase is useful to reason with dependencies. There are multiple variations of the chase, with the two most prevalent being the oblivious and the standard chase. Given a conjunction of atoms B (we can run the chase on both database instances and on queries) and a TGD¾Æ 8~ x,~ yÁ(~ x,~ y)!9~ zÃ(~ x,~ z), then¾ is obliviously applicable to B with antecedent homomorphism h iff h is a homomorphism fromÁ to B (intuitively, the antecedent holds in B). We say that¾ is standardly applicable to B with antecedent homomorphism h iff h is a homomorphism from Á to B such that h cannot be extended to coverÃ, that is, there is no extension of h, h 0 , that 15 Chapter 2. Background mapsÁ(~ x,~ y)^Ã(~ x,~ z) to B (intuitively, the consequent is not already satisfied). In both cases, if the TGD¾ is applicable, we apply¾ by adding its consequent to B. Formally, an oblivious (or standard) chase step addsÃ(h(~ x), f (~ z)) to B, whenever¾ is obliviously (or standardly) applicable to B with antecedent homomorphism h, where f creates “fresh” variables known as skolems or labeled nulls, for all the existential variables (~ z). The oblivious (or standard) chase is an exhaustive series of oblivious (or standard) chase steps, and may be finite or infinite depending on the constraints. We denote chase(B,§) or chase § (B) (or just chase(B) if § is clear from the context) the result of chasing B with all constraints in§, and use this for both the standard and the oblivious chase, unless otherwise specified. We assume fair and deterministic execution of the chase, that is, we are constructing the infinite chase graph in a breadth-first manner, choosing our constraints and skolems from a lexicographic or another well-founded order. Consider for example the following rule, r 1 : 8x, y Doctor(x)^ HasPatient(x, y)!9z HasChronicDisease(y, z). And consider the query q c (patient)à Doctor(x), HasPatient(x,patient), applying the chase on the body of this query with respect to r 1 yields the equivalent (under the constraint) query q 0 c (patient)à Doctor(x), HasPatient(x,patient), HasChronicDisease(patient, z f r esh ). Query answering using chase-terminating constraints can be done by firstly chasing the data, using the constraints, and then proceed with normal query answering over the chase. Intuitively this is equivalent to “completing” the data; adding all inferred knowledge in order to answer the query. This method has been studied extensively in the area of data exchange. 2.6.4 Data Exchange Given a finite source database instance I , a set of st-TGDs§ st , and a set of target dependencies § t , the data exchange problem [55] is to find a finite target database instance J, called a solution, such that I , J satisfy§ st and J satisfies§ t . The certain answers of a query q on the target schema obtained using the source instance I using the constraints§Æ§ t [§ st , denoted by certain(q, I ,§), is the set of all tuples t of constants from I such that for every solution J, t2 q(J) [55]. For certain classes of TGDs we can reach a representative solution for the entire space of solutions, called a universal solution, which has homomorphisms to all other solutions, and certain(q, I ,§) can be computed by issuing q on it [55]. Notice that "all tuples of constants" means that, from all answers of q(J), we remove tuples that contain non-constants (i.e., nulls); the latter operation is denoted as q#(J), and also as ans(q, J). Out of all homomorphisms form a query to an instance those that give tuples of constants as answers are the answering homomorphisms. The chase, chase(I ,§), is a sound algorithm for finding universal solutions. The chase with general target TGDs might not terminate, so relevant research has focused in identifying chase-terminating classes of TGDs, such as wa-TGDs, for which the chase is both a sound and a complete algorithm for computing universal solutions. 16 2.6. Data Integration and Exchange under Constraints 2.6.5 The Chase Graph The chase graph for a database instance B and a set of TGDs§ is a directed graph with nodes the elements of chase(B,§) and an arrow from a to b iff b is obtained from a by an application of a TGD. For brevity, we will be using interchangeably the notion of a node representing an atom and the atom itself, as well, collections of chase facts and the chase subgraphs that these define. For linear TGDs, the chase graph is essentially a forest of trees, each tree having as a root a fact of B. Every atom a2 chase(B,§) is in a chase path, denoted¼(a) which ends in a. We define the root of¼(a) as the fact r of B, such that the direct descendant of r in¼(a) is not a fact of B (intuitively, we consider as the root the “deepest” of the original facts of B on¼(a)). The root atom is denoted by root(¼(a)), or simply root(a), since every atom a is in a path with a single root. Notice that we might use the symbol¼ as a particular path’s identifier, while ¼(a) is function that identifies the path containing its input atom a. For a set of atoms s in the same subtree, we will use¼(s) to denote the path that ends in the highest-level common ancestor a of all atoms in s (and starts at root(a)). Sometimes it will be convenient to consider a fragment of a path¼, below a certain fact t; we will then talk about the path¼ starting at t. The level or (derivation level) of an atom in the chase is defined as follows. All atoms in B have level 0. When a TGD is applicable with an antecedent homomorphism h such that the highest level of any atom in the image of h is k, then the added consequents of the constraint’s application have level kÅ 1. For linear TGDs the derivation level of an atom is the depth of the corresponding node in the chase tree (with the root having depth 0). 2.6.6 Containment and Maximally-Contained Rewritings Under Constraints Definition 8. Containment under Dependencies. Query Q 1 is contained in query Q 2 under a set of constraints§, denoted Q 1 µ § Q 2 , iff for all databases D that are consistent with§, Q 1 (D)µ Q 2 (D) A query Q 2 is contained in a query Q 1 under TGD constraints§, denoted by Q 2 µ § Q 1 , iff for all databases D that satisfy§, Q 2 (D)µ Q 1 (D). The next theorem holds. Theorem 1. Containment using the chase [30]: For all conjunctive queries Q 1 ,Q 2 , for all sets of fes TGD constraints§, Q 2 µ § Q 1 iff there is a homomorphism that maps the bod y(Q 1 ) on the chase of the bod y(Q 2 ), and the head of Q 1 to the head of Q 2 , that is Q 2 µ § Q 1 iff chase(Q 2 )µ Q 1 . The above theorem informs us that in order to check containment under constraints we can chase the candidate containee query and then check for regular conjunctive query contain- ment (e.g., through containment mappings). Definition 9. Equivalence under Dependencies. Query Q 1 is±-equivalent to query Q 2 under a set of constraints§, denoted Q 1 ´ § Q 2 , iff Q 1 µ § Q 2 and Q 2 µ § Q 1 . A query Q is minimal under a set of dependencies, denoted as±-minimal, if there is no query Q 0 which is a subquery of Q, and Q´ § Q 0 . 17 3 Scalable Query Rewriting: A Graph- based Approach In this chapter we consider the problem of answering conjunctive queries using views, which is important for data integration, query optimization, and data warehouses. We consider its simplest form, conjunctive queries and views, which already is NP-complete. Our context is data integration, so we search for maximally-contained rewritings. By looking at the problem from a graph perspective we are able to gain a better insight and develop an algorithm which compactly represents common patterns in the source descriptions. This representation speeds up tremendously the computation of homomorphisms (a core mechanism in query rewriting), and (optionally) pushes some computation offline. This together with other optimizations result in an experimental performance about two orders of magnitude faster than current state-of-the-art algorithms, rewriting queries using over 10000 views within seconds. Given a conjunctive query Q, over a database schema D and a set of view definitions V 1 ,...,V n over the same schema, the problem that we study is to find answers to Q using only V 1 ,...,V n . Generally, the number of rewritings that a single input query can reformulate to, can grow exponentially to the number of views, as the problem (which is NP-complete [92]) involves checking containment mappings 1 from subgoals of the query to candidate rewritings in the cross-product of the views. Previous algorithms (as MiniCon [112] and MCDSAT [17]) have exploited the join conditions of the variables within a query and the views, so as to prune the number of unnecessary mappings to irrelevant views while searching for rewritings. We pursue this intuition further. The key idea behind our algorithm (called Graph-based Query Rewriting or GQR) is to compactly represent common subexpressions in the views; at the same time we treat each subgoal atomically (as the bucket algorithm [87]) while taking into account (as MiniCon does) the way each of the query variables interacts with the available view patterns, and the way these view patterns interact with each other. Contrary to previous algorithms however, we don’t try to a priori map entire “chunks” of the query to (each one of) the views; rather this mapping comes out naturally as we incrementally combine (relevant 1 Not to be confused with source descriptions which are mappings from the sources’ schema to that of the mediator. 19 Chapter 3. Scalable Query Rewriting: A Graph-based Approach to the query) atomic view subgoals to larger ones. Consequently, the second phase of our algorithm needs to combine much fewer and really relevant view patterns, building a whole batch of rewritings right away. Our specific contributions are the following: • We present an approach which decomposes the query and the views to simple atomic subgoals and depicts them as graphs; we abstract from the variable names by having only two types of variable nodes: distinguished and existential. • This decomposition makes it easier to identify the same graph patterns across sources and compactly represent them. This can be done offline, as a view preprocessing phase which is independent of the user query and can be done at any time the sources become available to the system, thereby speeding up system’s online performance. • Subsequently we devise a query reformulation phase where query graph patterns are mapped (through much fewer containment mappings) to our compact representation of view patterns. By bookkeeping some information on our variable nodes (regarding their join conditions) we can combine the different view subgraphs to larger ones, progressively covering larger parts of the query. • During the above phase each graph “carries” a part of the rewriting, so we incrementally build up the rewriting as we combine graphs; we conclude with a maximally-contained rewriting that uses only view relations, while at the same time we try to minimize its size. • Our compact form of representing the pieces of the views allows us to also reject an entire batch of irrelevant views (and candidate rewritings), by “failing” to map on a view pattern. This also allows the algorithm to “fail-fast” as soon as a query subgoal cannot map to any of the (few) view patterns. • These characteristics make our algorithm perform close to two orders of magnitude faster than the current state-of-the-art algorithm, MCDSAT [17]. We exhibit our perfor- mance by reformulating queries using up to 10000 views. • We present MGQR (for Multiple Graph-Based Query Rewriting), for scalable rewriting of multiple and unions of queries in the presence of large numbers of views. Our algorithm extends the GQR approach in two significant ways. First, MGQR finds common graph patterns in both the queries and in the views, compactly representing and indexing these patterns, but carefully keeping track of which patterns are relevant for which queries. Second, the graph patterns are combined incrementally in a way that multiple views are used to cover multiple queries simultaneously. 3.1 The Query Rewriting Problem This section will formally define our problem after going through some necessary preliminary definitions, which are additional to those of Chapter 2. 20 3.1. The Query Rewriting Problem A view V is a named query. The result set of a view is called the extension of the view. In the context of data integration a view could be incomplete, in the sense that its extension could only be a subset of the relation V (D). Users pose conjunctive queries over D and the system needs to rewrite or reformulate these queries into a union of conjunctive queries (UCQ) that only use the views’ head predicates so as to obtain all of the tuples in Q(D) that are available in the sources. We will refer to this UCQ as the query rewriting while an individual conjunctive query in a rewriting will be called a conjunctive rewriting [17]. As mentioned, in this thesis we focus on open-world maximally-contained rewritings. For speeding up the actual query evaluation on the sources, query rewriting algorithms pay attention to finding minimal conjunctive rewritings inside R. More formally we can define a “minimization” ordering· m as follows: Definition 10. Minimal conjunctive rewriting: For all conjunctive queries R 1 , R 2 : R 1 · m R 2 iff • R 1 » Æ R 2 , and • there exists a set Uµ var s(bod y(R 2 )) and there exists an isomorphism i : var s(bod y(R 1 )) ! U (i is extended in the obvious manner to atoms) such that (1) for all atoms A2 bod y(R 1 ) it holds that i (A)2 bod y(R 2 ), and (2) i (head(R 1 )) = head(R 2 ). The problem of minimizing a rewriting is NP-complete [92] and therefore most algorithms produce a number of non-minimal conjunctive rewritings in their solutions. An additional problem related to minimality is that of redundant rewritings, when more than one equivalent conjunctions exist in the same UCQ rewriting. Our algorithm produces fewer conjunctive rewritings than the current state-of-the-art algorithm, but we also suffer from redundant and non-minimal ones. A containment mapping from a query Q to a rewriting R is also seen as the covering of Q by (the views in) R. Similarly we can define: Definition 11. Covering: For all queries Q, for all views V , for all subgoals g q 2 bod y(Q), for all subgoals g v 2 bod y(V ), for all partial homomorphisms' : var s(Q)! var s(V ), we say that a view subgoal g v covers a subgoal g q of Q with' iff: • '(g q ) = g v , and • for all x2 var s(g q ) if x is distinguished then'(x)2 var s(g v ) is distinguished. The intuition behind the second part of the definition is that whenever a part of a query needs a value, you can not cover that part with a view that does not explicitly provide this value. On occasion, we might abuse the above definition to say that a greater subgoal, or even V itself, covers q q with' (since these coverings involve trivial extensions of'). For all variables x2 g q and y2 g v we say that x maps on y (or y covers x) iff for a covering involving','(x)Æ y. 21 Chapter 3. Scalable Query Rewriting: A Graph-based Approach To ground these definitions consider the following example. Assume that we have two sources, S 1 and S 2 , that provide information about road traffic and routes (identified by a unique id). S 1 contains ids of routes one should avoid; i.e., routes for which there is at least one alternative route with less traffic. S 2 contains points of intersection between two routes. The contents of these sources are modeled respectively by the two following LAV rules (or views): S 1 (r 1 )! AltRoutes(r 1 ,r 2 ), LessTraffic(r 2 ,r 1 ) S 2 (r 3 ,r 4 , p 1 )! ConnectingRoutes(r 3 ,r 4 , p 1 ) Assume the user asks for all avoidable routes and all exit points from these routes: q(x, p)à AltRoutes(x,y), LessTraffic(y,x), ConnectingRoutes(x,z,p) The rewriting of q is: q 0 (x, p)à S 1 (x), S 2 (x, f , p) In this example, the selection of relevant views to answer the user query and the reformulation process was quite simple since there were only two views. We just built a conjunctive query q 0 using the two views and tested that unfold(q 0 )µ q, where unfold(q 0 ) is: q 00 (x, p) à AltRoutes(x,f 1 ),LessTraffic(f 1 , x),ConnectingRoutes(x, f , p) However, in general, there could be many more conjunctive rewritings in the rewriting than q 0 , and as mentioned checking containment is an expensive procedure. Notice that in or- der to construct a conjunctive rewriting contained in the query, we select views that have relevant predicates, i.e., that there is a mapping (covering) from a query atom to the view atom. Coverings are essentially components of the final containment mapping from the query to the (unfolding of the) combination of views that forms a conjunctive rewriting. Paying close attention to how we construct coverings and select views can help us avoid building conjunctive rewritings which are not contained in the query. The second part of Def. 11 states that coverings should map distinguished query variables to distinguished view ones, as the containment definition demands. In our example, had one of S 1 or S 2 the first attribute (i.e., r 1 or r 3 ) missing from their head, they would be useless. In effect, we want the variable x of q to map onto a distinguished variable in a relevant view. Additionally to their definition coverings should adhere to one more constraint. Consider q 1 which asks for avoidable routes and all exit points from these routes to some alternative route with less traffic: q 1 (x, p)à AltRoutes(x,y), LessTraffic(y,x), ConnectingRoutes(x,y,p) Now the query demands that the second argument of ConnectingRoutes is joined with one of x’s alternative routes. This is impossible to answer, given S 1 and S 2 , as S 1 does not provide 22 3.2. Queries and Views as Graphs x’s alternative routes (i.e., r 2 in its definition). The property revealed here is that whenever an existential variable y in the query maps on an existential variable in a view, this view can be used for a rewriting only if it covers all predicates that mention y in the query. This property is referred to as (clause C 2 in) Property 1 in MiniCon [112]. This is also the basic idea of the MiniCon algorithm: trying to map all query predicates of q 1 to all possible views, it will notice that the existential query variable y in the query maps on r 2 in S 1 ; since r 2 is existential it needs to go back to the query and check wether all predicates mentioning y can be covered by S 1 . Here ConnectingRoutes(x, y, p) cannot. We notice that there is duplicate work being done in this process. First, MiniCon does this procedure for every query predicate, this means that if q 1 had multiple occurrences of Al tRoutes it would try to use S 1 multiple times and fail (although as the authors of [112] say certain repeated predicates can be ruled out of consideration). Second, MiniCon would try to do this for every possible view, even for those that contain the same pattern of S 1 , as S 3 below which offers avoidable routes where also an accident has recently occurred: S 3 (r 1 ) ! AltRoutes(r 1 ,r 2 ), LessTraffic(r 2 ,r 1 ), RoutesWithAccidents(r 1 ) S 3 cannot be used for q 1 as it violates MiniCon’s Property1, again due to its second variable, r 2 , being existential and atom ConnectingRoutes not covered. Our idea is to avoid this redundant work by compactly representing all occurrences of the same view pattern. To this end we use a graph representation of queries and views presented subsequently. 3.2 Queries and Views as Graphs Our graph representation of conjunctive queries is inspired by previous graph-based knowl- edge representation approaches (see conceptual graphs in [43]). Predicates and their argu- ments correspond to graph nodes. Predicate nodes are labeled with the name of the predicate and they are connected through edges to their arguments. Shared variables between atoms result in shared variable nodes, directly connected to predicate nodes. We need to keep track of the arguments’ order inside an atom. Therefore, we equip our edges with integer labels that stand for the variables’ positions within the atom’s parentheses. In effect, an edge labeled with “1” will be attached to the node representing the leftmost variable within an atom’s parentheses, e.t.c. Thus we can discard variables’ names; from our perspective we only need to remember a variable’s type (i.e., whether the i th variable of a specific atom is distinguished or not within the query or the view). This choice can be justified upon examination of Def. 11; the only knowledge we require for deciding on a covering is the types of the variables involved. Distinguished variable nodes are depicted with a circle, while for existential ones we use the symbol. Using these constructs the query: Q(x 1 , x 2 )à P 1 (x 1 , y, z),P 2 (y, z),P 3 (y, x 2 ) 23 Chapter 3. Scalable Query Rewriting: A Graph-based Approach Figure 3.1: Query Q, and sources S 1 -S 7 as a graphs. corresponds to the graph seen in Fig. 3.1(a). Fig. 3.1(b) shows the graph alterego of the following 7 LAV source descriptions: S 1 (x, y, z, g , f )! P 1 (x, y, z),P 4 (g , f ) S 2 (a,b)! P 4 (b, a) S 3 (c,d)! P 2 (c,d) S 4 (e,h)! P 3 (e,h) S 5 (i ,k, j )! P 1 (i ,k, x),P 4 (j, x) S 6 (l,m,n,o)! P 1 (l,n, x),P 4 (m, x),P 2 (o, x) S 7 (t, w,u)! P 1 (t,u, x),P 3 (x, w) 3.2.1 Predicate Join Patterns Our algorithm consists of mapping subgraphs of the query to subgraphs of the sources, and to this end the smallest subgraphs we consider represent one atom’s “pattern”: they consist of one central predicate node and its (existential or distinguished) variable nodes. These primitive graphs are called predicate join patterns (or PJs) for the predicate they contain. Fig. 3.2(a) shows all predicate joins that the query Q contains, (i.e., all the query PJs). We will refer to greater subgoals than simple PJs as compound predicate join patterns or CPJs (PJs are also CPJs, although atomic ones). We can now restate Def. 11 using our graph terminology: A view CPJ covers a query CPJ if there is a graph homomorphism h, from the query graph to the view one, such that (1) h is the identity on predicate nodes and labeled edges and (2) if a query variable node u is distinguished then h(u) is also distinguished. For the query PJ for P 1 in Fig. 3.2(a), all PJs that can potentially cover it, appear in Fig. 3.2(b)-(e). Notice that under this perspective: 24 3.2. Queries and Views as Graphs Figure 3.2: Predicate Join Patterns. • Given two specific PJs A an B, we can check whether A covers B in linear time 2 . • Given a specific query PJ A, there is an exponential number of PJs that can cover it (in effect, their number is 2 d with d being the number of existential variables in A). A critical feature that boosts our algorithm’s performance is that the patterns of subgoals as graphs repeat themselves across different source descriptions. Therefore we choose to com- pactly represent each such different view subgoal with the same CPJ. This has a tremendous advantage (as also discussed in Sect. 3.7); mappings from a query PJ (or CPJ) to a view are computed just once instead of every time this subgoal is met in a source description (with the exception of repeated predicates which are addressed in Sect. 3.3.2). Nevertheless, the “join conditions” , for a particular PJ within each view, are different; more ”bookkeeping” is needed to capture this. In Sect. 3.2.2 we describe a conceptual data structure that takes care of all the “bookkeeping” . At this point, we should notice that our graph constructs resemble some relevant and well study concepts from the literature, namely hypergraphs and hyperedges [1] discussed in Sect. 3.7. 3.2.2 Information Boxes Each variable node of a PJ holds within it information about other PJs this variable (directly) connects to within a query or view. To retain this information we use a conceptual data structure called information box (or infobox). Each infobox is attached to a variable v. Fig. 3.3 shows an example infobox for a variable node. A view (or source) PJ represents a specific 2 Since the edges of the two PJs are labeled, we can write them down as strings, hash and compare them (modulo the type of variable nodes). 25 Chapter 3. Scalable Query Rewriting: A Graph-based Approach Figure 3.3: Infobox for a variable node. The node is existential and is attached on its predicate node on edge with label 3 (this variable is the third argument of the corresponding atom). We can find this specific PJ in three views, so there are three sourceboxes in the infobox. The two join descriptions in the sourcebox S 6 tell us that this variable, in view S 6 , joins with the second argument of P 4 and the second argument of P 2 . subgoal pattern found in multiple sources. Therefore we want to document all the joins a variable participates in, for every view. Hence, our infobox contains a list of views that this PJ appears in; for each of these views, we maintain a structure that we call sourcebox (also seen in Fig. 3.3), where we record information about the other PJs, that v is connected to. In effect we need to mark which PJ and on which edge of this PJ, v is attached to in that particular view. We call this information a join description of a variable within a sourcebox (inside an infobox, attached to a PJ’s variable). For ease of representation we will denote each join description of a variable v in a sourcebox, with the name of the other predicate where we superscript the number of the edge v is attached to, on this other predicate (Sect. 3.3.2 clarifies this in the face of repeated predicates). Fig. 3.4 shows for all predicates of Q, all the different PJs that appear in sources S 1 ¡ S 7 with their infoboxes. Note that the infoboxes belonging to a query PJ contain only one sourcebox (that of the query), which in turn contains the join descriptions of the variables in the query. We omit to present the infoboxes of Q as its joins can be easily seen from Fig. 3.1(a). 3.2.3 Partial Conjunctive Rewritings Our compact representation of patterns allows us another major improvement. Previous algorithms would first finish with a view-query mapping discovery phase and then go on to gather up all “relevant” view subgoals to form a (conjunctive) rewriting. Our approach exploits the insight an algorithm gains during this phase so as to start building the correct rewriting right away. At different steps of our algorithm, each source CPJ covers a certain part of the query and within this source CPJ (and only in source CPJs) we maintain a list of conjunctions of atoms, which are candidate parts of the final conjunctive rewritings that will cover this part of the query. We call these partial conjunctive rewritings. These conjunctions maintain information about which view variables out of the views’ head are used and which of them are equated, i.e., joined (if at all). For example, the PJ in Fig. 3.4(a) contains the partial rewriting S 1 (P 1 1 ,P 1 2 ,P 1 3 ). 26 3.3. Graph-based Query Rewriting 3.3 Graph-based Query Rewriting Our solution is divided in two phases. Initially, we process all view descriptions and construct all source PJs. In our second phase, we start by matching each atomic query subgoal (i.e., PJ) to the source PJs and we go on by combining the relevant source PJs to form larger subgraphs (CPJs) that cover larger “underlying” query subgoals. We continue combining source CPJS and during this bottom-up combination, we also combine their partial rewritings by taking their cross-product and either merge some of their atoms or equate some of their variables. Note that we might also need to “drop” some rewritings on the way, if they are no longer feasible (see Sect. 3.3.2). We continue until we cover the entire query graph, whereby we have incrementally built the maximally-contained and complete rewriting. 3.3.1 Source Preprocessing Given the set of view descriptions and considering them as graphs, we break them down to atomic PJs, by splitting each graph on the shared variable nodes. On top of the PJ generation we construct infoboxes for every variable node in those PJs. Moreover, for every PJ we generate a list of partial conjunctive rewritings, each containing the head of a different view this PJ can be found in. Fig. 3.4 shows all the source PJs as constructed by the preprocessing phase. For presentation purposes, in the partial conjunctive rewritings, the view heads include only the arguments that we are actually using at this phase. Similarly to the naming scheme of the join descriptions, we use a positional scheme for uniquely naming variables in a partial conjunctive rewriting. For instance, S 7 .P 3 2 is the 2nd argument of PJ P 3 in source S 7 (Sect. 3.3.2 explains how we disambiguate occurrences of the same predicate in a given view). This naming policy is beneficial as it allows us to save valuable variable substitution time. An advantage of our approach is that our preprocessing phase does not need any information from the query, as it was designed to involve only views. Typically, a data integration system has access to all sources and view descriptions a priori (before any query). In such a case, the views can be preprocessed off-line and the PJs can be stored until a query appears. The details of this algorithm are rather obvious and omitted, but it can be easily verified that this preprocessing phase has a polynomial complexity to the number and length of the views. Nonetheless, one can create more sophisticated indices on the sources. As seen in Fig. 3.2 there are 2 d potential query PJs that a source PJ can cover, with d being the number of distinguished variable nodes in the source PJ. For our implementation we chose to generate those indices; for every source PJ we construct all the (exponentially many) potential query PJs that the former could cover. Consequently given a query PJ will are able to efficiently retrieve the source PJs that cover it. The payoff for all indexing choices depends on the actual data integration application (e.g., whether the preprocessing is an off-line procedure or how large does d grow per atom etc.). Nevertheless, any offline cost is amortized over the system’s live usage. 27 Chapter 3. Scalable Query Rewriting: A Graph-based Approach Figure 3.4: All existing source PJs for the predicates of the query Q. For presentation purposes the infoboxes of the first variable of P 1 are omitted. The partial conjunctive rewritings (view heads) each PJ maintains, are “descriptions” of the current PJ and are candidate parts of the final conjunctive rewritings. During our preprocessing phase, PJs for P 4 (which also exists in the sources) would also be constructed but are omitted from being presented here (as they are dropped by our later phase). Using these constructs we can compactly represent all the 8 occurrences of P 1 P 2 and P 3 in the sources S1-S7, with the 6 PJs presented here. 28 3.3. Graph-based Query Rewriting Moreover, the experiments of Sect. 3.4 show a good performance of our algorithm even when taking the preprocessing time into account. Algorithm 1 GQR Input: A query Q Output: A set of rewritings for the query 1: for all predicate join patterns P J q in the query do 2: SetPà Retr i eveP JSet(P J q ) 3: if SetP empty then 4: FAIL 5: else 6: add SetP to S //S is the set of all CPJ sets 7: repeat 8: select and remove A,B2 S 9: Cà combi neSet s(A,B) 10: if C is empty then 11: FAIL 12: add C to S 13: until all elements in S are chosen 14: return rewritings in S Algorithm 2 combineSets Input: sets of CPJs A, B Output: a set of CPJs combinations of A, B for all pairs (a,b)2 A£ B do cà combi neC P J s(a,b) if c is not empty then add c to C return C 3.3.2 Query Reformulation Our main procedure, called GQR (Graph-based Query Rewriting) and shown in Algorithm 1, retrieves all the alternative source PJs for each query PJ and stores them (as a set of CPJs) in S which is a set of sets of CPJs initially empty (lines 1-6). As we go on (line 7 and below) we remove any two CPJ sets 3 from S, combine all their elements in pairs (as Algorithm 2 shows), construct a set containing larger CPJs (which cover the union of the underlying query PJs) and put it back in S. This procedure goes on until we cover the entire query (combine all sets of CPJs) or until we fail to combine two sets of CPJs which means there is no combination of views to cover the underlying query subgoals. If none of the pairs in Alg. 2 has a “legitimate” combination as explained subsequently, the sets of CPJs fail to be combined (line 11 of Alg. 1). Fig. 3.5(c),(g) shows the combination of PJs for P 1 with PJs for P 2 , and in turn their combination with the PJ for P 3 (Fig. 3.5(h),(i)) to cover the whole query. 3 Our choice is arbitrary; nevertheless a heuristic order of combination of CPJ sets could be imposed for more efficiency. 29 Chapter 3. Scalable Query Rewriting: A Graph-based Approach Figure 3.5: All source PJs that cover P 1 are combined with all PJs that cover P 2 . As Alg. 2 suggests, we try all combinations of CPJs that cover the query. The figure does not show the un-combinable pairs (as e.g., the PJ in (a) with the PJ in (f)). Note that combining two nodes we eliminate the join descriptions we just satisfied. In the current example, PJ (d) is combined with both (c) and (g) which alternatively cover the same part of the query. The union of the resulting rewritings of (h) and (i) is our solution. Join preservation While we combine graphs we concatenate the partial conjunctive rewritings that they contain. When these rewritings contain the same view, the newly formed rewriting either uses this view twice, or only once; depending on whether the query joins are preserved across the mapping to this source. Definition 12. Join preservation: For all P J A , P J B source PJs, for all Q A , Q B query PJs where P J A covers Q A and P J B covers Q B , for all views V that contain both P J A and P J B , we say that V preserves the joins of Q A and Q B w.r .t P J A and P J B iff for all join variables u between Q A and Q B : 30 3.3. Graph-based Query Rewriting • if a is the variable node u maps onto in P J A and b is the variable node u maps onto in P J B , then a and b are of the same type, and both a and b’s infoboxes contain a sourcebox for V in which a has a join description for b, and b has a join description for a, and • there exists a u such that a and b are existential, or (1) a and b are essentially the same variable of V (in the same position) and (2) for all variables of P J A , a 0 , and of P J B , b 0 , such that no join variables of the query map on a 0 or b 0 , either a 0 and b 0 are don’t care variables (they are distinguished (in V) but no distinguished variable of the query maps on them) or (without loss of generality): a 0 covers a distinguished query variable and b 0 is a don’t care variable. Intuitively when a view preserves the query joins with respect to two PJs, our rewriting can use the view head only once (using both the source PJs) to cover the two underlying query subgoals; this is actually necessary when the PJs cover an existential join variable with an existential view one (as in Property 1 of MiniCon [112]). S 6 for example, preserves the query join between P 1 3 ,P 2 2 with respect to the PJs of Fig. 3.4(b) and (c). If on the other hand all view variables that cover join variables are distinguished, Def. 12 states the conditions 4 under which using the view head two times, would be correct but not minimal according to Def. 10. For example consider the following query and view: q(x, y, z)à p 1 (x,y,z), p 2 (y,z,w) v(a,b,c,d)! p 1 (a,b,c), p 2 (b,c,d) A naive covering of p 1 and p 2 would use v two times, coming up with the conjunctive rewriting: r (x, y, z)à v(x,y,z,f 0 ), v(f 00 ,y,z,f 000 ) If we enforce Def. 12 when combining the view patterns for p 1 (a,b,c) and p 2 (b,c,d) we’ll notice that b and c do fall on the same positions in v, and while the first pattern (p 1 ) covers more query variables (uses a to cover x) the second one has don’t cares (d in p 2 is a don’t care as rewritten as f 000 ). In this case we can merge the two occurrences of v (using their most general unifier) and come up with a more minimal rewriting: r (x, y, z)! v(x,y,z,f 000 ). This optimization is discussed in Sect. 4 of [92]. It is important to notice the dual function of Def. 12. On one hand, given two source PJs that cover an existential join with existential variables, a view must preserve the join in order to be used. If, on the other hand, all source variables that cover query joins are distinguished, the view can preserve the join, and be used a minimal number of times. 4 These are relaxed constraints; we are investigating more cases under which a rewriting could be minimized. 31 Chapter 3. Scalable Query Rewriting: A Graph-based Approach Algorithm 3 retrievePJSet Input: a predicate join pattern P J q in the query Output: The set of source PJs that cover P J q . 1: for all P J s , source PJs that covers P J q do 2: OkTo Addà true 3: for all u variable nodes in P J q do 4: và variable of P J s that u maps on to 5: if v is existential then 6: if u is existential then 7: for all sourceboxes S in v’s infobox do 8: if joins in u* joins in S then 9: drop S from P J s 10: if some of PJs infoboxes became empty then 11: OkTo Addà false 12: break 13: if OkTo Add then 14: link P J s to P J q 15: add P J s to C 16: Prime members of C returned in the past 17: return C Retrieving source PJs After the source indexing phase, our first job when a query is given to the system is to construct PJs and information boxes for all its predicates. We then retrieve, with the use of Alg. 3, all the relevant source PJs that cover each query PJ (i.e., the output of the algorithm is a set of PJs). For an input query PJ, line 1 of Alg. 3 iterates over all (existing) view PJs that cover the input. Moreover, as already discussed, if both a query variable and its mapped view variable (u and v correspondingly) are existential, we won’t be able to use this view PJ if the view cannot preserve the join patterns of the query, under any circumstances. Therefore if view variable v is existential, we have to inspect every sourcebox of v’s infobox (lines 7-9) and verify that all join descriptions of u are included in there. If we find a sourcebox that breaks this requirement we drop this sourcebox from every infobox of this view PJ and we delete the partial conjunctive rewriting that mentions the corresponding view as well (line 9). We do all that as if this source PJ pattern never appeared in that view (for the specific subgoal of the query, the specific view subgoal that this source PJ represents is useless). For example, when the query PJ for P 1 , shown in Fig. 3.2(a), is given to function retrievePJSet the latter will consider the preconstructed PJs shown in Fig. 3.4(a),(b); for the one in Fig. 3.4(b) line 9 will drop the sourceboxes and the partial conjunctive rewritings related to S 5 and S 7 since only the infobox of S 6 is describing the query joins. This fail-fast behavior allows us to keep only the necessary view’s references in a PJ (which, in the above example is S 6 ). Moreover, if none of the views can cover a query subgoal, the PJ itself is ignored, leading to a huge time saving as (1) a dropped view PJ means that a significant number of source pieces/partial rewritings are ignored and (2) if we ignore all source PJs that could cover a specific query PJ, the algorithm fails instantly. For example, consider the query PJ for P 3 , in Fig. 3.2(a), which joins existentially with P 1 and P 2 on its 1st argument. In order to 32 3.3. Graph-based Query Rewriting use the view PJ of Fig. 3.4(e) (which also has its first node existential) to cover it, we need to make sure that the PJ of Fig. 3.4(e) includes at least one sourcebox (attached to its 1st variable) which contains all the query join descriptions. However the only sourcebox in that infobox is for S 7 and it does not describe the query join with P 2 . Therefore sourcebox S 7 is dropped, and as the view PJ remains empty, it is never returned by Alg. 3. On the other hand if some PJs go through this procedure and get returned, these are really relevant and have a high possibility of generating rewritings. For our query of Fig. 3.1, retrievePJset will be called three times (each time it will iterate over one column of Fig. 3.4): for P 1 it will return the PJs in Fig. 3.5(a) and (e), for P 2 gives Fig. 3.5(b) and (f) and for P 3 it returns the PJ shown in Fig. 3.5(d). Note that Alg. 3 also marks (line 14) which variables correspond to returning variables of the query 5 . This is done by equating some of the partial conjunctive rewritings’ variables to distinguished query variables (as seen in Fig. 3.5(a), (d) and (e), we include some “equating” predicates in the rewriting). Also, notice that since our example query Q does not contain P 4 none of the PJs for P 4 is retrieved. Line 16 of the algorithm is explained in Sect. 3.3.2. Notice that Alg. 3 returns a set of PJs which alternatively cover the same query PJ. Furthermore the different sets that Alg. 3 returns, cover different (and all) subgoals of the query. That makes it easier to straightforwardly combine these sets to cover larger parts of the query. We treat the element of each set as a CPJ (initially they are atomic CPJs, i.e., PJs), and we pairwise combine all the elements (CPJs) of two sets that cover two query subgoals; we construct a resulting set containing larger combined CPJs. The latter alternatively cover the larger underlying query subgoal. Next section describes how we combine two CPJs to a greater one. Combination of CPJs Algorithm 4 takes two CPJs a and b and returns their combination CPJ. If the underlying query subgoals that these CPJs cover do not join with each other our algorithm just cross products the partial conjunctive rewritings these CPJs contain and returns a greater CPJ containing all PJs in both CPJs. If on the other hand underlying joins exist, procedure lookupJoins in line 1 returns all pairs of variables (v a ,v b ), where v a in a and v b in b, that cover the join variables. For example, for the PJs of Fig. 3.5(a) and (b), lookupJoins will return the two pairs of variables (for the two joins in the query between P 1 and P 2 ): (S 6 .P 1 3 ,S 6 .P 2 2 ) and (S 6 .P 1 2 ,S 6 .P 2 1 ). Next we want to enforce Def. 12 and “merge” the view heads in the partial conjunctive rewritings that preserve the join (lines 9 and 17) or equate some of them so as to satisfy the join (line 19). By merging two atoms of the same predicate, we keep the predicate (i.e., the view head) only once and merge their arguments. Otherwise we consider their conjunction and equate the variables that cover the query join variable. Notice that we only can enforce joins on view variables of the same type (line 4); we either join existential variables within the same source or join/equate distinguished variables across 5 In an effort to be user-friendly we are maintaining the original names of the query variables. 33 Chapter 3. Scalable Query Rewriting: A Graph-based Approach Algorithm 4 combineCPJs Input: two CPJS a,b Output: a CPJ, combination of a,b 1: for all join J in lookup Joi ns(a,b) do 2: v a à get from J variable in a 3: v b à get from J variable in b 4: if type of v a 6Æ type of v b then 5: return ; //; means un-combinable 6: else if type of v a Æ then 7: for all sourceboxes S in v a ’s infobox do 8: if S contains a join description for v b then 9: mar kFor Mer g e(S, v a , v b ) 10: else 11: drop S from a and b 12: if v a infoboxÆ; then 13: return ; 14: else 15: for all pairs of sourceboxes (s a , s b )2 (infobox of v a )£ (infobox of v b ) do 16: if s a preserves the joins of the query w.r.t. v a and v b then 17: mar kFor Mer g e(s a , v a , v b ) //preserves implies that s a Æ s b 18: else 19: mar kFor E quate(s a , v a , s b , v b ) 20: cr wà cr osspr oductRewr i ti ng s(a,b) 21: en f or ceMer g e AndE quati ons(cr w) 22: cà mer g eGr aphsUpd ateIn f oboxes() 23: return c sources. If the view variables that cover the join are existential “merging” is our only option; if v a contains sourceboxes for different sources as v b does, or if their common sourceboxes regard views that don’t preserve the join, we drop these sourceboxes (line 11) and the cor- responding partial conjunctive rewritings. If by this dropping we “empty” a PJ of source references, this is no longer useful to us and so a and b are un-combinable (line 13). This “pruning” is similar to the one in Alg. 3 and can happen often. On the other hand, if the view variables that cover the query join are distinguished, we either merge the partial conjunctive rewritings on the corresponding view head (in case that the view satisfies Def. 12) or we equate v a and v b in the two view heads (line 19). Finally, we consider the cross product of the remaining partial conjunctive rewritings creating larger ones, and we iterate over them to enforce all merging and equating we just noted down. For example, in Fig. 3.5 when our algorithm examines the existential join between the PJs shown in (a) and (b), it marks S 6 for merging since S 6 preserves the join. At another iteration of line 1, regarding the same PJs but for the distinguished join this time, we need to equate two variables of S 6 (namely, P 1 2 with P 2 1 ). At the end both these decisions are enforced as seen in the partial conjunctive rewriting of Fig. 3.5(c). As we go on, algorithm 1 combines all elements of S and produces C , the set containing the CPJs seen in 3.5(h) and (i). Their conjunctive rewritings are the conjunctive rewritings of the solution and their union is the maximally-contained and complete rewriting: Q(x 1 , x 2 )à S 6 (x 1 ,_, y, y),S 4 (y, x 2 ) 34 3.3. Graph-based Query Rewriting Figure 3.6: Repeated Predicates. In (b) only the PJs for P 1 are shown; these “capture” all five occurrences of P 1 in the sources S 1 ,S 2 and S 3 of (a). Q(x 1 , x 2 )à S 1 (x 1 , y, z,_,_),S 3 (y, z),S 4 (y, x 2 ) Repeated Predicates In general repeated predicates in the view descriptions should be treated separately; we create multiple PJs per predicate per source and name these PJs differently. We then can create and use infoboxes as described so far. For a query containing the corresponding predicate the algorithm would try to use all these alternative PJs and combine them with PJs that cover other parts of the query so as to capture the multiplicity of the predicate in the sources. An important notice is that we only need to maintain different PJs for predicates within the same source, but not across sources; hence the maximum number of different source PJs for the same predicate is the maximum number a predicate repeats itself within any source. Fig. 3.6(b) shows how we can use two different PJs to hold the information needed for all the five occurrences of P 1 in Fig. 3.6(a). In the face of multiple occurrences of the same predicate in the query, it is suitable to imagine all PJs discussed so far as classes of PJs: we instantiate the set of PJs that cover a specific predicate as many times as the predicate appears in the query. Each time we instantiate the same PJ we “prime” the sources appearing in the partial rewritings so as to know that we are calling the same source but a second, different time (and as our argument names are positional, “priming” the sources allows us to differentiate among two instances of the same variable in the same source). Line 16 of algorithm 3 does exactly that. Having said the above, Fig. 3.7 shows all the PJs for the sources of Fig. 3.6 created for query Q of Fig. 3.7. Rewritings for this query will come out of the 4 possible combinations of these 4 instantiations of the two PJs (of Fig. 3.6). GQR Correctness Below we give a sketch proof that our algorithm is correct; for soundness we show that any conjunctive rewriting in our output is contained in the query, and for completeness that for 35 Chapter 3. Scalable Query Rewriting: A Graph-based Approach Figure 3.7: Repeated predicates in the query (returning variable names are shown in Q’s graph for convenience). For the sources of Fig. 3.6(a), the algorithm “instantiates” the matching PJs (Fig. 3.6(b)) for every occurrence of the same query predicate. 36 3.4. Experimental Evaluation any possible conjunctive rewriting of the query, we always produce a conjunctive rewriting which contains it. Theorem 2. Given a conjunctive query Q and conjunctive views V 1 ,...,V n , the GQR algorithm produces a UCQ that is a maximally-contained and complete rewriting of Q using V 1 ,...,V n . Proof. Soundness. Consider an output conjunctive rewriting r of our algorithm. We need to show that unfold(r )Æ r 0 µ Q. This means that there is a containment mapping from Q to r 0 . Keep in mind that the atoms in r 0 however exist in different views (not in the same as the containment mapping definition demands). It is not difficult to see that our approach “constructs” the containment mapping through coverings; each covering is a mapping from a subpart of the query to a part of r 0 . Considering multiple such coverings will give us our containment mapping from Q to our combination of views (that r contains). Completeness. Consider a conjunctive rewriting p element of a problem’s maximally-contained and complete rewriting. There is a containment mapping from Q to unfold(p). This means that depicted as graphs, there is a graph homomorphism h 1 from the query to some of the PJs for the predicates that constitute unfold(p). Breaking up the homomorphism per individual target PJ means that there are coverings from parts of the query to these PJs (these coverings do not get affected if we compactly represent these PJs, gathering them up in patterns as in section 3.3.1). It is not hard to verify that if such coverings exist our algorithm will consider them when looking at different view PJs; hence it will produce a rewriting r for the crossproduct of the views these PJs belong in, for which it will hold that there exists a containment mapping (and a graph homomorphism) h 2 : vars(Q)! vars(unfold(r )). Moreover for all variables q v of Q which map on distinguished variables d in those view PJs of r (i.e., h 2 (q v )Æ d), it holds also that h 1 maps q v on the same node d in the PJs of unfold(p). Hence whenever r (which is written over view head predicates) has a distinguished variable in some view in its body, p has the same variable on the same position on the same view (modulo renaming). Hence, there is a containment mapping from r to p, which means that pµ r . 3.4 Experimental Evaluation For evaluating our approach we compared with the most efficient (to the best of our knowl- edge) state-of-the-art algortihm, MCDSAT [17] (which outperformed MiniCon [112]). We show our performance in two kinds of queries/views: star and chain queries. In all cases GQR outperforms MCDSAT even by close to two orders of magnitude. For producing our queries and views we used a random query generator 6 . 6 The generator was kindly provided to us by R. Pottinger, and it is the same one that she used for the original experimental evaluation of MiniCon. 37 Chapter 3. Scalable Query Rewriting: A Graph-based Approach Figure 3.8: (a) Average time and size of rewritings for star queries. GQR time does not take into account the source preprocessing time while gqr+p does. (b) Average time and size of rewritings for chain queries. 3.4.1 Star Queries We generated 100 star queries and a dataset of 140 views for each query. We created a space with 8 predicate names out of which each query or view chooses randomly 5 to populate its body and it can choose the same one up to 5 times (for instantiating repeated predicates). Each atom has 4 randomly generated variables and each query and view have 4 distinguished variables. We measured the performance of each of the queries scaling from 0 to 140 views. We run our experiments on a cluster of 2GHz processors each with 1Gb of memory; each processor was allocated all the 140 runs for one query, and we enforced 24 hours wall time for that job to be finished. Fig. 3.8(a) shows the average query reformulation time for 99 queries which met the time and memory bounds. On this set of experiments we perform 32 times faster than MCDSAT. Fig. 3.8(a) also shows how the average number of conjunctive rewritings grows with respect to the number of views. As we can see in the figure the preprocessing phase (for this small set of views) does not add much to our algorithm; the query reformulation phase is dominating the time as the number of views increases. 3.4.2 Chain Queries For the chain queries we generated again 100 queries and a dataset of 140 views for each query. Our generator could now choose 8 body predicates (on a chain), for any rule, out of a pool of 20 predicates of length 4. Up to 5 predicates in each rule can be the same. Input queries have 10 distinguished variables. With this experiment, we would like to avoid measuring exponential response times simply because the size of the rewriting grows exponentially. Hence, trying to create a “phase transition point” , we generated the first 80 views for all our view sets containing 10 distinguished variables and each additional view (up to 140) with only 3 distinguished variables. This causes the number of conjunctive rewritings to grow exponentially up to 80 views, but this rate becomes much slower from there and after. This trend in the number of 38 3.4. Experimental Evaluation 1 10 100 1000 10000 100000 1e+06 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 time in ms number of views chain queries/8 subgoals/10 distinguished vars q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 Figure 3.9: Average reformulation time for 10 chain queries. Preprocessing time is not included in the plot. rewritings can be seen in Fig. 3.8(b). As seen in Fig. 3.8(b), GQR runs 75 times faster than MCDSAT and it is much more scalable, as it “fails” very fast (7 chain queries did not meet our experimental bounds, either for GQR or MCDSAT). This can be seen when there are no rewritings at all, as well as after the point of 80 views; the time of reformulation clearly depends more on the size of the output than that of the problem. As also seen in Fig. 3.8(b), GQR produces fewer conjunctive rewritings than MCDSAT. An example query where MCDSAT produced a redundant rewriting was the following. Given the query and view: q(x 1 , x 2 , x 3 ) à p 1 (x 0 , x 1 , x 2 , x 3 ), p 2 (x 1 ) v(y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 ) ! p 1 (y 2 , y 3 , y 9 , y 10 ), p 1 (y 4 , y 5 , y 6 , y 7 ), p 2 (y 1 ) MCDSAT produces both rewritings below, while GQR produces only the second (which con- tains the first): q(x 1 , x 2 , x 3 ) à v(x 1 , f 1 , f 2 , x 0 , x 1 , x 2 , x 3 ) q(x 1 , x 2 , x 3 ) à v(f 0 , f 1 , f 2 , x 0 , x 1 , x 2 , x 3 ), v(x 1 , f 8 , f 9 , f 10 , f 11 , f 12 , f 13 ) 3.4.3 Chain Queries using 10000 Views In Fig. 3.9 we expose the times for 10 chain queries with the very same experimental setting of the previous subsection, using 10000 views. As previously the first 80 views for each query have 10 distinguished variables, and the rest only 3. However, as the figure shows, our “transition phase point” did not work well for the queries that had already produced some rewritings 39 Chapter 3. Scalable Query Rewriting: A Graph-based Approach 10 100 1000 10000 100000 1e+06 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 number of conj. rewritings number of views chain queries/8 subgoals/10 distinguished vars q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 Figure 3.10: Average size of rewriting for chain queries of Fig. 3.9. Queries q1,q2,q3 and q9 don’t produce any rewritings. within the first 80 views (namely, q0,q4,q5,q6,q7 and q8). The number of conjunctive rewrit- ings for these queries grows exponentially to the number of views. On the other hand 4 queries did not produce rewritings up to 80 views; and they also did not produce any from that point and on (as the number of distinguished variables for all views after 80 is too constraining to, e.g, cover an atom that has 4 distinguished variables). Nevertheless this figure serves our point. The unfinished queries have crashed on a number of views that caused too many conjunctive rewritings, as Fig. 3.10 shows. In this setting, our algorithm runs in less than a second for queries that don’t produce rewritings for 10000 views, while it produces 250000 conjunctive rewritings for 10000 views in 44 seconds (for q4). Lastly, in an effort to keep the number of conjunctive rewritings low, we decided to set the predicate space (out of which we populate our rules) to equal the number of views at each point. Fig. 3.11, shows how ten chain queries performed with and without the preprocessing times. As seen from the graph the exponential burden of solving the problem lies on the preprocessing phase, while the online reformulation time is less than second for all queries, even the ones producing thousands of rewritings (Fig. 3.12 shows the number of rewritings for these queries). 3.5 Extension to Constants In order to support constants to our views and queries, few changes are needed. Constants in our formulas behave for the most part as distinguished variables, so we use the same kind of nodes (i.e., circles) to represent these terms. We extend the definition of a covering in order to get constants into account. Definition 13. Covering: For all queries Q, for all views V , for all subgoals g q 2 bod y(Q), for 40 3.5. Extension to Constants 1 10 100 1000 10000 0 2000 4000 6000 8000 10000 12000 time (ms) number of views chain, preds=views, 8 preds body, 4 var/pred, <=5 repeated preds Figure 3.11: Ten chain queries on views constructed from an increasing predicate space. The upper bunch of straight lines gives the total time for the queries, while the lower dotted part gives only the reformulation time. 1 10 100 1000 10000 0 2000 4000 6000 8000 10000 12000 time (ms) number of views chain, preds=views, 8 preds body, 4 var/pred, <=5 repeated preds Figure 3.12: Number of rewritings for the 10 queries of Fig. 3.11 41 Chapter 3. Scalable Query Rewriting: A Graph-based Approach all subgoals g v 2 bod y(V ), for all mappings' : ter ms(g q )! ter ms(g v ), we say that a view subgoal g v covers a subgoal g q of Q with' iff: • (a mapping exists:)'(g q ) = g v , • (covering distinguished variables:) for all x2 var s(g q ) if x is distinguished then either '(x)2 var s(g v ) is a distinguished variable in the view, or'(x)2 cons(g v ), • (covering constants:) for all c2 cons(g q ), either'(c)Æ c, or'(c)2 var s(g v ) is a distin- guished variable in the view. The joinboxes for constant nodes differ in two ways to the joinboxes of variables. First, a joinbox for a constant node contains the constant value. Second, while it still contains a list of all sources that this PJ appears in, it does not contain join descriptions for this node for any of those sources; constants just restrict the values of relation attributes and they are not considered to participate in any joins (even if the same constant appears twice in a formula, we do not consider this a join). We allow mappings between query and source PJs to happen as if constants were distinguished variables, as stated in Def. 11, taking special care only when mapping a constant query node to a constant variable node; in that case the nodes should contain the same constant value, otherwise the mapping fails. Similar to the case of distinguished variables, during mappings between PJs that contain constants we might have to add an equate predicate in the partial rewriting of a source PJ that covers a constant with a distinguished variable, or to equate a returning variable of a conjunctive rewriting to a constant (if that distinguished query variable got covered by the specific constant. To make this change to our framework clear we give to examples. Assume that we have two sources, S 1 and S 2 , that provide information about students (identi- fied by a unique id). S 1 contains ids of students that are of have been enrolled in a course with code “CS548” during some semester t 1 . S 2 contains graduate students and their first semester in the graduate program. The contents of these sources are modeled respectively by the two following LAV views: S 1 (s 1 )! Student(s 1 ),Enrolled(s 1 , t 1 ,“C S548 00 ) S 2 (s 2 , t 2 )! Graduate(s 2 , t 2 ) Assume the user asks for all graduate students enrolled in the course “CS548”: q(x)à Enrolled(x, t 1 ,“C S548 00 ),Graduate(x, t 2 ) The maximally-contained rewriting of q is: 42 3.6. Extension to Answering UCQs and Multiple Queries Using Views q 0 (x)à S 1 (x), S 2 (x, f ) Also, consider S 3 which is specific to a graduate student with id “9517”: S 3 (t 2 )à Graduate(“9517 00 , t 2 ) Including S 3 in our example, the maximally contained rewriting of q is the union: q 0 (x)à S 1 (x), S 2 (x, f ) q 00 (“9517 00 )à S 1 (“9517 00 ), S 3 (f ) 3.6 Extension to Answering UCQs and Multiple Queries Using Views In order to address the multi-query rewriting problem we leverage insights revealed by our work on GQR. In GQR we used a graph representation of views, which we also adopt in this section for our input queries. Our algorithm, MGQR (for Multiple Graph-Based Query Rewriting) is an algorithm, MGQR, for scalable rewriting of multiple queries in the presence of large numbers of views. Our algorithm extends the GQR approach in two significant ways. First, MGQR finds common graph patterns in both the queries and in the views, compactly representing and indexing these patterns, but carefully keeping track of which patterns are relevant for which queries. Second, the graph patterns are combined incrementally in a way that multiple views are used to cover multiple queries simultaneously. Our experimental results show a promising speedup of rewriting the user queries in batch versus rewriting the user queries one by one. 3.6.1 Overlapping Queries Having multiple input queries can introduce redundancies in a query rewriting algorithm in a way very similar to having multiple views. Recall the earlier example describing medical records sources. S 1 contains doctors that treat patients with a chronic disease. S 2 contains doctors, patients and clinics where the doctor is responsible for discharging the patient from the clinic. S 1 : V1(doctor, disease)! TreatsPatient(doctor, patient), HasChronicDisease(patient,disease) S 2 : V2(doctor, patient, clinic)! DischargesPatientFromClinic(doctor, patient, clinic) Also consider q 3 below which asks for doctors that treat patients with chronic diseases and the clinics where they discharge those same patients from: 43 Chapter 3. Scalable Query Rewriting: A Graph-based Approach q 3 (d, c)à TreatsPatient(d,x), HasChronicDisease(x,y), DischargesPatientFromClinic(d,x,c) Query q 3 requires that the second argument of DischargesPatientFromClinic is joined with the patients that are treated for chronic diseases. This is impossible to answer, given S 1 and S 2 , as S 1 does not provide the patients. Recall that, this is the basic idea of the MiniCon algorithm: trying to map all query predicates of q 3 to all possible views, it will notice that the existential query variable x in the query maps to patient in S 1 ; since patient is existential it needs to go back to the query and check whether all predicates mentioning x can be covered by S 1 . Here DischargesPatientFromClinic(d, x,c) cannot. First, it would try to do this mapping and “backtracking” for every possible view, even for those that contain the same pattern of S 1 , like S 3 below, which provides surgeons and the (chronic) diseases of the patients they treat: S 3 : V3(doctor, disease)! TreatsPatient(doctor, patient), HasChronicDisease(patient,disease), Surgeon(doctor) S 3 cannot be used for q 3 as it violates MiniCon’s Property1, again due to patient being exis- tential and DischargesPatientFromClinic not covered. Second, any “one-by-one” algorithm would check this property for every possible query, regardless of the overlap with previous queries, as in q 4 : q 4 (d, c)à TreatsPatient(d,x),DischargesPatientFromClinic(d,x,c) S 1 and S 3 cannot be used for q 4 exactly for the same reason as q 3 : pati ent is existential and DischargesPatientFromClinic is not covered. Despite multiple occurrences of Tr eat sPati ent across different input queries, any “one-by-one” query rewriting algorithm would try to use S 1 and S 3 multiple times (and fail for all queries that join TreatsPatient with DischargesPatient- FromClinic in the way shown in q 3 and q 4 ). Note that these redundancies hold even for successful rewritings. If S 1 and S 3 did cover DischargesPatientFromClinic, in order to use them, a “one-by-one” algorithm would make the same steps for both q 3 and q 4 . Our idea is to avoid this redundant work by compactly representing all occurrences of the same query or view pattern, extending GQR to handle multiple queries. Our solution is divided in an offline phase which preprocess all the views, and an online phase which produces rewritings in the face of a set of input queries. Our online phase has three stages. First, we represent the queries as graphs and find common patterns. Second, for every query graph pattern (which now represents pieces of multiple queries), we retrieve the preconstructed view patterns that cover it (which in turn represent pieces of multiple views). Third, we incrementally combine the view patterns to larger ones, progressively covering the underlying queries. Consequently, we naturally come up with a “batch” of contained rewritings using multiple views to cover multiple queries at the same time. 44 3.6. Extension to Answering UCQs and Multiple Queries Using Views Figure 3.13: (a) Queries q 3 ,q 4 ,q 5 , and (b) sources S 4 -S 7 as a graphs. Query (c) and Source (d) Predicate Join Patterns. (e) Infoboxes for a query and a view variable nodes. The variable in the upper box is existential and it appears in two queries q 3 , q 4 . The join description associated with q 3 states that the variable joins, in q 3 , with the first argument of P 3 and the first argument of P 2 . 3.6.2 The MGQR Algorithm Our solution is divided to an offline phase which preprocess all the views and an online phase which produces rewritings in the face of a set of input queries. The offline phase is very similar to GQR’s preprocessing phase but with the additional feature that each source graph PJ maintains a list of “candidate” parts of the final conjunctive rewritings per query (that will eventually be “responsible” for covering this part of the query); again these are called partial conjunctive rewritings. Our online phase has three stages. First, we represent the queries as graphs and find common patterns. Second, for every query graph pattern (which now represents pieces of multiple queries), we retrieve the preconstructed view patterns that cover it (which in turn represent pieces of multiple views). Third, we incrementally combine the view patterns to larger ones, progressively covering the underlying queries. Consequently, we naturally come up with a “batch” of contained rewritings using multiple views to cover multiple queries at the same time. We will use the following queries and views to illustrate our algorithm: q 3 (x,y)à P 1 (x,z), P 2 (y,z), S 4 (x,y)! P 1 (x,y) P 3 (z) S 5 (x,y)! P 2 (x,y) q 4 (x,y)à P 1 (x,z), P 2 (y,z) S 6 (z)! P 3 (z), P 1 (z,x), P 2 (z,y) q 5 (w)à P 2 (y,w), P 3 (w) S 7 (x,y)! P 1 (x,z), P 2 (y,z) 3.6.3 Graph Modeling of UCQs Our graph representation of queries is similar to our modeling for views presented in Sect. 3.2. Queries q 3 ,q 4 ,q 5 correspond to the graphs on Fig. 3.13(a). The graphs for source descriptions S 4 , S 6 and S 7 appear in Fig. 3.13(b). Fig. 3.13(c) shows all predicate join patterns that the query 45 Chapter 3. Scalable Query Rewriting: A Graph-based Approach Figure 3.14: The PJs covering P 1 are first combined with the PJs covering P 2 , and then with the PJs covering P 3 . The union of the complete rewritings (marked with?) is the solution. 46 3.6. Extension to Answering UCQs and Multiple Queries Using Views q 3 contains, (i.e., the query PJs for q 3 ). For the query PJ for P 1 seen in Fig. 3.13(c), all source PJs that could potentially cover it (cf. Def. 5) appear in Fig. 3.13(d). Unless our sources contain one of these two patterns any query that contains this variation of P 1 will fail (immediately) to be rewritten. As in GQR modeling of views, we attach to each variable node an infobox. A variable’s infobox contains a list of queries/views in which this PJ appears, and for each such rule the variable’s join descriptions, which record which other PJs this variable (directly) joins to within the specific query/view. Fig. 3.13(e) shows two example infoboxes for one query and one view variable node. The upper level of Fig. 3.14 also shows all the different PJs that appear in queries q 1 ¡ q 3 with their infoboxes (aggregating information from all queries where they appear). All the different source PJs, relevant to the query ones, that appear in sources S 4 ¡ S 7 with their infoboxes are shown on the middle level of the same figure. Additionally, at different steps of our algorithm, each source graph consisting of PJs, covers a certain part of the queries and within this graph we maintain a list of “candidate” parts of the final conjunctive rewritings per query (that will eventually be “responsible” for covering this part of the query); we call these, partial conjunctive rewritings. 3.6.4 Multi-query rewriting The three stages of MGQR online phase correspond to the three horizontal levels of Fig. 3.14. First, MGQR constructs unique PJs and infoboxes for all common patterns that appear across the queries. This procedure can be implemented in time polynomial in the number and the length of the queries (we omit its description for space). Second, having constructed and indexed all source PJs offline, MGQR can efficiently retrieve the source PJs that cover our query PJs at runtime. All source PJs that cover a query PJ form a set (each “bubble” in the second level of Fig. 3.14), whose elements contain alternative partial rewritings for the pieces of the queries represented by the query PJ. At the third level of Fig. 3.14, MGQR combines these sets into larger ones, combining their partial rewritings and progressively covering larger subgoals of the underlying queries. Retrieving source patterns MGQR uses Algorithm 3 to retrieve all the relevant source PJs that cover each query PJ (i.e., the output of the algorithm is a set of PJs). For an input query PJ, line 1 of Alg. 3 iterates over all (existing) view PJs that cover the input. Moreover, due to MiniCon’s Property 1 discussed in Sect. 2.4, if both a query variable and its mapped view variable (u q and v s correspondingly in our pseudocode) are existential, we won’t be able to use this view PJ for a particular query if the view cannot “preserve” the join patterns of the query. Depending on this, all the sources in a source PJ’s infobox are associated with some or all the queries in the underlying query PJ. 47 Chapter 3. Scalable Query Rewriting: A Graph-based Approach When considering a source variable node and its covered query variable we examine all pairs of sources and queries that the corresponding infoboxes contain; if the variables are existential (line 7) we need to make sure that each source in the source PJ (in our algorithm P J s ) will be used to cover underlying queries for which it describes their joins. Therefore for every source of v s ’s infobox, and query in u q ’s infobox we have to verify that all join descriptions of u q are included in the sources. If we find a source S that breaks this requirement we mark that q i and S are uncombinable in P J s (line 9). On the other hand, if S can cover q i on this variable’s joins (line 13) or it doesn’t need to because u q and v s are distinguished (line 16), we associate S with q i . Note that further checks on a different variable node might break a previous association of a query with a source on P J s (we do this in line 9 as well). Association, apart from the creation of a relative pointer, means that we create a partial conjunctive rewriting that uses a source for a specific query. In the set of source PJs that is retrieved for query PJ P 1 in Fig. 3.14, source S 7 does not cover the existential joins of query q 3 and hence it is only associated to q 4 . If a source cannot be associated to any query for a certain query PJ (e.g. source S 6 in Fig. 3.14 for P 1 ), we drop this source from every infobox of this source PJ and we delete the partial conjunctive rewriting that mentions the corresponding view as well (line 9). We do all that as if this source PJ pattern never appeared in that view (for the specific subgoal of this query PJ’s queries, the specific view subgoal that this source PJ represents is useless). Moreover if some query (of the query PJ) does not get associated to any view for P J s , this means the specific query subgoal (that P J q represents) and consequently the entire query cannot be rewritten. Hence, we destroy all the information for this query in the system, as if this query never existed in our input (line 23). This fail-fast behavior allows us to keep only the necessary query/view references in a PJ. Moreover, if none of the views can cover a query PJ, the source PJ itself is ignored and never returned, leading to a faster reformulation performance as (1) a dropped view PJ means that a significant number of source pieces/partial rewritings are ignored and (2) if we ignore all source PJs that could cover a specific query PJ, the algorithm fails instantly for the queries of this query PJ. On the other hand if some PJs go through this procedure and get returned, these are really relevant and have a high possibility of generating rewritings for their associated queries. The algorithm also addresses repeated predicates, i.e., selfjoins, in the input queries (line 29). In the face of multiple occurrences of the same predicate in a query, it is convenient to consider all source PJs discussed so far as classes of PJs: we instantiate the set of source PJs that cover a specific predicate as many times as the predicate appears in the algorithm’s input. Each time we instantiate the same PJ we “prime” the sources appearing in the partial rewritings so as to know that we are calling the same source but a second, different time. This modeling is a natural extension of our approach for repeated predicates in the views (described in Chapter 3). We omit further discussion of repeated predicates in the same query or view due 48 3.6. Extension to Answering UCQs and Multiple Queries Using Views Algorithm 5 Retrieve Source PJ Sets for Input Queries Input: Predicate join patterns P J q in the queries Output: Set of source PJs that “alternatively” cover each P J q . 1: for all P J s , source PJ that covers P J q do 2: OkTo Addà true 3: for all u q variable nodes in P J q do 4: v s à variable of P J s that u q maps on to 5: for all sources S in v s ’s infobox do 6: for all queries q i in P J q do 7: if v s is existential and u q is existential then 8: if joins in u q for q i * joins in S then 9: mark q i and S as uncombinable in P J s 10: else 11: if q i and S are not marked as uncombinable in P J s then 12: if joins in u q for q i µ joins in S then 13: Associate S with q i 14: else 15: if q i and S are not marked as uncombinable in P J s then 16: Associate S with q i 17: if There is some source marked as uncombinable with all queries of P J q then 18: drop S from P J s 19: if some of PJs infoboxes became empty then 20: OkTo Addà false 21: break 22: if some queries are not associated with any source then 23: drop all those queries from all query PJs and all return cover-sets of source PJs 24: OkTo Addà false 25: break 26: if OkTo Add then 27: add P J s to C 28: if we have seen the input query P J q before, i.e., it is a repeated pattern (selfjoin) then 29: Rename (i.e. prime) the elements of C (which have also been returned in the past) 30: return C 49 Chapter 3. Scalable Query Rewriting: A Graph-based Approach to space limitations. Alg. 5 returns a set of PJs which alternatively cover the same query PJ. Also, different queries of the query PJ could be associated with different source PJs in the returned set. Furthermore the different sets that Alg. 5 returns, cover different (and all) subgoals of the queries. Next we want to combine these sets to cover larger parts of the queries. Combination of Source Graph Patterns To combine two sets of source graphs that cover two different query PJs and consequently two different sets of pieces of queries, MGQR uses Algorithm 6. We want to combine elements of these sets that cover the same queries (no need to try to combine a PJ that covers a part of q 3 with a PJ that covers a part of q 4 but not q 3 ). In fact, if one of the two sets covers queries that are not covered by the other set, we copy the corresponding source PJs directly in the resulting set (lines 7-11). In line 10, in case these PJs are also covering queries common between the two sets, we keep a copy of them in the original set so we can go on and combine it; this copy is now associated only with the common queries (in essence we “project out” of the source PJs the non-common queries before we combine them). Thus, we only combine elements of these sets (i.e., source PJs) if they cover common queries (lines 12-17). The third level of Fig. 3.14 shows this procedure; the all-distinguished-variable source PJs for P 1 and P 2 are combined on their common queries (i.e., q 3 and q 4 ), while a copy of the source PJs for P 2 associated with q 5 is directly passed onto the resulting set. Algorithm 6 Combine Source PJ Sets Input: Two sets of source PJs: A, B Output: 1) Set C combining A and B, with partial rewritings. 2) Complete rewritings R found so far. 1: for all queries q i covered by set A (or set B) do 2: if q i is completely covered then 3: Add rewritings to R 4: delete this query from set A (resp. set B) and any associated PJs 5: if some sources PJs are “empty” of queries then 6: delete them from the set 7: for all queries q covered by exactly one set out of A or B do 8: for all PJs P J i (elements of A or B) associated with q do 9: create a copy of P J i , associate it only with q and put it in set C 10: if P J i is also associated with some queries that do exist in both A and B then 11: the copy of P J i remaining in A or B should be associated only with queries common between A and B (drop information about other queries) 12: for all queries q common in sets A, B do 13: for P J a elements of A associated to q do 14: for P J b elements of B associated to q do 15: combine (P J a , P J b ) 16: if combination successful then 17: put result in C 18: return C , R Lines 1-6 of Alg.6 check whether some queries have completed so far, in which case we output their rewritings and delete their PJs (if the PJs are not associated with any other “active” 50 3.6. Extension to Answering UCQs and Multiple Queries Using Views rewriting). We omit the procedure that combines two specific graphs patterns (line 15), but we should state that it does so based on the underlying query joins, combining/merging the source PJ’s partial conjunctive rewritings per query, into larger ones, eventually producing the maximally-contained rewritings of each query. This combination could also fail (due to existential-distinguished patterns or due to violation of the corresponding underlying query joins), in which case nothing will be added in the resulting set in line 17. If combining two sets returns an empty set, the two query PJs and all their associated queries instantly fail. The order in which we combine the retrieved sets of PJs is currently driven by a simple heuristic: prefer to combine the sets that share the biggest number of queries. 3.6.5 Experimental Results To evaluate our multiple-query rewriting algorithm, we generated 100 experiments, each one testing a multi-query input on a set of views. Each input had 10 chain queries and each view set had up to 1000 chain views. Each query/view had 8 predicates out of which up to 4 could be repeated. Each atom had 4 randomly generated variables and each query had 10 distinguished variables. 7 We ran our experiments on a cluster of 2GHz processors each with 2Gb of memory, and gave each processor one (out of the one hundred) experiment: one multi-query and one view set. Each processor runs the experiments between 0 and 1000 views (in increments of 50 views at a time) in two settings: (1) using our MGQR multi-query “batch” rewriting algorithm, and (2) using GQR to rewrite the queries one-by-one. Since we compute all rewritings, we want to avoid problems that produce an exponential number of rewritings or produce no rewritings. In the first case, the times would be dominated by the exponential output, and in the second case, our algorithms prove unsatisfiability extremely fast. So, none of those settings would be interesting. Hence, we try to find the “phase transition” of the query rewriting problem, where there are a number of rewritings produced that are hard to find. To simulate this condition, we generated the first 180 views for all our view sets containing 10 distinguished variables and each additional view (up to 1000) with only 3 distinguished variables. We created the bodies of our views by randomly choosing 8 predicates out of an increasing predicate space. These two choices led to view sets that initially generate an exponential number of rewritings, but subsequently the number of rewritings grows very slowly as the number of views grows (see Fig. 3.15). Initially, we generated the user queries randomly, as we did for the views. To produce overlap in the user queries, we used a space of 20 predicate names out of which each query chooses randomly 8 to populate its body. We hoped to create enough overlap in the queries to demon- strate the benefits of our algorithm. However, it turned out that this was not adequate. In each multi-query experiment only one or two, out of the ten, queries would generate rewritings. As both GQR and MGQR fail very quickly when there are no rewritings, both algorithms had 7 The initial version of the query/view generator was kindly provided to us by Rachel Pottinger and it was the same one she used for the experiments of MiniCon. 51 Chapter 3. Scalable Query Rewriting: A Graph-based Approach 10 100 1000 10000 100000 0 100 200 300 400 500 600 700 800 900 1000 time in ms number of conj. rewritings number of views 10 multi-queries of 10 chain conj. queries each one-by-one GQR multi-query rewriting rewritings Figure 3.15: Average online time and number of rewritings for 10 chain queries over up to 1000 views. The MGQR multiquery rewriting algorithm outperforms GQR rewriting queries one by one. Offline times are the same for both algorithms (not shown). similar performance. In fact we were only measuring the rewriting time for one or two queries, making GQR and MGQR indistinguishable. Thus, we decided to generate the user queries based on combinations of the queries that did produce rewritings in our early experiments. Specifically, we chose ten multi-query sets that produced a substantial amount of rewritings and kept only the two queries per set that produced rewritings. We replicated these queries in order to grow them back to ten queries per set. For each set, we also deleted two predicates out of each query body (different predicates each time). This way each set of queries, now of length 6, were all different, but overlapping (and at the same time rewritable). Fig. 3.15 shows the average online times that these 10 multi-query sets, each having 10 con- junctive queries, took to reformulate over different number of sources, up to 1000. Batch MGQR outperforms the sum of the one-by-one rewritings by GQR by a factor of approximately 1.5 (for the 1000 view problem, GQR takes 4265 ms and MGQR takes 2798 ms). Fig. 3.15 also shows that the number of rewritings grows up to 34771 for the 1000 view problem up from 27370 rewritings for the 200 view problems. Note that in the region where the number of rewritings grows slowly, both GQR and MGQR times are proportional to the number or rewrit- ings, instead of depending on the number of sources available. In summary both GQR and MGQR can rewrite multiple queries over large numbers of views, producing tens of thousands of rewritings, under a few seconds. MGQR outperforms GQR in the multiple query problems when there is meaningful overlap between the queries. 52 3.7. Discussion and Related Work 3.7 Discussion and Related Work Early approaches dealing with query rewriting involve algorithms as the bucket [87] and the inverse rules [51]. A more efficient approach was proposed in 2001, by the MiniCon [112] algorithm. Similar to our notion of CPJs, MiniCon devises MCDs (MiniCon Descriptions), which are coverings as defined in Def. 11 (they are defined per distinct query and view subgoals) having the additional property that they always consider as a whole the existentially chained pieces of the query whenever one existential query variable is mapped to an existential one on the view (Property 1 as discussed in Sect. 3.1). Minimal MCDs are MCDs which don’t contain anything more than these existential chained pieces. In the following, we briefly describe the two phases of the MiniCon algorithm and give our algorithm’s advantages against each one of its steps. It is important to notice that MCDSAT which exhibited a better performance than MiniCon [17], is essentially the MiniCon algorithm casted as a satisfiability problem, and comparing our algorithm against MiniCon, reveals our advantages against the foundations of MCDSAT also. 3.7.1 MiniCon Phase One Before considering a covering of an atomic query subgoal g q with a view V , MiniCon uses a head homomorphism h:var s({head(V )})! var s({head(V )}) to possibly equate some vari- ables in the head of V (for all variables x, if x is existential h(x)Æ x and if it is distinguished, then h(x) is distinguished and h(h(x))Æ h(x)). It then can look for a homomorphism' so as to cover g q with an atomic subgoal h(g v )2 h(V ). Note in that the original MiniCon algorithm the authors suggest a search over the entire space of all least-restrictive head homomorphisms h, and mappings' so that'(g q )Æ h(g v )2 h(V ) (ST EP1). Subsequently, for each g q ,g v and pair of functions h,' that come out of ST EP1, the algorithm produces a minimal MCD (ST EP2). Formally, an MCD M is a tupleÇ V,',h,G q È where (a) h(V ) covers G q µ bod y(Q) (containing g q ) with' and (b) for all x, if'(x) is an existential variable, then all atomic subgoals g i 2 bod y(Q) that mention x are in G q . An MCD M is minimal if (c) it covers 8 g v together with the minimum additional subset of bod y(Q) so as to satisfy (a) and (b). A minimal MCD intuitively covers only one connected subgraph of the query graph which adheres to property (b) above (clause C 2 of Property 1 in [112]). 3.7.2 MiniCon Phase Two In its second phase the MiniCon algorithm, needs to choose sets of MCDs which cover mutu- ally exclusive parts of the query, and their union covers the entire query (ST EP3). It then fol- lows a lightweight generation of a conjunctive rewriting for each MCD combination (ST EP4). For each conjunctive rewriting that it produces, the analogous of our Def. 12 is employed in 8 Since MCDs are essentially descriptions of coverings, we’ll abuse terminology to say that an MCD covers a subgoal. 53 Chapter 3. Scalable Query Rewriting: A Graph-based Approach order to “minimize” some of the rewritings (ST EP5). 3.7.3 MCDSAT In MCDSAT, the authors cast MiniCon’s first phase (the MCDs generation problem) into a propositional theory whose models constitute the MCDs. This theory is then compiled into a normal form called d-DNNF that implements model enumeration in polynomial time in the size of the compiled theory. ST EP1, and properties (a), (b) and (c) of ST EP2 result in clauses of that theory, which is further extended with more optimization clauses. The result is an efficient algorithm for generating MCDs. Nevertheless, as we discuss below we believe that we have a better encoding of the problem than MiniCon’s steps 1 and 2, that MCDSAT also employs. For the second phase, that of MCD combination and rewriting generation, MCDSAT considers either a traditional implementation of MiniCon’s second phase or yet another satisfiability approach: some additional clauses are devised in order to present an extended logical theory whose models are in correspondence with the rewritings. Note, that although the authors present an experimental performance of the compilation times of these extended theories, for our own evaluation we consider (and exhibit much better performance on) the times to get the actual rewritings themselves. Nonetheless, a contribution of these compiled extended theories is that they serve as compact repositories of rewritings; once compiled one can get the rewritings in polynomial time to the size of the compiled theory. 3.7.4 GQR vs MiniCon and MCDSAT Our encoding of the problem exhibits several advantages against steps ST EP1-ST EP5. Firstly, we don’t need to explicitly deal with finding h in ST EP1, since we abstract from variable names. Returning variables are equated implicitly as we build up our CPJs. Moreover, MiniCon iterates over every input subgoals g q and g v . In combination with ST EP2(b) this implies that some query subgoals will be considered more than once. In effect, while in ST EP2 the query subgoals g i are included in an MCD for input g q (that came out of ST EP1), subsequent iterations will result them forming the same MCD (among others) when they are themselves considered as the input subgoals of ST EP1 9 . GQR on the other hand considers each query subgoal exactly once and thus avoids these redundant mappings. A major advantage of our encoding is that while ST EP2 is considered for every distinct atomic view subgoal of every view, we consider PJs; being compact representations of view patterns they are dramatically fewer than the distinct view subgoals. Nevertheless this does not mean “more work” for a subsequent phase. On the contrary, during our iteration of this less number of constructs we avoid the heavy computation of ST EP2. We essentially avoid “tracking” each 9 Although the algorithm suggests such a brute force approach, it is indeed possible that the original MiniCon implementation did prune out some redundant MCDs. The MCDSAT satisfiability perspective most probably also avoids this redundancy. 54 3.7. Discussion and Related Work individual existentially chained piece in a view, for every query subgoal we map onto a part of it (as ST EP2(b) suggests). Same piece patterns are consider just once for all sources, at the time we combine their individual parts (PJs and CPJs). This design also benefits our performance in the face of repeated predicates in a view. Consider the view V 1 (x, g , f )! P 1 (x, y), P 2 (y, z), P 3 (z, g ), P 3 (z, f ) and the query Q(a,b)! P 1 (a, y), P 2 (y, z), P 3 (z,b). Here MiniCon will most probably (to the best of our knowledge) create two MCDs for P 3 , each time recomputing the mapping of P 1 (a, y), P 2 (y, z) to the view. On the other hand, we will consider this mapping just once. Moreover as our previous argument exposed, this join pattern of P 1 and P 2 will be at the same time detected in every source it appears in (at the same step we also rule out of future consideration all the non combinable sources that contain these patterns). We would like to point out that we do consider the whole set of mappings from every atomic query subgoal to all the atomic view subgoals per view, in a way very similar to ST EP1. We do this however in our preprocessing phase where we additionally compute the PJ infoboxes. As a result we are left with both drastically less constructs to deal with and in a more straightforward and less costly manner than ST EP2. Moreover our algorithm exhibits additional advantages in that its second phase (the CPJ combination) already encapsulates the entire second phase of MiniCon and MCDSAT. And it does so even more efficiently; the ST EP3 of the related algorithms needs to explicitly choose “mutually exclusive” MCDs in order to provide non redundant rewritings; in our case all sets of CPJs (as seen in Sect. 3.6.4) are mutually exclusive. Yet, we do need to combine all their elements pairwise, however we can fail-fast in the case one such pair is un-combinable as the reader can also see from Sect. 3.6.4. The formulation of the actual rewritings (ST EP4 for MiniCon) is embodied in our second phase and is done incrementally (through the partial rewritings) during the combination of the CPJs to larger graphs. ST EP5 is also smoothly incorporated with this incremental building of a rewriting through the implementation of Def. 12. In conclusion, PJs (and CPJs) seem to be better constructs than MCDs in encoding the problem. This is also why we believe our solution performs better than the other algorithms. 3.7.5 Hypergraphs and Hypertree Decompositions As already mentioned, our graphs and PJs resemble some relevant and well study concepts from the literature, namely hypergraphs and hyperedges [1] accordingly. Nevertheless we find our own notation more suitable for focusing on the type of the query’s variables as well as making their join descriptions explicit. Since hypergraphs were developed for represent- ing single queries, it is less convenient (although entirely symmetrical to our approach) to attach infoboxes on them, or have hyperedges represent information across multiple views. In [63], hypertree decompositions were devised; these were hierarchical join patterns which combined hyperedges bottom-up into ever larger query fragments. Nevertheless the focus of this representation is different; each vertex of these constructs can represent a whole a set of 55 Chapter 3. Scalable Query Rewriting: A Graph-based Approach atoms and/or variables and each variable and atom induces a connected subtree. On the other hand, looking at our own incremental build-up of the view CPJs as a tree of combinations, we differentiate; we keep some compact “hyperedges” per vertex and each parent vertex contains all and exactly its children. Again it possible that our CPJ combination could be translated to a variation of hypertree decompositions and we plan to further investigate this perspective in the future. 56 4 Scalable Containment for UCQs un- der Constraints In this chapter, we consider the problem of query containment under ontological constraints, such as those of RDFS. Query containment, i.e., deciding whether the answers of a given query are always contained in the answers of another query, is an important problem to areas such as database theory and knowledge representation, with applications to data integration, query optimization and minimization. We consider unions of conjunctive queries, which constitute the core of structured query languages, such as SPARQL and SQL. We also consider ontological constraints or axioms, expressed in the language of Tuple-Generating Dependencies [24, 1]. TGDs capture RDF/S and fragments of Description Logics. We consider classes of TGDs for which the chase is known to terminate. Query containment under chase-terminating axioms can be decided by first running the chase on one of the two queries and then rely on classic relational containment. When considering unions of conjunctive queries, classic algorithms for both the chase and the containment phases suffer from a large degree of redundancy. We leverage our graph-based modeling of rules, presented in the previous chapter, that represents multiple queries in a compact form, by exploiting shared patterns amongst them. We develop two new graph-based algorithms, one for chasing multiple queries, and one for checking containment amongst multiple formulas. Our graph-based chase algorithm can run directly on PJs, our graphs for compactly representing multiple queries, and outputs PJs that represent the chased queries. Our graph-based containment algorithm receives two unions of PJs, representing the two UCQs to be tested for containment, and efficiently computes compact partial homomorphisms among the pieces of multiple queries represented as one graph. As a result we couple the phases of both chase and regular containment and end up with a faster and more scalable algorithm. Our experiments show a speedup of close to two orders of magnitude. TGDs are a generalization of the language of inclusion dependencies. As discussed, query con- tainment and answering under general TGDs is undecidable [23], so efforts have been made to devise syntactic restrictions of TGDs so that these problems are decidable and/or tractable. These devised fragments are usually studied along with relevant reasoning algorithms that allow us to solve the aforementioned problems. The reasoning algorithm that we employ here 57 Chapter 4. Scalable Containment for UCQs under Constraints is the well-known chase algorithm [8, 98]. The chase is a tool that allows query answering over incomplete databases (w.r.t a set of decidable TGDs), by “completing” the data missing. The chase is also useful for checking containment; for all conjunctive queries q 1 , q 2 and for all sets § of TGD constraints, it holds that q 1 is contained in q 2 under the constraints, iff the chase of q 1 with§ is contained in q 2 . Hence, once having an algorithm for containment of conjunctive queries, we can reuse it for queries under dependencies by firstly using the chase on one of the two queries [72, 106, 30]. We specifically focus on weakly acyclic sets of constraints which only have a single atom in the antecedent. T Weakly-acyclic LAV constraints capture useful web ontology languages like RDF/S 1 , and are known to have good properties in data integration and exchange [4] and in inconsistent databases [6]. The contributions of this chapter are the following: • We leverage our previously introduced (Chapter 3) graph-modeling of rules, which can represent multiple queries into compact graph structures, exploiting overlapping parts of these queries (Sect. 4.2). • This modeling allows us to create an efficient index for the predicates in a UCQ. Through this index we can map (or fail to map) a certain predicate into a set of predicates/queries at once. Moreover for a certain predicate pattern we keep pointers to other joined predicates in the UCQ. • This allows us to compactly chase a UCQ, by triggering constraints for a set of queries in batch, and by adding multiple consequents across queries in a single step. This compact chase algorithm runs much faster than the classic chase algorithm, and in addition it results in a compact representation of the chased queries. • This compact graph representation of UCQs allows us to compute homomorphisms among multiple queries in batches, saving the redundant cost of checking each predicate of each individual query. These designs result in a faster, more scalable containment under constraints. We experimentally show (Sect. 4.4) that we can check containment among UCQs of several hundreds of queries, under hundreds of constraints, being two orders of magnitude faster than the classic solutions. 4.1 The UCQ Containment Problem We consider UCQs and we will formally refer to the UCQ as the “query” while an individual conjunctive query in a UCQ will be explicitly called so. We denote conjunctive queries with lower-case letters (e.g., q), while UCQs use upper-case letters (e.g., Q). Consider the following rules which capture “domain” and “range” properties in RDF/S, as well as “subclass” relations. Constraint c 1 states that the domain and the range of the TreatsPa- 1 http://www.w3.org/RDF/ 58 4.1. The UCQ Containment Problem tient relation are Doctors and Patients respectively. Constraint c 2 states that Doctors are ClinicEmployees. c 1 :8x, y TreatsPatient(x,y)! Doctor(x), Patient(y) c 2 :8x Doctor(x)! ClinicEmployee(x) Consider the following UCQ Q 4 : q 4 (d)à TreatsPatient(d,p), Surgeon(d) q 4 (d)à TreatsPatient(d,p), HasCronicDisease(p,dis) q 4 (d)à TreatsPatient(d,p), Doctor(d) Without considering the constraints, no query in Q 4 is contained in query q 5 below: q 5 (doc)à ClinicEmployee(doc), Doctor(doc) However, under the constraints c 1 and c 2 , the entire UCQ Q 4 is contained in q 5 . This can seen by chasing Q 4 and noticing that there exist containment mappings from q 5 to each one of the queries in the chase(Q 4 ): chase(q 4 )(d)à TreatsPatient(d,p), Surgeon(d), Doctor(d), Patient(p), ClinicEmployee(d) chase(q 4 )(d)à TreatsPatient(d,p), HasCronicDisease(p,dis), Doctor(d), Patient(p), ClinicEmployee(d) chase(q 4 )(d)à TreatsPatient(d,p), Doctor(d), Patient(p), ClinicEmployee(d) The standard chase algorithm repeatedly finds homomorphisms from the rule’s antecedents into the queries. We firstly notice that there is a redundancy here. In effect, the chase algorithm will separately consider the three occurrences of the predicate TreatsPatient in Q 4 and add the consequents of c 1 . Similarly for every different occurrence of the Doctor predicate across all conjunctive queries, rule c 2 will be applied. In this chapter we adopt the graph-based modeling of queries, introduced in GQR, which can compactly represent different occurrences of the same predicate across multiple rules (in Chapter 3 this is done for views). This ends up in an optimized chase algorithm that detects homomorphisms and chases parts of multiple queries in a single rule application. Moreover this algorithm adds consequents in the same graph-based form, resulting in a compact representation of the chased queries. The compact output of our chase algorithm is tailored towards, and proves particularly useful for, optimizing the relational containment algorithm as well. Classic containment of chase(Q 4 ) in q 5 suffers for the same redundancies as the chase, since the algorithm has to iterate over all different predicates of chase(Q 4 ) in order to find mappings for the predicates of q 5 , considering the same patterns multiple times. This redundancy would symmetrically be worse in the case that q 5 was a UCQ. All candidate “containing” queries would be checked (no matter how overlapping they are), until we find one that maps to a particular query in chase(Q 4 ). As our experimantal evaluation shows, our compact representation speeds up containment under constraints significantly. 59 Chapter 4. Scalable Containment for UCQs under Constraints (a) (b) Figure 4.1: (a) The conjunctive queries in Q 4 , and in chase(Q 4 ) as graphs. (b) Predicate Join Patterns (PJs) for all the queries in Q 4 . 4.2 Graph-Based Modeling Using our graph modeling from Chapter 3 the queries for Q 4 of the previous section (with abbreviated predicate names for brevity, e.g., q 4 (doc)à T P(doc), S(doc)) correspond to the graphs seen on the left part of Fig. 4.1(a). The right part of Fig. 4.1(a) shows the graph repre- sentation (again with abbreviated predicate names) of the conjunctive queries in chase(Q 4 ). Fig. 4.1(b) shows all predicate join patterns that query Q 4 contains. Fig. 4.2 shows PJs with their infoboxes. Recall that, a variable’s infobox contains a list of queries that this PJ appears in and for each such query the variable’s join descriptions. This way we record which other PJs this variable (directly) joins to within any of the queries this pattern appears in. Fig. 4.2(a) shows for all predicates of Q 4 , all the different PJs (with their infoboxes) that appear in its conjunctive queries (we assume the queries are named q 4 1 ,q 4 2 , and q 4 3 ). 4.3 UCQ Containment under Constraints Given the set of queries and considering them as graphs, we break them down to atomic PJs, by splitting each graph on the shared variable nodes. On top of the PJ generation we construct infoboxes for every variable node in those PJs. Fig. 4.2(a) shows all the query PJs as constructed by this phase. The details of the PJ construction algorithm are rather obvious and omitted, but it can be easily verified that this phase has a polynomial complexity to the number and length of the queries. On top of our PJ generation we create a simple index on the queries, by creating a hashtable on every different pattern (PJ) so we can retrieve it efficiently. As noted for a specific pattern we might retrieve more than one PJs if we have repeated predicates within the conjunctive queries themselves. 60 4.3. UCQ Containment under Constraints (a) (b) Figure 4.2: (a) All the PJs of Q 4 with their infoboxes. (b) All the PJs of chase(Q 4 ) (the result of the compact chase algorithm on the PJs of Q 4 ). In the two figures we see the infoboxes for all variable nodes. For example, we can find the PJ for P (bottom right corner of figure b) in three queries: q 4 1 , q 4 2 , and q 4 3 . The two join descriptions related to q 4 2 in the PJ for P tell us that its variable, in query q 4 2 , joins with the second argument of HC D and the second argument of T P. In figure (b) all the pre-existing infoboxes have been updated by our algorithm to reflect the new chased set of queries. 61 Chapter 4. Scalable Containment for UCQs under Constraints 4.3.1 Algorithm for Chasing Algorithm 7 is used for chasing a union of queries using a set of weakly acyclic LAV dependen- cies. In line 1 the algorithm iterates over all constraints, and in line 2 finds all the different antecedents in the queries. SetA in line 2 contains all different patterns (PJs) for the antecedent predicate. As mentioned SetA could contain the same exact pattern twice, in order to dis- tinguish a predicate with same pattern repeating itself within one query. Line 3 checks that indeed there have been occurrences of the antecedent found in the queries (otherwise we can go to the next constraint). Algorithm 7 Compact Chase Input: A UCQ query Q, a set of LAV-WA constraints§ Output: A set of PJs representing the chase § (Q) 1: for all¾2§ : P(~ x,~ z)! C 1 (~ x,~ y),C 2 (~ x,~ y),...,C n (~ x,~ y) do 2: Set Aà Retr i eveP J wi thPr edi cate(P) //occurrences of P(~ x,~ z) in the query 3: if Set A is empty then 4: continue to the next constraint (line 1) 5: else 6: if¾ has been triggered for all PJs in Set A then 7: continue to the next constraint (line 1) 8: for all PJs P J a 2 Set A do 9: for all i, take consequent C i of¾ do 10: SetC i à Retr i eveP JSet(C i ) //occurrences of C i (~ x,~ z) in the queries 11: for all PJs P J C i 2 SetC i do 12: for all common conjunctive queries q in P J a and P J C i do 13: if checkQueryPJCanBeUsed(q,P J C i ,C i ,P J a ) then 14: mark C i already satisfied in q 15: if there is a consequent C i with a unsatisfied query q then 16: mark all other consequents as unsatisfied for query q 17: for all i, take consequent C i of¾ do 18: construct a new query PJ holding information for all conjunctive queries q not satisfied in C i 19: add the new query PJ in our index and update the queryboxes of the other PJs For every distinguished PJ, we trigger every applicable constraint only once; if the consequents have been added for this specific PJ, we will not apply the rule again. This is checked in line 6. In line 8 we iterate over all candidate antecedent PJs, P J a , in our UCQ and in the following lines we will try to see whether the consequents of the rule are already implied for some of the conjunctive queries in the infoboxes of P J a (those are the queries that the pattern P J a appears in). For some of those we might find PJs that imply them and for some others we might have to create new PJs to stand for the consequents predicates. For each consequent C i in the rule, we retrieve all PJs, P J C i , that match this consequent and we get all the common queries between P J a and P J C i (lines 11,12). Depending on the different join descriptions of these queries inside P J C i , the already existing PJ P J C i might be useful to “stand as” C i for some of those queries and for some might not. This is what the call to checkQueryPJCanBeUsed (which is shown in Alg. 8) decides. Alg. 8 essentially checks that for a specific query the joins of C i are described in the query’s boxes inside P J C i . This guarantees that for the specific query, C i homomorphically maps to P J C i . Notice that later, some other P J C j might fail to cover another consequent, C 2 , of the 62 4.3. UCQ Containment under Constraints Algorithm 8 checkQueryPJCanBeUsed(q,P J C i ,C i ,P J a ) Input: A conjunctive query q, a prexisting query PJ P J C i , a constraint PJ C i , the query pattern P J a which unified with constraint antecedent Output: true if P J C i already implies C i 1: for all edges k of P J C i get node N k do 2: box q à the query box for q from N k ’s infobox 3: M k à node in edge k of C i 4: box c à the constraint box in M k ’s infobox 5: if M k is a distinguished in the constraint then 6: if joins with antecedent in box c * joins in box q then 7: return false 8: else 9: if Joins in box c * joins in box q then 10: return false 11: return true same constraint for the same query. This renders the entire containment infeasible. The intuition is that when we mapped P J C i to C i (by checking the inclusion of their joins), we believed that there is a homomorphism from all the consequents to the PJs of the specific conjunctive query. However C j spoils that. Our algorithm remembers that fact, backtracks and “cancels” all previous associations. For ease of presentation we included this in lines 15-16 of Alg. 7; even though in our implementation this check is done at the same time as the mappings, using some pointers and data structures in order to remember “what” to cancel. Lastly in line 17, we construct new PJs to hold information for all queries and consequents left unsatisfied, essentially compactly adding consequents in the original queries. We would like to point out line 6 in Alg. 8. This basically relaxes the demand that all joins of a variable in the constraint need to be in the query, if that variable is a distinguished variable in the constraint. If a variable is a distinguished variable in the consequent this means it belongs to the constraint antecedent as well. When we unify the constraint antecedent with a predicate p in a query all the consequents of the constraint mentioning this variable will automatically map this to the same variable. As an example, consider chasing q 6 with c 3 below. c 3 :8x, y TreatsPatient(x,y)! Doctor(x), Surgeon(x), Patient(y) q 6 (d)à TreatsPatient(d,p), Doctor(d) There is no homomorphism from the consequents of c 3 to q 6 , nevertheless we don’t have to add the Doctor predicate again in q 6 ; the fact that x is in the antecedent of the rule means that there is only one value it can take when we add it in q 6 , (and that is d since TreatsPatient(x,y) unified with TreatsPatient(d,p)). Hence if the Doctor in the query already joins with TreatsPatient on d, we don’t have to look to satisfy other joins on the constraint. Running our compact chase algorithm on the query PJs of Fig. 4.2(a) with constraints c 1 and c 2 , results in the query PJs of Fig. 4.2(b). 63 Chapter 4. Scalable Containment for UCQs under Constraints 4.3.2 Algorithm for Containment After running our compact chase algorithm we are left with a set of PJs representing our chased UCQ. In order to check containment among of this UCQ in another one, we transform the second one into PJs as well and run Alg. 9. In line 1, our algorithm keeps list quer i esLe f t with all conjunctive queries q 1 in the “con- tainee 2 ” UCQ, Q 1 (the one coming out from the chase algorithm). As soon as we find a query q 2 in the other UCQ, Q 2 (the “containing” one), such that q 1 µ q 2 , we remove q 1 from the list quer i esLe f t. Alg. 9 starts (line 2) by iterating among all PJs for Q 1 . For each such PJ, P J Q 1 we retrieve the same exact patterns that appear in Q 2 . If we fail to retrieve any PJ in Q 2 having the same pattern as P J Q 1 , the algorithm instantly fails (line 4), since this means that the queries of P J Q 1 cannot be covered (i.e., proven contained). If we do retrieve some PJs we will try to use them to prove containments into all the queries in the infoboxes of P J Q 1 , that have not already been satisfied (line 6). For all such conjunctive queries q 1 in P J Q 1 , we iterate over the retrieved PJs of Q 2 and the queries they contain (lines 7-8). Algorithm 9 checkContainment({PJs in Q 1 },{PJs in Q 2 }) Input: Two sets of PJs representing the PJs of two UCQs Q 1 and Q 2 resp. Output: true if Q1µ Q 2 1: quer i esLe f tà all conjunctive queries in Q1 2: for all P J Q 1 2 {PJs in Q 1 } do 3: P J sInQ2ForQ1à RetrievePJs(P J Q 1 ,{PJs in Q 2 }) 4: if P J sInQ2ForQ1 is empty then 5: return false 6: for all conjunctive queries q 1 in the intersection of quer i esLe f t and P J Q 1 do 7: for all P J Q 2 2 P J sInQ2ForQ1 do 8: for all conjunctive queries q 2 in P J Q 2 do 9: for all edges k of P J Q 2 get node N k do 10: box q 2 à the query box for q 2 from N k ’s infobox 11: M k à node in edge k of P J Q 1 12: box q 1 à the query box for q 1 from M k ’s 13: if Joins in box q 2 * joins in box q 1 then 14: continue queries in P J Q 2 //goto line 8 15: else 16: for all joined/neighbour PJs N P J Q 2 in {PJs in Q 2 } as described in box q 2 do 17: if there is no joined/neighbour PJ N P J Q 1 in {PJs in Q 1 } as described in box q 1 , s.t. checkPJQ2mapsOnPJQ1(N P J Q 2 ,N P J Q 1 ,q 1 ,q 2 ) then 18: continue queries in P J Q 2 //goto line 8 19: quer i esLe f tÆ quer i esLe f t \ q 1 // reaching here means q 2 maps to q 1 so q 1 is contained 20: if quer i esLe f t is empty then 21: return true 22: else 23: continue queries in P J Q 1 //goto line 6 24: return false Inside the for loop in line 8 we consider whether a specific P J Q 2 for one of the queries it contains, say q 2 , can map onto query q 1 in P J Q 1 . This is done by looking in all variables of 2 We are using the terms “‘containee” and “containing” to ease the presentation even if containment might fail for some queries. 64 4.4. Experimental Evaluation P J Q 2 and getting the joins described for q 2 (line 10). If those joins are not contained in q 1 in P J Q 1 , q 2 cannot map to q 1 , so we should try the next candidate containing query in the infobox of P J Q 2 (lines 13-14). Else, if the joins are contained in the information related to q 1 in the infobox of P J Q 1 , this is an indication that q 2 might map on q 1 . However we are not done yet since we need to make sure that the other (joined) predicates of q 2 in P J Q 2 can themselves map to q 1 . Lines 16-18 describe this; we follow the joins described in the corresponding infoboxes of P J Q 2 and P J Q 1 and make sure the “neighboring” PJs also map to each other for q 2 and q 1 . In fact since we are looking to map q 2 in q 1 , only one (line 17) of the neighbors of P J Q 1 is sufficient to cover a neighbor of P J Q 2 (essentially this says that q 1 can have more joins). The aforementioned check involves a call to Alg. 10 which is almost identical to Alg. 8 and always checks that the neighbors indeed map on all variables for queries q 1 , q 2 (variable types don’t matter here as in Alg. 8). Line 19 is in the for loop of line 8 and if we reach there it means q 1 was satisfied (otherwise previous lines would jump back to the beginning of the loop and exhaust it). Hence in lines 19-23 we remove q 1 from our list and goto line 6 to continue with the next containee conjunctive query. Algorithm 10 checkPJQ2mapsOnPJQ1(P J Q 2 ,P J Q 1 ,q 1 ,q 2 ) Input: Conjunctive “containee” query q 1 , and conjunctive “containing” query q 2 , a “containee” query PJ P J Q 1 , and a “containing” query P J Q 2 Output: true if q 2 in P J Q 2 maps to q 1 in P J Q 1 1: for all edges k of P J Q 2 get node N k do 2: box q 2 à the query box for q 2 from N k ’s infobox 3: M k à node in edge k of P J Q 1 4: box q 1 à the query box for q 1 from M k ’s infobox 5: if joins in box q 2 * joins in box q 1 then 6: return false 7: return true 4.4 Experimental Evaluation We evaluated our approach by comparing against our implementation of the brute force algorithms for chase and UCQ containment using containment mappings. To the best of our knowledge there is no other optimized algorithm available for the definitions we introduced in section 4.1. The classic chase algorithm is straightforward: we get all containee query predicates and for all those for which a chase rule is applicable (i.e., there is a homomorphism from the antecedent to the query predicate, that cannot be extended over the consequents) we add the (image of the) consequents in the query. The classic containment that we implemented takes all chased containee query predicates and then looks in the first predicates of every containing query until it finds one that can map onto the containee predicate. For a query to be containing, all its predicates must map to the containee, so we choose the first one as a "seed" for the homomorphism. For that first containing predicate and a given containee query, if we don’t find a mapping we check the next containee predicate, of the same query. If we do, then we check that the rest of the containing query has an extended containment mapping to the containee query; if not, we go again to 65 Chapter 4. Scalable Containment for UCQs under Constraints the next containee predicate, for the same query. Alg.11 describes this procedure. Algorithm 11 checkBruteForceContainment(Q 1 ,Q 2 ) Input: A “containee” UCQ query Q 1 , and a “containing” UCQ query Q 2 Output: true if Q 1 µ Q 2 1: for all conjunctive queries, q 1 2 Q 1 do 2: for all conjunctive queries, q 2 2 Q 2 do 3: firstAtomà first atom of q 2 4: for all atoms p 1 2 q 1 do 5: if exists a containment mapping from firstAtom to p 1 then 6: 7: if exists an extension to the containment mapping from the rest of q 2 to q 1 then 8: continue with the next containee query //goto line 1 9: else 10: continue with the next containee predicate //goto line 4 11: else 12: continue with the next containee predicate //goto line 4 13: return false //if we reach here there is some q 1 which could not be mapped by any q 2 14: return true We used the random-data generator from Chapter 3 to produce 1000 chain queries (queries where each predicate joins with the next one). We created a space with 20 predicate names out of which each conjunctive query chooses randomly 8 to populate its body and it can choose the same one up to 5 times (for instantiating repeated predicates). Each atom has 4 randomly generated variables. We generated the first 80 queries with 10 head variables, and the rest with just 3. Having less distinguished variables among the queries in general, makes the contain- ment problem harder as containment mappings need to map distinguished to distinguished variables. For generating our constraints we wrote a weakly-acyclic constraint generator, and we generated 200 constraints. Each constraint had a single antecedent predicate with 4 head variables, and 4 consequent predicates (with 4 variables each) joined in a chain. In order to generate the predicates and variables of constraints we, again, chose randomly from the same predicate and variable space as in the queries. Each constraint could have up to 3 repeated predicates. We run our experiments on a mac book with a 2.3GHz processor. We implemented our code in Java and gave 2GB of RAM to the running environment. We run two sets of experiments. For the first run we randomly chose two sets of 700 queries out of the same 1000 above. We ended with two UCQs with no containment between them. This fact does not change even when we chase the containee UCQ and add more predicates to it (when q 2 * q 1 , then chase(q 2 ) * q 1 as well). We run containment checks for these two sets of 700 queries under several numbers of constraints. Fig. 4.3 shows our results. As the number of constraint grows, our total time becomes about two orders of magnitude better than the classic algorithms. Our total time is divided into the graph 66 4.4. Experimental Evaluation Figure 4.3: Checking containment for two UCQs of 700 queries each, under various numbers of constraints. The containment check fails for all cases. We run each experiment 5 times and took the average times. chase time and the graph containment time. The latter is the time it takes for Algorithm 9 to check containment after the queries have been chased; it hence gives a feeling of how our algorithm behaves when we have no constraints, and rather we have “longer” (chased) containee queries. From the graph of Fig. 4.3 we see that when: 1) there is no containment among the UCQs, 2) there are no constraints (interpreting each point of the ‘x’ axis as a point with no constraints but with longer “chased” containee queries), and 3) the queries in the containee UCQ are not much “longer” than the containing queries (x · 150), the classic algorithm for containment seems to perform better. This is because it is sufficient for one query to be proved non-contained and the classic algorithm stops. The classic containment algorithm pays the full cost when there are actually containments for every query; it then has to check all of them. Nevertheless we see that as the queries become larger (e.g. after chasing them with 150 constraints in this setting), our graph containment starts performing better (still with no constraints and no containment). Our graph-based approach for just the containment induces a cost of generating graphs which does not pay off when the algorithms fail fast to prove containment, since the containee queries are "short". Nevertheless our compact format of the containee queries pays off as they get longer (x¸ 150). For our second experiment we would like to see how the algorithms perform when there are containment mappings (Fig. 4.4). Hence for our containee UCQ we randomly chose 500 queries from our original set of 1000 queries, and we checked for containment against the latter (containment always exists). Here our graph-based approach outperforms the classic 67 Chapter 4. Scalable Containment for UCQs under Constraints Figure 4.4: Checking containment for two UCQs with 500 and 1000 queries resp. under various numbers of constraints. The containment check always succeeds. We run each experiment 5 times and took the average times. times in all phases of the problem, again by about two orders of magnitude. Interestingly, it seems that once we have the chased containee queries as graphs our containment time seems to remain constant, which means that the algorithm efficiently navigates through our compact graphs and finds the same containments at the same time even though the length of the queries grows. This is still true when ignoring the constraints and assuming relatively short (xÈ 30) chased queries as our input. The dominating time in the algorithms in both figures is the time to chase the queries. This is because while in the containment case it is sufficient for one query to fail, in the chase case one needs to consider all predicates of all containee queries. Our compact chase algorithm does this much faster and hence is a clear win in all interesting cases in both figures. Moreover, when we consider constraints as part of the problem our algorithms combined run much faster than the classics, in almost all cases (with more than 30 constraints). An additional advantage of our approach in the presence of constraints is that our chase algorithm outputs our graphs right away (so in a sense we get those for free for the containment phase). 68 4.5. Discussion and Related Work 4.5 Discussion and Related Work The problem of query containment has been thoroughly studied, starting with the seminal work of Chandra and Merlin [39], who proved that conjunctive query containment, without constraints, is an NP-complete problem. Containment of conjunctive queries with compar- ison predicates is¼ P 2 [73], and containment of datalog programs is undecidable [41]. Con- tainment of conjunctive queries under unrestricted functional and inclusion dependencies is undecidable [103]. Starting with the work of Johnson and Klug [72] a number of decidable combinations have been explored, using the chase [99] as the core reasoning tool. Ensuring termination of the chase by imposing syntactic restrictions on the form of the constraints has been particularly fruitful [30, 31, 32, 123]. We follow on this work by presenting an algorithm for query containment under LAV weakly-acyclic dependencies. We have presented a radically improved (by two orders of magnitude) solution to the problem of UCQ query containment under weakly acyclic LAV dependencies, for which the chase terminates, and which can represent practically important languages such as RDFS. Inspired by our previous work on query rewriting (Chapter 3), we achieve significant scalability by exploiting common patterns in the constraints and queries. Thus, we provide a practical algorithm to reason and optimize query evaluation in the semantic web. We are working towards extending the languages of supported constraints. Extending to TGDs with more than one predicates in the antecedent should be relatively straightforward since we already have a graph-based method for computing homomorphisms among conjunctive for- mulas. Extending to non chase-terminating cases of constraints would need a preprocessing of the containing query as well, with algorithms that look more like perfect reformulation [36] rather than the chase. In line with our previous work in Chapter 3 we would like to extend the algorithms presented here for supporting query answering using views under constraints. 69 5 Optimizing the Chase: Scalable Data Integration under Constraints In this chapter are dealing with virtual data integration and data exchange under constraints/de- pendencies. We address these problems individually and under a unifying lens. In data exchange the problem is how to materialize a target database instance, satisfying the source- to-target and target dependencies, that provides the certain answers. In data integration, the problem is how to rewrite a query over the target schema into a query over the source schemas that provides the certain answers. In both these problems we make use of the chase algorithm, the main tool to reason with dependencies. Our first contribution is to introduce the frugal chase, which produces smaller universal solutions than the standard chase, still remaining polynomial in data complexity. Often, in the presence of existential variables/labeled nulls, some of the consequences of a chase rule are already implied in the database, and these facts can be reused. That is, although the rule is not satisfied (i.e., there is no homomorphism from the consequent to the database instance that is an extension of the antecedent homomor- phism), the consequent might be partially satisfiable; adding a subset of the consequent’s atoms can construct such a homomorphism from the entire consequent. We introduce the frugal chase, which is equivalent to the standard chase, but results in smaller universal so- lutions. We use the frugal chase to scale up query answering using views under LAV weakly acyclic target constraints, a useful language capturing RDF/S. The latter problem can be reduced to query rewriting using views without constraints by chasing the source-to-target mappings with the target constraints. We construct a compact graph-based representation of the mappings and the constraints and develop an efficient algorithm to run the frugal chase on this representation. We show experimentally that our approach scales to large problems, speeding up the compilation of the dependencies into the mappings by close to 2 and 3 orders of magnitude, compared to the standard and the core chase, respectively. Compared to the standard chase, we improve online query rewriting time by a factor of 3, while producing equivalent, but smaller, rewritings of the original query. In data exchange, the main technique to reason with constraints is the chase [23], a form of forward chaining that “fetches” data from the sources through the mappings to a target database, and also “completes” this database w.r.t. the target constraints. One can then 71 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints disregard the constraints and do query answering over the completed database. The frugal chase is an optimization of the chase algorithm usable in data exchange with GLAV mappings and standard-chase terminating (e.g., weakly-acyclic) target TGDs. In virtual integration (which is achieved through query rewriting), the initial focus has been on settings without target constraints, i.e.,§ t Æ;, and mappings with only one source predicate in the rule antecedent, called Local-as-view (LAV) mappings [88, 53]. LAV mappings expose the core challenges in query rewriting, since they contain joins of existential variables. For virtual integration with target constraints, Afrati and Kiourtis [4] used the chase algorithm in a novel way by “compiling” the target constraints (specifically, LAV weakly-acyclic TGDs) into the LAV mappings to reduce the problem to view-based query rewriting without constraints. In this chapter, we present an algorithm for query rewriting under LAV weakly-acyclic target TGDs, building on [4], Chapter 3, and our optimized chase. This type of constraints is first-order rewritable, includes practically interesting languages like RDF/S, and has good computational properties in data integration, data exchange [64, 4], and in inconsistent and incomplete databases [6]. In particular, this chapter presents two main contributions: 1. The frugal chase. We develop a novel, optimized version of the standard chase, usable in data integration, data exchange, or incomplete database settings, with GLAV mappings and GLAV target constraints. Instead of adding the entire consequent to a solution when a chase rule is applicable, as in the standard chase, the frugal chase avoids adding provably redundant atoms (Sect. 5.2). We prove that the frugal chase results in equivalent, yet smaller in size (number of tuples), universal solutions with the same data complexity as the standard chase. We also present a procedural version of the frugal chase (Sect. 5.3), and a compact version of the latter adapted to our GQR rewriting algorithm (Sect.5.4). 2. A scalable conjunctive query rewriting algorithm for LAV mappings under weakly-acyclic LAV target constraints. This algorithm uses the compact frugal chase to efficiently compile the constraints into the mappings (à la [4]), and then efficiently do query rewriting (using GQR). Our compact frugal chase identifies common patterns across the mappings and “factorizes” them using a graph representation. This allows us to trigger a constraint for multiple views at the same time, and to add consequent predicates across multiple views at once. Identifying and indexing common patterns and chasing the mappings can be performed offline in a precompilation phase, independently of user queries, thereby speeding up system’s online query performance. Our compact graph representation of the mappings is particularly tailored for GQR, optimizing our solution even more. Our algorithm experimentally performs about 2 orders of magnitude faster than running the standard chase on each view individually and then applying query rewriting using views. Our approach scales to larger numbers of constraints and views and produces smaller, but equivalent, UCQ rewritings (containing less and shorter conjunctive queries), that compute the certain answers. For our experimental setting the size of the frugal chased mappings is very close to the core [56, 47] (i.e., the globally minimized 72 5.1. Chase in Virtual Data Integration chased mappings). Nevertheless, our compact algorithm achieves this output almost 3 orders of magnitude faster than the core chase [47], since we do not rely on minimization. 5.1 Chase in Virtual Data Integration In order to give a context for our chase algorithm, we are going to apply it so as to reduce the query rewriting with constraints problem to query rewriting without target constraints, and then subsequently produce maximally-contained rewritings that compute the certain answers. We are interested in first-order rewritings (i.e., expressible in SQL) since we want to develop a practical solution that leverages scalable relational technology. When considering target constraints in data integration, maximally-contained first-order rewritings do not always exist. As in the previous chapter, we focus on weakly-acyclic LAV constraints, which are first-order rewritable [64, 4]. We discuss other cases of first-order rewritable constraints in section 5.6. A maximally-contained UCQ rewriting under LAV wa-TGDs, which computes the certain answers, can be obtained by first chasing the views with the target constraints, and then applying a query answering using views algorithm (without constraints) [4]. Theorem 3. Chasing the Views [4] Given a query Q on a schema R, a set of LAV schema mappingsV Æ {V 1 ,...,V n } on R, a source instance I underV and a set of weakly-acyclic LAV TGD constraints§ on R, the set of certain answers cer t ai n(Q, I ) is computed by the UCQ maximally- contained rewriting of Q using {V 1 0 ,...,V n 0 }, where each V i 0 2 {V 1 0 ,...,V n 0 } is produced by running the standard chase on the consequent of V i using§. Consider the following LAV rules describing sources S 1 -S 4 of medical data. S 1 contains physi- cians that treat patients with a chronic disease. S 2 records the physician responsible for discharging a patient from a clinic. S 3 is the same to S 1 but physicians are typed as Doctors. S 4 provides Surgeons. S 1 (d, s)! TreatsPatient(d, p), HasChronicDisease(p,s) S 2 (d, p, c)! DischargesPatientFromClinic(d, p, c) S 3 (d,s)! TreatsPatient(d,p), HasChronicDisease(p,s), Doctor(d) S 4 (d)! Surgeon(d) The maximally contained (with no constraints) rewriting of q (presented in the beginning of this section) is: q 0 (d)ÃS 3 (d,s),S 2 (d,z,c). Now, consider the following (RDF/S) constraints that capture “domain” and “range” properties, and “subclass” relations. Constraint c 1 states that the domain of TreatsPatient is Doctor and the range is Patient. Constraint c 2 states that Surgeons are Doctors (as for queries our notation for constraints and views omits quantifiers). c 1 : TreatsPatient(x,y)! Doctor(x), Patient(y) c 2 : Surgeon(x)! Doctor(x) 73 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints Theorem 3 guarantees that we can answer query q, using sources S 1 -S 4 , by first chasing the consequents of the views and then looking for maximally-contained rewritings of the query using the chased views. Running the chase on S 1 -S 4 using c 1 and c 2 yields: S 0 1 (d,s)! TreatsPatient(d,p),HasChronicDisease(p,s), Doctor(d), Patient(p) S 0 2 (d,p,c)! DischargesPatientFromClinic(d,p,c) S 0 3 (d,s)! TreatsPatient(d,p),HasChronicDisease(p,s), Doctor(d), Patient(p) S 4 0 (d)! Surgeon(d), Doctor(d) The maximally-contained rewriting of q using S 1 0 -S 4 0 is the UCQ: q 0 (d)à (S 1 (d,x),S 2 (d,y,z))_ (S 3 (d,u),S 2 (d,v,w))_ (S 4 (d),S 2 (d,s,t)). This approach was employed in [4] by running the standard chase on the views and using the Minicon algorithm [112] for query rewriting. In this work, we chase the views using our optimized frugal chase (Sect. 5.2). Moreover, we develop a graph-based compact version of the frugal chase (Sect. 5.4) optimized for running on multiple views simultaneously, tailored to be input directly into GQR for fast query rewriting. Running the frugal chase, rather than the standard one produces shorter (but equivalent) mappings (in number of predicates/joins), which in turn produce less and shorter (but equivalent) conjunctive queries in the final UCQ rewriting. As an example of our approach, consider the mapping S 3 and constraint c 3 below, which states that an individual with a chronic disease must be a patient treated by a doctor: c 3 : HasChronicDisease(pat,dis)! TreatsPatient(doc,pat), Doctor(doc), Patient(pat) Since there is no homomorphism that maps the consequent of the rule to the consequent of the view, the standard chase produces: S 3 00 (d,s)! TreatsPatient(d,p), HasChronicDisease(p,s), Doctor(d), TreatsPatient(d2,p), Doctor(d2), Patient(p) Our algorithm produces a shorter, yet equivalent, mapping: S 3 000 (d,s)!TreatsPatient(d,p),HasChronicDisease(p,s),Doctor(d),Patient(p) Consider query q 2 (s)à TreatsPatient(d,p), HasChronicDisease(p,s). Using S 3 00 this query will give the UCQ rewriting: q 0 2 (s)à S 3 (d,s)_ S 3 (d2,s). One of the two elements of q 0 2 is redundant. Minimizing the output of a query rewriting algorithm is an orthogonal NP-hard problem [88]. Query rewriting using the frugal chased mapping S 3 000 avoids this redundancy, without running minimization, leading to the smaller (and faster to evaluate) rewriting: q 00 2 (s)à S 3 (d,s). 5.2 The Frugal Chase Consider the simple data exchange scenario of Fig.5.1 with a source described by S 3 and a target constraint c 3 (cf. Sect. 4.1). Existential variables in the constraint introduce additional 74 5.2. The Frugal Chase !"#$%&'$ ()*+,-$ !"#$.&,$ ()*+,-$ !"#$/+&$ !0-'1*1)$ 2 3$ !"#$%&'$ 45$ !"#$.&,$ 46$ !"#$/+&$ 43$ ."1-*)7-814* $ 45$ ()*+,-$ 46$ ()*+,-$ 43$ !0-'1*1)$ 9-):+"&40;!0)1-)1 $ !"#$%&'$ !"#$.&,$ !"#$/+&$ !&;*&" $ 23<=>)?$!$."1-*)7-814*<=>$@?>$9-):+"&40;!0)1-)1<@>)?>$!&;*&"<=?$$$ !"#$%&'$ 45$ !"#$.&,$ 46$ !"#$/+&$ 43$ 4A$ 45$ 4B$ 46$ 4C$ 43$ ."1-*)7-814* $ 45$ ()*+,-$ 46$ ()*+,-$ 43$ !0-'1*1)$ 9-):+"&40;!0)1-)1 $ $$$$$$$; 3 D9-):+"&40;!0)1-)1<@>)?$!."1-*)7-814*<=>@?>!&;*&"<=?$>7-814*<@?$$ 45$ 46$ 43$ 7-814* $ !"#$%&'$ !"#$.&,$ !"#$/+&$ 4A$ 4B$ 4C$ !&;*&" $ Figure 5.1: Redundancy in the Chase. labeled null tuples which are redundant (nulls do not participate in the certain answers of a query). The bottom 3 rows of TreatsPatient and Doctor can be removed. Our frugal chase avoids adding such redundant facts. Before we define the frugal chase, let us introduce some useful notions. Let B a database instance (or any other conjunction of atoms such as the consequent of a TGD). The Gaifman graph of nulls [56] is an undirected graph with nodes being the existential variables (labeled nulls) of B; edges between nodes exist if the variables co-exist in an atom in B. Dually, the Gaifman graph of facts of B, denoted grf(B), is the graph whose nodes are the atoms of B and two nodes are connected if they share existential variables/labeled nulls. Note that parts of B that are connected only through constants (as well as distinguished variables for TGDs), constitute different connected components of the Gaifman graph of facts. For G i a connected component of grf(B), V (G i ) is the set of all the facts (i.e., the nodes) in G i . For any conjunction of atoms B we denote the decomposition of B to (the facts of) its connected components {G i ,...,G n } as the set of sets (of facts) dec(B)Æ {V (G i ),...,V (G n )}. Fig. 5.2 shows in dotted circles the different connected components of TGD c and instance B. A constraint can be decomposable to an equivalent set of “simpler” constraints, each with a different element of its decomposition as consequent. For example, c 3 in Fig. 5.1 can be broken down to two constraints with consequents {TreatsPatient(d,p),Doctor(d)} and {Patient(p)} respectively. For Fig. 5.1, applying the standard chase using the new set of constraints would also avoid the redundancies presented. Nevertheless, our frugal chase produces smaller chase results, even with non-decomposable constraints. Informally, a set of predicates is partially satisfiable, and not added during the frugal chase application, in two cases: if it is a connected component of the constraint and is mapped 75 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints to the instance as a whole, or if its image on the database instance is a complete connected component of the instance. Any union of such partially satisfiable sets is a partially satisfiable set, as long as the individual satisfying mappings agree on their common arguments. Definition 14. Partially Satisfiable Set. Let¾ be a TGD constraint8~ x,~ y,Á(~ x,~ y)!9~ zÃ(~ x,~ z) and B an instance s.t. there is an antecedent homomorphism h that mapsÁ(~ x,~ y) to B. A set of atoms SµÃ, is partially satisfiable for h if there exists an extension of h, h 0 , called a satisfying homomorphism for S, s.t. h 0 (S)µ B and for each S i 2 dec(S), either: 1. (a) for all existential variables~ z i in S i , h 0 (z i ) is an existential variable (or labeled null) in B, and (b) h 0 (S i )2 dec(B), i.e., the image of S i is an entire connected component of B; or 2. S i 2 dec(Ã), i.e., the mapped set S i is actually an entire connected component of the constraint consequent. We illustrate Def. 14 with the example in Fig. 5.2, which shows how the frugal chase applies constraint c on instance B (© is a conjunction of atoms, c and d are constants, x i and z i are variables, n i are labeled nulls, and P i are relations). Consider whether the set of predicates S = {P 1 (x 1 , z 1 ), P 2 (z 1 , z 2 ), P 4 (x 1 , z 3 ), P 5 (z 3 , z 4 )}, a subset of the consequent of constraint c, is partially satisfiable. The decomposition dec(S) has two elements (connected components): S 1 = {P 1 (x 1 , z 1 ), P 2 (z 1 , z 2 )} and S 2 = {P 4 (x 1 , z 3 ), P 5 (z 3 , z 4 )}. Per Def. 14, for our S there exists a homomorphism h 0 which extends the one from the antecedent (h), such that h 0 (S)µ B. More- over, S 1 is an entire connected component of the constraint (falling in case 2 of Def. 14), while for S 2 all its existential variables are mapped through h 0 (and in particular g 2 ) to existential variables (case Def. 14.1(a)), and h 0 (S 2 ) is a connected component of B (case Def. 14.1(b)). So S is partially satisfiable. Note that P 6 (x 2 , z 4 ) in the constraint could be partially satisfiable by mapping it to P 6 (d,n 5 ) in B; however not at the same time as {P 4 (x 1 , z 3 ), P 5 (z 3 , z 4 )}, which maps z 4 to a different variable, n 5 (according to Def. 14 there needs to be a single satisfying homomorphism h 0 which maps all elements of a partially satisfiable set). Hence, the set of partially satisfiable predicates of a constraint might not be unique; in Fig. 5.2 there are two equivalent alternative frugal chases, given by h 0 and h 00 . Function f in h 00 maps variables that are not in the domain of the satisfying homomorphism to fresh names. This non-uniqueness does not cause a problem for the end chase result; as we prove later, all frugal chase results are universal solutions. Currently our algorithms choose one arbitrary partially satisfiable set (our implementation chooses the first it discovers). Nevertheless, one could develop heuris- tics to materialize the preferred alternative, e.g., the shorter chase or one depending on the application. The frugal chase is applicable whenever not all atoms of the consequent are in a partially satisfiable set; by Lemma 4 this is equivalent to the standard chase applicability (proof omitted due to space). Lemma 4. Let¾ a TGD constraint8~ x,~ y,Á(~ x,~ y)!9~ zÃ(~ x,~ z) and B an instance.¾ is applicable to B, per the standard chase, iff there exists an antecedent homomorphism h fromÁ(~ x,~ y) to B 76 5.2. The Frugal Chase !"#Φ$% & '% ( '####)!#* & $% & '+ & )'#* ( $+ & '+ ( )##'#* , $+ , '+ - )'##* - $% & '+ , )'* . $+ , '+ - )##'#* / $% (' + - ) # 012Φ$!'3'####)'###* & $!'4 & )'#* ( $4 & '#4 ( )'#* , $4 ( '3)#'#* - $!'4 , )'#* . $4 , '4 - )#'#* / $3'4 . )#'56 # 712% & !!'#% ( !3'##### #########!#######6# 8 & 17!2+ & !4 & '#+ ( !4 ( 6# 8 ( 17!2+ , !4 , '#+ - !4 - 6# 9:;8<=>!7<?@$0'!)#1#0!#7A$2* , $+ , '+ - )'#* / $% ( '+ - )6)#1#0!2* , $4 , '4 - )'#* / $3'4 - )6 # ! v ! v ! w ! w B<C?9DE48#7FGFGF:H7E?G#E?#@EI7@:#7A#1#J8 & !8 ( K#LM#7AA#1#J8 & !8A (# !#9K # 8N ( 17!2% ( !3'## ##############+ - !4 . 6# O0!#7AA$2* , $+ , '+ - )'* - $% & '+ , )'* . $+ , '+ - )6)#1#0!2* , $4 / '4 . )'* - $!'4 / )'* . $4 / '4 . )6 # Figure 5.2: Frugal Chase: Partially Satisfiable Set. and there exists one consequent predicate P ¾ (~ x P ,~ z P )2Ã(~ x,~ z) (with ~ x P µ~ x, ~ z P µ~ z) which is not partially satisfiable for h. The application step of the frugal chase rule adds all such non partially-satisfiable atoms of the consequent to our database instance. Formally, for S a partially satisfied set of predicates of the consequentà with satisfying homomorphism h 0 , Ã(~ x,~ z)Æ (Ã(~ x,~ z) \ S) is the set of all non partially satisfiable atoms w.r.t. S. The frugal chase step addsH (à 0 (~ x,~ z)) to our database, where the applicable homomorphism,H , is an extension of h 0 , mapping variables not appearing in h 0 to new “fresh” labeled nulls. In the example of Fig. 5.2, if we choose to satisfy {P 4 (x 1 , z 3 ),P 5 (z 3 , z 4 )}, thenH = h 0 and only the two non partially satisfiable atoms P 3 (n 3 ,n 4 ) and P 6 (d,n 5 ) are added to our database (compared to the entire consequent of c that the standard chase would add). The resulting instance after the frugal chase step is equivalent (but smaller in size) to the one produced by the standard chase. Theorem 5. For all instances B, and sets of TGDs§, the frugal chase step is applicable to B with antecedent homomorphism h iff the standard chase step is applicable with h. Moreover if B 0 is the instance after the application of the standard chase step for h and¾2§ and B 00 is the instance after the application of the frugal chase step for h and¾, then B 0 and B 00 are homomorphically equivalent, and they both satisfy¾ (for h). Proof. We prove the theorem for Def. 15 since it is equivalent to Def. 14 by Th. 7. The frugal chase step is applicable iff the standard chase step is, by Lemma 4. We first show that there exist homomorphisms that map B 0 to B 00 and vice versa, assuming¾ is applicable; which means there is a homomorphism h from the antecedent of¾, sayÁ(~ x,~ z) to B s.t. it cannot be extended over the consequent,Ã(~ x,~ y). We know by Lemma 4 that there is a conjunct with at least one atom in¾,à 2 (~ x,~ y), that is not partially satisfiable for h. Ifà 2 is the entire consequent of the constraint, then the applicable homomorphism by definition will be the same as the standard chase application homomorphism (modulo names of fresh variables). This means that B 0 and B 00 are homomorphically equivalent (in fact, they are isomorphic since there is no essential optimization to the result of the frugal chase step). 77 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints We now examine the case where a part of the constraint, sayà 1 is partially satisfiable, while another partà 2 is not. Conjunctà 1 can have three categories of variables/constants: (1) Those that are antecedent variables or constants in the constraint, i.e., belong in~ x (we abuse notation and regard that constants are in~ x). (2) Those existential variables, ~ e à 2 , that (a) fall under case 2 of our definition, i.e., they map to constants/distinguished variables or existential variables that join with a part of the database which is not a partially satisfiable image, together with variables that (b) belong in the same gaifman connected component with the variables in (a). We denote by ~ e à 2 B the images of ~ e à 2 in the database. All atoms containing ~ e à 2 in the constraint should be partially satisfiable, hence the variables of ~ e à 2 exist only in the conjunctionà 1 in the constraint. (3) Those existential variables that map to variables that only exist in images of partially satisfiable atoms and are not in ~ e à 2 , i.e, they are not contained in any “case 2 predicate” above; we denote these variables in the constraint as ~ e à 1 . We denote by ~ e à 1 B the images of ~ e à 1 in the database. All atoms containing ~ e à 1 B are images of partially satisfiable atoms, hence ~ e à 1 B exists only in the image ofà 1 in the database. Hence, our constraint has the form:¾:Á(~ x,~ y)!à 1 (~ x,~ e à 1 ,~ e à 2 ),à 2 (~ x,~ e à 1 ,~ r c ) with~ r c the rest of the variables belonging solely inà 2 . In general, our database instance has the form (reusingÁ, andà 1 as the images of the antecedent and the partially satisfiable conjunction, respectively): {Á(~ x B ,~ y B ),à 1 (~ x B , ~ e à 1B , ~ e à 2B ),à 3 (~ x B , ~ e à 2B ,~ r B , ~ y B ))} withà 3 denoting the rest of the atoms in the database (if any). Terms in ~ x B and ~ y B are the images of~ x and~ y respectively, and ~ r B contains the rest of the terms in the database. Moreover, by definition there is no constraint predicate that contains variables from both ~ e à 1 and ~ e à 2 . Hence, we can “break”à 1 in the constraint into two parts, one containing ~ e à 1 and one containing ~ e à 2 (exactly one of the two parts might be empty). Our constraint now can be written as¾:Á(~ x,~ y)!à 11 (~ x,~ e à 1 ),à 12 (~ x,~ e à 2 ),à 2 (~ x,~ e à 1 ,~ r c ). Similarly our instance becomes: {Á(~ x B ,~ y B ),à 11 (~ x B , ~ e à 1B ),à 12 (~ x B , ~ e à 2B ),à 3 (~ x B , ~ e à 2B , ~ r B , ~ y B )}. The standard chase run on this database with¾ will produce the instance: {Á(~ x B ,~ y B ),à 11 (~ x B , ~ e à 1B ),à 12 (~ x B , ~ e à 2B ),à 3 (~ x B , ~ e à 2B , ~ r B , ~ y B ),à 11 (~ x B , ~ e à 1 ),à 12 (~ x B , ~ e à 2 ),à 2 (~ x B , ~ e à 1 ,~ r c )}. The frugal chase will produce {Á(~ x B ,~ y B ),à 11 (~ x B , ~ e à 1B ),à 12 (~ x B , ~ e à 2B ),à 3 (~ x B , ~ e à 2B , ~ r B , ~ y B ), à 2 (~ x B , ~ e à 1B ,~ r c )}. To verify that the two instances are equivalent, the only non-obvious fact is that ~ e à 1B can map to ~ e à 1 . Since case 1 of Def. 15, states that in the same positions that two atoms in the database contain h 0 (z), the corresponding atoms in the constraint contain z, it means that the restriction of h 0 (z) on these variables is one-to-one, and the inverse can map h 0 (z) to z. Hence, the two instances are homomorphically equivalent. Moreover, if the databases are views, the part containing distinguished variables (à 12 ) maps to itself and the homomorphisms that prove equivalence are containment mappings. Lastly, the fact that B 00 satisfies the constraint, is directly proven by using the applicable homomorphism that we used to construct B 00 . As in the standard chase, the frugal chase is an exhaustive series of frugal chase application steps. Since the output of each frugal chase step might be smaller, this might also lead to fewer number of chase steps overall (since redundant predicates triggering the application of 78 5.2. The Frugal Chase subsequent constraints might not appear at all during the frugal chase). This optimizes our proposed solution even more. Theorem 6. For all instances B, and sets of TGDs§, the frugal chase terminates for all instances and constraints for which the standard chase terminates, producing a universal solution. Proof. Let c 1 ,c 2 ,... be an ordering of constraints (possibly repeating with different antecedent homomorphisms) in the frugal chase application. Let fchase c i (B) and chase c i (B) the results of the frugal and the standard chase resp. on B with c i . fchase c 1 (B) and chase c 1 (B) are homomor- phically equivalent, by Th. 5. Let g 1 the homomorphism from chase c 1 (B) to fchase c 1 (B). Let a constraint c j satisfied, per the standard chase, on chase c 1 (B), with satisfying homomorphism g 2 . Composing g 2 with g 1 , proves that the constraint is also satisfied on fchase c 1 (B). Hence, fchase c 1 (B) triggers at most as many constraints as chase c 1 (B) does. Also, assuming c 2 is applicable to fchase c 1 (B), it is also to chase c 1 (B), with the same antecedent homomorphism. Moreover, after the application of c 2 , fchase c 2 (fchase c 1 (B)) is homomorphically equivalent to chase c 2 (chase c 1 (B)). Hence, inductively, if the standard chase terminates on B with§ so does the frugal chase with fchase § (B) satisfying all constraints and being a solution. In order to show that it is universal, i.e. has homomorphisms to all other solutions, we note that it has a homomorphism to chase § (B) which in turn has homomorphisms to all other solutions. The intuition behind our frugal chase is taking care that when adding a non-partially-satisfiable predicate, we will not introduce a join among this predicate and some relation, or some constant, in B which is not in the constraint (such a join would not be introduced by the standard chase). This case is avoided when satisfying an entire connected component of the constraint since there is nothing else (existentially) joining with that component in the consequent to be added (e.g., predicates P 1 and P 2 of c in Fig. 5.2). If our set of satisfied predicates (e.g., P 4 and P 5 of c in Fig. 5.2) is not an entire connected component of the consequent, its image has to be; otherwise whatever these predicates join with in the constraint (which we will add to our instance), will “accidentally” join with whatever joins with their image. For example, consider: c 4 : P(x),R(x, z)! P 1 (x, y, z),P 2 (y, w),P 4 (y) BÆ {P(d),R(d,c)P 1 (d, y 1 ,c),P 2 (y 1 , w 1 ),P 3 (y 1 )} Running the standard chase with c 4 on B produces: B 0 Æ {P(d),R(d,c)P 1 (d, y 1 ,c),P 2 (y 1 , w 1 ),P 3 (y 1 ), P 1 (d, y, z),P 2 (y, w),P 4 (y)} Had we assumed partial satisfiability for both P 1 , P 2 we would get: B 00 = {P(d),R(d,c),P 1 (d, y 1 ,c),P 2 (y 1 , w 1 ),P 3 (y 1 ),P 4 (y 1 )}. 79 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints B 00 and B 0 are not equivalent since there is no homomorphism from B 00 to B 0 , since the join of P 3 -P 4 in B 00 cannot be preserved. If P 3 was missing from B (we would fall in case 1: all joins of y 1 in B would be with images of partially satisfiable predicates) or if P 4 was missing from c 4 (we would fall in case 2: the entire connected component in the constraint is partially satisfiable), then we would end up in B 0 and B 00 either both missing P 3 or both missing P 4 in which case they would be homomorphically equivalent. Complexity. The problem of deciding whether the standard chase is applicable on an instance is polynomial in data complexity [55]. The difference to our case is essentially case 1(b) of Def. 14, which introduces a polynomial traversal of the corresponding connected component of the database, for all S i . So, the frugal chase remains polynomial in data complexity. Notice that our definitions care for only one (arbitrarily chosen) partially satisfiable set. Potentially, we could exhaustively examine all subsets of a constraint’s consequent; still polynomial in data complexity. Moreover, as our experiments attest for LAV constraints, checking partial satisfiability and running the frugal chase is faster than the standard in practice. 5.3 Procedural Frugal Chase In preparation to use the frugal chase in query rewriting using GQR, we present an alternative definition that examines each atom separately and decides whether it is partially satisfiable. Definition 15. Let¾ be a TGD constraint8~ x,~ y,Á(~ x,~ y)!9~ z Ã(~ x,~ z), and B an instance s.t. there exists a homomorphism h: Á(~ x,~ y)! B. For all atoms P(~ x P ,~ z P )2Ã(~ x,~ z), P(~ x P ,~ z P ) is partially satisfiable for h if there exists an extension of h, h 0 , called a satisfying homomorphism for P, s.t. h 0 (P)µ B and for all z2~ z P : 1. if h 0 (z) is an existential variable (labeled null for instances) then for every atom R B in B that contains h 0 (z), there is an atom R C in the constraint that contain z, in the same argument positions, s.t.: (a) R C is partially satisfiable for h 0 (its satisfying homomorphism is an extension of h 0 ), (b) R B is the image of R C through the satisfying homomorphism for R C , and (c) for all R(~ x,~ z)2Ã(~ x,~ z) that contains z, if R is partially satisfiable for h, it is also partially satisfiable for h 0 (which, recursively, means it uses an extension of h 0 ); 2. if h 0 (z) is (a) a constant (for instances or mappings), or (b) a distinguished variable (for mappings), or (c) an existential variable (labeled null for instances) which does not fall into case (1) above (i.e., it joins with at least one atom which is not the image of a partially satisfiable atom that joins with P on z in the same argument positions), then all atoms in the connected component of grf(Ã) that P is in, are partially satisfiable for h 0 (which means they use extensions of h 0 ). Theorem 7. A set S of atoms is partially satisfiable per Def. 14, iff every atom in S is partially satisfiable per Def. 15. Proof. It is not hard to see that cases 1(a), 1(b) and 2 of Def. 15 correspond to the same cases 80 5.3. Procedural Frugal Chase of Def. 14. However Def. 15 considers “atomic” satisfying homomorphisms over single atoms. We need to make sure that these homomorphisms can be unified to construct one from the entire set of partially satisfiable atoms (i.e., h 0 of Def. 14). Since every atomic satisfying homomorphism in Def. 15 extends h (that maps the antecedent variables to B,) we really need to examine only existential variables. Notice that all partial satisfiable atoms (per Def. 15) that share existential variables need to fall into the same case of Def. 15. For partially satisfiable atoms in the constraint, that share existential variables and fall in case 2 of Def 15, their satisfying homomorphisms agree and are essentially the same, since the predicates belong in the same connected component of grf(Ã). For partially satisfiable atoms, that share existential variables, and fall in case 1 of Def 15, case 1(c) takes care that their homomorphisms agree on their common values. With respect to Def. 15, we define an applicable homomorphism, the frugal chase step and the frugal chase similarly to Def. 14. To illustrate Def. 15, we run the frugal chase on a view (instead of an instance). For views, distinguished variables are interpreted as constants in our homomorphisms, for both the standard and the frugal chase. Consider the following non-decomposable constraint: c 5 : P(x, v),R(v, t)! P 1 (x, y, z), P 2 (y, w) and view S 5 : S 5 (x 1 )! P(x 1 , v 1 ),R(v 1 , t 1 ),P 1 (x 1 , y 1 , z 1 ),P 2 (x 1 , w 1 ). The standard chase algorithm run with c 5 on S 5 will produce: S 5 0 (x 1 )! P(x 1 , v 1 ),R(v 1 , t 1 ),P 1 (x 1 , y 1 , z 1 ),P 2 (x 1 , w 1 ), P 1 (x 1 , y, z),P 2 (y, w) Nevertheless a shorter, equivalent mapping per our chase is: S 5 00 (x 1 )! P(x 1 , v 1 ),R(v 1 , t 1 ),P 1 (x 1 , y 1 , z 1 ),P 2 (x 1 , w 1 ),P 2 (y 1 , w) When considering whether P 1 (x, y, z) in c 5 is partially satisfiable, we notice that y and z can only map to don’t care variables in S 5 , namely y 1 and z 1 respectively (hence case 1(b) of Def 15 trivially applies). Case 1(c) of Def 15 dictates that we can partially satisfy P 1 (x, y, z) as long as y and z are not being mapped to different variables than y 1 and z 1 , in some other atom’s satisfying homomorphism. This means that P 2 (y, w) in c 5 cannot be partially satisfiable when P 1 (x, y, z) is, since in order to partially satisfy P 2 (y, w) we would need to map y to a different variable, namely x 1 . In fact if we examine P 2 (y, w) for partial satisfaction before we examine P 1 (x, y, z), we find another reason for which P 2 (y, w) cannot be partially satisfiable: variable y can only map to the distinguished variable x 1 in S 5 and, per case 2, we check whether this mapping can be extended to cover the atoms of c 5 joining (directly or indirectly) with the existential variables of P 2 (y, w), in effect P 1 (x, y, z). Such an extension cannot happen as for P 1 in c 5 , y has to map to y 1 in S 5 , rather than x 1 . Hence, P 2 (y, w) in c 5 cannot be partially satisfiable for S 5 ; only P 1 (x, y, z) can and P 2 needs to be added explicitly, when applying the rule. 81 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints !"#$% & ' (& ' )& ' *& (& +, & )& -./ & (& )& 01#23 & (& +, & )& -./ & (& )& (& / & (& )& /,4. & *& ' ( & -./ (& ' * & -./ (& ) 567 & 587 & 597 & (& / & (& )& /,4. & *& :;<=8=> & !" #$ " %" & !" & '" ( !" & !" )*( !" & '" )*( !" !" %" ($+* " '" & %" & %" & %" )*( " !" %" & !" #$ %" & !" & '" #$ %" & '" & '" #$ !" ( " !" #$ " %" * !" ( !" * !" * !" #$ !" ( " * !" #$ %" $ " "," #$ " & !" #$ !" ( " $ !" "," "," "," & '" " ( ! "#$ !"" & !" #$ !" ( " & '" #$ !"" !" %" & !" ( !" & '" & !" )*( !" $ !" & '" )*( !" $ !"" ( !" & !" #$ !" ( " & '" $ " )*( !" #$ %"" & !" )*( !" #$ %"" & '" #$ !"" !" %" ($+* " '" & %" & %" & %" & !" #$ %" $ !" & '" #$ %" $ !" !" !" !" !" !" !" )*( " !" %" & !" & '" -./0/123"&45.67"$89 " *419:.2/1:"* !" *;297<"&45.67"$89 " =<> " =7> " =;> " =?> " =0> " & '" $ " )*( !" #$ %"" & !" )*( !" #$ %"" !" Figure 5.3: (a) Query q. (b) Sources S 1 -S 3 as graphs. (c) Joinbox for a variable node. (d) View PJs for S 1 -S 3 with their joinboxes. (e) Constraint c 1 as a graph. (f ) PJs resulting from chasing PJ TreatsPatient (TP) with the constraint c1. (g) Merging existing and inferred PJs. (h) Chased view PJs. The frugal chase is applicable to all cases of TGD languages in which the standard chase applies. It yields smaller universal solutions in data exchange, smaller chased mappings in data integration (which lead to smaller rewritings using these mappings), smaller database instances in incomplete databases, and smaller chased queries for query containment (ucq containment under constraints can sometimes be done by chasing one ucq and relying on classic containment [72, 80]). However, our chase does not produce a minimal solution, i.e., the core [56]. Using the frugal chase, one could explore all combinations of partially satisfiable atoms and keep the maximum set, but even then, pre-existing redundancies in the instance or the constraints are not accounted for. 5.4 Compact Frugal Chase For Query Rewriting Under Constraints In this section, we present our approach to compute the maximally-contained UCQ rewriting of a target query under target LAV weakly-acyclic dependencies and LAV views (i.e., the perfect rewriting [16]), using the frugal chase. First, we show (Th. 8) that our algorithm computes the certain answers (this is the equivalent of Th. 3 [4] for the frugal chase). Second, we describe a graph representation for queries, mappings (views), and now constraints, that we introduced in GQR (see Chapter 3). This representation allows us to compactly represent common subexpressions in the views and constraints, and efficiently reason with them. Third, we briefly describe the GQR query rewriting approach. Finally, we describe a compact graph- based version of the frugal chase that radically improves the compilation of the constraints into the (graph representation of) the views. Theorem 8. Given a query Q on schema T , a set of LAV schema mappingsV Æ {V 1 ,...,V n }, a source instance I underV , and a set of LAV weakly-acyclic TGD constraints§, the set of certain answers cer t ai n(Q, I ) is computed by the UCQ maximally-contained rewriting of Q using {V 1 0 ,...,V n 0 }, where each V i 0 is produced by running the frugal chase on V i using§. 82 5.4. Compact Frugal Chase For Query Rewriting Under Constraints Proof. Consider the set of viewsV 00 Æ {V 00 1 ,...,V 00 n }, which are taken by running the standard chase on the consequent of V i using§. We know by Th. 3 that a maximally-contained rewriting of Q usingV 00 (denoted MC R(Q,V 00 )) produces the certain answers of the query. The view sets V 00 andV 0 are equivalent since for every view in one set there is an equivalent one in the other, by Th. 6. Two equivalent sets of views produce equivalent maximally contained rewritings, hence in our case MC R(Q,V 00 )´ MC R(Q,V 0 ). This means that for each conjunctive query r 00 2 MC R(Q,V 00 ) there is a r 0 2 MC R(Q,V 0 ), s.t. r 00 µ r 0 . Let a tuple t2 cer t ai n(Q, I ) and so t2 MC R(Q,V 00 )(I ) and in particular t2 r 00 (I ). Then t2 r 0 (I ) and hence t2 MC R(Q,V 0 ). So every certain answer is computed by MC R(Q,V 0 ). Symmetrically, for any tuple t we obtain from MC R(Q,V 0 ), with the same reasoning it holds that t2 MC R(Q,V 00 ) and hence is a certain answer. 5.4.1 Graph Modeling We represent queries, mappings, and constraints as graphs (extending the representation of GQR to constraints, see sections 3.2 and 3.6.3). Figs. 5.3(a), (b) and (e) show the graph repre- sentation of a query q (q(d,c)ÃD(d), DPFC(d,z,c)), LAV views S 1 ,S 2 and S 3 , and constraint c 1 (cf. Sect. 4.1), respectively. Fig. 5.3(d) shows all PJs for sources S 1 , S 2 and S 3 . Figure 5.3(c) shows the joinbox for the second variable of the TP (TreatsPatient) PJ, which records that this variable joins with the first argument of HCD in S 1 and in S 3 . The 6 predicates in the sources of Fig. 5.3(b) correspond to only 4 PJs, in Fig. 5.3(d). PJs can be constructed straightforwardly in polynomial time by scanning each LAV view and its joins, and hashing the different patterns encountered. 5.4.2 Graph-based Query Rewriting (GQR) In its original version, GQR has a pre-processing/off-line phase, where it extracts the PJs from the views and indexes them. At query time, GQR (see chapter 3 for a detailed description) processes the user query one subgoal at-a-time. It retrieves the view PJs that match each query subgoal and incrementally combines the retrieved view PJs to form larger subgraphs that cover larger portions of the query, building the maximally-contained rewriting incrementally. For example, given the query in Fig. 5.3(a), and the PJs in Fig. 5.3(h), Fig. 5.4(a) shows GQR retrieving the PJs corresponding to query predicates D and DPFC, and combining them into a single graph (Fig. 5.4(c)). Since the combined graph covers the query, the process terminates and outputs the logical rewritings (Fig. 5.4(c)). In this section, GQR takes as input not the original PJs from the sources, but the PJs chased with the target constraints using the compact frugal chase described next. As discussed, we chase the LAV wa-TGDs into the mappings and reduce the problem to query rewriting (without constraints) using the chased views. Fig. 5.3(h) shows the view PJs (resulting from the frugal 83 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints !" #" $%&' " (" ) #" ) #" ) #" ) !" *% !" $ " ) (" *% !"" !" )+,-./"%01 " 2,/-3 " !" $ " !" #" $%&' " (" !" #" $%&' " (" ) #" ) #" ) !" *% !" $ " ) (" *% !"" !" ) #" 45678"!")!679: ! 89)#679": # 9": ( 8" ;/<-=>?@1A" 45678"!")(679: ! 89)#679": # 9": ( 8" 6B8 " 6C8 " 6.8 " ) #" Figure 5.4: Graph-based Query Rewriting. chase) of views S 0 1 , S 0 2 and S 0 3 with constraint c 1 (from Section 4.1). The overall result is an efficient algorithm for query rewriting under constraints. 5.4.3 Compact frugal chase This section presents our compact frugal chase implementation for chasing a set of LAV mappings using a set of LAV weakly acyclic dependencies. Instead of considering every view subgoal, our compact frugal chase considers the distinct patterns (PJs) that all views contain. We start by finding mappings of the antecedent to atomic PJs, that repeat themselves across views and hence compactly represent pieces of multiple views (which imply the same constraint antecedent). Finding a homomorphism to such a pattern means triggering the constraint for all the views represented by this pattern. Then we recursively map all partially satisfiable subgoals of a rule’s consequent, to the compact graph representation of these atoms in the mappings. When we apply the chase step we add compact consequent patterns (constraint PJs) that represent the addition of a predicate to multiple views simultaneously. We end up in a compact PJ representation of the chased mappings, shorter than the standard chased ones, which leads to smaller yet equivalent rewritings of the input queries using these mappings. As noted, the output of our compact chase algorithm is using the same PJ representation as its input, and is particularly tailored for GQR, improving our solution even more. Our initial implementation does not include constants in the queries, the views or the constraints. Alg. 12 executes all constraints repeatedly, until it converges and stops adding new predicates. It takes as inputs the view PJs, constructed and indexed, representing our original set of views, and the set of constraints. The output of the algorithm is the set of chased PJs. The algorithm first maps the antecedent predicate to a view PJ in order to trigger the constraint for all views in this view PJ. Then, it examines each consequent’s predicate, for all applicable views triggering the constraint; some existing view PJ (that matches a consequent PJ’s pattern) might contain some of the relevant views and might prove the corresponding consequent predicate already partially satisfiable. For the applicable views that are left not satisfying a consequent predicate, we will have to add a copy of the consequent PJ as a new view PJs to our index (our set of view PJs will now correspond to the longer, frugal-chased mappings). Using this control flow, Alg. 7 is our high-level algorithm and keeps track of the constraints, and the PJs that satisfy them, 84 5.4. Compact Frugal Chase For Query Rewriting Under Constraints for specific views. To check if a specific view PJ is the image of a constraint PJ, for one of the views it contains, it calls upon Alg. 13, which in turn checks the cases of Def. 15, by calling Alg. 15; the latter iterates over connected components of the views or the constraint respectively. Algorithm 15 calls Alg. 14 which, after examining whether a constraint and a view PJ have been processed before, calls again Alg. 8. Hence, algorithms 8, 15, and 14 are mutually recursive. In line 2, Alg. 12 iterates over all constraints, and in line 3 finds all the different view PJs that match the antecedent (denoted P J V ant ), for a specific constraint. For each P J V ant , line 5 constructs new PJs for all constraint consequents. For each consequent PJ, “initialization” means the following (pseudo-code omitted due to space): 1) Depending on the pattern of P J V ant , the variable nodes in a consequent PJ become distin- guished or existential (initially constraint node types are undetermined). Fig. 5.3(f) shows that when we trigger constraint c 1 for the antecedent PJ for T P, the variables of the constraint (originally of unknown (?) type) take a specific type. We change to distinguished all constraint variable nodes that have been “unified” with distinguished view variables; the rest are existen- tial. 2) For all views triggering the constraint (V i ewSet), we include joinboxes in the consequent PJ. Such joinboxes include all joins to other consequents (which eventually become part of every view), as well as the joins inherited from any variable shared with P J V ant . In effect, when we trigger constraint c 1 , in Fig.5.3(e), for the view PJ T P in Fig.5.3(d), after line 5 we have constructed the two consequent PJs shown in Fig.5.3(f), which capture the addition of predicates “P” and “D” in the views they contain. Subsequently we need to examine if for some of these views there are pre-existing PJs satisfying the consequent PJ’s predicates. To check this, for every consequent PJ, P J C c i , we query our index for view PJs that capture its pattern (it could be a “more” general pattern, i.e., one that has distinguished variables in places that the consequent PJ has existential ones) (lines 6-7). Retrieving such a view PJ, we need to examine it for all views common to the “triggering” views (line 8). The main method to check whether the predicate of P J C c i is partially satisfied by a relevant view s in a view PJ, P J V c i , is isPJSatisfiable (line 16), which is a recursive implementation of Def. 15. Due to it’s recursive nature, some previous calls of isPJSatisfiable might have traversed the connected component of the constraint and might have actually already memoized the specific combination, of PJs and view currently in question, as being “satisfiable” or “unsatisfiable” . If we find this combination satisfiable, we avoid calling the method again (line 9), and we call updateJoinboxes (line 10) which updates several join pointers on our view and constraint PJs. It first transfers the joinbox information for the specific view s from the consequent to the view PJ. Fig. 5.3(g) shows that the pre-existing PJ for Doctor in Fig. 5.3(d) already satisfies the Doctor predicate for S 3 , and hence S 3 will be deleted from the consequent PJ of Fig. 5.3(f) (an additional optimization of our algorithm, shown in the figure, is that if the Doctor PJ of Fig. 5.3(f) remains unsatisfiable until the end, for some views (e.g., S 1 ), instead of adding it to our index as an additional PJ we can merge it with an existing one (Fig. 5.3(g))). Moreover updateJoinboxes updates the joinboxes of all other consequent PJs, s.t. if they are not satisfied 85 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints Algorithm 12 Compact Frugal Chase Input: An indexed set of view PJs, a set of LAV-WA constraints§ Output: Chased PJs after the frugal chase converges for§ 1: while Index of view PJs changes do 2: for all¾2§ : P(~ x,~ y)! C 1 (~ x,~ z),...,C n (~ x,~ z) do 3: for all P J V ant à antecedent PJ in views do 4: V i ewSetà views in P J V ant not used to trigger¾ 5: initialize constraint PJs for V i ewSet 6: for all constraint PJs P J C c i do 7: for all P J V c i à view PJs capturing pattern P J C c i do 8: for all view s common to V i ewSet and P J V c i do 9: if P J V c i is marked as satisfiable for¾,P J C c i and s then 10: upd ate Joi nboxes(P J V ant ,P J V c i ,P J C c i ,¾,s) 11: if P J C c i “satisfied” for all views in V i ewSet then 12: continue to the next consequent PJ (line 6) 13: continue (line 8) 14: mark P J V c i as satisfiable for¾, P J C c i and s 15: State STÃUNDECIDED 16: if i sP JSati s f i abl e(P J V c i ,P J C c i ,s,ST ) then 17: upd ate Joi nboxes(P J V ant ,P J V c i ,P J C c i ,¾,s) 18: if P J C c i has been “satisfied” for all views in V i ewSet then 19: continue to the next consequent PJ (line 6) 20: continue (line 8) 21: else 22: mark P J V c i as unsatisfiable for¾, P J C c i and s 23: add P J C c i to Index, update joinboxes of affected PJs and end up as new PJs in our index, their joins already point to P J V c i for this predicate and view. If after updateJoinboxes, which has just deleted the satisfied view from the consequent PJ, the consequent PJ, P J C c i , becomes empty, this means that the corresponding consequent predicate is partially satisfiable by existing view PJs for all applicable views. Hence, in line 12, the algorithm jumps to the next consequent PJ. Otherwise, in line 13, the algorithm goes on examining the next applicable view s for P J V c i and P J C c i , i.e. it jumps to line 8. If we have no prior information for s, P J C c i and P J V c i we call isPJSatisfiable, in line 16. A global variable ST (initially set “undecided” at line 15) maintains which of the two cases of Def. 15 we are currently checking. If we find a partial satisfaction, lines 17-20 repeat the steps described in lines 10-13. Since our algorithms visit joined predicates recursively, we wouldn’t like to fall in an infinite cycle and visit the same PJs again for the same view, such as the ones currently considering. In line 14, starting in good faith, we set our pair of PJs and view as already visited and satisfiable, in order to avoid an infinite recursion. If the pair fails to prove partial satisfiability we reverse this decision in line 22. At the end of Alg. 12 (line 23), we will add to our index each consequent PJs with the information for the views left in it (i.e., those views not partially satisfying the consequent predicate). In Fig. 5.3(f) both sources remain in the consequent PJ for Patient, and this entire PJ is added to our output (Fig. 5.3(h)). If a new PJ gets added to our index, all view PJs containing the same views (and joining with the new predicate) update their joinboxes 86 5.4. Compact Frugal Chase For Query Rewriting Under Constraints accordingly (Fig. 5.3(h)). Algorithm 13 isPJSatisfiable(P J V c i ,P J C c i ,s, ST ) Input: A prexisting view PJ P J V c i , a constraint PJ P J C c i , view s, the state ST of this recursion so far Output: true if the predicate in P J C c i for view s is partially satisfiable in P J V c i 1: if P J V c i is marked unsatisfiable for¾, P J C c i and view s then 2: return false 3: for all V k à node on edge k of P J V c i do 4: j oi ns V k à the joinbox joins for s from V k 5: C k à node on edge k of P J C c i 6: j oi ns C k à the joinbox joins in C k 7: if C k is an antecedent variable in the constraint then 8: if joins with antecedent in j oi ns C k * j oi ns V k then 9: return false 10: else 11: if V k is a distinguished variable then 12: if ST == 1 then 13: return false 14: STà 2 15: if !checkC ase2(j oi ns V k ,j oi ns k ,s,ST ) then 16: return false 17: else 18: switch (ST) 19: case(2): 20: if !checkC ase2(j oi ns V k ,j oi ns C k ,s,ST ) then 21: return false 22: case(1): 23: if !checkC ase1(j oi ns V k ,j oi ns C k ,s,ST ) then 24: return false 25: case(UNDECIDED): 26: STà 2 27: if !checkC ase2(j oi ns V k ,j oi ns C k ,s,ST ) then 28: STà 1 29: if !checkC ase1(j oi ns V k ,j oi ns C k ,s,ST ) then 30: return false 31: if ST == 1 then 32: for all P J C c j à joined predicates in j oi ns C k do 33: if P J C c j is marked as satisfiable for s, but maps V k to a different variable than C k then 34: return false 35: return true Checking partial satisfiability. In order to check whether P J V c i can partially satisfy P J C c i for s, Alg. 13 is called. This returns false if at some other point this combination has been marked as unsatisfiable (lines 1-2). Dictated by Def. 15, the partial satisfiability check happens on a variable per variable basis (line 3), for the specific view s, and Alg. 13 returns true if no variable check fails (line 35). If the constraint variable, C k , is an antecedent variable, we only need to check that the cor- responding view PJ variable, V k , joins with the antecedent view PJ in the positions that C k joins with the antecedent constraint PJ (lines 7-9) (i.e., the variables satisfy the constraint antecedent homomorphism). If C k , is existential, but V k is distinguished we have to be in case 2 (lines 11-16). We check this by calling, in line 15, checkCase2 (shown in Alg. 15) which 87 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints essentially checks that the joined predicates, to the constraint variable, can be mapped to some of the joined predicates of V k . This ends up in recursively considering the connected component of the constraint that C k is in. On the other hand, checkCase1, omitted due to space but almost identical to Alg. 15 (just the outer for loops are reversed), considers that all joined predicates of V k , i.e., the corresponding connected component of the view, are images of constraint predicates joining with C k . Algorithm 14 joinsSatisfiedRecursively(j d V , j d C , s, ST ) Input: Join description j d V of view PJ variable, join description j d C of constraint variable, source s, state ST of the recursion Output: true if the joined predicate described by j d V is the satisfying homomorphism’s image of the joined predicate j d C 1: if j d V == j d C //Joins are to the same predicate name on the same position then 2: nei g hbor V P Jà joined PJ in view described by j d V 3: nei g hborC P Jà joined PJ in constraint described in j d C 4: if nei g hbor V P J has been marked as satisfiable for nei g hborC P J for view s then 5: return true 6: if nei g hbor V P J is marked as unsatisfiable for nei g hborC P J for source s then 7: return false 8: mark nei g hbor V P J as satisfiable for nei g hbourC P J for source s 9: if i sP JSati s f i able(nei g hbor V P J,nei g hborC P J,s,ST ) then 10: return true 11: mark nei g hbor V P J as unsatisfiable for nei g hbourC P J for view s 12: return false If V k is existential then depending on the current state (lines 18-24), we will check to make sure that P J V c i can still partially satisfy P J C c i , in accordance to the corresponding case of the definition. If ST is “undecided” we will check both cases (lines 25-30). After these checks and if we are left in the state of case 1, we test whether case 1(c) of Def. 15 is satisfied, by checking in our memoization structures to see whether the same constraint variable has been mapped somewhere else in the same view and if so we fail (lines 31-34). Algorithms 15 and method checkCase1 call Alg. 14 which ends up calling isPJSatisfiable recursively for the joined predicates of our variables, called neighborPJS. Alg. 14 marks the PJs as satisfying (line 8) to avoid an infinite recursion (line 4 guarantees that); this marking is reversed in line 11 if proven wrong. Algorithm 15 checkCase2(j oi ns V k , j oi ns C k , s, ST ) Output: true if every joined predicate in j oi ns C k is partially satisfiable with a predicate in j oi ns V k 1: for all join descriptions j d C in j oi ns C k do 2: for all join descriptions j d V in j oi ns V k do 3: if !j oi nsSati s f i edRecur si vel y(j d V ,j d C ,s,ST ) then 4: continue outer for (line 1) 5: return false //there is one join in constraint not satisfied by any join in source 6: return true Alg. 13, implements exactly definition 15 for view s and the predicates in P J V c i and P J C c i . In order to prove that our compact chase outputs PJs which capture the frugal chased views, we should point out that the algorithm always terminates since: 1) marking the PJs considered avoids infinite recursion, hence every consequent PJ gets satisfied or added, 2) once finished 88 5.5. Experimental Evaluation 100 1000 10000 100000 1e+06 20 40 60 80 100 120 140 160 180 200 time in ms number of constraints Chain Datasets: Total Chase and Rewriting time/const for 300 views CompactFrugalChase StandardChase ParallelChase CoreChase Figure 5.5: Total time for 20 chain queries, chasing 300 chain views with up to 200 constraints. processing a constraint we do not trigger it again for the same predicate and 3) since our constraints are chase-terminating, and we proved equivalence of the frugal chase to the standard one, we cannot be triggering constraints indefinitely, as at some point we will be finding all consequent predicates partially satisfiable and stop adding new ones. Note that our implementation finds a partially satisfiable set by keeping the first partially satisfiable atom it discovers, together with all those compatible to it. Summarizing, the following holds: Theorem 9. Given a set of view PJs representing our original LAV mappings, and a set of LAV-WA TGD constraints, Alg. 12 always terminates and produces a set of PJs which represent the set of mappings produced after one runs the frugal chase on each one of the view formulas using the constraints. 5.5 Experimental Evaluation To evaluate our approach we compare our compact frugal chase algorithm to the standard chase, as well as parallel and core chase [47], in the context of compiling LAV wa-TGD con- straints into LAV mappings. Every parallel chase step decides all applicable constraints on B first, and then adds to B their consequents. Each core chase step is a parallel step followed by minimization of the result. First, we run our compact graph chase on a set of conjunctive LAV mappings using a set of LAV weakly acyclic constraints, feeding the resulting PJs to GQR to rewrite a conjunctive query (using the PJs representing the chased mappings). Second, we have implemented and run the standard, parallel and core chase algorithms in order to 89 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints 0 200 400 600 800 1000 1200 0 50 100 150 200 250 300 number of PJs number of views Chain Datasets: Average no. of PJs for 100 constraints CompactFrugalChase StandardChase ParallelChase CoreChase Figure 5.6: Number of PJs produced for the queries of Fig. 5.5 compile the same sets of constraints into our set of mappings, as in [4]. We compute PJs for the standard, parallel and core chased mappings. In all cases the chased PJs are inputed directly to GQR for query rewriting. Our compact chase outperforms the standard and the core chase by close to 2 and 3 orders of magnitude resp., while our output remains very close to the core. For producing our queries, constraints, and views, we wrote a random query generator, extending the one used in Minicon [112], which is highly customizable as we discuss next. In fact, it can generate queries and views (and constraints) which capture most cases of the mapping scenarios identified in [10] (note that our prototype implementation, however, does not consider constants). 5.5.1 Chain Queries, Constraints and Views Initially we generated 20 datasets of chain queries, chain constraints and chain views. We used a space of 300 predicates to generate our LAV weakly-acyclic constraints and views with each constraint/view having 5 predicates joined on a chain. The first atom in our constraints is the antecedent. Each predicate has length 4 and each constraint/view can have up to 3 repeated predicates. Each view has 4 distinguished variables. Additionally in order to get more relevant constraints to views, we generated 10% of our constraints from a smaller subspace of our space of predicates, of size 60. Also, we constructed around 10% of our views by taking one constraint and dropping one of its atoms, causing our constraints to most likely have all atoms except this one partially satisfiable on these views (unless they map their existential variables to distinguished view variables causing case 2 of Def. 15). Lastly, we generated our 20 queries 90 5.5. Experimental Evaluation 100 1000 10000 100000 1e+06 20 40 60 80 100 120 140 160 180 200 time in ms number of constraints Chain Datasets: Total Chase and Rewriting time/const for 300 views CompactFrugalChase StandardChase ParallelChase CoreChase Figure 5.7: Average query rewriting time for the queries of Fig. 5.5 0 50 100 150 200 250 0 50 100 150 200 250 300 number of conj. rewritings number of views Chain Datasets: Average no. of rewritings for 100 constraints CompactFrugalChase StandardChase ParallelChase CoreChase Figure 5.8: Average number of rewritings GQR produces for the queries of Fig. 5.5 91 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints by randomly selecting 3 constraint antecedent atoms and one of these extra non-partially satisfiable atoms that the constraints contain; we do so as to not penalize the standard and parallel chases by querying redundant atoms that they will add (rather, the “extra” atom that the query contains will be probably added by all chase algorithms, including the frugal). We run our experiments on a cluster of 2GHz processors each with 2Gb of memory. We allocate to each processor 300 views and 300 constraints and one hour of wall time for that job to be finished. Due to the density of our data in this setting, most of the standard/parallel and core chase runs died after reaching 100 constraints (only 7 made it to 140), while for the compact chase all of them reached 140 constraints and 14/20 completed with 180 constraints, which indicates that the compact chase scaled almost twice as much. Figure Fig. 5.5 shows the average total time for running the chases and rewriting the queries (note that the figure averages over the successful queries at each point as discussed above). As seen in the figure, the compact algorithm outperforms the standard and parallel chases by close to two orders of magnitude and the core by close to three. Moreover as Fig. 5.6 shows, the number of PJs that the compact chase produces for 100 constraints, for which all queries succeeded in all frameworks, is the same as the core and is consistently less than the standard/parallel as the number of views increases (for 100 constraints and 300 views the compact chased mappings have 5895 predicates while the standard chased ones have 6232). In fact, the frugal chase computes almost the same output as the core chase, and this is done very efficiently without minimization. This leads to (equivalent but) fewer conjunctive rewritings for both the frugal and the core chase as output of the query rewriting problem as Fig. 5.8 shows. In particular, for 300 views and 100 constraints our system produced around 30% less query rewritings than the standard/parallel chases, and only 13% more than the core chase. Lastly, the sets of PJs in Figure 5.6 are rather small for GQR, so it reformulates the queries extremely fast (in miliseconds). Fig. 5.7 shows that the reformulation time for the compact chased PJs (and the core chased PJs) is considerably (more than 3x) faster than the standard/parallel chased ones, with the gap between them increasing with the number of views (the core time in Fig. 5.7 for 250 views seems to be an outlier). A side note of our setting is that the parallel chase, which is just a specific ordering on the standard chase execution steps, is slower than the standard chase (see related work for the relevant discussion). 5.5.2 Star Constraints and Views For our second experiment we evaluated our algorithm when it degenerates to the standard chase, and has the same output, i.e. no predicate is partially satisfiable. We did not measure query rewriting; since both compact and standard chase produce the same chased mappings in the form of PJs, GQR takes the same time to rewrite these for any query, yielding identical rewritings. We created a space with 500 predicate names, out of which we populate our views and con- straints. Each atom has 5 variables and each view/constraint can have up to 3 occurrences 92 5.5. Experimental Evaluation 1000 10000 100000 1e+06 0 50 100 150 200 250 time in ms number of constraints Star: Total Chase and Rewriting time/const for 1000 views CompactFrugalChase StandardChase ParallelChase CoreChase Figure 5.9: Average time of chasing 1000 star views, with up to 250 star constraints. of the same predicate. In each formula, one predicate is the “star” joining with all other predicates which don’t join directly to each other. Using this setting we created 20 datasets of 1000 views and 300 constraints. Each formula has 5 predicates (one of these five predicates in a constraint is the antecedent). Additionally each view has 5 distinguished variables, choosing most of them from the star predicate; this fact essentially introduces distinguished variables in the connected component of the view and obliges our algorithm to fall in case 2 of our defi- nition, i.e., map the constraint consequent in its entirety or not at all (since the consequents are “shorter” than the source descriptions). Since the space of predicates is rather sparse this setting causes the constraints to be unsatisfiable. Moreover it is also difficult for this dataset to produce redundant predicates. This means that the minimization the core chase runs is wasted time. Each dataset is tested for up to 1000 views and 300 constraints and each processor was allocated a specific number of constraints, scaling from 20 to 300, and all the 1000 views, and we enforced 1 hour wall time. Fig. 5.9 shows the average time for the compact chase versus the standard the parallel and the core chase on all 1000 views. Note that for this experiment the size of the output blows up exponentially as the number of constraints grows: the number of predicates goes from around 5,500 in the original mappings to approximately 180,000 for 250 constraints. This exhausts all algorithms: for the compact chase, all experiments run out of time/memory, for 1000 views, between 250 and 300 constraints, while the other chases crash around 180 constraints. As the figure shows, the compact frugal chase scales to around 30% more constraints while being close to two order of magnitude faster than the standard chase 93 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints 100 1000 10000 100000 20 40 60 80 100 120 140 160 180 time in ms number of constraints Star Datasets: Total Chase time/const for 20 views CompactFrugalChase StandardChase ParallelChase CoreChase Figure 5.10: Average time of chasing 20 star views, with up to 250 star constraints. and almost three compared to the core. Additionally note figure Fig. 5.10 which shows the compact chase performance in the same setting but scaling the constraints for only 20 views. Such a small database essentially reduces the advantage of our compact representation: the 20 view formulas generated from our sparse domain, have negligible overlap. Nevertheless our atom oriented implementation of the frugal chase gives a speedup of about one order of magnitude compared to standard/parallel and close to 2 compared to the core. The size of the chased mappings ranges from hundreds for 180 constraints to thousands for 250 constraints. 5.6 Discussion and Related Work Early approaches dealing with relational query rewriting, with no constraints, involve algo- rithms such as the bucket [70], and the inverse rules [53]. Modifications of the inverse rules algorithm for full dependencies or functional dependencies where presented in [53]. Koch [74] presented another variation of inverse rules for answering queries using views under GLAV dependencies (for dependencies common to both source and global schemas). A more effi- cient approach to view-based query answering was proposed in 2001, by the MiniCon [112] algorithm. MCDSAT [17] exhibited a better performance than MiniCon, and is essentially the MiniCon algorithm cast as a satisfiability problem. In this chapter we employ GQR, which significantly outperformed MCDSAT. There are approaches to data integration and query answering under constraints for which the chase does not terminate, but rather one has to do query expansion, using the constraints. The 94 5.6. Discussion and Related Work family of DL-Lite languages [16] is a famous such first-order rewritable class of constraints, which underlies the semantic web language OWL2-QL. Extensions of DL-Lite include flavors of the Datalog+/- family of languages [31] Query expansion rewrites the query using the constraints into a UCQ, before even considering the views or data. Although the LAV wa- TGDs used in this work are both FO-rewritable and chase-terminating, we focus on running the chase on the views without needing to take the query into account, in order to speed up the system’s online performance. There are flavors of Datalog+/-, such as the guarded fragment [33], for which the chase does not terminate, but there is a finite part of the infinite chase sufficient for query answering, but again, once a specific query is given. Our frugal chase could substitute the chase algorithms in these fragments. Chase&Backchase (C&B) [46] is a technique for query minimization and for finding equivalent rewritings, by first chasing all constraints and mappings and then minimizing the chase. Afrati et. al [4] presented an optimized version for finding equivalent rewritings, as well as maximally-contained, as discussed. In this work we outperformed the latter approach (running the standard chase and then doing query rewriting). Additionally, the frugal chase can replace the chase in the (C&B) algorithm, for finding equivalent rewritings and be even more advantageous (being shorter than the standard chase) for query minimization. Our frugal chase can be used for query containment; in Chapter 4 we used the standard chase and a compact graph-based algorithm for UCQ containment. The chase algorithm has been studied in multiple variations in the recent literature (for overviews see [107, 67]), which guarantee chase termination for one or all instances, un- der different classes of constraints, producing universal solutions. These include the naive oblivious chase [33], the skolem oblivious chase [101] and the parallel and core chase [47]. Whenever the oblivious and naive chases terminate: 1) the standard chase also terminates and 2) they produce longer solutions than the standard chase [107, 67]. This justifies our choice of studying the standard chase since we focus on producing shorter solutions. The parallel chase is just a specific ordering of standard chase step executions; in particular, each parallel chase step decides all applicable constraints on the instance as it has been formed so far and, only after finishing considering all constraints, adds all consequents before going to the next step. Notice that this might penalize the parallel chase as evident by our experiments; the “incremental” addition of consequents that the standard chase performs might prove some “parallel” rules satisfiable, and hence avoid applying them. Each core chase step is a parallel chase step followed by minimization of the produced instance. As its name suggests the core chase leads to a minimal universal solution, i.e., the core, as opposed to the frugal chase of Chapter. 5, which leads to smaller but not necessarily minimal solutions. The prize paid however by the core chase is extra minimization (whose equivalent decision problem is NP-hard), as indicated by our experiments. Nevertheless the core chase is more “general” than the standard and the frugal chase since it is a complete algorithm for producing universal solutions under more general classes of constraints [47]. Notice that the “building block” of the core and parallel chases is the standard chase which we could replace with the frugal chase or our pruned chase in order to implement a variety of algorithms algorithms. 95 Chapter 5. Optimizing the Chase: Scalable Data Integration under Constraints Other data exchange literature that computes the core universal solution, includes [56, 64, 65]. The space of universal solutions (such as our frugal chase) which lie in between the core and the standard chase has been discussed by Henrich et al. [71]. Approaches in [124, 102] construct mappings and constraints that use negation to avoid adding redundant predicates to a solution. We plan to evaluate our approach in these settings and extend our algorithms to produce the core by leveraging our compact graph-based format. In this chapter, we have presented several contributions. First, we introduced the frugal chase, which produces smaller universal solutions than the standard chase. Second, we developed an efficient compact version of the frugal chase. Third, we used this frugal chase to scale up query rewriting under constraints, for the case of LAV mappings and LAV weakly acyclic constraints. As our experiments show, we gain the additional expressivity of using dependencies very cheap: in essence, our algorithm only pays the cost of relational query answering using our preprocessed views, which due to our optimized, shorter chase, our compact format, and our indexing, becomes very efficient. In future work, we plan to explore extensions of our chase that would compute the core solution, as well as evaluate our system both in data exchange scenarios (i.e., chase instances rather than mappings) and with other chase-terminating cases of constraints, including limited interactions of TGDs and EGDs. 96 6 Pruning the Infinite Chase In this chapter we develop the theoretical foundations for a “combined” approach to query an- swering under constraints. We focus on First-Order (FO) rewritable, but not chase-terminating constraints; in particular, linear TGDs, which generalize Inclusion Dependencies and capture the “essential” part of practically useful languages, such as DL-Lite/OWL2-QL (a W3C recom- mendation). In order to account for these constraints, traditional query answering approaches devise a rewriting of the original query that encapsulates the reasoning of the constraints (since, the “alternative” approach of running the chase on the data does not terminate). Such a rewriting is usually of an exponential size and its evaluation is computationally wasteful. In order to reduce the size of the rewriting, several solutions have been proposed. The com- bined approach to query answering under constraints, refers to performing a “chase-like” preprocessing on the data and the constraints such that, at query time, there is a smaller query rewriting, w.r.t the constraints, that gives the correct answers over the preprocessed data. In this thesis, we propose a novel and much simpler combined approach to query answering. In particular, we show how to use only a finite portion of the infinite chase and prove that one can chase the data up to a point independently of any query, in effect pruning the rest of the infinite chase, and then just pre-process any incoming query at runtime, such that it can “fit” inside our pruned chase. In order to check how a query “fits” or matches the infinite chase we develop a novel notion of a representative “prototypical” database, and we chase, at query time, only a few prototypical tuples for every relation of our schema. We do this at query time to leverage results that allow us to stop the prototypical chase after a finite chase whose size depends on the query. We can then examine the homomorphisms of the query on this chase and “simulate” its matches on the infinite chase. This allows us effectively to “prune” or “minimize” the query, a process which we call query contraction. To account for all the answers of our original query on the infinite chase, there may be more than one contracted queries that we need to issue on our pruned chase. Nevertheless, unlike the traditional, or the combined UCQ rewriting of the query, which is of a large exponential size, our UCQ rewriting needed to get the certain answers, using our pruned chase result, is much smaller in size. 97 Chapter 6. Pruning the Infinite Chase Conjunctive query answering under linear TGDs is, in general, PSPACE-complete, it is in NP when the constraints are fixed, and it is in AC0 when both the constraints and the query is fixed [34]. The latter case regards the so-called data complexity of the problem and is of particular interest since the queries are usually assumed to be small and the ontologies/constraints relatively stable over time. It also means that linear TGDs are FO-rewritable; query answering can be done by rewriting the original query (independent of the data), using the dependencies, into a new union of queries (i.e., a UCQ) which can be issued directly on our original data. This UCQ rewriting of the query can have an exponential size to the size of the original query and much attention has been devoted to efficiently produce smaller but equivalent rewritings. However, the problem of evaluating such large UCQs is still a burden to practical query answering and data integration systems. An alternative technique for query answering under constraints, is to run the chase algorithm on the data using the dependencies, and thereafter issue the original query on the chased result. For linear TGDs (and even for their subset, inclusion dependencies) the chase algorithm does not always terminate. However, it has been proven that any given query can be answered on a finite initial part of the infinite chase, which depends on the query and the constraints. Essentially, this means that one would have to run the chase up to a finite point for every new query that comes into the system (or devise an incremental approach to do so), and then issue the query on this finite chase part. Nevertheless, the chase is an expensive procedure, and chasing the entire data is practically unacceptable at query time. In this chapter we present a combined approach to query rewriting that results in smaller query rewritings that other approaches. We prove that for any query we can get its results by issuing an equivalent (to the original conjunctive query) union of conjunctive queries (a UCQ) on a fixed portion of the infinite chase that depends only on the constraints. This allows us to chase the data up to a point independently of any query, in effect pruning the rest of the infinite chase, and then just “minimize” any incoming query at runtime, such that it can “fit” inside our pruned chase. We compute the rewriting with the help of a small “prototypical” database, which we chase at query time. Our prototypical database, which is essentially only the “frozen” schema of our original database, contains only “pattern” facts rather than actual tuples, and hence, could be several orders of magnitude smaller than the actual database in practice. To account for all the answers of our original query on the infinite chase, there are more than one queries we need to issue on our pruned chase. Nevertheless, unlike the traditional UCQ rewriting of the query which is of a large exponential size, our UCQ rewriting needed to get the certain answers using our pruned chase result, is much smaller in size (but still exponential). Note that, usually, the cost of evaluating a rewriting dominates the computational cost of query answering under constraints, as the data grows (see, e.g.., [110]), and the time of computing the rewriting itself is of secondary importance. Hence, although we also focus on efficiently producing the rewriting, it is the output size of the rewriting algorithm that is of prominent value. 98 6.1. Answering a fixed query using LTGDs We intend to “shorten” a query in such a way that it has the same answers on our pruned chase as the original query would have on the infinite. However this “shortening” is not unique and we have to produce a UCQ rewriting. This rewriting is still smaller than the UCQs produced by alternative approaches, as noted in Sect. 3.7. We focus on linear TGDs (LTGDs) which can express the “core” part of the DL-lite [16] family of languages. In particular the language of linear TGDs, together with negative constraints and non-conflicting key dependencies [34], captures all known DL-Lite variations. We will only deal with query answering in the face of linear TGDs as it is the part that presents the essential challenges in query answering. As known from prior work, all conjunctive queries can be answered on an initial part of the infinite chase that depends on the length of the query. The next subsection revisits these results from our new perspective of blocks (periodic parts) of the infinite chase, giving better bounds on the length of the chase needed to answer a particular query. In order to simplify presentation we are going to use LTGDs that have only one atom in the consequent. Any set of LTGDs, can be rewritten as such, and all our results hold for rules with multiple atoms in the consequent. 6.1 Answering a fixed query using LTGDs For certain classes of TGDs, such as Inclusion Dependencies and linear TGDs, the infinite chase graph exhibits a so-called periodicity. This has been noticed in [34, 33, 72] before, but we formalize it here, using the notion of “blocks” of a chase graph. Firstly, we define the notion of two equivalent atoms in the chase. Definition 16. For all database instances B, for all sets of linear TGDs§, for all paths¼2 chase(B,§) for all triples of atoms t, r 1 , r 2 on¼, s.t., level(t)· level(r 1 )Ç level(r 2 ), we say that r 1 and r 2 are equivalent for ancestor t (or simply, for t) iff (1) their incoming edges are labeled by the same constraints, and (2) they are isomorphic, with (3) the isomorphism being the identity on all terms of t that appear in r 1 or r 2 . Two sets of atoms of the chase are equivalent for a common ancestor t iff there is a single isomorphism i that proves one-to-one equivalence of their elements for t, per Def. 16. Notice that the term ancestor of an atom a, describes all atoms on a path¼ which precede a, as well as a itself. The lowest level atom on a chase path starting at an atom t, that is equivalent to a previous one on the same path, will be called the frontier atom for¼ and t. The chase tree below the frontier atom becomes “periodic” . Definition 17. For all database instances B, for all sets of linear TGDs§, for all chase paths¼ of chase(B,§), for all atoms t2¼ we say that an atom r 2 is the frontier atom for¼ and t, denoted frontier(¼, t), iff (1) r 2 is equivalent, for t, to an atom r 1 , lying on¼, and (2) for all pairs r 1 ,r 2 of atoms on¼, with level(r 0 1 )Ç level(r 0 2 ), and r 0 1 equivalent to r 0 2 for ancestor t, and r 0 2 6Æ r 2 , it holds that level(r 0 2 )È level(r 2 ). 99 Chapter 6. Pruning the Infinite Chase The atom frontier(¼,root(¼)) is the first frontier atom on a chase path ¼. In this case we simply write frontier(¼). Informally, a frontier atom denotes the beginning of a “new” period of a chase path. The following lemma can be generalized to state that if r 1 is an equivalent ancestor of frontier(¼, t), then each subpath starting at frontier(¼, t), is isomorphic (with the isomorphism being the identity on terms of t) to some subpath starting r 1 . Lemma 10. For all database instances B, for all sets of linear TGDs§, for all facts t, r 1 , r 2 in chase(B,§), for which r 1 , r 2 are equivalent for t, for all linear TGD constraints¾ applicable to r 1 and producing r 0 1 ,¾ is also applicable to r 2 , producing r 0 2 and it holds that r 0 1 is equivalent to r 0 2 for ancestor t. Moreover the sets {r 1 ,r 0 1 } and {r 2 ,r 0 2 } are equivalent for ancestor t. Proof. Note that two equivalent atoms are facts of the same relation (and have the same incoming constraints in the chase graph). Moreover, let i being the isomorphism that maps r 1 to r 2 , let h be the homomorphism that maps the antecedent atom p, of the constraint ¾ : p(~ x,~ y)!Ã(~ x,~ z), to r 1 , then i (h(p(~ x,~ y))) is a homomorphism that maps p to r 2 . The immediate chase descendants of r 1 after the application of¾ is the set of atoms inÃ(h(~ x), f (~ z)) while those of r 2 isÃ(i (h(~ x)), f (~ z)). For all atoms in the first conjunction there is an isomorphic one in the second and vice versa; these two predicates are also equivalent, since (1) f uses fresh names and (2) h(~ x) and i (h(~ x)) agree on the original terms of t. The length of a path¼ starting at a predicate and ending in frontier(¼) is called the “period” of¼. For all infinite paths of a linear TGD chase graph (in fact for all paths longer than the bounds we give below) a frontier atom exists. This was shown in [34] for Guarded TGDs (and a similar bound was shown in [72] for Inclusion Dependencies), by forming a combinatorial argument that bounds the occurrences of non-isomorphic (in particular, non-equivalent) atoms on any chase path. In fact this is a bound on the period size for any path of the chase. Guarded TGDs (GTGDs) generalize linear TGDs, and the following result can be derived from the analysis in [34]. Lemma 11. ([34]). For all database instances B, for all sets of linear TGDs§, for all oblivious chase paths¼ of chase(B,§), with t Æ root(¼), the period of¼, i.e., the length of¼ from t to frontier(¼, t), if the latter exists, is no longer thanj§j£ (2w) w , where w is the maximum arity of any predicate in the constraints. For an arbitrary long (or an infinite) path will have many “frontier atoms” appearing periodi- cally. Given a path¼ and a tuple t, we inductively define frontier k (¼, t) as follows: • frontier 1 (¼, t) = frontier(¼, t) • frontier k (¼, t) = frontier(¼,frontier k¡1 (¼, t)), for integer k¸ 2 In order to compute frontier k (¼, t) we consider atom frontier k¡1 (¼, t) as the root of the path ¼ and look for its own frontier. For finite paths a frontier atom does not always exists, in 100 6.1. Answering a fixed query using LTGDs which case, frontier k (¼, t) might be undefined. If tÆ root(¼), we will simply write frontier k (¼). Based on this definition, when using a chase algorithm that has period l , we can state that the distance between two subsequent frontier atoms on the same path is no longer thanj§j£ l. We call the index k, the block level of the frontier atom. Intuitively, a set of frontier atoms of the same block level defines a “block” in the chase. Formally, for a set of facts B, and a set of linear TGDs§, block 1 (B,§) = { bj b2 chase(B,§) and if frontier 1 (¼(b)) is defined, level(b)Ç level(frontier 1 (¼(b)))}. For all kÈ 1, we inductively define block k (B,§) = { bj b2 chase(B,§) s.t. (1) frontier k¡1 (¼(b)) is defined, and (2) frontier k¡1 (¼(b))· level(b), and (3)8¼ i chase paths that b belongs in, if frontier k (¼ i ) is defined, level(b)Ç level(frontier k (¼ i ))}. For a database instance B and a set of linear TGDs§, block 1 (B,§) is the initial portion of the chase which contains all atoms in B, and all atoms of the chase up to and excluding the “first” frontier atoms. Similarly, the second block of the chase contains, for all paths ¼, all frontier 1 (¼) atoms and their descendants up to and excluding the second “layer” of frontier atoms, frontier 2 (¼). Notice that, for a finite chase graph there is a maximum k after which block k (B,§)Æ;. Also, chase(B,§)Æ S iÆ1 iÆ1 block i (B,§). Given an instance B, a set of linear TGDs§ and an integer k, we define the pruned chase at block level k, p-chase k (B,§)Æ S iÆk iÆ1 block i (B,§), and p-chase 1 (B,§)Æ chase(B,§). A first result that we can establish states that each chase block has a homomorphism to its previous block (and thus, to block 1 ), which maintains any original constants of B. Theorem 12. For all database instances B, for all sets of linear TGDs§, for all integers k¸ 2, if bl ock k (B,§)6Æ;, there exists a homomorphism h k , s.t. 8a 2 block k (B,§), h k (a)2 bl ock k¡1 (B,§) and a is equivalent to h k (a) for ancestor frontier k¡1 (¼(a)). Proof. The atoms in block k (B,§) are descendants of all the frontier k (¼) atoms for all paths ¼. We will examine individually subgraphs of block k (B,§) that have the same ancestor frontier k (¼ i ), for a path¼ i (that starts from a fact of B and ends in frontier k (¼ i )). Such a subgraph of block k (B,§) is a tree rooted at frontier k (¼ i ). We claim that for such a tree t, there exists an isomorphism i s.t. for all atoms a in t, i (a)2 block k¡1 (B,§) and a is equivalent to i (a) for ancestor frontier k (¼ i ). We will prove the claim by induction on the height of the tree. If the height is 1 (the subgraph contains only the frontier atom) the claim holds, by the definition of frontier atoms. Let the claim hold for trees of size n, and let the size of the tree t be nÅ 1. Let t 0 the subtree of t of height n. By induction hypothesis, there is an isomorphism i which maps t 0 onto block k¡1 (B,§). Let SÆ {s 1 , s 2 ,.., s m } the set of atoms in t \ t 0 , i.e., the immediate descendants of the leaves of t 0 (which are the leaves of t). By lemma 10 there is an equivalence between each s j 2 S (for 1¸ j · m), and an immediate descendant of a leaf of i (t 0 ). Moreover, by the same lemma, we can extend i to create isomorphism i 0 that covers the entire t. To prove the claim, we still need to argue that each i 0 (s j ) is in block k¡1 (B,§). If for some 1¸ j· m, i 0 (s j )Ý block k¡1 (B,§), this means that i 0 (s j ) is a frontier atom, since its predecessor in i 0 (t 0 ) is in block k¡1 (B,§), by induction 101 Chapter 6. Pruning the Infinite Chase hypothesis. Thus, i 0 (s j ) is equivalent to a previous atom, say F , on its path. We distinguish two cases: (1) If F2 block k¡2 (B,§) then F is an ancestor of frontier k¡1 (¼ i ), and so ancestor of frontier k (¼ i ). Therefore, F is an ancestor of s j . Also, F is equivalent to i 0 (s j ) which in turn is equivalent to s j . Hence, s j is equivalent to F and s j is also frontier atom. This is a contradiction, since s j 2 block k (B,§). (2) If F is an ancestor of i 0 (s j ) but within block k¡1 (B,§), by reasoning with the fact that t and i 0 (t) are isomorphic, F has an equivalent atom in t 0 which is an ancestor of s j , making the latter a frontier atom; contradiction. Hence the claim holds. This means that there are individual isomorphisms that map each existentially connected sub- graph of block k to its previous block. Taking the union of all such isomorphisms as functions is a homomorphism since the only terms shared in the domain of those isomorphisms, are (1) terms from B, or (2) labeled nulls that come from a common ancestors frontier k¡1 (¼ j ). For all those terms the isomorphisms are the identity and they agree. Hence, for a conjunctive query q, and for any query homomorphism h k that maps q within a chase block of level k, we can compose h k with a series of homomorphisms (and in fact, isomorphisms) h k¡1 , h k¡2 , ..., h 1 s.t. h k (q)´ h k (h k¡1 (h k¡2 (...h 1 (q)))), where each h i maps its input to block i . This shows that there exists a homomorphism to the first block (and actually, to any block) that gives the same, tuples of constants, as answers to q, that h k does on the k th block. Theorem 12 implies that any query which is fully answerable within single blocks, in other words, a query whose image on the chase does not span multiple blocks, such as atomic queries, can be answered in p-chase 1 (B,§) (the latter result about atomic queries can also be derived from [34], where it is proven for GTGDs). Another implication of Theorem 12, intuitively, is that we can “slide” all query homomorphisms, onto the chase, “upwards” s.t. that the lowest level atom of the image of any homomorphism now maps onto the first block, and the rest of the image atoms are shifted accordingly. Formally, the next theorem holds. Theorem 13. For all database instances B, for all sets of linear TGDs§, for all± subgraphs of chase(B,§), there is a homomorphism h that maps every atom of ± to chase(B,§), with the following properties: (1) h is the identity on all terms of B appearing in± or h(±), (2) for all “lowest level” atoms a mi n 2± (atoms a mi n are those for which, for all b2±, level(a mi n )· level(b)), h(a mi n )2 block 1 (B,§), and (3) for each atom b 2± with b 2 bl ock j (B,§), with j È 1, h(b)2 bl ock (j¡m) (B,§), where m is the block that atoms a mi n belong to, i.e., a mi n 2 block m (B,§). Proof. As in the proof of theorem 12, we are going to consider disjoint subgraphs of±. Let k be the highest block level any atom in± belongs to. Let± 1 ,± 2 ,...,± n the gaifman components of±. If the theorem holds for each± i we can take the union of the homomorphisms h i that map each± i to the first (k¡ m) blocks. This constructs a homomorphism, since the domain of each h i shares only terms of B with domains of other homomorphisms, and all h i agree on those terms by property (1) of this theorem. Moreover, the constructed homomorphism does 102 6.1. Answering a fixed query using LTGDs not exceed block level (k¡ m), and hence the theorem holds. The argument to prove that the theorem holds for each± i uses induction on the height¸ of± i . If¸Æ 1,± i contains a single node and the theorem holds by Theorem 12. Suppose the theorem holds for¸Æ n, and let SÆ {s 1 , s 2 ,.., s m } the immediate descendants of the leaves of± i . Using the same argument as we did in the proof of Theorem 12, we can extend the homomorphism to cover S, and prove that the image blocks atoms of S are within the (k¡ m) block level bounds. Theorem 13 shows that in order to answer any conjunctive query we only need to consider a first part of the chase and in particular at most k blocks, where k is the number of the blocks between the lowest level and highest level atoms in the image of any query homomorphism±. Subsequently we are going to see how we can further “compress” the image of any query homomorphism by mapping any part of the chase into subsequent blocks. The next lemma states that we can always satisfy a join between two chase atoms, by considering only two subsequent blocks. Lemma 14. For all database instances B, for all sets of linear TGDs§, for all pairs of atoms a 1 , a 2 , s.t. a 1 2 bl ock i (B,§) and a 2 2 bl ock j (B,§) with i· j , there is a homomorphism h that maps, a 2 2 bl ock l (B,§) with i· l· iÅ 1 s.t. h is the identity: (1) on terms of B appearing in a 2 or h(a 2 ), and (2) on all join terms, i.e., terms appearing in both a 1 and a 2 . Proof. If the atoms are in the same or subsequent blocks the lemma holds trivially. Let j¡ i¸ 2 and let {t 1 , t 2 ,..., t n } the terms appearing in both a 1 and a 2 . These terms belong in the frontier j¡1 (¼(a 2 )) since they must be inherited to a 2 through its ancestors. Moreover these terms have to be present in all the ancestors of frontier j¡1 (¼(a 2 )), up to frontier i (¼(a 2 )), which marks the beginning of a branch in the iÅ 1 block level, since a 1 , which is in block level i , already contains these terms. It holds that for f Æ frontier j¡1 (¼(a 2 )), which is in the j th block, there is an atom a2block j¡1 (B,§) s.t. f and a are equivalent for ancestor frontier j¡2 (¼(a 2 )) (which, as stated, contains all joined terms), by definition of the frontier f . Hence a contains terms {t 1 , t 2 ,..., t n } in the same position as f does. By using lemma 10 and induction on the length of the path from f to a 2 we can prove that there is a descendant of a in the block j¡1 (B,§) which is equivalent to a 2 for frontier j¡2 (¼(a 2 )) and there is a homomorphism with the proper- ties of the lemma from a 2 to this atom in block j¡1 (B,§). By exercising repeatedly the above reasoning we can prove the claim of the lemma. As a corollary of Theorem 13 and Lemma 14 we can get that we can always answer a query with two atoms on the first two blocks of the chase. We will generalize this result, to show that for all conjunctive queries q, we can get the certain answers of q by only querying the firstjqj blocks of the chase, i.e., p-chase jqj (B,§), wherejqj is the size (number of atoms) in the input query. Since for every query q, the size of the image of q for any query homomorphism cannot exceedjqj, this result comes as a corollary of the following theorem. 103 Chapter 6. Pruning the Infinite Chase Theorem 15. For all database instances B, for all sets of linear TGDs§, for all subgraphs± of chase(B,§), there is a homomorphism h that maps every atom of± to chase(B,§), with the following properties: (1) h is the identity on all terms of B appearing in± or h(±), and (2) for all atoms b2±, h(b)2 p-chase j±j (B,§). Proof. Again, we are going to consider disjoint subgraphs of±. Let± 1 ,± 2 ,...,± n the gaifman components of±. For each± i with 1· i · n, the lemma can be proven by induction on the size of± i . Ifj± i jÆ 1 the lemma holds trivially by Theorem 13. Ifj± i jÆ 2, we can map the two predicates in the same or subsequent blocks by lemma 14 and compose this with a homomorphism that maps these blocks to the first two blocks, by Theorem 13. Letj± i jÆ n, and suppose, by induction hypothesis, that the lemma holds for any chase tree± 0 i ½± i withj± 0 i jÆ n¡ 1, that is there is an h, identity on terms of B, s.t. h(± 0 i )µp-chase n¡1 (B,§). We argue that we can extend h to map the atom a2± i \± 0 i in p-chase n (B,§). The reasoning is very similar the that of the proof of lemma 14, that is, informally, the join variables shared between± 0 i and a have to come from an ancestor of a, which itself has an equivalent predicate in a previous block allowing us to “shift” a accordingly. Since the lemma holds for all gaifman components of± i we can union the corresponding homomorphism to prove it for±. Corollary 16. For all database instances B, for all sets of linear TGDs§, for all conjunctive queries q, cer t ai n(q,B,§) = q# (chase(B,§)) = q# (p-chase jqj (B,§)). Using the upper bound of lemma 11, on the length of a chase path containing a k th frontier atom, we can easily derive an upper bound on the sizes (lengths of paths) of the blocks in Theorem 13. This sketches the proof of the following theorem, which can also come from the proof of the similar result for GTGDs, in [34]. Theorem 17. ([34]) For all database instances B, for all sets of linear TGDs§, for all chase trees T that belong in chase(B,§), for all subgraphs± of T , there is a homomorphism h that maps every atom of± to T , with the following properties: (1) h is the identity on all terms of B appearing in± or h(±), and (2) for all atoms p i of T , level(h(p i ))·j±j£j§j£(2w) w . To clarify the difference of our results to Theorem 17, notice that theorem 13 and corollary 16 provide a procedurally-produced worst-case optimal bound on the chase part needed to answer a query (by detecting frontier atoms and blocks). In effect, by detecting the frontier of each path we “prune” the chase on exactly the point needed, rather than relying on a combinatorially-produced upper bound on the number of steps that one needs to run the chase for (for each path), as in Theorem 17. This has an additional cost of looking for equiva- lence of each new atom, that the chase adds, to an atom earlier in its path. Nevertheless, this approach can prune the chase much earlier than the combinatorial upper bound in many cases, saving significant chase-running time. In the next subsection we are going to give a tighter bound for the number of chase blocks needed to answer a particular query, that is, we are going to improve the results from [34], for the case of linear TGDs. 104 6.2. Query Contraction 6.2 Query Contraction Of special importance in our approach is the set of atoms which are common ancestors of the frontier atoms. The highest-level common ancestor of a set of atoms s, hca(s), is the “deepest” atom in the chase tree that precedes all atoms in s. Notice that frontier k (¼, t) belongs in block kÅ1 and it is actually the “first” atom (the atom with the lowest derivation level) on ¼ that belongs to block kÅ1 . Atom hca(frontier k (¼, t)) belongs to block k , and it is the “last” atom (the atom with the highest derivation level) on¼ that belongs to block k . We now define how to rewrite a query so that it is answerable within a fixed number of blocks k. We call this technique query contraction, and, intuitively, it proceeds as follows: (1) it considers all query homomorphisms onto the chase, (2) for those homomorphisms whose image goes over the k th frontier atoms, it prunes the query part whose image goes over the frontier atoms, substituting it with the hca of the frontier atoms that got exceeded, creating a new conjunctive rewriting; (3) the result is a union of conjunctive rewritings created in the previous step. Definition 18. For all database instances B, for all sets of linear TGDs§, for all conjunctive queries q, withjqj¸ 2, for all k¸ 1, the contraction of the query within block k, is the UCQ: contract k (q,B,§) = S iÆn iÆ1 q i , where n is the number of answering homomorphisms h 1 ,h 2 ,...,h n from q to p-chase jqj (B,§), and for each h i , q i =Á({h i (q\s)}^ hca(frontier k (¼(h i (s))))) where s is the maximal subset of atoms of the query s.t. h i (s)2 block j (B,§), for j È k andÁ is a renaming function over terms of its argument, defined as follows: • Á is the inverse of h i , h ¡ i , when defined, and • 8 terms y s.t.@x with h i (x)Æ y,Á(y) is a “fresh” new skolem variable, and • 8 terms x, y s.t. xÆ y,Á(x)ÆÁ(y) is the lexicographically smaller term among h ¡ i (x) and h ¡ i (y) if they exist. Next we prove that in order to get the certain answers of the query on the pruned chase at block level k, we can contract the query within block k, and the resulting query is equivalent, w.r.t the constraints, to the original query. Theorem 18. For all database instances B, for all sets of linear TGDs§, for all conjunctive queries q, for all k¸ 1, contract k (q,B,§)´ § q. Proof. ()) We first prove that contract k (q,B,§)µ § q. For all k¸ 1, for all q i 2 contract k (q,B,§), consider q 0 i =Á(h i (q)^ hca(frontier k (¼(h i (s))))). Query q 0 i differs from q i only in that it con- tains the set of atomsÁ(h i (s)), which all are chase descendants ofÁ(frontier k (¼(h i (s))))) and hence ofÁ(hca(frontier k (¼(h i (s)))))) which however belongs in q i . Thus, chase(q i ,B,§)´ chase(q 0 i ,B,§), which means q i ´ § q 0 i . Moreover, we see that there is a containment map- ping that maps q onto q 0 i which isÁ(h q ()). Hence q 0 i µ q, and since q 0 i ´ § q i , it holds that q i µ § q. For all k¸ 1, this holds for all q i 2 contract k (q,B,§), and therefore, for all k¸ 1, contract k (q,B,§)µ § q. 105 Chapter 6. Pruning the Infinite Chase (() We prove that qµ § contract k (q,B,§). Let tuple t 2 certain(q,B,§), then there is a ho- momorphism h i : q ! chase(B,§), that binds the distinguished variables of q onto the terms of t. For every such h i , q i 2 contract k (q,B,§), binds the same variables to the same tuple by construction, hence t 2 contract k (q,B,§)(chase(B,§)). Therefore, for all k ¸ 1, qµ contract k (q,B,§) and hence qµ § contract k (q,B,§). Theorem 18 proves, that given a pruned infinite chase forest at block level k, there is a rewrit- ing of the query, no longer than k, that can provide the certain answers within the pruned chase. However, Def. 18 relies on actually running the chase forjqj blocks in order to find the contraction, and this defeats our purpose, since if we have p-chase jqj (q,B,§) we would be able to answer q directly on it. Next, we devise an algorithm that computes the query contraction without relying on comput- ing p-chase jqj (q,B,§). Rather, we will define a new “prototypical” instance B 0 much smaller than our original instance. Computing the query contraction for B 0 is more efficient, and we will prove that this “prototypical” query contraction provides the query contraction needed to get the certain answers of the query on our pruned chase. The intuition behind our prototypical instance is that it is “representative” of our original database. It holds the minimum number of facts s.t. for every fact in our database, there is an isomorphic fact in our prototypical instance. These “prototypical” facts contain a new kind of term which behaves both as a variable and a constant. In effect we define a infinitely enu- merable set of prototypical constants CÆ c 1 ,c 2 ,..., which we use as terms in our prototypical facts. We will also use a “prototypical” version of the input conjunctive query. Given a conjunctive query q, we define a special renaming function, an isomorphic mapping, prot : const ant s(q)! C . Function prot is extended to be the identity on variables. We also extend the notation prot in the natural way over queries, and define prot(q) as the prototypical version of q. Since prot is an isomorphism, its inverse exists, and prot ¡ (prot(q))Æ q. If we try to extend the mapping prot to apply it to instances in the obvious way, we will end up with an isomorphic instance to B of the same size. Rather, we would like to capture and create a “prototype” of just the patterns of repeated values of the tuples of B. We will use boolean queries with varying repeated variable patterns, in order to recognize the repeated patterns of constants in an instance. A pattern of variables for a relation R of arity n is a tuple of~ x of (possibly repeated) variables of cardinality n. Using this notion we subsequently define the prototypical version for an instance (although we use the same function prot over instances, we will overload its semantics as explained in Def. 19). Definition 19. For all instances B, we define the prototypical instance for B, prot(B) as follows. For all relations of B, R, for all possible patterns of variables~ x for R, if the boolean query q()à R(~ x) is true on B, then R(prot(~ x))2 prot(B). Moreover each time we call prot on the variables of a fact R in order to create the prototypical version of the fact, we map these variables 106 6.2. Query Contraction to new elements of C. A prototypical instance involves identifying the unique patterns of constants in facts of our original instance and creating a unique prototypical isomorphic fact for each such occurrence. This construction can be performed in a preprocessing phase, prior to any query or constraints. Intuitively, the prototypical version of an instance relies to “freezing” the schema relations of our database, creating a first set of facts, and adding facts to account for more “restrictive” tuples, i.e., tuples in our original database that contain repeated values. Before we give the main result of this section, we have to extend all notions involving homo- morphisms discussed so far, in order to account for prototypical constants. In particular, a homomorphism h (such as a containment mapping or an answering homomorphism) must map a prototypical constant onto a prototypical constant (not necessarily the same) and for c 1 ,c 2 2 C , with c 1 6Æ c 2 ) h(c 1 )6Æ h(c 2 ). Given a conjunctive query q, an instance B, and a set of linear TGDs§, we can define the prototypical query contraction for q as contract k (prot(q),prot(B),§). Note that this query contraction on prot(B) is much easier to compute than the one of the original query on B, since it involves chasing the much smaller prototypical instance. Next, we prove that if we substitute the original term names in the prototypical contraction we get a rewriting equivalent to the original query contraction. Theorem 19. For all database instances B, for all sets of linear TGDs§, for all conjunctive queries q, for all k¸ 1, contract k (q,B,§)´ prot ¡ (contract k (prot(q),prot(B),§)), where prot ¡ is the identity on undefined inputs. Proof. ())For all k¸ 1, let q k 2 contract k (q,B,§). This means, that there is a homomorphism h q : q k ! p-chase k (B,§). For all s i 2 dec(h q (q k )), that is, for every connected component of the gaifman graph of the image h q (q k ), consider the “root” atom r i 2 s i , i.e., the atom r i for which for all atoms b2 s i with r i 6Æ b, level(r i )· level(b) (this must be a unique atom for each tree s i ). It holds that p-chase k (prot(r i ),§) is isomorphic and equivalent to a sub- graph of p-chase k (prot(B),§). Abusing notation we will write that p-chase k (prot(r i ),§)µ p-chase k (prot(B),§). Also, there is an isomorphism f i from s i to p-chase k (prot(r i ),§), s.t. the graph f i (s i ) is a tree, subgraph of the graph of p-chase k (prot(r i ),§), and for all terms t2 terms(q k ), f i (h q (t)) is a prototypical constant iff h q (t) is a constant (iff t is a constant or a distinguished variable); else, f i (h q (t)) is a labeled null. Notice that there might be two different such mappings f i and f j , s.t. f i (c)6Æ f j (c) for a particular constant c2 const(h q (q k )), but c can be considered a different “node” in each graph that it appears (s i or s j ). Hence, skipping some technical details, we can consider the “union” f of all f i which is a graph isomorphism. This iso- morphism intuitively constructs a prototypical graph for the image of the query, but “‘breaks down” joins on constants across atoms in this image, constructing different prototypical constants (however, repetitions of constants within single atoms are maintained when substi- tuting with prototypical constants). We can define a homomorphism h p from the terms of 107 Chapter 6. Pruning the Infinite Chase prot(q k ) to S 8i p-chase k (prot(r i ),§) by mapping all terms t p 2 prot(q k ) to f (h q (prot ¡ (t p ))). Such a homomorphism exists, since all prototypical constants c in prot(q k ) came from query constants and hence (1) prot ¡ (c) and h q (prot ¡ (c)) are constants, and f (h q (prot ¡ (c))) is a prototypical constant, and (2)8 c 1 ,c 2 prototypical constants of prot(q k ) with c 1 6Æ c 2 , prot ¡ (c 1 )6Æ prot ¡ (c 2 )) h q (prot ¡ (c 1 ))6Æ h q (prot ¡ (c 2 ))) f (h q (prot ¡ (c 1 )))6Æ f (h q (prot ¡ (c 2 ))). Moreover, h p proves a homomorphism from prot(q k ) to p-chase k (prot(B),§). So, 9 q 0 2 contract k (prot(q),prot(B),§) that adheres to Def. 18, andjq 0 j· k. Also, there is no pruning or substitution with an hca atom that happens for q 0 , and by Def. 18, q 0 ÆÁ(h p (prot(q k ))). We subsequently prove thatÁÆ h ¡ p . The second bullet of Def. 18 does not apply forÁ in our case, since there are no new terms in the query apart from the image of h p . The third bullet does not apply as well, as: (1) for different prototypical constants, we have differ- ent images by definition, and (2) suppose that for two labeled nulls x, y in the terms of h p (prot(q k )), xÆ y. That is for x 1 , y 1 2 terms(q k ), xÆ h p (prot(x 1 ))Æ yÆ h p (prot(y 1 )). It holds that xÆ h p (prot(x 1 ))Æ f (h q (prot ¡ (prot(x 1 ))))Æ f (h q (x 1 )) and yÆ f (h q (y 1 )). So f (h q (x 1 ))Æ f (h q (y 1 )) and since f is an isomorphism on labeled nulls, h q (x 1 )Æ h q (y 1 ). This means that x 1 Æ y 1 , as by construction of q k 2 contract k (q,B,§) the correspondingÁ function would have equated x 1 and y 1 in q k . Hence, q 0 ÆÁ(h p (prot(q k )))Æ h ¡ p (h p (prot(q k )))Æ prot(q k )) prot ¡ (q 0 )Æ prot ¡ (prot(q k ))Æ q k . Hence for every q k 2 contract k (q,B,§), q k Æ prot ¡ (q 0 ) also belongs in prot ¡ (contract k (prot(q),prot(B),§)). Therefore, for all k¸ 1, contract k (q,B,§)µ prot ¡ (contract k (prot(q),prot(B),§)) (() Let t 2 prot ¡ (contract k (prot(q),prot(B),§))(chase(B,§)). This means that for tuple t 2 prot ¡ (contract k (prot(q),prot(B),§))(p-chase k (B,§)). Therefore there exists a query q 00 that belongs in prot ¡ (contract k (prot(q),prot(B),§)), s.t. 9 hom. h : q 00 !p-chase k (B,§), with h(head(q 00 ))Æ t. There is a query q 0 2 contract k (prot(q),prot(B),§), s.t., q 00 Æ prot ¡ (q 0 ). More- over q 0 must have been created, by Def. 18, due to a hom. h p of the prototypical version of the original query q, h p : prot(q)! chase(prot(B),§), s.t. q 0 =Á({h p (prot(q) \ prot(s))}^ hca(frontier k (¼(h p (prot(s)))))), whereÁ and prot(s) are as defined in Def. 18. Hence there is an answering homomorphism from q 00 Æprot ¡ (Á(h p (prot(q)\prot(s))^hca(frontier k (¼(h p (prot(s))))))) to p-chase k (B,§) that binds the distinguished variables to t. Also, the query prot ¡ (Á(h p (prot(q))^ hca(frontier k (¼(h p (prot(s))))))) has a homomorphism to chase(B,§), that binds the distin- guished variables to t, since hca(frontier k (¼(h p (prot(s))))) is a chase ancestor of h p (prot(s)) and any homomorphism from q 00 would be extensible to cover the latter. Since all distin- guished variables are in the prot ¡ (Á(h p (prot(q)))), we can “drop” the hca atom and consider q 000 Æ prot ¡ (Á(h p (prot(q)))), from which there is a homomorphism h a to chase(B,§), s.t. h a (head(q 000 ))Æ t. SinceÁ(h p ()) will only equate some of the terms in prot(q), it holds that q 000 µ q. Also, t2 q 000 (chase(B,§)) and hence t2 q(chase(B,§)), that is t2 certain(q,B,§), and by Theorem 18, it holds that tuple t2 contract k (q,B,§)(chase(B,§)). Thus, we proved that prot ¡ (contract k (prot(q),prot(B),§))µ contract k (q,B,§). Corollary 20. For all database instances B, for all sets of linear TGDs§, for all conjunctive queries q, certain(q,B,§)Æ prot ¡ (contract k (prot(q),prot(B),§))# (p-chase k (q,B,§)), where prot ¡ is the identity on undefined inputs. 108 6.3. Discussion & Related work 6.2.1 Rewriting size Consider an existentially connected query q. It is apparent that in order to compute the contraction of q, we need to find an answering homomorphism of prot(q) onto a single chase tree of p-chase jqj (prot(B),§). If every fact in prot(B) is a root in this chase forest, we have at mostjprot(B)j chase trees (in practice due to loops in the constraint reasoning this number will be smaller). For all homomorphisms h i from q to a tree tµ p-chase jqj (prot(B),§), we produce a rewriting q i , by substituting the atoms that exceed the frontier predicates with the parents of the latter. For a specific tree t the number of query atoms mapped below the frontiers is bounded by the number of subsets of the query 2 jqj . For each such subset s, the number of different mappings, elements of s can have, below the frontier atoms is determined by the pattern of the lowest common ancestor a s of the frontier atoms (since it is a s that determines the parent predicates). The lowest common ancestor a s should “generate” all the joins of the covered atoms in s, as well as propagated any joined terms from the rest of the query q \ s, which is mapped above the frontiers. The number of joined terms, that are shared between between s and q \s is the maximum arity of any atom w. Hence a s can have atoms w! different patterns containing the w joined terms. The number of rewritings that s can create when mapped on t is w!. Hence the number of rewritings for a single chase tree t is w!£ 2 jqj , and the number of total rewritings for q isjprot(B)j£ w!£ 2 jqj . In fact, we can further improve this bound, by considering that homomorphisms on different trees for which a s is the same atoms will create the same rewriting. This latter number is bounded by the maximum number r of ancestors any atom in can have. Thus, an upper bound for the number of rewritings is also r£ w!£ 2 jqj . Our rewriting is reminiscent of query expansion [35], where the algorithm exhaustively sub- stitutes pieces of a query with an ancestor creating a rewriting, and then repeating until no new queries are created. It is also similar to a part of the combined approach proposed by [82], that considers only unary and binary query predicates, where there is a preprocessing step that substitutes all query binary predicates with all their ancestors in all possible ways. The latter approach, creates the smallest UCQ rewriting of all related approaches, with size r jqj where r is the maximum number of distinct ancestors for each atom. However, this approach focuses on the DL-Lite language, which contains only binary predicates. To the best of our knowledge there is no combined approach designed for more expressive languages, with atoms of arbitrary arity as we consider here. In the case of binary predicates the size of our rewriting becomes 2r£ 2 jqj which is considerably smaller than r jqj . 6.3 Discussion & Related work For discussion on the chase algorithm see Sect. 5.6. Query rewriting has been extensively studied in the context query optimization [40], database design [133, 127, 134, 125], and data integration and exchange [70, 55]. Most of the research has considered source and global schemas over the relational [70, 85] and XML [108, 100, 37] data models. More recently, there 109 Chapter 6. Pruning the Infinite Chase has been significant interest on ontology-based data integration (OBDI), that is, approaches to incorporate intensional knowledge on the global schema, as specific types of constraints that provide a good trade-off between expressiveness and computational complexity [36, 111, 85, 109, 31, 32, 55, 56, 46]. This problem has also been studied as answering queries using views under dependencies [68, 5], or computing certain answers under constraints [55, 4, 46]. Query answering under expressive sets of constraints has also been studied in the scope of the Carin family [91, 93] of languages, which combine conjunctive queries, datalog rules, and description logics. Carin language has also been employed in the context of data integration [86]. The description logics in the Carin framework include among other constructs universal, existential and number restrictions. These are coupled with datalog rules that can contain description logic concepts in their antecedents. The Carin framework focuses on examining the interaction of these elements in a language such that decidability is maintained [90]. For several combinations of these elements, Carin provides customized query answering algorithms, which combine forward-chaining (the algorithm generates a set of propagation rules that generate a set of “completions” , reminiscent of the chase procedure) and query resolution (which is reminiscent of query rewriting). Most techniques in query answering under non-chase terminating constraints, deal with pure query expansion/rewriting, that is, they rewrite the original user query and the constraints, into a new query that can be directly evaluated over the data. Approaches and systems in this category include [111, 16, 117, 115, 44, 76, 66, 110, 109, 54]. All the approaches that aim into constructing pure FO rewritings suffer from a large exponential blowup. As we discussed in section 6.2.1, there have been combined approaches that pre-process the data and then rewrite the query. The approach in [82, 97] suggests the reuse of labeled nulls, when applying a chase rule. This creates a finite data, which however can give spurious answers to queries. To overcome this, the rewriting at query time includes parts that explicitly filter out results that are not certain answers. As discussed, using the pruned chase we produce an exponentially shorter rewriting than the aforementioned approach (which in addition, is limited to unary and binary predicates). A slightly different idea than the one in [82], is suggested in [96], where the filtering of the spurious answers takes place as a user defined function inside an RDBMS. This functions runs in polynomial time, but still, the system might have to run this function exponentially many times. We have presented a combined approach to query rewriting, which produces smaller runtime rewritings than related approaches. To the best of our knowledge, it remains an open problem whether there can exist a combined approach that has polynomial rewritings. 110 7 Conclusions and Future Work In this thesis we presented several solutions for scalable data integration, data exchange, query answering and query containment under constraints. First, we presented GQR, a scalable query rewriting algorithm that computes maximally-contained rewritings in an incremental, bottom up fashion. Using a graph perspective of queries and views, it finds and indexes common patterns in the views making rewriting more efficient. Optionally, this view prepro- cessing/indexing can be done offline thereby speeding up online performance even more. Our bottom-up rewriting process has a fail-fast behavior. In our experiments, GQR is about 2 orders of magnitude faster than the state of the art and scales up to 10000 views. Moreover, we have presented MGQR, an scalable algorithm for multiple query rewriting, that exploits common patterns across both queries and source descriptions. Our initial exper- iments are promising, and demonstrate that MGQR can take advantage of the overlap in the user queries. Although multi-query processing has been studied in traditional relational database systems [120], relevant algorithms and systems in data integration have focused on rewriting a single user query using the views. MGQR is particularly useful in the area of ontology-based data integration, where the problem is usually divided into two separate phases. The first phase leverages techniques from the area of query answering under con- straints, ignoring the integration setting and assuming a centralized and incomplete (violating the constraints) database underneath. In the second phase, the integration proper occurs. For example, in [111] the user writes a query over a global schema/ontology. Then, the system rewrites the original query by compiling the inferences embodied in the ontology into an expanded query, which is a UCQ, still over the global ontology terms, whose size may be exponential in the worst case. Then, the integration system will rewrite this UCQ into another UCQ query that uses only the source relations. In all these cases, it is important to efficiently process multiple queries over large numbers of sources. In the face of target constraints, we first studied the problem of query containment. Depending on the constraint language used, different techniques are used in order to implement query answering and containment. These techniques involve a preprocessing on the data, the views or the query. We studied constraints known as tuple-generating dependencies TGDs [1, 24] 111 Chapter 7. Conclusions and Future Work for which query answering is in general undecidable [1, 24]. There are, however, syntactic restrictions of TGDs that ensure decidability (and in some cases, good complexity) of query answering (and query rewriting). Given the great practical significance of the problem con- siderable effort has been devoted in order to find tractable classes of queries and constraints. In particular, for the case of containment, most efforts focus in identifying syntactic restric- tions of queries for which polynomial-time algorithms for containment, equivalence and minimization exist [9, 72, 27]. However, less attention has been paid to developing optimized algorithms for conjunctive query containment per se, which is the focus of our work. We presented a solution that decides containment between unions of hundreds of queries, in the presence of chase terminating constraints (in particular, LAV weakly acyclic TGDs), in seconds, outperforming the brute force approaches by two orders of magnitude. For all cases of chase-terminating constraints, we made a core contribution to all solutions that employ the chase algorithm, by introducing the Frugal chase, an algorithm that runs in the same data complexity as the standard chase, but produces much smaller output (universal solutions). This output is homomorphically equivalent to the standard chased universal solution. In addition, in virtual data integration it is not possible for one to transfer all data in the mediator (as in a data exchange setting [55]), create a data warehouse and then run the chase over the newly created mediator database. Rather, for the case of LAV weakly acyclic TGDs, we chose to “chase” the views; that is, we compile the ontology inferences in the logical formulas describing the sources. We then proceed to (the second phase of OBDI as described above) rewrite the query using the new set of views. We are aware of only one work that employed this approach [5] using an off-the-self query rewriting algorithm for LAV views, for the same language of constraints, which we outperformed by orders of magnitude. Moreover, for that particular setting, our frugal chase algorithm produced chased views whose size is very close to the minimal ones (the core universal solutions). In order to run the frugal chase on the views, we devised a graph-based frugal chase algorithm which outperforms the standard chase by close to two orders of magnitude and the core chase by almost three. The chase algorithm does not always terminate (even for decidable TGD languages). For exam- ple, under DL-Lite/OWL-2 QL [36] axioms (which is a decidable language mostly expressible in TGDs) the chase does not terminate (i.e., it is infinite). For such languages there are other reasoning algorithms (e.g., perfect reformulation [36]) for query answering and containment. In other languages (such as guarded-TGDs [34] and variants of Datalog+/- [31]) only a finite part of the infinite chase is needed, whose size depends on the query. In our last chapter we tackled the problem of the infinite chase by introducing the notions of the pruned chase, at a certain chase block, and that of query contraction, within a certain block. For the language of linear (or LAV) TGDs, which captures a significant part of the aforementioned languages, we proved that query contraction is a rewriting of the original query that gives the certain answers when evaluated on the pruned chase, and that it is exponentially smaller than all related approaches. We believe that GQR/MGQR, the Frugal chase, and our technique for Query Contraction in 112 the face of infinite chases, open several areas of future work. Currently, GQR in its graph combination phase, checks whether graphs are compatible, and if so, generates the logical form of the partial rewritings. However, this may lead to wasted effort if a particular combined graph eventually reaches a dead end (fails to combine later). It would be interesting to explore the efficiency gains of running GQR graph combination to completion, without explicitly computing partial rewritings, and then compute the rewritings in a second pass over the final graph (in a “lazy” fashion). Also, the current algorithm picks an arbitrary order for the bottom up combination of CPJs. However, one could investigate heuristic orderings that could speed the process even further, by failing faster and/or computing a smaller number of intermediate partial rewritings. As well, we believe that there is a lot of potential in exploring the phase transitions in the space of rewritings and in investigating the nature of minimal rewritings and encode additional minimizations. We would like to point out the possibility of extending our query/view language to include interpreted predicates. This could happen along the lines of [135, 69], which showed that proving containment of queries with interpreted predicates can be divided in two reasoning tasks: (1) proving containment of the ordinary predicates, and (2) proving a formula over the theory of the interpreted predicates. For example, if the interpreted predicates are arithmetic comparisons, the formula may be to check whether x· y^ y· z) x· z is true. We think this approach can be adapted to work in an incremental fashion during graph combination in GQR. We also believe that we can use the insights of our graph-based representation to define a novel SAT encoding that may lead to even faster algortihms, given the significant optimization effort invested in state-of-the-art SAT solvers. Regarding the frugal chase, it is interesting to explore how our algorithm can be designed to produce core universal solutions. The size of the output of the frugal chase, for our set of experiments, is very close to the core, without paying the minimization cost that the core chase has to pay. To our opinion, extending our solution to produce the core with less effort than full minimization is a very promising direction for future research. We envision two future directions regarding our work on the infinite chase and query con- traction. First, our query contraction technique uses ideas from the query-dependent finite chase approach of [34]. In particular in order to compute the query contraction, for the case of LAV TGDs, we chase a prototypical database which is smaller but representative to our original database. We believe that the same technique can be used to answer queries on a pruned chase instance, for richer constraint languages, such as guarded TGDs [34], for which the query-dependent finite chase is also an applicable technique. Second, in the problem of virtual data integration under these kinds of constraints (non case-terminating, but FO- rewritable), it is possible that one can find a set of mappings that encapsulate the constraints and give the maximally-contained rewritings of the query. Most probably, one could use the chase algorithm again (as in the case of VDI under LAV weakly acyclic constraints) to come up with this new set of mappings, which however might be infinite. Employing our ideas of query contraction could allow us to prune these infinite mappings, just as we did with the infinite 113 Chapter 7. Conclusions and Future Work universal solution, and reduce the problem to query answering using views. Note that, what lies behind our compact graph representation is a very efficient algorithm to compute homomorphisms between multiple (overlapping) formulas. Homomorphisms play an important role in database theory as well as in knowledge representation. In areas like automated reasoning and logic programming, forward-chaining reasoning relies in unification between atoms. Unifications are renamings of variables, strongly related to homomorphisms; in fact, a unification degenerates into a homomorphism when one of the unified atoms is a ground atom. Forward-chaining in this “data-oriented” context relies on looking for homomorphisms from the antecedents of our rules to our data. Our modeling can be of valuable use to this and other classical AI techniques. Our graph-encoding of queries and views is perfectly fitted for capturing overlapping parts of multiple rules. We already exhibited that multiple times; for multiple views and multiple queries, as well as constraints. We believe that, as in the case of UCQ answering using views, our compact representation of queries could be reusable in different problems across databases, knowledge representation and logic. It is interesting to explore how to apply our modeling into efficiently solving problems such as UCQ answering, containment, and minimization, under richer classes of constraints, as well as multiple-rule resolution (evaluation) of logic programs, etc. In particular, we would like to connect our results with other closely related approaches such as hypergraph decomposition [1], and piece resolution [21]. Our graph modeling, in this thesis, is reminiscent of, and in part inspired by, Sowa’s Simple Conceptual Graphs (SGs) [122] which are used to represent entities and relations between them, and can be translated to conjunctive queries. Sound and complete forward-chaining rea- soning in the language of SGs (and conceptual graph rules, or CGs) is obtained through a kind of graph homomorphism called projection [42]. Our modeling is much more compact than SGs and also allows for incremental and modular homomorphism detection. In the context of databases, homomorphism-based forward-chaining has been studied under the chase, an important tool used for several interesting inference tasks such as querying incomplete data, data integration and data exchange under constraints, which we elaborated on and improved upon, in this thesis. Hence, we believe that our graph-based modeling and our chase-related results can be beneficial to multiple areas of Artificial Intelligence and Databases. Moreover our approach has the potential to move much of the complexity of the related problems to an off-line pre-processing phase, since sources and their schema mappings as well as constraints are usually known a priori. We believe that further investigation to connect these results could provide a deeper insight to the ongoing “unifying” effort between Databases and AI. The results of this thesis have been published in [78, 79, 77, 80, 81, 12]. 114 Bibliography [1] S. Abiteboul, R. Hull, and V . Vianu. Foundations of databases. Citeseer, 1995. [2] Serge Abiteboul and Oliver M. Duschka. Complexity of answering queries using material- ized views. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 254–263, Seattle, Washington, 1998. [3] Sibel Adali, Kasim Selcuk Candan, Yannis Papkonstantinou, and V . S. Subrahmanian. Query caching and optimization in distributed mediator systems. SIGMOD Record (ACM Special Interest Group on Management of Data), 25(2):137–148, June 1996. [4] F .N. Afrati and N. Kiourtis. Computing certain answers in the presence of dependencies. Information Systems, 35(2):149–169, 2010. [5] Foto Afrati and Nikos Kiourtis. Query answering using views in the presence of depen- dencies. New Trends in Information Integration (NTII), pages 8—-11, 2008. [6] Foto N Afrati and Phokion G Kolaitis. Repair checking in inconsistent databases: algo- rithms and complexity. In Procs. ICDT, pages 31–41, 2009. [7] Parag Agrawal, Anish Das Sarma, Jeffrey Ullman, and Jennifer Widom. Foundations of Uncertain-Data Integration. Proceedings of the VLDB Endowment, 3(1):1–24, 2010. [8] Alfred V . Aho, Catriel Beeri, and Jeffrey D. Ullman. The theory of joins in relational databases. ACM Transactions on Database Systems (TODS), 4(3):297–314, 1979. [9] Alfred V Aho, Yehoshua Sagiv, and Jeffrey D Ullman. Efficient optimization of a class of relational expressions. ACM Transactions on Database Systems (TODS), 4(4):435–454, 1979. [10] Bogdan Alexe, Wang-Chiew Tan, and Yannis Velegrakis. Stbenchmark: towards a bench- mark for mapping systems. Proceedings of the VLDB Endowment, 1(1):230–244, 2008. [11] Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang Chiew Tan. Designing and refining schema mappings via data examples. In Proceedings of the ACM SIGMOD Inter- national Conference on Management of Data (SIGMOD 2011), pages 133–144, Athens, Greece, 2011. 115 Bibliography [12] Jose Luis Ambite, Marcelo Tallis, Kathryn Alpert, David B Keator, Margaret King, Drew Landis, George Konstantinidis, Vince D Calhoun, Steven G Potkin, Jessica A Turner, et al. Schizconnect: Virtual data integration in neuroimaging. In Data Integration in the Life Sciences, pages 37–51. Springer, 2015. [13] Marcelo Arenas, Jorge Pérez, Juan Reutter, and Cristian Riveros. Composition and inversion of schema mappings. ACM SIGMOD Record, 38(3):17, December 2010. [14] Marcelo Arenas, Jorge Pérez, Juan L. Reutter, and Cristian Riveros. Foundations of schema mapping management. In PODS ’10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems of data, pages 227–238, New York, NY, USA, 2010. ACM. [15] Patricia C. Arocena, Ariel Fuxman, and Renée J. Miller. Composing local-as-view map- pings: closure and applications. In Proceedings of the 13th International Conference on Database Theory, pages 209–218. ACM, 2010. [16] A. Artale, D. Calvanese, R. Kontchakov, and M. Zakharyaschev. The dl-lite family and relations. J. of Artificial Intelligence Research, 36:1–69, 2009. [17] Yolifé Arvelo, Blai Bonet, and María Esther Vidal. Compilation of query-rewriting prob- lems into tractable fragments of propositional logic. In AAAI’06: Proceedings of the 21st national conference on Artificial intelligence, pages 225–230. AAAI Press, 2006. [18] F . Baader. Unification theory. Word Equations and Related Topics, pages 151–170, 1992. [19] L. Bachmair and H. Ganzinger. Resolution theorem proving. Handbook of automated reasoning, 1:19–99, 2001. [20] J.F . Baget, M. Leclère, M.L. Mugnier, et al. Walking the decidability line for rules with existential variables. In Proc. 12th Int. Conf. on Principles of Knowledge Representation and Reasonign (KR10), pages 466–476, 2010. [21] J.F . Baget, M. Leclère, M.L. Mugnier, Ê. Salvat, et al. Extending decidable cases for rules with existential variables. In Proc. of IJCAI, volume 9, pages 677–682, 2009. [22] J.F . Baget, M.L. Mugnier, and M. Thomazo. Towards farsighted dependencies for exis- tential rules. Web Reasoning and Rule Systems, pages 30–45, 2011. [23] C. Beeri and M. Vardi. The implication problem for data dependencies. Automata, Languages and Programming, pages 73–85, 1981. [24] C. Beeri and M.Y. Vardi. A proof procedure for data dependencies. Journal of the ACM (JACM), 31(4):718–741, 1984. [25] Zohra Bellahsene, Angela Bonifati, and Erhard Rahm. Schema Matching and Mapping. Springer, 1st edition, 2011. 116 Bibliography [26] Jiten Bhagat, Franck Tanoh, Eric Nzuobontane, Thomas Laurent, Jerzy Orlowski, Marco Roos, Katy Wolstencroft, Sergejs Aleksejevs, Robert Stevens, Steve Pettifer, Rodrigo Lopez, and Carole A. Goble. Biocatalogue: a universal catalogue of web services for the life sciences. Nucleic Acids Research, 38(suppl 2):W689–W694, 2010. [27] Joachim Biskup, Pratul Dublish, and Yehoshua Sagiv. Optimization of a subclass of conjunctive queries. Acta informatica, 32(1):1–26, 1995. [28] Dan Brickley and R.V . Guha. RDF vocabulary description language 1.0: RDF schema W3C recommendation 10 february 2004. http://www.w3.org/TR/rdf-schema/, 2004. [29] Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. Webta- bles: exploring the power of tables on the web. Proc. VLDB Endow., 1:538–549, August 2008. [30] A. Calı, G. Gottlob, and M. Kifer. Taming the infinite chase: Query answering under expressive relational constraints. Proc. of KR, pages 70–80, 2008. [31] A. Cali, G. Gottlob, T. Lukasiewicz, B. Marnette, and A. Pieris. Datalog+/-: A family of logical knowledge representation and query languages for new applications. In Logic in Computer Science (LICS), 2010 25th Annual IEEE Symposium on, pages 228–242. IEEE, 2010. [32] A. Calì, G. Gottlob, T. Lukasiewicz, and A. Pieris. A logical toolbox for ontological reasoning. SIGMOD Record, 40(3):5, 2011. [33] Andrea Cali, Georg Gottlob, and Michael Kifer. Taming the infinite chase: Query answer- ing under expressive relational constraints. JAIR, 48:115–174, 2013. [34] Andrea Calì, Georg Gottlob, and Thomas Lukasiewicz. A general datalog-based frame- work for tractable query answering over ontologies. Web Semantics: Science, Services and Agents on the World Wide Web, 14:57–83, 2012. [35] D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Dl-lite: Tractable description logics for ontologies. In Proceedings of the National Conference on Artificial Intelligence, volume 20, page 602. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005. [36] D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reason- ing and efficient query answering in description logics: The dl-lite family. Journal of automated reasoning, 39(3):385–429, 2007. [37] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. An- swering regular path queries using views. In Proceedings of the 16th IEEE International Conference on Data Engineering, pages 389–398, San Diego, CA, 2000. 117 Bibliography [38] Ashok K. Chandra, Harry R. Lewis, and Johann A. Makowsky. Embedded implicational dependencies and their inference problem. In Proceedings of the thirteenth annual ACM symposium on Theory of computing, STOC ’81, pages 342–354, New York, NY, USA, 1981. ACM. [39] Ashok K. Chandra and Philip M. Merlin. Optimal implementation of conjunctive queries in relational data bases. In Proceedings of the 9th ACM Symposium on Theory of Com- puting (STOC), pages 77–90, Boulder, Colorado, 1977. [40] Surajit Chaudhuri, Ravi Krishnamurthy, and Spyros Potamianosand Kyuseok Shim. Optimizing queries with materialized views. In Philip S. Yu and Arbee L. P . Chen, editors, Proceedings of the Eleventh International Conference on Data Engineering, pages 190– 200. ieee, 1995. [41] Surajit Chaudhuri and Moshe Y. Vardi. On the equivalence of recursive and nonrecur- sive datalog programs. In Proceedings of the Eleventh ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 55–66, San Diego, CA, 1992. [42] Michel Chein and Marie-Laure Mugnier. Conceptual graphs: Fundamental notions. In Revue d’intelligence artificielle. Citeseer, 1992. [43] Michel Chein and Marie-Laure Mugnier. Graph-based Knowledge Representation: Com- putational Foundations of Conceptual Graphs. Springer Publishing Company, Incorpo- rated, 2008. [44] Alexandros Chortaras, Despoina Trivela, and Giorgos Stamou. Optimized query rewrit- ing for owl 2 ql. In Automated Deduction–CADE-23, pages 192–206. Springer, 2011. [45] Stavros S. Cosmadakis and Paris C. Kanellakis. Functional and inclusion dependencies a graph theoretic approach. In Proceedings of the 3rd ACM SIGACT-SIGMOD symposium on Principles of database systems, PODS ’84, pages 29–37, New York, NY, USA, 1984. ACM. [46] A. Deutsch, L. Popa, and V . Tannen. Query reformulation with constraints. ACM SIGMOD Record, 35(1):65–73, 2006. [47] Alin Deutsch, Alan Nash, and Jeff Remmel. The chase revisited. In PODS, pages 149–158, 2008. [48] Anhai Doan, Pedro Domingos, and Alon Halevy. Learning to match the schemas of data sources: A multistrategy approach. Machine Learning, 50:279–301, March 2003. [49] Xin Dong, Alon Halevy, Jayant Madhavan, Ema Nemes, and Jun Zhang. Similarity search for web services. In Proceedings of the Thirtieth international conference on Very large data bases - Volume 30, VLDB ’04, pages 372–383. VLDB Endowment, 2004. [50] Oliver M. Duschka. Query Planning and Optimization in Information Integration. PhD thesis, Department of Computer Science, Stanford University, 1997. 118 Bibliography [51] Oliver M. Duschka and Michael R. Genesereth. Answering recursive queries using views. In PODS ’97: Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 109–116, New York, NY, USA, 1997. ACM. [52] Oliver M. Duschka and Michael R. Genesereth. Query planning in infomaster. In 12th ACM Symposium on Applied Computing, San Jose, CA, 1997. [53] Oliver M. Duschka, Michael R. Genesereth, and Alon Y. Levy. Recursive query plans for data integration. Journal of Logic Programming, 43(1):49–73, 2000. [54] Thomas Eiter, Magdalena Ortiz, Mantas Simkus, Trung-Kien Tran, and Guohui Xiao. Query rewriting for horn-shiq plus rules. In AAAI, 2012. [55] R. Fagin, P .G. Kolaitis, R.J. Miller, and L. Popa. Data exchange: semantics and query answering. Theoretical Computer Science, 336(1):89–124, 2005. [56] R. Fagin, P .G. Kolaitis, and L. Popa. Data exchange: getting to the core. ACM Transactions on Database Systems (TODS), 30(1):174–210, 2005. [57] Ronald Fagin, Laura M. Haas, Mauricio A. Hernández, Renée J. Miller, Lucian Popa, and Yannis Velegrakis. Clio: Schema mapping creation and data exchange. In Conceptual Modeling: Foundations and Applications - Essays in Honor of John Mylopoulos, pages 198–236, 2009. [58] Daniela Florescu, Louiqa Raschid, and Patrick Valduriez. Answering queries using OQL view expressions. In Workshop on Materialized Views: Techniques and Applications, pages 84–90, Montreal, Canada, June 1996. SIGMOD. [59] Marc Friedman, Alon Y. Levy, and Todd D. Millstein. Navigational plans for data integra- tion. In AAAI, pages 67–73, 1999. [60] H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. Integrating and accessing heterogeneous information sources in tsimmis. In Proceedings of the AAAI Symposium on Information Gathering, pp. 61-64, March 1995. [61] Hector Garcia-Molina, Yannis Papakonstantinou, Dallan Quass, Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman, Vasilis Vassalos, and Jennifer Widom. The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8(2):117–132, 1997. [62] Michael R. Genesereth. Data Integration: The Relational Logic Approach. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010. [63] G. Gottlob, N. Leone, and F . Scarcello. Hypertree decompositions and tractable queries. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 21–32. ACM, 1999. 119 Bibliography [64] Georg Gottlob. Computing cores for data exchange: new algorithms and practical solutions. In Proceedings of the 24th ACM PODS, pages 148–159. ACM, 2005. [65] Georg Gottlob and Alan Nash. Efficient core computation in data exchange. Journal of the ACM (JACM), 55(2):9, 2008. [66] Georg Gottlob, Giorgio Orsi, and Andreas Pieris. Ontological queries: Rewriting and optimization. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 2–13. IEEE, 2011. [67] Sergio Greco, Cristian Molinaro, and Francesca Spezzano. Incomplete data and data dependencies in relational databases. Synthesis Lectures on Data Management, 4(5):1– 123, 2012. [68] Jarek Gryz. Query rewriting using views in the presence of functional and inclusion dependencies. Information Systems, 24(7):597–612, 1999. [69] Ashid Gupta, Yehoshua Sagiv, Jeffrey D. Ullman, and Jennifer Widom. Constraint check- ing with partial information. In Proceedings of the Thirteenth ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems, pages 45–55, Minneapolis, Min- nesota, 24–26 May 1994. [70] Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270– 294, 2001. [71] André Hernich, Leonid Libkin, and Nicole Schweikardt. Closed world data exchange. ACM Transactions on Database Systems (TODS), 36(2):14, 2011. [72] David S Johnson and Anthony Klug. Testing containment of conjunctive queries under functional and inclusion dependencies. Journal of Computer and system Sciences, 28(1):167–189, 1984. [73] Anthony Klug. On conjunctive queries containing inequalities. Journal of the ACM, 35(1), January 1988. [74] Christoph Koch. Query rewriting with symmetric constraints. AI Communications, 17(2):41–55, 2004. [75] H. Kondylakis and D. Plexousakis. Enabling ontology evolution in data integration. In Proceedings of the 2010 EDBT Workshops, pages 1–7. ACM, 2010. [76] Mélanie König, Michel Leclere, Marie-Laure Mugnier, and Michaël Thomazo. A sound and complete backward chaining algorithm for existential rules. Springer, 2012. [77] G. Konstantinidis. Towards scalable data integration under constraints. Proceedings of the EDBT/ICDT PhD Workshop 2012. 120 Bibliography [78] G. Konstantinidis and J.L. Ambite. Scalable query rewriting: a graph-based approach. In Proceedings of the 2011 international conference on Management of data, SIGMOD’11, pages 97–108. ACM, 2011. [79] George Konstantinidis and José Luis Ambite. Optimizing query rewriting for multiple queries. In Proceedings of the 9th Workshop on Information Integration on the Web (IIWeb 2012), Scottsdale, Arizona, 2012. [80] George Konstantinidis and Jose Luis Ambite. Scalable containment for unions of con- junctive queries under constraints. In Proceedings of the Fifth Workshop on Semantic Web Information Management, page 4. ACM, 2013. [81] George Konstantinidis and José Luis Ambite. Optimizing the chase: scalable data integration under constraints. Proceedings of the VLDB Endowment, 7(14):1869–1880, 2014. [82] Roman Kontchakov, Carsten Lutz, David Toman, Frank Wolter, and Michael Za- kharyaschev. The combined approach to query answering in dl-lite. In KR, 2010. [83] M. Krötzsch and S. Rudolph. Extending decidable existential rules by joining acyclicity and guardedness. IJCAI11, 2011. [84] Eric Lambrecht, Subbarao Kambhampati, and Senthil Gnanaprakasam. Optimizing recursive information gathering plans. 1999. [85] Maurizio Lenzerini. Data integration: A theoretical perspective. In Lucian Popa, editor, PODS, pages 233–246. ACM, 2002. [86] Alon Levy. The information manifold approach to data integration. IEEE Intelligent Systems, 13(5):12–16, 1998. [87] Alon Levy, Anand Rajaraman, and Joann Ordille. Querying heterogeneous information sources using source descriptions. pages 251–262, 1996. [88] Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, and Divesh Srivastava. Answering queries using views. In PODS, pages 95–104, San Jose, California, 1995. [89] Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Query-answering algorithms for information agents. pages 40–47, 1996. [90] Alon Y Levy and Marie-Christine Rousset. The limits on combining recursive horn rules with description logics. In Proceedings of the thirteenth national conference on Artificial intelligence-Volume 1, pages 577–584. AAAI Press, 1996. [91] Alon Y Levy and Marie-Christine Rousset. Carin: A representation language combining horn rules and description logics’ . 19986. 121 Bibliography [92] A.Y. Levy, A.O. Mendelzon, Y. Sagiv, and D. Srivastava. Answering queries using views (extended abstract). In Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 95–104. ACM, 1995. [93] A.Y. Levy and M.C. Rousset. Combining horn rules and description logics in carin. Artificial Intelligence, 104(1-2):165–209, 1998. [94] Václáv Lín, Vasilis Vassalos, and Prodromos Malakasiotis. MiniCount: Efficient Rewriting of COUNT-Queries Using Views. 22nd International Conference on Data Engineering (ICDE’06), pages 1–1, 2006. [95] J.W. Lloyd. Foundations of logic programming. symbolic computation: Artificial intelli- gence, 1987. [96] Carsten Lutz, Inanç Seylan, David Toman, and Frank Wolter. The combined approach to obda: Taming role hierarchies using filters. In The Semantic Web–ISWC 2013, pages 314–330. Springer, 2013. [97] Carsten Lutz, David Toman, and Frank Wolter. Conjunctive query answering in the description logic el using a relational database system. In Proceedings of the 21st inter- national jont conference on Artifical intelligence, pages 2070–2075. Morgan Kaufmann Publishers Inc., 2009. [98] David Maier. The theory of relational databases, volume 11. Computer science press Rockville, 1983. [99] David Maier, Alberto O. Mendelzon, and Yehoshua Sagiv. Testing implications of data dependencies. ACM Transactions on Database Systems, 4(4):455–469, December 1979. [100] Bhushan Mandhani and Dan Suciu. Query caching and view selection for xml databases. In Proceedings of the 31st International Conference on Very Large Data Bases, pages 469–480, Trondheim, Norway, 2005. [101] Bruno Marnette. Generalized schema-mappings: from termination to tractability. In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Princi- ples of database systems, PODS ’09, pages 13–22, New York, NY, USA, 2009. ACM. [102] Giansalvatore Mecca, Paolo Papotti, and Salvatore Raunich. Core schema mappings. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 655–668. ACM, 2009. [103] J.C. Mitchell. The implication problem for functional and inclusion dependencies. Information and Control, 56:112–138, 1983. [104] Boris Motik, Bernardo Cuenca Grau, Ian Horrocks, Zhe Wu, Achille Fokoue, and Carsten Lutz. OWL 2 Web Ontology Language Profiles, W3C Recommendation 27 October 2009. http://www.w3.org/TR/owl-profiles/, 2009. 122 Bibliography [105] M.L. Mugnier. Ontological query answering with existential rules. Web Reasoning and Rule Systems, pages 2–23, 2011. [106] Alan Nash, Alin Deutsch, and Jeffrey Remmel. Data Exchange, Data Integration, and Chase. TR, CS2006-0859, UCSD, 2006. [107] Adrian Constantin Onet. The chase procedure and its applications. PhD thesis, Concordia University, 2012. [108] Yannis Papakonstantinou, Vinayak Borkar, Maxim Orgiyan, Kostas Stathatos, Lucian Suta, Vasilis Vassalos, and Pavel Velikhov. Xml queries and algebra in the enosys in- tegration platform. Data and Knowledge Engineering Journal, 44(3):299–322, March 2003. [109] Hector Pérez-Urbina, Ian Horrocks, and Boris Motik. Tractable query answering and rewriting under description logic constraints. Journal of Applied Logic, 8(2):186–209, 2010. [110] Héctor Pérez-Urbina, Edgar Rodrıguez-Dıaz, Michael Grove, George Konstantinidis, and Evren Sirin. Evaluation of query rewriting approaches for owl 2. In Joint Workshop on Scalable and High-Performance Semantic Web Systems (SSWS+ HPCSW 2012), page 32. Citeseer. [111] Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati. Linking data to ontologies. Journal on Data Semantics, X:133–173, 2008. [112] Rachel Pottinger and Alon Halevy. MiniCon:a scalable algorithm for answering queries using views. The VLDB Journal, 2001. [113] Xiaolei Qian. Query folding. In Proceedings of the 12th International Conference on Data Engineering, New Orleans, Louisiana, Febraury 1996. [114] E. Rahm and P .A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4), Dec 2001. [115] Riccardo Rosati. Prexto: Query rewriting under extensional constraints in dl- lite. In The Semantic Web: Research and Applications, pages 360–374. Springer, 2012. [116] Riccardo Rosati and Alessandro Almatelli. Improving query answering over dl-lite ontologies. In Proceedings of the 12th International Conference on the Principles of Knowledge Representation and Reasoning, pages 290–300, Toronto, Canada, 2010. [117] Riccardo Rosati and Alessandro Almatelli. Improving query answering over dl-lite ontologies. In Proceedings of the 12th International Conference on the Principles of Knowledge Representation and Reasoning, pages 290–300, Toronto, Canada, 2010. 123 Bibliography [118] Y. Sagiv and M. Yannakakis. Equivalences among relational expressions with the union and difference operators. Journal of the ACM (JACM), 27(4):633–655, 1980. [119] E. Salvat and M.L. Mugnier. Sound and complete forward and backward chainings of graph rules. Conceptual Structures: Knowledge Representation as Interlingua, pages 248–262, 1996. [120] Timos K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23–52, March 1988. [121] Pavel Shvaiko and Jérôme Euzenat. A survey of schema-based matching approaches. Journal on Data Semantics IV, 3730:146–171, 2005. [122] John F Sowa. Conceptual structures: information processing in mind and machine. 1983. [123] Francesca Spezzano and Sergio Greco. Chase termination: A constraints rewriting approach. PVLDB, 3(1):93–104, 2010. [124] Balder Ten Cate, Laura Chiticariu, Phokion Kolaitis, and Wang-Chiew Tan. Laconic schema mappings: Computing the core with sql queries. PVLDB, 2(1):1006–1017, 2009. [125] Dimitri Theodoratos, Spyros Ligoudistianos, and Timos Sellis. Designing the global data warehouse with SPJ views. Lecture Notes in Computer Science, 1626:180–194, 1999. [126] Mary Tork Roth, Manish Arya, Laura M. Haas, Michael J. Carey, William Cody, Ron Fagin, Peter M. Schwarz, John Thomas, and Edward L. Wimmers. The Garlic project. SIGMOD Record (ACM Special Interest Group on Management of Data), 25(2):557–558, 1996. [127] Odysseas G. Tsatalos, Marvin H. Solomon, and Yannis E. Ioannidis. The GMAP: A versatile tool for physical data independence. VLDB Journal: Very Large Data Bases, 5(2):101–118, April 1996. [128] Jeffrey D. Ullman. Principles of Database and Knowledge-Base Systems, volume 1. Com- puter Science Press, Rockville, Maryland, 1988. [129] Jeffrey D. Ullman. Principles of Database and Knowledge-Base Systems, volume 2. Com- puter Science Press, Rockville, Maryland, 1989. [130] Jeffrey D. Ullman. Information integration using logical views. In Proceedings of the Sixth International Conference on Database Theory, pages 19–40, Delphi, Greece, January 1997. [131] M.Y. Vardi. The complexity of relational query languages. In Proceedings of the fourteenth annual ACM symposium on Theory of computing, pages 137–146. ACM, 1982. [132] Gio Wiederhold. Mediators in the architecture of future information systems. IEEE Computer, 25(3):38–49, March 1992. 124 Bibliography [133] H. Z. Yang and Per-Åke Larson. Query transformation for PSJ-queries. 1987. [134] Jian Yang, Kamalakar Karlapalem, and Qing Li. Algorithms for materialized view design in data warehousing environment. pages 136–145, 1997. [135] Xubo Zhang and Z. Meral Ozsoyoglu. On efficient reasoning with implication constraints. In Proceedings of the Third International Conference on Deductive and Object-Oriented Databases, pages 236–252, Phoenix, AZ, 1993. [136] Linhua Zhou, Huajun Chen, Yu Zhang, and Chunying Zhou. A Semantic Mapping System for Bridging the Gap between Relational Database and Semantic Web. American Association for Artificial InteIligence (www. aaai. org), 2008. 125
Abstract (if available)
Abstract
We witness an explosion of available data in all areas of human activity, from large scientific experiments, to medical data, to distributed sensors, to social media. Integrating data from disparate sources can lead to novel insights across scientific, industrial, and governmental domains. In this thesis we consider the problem of scalable data integration, in a setting that consists of: 1) a set of distributed and heterogeneous data sources, modeled as relational databases, 2) a mediating (or target) relational schema, exposed to the user as a global query interface, 3) a set of integrity constraints, expressed in the language of Tuple-Generating Dependencies (TGDs), and known as target TGDs, which is used to model complex relationships among the target relations, and, 4) a set of schema mappings, which are used to connect the source and the target schemas. In this setting the main problem of data integration, and the focus of this thesis, is answering conjunctive queries and unions of conjunctive queries (UCQs). We also address the closely related problem of query containment under constraints, which studies when two queries have subset-related sets of answers, under the constraints. These problems are well studied and crucial to the areas of Database Theory, Database Systems, Knowledge Representation & Reasoning, Ontologies and the Semantic Web. The two main approaches to data integration, in our setting, are: 1) data exchange which focuses on how to materialize a target database instance, by running the chase algorithm and transforming the source data to a centralized warehouse, appropriate for answering queries, and 2) virtual data integration which assumes that the data is left in the sources, and consists of rewriting, at query-time, the conjunctive queries over the target schema, into a maximally-contained rewriting over the data sources, that answers our query. ❧ In this thesis, we make significant contributions to the problems of scalable virtual data integration, data exchange, query answering and query containment with and without the presence of integrity constraints. First, we develop a scalable algorithm for virtual data integration without target constraints. The novel insight of our approach is to look at the problem from a graph perspective and compactly represent overlapping graph patterns in the schema mappings. This, together with other optimizations, results in an experimental performance that rewrites queries in the presence of 10000 sources in under a second, which is about two orders of magnitude faster than state-of-the-art algorithms. We also extend this algorithm to address UCQs and multiple queries as inputs, in an optimized way. We then turn our focus to richer mediator languages and present a solution for UCQ query containment under linear weakly-acyclic constraints. Our solution employs, again, our compact graph-modeling and indexing and outperforms the brute-force approaches by two orders of magnitude. Subsequently, we examine the chase algorithm and develop an optimization that produces equivalent but smaller target databases, with the same (polynomial) data complexity as the standard chase. We implement and apply this result to the problem of query answering using views under linear weakly acyclic constraints. For the latter setting, our experiments show that the size of the frugal chase is much smaller to that of the standard chase (and very close to the core), which we outperform by almost two orders of magnitude. Lastly, we develop a theoretical solution for query answering under non-chase terminating constraints (linear TGDs), where we “combine” the chase with query rewriting. We employ the chase algorithm but prune its output independently of any query, and we design a novel approach to “minimize” any query into a rewriting that gives the answers on the pruned chase. We prove that our rewriting is exponentially smaller than other related approaches.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A data integration approach to dynamically fusing geospatial sources
PDF
A reference-set approach to information extraction from unstructured, ungrammatical data sources
PDF
Scalable processing of spatial queries
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Iteratively learning data transformation programs from examples
PDF
Approximate query answering in unstructured peer-to-peer databases
PDF
From matching to querying: A unified framework for ontology integration
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Learning the semantics of structured data sources
PDF
Generalized optimal location planning
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
Spatial query processing using Voronoi diagrams
PDF
Efficient reachability query evaluation in large spatiotemporal contact networks
PDF
Utilizing real-world traffic data to forecast the impact of traffic incidents
PDF
Partitioning, indexing and querying spatial data on cloud
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Robust and proactive error detection and correction in tables
PDF
Query processing in time-dependent spatial networks
PDF
Discovering and querying implicit relationships in semantic data
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
Asset Metadata
Creator
Konstantinidis, George
(author)
Core Title
Scalable data integration under constraints
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/31/2015
Defense Date
06/15/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data exchange,data integration,database constraints,database dependencies,database theory,database views,databases,OAI-PMH Harvest,ontologies,query answering,query rewriting
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ambite, José Luis (
committee chair
), Knoblock, Craig (
committee member
), O'Leary, Daniel E. (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
gconstan@gmail.com,konstant@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-622556
Unique identifier
UC11305863
Identifier
etd-Konstantin-3779.pdf (filename),usctheses-c3-622556 (legacy record id)
Legacy Identifier
etd-Konstantin-3779.pdf
Dmrecord
622556
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Konstantinidis, George
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
data exchange
data integration
database constraints
database dependencies
database theory
database views
databases
ontologies
query answering
query rewriting