Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ON INCREMENTAL UPDATE PROPAGATION BETW EEN OBJECT-BASED DATABASES by Ti-Pin Chang A Dissertation Presented to the FACULTY OF TH E GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Com puter Science) May 1994 Copyright 1994 Ti-Pin Chang UMI Number: DP22878 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Dissertation Publishing UMI DP22878 Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106- 1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by T i-P in (Ben) Chang under the direction of h.£s Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of fh-P- CfS C451* S 7 3 ^ * f f DOCTOR OF PHILOSOPHY Deart of Graduate Studies DISSERTATION COMMITTEE Chairperson Dedication This dissertation is dedicated to my parents, iMjlllla and , for their mono tonic increm ental updates of their expectations on m e th a t brought m e this far. A ckno wledgement s I wish to express my deepest appreciation to Professor Richard Hull for serving as my advisor throughout my doctoral study. His inexhaustible patience to my imprecise intuitions and his theoretical meticulosity after the idea reified ensured the soundness and completeness of my work. As an advisor, his enthusiasm toward his subject students and his standards for an “acceptable result” are more than any graduate student could dream of. I also wish to express my gratitude to Professor Dave Wile who accepted me into the Information Sciences Institute (ISI). As the co-advisor, his guidance on the thesis and inspirations on the implementation issues helped to make the BIRD prototype possible. I would like to thank the Information Sciences Institute which provided the finan cial support and unrestricted use of state-of-the-art facilities for the first two years of my doctoral study. Among the graduate students working at the institute, few are granted the extensive freedom I was allowed in pursuing my thesis research. My thanks also goes to everyone in the Software Sciences Division. In particular, many thanks to Dr. Don Cohen for his proof reading of the thesis, numerous technical support of the implementation, and unreserved academic debates (ranging from the justification of OIDs to my preference of programming language). A special thanks goes to Professor Seymour Ginsburg, who first brought me to the world of precise thinking and m athem atic reasoning. His classes transformed m e from a down-to-earth engineer with little theoretical training into a Ph.D. whose thesis contains 40 pages of formal proofs and ugly symbols. Last but not least, I am grateful to my wife, , for enduring my night-owl working schedule and standing by me during my darkest days of my thesis writing. Also, her delicious cooking and non-stop logistic support enabled me to have more tim e struggling with the lemmas and theorems. Contents D e d ic a tio n ii A ck n o w led g em en ts iii L ist O f T ables viii L ist O f F ig u re s ix A b s tra c t xi 1 H e te ro g e n e o u s D a ta b a se U p d a te s 1 1.1 Problem S ta te m e n t..................................................................... 2 1.2 Examples of Semantic C o rresp o n d en ce..................................................... 3 1.2.1 Value-Based Correspondence............................................................. 4 1.2.2 Introduction of Object Identifiers (O ID s )...................................... 5 1.2.3 Ambiguous U p d a te .............................................................................. 7 1.3 The Proposed S o lu tio n ................................................................................... 9 1.3.1 System A rc h ite c tu re .................... 10 1.3.2 Scenario for Establishing Semantic C orrespondence.................. 12 1.3.3 Methodology to Resolve A m b ig u ity ........... ' ................................ 12 1.4 Dissertation Structure........... ............................................................................ 12 2 T ools U sed B y B ird 15 2.1 AP5, the platform ................................. 15 2.1.1 D ata Definition L a n g u a g e .............................. 15 2.1.2 Data Manipulation L a n g u a g e .......................................................... 16 2.1.3 Active R u les............................................... 17 2.1.4 Semantics of Consistency R u l e s ...................................................... 18 2.2 IFO- , the D ata Base M o d e l......................................................................... 19 2.3 ILOG, the Linkage Language ...................................................................... 21 2.3.1 Evaluation of ILOG p ro g ra m s.......................................................... 22 2.3.2 Physical vs. Logical In sta n c e s.......................................................... 23 2.3.3 Restrictions for nrecILOG- ............................................................. 25 iv 2.4 POPART, the C o m p ile r................................................................................ 26 2.4.1 BNF’, the gram m ar description la n g u a g e ...................................... 26 2.4.2 Syntax Directed E x p e r ts ................................. 27 3 S o lu tio n of th e V alu e-B ased C ase 30 3.1 Abstract Linkage Description ( A L D ) ......................................................... 30 3.2 Compilation of an A L D ................................................................................... 32 3.3 Remote Transaction Control (RemTrans) .................................................. 35 3.3.1 Motivation of R e m T ra n s.................................................................. 36 3.3.2 Managing the C o m m u n icatio n s..................................................... 37 3.4 Socket and RPC, the Communication M e ch a n ism ................................. 40 3.5 Linkage Implicit Constraints (L IC s)................................................................ 41 4 U n i-d ire c tio n a l P ro p a g a tio n of In c re m e n ta l U p d a te s In v o lv in g O ID 45 4.1 Entity Equivalence Specification ( E E S ) ..........................................................49 4.2 OID Association Problem ............................................................................ 51 4.3 Augmentation of Source Schem a.......................................................................53 4.4 Compilation of ILOG Invention R u le s......................................................... 56 4.5 Machine Dependent Object Translation (MDOT) M echanism ..................59 4.5.1 Translation of Machine Dependent O b j e c t ................................. 60 4.5.2 Synthesis of OID Invention and M D O T ......................................... 62 4.6 Static vs. Dynamic Semantics in ALD ..................................................... 64 5 B i-d ire c tio n a l P ro p a g a tio n o f In c re m e n ta l U p d a te s 67 5.1 Preprocessing of EES in Bi-Directional C ases........................................... 69 5.2 W itness Update P ro b lem ................................................................................ 71 5.3 Definition of W itness G e n e ra to r..................................... 74 5.4 Logical and Physical Views of a Semantic C o rresp o n d en c e ................ 76 5.5 Equivalent and Mutually Recursive C lasses.............................................. 78 5.6 Recapitulation of the B ird S y s te m ............................................................ 80 6 H ow to F in d W itn e s s G e n e ra to rs 83 6.1 Overview of Automatic Generation of Witness Generators ................ 84 6.2 Removal of E E S ................................................................................................ 88 6.2.1 Well-foundedness of an A L D ........................................................... 90 6.2.2 Reduction and Expansion O p eratio n s........................................... 92 6.3 WGG Expansion H isto ry ............................................................................... 95 6.4 The WGG A lgorithm .........................................................................................100 6.5 The Incompleteness of the WGG A lg o rith m ..............................................105 6.5.1 Well-founded A L D .............................................................................. 106 6.5.2 Non-Well-founded ALD .....................................................................108 v 7 F o rm al D iscussion on W G G E x p an sio n 111 7.1 Structured WGG Expansion H i s t o r y ...........................................................113 7.2 The Soundness of WGG A lg o rith m ..............................................................122 7.3 OID Component ............................................................................................... 125 7.4 Decidability of Halting for WGG E x p a n s io n ............................................. 133 7.5 W G G halt A lgorithm ............................................................................................141 8 H a n d lin g A m b ig u ity 144 8.1 Motivating E x am p les.................................................... 145 8.2 Overview on A m b ig u ity ............................ 150 8.2.1 Ways the Information Can O v e rla p .................................................. 151 8.2.2 Overlapping Schemas and Update A u th o ritie s ...............................152 8.3 Ambiguity Resolving Methodology ( A R M )................................................ 153 8.4 Ambiguous Examples R e v is ite d .................................................................... 157 8.4.1 Partial Overlapping Characterized by P ro jectio n ........................... 158 8.4.2 Partial Overlapping Characterized by Selection ........................... 159 8.4.3 More Intricate Sharing of Inform ation............................................... 161 9 R e la te d R e se arch 165 9.1 Heterogeneous D atabases..................................................................................165 9.1.1 Federated Database A rc h ite c tu re ......................................................166 9.1.2 Database Schema Integration . ......................................................167 9.1.3 W orldB ase.................................................................................................168 9.2 View Update P r o b le m .....................................................................................170 9.3 Schema Restructuring/Transform ation L an g u a g e s...................................171 9.4 Update Propagation for Interoperating D a ta b a s e s ...................... 173 9.4.1 Update Propagation by Transaction P ro c e s s in g ...........................174 9.4.2 Update Propagation by Active Database T e c h n o lo g y .................175 10 C o n clu sio n s a n d D ire c tio n s for F u tu re R e se arch 177 A p p e n d ix A Syntax of A L D ............................................................................................................. 181 A p p e n d ix B Formal Semantics of nrecILOG- and ALDs .......................................................... 183 A p p e n d ix C Proofs for the EES R e m o v a l........................................................ 187 A p p e n d ix D Detailed Proofs for Chapter 7 ...................................................................................191 vi A p p e n d ix E Detailed Solution of Example “Requisition” ..........................................................220 E .l The ALD for the equivalent subschemas ..................................................... 220 E.2 Local E x p e r ts ......................................................................................................222 E.3 Example S c e n a rio s ............................................................................................224 vii L ist O f T ables 1.1 Road map of the th e s is ........................................................................... 13 6.1 The WGG expansion history for the “Segment Flight” example . . . 103 6.2 The WGG expansion history for the “More than Ga” example . . . . 106 6.3 Expansion history for the “Cross Product” e x a m p le ........................ 107 6.4 Expansion history of the “Recursive OID Creation” e x a m p le ....... 109 7.1 Two homomorphic expansion histories ........................................................ 135 7.2 Example homomorphism for the two expansion histories .......................135 L ist O f F ig u res 1.1 Transition between instance pairs in a semantic correspondence . . . 3 1.2 Schemas and instances of Example “Segment Flight” .......................... 6 1.3 Architecture of the B ird sy ste m ................................................................... 10 2.1 Example IFO Schem a...................................................................................... 19 2.2 Example instance of an IFO schem a............................................................ 20 2.3 Example skolemize pre-instances................................................................... 22 2.4 Resulting instances of two different m a p p in g s .................................... 24 3.1 A naive scenario of using “remote-execute” ............................................... 36 3.2 Example scenario for phase transitions under RemTrans ........................ 40 3.3 System diagram of the Communication S ubsystem ..................................... 41 3.4 Linkage Implicit Constraints (L IC s )............................. 42 3.5 Dynamic behaviors of L IC s............................................................................ 43 4.1 Schemas and instances for “City” ............................................................... 47 4.2 Schemas and instances of Example “Segment Flight” ...............................48 4.3 The semantics of OID invention r u l e s ...................... 54 4.4 Scenario for incremental OID creation ........................................................ 58 4.5 Scenarios for incremental OID d e s tru c tio n .............................................. 59 4.6 MDOT snapshots for the “Segment-Flight” exam ple.............................. 61 4.7 Scenarios of OID invention under MDOT m e c h a n ism .......................... 63 4.8 Dynamic behavior differences for “City” e x a m p le ................................. 65 5.1 Naive compilation for bi-direction ALD involving O I D s ....................... 71 5.2 Scenario using the “naive” concrete lin k a g e s .......................... 72 5.3 W itness pair for the “Segment Flight” e x a m p le .................................... 76 5.4 Different levels of abstraction in the BIRD s y s te m ................................. 77 5.5 Different entity type relationships specified by EES and invention rules 79 5.6 Refined system architecture for the B ird system . ............................. 81 6.1 SLD Expansion for “Segment Flight” ........................................................ 86 6.2 Schemas of Example “Itinerary” .................................................................. 89 6.3 Schemas for the “More than Gra ” e x a m p le .................................................104 7.1 Schemas and instances for “Enrollment” ..................................................,1 1 2 7.2 OID components for the “Enrollment” e x a m p le ................................... 127 7.3 Lemma OID Component C o n d en satio n ....................................................128 7.4 Lemma OID C o m p o n en t.............................................................................. 129 7.5 Information transfer between OID com ponents.......................................132 7.6 Homomorphism L e m m a .............................................................................. 136 7.7 MGU Look Ahead L e m m a ........................................................................... 138 7.8 Uselessness L e m m a ........................................................................................... . 139 8.1 Schemas and instances for “Employee” ......................................................... 146 8.2 Schemas for “Requisition/Purchase Order” .................................................. 147 8.3 Possible ways of arranging the purchase o rd e rs ...........................................148 8.4 Schemas and instances for “Airport” .............................................................149 8.5 Ambiguous Semantic C orrespondence....................... 150 8.6 “Update authorities under B ird ’s framework” .......................................153 8.7 “Schema Surgery” .................... 154 8.8 Transformation of An Ambiguous Semantic C o rresp o n d en c e............155 8.9 New Architecture Incorporating Expert Systems and B ird System . . 157 8.10 The “Employee” schemas after “schema surgery” .................................... 158 8.11 The three scenarios shown in “Requisition” with “subquantities” . . . 162 8.12 The “Requisition” schemas after “schema surgery” ........................163 E .l Scenario of “Don wants 8 more Macs” .........................................................225 E.2 Scenario of “Move 1 Mac from Ben’s to Sherry’s” .....................................226 Abstract W ith the advent of the ’ ’information superhighway”, database interoperation is emerging as one of the most im portant topics in database research in the 90’s. Most previous academic and commercial work in this area (e.g, schema integration, federated databases) has focused on providing read-only access to data in diverse databases. The research in this thesis addresses a fundamentally different issue, that of incrementally propagating updates between databases that hold overlapping information. A primary focus of the research is on the impact of object identifiers (OIDs). The presence OIDs complicates the situation because the meaning of an OID is local to its own database, and sometimes the object classes in one database do not correspond directly to the object classes in the second database. Results presented in this thesis shows that (1) in the context of uni-directional incremental update propagation involving OIDs, auxiliary witness relations are needed, and (2) for the bi-directional case, m aintaining the contents of the witness relations is quite subtle; a mechanism is described in this thesis to construct new rules, called witness generators, that can be used to properly update the witness relations. This research also develops a prototype system called BIRD (Bi-directional In cremental Revising of Data) that offers one solution to the problem of providing incremental update propagation in the presence of OIDs. BIRD uses a high-level database query language for specifying the correspondence between two databases, and uses active database technology to perform incremental update propagation. One of the m ajor contributions of the BIRD system is the development of the W itness Generator Generator (W GG) algorithm that constructs witness generators from a user-specified ALD. This algorithm is based on a variation of SLD-resolution. A theoretical analysis of the algorithm is also presented in this thesis. The anal ysis demonstrates that (a) the algorithm is sound and (b) the term ination of the algorithm on a given input is decidable. Chapter 1 Heterogeneous Database Updates W ith the advent of the ’ ’information highway”, database interoperation is emerging as one of the most im portant topics in database research in the 90’s. Most previous academic and commercial work in this area (e.g, schema integration, federated data bases) has focused on providing read-only access to data in diverse databases. The research in this thesis addresses a fundamentally different issue, that of incrementally propagating updates between databases that hold overlapping information. This research develops a prototype system called B ird (Bi-directional Increm en tal Revision of D ata) that offers one solution to the problem of providing incremental update propagation. B ird uses a high-level database query language for specifying the correspondence between two databases, and uses active database technology to perform incremental update propagation. Both practical and theoretical tools were used in the development of the prototype. Perhaps the most interesting technical issue addressed by B ird stems from object-orientation. Object-oriented databases have object identifiers (OIDs) which serve to identify objects uniquely within a database. Their presence complicates the situation because (1) the meaning of an OID is local to its own database, and (2) sometimes the object classes in one database do not correspond directly to the object classes in the second database. B ird uses an adaptation of SLD-resolution when analyzing user-specified correspondences, in order to support incremental up date propagation in this context. 1 1.1 Problem Statement In particular, this research concerns the following problem. P ro b le m S ta te m e n t: Find mechanisms to: • describe the semantic correspondence between two databases, and • maintain that semantic correspondence incrementally. D e sid e ra ta : o b je c t-o rie n ta tio n : The two databases may use entity types, i.e. classes whose members are represented by object identifiers (OIDs). s y m m e try : Each database has the authority to make autonomous updates to its own data. The first part of the problem addressed in this research is to find a way to describe the “semantic correspondence”. Intuitively, a semantic correspondence is a set of corresponding instances of two databases. It serves as a static constraint imposed on the two databases, so that only the “acceptable” instance pairs in the semantic correspondence can co-exist in the two databases at any given time. D efin itio n : Semantic Correspondence A semantic correspondence SC ab between two databases A and B is a decidable subset of Inst(A) x Inst(B ), where Inst(X) denotes the set of instances of database X . □ The second part of the problem is how to m aintain that semantic correspondence dynamically in the context of incremental updates. Figure 1.1 illustrates a scenario which highlights this issue. In particular, databases A and B are initially in the “acceptable” instance pair (l 2 ,Ji ) in the semantic correspondence (lines between instances indicate acceptable pairs). Suppose an incremental update A^ changes A from 1 % to I 5 . The question is: how can the system translates the update of database A into an update Ajg of database B, so that, after applying Ajg to B, the two databases reach another “acceptable” instance pair (/5, J 2) h1 the semantic correspondence. 2 possible instances o f A Ii h h U h h possible instances o f B h h J3 J4 J5 h Figure 1.1: Transition between instance pairs in a semantic correspondence The basis of the research is that the semantic correspondence is specified using a high-level language, and then an autom atic mechanism translates that seman tic correspondence into a set of active database rules that m aintain the semantic correspondence by propagating updates incrementally. For the rem ainder of the chapter, in Section 1.2, several examples are presented to illustrate different aspects of semantic correspondence. Then Section 1.3 presents an overview of the B ird system. Finally Section 1.4 draws a road map for the bulk of the research. 1.2 Examples of Semantic Correspondence The solution provided in this thesis focuses only on the family of semantic correspon dences that can be expressed by two nrecILOG- programs (Section 2.3) specifying the instance mappings from database A to database B and vice versa. T hat is, the semantic correspondence SC ab specified by two nrecILOG- programs Pa< - > b and Pb > - * a '{(/, J) | Pa~ b (I) = J A Pb ^ a (J) = I } The following subsections each present an example illustrating a different aspect of the problem. W ithin each example an “acceptable instance pair” is presented according to the application requirement. 3 In the following discussion we use the term “instance” and “state” of a database interchangeably. 1.2.1 V a lu e-B a sed C o rresp o n d en ce We first consider a very simple example dealing with two different views of the same information describing the telephone number, employee, and office assignments in da tabases A and B respectively. It illustrates “The many forms of a single fact”[Ken89] or the semantic reiativism[HM81] which is found commonly in a heterogeneous da tabase environment. E x a m p le 1. 1: “Office” In database A the data is recorded in two relations. p e rso n -n o (p , no) records that a person p has a telephone number no, and n o - o f f ic e ( n o ,o f f ) records that a telephone with number no is in office o ff. In database B the data is stored in relation in - o f f ic e (p, o ff) recording that a person p is in office o ff, and in relation o f f ic e - n o ( o f f ,no) recording that an office o f f has a telephone number no. For the information kept in both databases, it is assumed that there is a one-one correspondence between a telephone number and an office. The following is a possible pair of corresponding instances for A and B. A.person-no: Name Telephone Ben Sherry 123 456 A.no-office: Telephone Office 123 456 SAL250 SAL130 B.in-office: Name Office Ben Sherry SAL250 SAL130 B .office-no: Office Telephone SAL250 SAL130 123 456 The application further requires that the two views be m aterialized - i.e. the data must be represented explicitly - because they are queried quite often at both sites, and communication between the two sites is not very reliable. □ 4 Depicted in this example, the correspondence between these two schemas is fairly straightforward. In fact, the instances of B can be expressed by a nrecILOG- pro gram in-office(person,office) person-no(person,no), no-office(no,office); office-no(office,no) :- no-office(no,office); Similarly, the instances of A can be expressed in term s of B by the following pro gram. person-no(person,no) :- in-office(person,office), office-no(office,no); no-office(no,office) office-no(office,no); As shown above, in the value-based case, a nrecILOG- program looks very much like a Datalog program or a conjunctive query, and can be used to express the semantic correspondence between two databases. 1.2.2 In tr o d u c tio n o f O b ject Id en tifiers (O ID s) The semantic correspondence is not often so straightforward as in the previous ex ample. In particular, “Object Identifiers” (OIDs) may be used within one or both databases. These OIDs are normally inaccessible or meaningless outside the data base in which they exist. This usually makes the specification and maintenance of the semantic correspondence more difficult. For instance, a graphic interface may be used to visually depict the conceptual objects stored in an underlying application database. Typically a graphical interface also m aintains an internal database that holds information about the things it is representing. In this case there is a natural correspondence between the objects stored in the application database and their graphic representations. To dem onstrate how the correspondence descriptions are expressed in the pres ence of OIDs, consider the following example: E x a m p le 1.2: “Segment Flight Figure 1.2 shows the schemas for databases A and B which hold equivalent but structurally different information about airline flights. These are depicted using the 5 Database A (node) Segment^®] ^~s^Hias-node ^ V se g m e n t-i nfo rSTRlNGl I STRING I (time) (price) A.has-node Segment node sl S.F. s 2 . _ . N.Y. A.segment-info Segment time price sl 10:00 65 si 14:00 80 £2______ 09:00 150 Database B Flight . 'S ^^ Jlig h t-in fo Vncc/ (city) (time) B.flight-info Flight city time L S.F. 10:00 h S.F. 14:00 ,fi . N.Y. 09:00 B.price Flight price L 65 h 80 h 150 Figure 1.2: Schemas and instances of Example “Segment Flight” IFO - database model (Section 2.2), which can be viewed as the structural portion of an object-oriented data model (i.e., no m ethods). In the B database, the entity type Flight holds OIDs for flights. It has two attributes: price recording the prices of the flights, and f light-info indicating the destination 1 and tim e of the flights. On the other hand, the A database models the same information within the framework of a graph representation, where each OID in Segment corresponds to a possible (single-stop) route between L.A. and a destination node (city). As shown in the above figure, even if there are two flights f \ and /2 to San Francisco, there is only one segment OID .si corresponding to these flights. The attribute segment-inf o of Segment records the corresponding times and prices of the flights. □ The main difference between nrecILOG- and Datalog is the ability to specify OID creation in the program. The following nrecILOG- program specifies the instance mapping from A to B for the “Segment Flight” example. i-flight[*,s,t] segment-info(s,t,p); // SI Flight(f) :- i-flight[f,s,t3; // S2 1To simplify the example, we assume that all flights start from L.A. 6 flight-infoCf,c,t) i-f light [f ,s ,t] , has-node(s,c); // S3 price(f.p) i-flight[f,s,t], segment-info(s,t,p); // S4 In the above nrecILOG- program, the OID invention is specified by the “inven tion rule” Sl with an “invention predicate” i-f light appearing in the rule head. Intuitively, the rule specifies that for each distinct (s,t) pair in segment-info, cre ate an OID. The sign in the rule head indicates that an OID will be invented with the witness of a distinct (s,t) pair. Notice, the particular p in the segment-info relation is irrelevant to the OID creation. W ith the OIDs invented and stored in i-f light, rule S2 populates the OIDs into entity type Flight. Similarly, the following nrecILOG- program specifies the instance mapping from B to A. i-segment[*,c] :- flight-info(f,c,t); Segment(s) :- i-segment[s,c]; has-node(s,c) :- i-segment[s,c]; segment-info(s,t,p) i-segment[s,c], flight-info(f,c,t), price(f,p); An interesting issue that arises with OIDs is that the value of an OID in a local database may change from one session to the other[EK91], As a result, a second database will not know the value change of the OID. In order to provide a “stronger notion of identity” at a more global level, the application must provide a Machine Dependent Object translation mechanism to translate a local OID value into an im m utable OID so that this OID can be shared between the databases A and B. 1.2.3 A m b ig u o u s U p d a te In the examples presented in the previous subsections, the two databases hold exactly the same amount of information. However, in reality, databases may hold informa tion that only partially overlaps. W ith the presence of non-equivalent information, there is ambiguity in how an update from one database should be propagated to the other. The next example highlights one form of partial overlapping, in which one data base overlaps on the selection portion of the other database. / / R1 // R2 // R3 // R4 7 E x a m p le 1.3: “Football” This example concerns the situation where the provost of USC has a database (*4) which records student information university-wide, and the football team coach has a database (B) about the team members. The student information in the provost’s database is kept in the following two relations Name Team Member? grade: Name GPA And the coach has a database keeping information about his team members in the following relations Name assignm ent: Name position Statically speaking, given an instance of the coach’s database, there can be in finitely many provost’s database instances associated with it, as long as the player parts of student information in the two database instances m atch up. Let us now consider the dynamic behaviors of the two databases after encoun tering some incremental updates. Suppose students “John” and “Mary” are in the provost’s database with the following instance. s tu d e n t: Name Team Member? John Mary yes no grade: Name GPA John Mary 2.45 3.95 And the corresponding player information in the coach’s database is team: Name John assignm ent: Name position John Quarter Back Now what happens if the coach decides to kick “John” off the team ? W hat is the corresponding update to the USC student database? Following are some of the possibilities: 1) Delete “John” from the stu d e n t relation, i.e. DELETE s tu d e n t ( ' ‘ J o h n , y e s). 8 2) Modify s tu d e n t relation, i.e. replace stu d e n t ( ' 'J o h n '' , y es) by s t u d e n t ( '‘John’ ' , no). 3) Kick “John” out of USC (do 1) if his GPA is lower than 2.5, otherwise leave him in (do 2). The reason we may have different policies is because the correspondence between the instances of A and B are not one-one, which is the essence of the ambiguity. Another interesting policy issue involves the ability to refuse to allow a trans action to occur in the originating database.. Consider what happens if the provost wants to drop student “John” because of his low GPA. The following is one of the possible corresponding updates to the coach’s database. 4) depending on which position “John” plays, the coach can overrule the provost’s decision; in this case quarterback may be too im portant a position to ignore, so the all-mighty coach simply rejects the provost’s decision and aborts the transaction. □ From the possibilities above we can see that the corresponding updates may be ambiguous and may even depend on the policy at the rem ote database which could change at any time. 1.3 The Proposed Solution In this section we present an overview of the B ird system, an overall solution to support incremental updates between two object-based databases holding overlap ping information. We begin by stating four key design issues that influenced the development of B ird . (1) The system should provide a high level language to describe the semantic correspondence, so that the DBA can easily describe it in a highly abstract way without worrying about the low level details. 9 RemTrans AP5 Active DBS MDOT Mechanism | Communication 1 Subsystem ALD ALD Compiler RemTrans APS Active DBS MDOT Mechanism A * Com munication Subsystem Figure 1.3: Architecture of the B ird system (2) A compiler is needed to translate the high level language description into an underlying concrete procedural linkage that incrementally m aintains the se m antic correspondence. (3) As mentioned before, local OIDs may be dynamic and are only meaningful within the domain of the local database. Therefore the solution m ust provide a mechanism to encode each local OID into an im m utable OID. (4) The general solution needs to provide an architecture and a run tim e envi ronm ent to orchestrate the incremental updates and interactions between the databases. (5) In the case of ambiguous semantic correspondence, a methodology is need to resolve the ambiguity in the semantic correspondence between the two data bases, so that B ird can be applied in the solution. 1.3.1 S y ste m A r c h ite c tu r e The architecture of B ird is depicted in Figure 1.3. The architecture is sym m etric and allows both databases to initiate incremental updates. Its design follows the principle of layered architecture[Nut92] to minimize the complexity. The architecture of B ird consists of the following components. 10 A L D C o m p ile r: To describe the underlying semantic correspondence to the B ird system, the DBA first writes a file called the Abstract Linkage Description (ALD). The ALD specifies the abstract linkage by using rules in nrecILOG” , a variant of non-recursive Datalog extended to support OID invention (see Section 2.3). The ALD is then fed into the ALD compiler which is implemented using the Syntax Directed Experts of POPART (see Section 2.4) using program trans formation techniques. The compiler translates the abstract linkages described in the ALD into two sets of concrete linkages and Eg. Each concrete link age consists of some auxiliary schema definitions and a set of production rules, both expressed in the AP5 database language (see Section 2.1). These rules react to changes reported by one database to update the other incrementally. RemTrans M e ch an ism : The RemTrans mechanism is responsible for rem ote trans action management between the two databases. It shields the system-level heterogeneity of the underlying active databases from the user. It also serves as an interface of the B ird system to the user-issued incremental updates. The functionalities of the RemTrans mechanism are discussed in Section 3.3. A P 5 A c tiv e D a ta b a se S y ste m (A D B S ): The actual database operations and the production rule firings are performed by the AP5 active database system. As shown in the architecture, the concrete linkages E^ and Eg interact with their host AP5 ADBSs and translate the user-issued updates to the other database incrementally. M a ch in e D e p e n d e n t O b je c t T ra n sla tio n M e ch a n ism (M D O T ): The Machine Dependent Object Translation (MDOT) mechanism serves as a filter between the AP5 database system and the communication subsystem; it en codes each outgoing OID into an im m utable OID, and decodes each incoming im m utable OID into its local value so that local OIDs can be shared among databases. MDOT will be discussed in Section 4.5. C o m m u n ic a tio n S u b sy ste m : To simplify the communication between databases, the Communication Subsystem is designed to accomplish the message-passing paradigm through the socket mechanism in the UNIX system. 11 1.3.2 S cen ario for E sta b lish in g S e m a n tic C o rresp o n d en ce The following scenario is used to establish a connection between two databases using the B ird System: S te p 1) C o n s tru c t A L D : The DBA first determines the semantic correspondence defined by the application, and then specifies the semantic correspondence in the form of an ALD, L. S te p 2) C o m p ile A L D : The ALD, L, is fed into the B ird compiler. The compiler generates two sets of concrete linkages, and Eg, for A and B , respectively. S te p 3) L oad: The concrete linkage E.4 is then loaded into database A and Eg is loaded into database B. S te p 4) R u n : After loading, database A and B use E^ and Eg, respectively, to interact with the B ird underlying architecture to m aintain the semantic cor respondence incrementally. In the above, all but step 1 are autom atically carried out by B ird . 1 .3.3 M e th o d o lo g y to R e so lv e A m b ig u ity Intuitively speaking, the approach is to separate the ambiguity from the inter database communication, by isolating and/or creating subschemas of databases A and B such that the subschemas hold equivalent information. This perm its B ird to be applied to them in order to m aintain OID creation/deletion between the two databases, and leaves the problem of updating the remaining parts of A and B to the local databases. As will be seen, creation of the subschemas may be the result of isolating portions of the original schema and/or augmenting the original schemas with derived data. 1.4 Dissertation Structure In the body of this dissertation, we consider progressively more difficult cases of incremental update propagation. To understand the progression, we categorize the 12 Uni-directional Bi-directional Value-Based Chapter 3 Chapter 3 OID-Based Chapter 4 Chapter 5,6,7 Table 1.1: Road m ap of the thesis problem domain along three dimensions: (a) whether the linkage established is uni directional or bi-directional, (b) the presence or absence of OIDs in the overlapping data, and (c) whether the two databases are holding equivalent information or non equivalent overlapping information (i.e., whether the semantic correspondence is one-one or ambiguous). Using “Football” as an example, dimension (a) concerns whether the linkage is only for a one-way translation of incremental updates from the provost’s database to the coach’s database or a two-way translation of incremental updates from both databases; i.e. in the former case, only the provost has the right to modify the student data, the (team membership portion of the) coach’s database serves as a virtual view, and in the latter case both the provost and the coach can update their databases and the changes will be automatically m apped to the other side. Stemming from the dimensions (a) and (b), there are four possible combinations of solutions. As indicated in Table 1.4 above, they are discussed from Chapter 3 to Chapter 7 in the context of one-one semantic correspondences. Stemming from dimension (c), Chapter 3 to Chapter 7 discuss the case when the two databases hold equivalent information, i.e., one database is expressible by the other database. The case when the two databases databases hold non-equivalent information are discussed in Chapter 8. Chapter 3 covers both the uni-directional and bi-directional solutions for the value-based cases. The abstract linkage description (ALD) is introduced in Sec tion 3.1. Then Section 3.2 presents the compilation of ILOG non-invention rules. Finally, the RemTrans subsystem is discussed in Section 3.3. Chapter 4 presents the solutions for establishing uni-directional linkage involving OID creation. In particular, the E ntity Equivalence Specification (EES) in an ALD specifying an equivalence relationship between two entity types in the two databases is first introduced in Section 4.1. Then the chapter discusses the OID association problem (Section 4.2) which motivates the need of augmenting the schema of A 13 by adding the witness relations (Section 4.3), the compilation of ILOG invention rules (Section 4.4), the MDOT mechanism (Section 4.5), and static and dynamic semantics in ALD (Section 4.6). The bi-directional propagation of incremental updates is discussed in Chapter 5. Section 5.1 discusses the preprocessing of the EES in the bi-directional case. Sec tion 5.2 highlights the witness update problem, a problem that arises in the bi directional case when updates in the local database need to be propagated to the witness relations stored in the rem ote database. This motivates the need for the wit ness generators (Section 5.3), the nrecILOG" rules derived from the user-specified ALD. In Section 5.4, the logical and physical perspectives of semantic correspon dences are discussed. Section 5.5 presents the justification of using EES instead of nrecILOG" invention rules to specify the equivalent relationship between two entity types. Finally, Section 5.6 summarizes the development and m ajor components of the B ird system. The autom atic generation of witness generators from an ALD is a non-trivial task. One of the main contributions of this thesis is the development of the W it ness Generator Generator (WGG) algorithm which autom atically generates witness generators from a user-given ALD. The WGG algorithm is presented in Chapter 6. However, not every execution of WGG halts. Chapter 7 focuses on the formal dis cussion of the WGG algorithm and demonstrates that: (1) the WGG algorithm is sound, and (2) the term ination of a WGG execution is decidable. Chapter 2 gives a brief description of all the tools used in building B ird . In Chapter 9 we examine research related to our problem. Finally, Chapter 10 discusses some directions for future research. 14 Chapter 2 Tools Used By Bird This chapter briefly introduces the tools used to implement the B ird system. This will serve as a reference for later discussions on various components of the B ird system. 2.1 AP5, the platform The current implementation of B ird is built on top of the AP5 active database system, an extension to Common Lisp developed at USC/Inform ation Sciences In stitute. This will provide the rule-based “engine” of B ird which propagates updates between two databases. This section describes only the features of AP5 which are relevant to the B ird system. For more details of AP5, the reader should refer to [Coh86, Coh87]. 2.1.1 D a ta D e fin itio n L an gu age AP5 supports a data model similar to the Entity-relationship model[Che76]. D ata in AP5 are stored in relations as in the relational model. A type in AP5 is a unary re lation holding a set of objects. OIDs can be generated by calling the Make-DBObject macro. AP5 distinguishes three kinds of relations: transition relation, derived relation and com puted relation. The population of a transition relation is explicitly deter mined by the insertion operations (e.g. ++ R(1,5)) and deletion operations (e.g. — R(l,5)) and also implicitly by rule firings. The contents of a derived relation are 15 defined by a computation based on other existing relations. A computed relation is a relation defined by a Lisp computation th at does not depend on the contents of the database. The following three relation declarations are examples of the transition relations, derived relations, and computed relations respectively. (DEFRELATION room-mate :arity 2 :types (person person)) (DEFRELATION class-mate :arity 2 :types (person person) rDEFINITION ((x y) S.T. (E (class) (AND (in-class x class) (in-class y class))))) (DEFRELATION even-number-list :arity 1 :types (integer) :COMPUTATION (CODED ((x) S.T. (AND (listp x) (every #’evenp x))))) In the above, room -m ate is a transition relation w ith arity of two. c la s s - m a te is derived from an AP5 query defined in the rDEFINITION p art of the relation definition. B oth of the derived and com puted relations in AP5 (c la s s -m a te and e v e n -n u m b e r-lis t) do not physically hold values. T he com puted relations in AP5 serve m ore like a predicate; e v e n -n u m b e r-lis t (x) is tru e if and only if x is a list containing even integers. 2 .1 .2 D a ta M a n ip u la tio n L an gu age In AP5, the conditions of the database can be tested by well-formed formulas (wffs). Wffs are expressions built from primitive relation predicates, logic operators (NOT, AND, OR, IMPLIES, EQUIV, XOR), quantifiers (A and E represent V and 3), variables and lisp expressions. For example, the AP5 syntax for (A (x) (IMPLIES (g o o d -stu d en t x) (high-GPA x ) ) ) is equivalent to the following first order logic sentence[End72] Var (g o o d -stu d e n t (or) — » high-GPA(x)) And the AP5 syntax for (E (x y p e rso n ) (AND ( o f f i c e p e rso n x) ( o f f i c e p e rso n y) 16 (NOT (eql x y)))) is equivalent to 3 x, y, p(of f ice(p, x) A of f ice(p, y) h x ^ y) A description is an expression of the form (.vars S.T. w ff ) , where vars is a list of free variables in wff. The AP5 queries are formed by combining various query operators and descriptions. The following are some examples of the AP5 queries. (?? (E (s) S.T. (AND (GPA s g) (> g 3.75)))) ; return TRUE if there is a student with GPA higher than 3.75, ; return FALSE otherwise (LISTDF (s) S.T. (AND (GPA s g) (> g 3.75))) ; return a list of high GPA students (THEONLY (w) S.T. (has-wife ’Bill w) IFMANY (report-scandal ’Bill)) ; return the only wife of ’Bill, if many found then call report-scandal The data can be updated by the insertion operator ++ and deletion operator — . Updates can be grouped into an atomic transaction which delays the rule firings until all updates in the atomic transaction have been completed. The following atomic transaction inserts a tuple in office and deletes a student from Student. (atomic (++ office ’John ’SAL200); (— student ’Marry)); 2 .1 .3 A c tiv e R u les There are two kinds of rules in AP5: consistency rules and automation rules. To support efficient rule triggering, AP5 compiles active rules into a network simi lar to that of the RETE algorithm[For79] commonly used for implementing expert systems[Coh89]. Since only the consistency rules are used in the B ir d system, AP5 autom ation rules are not discussed'here. In AP5, the firing of consistency rules occurs as part of an atom ic transaction. Consistent rules guarantee that certain conditions are satisfied by every state of the 17 database. Each rule express one such condition, typically in the form of a toff. Every atom ic transition m ust satisfy every such condition (or be altered to satisfy them all). Otherwise it aborts. A rule may include, in addition to the condition, a consistency restoration program. The consistency restoration program can read both the original state and the invalid state which it m ust try to repair. T hat is, tem poral references can be used to monitor the changes in the current database state. For example, (PREVIOUSLY w ff) refer to the database state before the atomic transition started. (START w ff) is the same as (AND w ff (NOT (PREVIOUS w f f )) ), i.e. to test if the wff has become true since the beginning of the transaction. And (CHANGE w ff) is the shorthand for (NOT ( e q l w ff (PREVIOUS w f f)) ). The following is an example of an AP5 consistency checker, a special form of consistency rule used by the B ird system. The consistency checker demands that for any person there can be at most one office assignment. The repair action is a lisp function r e p a ir - f o r - m u ltip le - o f f ice -assig n m e n t which simply reports the error and aborts the transaction. (DEFCONSISTENCYCHECKER only-one-office-per-person ((p o) S.T. (AND (START (has-office p o)) (E (o2) S.T. (AND (has-office p o2) (NOT (eql o o2)))))) repair-for-multiple-office-assignment) (DEFUN repair-for-multiple-office-assignment (p o) (ERROR-MESSAGE ‘‘Person "A has multiple office assignments ~A’ ’ p o) (ABORT-TRANSACTION)) 2 .1 .4 S e m a n tic s o f C o n siste n c y R u les The semantics of consistency rule application is based on accumulation [HJ91]; the final delta com m itted resulting from an atomic transaction includes (1) the initial updates in the atomic transaction and (2) the accumulation of updates proposed by consistency rule firings. Rule firings in AP5 are organized into consistency cycles. Initially, the updates in the atomic transaction are put into a delta list, then the rule firing mechanism starts 18 the first consistency cycle. Conceptually, at each consistency cycle the consistency rules are firing concurrently based on the initial state of the database and the delta list accumulated from previous cycles. Then updates of the triggered rules are appended to the delta list. This completes one consistency cycle and the database is ready to start the next consistency cycle. Inconsistent deltas, i.e. deltas containing insertion and deletion of the same tuple, are not allowed in AP5. Once inconsistent deltas are found in the delta list, the transaction is aborted. The database iterates through consistency cycles until no updates are added to the accumulated delta. The accumulated delta is com m itted to the database only if in the last consistency cycle there is no more rule firings, otherwise the atomic transaction is aborted. 2.2 IFO , the Data Base Model The data model used in this thesis is based on the IFO - [AH87]. It is a simple subset of the IF O database model , a formally defined data model containing the fundam ental structural components of many semantic data models. IFO - is slightly richer than the Entity-Relationship model, and is easily represented by AP5. Every IFO - schema can be described by a directed graph with various types of vertices and edges. Enroll enroll-student^^>**-^ enroll-course . Course 1 STRING I (student name) course-n</~*^Nbook-used / STRING] I STRING | (book name) Figure 2.1: Example IFO Schema There are two kinds of atomic type in IFO~; the first one is called printable represented by a square node with the name of the type in it, the second one is called abstract represented by a diamond-shaped box which corresponds to the non- printable abstract objects with no underlying structure (in other words, to OIDs). 19 Enroll enroll-student enroll-course Entity Enroll student name e, Ben ez Ben Enroll Course < = 1 ® 2 , ,£2L . . Course course-no Course number C l CS101 c2 EE101 c, PH101 book-used Course book title Cl un with Computeres” Cl love ADA” Cl ascal and Me” C 2 un with Electronics” un with Physics” Figure 2.2: Example instance of an IFO schema For example, the E n r o ll and C ourse in Figure 2.1 are abstract types; while the square boxes with STRING in them are printables. To construct a complex type out of the existing ones, in IFO - the ®-vertex can be used to form a new type from the Cartesian product of other existing types. For example, the ®-vertex labeled with e n r o l l- s tu d e n t and e n r o ll- c o u r s e in Figure 2.1 are two new types of Cartesian product of E n r o ll and STRING, and Cartesian product of E n r o ll and C ourse respectively. Also, the attribute edges can be used to represent the functional relationships as in the Functional D ata Model[Shi81]. A ttributes can be single-valued, represented using arrows, or multi-valued, represented using double-headed arrows. In Figure 2.1 c o u rse -n o is a single-valued attribute from C ourse to STRING and b o o k -u sed defines a multi-valued attribute from C ourse to STRING. In the current implementation of B ird , the IFO - data model is used only for the conceptual schema design. Also, ISA relationships , while supported in IFO - , will not be perm itted in the bulk of this thesis. There exists a simple and natural mapping from an IFO - schema to a set of AP5 relations plus some functional and inclusion dependencies[Mai83]. For example, the 20 schema in Figure 2.1 can be implemented using the relations shown in Figure 2.2, and the following dependencies c o u rse -n o [l] C C ourse[l] e n r o ll- s tu d e n t[ lj C E n ro ll[l] < e n r o ll- c o u r s e [ l] C E n ro llfl] e n ro ll-c o u rs e [2 ] C C oursefl] c o u rse -n o : [1] i — > [2] An example instance is also shown in Figure 2.2, which describes the enrollment information of a student Ben registered in two courses: CS101 and EE101. 2.3 ILOG, the Linkage Language The ILOG language[HWW90, HY90] is a declarative language in the style of Datalog modified to be used for querying, schema translation, and schema augm entation in the context of the object-based data models. Derived from the ILOG language, nrecILOG- does not have unions (i.e., more than one rule defining a single relation), negations, and does not allow recursions in the rules. In the B ird system, a semantic correspondence is described by two nrecILOG- programs; each describes the schema translation from one database to another. To see how the schema translation is specified using nrecILOG- , consider the schemas of Example 1.2 shown in Figure 1.2. The following nrecILOG- program specifies the translation from A to B. (Our syntax differs slightly from th at of [HY90].) i-flightA.segment-info(s,t,p); // SI B.Flight(f) i-flight[f,s,t]; // S2 B.flight-info(f,c,t) i-flight[f,s,t], A.has-node(s.c); // S3 B.price(f.p) i-flight[f,s,t], A.segment-info(s,t,p); // S4 Notice th at for the sake of clarity, we prefix the predicates in B and A as B. and A. respectively. The above “program” looks very much like a Datalog program except for rule SI, called the OID invention rule. The rule head, i - f l i g h t , of SI is the interm ediate relation that serves as a “scratch paper” to record the association between the OIDs invented and their witness values in the source database. These 21 i-flight B. Flight Flight Segment time f(S|,10:00) Sl 10:00 f(s„ 14:00) S l 14:00 f(s, ,09:00) , S Z„.......... 09:00 B.flight-info Flight city time f(s, ,10:00) S.F. 10:00 f(s,, 14:00) S.F. 14:00 f(s,,09:00) N.Y. 09:00 Flight f(s,, 10:00) f(s,, 14:00) f(s„09:00) B.price Flight amount f(si,10:00) 65 f(s„14:00) 80 f(s7,09:00) 150 Figure 2.3: Example skolemize pre-instances interm ediate relations in a program are not part of the source schema or the target schema. According to the original semantics presented in [HY90] they are not in cluded in the output of the translation process. Later we will see in Section 4.3 that the B ir d system augments the source schema with these interm ediate relations to support incremental update propagation involving OIDs. 2.3.1 E v a lu a tio n o f IL O G p rogram s The formal semantics of ILOG programs are presented in Appendix B. To see how an instance from the source schema is translated into an instance of the target schema, we use the instance of the A database depicted in Figure 1.2 as the source instance and follow the three-step process described in [HY90]. (1 ) S k o le m iz e p ro g ra m Pa*->b' The nrecILOG~program is first “skolemized” into i-flight[f(s,t),s,t] A.segment-info(s,t,p); 11 SI B.Flight(f) i-flight[f,s,t]; 11 S2 B.flight-infoCf,c,t) :- i-flight[f,s,t], A.has-node(s,c); // S3 B.price(f,p) :- i-flight[f,s,tj, A.segment-info(s,t,p); // S4 T hat is, the symbols in the interm ediate relation i - f l i g h t of the invention rule SI is replaced by a skolem function f ( s , t ) . This corresponds to the intuition that new OID should be created for each ( s , t ) pair satisfying the condition of the rule body of SI. 22 (2) C o m p u te th e sk o lem ized p re -in sta n c e : Use the instance of A shown in Figure 1.2 as the source and run the skolemized program as if it is an ordi nary Datalog program. The resulting “skolemized pre-instance” for database B , denoted and interm ediate relation i - f l i g h t are illustrated in Figure 2.3. Notice that the skolem term s of the instance in the figure are the “logical OIDs”. Each logical OID represents a distinct (physical) OID value. In the next step, the logical OIDs will be replaced by distinct (physical) OIDs. (3) C o m p u te th e in sta n c e fro m th e p re -in sta n c e : Define a mapping ip that maps each different skolem term in the skolemized pre-instance to a unique OID. If the mapping ip is ' f ( s u 10:00) / ( 5i? 14:00) / 2 _ / ( s 2, 09:00) ~ / 3 then the resulting instance for B is the one depicted in Figure 1.2. Notice that the choice of ip in Step (3) above is non-deterministic. However, since OIDs in a database instance can be viewed as “place holders” or “pointers” , the specific choice of OID values is irrelevant. T hat is, a different ip yields a different but “equivalent” (up to perm utation on OIDs) instance for B. This motivates the definitions of physical and logical instances. 2 .3 .2 P h y sic a l v s. L ogical In sta n ces Because of the non-deterministic nature in substituting the skolem term s for the OIDs in Step (3) above, given a specific source instance the output of the translation may differ depending on the specific ip mapping chosen. (Similar non-determinism occurs with all languages m apping to object-based instances; see also [AK89, AB91, Day89].) As shown in Figure 2.4, the two output instances resulting form the two different choices of the ip mappings are “isomorphic” but different. To overcome this difficulty, ILOG separates the notions of Physical and Logical Instances. 1 1In the paper [HY90], a physical instance is called the pre-instance and a logical instance is called the instance. 23 * p f f ( s „ 10:00) - > / 4 / ( i „ 14:00) - » / 5 / ( * 2 , 09:00) - > /4 i-flight Flight Segment time u Sl 10:00 u Sl 14:00 s2 , , 09:00 B.f light-info Flight city time u S J. 10:00 u S.F. 14:00 N.Y. 09:00 B.FIight Flight B.price Flight amount * 4 65 f5 80 f6 150 Z ' /(* „ 10:00) -» /„ / ( * „ 14:00) - > / * V / ( * 2 , 09:00) - > /2 5 0 i-flight Flight Segment time f1 R Sl 10:00 f49 si 14:00 f250 - s2_... 09:00 B.flight-info Flight city time flR S.F. 10:00 f4 9 SJ3 . 14:00 f 250 N.Y. 09:00 B.FIight Flight L2£_ 112 _ B.price Flight amount f3 8 65 *49 80 ,*230 ........... 150 Figure 2.4: Resulting instances of two different tp mappings The instances discussed above are technically called “physical instances” . This is because the OIDs used are concrete, physical OIDs. In contrast, logical instances are defined based on the following notion of OID-equivalent. D e fin itio n : OID-equivalent, Logical Instance Let S be a schema. Then two physical instances I, J of S are OID-equivalent, denoted / = J , if there exists a perm utation cr on the domain of OIDs such that < 7 (/) = J The logical instance represented by I is the OID-equivalence class containing I, demoted by [/]. □ W ith this notion of logical instance based on OID-equivalent class, the choice of specific OIDs during the skolem term substitution step (3) is irrelevant. T hat is, given a source instance I and a nrecILOG- program P , the output defined in [HY90] is always [P (/)], where the particular choice of OIDs for P ^ ° ^ ( I ) is irrelevant. 24 However, as will be seen in C hapter 4 and 5, the specific choice of physical OIDs is im portant in the context of incremental update propagation between databases. To avoid confusion, the term “instance” used in the discussion, unless otherwise specified, refers to a physical instance. 2 .3 .3 R e str ic tio n s for n recIL O G - The nrecILOG- language is a restricted member in the general ILOG language family. The following is a list of restrictions on rules in a nrecILOG- program. (1) N o n e g a tio n : each rule has only positive conjunctive predicates in its rule body. (2) N o u n io n : each rule in a nrecILOG- program has a distinct predicate in the rule head. Thus, we can use lZr to represent the rule in the program with r appearing in the rule head. (3) O n ly in v e n tio n a l in te rm e d ia te re la tio n : there is no need for interm ediate predicates aside from the invention predicates. (4) R u le b o d y : only the predicates from the source database, and/or the invention predicates can appear in a rule body. Also, the invention predicate can appear in a rule body only if the OID variable of the invention predicate appears in the rule head. (5) R u le h ead : the rule head of a rule is either an invention predicate or a predicate from the destination database. (6 ) N o re p e a te d v a ria b le s in th e ru le h ead : if there are repeated variables in the rule head of 7Zr, since there is no union in the program, then the program can be transformed into an equivalent one in which each occurrence of the r atom in the program is isomorphic to the rule head of 7Zr. Also, restrictions (3), (4) and (5) ensure that there is no recursion in a nrecILOG- program. (6) is mainly for syntactic sugaring to simplify the formal discussion. Notice that usually the invention rules are first used to create untyped OIDs, then the interm ediate relations appearing in the rule heads are used to populate the 25 OIDs into an entity type in the target database. In th e above, this occurs in S l and S2. For the sake of brevity, rules H I, R2 can be abbreviated as i - f l i g h t [* F lig h t,s ,t] s e g m e n t-in fo (s ,t,p ); This specifies implicitly th at the OIDs invented are objects of the entity type Flight. This shorthand will be used when specifying ALDs in later discussion. 2.4 POPART, the Compiler The ALD compiler is constructed by using the POPART (Producers Of Parsers And Related Tools) [Wil, Wil87] developed at US C/Inform ation Science Institute. POPART can be used to generate a lexical analyzer, parser, unparser, p at tern m atcher, and program transform ation system for the ALD compiler. Because POPART is a grammar-driven programming environment generator, the ALD com piler can be easily generated from a high-level gram m ar description. 2.4.1 B N F ’, th e g ra m m a r d e sc r ip tio n la n g u a g e The gram m ar of the underlying language in POPART is specified in B N F’. It is a BNF-like gram m ar description language extended to allow regular expression con structs (optional presence, alternation, iterated patterns, operator precedence p at terns, and pattern expressions) in the language. For example, the following is a piece of the gram m ar specification of the nrecILOG- rules by BNF’. ilo g -ru le := head * body; head := interm ediate I in v en tio n -in term ed iate I re la tio n ; body := in te rm e d ia te -o r-re la tio n ~ ; in te rm e d ia te -o r-re la tio n := interm ediate I re la tio n ; in v en tio n -in term ed iate := re l# in v en tio n ’ [ ’ * { re l# c o n stra in t > ( var-or-const#w itness ’] ; interm ediate := rel# in term ed iate ’ [ ( var-o r-co n st# o b j-w itn ess ~ ; r e la tio n := re l# re la tio n ’ ( var-or-const#type In the gram m ar specification, term inals are quoted and nonterminals are unquoted symbols. < non-term inal> means “one or more occurrences of 26 <non-terminal> separated by commas. The # sign in a non-term inal is used to name and distinguish different pattern variables with the same syntactic type on the right hand side of a production. For example, the two pattern variables of the type rel appearing in the definition of invention-intermediate above can be distinguished by rel#invention and rel#constraint. 2 .4 .2 S y n ta x D ir e c te d E x p e r ts The syntax directed experts in POPART allow the compilation of an ALD at the level of abstract syntax. Each syntax directed expert is specified as a sequence of production rules with patterns to be m atched in the left hand side of the rules. Once a pattern in the left hand side of a production rule is m atched against the parse tree of the source gram m ar, the activity in the right hand side of the production rule is activated. Metavariables in a pattern are indicated using the ! signs, option variables are indicated using the !? signs. The ! ! signs are used to m atch any subsequence of an iterated field of the same syntactic type. Therefore, ''begin !(Statement end' ’ can be used to m atch the begin-end statem ent block in PASCAL, and IF (Predicate THEN !Statement#true ELSE !?Statement#false can be used to m atch an IF statem ent with the optional ELSE part. The activity of the expert’s right hand sides distinguishes the three types of syntax directed experts. A c tio n ro u tin e s: For these, the right hand sides consist of Common Lisp codes that can refer to the pattern variables th at were m atched. The following is an action routine in the ALD compiler th at generates a list of unbounded free variables in a nrecILOG" rule. ( d e fa c tio n -ro u tin e f in d - f r e e v a r s - in - r u le :from-grammar A L D :n o n term in als ( ilo g - r u le ) :r u le s ( ("!h ead # lh s :- !body#rhs" 27 (progn (remove-duplicates (set-difference (find-vars tr::body#rhs) (find-vars tr::head#lhs) :test #’equal) :test #’equal))) ) ) T ra n sfo rm er s: For these, the right hand sides are patterns of the same language as the left hand sides’. They serve as the rewriting rules of the underlying gram m ar. The following is a rewriting rule to simplify wffs in an AP5-lisp program. In particular, it removes every em pty existential quantifier and every and operator with a single operand from the program. (poe:deftransformer Simplify-wff :grammar AP5-lisp :nonterminals (lisp-list) :rules ( ("CE nil !lisp-list)" :==> "!lisp-list" ) ("(and !lisp-list)" :==> "!lisp-list" ) ) :otherwise-action poerptree 28 T ra n sla to rs : These are used to transform the m atched parse tree into a parse tree of the destination grammar. Thus the right hand sides are patterns in the destination grammar. The following is a translator which translates the rule body of an nrecILOG- rule into a conjunctive clause in AP5-lisp. (deftranslator body-to-lisp-program :from-grammar ALD :to-grammar AP5-lisp :from-nonterminals (body) :to-nonterminals (program) :rules ( ("!!intermediate-or-relation#one-item" :==> "(AND !!program#one-item)") ) ) For example, the rule body of the ILOG rule B.flight-info(f,c,t) i-flight[f,s,t] , A.has-node(s,c); is translated into a lisp list (AND i-flight [f,s,t] A.has-node(s,c)) 29 Chapter 3 Solution of the Value-Based Case This chapter presents the framework for m aintaining the uni-directional and bi directional sem antic correspondence for the value-based case under the general ar chitecture of B ird illustrated in Section 1.3. This chapter also serves as the starting point for the object-based context to be presented in Chapter 4 and 5. Each component of the B ird system used for m aintaining the value-based seman tic correspondence is discussed in one section; Section 3.1 describes the abstract link age description (ALD), Section 3.2 discusses the compilation of ILOG non-invention rules, Section 3.3 presents the RemTrans mechanism which orchestrates the inter actions between the two databases, Section 3.4 briefly describes the communication subsystem in this framework. Finally, Section 3.5 discusses the side-effects on the behaviors of the two databases after the semantic correspondence has been estab lished. Throughout this chapter the “Office” example (Example 1.1) introduced in chap ter 1 will be used to illustrate various parts of the B ird system. 3.1 Abstract Linkage Description (ALD) In B ird , the sem antic correspondence is described by a text-based file called the ab stract linkage description (ALD). The formal semantics of the ALD are summarized in Appendix B. Formally speaking, an ALD L, for the bi-directional value-based case, is a four tuple (A l , B l , P a^ b ,P b > -> a ) where • A l specifies the schema of database A 30 • B l specifies the schema of database B • Pav* b is a nrecILOG” program specifying the schema restructuring m apping that maps instances of A to instances of B. • Pb > - * a is a nrecILOG" program specifying the schema restructuring mapping that maps instances of B to instances of A . L specifies the sem antic correspondence of {(/, J ) | (Pa~ b (I) = J) A (Pb^ a(J) = /)} For the case of a value-based uni-directional ALD L , the only difference is that L does not have the Pb ^-a part; i.e., the ALD provides only the one-way mapping from A l to B l - Other than that, the compilation and loading process are the same as in a bi-directional case. Since only the value-based cases are discussed in this chapter, the two nrecILOG" programs Pa> - + b and Pb ^ a in an ALD contain only the non-invention nrecILOG" rules. W ith the ALD, the DBA can give a high level description of the semantic corre spondence without worrying about the low level details of how incremental updates are propagated and synchronized between the two databases. To specify the seman tic correspondence, the DBA defines the bi-directional linkage in a one-direction- at-a-tim e fashion; i.e., he first puts himself in the A database and describes the mapping from instances of A to instances of B , then he switches to the B side and specifies the m apping from instances of B to instances of A . This approach distin guishes B ir d from most of the approaches taken by research into the “view update” problems[BS81, DB82, GPZ88]. The formal syntax of ALD is presented in Appendix A. E x a m p le 3.1: The ALD for the “Office” example is (abs-linkage Office A-Schema ( person-no(String, String); // This is a comment line no-office(String, String); 31 B-Schema ( in-office(String, String); office-no(String, String); ) A-to-B ( B.in-office(p,o) :- A.person-no(p,n), A.no-office(n,o); // E l B .office-no(o,n) :- A.no-office(n,o); // R2 ) B-to-A ( A.person-no(p,n) :- B.in-office(p,o), B .office-no(o,n); // E3 A.no-office(n,o) :- B.office-no(o,n); // E4 ) ) where th e four-tuple (A l,B l, Pa^ B i Pb ~ a ) is represented in th e A-Schema, B-Schem a, A -to-B , and B -to-A p art in the above. □ As shown in the above example, the ALD adapt the C++ style comment token “/ / ” . W henever this token appears (unless it is inside a string), everything to the end of the current line is a comment. 3.2 Compilation of an ALD W ith the DBA-defined ALD as input, the ALD compiler translates the “abstract linkage” defined in the ALD into two “concrete linkages” , and Eg, which con tain production rules 1 for the underlying active DBMSs of databases A and B, respectively. Since there are no invention rules in a value-based ALD, in this chapter we only describe the translation of nrecILOG- non-invention rules. The translation of invention rules will be presented in Section 4.4. 1In later chapters (4 and 5) for the case of an OID-based semantic correspondence, the compiler will incorporate some “auxiliary relations” besides those defined in the A-Schema and B-Schema sections. Then the resulting (E ^ E g ) contains not only sets of production rules but also the schema definition for the “auxiliary relations”. 32 Notice th at for a non-invention nrecILOG- rule r, the predicates in the rule body of r all come from one database, while the rule head predicate of r comes from the other database. For the sake of the discussion, we call the former the “source database” and the latter the “destination database” of rule r. For example, each of the predicates in the body of B . in-off ice(p, o) A .person-no(p,n),, A.no-office(n,o); comes from A , while the rule head is a predicate from B. Thus the source and the destination of the above rule are A and B , respectively. Each nrecILOG- rule r is translated by the ALD compiler into two production rules, rP and r^, of the underlying active database. 2 rp is loaded into the source database of r to m onitor condition in the source database violating the constraint implied by r, and is loaded into the destination database of r to check if the constraint specified by r is satisfied after the rule head of r is modified. The details of rp and rj. are now presented. P o p u la tio n R u le rp: This resides in the source database of r to m onitor any increm ental update violating the constraint implied by r. If the violation is due to some updates to the rule body of r, then rp is triggered and is responsible for propagating updates to the rule head in the target database by inserting update requests into a data structure Remote-Agenda. Remote-Agenda servers as a buffer holding updates proposed by various population rules. The actual execution of the updates in Remote-Agenda is postponed until all rules in the source database stop firing. This is mainly to avoid deadlock and will be elaborated in Section 3.3 when the RemTrans mechanism is introduced. D elay C h e ck in g R u le rj: This resides in the destination database of r to ensure the constraint specified by r after the rule head of r has been modified. Td will be triggered whenever the rule head of r is modified. The action of rd inserts a query into a data structure delayed-checking-list asking if the constraint of r is violated. The actual checking of this query is delayed until the whole interaction between both source and destination databases is settled 2In the current implementation, the Syntax Experts in POPART are used to perform the trans lation. And the result of the compilation is a set of AP5 production rules, or more precisely, a set of Consistency Checkers in AP5. 33 down. The details of this “Wait till everything settled down, then check” process will be given in Section 3.3. As described above, the behavioral aspect of a nrecILOG” rule is divided into two parts; each part is handled by one production rule. E x a m p le 3 .2 : We now use the “Office” example to illustrate the translation of nrecILOG” non invention rules. The production rules generated from the compiler are presented here in pseudo code for easier understanding. As indicated in the ALD, the two rules in B-to-A are named R1 and E2, and the two rules in A-to-B are named R3 and R4. For the nrecILOG” rule A.person-no(p,n) :- B .in-office(p,o), B.office-no(o,n); //Rl it is translated into the following two production rules. Rule Name: populate-A.person-no Trigger: any update changing the truth value of ( 3o B. in-off ice(p,o) A B. off ice-no (o,n)) with the variable bindings of p,n, i . e . , change made to the rule body that is used to populate the rule head. Action: IF (3o B.in-office(p,o) A B.office-no(o,n>) is becoming true; THEN append ‘‘INSERT A.person-no(p.n)’’ to Remote-Agenda ELSE append ‘‘DELETE A.person-no(p.n)’’ to Remote-Agenda Rule Name: delay-checking-A.person-no Trigger: any updates changing the truth value of A.person-no(p,n) with the variable bindings of p, n, i . e . , change made to the rule head. Action: IF A.person-no(p.n) is becoming true; THEN append “Does there exists office o, such that (B.in-office(p,o) A B.office-no(o,n)) is TRUE ?” to Delay-Checking-List ELSE append “For each office o, is (B.in-office(p.o) A B.office-no(o.n)) FALSE ?” to Delay-Checking-List 34 Notice th at in the action part of rule populate-A .person-no, instead of executing the insertion/deletion to A.person, it appends a request for inser tion/deletion to data structure Remote-Agenda. Also, in the action part of rule delay-checking-A.person-no, the action simply adds a query into another data structure Delay-Checking-List. The reason is because the production rule is lo cated in one database, while its action involves two databases. If the action part of a production rule is allowed to directly update the rem ote database, this m ay result in a deadlock. Remote-Agenda and Delay-Checking-List are used by the RemTrans mechanism to buffer rem ote requests proposed by various production rules, so that rem ote update requests are sent to the rem ote database in a batch mode. The details of RemTrans mechanism will be presented in the next subsection. R2, R3, R4 can be translated in a similar fashion. The resulting concrete link ages for this example are: It contains rules populate-A. person-no, populate-A. no-off ice, delay-checking-B.in-office,and delay-checking-B.office-no. £ b'. It contains production rules populate-B. in-off ice, populate-B .off ice-no, delay-checking-A.person-no, and delay-checking-A.no-office. □ 3.3 Remote Transaction Control (Rem Trans) Once the resulting and £# are loaded into A and B , respectively, any incremental update to one database may trigger some production rules in the concrete linkage of the local database. The repair actions of those triggered rules may proposed updates to the rem ote database, which may further trigger some production rules in the concrete linkage of the rem ote database. Therefore, it is conceivable that there is a back-and-forth process of update exchanges before both of the two databases finally commit. In the B ird system, the RemTrans mechanism helps to orchestrate such “negotiation” process between the two databases. 35 A.person-no Name Telephone Ben 123 Shery 456 A. no-off ice Telephone Office 123 SAL250 456 SAL130 (1) User issues updates on A insert A.person-no(John,007) insert A. no-off ice(007,SAL250) (2) Database A is locked, execute updates (3) Production rules populate-B. in-office, populate-B.office-no are fired (4) Send remote updates to B- — ■ ‘ insert A.person-no(John,123) insert A.person-no(Ben,007) Deadlock! B. in-off ice Name Office Ben SAL250 Shery SAL130 B.off ice-no Office Telephone SAL250 123 SAL130 456 j U insert B. in-off ice(Jone,SAL250) insert B.office-no(SAL250,007) j (5) Database B is locked, execute updates (6) Production rules populate-A. person-no, populate-A.no-office are fired (7) Send remote updates to A Figure 3.1: A naive scenario of using “remote-execute” 3.3.1 M o tiv a tio n o f Rem Trans A naive way to allow the ACTION part of a production rule to m anipulate data in a rem ote database is to simply provide a “remote-execute” procedure that can exe cute commands updating data in the remote database. W ith this “remote-execute” procedure, a production rule can m anipulate the rem ote data directly. However, un fortunately, this approach might result in a deadlock. In the following, an update scenario of the “Office” example using “remote-execute” is presented to illustrate this. Consider the scenario depicted in Figure 3.1. Assume the user issues a trans action in A (1). For most database systems, a database is first locked before a transaction starts. Therefore, A is first locked before the transaction is executed 36 (2). Suppose during the execution of the user update some production rules in data base A are triggered (3). As a result, the production rules use the “re m o te -e x e c u te ” function call to execute updates in rem ote database B (4). After locking database B (5), the remote updates are executed. Suppose that the updates in (5) further trigger some production rules in B (6). W hen the production rules in B try to call “rem o te -e x e c u te ” to perform update back in A database, a deadlock happens. This is because the transaction on the A database has not completed and the A database is still locked. Beside the need to prevent deadlock, there is also the need for a protocol between the two databases to synchronize and simulate the hum an negotiation process; one database proposes some update to the other database, then the other database may counter-propose some update back to the original database. T hat is, there may be a back-and-forth proposal/counter-proposal exchange during the negotiation process. In the end, the accumulated results of the negotiation can be com m itted to the databases only if both parties are satisfied with what they have. If either party is unhappy with the result, the updates for both of the databases m ust be undone. 3 .3 .2 M a n a g in g th e C o m m u n ic a tio n s RemTrans in B ird is the mechanism preventing deadlock and synchronizing inter actions between databases. W hen a user issues an update on one database, the RemTrans mechanism wraps the whole interaction between the two databases into one “global transaction”, so that a user-issued incremental update and its corre sponding actions issued by the production rules in both databases are com m itted or aborted as in one transaction. W hen the user on the local database issues a transaction through RemTrans, the RemTrans mechanism first creates two processes, RemTransA and RemTransg, simultaneously on the two databases. Then the two databases are locked. After th at interactions between databases A and B are then taken over by the RemTrans mechanism. The two databases will be unlocked when the transactions of the two databases are both com m itted or both aborted. W.l.o.g. assume that the user issues a transaction T through function REMOTE-TRANS of RemTrans at database A . After the user issues REMOTE-TRANS (T), 37 the RemTrans mechanism goes through the following phases to complete the “rem ote transaction” . Notice that there are two checking phases below. The local process en ters the Self Checking Phase when there are no rem ote updates in the Remote-Agenda, i.e., all the production rules in the local process are satisfied. The local process en ters the Requested Checking Phase when it receives the ‘ 'I am s a t i s f i e d . How about you? * ’ message from the rem ote process. Initialization Phase: RemTransA first creates a rem ote process Rem Transs by the RPC mechanism as its negotiation counterpart in database B. Instead of start ing from the Initialization Phase the newly created RemTransB starts directly from the Listening Phase. Execution Phase: RemTransA then performs the transaction T on database A . During the execution of T , it may trigger some population and delay-checking production rules in Y> a- Recall the previous section, ACTION parts of production rules generated by the B ird compiler may contain the following actions • ACTION part of a population rule will append some remote update requests into a data structure Remote-Agenda • ACTION part of a delay checking rule will append some remote predicate query requests into a data structure Delay-Checking-List. If there is no request in Remote-Agenda then R em T ransA jum ps to the Self Checking Phase. Otherwise move to the Proposing Phase. Proposing Phase: Pack the rem ote update requests stored in the Remote-Agenda into a message and send it to database B by issuing SEND-MESSAGE(B,''Please execute (Remote-Agenda)'') Listening Phase: Then R em T ransA waits for any counter-proposal or reply from database B. If the reply by RECEIVE-MESSAGE(B) is • ‘‘Please evaluate (Remote-Query)’’ then evaluate the Remote-Query against current state of database A and return the TRUE/FALSE result back to database B. Return Listening Phase. • ‘‘Please execute (New-Proposal)’’ then set T = New-Proposal and go back to Execution Phase 38 • ‘ ‘Abort the global transaction! ’ * then abort and undo the RemTransA transaction and exit • ‘‘I am satisfied. How about you?’’ then go to the Requested Checking Phase • ‘ ‘Global Transaction Committed* ’ then commits the transaction and exit Self Checking Phase: pack the remote query in Delay-Checking-List into a mes sage and issue SEND-MESSAGE(B,‘‘Please evaluate (Delay-Checking-List)) If the reply is • TRUE then issue SEND-MESSAGE(B, ‘‘I am satisfied. How about you?*’) and go to Listening Phase • FALSE then issue SEND-MESSAGE(B, ‘‘Abort Global Transaction!*’), undo the transaction and exit. ' Requested Checking Phase: Different from the Self Checking Phase, it occurs only after the other side has performed its self checking, and the local database receives the ‘‘I am satisfied. How about you?*’ message. First pack the queries in Delay-Checking-List into a message and issue SEND-MESSAGE(B, ‘‘Please evaluate (Delay-Checking-List)) If the reply is • TRUE then issue SEND-MESSAGE(B,*‘Global Transaction Committed’’), commit the transaction and exit • FALSE then issue SEND-MESSAGE(B, ‘ ‘Abort Global Transaction! ’ ’ ), undo the transaction and exit. A scenario depicting the phase transition under RemTrans is shown in Figure 3.2. Under the RemTrans mechanism all proposed updates to the remote database are first buffered, then packed into a message (not as a transaction) and sent to the Rem Trans server in the other side. In this fashion, a deadlock situation will not happen during the interaction between databases *4 and B . 39 V _ 9 s l Database A RemTrans Initial Phase Create process RemTrana E xecution Phase Execute A ^ (rule firings...) P roposing Phase Pack and send remote requests Listening Phase [A,] [AJ Database B In itial Phase Create process RemTrartg L istening Phase Execution P hase Execute A, (rule firings...) Proposing Phase Pack and send remote requests tAJ L istening Phase [remote queries] [I am satisfied. How about you?] R equested C hecking Phase Check queries in [remote queries] Delay-Checking-List ---------------- Inform B to commit [Transaction commit!] Transaction Commits on A E xecution Phase Execute A „ (no more rule firings!) Self C hecking P hase Check queries in Delay-Checking-List Inform A that B is satisfied L istening Phase Transaction Commits on B Figure 3.2: Example scenario for phase transitions under RemTrans 3.4 Socket and RPC, the Communication Mechanism The communication sub-system in B ird is built on top of the BSD socket[Bac86] but can be easily modified for other communication paradigms. Figure 3.3 illustrates the system diagram of the communication subsystem. In particular, each database has a running process, the post-office connected with the other post-offices through a BSD socket. The unit of communication is messages in the form of (Sender, Receiver, Content). The post-office m aintains a message- pool containing the in-coming messages from other rem ote post-offices. The communication subsystem provides services through the following two vir tual functions. 40 Database A Database B send-message receive-message BSD Socket RemTran Post O ffice Post Office Figure 3.3: System diagram of the Communication Subsystem send-message(Receiver,Content): This function encodes the Receiver and Content into a message, then sends it to the post office of the destination host. receive-message (Receiver) : This function returns a message previously stored in the message-pool with the receiver field of Receiver. Shown in Figure 3.3, the RemTrans mechanism uses the two virtual functions to perform inter-database communications. 3.5 Linkage Implicit Constraints (LICs) By definition, a semantic correspondence specifies a set of “acceptable instance pairs” th at can co-exist in the two databases. It is conceivable th at not all instances of one database have corresponding instances in the other database. T hat is, it is possible that the semantic correspondence may not correspond to a total mapping from Inst(A) to Inst(B). As a result, m aintaining a sem antic correspondence may have the side-effect on each local database as if new constraints are introduced on the databases which elim inate the instances not appearing in the semantic corre spondence. Figure 3.4 illustrates this situation. Here each bold dot represents an instance of database A or database B. Only some instances of A lie within the first field of S C a b , and likewise of B and the second column of SC ab ■ These constraints are called the linkage im plicit constraints (LICs). 41 Database A Inst(A ) Database B I n s t ( B ) n , (scA B ) After maintaining SCab by Bird n 2(sca b ) Figure 3.4: Linkage Implicit Constraints (LICs) E x a m p le 3 .3 : Continuing with the “Office” example, assume th at initially there are no constraints or production rules in A and B. Then the concrete linkage and Eg presented in Example 3.2 are loaded into databases A and B, respectively. And the current instances for the two databases are: A.person-no: Person No Ben Sherry 123 456 A.no-office No Office 123 456 SAL250 SAL130 B.in-office: Person Office Ben Sherry SAL250 SAL130 B.office-no Office No SAL250 SAL130 123 456 Assume the user at database A issues the the following transaction Transaction{ INSERT person-no(Dave,789); } According to the current database instance, no (propagation) production rule is triggered. Therefore, RemTransa goes to the Self Checking Phase and sends a re m ote query f'Please evaluate (does there exists an office 0, such that in-off ice (Dave, o) and off ice-no(o ,789) * * to database B. But since the result of the rem ote query returns false, which means the static constraint implied by the nrecILOG- rule 42 Database A Database B Inst(A ) In st(B ) by user-issued A ^ by propagation rule repairs Figure 3.5: Dynamic behaviors of LICs A.person-no(p,n) in-office(p,o), office-no(o,n); has been violated by the user update. Therefore the whole transaction is aborted and undone. As a m atter of fact, once the linkage has been established, database A will be affected as if there were a new inclusion dependency A .person-no[2] C A .no-off ice[l] added to the database. Since there is no office associated with telephone number 789, the user update INSERT person-no(Dave,789) is rejected as if it violates the above LIC. □ Now let us consider the dynamic aspect of LICs. The propagation of incremental updates in the B ird system is driven by the firings of the production rules in the concrete linkages. Im portantly, because of the delay checking rules, at the end of each non-aborting RemTrans transaction, the two databases necessarily reach a new instance pair in the semantic correspondence. However, as m entioned in Section 3.3, during the negotiation process of the RemTrans transaction, both databases may go through several interm ediate states. This is illustrated in Figure 3.5; even though the initial update A by the user may not put the database into a legal state in the semantic correspondence, the negotiation process between the two databases may eventually lead to an instance pair (/', J ') in the sem antic correspondence. 43 E x a m p le 3 .4 : Refer to the scenario shown in Figure 3.1. Suppose the user issues Transaction-C insert A .person-no (John,007) ; insert A.no-office(007,SAL250); } through Rem Trans and there is no deadlock. Then, after (7) all the nrecILOG- rules are satisfied and all the updates are com m itted. However besides the initially update issued by the user, the interaction between A and B adds two more insertions insert A.person-no(John,123); insert A.person-no(Ben,007); to the total delta. This acts as if the user’s transaction violates an implicational constraint [Fag82] Va;, y,z,w (A . person-no(a:, y) A A .n o -o ff ice(y, z) A A .n o -o ff ice(u>, z)) —► A .p e rso n -n o (r, w) and the additional deltas are repairs proposed by this LIC. □ V There are two im portant open problems concerning LICs: (1) Given a semantic correspondence SC, what is the set of LICs implied by SCI (2) Given a semantic correspondence SC and the associated LICs, what output will B ird produce if the user update initially yields an instance violating some LIC? The answers for these questions can help the DBA to understand the side effects and sem antic changes incurred by the sem antic correspondence. However these problems are outside the scope of this thesis. 44 Chapter 4 Uni-directional Propagation of Incremental Updates Involving OID In this chapter, the capabilities of B ird are upgraded adding the ability to m ain tain uni-directional semantic correspondence involving OIDs. Although most of the framework introduced in C hapter 3 can be used in the presence of OIDs, there are m ajor differences between a uni-directional value-based case and a uni-directional object-based case. To begin, a new section, E n tity -E q u iv a le n c e , is added to ALDs specifying the following information. E n tity E q u iv a le n t S p e c ifica tio n (E E S ): In a sem antic correspondence involv ing OIDs, the B ird system allows the DBA to specify pairs of entity types in the two databases that are “equivalent,” i.e., they hold sets of OIDs standing in a one-one correspondence with each other. This specification is called the E ntity Equivalent Specification (EES) in the ALD. The semantics of EES and a rewriting system translating entries in an EES into nrecILOG- rules are presented in Section 4.1. W ith the presence of OIDs, the nrecILOG" program in an ALD contains in vention rules specifying relationships between entity sets. As a result the following fundam ental problem arises. O ID a sso c ia tio n p ro b le m : To support increm ental OID creation/destruction, the B ird system needs to carry information associating rem ote OIDs in the target database with their local “witness” values in the source database. This problem is described in Section 4.2. 45 The following extensions of the B ird system, introduced in Chapter 3, are developed to remedy this problem. A u g m e n ta tio n of so u rc e sc h e m a d u rin g th e A L D c o m p ila tio n : For each EES entry and ILOG invention rule in Pa^ b of the ALD, the ALD compiler will augment the A schema to include the interm ediate relation defined by the rule or the EES entry (Section 4.3). T ra n s la tio n o f IL O G in v e n tio n ru les: The production rules used for invention rules need to create OIDs in the rem ote database and rem ember the associ ation between the OIDs invented and their witness values. The translation (Section 4.4) is more involved because of: (1) rem ote OID invention/deletion and (2) communication of OIDs across the boarder of databases. M a c h in e D e p e n d e n t O b je c t T ra n sla tio n (M D O T ): W hen introducing OIDs in a semantic correspondence, the B ird system needs to communicate OIDs between databases. Since OIDs are only meaningful within the local database, a machine dependent object translation mechanism is needed to encode local OIDs into the global or “im m utable” OIDs (Section 4.5). W hen introducing the first two extensions above, we ignore the problems raised by the locality of OIDs. In particular, we assume that all OIDs are global or im m utable until Section 4.5, when the MDOT mechanism is introduced. Finally, in Section 4.6 at the end of this chapter, the subtle difference between the static and dynamic semantics of ALDs are discussed. This difference is mainly due to the non-determ inistic nature of assigning the OIDs at run time; if an object with OID o is destroyed and then created, the new OID is not necessarily o even though it is created with the same witness value. This phenomenon raises the profound question of whether the identification of an object can be truly represented by its surrogates Throughout this chapter, two examples will be used to illustrate various com ponents of our approach. The first example illustrates the situation when the pairs of entity sets from the two databases correspond exactly; the second one illustrates cases where a pair of entity sets in the two databases correspond in a less direct fashion. 46 Database A Database B C't Node ir S g y ^ 0 " nrrmfe node-label A.c-name B.node-label City name C | L.A. c2 S.F. c3 N.Y. Node label nl L.A. n2 S.F. n3 N.Y. Figure 4.1: Schemas and instances for “City” V E x a m p le 4.1: “ City” In this example, database A contains entity type A . City and its attribute A . c-name; database B contains the entity type B.Node and its attribute B.node-label. There is a one-one correspondence between the cities in A.City and the nodes in B.Node. The schemas and example instances are depicted in Figure 4.1. The following ALD specifies th at correspondence. (abs-linkage city A-schema ( City(entity); c-name(City,string); ) B-schema ( Node(entity); node-label(Node,string); ) Entity-Equivalence ( A.City = B.Node; //El ) A-to-B ( B.node-label(o,1) A.City=Node[c,o], A.c-name(c,l); // R1 ) ) Notice that in the above ALD, Entity-Equivalence is a new section. It con tains an entity equivalence specification (EES), with El specifying th at A.City and 47 Database A Database B Segment (node) ■pl s t r i n g ! has-node segment-info I STRING I I STRING I (time) (price) A.has-node Segment node S.F. *2........ N.Y. A.segment-info Segment time price S1 10:00 65 sl 14:00 80 *2 09:00 150 Flight flight-info ™ § ] fSTR fN G l (time) B.flight-info STRING I Flight city time f. S.F. 10:00 h S.F. 14:00 b N.Y. 09:00 B.price Flight price f. 65 b 80 ............ 150 Figure 4.2: Schemas and instances of Example “Segment Flight” B .Node are “equivalent” entity types. Associated w ith E l is one 1 system m aintained interm ediate relation A .C ity=N ode, storing th e m atching relationship betw een th e cities in A .C ity and their “equivalent” counterparts of nodes in B.Node. T he detail of the EES will be discussed in the next section. □ E x a m p le 4 .2 : “ Segment Flight” Continuing with Example 1.2, the following is the uni-directional ALD for the “Segment Flight” example. In Chapter 5, this example will be revisited for the bi-directional case. (abs-linkage segment-flight A-schema ( S egment(ent ity); has-node(Segment.string); segment-info(Segment,string,string); 1Later in Chapter 5, when bi-directional ALDs are discussed, there are two duplicate copies of the system maintained intermediate relations associated with an EES entry. The two intermediate relations are stored in databases A and B respectively. 48 ) B-schema ( Flight(entity); flight-info(Flight.string,string); price(Flight,string); ) A-to-B ( A.i-flight[* »s,t] A.segment-info(s,t,p); // SI B.Flight(f) A.i-flight[f,s,t]; // S2 B.flight-info(f,c,t) :- A.i-flight[f,s,t], A.has-node(s.c); // S3 B.price(f.p) :- A.i-flight[f,s,t], A.segment-info(s,t,p); // S4 ) ) □ In the above ALD, rule SI is a nrecILOG- invention rule. Notice th at for a nrecILOG- invention rule, the source and destination databases are the same. T hat is, the interm ediate relation A. i - f l i g h t (which is not included in the A schema) is assigned to be located in database A. W ith the interm ediate relation m aterialized (Section 4.3) and treated as an ordinary relation, each predicate in the rule body of a nrecILOG- rule is located in the source database (.4.), and each rule head, except for the invention rules, is located in the destination database (B). From the two examples above, it looks convenient and straightforward to use nrecILOG- invention rules describing the sem antic correspondence in the object- based case. However, we will see in a later discussion th at the invention of OIDs across the border of a database needs special treatm ent. 4.1 Entity Equivalence Specification (EES) For the object-based semantic correspondence, the DBA can use the E n tity Equiva lence Specification (EES) in an ALD to specify the equivalent relationship between an entity type in A and an entity type in B, in an ALD. In the “City” example above, there is an EES entry A.City = B.Node // El 49 in the Entity-Equivalence section of the ALD. Intuitively speaking, El specifies th at there is a one-one correspondence between OIDs in A . City and OIDs in B . Node. Associated with El is the system m aintained EES relation A.City=Node recording the m atching relationship between OIDs of the two entity types. The semantics of the EES is very similar, as least for the uni-directional case, to the semantics of nrecILOG- invention rules. As a m atter of fact, in Bird, the EES is implemented by the following rewriting rules which translate an ALD L containing EES entries into another ALD, denoted EES—to-ILOG(L), replacing each EES relation by a nrecILOG- invention rule. R e w r itin g S y s t e m 4 .1 For each EES entry A .R = B . S in an ALD, the EES translator EES-to-ILOGrewrites the Pa*-*b program in the ALD in the following ways. (1) The following ILOG rule will be generated and added into Pa^ b - i-S[* S,r] A.R(r); (2) Each occurrence of the predicate A.R=S[x,y] in the ALD will be substituted by i-S[y,x] □ In the above rewriting, (1) ensures th at whenever there is an OID r in A.R, there will be a unique OID s in B.R corresponding to it. Since the interm ediate relation i-S is actually holding the m atching correspondence between OIDs in A .R and OIDs in B.S, it can play the role of the EES relation A.R=S. (2) is just a syntactical modification substituting i-S for A.R=S. It can be easily shown th at EES-to-ILOG(L) em ulates the semantics of the original L , and the sem antic correspondences specified by EES-to-ILOG(L) and L are the same. E x a m p le 4 .3 : For the “City” example, EES—to-ILOG(L) is derived from L by replacing the original Pa~b with A-to-B C B.node-label(o,l) A.i-node[o,c], A.c-name(c,1); 11 Rl’ 50 A. i-node [*,c] A.City(c); // R2’ B.Node(n) A.i-node[n,c]; // R3’ ) Notice that in the above, R2' and R3’ are added because rewriting rule (1) and R1 were rew ritten into R1 ' by (2). □ From the rewriting process above, it seems th at invention rules alone are sufficient to express the entity equivalent relationships between entity types. The justification of providing a second way, other than nrecILOG- invention rules, to specify the entity equivalent relationship is not clear for the uni-directional case. However, in Section 5.5 we will see th at by writing an EES entry, semantically speaking, the DBA means more than just the OID creation/destruction behaviors between two entity types. As a rule of the thum b, for the tim e being, whenever there is a one-one correspondence between OIDs of two entity types in the two databases, the DBA should use EES to articulate the relationship in the ALD. For the current im plem entation, entries in an EES are restricted in the following ways. These restrictions apply to both uni-directional and bi-directional cases. • Each entry contains exactly one entity type in A and one entity type in B. • Each equivalence class specified by the entries in an EES cannot contain more than two entity types. For the rem ainder of this chapter, unless otherwise specified, we assume th at the ALDs in discussion are first pre-processed by EES-to-ILOG, so that P a ^b contains only the nrecILOG- rules with no EES relations appearing in the rule bodies. 4.2 OID Association Problem This section explores the OID association problem th at arises when OIDs in the target database are destroyed incrementally by updates propagating from the rem ote database. This m otivates the need for the schema augm entation process (Section 4.3) to m aterialize the interm ediate relations appearing in the ALD. First, let’s recall the OID invention process of the ILOG language presented in Section 2.3 by using the schemas in the “City” example. 51 (1) S k o lem ize p ro g ra m P a * -* b ' The ILOG program P a *-* b is first skolemized into A-to-B ( A.i-node[f(n),n] A.City(c); // R1 B.node-label(o,l) A.i-node[o,c], A.c-name(c,1); // R2 ) T hat is, the * sign in the interm ediate relation i-node of the invention rule R1 is substituted by the skolem function f (n). (2) C o m p u te th e sk o lem ized p re -in sta n c e : Use the instance of database A shown in Figure 4.1 as the source and run the skolemized Pa> - + b as if it were an ordinary Datalog program. The resulting skolemized pre-instance for database B and the interm ediate relation A. i-n o d e are A.i-node: Node city name /(c 1 ) L.A. /(<*) S.F. /(<*) N.Y. B.node-label: Node label f(ci) /(<*) /(<*) L.A. S.F. N.Y. (3) C o m p u te th e in s ta n c e fro m th e p re -in sta n c e : Substitute each distinct skolem term in the skolemized pre-instance with a unique OID value. Assume that m apping ^ is /(c i) n x /( c 2) * - * ■ n 2 . f( c 3) n 3 Then the resulting instance for B is the one depicted in Figure 4.1 Recall th at, in the original ILOG semantics, the interm ediate relations served only as “scratch paper”. They are not part of the output instance and are thrown away after the instance transform ation process term inates. However, as we will 52 see below, in the situation dealing with incremental OID deletions, the association between OIDs and their witness values stored in the interm ediate relations is indis pensable. Consider the following deletion scenario for the “C ity” example. Suppose a user at database A issues transaction { delete city(ci); delete c-name(ci ,L. A. ) ; } According to R1 of Pa*-*b, the OID of‘ B .Node associated with the witness ci, or more precisely the logic OID /(c i), should be deleted from the B database. However, as shown in [CH94], in the absence of other information, the problem of finding the correct OID is essentially equivalent to the graph isomorphism problem[GJ79] and, thus, probably intractable. The solution is to store the interm ediate relations (e.g., i-node in this example) in the source database A . By [CH94], with the interm ediate relations carrying the OID association information, the problem of finding the correct OID is polynomial. 4.3 Augmentation of Source Schema Continuing with the discussion in the previous section, to keep this OID/witness association information available in database A , for each nrecILOG" invention rule r in PA^ B , the ALD compiler generates a relation definition for the interm ediate relation appearing in the rule head of r, and puts them into concrete linkage Ea- These new relations generated by the ALD compiler are called the witness relations, and the family of these is denoted by auxA. After loading E ^, these witness relations are m aterialized in database A . As will be shown in the next section, the production rules generated from the invention rules in P a > -* b m aintain the association informa tion in the witness relations incrementally. For the “C ity” example, since there is only one invention rule R1 with A. i-node as rule head, auxA contains one witness relation, and the following AP5 relation definition will be generated by the ALD compiler and put into Ea- (DEFRELATION i-node :TYPES (Entity City)) Similarly, for the “Segm ent-Flight” example, auxA contains relation i-flight, and the following AP5 relation definition will be generated and put into E^. (DEFRELATION i-flight :TYPES (Entity Entity String)) 53 w «|g To compute Jg, PA-*a first computes IauxA ■ With 1 ^ = 1auxA ^ I A ^ en computes Jg and throws IauxA away. This is why each execution of A*-.swill produce an equivalent but not equal physical image. Figure 4.3: The semantics of OID invention rules After loading into the A database, these witness relations are m aterialized and serve as ordinary relations in A except th at they are only accessible by the production rule generated from the invention rules in the, ALD; e.g. El for the “City” example and R1 for the “Segm ent-Flight” example. The details of the interaction between production rules generated from nrecILOG" invention rules and the witness relations will be given in the next sec tion. Conceptually speaking, with the presence of witness relations, the semantics of P a ^b can be viewed as the two-phase process 2 depicted in Figure 4.3; i.e., pro gram P a > -* b can be divided into to parts: P%^+b containing all the invention rules in P a ^ b , and P a ^ b containing all the non-invention rules in P a < -+ b - T o compute target instance Jb from source instance I a, the B ird system first computes witness IauxA (the physical instance of the witness relations ) by Pa<->b{Ia) = IauxA- Then the target instance can be computed by P ai-*b ( I a U I uuxa) — J b The notion of witness can be defined formally as follows. D e fin itio n : Witness 2This relies heavily on the restriction that there is no recursion in a nrecILOG" program. 54 Given an ILOG program Pa>->b and instances I a , J b of database A and B such that [P^sC /yi)] = [Jb], or Pa>-+b{Ia) = J b T hat is, the resulting (physical) instance of Pa^ b (Ia) is OID-equivalent to Jb - Then an instance I auxA over auxA is a potential witness from I a to J b if Pa*-*b{.Ja) ~ IauxA, and Pa^-*b(Ia 0 IauxA) = Jb - A potential witness I aU xA is a witness from I a to J b if P a ~ b (Ia 0 IauxA) = = J b ■ (The notion of witness from J b to I a in the bi-direction case is defined analogously.) □ Using the source and target instances depicted in Figure 4.1, the corresponding witness I aU xA for the “City” example is A.i-node: Node City Til Cl n 2 c2 nz C 3 For the “Segm ent-Flight” example, with the source and target instances in Fig ure 4.2, the witness I auxA is A.i-flight: Flight Segment Time h Si 10:00 /2 S\ 14:00 fs s 2 9:00 Due to the OID invention semantics of ILOG, instances of the witness relations satisfy certain functional dependencies. D e fin itio n : Skolem Functional Dependency Let r be a relation with arity n, then a skolem functional dependency over r is denoted as SFD(r). For each instance 3 7, 7(r) |= SFD(r) iff I(r ) (= 1 {2, ...,n ) and 7(r) (= {2, ...,n} 1 3The notations used here are conventional functional dependency using coordinate position to indicate attributes. 55 □ As stated in the next lemma, a special property of the nrecILOG- OID invention process implies that each witness relations r satisfies SFD(r). L e m m a 4.3.1: Let IauxA he a witness. For each interm ediate relation r in a auxA , then h u xA b SFD(r) th at is IauxA satisfies the Skolem Functional Dependence of r. P ro o f: It follows immediately from the semantics of OID invention of a invention rule in ILOG. □ 4.4 Compilation of ILOG Invention Rules Continuing with the discussion of witness relations in the previous section, we now consider the translation of an ILOG invention rule r into population rule 4 rp. The trigger part of rp is translated in a similar fashion as th a t of an ILOG non invention rule; it m onitors the conjunction predicate derived from the rule body of r against changes in the source database of r. It is the action part of rp th at needs special treatm ent to handle OID creations in the rem ote database. We now use El in the “C ity” example A.i-node[* Node,c] A.City(c); // R1 as an example. As m entioned in Section 2.3, R1 is actually the shorthand for the following two rules. A.i-node[* ,c] A.City(c); // Rl-1 B.Node(n) A.i-node[n,c]; // Rl-2 In particular, R l-2 can be translated as a non-invention rule presented in Section 3.2; R l-1 is translated into the following production rule which is put into S ^. Rule Name: populate-A. i-node 4Since the rule head of r, a witness relation, is transparent to the rest of the database and can only be modified by rp, there is no need for delay checking rules. This is true even in the bi-directional case. 56 Trigger: any updates changing the truth value of A.City(c) with the variable binding of c. Action: IF A.City(c) is becoming true; THEN create a new OID in B, and update A.i-node as follows: 1.1 Call remote procedure create-OID in B to create a new OID O b and have it sent back to A 1.2 insert A. i-node(Os,c) ELSE find the OID to be destroyed in A.i-node, update A.i-node, and send a deletion request to B as follows: 2.1 Search A.i-node for the OID O b associated with the witness value c 2.2 delete A. i-node (Os, c) Notice that, after performing the insertion (1.2) or deletion (2.2) in the action above, the production rule maintaining Rl-2 will be triggered and will propose insertion or deletion to B.Node. Similarly for the “Segment Flight” example, invention rule A.i-flight[*,s,t] :- A.segment-info(s,t,p); // SI is translated into the following production rule which is put into £ .4 . R ule N am e: populate-A. i-flight Trigger: any updates changing the truth value of ( 3p A.segment-info(s,t,p)) with the variable bindings of s and t. Action: IF the trigger condition is becoming true; THEN create a new OID in B, and update A.i-flight as follows: 1.1 Call remote procedure create-OID in B to create a new OID O b and have it sent back to A 1.2 insert A.i-flight(0£»sft) ELSE find the OID to be destroyed in A. i-flight, update A. i-flight, and send a deletion request to B as follows: 2.1 Search A.i-flight for the OID O b associated with the witness value (s,t) 2.2 delete A.i-flight(Ojg,s,t) 57 R em T ranjj 7 L istening Phase Execution P hase Execute requests in Remote_Agenda insert B.FIight(f49) insert B.flight-info(f49,D.C.,8:0 0) insert B.price(f49,250) Figure 4.4: Scenario for incremental OID creation The following example dem onstrates how these production rules create and delete OIDs and properly m aintain the witness relations. E x a m p le 4.4: Continuing with the “Segment-Flight” example, assume th at the two databases are in the states depicted in Figure 4.2, and the concrete linkage is generated and loaded as described. Suppose a user at database A wants to add one more segment s 3 8 to database A and issues Transaction {insert A.has-node(s3 8 ,D.C.) ; insert A.segment-info(s 38, 8:00,250);} The resulting firing sequence of the production rules is depicted in Figure 4.4. Then assume the user wants to undo the previous update by issuing Transaction {delete A.has-node(s3 8 ,D.C.); delete A.segment-info(s3 8 , 8:00,250);} The resulting firing sequence of the production rules is depicted in Figure 4.5. After the RemTrans transaction commits, databases A and B return to the original states shown in Figure 4.2. □ RemTranj^ Execution Phase The following rules are Fired A.i-flight[* Flight,s,t]A.segment-info(s,t,p); 1.1 Call create-OID in B and get f 4g 1.2 Insert A.i-flight[f49,s3e 8:00] 1.3 Append "insert B.FIight(f49)“ to Remote-Agenda B.flight-info{f,c,t):- A.i-flight[f,s,t], A.has-node(s.c); Append"insert B.flight-info(f49,D.C.,8:00)" to Remote-Agenda B.price(f.p):- A.i-flight[f,s,t], A.segment-info(s,t,p); Append "insert B.price(f49,250)“ to > Remote-Agenda P roposing P hase Send Remote_Agenda to B ^ 58 R em T ratig 7 L istening Phase Execution P hase Execute requests in Remote_Agenda delete B.FIight(f49) delete B.flight-info(f4g,D.C.,8:0 0) delete B.price(f49>250) Figure 4.5: Scenarios for increm ental OID destruction Notice that the action of an OID invention rule requires newly created OIDs in B be passed and stored in A . If pure OIDs were used, a problem might arise; the OIDs may not be im m utable. For example, B can change its OIDs due to an autom atic background garbage collection procedure which is independent of A . If pure OIDs were sent to A , this could result in dangling OID references. For this reason, the translation of ILOG invention rules described above serves only as a conceptual model. The actual MDOT mechanism used in B ird to provide the sharing of OIDs, in the context of incremental OID creations/destructions, is presented in the next section. 4.5 Machine Dependent Object Translation (MDOT) Mechanism Stated in [EK91], the value of OIDs, or machine dependent objects (MDOs), in a lo cal database may change even within a session. In order to provide a “stronger notion of identity” at a global level, B ird has the Machine Dependent Object Translation RemTrattj^ E xecution Phase The following rules are Fired A.i-flight[* Flight,s,t]A .segm ent-info(s,t,p); 2.1 found f4g associated with (s3B ,8:00) 2.2 delete A.i-flight[f49,s38 8:00] 2.3 Append "delete B.FIight(f49)" to Remote-Agenda B.flight-info(f,c,t)A.i-flight[f,s,t], A.has-node(s.c); Append "delete B.flight-info(f49,D.C.,8:00)" to Remote-Agenda B.price(f.p)A.i-flight|f,s,t], A.segment-info(s,t,p); Append "delete B.price(f49,250)“ to > Remote-Agenda P roposing P hase Send Remote_Agenda to B ' 59 (MDOT) mechanism to translate local OIDs into im m utable OIDs so th at they can be shared among databases. In the following, Section 4.5.1 introduces the functionalities of the MDOT mech anism which include • the translation of a local OID to an im m utable OID before sending the local OID to a foreign database • the translation of an im m utable OID back to its local value when receiving from a foreign database. The translation of nrecILOG- invention rules introduced in Section 4.4 serves only as a conceptual model. The actual actions of the production rules implementing nrecILOG- invention rules incorporate the MDOT mechanism to achieve the sharing of OIDs. Section 4.5.2 elaborates this collaboration between rem ote OID creations of the invention production rules and the MDOT mechanism. 4 .5 .1 T ra n sla tio n o f M a c h in e D e p e n d e n t O b je c t Each OID traveling beyond the boundary of a database is encoded by the MDOT mechanism as [host id], where host is the database where it originates and id is a unique num ber count for this OID in the host database. To record the necessary information for the machine dependent objects presented in a database, each database m aintains a variable MDQ-count and a relation MDOTT (for M DOT table). MDQ-count is used to assign a unique num ber for each machine dependent object originating in this database. For each machine dependent object th at travels in or out of the database, MDOTT records their originating host host, identification num ber id and local value substitution value in the database. This may include the OIDs created in the local database and the OIDs created in other databases. Figure 4.6 illustrates the snapshot of the MDOT tables and the witness instances of the “Segm ent-Flight” example assuming th at the databases are in the states shown in Figure 4.2. The translation process can be divided into a two-way communication process—encoding and decoding of MDOs—described below. 60 A.i-f light Flight witness 1 witness 2 f| S| 10:00 h 14:00 h , s3 09:00 MDO counter = 2 A. MDOTT host id local value A 1 * 1 A 2 S 2 B 1 IB, 1] B 2 [B,2] B 3 [B, 3] B.i-segment Segment witness Si SP. s2 N.Y. |M DO counter = 3 B. MDOTT host id local value A 1 [A, 1] A 2 [A, 2] B 1 f. B 2 f* B 3 h local value si look up A.MDOTT and encode MDOT look up B .MDOTT local value and decode MDOT [A,l] -----— — ------► hM! IB,3] [B,3] Figure 4.6: MDOT snapshots for the “Segment-Flight” example E n c o d in g M D O s: W henever the local OID Oiocai is to be sent to another database, do the following: IF there is a tuple MDOTT ( lo c a l- h o s t -nam e, id -n o ,Oiocai) in the MDOT table THEN substitute Oiocai by [lo c a l-h o s t-n a m e ,id -n o ] ELSE increm ent MDO-count, insert MDOTT (local-host-name, MDO-count, Oiocai) into MDOTT, and substitute OiO C ai by [local-host-name, MDO-count] For example, with the A.MDOTT instance shown in Figure 4.6, if A wants to send sj to B , then sj is substituted by [A,l] before it is sent. Nonetheless, if A wants to send [B ,3] to B, the encoding [B,3] itself is sent. D e c o d in g M D O s: W henever an encoding [host-nam e, id -n o ] is received from another database, do the following: 61 IF there is a tuple M D O T T (host-nam e,id-no,local-value) in the MDOT table THEN substitute [h o st-n a m e ,id -n o ] by lo c a l- v a lu e ELSE insert MDOTT(host-name, id -n o , [host-nam e, id -n o ] ) into the MDOT table For example, if B receives [A ,l] from A , since there is a tuple B.MD0TT(A, 1, [A ,l]) in the MDOT table, the encoding [A ,l] itself is used in database B. If B receives [B ,3], it is substituted by / 3. W ith the translation scheme and the MDOT available, in case a local OID, say Oi7, changes to o 'l7, then the entry in A.MDOTT recording 017 will also change to o'17. T hat is, the new A.MDOTT now becomes host id value A 1 ° '\7 W henever A communicates with a rem ote database, the same encoding of [A, 1] is always used independent of the change of O 17. Thought the MDOT mechanism shown in this section does not contain the typing information for the OIDs, it can be easily extended. 4 .5 .2 S y n th e sis o f O ID In v e n tio n an d M D O T Recall the translation of an ILOG invention rule presented in Section 4.4. W hen the population rule of an ILOG invention rule is fired, conceptually speaking, its action performs the following: (1) Invoke a rem ote procedure call to create an OID Oremote in the rem ote database, (2) Receive the rem ote OID Oremote as a return value of the rem ote procedure call, and (3) Store the Oremote and its witness values in the witness relation. 62 Rem Tran^ Execution Phase A.i-flight[*Flight,s,t]A.segment-info(s,t,p) (1) Remote procedure ca ll---------------------------------- get return value = 4 (2) Append "create_oid=o; Insert MDOTT(B,4,o)“ to Remote_Agenda (3) Insert A.MDOTT(B,4,[B,4]) P roposing Phase Send Remote_Agenda to B - Rem Tran ft L istening Phase > get_next_count: Increment MDO counter Return MDO counter E xecution Phase Executing "create_oid=o; Insert MDOTT(B,n,o)" (1) Create new OID value = f49 (2) Inserte B.MDOTT(B,4, f49) Figure 4.7: Scenarios of OID invention under MDOT mechanism Recall also the two scenarios shown in Figure 4.4 and 4.5, (1) happens during the executing phase of a RemTrans transaction. For most of the active object-based database im plementations, creating an object is regarded as a database event and may trigger some production rule in the rem ote database. As a result, a deadlock situation similar to the scenario depicted in Figure 3.1 may happen. Also in the real im plem entation, (2) and (3) require the rem ote OID to travel across the border of a database. From the discussion in Section 4.5.1, we know th at special treatm ent m ust be taken to incorporate this rem ote OID invention process within the general MDOT framework. Consider the scenario depicted in Figure 4.7 which portrays the process of OID invention under the framework of the MDOT mechanism. Shown in the figure is the actual OID creation process of an OID / 4g in A .F lig h t first presented in Figure 4.4(Section 4.4). In the above figure, assume the user at A database issues a Rem Trans transac tion. During the execution phase of RemTransA, the increm ental update triggers 63 the population rule rp generated from A.i-flight[*,s,t] A .segment-info(s,t ,p) ; The action of rp first invokes a rem ote procedure call get-next-count in database which increments the MDO-count in B and returns its value, 4, to A . Then it appends an update request to Remote-Agenda requesting the B database to create a new OID and recognize this new OID value as the 4th machine dependent object created in B (2). Finally, the action of rp put the encoding [B,4] of this rem ote OID into the MDOT table in A database(3). In the Proposing Phase, RemTransA sends the update requests stored in Remote-Agenda to B{4). After receiving and executing the update request from A , R em Trahss creates a new OID, / 49, and insert B .MD0TT(B,4,/49) into its own MDOT table(5). After th at, the resulting MDOT tables for databases A and B are A.MDOTT; host id value B 4 PM] B.MDOTT: host id value B 4 /* 9 4.6 Static vs. Dynamic Semantics in ALD This section revisits the “City” example and investigates the subtlety in choosing different witnesses for the invention of rem ote OIDs in a nrecILOG- invention rule. Assume that in the “City” example, A.c-name is the key for entity type A.City. We can rewrite P a > -+ b into P'a ^ b by substituting R 1 by A.i-node[* Node,n] :- A.c-name(c,n); \\ XI T hat is, instead of using the cities in A.City as the witnesses, the nodes in B.Node are invented with the witness of city names in A. c-name. Statically speaking, given a source instance the two programs always give the same target instance on the logical instance level(Section 2.3.2). T hat is, i W / ) = P A „ S { / ) , o r [ P 4„ B ( / ) ] = [ n „ B ( / ) ] However, the concrete linkages generated from the two nrecILOG programs(Rl and XI) may behave differently. Consider the two scenarios depicted in Figure 4.8. 64 Database A A.c-name Database B B.node-label Node label ni L.A. “2 S.F. “3 N.Y. City name C| L.A. C 2 S.F. C J N.Y. delete A.c-name(c1tL.A.) insert A .c-nam e^.D .C .) delete B.node-labeKc^LA.) A .i-N ode[*N ode,c]A .C ity(c) ftp insert B.node-labelCc^D.C.) Scenario (1) delete B.Node(n1); delete B.node-labeKc^LA.) A.i-Node[*Node,n]:- A.c-name(c,n) ........................ aw- insert B.Node(n250}; insert B.node-label(c250,D.C.) Scenario (2) Figure 4.8: Dynamic behavior differences for “City” example Initially, the two databases are in the states depicted on the top of the figure. As sume a user at A issues a transaction Transaction{ delete A.c-name(ci,L.A.) ; insert A .c-name(ci,D.C .) ; } Then, in scenario (1) running the concrete linkage generated from Pa^-b , the corre sponding updates sent to B are delete A . c-name(ci,L. A . ) ; insert A. c-name (ci, D.C. ) ; which in essence modify the attribute B.node-label of the node rii from L.A. to D.C. On the other hand, in scenario (2) running the concrete linkage generated from Pa^ b i updates propagated to the B database include delete B.Node(ni); delete node-label(n1}L.A.); insert B .Node(n 2 5o) ; insert B.node-label(n25o, D.C.); Now the OID ni will first be deleted then a new OID n 2 5o will be created with the new label of D.C. Notice that the two resulting target instances in the two scenarios are equivalent as logical instances. But suppose B.Node has another attribute B. color with the 65 following instance initially. B.color: Node color ni green n2 yellow nz red Then, in scenario (1) the color information of the node («i) corresponds to city Ci will still be green; in scenario (2) the color information of the node (^ 250) that corresponds to city C i is either lost or contains a dangling OID reference. This phenomenon is due to different ways of identifying rem ote objects with nrecILOG- ; as in the two scenarios above, even though P a > -* b are P'a ^ b compute the same (logical) target instances, on the physical instance level the concrete linkages associated by B ird to these programs have different dynamic behavior. In P a ^b , the identity of a node n\ is “witnessed” by a city OID ci; therefore changing the attributes of ci does not affect the existence of n\. In contrast, in P'a^b fhe focus is on the names of cities. The existence of node ni depends on the nam e of the city L.A. in A . As a result, changing the name of city ci results in first the destruction of n i, and second the creation of another node n 2so th at corresponds to the new name D.C. of city c i. For the application of in the “City” example, P a > -* b describes more precisely the sem antic correspondence of the application, because no m atter what the name of a city is, there will be a node on the GUI representing th at city. Therefore, the correspondence is between the OIDs of A.City and the OIDs of B.Node. A complete investigation of this phenomenon is beyond the scope of this thesis. Practically speaking, it is im portant for DBAs to be aware of this subtlety and to ensure th at the correct nrecILOG- invention rules are used th at best express the semantics of the application. 66 Chapter 5 Bi-directional Propagation of Incremental Updates W ith the framework introduced in Chapter 3 and 4, we are now ready to present the full capabilities of the B ird system to m aintain bi-directional sem antic corre spondences involving OIDs. First in Section 5.1, the preprocessing of EES in the bi-directional case is dis cussed. Then Section 5.2 discusses the witness update problem which motivates the need for witness generators; these are special rules that populate the witness relations in the rem ote database. Section 5.3 gives a formal definition of the wit ness generators. An algorithm, called W itness Generator Generator(WGG), which generates the witness generators from a DBA-given ALD will be presented in the next chapter. Section 5.5 presents the justification of using EES to specify entity equivalence relationships, and the subtle semantics implied by the EES th at can not be expressed by nrecILOG- invention rules alone. Finally, Section 5.6 summarizes the development and the m ajor components of the B ird system introduced so far. The “C ity” and “Segment-Flight” examples introduced in Chapter 4 will again be used in this chapter. However, the difference is that the applications now require the two databases A and B be treated symmetrically; i.e., both databases can issue increm ental updates that propagate to the other database. For this reason, the ALDs include two nrecILOG- program s{ P a ^ b and P b > -> a ) specifying the instance mappings of both directions. 67 E x a m p le 5.1: “ City”, the bi-directional case Continuing with the “City” example introduced in C hapter 4, the following ALD specifies the bi-directional semantic correspondence for the “City” example. (abs-linkage city A-schema ( City(entity); c-name(City.string); ) B-schema ( Node(entity); node-label(Node,string); ) Entity-Equivalence ( A.City = B.Node; ) A-to-B ( B.node-label(n.l) :- A.City=Node[c,n], A.c-name(c,1); ) B-to-A ( A.c-name(c.l) :- B.City=Node[c,n], B.node-label(n,l); ) ) □ E x a m p le 5.2: “ Segment Flight” , the bi-directional case Continuing with the “Segment-Flight” example introduced in C hapter 4, the fol lowing ALD specifies the bi-directional sem antic correspondence for the “Segment- Flight” example. (abs-linkage segment-flight A-schema ( Segment(entity); has-dest(Segment,string); segment-info(Segment,string,string); ) // Ei / / R1 / / SI 68 B-schema ( Flight(entity); flight-info(Flight,string,string); price(Flight,string); ) A-to-B ( A.i-flight[* Flight,s,t] A.segment-info(s,t,p); B.flight-info(f,c,t) A.i-flight[f,s,t], A.has-node(s,c); B.price(f,p) A.i-flight [f,s,t], A.segment-info(s,t,p); ) B-to-A ( B.i-segment[* Segment,c] B.flight-info(f,c,t); A.has-node(s,c) B.i-segment[s,c]; A.segment-info(s,t,p) B.i-segment[s,c], B.flight-info(f,c,t), B.price(f,p); ) ) □ For the discussion in this chapter, it is assumed th at the MDOT mechanism introduced in Section 4.5 is being used, so that all OIDs are global, im m utable, and can be passed between databases. 5.1 Preprocessing of EES in Bi-Directional Cases As shown in Section 4.1, the EES can be used to express one-one correspondence re lationship between a pair of entity types in the two databases. For the bi-directional cases, consider the ALD in Example 5.1 above. Notice th at, different from the ALD in Exam ple 4.1, the ALD above has the B-to-A section. Also the EES relation B.City=Node is used in SI recording the m atching OID pairs of A.City and B.Node. As indicated by its prefix, B.City=Node is a relation located in database B. T hat is, for each EES entry in a bi-directional ALD, the B ird system m aintains two duplicated copies of the same EES relation, one in each database. / / R1 // R2 // R3 / / SI // S2 // S3 69 Similar to the uni-directional cases, the EES in a bi-directional ALD L is first translated by the following rewriting system, then compiled by the ALD compiler. T he result of the translation, denoted EES-to-ILOG(Z/), is an ALD where the P a^b and P b ^a does not contain any EES relation predicate. R e w r it in g S y s t e m 5 .1 For each EES entry A.R = B.S in a bi-directional ALD, the EES translator EES-to-ILOG rewrites the Pa*-*b and P b ^a programs in the ALD as follows. (1) The following ILOG rules will be generated and added into Pa* - + b and Pb ^ a , respectively. A.i-S[* S,r] A.R(r); B.i-R[* R,s] B.S(s); (2) Each occurrence of the predicate A . R=S [x, y] in Pa>-*b will be substituted by i~S[y,x], and each occurrence of the predicate B.R=S[x,y] in P b> -+ a will be substituted by i-R[x,y]. □ In the above, (1) and (2) are simply the same rules in Rewriting System 4.1 apply sym m etrically to both Pa> - + b and Pb * - * a - E x a m p le 5 .3 : For the ALD L of the bi-directional “City” example, the resulting ALD, EES-to-ILOG(L), of the EES rewriting process is Entity-Equivalence ( A.City = B.Node; // El ) A-to-B ( B.node-label(n,l) A.i-node[n,c], A.c-name(c,l); // R1 A.i-node[* Node,c] :- A.City(c); // R2 ) B-to-A ( A.c-name(c,l) :- B.i-city[c,n], B.node-label(n,l); // SI B.i-city[* City.o] :- B.Node(o); // S2 ) 70 JauxB Figure 5.1: Naive compilation for bi-direction ALD involving OIDs By rewriting rule (2), the EES relation predicates A . City=Node and B . City=Node in Rl and SI are replaced by A . i-node and B . i-city, respectively. And, by rew rit ing rule (1), R2 and S2 are added. □ For the rem ainder of this chapter, unless otherwise specified, we assume th at the ALDs in discussion are first pre-processed by EES-to-ILOG, so th at their P a < -+ b and P b *-* a do not contain any EES relation predicate. 5.2 Witness Update Problem As discussed in Section 4.3, for the uni-directional case, the schema of A is aug m ented with witness relations to store the association between the OIDs defined logically by Pa^ b and the physical OIDs of B that correspond to them . For the same reason, in the bi-directional case, witness relations— denoted as auxA and auxB , respectively—are added to both A and B. The resulting schemas are de noted as A = A U auxA and B = B U auxB. For a bi-directional ALD, a naive way to m aintain the sem antic correspondence is to simply compile both P a > -* b and P b < -* a as in the uni-directional case. The resulting concrete linkages can then translate increm ental updates in both directions. U nfortunately as indicated in Figure 5.1 there is a potential problem —although can be used to propagate Jb, there are no rules in A to populate the witness relations in a u x B , nor are there rules in B to populate auxA. This is the witness update problem. To see how this leads to a problem under increm ental updates, consider the following scenario for the “Segment Flight” example. 71 The user issues a transaction: Insert Segment{s38) Insert segment-info(s38,8:00,250) Insert has-node(s38,D.C.) R em T ra n s E xecution P hase The following rules are Fired A.i-flight[* Flight,s,t]A .segm ent-info(s,t,p); B.flight-info(f,c,t)A.i-flight[f,s,t], A.has-node(s,c); B.price(f,p)A.i-flight[f,s,t], A.segment-info(s,t,p); P roposing P hase Send Remote_Agenda to B L istening P hase E xecution Phase Executing requests in Remote_Agend'Sv insert A.Segment(s13) insert A.segment-info(s13,8:00,250) insert A.has-stop{s13,D.C.) R em T rang rListening P hase E xecution Phase Executing requests in Remote_Agenda insert B.FIight(f4g) insert B.flight-info(f4g,D.C.,8:00) insert B.price(f49,250) The following rules are Fired B.i-segment[* Segm ent.c]B .flight-info(f,c,t); A.segment-info(s,t,p):-B.i-segment[s,c], B.flight-info(f,c,t), B.pnce(f,p); A.has-nodet(s.c):- B.i-segment(s,c] P roposing Phase Send Remote_Agenda to A Figure 5.2: Scenario using the “naive” concrete linkages Suppose the ALD in Example 5.2 is compiled by the “naive compiler” m entioned above. Assume after loading the concrete linkages, the two databases are in a state of equilibrium illustrated in Figure 5.3 and all the production rules are satisfied. Then a user at A wants to add one more segment by issuing the following transaction. Transaction! Insert A.Segment(s3s) ;Insert A.segment-info(,S38,8:00,250) ; Insert A.has-dest(s3 8 ,D.C.) ; } As we can see in Figure 5.2, after executing the user-issued transaction and assuming that / 4g is the new OID created in B , three production rules in A are fired, and RemTransA sends the following update requests to B: insert B.FlightC//o) ; insert B.flight-info(/jg,D.C. ,8:00) ; insert price(/49,250) ; After executing the update requests, three rules in B are triggered. These request a new OID from A , say it is 543, and propose 72 Transaction! Insert A.Segment($1 3); Insert A.segment-info(si3,8:00,250); Insert A.has-dest(si3 ,D.C.); } back to A. At this point, things have already gone wrong! A new OID .S 1 3 has been created for the same segment represented by 533. This abnorm ality is called the witness update problem. Analogous problems may arise when information associated with a given OID is modified. The root of the problem is th at there are no rules in A to populate the association information of ( 5 3 3, D.C.) to B. i-segment and prevent unnecessary rule firings in B to “re-invent” s3S. The solution to this problem is to augment P a > -* b by adding a nrecILOG- rule B.i-segment[s,c] A.has-node(s,c); W ith this rule added, after executing the user-issued update, this new rule will be fired and adds one more update request Insert B. i-segment (5 3 3 ,D.C.); to be sent to B. Now, due to this newly-added rule, the association between the S3 8 and its witness D.C. will be established in B. i-segment. As a result, all rules in B are satisfied and finally the transaction is com m itted. Similarly, the rule A.i-flight[f,d,t] B.i-segment[s,d], B.flight-info(f,d,t) , B.price(f,p ) ; m ust be added to Pb ~ a propagating updates of database B to the witness relations in A. The two rules just given are called the witness generators. In the next section, the witness generator is formally defined. In general, finding witness generators is a non-trivial task. C hapter 6 presents a procedure for finding them , and C hapter 7 presents a theoretical analysis of when the procedure halts. 73 5.3 Definition of Witness Generator As pointed out in the previous section, in order to resolve the witness update prob lem, the ALD compiler generates a set of nrecILOG- rules, called the witness gen erators, and adds them to the ALD during compilation. In this section, the formal definition for the witness generator is presented. Intuitively speaking, the witness relations serve as the glue combining the re m ote physical OIDs and their corresponding logical witness values, so th a t the local database can determ ine how to translate updates to the specific rem ote OIDs. As defined in Section 4.3, the information stored in the witness relations is called a wit ness. In the context of bi-directional sem antic correspondence, the witnesses stored at databases A and B co-exist as a pair. D e fin itio n : Witness Pair Let L be an ALD specifying a semantic correspondence SC a b - , an<l (Ia , Jb ) G SC ab be an instance pair in the sem antic correspondence. The two instances IauxA over auxA and JauxB over auxB are a witness pair for (Ia , J b ) if (1) IauxA is a witness from I a to J b and J a U xB is a witness from J b to I a - , th at is { I *A>-*b(Ia) — IauxA, and P a*->b(Ia 0 IauxA) = Jb P b ^ a ( J b ) = JauxB , and Pb^+a(Jb u JauxB) = I A (2) For each interm ediate relations B. iP and A. iQ resulting from the EES-to-ILOG translation of an EES entry A.P = B.q then the two relations B. iP and A. iQ are inverse to each other. T hat is \/p,q,iP(p,q) € JauxB iQ(4,P) € IauxA □ A key aspect of the approach taken by B ird is to m aintain the witness pair (IauxA, JauxB) along with (I a ,J b )• In this way the rem ote (physical) instance Jb can be com puted locally at A by JB = PA>— * b (IA U IauxA) 74 T hat is, the A database holds enough information to reconstruct Jb - Similarly, the instance I a can be com puted locally at B by I A — P U J a u x B ) Based on the notion of witness pair, the witness generator can be formally defined as follows. D e fin itio n : Witness Generator Given an ALD L containing two nrecILOG- programs P a ~ b and P b ^ a , a nrecILOG- program P ^ a u x B ls sa^ the witness generator from A to a u x B , if for each (I a , J b ) € S C (p A^ BtpB„ A) with the witness pair { L uxA, JauxB), then JauxB = PA<-^auxB^A U IauxA) Similarly, a nrecILO G - program P g ^ a u x A said to be the witness generator from B to auxA if IauxA ~ P& > — auxA(JB G JauxB) The rules in the witness generators are call the witness rules. □ Example 5.4: Continuing with the “Segment Flight” example (Exam ple 5.2), from the ALD, we know auxA — { i-flight }, and auxB — { i-segment }. The witness pair ( L uxA, JauxB) and the two instances I a and Jb are depicted in Figure 5.3. As shown in the previous section, the witness rule VFGi-segment is B.i-segment[s,d] A.has-dest(s,d), A.segment-info(s,t,p); and the witness rule VFGi-fljght is A.i-flight [f,d,t] :- B .i-segment[s,d], B .flight-info(f,d,t), B .p r i c e ( f ,p ) ; Then «/o«a:S 'i_ segment = W G i _ s e g m e n t ( I A U L u x a ) , and /auxA-i-flight = W U J a u x B ) □ 75 Database A U Database B JB A. has-node A.segment-info Segment time price Si 10:00 65 sl 14:00 80 S 2 09:00 150 Segment node Si S.F. s2 N.Y. Witness pair «WW¥WIWW<I*IM*V>W»WVW B.flight-info B.price Flight city time f. S.F. 10:00 u S.F. 14:00 b N.Y. 09:00 Flight price f> 65 f2 80 .L___ 150 IauxA A.i-fiight Flight Segment time fl S l 10:00 h si 14:00 u s2 09:00 •IquxB B. i-segment Segment node Si S.F. s2 N.Y. Figure 5.3: W itness pair for the “Segment Flight” example 5.4 Logical and Physical Views of a Semantic Correspondence As m entioned in Section 2.3, due to the non-determ inistic nature of OID inven tion, the original semantics presented in [HY90] is defined on the notion of “logical instance” , i.e. the OID-equivalence class of database instances. Therefore when the DBA writes an ALD L involving OID invention/destruction, what L actually describes is the sem antic correspondence SC ab between logical instances of A and logical instances of B . D e fin itio n : Semantic Correspondence Involving OIDs Let L be an ALD defined between two databases A and B . Let the two nrecILOG- program s in L be P a * -> b and P b > -> a containing some invention rules. Then the sem antic correspondence S C a b defined by L is { ([ /], [J]) I {[Pa~ b (I)] = [J]) A {[Pb~ a{J)\ = [I])} □ 76 ®® m in im i ° ° , c ALD Compiler o augmented ALD describing the SC at the “logical Instance” level B describing the SC at the “physical Instance” level Figure 5.4: Different levels of abstraction in the BIRD system However, in practice logical instances are more a concept to achieve “OID in dependence” than a reality. For most of the database im plem entations, only the physical instances are stored in the database. The witness relations (Section 4.3) and witness generators can be viewed as the extra information needed to bridge the gap between the logical and physical aspects of an ALD. T hat is, to m aintain the sem antic correspondence between two databases physically, the B ird system needs: (1) the witness relations keeping the OID association information, and (2) the witness generators to properly m aintain the witness relations. This motivates the following definition. D e fin itio n : Extended Semantic Correspondence Let L be an ALD specifying a sem antic correspondence SCab- Let auxA and auxB be the schemas of the witness relations appearing in Pa*-*b and P b ^ a respectively. 77 Let A = A U auxA and B = B U auxB. Then the extended semantic correspondence SC Ai} is {(^4? J&) I (IA = I a Pa*-*b^JaA) A (J g = J b U Pb>-^a(Jb))A (PZ+bVa) = JB) A ( P ^ A i W = I a ) } □ Figure 5.4 shows how the B ird system helps the DBA to specify and m aintain a sem antic correspondence by facilitating the both aspects of a sem antic correspon dence. As indicated in the figure, the DBA specifies a sem antic correspondence at the abstract level of “logical instances”, then the ALD compiler translates this “logical level” specification into the concrete linkages, so th at the extended sem antic correspondence SC Ag can be m aintained efficiently at the “physical level”. 5.5 Equivalent and Mutually Recursive Classes This section discusses the justification of using EES to specify the equivalent rela tionship between two entity types, and the subtle semantics implied by the EES that can not be expressed by nrecILOG- invention rules alone. Consider the bi-directional “City” example. An “ILOG-activist” may argue th at the following nrecILOG- programs alone express more naturally the sem antic cor respondence. (These are virtually the same nrecILOG- programs in the resulting ALD generated by EES-to-ILOG.) A-to-B ( A.i-node[* Node,c] A.City(c); B.node-label(n,l) A .i-node(n,c), A.c-name(c,l); ) B-to-A ( B.i-cityC* City,n] :- B.Node(n); A.c-name(c,l) :- B .i-city[c,n], B.node-label(n,1); ) However, the DBA should use the EES instead of the conventional nrecILOG invention rules because of the following reasons. \ \ T1 \ \ T2 \ \ U l \ \ U2 78 City=Node City=Node C l • » City=Node C l • ■ « ............. » ♦ W 2 City=Node (a) Four possible OID mappings of the “City” example by EES i-Node i-C ity Node City ni c 1 n. c2 City Node cl ni __ “1 (b) Example O D D mapping recorded in i-Node and i-City i-Node i-Node i-Node ■ ► •"a i-City i-City City i-Node i-Node i-Node i-Node (c) Possible O D D mappings of the “City” example by invention rules Figure 5.5: Different entity type relationships specified by EES and invention rules T o a r tic u la te th e e q u iv a le n t r e la tio n s h ip b e tw e e n m a t c h in g e n t it y ty p e s : The EES gives a more direct indication th at there is a one-one correspon dence between the two entity types, which can not be articulated by using nrecILO G- invention rules alone. As we have seen in the “C ity” example, an EES entry A.City = B.Node; in the ALD directly points out that there is a one-one corresponding rela tionship between OIDs in A.City and OIDs in B.Node. Suppose there are two OIDs c1?c2 in A.City and two OIDs n \,n 2 in B.Node, then Figure 5.5(a) shows in a schematic way two different possibilities of the correspondence. On the other hand, the correspondence specified by the nrecILOG- invention rules i-node[* Node,c] A.City(c); \\ T1 i-city[* City,n] B.Node(n); \\ Ui is not necessarily one-one. Figure 5.5(b) illustrates example instances of the 79 two witness relations. In it, i-node records the m apping from ci,C 2 to ni,rc2 respectively; on the contrary, i-city records the m apping from rii, n 2 to c2, c\ respectively. In fact, as indicated in Figure 5.5(c) in a schem atic way, there are four possible ways two nodes « i,n 2 and two cities c i,c 2 can link up. To avoid endless loop in W G G algorithm: The two invention rules T1 and U1 form a recursive OID invention loop between A.C ity and B.Node. T hat is, T1 says th at the witness for the invention of an OID in B.Node is an OID in A.C ity ; U1 says th at the witness for the invention of an OID in A.C ity is an OID in A.C ity . As we will see in Section 6.5, using nrecILOG- invention rules recursively to specify the equivalence of two entity types will result in endless loop for the W GG algorithm. Later in C hapter 6, we will see th at the ALD compiler uses the entity equiva lence information in the Entity-Equivalence section, so th at the proper witness generators can be generated for the two nrecILOG- invention rules generated by the rewriting rule (2) in the Rewriting System 5.1. 5.6 Recapitulation of the Bird System This section summarizes the m ajor components of the B ird system presented so far. The refined system architecture for the B ird system is shown in Figure 5.6. To specify a semantic correspondence, the DBA first writes an ALD specifying the correspondence between the logical instances of A and the logical instances of B. An ALD includes the following parts. A-Schema: The schema description of A. B-Schema: The schema description of B. Entity-Equivalence: Entries specifying the one-one correspondence relationships between entity types in A and entity types in B. A-to-B: A nrecILOG- program specifying the mapping from the logical instances of A to the logical instances of B 80 ALD A-schema B-schema Entity-Equivalence A LD C om piler A ->B (1) Preprocessing of EES (2) Witness Relations augmentation (3) Adding Witness Generators to the ALD (4) Generating Concrete Linkages definitions for auxA population rules for 1 JA_ >B checking rules for Pb RemTrans / / AP5 Active DBS MDOT Mechanism Communication Subsystem / / J relation definitions for population rules for Pb- delay-checking rules for \ RemTrans AP5 Active DBS MDOT Mechanism Communication Subsystem Figure 5.6: Refined system architecture for the B ird system B-to-A: A nrecILOG" program specifying the m apping from the logical instances of B to the logical instances of A. For a uni-directional sem antic correspondence, the ALD does not contain this part. W ith the ALD as input, the ALD compiler performs the following tasks: (1) P re p ro c e s s in g o f E E S : Each EES entry in the ALD is translated according to Rewriting System 5.1 presented in Section 5.1. (2) W itn e s s re la tio n s a u g m e n ta tio n : The ALD compiler generates relations definitions for the interm ediate relations appearing in the rule heads of pro duction rules in the ALD(Section 4.3). These relations, called the witness 81 relations, are to be m aterialized once the concrete linkages are loaded in the databases. (3) A d d in g W itn e s s G e n e ra to rs to th e A L D : The W GG algorithm (to be presented in Chapter 6) is used to generate a witness generators for each invention rule in the ALD. These witness generators(Section 5.3)—they are non-invention nrecILOG- rules with an interm ediate relations appearing in the rule heads— are added into the ALD. (4) G e n e ra tin g C o n c re te L inkages: Then the compiler generates two concrete linkages and S e for databases A and B respectively. The concrete linkages include relation definitions generated in (2), and population and delay-checking rules(Section 3.2) for the two database. The translation for nrecILOG" non-invention and invention rules are presented is Section 3.2 and 4.4 respectively. After loading the concrete linkages, the two databases A and B are ready to propagate increm ental updates. An overview of the interactions between layers of the B ird system to translate and propagate a user-given increm ental update A a (A b) into A s (A ^) was presented in Section 1.3.1. 82 Chapter 6 How to Find Witness Generators One of the m ain contributions of this research is the development of the W itness Gen erator Generator (W GG) algorithm that generates correct witness generators from a user-specified ALD. This chapter focuses on the intuition and informal discussions of the algorithm . A formal discussion for the soundness and decidability of halting of WGG will be presented in Chapter 7. In Section 6.1, an informal description of W GG is given to highlight the intuition behind the procedure. Section 6.2 first presents the notion of “well-foundedness” on ALD; it then describes a technique removing the inherited non-well-foundedness in ALDs containing EES definitions. Section 6.3 presents some formal definitions needed in Chapter 7 for the proof of W GG soundness. Section 6.4 describes the WGG procedure. However, unfortunately, the WGG algorithm does not always term inate. Finally, In Section 6.5 we present two families of ALD th at may lead to endless WGG execution. These dem onstrate the incompleteness of the WGG algorithm , and m otivate the need for a further study of the term ination behavior in C hapter 7. For the discussion in this chapter, examples “C ity” (Exam ple 5.1) and “Segment Flight” (Exam ple 5.2) will again be used. 83 6.1 Overview of Automatic Generation of Witness Generators In this section, we present the intuition behind autom atic generation of witness gen erators. In the following discussion, two examples are presented; the first demon strate how the witness rule for an interm ediate relation generated by EES-to-ILOG translation can be generated, and the second shows the construction of the witness rule for an ordinary interm ediate relation. For the sake of discussion, we define the Skolem operator on an ALD as follows. D e fin itio n : Skolem(L) Let L be an ALD, then Skolem(L) is the ALD such th at each nrecILO G" invention rule i-R[*R, w] ...; is first expanded into the non-shorthand form i-R[*,uf] :- ...; R (r) :- i-R [r,«;]; Then each non-shorthand invention rule is skolemized (Section 2.3.1) into i 'W i - R ^ ) ’™ ] ! with a distinct skolem function a The next example walks through the “C ity” example and shows how we can use “common sense” to infer the witness rule for interm ediate relation B . i-city derived from EES entry A. City = B.Node. E x a m p le 6 .1 : The ALD, L. presented in Exam ple 5.1 can be transform ed into a new ALD, L' — EES-to-ILOG(L), as follows. Entity-Equivalence ( A.City = B.Node; // El ) A-to-B ( B.node-label(n,l) :- A .i-node[n,c], A.c-name(c,1); // R1 84 A .i-n o d e [* N ode,c] A .C ity (c ); / / R2 ) B-to-A ( A.c-name(c,l) :- B .i-city[c,n], B.node-label(n,1); // Sl B.i-city[* City,o] :- B.Node(o); // S2 ) ) L' can then be skolemized into L" = Skolem(L') as follows. Entity-Equivalence ( A.City = B.Node; // El ) A-to-B ( B.node-label(n,l) :- A.i-node[n,c], A .c-name(c,1); // R1 A .i-node[f(c),c] :- A.City(c); // R2' B.Node(n) :- A.i-node[n,c]; // R3J ) B-to-A ( A.c-name(c,l) :- B.i-city[c,n], B.node-label(n,l); // Sl B.i-city[g(o),o] :- B.Node(o); // S2J A.City(c) :- B.i-city[c,o]; // S3' ) ) By the definition of witness generator in Section 5.3, a witness generator 72. for B . i-city satisfies Ai ^b ) ^ S C abi Jb.i-city = 72.(7^) From the semantics of EES we know the following is true. Vc, n, B. i- c ity [c , rz] < - > ■ A. i-node[n, c] Thus, the witness generator for i-city is B.i-city[x,y] :- A.i-node[y,x] 85 STEP MGU Goal Set 0 A.i-flight[f,s,t] I A.i-flight[f(slt)lslt]A.segment-info(s,t,p) 1 A.segment-info(s,t,p) B.i-segment[s,c] B.flight-info(f\c,t) B.price(f\p) a/g(c), f/f{g(c),t) j B.flight-info(f”,c,t’) A.i-flight(f’,s\t) A.has-node(s’.c) B.price(f’,p) ................ B.i-segment[s’,c] s7g(c),fVf(g(c),t) |i 1 1 9 1 \ Figure 6.1: SLD Expansion for “Segment Flight” In fact, we can generalize this result: for each EES entry A . P = B. Q , the witness generators for B. iP and A . iQ are B. iP [x,y] :- A.iQ[y,x]; and A.iQ[x,y] :- B .iP[y,x]; respectively. □ The next example dem onstrates how to construct a witness rule for an interm e diate relation which does not associate with any EES entry. E x a m p le 6 .2 : Let the ALD shown in Exam ple 5.2 for the “Segment Flight” example be L. L can be skolemized into V — Skolem (L) as follows. A-to-B ( A.i-flight[f(s,t),s,t] A.segment-info(s,t,p); // R1 B.flight-info(f,c,t) :- A.i-flight[f,s,t], A.has-node(s.c); // R2 B.price(f.p) :- A.i-flight[f,s,t], A.segment-info(s,t,p); // R3 ) B-to-A ( 86 B.i-segment[g(c),c] B.flight-info(f,c,t); // SI A.has-node(s.c) B.i-segment[s,c]; // S2 A.segment-info(s,t,p) B.i-segment[s,c], // S3 B.flight-info(f,c,t), B.price(f,p); ) Let A.i-flight[f,s,t] be an arbitrary tuple in Ix- L et’s “infer” , from the above nrecILO G" program s, what is also true along w ith the fact A.i-flight[/, 5,t] e Ix According to rule R1 A. i - f l i g h t [ f ( s , t ) , s , t ] : - A. s e g m e n t- in f o ( s ,t,p ) ; and the fact th at there is no union in nrecILOG" programs Pa^ b and Pb ^ a , there m ust be a tuple A.segment-info(s, t, p) in Ix f°r some price p, and / is equal to the skolem term f ( s , t). Next, from rule S3 A .s e g m e n t-in fo (s ,t,p ) : - B. i-s e g m e n t[s , c ] , B . f l i g h t - i n f o ( f , c , t ) , B .p r i c e ( f ,p ) ; we know there m ust be three tuples B.i-segment[s,c], B .flight-info(//, c, t), and B .p rice(//, p) in for some c, / '. In essence, the “common sense” reasoning can be carried out as doing a SLD- resolution style expansion. This is shown in a pictorial way in Figure 6.1. In particular, at step 1 the initial OID variable / is substituted by skolem term f( s ,t) . As the expansion goes on, at step 3 / is further substituted— along with the substitution of s by g(c)— by f(g(c),t). Alone the expansion, there is another variable / ' which is also substituted by the same skolem term f/(g (c ),t) at step 4. Intuitively speaking, from the above observation we know th at variable / m ust equal to variable / '. Furtherm ore, by observing the expansion, we know the goal set at step 2 has the following properties: (1) it contains the variable / ', (2) each predicate in it comes from database B, and (3) it contains both witness variables or their equivalents, t and s', of A. i - f l i g h t [/, s,£]. Then the witness rule can be constructed by using the goal set at level 2 as the rule body and A.i-flight[/, s, t] as the rule head with equivalent variables equated; i.e., the witness generator for i - f l i g h t is 87 A.i-flight[f,s,t] B.i-segment[s,c],B.flight-info(f,c,t), B.price(f,p) □ From the two examples above, we can now summarize the “common sense” rea soning as follows. To find a witness rule for interm ediate relation r, consider the following two cases. Case 1) r is defined in a EES entry R = S Directly construct the witness rule for r as r[x,y] s[y,x] Case 2) r is not defined in any EES entry Then do a SLD-resolution style expansion as follows. (1) Initially construct a goal set containing an atom r[o, w] where o is the OID variable and w is the witness variables. (2) Expand goal set as if doing an SLD-expansion, until there exists a variable o', such th at both o and o' are substituted into a same skolem term . (3) Then construct the witness rule 71 using r[o, w] as the rule head. The body of 7Z is the goal sets which contain: (a) only atom s from the different database, and (b) variables equivalent to o and w. The W GG algorithm is basically based on 1 the above “common sense” reasoning. 6.2 Removal of EES As m entioned in Section 5.5, the EES expresses the one-one correspondence rela tionship between entity types, and the EES-to-ILOG (Section 5.2) transform ation defines the semantics of an EES entry in term s of the nrecILOG- invention rules. In particular, an EES entry *As will be shown later, the way a witness generator is constructed by the WGG algorithm is more complicated than (3) above. For some ALDs, (3) does not always find a goal set satisfying conditions (a) and (b). 88 Database A Database B Node Se it-in fo F lig h t [ ST R IN G ] | s t r i n g ! I STRING I 1 STRING I (time) (price) (tim e ) Figure 6.2: Schemas of Exam ple “Itinerary” A P = B.Q; is translated into two nrecILOG- invention rules A.iq[*q,p] A.P(p); B.iPOP,q] B.q(q); However, the two rules above involve a recursive invention loop; i.e., an OID p of A. P is the witness of the invention of an OID q of B. q and vice versa. This makes the ALD “non-well-founded” (Section 6.2.1). As will be shown in Section 6.5.2, non-well-founded ALDs result in endless executions of the W GG algorithm . To circumvent the non-well-foundedness inherited in the ALDs containing EES, two rewriting operators reduce and expand are proposed (Section 6.2.2). In essence, an ALD L is first “reduced” into a well-founded ALD reduce(L); then the witness generators for reduce(L) are “expanded” into witness generators for the original For the discussion in this section, the following example will be used. E x a m p le 6.3: “ Itinerary” This is alm ost the same as the “Segment Flight” example (Exam ple 5.2), except that A . Node and B . City are entity types. Furtherm ore, there is a one-one correspondence between nodes in A . Node and cities in B . City. The schemas are shown in Figure 6.2. The sem antic correspondence can be specified by the following ALD L which has already been translated by EES-to-ILOG. (abs-linkage itinerary A-schema ( Segment(entity); Node(entity); ALD L. 89 has-node(Segment,Node); segment-info(Segment,string,string); ) B-schema C Flight(entity); City(entity); flight-info(Flight,City,string); price(Flight.string) ; ) Entity-Equivalent ( A.Node - B.City; //El ) A-to-B ( A.i-flight[* Flight,s,t] :- A.segment-info(s,t,p); // R1 A.i-city[* City,o] :- A.Node(o); // R2 B.flight-info(f,o,t) :- A.i-flight[f,s,t], A.has-node(s,o); // R3 B.price(f,p) :- A.i-flight[f,s,t], A.segment-info(s,t,p); // R4 ) B-to-A ( B.i-segment[* Segment,c] :- B.flight-info(f,c,t); // SI B.i-node[* Node.c] :- B.City(c); // S2 A.has-node(s,o) :- B.i-segment[s,c], B.i-node[o,c]; // S3 A.segment-info(s,t,p) B.i-segment[s,c], // S4 B.flight-info(f,c,t), B.price(f,p); ) ) Notice th at EES entry El specifies the equivalent relationship between A.Node and B .C ity . □ 6 .2 .1 W e ll-fo u n d e d n e ss o f an A L D Simply put, an ALD is “well-founded” if every entity type appearing in the ALD has a finite rank defined as follows. 90 D e fin itio n : Rank Let L be an ALD. The rank function on the types in schema(L) is defined as follows. { 0 if r is a base type M ax{rank(rr) \ r ' G W } + 1 r is invented by i[* ,..] and W is the types of witness columns in i[* ,..] The rank num ber of L is then rank(L) = M ax{rank(r) | r G schem a(L)}. □ D e fin itio n : Well-foundedness An ALD L is well-founded if for each type r in Schema(L), ra n k(r) is defined. □ E x a m p le 6.4: The ALD shown in Example 5.2 for the “Segment Flight” example has rank of 2. This is com puted as follows. By the invention rule B .i-segment[* Segment,c] B.flight-info(f,c,t); we know ra n k (k. Segment) = rank(String) + 1 = 1 Then from the invention rule A.i-flight[* Flight,s,t] :- A.segment-info(s,t,p); we know rank{ B. Flight) = rank( A. Segment) + 1 = 2 Thus, by definition, the ALD is well-founded. However, for ALD L presented in Exam ple 6.3, rank(L) is not well-defined. This is due to the A. i - c i t y / B . i-n o d e OID invention loop. Suppose ranfc(A.Node) = i. From A.i-city[* City,o] :- A.Node(o); // R2 we know rank(B.City) = i + 1. Then by B.i-node[* Node,c] :- B.City(c); // S2 we have rank(A.Node) = i + 1 + 1! Contradiction. □ 91 6 .2 .2 R e d u c tio n an d E x p a n sio n O p e r a tio n s Let L be an ALD containing EES entry A.P = B.Q. Recall th at, in Section 6.1, the witness generators for A. iQ and B. iP can be simply constructed as A . iQ [q,p] B.iP[p,q]; B.iP[p,q] :- A .iQ[q,p]; However, as will be shown later, the W GG algorithm fails to find the witness gen erators for those interm ediate relations of L not appearing in the EES. This is m ainly because of the inherited non-well-foundedness resulting from transform ation EES-to-ILOG. In this subsection, we introduce a technique th at removes this inherited non- well-foundedness, so th at the W GG algorithm can be applied to find the remaining witness generators. The technique is based on two operators: reduce and expand. To find witness generator W G T for interm ediate relation r not appearing in the EES, the ALD compiler performs the following: (1) Apply the reduce operator to L. In this new ALD, reduce(L), entity types A.P and B.Q are “reduced” into printables. (2) W ith reduce(L) and reduce{r) as input, run the W GG algorithm to find the witness generator, W G redU ce(r), f°r the “reduced” interm ediate relation reduce(r) in reduce(L). (3) Apply the expand operator to W G redU ce(r)i and return expand{W Gre(iU C e(r)) as the output; i.e., expand(W Gr e < iuce(r)) is the witness generator for r w .r.t. L. D e fin itio n : reduce A Let L be an ALD which has been pre-processed by EES-to-ILOG. Let A = A U auxA and B = B U auxB be the extended schemas for databases A. and H, respectively. For each EES entry A.P = B.Q in L, the result of reduce(L) is defined as follows. S c h e m a s: reduce(A) contains all the relations in A except the interm ediate relation A . iQ derived from an EES entry A.P = B. Q in L. A A reduce(B) contains all the relations in B except the interm ediate relation B . iP derived from an EES entry A.P = B.Q in L. 92 Then defined a new printable type PQ-value. The columns of reduce(A) th at in A had entity type A.P, are given types PQ-value in reduce(A), and the A A columns of reduce(B) th at in B had entity type B.Q, are given types PQ-value in reduce(B). P ro g ra m s : For each EES entry A.P = B.Q in L replace the two rules A.iQO Q,p] :- A.P(p) in P a ~ b and B.iPO P,q] :- B.Q(q) in P b ^ a by A.P(x) :- B.Q(x) in reduce{PA*-*B) and B.Q(x) :- A.P(x) in reduce{Ps>-^A) respectively. For a rule 7lr in Pa^b-, where r is not defined in EES, construct reduce(TZr ) in re d u c e (P A ^ B ) as follows. For each iQ[q,p], do the following: (1) replace each atom A . iQ [q,p] in 7?.r .body by a new atom s A . P (arpg) where x pq is a new variable, and (2) replace each p, q in 7Zr by the new variable x pq. Each rule in reduce(PB^A) can be constructed in the same fashion. □ The next definition describes how to construct the reduced instance pairs, (reduce(I^),reduce(J^)) £ Inst(reduce(A )) x Inst(reduce(B)), from an instance pair (/,« /) € Inst(A ) x Inst(B ). D e fin itio n : Reduced Instance Pair Let L be an ALD, and J# ) be an instance pair in the extended sem antic cor respondence SC ab • Let A.P = B.Q be an EES entry in L. From the semantics of EES, we know th at for each p, q iP [p,?] G J f i.iP < - * • iQ[q,p] € IA.iQ The red u ce operations on OIDs p and q are defined as follow: reduce(p) = reduce(q) = xpq, where x pq is a new value of type PQ-value. For all other value i € adom(I^ U Jg) define reduce(t) = t. Then reduce(I^) and reduce(J^) can be constructed as follows. 93 (1) remove A . iQ from and B . iP from Jg . (2) All other relations r in A are replaced by reduce(I^.r) in reduce(Ij) and all other relations s in B are replaced by reduce(Jg).s in reduce(Jg). a E x a m p le 6.5: The instance pair shown in Figure 4.2 for the “Segment Flight” example is a valid instance pair for the “Itinerary” Example after reduction. □ The next lem m a states th at for any instance pair in an extended se m antic correspondence S C \B defined by an ALD L, then the reduced instance pair reduce(I£ ■ , J b ) iS in the reduced sem antic correspondence ^^'reduce(AS) ' L e m m a 6 .2 .1 : Reduced Semantic Correspondence Let L be an ALD defining an extended sem antic correspondence SC jj^. Let the extended sem antic correspondence defined by reduce(L) be S C ^ ^ j ^ y Then, (JA’Jb ) ^ SCab (reduce(IA),reduce(J6 )) € SCZtZ{AB) □ After finding witness generator W G reduce(r) for interm ediate relation reduce(r) in the reduced ALD, reduce(L), the witness generator W G r for r of the original ALD is constructed by “expanding” W G redU ce(r) as follows. D e fin itio n : expand(W Gr> ) Let W G reduce(r) be a witness generator of reduce(r) under reduce(L). W.l.o.g. as- A sume reduce(r) £ reduce(A), then expand(W GredU ce(r)) can be constructed as fol lows. For each variable xpq in V ar(W G redU C e(r))-> of type PQ-value, derived from an EES entry A.P = B.Q in L, do the following. (1) add one atom B.iP[p, q] to the rule body, where p q are new variables not used in W G reduce(r), (2) replace each occurrence of x pq in the rule body by q, 94 (3) replace each occurrence of xpg in the rule head, if any, by p. expand(W G reduce(r)) f°r reduce(r) in reduce(B) is defined analogously. □ The next lem m a shows that the expanded witness generator is indeed a witness rule in the original sem antic correspondence. L e m m a 6 .2 .2 : Expanded Witness Rule Let L is an ALD defining a sem antic correspondence SC l • Let the sem antic corre spondence defined by reduce(L) be S C r e d UCe ( L ) - If W G r> is a witness rule for r' in S C r e d u c e { L ) , then expand(W Gr') is a witness rule for r in S C l • D E x a m p le 6 .6 : Recall th at witness generator ITG a.i- f light for the “Segment Flight” example, con structed in the previous section, was A.i-flight[f,s,t] B.i-segment[s,c],B.flight-info(f,c,t), B.price(f,p); Then the witness rule for A. i-f light for the “Itinerary” example is, in fact, expand(WGu,i-iiigh.t)- T hat is, A.i-flight[f,s,t] :- B .i-segment[s,y],B.flight-info(f,y,t), B.price(f,p), B .i-node[x,y]; O 6.3 WGG Expansion History In this section, some basic definitions about the expansion process are presented. These definitions will be used in the description of the WGG algorithm . Later in C hapter 7, they will be used to prove various result regarding the execution of the WGG algorithm . Recall the “common sense” reasoning of Example 6.2 in the previous section. As shown in Figure 6.1, the expansion can choose the “most promising” atom in a goal set to expand first. This decision on the sequence of atom expansions, called the computation rule in [Llo87], is based on some intuition. 95 R e m a rk 6.3.1: For the formal development we insist the W GG algorithm con structs the expansion history in a system atic, layer by layer fashion to simplify the proof of soundness. In practice, however, heuristics might be developed to expand only the “most promising” portions of the tree. < d In contrast, the expansion in the W GG algorithm is perform ed in a more system atic way, in which atoms in a goal set are expanded according a flag called schema tag. The value of the schema tag alternates from “aux” to “base” after all the atom s of the interm ediate relations are expanded; after all atom s of the interm ediate relations are expanded, the value of the schema tag switches back from “base” to “aux” . D e fin itio n : Schema Tag t is a schema tag if t € {“base” , “aux” }. Let G be a set of atom occurrences, then G is “base” congruous if Gi contains only base relation atom s from the same database “ aux” congruous if Gi may contain atom s of both base and interm ediate relations from the same database □ Next we define the WGG expansion history; this records the execution history of the W GG algorithm . R e m a rk 6.3.2: For the discussion in this section, we assume th at the ALDs have been first translated by EES-to-ILOG, then rew ritten by reduce, so that they are well-founded. < 1 For the sake of brevity, the following notations are used. Given substitutions cr1, . . . , crn, we use ed1’ ”! as a shorthand for cr1 ■ cr 2 • ■ • crn. Since each rule in an ALD has a distinct relation name in the rule head, 7Zr represents the rule in skolem(L) with relation r appearing in the rule head, and Var(1Z) represents the vector of variables appearing in 7Z. Also, we use 7£.head-term to represent the term s/variables appearing in the rule head and 7Z.body to represent the the body of rule 7Z. D e fin itio n : WGG Expansion History Let L be an ALD, t be a schema tag and G be a set of atom occurrences th at is t 96 A congruous. The WGG expansion history under L with initial goal set G and initial schema tag t, denoted as W G G l(G , t), is an infinite sequence {((?*•, 0t, U), i > 0} of three-tuples, where Gi, 0 {, and £ * • are called the goal set, substitution, and the schema tag at step i, respectively. Initially, let to = t Go = G 0 0 = 0 The schema tag at step i > 1 is f “base” if £,•_i = “aux” [ “aux” if ti-i — “base” Later, we will see th at, in the expansion history, Gi is always £ * congruous. The goal sets in the expansion history is constructed recursively as follows. Let the goal set at level i > 0 be Gi = Uih),li > 1 Recall in [Llo87], before unifying a goal rf(u f) and a program clause 7Zrj , the vari ables in TZrj will be renam ed, so that it does not contain any variable th at already appeared in the derivation up to rf (tf*). The renam ing process is called standard izing the variable apart. In the following, we use 7Zrj to denote the result of 1lrj after standardizing the variable apart. The atom s in Gi are expanded according the following two cases. Case 1) ti = “base” Gi+1 and $i+1 are constructed by going through U micro steps. Speaking intu itively, each micro step j generates: (1) a substitution < r^ +1 which is the most general unification (MGU)[Llo87] of atom (uj) and rule head of 7Zrj and (2) a tem porary goal set gf+ 1 which is the result of expanding rl(u\) into the rule body of 7 l r 3 . 97 Initially, let gf+ 1 = G{ and of+1 = 0. Each micro step j , j £ [1, /* •], generates oj+i = mgu{u{ • 7^r>.head-term), 9 i + 1 = Kr\ -bodyo-J+f, ..,Tlri .body • of+1, rj + 1 (uV'+1)<T-+f, .. = [Rri .b o d y ,.., K r! .body, r f -1^ * 1) , .., ^ ( u ^ ' ) ] ^ 3 Then we can construct: Gi+i = a -.tMi] " t + 1 — c r i + 1 Case 2) ti = “aux” Gi+i and # , - +1 are constructed in the same way as in Case 1) except that for a micro step j £ [1, /*], if r\ is not an interm ediate relation then assign 9 i + 1 = 9 i + i (with g° = G i ) and of+1 = 0. □ The next exam ple illustrates the details of micro step expansions using the “Seg m ent Flight” example. It goes through the first two micro steps in level 3 of the WGG expansion history. E x a m p le 6 .7 : Let L be the ALD of the “Segment Flight” example and H = VFGGi,({A.i-flight[/,s,*]}, “aux”) be the expansion history. At level 3, we have Gz = {B.flight-info(f” ,c,t’), B.flight-info(f’,c,t), B.price(f’,p) } £ 3 = “base” We now begin level 4 by expanding each atom s in G 3 from left to right. For micro step 1, r\ — B.flight-info(f” ,c,t’) 7£ri = B.flight-info(f,c,t) A.i-flight[f,s,t], A.has-node(s,c); 98 7?.ri = B.flight-info(/, c, t) A.i-flight[/, s,t], A.has-node(s, c); cr] = mgu{{f",c,t'),{f,c,t)) g\ — { A.i-flight(f",s,t'), A.has-node(s, c), B.flight-info(f’,c,t), B .price(f’,p) } For micro step 2, r|<r] = B.flight-info(f’,c,t) 7?.r2 = B.fiight-info(f,c,t) A.i-flight[f,s,t], A.has-node(s,c); lZr 2 = B.flight-info(/, c, i) A.i-flight[/, s,t], A .has-node(i, c); cr] = m g u ((f, c, t), ( /, c, t)) = { f / f , c / c , i / t } = { A .i-flight(/w , s, i'), A.has-node(5, c), A .i-flight(/', s,t), A.has-node(s, c), B .price(f’,p) } Notice th at, even thought the two rules 7Zri and TZrj are the same, their renam ed versions 7Zri and 7Zr 2 are quite different. □ Later, in Section 7.1, we will show th at the micro step expansions in a level is order independent. T hat is, the WGG algorithm can chose any ordering for the micro step expansions and the result will be the same. However, for the algorithm discussed in this chapter, we assume that, at any level *, the atom s are always expanded from left to right. We end this section by defining the notions of answer and answer variable in an expansion history. As will be seen in the next section, once an answer variable is found in the expansion history, WGG will stop the expansion and will generate a witness generator. D efinition: Answer and Answer Variable Let Go = {r[o, iy]} be a singleton initial goal set where r is an interm ediate relation. Let H = W G G l (Go, “aux” ) be the expansion history of Go under ALD L. Then at level i > 0, a substitution item (o' js) € 0[O ,i] is an answer for H, if o' ^ o _ O • 0[oit] = o' • 0[o,i] 99 Variable o' is called the answer variable in H. □ 6.4 The WGG Algorithm In this section, we present the Witness Generator Generator (WGG) algorithm that autom atically generates the correct witness rules for a user-given ALD. R e m a rk 6 .4.1: As shown in Example 6.1, there is a simple and direct way to construct a witness rule for an interm ediate relation r derived from an EES entry. Also in Section 6.2, two operators are introduced; reduce “reduces” an ALD contain ing recursive EES invention rules into an equivalent well-founded ALD, and expand “expands” the witness generators for the reduced ALD into witness generators for the original ALD. For the sake of brevity, we assume that the input ALD L , to the W GG algorithm , is the result of red«ce(EES-to-ILOG(Z/)), for some ALD V . Furtherm ore, the output witness generators will be expanded autom atically. Thus, the discussion in the rem aining of this chapter will focus on the W GG algorithm . < 1 The algorithm is based loosely on SLD-resolution[Llo87] and the Skolem Func tional Dependency property defined in Section 4.3. It takes an ALD L and an interm ediate relation nam e r in schema(L) as its input. If it halts, the output is a witness generator for interm ediate relation r. A lg o rith m : WGG INPUT: (L ,r) where L is a user-defined ALD, and r is a interm ediate relation from aux A or auxB. W .l.o.g., in the following we assume that r € aux A. (1) I n i t i a l i z a t i o n Phase l.a R e cu rsio n S tep Recursively call WGG to generate witness generators of interm ediate re lations with ranks less than rank(r). l.b E xpansion I n i t i a l i z a t i o n S tep Com pute L' = Skolem(L). Let r [ / r (i;),u] be the rule head of invention rule TZr in L'. Set the initial goal set Go = {r[o, to]}, where o and w 100 are new variables not appearing in Var(L). Furtherm ore, v and w are isomorphic. (2) Expanding Phase Let i = 1. Repeat the following steps until the test in (2.b) becomes true. (2.a) SLD Derivation Step Com pute the ith level of the W G G expansion history W G G l'{Gq, “aux”). Let the resulting goal set and substitution set be Gi and , respectively. (2.b) Checking Step Check if there is an answer variable, o', in Var{G[o) 4 ]). If found, stop the expanding phase and go to the next phase, otherwise set i = i -f 1 and go to (2.a). (3) Output Phase Assume at level s WGG finds an answer variable o'. Furtherm ore, o' (an OID variable of B) appears in a goal set Ga such that Ga contains only atom s from database B and a is minimum. Since w may contains OID variables, three sets of atom occurrences are con structed to record the skolemization process of w from level 2 to level a as follows 2. Bi = { s [r , u] | s[;c, v\ e G4k+2 , 0 < k < |a /4 j — 1, x € Var(o • 0[O ,4fc+2])} B 2 = {<s[x,u] | € G4k+4, 0 < k < [«/4j — l ,x G Var(o-0[0^ + 4])} 1?2 = [J (IFG j.body) ■ m gu(s[r, v\, V F ^.head) s[ x,v ]6B 2 In particular, B 4 and B 2 contain the atoms from aux A and auxB, respectively, which contribute to the skolemization of w. B ' 2 is derived from B 2 by expanding each atom ,s[.x, u] in B 2 against the witness generator for s, so th at B ' 2 contains only atom s from database A. Then the witness generator can be constructed as j W G r.head = G0u \ W G r.body = (Ga U B x U B'2)u 2 We abuse notation slightly by using Var(t) to denote the set of variables occurring (possibly nested within skolem terms) in t. 101 where v — {(x/y)\x,V € Var(G 0 U B iU B 2 ),x- 0[O ;S ] = y • 0[O > s ],a: ^ y} O The next example uses the “Segment Flight” example to illustrate the algorithm step by step. E x a m p le 6 .8 : W ith the ALD L and relation name A.i-flight as input, W GG goes through the following steps. (1) Initialization Phase l.a Since rank(k . i - f lig h t) = 2 and rank{ B . i-segm ent) = 1, the witness generator for B. i-seg m en t is first computed as follows. B.i-segment[s,c] :- A.i-flight[f,s,t], A.has-node(s,c) l.b The ALD is fist skolemized as shown in Exam ple 6.2. Then the initial goal set is set to be G 0 = {i-f light[/, s, t]} (2) Expanding Phase Consider expansion history WGG(Go, “aux”) depicted in Table 6.1 At step 7, in the above expansion history, substitution 0[0,7 ] contains an answer f ’/f(g(c),t). Thus, the W GG algorithm exits the expanding phase and moves on to the output phase. (3) O u tp u t P h ase By observing the the table above, we know the answer variable f ’ first appears in G % = {B.i-segment[s,c], B.flight-info(f’,c,t), B.price(f’,p)} Furtherm ore, G 2 contains only atoms from database B. Thus a = 2, and the three sets of atom s are B\ = B 2 = B ' 2 = 0. Then, we can construct the witness generator W G A.i-flight as W G A . i - f l i g h t-head = A.i-flight[f,s,t]^ W G A.i-flight-body = (B.i-segment[s,c], B.flight-info(f’,c,t), B.price(f’,p))i/ 102 level Tag Gi 0ro,;i 0 “aux” A.i-flight[f,s,t] 0 1 “base” A.segment-info(s,t,p) {...,///(*,*)»•••} 2 “aux” B.i-segment[s,c] B.flight-info(f’,c,t) B.price(f’,p) {...} 3 “base” B .flight-info(f ” ,c,t ’) B.flight-info(f’,c,t) B.price(f’,p) {..., s/g(c), / / f(g{c), t ),...} 4 “aux” A.i-flight[f”,s’,t’] A.has-node(s’,c) A.i-flight[f’,s”,t] A.has-node(s” ,c) A.i-flight[f’,s”’,t] A.segment-info(s” ’,t” ,p) {...} 5 “base” A.has-node(s”,c) {...,f”/f(s’,t’), f ’/f(s”,t), s” ’/s ”, t”/t,...} 6 “aux” B.i-segment [s” ,c] {...} 7 “base” { ...,f’/f(g(c),t),...> : : : Table 6.1: The WGG expansion history for the “Segment Flight” example where * = { / '/ / ) T hat is, the witness generator constructed is A .i-flight[f,s,t] B.i-segment[s,c],B.flight-info(f,c,t), B.price(f,p); □ The alert reader may have noticed th at the rule bodies of the witness genera tors constructed in Exam ple 6.2 and 6.8 are all derived from Ga,a = 2. It seems redundant th at, in the output phase of W GG, the rule body is constructed from (Ga U Bi U B-i)v. However, the next exam ple shows the case in which Ga alone cannot be the body of the witness generator. 103 Database A A1 Database B A3 CTMMfiJ A4 A2 Figure 6.3: Schemas for the “More than C ?a” example E x a m p le 6.9: “ More than Ga” Consider the schemas for databases A and B depicted in Figure 6.3 and the following ALD specifying the sem antic correspondence between the two schemas. (abs-linkage more-than-Ga A-Schema ( A1(Entity); A2(Entity); A3(Entity.string); A4(Entity.string); ) B-Schema ( B1(Entity); B2(string); B3(Entity,string); ) A-to-B ( Il[*, wl] A2(wl); Bi(w2) Ii[w2,wi]; B2(w0) A3(wl,w0); B3(w2,w0) Il[w2,wl], A4(wl,w0); ) B-to-A ( Al(w3) A2(wl) A3(w3,w0) A4(wl,w0) J2[*, wO] - J1[w3,w2]; - J2[wl,w0]; - Jl[w3,w2], B3(w2,w0); - J2[wl ,w0] ; - B2(w0); 104 Jl[*, w2] Bl(w2); ) ) The above ALD L is well-founded, and rank(L) = 3. Table 6.2 illustrates the W GG expansion history W (jGx({J1[o,w2]}, “aux"). Notice th at there is an answer ° ’/f(g(h(w0))) found at level 13. However, Ga = G& = {A.A3[o’,wO]} does not even contain variable w2 in the rule head, Go- Thus, Ga alone can not serve as the rule body of the witness generator. Let the witness generator for interm ediate relation J2 {rank( J2) = 2) constructed in the recursion step be B.J2[wl,wO] A.Il[w2,wl], A.A4(wl,wO); In the output phase, the three sets of atoms are constructed as follows. B\ = Gi — {A.Il[w2,wl]} B 2 = G 4 = {B.J2[wl,wO]} B ' 2 = {A.Il[w2’,wl], A.A4(wl,wO)} And the witness generator, W G jt , is constructed as: B.Jl[o,w2] A.A3(o,wO), A.Il[w2,wl], A . II [w2» , wl] , A .A4(wl,w0); whose rule body consists of G&, G 2 and the rule body of W G j 2. □ Later in Section 7.2, we will prove th at, once an answer is found in the expansion history, the rule constructed as the output is indeed a witness generator. 6.5 The Incompleteness of the WGG Algorithm As will be proved in Chapter 7, the WGG algorithm is “sound” ; i.e., the output of a W GG execution is indeed a correct witness generator. Let us now consider the “completeness” of the W GG algorithm , i.e., with an arbitrary user-given ALD, will the W GG algorithm always produce the witness generators? Unfortunately, the answer is—no; there are cases when the WGG executions will enter endless loops and never term inate. 105 level Tag G i 0 [ O ,i] 0 “aux” W i n 1 ....1 O * t o 0 1 “base” B.Bl(w2) {...,o/f(w2)} 2 “aux” A.Il[w2,wl] {...} 3 “base” A.A2(wl) {...,o/f(g(wl))} 4 “aux” B.J2[wl,wO] { ...} 5 “base” B.B2(wO) {...,o/f(g(h(w0)))} 6 “aux” A.A3(o’,wO) {...} 7 “base” A.A3(o’,wO) {...} 8 “aux” B .Jl[o\w 2’] {...} 9 “base” B.Bl(w2’),B.B3(w2’,wO) {...,o’/f(w2’)} 10 “aux” A.Il[w2*,wl’ ], A.Il[w2’,w l”], A.A4[wl”,w0] {...} 11 “base” A.A2(wl’), A.A2(wl’), J2[wl’,w0] {..,o7f(g(wl’))} 12 “aux” B.J2[wl’,wO’], B.J2[wl’,wO”], J2[wl\w0] {...} 13 “base” : {...,o’/f(g(h(w0)))} : : : : Table 6.2: The WGG expansion history for the “More than Ga” example In the next two subsections, two families of ALD are presented which lead to endless executions of the W G G algorithm. These examples m otivate the need for a further investigation of the term ination behavior of W GG executions in C hapter 7. 6 .5 .1 W e ll-fo u n d e d A L D Consider the expansion history, WGGL({r[x,w\}, “aux”), for ALD L. Suppose that at level 1 the witness variables, w, are divided among predicates .., which do not have any overlapping variable. The next example illustrates such an expansion history and shows th a t—even for a well-founded ALD—if the witness variables w are divided into predicates with non-overlapping variables, the expansion history W G G l (GoA) does not contain an answer. Therefore, the W G G algorithm does not halt. E x a m p le 6.10: “Cross Product” The schema for database A contains two unary relations of strings A.R1 and A.R2. And the schema for database B contains one entity type B.Node and two binary label relations B.L1 and B.L2 holding the label information for OIDs in A.Node. The ALD L for this example is: 106 level Gi 0 [O ,i] 0 A.i-node[r, 11,12] 0 1 A.R1(/1), A.R2(/2) { ...,x //(/l,/2 )} 2 B .L l(n,/l),B .L 2(n/,/2) 3 B .Ll(w ,/l), B.L2(ra', 12) 4 A.i-node[n, 11,12'], A.i-node[n', IV, 12] 5 A.Rl(/l),A.R2(/2'), A .R l(/ly ), A.R2(/2) { ...,n //(/l,/2 ')X //( /lV 2 )} • I I Table 6.3: Expansion history for the “Cross P roduct” example (abs-linkage cross-product A-schema ( R1 (string); R2 (string); ) B-schema ( node (entity); LI (node,string); L2 (node,string); ) B-to-A ( Rl(l) Ll(n,l); R2(l) L2(n,l); ) A-to-B ( i-node[* node,11,12] :- Rl(ll), R2(12); Ll(n,ll) :- i-node[n,11,12]; L2(n,12) :- i-node[n,11,12]; ) ) By observing the above ALD, we know th at L is well-founded, and rank(L) = 1. Furtherm ore, from the above ALD, we know th at an instance pair (/, J ) is in the sem antic correspondence SC ab if «rnd only if for each tuple (a, 6) £ I.R1 x I.R2 there is a unique OID n £ J.node with labels (n, a) £ J.Ll and (n, b) £ J.L2. 107 Now consider the following expansion history WGGi,{{\-no&e\x, ll, /2]}, “aux”) shown in Table 6.3. From the above table, we can see th at at level 1 the two witness variables, /I and 1 2 , in the initial goal set are split into two predicates, i? l(/l) and R2(l2), containing no overlapping variables. Later in the expansion, the two “expansion subtrees” descended from j? l(/l) and R2(l2) inherit either 11 or 12 but not both variables from their ancestors. T hat is, no m atter how deep the expansion goes, there is always one witness variable missing in either chunk of the expansion subtree. As a result, the substitutions do not contain an answer, which has the skolem term f ( l l , 12) in the RHS. Therefore, the expansion process goes on forever. □ 6 .5 .2 N o n -W e ll-fo u n d e d A L D * Here we exam ine the case where the input ALD is not well-founded. The the WGG algorithm keeps on expanding because there is a loop of OID invention. As a result, the OID variable, o, appearing in the initial goal set keeps on being substituted into skolem term s with increasing depth. For other OID variables appearing later in the expansion, they can never “catch up” the ever-changing substitutions of o. E x a m p le 6.11: “Recursive OID Creation” The ALD below is non-well-founded, since entity type R is witnessed by entity type S , and entity type S is witnessed by entity type R. This forms a loop of OID invention. (abs-linkage Recursive-OID-Creation A-schema ( R(entity); ) B-schema ( S(entity); ) B-to-A ( B.i-RO R,s] B.S(s); // R1 ) A-to-B 108 level Gi d\oA 0 B.i-R[a;,s] 0 1 B.S(s) { - , * / / ( * ) } 2 A.i-S[s, r] { . . . } 3 A.R(r) {...,s/g(r),x/fg(r)} 4 B.i-R[r, s'] {...} 5 B.S(s') {...,r/f(s'),s/g f(s % x /fg f(s')} 6 A.i-S [s', r'\ {...} 7 A.R(r') {..., sf/g(rf), r/fg(P), sfgfg(r'), x/fgfg(r')} I I ; Table 6.4: Expansion history of the “Recursive OID C reation” example ( A.i-S[* S,r] A.R(r); // SI ) ) Recall the EES-to-ILOG transform ation introduced in Section 5.1. The two nrecILO G" programs P a *-* b and P b ^ a above may be the result of an EES entry A.R = B.S translated by EES-to-ILOG. However, as pointed out in Section 5.5, the two inven tion rules R1 and SI alone can not specify the the equivalent relationship between A.R and B.S, unless there is an EES entry explicitly declares it. Also, based on the definition of rank (Section 6.3), unless there is an EES entry declaring the equivalent relationship, neither rank(R) nor rank(S) can be defined. Next, we will show that the W GG algorithm fails to find a witness generator for A.i-R and runs into an endless loop. Let us follow the W G G algorithm with i-R and the above ALD as input. (1) Initialization Phase First, skolemize Pa^ b and Pb* ~ * a to form an<^ Pb ^ a as follows. B-to-A ( i-R[f (s) ,s] :- S(s) ; R(r) :- i-R[r,s]; ) A-to-B ( i-S[g(r),r] :- R(r) ; 109 S(s) i-S [ s , r ] ; ) and initial goal set Go = {i-R[a;, s]}. (2) Expanding Phase The expansion history of W G G l (Go, “aux” ) is illustrated in Table 6.11. From the accum ulated substitution, shown in the table, we know th at OID variable x is always skolemized two level deeper than any OID variable(r, r',...) of type R. This is mainly due to the OID invention loop of R y-* i — S S i— > i — R i— * R * — »•••• As an result, the W G G algorithm will never term inate. Should there be an EES entry A.R = B.S declaring the equivalence of B. i-R and A. i-S , then W GG knows th at the two skolem functions / and g are actually inverse to each other. Then, the witness generator can be constructed as A . i-S [x ,y] :- B.i-R[y,x]; □ 110 Chapter 7 Formal Discussion on WGG Expansion In this chapter, we focus on the formal discussion of the W GG expansion history and dem onstrate: (1) the W GG algorithm presented in Section 6.4 is sound, i.e. the rule generated by the WGG algorithm is a valid witness generator and (2) the term ination of W GG execution is decidable for well-founded ALDs. In particular, given a well-founded ALD, we can construct a bound B with the property that the W GG expansion of the ALD yields an answer if and only if it yields an answer within B layers. The theoretical bound B is triply exponential in the size of the ALD. Section 7.1 presents the structured WGG expansion history, a more structured expansion than the one presented in Section 6.3. Throughout this chapter the struc tured W GG expansion history will be used to prove various results. In Section 7.2, the proof of the soundness of the W GG algorithm is presented. The theoretical result on term ination is mainly based on the notion of OID component, a special property of the expansion history. Section 7.3 discusses various topics surrounding OID components in an expansion history. Then the decidability of W GG term ina tion is presented Section 7.4. Finally, in Section 7.5 we present a modified version of the W GG algorithm , the W G G halt algorithm , th at produces the same witness rules as the the WGG algorithm , and is guaranteed to term inate. Due to their length and technical intricacy, most of the proofs in this chapter are removed from the discussion and presented in Appendix D. Throughout this chapter, the following example will be used. E x a m p le 7.1: “ Enrollm ent” In this example the two databases A. and B record the same inform ation about 111 Database A Database B Enroll enroll-student fS T g lfN c r A.enroll-student Enroll name E , Ben e2 Ben A.enroll-course Enroll Course E , c, E 2 c2 enroll-course .Course J i l'kiNU I course-no A.course-no Course number c, EE101 C2 CS101 enrollment student rsT f ISTR IN C H student-name B.student-name (class-name) Student Name s, Ben B.enrollment Student class name s, CS101 s, EE101 Figure 7.1: Schemas and instances for “Enrollm ent” student enrollments. However, as depicted in Figure 7.1, the schemas of the two databases are quite different. In particular, the enrollment inform ation in database A is recorded in an entity type, A.Enroll, while the same information is represented by a binary relation, B.enrollment, in database B. In database A, each student is represented as a string value in A.enroll-student, while the student inform ation in database B is kept in entity type B. Student, with attrib u te B. student-name. Conversely, the course inform ation associated with an enrollment is recorded in B as string values in B.enrollment, while the same information is modeled as an entity type A.Course with the attrib u te A.course-no in A. The following ALD specifies the sem antic correspondence required by the application. (abs-linkage Enrollment A-schema ( Course(Entity); course-no(Course,String); Enroll(Entity); enroll-student(Enroll.String); enroll-course(Enroll,Course); 112 ) B-schema ( Student(Entity); student-name(Student.String); enrollment(Student.String); ) A-to-B ( A.i-student[* Student,sn] A.enroll-student(Ae.sn); B.student-name(Bs.sn) :- A.i-student[Bs,sn]; B.enrollment(Bs,cn) :- A.course-no(Ac,cn), A.enroll-course(Ae.Ac), A.enroll-student(Ae.sn), A. i-student [Bs, snj| ; ) B-to-A ( B.i-course[* Course.cn] :- B.enrollment(Bs.cn) A.course-no(Ac.cn) :- B.i-course [Ac,cn]; B.i-enroll[* Enroll.Bs.cn] :- B.enrollment(Bs,cn); A.enroll-student(Ae.sn) :- B.i-enroll[Ae.Bs.cn], B.student-name(Bs,sn); A.enroll-course(Ae.Ac) :- B.i-course[Ac,cn], B.i-enroll [Ae.Bs.cn]; ) ) □ 7.1 Structured WGG Expansion History For the formal discussion in this chapter, the Structured WGG Expansion History is designed to provide technical sugaring and to simplify the proofs. In essence, the structured W GG expansion differs the WGG expansion introduced in Section 6.3 in the following ways: (1) it imposes an ordering or “index” on the variables in V A R , (2) the unification always returns variables with lower index, and (3) a system atic way to perform the standardizing the variable apart process, so th at the renam ed rules are independent of the order of micro-step expansions. 113 D e fin itio n : index(x) We assume th at the variables in V A R have a total order isomorphic to the (positive and negative) integers. The sequence num ber of variable x is denoted by index(x). □ D efinition: Proper Substitution A substitution 9 is said to be proper, if V (x / t / ) € 9,y € V A R = > ■ index(x) > index(y). □ The unification function mgu(tx,t2) used in the structured W GG expansion is required to be proper. D e fin itio n : Act On Let G be a set of atom occurrences and 9 be a proper substitution. We say 9 acts on G if G9 ^ G. Also, 9\q denotes the substitution item s (x / t ) in 9 where x occurs in G. □ The substitution item s generated in the expansion process can be categorized as follows. D e fin itio n : Variable-Pure, Skolemization, and Trivial Substitution Let 9 be a substitution, and G be a set of atom occurrences. 9 is said to be a variable-pure substitution if for each ( x / t / ) 6 9,y E V A R , skolemization substitution if each (x /y ) E 9 and y is a skolem term , trivial substitution to G is 9 does not act on G. □ N o ta tio n : We abuse notation slightly by using Var(S) to denote the set of variables occurring (possibly nested within skolem term s) in S'; S m ay be a set of atom occurrences, a subtree of an expansion history, or even a, possibly nested, skolem term . < 1 Recall th at the W GG expansion introduced in Section 6.3 did not specify how rules were renam ed in the standardizing the variable apart process before the unifica tion occurred. In contrast, the renam ing process in the structured W GG expansion 114 is carefully designed. The variables of the rule 7Zr, in the expansion of atom r\ (u3), are renam ed according to i and j; i.e., the level num ber and the physical position of r 3 (u3) in G{. As will be shown th at the renam ed rules are independent of the order of atom expansions at each level. For the sake of brevity, in the following discussion, we assume th at all the variables appearing in an ALD L have indices less than zero, and the variables in the initial goal set, Go, of the structured expansion history starts with index one. Let the goal set at level *, i > 0, be Gi = r}(uiX ),..,rl l{uiU),li > 1 Let Var(7Lri) = (t>i,.., t> m). Then we define the renam ing substitution 6 rj and the renam ed rule 7trj for atom r^tT;1) to be & r{ = {(x /y) I x = Vk,k e [1, m], and index(y) = £p=o ^ p =1 \\Var{nr^)\\ + 5^=1 ||V a r(K ? )|| + k 4- M ax-Index(Far(G o)) } ^-r3 = T ’ Zyi ' & r3 In essence, S > is design to replace each variable x € V a r(lZ i ) into a new variable ri i y with index num ber com puted by adding up: • ]Cp=olCg=i ||V ur(7?-r9) ||, the total num ber of variables used in the previous expansions from level 0 to i — 1 • Yfq=i ll^ a r (7^r,?)||> the num ber of variables th at would be used if the expansions at level i is performed from left to right till atom • k , the position of x in Var(7Zrj ) • M ax-Index(V ar(Go)), the m axim um index of the variables used in the initial goal set Go W ith this system atic renam ing apart of variables, the resulting rule 7Zj is always ri the same no m atter what micro-step expansion order is chosen at each layer. In the following discussion, unless otherwise specified, the term “W GG expansion history” refers to the structured W GG expansion history. The following example illustrates 115 two m icro-steps at level 5 of the the structured expansion history for the “Enroll m ent” example. E x a m p le 7 .2 : Let L be the ALD of the “Enrollm ent” example and H = W GGx({A.i-enroll[ui, u2, ^3]}> “aux” ) be the expansion history. At the end of micro-expansions of level 4, we have G 4 = {B.i-course[u85 ^3], B.i-course[i>8, U 17], B.i-enroll[u9, ui8, vn], B.i-enrolljug, n2i, U 22], B .student-nam e(n2i, vio) 5 B.i-enroll[u12, v25, n26], B.student-nam e(u25, ^io) micro-step 1, we have *4(^4) = B.i-course[u8, v3] lZri = B.i-course[f(cn),cn] B.enrollment(Bs,cn) 6 r 1 = { cn / v27, Bs / U 2 s} 7Zr 1 = B.i-course[/(n27), ^27] B.enrollment(t>2 8, ^27) cr\ = mgu([f(v27),V27\,(v8,v3)) = {v 2 7/v 3,V8/f(v3)} gl = { B.enrollm ent(u2 8, v3) , B.i-course[/(u3), B.i-enroll[u9, ui8, ^17], B. i-enroll [ug, u2ii ^22], B.student-nam e(u2i, ui0), B.i-enroll[ui2, U 25, ^26], B.student-name(i>25? ^io)} We now begin level 5 by expanding each atom in G 4 from left to right. For For m icro-step 2, we have — B.i-course[/(u3),n i7] 7?.r2 = B.i-course[f(cn),cn] B.enrollment(Bs,cn) Sr 2 = { cn / v29, B s / n30} 7 Zr 2 = B.i-course[/(u29), ^29] > B. enrollment (u3o, v29) o-f = mgu([f(v 2Q),v2 9],[f(v 3 ),v17]) = {v 2 9/v 3 ,vi 7 /v3} 116 g\ = { B.enrollment(t/28, ^3), B.enrollment(U30, U 3), B. i-enroll [ug, uig, u3], B.i-enroll [ug, u2i, v 2 2 ], B.student-nam e(u2i, ^10)5 B.i-enroll[v12, v25, v2 & \, B.student-name(u25, ^10)} Notice th at, at micro-step 2, rKu^crg contains a skolem term /( n 3). This is because atom s and rl{u\) share the same OID variable v8, and there is a substitution item v 8 / f ( v e) G < 7 5 . Notice also, because of this skolem term f{y 3), the MGU erf, generated at micro-step 2, contains a non-trivial substitution item v n /v 3 which changes atom r^(u^) from B . i-e n ro ll[v g , uis, ^ 1 7 ] to B. i-e n ro ll[u g , uis, U 3] in gl- n W ith the renam ing and unification processes described above, we next show th at the expansion of micro-steps is order independent; i.e., the structured WGG expansion history will be the same no m atter what order the atom s at any level are expanded. D efinition: Micro-Step Permutation Let Gi = rKtTj-1) , .., rj‘(tT/‘), > 1 be the goal set at level i in a expansion history. For a perm utation 7 r on [1,/,-], let GJ+ 1 and 9?+ 1 represent the goal set and the substitution generated from G i based on the expansion of at micro-step j € [1 ,/,-]• a. L e m m a 7 .1 .1 : Order Independence of Micro-Steps Let Gi = r^'(ui*‘), /,• > 1 be the goal set at level i in an expansion history. Then for any two perm utations 7 r and tv' on [1, /,•] we have L r i + 1 — U i + 1 . *'+. = ®.>i □ In the next example, we change of the two micro-step expansions of Exam ple 7.2 and illustrate Lemma 7.1.1. E x a m p le 7 .3 : Continuing with Exam ple 7.2, assume at level 5, the expansion first expands r |( u |) 117 at micro-step 1 then expands r\(u\) at micro-step 2. T hat is, for m icro-step 1, we have r\{u\) = B.i-course[n8,fi7] 7?.r2 = B.i-course[f(cn),cn] :- B.enrollment(Bs,cn) 8 r2 = { cn / n29, Bs / u30} 7 t r 2 = B.i-course[/(u29), u2g] ■ - B.enrollm ent(^30, U 29) < t \ = m</M([/(u29’ ),W29],[v8,t;i7]) = { ^ 2 9 / V i 7 , U 8 / / ( V l 7 ) } #5 = { B .i-course[/(vi7), u3], B.enrollment(u30, u17), B.i-enroll[u9, ui8, u3], B.i-enroll[u9, i> 21, t> 22], B.student-nam e(u2i, ^io)> B.i-enroll[vi2, t> 25, V 2$], B .student-nam e(u25 ? ^10)} And, for micro-step 2, we have r|(u^)cr| = B.i-course[/(v17), v3] 7£ri = B.i-course[f(cn),cn] :- B.enrollment(Bs,cn) 8 r 1 = { cn / v27, Bs / u28} n ri = b . i-course[/(u27), u27] B.enrollm ent(u2 8j ^27) erf = m f l r u ( [ / ( v 27 ) , t > 2 7 ] , ( / ( v i 7 ) , v 3 ) ) = {W27/V3,U17/W3} g\ — { B.enrollment(n285 v3) , B.enrollm ent(n3o, n3), B.i-enroll[t>9, iqs, v3], B.i-enroll[t>9, u21, ^22]? B.student-name(i>2i, u10), B.i-enroll[u12, v 2s, ^26]? B.student-nam e(u25, uio)} Because of the special design of our variable renam ing process, the renam ing substitutions, 8 r 1 and 8 r2, and the renam ed ILOG rules, 7£_i and 7?.„2, are the same > r 4 ? 5 r 4 r 4 7 as in Exam ple 7.2, even though the expansion sequence is different. Also, because the unification function always prefers variables with smaller index, the resulting mgu (rfog and the resulting goal set ( up to micro-step 2) g\ are the same as in Exam ple 7.2. □ Based on the notion of structured WGG expansion, we define the following short hands and notations. 118 D e fin itio n : Tree and MGU of an Expansion History Let H = W G G l (Gq, to) be the expansion history of an ALD L. The Tree of H, denoted as Tree(H), is a Directed Acyclic Graph (DAG) (V,E), where V = USo (where each Gi is viewed as a set of atom occurrences) E = { H(uj),rf+1(uf+1)] | rf+1(uf+1) 6 ft^.body • 0i+1 } If the initial goal set, Go ,is a singleton, we call H a singleton W G G expansion history', if Go has more than one atom occurrence, then Tree(H) is in fact a forest. We abuse notation slightly by referring to such trees and forests as trees in this context. For the sake of brevity, we the following shorthand will be used. Let j > i > o, • Tree[ij](H) denotes the subtree of Tree(H) restricted on |Ji=iGk', be., the expansion tree from level i to level j of H • Treei(H) denotes Treey^{H) • MGU[itj](H) denotes $i • ■ ■ Of, i.e., the accum ulated substitution resulting from the most general unification process from level i to level j in H • MGUi(H) denotes MGU[iA{H). Let C C Treei(H) be a set of atom occurrences in level i > 0, then • C denotes Treej(H) — C • Treey k](H) denotes the sub tree of Treeytk](H) th at is descended from C. Let {D i, ^ 2,^)3} be a partition of T ree ,(/f), then • M G U f^ D^ Dz{H) denotes the substitutions generated from micro-step | IT1 r-eej^1 ( //) j j +1 to \ \Treef 2 (H)\ \ at s te p j based on the expansion sequence of T ree f'T ree ^T re ef* . If D\ or D 3 = 0, then they are om itted in the notation. • MGU°}lD 2]D 3 (H) denotes M G U ?'lD*]D *(H) ■ ■ ■ M G U f l[D*]Ds{H). W hen understood by the context, the H argum ent in the above expressions can be om itted. □ The next lem m a describes how the variables are “inherited” in the expansion history from goal sets in the previous levels. 119 L em m a 7.1.2: Variable Inheritance Let H be an WGG expansion history, and r(u) £ Treei be an atom occurrence at level i. Let I/other = {x | x £ V ar (7?.,..body) — F a r (7?.,..head)} be the set of existential variables in the body of the renam ed rule 7Zr. Then (a) Var(Treef+( f )}) = Var{u • MGUi+1) U V?theT = {t | t £ it • MGUi+i and t is not a skolem term} j j y o t h e r T hat is, the variables in T r e e j^ 1^ includes variables in ]/other and variables “inherited” , subject to be changed by MGUi+i and excluding those OID vari ables replaced by MGUi+i into skolem term s, from its ancestor r(u ). (b) for any variable x £ V ar(T ree,•(//)) if rank(x) = m > 1, then M axRank(Var(x ■ MGU[i+\ti+ 2])) = ra — 1 □ As will be shown later, the only way th at the MGUi+i(H) can substitute variable x in Var(Treei(H )) to a distinct variable y (where index(x) > index(y)) is when there exists an OID variable o, an interm ediate relation / , and two atom occurrences I[o,..,x, ..] and / [ o , y , ..], where x and y are in the same position. In this case we say the two variables x and y are “equivalent” witness variables. D efinition: Witness Equivalence Class Let I[o, tui], /[o, w2] € Treei(H) be two atom occurrences at level i for some inter m ediate relation I. Let x be the variable in Wi in the same position as the variable y in w2, and x ^ y. Then we say x and y are equivalent witness variables. An witness equivalent class of Treei(H) is a class of equivalent witness variables in Var(Treei(H)). □ E x a m p le 7.4: As we will see in Section 7.3, the expansion history shown in Figure 7.2 of Exam ple 7.5 has two witness equivalent classes: {^3,^X7, * > 22}, and {ui8,^ 2i} of Tree4. □ 120 We end this section by summarizing, in the next lem m a, the form of substitutions generated in the W GG expansion. In particular, part (a) of the lem m a states th at, at level *, if the schema tag U = “base”—i.e.,G,- contains only atom s from the base relations in A or B —then the resulting mgu MGUi+\(H ) contains only variable-pure substitution item s of the form x / y , where a: is a new variable in a renam ed rule, and y is an old variable in Var(Gi). Hence, MGUi+\(H) does not act on G % . This is mainly because (1) the unification process always generates proper substitutions and (2) each variable in Gi has index less than the variables in the renam ed rule. P art (b) states th at, if t{ = “aux”, the expansion of atom s of interm ediate rela tions generate some non-trivial substitutions 0 s and 0 e; 9s contains the skolemization substitution item s of MGUi+i(H), and 9e contains the variable-pure substitution item s of MGUi+i(H). Recall Example 7.2 and 7.3, the expansion of the two micro steps generates the following non-trivial substitutions: Q s = {vs/f(v3)} _ $e = {vi 7 /v3} Later we will see th at it is the accumulation of these 9s and 9e during the WGG expansion th at helps to find an answer. L e m m a 7 .1 .3 : Substitution Partition Let H be an expansion history. Then for each expansion step i > 0 (a) if ti = “base” , then MGUi+i(H) is a variable-pure substitution th at is trivial to Treei(H). In particular, for each (x/y) E MGUi+i(H), a : is a new variable in » some renam ed rule, and y is in Var(Treei(H)). (b) if t{ = “aux”, then MGUi+i(H) = 0sU0e[J0t , where 0s, 0e, 91 are a skolemization, variable-pure, and trivial substitution to Treei(H), respectively. In particular, 0e = {(x/y) | C is an equivalent witness class of Treei, x ,y € C ,x y, and index(y) = M inlndex(C)} < 0s = {(of f T(u ■ 0e)) | r[o, uj E Treei(H),r is an interm ediate relation, and f r is the skolem function associated with r} 9t is a trivial variable-pure substitution similar to the one defined in (a) □ 121 7.2 The Soundness of WGG Algorithm This section presents one of the m ain theorems of this chapter—the Soundness The orem. It shows that the output of the W GG algorithm , introduced in Section 6.4, is indeed a valid witness generator. This result is based on the Em bedding Lemma. Before stating the Lemma, we need to define the “variable assignm ent” (or the “em bedding” ) on an expansion tree. D e fin itio n : Variable Assignment (V.A.) A substitution a = {vi/ti, ..., vn/tn} is a Variable Assignment (V.A.) if , for each * € [1, n], t, € D o m . T hat is, a V.A. can be viewed as a m apping from variables in V A R to values in D o m . Let be an instance pair in extended sem antic correspondence SC^b an(l Iall = /^ U Jq. V.A. a can be extended to skolem term s w .r.t. Iaii in the following way. j o / i s the skolem function for r, r[o, to] € Iaih an<l w = tot f(t)a = < undefine otherwise □ L e m m a 7 .2 .1 : Embedding Let H = W G G l (Go, to) be an expansion history of an ALD L, and be an instance pair in the extended sem antic correspondence SC^b defined by L. Let Iall = -^4 G t/g. If there is a V.A. a w .r.t. Iau such that Gia C Iau,i > 0, then for any k > i there exist a V.A. /? w .r.t. Iau such that /3 is an extension of a w .r.t. Iau GjPQIauJ e [0,fc] Vx E Var(Tree[ 0 < k ](H)), xfi = xMGU[O > k](H)0 □ The Embedding lem m a states th at if there is an embedding from one of the goal sets in expansion history H to an instance pair (I^ ,J b ) in the extended semantic correspondence, then there exists an extension of this embedding th at m ap each goal 122 set of H into the instance pair. The lem m a serves as a bridge between the expansion history and the rule (the witness generator) generated by the W GG algorithm . Next, we are ready to prove the soundness of the W GG algorithm. T h e o re m 7.2.1: Soundness Let W G r be the rule generated by the W GG algorithm with input (L ,r ), where L is a well-founded ALD and r is an interm ediate relation appearing in L. Then W G r is a witness generator for r. P ro o f: W .l.o.g., assume r be an interm ediate relation in auxA, and H — W G G l (t[o, w], “aux” ) be the expansion history. Suppose the W GG algorithm stops at level s > 0 and constructs W G r as follows. W G rA\e&d = Gqu W G r .body = (Ga U Z?i U B'2)u where i/,Ga,B i and B 2 are constructed as described in the algorithm . Let Iau = I Jj }- Since B ' 2 contains the rule bodies of witness generators for atom s in B 2, by the definition of witness generator and the semantics of nrecILOG- , it is sufficient to prove the theorem by showing that 'i{Ii ,J 6 ) e S C M , IA.r = WG'r(I.„) where ' W G '.head = G0v W G ;.body = (Ga U B 1 U B 2)u T hat is, WG'r is the “m utated” witness generator for r, by using B 2, instead of B 2, in its rule body. (C) I A . r C W G ' r {Ia ii) Let r[o, u > ] be a tuple in IA.r. Then there exists a V.A. a w .r.t. Ian such th at r[o, w] = r[o, w]a = G0a G Iaii (1) From (1) and the Embedding Lemma, there exists another V.A. w .r.t. Iaii such that is an extension of a (2) Tree[ 0,a)P Q hu (3) V o: G Var{Tree[ 0tS](H)), x • MGU[ 0 ,s](H)/3 = x/3 (4) 123 From (4) and the idem potency of u C MGU\o,s]> we know Vx € Var(Tree[0'S ]), x • ufi = x • f3 (5) From (1), (2),(3), and (5), we have G0 • uP = Go • (3 = r[o, w) e IA.r ( ■ Ga U5iU B 2) • u/3 C Tree[ 0,a ] • P Q hu Finally, by the semantics of nrecILOG- and (6), we know r[6,ti) = Go • v(3 e W G 'T{Iaii) (2) IA.r 2 ^ ( J ^ ) -H I Let r[o, w] € WG(.(«/g). Then, by the semantics of nrecILO G ", there exists a V.A. a w .r.t. Iau such that W G'r.head • a = G0v • a — r[o, w]u ■ a = r[o, to] (7) kFG '.body * a = (Ga U Bj U B 2)u • a C Iall (8) Let {vot)\var(Ga) d enotes V.A. ua restricted to the variables in Var(Ga). By (8), we know Ga{va)\var(Ga) — I ail- Therefore, by the Embedding Lemma, there is a V.A. ft such that P is an extension of {ua)\Var(Ga) w .r.t. Iau (9) ^ C / ail, i 6 [ 0 , 5] (10) V o; € Var{Tree[0,s]{H)),x ■ MGU[0,s](H){3 = x/3 (11) R e m a rk 7.2.2: The Embedding lem m a only works when the given V.A., a , maps one goal set to Iau. Here, however, ua maps goal set Ga and two sets B\ a nd B 2 of interm ediate atom s to Therefore the extension V.A., ft, does not necessarily agree with ua on B\ and B 2. Also, since w may contains OID variables appearing — H in B\ U B 2 but not in Ga, it is possible that r[o,w]/3 ^ r[6, u>]. < 1 To prove r[o,w] = r[o, w\f3 € Iall-r, we first show that (Ga U B a UB2)/3 = (G a U B r U B ^ u a (12) 124 For the sake of brevity, let Si — Treei(H) n (Ga UBiU B 2), and S[ij\ = Ufce[ij] Sk- We prove (12) by induction and show that, for each i € [2, a], U a ^[i,a]ft ( ^ 3) Since is an extension of va\vaT(Ga)i (13) holds when i = a. Assume (13) also holds when i = fc, 2 < k < a but does not hold when i = k — 1. T hat is, there is an atom s[a;,u| € Sk-1 such th at s[:e, v\ua 7^ s[x, v\f3 (14) By the Variable Inheritance Lemma, we know v C Var{S[k,a])- By our induction hypothesis, we know vua = u/9 Let vua = v/3 = v. Then (8), (10), and (14) imply th at there are two distinct tuples s[£, u] and u] in Iau and x x'. But this violates the Skolem Function Dependency property of Iaii-s, since s is an interm ediate relation. By the Variable Inheritance Lem ma and the construction of Ga, B\ and B 2, we know Var(r[o, w]u) C Var((Ga U B x U B 2 )u) Therefore, from (12), we know r[o,u>] = r[o,w]ua = r[o,w](3 £ I^.r □ 7.3 OID Component As discussed in the preceding section, non-trivial variable-pure substitution item s can only occur among variables of a witness equivalence class, whose members are variables appearing in atoms “linked” by identical OID variables. This leads to an im portant notion, called OID Component, of W GG expansion histories. Recall th at, in Example 7.2, the expansion of atoms B.i-coursefus, U 3], B.i-course[us, U 17] 125 with overlapping OID variable vg can generate a non-trivial variable-pure substi tution item (U X 7/U3). In fact, the transitive closure of atom s with overlapping OID variables, called an OID component, can be viewed as a unit of expansion in a WGG expansion history. This is based on the following definition. D e fin itio n : OID Reachable, OID Component Let H = W G G l (Go, to) be an expansion history. Two atom occurrences r(tt),.s(u) £ Treei(H) are said to be OID linked, denoted r(u) 9 5 s(u), if 3 i 6 u fl rank(x) > 1. Variable x is called an OID link between atom s r(u) and s(v). Also, we define 95 to be reflexive; i.e., r(u) 95 r(u) for each atom occurrence r(u ) in Treei(H). Atoms r(u ) and s(v) are OID reachable, denoted r(u ) °5 * s(u), if there exists r 0(uo),.., rn(iTn) E Treei(H) such that ro(«o) = r(u),rn(un) = s(tT), and r,-(i£) r i+i( u 4 i) , for i E [0, n — 1]. Let C C Treei(H) be a set of atom occurrences at level i. C is an OID component iff C is a connected component under OID reachable. □ E x a m p le 7 .5 : Consider the expansion history W G G ^dA .i-enrollfni, u2, U3]}, “aux” ) of the “Enroll m ent” example. The 14 OID components, labeled Co," • ,C\ 3, in Tree[ 0 ^{H ) are shown in Figure 7.2. □ D e fin itio n : Atomic Ancestor An atom r^(uj) E Treei is an atomic ancestor of an OID component C C Tree^, k > i, if C C Treelr'(^ )}. □ E x a m p le 7 .6 : Atom B. i - e n r o l l [ui, u2, ^3] of C 0 in Figure 7.2 is the atom ic ancestor of all the OID components beneath it. However, none of the atom s in is an atom ic ancestor of C^Cq, or C7. □ By definition, each atom in an OID component is OID reachable by other m em bers in the OID component. As suggested by the Substitution Partition Lemma, 126 Level 0 1 2 C0 y N ( B.i-enroll[vi,v2,V 3] 1 . r ' i / — — ' ( B .enrollm ents^) course-no(v8,v3) A .enrol 1 co u rse ( v9,V g ) \ .enroll-student(v9 ,vi0) J Student(v2 ,v10) i> A.course-no(vg,v3) A.enroll-course(v9,vg)' . A.enroll-student(v9,v10) enroll-student( v 12,v ] 0)^ rB.i-course[v8> v3] i B.i-course[vg,Vj7] B.i-enroll[v9> v1g,v17] B.i-etiroll[v9,V21,v22] B.student-name(v2i,Vi0) ( B .i-enroU[ v l2, V 2 5 > v26l 'N B .student-name(v2 s,vio]/ cio Cj] 5 n / (B.enrollment(v3 o,V 3 )) /B.enrollment(vl8,v3) C9 f v s \sB.enrollment(v1 g,V 3 ) (B .enrollment v 2 8 ,v 3] ) q J2 ■ < 23 /B .enrollm ent(v 25,v26) \ \B.student-name(v2 s,v1 q>/ ^..student-name(v2i ,v Figure 7.2: OID components for the “Enrollm ent” example expansion of the atom s in an OID component may generate non-trivial variable-pure substitution which has im pacts on future expansion. Thus, intuitively speaking, OID component serves as a natural unit of expansion and provides an abstract view of the W GG expansion history. T hat is, instead of focusing in each individual atom , it helps the analysis of expansion trees by “condensing” them into nodes of OID components. D e fin itio n : Condensation Let G = (V, E ) be a graph, V 1 be a partition of V. Then the condensation of G based on V7, denoted by G[V], is defined as (V', E') where E' = {{V\,V 2 ) | € Vi, v 2 £ V2> (^15^2) € E} □ 127 (a) (b) Figure 7.3: Lem ma OID Component Condensation D e fin itio n : OID Component Condensation Let H be an expansion history of an ALD L. The OID component condensation of Tree[ij](H),j > i > 0, noted as [Tree[ij](H)] is a condensation by partition: {C | C C Treek(H),C is an OID component, k £ [*,i]} □ Figure 7.2 of Exam ple 7.5 shows the condensation of lE(7Gx({A.i-enroll[ui,n2, ^3]}, “aux” ) for the “Enrollm ent” example. As depicted in Figure 7.3, the next lem m a states - th at (a) the descendants of two non-OID reachable atom s are not OID reachable, and (b) the OID component condensation of an expansion tree is a tree. L e m m a 7 .3 .1 : OID Component Condensation Let H be an expansion history of an ALD L. Then (a) For each r(u ),r'(u ) € Treei(H), if r(u)°^5* r'{v) then Vj > 0, Tree}r +f }(H)°fi* Tree#}™ (H) (b) The OID component condensation [Tree[itj](H)] of Tree[itj](H),j > i > 0 is a tree/forest. □ 128 less than m*2+2 steps (b) OID link decreases 1 after every 2 steps i+ 2 , (a) Figure 7.4: Lem ma OID Component The next lem m a and Figure 7.4 describe various properties among OID compo nents in a W G G expansion history. In particular, part (a) states th at the m axim um rank of the OID links between descendants of any two atoms will always decrease by 1 after 2 levels of expansion until the m axim um rank reaches 0. P art (b) gives a bound for the size of any OID component. It is the size of the biggest (2 x m + 2)th level descendant of any singleton atom in the expansion history, where m is the rank of the ALD. Finally, (c) establishes a bound on the num ber of non-isomorphic OID components in the expansion history under isomorphism. The proof of (c) is included here, which describes how to construct the bound. This bound will later be used to calculate the bound for how quickly a W GG expansion leads to an answer, if it ever leads to an answer. L e m m a 7 .3 .2 : OID Component Let L be a well-founded ALD with m axim um rank m , and H = WGG°L(Go,to) be an arbitrary singleton expansion history. Then at any expansion level i, we have: (a) Let C C Treei(H) be a set of atom occurrences at level i 14 = Var(Tree%) 0 Var(Tree % ), k > i If Maxrank(Vi) = n > 1 then Maxrank(Vi+2) = n — 1 (b) The size of each OID component in Tree(H) is bounded by M ax{\\Tree% ^ 2+ 2 (H)\\ | r(u) < E Tree(H)} 129 (c) There is a num ber, denoted #O ID Com p(H ), which bounds the num ber, up to variable isomorphism, of all possible OID components occurring in Tree[ 0 tO o ](H). P r o o f: For the proof of part (a), and (b) refer to Appendix D. P art (c) follows directly from (b) and the fact th at the schema of L is finite. The specific bound can be com puted as follows. Let the M ax-Rank(L) be n and the num ber of relations in Schema(L) be N. Let the m axim um num ber of columns for a relation in Schema(L) be C and the m axim um num ber of atom s in the right hand side of an ILOG rule in L be R. Then from part (b) of this lemma, we know the num ber of atom s in an OID component is bounded by ^2*rn+2 and the num ber of different combinations of predicates in an OID component is bounded by fi2 .m + 2 T hat is there are at m ost j\ [ R 2 m m + 2 x C column positions to fill with different variables and therefore to form a different OID component. Since under variable isomorphism only the equality relationship among variables m atters, not the specific variable chosen, therefore, for the first column position there is only one possible variable pattern. For the second column position there can be two possibilities; choose the same variable as the first column or choose a different one. For the third column position there are at most three possibilities; choose the same as the first, choose the same as the second, or choose a different one. We can carry out the column filling process for each of the N R 2 *m + 2 x C column positions and the total num ber of possibilities are bounded by #O ID C om p{H ) = ( N R 2 * m + 2 x C)\ This is bound for the num ber of possible OID components under a given ALD, modulo variable isomorphism. □ E x a m p le 7.7: Recall Example 7.5. As indicated in Figure 7.2, the OID component condensation of 130 the goal sets is indeed a tree. To illustrate part (a) of the OID Component Lemma, let { C = { A.course-no(t>8> ^3)* A.enroll-course(u9, u8)} € Tree 3 (H) C = { A.enroll-student(i>9, i>io), A.enroll-student(ui2, t;io)} £ Tree 3 (H) then M ax-R ank(F ar(C ) fl Var(C)) — Max-Rank({i>9}) = 2 The descendants of C and C at level 5 are Tree^(H) = {B.enrollm ent(^28, ^3), B.enrollment(n30, v3), B.enrollment (u18, V 17)} Tree^(H) = {B.enrollment(i?i8, U 17), B.student-name(t?2i, t>io), B.enrollm ent(v 2s, V 2e), B .student-nam e(u25, U 10)} and Max-Rank(Vra r(T re e ^ (//)) D Var(Treef(H)) = Max-Rank({t>i8}) = 1 Actually, if C is any set of atom occurrences, not necessarily an OID component, in Treez(H ), the above result remains the same. T hat is, the m axim um rank of the OID linkages between any two atom s will always decrease one after two levels of expansion. Since the m axim um rank m for the ALD is 2, the size of any OID component, according to part (b) of Lemma 7.3.2, is bounded by the size of the 6th (2 * 2 + 2) level descendant of any singleton atom . Shown in Figure 7.2, all OID component has a singleton ancestor less than 6 level away. According to Lemma 7.3.2(c), the num ber of possible OID components under variable isomorphism is bounded by ( A y C y And the param eters for the “Enrollm ent” example are m = 2 The rank of the ALD N = 1 1 The total num ber of relations in Schema(L) 131 G, / C c \ y / Treef(H) Treef(H) \ 1 \ MGU)c\c (li) s action V = Var(C) n Var(C) V ■ MGUl0Jl (H) /then specifically it acts on Figure 7.5: Information transfer between OID components R = 4 The m axim um num ber of atom s in the right hand side of a rule C = 3 The maxim um num ber of columns in a relation □ Let C C Treei(H) be an OID component. Although C can be viewed as a unit of expansion, the expansion of C is far from being isolated from the other atoms in C\ the expansion of C may generate substitutions affecting atom s in C , which affect further expansion of C. This phenomenon is articulated in the next lemma. In particular, it states th at the only way descendants of C , Treef+k(H), k > 0, can affect descendants of C, Treef+k(H ) 7 is by generating substitutions acting on the variables appearing in both V ar(Treef+k(H )) and V ar(Treef+k(H )). Figure 7.5 illustrates this lemma. L e m m a 7.3.3: Information Transfer between OID Components Let H be an expansion history for a well-founded ALD L, and C C Treei(H) be a set of atom occurrences at level i > 0 such th at (7°^?* C. And let V = Var(C) D Var{C) This contains only non-OID variables. For each level j (j > i), let (x/y) be a substitution item in M G U f^ ( H ) . If (x/y) acts on Treef(H ) then (x/y) acts on V • MGU[i+itj](H). This also contains only non-OID variables. □ 132 7.4 Decidability of Halting for WGG Expansion This section presents another main theorem of this chapter—Theorem 7.4.1. It states th at the term ination of WGG algorithm on a given well-founded ALD is decidable; there exists a num ber 15, such that if W GG finds a witness generator, it m ust find it within B levels of expansion. This result is based on the following key lemmas. H o m o m o rp h is m L em m a : If there is a homomorphism between the initial goal sets of two expansion histories H\ and H 2 , then, for any k > 0, there is a homo m orphism between Tree^^C-fiTi) and Tree[0 > k](H2 ), and between MGU[ 0 tk](Hi) and MGU[ 0 ,k]{H2). M G U L o o k A h e a d L em m a : If we apply the variable-pure substitution gener ated up to level j in advance to the goal set Gi(i < j) to form G\ and continue the expansion based on G'-, then the following are true: (a) the variable-pure substitutions generated in the new expansion up to level j do not act on G\, (b) after level j the new expansion generates goal sets and m gu’s exactly as the original expansion, (c) each OID component in G[ expands independently of each other till level j , and, finally, (d) each OID component at level i expands independently form the rest; as a result the m gu’s of H' generated from levels i + \ to j is equal to the union of the m gu’s generated by each OID component expanding isolatedly. U selessn e ss L em m a : This lem m a is a very useful lemma. It states the condition when a set C of atom occurrences in a goal set can be regarded as “useless” ; th at is, the expansion of C will not generate an answer. This lem m a also serves as an safety valve for endless-expansion; it states a condition under which the W GG expansion will never yield an answer and the algorithm enters an endless loop. Intuitively, the Uselessness lem m a is based in part on reasoning of the pum ping lem m a from context-free languages. Here the tree used is the OID condensation of Tree(H). The development is much more intricate than the pum ping lem m a because: (1) homomorphisms m ust be used to identify nodes of the tree, and (2) special argum ents are needed to ensure th at the expansions of various parts are independent of the other parts (i.e., to ensure a kind of context-freeness). 133 The substitutions generated from a W GG expansion can be viewed as the ac cum ulated knowledge about the relationships among variables appearing in the ex pansion. As the expansion proceeds, various variables are equated by variable-pure substitution item s. These equated variables from a Variable Equivalent Class. D e fin itio n : Variable Equivalent Class (VEC) Let $ be a proper substitution. Two variables x and y are equivalent w .r.t. 6, if either x /y or y/ x is in 6. By definition, a variable z is equivalent to itself. A Variable Equivalent Class (VEC) w .r.t. 0 is a class of equivalent variables w .r.t. 6 . □ Because of the special property of the mgu process, only the variables with smaller index in a variable-pure substitution can be passed down to the next level. Due to this technicality, the homomorphism on expansion histories defined below has to m ap each variable of the source expansion history to the equivalent variable with smallest index in a VEC of the destination history. D e fin itio n : Homomorphism Let L be a well-founded ALD and p be a variable m apping from a set of atom occurrences Go to another set of atom occurrences GqJ th at is, G0p = G ' 0 Let H = WGGL(Go,to) and H' = W GG l {G'q, t0) be two expansion histories. Then the homomorphism from H to H' derived from p, is a sequence of variable mappings (po, p\ 5 •••} constructed as follows. For each i > 0, let /9t+i be a variable m apping from 7?.ri ,.., to 7?/i, ..,7?/^; i.e., /?,•+j is a variable m apping from the renam ed rules of Treei(H) to the renam ed rules of Treei(H'). Let po = p. Then is defined as P i + 1 = P i + 1 U { i h yniin | (x h-> y) e Pi, y,ymin are in the same VEC C by MGU[oti+\]{H'), J/m in is the variable with m inim um index in C} The homomorphism from H to H' up to level k , denoted as p\o,kp is the sequence {^o, ••,/>*}• D Before we can make a connection between two expansion histories, G and G' , with homomorphic initial goal sets, we need to define the (g ) operator. In essence, 134 level Gi O i 0 A .en ro ll-stu d en t(« i, «2)»A .enroll-student(«i, M3) 0 1 B.i-enroll|/Mi, u2, ug], B .student-nam e(M 2, h7) B .i-en ro ll[« i,« 3 ,« io ], B .stu d e n t-n a m e (tt3 ,« n ) u 4( u 4, u s / u 2 u 8/ u \ ,u g / u 3 2 B .enrollm ent(u2, u8), B .student-nam e(«2, ^7) B .enrollm ent(«2 5 u6), B .student-nam e(«2» uu ) t ti//(« 2» «e), u3/ u 2, u 10/ u 6 u 12/ u 2, u 13/ u 8, u 14/ u 2, u 45/ u e level Ga 0'i 0 A .e n ro ll-stu d e n t^ s, M6),A.enroU-student(t75, v4) 0 1 B.i-enrollfns, ^6, M g], B .student-nam efvg, vjo) B.i-enrollfvs, v4, ^ 3 ] , B .student-nam e(u4, vi4) V7/v 5,V8/v 6 V u /v 5,vi 2/v 4 2 B .enrollm ent(174, M9), B .student-nam e(n4, mjo) B .enrollm ent(v4, M g), B.student-nam e(M 4, v\4) Vs/f(v4, Vg), V6/v 4, V13/vg n s / v 4, Vie/vg, v17/v 4, v18/vg Table 7.1: Two homomorphic expansion histories Po Hi * -* V 5 , u 2 • -» v6, u3 v4 Pi U\ H* Vs, U2 h~ * Vq, U3 V4, U4 V7, Us v8, Uq >-* Vg, U7 Mio, U8 l- + « n , Ug I~ » Mi2,Mio v43, u n vX4 P2 U\ 1 —* ■ Vs, u2 1 — ► V4, U3 ► V4, U4 1 — * ■ Vs, Us v4, Uq i—* • Vg, U7 I — * Uio, U8 I— ► Vs, Ug ►-» V4, U j g »-»• U g , M n V 1 4 , U12 H-s- M1 5 , UX3 M1 6 , U14 M1 7 , U45 ^ U 18 Table 7.2: Example homomorphism for the two expansion histories the 0 operator maps MGU(H) to MGU(H'). Note that in the following definition, 0 ensures the properness of MGU(H'). D e fin itio n : 0 Let H = W G G l {Gq, io) and H' = WGGL{G'Q ,to) be two expansion histories of ALD L, and p[o,k] be a homomorphism from H to H' up to level k. Then MGU[0,k)(H) 0 p[0 ,k) = {xpi/tpk | (x / t ) E MGU[0 < k](H)} — { x j x | x € V A R } □ 135 H = WGGl (.G0,1) H' = WGGl (G'q , 1) — ~ ^ X < ? h \ If r \ Tree.(H)p. = Tree.(H’) P[Q.,t / \ " G G fo.i]< "> ® Pfo.il = W G tW W'> Figure 7.6: Homomorphism Lemma E x a m p le 7.8: Consider the following two homomorphic expansion histories of the “Enrollm ent” example. H = W G G l ({A .e n ro ll-stu d e n t^ , ^ 2)5 A.enroll-student(ui, M 3)}, “base” ) H ‘ = VFGG£/({A.enroll-student(u5, u6), A.enroll-student(u5, U 4)}, “base”) Note th at p : Go ► Gq, and p = {Ul i-> vs , u 2 * - * ■ t>6, «3 »-» u4} The two expansion histories are shown in Table 7.1 above. The VECs defined by 0[ O ,2 ] are: {u4,Ui,lf8} {U55U2,U3,U9,U12,U14} {u6, li10, U13, U15} {v7} {uu} And the VECs defined by are: {u8,U6,t»12,U4,U15, U 17) {U13, U 9,U16,Ui8} {U10} {U14} The homomorphism from H to H' up to level 2 is shown in Table 7.2. By applying the homomorphism p[0,2 ] to the M G U ‘ ,s of expansion history H , we have #[0,2] ® P[0,2] = #[0,2] □ L e m m a 7 .4 .1 : Homomorphism Let H = WGGL,(Go,t) and H' = WGGL{G'Q ,i) be two expansion histories of a 136 well-founded ALD L. If there exist a variable m apping p from Go to Go, and let the hom omorphism from H to H' derived from p be {po, Pi---}, then for each i > 0 Treei(H)pi = T ree, ( i f ') _ MGU[ oA{H) 0 P m = MGU[o,i]{H> ) □ The Alter operator constructs a new W GG expansion history H ‘ from an ex pansion history H by: (1) applying a variable-pure substitution, 9, to G, of H and (2) continuing the expansion from level i with G, • 6 . H' and H are equivalent up to level i. D e fin itio n : History A lteration O perator Let H = W G G l (Go, to) = {(Go, tj), * > 0} be a W G G expansion history. The history alteration operation on H at level i > 0 with a proper variable-pure substitution 9, denoted Alter(H ,9,i), is the infinite sequence {(Go, 9\, ti), i > 0} where < 3 Gj i € [ 0, i - i ] Gi9 j = i constructed from the expansion level using G j - 1 j > i 9j j € [0, i - 1] 9i9 j = i constructed from the expansion level using G j - \ j > i □ E x a m p le 7.9: Continuing with Example 7.2, we can construct a new expansion history H' = Alter(H,{v 1 7/v 3}, 4) In the new expansion history, H \ at level 5 after the second micro-step, we have g\ = {B.enrollm ent(u2 8, ^3), B.enrollm ent(U 30, ^3), B.i-enroll[t>9, i>ig, ^3], B.i-enrollfug, u2i, V 2 2], B.student-nam e(u2i, ^10), < B.i-enroll[vi2, u2s, ^e], B.student-name(u25, ^10)} = { v 27/ v 3 ,V 8/ f ( v 3 ) , V 29/ v 3 } 137 H g> m ] • • • • i c T MGUli+hJ] (H) var-pure (MGUv+XJl(H) ) = % i G'j (b) both trees expand the same after step j H' = Alt(H, i) G ; = G^ (a) does not act on var-pure (A fG G [(+ I J ) (H')) i (c) m g u ii+ua (H) = % ■ A /G t/[j+ 1 J ] (// ) (d)each / expands independently, [C,5ica I.e., MGUv^ n (//’)= U M G C / {i+M (W) C *e C j Figure 7.7: MGU Look Ahead Lemma Note th at in the original expansion history H, the substitution item (vn/v^) oc curs at level 5. By applying (vu/vs) in advance to the goal set of level 4, the resulting mgu at level 5 of the new expansion is the same as the original expansion except th at {v\ 7 /vz) is not included in MGUs(H). A generalization of this phenomenon is dem onstrated in the MGU Look Ahead Lemma. □ L e m m a 7 .4 .2 : MGU Look Ahead Let H = WGGi,{Go,to) be an expansion history of an ALD L. For some j > i > 0, let £ = var-pure(MGU[oj](H)) |rree,(H) be the variable-pure substitution item s in MGU[i+itj](H) th at act on T ree,(/7), and let H' = Alter(H,£,i) be the altered expansion history by applying £ in advance to the goal set of level i. Then we have (a) var-pure(MGU[i+ij](H1 )) does not act on Treei(H') = Treei(H)£ (b) V fc > j,T re e k(H) = Treek{H') and MGUk+ 1 (H) = MGUk+l{H') T hat is, the expansions after level j are the same for both histories. (c) V fc > j, MGU[i+ 1< k](H) = £ ■ MGU[i+hk](H') (d) Assume Treei(H') contains n OID components Ci,..., Cn. Then, MGUv+hji(H') = (J 138 H = WGG( { r [o , iv] } , aux) G, S (2 xrank(o)) £, = var-pure (MGUi< K n (H )) 3 = o- M G U ^ (H) Sp = 3 3c, = 3 (b) if 7 \ does not contain an answer G‘ l \ Treef(H) then neither does (a) MGU}?]° (H) does not acts On Treef(H) Figure 7.8: Uselessness Lemma T hat is, each OID component expands independently. □ L e m m a 7 .4 .3 : Uselessness Let H = W G G l { { t \ o , w \ } , “aux”) be the expansion history of a well-founded ALD L, where r is an interm ediate relation appearing in L. At level / > 0 let £ = var-pure(MGU[oj](H)) (a) Let C C Treei(H) and D C Treej(H) (/ > j > i) be two OID components such th at C is an ancestor of D. If there is an isomorphism p between C£ and then MGUf+lD(H) does not act on Treef(H ) (b) Let E C Treek(H), C C T ree,(i7), and D C Treej(H),(i < j < k < I) be three OID components such that { C is an ancestor of D D is an ancestor of E k — j > 2 * rank(o) 139 lElE Let o • MGU[ott](H) • MGU\+[ (H ) = s, and let p and g be two isomorphisms such that f D£ A and sp = s [ E£ A and sg = s If there is no answer in MGU[o,i](H), then there is no answer in MGU[ 0 ,q(H) • M G U \fB{H) □ Arm ed with the Uselessness Lemma, we are ready to show th at there exists a bound for how quickly a WGG expansion th at finds an answer, if it finds one at all. T h e o re m 7.4.1: Bound for WGG Expansion Let L be a well-founded ALD, r be an interm ediate relation. Let H = WGGlH^Io, u>]}, “aux” ) be a W GG expansion history. There exists a num ber B th at dependent on L, such th at if there is no answer found in MGU[0 ,b](H)> then for any n > B there is no answer in MGU[otn](H) P r o o f : Let B = 2 x (iV*2* m + 2 x C)\ + 2 *rank(r), where N , R, m, and C are defined as in the proof of the OID Component Lemma (c). Assume that there is no answer found in MGU[otB](H). We prove by induction th at there will be no answer found no m atter how deep the expansion goes. Let Treek = E\ U • • • U E n, where Ei,l € [l,n], are distinct OID components. Assume at level k > B there is no answer found in MGU[0,k]{H)• Consider level k + 1 of the expansion history. By the OID Component Lemma, we know there are at m ost (Wi?2* m + 2 x C)l possible OID components up to variable isomorphism. Since k > B and by the definition of B , we know th at, alone the path from each Ei(l € [1, n.]) to the root of the OID component condensation [Tree[0>k](H)\, there exist two OID components Ci C Tree,- and Di C Treej(i < j < k) and two homomorphisms p\ and gi which satisfy the condition stated in the Uselessness Lem ma (b). Let o • MGU[otk]{H) • M G U l $ ? ( H ) = s. T hat is, Ci is an ancestor of Dt, and Di is an ancestor of Ei k — j > 2 * rank(r) Dii A Ci£, and sp = s k Eli A Dii, and sg = s 140 Then, by the Uselessness Lemma (b), we know that there is no answer in MGU[o,k] • (15) Furtherm ore, by the Uselessness Lemma (a), we know, MGUj^\E‘ does not act on Treef ', i.e., MGUk+i = M G U if}Sl U • ■ • U M G U lH En (16) Thus, by (15) and (16), we know MGU[0tk+ 1] = MGU[Q tk y MGUk+i = MGUm ■ (M GVf+f ' U • • • U M G U lfV " ) does no contain an answer. □ 7.5 W G G halt Algorithm From the above theorem we know that given an ALD and an interm ediate relation name, there exists a bound on the num ber of steps within which the W GG expansion can find an answer. However, the theoretic bound is astronomical ((j\rR 2* m + 2 x C) \ + 2 * rank(r) + 2), and is way too large for any practical im plem entation. In this section, we describe a modified version of the W GG algorithm , the W G G halt algorithm. It guarantees the following: (1) it produces the same wit ness generators as the original W GG algorithm and (2) it term inates if there is no hope for any further expansion. Theorem 7.5.1 proves the above claims. Basically, the W G G halt algorithm is the W GG algorithm with two extensions: • In the Initialization Phase, the following step is added. (l.c) If V is not well-founded, then output “I Give Up!!” , and term inates the algorithm. • In the Expanding Phase, the following step is added. (2.c) Uselessness Checking Step For each atom s(v) G Treei Do 141 if s(v) has three ancestor OID components C, D, E which satisfy the condition in the Uselessness Lem ma(b), then m ark s(v) “useless”. If each atom in Tree,- is marked “useless”, then output “I Give Up!!” and term inates the algorithm. The next example revisits the well-founded ALD presented in Section 6.5.1 that causes endless loop in the WGG expansion to dem onstrate how W G G halt term inates. E x a m p le 7.10: “Cross Product” Revisit Continuing with Exam ple 6.10, at level 8, the goal set will contain 4 atom s of A. i-n o d e. T hat is, Gg = { A .i-node[ni,/l, 12"], A.i-node[n2, IV, 12'\, A.i-node[n3,/ l',/2 /'/], A.i-node[n4, IV", /2]} Since there is no OID link between any two atom s in Gg, each atom in Gg is a singleton OID component. Furtherm ore, each atoms in G4 and Go is a singleton OID component. The OID component hierarchy based on Go < -» G4 < -» Gg satisfies the condition stated in the Uselessness Lemma(b). As a result, the W G G halt algorithm m arks each atom in Gg “useless” , and term inates the expansion by outputting an “I Give Up!!” message. The expansion only has to go 8 levels deep. Comparing this with the theoretical bound 1 (62 2 * 1 + 2 * 3)! + 2 * 1 + 4 predicated by Theorem 7.5.1, the W G G hali algorithm is of great practical value. □ The next theorem proves our claim of the W G G halt algorithm. T h e o re m 7.5.1: W G G halt Let L be an ALD, and r be an interm ediate relation in Schema(L). Then (a) W G G halt(L,r) always term inates (b) W G G halt(L r) = [ W G G (L< r ) « W GG(L,r) term inates 1 “I Give Up” if W G G (L ,r ) does not term inate xAt the time of writing this, the poor writer has not got enough funding to purchase a calculator capable of computing this number. 142 P ro o f: Let H — W G G l {{t[o, iu]}, “aux”) be the corresponding singleton W GG expansion history. To prove (a), by the OID Component Lem m a(d), we know the num ber of OID components under isomorphism is finite. T hat is, sooner or later along each path in the OID component condensation of Tree[0 very-big] (-^0 there is an isomorphic ancestor/descendant OID component triple (C ,D ,E ) th at satisfies the condition stated in the Useless Lem m a (b). Therefore, each atom will eventually be m arked “useless” and the algorithm term inates. To prove (b), observe th at the only difference between the W G G and the W G G halt algorithm is the additional steps (l.c) and (2.c). If W G G halt(L ,r ) gener ates a witness rule 7Zr, clearly W G G (L ,r ) also outputs 7Zr. Thus, it is sufficient to prove the theorem by showing that: if at level n of the expansion, W G G halt(L, r) outputs “I Give Up” , then for any k > n, MGU[0,k](H) does not contain an answer. From the W G G hatt algorithm , we know , for each atom s(n) in Treen(H), there exists three ancestor OID components C ,D ,E that satisfy the condition stated in the Uselessness Lemma (b). Then follow the similar argum ent in the proof of Theorem 7.4.1, we know for any k > n, there is no answer in MGU[0 > k](H). □ 143 Chapter 8 Handling Ambiguity As m entioned in Section 1.2.3, in real world applications, databases m ay hold non equivalent information and have ambiguous sem antic correspondences. Though not all ambiguity has a solution, in this chapter we propose a general framework that can handle m any cases of ambiguity. For most cases, ambiguity arises when the inform ation stored in the two da tabases is not equivalent: i.e., the complete contents of one database can not be expressed in term s of the contents of the other database. Sometimes the “informa tion” th at is non-expressible by the rem ote database m ay involve hum an decisions at the local databases. For example, in the “Football” example the school policy m ay require the following: once a player is kicked out of the team , the decision on the player’s student status is m ade by the school discipline comm ittee. Obviously, such decision is not computable. Speaking intuitively, B ir d ’s approach is to separate the ambiguity from the inter-database communication. This is achieved by isolating and/or creating sub schemas of databases A and B such that the subschemas hold equivalent information. This perm its B ird to be applied to the subschemas in order to m aintain OID cre ation/deletion between the two databases, and leaves the problem of updating the rem aining parts of A and B to the local databases. As will be seen, creation of the subschemas may be the result of isolating portions of the original schema and/or augmenting the original schemas with derived data. In the following sections, first in Section 8.1 two more examples of ambiguous sem antic correspondence are presented. Section 8.2 gives an overview of ambiguity where the various ways of information sharing and different update authorities are 144 discussed. Then a methodology is proposed in Section 8.3 which transform s a pair of databases with an ambiguous sem antic correspondence into a pair of databases with identified sub-databases having non-ambiguous sem antic correspondence. Finally, solutions for examples presented in Section 1.2.3 and Section 8.1 are presented in Section 8.4. 8.1 Motivating Examples As shown in Section 1.2.3, the “Football” example illustrates one kind of ambiguous sem antic correspondence. In this section two more examples are presented; each dem onstrates a different aspects of ambiguity. In the end of this section, a third example is presented to clarify the subtle distinction between ambiguity and insuf ficient inform ation in a sem antic correspondence. The first example highlights the kind of ambiguity th at arises when only some portions of schema A and schema B carry overlapping information. Due to the non overlapping portions of the two schemas, the sem antic correspondence is ambiguous. E x a m p le 8.1: “ Employee” In a corporation, the personnel departm ent m aintains a database A and the financial, departm ent m aintains a database B. Both databases keep inform ation about the employees working in the corporation. The schemas and example instances for the two databases are given in Figure 8.1. D atabase B in the personnel departm ent has entity type B.Employee w ith the at tributes B .name and B . education for the names and education qualifications of the employees. On the other hand, to facilitate the distribution of pay checks, database A has entity type A.Employee with the a ttrib u tes'A .name and A. salary for the names and the salaries of the employees. As indicated in Figure 8.1, each of the 4 possible pairings of the instances / i ,/2 and J \ , J 2 is an acceptable instance pair in the sem antic correspondence. In general, given an instance of A, there can be infinitely m any corresponding instances of B in the sem antic correspondence. This is because there is no salary information in database A to single out the specific instance of B in the sem antic correspondence. Similarly, there is no education information in database B. Therefore, there is a 145 Employee Database A education Database B STR IN G I name STR IN G I Employee salary STRING I name STRING I A.education A. name B. salary B.name Employee education Employee name Employee salary Employee name MBA A -> sn A. Smith B, 70,000 B, A. Smith A n MFA A | 3 J. Bach b 2 45,000 b 2 J. Bach A.education A. name B.salary B.name Employee education Employee name Employee salary Employee name A, CPA A, A. Smith < > B}8 90,000 B38........... A. Smith a 2 BFA a 7 . J. Bach b4. 25,000 B 4 9 J. Bach Figure 8.1: Schemas and instances for “Employee” m any-to-m any m apping between the instances of A and B in the sem antic corre spondence. □ The next example of ambiguity is taken from a real world application, in which the inform ation between the two databases overlaps in an obscure way. This ex am ple also highlights the need for adding some bookkeeping information to resolve ambiguity. E x a m p le 8.2: “Requisition” The Inform ation Sciences Institute (ISI) is an organization within the University of Southern California (USC). Being an organization within USC, ISI makes all its purchases through the Purchasing D epartm ent of USC. To handle the requisition processing, ISI has a requisition system, the REQ system and to handle the pur chase order processing, USC has a purchase order system, the P.O. system. The schemas for the REQ and the P.O. Systems (or the A and B databases in the B ir d ’s terminology) are given in Figure 8.2. Figure 8.3 shows several example instances in a pictorial way. Consider Scenario (3). In it there are two requisitions: REQ#1 contains one request by Rick for two laser printers, and BEQ#2 contains one request by Dave for three Macintosh computers. On the other hand, the Purchasing D epartm ent of USC arranges them into two 146 Database A (Requisition) Datebase B (Purchase) p p o re q ~n o req-include re-no STRING I item-name S tR lN o I re-jnclude rse-no s T r T O jI requester ISTR IN qty i n t e g e r I I s t r i n g ! i i n t e g e r I (req-no) (qty) po-no STRInJ C j I po-include pe-no ST RIN G I umt-pnce REA obj-nam e 'S T R IN G I distnbutio qty INTEGERI Figure 8.2: Schemas for “R equisition/Purchase O rder” purchase orders: P .0 .# 1 purchases two laser printers and two M acintosh computers form “Ben’s Com puter W arehouse”, and P . 0 .#2 buys one Macintosh from “Sherry’s Mac W orld” . In particular, in database A each OID in A .REQ corresponds to a requisition form which includes (A. re q -in c lu d e ) several requisition entries represented by OIDs in A.RE. Each requisition entry in A.RE records a request for a specific item with name A. item-name. A requisition entry may include (A.re-include) several requisition subentries for different requesters (A. requester) and quantities (A. qty) of the same item . Each OID in A.REQ, A.RE, and A.RSe can be uniquely identified by the at tributes A.req-no, A.re-no, and A.rse-no, respectively. In the B database, each OID in B. PO represents a purchase order form uniquely identified by a purchase order num ber stored in A.po-no. A purchase order in cludes (B.po-include) several purchase entries; each is represented by an OID in A.PE with key attribute B.pe-no. Associated with a purchase entry, the following inform ation are included: the item name B . item-name, unit price B .unit-price, quantity B.qty, and the distribution information B .distribution, which specifies how the purchase entry is to be distributed across the different requisitions. The goal is to establish the correspondence between the requisition data at ISI and the purchase order information at USC, so th at once a requisition is keyed into the REQ System at ISI, USC’s P.O. System will be notified and properly updated. Similarly, any update of the purchase status in the P.O. System will cause the 147 REQ#1 Rick 2 LP REO#2 A Dave 3 Mac / Scenario (1) REO#l Rick 2 LP REQ#2 / Dave 3 Mac P.O.#l Bens Computer Warehouse 3 Mac, 2 LP Rick 2 LP ■ REO#2 Dave 3 Mac P.O.#l________ Bens Computer Warehouse 2 LP P.O.#2 Sherys Mac World 3 Mac P.O.#l Bens Computer Warehouse 2 Mac 2 LP Scenario (2) P.O.#2 Sherys Mac World IMac Scenario (3) Figure 8.3: Possible ways of arranging the purchase orders corresponding updates to be sent back to ISI autom atically, so th at the REQ System can be updated accordingly. The problem of ambiguity occurs when the num ber of item s to buy in an entry in a requisition may be split and contained in m ultiple purchase orders. Similarly, entries from different requisitions may be merged into one purchase order entry for the reason of a “good bargain price”. Consider the following scenarios. Suppose th at at ISI, Rick wants to buy two laser printers and Dave wants to buy three Macintoshes. As indicated in 1 Figure 8.3, the following are some possible ways the Purchasing D epartm ent of USC could arrange their purchase orders. (1) A single purchase order to Ben’s Com puter Warehouse for two laser printers and three Macintoshes. lrTo simplify the discussion, shown in the Figure, there are only one requisition entry in requi sitions REQ#i and REQ#2. In general, the each requisition may contain more than one entry; e.g., REQ#1 may also contains an entry from Joe requesting 10 laptop computers. 148 Database A Airport name(full) STRING I A.name Airport Name A, Los Angeles a 2 John Wayne Airport Database B n ame(abbreviati on) B.name Airport Name Bi LAX b 2 JWA Figure 8.4: Schemas and instances for “A irport” (2) Two purchase orders; one to Ben’s Com puter W arehouse for two laser printers, and one to Sherry’s Macintosh World for three Macintoshes. (3) Two purchase orders; one to Ben’s Com puter Warehouse for two “Buy one Macintosh get one LP free!” packages, one to Sherry’s Macintosh World for one Macintosh. This is because Ben’s Com puter Warehouse has a sale for the Macintosh and laser printer package and the Sherry’s Macintosh World has the lowest price for single Macintosh. (4) If there is another requisition from, say, the Philosophy D epartm ent for 10 Macintoshes and 10 laser printers, then there are even more combinations.... As a result, there is a m any-to-many mapping between states of the requisitions database and states of the purchase orders database. The only a priori correspon dence is th at the total num ber of item s in the requisitions m ust equal the total of th at item in the purchase orders. How these items are split and merged is based on some dynamic information (e.g. “good bargain price” in the current m arket), or even the intuition of the purchaser. □ O u r fin al e x a m p le illu s tra te s th e differen ce b e tw e e n a m b ig u ity a n d in su fficien t in fo rm a tio n in a s e m a n tic c o rre sp o n d e n c e . E x a m p le 8.3: “ Airports with Insufficient Inform ation” The application in this example concerns the information about airports in the two databases shown in Figure 8.4. 149 Database A Inst(A ) n, (.scA B ) Database B in s t(B ) n ttscAt) Database A Inst(A ) n, (scA B ) ( a ) Database B In st(B ) n 2(5cA B ) (b) Figure 8.5: Ambiguous Semantic Correspondence In each database, airports are represented as an entity type with one attribute. The only difference is th at in A the attribute A.name stores the full nam e of the airports, and in B the attribute B .name records the Federal Aviation Adm inistration (FAA) abbreviation of the airport. Though there is an official abbreviation for each airport, if the list of abbrevi ations is not available, it is essentially impossible for a com puter to determ ine the abbreviation of an airport name. □ This situation is very different from Examples 8.1, 1.3, and 8.2. Here there is indeed a one-one correspondence between instances of A and B. The difficulty in finding an instance of B corresponding to an instance in A stems from lacking of inform ation—an FAA full-nam e/abbreviation conversion table—rather than the existence of m ultiple “acceptable instance pairs” . 8.2 Overview on Ambiguity By definition, ambiguity arises when the sem antic correspondence specified by the application is not one-one, as illustrated in Figure 8.5(a). To m aintain such a se m antic correspondence, the system will have trouble deciding how to translate an 150 increm ental update to the rem ote database. This situation is illustrated in Fig ure 8.5(b); when the user issues an update in the A database, as indicated in the figure, it is unclear which of J 2, J3 , or J 4 should be the resulting instance of B. It is thus unclear what update should be transm itted to B. In this section we first look into the various ways th at the information in the two databases can overlap (Section 8.2.1), then discuss the issues of update authorities in the two databases (Section 8.2.2). 8 .2 .1 W ays th e In fo r m a tio n C an O v erla p We now categorize different ways two databases can hold overlapping but not equiv alent information, based on the experience we gained through the exam ination of several real-world examples. P a r tia l o v e rla p p in g by p ro je c tio n : This is a common cause for ambiguity, in which portions of the two schemas do not overlap. In particular, part of the schema in A does not appear in B and/or vice versa. In the “Employee” example (Example 8.1), the inform ation stored in A .e d u ca tio n is relevant only to database A. Similarly, the information kept in B. s a la r y is not represented in A. As a result, any instance pair of A and B with the isomorphic projection on the employee and their names is in the sem antic correspondence. In this case, the subschemas of A and B th at hold equivalent information can be obtained as projections of A and B. P a r tia l o v e rla p p in g b y se le c tio n : In the “Football” example (Example 1.3), an instance in the coach’s database can correspond to infinitely many instances in the provost’s database because their information only overlaps for those students who are also football players. This kind of ambiguity is due to the fact that the overlapping part of the coach’s database is a selection of a portion of the provost’s database. M o re in tric a te sh a rin g o f in fo rm a tio n : Sometimes the sharing of infor m ation is implicit and can not be fully understood unless some bookkeeping information is added to one or both databases. W ithout this 151 bookkeeping information the local database cannot decide which instance of the rem ote database it should correspond to. For example, in the “Requisition” example (Example 8.2), the vital informa tion of how requisitions are distributed across purchase orders, and how they correspond to one another are missing form both of the databases. In other words, some of the crucial linkage information is not shared. Also in the “Football” example, suppose the policy demands th at when a player is fired by the coach then his fate depends on his GPA. However there is no information about the players’ GPAs on the coach’s database to decide what corresponding update should be sent to the provost’s database. 8 .2 .2 O v erla p p in g S ch em a s an d U p d a te A u th o r itie s As m entioned above, in the context of an ambiguous sem antic correspondence, the d ata stored in schemas A and B may overlap in various ways. This subsection ad dresses another issue occurring in the sem antic correspondences involving databases holding non-equivalent information—different update authorities on the overlapping and non-overlapping porting of the two databases. In the discussion, Aover and yjsep are ugeci to loosely denote the “overlapping p a rt” and the “non-overlapping p art” of the d ata stored in database A. B ovev and B seP are defined analogously. 2 Under B ir d ’s framework, we make the assum ption th at a user at A has the authority to modify both ,4over and AseP, and a user at B has the authority to modify both B over and B seP. Consider the scenario shown in Figure 8.6. W hen a user Ua updates the d ata in A, the portion of the update on Aover will be propagated to jE ? over, and therefore indirectly modifies the data in j?over. In a way, this can be viewed as if user Ua extends his/her update authority to database B through the linkage between Aover and B ovel . Similarly, a user Ub can extend his/her update authority to database A through the linkage between B over and Aover. An issue arises when a user makes an update to ,4over and then indirectly to jfjover resuiting j3over may be inconsistent with the state of jBseP. For exam ple, B seP may contains dangling OID reference similar to the situation depicted in 2A more precise definition for the notions of “overlapping part” and “non-overlapping part” will be presented in the next section. 152 X AseP I Aover o u B V Qover J BseP User UA’s authority extends to Bover on the B side, and user Ugs authority extends to Aover on the A side. Figure 8.6: “U pdate authorities under B ir d ’s framework” scenario (2) in Section 4.6. If such inconsistency occurs, we assume th at database B has an expert (either hum an or program) to fix it. By doing so, database B may update Z?over, and therefore update Aover. As a result, Aover m ay be in an incon sistent state with AseP. This may lead to a cycle similar to a hum an negotiation process in which proposals and counter proposals are exchanged until either both sides reach an agreement or the negotiation is aborted. In the next section, a methodology is presented to resolve ambiguity. This in volves the following activities. • Identifying and constructing (Aover, AseP) in database A , and (Bover, B seP) in database B • Applying B ird to m aintain the correspondence between A °ver and B over • Developing the local experts m aintaining the correspondences between Aover and AseP in A, and between B oveT and B seP in B 8.3 Ambiguity Resolving Methodology (ARM) In this section, we present the Ambiguity Intermediation Methodology (ARM) which transform s a pair of databases with an ambiguous sem antic correspondence into 153 Database A visible to the user Database B SC AAe(l Local experts maintain the local semantic correspondences After schema surgery , the following conditions are satisfied. — SCA *,A e q is one-one - V (/, 7) e SCAB< ^ > 3/a " ? , Jb “ > such that S C n o eq U e s c ^ a (7,J ^ ) e SCB B „ c ^ ( t I V_. s ! c SCj\eq geq BIRD maintains □ b correspondence B Figure 8.7: “Schema Surgery” a pair of databases with identified sub-databases having non-ambiguous semantic correspondence. The result of the transform ation is shown in Figure 8.7. In particular, for data base A, a new schema 3 Ae < f is formed by projecting portions of A plus adding some new data. Similarly, a new schema B eiI can be formed in database B, so th at there is a non-ambiguous semantic correspondence SCAeqBeq between A eQ and J3ecl. Then the resulting schemas for databases A and B are A = A U A ec1 , and B = B U B ecI Intuitively speaking, Aecl and B e(1 are the keys to isolate the overlapping parts of A and B. Since now v4ecl and B e(^ keeps the overlapping portions of databases A and S , B ird can be applied on A eQ and B eiT Locally on databases A, some mechanism m ust be devised to m aintain the local correspondences SCAAeq between the original schema A and the equivalent schema Aech SCBBeq on database B are m aintained in the same fashion. 3Here A e(^ is the schema component that stores the information of Aover. Similarly, J 3 e C | is the schema component that stores the information of J5over. 154 Database A Database B Inst(Beq) s c ^ J*. J, 0 — Inst(Aeq) Inst(B) Inst(A) SC SC 2 • 3 • 4 • local correspondence remote correspondence local correspondence SCA A „ maintained by SCA e q #, maintained SCB B * , maintained by the local expert by BIRD the local expert Figure 8.8: Transformation of An Ambiguous Semantic Correspondence This separation of local semantic correspondences and rem ote semantic corre spondences allows the DBA to separate the issues of propagating OID creation/de struction from the issues of ambiguity and can use B ird to address it. W hen encountering an ambiguous semantic correspondence SC a b , the DBA should first carefully study the semantics of the two databases, then perform the following activities. (1) D e te rm in e how in fo rm a tio n o v erlap s Study the semantics of the application, and determ ine the causes of the ambi guity according to the classifications listed in Section 8.2.1. (2) P e rfo rm th e “sc h e m a s u rg e ry ” As indicated in Figure 8.7, the DBA modifies the schemas A and B by adding new data into A and B so th at the following conditions are satisfied. • There exist subschemas AeS of A and jBeS of B which have a one-one sem antic correspondence •S 'C 'A eqBeq between them , and SC AeqBeq can be expressed by nrecILOG- and EES. 155 • The following relationship exists among SC ab, SCAAe ^ C BBeq and S C Ae(lB e< i : V (/, J )[(/, J) e SC ab ~ (3/^eq, JBeq, (J, IAeq) € S ^ e q A (J ,J Beq) € SC BBeq AC-T^eq, « /Beq) € 5'C'A eqBeq)] The actual techniques used to derive A and B will be presented in the next section, where the ambiguous sem antic correspondence examples presented in Section 1.2.3 and 8.1 are solved. In most cases the construction of A and B will be increm ental, and consider isolated portions of A and B . (3) M a in ta in th e lo cal c o rre sp o n d e n c e s Simultaneous with Step (2) above, the DBA designs two local m echanism s/ex- perts, Expert^ and Expert#, on databases A and B. These experts m aintain the local correspondences S C Aa and S C g B defined in (2). Notice th at SC Aa and SC BB may not be one-one, but since now each local expert m aintains the correspondence between two schemas located in the same database, there are various ways to implement the local experts; these may include: the derived data mechanism in the DBMS, an ad hoc program , an expert system, or even a hum an operator m aking some “politically correct” decisions through a graphic interface, etc. (4) W rite a n A L D for th e e q u iv a le n t su b sch e m as. Since the construction of AeS and B e < ^ ensures th at the sem antic corre spondence SC Aeq # e q can be expressed by nrecILOG- and EES, the DBA can now write the ALD specifying the unambiguous sem antic correspondence < S ’C'yle q Be q . Then follow the same steps as in Section 1.3.2 to m aintain SC a b using B ird , as indicated in Figure 8.8 and 8.7. According to the above framework, to solve the ambiguity a DBA m ust construct a good design of (1) the equivalent subschemas, AeS and J5e< 1, and (2) the local experts, Expert^ and Expert#, 156 monitoring the local database instance Application Expert anc* adjusting the instance according to Application Expert its best knowledge about the application requirement. _^ » — A Bird System B e '* — ' " — N passing intermediate results back and forth between the two databases Figure 8.9: New Architecture Incorporating Expert Systems and B ird System so th at the desired semantics can be obtained while incurring an acceptable amount of overhead. The architecture in Figure 8.9 can be viewed as an integration between the local expert and the B ird system. As indicated there, under this architecture the B ird system is used as a vehicle to convey the interm ediate results (stored in the forms of Ae9 and between the two databases. The local experts on both databases keep m onitoring these interm ediate results and try to update the rem ainder of their respective databases according to the best of their knowledge about the application requirem ent. This process simulates the hum an negotiation process in which one side makes a proposal, and, after studying the proposal, the other side makes a counter proposal. This process may repeat several times before both parties reach an agreement (or an acceptable instance pair in SCab)- In this scenario, the B ird system serves as a messenger passing the relevant portions of proposals and counter proposals back and forth. From our experience with various examples, we believe th at for most ambiguous sem antic correspondence cases this architecture provides a workable solution. 8.4 Ambiguous Examples Revisited In this section, we revisit the three examples presented in Section 1.2.3 and 8.1, and illustrate how the framework presented above can lead to workable solutions. 157 Database A Database B — salary •rnployeg^a^ sTR iN G I ^ s . v name s t r i n g ! education I STRING I name I STRING I Employi Figure 8.10: The “Employee” schemas after “schema surgery” For the sake of brevity, the local experts, Experts and Experts, presented in the discussion are expressed in an English-like pseudo-code. 8 .4 .1 P a r tia l O v erla p p in g C h a r a c te r iz e d b y P r o je c tio n We illustrate this situation with the “Employee” example (Exam ple 8.1). We now follow the framework presented in the previous section step by step. (1 ) D e t e r m in e h o w in fo r m a tio n o v e r la p s As noted above, the ambiguity in the “Employee” example stems from the fact th at the A. education information is not representable in the B database, and B . salary is not representable in the A database. According to the clas sifications in Section 8.2.1 this is a typical case of the “partial overlapping by projection”, i.e., the overlapping information can be isolated using projection. (2 ) P e r fo r m t h e “s c h e m a s u r g e r y ” From the above analysis, we can let A and B be the same as A and B respec tively, and construct A e(l = {A.Employee, A.name} B e(l = {B.Employee, B .name} The resulting schemas are depicted in Figure 8.10. (3 ) M a in ta in t h e lo c a l c o r r e s p o n d e n c e s Since the equivalent subschemas are just part of the original ones, the local correspondences in the two databases can be m aintained by production rules th at insert default or null values to the non-overlapped part of the schema, or display messages to the hum an operator requesting the missing values. 158 (4 ) W r ite a n A L D fo r t h e e q u iv a le n t s u b s c h e m a s Then the DBA can write the following simple ALD specifying the semantic correspondence between the two new schemas. (abs-linkage employee A-schema ( Employee(Entity); name(Employee.string); ) B-schema ( Employee(entity); name(Employee.string); ) Entity-Equivalence ( A.Employee = B.Employee; ) A-to-B ( B.name(b.n) A.Employee=Employee[a,b], A.name(a.n); ) B-to-A ( A.name(a.n) B.Employee=Employee[a,b], B.name(a.n); ) ) 8 .4 .2 P a r tia l O v erla p p in g C h a ra cterize d b y S e le c tio n This section revisits the “Football” example (Example 1.3), where the ambiguity arises from both the “partial overlapping by projection” and “partial overlapping by selection” . (1 ) D e t e r m in e h o w in fo r m a tio n o v e rla p s The causes of the ambiguity are two-fold: (1) only the relations A .stu d e n t and B .team contain information related to each other; the information stored in A .grade and B .assignm ent are unique to A and B respectively. (2) the coach’s database can be viewed as a selection view from the provost’s database. 159 (2 ) P e r fo r m t h e “s c h e m a s u r g e r y ” T he DBA can create a new relation fo o tb a ll- s tu d e n t: Name It is derived from A. student to record the students with ' 'Team Member = True'’. Then = {A .football-student} B e < 1 = {B . team} (3 ) M a in ta in t h e lo c a l c o r r e s p o n d e n c e s There are several ways to m aintain the local correspondences. In the following, we present an example design using expert systems on both databases th at can be used to enforce each of the three policies presented in Exam ple 1.3. Expert^: If there is a rem ote update proposed from the coach’s database which requests to delete a student s from A .football-student then the derived relation mechanism will autom atically delete the student form A . student. Correspond to the four policies, Experts can be coded to enforce Policy ( 1): do in g n o th in g , Policy (2): insert A.student(s,' ‘no* ') , Policy (3): If student s has a GPA greater than 2.5 then insert A.student(s,''no’') . Expert^: If there is a rem ote update proposed from the provost’s database which requests to delete a play p from B.team then try to enforce Policy (4): if p plays the position of ' 'Quarter Back* ' then abort. (4 ) W r ite a n A L D for t h e e q u iv a le n t s u b s c h e m a s The following trivial ALD will provide a m apping between A ecL and B ei1 . (abs-linkage Football A-Schema // the provost’s database ( football-student(String); ) B-Schema // the coach’s database ( team(String); 160 ) A-to-B ( B.team(n) A.football-student(n); ) B-to-A ( A.football-student(n) B.team(n); ) ) 8 .4 .3 M o re In tr ic a te S h a rin g o f In fo r m a tio n In this section we revisit the “Requisition” example (Example 8.2) to investigate yet another category of ambiguity in which the sharing of information is implicit and can not be fully understood unless some bookkeeping information is added. This requisition vs. purchase order situation is a typical example showing differ ent authorities among heterogeneous databases, in which the databases are just part of the managerial system controlled by polices or hum an judgem ent. In such case, a system solution usually can not rule out the hum an involvement. As indicated in the schema surgery process below, the hum an decision is captured by the bookkeeping information m aintained in the additional schema. Due to its length and technical boredom, details of the solution and two update scenarios are presented in Appendix E. (1 ) D e t e r m in e h o w in fo r m a tio n o v e rla p s Recall the three possible ways of arranging the example requisitions and pur chase orders illustrated in Figure 8.3. The correspondence is ambiguous be cause, given an instance of the REQ database there may be several ways to split and merge requisitions into purchase orders; on the other hand, given an instance of R O . database, even with the information in B. d i s t r ib u t i o n , there are several ways to “assemble back” each requisition. This is mainly be cause th at B .d is tr ib u tio n does not contain the detailed information of how requisition subentries are contained in the requisitions. The key information missing in both of the databases is specifically how the quantity of an item in a requisition is split into one or more “subquantities”; 161 REQ#1 3 Mac h — 13Mac 3 Mac P.O.#l Bens Computer Warehouse 2LP, 3 Mac Scenario (1) REQ#1 2Mac~|- REQ#2 2Mac Dave 3 Mac lMach lMac P.O.#l Bens Computer Warehouse 2 LP, 2 Mac P.O.#2 Sherys Mac World 1 Mac Scenario (3) REQ#1 Rick 2 LP P.O.#l - I 2LP 2LP K Bens Computer Warehouse 2 LP REQ#2 Dave 3 Mac P.O.#2 Sherys Mac World 3 Mac Scenario (2) Figure 8.11: The three scenarios shown in “Requisition” with “subquantities” and how the quantity of an item in a purchase order sums up from different “subquantities” . Thus one solution is to augment the original schemas to record the book keeping information of how the requisition and purchase “subquantities” are connected. To see how these subquantities can help to “glue” requisitions and purchase orders together, Figure 8.11 illustrates the same scenarios presented in Figure 8.3 with subquantities added in both databases. As illustrated in Fig ure 8.11 there is a straightforward one-one correspondence between requisition subquantities and purchase sub quantities. (2 ) P e r fo r m th e “s c h e m a s u r g e r y ” To accommodate the bookkeeping information about the “subquantities” in both databases, schema A is augmented by adding entity type A. rs q , the requisition subquantity, and schema B is added with an entity type B .psq, the purchase subquantity. Also some bookkeeping information surrounding 162 Database A (Requisition) Database B (Purchase) po-no STRING] re q -n o I STRIN G! re q -in c lu d e re -n o STRING 1 po-include p e -n o I STRING I obj-nam e 3H STRING I unit-price Ir e a lI q ty I INTEGER item-naine ISTRINGI re-m clud distribution rs e -n o q ty INTEGER! STRING STRING (re q -n o ) pe- contain requester I STRING rse-qty lINTEGERl re -c o n ta m [INTEGER! ( q ty ) psq-req-no rs q -re q -n o STRING STRING i s psq-po-no STRING rsq-po-no STRIN G i j psq-item -nam e STRINGI psq-qty : rsq-qty I lINTEGERl rs q -ite m -n a m e j STRING lINTEGERl Figure 8.12: The “Requisition” schemas after “schema surgery” the two entity types are added. The resulting schemas are illustrated in Fig ure 8.12. Notice that the schema components inside the two shade boxes 4 in Figure 8.12 are added to the original schema. They record information about the requisition and purchase subquantities. Also, in database A , A . re q u e s t and A . q ty are not included in A ec1 because A. re q u e s t contains information relevant only to A , and A. q ty can be viewed as the sum m ation of the quantity information in A. rs q -q ty . Similarly, in B , B .u n it- p r ic e , B .d is tr ib u tio n , and B .q ty are not included in B ec1 . Notice that in the above design, A ei1 and B e < ^ can be viewed as derived and/or im ported data. They serve as the gateways through which the incremental update can be propagated. They are designed to contain only the information needed by the other database and to minimize the overhead. 4Actually, A .r e -c o n ta in and B .p e-c o n ta in are also new schema components added during the schema surgery. Here, they are not included in the equivalent subschemas to simplify the discussion. 163 (3) M a in ta in th e local c o rre sp o n d e n c e s. In essence, wherever a new requisition entity re is created in database A, Expert^ creates a new requisition subquan tity rsq in A.ESQ with attribute values of A .rsq-req-no, A .rsq-item-name, and rsq-qty to be the same as the corresponding attribute values of re. Also, Expert a updates A.re-contain linking rsq to re. As a result, the Bird sys tem propagates these updates to database B and creates a corresponding new purchase subquantity psq. At this m oment, Experts is aware of the creation of psq and knows th at a new requisition—encoded in the form of a subquantity—has just been issued. It then makes various arrangem ents to accommodate this requisition. This may involve: (1) creating new purchase orders, (2) creating more purchase subquan tities and distributing the num ber of items (recored in B.psq-qty) among these purchase subquantities, and (3) linking (by updating B.pe-contain) these purchase subquantities to various purchase entries according to the best of its knowledge about purchasing. These arrangem ents, or updates, will be propagated back to database A, where Expert^ will update the A.re-contain to reflect the up-to-date purchasing arrangem ents of Expert#. In a nutshell, the information encoded in Ae < 1 and B ecI describes the intricate relationship among requisitions and purchase orders. The two local experts communicate and negotiate—with the help of B ird propagating incremen tal updates between them —through this gateway, so that an acceptable solu tio n /state can be reach satisfying the requirements of both experts. (4) W rite an A L D fo r th e e q u iv a le n t su b sc h e m a s. The ALD linking the two equivalent schemas Ae< l and B e< 1 is presented in Appendix E. 164 Chapter 9 Related Research In this C hapter we briefly survey some research that is related to the issues addressed by this thesis. Section 9.1 examines some of the most prom inent works on database interoper ation. All of this is focused on providing read-only access to data located in foreign databases. The two databases in B ird can be regarded as views to each other. Sec tion 9.2 compares the research in the view update area and the approach taken by B ird . In Section 9.3 several schema restructuring languages are discussed. They are compared with nrecILOG- language in the context of sem antic correspondence spec ification. Section 9.4 compares the B ird system with other approaches using active database techniques. Particular emphasis is paid to application of active database techniques in supporting database interoperation. 9.1 Heterogeneous Databases The main objectives of database interoperation research is to develop a system and a set of user aids which hide the various types of heterogeneity (e.g. heterogeneity in d ata model, data semantics, database facilities, hardware, and operating systems) and facilitate the sharing of data distributed among a set of heterogeneous databases. However, most of the previous research works focus prim arily on providing re trieval access to multiple, or between heterogeneous databases, but do not address the issues of propagating updates between databases holding interrelated data. One of the few works that does address update propagation will be discussed in Sec tion 9.4 165 9 .1 .1 F ed er a ted D a ta b a se A r c h ite c tu r e A Federated Database System (FDBS) is a collection of independent database sys tem s united into a federation in order to share and exchange information. Most of the approaches in this area try to provide a framework th at minimizes the hetero geneity in the database systems while providing autonomy at the same time. The m ain emphasis is on gaining read access to non-local data; propagation of updates is generally not considered. In particular, B ird uses a high-level language to describe the correspondence between the two databases. Then the high-level description is translated in to ac tive databases rule to m aintain th at correspondence incremental. In contrast, the federated database architecture provides a framework for establishing linkages, but does not give so much guidance as to (1) what particular language to be used to describe the correspondence, and (2) how the correspondence is m aintained. [SL90] provides a survey of FDBSs. A five-level schema architecture for feder ated databases is presented there which includes: Local Schema: This is the conceptual schema expressed in the native data model of the local component DBMS. Component Schema: A component schema is derived by translating the local schema into a Canonical Data M odel (CDM ) to describe the divergent local schemas in an unified data model. Export Schema: An export schema represents the portion of d ata that a component wishes to share with other components. It can also used to facilitate access control and ensure local autonomy. Federated Schema: This is the result of integrating m ultiple export schemas. It also includes the information on the data distribution. There m ay be several federated schemas for different class of federation users of different application and activities. External Schema: An external schema defines a schema for a user and/or appli cation. to facilitate customization, additional integrity constraints or access control. 166 The ARM methodology presented in Chapter 8 can be viewed as a two-level schema architecture. However the emphasis is quite different; the added layer of Aecl and B e < l is mainly to facilitate the update propagation between the two database, and should be transparent to the users at both databases. W ith the terminology defined in [SL90], the two databases m aintained by B ird can be regarded as a tightly coupled federation, since the DBAs have the responsibil ity for creating and m aintaining the linkage. Furtherm ore, it is possible to integrate the B ird system into the framework of federated database architecture in the fol lowing ways. • To m aintain sem antic correspondences among export schemas in a federation. • To facilitate the translating between layers in the five-level schema architecture. 9 .1 .2 D a ta b a se S ch em a In te g r a tio n Another school of thought in the field of the heterogeneous database is to integrate local databases into a non-redundant, unified representation of the global data, so that all applications can operate against this integrated view of data. The underlying goal for database schema integration is to minimize the problem of duplication, m ultiple updates, and inconsistencies across applications. A survey of this area may be found in [BLN86]. T hat survey distinguishes two contexts in which schema integration is used: view integration and database integration. View integration is performed during the database design process to form a global conceptual description of a proposed database. [BL84] describes a methodology to perform view integration in the E n tity Relationship Model[Che76]. The m ethodol ogy include the following activities. • Conflicts Analysis: This involves Naming Conflicts Analysis to discover and resolve the nam ing conflicts between objects in the schemas, and Modeling C om patibility Analysis to resolve the different representations and modelings of the same data object. The schemas are transform ed to resolve all conflicts found. • Merging: After resolving the conflicts, the schemas are com patible and can be superimposed. 167 • Enrichment and Restructuring: At this stage, the interschem a properties such as inclusion dependencies and functional dependencies can be gathered. The knowledge acquired in the previous activities on the semantics can be incor porated in the global schema. Also, redundant schema components can be identified. After th at a further schema restructuring can be performed to in crease the clarity and simplicity in the representation of global schema. D atabase integration concerns the problem of integrating local schemas of the existing databases into a global schema. [DH84] describes a versatile view definition facility for the functional data model and illustrates how database integration can be achieved using the views and generalization. Also a query modification algorithm is presented so that queries against the global schema can be modified into queries against the local schema. Different from the database schema integration approaches, B ird uses a con crete largely declarative language to specify the relationship between alternative representations, while m any schema integration techniques surveyed in [BLN86] use procedural languages based on local structural schema m anipulations. Also in B ird there is no global schema defined to include both of the two underlying databases. Instead, the “sharing” is in the form of deltas; increm ental updates on the over lapped schemas Ae9 and B ec1 are translated/shared by the two databases through the definition of the semantic correspondence. Also, since the overlapped schema are m aterialized, there is not need for query modification between global and local data. 9 .1 .3 W o rld B a se The WorldBase system [Wid90, WHW89, WHW90] is designed to provide support for persistence and sharing of possibly overlapping, possibly inconsistent information stored in a family of underlying heterogeneous databases. A worlds is essentially a m aterialized database view constructed using portions of one or more of the under lying databases, and/or other worlds. A world specification describes the schema and properties of information in worlds. A primary application of WorldBase is to share information between workstations. 168 In order to facilitate information sharing, mechanisms for the following operations are provided in WorldBase. Selection: The user can extract portions of information from a database by provid ing a closure specification. Transformation: Worlds with different schemas are transform ed into compatible schemas by specifying the correspondences between the source and target schemas using ILOG*, a sibling of nrecILOG" language in the generic ILOG family of transform ation languages. Merging: WorldBase supports several features for specifying how m ultiple worlds Wi> W2, • • • with compatible schemas can be merged into a new world W . Espe cially im portant here is the fact that, with the object equivalence specification, the user can specify when the persistent objects in different worlds refer to the same conceptual object. Also, WorldBase enforces the Most Natural Merge (MNM) on the resulting W. MNM is the most restrictive family of constraints on W such that no constraint will be violated when merging W\, There is a strong analogy between worlds in WorldBase and files in most word- processing environment; a user can load a world into his/her workstation, modify it (by selection, transform ation, and merging), then save it back to the disk. Fundam ental differences between WorldBase and B ird include the following. • The philosophy behind WorldBase is to facilitate a workstation user to obtain the information he/she wants from the databases on the network. In particular, one emphasis of the WorldBase is on read-only access, and autonomy of users of the system. On the other hand, B ird is focused on m aintaining semantic correspondence by increm ental update propagation. T hat is, when a semantic correspondence is m aintained by B ird , the underlying two databases are tightly coupled through the two overlapping schemas Ae9 and B ecI. As a result, the behavior of the local database may be affected by the foreign databases (e.g. as the result of updates to the foreign database, and more subtly, as a result of the LICs discussed in Section 3.5). In this case, autonomy is deliberately violated in B ird . 169 • As mentioned above, in WorldBase a world W can be derived from other worlds W i,W 2 , " • by performing the selection, transform ation, and merging operations. However there is no incremental update propagation from the source database Wj, W 2 , • • • to the destination W , nor between worlds W{ and W j. In order to get the most up-to-date data, the user m ust constantly perform the derivation operations. Also, unlike a bi-directional sem antic correspondence m aintained by B ird , updates to a world W will not propagate back to the worlds Wi, W 2 , • ■ ■ ■ • In WorldBase, through the closure specification, a user can select portions of data from different databases into a world. Mechanisms for data access have not been studied in this thesis, although the B ird framework could support this. 9.2 View Update Problem In essence, the two databases in the B ird system can be regarded as two sets of views to each other. The problem of update propagation from one database to another to another is therefore analogous to the view update problem; i.e., how to translate an update expressed against a view to updates against the underlying database. The framework described in [Kel86] presents a m ethod to help the DBA to select a specific view update translator through a sequence of question-and-answers in a dialog at the view definition tim e. The dialog is a set of algorithms obtaining the semantics necessary for disambiguating view update translation by asking the DBA a sequence of questions. The output of the dialog is a valid translator. Furthermore, the translators generated by the dialog satisfy the 5 additional criteria presented in [Kel85], which include: (1) “no database side effects”, (2) “only one step changes”, (3) “minimal change: no unnecessary changes”, (4) “minimal change: replacements cannot be simplified” , and (5) “minim al change: no delete-insert pairs” . The resulting translators are stored along with their view definitions. Then at run tim e, updates against a view are translated by the translator of the view into updates against the base data. 170 The TAILOR system developed in [SLW88] is a sem i-autom atic tool for resolving ambiguous view updates by using both the syntactic information and the seman tic information related to the view definition. The syntactic information is derived from the syntax of the view definition. The semantic information includes vari ous constraints and dependencies of the data, and the semantics of the database application. W hen defining a view the DBA provide both the view definition and some ap plication semantics. Then at run time, TAILOR translates the updates to the view based on the semantics information available, and/or ask the user for further infor m ation if necessary. The differences between B ird and these sem i-autom atic are • In B ird , the DBA provide the mappings for both directions in an ALD. Each of the two mappings is a nrecILOG" program that populates the destination database from the source database as a whole: i.e., it circumvents the view update problem to each of the view relations by providing a m apping from the destination database to the entire source database. • As mentioned in Section 5.2, both of the witnesses and the witness generators are essential to the proper m aintenance of bi-directional sem antic correspon dences involving OIDs. As a result, none of the frameworks proposed in [Kel86] and [SLW88] can be used in the context of view definition involving OIDs. 9.3 Schema Restructuring/Transformation Languages In B ird the language nrecILOG" is chosen to describe the one-one sem antic cor respondence between two databases. In this section we explore other object-based database query languages that may also be used for this purpose. The focus of the discussion is on the languages that have the capability of OID creation. 171 Generally speaking, these query languages can be roughly categorized into two groups: (1) procedural languages and (2) declarative languages. P ro c e d u ra l r e s tru c tu r in g /tr a n s f o r m a tio n lan g u a g es Languages th at fall into this category include C 02 for the 0 2 object-oriented data model[LR89, LRV88], TAXIS[MB90], OODAPLEX[Day89] from the functional data model, etc. In general these languages are closely related to the object-oriented programming language paradigm, and therefore have the expressive power close to a general programming language. The OID creation is usually accomplished by a “new” construct as in the C++ programming language. W ith its greater expressive power, a procedural language can be used to express a family of sem antic correspondences much richer than nrecILOG- . However, the extra expressive power makes the inferencing of witness generators essentially equiv alent to the view update problem. As indicated in [Kel85], the general view update problem is not computable, and therefore can not be fully autom ated. T hat is, the DBA has to define the inverse of the view. This usually means ad hoc coding and buggy programs, which is exactly what B ird is designed to avoid. D e c la ra tiv e r e s tru c tu r in g /tr a n s f o r m a tio n lan g u a g es Another group of language is more declarative in favor. This group includes the ILOG family of languages and the language IQL[AK89]. Simply put, the IQL language “is inflationary Datalog with negation combined with set/tuple types, invention of new OIDs, and a weak form of assignm ent.” [AK89] As indicated in [AK89], OID creations in IQL are used for the following three critical purposes: (1) new objects may be part of the result and need OIDs to represent them , (2) to m anipulate sets and (3) to obtain completeness in expressing computable database queries. Sets, like tuples in a relation and OIDs in an entity type, are the first class object in IQL. In particular, each set is represented by an OID. Let s be the OID of a set, then s denotes the value of the set s. The following rule assigns the cars owned by Rick to a set represented by OID p. p(c) < — Owns-Car(“Rick” ,c) 172 The dereferencing and assignment to OIDs representing sets enable IQL to m anip ulate set within its rule-based framework. Also, the type definition in IQL can be recursive. The following type definition 1 defines a class Graph-Trunk which is a two-tuple with the first column of type Node and the second column be a set of Graph-Trunk! TYPE Graph-Trunk = [Node, { Graph-Trunk } ] The m ain differences of the semantics between ILOG and IQL lies in the ways of OID creation; IQL uses a variation of the invention rules of detDL[AV88a, AV88b], while OID creation in ILOG is simulated by assigning distinct OIDs to the skolem function term s in the com putation result of Datalog-like rules. The foundation th at nrecILOG- has in logic programming enables B ird to do a simple SLD-style inferencing in order to derive witness generators from the ALD. It is conceivable th at such an inferencing on ALD constructed using IQL programs, if possible, will be more complicated. 9.4 Update Propagation for Interoperating Databases In a way, the ALD specified by the DBA can be viewed as a high-level description of constraints over two databases. The ALD is then translated in to active database rules to m aintain the constraints by propagating increm ental updates. In this section, we briefly survey other research works focus on m aintaining inter-database properties by update propagation. These approaches can be loosely classified into two groups: extension of transaction processing techniques to m ulti database environment (Section 9.4.1), and application of active database technology (Section 9.4.2). 1For the sake of simplicity, the notation used here is slightly different from [AK89] 173 9 .4 .1 U p d a te P r o p a g a tio n b y T ra n sa ctio n P r o c e s sin g [SRK92] proposes a framework, based on the concept of polytransactions, which allows the DBA to describe and m aintain the consistency of interdependent data in an environment consisting of m ultiple interoperating databases. The inter-database dependencies are stored in the Interdatabase Dependency Schema (IDS) containing a set of Database Dependency Descriptors D 3s. A D3 is a five-tuple (S , U, P, C, A ) where S is the set of source data objects, U is the target data object, P is the interdatabase dependency predicate specifying the relationship between S and U, C is the m utual consistency predicate that specifies consistency requirements and defines when P m ust be satisfied. A is a collection of consistency restoration procedures. The IDS is shared by a set of databases in a m ultidatabase system architecture. Notice that the m utual consistency predicate C may be an expression containing: (1) temporal consistency terms, i.e. the point of tim e when the related data objects m ust be consistent, and (2) data state consistency term s, i.e., “to what degree” (e.g. whenever the price of a flight is changed by more then 10%, or the enrollment data have been modified by 10 update operations) may the related d ata be allowed to diverge. W hen a user subm its a transaction T in one of the databases, then the poly transaction T + is computed as a “transitive closure” of T with respect to the IDS, so th at each inter-database dependency in the IDS can be satisfied. The resulting T +, as a tree of subtransactions, is then scheduled for execution. Each subtransaction t in the polytransaction can have the following execution modes. Coupled: The parent transaction of t m ust wait until t completes before it can proceed. 174 Decoupled: The parent transaction first schedules the execution of t, then proceeds without waiting for t to complete. Vital: The parent transaction will abort if t fails. Nou-vital: The parent transaction may survive the failure of t. Unlike the D3 proposed in [SRK92], the ALD in B ird is more declarative and does not require the DBA to specify consistency restoration procedures. The active database rules generated by the ALD compiler provide the functionalities of the consistency restoration procedures and m aintain the interdatabase dependencies or correspondence according to the semantics of the nrecILOG- rules in the ALD. The simple RemTrans transaction mechanism requires th at the user-issued up dates and the resulting update propagations be executed in a single transaction across the two databases. In contrast, the framework presented in [SRK92] provides a more flexible transaction model, in which data can be tem porarily inconsistent within specified limits. Also, several execution modes (coupled, decoupled, vital, and non-vital) are provided to specify how a child transaction is related to its par ent transaction. 9 .4 .2 U p d a te P r o p a g a tio n b y A c tiv e D a ta b a se T ech n o lo g y As indicated in [HW92, SJGP90], production rule of the active database technology[GHJ93, SRH90, MD89, Han89, SLR88] can be used to enforce con straints, m onitor data access and evolution, m aintain derive data, enforce pro tection schemes, m aintain version histories, etc. Several investigations pro pose to compile a high-level description into active database rules for incre m ental view maintenance[CW 92a, CW91] or the consistency in m ultidatabase environm ent [CW92b, CW90]. In a way, the ALD can be considered as a static specification of the constraints across the two databases. Similar to B ird system, the framework presented in [CW92b] m aintains consistency across databases by production rules and persistent queues. In [CW92b] users can define constraints with high-level SQL-based declara tive language. These consistency specifications are then translated into production rules which will m aintain the consistency incrementally. 175 Although the B ird system and that proposed in [CW92b] both use the produc tion rules to m aintain consistency across databases incrementally, there are signifi cant differences between these two approaches. On the syntactic level [CW92b] uses an SQL-like high level language to specify rules while B ird uses nrecILOG- . In the following discussion we will call these rules the specification rules to distinguish from the production rules into which they are compiled. We now summarize the differences. • The data model in [CW92b] does not support Object-Based data model, whereas B ird can m aintain semantic correspondence in an object-based data model. A main contribution of B ird is supporting increm ental update propa gation in the presence of OID creation and deletion. • The two systems are based on different design philosophies; [CW92b] focuses on using active database technology to m aintain consistency and resolve the semantic heterogeneity across databases, while B ird allows the DBA to specify and m aintain a “one-one semantic correspondence” between (portions of) two databases. Under B ird the ALD specifies how (portions of) each database can be m apped to the other. As a result the linkage established by B ird is more strongly coupled than that by [CW92b]. • W hen the LHS of a specification rules is modified by increm ental updates, the corresponding production rules translated by B ird compiler and that pro posed in [CW92b] behave differently; the latter try to m aintain the constraint specified by the specification rule by padding NULL values to the rule head, the former does nothing and just wait till the end of the transaction then check if the constraint has been violated, if so the whole transaction will be aborted. • [CW92b] adapts a very flexible mechanism for communication between data bases, based on persistent queues and asynchronous message passing. B ird is more focussed, and requires a single transaction across the two databases in order to support update propagation. 176 Chapter 10 Conclusions and Directions for Future Research This thesis is one of the first investigations into autom ating update propagation between databases with overlapping information. This chapter summarizes the de velopment of the B ird system and discusses possible extensions and directions for future research. The B ird system proposed in this thesis provides an overall solution to the problem statem ent stated in Section 1.1. In particular, the framework allows the DBA to describe a semantic correspondence in the form of an ALD (Section 3.1). Then a compiler (Section 3.2 and 4.4) is used to translate the ALD into two concrete linkages, E^ and Eg, containing active database rules th at m aintain the semantic correspondence by propagating updates incrementally. Also, a running environment is developed in B ird to (1) support the propaga tion of increm ental updates involving OIDs. and (2) orchestrate the interactions between the two databases, so that each database has the authority to initiate autonomous updates. The m ost im portant contributions of this research include: (1) a clear articulation of the issues raised by the presence of OIDs in connection with increm ental update propagation, (2) the usage of witnesses (Section 4.3) to bridge the gap between “logical” and “physical” aspects (Section 2.3, 5.4) of the uni-directional incremental update propagation, (3) the development of the WGG algorithm (Section 6.4) that perm its the use of witnesses in the context of bi-directional updates propagation, and (4) the establishm ent of the soundness and the decidability of halting of the W GG algorithm (Chapter 7). 177 Of course, several key assumptions facilitated these results. Follow-on research th at relaxes and generalizes these assumptions will be necessary before practical systems allowing heterogeneous update can be developed. C o rre s p o n d e n c e D e sc rip tio n : In the current im plem entation of B ird , the ALD is expressed in nrecILOG". Future versions of B ird should extend the expressive power of the language to include union and negation. Also, it would be desirable to extend the underlying data model to include ISA relationships, so th at a larger family of sem antic correspondences can be specified in an ALD. Unfortunately, it is not obvious how to extend the current WGG algorithm to incorporate them. C o rre s p o n d e n c e M a in te n a n c e : The current framework only m aintains a seman tic correspondence between two databases. It is conceivable that the framework can be extended to m aintain several semantic correspondences concurrently among a group of databases. This sem antic correspondence “network” can be used to m ain tain “the m any forms of a single fact” among a group of heterogeneous databases. As proposed in [SM91b, BG91, SM91a, SM89], the schema of a database can be extended to incorporate the semantics of the data. W ith the extra semantics information added, software tools can be constructed to assist the DBA to perform various activities associated with establishing semantic correspondence. The possible functionalities of the software tools may include the following. • To help the DBA to identify the relevant portions of the two schemas. • To help the DBA perform the “schema surgery” described in Section 8.3. • To verify the correctness of an ALD. • To analyze and predict the im pact of the semantic correspondence on the two databases. A u to n o m y : As mentioned in Section 3.5 and Section 8.2.2, the m aintenance of sem antic correspondence may have global im pact on the behaviors of the two local databases. One of the manifestations of this global im pact is in the form of Linkage Im plicit Constraints (LICs). 178 Due to the implications of the LICs, after establishing the sem antic correspon dence, the local database gives up its autonom y to some extent. It will be helpful to the DBA if the closure of LICs can be computed from the constraints originally in the two databases and the ALD. This information can assist the DBA to bet ter assess the im pact of the semantic correspondence and to better understand the implications of the linkage on the local databases. Also, the RemTrans mechanism in B ird can be extended to provide an expla nation at run tim e to the rem ote database for why a RemTrans transaction was aborted by the local database. For example, in the “Football” example, when the coach tries to delete a player from his database and fails every tim e, it might be helpful to have a message Violation of remote constraint: No-One-Can-Fire-the-Provost’s-Son. Transaction aborted, displayed on the coach’s screen. O p e n Q u e stio n s: This research leaves a num ber of questions unanswered. These include: • Do all well-founded ALDs have witness generators? • Is there a complete WGG algorithm that finds witness generators for all well- founded ALDs? • Is there another way to find the witness generator other than using the SLD- resolution style expansion? G e n e ra liz a tio n s: The examples presented in this thesis are only indicative of the problems th at can be solved by B ird . It is our belief that B ird can be applied in m any related areas in the field of database interoperation. Potential applications include: • As discussed in Section 9.1.1, B ird may be used to m aintain the mappings between layers in the five-level schema architecture of the Federated Database System proposed in [SL90]. 179 • In the “Requisition” example, the solution utilizes B ird to send proposals between two database systems. In fact, this architecture can be applied in the general Office A utom ation arena which generates constant inter-departm ental traffic of electronic forms (and paper). In general, the ALD can be extended to include a bi-directional and two uni-directional semantic correspondences. The bi-directional semantic corre spondence m aintains the information shared by both departm ents; the two uni-directional sem antic correspondences serve as “wish lists” or “For-Your- Inform ation” notices to the rem ote departm ent. • In the general areas of Cooperative Com puter Aided Design and Com puter Supported Cooperative Work, B ird can be used to propagate increm ental updates of several design groups. 180 A p p e n d ix A Syntax of ALD The syntax of ALD presented below is written in the B N F ’ language of POPART introduced in Section 2.4.1. correspondence-spec := ’( ’abs-linkage id#abs-name { schema-definition } {EES} { B-to-auxA } { A-to-auxB } (B-to-A A-to-B I A-to-B B-to-A)’) ; schema-definition := { A-schema } { B-schema } { auxA-schema } { auxB-schema } ; A-schema := ’A-schema ’( schema-info#A ~ } ’) ; B-schema := ’B-schema ’( schema-info#B " ’; { ' ; } ’) ; auxA-schema := ’auxA-schema ’( schema-info#auxA “ ’; { ’; } ’) ; auxB-schema := ’auxB-schema ’( schema-info#auxB ~ } ’) ; schema-info := ( relation-definition )ll; relation-definition := rel#def-name ’( id#tuple-type EES := ’Entity-Equivalence ’( EES-entry “ ’; { ’; } ’); EES-entry := rel#A ’= rel#B; B-to-A := ’B-to-A ’( ilog-rule#B-to-A ~ { ’; } ’); A-to-B := ’A-to-B ’( ilog-rule#A-to-B ~ ’; { ’; } ’); B-to-auxA := 181 'B-to-auxA ' ( ilog-ru le#B -to-au xA “ } ') ; A-to-auxB := 'A-to-auxB '( ilog-rule#A-to-auxB ~ } '); ilog-rule := head body; head := intermediate | invention-intermediate I relation I I; body := intermediate-or-relation ’ , ; intermediate-or-relation := intermediate I relation I|; invention-intermediate := rel#invention '[ '* { rel#constraint } ( var-or-const#witness ~ ',) '] ; intermediate := rel#intermediate '[ ( var-or-const#obj-witness ', ) '] ; relation := rel#relation '( var-or-const#type “ ') ; rel { rel-host } rel-neune ; rel-host := id#rel-host ; rel-name := id#rel-name ; var-or-const := const | var I I; const := string I integer I bool I I; bool := 'T | 'nil ; string := lexeme <| stringp ; integer := lexeme <| integerp ; id := lexeme <| alphanumeric ; var := skolem-term I variable I I; skolem-term := id '( var “ ') ; variable := lexeme <| alphanumeric 182 A p p e n d ix B Formal Semantics of nrecILOG and ALDs This appendix summarizes the formal semantics of the ALD presented in this thesis. In the presence of OIDs, two instances are considered OID-equivalent if they are identical up to a one-one m apping on OIDs. In most of the object-oriented restructuring languages, the output is defined as a family of instances th at are all OID-equivalent. The family of instances is called the “logical instance” in contrast to a “physical instance” store in a database. D e fin itio n : OID-equivalent, Logical Instance Let S be a schema. Then two physical instances 7, J E Inst(S) are OID-equivalent, denoted 7 = J , if there exists a perm utation cr on the domain of OIDs such that *(I) = J The logical instance represented by I is the OID-equivalence class containing 7, demoted by [I]. □ D e fin itio n : Logical Instance Mapping Specified by a nrecILOG" program Let P is a nrecILOG" program, and S T are the source and target schemas in P. The function defined by P is a m apping from [7nst(S)] to [7nsf(T)], i.e. from the logical instances of S to the logical instances of T. However, the com putation and behavior of P is specified in term s of physical instances of S', which includes the following three-step process. (1) construct the “Skolemization” Skolem(P) of P (Section 2.3.1) 183 (2) for an logical instance [/] of S, compute Skolem(P)(I) using conventional logic programming to obtain a “Skolemized” physical instance, i.e. the physical instance may contain skolem terms. (3) Let be a function that maps each distinct skolem term in Skolem(P)(I) to a unique OID. Then the image of F ([/]) is [rJ)(Skolem(P)(I))]. As shown in [HY90], since the output of P ([/]) is a logical instance, the specific choice of the skolem term substitution in tf> is irrelevant; th at is, as long as each distinct skolem term is replaced by a unique OID, the resulting logical instances are the same. □ The A bstract Linkage Description (ALD) is a text description of the semantic correspondence between two databases A and B. D e fin itio n : Abstract Linkage Description (ALD) An ALD L is a five-tuple (A, B , EES, Pa^ b - , Pb> - + a )-, where A is the schema definition for database A, B is the schema definition for database B, E E S specifies a family of one-one relationships between entity types in A and entity types in B , PA^B is a nrecILOG- program specifying the instance m apping from A to B, Pb ~ a is a nrecILOG- program specifying the instance m apping from B to A. □ According to the original semantics presented in [HY90], the output of a nrecILOG- program is a “logical instance”. As a result the semantic correspon dence specified by an ALD is defined in the context of logical instances as follows. D e fin itio n : Semantic Correspondence The Semantic Correspondence S C a b specified by an ALD L = (A , B, E E S , Pa^ b , Pb^ a ) 184 is a set of logical instance pairs of [Inst(A)] x [In st(B )]. scA B - {([/], [j]) i ( m o i = [< /]) a = m □ One of the m ajor contributions of B ir d is using the notion of witness to bridge the gap between the semantics of nrecILOG- defined in term s of logical instances and the physical instances implemented in the real system. D e fin itio n : Witness Let Pa> -* b be an ILOG program specifying a m apping from schema A to schema B. Let IA and JB be an instance of A and B respectively, and [Pa~ b {Ia )] = [Jb ]» or Pa> -* b {Ia ) = Jb T hat is, the resulting (physical) instance of Pa^ b (Ia ) is OID-equivalent to Jb - Then an instance IauxA over auxA is a potential witness from I a to Jb , if Pa^ b (I a ) = IauxA, and P A^ B {IA U IauxA ) = J b - A potential witness IauxA is a witness from I a to Jb if Pa~ b (Ia 0 IauxA) = Jb - The notion of witness from Jb to I a is defined analogously. □ As m entioned in Chapter 5, in the bi-directional semantic correspondence case, each of the two databases has to m aintain a local witness to m atch the current physical state of the other database. Thus, the witnesses stored in the two databases co-exit as a pair. D efin itio n : Witness Pair Let L be an ALD specifying a sem antic correspondence SCab, and ([-T a]> [«/b]) € SCab be an instance pair in the semantic correspondence. The two instances I aUxA over auxA and JaU xB over auxB are a witness pair for (Ia,Jb) if (1) IauxA is a witness from I a to Jb and JaU xB is a witness from Jb to I a , that is J I*a* -* b {Ia ) := IauxA, and Pa> -* b (.Ia C IauxA) — Jb I Pb^ a (Jb ) — JauxB, and Pb^ a ^'Ib U JauxB) = Ia 185 (2) For each interm ediate relations B. iP and A. iQ resulting from the EES-to-ILOG translation of an EES entry A. P = B.Q then th e two relations B. iP and A. iQ are inverse to each other. T h a t is i P ( . P i ? ) € J a u x B * • * i’Q i S h l 0 ^ I a u x A □ Though the two nrecILOG- programs P a > - * b and P b < - + a in an ALD specify in stance mappings in the logical instance level, the actual sem antic correspondence m aintained by B ird is defined as follows. D e fin itio n : Extended Semantic Correspondence Let L be an ALD that defines a semantic correspondence SC a b■ Let auxA and auxB be the schemas of the witness relations appearing in Pa> - > b and Pb> -* a of L respectively. Let A = AUauxA and B = BUauxB be the two extended schemas of A and B with the witness relations added. Then the extended semantic correspondence defined by L is S C A§ — {[IA O IauxA> J b U JauxB) | € S C a B i and i^IauxAi JauxB) is a witness pair for (I a , Jb )} □ 186 A p p e n d ix C Proofs for the EES Removal This appendix presents the proofs for the various results in the EES removal process presented in Section 6.2. L e m m a 6 .2 .1 (R e d u c e d S e m a n tic C o r r e sp o n d e n c e ) Let L be an ALD specifying a semantic correspondence1 SC l ■ Let the semantic correspondence defined by reduce(L) be SCredU C e(L)- If (7^, Jg) is in the extended semantic correspondence of SC l , then r e d u c e ( I J g ) is in the extended sem antic correspondence of SCredU C e(L) P ro o f: Let = reduce(I^), Jg, = reduce(Jg), I'atl = I U Jg,, and Iau = U Jg. It is sufficient to show th a t, for each rule 7tri E reduce{pA^>B), I'all.r' = 7Zri(I^,). Case 1) r' derived from r defined in an EES entry A.S = B.R in L. From the reduce construct, we know 1Zr> is B.R(x) : — A.S(x); Clearly, for each xrs of type RS-value S(xrS) € reduce(I^.S) R(xrS) e reduce(Jg.S) Thus Iall-Y = Rr‘{Jj\i) 1For the sake of brevity, the semantic correspondence defined by L is denoted here as SCl , instead of SC ab ■ And the semantic correspondence defined by reduce(L) is then SCreduce(L)- 187 Case 2) r' is not derived from any EES relation. Since I'aU.r' = reduce(Ian.r) = reduce(TZr(I^)), it is sufficient to show that re d u c e ( 7Zr (I^ )) = 7Zri(I^,) Q ) — * Let r(tS') € 7Zr>(/^,). Then there exists a V.A. /? such that 7?.,./.head/? = r(w') 7?.r/.body/? C I We can construct a V.A. a on Var(1Zr) from /? as follows. For each a ; € Var(lZr), x m ust satisfy one of the three cases. Subcase 1) if a; is of type A.P defined in an EES entry A .P = B.Q; then there is a variable xpq of type PQ-value in Var(7£r» ) which replaces x. Let xa = p where p is of type A.P and reduce(p) = xpq(3. Subcase 2) if x is of type B.Q defined in an EES entry A.P = B.Q; then there is a variable xpq of type PQ-value in Var(/1Zr> ) which replaces x. Let xa = p where p is of type B.Q and reduce(p) = xpq/3. Subcase 3) x is not OID of any EES relation, let xa = x/3. From the reduce construct and the definition of a , we know T^.bodyo: C 1^. Therefore, 7?.r .headai = r(w) € Iati reduce(r(w)) = r(w') T hat is, r(w') € reduce(7^r(I^)). ( Q Let r(w') € t'educe(7ZT(Ij4)). Then there exists a V.A. a such that 7£r .heada! = r(iy) ( E Iau-T 7 2 .r .bodyo! C / | reduce[r{w)) = r{w') 188 We can construct a V.A. (3 on Var(7Z.r> ) from a as follows. For each x € Var(7ir> ), if x is of some type PQ-value derived from an EES entry, A.P = B.Q, in L, then there exists a variable y € Var(lZT) such that y is of type A.P or B. Q. Let x/3 = reduce(ya). And let x(3 = xa otherwise. From the reduce construct and the definition of /? we know 7£r'.body/? C Ifr 7£r/.head/? = reduce(7Zr.hea,da) = r(w' ) e I'all.r' □ L e m m a 6.2.2 (E x p a n d e d W itn e s s R u le ) Let L be an ALD specifying semantic correspondence SCl- Let the sem antic corre spondence defined by reduce(L) be S C r e d Uc e ( L ) - If W G r> is a witness rule for r' in SCredU ce(L)> then expand(WGr> ) is a witness rule for r in SCl- P ro o f: Let I = reduce(I^), Jg, = reduce(Jg), I ‘ all = U Jg ,, and Iaii — /4 U Jg. Let W G r = expand(WGri). W.l.o.g., assume r € auxA. We now show that IA.r = W G r( J6 ) (2 ) Let r[u>] G W G r(Jg). Then there exists a V.A. a such th at WGy. head a = r[tt>] W Gy.bodya C Jg We can construct a V.A. /3 on Var(W G T > ) from a as follows. For each x G Var(W G r'), x m ust satisfy one of two cases: Subcase 1) x is of type PQ-value derived from an EES entry A.P = B.Q in L Then there is a variable y of type A.P or B.Q in Var(W G r) which replaces x in the expand construction. Let x(3 = reduce(ya). Subcase 2) otherwise, let xfl = xa. 189 From the definition of expand and the definition of 0, we know W G r> .body/3 C Jg,. Since W G r> is a witness rule for r' we have VFGv.head/? = r[w'] E IA,.r' reduce{r[w\) = From the definition of reduce, it follows that r[u>] E IA.r. ( Q ^ —* — i Let r[w\ E IA.r and reduce(r[w]) = r[u>']. Since W G r< is a witness rule for r', there exists a V.A. 0 such that VFGv.head/? = r[w'} E IA>^' W G r>.body0 C Jg, We can construct a V.A. a on Var(W GT) from 0 as follows. For each x E Var{W G T ), x m ust satisfy one of the three cases. Subcase 1) x is of type A.P defined in an EES entry A.P = B.Q in L. Then there is a variable xvq of type PQ-value in Var(7Zri) which replaces x. Let xa = p where p is of type A.P and reduce(p) — xpq0. Subcase 2) x is of type B.Q defined in an EES entry A.P = B.Q in L. Then there is a variable xpq of type PQ-value in Var(7Z-ri) which replaces x. Let xa = q where q is of type B.Q and reduce(q) = xpq0. Subcase 3) x is not OID of any EES relation, let xa = x0. Since W G r‘.body/? C Jg,, from the expand construct and the definition of a , we know W G .bodya C Jg . Therefore, 7^r .heada = r(w) E Iaii reduce(r(w)) = r( w') From the definition of reduce it follows that r(w) E (kFG>(Jg)). □ 190 A p p e n d ix D Detailed Proofs for Chapter 7 This appendix provides the proofs for most of the results in C hapter 7. L e m m a 7.1.1: Order Independence of Micro Steps Let Gi = r'KiTj1) , .., /,• > 1 be the goal set at level i in an expansion history. Then for any two perm utations 7 r and 7r' on [1, /,] we have G r T T ("lit' i + 1 — L r t-+ 1 j u i = P ro o f: The m ain reason that this result holds is because of (1) the careful choice of standardizing the variable apart functions and (2) the fact that the substitutions generated are proper. The proof used here is reminiscent of the “Switching Lemma” of [Llo87]. To prove this lemma, it is sufficient to show th at the following two micro step expansion orderings yield the same Gi+i and 0i.fi. (a) at each micro step, the atoms are expanded according to a perm utation ir where ri (uiS) is expanded at micro step jf, and rt i(uit) is expanded at micro step j -f 1 and s,t e [ M i ] , j € [1,/,- - 1]. (b) same as (a) except, at micro step j, is expanded, and, at micro step j + 1, rt(uis) is expanded. T hat is, the order of rf(iiis) and r ^ u /) are switched. 191 For (a) the expansion scenario is 1 expanding a 3 = mgu{uiSa^x'3~1^ ^ 7?.,.?.head-term) g3 = r | ( ^ ) , expanding erJ+1 = mgu(uita^l^ \ H rt .head-term) g3+l = (...,7Zrf, •■jTZrt, ...)cr[liJ+ l1 where cr^x,3~x^ are the m gu’s obtained from micro steps 1 through j — 1. For (b) the expansion scenario is 93~X = (—,**?(£•'),■■, ^ i ( i ^ , - - - ) crC l’ J_1] expanding {33+l = mgu(ui 7Zrt.head-term) g 3 = (..., r|(uV2 , . . , ^ r?,...)cr[ 1-i-1 ]/?i+i expanding /3 3 = m gu(uiS < T l1’ J~1 ]/?J+1,7?.^. head-term) *+ ' = •?(ui'‘))<Ttl- i+l1 From the two expansion scenarios we know (uiscr^l,3~x^)a3 = (T ^.head-term jer7 (1) (uita^1'3~^)a 3 a'3 + 1 = (7£rj.head-term)cr?+ 1 = (7?.rt.head-term)<TJ(T -'+1 (2) (uitcr^ 1, 3~1^)/33+l = (7?.rt.head-term )/?-?+1 (3) ( u i V ^ _1^)/?J+1/?J = (7?.,.? .head-term )/3 3 = (7£r?.head-term)/7-?+1/?-? (4) Equation 2 holds because TZrt .head-term uses no variables occurring in a 3 or cr3+1 , and likewise for (4). Since ft3 + 1 is a mgu unifying u/a^1,3-1^ and TZrt.head-term (3), and a 3 < r3+x unifies u^cr^1’ 3" 1] and 7Zrt.head-term (2), we have a 3 a 3 + 1 = (33+l a, for some a (5) 1W .l.o.g.1 in this illustration, we assume that ir(j) = s < 7r(j 4- 1) = t; i.e. r|(u*) is to the left of r*(uj) in Gj. 192 Then, since /3J+1 does not act on 7£r? .head-term, we have 0 Zrf .head-term)er = .head-term)/3J+1cr = (7?.r*.head-term)cr-7 cr- 7+ 1 = ( u .V 1 ’ ■ 7 - 1 l)crJcr7+ 1 = (tf;S < 7 f1’ - 7-1})/?-7+1< T Thus a is a unifier of 7Zr?.head-term and (ttiS cr[1,- ?-1 i)/3-’+1. Also by (4), is the mgu for 7£rs.head-term and { v t * ’ J -fi)/?-7+1. It follows that cr = ftcr', for some cr' (6) Substitute (6) into (5) we have and by symmetry, there is a cr" such that cr" = f t Since the mgu only generates proper substitutions, we know aJ a J + 1 = □ L e m m a 7 .1 .2 : Variable Inheritance Let H be an WGG expansion history, and r(u ) G Tree,- be an atom occurrence at level i. Let I/other = {x \ x 6 Far(7?.r .body) — V a r(7?.,..head)} be the set of existential variables in the body of the renam ed rule H r. Then (a) Var{Tree\T +f )}) = Var(u ■ MGUi+a)U V other = {t | t £ u • MGUi+\ and t is not a skolem term} p 'o th e r T hat is, the variables in T r e e } ^ includes variables in Vother and variables “inherited” , subject to be changed by MGUi+i and excluding those OID vari ables replaced by MGUi+i into skolem term s, from its ancestor r(£t). 193 (b) for any variable x € Var(Treei(H )) if rank(x) = m > 1, then M axRank(Var(x • MGU[i+iti+ 2])) = m — 1 P ro o f: To-prove part (a), using the Order Independence of Micro Steps Lemma, we m ay assume that atom r(u ) is expanded in the first micro step. Then from the idem potency of substitutions, we know T ree};(“)} = (K r.body • er-+1)of+1 • • • crj^ 1 = (ftr .body • a}+x)< r}+ x • < r? +1 • • • a ^ x = ( £ r .body • cr}+x)MGUi+x (7) Let 7?.r .head = r(v). Consider the following two cases. Case 1) t{ = “base” Since each nrecILOG” non-invention rule is safe[U1188], v C Var(7?.r .body). Also, each variable in v has larger index than the variables in u, and the mgu is always proper. Thus, the substitution 2 generated in the first micro step is 3 cr}+ x = mgu(v, u) = {v i — » u} T hat is, u C V ar (7^r .body<7l+1). Since aj+x does not act on the existential variables yother jn body. This implies Var(Kr. body aj+x) = H u V other (8) Since the variables in I/other do not appear in other atoms and their corresponding renam ed rules, the substitutions generated in the remaining micro steps do not act on y other. From (7) and (8), this implies Var(Tree\l( ^ }) = Var((u U V other) ■ MGUi+x) = Var(u ■ MGUi+x) U y other = {f | / € w • MGUi+i and t is a variable.} U y other 2For the sake of brevity, we use “t? ► u” to denote substitution {(x/y) | x G v, y € u] 3Based on the restrictions presented in Section 2.3.3, we assume each variable in the rule head of Ttr is distinct. 194 Case 2) t{ — “aux” If r is not an interm ediate relation, then atom r(u ) is not expanded. Clearly (a) holds. Assume r is an interm ediate relation. T hat is, u = (o, w ) for some OID variable o and witness vector w. Let v = (f(w '), w'). Since each variable in v has bigger index than the variables in u, we have a i + i = mgu((f(w'),w'), (o,w)) = {w' w ,o/f(w )} Since w' C I/ar(7?.r .body), we know w = iyVl+1 C Var{7 2 .r .body • crl+1) Var(o ■ al+l) = Var(f(w)) — w (9) Then we have Vra r(T re e [ji)) = Var(7lr.body • crj+ 1 ■ ■ ■ a ^ ) = Var{{w' U Vother) • MGU i + 1 = Var{(w) ■ MGUi+i) U V otheT (10) = Var{(o, w ) • MGUi+i) U F other (by (9)) = Var(u ■ MGUi+\) U V other = {t | t £ u • MGUi+i and t is a variable.} U I/other (by (10)) To prove part (b), assume U = “base”. From the proof of part (a) of this lemma, we know x-MGUi+i = x € Var(Treei+ i), and x-MGUi+z = f{w) becomes a skolem term of variables w. Then, part (b) follows directly from the definition of rank. The case for t{ = “aux” can be shown analogously. □ L em m a 7.1.3: Substitution Partition Let H be an expansion history. Then for each expansion step i > 0 (a) if t{ = “base”, then MGUi+i(H) is a variable-pure substitution th at is trivial to Treei(H). In particular, for each (x /y ) £ MGUi+i(H), a: is a new variable in some renamed rule, and y is in Var(Treei(H)). 195 (b) if t{ = “aux” , then MGUi+\{H) = 9sU0eU8t, where 6 s, 6 e, 6 * are a skolemization, variable-pure, and trivial substitution to Treei(H), respectively. In particular, 6 e = {(x/y) | C is an equivalent witness class of Tree x, y € C, x 7^ y, and index(y) = MinIndex(C)} 6 s = {(o /fr(u • 9e)) | r[o, u] < E Tree{(H),r is an interm ediate relation, and f r is the skolem function associated with r} 6 l is a trivial variable-pure substitution similar to the one defined in (a) P ro o f: Let Gi — { r* (u j),..., We prove (a) by induction on the micro steps and show th at for each micro step k 6 [0, L] #!+? = {* /i/|s € U Var(Hr^),y eV ar(G i)} (11) <7=0,A : Clearly, (11) holds when k = 0. Assume (11) also holds at micro step k > 0. Now consider micro step k + 1. By definition, we know 9 i+1 = ( ^ ri.body,..,7erA.body,rf+1(uf+1) ,..,r |i(u|,))a-S+f1 9 i+\ = (ftrj .body, .body, 7l rk+i .b o d y ,.., r-^u^ ) ) ^ ? 1 ^ 1 where is derived from g *+ 1 by unifying r*+1(«f+1)<rj°’ ^ and 7£rfc+i.head with m gu ■ Since £ ,• = “base” , r^+1(uf+1) in Gi is a predicate of a base relation rf +1. Let the renam ed rule 7£r* + 1 be rf+1(0?+1) : Then by the unification process and (11), we know Ojff = m g u ( u ^ - 0 ? * \ v ^ ) = mgu(u^+l,v^+1) Since each variable x 6 uf+1 is a new variable and index(x) > Max-index(uf+1) we know - *?+1> 196 Then 0j+i+1^ = satisfies (11). W hen tj — “aux”, according to the WGG expansion, for each micro step k £ [1, /,■ ], if atom rk(uk) is a predicate of a base relation, then c r^ = 0 and gk +l = 1 1. Thus, to prove (b), we only consider those micro steps when r f,k £ [0,/,•], is an interm ediate relation and show 0 1 = {(x/y) | C is an equivalent witness class of (rl(« l), rk(uk)), x,y £ C,x 7 ^ y, and index(y) = M inlndex(C)} < H = i(°/fr 9 (u ■ 9%)) | r?[o,u] £ Treei(H),q £ [l,fc] (12) f r 9 is the skolem function associated with r f } d\ is a trivial variable-pure substitution Clearly, (12) holds when k = 0. Assume it also holds for k £ [0, U — 1]. Let the skolemized and renamed rule 1Zrk+ i at micro step k + 1 be ' f H [/(“ )> 5] ■ Consider the two possibilities for atom in gf+1- Case 1) = r ^ '[ o ,S | That is, the OID variable o has not been skolemized by cr&j .. Then Since v are new variables in the renamed rules and mgu always generates proper substitution, we have vf+i = { ( ° //( “ ))} Since {v i-» «} only acts on the new variables in the renamed rule, let ( n+, = n [ O ' k + i = e k U { » * - > « } Then (12) holds. Case 2) rf = rk[f(u),u'] T hat is, the OID variable in u\ has been skolemized by <xj+i Since v are new 197 variables appearing in the rename rule, then unifying r*[f(u), v?] with r* [f(v), v\ will yield 4 °f+i = rngu(rf[f(u),u'],r!?{f(v),v\) = {v i — > (u ■ m g u ( u u))} U mgu(u', u) Since v are new variables and both u and u' contains corresponding variables in the same witness equivalent classes, we can set , 6 k+\ = ek u * -+ (“ ‘ rn g u {u u))} Then (12) holds. □ L e m m a 7.2.1: Embedding Let H — WGGi,{Go,to) be an expansion history of an ALD L , and be an instance pair in the extended semantic correspondence SC^g defined by L. Let I all = I £ U Jb - If there is a V.A. a w.r.t. Iaii such th at G icx. C Iau,i > 0, then for any k > i there exist a V.A. /? w.r.t. Iau such that the condition above. □ S u b le m m a D .0.1: Embedding Let Go, G \,... be the sequence of goal sets and 6 1 , 6 2 ,... be the sequence of m gu’s appearing in some WGG expansion history. Let Gi and G,+i,i > 0, be two goal sets. Then 4By our assumptions in Section 2.3.3, we assume that for each invention rule,i[*,u] • • •, the variables in v are distinct. (3 is an extension of a w .r.t. Iau G jP Q Ia u J G [0,A r] Vs € Var(Tree[Q M (H)), x(3 = xM GU[o,k](H)0 P ro o f: By repeated application of part (a) of Sublemma D.0.1 below from level i to k, and part (b) from level * — 1 to 0, we can construct the V.A. /9 that satisfies 198 (a) if there exists a V.A. a such that Gia C Ian, then there exists some V.A. a' such that a' is a skolem extension of a w.r.t. Iaii Gi+1a' C Iau (13) V y E Var(Gi),ya = yOi+xa' (b) if there exists a V.A. a' such that Gi+\Gt' C / a», then there exists some V.A. a which satisfies (13). P ro o f: Let Gi = rj (u j) ,. . . , r\' {u\') be an goal set at level i > 0. We prove (a) by induction on the micro steps and show th at, for each micro step j E [0, Z < ], there exists a V.A. cV (a 0 — o) such that cV is an extension of a w.r.t. Iaii 9i+laJ Q 10 .1 1 (14) V y G Var(Gi), ya = ycrf^a 3 Then set a ' = a 1 '. Clearly, since o 0 = o, we know (14) holds when j = 0. Assume (14) also holds at micro step j E [0, /,• — 1]. Now consider micro step j + 1. By definition, we know 9 i+ i = (ftrj.body, body, rj+ 1(ui+1) , .., ^ ‘(u^ ) ) ^ 1 9 l t l = (7^ri .body,..,7Zrj .body, 7 ^ + i .body,.., rj'(u/*))a-[°f where g{+l is derived from g{ + 1 by unifying rf+1(ui+1)cr|^.’ f and 7£rj+i .head with mgu 1. T hat is, (rf+1 (u-+1)cr|”f ) • aiH = Tl^+i .head • cr#* (15) By our induction hypothesis (14), we have (r?+1( ^ +1)<rj°f )a* G g3i + 1 a 3 C Iall (16) Since ( ^ +1(wi+1)crj+f) is an atom of relation rf+1, and 'R-rj + 1 is the only rule in the ALD th at specifies the population of r^+1, therefore, by the semantics of nrecILOG- , we know ( r ? + , ( « J + > M ) . ^ g f l r , * , ( h „ ) 199 i.e., there exists a V.A. 0 such that .head • 0 = (rf+1(u^+1)<rj°f )aj € Iaii (17) Tl^+i .body • 0 C Ialt (18) Since all variables in the renamed rule are new ones, i.e., Var(TZr j+ i)n Var(gj+1) = 0, we can define a new V.A. 7 = a - 7 U 0 (19) Then, from (19), we have g j+ i'lQ Iaii by (14) -head • 7 C Iall by (18) (20) ( r ^ 1( ^ +1)< rjy)7 = 7 ?-ri+i .head • 7 € Iaii by (17) From (20) and (15) we know both 7 and unifies ^;+1(w;+1)^j+i^ and 7Ztj+ 1 .head. Then by the definition of mgu, there exists another V.A. a J+1 such th at 7 = a f ^ a 3+l ( 2 1 ) Then we have ff& V +1 = (S?+, - {r?+ ,(S?+V l + f } U K r;>..body) • 0 & V + 1 = («?+, - M +‘ ( ^ * R ^ 1} U R ^ .b o d y ) • 7 £ 9i+\l U .body • 7 C IaU by (20) and for each variable y € Var(Go = < /f+1), ya 3 + 1 = y ai+1 a by (14) = y ■ ai+1 1 by (19) = y ■ , f o ,i] . i+i j+i ui+\ °i + 1 a by (21) = y- ai+1 a We can prove part (b) by induction backward on the micro steps (i.e., from micro step /,• back to micro step 0) and show th at, for each micro step j [0,/j], there exists a V.A. a* (a li = a') such th at gl+i<xj Q I all Vy € Var(g3 i+1), yaJ - yoV+ ^ct 200 Clearly, (22) holds when j = U. Assume (22) also holds at micro step j G [ 1 , 4 1 - Now consider micro step j — 1. By definition, gf+ 1 is derived from gj+l by unifying r\ (u3) and 7?.^ .head with mgu < r3 +1. T hat is, (r * ( « ? ) a £ r 1,) ' < i = ^ - h e a d • a ° i + 1 (23) Furtherm ore, by our induction hypothesis (22), we know 7£r> .body • a 3 C gj+ 1 ■ a 3 C Iaii. Therefore, from the semantics of nrecILOG- and (23), we know ri( “?)< 7i+?aJ = 7^rj.head • a 3 G Iali (24) We can define a3~l = cr 3+ 1 a3. Then gi+ia 3 - 1 = v-+ 1 )aJ = (9i+i - R rt -body • crj+l U {rj(u3) ■ <r|+f )}a 3 C gf+ 1 a 3 U { (r3 (u3) • o{°f )a 3 } < = hll And for each variable y G Var(gi+ij ya3~x □ L e m m a D .0 .2: VEC Let H be an W GG expansion history, and C C Tree,- be a set of atom occurrences at level i. Let V = Var(C) H Var(C). Let x G Var(C) and y G Var(C) be two variables. If x V and y V and x • MGU{+i = y ■ MGUi+i, then C °-S* C P ro o f: By the definition of VEC, we know x and y m ust be in the same VEC w.r.t. MGUi+i. By the Substitution Partition Lemma, this implies th at x and y are in — 1), we have ya 3i + 1 a 3 ycrW^oi 201 the same equivalent witness class. T hat is, there exist n > 2 atom s of interm ediate relations 5 •**]? * * * \p 7 l ) • • * * ^7 1 5 • * * ] } Q Treei such that x = x\, y = yn, and each Xk, k € [l,rz — 1], in Ik[ok, •••Xk, •••] corresponds to Xk+i in Ik+i [ofc+ii ••••Efc+i5 •••]• Furtherm ore, since x g C and y £ C, we know I\[o\, ...x, ...] € C and In\on,-■•y-, •••] £ C. And therefore, c°&c □ L e m m a 7.3.1: OID Component Condensation Let H be an expansion history of an ALD L. Then (a) For each r(«), r'(u) € Treei(H), if r'(v) then Vj > 0, T r e e j l f \ H ) ° $ ’ T r e e i ^ }(H) (b) The OID component condensation [Tree[,j](/f)] of Tree^j^H ), j > i > 0 is a tree/forest. P ro o f: To prove (a), it is sufficient to show that: If at level i two atoms r(u)°-yS* r'(w) then T re e J ;f)}(if) °^P T re e f;'f)}(H) (25) From the Variable Inheritance Lemma, we know V a r ( T r e e D Var(Tree\r ^ }) = Var(u • MGUi+1) n Var{v • MGUi+1) Case 1) U = “base” By the Substitution Partition Lemma, we know MGUi+i does not act on u and v. And by the Variable Inheritance Lemma, we know Var{TreejT + { f }) 0 Var(Treejr J * )}) = (uU Vother) D (vU V 'other = u O v 202 which contains only base variables. Therefore, (25) holds. Case 2) U = “aux” We prove this case by contradiction. Assume (25) is false and there exists an OID variable x such that rank(x) > 1 and x G Var{Tree\+i^') D Var(Tree\+[v^) (26) From (26) and the Variable Inheritance Lemma, we know a: € Var(u • MGUi+1) D Var(v • MGUi+1) This implies that there must exist two OID variables x\ G Var(r(u)) and x 2 € V ar(r'(n )), such that 7 ^ #2 (since otherwise r(u) C S > r'(v)) xt ■ MGUi+i = x x 2 • MGUi+i = x This is only possible if both x x and x 2 are in the same witness class. Thus, there m ust be two atoms p[o,.., x x, ..],p[o,.., x , ..] 6 Treei(H) and two atoms q[o', ,.x2, ..], q[o' , .., x , ..] G Treei(H). But this implies / OID r , OID r , OID r / , OID r , , OID , , -s r(u) p[o,..,xu ..\ ^ p[o,..,x,..\ q[o,..x2,..\ < - ► \ ^ r [v) Contradiction! To prove part (b), clearly for each C G [Treepj](H)],j > i > 1 there exists C\ G [T ree[i_ij](i/)] such that C\ is a parent of C. Assume there is another OID component C 2 G [Tree[i^xj](H)\ which is also a parent of C. T hat is, there exist two distinct atoms r(u),s(v ) G C and two atoms r'(u ') G Cx,s'{v') ! G C 2 s.t. r'(u '), s^v') are the parents of r(u ),s(v ) respectively. Since r(u) °*^* s(v) and by part (a) we have K OID* r (u ) « -* s (u ) thus Ci = C2. Therefore, [Tree[ij](H)] is a tree/forest. □ 203 L e m m a D .0 .3 : Variable Overlapping Let C C Treei(H) be a set of atom occurrences at level i and C Let Vj = V ar(Treef ) fl Var(Treef), for each j > i. Then Vj = Vi- MGU[i+1J] P ro o f: By the OID Component Condensation Lemma (a), we know Vj,j > i con tains only base variables. Also, from the Variable Inheritance Lemma (a), we know variables in Vj are always passed down to level j + 1 (subject to be changed by MGUj+%). This implies Vi • MGU[i+1J] C Vj It is sufficient to show Vi C Vi ■ MGU[i+ij] (27) We prove by induction. It is trivial that (27) holds at level i. Assume (27) also holds at level j > i. Consider Vj+1. By the Variable Inheritance Lemma, we know Vj+i = Var(Treef + 1 ) fl Var(Treef+1) = Var(Var(Treef) ■ MGUj+l U F other) n Var(Var(Treef) • MGU j + 1 U V 'other) = Var(Var(Treef) ■ MGUj+i) fl Var(Var(Treef) ■ MGUj+1) = {x\x £ V a r ( T r e e ■ MGUj+i and £ is a variable} fl {o:|a: G Var{Tree^) • MGUj+i and a: is a variable} (28) Let x be a variable in Vj+1. From (28), we know there exist two variables x ° G Var(Treef) and x^ G V ar(Treef) such that x ° • MGUj+i = x and e (29) x • MGUj+i = x Claim: x c G Vj or x^ G Vj Assume otherwise. From (29), we know xc and x ° m ust be in the same VEC. But, 204 by the VEC Lemma, this implies Treef °e3* T r e e which contradicts the OID Component Condensation Lemma. We proved our claim. < 1 W.l.o.g, assume x c € Vj. Then we have x = x ° • MGUj+1 € Vj • MGUj+1 a L e m m a 7 .3 .2 : OID Component Let L be a well-founded ALD with maxim um rank m, and H = WGG°L{G0, to) be an arbitrary singleton expansion history. Then at any expansion level i, we have: (a) Let C C Treei(H) be a set of atom occurrences at level i Vk = Var(Treej?) D Var{Tree^), k > i If Maxrank(Vi) = n > 1 then Maxrank(Vi+ 2 ) = n — 1 (b) The size of each OID component in Tree(H) is bounded by M a x { \\T r e e f^ 2+ 2 {H)\\ | r{u) € Tree(H)} (c) There is a number, denoted #OIDComp(H), which bounds the number, up to variable isomorphism, of all possible OID components occurring in Tree[0 iO O ](H). P r o o f: W.l.o.g. assume 5 t{ = “base” , i.e., T ree, contains atoms of base relations from database A. Then by the Substitution Partition Lemma, we know MGUi+i does not act on T reet - +i. Thus by the Variable Inheritance Lemma, we know ’ Var(Treef+1) = Var{C) ■ MGUi+1 U V£ther = Var{C) U V£ther Var(Treef+1) = Var(C) ■ MGUi+l U V£ther = Var(C) U Vg*™ 5the restriction is superfluous; even if the goal set at level i does contain some intermediate relations, then its one level expansion will contain only base relation atoms. The restriction is only to simplify the proof. 205 Once again, by applying the Variable Inheritance Lemma to level i -f 2, we have Var(Treef+2) = Var(Treef+1) ■ MGU i + 2 U V£ther' = Var(C) • MGUi + 2 U Vgther • MGU i + 2 U V£tW = {x\x is a variable in Var(C) • MGU{+2} U V£ther • MGUi+2 U V£ther' (30) Var(Treef+2) = Var(Treef+1) ■ MGU i + 2 U Vgth^ = Var(C) ■ MGU i + 2 U Vgthei • MGU i + 2 U V^ther' = {x\x is a variable in Var(C) • MGUi+2} U VgtheT • MGUi+2 U V < fther' (31) Since each the OID variable in C or C is from database A and will be replaced by a skolem term in M G£/,+2, therefore, from (30) and (31) we know, if x is an OID variable in Vi+2, then x e (VgtherMGU i + 2 D VgtheTMGU i+2) Let L = { o i,..,o n} contains all the OID variables in V *. T hat is, L C Var(r(u)) and L C Var(C) Then correspond to € [1, n], in L there are two atom s s[c>i,W i] € Treef + 1 and sfo,-, ii?/] € Treef+1. And all the B OID variable not in {w\ U • • • wn U w\ • Uwn') will not be in a witness class with any variable in Treef+1, and therefore does not appear in Vi+2. The only OID links between Treej+g^ and Tree ^ } 2 will come from ( ■ WiMGU i + 2 U • • • wnMGUi+2) f l (w/MGUi+i U • • • wn'MGU i + 2 ) By the definition of rank, we know MaxRank(wi) = rank(oi) — 1 , i E [1, n] This implies MaxRank(Vi+2) = MaxRank(wiMGUi + 2 U • • • wnMGUi+2) = n — 1 To prove (b), if m = 0, each OID component in H is a singleton. Clearly (b) holds. 206 Now consider the case when m > 1. Note th at since H is a singleton expansion history, for each OID component C in H there always exists an atomic ancestor (the initial goal set can be viewed as the atom ic ancestor of all OID components). It is sufficient to show that for each OID component C C Treej there exists an atom occurrence r(u) E Tree,,* E [0,.?] such th at r(u) is an atomic ancestor of C, and j — * < 2 * m -f- 2 We prove this by contradiction. Assume otherwise, i.e., there exists an OID component C E Treej, j > 0, such th at its nearest atomic ancestor is r(u) E Tree,- and j —i> 2 * m - f - 2 . Let Treefl[u)}(H) = { s i(n l),.., Then by repeated applications of part (a), we know, for any two distinct atoms sp(up) and sq(uq) in Tree\T ^ ])(H), Tree^Sp^U p^ °}R* Tree^Sq^ 9^ l r e e l+1+2*m ± ' eej+1+2*m T hat is, the descendants of sp(up) and sq(uq) at level * + l- |- 2 * m d o not have OID link. Since, C E Treej, j — i > 2* m -f2, and from the OID Component Condensation Lemma (b), we know C can be the descendant of only one of the {si(iJi),.., But this means th at r(u) is not the nearest singleton ancestor of C . Contradiction. (The proof for Part (c) is shown in Chapter 7.) □ L e m m a 7 .3 .3 : Information Transfer between OID Components Let H be an expansion history for a well-founded ALD L, and C C Tree,-(H) be a set of atom occurrences at level i > 0 such that C. And let V = Var(C) D Var(C) This contains only non-OID variables. For each level j (j > i), let (x/y) be a substitution item in M G U f^ (H ) . If ( x/y) acts on T reef (H) then (x/y) acts on V • MGU[i+i,j](H). This also contains only non-OID variables. P ro o f: From the OID Component Condensation Lemma (a) we know Treef(H)°fi* Treef(H) 207 And from the Variable Overlapping Lemma, we know Var(Treef) n Var(Treef ) = V • MGU[i+hj](H) which contains only non-OID variables. By definition, MGUj+f' is the accumulated substitution resulting from the micro expansion of each atom in Treef. As a result, M G U j ^ only acts on those variables appearing in Var(Tree*?). Thus for any substitution item (x/y) G M G U j acts on T r e e f , a: m ust also appear in Treef. T hat is, x G Var(Tree?) n Var(Treef) = V • MGU[i+hj](H) □ L em m a 7.4.1: Homomorphism Let H = WGGL(Go,t) and H 1 = WGGl(G'0, i) be two expansion histories of a well-founded ALD L. If there exist a variable m apping p from Go to G' 0, and let the homomorphism from H to H' derived from p be {po, pi...}, then for each i > 0 Treei(H)p{ — Treei(H') MGU[o,i](H) ® p[0,i] = MGU[qA(H ') P ro o f: The expansion history can be viewed as a continuous sequence of micro steps, i.e. (except for switching schema tag) the j th micro step at level i can be viewed as the ((S^Z-o/fc) + i ) th “micro step” at level 0. We can define two functions, L and S, converting “micro step” j to its corresponding level number, L(j), and the real micro step number, S ( j ), within the level as follows. L(j) = x where (E JlJ/t) < j < (£jL 0/t ) s(j) = i - x S t ' h To comply with this new “micro step” sequence, homomorphism p- ,+1 (p° = po) at micro step j + 1 is constructed as follows. /?J+1 : 7Zrsu+i) R! £(j + l) L(>+1) pJ+l = {x h - » \(xi-^ y) e pj U p3+1, y,ymin are in the same VEC C defined by 2 /min is the variable with m inim um index in C} 208 Then, it is sufficient to prove this lem m a by doing induction on the “micro steps” and show that, for each micro step j > 0, gi pi = g' 3 (32) where (g3 ,cr3) and (g'3 ,cr,J) are the goal sets and substitutions generated in micro step j of the two expansion histories, respectively. Clearly, at the 0th micro step, (32) holds. Assume, (32) also holds at micro step j > 0. Suppose at micro step j , I = L (j ) and s — S(j). Furtherm ore, g3 and g3 + 1 are g3 = ( ^ ri.body,..,7eriS .b o d y ,rf+1(uf+1),..,rJi(uii))cr[°’ Jl gn = .body, body, rf+1(u'*+1), ..,r i ii(u/!,))a-'[0’ j] This is, atom rf+1(uf+1) • M0,3^ will be expanded next in micro step j + 1, and g3+1 = (7£ri. body,.., Hr». body, 7 fcr.+i. b o d y ,.., r{i(u{‘))cr[M< T J+i g , 3 + 1 = (K'ri .b o d y ,.., TIL .body, K' ,+ 1 .body,.., r j‘ (u'J*) )ad°’ 3]a'3+l (33) Case 1) ti — “base” Let the rule heads of the renamed rules for rf+1(uf+1) and rf+1(u'/ ’ ) be -*+■(?) 72?s+i.head = rf+1(iT) TZ„s+i .head = r~ , Tl 1 For the sake of brevity, let 6 = w and u'*+1cr^0’ ^ = w'. By the Substitution Partition Lemma, we know the substitutions generated at micro step j + 1 in both of the histories are trivial substitutions to g3 and g'3. In particular, a 3 + 1 = mgu(v,w) = (v/w) rd+1 _ = mgu{v',w') = (v'/w ') Thus, from (33) and the construction of pJ+1, we know g3+ 1p3 + 1 ,/j+i 6Since 2/ = “base”, by the Substitution Partition Lemma, we know Uj+Icd0'Jl and u ',+ cr do not contain skolem terms. 209 Also, by the definition of C E D , we know 0 = (v/w) 0 = (vp3 + 1 /wp3+1) = (v’ jw') = cr , 3 + 1 (34) Furtherm ore, since (v/w) does not act on the RHS of any substitution item in a and (v'/w') does not act on the RHS of any substitution item in a^°’ 3\ we know [o,il a [o,j+i] = a [o,j) y a j +1 (35) By (34) and (35), we have Joj+i] (orf0’ ^ U a3+1) 0 > pl (cr^'fi 0 1 p[°’ i+1] U cr j + 1 0 pl°’ j+1l) = (a,[0’ j] U a,3+l) = a /j+i' j[0,3+l] Case 2) ti = “aux” Let 7£rs+i.head = rf+1 [f(w),w] 7£'s+i.head = r*+l[f(w'),w'] Consider the following two cases of rf+1[uf+1]crt°’ ^ in gK Subcase 1) r / s+1[uf+1]crtM — rf+1[o, u] T hat is, the OID variable o has not been skolemized by cd0,Jl Since and g ' 3 are homomorphic by let the corresponding atom in g ' 3 be r/+1 [o,v\p3 = rf+1[o', iT ] Also, because w and w’ in the renamed rules contain only new variables, we know c t > + 1 = mgu([f(w),w],[o,v\) = {o/f(v),(w /v)} a , 3 + 1 = mgu([f(w '), to'], [o', t/]) = {o'/f(v'), (w'/v1 )} (36) From (32), if (o/f(v)) acts on gJ — (rf+1[o, u]}, then (o'/ f(v')) acts on g ' 3 — {rf+1 [o', Also, since crJI+ 1 and a ' 3 + 1 do not act on any other variables in and g' 3 , respectively, from the construction of p3 + 1 , we have = ( d 3 ~ {rf+1[“f+1]}){°//(^)}P'7 + 1 u ('ftrH-i.body){ty/u}/ 0J+ 1 = K - {'■f+1[ « T , ] } ){ ° 7 /(5 ')} U W ^ .b o d y = g'H1 By (36), each occurrence of o in the RHS of a substitution item in cd0* - 7 ! is replaced by (T J+1 into f(v) and each occurrence of o' in the RHS of a substitution item in < r/[°’ J |] is replaced by a ' 3 + 1 into f{v'). Thus, by the definition of (g > , we have < y [°.i+!] (g j /3[°.i+1 ] _ f(v)) U {w/v}) ( 8 > pl0,- 7+1l = (<7[0’ J ] ( 8 ) p[0'3+n) • (o7/(t0) u {w '/t?} = a'[0’ j] • (o'/f(v’)) U {w'/v*} = <r'[0 ’ i+1] Subcase 2) rf+1(uf+1)<d0’ ^ = rf+1[/(u ),u ] T hat is, the OID variable has been skolemized by Since g* and g ' 3 are homo morphic by p \ let rl + 1 [/(« ), = ? r S t+1 [f{u'),v) For the sake of brevity, let e = u ■ mgu(u, v) = v • mgu(u, v) e* — u' • mgu{u', v') = v‘ • mgu(ur, v') where e and e * contain the minimum index variables in the VECs defined by cd0,J+1l and cr^0’ J+1^ , respectively. Then, from the expansion process, we have <rJ+1 = mgu([f(w),w],[u,v\) = {w /e,u/e,v/e} — {x/x} cr/J+1 = mgu([f(w'), to'], [«', o']) = {tU'/e*, $ /(? , v'/e'} — {x/x} Finally, follow the similar argument of Subcase 1), we have gj+1 pj + 1 = (gi _ {rf+1[uf+1]} U 7^ + 1.body )erJ+1 • p3 + 1 = (a3 ~ ( rf+11^?+1]}) ‘ {w/e, v/e} ' P 3 * 1 O (7?.rs+i .body) • {w/e} • p3 + 1 = (s U - W +1 [“ f ’ ]}) ' K M * M } U -body = g‘i+I And ^ ^,[o,i+i] _ (ort0'- ? ] . {u/e, o /e) U {to/e}) ( 8) pf°b+1 l = (<d0-^ < 8 > p[°^+1 l) ■ { { T /e W e '} U W e * } = < r ,[M • {£ Z //e r, v'/e'} U {rZT/e'} = < 7/t0J+1] 211 □ L em m a 7.4.2: MGU Look Ahead Let H = WGGL,(Go,to) be an expansion history of an ALD L. For some j > i > 0, let £ = var-pure(MGU[oj](H)) \Treei(H) be the variable-pure substitution items in MGU[i+ij](H) that act on Treei(H), and let H' = Alter(H,£,i) be the altered expansion history by applying £ in advance to the goal set of level i. Then we have (a) var-pure(MGU[i+itj](H')) does not act on Treei(H') = Treei(H)£ (b) V A ; > j,T ree k(H) = Treek(H') and MGUk+ 1 (H) = MGUk+l(H') T hat is, the expansions after level j are the same for both histories. (c) V fc > j, MGU[i+hk](H) = £ • MGU[i+ltk](H') (d) Assume Treei(H') contains n OID components C i ,. . . , C n. Then, M GU[i+ul(H') = |J M G U f: § ( H ') cx T hat is, each OID component expands independently. P ro o f: £ can be viewed as a variable mapping from Tree{(H) to Tree^H*). By the Homomorphism Lemma, there exists a homomorphism {pi = £, /9 ,+i ...} between the two expansion histories after level i. T hat is, for each k > i Treek(H') = Treek(H)pk . MGU[i+ltk]{H') = MGU[i+ltk](H) 0 pm We now prove (a) by contradiction. Assume there exists a variable-pure substi tution item (x/y) £ which acts on Tree^H'). Since Tree^H') = Tree{(H)£ and x can be replaced only by a variable with smaller index, we know x,y € Var{Treei{H)) C Var(Treei(H')) (38) Furthermore, since (x/y) £ MGU[i+ij](Hr), we know variable x is replaced by an other variable and therefore disappeared between level i and j in the expansion of H' . Assume x disappear at level d £ [i,j]. From (37), we know Treed(H) is ho momorphic to Treed(H). This implies that x also disappeared somewhere between 212 level i and level j in H . Assume this is because that x is replaced by another variable z in the unification process. T hat is, x /z € But this implies that x /z € £. Thus. a r i Treei(H') = Tree,-(tf )£ which contradicts (38). It is sufficient to prove (b) by showing that Tree^H') = Treej(H) (39) T hat is, the goal sets at level j of H and H' are identical. Then the m gu’s and goal sets generated after level j in the two expansion histories will also be the same. From (37), we can prove (39) by showing that Treej(H)pj = Treej(H) Let x € Var(Treej(H )) be an arbitrary variable. Consider the following two cases. Case 1) x £ Var(Treei(H')) From part (a) of this lemma, we know var-pure(MGU[i+ij](H ')) does not act on Treei(H'). This implies th at x is the minimum index variable in a VEC w.r.t. MGU[i+i By the definition of pj, we have Xpj = x Case 2) x ^ Var{Treei{H')) By the Variable Inheritance Lemma, we know x m ust be a new variable in a renamed rule. Assume x first appeared at level p € [* + 1, j]- Since by definition the expansions of H and H' use identical renamed rules, we know, in the construction of pp, (x t~* x) is in j3p. T hat is, x is also a new variable first appeared at level p in H and xpp = x Assume xpj = z ^ x. By the construction of pj, this implies x /z € M % +1i3 ](F ') By (37) we know there exists a substitution item x/z' 6 MGU\p+ij](H) such th at x/z' — x /z ® P [itk] 213 But this implies th at x is replaced by another variable during the expansion in H. Thus, x Var(Treej{H)). Contradiction! We now prove (c). Since the renamed rules used in the expansions of H and H' are identical and Tree{(Hf) = TYee;(//)£, we know {*!(*/<) € MGU l w ,,,(//)} = {x\(x/t) < = { ■ M O T|i+1J,( f l') ) T hat is, the right sides of the two substitutions are the same. Let x be an arbitrary variable in RHS(MGU[i+itj](H)). It is sufficient to show that x • MGU[i+1j](H) = x-t-M G U v+ ijiiH ') Assume x ■ MGU[i+ij](H) = t. By the Homomorphism Lemma, we know */* ® P[i,A G MGU[t+l7j](H') (40) Since Var(t) C Var(Treej(H )) and from the proof of part (b), we know tpj - t (41) We now show that x ■ £ • M GU[i+itj](H') is also t. Consider the following two cases. Case 1) x € Var(Treei(H)) Assume £ does not act on x. From (40) and (41), we know x /t ® p[{,j] = xpi/tpj = x(/tpj = x /t € MGU[i+X j\(H') T hat is, x • £ • MGU[i+1 J i m = t Now consider the case when £ does act on x . Assume x£ = y ^ t. Since y m ust has an index less that a;’s, we know y € V ar{T rea {H )). Since by definition £ C MGU[i+u ] ( H ), we know y / t 6 MGU[i+ij](H). Again, by (40), (41) and the idem potency of £, we have y/t ® P[ij] = ypi/tpj = yi/tpj = y/t € MGU[l+ltj](H') Then x ■ ^ • MGU[i+ 1< j](H‘) = y • = t 214 Case 2) x £ Var(Treei(H )) T hat is, x£ — x. By the Variable Inheritance lemma, x m ust be a new variable in a renam ed rule. Assume variable x first appeared at level p E [* + 1, j]. Since the renam ed rules used are identical, x m ust also be a new variable appeared at level p in H. By the definition of pp, we know xpp = x From (40) and (41), we know x/t (^ ) P[i,j] --- X pp/ tpj — x/t £ GU[i^.\ij](H ) Thus, we have x - t - MGU[i+hj](H') = x . M G U ii+ t^H 1 ) = t To prove (d), by the Independent of Micro Step Lemma, it is sufficient to show th at for each OID components C C Treei(H'), the MGUk+i (H*) does not act on T re e f (i?7), fc E [i,j — 1] For each k E [i,j — 1], let 14 = Var(Tree^(H')) fl Var(Tree^(H')). From the Variable Overlapping Lemma and part (a) of this lemma, we know, Vk = Vi- MGU[itk](Hr) = V which contains only base variables. From the Information Transfer between OID Component Lemma, we know, for each (x/y) E M G U ^ f (H'), if (x/y) acts on Tree^(H'), then x E Vk = V, and x is a base variable. This implies that (x/y) is a variable-pure substitution items that acts on Treef (H‘), which contradicts part (a) of this lemma. □ L e m m a 7 .4 .3 : Uselessness Let H = WGGi,({r[o, w]}, “aux”) be the expansion history of a well-founded ALD 2/, where r is an interm ediate relation appearing in L. At level I > 0 let £ = va,r-pure(MGU[0j](H)) 215 (a) Let C C Treei(H) and D C Treej(H) (I > j > i) be two OID components such th at C is an ancestor of D. If there is an isomorphism p between and D£, then MGU[+l®(H) does not act on Treef(H ) (b) Let E C Treek(H), C C Treei(H), and D C Treej(H), (i < j < k < I) be three OID components such th at C is an ancestor of D D is an ancestor of E k — j > 2 * rank(o) Let o • MGU[oj](H) • MGUj+\E(H) = s , and let p and g be two isomorphisms such that D£ A C£, and sp = s E£ A D£, and sg = s If there is no answer in MGU[o,i](H), then there is no answer in MGU[o,i]{H) • MGU\E\E{H) P ro o f: Let and be the variable-pure substitutions constructed by restricting £ on Treei(H), Treej(H) and Treek(H), respectively. Let H ° — Alter(H,£i,i) H d = Alter{H,tj,j) (42) H e = Alter(H, £k, k) We prove (a) by contradiction. Assume there is a substitution item (x/y) € MGU}+l^(H) which acts on Treep(H). From the construction of H c and H D and the MGU Look Ahead Lemma, we know T r e e ^ ( H c ) and TreeE^(HD) expand independent of Tree^^(Hc ) and T reefa(H D) in H ° and H D, respectively. This means we can “excise” T r e ^ H 0 ) and Treeyq(HD) out of Tree[ij](Hc ) and Treeyy q(HD), respectively. Also, since p is an isomorphism between C £ and D£, by applying the Homomorphism Lemma from C£ to D / and from D£ to C£, we know there is an isomorphism between each layer in Tree®l+l](HD) and its corresponding layer in Treep{ ,_j+l^(Hc ). T hat is, Tree?+p(H c ) A Treef+p(H D), p € [0, 1 - j + 1] (43) 216 From (43) and the MGU Look Ahead Lemma (b), we have T ree f(H D) = Treef(H ) £=4 T r e e f + ^ H 0 ) (44) Then by the Micro Step Independence Lemma and the fact that Treei(HD) = Treei(H ), we know Treef+ 1 (H D) = Tree®+l(H) and Treef^(HD) = Treej>+1(H) Tree?+,_i+t(H°) (45) Let V = Var(D) fl Var(D). By the Information Transfer Lemma, (x/y) also acts on V • MGU\j+\fi(H) which contains only base variables. Then x , y £ V - MGUy+u](H) = V • var-pure(MGUy+u](H)) = V • & C V ar(D ^) Since C£ and are isomorphic (43), let x ', y' E C£ be the two variables corresponds to x,y in D£j. Since (x/y) acts on Treef+ 1(H), by (45), we know x is replaced by y in Treef+l(H) and x' is replaced by y' in Treef+[_j+ 1 (H). This implies (x'/y') E M G U $ ° i+l_J+ 1](H°) acts on Treef(H °). But this violates the MGU Look Ahead Lemma (a). Contradiction. To prove (b) by contradiction, assume there exists an answer (o'/s) E MGU[ 0 ,i](H) • MGU\+\E(H) where o' ± o. Consider the following two cases concerning where the OID variable o' appears in Tree(H). Case 1) o' E Vai'(Treeq(H)),q E [j, /] T hat is, (o '/i) 6 MGUu+ 1J](H) ■ MGU}$D(H) (46) Since £ can be viewed as a variable mapping from Treej(H) to Treej(HD), therefore by the Homomorphism Lemma there exists a homomorphism ...,£/+i), such that Since j > 2 * rank(o), by the Variable Inheritance Lemma (b), we know Var(o • MGU[oj](H)) contains only base variables in Var(Treej(H)). Also, by part (a) of 217 this lemma, we know M G U y ^ ^ (H D) does not act on variables in Var(Treej(HD)). Thus, by the definition of £/+i, we know Hi+i = « Then from (46) we know (o'/s) ® = (o’tq/sti+i) = (o'US) € (47) For the sake of brevity, let o = o'£g ( E Var(Treeq(H D)) and o ^ o (since o has been skolemized and therefore disappears after level 1). By the MGU Look Ahead Lemma (d), we know D£j and B£j expand indepen dently in H d up to level I and MGUU+,A(HD) = M G U ^ n(HD)U M G U ^ °n(HD) (48) where this is a disjoint union. From (47) and (48), we know (6/s) 6 (M G U ^ fi{HD) u m g v W°a (h d )) ■ MGUl?lD(HD) From part (a) of this lemma, we know MGU\+}p (HD) does not act on Treep(HD). As a result, M G U ^l^ (HD) is independent of MGUy^®q(HD). Thus, ( 6 / i ) e m g u ^ a (h d ) ■ m gu ^ 1°(h d ) u m g u ^ a (h d ) Then the following two subcases arise. Subcase 1) (d/s) € M G U ^ f ^ H 0 ) ■ MGU\+\D(Hd ) Combining the two M G f/’s, we have ( 6 / i ) € M GU$fJ+n(HD) (4 9 ) Since p can be viewed as a variable m apping — > C£, by the Homomorphism Lemma, we know there exists a homomorphism {pj,pj+1, ...} such that MGU{$Z+1_j+il(Hc ) = M G {/$ ® +1](ffD) ® m + n (50) 218 Since o • MGU[ 0,i](H) • MGU}+\E(H) = s, we know each variable in Var(s) is the minimum index m ember in a VEC. From the construction of pi+i we know spl+1 = s (51) Let o = 6 pq. Then from (49), (50), and (51), we have (6 /i) ® pyMl] = ( 6 p,/$p,+t) = (o/3) € M G U $.lw _m (H°) (52) where 0 =^0 . Finally, by MGU Look Ahead Lemma (c), we have MGUm (H) = f ■MGUva(H c ) (53) But (52) and (53) imply that there is an answer in MGU[otq(H). Contradiction. Subcase 2) (6 /s) £ M G U $ ? A(HD) Once again, by the MGU Look Ahead Lemma (c), we have M G V^n(H) = t I t t ^ w ■MGU^IZ^H0) Following the similar argument as in Subcase 1, this implies that there is an answer in MGUy+-itq(H). Contradiction. Case 2) o' £ Var(Tree[o,j-i](H)) By the Variable Inheritance Lemma (b) and the fact that k — j > 2 * rank(o'), we know Var(o' • MGU[0 ,k3 ) contains only base variables in Var(Treek(H)). Also, by the proof of part (a) of this lemma, we know MGUj^{B(H) does not act on base variables in Treek(H). This implies o'-M G U W]{H) = o' • MGU[0 < i](H) • M G U \^ ( H ) = s But this means that the same answer has already appeared in MGU[q^{H). Con tradiction. □ 219 A p p e n d ix E Detailed Solution of Example “Requisition” Continuing form Section 8.4.3, we now present the details of our solution to the “Requisition” example. First, in Section E .l, the ALD specifying the semantic correspondence between Requisition Subquantities and Purchasing Subquantities is presented. Then, Section E.2 describes the local experts, Expert^ and E xperts, of the two databases. Finally, in Section E.3, some update scenarios are presented to illustrate the interactions between the REQ and P.O. systems under this framework. E .l The ALD for the equivalent subschemas Observe that in Figure 8.12 the overlapping subschemas, AeS and B ec^, are almost identical; i.e., they carry the same information of the “subquantities” distributed among requisitions and purchase orders. The following ALD specifies the correspon dence between the requisition subquantities and purchase subquantities stored in Aecl and B eQ, respectively. (abs-linkage Requisition-Purchase-Order A-schema ( RSQ(Entity); rsq-req-no(RSQ.String); rsq-po-no(RSQ.String); rsq-item-name(RSQ.String); rsq-qty(RSQ,Integer); ) B-schema 220 ( PSQ(Entity); psq-req-no(PSQ,String); psq-po-no(PSQ.String); psq-item-name(PSQ,String); psq-qty(PSQ,Integer); ) I Entity-Equivalence ( (El) A.RSQ « B.PSQ; ) A-to-B ( (Rl) B.psq-req-no(psq.n) A.rsq-req-no(rsq.n), A.RSQ=PSQ[rsq.psq]; (R2) B.psq-po-no(psq.n) A.rsq-po-no(rsq.n), A.RSQ=PSQ[rsq.psq]; (R3) B.psq-item-name(psq.n) A.rsq-item-name(rsq.n), A.RSQ=PSQ[rsq.psq]; (R4) B.psq-qty(psq.q) A.rsq-qty(rsq.q), A.RSQ=PSQ[rsq.psq]; ) B-to-A ( (51) A.rsq-req-no(rsq.n) B.psq-req-no(psq.n), B.RSQ=PSQ[rsq.psq]; (52) A.rsq-po-no(rsq.n) B.psq-po-no(psq.n), B.RSQ=PSQ[rsq.psq]; (53) A.rsq-item-name(rsq.n) B.psq-item-name(psq.n), B.RSQ=PSQ[rsq.psq]; (54) A.rsq-qty(rsq.q) B.psq-qty(psq,q), B.RSQ=PSQ[rsq.psq] ; ) ) In the above ALD, rules R l, R2, R3, R4 describe how to populate various book keeping information of purchase subquantities in B. PSQ from their corresponding requisition subquantities in A.RSQ. These bookkeeping information include: the req uisition number, purchase order number, object name, and quantity. Similarly, rules SI, S2, S3, S4 populate the same bookkeeping information of requisition subquantities from their corresponding purchase subquantities. Note that once a new requisition subquantity is created in database A, the seman tics of EES entry (Section 5.5) A.RSQ = B.PSQ in the ALD will create a correspond ing purchase subquantity and vice versa. Similarly, once a requisition subquantity is deleted, its corresponding purchase subquantity will also be deleted, so th at there is always a one-one correspondence between requisition subquantities in A.RSQ and purchase subquantities in B.PSQ. 221 E.2 Local Experts In our design, we assume th at the user at database A makes requests by updating the REQ database. These updates will trigger the local expert, Expert^, which proposes updates to the P.O. system. Also, Expert^ monitors the updates to AecL proposed by database B and adjusts the contents,of database A accordingly (e.g., adjusting A. re-contain). In database B , E xpertg’s jobs include (1) monitoring new requisition creations in database A and (2) reorganizing the purchase subquantities after the human operator m ade adjustm ents to the P.O. database. In the following, the local experts are presented in pseudo-code. Furthermore, they are designed only to handle the basic requisition/purchasing operations. These experts can be extended to accommodate more elaborated operations (e.g., so that the user may change the quantity of an existing requisition). However, such exten sions are outside the scope of this thesis. Local Expert Expert^: A .l W hen the user adds a new requisition Let the new requisition entry be re, its quantity and item -nam e be q and name , respectively. Assume re is contained in requisition req with requisition num ber reno. A. 1.1 Create a new requisition subquantity rsq. A .1.2 Update the attributes of rsq as follows. Add tuples (rsq,q), (rsq, reno), and (rsq,name) to A.rsq-qty, A.rsq-req-no, and A .rsq-item-name, respectively. Note th at, at this point, the requisition has not been assigned to a pur chase yet. Thus, attribute A.rsq-po-no of rsq is not updated. Once the update is sent to the P.O. system and a specific purchase order is assigned, this information will be added by update sent back from the P.O. system. A.1.3 Link this new requisition subquantity rsq to requisition entry re by adding a tuple (re, rsq) to A.re-contain. 222 Notice that, due to the creation of requisition subquantity rsq, the B ird system will create a new purchase subquantity psq in the P.O. system. Furthermore, attributes B . psq-qty, B. psq-req-no, B . psq-po-no, and B. psq-it em-name of psq will be properly updated through update propagations to the P.O. system by B ird . A.2 W hen a new subquantity is created by updates proposed by B Let the new requisition subquantity created be rsq, and its attribute values of A.rsq-req-no and A.rsq-item-name be rno and name, respectively. A.2.1 Search for the requisition req with requisition num ber rno in A.req-no. A.2.2 Search for the requisition entry re which is contained in requisition req and has item name name, i.e., (req, re) is in A . req-include and (re, name) is in A . item-name. A.2.3 Link this new requisition subquantity rsq to requisition entry re by adding (re, rsq) to A . re -c o n ta in . A.3 W hen the user change the quantity of an existing requisition Local Expert E xp erts: B .l W hen a new subquantity is created by update proposed by A Let the new purchase subquantity be psq. B .1.1 Create a “fake” purchase order po and a “fake” purchase entry pe with unique purchase order num ber pon and entry num ber pen. B .l.2 Add tuples (po,pe) and (pe,psq) to A.po-include and A.pe-contain, respectively. B .l.3 Notify the purchase officer (either hum an or autom ated) that a fake pur chase order po is created containing a new requisition. B .l corresponds directly to A .l of Expert^. W hen a new purchase subquan tity is created by update from A, Experts creates a “fake” purchase or der/entry/subquantity hierarchy. This will tem porarily hold the information 223 needed to issue a real purchase order later. Then, the purchase officer will be informed and rearrange the “fake” purchase orders, which will trigger B.2 below. In fact, the actions in B.2 can be incorporated into B .l. This enables Expert^ to avoid making “fake” purchase orders. T hat is, once there is new subquantity created, Experts can make arrange to these new purchase order subquantities. B.2 W hen the purchasing officer rearrange items among purchase O R D E R S Assume the purchasing officer wants to move q' item s (out of q items) from purchase entry pe to another purchase entry pe'. B.2.1 Divide the set of purchase subquantities contained by pe into two sets, Si and S 2 , so that the quantities in S\ and S 2 sum up to q — q‘ and q1 , respectively. If there is no way to have two sets Si and S 2 with quantities summing up to q — q' and q', then a subquantity will be split as follows. W .l.o.g., assume Si sums up to Q > q — q'- Let x £ 5i be the subquantity with the biggest quantity N. Then, create a new subquantity x'. Copy the attributes of x to x' and set attribute B .p sq -q ty of x and x‘ to be N — (Q — (q — q')) and (Q — (q — q')), respectively. Add x' to S 2 - It can be easily shown th at now the two sets have exactly q — q' and q' items. B.2.2 Move subquantities in S 2 from pe to pe' by adding tuples {(pe',x)\x £ £ 2} to B .p e -c o n ta in , and deleting tuples {(pe, x)\x £ S2} from B .p e -c o n ta in . B.3 W hen the quantity of an item is changed proposed by A 1 Notify the purchasing officer of this change.... E.3 Example Scenarios In this section two scenarios are presented to illustrate the interactions between the two local experts. xT h is corresponds directly to A .3 o f Expert^. 224 REQ#1 psqi P.O.#l Bens Computer Warehouse 2 Mac, 2 LP psqa 2Mac h rsq2 H 2Mac P.O.#2 Dave 3 Mac Sherys Mac World 9 Mac REQ#3 0.#38 rsq4 [8Mac Don 8 Mac ■HjjMac (1) ExpertA first creates rsq4 (2) Bird creates a correspond ing psq4 (3) Expertg creates a fake P.O.#38 (4) Purchase Officer decides to merge psq4 into P.O.#2 (5) Bird sends some book keeping infor mation back to REQ Figure E .l: Scenario of “Don wants 8 more Macs” E x a m p le E .l: Assume the REQ system and the P.O. system have the instances depicted in Sce n a rio ^ ) of Figure 8.11. Now user Don at ISI wants to buy 8 Macs and update the REQ database by creating a new requisition, REQ#3. The resulting update scenario is depicted in Figure E .l (1) Expert a creates a new requisition subquantity rsq and fills up its attributes accordingly (A .1.1 - A .1.3). (2) The production rule m aintaining El in the ALD is fired. It creates a cor responding purchase subquantity psq in the P.O. database. Next, production rules m aintaining R l, R3, and R4 of the ALD are fired and populate attributes B .p sq -re q -n o , B .p s q - i t em-name, and B. p sq -q ty of psq in the P.O. system. (3) In database B, Experts creates a fake purchase order P.O.#38 with one entry pe containing subquantity psq (B.1.1 - B .l.2). It then notifies the purchasing officer (B .l.3). 225 (1) Expertg changes the quantity of psq2 from 2 to 1, creates a new psq5 with the same attribute values of psq2, and links psqs to P.O.#2 (2) Bird creates a new requistion sub quantity rsq5, and updates relevent bookkeeping information of rsq2 and r sq 5 REQ#1 P.O.#l IMac p sq s IMac ^ IMac 3 Mac IMac h IMac 8Mac Don 8 Mac 8Mac ■rl 2LP |- Ben’s Computer Warehouse "Ji IMac || 2 LP |* < a c IMac * psq2 P.O.#2 Sherry’s Mac World 91& 8W : 10 Mac Figure E.2: Scenario of “Move 1 Mac from Ben’s to Sherry’s” (4) Suppose, after contacting Sherry’s Mac World, the purchasing officer decides to merge the 8 Macs of psq with P . 0. #2 by moving entry pe from P.O. #38 to P.O. #2. (5) Since B.psq-po-no is derived from B.po-no, attribute B.psq-po-no of sub quantity psq is changed from P.O.#38 to P.O.#2. Consequently, rule S2 of the ALD will be fired. As a result, the assigned purchase order num ber is propagated back to the REQ system. Finally, the two databases reach the states shown in Figure E .l. □ E x a m p le E .2: Continuing with Example E .l. Assume that due to supply shortage, Ben’s Computer Warehouse can provides at most 1 Macs. After contacting Sherry’s Mac World for emergency supply, Experts decides to move 1 Mac from P.0.#1 to P.O.#2 so that P.O.#2 now contains 10 Macs. Figure E.2 depicts the resulting update scenario. (1) Experts first splits the purchase subquantities into two sets (B.2.1): 51 = {psq2} 52 = {ps?5} Note that psq5 is a new purchase subquantity created in the splitting of the two Macs originally in psq2. It then copies the attributes of psq2 to psq$, and 226 set the B. p sq -q ty attributes of psq 2 and psq5 to 1. Finally, Expert# links psq 5 to P.O.#2 (B.2.2). (2) The B ird system propagates the bookkeeping information to REQ system as follows. — The production rule in the P.O. database m aintaining El is fired. It cre ates a corresponding requisition subquantity rsq$ in the REQ database. — The production rules maintaining S I, S2, S3, S4 are fired to update the attributes of rsq 2 and req5 to reflect the changes in the P.O. system. — Local expert Experts links rsq 5 to REQ#2 (A.2). Then, the two databases reach the new states depicted in Figure E.2. Note th at, as shown in the figure, there are two purchase subquantities of 1 Mac; i.e., psq2 ,psqs. The local expert can be easily extended to merge them into into one subquantity with quantity of two, so that there are no redundant subquantities in the system. □ 227 Reference List [AB91] [AH87] [AK89] [AV88a] [AV88b] [Bac86] [BG91] [BL84] Serge Abiteboul and Anthony Bonner. Objects and views. In Proc. ACM SIGMOD Symp. on the Management of Data, pages 238-247, 1991. S. Abiteboul and R. Hull. IFO: A formal semantic database model. ACM Trans, on Database Systems, 12(4):525-565, Dec. 1987. Serge Abiteboul and Paris C. Kanellakis. Object identity as a query language primitive. In Proc. ACM SIGMOD Symp. on the Management of Data, pages 159-173, 1989. S. Abiteboul and V. Vianu. Datalog extensions for database queries and updates. Technical Report 900, INRIA, September 1988. S. Abiteboul and V. Vianu. Procedural and declarative database update languages. In Proc. ACM Symp. on Principles of Database Systems, 1988. Maurice J. Bach. The design of the UNIX operating system. Prentice- Hall, Englewood Cliffs, New Jersey, 1986. Thierry Barsalou and Dipayan Gangopadhyay. M(DM): An open frame work for interoperation of multimodel m ultidatabase systems. Technical Report RC 16965 (No.75244), IBM Almaden Research Division, June 1991. C. Batini and M. Lenzerini. A methodology for d ata schema integration in the entity relationship model. IEEE Transaction on Software Engi neering, SE-10(6), November 1984. 228 [BLN86] [BS81] [CH94] [Che76] [Coh86] [Coh87] [Coh89] [CW90] [CW91] [CW92a] [CW92b] C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Sur veys, 18(4):323-364, December 1986. F. Bancilhon and N. Spyratos. U pdate semantics of relational views. ACM Trans, on Database Systems, 6(4), December 1981. Ti-Pin Chang and Richard Hull. On witnesses and witness generators for object-based databases. Technical report, Com puter Science D epart m ent, University of Southern California, 1994. in preparation. P. P. Chen. The entity-relationship model - toward a unified view of data. ACM Trans, on Database Systems, 1 (1):9— 36, January 1976. Don Cohen. Programming by specification and annotation. In Proc. of AAAI, 1986. Don Cohen. AP5 reference manual. Technical report, USC/Inform ation Sciences Institute, 1987. Don Cohen. Compiling complex database transition triggers. In Proc. ACM SIGMOD Symp. on the Management of Data, pages 225-234, 1989. Stefano Ceri and Jennifer Widom. Deriving production rules for con straint maintenance. In Proc. of Intl. Conf. on Very Large Data Bases, pages 566-577, 1990. S. Ceri and J. Widom. Deriving production rules for incremental view maintenance. Technical Report R J 8027 (73675), IBM Almaden Research Center, March 1991. S. Ceri and J. Widom. Deriving incremental production rules for deduc tive data. Technical Report RJ 9071 (80884), IBM Almaden Research Center, November 1992. S. Ceri and J. Widom. Managing semantic heterogeneity with production rules and persistent queues. Technical Report R J 9064 (80754), IBM Almaden Research Center, October 1992. 229 [Day89] [DB82] [DH84] [EK91] [End72] [Fag82] [For79] [GHJ93] [GJ79] [GPZ88] Uraeshwar Dayal. Queries and views in an object-oriented data model. In Proc. of Second Intl. Workshop on Database Programming Languages, pages 80-102. Morgan Kaufmann, Los Altos, CA, June 1989. U. Dayal and P. A. Bernstein. On the correct translation of update oper ations on relational views. ACM Trans, on Database Systems, 7(3):381- 416, September 1982. U. Dayal and H.Y. Hwang. View definition and generalization for da tabase integration in a m ultidatabase system. IEEE Trans, on Software Engineering, SE-10(6):628-644, 1984. Frank Eliassen and Randi Karlsen. Interoperability and object identity. SIGMOD Record, 20(4):25-29, 1991. Herbert B. Enderton. A mathematical introduction to logic. Academic Press Inc., San Diego, New York, 1972. Ronald Fagin. Horn clauses and database dependencies. J. ACM , 29(4), October 1982. Charles L. Forgy. On the Efficient Implementation of Production Sys tems. PhD thesis, Carnegie-Mellon University, February 1979. Shaharm Ghandeharizadeh, Richard Hull, and Dean Jacobs. On imple menting a language for specifying active database execution models. In Proc. of Intl. Conf. on Very Large Data Bases, 1993. Michael R. Garey and David S. Johnson. Computers and Intractability: A guide to the theory of NP-Completeness. H. Freeman, San Francisco, 1979. G. Gottlob, P. Paolini, and R. Zicari. Properties and update semantics of consistent views. ACM Trans, on Database Systems, 13(4), December 1988. 230 [Han89] [HJ91] [HM81] [HW92] [HWW90] [HY90] [Kel85] [Kel86] [Ken89] [Llo87] Eric N. Hanson. An initial report on the design of Ariel: A DBMS with an integrated production rule system. ACM SIGMOD Record, pages 12- 19, September 1989. Richard Hull and Dean Jacobs. Language constructs for programming active databases. In Proc. of Intl. Conf. on Very Large Data Bases, pages 455-468, 1991. M. Hammer and D. McLeod. Database description with SDM: A Seman tic Database Model. ACM Trans, on Database Systems, 6(3):351-386, 1981. Eric N. Hanson and Jennifer Widom. An overview of production rules in database systems. Computer Science, October 1992. Richard Hull, Surjatini W idjojo, and Dave Wile. Specificational ap proach to database transformation. Technical report, US C/Inform ation Sciences Institute, February 1990. R. Hull and M. Yoshikawa. ILOG: Declarative Creation and Manipula tion of Object Identifiers (Extended A bstract). In Proc. of Intl. Conf. on Very Large Data Bases, pages 455-468, 1990. A rthur M. Keller. Algorithms for translating view updates to database updates for views involving selections, projections, and joins. In Proc. ACM Symp. on Principles of Database Systems, Portland, OR, March 1985. A rthur M. Keller. Choosing a view update translator by dialog at view definition time. In Proc. of Intl. Conf. on Very Large Data Bases, pages 467-474, Kyoto, August 1986. W illian Kent. The many forms of a single fact. In Proceedings of the IEEE Compcon Conference, Barcelona, Spain, February 1989. J. W. Lloyd. Foundations of Logic Programming (Second Edition). Springer-Verlag, Berlin, 1987. 231 [LR89] [LRV88] [Mai83] [MB90] [MD89] [Nut92] [Shi81] [SJGP90] [SL90] [SLR88] C. Lecluse and P. Richard. The O2 database programming language. In Proc. of Intl. Conf. on Very Large Data Bases, pages 411-422, 1989. Christophe Lecluse, Philippe Richard, and Fernando Velez. O 2 an object- oriented data model. In Proc. ACM SIGMOD Symp. on the Management of Data, 1988. David Maier. The Theory of Relational Databases. Com puter Science Press, Potomac, Maryland, 1983. John Mylopoulos and Philip A. Bernstein. A language facility for de signing database-intensive applications. In Stanley B. Zdonic and David Maier, editors, Readings in Object-oriented Database Systems. Morgan Kaufmann, 1990. Dennis R. M cCarthy and Umeshwar Dayal. The architecture of an active database management system. In Proc. ACM SIGMOD Symp. on the Management of Data, pages 215-224, 1989. Gary J. N utt. Open systems. Prentice-Hall International, Inc, Englewood Cliffs, New Jersey, 1992. D. Shipman. The functional model and the data language DAPLEX. ACM Trans, on Database Systems, 6(1):140-173, 1981. M. Stonebraker, A. Jhingran, J. Goh, and S. Potamianos. On rules, procedures, caching and views in database systems. In Proceedings of the A CM SIGMOD international Conference on Management of Data, May 1990. Amit P. Sheth and James A. Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys, 22(3):183-236, September 1990. T. Sellis, C-C. Lin, and L. Raschid. Implementing large production sys tems in a DBMS environment: Concepts and algorithms. In Proc. ACM SIGMOD Symp. on the Management of Data, 1988. 232 [SLW88] [SM89] [SM91a] [SM91b] [SRH90] [SRK92] [U1188] [WHW89] [WHW90] [Wid90] Amit P. Sheth, Jam es A. Larson, and Evan W atkins. TAILOR, a tool for updating views. In Proc. of Intl. Conf. on Extending Data Base Technology, pages 190-209, 1988. Michael Siegel and Stuart E. Madnick. M aintaining valid schema inte gration in evolving heterogeneous database systems. Office Knowledge Engineering, 3(2):9— 16, August 1989. Michael Siegel and Stuart E. Madnick. Context interchange: Sharing the meaning of data. SIGMOD RECORD, 20(4):77-78, December 1991. Michael Siegel and Stuart E. Madnick. A m etadata approach to resolving semantic conflicts. In Proc. of Intl. Conf. on Very Large Data Bases, pages 133-145, 1991. Michael Stonebraker, Lawrence A. Rowe, and Michael Hirohama. The implementation of POSTGRES. IEEE Transaction on Software Engi neering, 2(1):125-142, March 1990. Amit P. Sheth, Marek Rusinkiewicz, and George Karabatis. Using poly transactions to manage interdependent data. In Stanley B. Zdonic and David Maier, editors, Transaction Models for Advanced Database Appli cations, chapter 14. Morgan Kaufmann, 1992. Jeffrey D. Ullman. Principles of Database and Knowledgebase Systems. Com puter Science Press, Potomac, M aryland, 1988. Surjatini W idjojo, Richard Hull, and Dave Wile. D istributed Information Sharing using WorldBase. IEEE Office Knowledge Engineering, 3(2): 17- 26, August 1989. S. W idjojo, R. Hull, and D. S. Wile. A specificational approach to merg ing persistent object bases. In A1 Dearie, Gail Shaw, and Stanley Zdonik, editors, Implementing Persistent Object Bases. Morgan Kaufmann, De cember 1990. Surjatini Widjojo. Sharing Persistent Object-Bases in a Workstation Environment. PhD thesis, University of Southern California, June 1990. 233 [Wil] David S. Wile. Integrating syntaxes and their associated semantics. In IFIP TC 2 Working Conference on Constructing Program from Specifi cations. [Wil87] David S. Wile. POPART, Producer Of Parsers And Related Tools system builders’ manual. Technical report, US C/Inform ation Sciences Institute, July 1987. 234
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255722
Unique identifier
UC11255722
Legacy Identifier
DP22878