ON INCREMENTAL UPDATE PROPAGATION BETW EEN OBJECT-BASED DATABASES by Ti-Pin Chang A Dissertation Presented to the FACULTY OF TH E GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Com puter Science) May 1994 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by T i-P in (Ben) Chang under the direction of h.£s Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY As an advisor, his enthusiasm toward his subject students and his standards for an “acceptable result” are more than any graduate student could dream of. I also wish to express my gratitude to Professor Dave Wile who accepted me into the Information Sciences Institute (ISI). As the co-advisor, his guidance on the thesis and inspirations on the implementation issues helped to make the BIRD prototype possible. I would like to thank the Information Sciences Institute which provided the finan cial support and unrestricted use of state-of-the-art facilities for the first two years of my doctoral study. Among the graduate students working at the institute, few are granted the extensive freedom I was allowed in pursuing my thesis research. My thanks also goes to everyone in the Software Sciences Division. In particular, many thanks to Dr. Don Cohen for his proof reading of the thesis, numerous technical support of the implementation, and unreserved academic debates (ranging from the justification of OIDs to my preference of programming language). A special thanks goes to Professor Seymour Ginsburg, who first brought me to the world of precise thinking and m athem atic reasoning. His classes transformed m e from a down-to-earth engineer with little theoretical training into a Ph.D. whose thesis contains 40 pages of formal proofs and ugly symbols. Last but not least, I am grateful to my wife, , for enduring my night-owl working schedule and standing by me during my darkest days of my thesis writing. Also, her delicious cooking and non-stop logistic support enabled me to have more tim e struggling with the lemmas and theorems. Contents D e d ic a tio n ii A ck n o w led g em en ts iii L ist O f T ables viii L ist O f F ig u re s ix A b s tra c t xi 1 H e te ro g e n e o u s D a ta b a se U p d a te s 1 1.1 Problem S ta te m e n t..................................................................... 2 1.2 Examples of Semantic C o rresp o n d en ce..................................................... 3 1.2.1 Value-Based Correspondence............................................................. 4 1.2.2 Introduction of Object Identifiers (O ID s )...................................... 5 1.2.3 Ambiguous U p d a te .............................................................................. 7 1.3 The Proposed S o lu tio n ................................................................................... 9 1.3.1 System A rc h ite c tu re .................... 10 1.3.2 Scenario for Establishing Semantic C orrespondence.................. 12 1.3.3 Methodology to Resolve A m b ig u ity ........... ' ................................ 12 1.4 Dissertation Structure........... ............................................................................ 12 2 T ools U sed B y B ird 15 2.1 AP5, the platform ................................. 15 2.1.1 D ata Definition L a n g u a g e .............................. 15 2.1.2 Data Manipulation L a n g u a g e .......................................................... 16 2.1.3 Active R u les............................................... 17 2.1.4 Semantics of Consistency R u l e s ...................................................... 18 2.2 IFO- , the D ata Base M o d e l......................................................................... 19 2.3 ILOG, the Linkage Language ...................................................................... 21 2.3.1 Evaluation of ILOG p ro g ra m s.......................................................... 22 2.3.2 Physical vs. Logical In sta n c e s.......................................................... 23 2.3.3 Restrictions for nrecILOG- ............................................................. 25 iv 2.4 POPART, the C o m p ile r................................................................................ 26 2.4.1 BNF’, the gram m ar description la n g u a g e ...................................... 26 2.4.2 Syntax Directed E x p e r ts ................................. 27 3 S o lu tio n of th e V alu e-B ased C ase 30 3.1 Abstract Linkage Description ( A L D ) ......................................................... 30 3.2 Compilation of an A L D ................................................................................... 32 3.3 Remote Transaction Control (RemTrans) .................................................. 35 3.3.1 Motivation of R e m T ra n s.................................................................. 36 3.3.2 Managing the C o m m u n icatio n s..................................................... 37 3.4 Socket and RPC, the Communication M e ch a n ism ................................. 40 3.5 Linkage Implicit Constraints (L IC s)................................................................ 41 4 U n i-d ire c tio n a l P ro p a g a tio n of In c re m e n ta l U p d a te s In v o lv in g O ID 45 4.1 Entity Equivalence Specification ( E E S ) ..........................................................49 4.2 OID Association Problem ............................................................................ 51 4.3 Augmentation of Source Schem a.......................................................................53 4.4 Compilation of ILOG Invention R u le s......................................................... 56 4.5 Machine Dependent Object Translation (MDOT) M echanism ..................59 4.5.1 Translation of Machine Dependent O b j e c t ................................. 60 4.5.2 Synthesis of OID Invention and M D O T ......................................... 62 4.6 Static vs. Dynamic Semantics in ALD ..................................................... 64 5 B i-d ire c tio n a l P ro p a g a tio n o f In c re m e n ta l U p d a te s 67 5.1 Preprocessing of EES in Bi-Directional C ases........................................... 69 5.2 W itness Update P ro b lem ................................................................................ 71 5.3 Definition of W itness G e n e ra to r..................................... 74 5.4 Logical and Physical Views of a Semantic C o rresp o n d en c e ................ 76 5.5 Equivalent and Mutually Recursive C lasses.............................................. 78 5.6 Recapitulation of the B ird S y s te m ............................................................ 80 6 H ow to F in d W itn e s s G e n e ra to rs 83 6.1 Overview of Automatic Generation of Witness Generators ................ 84 6.2 Removal of E E S ................................................................................................ 88 6.2.1 Well-foundedness of an A L D ........................................................... 90 6.2.2 Reduction and Expansion O p eratio n s........................................... 92 6.3 WGG Expansion H isto ry ............................................................................... 95 6.4 The WGG A lgorithm .........................................................................................100 6.5 The Incompleteness of the WGG A lg o rith m ..............................................105 6.5.1 Well-founded A L D .............................................................................. 106 6.5.2 Non-Well-founded ALD .....................................................................108 v 7 F o rm al D iscussion on W G G E x p an sio n 111 7.1 Structured WGG Expansion H i s t o r y ...........................................................113 7.2 The Soundness of WGG A lg o rith m ..............................................................122 7.3 OID Component ............................................................................................... 125 7.4 Decidability of Halting for WGG E x p a n s io n ............................................. 133 7.5 W G G halt A lgorithm ............................................................................................141 8 H a n d lin g A m b ig u ity 144 8.1 Motivating E x am p les.................................................... 145 8.2 Overview on A m b ig u ity ............................ 150 8.2.1 Ways the Information Can O v e rla p .................................................. 151 8.2.2 Overlapping Schemas and Update A u th o ritie s ...............................152 8.3 Ambiguity Resolving Methodology ( A R M )................................................ 153 8.4 Ambiguous Examples R e v is ite d .................................................................... 157 8.4.1 Partial Overlapping Characterized by P ro jectio n ........................... 158 8.4.2 Partial Overlapping Characterized by Selection ........................... 159 8.4.3 More Intricate Sharing of Inform ation............................................... 161 9 R e la te d R e se arch 165 9.1 Heterogeneous D atabases..................................................................................165 9.1.1 Federated Database A rc h ite c tu re ......................................................166 9.1.2 Database Schema Integration . ......................................................167 9.1.3 W orldB ase.................................................................................................168 9.2 View Update P r o b le m .....................................................................................170 9.3 Schema Restructuring/Transform ation L an g u a g e s...................................171 9.4 Update Propagation for Interoperating D a ta b a s e s ...................... 173 9.4.1 Update Propagation by Transaction P ro c e s s in g ...........................174 9.4.2 Update Propagation by Active Database T e c h n o lo g y .................175 10 C o n clu sio n s a n d D ire c tio n s for F u tu re R e se arch 177 A p p e n d ix A Syntax of A L D ............................................................................................................. 181 A p p e n d ix B Formal Semantics of nrecILOG- and ALDs .......................................................... 183 A p p e n d ix C Proofs for the EES R e m o v a l........................................................ 187 A p p e n d ix D Detailed Proofs for Chapter 7 ...................................................................................191 vi A p p e n d ix E Detailed Solution of Example “Requisition” ..........................................................220 E .l The ALD for the equivalent subschemas ..................................................... 220 E.2 Local E x p e r ts ......................................................................................................222 E.3 Example S c e n a rio s ............................................................................................224 vii L ist O f T ables 1.1 Road map of the th e s is ........................................................................... 13 6.1 The WGG expansion history for the “Segment Flight” example . . . 103 6.2 The WGG expansion history for the “More than Ga” example . . . . 106 6.3 Expansion history for the “Cross Product” e x a m p le ........................ 107 6.4 Expansion history of the “Recursive OID Creation” e x a m p le ....... 109 7.1 Two homomorphic expansion histories ........................................................ 135 7.2 Example homomorphism for the two expansion histories .......................135 L ist O f F ig u res 1.1 Transition between instance pairs in a semantic correspondence . . . 3 1.2 Schemas and instances of Example “Segment Flight” .......................... 6 1.3 Architecture of the B ird sy ste m ................................................................... 10 2.1 Example IFO Schem a...................................................................................... 19 2.2 Example instance of an IFO schem a............................................................ 20 2.3 Example skolemize pre-instances................................................................... 22 2.4 Resulting instances of two different m a p p in g s .................................... 24 3.1 A naive scenario of using “remote-execute” ............................................... 36 3.2 Example scenario for phase transitions under RemTrans ........................ 40 3.3 System diagram of the Communication S ubsystem ..................................... 41 3.4 Linkage Implicit Constraints (L IC s )............................. 42 3.5 Dynamic behaviors of L IC s............................................................................ 43 4.1 Schemas and instances for “City” ............................................................... 47 4.2 Schemas and instances of Example “Segment Flight” ...............................48 4.3 The semantics of OID invention r u l e s ...................... 54 4.4 Scenario for incremental OID creation ........................................................ 58 4.5 Scenarios for incremental OID d e s tru c tio n .............................................. 59 4.6 MDOT snapshots for the “Segment-Flight” exam ple.............................. 61 4.7 Scenarios of OID invention under MDOT m e c h a n ism .......................... 63 4.8 Dynamic behavior differences for “City” e x a m p le ................................. 65 5.1 Naive compilation for bi-direction ALD involving O I D s ....................... 71 5.2 Scenario using the “naive” concrete lin k a g e s .......................... 72 5.3 W itness pair for the “Segment Flight” e x a m p le .................................... 76 5.4 Different levels of abstraction in the BIRD s y s te m ................................. 77 5.5 Different entity type relationships specified by EES and invention rules 79 5.6 Refined system architecture for the B ird system . ............................. 81 6.1 SLD Expansion for “Segment Flight” ........................................................ 86 6.2 Schemas of Example “Itinerary” .................................................................. 89 6.3 Schemas for the “More than Gra ” e x a m p le .................................................104 7.1 Schemas and instances for “Enrollment” ..................................................,1 1 2 7.2 OID components for the “Enrollment” e x a m p le ................................... 127 7.3 Lemma OID Component C o n d en satio n ....................................................128 7.4 Lemma OID C o m p o n en t.............................................................................. 129 7.5 Information transfer between OID com ponents.......................................132 7.6 Homomorphism L e m m a .............................................................................. 136 7.7 MGU Look Ahead L e m m a ........................................................................... 138 7.8 Uselessness L e m m a ........................................................................................... . 139 8.1 Schemas and instances for “Employee” ......................................................... 146 8.2 Schemas for “Requisition/Purchase Order” .................................................. 147 8.3 Possible ways of arranging the purchase o rd e rs ...........................................148 8.4 Schemas and instances for “Airport” .............................................................149 8.5 Ambiguous Semantic C orrespondence....................... 150 8.6 “Update authorities under B ird ’s framework” .......................................153 8.7 “Schema Surgery” .................... 154 8.8 Transformation of An Ambiguous Semantic C o rresp o n d en c e............155 8.9 New Architecture Incorporating Expert Systems and B ird System . . 157 8.10 The “Employee” schemas after “schema surgery” .................................... 158 8.11 The three scenarios shown in “Requisition” with “subquantities” . . . 162 8.12 The “Requisition” schemas after “schema surgery” ........................163 E .l Scenario of “Don wants 8 more Macs” .........................................................225 E.2 Scenario of “Move 1 Mac from Ben’s to Sherry’s” .....................................226 Abstract W ith the advent of the ’ ’information superhighway”, database interoperation is emerging as one of the most im portant topics in database research in the 90’s. Most previous academic and commercial work in this area (e.g, schema integration, federated databases) has focused on providing read-only access to data in diverse databases. The research in this thesis addresses a fundamentally different issue, that of incrementally propagating updates between databases that hold overlapping information. A primary focus of the research is on the impact of object identifiers (OIDs). The presence OIDs complicates the situation because the meaning of an OID is local to its own database, and sometimes the object classes in one database do not correspond directly to the object classes in the second database. Results presented in this thesis shows that (1) in the context of uni-directional incremental update propagation involving OIDs, auxiliary witness relations are needed, and (2) for the bi-directional case, m aintaining the contents of the witness relations is quite subtle; a mechanism is described in this thesis to construct new rules, called witness generators, that can be used to properly update the witness relations. This research also develops a prototype system called BIRD (Bi-directional In cremental Revising of Data) that offers one solution to the problem of providing incremental update propagation in the presence of OIDs. BIRD uses a high-level database query language for specifying the correspondence between two databases, and uses active database technology to perform incremental update propagation. One of the m ajor contributions of the BIRD system is the development of the W itness Generator Generator (W GG) algorithm that constructs witness generators from a user-specified ALD. This algorithm is based on a variation of SLD-resolution. A theoretical analysis of the algorithm is also presented in this thesis. The anal ysis demonstrates that (a) the algorithm is sound and (b) the term ination of the algorithm on a given input is decidable. Chapter 1 Heterogeneous Database Updates W ith the advent of the ’ ’information highway”, database interoperation is emerging as one of the most im portant topics in database research in the 90’s. Most previous academic and commercial work in this area (e.g, schema integration, federated data bases) has focused on providing read-only access to data in diverse databases. The research in this thesis addresses a fundamentally different issue, that of incrementally propagating updates between databases that hold overlapping information. This research develops a prototype system called B ird (Bi-directional Increm en tal Revision of D ata) that offers one solution to the problem of providing incremental update propagation. B ird uses a high-level database query language for specifying the correspondence between two databases, and uses active database technology to perform incremental update propagation. Both practical and theoretical tools were used in the development of the prototype. Perhaps the most interesting technical issue addressed by B ird stems from object-orientation. Object-oriented databases have object identifiers (OIDs) which serve to identify objects uniquely within a database. Their presence complicates the situation because (1) the meaning of an OID is local to its own database, and (2) sometimes the object classes in one database do not correspond directly to the object classes in the second database. B ird uses an adaptation of SLD-resolution when analyzing user-specified correspondences, in order to support incremental up date propagation in this context. 1 1.1 Problem Statement In particular, this research concerns the following problem. P ro b le m S ta te m e n t: Find mechanisms to: • describe the semantic correspondence between two databases, and • maintain that semantic correspondence incrementally. D e sid e ra ta : o b je c t-o rie n ta tio n : The two databases may use entity types, i.e. classes whose members are represented by object identifiers (OIDs). s y m m e try : Each database has the authority to make autonomous updates to its own data. The first part of the problem addressed in this research is to find a way to describe the “semantic correspondence”. Intuitively, a semantic correspondence is a set of corresponding instances of two databases. It serves as a static constraint imposed on the two databases, so that only the “acceptable” instance pairs in the semantic correspondence can co-exist in the two databases at any given time. D efin itio n : Semantic Correspondence A semantic correspondence SC ab between two databases A and B is a decidable subset of Inst(A) x Inst(B ), where Inst(X) denotes the set of instances of database X . □ The second part of the problem is how to m aintain that semantic correspondence dynamically in the context of incremental updates. Figure 1.1 illustrates a scenario which highlights this issue. In particular, databases A and B are initially in the “acceptable” instance pair (l 2 ,Ji ) in the semantic correspondence (lines between instances indicate acceptable pairs). Suppose an incremental update A^ changes A from 1 % to I 5 . The question is: how can the system translates the update of database A into an update Ajg of database B, so that, after applying Ajg to B, the two databases reach another “acceptable” instance pair (/5, J 2) h1 the semantic correspondence. 2 possible instances o f A Ii h h U h h possible instances o f B h h J3 J4 J5 h Figure 1.1: Transition between instance pairs in a semantic correspondence The basis of the research is that the semantic correspondence is specified using a high-level language, and then an autom atic mechanism translates that seman tic correspondence into a set of active database rules that m aintain the semantic correspondence by propagating updates incrementally. For the rem ainder of the chapter, in Section 1.2, several examples are presented to illustrate different aspects of semantic correspondence. Then Section 1.3 presents an overview of the B ird system. Finally Section 1.4 draws a road map for the bulk of the research. 1.2 Examples of Semantic Correspondence The solution provided in this thesis focuses only on the family of semantic correspon dences that can be expressed by two nrecILOG- programs (Section 2.3) specifying the instance mappings from database A to database B and vice versa. T hat is, the semantic correspondence SC ab specified by two nrecILOG- programs Pa< - > b and Pb > - * a '{(/, J) | Pa~ b (I) = J A Pb ^ a (J) = I } The following subsections each present an example illustrating a different aspect of the problem. W ithin each example an “acceptable instance pair” is presented according to the application requirement. 3 In the following discussion we use the term “instance” and “state” of a database interchangeably. 1.2.1 V a lu e-B a sed C o rresp o n d en ce We first consider a very simple example dealing with two different views of the same information describing the telephone number, employee, and office assignments in da tabases A and B respectively. It illustrates “The many forms of a single fact”[Ken89] or the semantic reiativism[HM81] which is found commonly in a heterogeneous da tabase environment. E x a m p le 1. 1: “Office” In database A the data is recorded in two relations. p e rso n -n o (p , no) records that a person p has a telephone number no, and n o - o f f ic e ( n o ,o f f ) records that a telephone with number no is in office o ff. In database B the data is stored in relation in - o f f ic e (p, o ff) recording that a person p is in office o ff, and in relation o f f ic e - n o ( o f f ,no) recording that an office o f f has a telephone number no. For the information kept in both databases, it is assumed that there is a one-one correspondence between a telephone number and an office. The following is a possible pair of corresponding instances for A and B. A.person-no: Name Telephone Ben Sherry 123 456 Telephone Office 123 456 SAL250 SAL130 Name Office Ben Sherry SAL250 SAL130 B .office-no: Office Telephone SAL250 SAL130 123 456 The application further requires that the two views be m aterialized - i.e. the data must be represented explicitly - because they are queried quite often at both sites, and communication between the two sites is not very reliable. □ 4 Depicted in this example, the correspondence between these two schemas is fairly straightforward. In fact, the instances of B can be expressed by a nrecILOG- pro gram in-office(person,office) person-no(person,no), no-office(no,office); office-no(office,no) :- no-office(no,office); Similarly, the instances of A can be expressed in term s of B by the following pro gram. person-no(person,no) :- in-office(person,office), office-no(office,no); no-office(no,office) office-no(office,no); As shown above, in the value-based case, a nrecILOG- program looks very much like a Datalog program or a conjunctive query, and can be used to express the semantic correspondence between two databases. 1.2.2 In tr o d u c tio n o f O b ject Id en tifiers (O ID s) The semantic correspondence is not often so straightforward as in the previous ex ample. In particular, “Object Identifiers” (OIDs) may be used within one or both databases. These OIDs are normally inaccessible or meaningless outside the data base in which they exist. This usually makes the specification and maintenance of the semantic correspondence more difficult. For instance, a graphic interface may be used to visually depict the conceptual objects stored in an underlying application database. Typically a graphical interface also m aintains an internal database that holds information about the things it is representing. In this case there is a natural correspondence between the objects stored in the application database and their graphic representations. To dem onstrate how the correspondence descriptions are expressed in the pres ence of OIDs, consider the following example: E x a m p le 1.2: “Segment Flight Figure 1.2 shows the schemas for databases A and B which hold equivalent but structurally different information about airline flights. These are depicted using the 5 Database A (node) Segment^®] ^~s^Hias-node ^ V se g m e n t-i nfo rSTRlNGl I STRING I (time) (price) A.has-node Segment node sl S.F. s 2 . _ . N.Y. A.segment-info Segment time price sl 10:00 65 si 14:00 80 £2______ 09:00 150 Database B Flight . 'S ^^ Jlig h t-in fo Vncc/ (city) (time) B.flight-info Flight city time L S.F. 10:00 h S.F. 14:00 ,fi . N.Y. 09:00 B.price Flight price L 65 h 80 h 150 Figure 1.2: Schemas and instances of Example “Segment Flight” IFO - database model (Section 2.2), which can be viewed as the structural portion of an object-oriented data model (i.e., no m ethods). In the B database, the entity type Flight holds OIDs for flights. It has two attributes: price recording the prices of the flights, and f light-info indicating the destination 1 and tim e of the flights. On the other hand, the A database models the same information within the framework of a graph representation, where each OID in Segment corresponds to a possible (single-stop) route between L.A. and a destination node (city). As shown in the above figure, even if there are two flights f \ and /2 to San Francisco, there is only one segment OID .si corresponding to these flights. The attribute segment-inf o of Segment records the corresponding times and prices of the flights. □ The main difference between nrecILOG- and Datalog is the ability to specify OID creation in the program. The following nrecILOG- program specifies the instance mapping from A to B for the “Segment Flight” example. i-flight[*,s,t] segment-info(s,t,p); // SI Flight(f) :- i-flight[f,s,t3; // S2 1To simplify the example, we assume that all flights start from L.A. 6 flight-infoCf,c,t) i-f light [f ,s ,t] , has-node(s,c); // S3 price(f.p) i-flight[f,s,t], segment-info(s,t,p); // S4 In the above nrecILOG- program, the OID invention is specified by the “inven tion rule” Sl with an “invention predicate” i-f light appearing in the rule head. Intuitively, the rule specifies that for each distinct (s,t) pair in segment-info, cre ate an OID. The sign in the rule head indicates that an OID will be invented with the witness of a distinct (s,t) pair. Notice, the particular p in the segment-info relation is irrelevant to the OID creation. W ith the OIDs invented and stored in i-f light, rule S2 populates the OIDs into entity type Flight. Similarly, the following nrecILOG- program specifies the instance mapping from B to A. i-segment[*,c] :- flight-info(f,c,t); Segment(s) :- i-segment[s,c]; has-node(s,c) :- i-segment[s,c]; segment-info(s,t,p) i-segment[s,c], flight-info(f,c,t), price(f,p); An interesting issue that arises with OIDs is that the value of an OID in a local database may change from one session to the other[EK91], As a result, a second database will not know the value change of the OID. In order to provide a “stronger notion of identity” at a more global level, the application must provide a Machine Dependent Object translation mechanism to translate a local OID value into an im m utable OID so that this OID can be shared between the databases A and B. 1.2.3 A m b ig u o u s U p d a te In the examples presented in the previous subsections, the two databases hold exactly the same amount of information. However, in reality, databases may hold informa tion that only partially overlaps. W ith the presence of non-equivalent information, there is ambiguity in how an update from one database should be propagated to the other. The next example highlights one form of partial overlapping, in which one data base overlaps on the selection portion of the other database. / / R1 // R2 // R3 // R4 7 E x a m p le 1.3: “Football” This example concerns the situation where the provost of USC has a database (*4) which records student information university-wide, and the football team coach has a database (B) about the team members. The student information in the provost’s database is kept in the following two relations Name Team Member? grade: Name GPA And the coach has a database keeping information about his team members in the following relations Name assignm ent: Name position Statically speaking, given an instance of the coach’s database, there can be in finitely many provost’s database instances associated with it, as long as the player parts of student information in the two database instances m atch up. Let us now consider the dynamic behaviors of the two databases after encoun tering some incremental updates. Suppose students “John” and “Mary” are in the provost’s database with the following instance. s tu d e n t: Name Team Member? John Mary yes no grade: Name GPA John Mary 2.45 3.95 And the corresponding player information in the coach’s database is team: Name John assignm ent: Name position John Quarter Back Now what happens if the coach decides to kick “John” off the team ? W hat is the corresponding update to the USC student database? Following are some of the possibilities: 1) Delete “John” from the stu d e n t relation, i.e. DELETE s tu d e n t ( ' ‘ J o h n , y e s). 8 2) Modify s tu d e n t relation, i.e. replace stu d e n t ( ' 'J o h n '' , y es) by s t u d e n t ( '‘John’ ' , no). 3) Kick “John” out of USC (do 1) if his GPA is lower than 2.5, otherwise leave him in (do 2). The reason we may have different policies is because the correspondence between the instances of A and B are not one-one, which is the essence of the ambiguity. Another interesting policy issue involves the ability to refuse to allow a trans action to occur in the originating database.. Consider what happens if the provost wants to drop student “John” because of his low GPA. The following is one of the possible corresponding updates to the coach’s database. 4) depending on which position “John” plays, the coach can overrule the provost’s decision; in this case quarterback may be too im portant a position to ignore, so the all-mighty coach simply rejects the provost’s decision and aborts the transaction. □ From the possibilities above we can see that the corresponding updates may be ambiguous and may even depend on the policy at the rem ote database which could change at any time. 1.3 The Proposed Solution In this section we present an overview of the B ird system, an overall solution to support incremental updates between two object-based databases holding overlap ping information. We begin by stating four key design issues that influenced the development of B ird . (1) The system should provide a high level language to describe the semantic correspondence, so that the DBA can easily describe it in a highly abstract way without worrying about the low level details. 9 RemTrans AP5 Active DBS MDOT Mechanism | Communication 1 Subsystem ALD ALD Compiler RemTrans APS Active DBS MDOT Mechanism A * Com munication Subsystem Figure 1.3: Architecture of the B ird system (2) A compiler is needed to translate the high level language description into an underlying concrete procedural linkage that incrementally m aintains the se m antic correspondence. (3) As mentioned before, local OIDs may be dynamic and are only meaningful within the domain of the local database. Therefore the solution m ust provide a mechanism to encode each local OID into an im m utable OID. (4) The general solution needs to provide an architecture and a run tim e envi ronm ent to orchestrate the incremental updates and interactions between the databases. (5) In the case of ambiguous semantic correspondence, a methodology is need to resolve the ambiguity in the semantic correspondence between the two data bases, so that B ird can be applied in the solution. 1.3.1 S y ste m A r c h ite c tu r e The architecture of B ird is depicted in Figure 1.3. The architecture is sym m etric and allows both databases to initiate incremental updates. Its design follows the principle of layered architecture[Nut92] to minimize the complexity. The architecture of B ird consists of the following components. 10 A L D C o m p ile r: To describe the underlying semantic correspondence to the B ird system, the DBA first writes a file called the Abstract Linkage Description (ALD). The ALD specifies the abstract linkage by using rules in nrecILOG” , a variant of non-recursive Datalog extended to support OID invention (see Section 2.3). The ALD is then fed into the ALD compiler which is implemented using the Syntax Directed Experts of POPART (see Section 2.4) using program trans formation techniques. The compiler translates the abstract linkages described in the ALD into two sets of concrete linkages and Eg. Each concrete link age consists of some auxiliary schema definitions and a set of production rules, both expressed in the AP5 database language (see Section 2.1). These rules react to changes reported by one database to update the other incrementally. RemTrans M e ch an ism : The RemTrans mechanism is responsible for rem ote trans action management between the two databases. It shields the system-level heterogeneity of the underlying active databases from the user. It also serves as an interface of the B ird system to the user-issued incremental updates. The functionalities of the RemTrans mechanism are discussed in Section 3.3. A P 5 A c tiv e D a ta b a se S y ste m (A D B S ): The actual database operations and the production rule firings are performed by the AP5 active database system. As shown in the architecture, the concrete linkages E^ and Eg interact with their host AP5 ADBSs and translate the user-issued updates to the other database incrementally. M a ch in e D e p e n d e n t O b je c t T ra n sla tio n M e ch a n ism (M D O T ): The Machine Dependent Object Translation (MDOT) mechanism serves as a filter between the AP5 database system and the communication subsystem; it en codes each outgoing OID into an im m utable OID, and decodes each incoming im m utable OID into its local value so that local OIDs can be shared among databases. MDOT will be discussed in Section 4.5. C o m m u n ic a tio n S u b sy ste m : To simplify the communication between databases, the Communication Subsystem is designed to accomplish the message-passing paradigm through the socket mechanism in the UNIX system. 11 1.3.2 S cen ario for E sta b lish in g S e m a n tic C o rresp o n d en ce The following scenario is used to establish a connection between two databases using the B ird System: S te p 1) C o n s tru c t A L D : The DBA first determines the semantic correspondence defined by the application, and then specifies the semantic correspondence in the form of an ALD, L. S te p 2) C o m p ile A L D : The ALD, L, is fed into the B ird compiler. The compiler generates two sets of concrete linkages, and Eg, for A and B , respectively. S te p 3) L oad: The concrete linkage E.4 is then loaded into database A and Eg is loaded into database B. S te p 4) R u n : After loading, database A and B use E^ and Eg, respectively, to interact with the B ird underlying architecture to m aintain the semantic cor respondence incrementally. In the above, all but step 1 are autom atically carried out by B ird . 1 .3.3 M e th o d o lo g y to R e so lv e A m b ig u ity Intuitively speaking, the approach is to separate the ambiguity from the inter database communication, by isolating and/or creating subschemas of databases A and B such that the subschemas hold equivalent information. This perm its B ird to be applied to them in order to m aintain OID creation/deletion between the two databases, and leaves the problem of updating the remaining parts of A and B to the local databases. As will be seen, creation of the subschemas may be the result of isolating portions of the original schema and/or augmenting the original schemas with derived data. 1.4 Dissertation Structure In the body of this dissertation, we consider progressively more difficult cases of incremental update propagation. To understand the progression, we categorize the 12 Uni-directional Bi-directional Value-Based Chapter 3 Chapter 3 OID-Based Chapter 4 Chapter 5,6,7 Table 1.1: Road m ap of the thesis problem domain along three dimensions: (a) whether the linkage established is uni directional or bi-directional, (b) the presence or absence of OIDs in the overlapping data, and (c) whether the two databases are holding equivalent information or non equivalent overlapping information (i.e., whether the semantic correspondence is one-one or ambiguous). Using “Football” as an example, dimension (a) concerns whether the linkage is only for a one-way translation of incremental updates from the provost’s database to the coach’s database or a two-way translation of incremental updates from both databases; i.e. in the former case, only the provost has the right to modify the student data, the (team membership portion of the) coach’s database serves as a virtual view, and in the latter case both the provost and the coach can update their databases and the changes will be automatically m apped to the other side. Stemming from the dimensions (a) and (b), there are four possible combinations of solutions. As indicated in Table 1.4 above, they are discussed from Chapter 3 to Chapter 7 in the context of one-one semantic correspondences. Stemming from dimension (c), Chapter 3 to Chapter 7 discuss the case when the two databases hold equivalent information, i.e., one database is expressible by the other database. The case when the two databases databases hold non-equivalent information are discussed in Chapter 8. Chapter 3 covers both the uni-directional and bi-directional solutions for the value-based cases. The abstract linkage description (ALD) is introduced in Sec tion 3.1. Then Section 3.2 presents the compilation of ILOG non-invention rules. Finally, the RemTrans subsystem is discussed in Section 3.3. Chapter 4 presents the solutions for establishing uni-directional linkage involving OID creation. In particular, the E ntity Equivalence Specification (EES) in an ALD specifying an equivalence relationship between two entity types in the two databases is first introduced in Section 4.1. Then the chapter discusses the OID association problem (Section 4.2) which motivates the need of augmenting the schema of A 13 by adding the witness relations (Section 4.3), the compilation of ILOG invention rules (Section 4.4), the MDOT mechanism (Section 4.5), and static and dynamic semantics in ALD (Section 4.6). The bi-directional propagation of incremental updates is discussed in Chapter 5. Section 5.1 discusses the preprocessing of the EES in the bi-directional case. Sec tion 5.2 highlights the witness update problem, a problem that arises in the bi directional case when updates in the local database need to be propagated to the witness relations stored in the rem ote database. This motivates the need for the wit ness generators (Section 5.3), the nrecILOG" rules derived from the user-specified ALD. In Section 5.4, the logical and physical perspectives of semantic correspon dences are discussed. Section 5.5 presents the justification of using EES instead of nrecILOG" invention rules to specify the equivalent relationship between two entity types. Finally, Section 5.6 summarizes the development and m ajor components of the B ird system. The autom atic generation of witness generators from an ALD is a non-trivial task. One of the main contributions of this thesis is the development of the W it ness Generator Generator (WGG) algorithm which autom atically generates witness generators from a user-given ALD. The WGG algorithm is presented in Chapter 6. However, not every execution of WGG halts. Chapter 7 focuses on the formal dis cussion of the WGG algorithm and demonstrates that: (1) the WGG algorithm is sound, and (2) the term ination of a WGG execution is decidable. Chapter 2 gives a brief description of all the tools used in building B ird . In Chapter 9 we examine research related to our problem. Finally, Chapter 10 discusses some directions for future research. 14 Chapter 2 Tools Used By Bird This chapter briefly introduces the tools used to implement the B ird system. This will serve as a reference for later discussions on various components of the B ird system. 2.1 AP5, the platform The current implementation of B ird is built on top of the AP5 active database system, an extension to Common Lisp developed at USC/Inform ation Sciences In stitute. This will provide the rule-based “engine” of B ird which propagates updates between two databases. This section describes only the features of AP5 which are relevant to the B ird system. For more details of AP5, the reader should refer to [Coh86, Coh87]. 2.1.1 D a ta D e fin itio n L an gu age AP5 supports a data model similar to the Entity-relationship model[Che76]. D ata in AP5 are stored in relations as in the relational model. A type in AP5 is a unary re lation holding a set of objects. OIDs can be generated by calling the Make-DBObject macro. AP5 distinguishes three kinds of relations: transition relation, derived relation and com puted relation. The population of a transition relation is explicitly deter mined by the insertion operations (e.g. ++ R(1,5)) and deletion operations (e.g. — R(l,5)) and also implicitly by rule firings. The contents of a derived relation are 15 defined by a computation based on other existing relations. A computed relation is a relation defined by a Lisp computation th at does not depend on the contents of the database. The following three relation declarations are examples of the transition relations, derived relations, and computed relations respectively. (DEFRELATION room-mate :arity 2 :types (person person)) (DEFRELATION class-mate :arity 2 :types (person person) rDEFINITION ((x y) S.T. (E (class) (AND (in-class x class) (in-class y class))))) (DEFRELATION even-number-list :arity 1 :types (integer) :COMPUTATION (CODED ((x) S.T. (AND (listp x) (every #’evenp x))))) In the above, room -m ate is a transition relation w ith arity of two. c la s s - m a te is derived from an AP5 query defined in the rDEFINITION p art of the relation definition. B oth of the derived and com puted relations in AP5 (c la s s -m a te and e v e n -n u m b e r-lis t) do not physically hold values. T he com puted relations in AP5 serve m ore like a predicate; e v e n -n u m b e r-lis t (x) is tru e if and only if x is a list containing even integers. 2 .1 .2 D a ta M a n ip u la tio n L an gu age In AP5, the conditions of the database can be tested by well-formed formulas (wffs). Wffs are expressions built from primitive relation predicates, logic operators (NOT, AND, OR, IMPLIES, EQUIV, XOR), quantifiers (A and E represent V and 3), variables and lisp expressions. For example, the AP5 syntax for (A (x) (IMPLIES (g o o d -stu d en t x) (high-GPA x ) ) ) is equivalent to the following first order logic sentence[End72] Var (g o o d -stu d e n t (or) — » high-GPA(x)) And the AP5 syntax for (E (x y p e rso n ) (AND ( o f f i c e p e rso n x) ( o f f i c e p e rso n y) 16 (NOT (eql x y)))) is equivalent to 3 x, y, p(of f ice(p, x) A of f ice(p, y) h x ^ y) A description is an expression of the form (.vars S.T. w ff ) , where vars is a list of free variables in wff. The AP5 queries are formed by combining various query operators and descriptions. The following are some examples of the AP5 queries. (?? (E (s) S.T. (AND (GPA s g) (> g 3.75)))) ; return TRUE if there is a student with GPA higher than 3.75, ; return FALSE otherwise (LISTDF (s) S.T. (AND (GPA s g) (> g 3.75))) ; return a list of high GPA students (THEONLY (w) S.T. (has-wife ’Bill w) IFMANY (report-scandal ’Bill)) ; return the only wife of ’Bill, if many found then call report-scandal The data can be updated by the insertion operator ++ and deletion operator — . Updates can be grouped into an atomic transaction which delays the rule firings until all updates in the atomic transaction have been completed. The following atomic transaction inserts a tuple in office and deletes a student from Student. (atomic (++ office ’John ’SAL200); (— student ’Marry)); 2 .1 .3 A c tiv e R u les There are two kinds of rules in AP5: consistency rules and automation rules. To support efficient rule triggering, AP5 compiles active rules into a network simi lar to that of the RETE algorithm[For79] commonly used for implementing expert systems[Coh89]. Since only the consistency rules are used in the B ir d system, AP5 autom ation rules are not discussed'here. In AP5, the firing of consistency rules occurs as part of an atom ic transaction. Consistent rules guarantee that certain conditions are satisfied by every state of the 17 database. Each rule express one such condition, typically in the form of a toff. Every atom ic transition m ust satisfy every such condition (or be altered to satisfy them all). Otherwise it aborts. A rule may include, in addition to the condition, a consistency restoration program. The consistency restoration program can read both the original state and the invalid state which it m ust try to repair. T hat is, tem poral references can be used to monitor the changes in the current database state. For example, (PREVIOUSLY w ff) refer to the database state before the atomic transition started. (START w ff) is the same as (AND w ff (NOT (PREVIOUS w f f )) ), i.e. to test if the wff has become true since the beginning of the transaction. And (CHANGE w ff) is the shorthand for (NOT ( e q l w ff (PREVIOUS w f f)) ). The following is an example of an AP5 consistency checker, a special form of consistency rule used by the B ird system. The consistency checker demands that for any person there can be at most one office assignment. The repair action is a lisp function r e p a ir - f o r - m u ltip le - o f f ice -assig n m e n t which simply reports the error and aborts the transaction. (DEFCONSISTENCYCHECKER only-one-office-per-person ((p o) S.T. (AND (START (has-office p o)) (E (o2) S.T. (AND (has-office p o2) (NOT (eql o o2)))))) repair-for-multiple-office-assignment) (DEFUN repair-for-multiple-office-assignment (p o) (ERROR-MESSAGE ‘‘Person "A has multiple office assignments ~A’ ’ p o) (ABORT-TRANSACTION)) 2 .1 .4 S e m a n tic s o f C o n siste n c y R u les The semantics of consistency rule application is based on accumulation [HJ91]; the final delta com m itted resulting from an atomic transaction includes (1) the initial updates in the atomic transaction and (2) the accumulation of updates proposed by consistency rule firings. Rule firings in AP5 are organized into consistency cycles. Initially, the updates in the atomic transaction are put into a delta list, then the rule firing mechanism starts 18 the first consistency cycle. Conceptually, at each consistency cycle the consistency rules are firing concurrently based on the initial state of the database and the delta list accumulated from previous cycles. Then updates of the triggered rules are appended to the delta list. This completes one consistency cycle and the database is ready to start the next consistency cycle. Inconsistent deltas, i.e. deltas containing insertion and deletion of the same tuple, are not allowed in AP5. Once inconsistent deltas are found in the delta list, the transaction is aborted. The database iterates through consistency cycles until no updates are added to the accumulated delta. The accumulated delta is com m itted to the database only if in the last consistency cycle there is no more rule firings, otherwise the atomic transaction is aborted. 2.2 IFO , the Data Base Model The data model used in this thesis is based on the IFO - [AH87]. It is a simple subset of the IF O database model , a formally defined data model containing the fundam ental structural components of many semantic data models. IFO - is slightly richer than the Entity-Relationship model, and is easily represented by AP5. Every IFO - schema can be described by a directed graph with various types of vertices and edges. Enroll enroll-student^^>**-^ enroll-course . Course 1 STRING I (student name) course-n</~*^Nbook-used / STRING] I STRING | (book name) Figure 2.1: Example IFO Schema There are two kinds of atomic type in IFO~; the first one is called printable represented by a square node with the name of the type in it, the second one is called abstract represented by a diamond-shaped box which corresponds to the non- printable abstract objects with no underlying structure (in other words, to OIDs). 19 Enroll enroll-student enroll-course Entity Enroll student name e, Ben ez Ben Enroll Course < = 1 ® 2 , ,£2L . . Course course-no Course number C l CS101 c2 EE101 c, PH101 book-used Course book title Cl un with Computeres” Cl love ADA” Cl ascal and Me” C 2 un with Electronics” un with Physics” Figure 2.2: Example instance of an IFO schema For example, the E n r o ll and C ourse in Figure 2.1 are abstract types; while the square boxes with STRING in them are printables. To construct a complex type out of the existing ones, in IFO - the ®-vertex can be used to form a new type from the Cartesian product of other existing types. For example, the ®-vertex labeled with e n r o l l- s tu d e n t and e n r o ll- c o u r s e in Figure 2.1 are two new types of Cartesian product of E n r o ll and STRING, and Cartesian product of E n r o ll and C ourse respectively. Also, the attribute edges can be used to represent the functional relationships as in the Functional D ata Model[Shi81]. A ttributes can be single-valued, represented using arrows, or multi-valued, represented using double-headed arrows. In Figure 2.1 c o u rse -n o is a single-valued attribute from C ourse to STRING and b o o k -u sed defines a multi-valued attribute from C ourse to STRING. In the current implementation of B ird , the IFO - data model is used only for the conceptual schema design. Also, ISA relationships , while supported in IFO - , will not be perm itted in the bulk of this thesis. There exists a simple and natural mapping from an IFO - schema to a set of AP5 relations plus some functional and inclusion dependencies[Mai83]. For example, the 20 schema in Figure 2.1 can be implemented using the relations shown in Figure 2.2, and the following dependencies c o u rse -n o [l] C C ourse[l] e n r o ll- s tu d e n t[ lj C E n ro ll[l] < e n r o ll- c o u r s e [ l] C E n ro llfl] e n ro ll-c o u rs e [2 ] C C oursefl] c o u rse -n o : [1] i — > [2] An example instance is also shown in Figure 2.2, which describes the enrollment information of a student Ben registered in two courses: CS101 and EE101. 2.3 ILOG, the Linkage Language The ILOG language[HWW90, HY90] is a declarative language in the style of Datalog modified to be used for querying, schema translation, and schema augm entation in the context of the object-based data models. Derived from the ILOG language, nrecILOG- does not have unions (i.e., more than one rule defining a single relation), negations, and does not allow recursions in the rules. In the B ird system, a semantic correspondence is described by two nrecILOG- programs; each describes the schema translation from one database to another. To see how the schema translation is specified using nrecILOG- , consider the schemas of Example 1.2 shown in Figure 1.2. The following nrecILOG- program specifies the translation from A to B. (Our syntax differs slightly from th at of [HY90].) i-flightA.segment-info(s,t,p); // SI B.Flight(f) i-flight[f,s,t]; // S2 B.flight-info(f,c,t) i-flight[f,s,t], A.has-node(s.c); // S3 B.price(f.p) i-flight[f,s,t], A.segment-info(s,t,p); // S4 Notice th at for the sake of clarity, we prefix the predicates in B and A as B. and A. respectively. The above “program” looks very much like a Datalog program except for rule SI, called the OID invention rule. The rule head, i - f l i g h t , of SI is the interm ediate relation that serves as a “scratch paper” to record the association between the OIDs invented and their witness values in the source database. These 21 i-flight B. Flight Flight Segment time f(S|,10:00) Sl 10:00 f(s„ 14:00) S l 14:00 f(s, ,09:00) , S Z„.......... 09:00 B.flight-info Flight city time f(s, ,10:00) S.F. 10:00 f(s,, 14:00) S.F. 14:00 f(s,,09:00) N.Y. 09:00 Flight f(s,, 10:00) f(s,, 14:00) f(s„09:00) B.price Flight amount f(si,10:00) 65 f(s„14:00) 80 f(s7,09:00) 150 Figure 2.3: Example skolemize pre-instances interm ediate relations in a program are not part of the source schema or the target schema. According to the original semantics presented in [HY90] they are not in cluded in the output of the translation process. Later we will see in Section 4.3 that the B ir d system augments the source schema with these interm ediate relations to support incremental update propagation involving OIDs. 2.3.1 E v a lu a tio n o f IL O G p rogram s The formal semantics of ILOG programs are presented in Appendix B. To see how an instance from the source schema is translated into an instance of the target schema, we use the instance of the A database depicted in Figure 1.2 as the source instance and follow the three-step process described in [HY90]. (1 ) S k o le m iz e p ro g ra m Pa*->b' The nrecILOG~program is first “skolemized” into i-flight[f(s,t),s,t] A.segment-info(s,t,p); 11 SI B.Flight(f) i-flight[f,s,t]; 11 S2 B.flight-infoCf,c,t) :- i-flight[f,s,t], A.has-node(s,c); // S3 B.price(f,p) :- i-flight[f,s,tj, A.segment-info(s,t,p); // S4 T hat is, the symbols in the interm ediate relation i - f l i g h t of the invention rule SI is replaced by a skolem function f ( s , t ) . This corresponds to the intuition that new OID should be created for each ( s , t ) pair satisfying the condition of the rule body of SI. 22 (2) C o m p u te th e sk o lem ized p re -in sta n c e : Use the instance of A shown in Figure 1.2 as the source and run the skolemized program as if it is an ordi nary Datalog program. The resulting “skolemized pre-instance” for database B , denoted and interm ediate relation i - f l i g h t are illustrated in Figure 2.3. Notice that the skolem term s of the instance in the figure are the “logical OIDs”. Each logical OID represents a distinct (physical) OID value. In the next step, the logical OIDs will be replaced by distinct (physical) OIDs. (3) C o m p u te th e in sta n c e fro m th e p re -in sta n c e : Define a mapping ip that maps each different skolem term in the skolemized pre-instance to a unique OID. If the mapping ip is ' f ( s u 10:00) / ( 5i? 14:00) / 2 _ / ( s 2, 09:00) ~ / 3 then the resulting instance for B is the one depicted in Figure 1.2. Notice that the choice of ip in Step (3) above is non-deterministic. However, since OIDs in a database instance can be viewed as “place holders” or “pointers” , the specific choice of OID values is irrelevant. T hat is, a different ip yields a different but “equivalent” (up to perm utation on OIDs) instance for B. This motivates the definitions of physical and logical instances. 2 .3 .2 P h y sic a l v s. L ogical In sta n ces Because of the non-deterministic nature in substituting the skolem term s for the OIDs in Step (3) above, given a specific source instance the output of the translation may differ depending on the specific ip mapping chosen. (Similar non-determinism occurs with all languages m apping to object-based instances; see also [AK89, AB91, Day89].) As shown in Figure 2.4, the two output instances resulting form the two different choices of the ip mappings are “isomorphic” but different. To overcome this difficulty, ILOG separates the notions of Physical and Logical Instances. 1 1In the paper [HY90], a physical instance is called the pre-instance and a logical instance is called the instance. 23 * p f f ( s „ 10:00) - > / 4 / ( i „ 14:00) - » / 5 / ( * 2 , 09:00) - > /4 i-flight Flight Segment time u Sl 10:00 u Sl 14:00 s2 , , 09:00 B.f light-info Flight city time u S J. 10:00 u S.F. 14:00 N.Y. 09:00 B.FIight Flight B.price Flight amount * 4 65 f5 80 f6 150 Z ' /(* „ 10:00) -» /„ / ( * „ 14:00) - > / * V / ( * 2 , 09:00) - > /2 5 0 i-flight Flight Segment time f1 R Sl 10:00 f49 si 14:00 f250 - s2_... 09:00 B.flight-info Flight city time flR S.F. 10:00 f4 9 SJ3 . 14:00 f 250 N.Y. 09:00 B.FIight Flight L2£_ 112 _ B.price Flight amount f3 8 65 *49 80 ,*230 ........... 150 Figure 2.4: Resulting instances of two different tp mappings The instances discussed above are technically called “physical instances” . This is because the OIDs used are concrete, physical OIDs. In contrast, logical instances are defined based on the following notion of OID-equivalent. D e fin itio n : OID-equivalent, Logical Instance Let S be a schema. Then two physical instances I, J of S are OID-equivalent, denoted / = J , if there exists a perm utation cr on the domain of OIDs such that < 7 (/) = J The logical instance represented by I is the OID-equivalence class containing I, demoted by [/]. □ W ith this notion of logical instance based on OID-equivalent class, the choice of specific OIDs during the skolem term substitution step (3) is irrelevant. T hat is, given a source instance I and a nrecILOG- program P , the output defined in [HY90] is always [P (/)], where the particular choice of OIDs for P ^ ° ^ ( I ) is irrelevant. 24 However, as will be seen in C hapter 4 and 5, the specific choice of physical OIDs is im portant in the context of incremental update propagation between databases. To avoid confusion, the term “instance” used in the discussion, unless otherwise specified, refers to a physical instance. 2 .3 .3 R e str ic tio n s for n recIL O G - The nrecILOG- language is a restricted member in the general ILOG language family. The following is a list of restrictions on rules in a nrecILOG- program. (1) N o n e g a tio n : each rule has only positive conjunctive predicates in its rule body. (2) N o u n io n : each rule in a nrecILOG- program has a distinct predicate in the rule head. Thus, we can use lZr to represent the rule in the program with r appearing in the rule head. (3) O n ly in v e n tio n a l in te rm e d ia te re la tio n : there is no need for interm ediate predicates aside from the invention predicates. (4) R u le b o d y : only the predicates from the source database, and/or the invention predicates can appear in a rule body. Also, the invention predicate can appear in a rule body only if the OID variable of the invention predicate appears in the rule head. (5) R u le h ead : the rule head of a rule is either an invention predicate or a predicate from the destination database. (6 ) N o re p e a te d v a ria b le s in th e ru le h ead : if there are repeated variables in the rule head of 7Zr, since there is no union in the program, then the program can be transformed into an equivalent one in which each occurrence of the r atom in the program is isomorphic to the rule head of 7Zr. Also, restrictions (3), (4) and (5) ensure that there is no recursion in a nrecILOG- program. (6) is mainly for syntactic sugaring to simplify the formal discussion. Notice that usually the invention rules are first used to create untyped OIDs, then the interm ediate relations appearing in the rule heads are used to populate the 25 OIDs into an entity type in the target database. In th e above, this occurs in S l and S2. For the sake of brevity, rules H I, R2 can be abbreviated as i - f l i g h t [* F lig h t,s ,t] s e g m e n t-in fo (s ,t,p ); This specifies implicitly th at the OIDs invented are objects of the entity type Flight. This shorthand will be used when specifying ALDs in later discussion. 2.4 POPART, the Compiler The ALD compiler is constructed by using the POPART (Producers Of Parsers And Related Tools) [Wil, Wil87] developed at US C/Inform ation Science Institute. POPART can be used to generate a lexical analyzer, parser, unparser, p at tern m atcher, and program transform ation system for the ALD compiler. Because POPART is a grammar-driven programming environment generator, the ALD com piler can be easily generated from a high-level gram m ar description. 2.4.1 B N F ’, th e g ra m m a r d e sc r ip tio n la n g u a g e The gram m ar of the underlying language in POPART is specified in B N F’. It is a BNF-like gram m ar description language extended to allow regular expression con structs (optional presence, alternation, iterated patterns, operator precedence p at terns, and pattern expressions) in the language. For example, the following is a piece of the gram m ar specification of the nrecILOG- rules by BNF’. ilo g -ru le := head * body; head := interm ediate I in v en tio n -in term ed iate I re la tio n ; body := in te rm e d ia te -o r-re la tio n ~ ; in te rm e d ia te -o r-re la tio n := interm ediate I re la tio n ; in v en tio n -in term ed iate := re l# in v en tio n ’ [ ’ * { re l# c o n stra in t > ( var-or-const#w itness ’] ; interm ediate := rel# in term ed iate ’ [ ( var-o r-co n st# o b j-w itn ess ~ ; r e la tio n := re l# re la tio n ’ ( var-or-const#type In the gram m ar specification, term inals are quoted and nonterminals are unquoted symbols. < non-term inal> means “one or more occurrences of 26 <non-terminal> separated by commas. The # sign in a non-term inal is used to name and distinguish different pattern variables with the same syntactic type on the right hand side of a production. For example, the two pattern variables of the type rel appearing in the definition of invention-intermediate above can be distinguished by rel#invention and rel#constraint. 2 .4 .2 S y n ta x D ir e c te d E x p e r ts The syntax directed experts in POPART allow the compilation of an ALD at the level of abstract syntax. Each syntax directed expert is specified as a sequence of production rules with patterns to be m atched in the left hand side of the rules. Once a pattern in the left hand side of a production rule is m atched against the parse tree of the source gram m ar, the activity in the right hand side of the production rule is activated. Metavariables in a pattern are indicated using the ! signs, option variables are indicated using the !? signs. The ! ! signs are used to m atch any subsequence of an iterated field of the same syntactic type. Therefore, ''begin !(Statement end' ’ can be used to m atch the begin-end statem ent block in PASCAL, and IF (Predicate THEN !Statement#true ELSE !?Statement#false can be used to m atch an IF statem ent with the optional ELSE part. The activity of the expert’s right hand sides distinguishes the three types of syntax directed experts. A c tio n ro u tin e s: For these, the right hand sides consist of Common Lisp codes that can refer to the pattern variables th at were m atched. The following is an action routine in the ALD compiler th at generates a list of unbounded free variables in a nrecILOG" rule. ( d e fa c tio n -ro u tin e f in d - f r e e v a r s - in - r u le :from-grammar A L D :n o n term in als ( ilo g - r u le ) :r u le s ( ("!h ead # lh s :- !body#rhs" 27 (progn (remove-duplicates (set-difference (find-vars tr::body#rhs) (find-vars tr::head#lhs) :test #’equal) :test #’equal))) ) ) T ra n sfo rm er s: For these, the right hand sides are patterns of the same language as the left hand sides’. They serve as the rewriting rules of the underlying gram m ar. The following is a rewriting rule to simplify wffs in an AP5-lisp program. In particular, it removes every em pty existential quantifier and every and operator with a single operand from the program. (poe:deftransformer Simplify-wff :grammar AP5-lisp :nonterminals (lisp-list) :rules ( ("CE nil !lisp-list)" :==> "!lisp-list" ) ("(and !lisp-list)" :==> "!lisp-list" ) ) :otherwise-action poerptree 28 T ra n sla to rs : These are used to transform the m atched parse tree into a parse tree of the destination grammar. Thus the right hand sides are patterns in the destination grammar. The following is a translator which translates the rule body of an nrecILOG- rule into a conjunctive clause in AP5-lisp. (deftranslator body-to-lisp-program :from-grammar ALD :to-grammar AP5-lisp :from-nonterminals (body) :to-nonterminals (program) :rules ( ("!!intermediate-or-relation#one-item" :==> "(AND !!program#one-item)") ) ) For example, the rule body of the ILOG rule B.flight-info(f,c,t) i-flight[f,s,t] , A.has-node(s,c); is translated into a lisp list (AND i-flight [f,s,t] A.has-node(s,c)) 29 Chapter 3 Solution of the Value-Based Case This chapter presents the framework for m aintaining the uni-directional and bi directional sem antic correspondence for the value-based case under the general ar chitecture of B ird illustrated in Section 1.3. This chapter also serves as the starting point for the object-based context to be presented in Chapter 4 and 5. Each component of the B ird system used for m aintaining the value-based seman tic correspondence is discussed in one section; Section 3.1 describes the abstract link age description (ALD), Section 3.2 discusses the compilation of ILOG non-invention rules, Section 3.3 presents the RemTrans mechanism which orchestrates the inter actions between the two databases, Section 3.4 briefly describes the communication subsystem in this framework. Finally, Section 3.5 discusses the side-effects on the behaviors of the two databases after the semantic correspondence has been estab lished. Throughout this chapter the “Office” example (Example 1.1) introduced in chap ter 1 will be used to illustrate various parts of the B ird system. 3.1 Abstract Linkage Description (ALD) In B ird , the sem antic correspondence is described by a text-based file called the ab stract linkage description (ALD). The formal semantics of the ALD are summarized in Appendix B. Formally speaking, an ALD L, for the bi-directional value-based case, is a four tuple (A l , B l , P a^ b ,P b > -> a ) where • A l specifies the schema of database A 30 • B l specifies the schema of database B • Pav* b is a nrecILOG” program specifying the schema restructuring m apping that maps instances of A to instances of B. • Pb > - * a is a nrecILOG" program specifying the schema restructuring mapping that maps instances of B to instances of A . L specifies the sem antic correspondence of {(/, J ) | (Pa~ b (I) = J) A (Pb^ a(J) = /)} For the case of a value-based uni-directional ALD L , the only difference is that L does not have the Pb ^-a part; i.e., the ALD provides only the one-way mapping from A l to B l - Other than that, the compilation and loading process are the same as in a bi-directional case. Since only the value-based cases are discussed in this chapter, the two nrecILOG" programs Pa> - + b and Pb ^ a in an ALD contain only the non-invention nrecILOG" rules. W ith the ALD, the DBA can give a high level description of the semantic corre spondence without worrying about the low level details of how incremental updates are propagated and synchronized between the two databases. To specify the seman tic correspondence, the DBA defines the bi-directional linkage in a one-direction- at-a-tim e fashion; i.e., he first puts himself in the A database and describes the mapping from instances of A to instances of B , then he switches to the B side and specifies the m apping from instances of B to instances of A . This approach distin guishes B ir d from most of the approaches taken by research into the “view update” problems[BS81, DB82, GPZ88]. The formal syntax of ALD is presented in Appendix A. E x a m p le 3.1: The ALD for the “Office” example is (abs-linkage Office A-Schema ( person-no(String, String); // This is a comment line no-office(String, String); 31 B-Schema ( in-office(String, String); office-no(String, String); ) A-to-B (,o) :- A.person-no(p,n),,o); // E l B .office-no(o,n) :-,o); // R2 ) B-to-A ( A.person-no(p,n) :-,o), B .office-no(o,n); // E3,o) :-,n); // E4 ) ) where th e four-tuple (A l,B l, Pa^ B i Pb ~ a ) is represented in th e A-Schema, B-Schem a, A -to-B , and B -to-A p art in the above. □ As shown in the above example, the ALD adapt the C++ style comment token “/ / ” . W henever this token appears (unless it is inside a string), everything to the end of the current line is a comment. 3.2 Compilation of an ALD W ith the DBA-defined ALD as input, the ALD compiler translates the “abstract linkage” defined in the ALD into two “concrete linkages” , and Eg, which con tain production rules 1 for the underlying active DBMSs of databases A and B, respectively. Since there are no invention rules in a value-based ALD, in this chapter we only describe the translation of nrecILOG- non-invention rules. The translation of invention rules will be presented in Section 4.4. 1In later chapters (4 and 5) for the case of an OID-based semantic correspondence, the compiler will incorporate some “auxiliary relations” besides those defined in the A-Schema and B-Schema sections. Then the resulting (E ^ E g ) contains not only sets of production rules but also the schema definition for the “auxiliary relations”. 32 Notice th at for a non-invention nrecILOG- rule r, the predicates in the rule body of r all come from one database, while the rule head predicate of r comes from the other database. For the sake of the discussion, we call the former the “source database” and the latter the “destination database” of rule r. For example, each of the predicates in the body of B . in-off ice(p, o) A .person-no(p,n),,,o); comes from A , while the rule head is a predicate from B. Thus the source and the destination of the above rule are A and B , respectively. Each nrecILOG- rule r is translated by the ALD compiler into two production rules, rP and r^, of the underlying active database. 2 rp is loaded into the source database of r to m onitor condition in the source database violating the constraint implied by r, and is loaded into the destination database of r to check if the constraint specified by r is satisfied after the rule head of r is modified. The details of rp and rj. are now presented. P o p u la tio n R u le rp: This resides in the source database of r to m onitor any increm ental update violating the constraint implied by r. If the violation is due to some updates to the rule body of r, then rp is triggered and is responsible for propagating updates to the rule head in the target database by inserting update requests into a data structure Remote-Agenda. Remote-Agenda servers as a buffer holding updates proposed by various population rules. The actual execution of the updates in Remote-Agenda is postponed until all rules in the source database stop firing. This is mainly to avoid deadlock and will be elaborated in Section 3.3 when the RemTrans mechanism is introduced. D elay C h e ck in g R u le rj: This resides in the destination database of r to ensure the constraint specified by r after the rule head of r has been modified. Td will be triggered whenever the rule head of r is modified. The action of rd inserts a query into a data structure delayed-checking-list asking if the constraint of r is violated. The actual checking of this query is delayed until the whole interaction between both source and destination databases is settled 2In the current implementation, the Syntax Experts in POPART are used to perform the trans lation. And the result of the compilation is a set of AP5 production rules, or more precisely, a set of Consistency Checkers in AP5. 33 down. The details of this “Wait till everything settled down, then check” process will be given in Section 3.3. As described above, the behavioral aspect of a nrecILOG” rule is divided into two parts; each part is handled by one production rule. E x a m p le 3 .2 : We now use the “Office” example to illustrate the translation of nrecILOG” non invention rules. The production rules generated from the compiler are presented here in pseudo code for easier understanding. As indicated in the ALD, the two rules in B-to-A are named R1 and E2, and the two rules in A-to-B are named R3 and R4. For the nrecILOG” rule A.person-no(p,n) :- B .in-office(p,o),,n); //Rl it is translated into the following two production rules. Rule Name: populate-A.person-no Trigger: any update changing the truth value of ( 3o B. in-off ice(p,o) A B. off ice-no (o,n)) with the variable bindings of p,n, i . e . , change made to the rule body that is used to populate the rule head. Action: IF (3o,o) A,n>) is becoming true; THEN append ‘‘INSERT A.person-no(p.n)’’ to Remote-Agenda ELSE append ‘‘DELETE A.person-no(p.n)’’ to Remote-Agenda Rule Name: delay-checking-A.person-no Trigger: any updates changing the truth value of A.person-no(p,n) with the variable bindings of p, n, i . e . , change made to the rule head. Action: IF A.person-no(p.n) is becoming true; THEN append “Does there exists office o, such that (,o) A,n)) is TRUE ?” to Delay-Checking-List ELSE append “For each office o, is ( A FALSE ?” to Delay-Checking-List 34 Notice th at in the action part of rule populate-A .person-no, instead of executing the insertion/deletion to A.person, it appends a request for inser tion/deletion to data structure Remote-Agenda. Also, in the action part of rule delay-checking-A.person-no, the action simply adds a query into another data structure Delay-Checking-List. The reason is because the production rule is lo cated in one database, while its action involves two databases. If the action part of a production rule is allowed to directly update the rem ote database, this m ay result in a deadlock. Remote-Agenda and Delay-Checking-List are used by the RemTrans mechanism to buffer rem ote requests proposed by various production rules, so that rem ote update requests are sent to the rem ote database in a batch mode. The details of RemTrans mechanism will be presented in the next subsection. R2, R3, R4 can be translated in a similar fashion. The resulting concrete link ages for this example are: It contains rules populate-A. person-no, populate-A. no-off ice,,and £ b'. It contains production rules populate-B. in-off ice, populate-B .off ice-no, delay-checking-A.person-no, and □ 3.3 Remote Transaction Control (Rem Trans) Once the resulting and £# are loaded into A and B , respectively, any incremental update to one database may trigger some production rules in the concrete linkage of the local database. The repair actions of those triggered rules may proposed updates to the rem ote database, which may further trigger some production rules in the concrete linkage of the rem ote database. Therefore, it is conceivable that there is a back-and-forth process of update exchanges before both of the two databases finally commit. In the B ird system, the RemTrans mechanism helps to orchestrate such “negotiation” process between the two databases. 35 A.person-no Name Telephone Ben 123 Shery 456 A. no-off ice Telephone Office 123 SAL250 456 SAL130 (1) User issues updates on A insert A.person-no(John,007) insert A. no-off ice(007,SAL250) (2) Database A is locked, execute updates (3) Production rules populate-B. in-office, are fired (4) Send remote updates to B- — ■ ‘ insert A.person-no(John,123) insert A.person-no(Ben,007) Deadlock! B. in-off ice Name Office Ben SAL250 Shery SAL130 ice-no Office Telephone SAL250 123 SAL130 456 j U insert B. in-off ice(Jone,SAL250) insert,007) j (5) Database B is locked, execute updates (6) Production rules populate-A. person-no, are fired (7) Send remote updates to A Figure 3.1: A naive scenario of using “remote-execute” 3.3.1 M o tiv a tio n o f Rem Trans A naive way to allow the ACTION part of a production rule to m anipulate data in a rem ote database is to simply provide a “remote-execute” procedure that can exe cute commands updating data in the remote database. W ith this “remote-execute” procedure, a production rule can m anipulate the rem ote data directly. However, un fortunately, this approach might result in a deadlock. In the following, an update scenario of the “Office” example using “remote-execute” is presented to illustrate this. Consider the scenario depicted in Figure 3.1. Assume the user issues a trans action in A (1). For most database systems, a database is first locked before a transaction starts. Therefore, A is first locked before the transaction is executed 36 (2). Suppose during the execution of the user update some production rules in data base A are triggered (3). As a result, the production rules use the “re m o te -e x e c u te ” function call to execute updates in rem ote database B (4). After locking database B (5), the remote updates are executed. Suppose that the updates in (5) further trigger some production rules in B (6). W hen the production rules in B try to call “rem o te -e x e c u te ” to perform update back in A database, a deadlock happens. This is because the transaction on the A database has not completed and the A database is still locked. Beside the need to prevent deadlock, there is also the need for a protocol between the two databases to synchronize and simulate the hum an negotiation process; one database proposes some update to the other database, then the other database may counter-propose some update back to the original database. T hat is, there may be a back-and-forth proposal/counter-proposal exchange during the negotiation process. In the end, the accumulated results of the negotiation can be com m itted to the databases only if both parties are satisfied with what they have. If either party is unhappy with the result, the updates for both of the databases m ust be undone. 3 .3 .2 M a n a g in g th e C o m m u n ic a tio n s RemTrans in B ird is the mechanism preventing deadlock and synchronizing inter actions between databases. W hen a user issues an update on one database, the RemTrans mechanism wraps the whole interaction between the two databases into one “global transaction”, so that a user-issued incremental update and its corre sponding actions issued by the production rules in both databases are com m itted or aborted as in one transaction. W hen the user on the local database issues a transaction through RemTrans, the RemTrans mechanism first creates two processes, RemTransA and RemTransg, simultaneously on the two databases. Then the two databases are locked. After th at interactions between databases A and B are then taken over by the RemTrans mechanism. The two databases will be unlocked when the transactions of the two databases are both com m itted or both aborted. W.l.o.g. assume that the user issues a transaction T through function REMOTE-TRANS of RemTrans at database A . After the user issues REMOTE-TRANS (T), 37 the RemTrans mechanism goes through the following phases to complete the “rem ote transaction” . Notice that there are two checking phases below. The local process en ters the Self Checking Phase when there are no rem ote updates in the Remote-Agenda, i.e., all the production rules in the local process are satisfied. The local process en ters the Requested Checking Phase when it receives the ‘ 'I am s a t i s f i e d . How about you? * ’ message from the rem ote process. Initialization Phase: RemTransA first creates a rem ote process Rem Transs by the RPC mechanism as its negotiation counterpart in database B. Instead of start ing from the Initialization Phase the newly created RemTransB starts directly from the Listening Phase. Execution Phase: RemTransA then performs the transaction T on database A . During the execution of T , it may trigger some population and delay-checking production rules in Y> a- Recall the previous section, ACTION parts of production rules generated by the B ird compiler may contain the following actions • ACTION part of a population rule will append some remote update requests into a data structure Remote-Agenda • ACTION part of a delay checking rule will append some remote predicate query requests into a data structure Delay-Checking-List. If there is no request in Remote-Agenda then R em T ransA jum ps to the Self Checking Phase. Otherwise move to the Proposing Phase. Proposing Phase: Pack the rem ote update requests stored in the Remote-Agenda into a message and send it to database B by issuing SEND-MESSAGE(B,''Please execute (Remote-Agenda)'') Listening Phase: Then R em T ransA waits for any counter-proposal or reply from database B. If the reply by RECEIVE-MESSAGE(B) is • ‘‘Please evaluate (Remote-Query)’’ then evaluate the Remote-Query against current state of database A and return the TRUE/FALSE result back to database B. Return Listening Phase. • ‘‘Please execute (New-Proposal)’’ then set T = New-Proposal and go back to Execution Phase 38 • ‘ ‘Abort the global transaction! ’ * then abort and undo the RemTransA transaction and exit • ‘‘I am satisfied. How about you?’’ then go to the Requested Checking Phase • ‘ ‘Global Transaction Committed* ’ then commits the transaction and exit Self Checking Phase: pack the remote query in Delay-Checking-List into a mes sage and issue SEND-MESSAGE(B,‘‘Please evaluate (Delay-Checking-List)) If the reply is • TRUE then issue SEND-MESSAGE(B, ‘‘I am satisfied. How about you?*’) and go to Listening Phase • FALSE then issue SEND-MESSAGE(B, ‘‘Abort Global Transaction!*’), undo the transaction and exit. ' Requested Checking Phase: Different from the Self Checking Phase, it occurs only after the other side has performed its self checking, and the local database receives the ‘‘I am satisfied. How about you?*’ message. First pack the queries in Delay-Checking-List into a message and issue SEND-MESSAGE(B, ‘‘Please evaluate (Delay-Checking-List)) If the reply is • TRUE then issue SEND-MESSAGE(B,*‘Global Transaction Committed’’), commit the transaction and exit • FALSE then issue SEND-MESSAGE(B, ‘ ‘Abort Global Transaction! ’ ’ ), undo the transaction and exit. A scenario depicting the phase transition under RemTrans is shown in Figure 3.2. Under the RemTrans mechanism all proposed updates to the remote database are first buffered, then packed into a message (not as a transaction) and sent to the Rem Trans server in the other side. In this fashion, a deadlock situation will not happen during the interaction between databases *4 and B . 39 V _ 9 s l Database A RemTrans Initial Phase Create process RemTrana E xecution Phase Execute A ^ (rule firings...) P roposing Phase Pack and send remote requests Listening Phase [A,] [AJ Database B In itial Phase Create process RemTrartg L istening Phase Execution P hase Execute A, (rule firings...) Proposing Phase Pack and send remote requests tAJ L istening Phase [remote queries] [I am satisfied. How about you?] R equested C hecking Phase Check queries in [remote queries] Delay-Checking-List ---------------- Inform B to commit [Transaction commit!] Transaction Commits on A E xecution Phase Execute A „ (no more rule firings!) Self C hecking P hase Check queries in Delay-Checking-List Inform A that B is satisfied L istening Phase Transaction Commits on B Figure 3.2: Example scenario for phase transitions under RemTrans 3.4 Socket and RPC, the Communication Mechanism The communication sub-system in B ird is built on top of the BSD socket[Bac86] but can be easily modified for other communication paradigms. Figure 3.3 illustrates the system diagram of the communication subsystem. In particular, each database has a running process, the post-office connected with the other post-offices through a BSD socket. The unit of communication is messages in the form of (Sender, Receiver, Content). The post-office m aintains a message- pool containing the in-coming messages from other rem ote post-offices. The communication subsystem provides services through the following two vir tual functions. 40 Database A Database B send-message receive-message BSD Socket RemTran Post O ffice Post Office Figure 3.3: System diagram of the Communication Subsystem send-message(Receiver,Content): This function encodes the Receiver and Content into a message, then sends it to the post office of the destination host. receive-message (Receiver) : This function returns a message previously stored in the message-pool with the receiver field of Receiver. Shown in Figure 3.3, the RemTrans mechanism uses the two virtual functions to perform inter-database communications. 3.5 Linkage Implicit Constraints (LICs) By definition, a semantic correspondence specifies a set of “acceptable instance pairs” th at can co-exist in the two databases. It is conceivable th at not all instances of one database have corresponding instances in the other database. T hat is, it is possible that the semantic correspondence may not correspond to a total mapping from Inst(A) to Inst(B). As a result, m aintaining a sem antic correspondence may have the side-effect on each local database as if new constraints are introduced on the databases which elim inate the instances not appearing in the semantic corre spondence. Figure 3.4 illustrates this situation. Here each bold dot represents an instance of database A or database B. Only some instances of A lie within the first field of S C a b , and likewise of B and the second column of SC ab ■ These constraints are called the linkage im plicit constraints (LICs). 41 Database A Inst(A ) Database B I n s t ( B ) n , (scA B ) After maintaining SCab by Bird n 2(sca b ) Figure 3.4: Linkage Implicit Constraints (LICs) E x a m p le 3 .3 : Continuing with the “Office” example, assume th at initially there are no constraints or production rules in A and B. Then the concrete linkage and Eg presented in Example 3.2 are loaded into databases A and B, respectively. And the current instances for the two databases are: A.person-no: Person No Ben Sherry 123 456 No Office 123 456 SAL250 SAL130 Person Office Ben Sherry SAL250 SAL130 Office No SAL250 SAL130 123 456 Assume the user at database A issues the the following transaction Transaction{ INSERT person-no(Dave,789); } According to the current database instance, no (propagation) production rule is triggered. Therefore, RemTransa goes to the Self Checking Phase and sends a re m ote query f'Please evaluate (does there exists an office 0, such that in-off ice (Dave, o) and off ice-no(o ,789) * * to database B. But since the result of the rem ote query returns false, which means the static constraint implied by the nrecILOG- rule 42 Database A Database B Inst(A ) In st(B ) by user-issued A ^ by propagation rule repairs Figure 3.5: Dynamic behaviors of LICs A.person-no(p,n) in-office(p,o), office-no(o,n); has been violated by the user update. Therefore the whole transaction is aborted and undone. As a m atter of fact, once the linkage has been established, database A will be affected as if there were a new inclusion dependency A .person-no[2] C A .no-off ice[l] added to the database. Since there is no office associated with telephone number 789, the user update INSERT person-no(Dave,789) is rejected as if it violates the above LIC. □ Now let us consider the dynamic aspect of LICs. The propagation of incremental updates in the B ird system is driven by the firings of the production rules in the concrete linkages. Im portantly, because of the delay checking rules, at the end of each non-aborting RemTrans transaction, the two databases necessarily reach a new instance pair in the semantic correspondence. However, as m entioned in Section 3.3, during the negotiation process of the RemTrans transaction, both databases may go through several interm ediate states. This is illustrated in Figure 3.5; even though the initial update A by the user may not put the database into a legal state in the semantic correspondence, the negotiation process between the two databases may eventually lead to an instance pair (/', J ') in the sem antic correspondence. 43 E x a m p le 3 .4 : Refer to the scenario shown in Figure 3.1. Suppose the user issues Transaction-C insert A .person-no (John,007) ; insert,SAL250); } through Rem Trans and there is no deadlock. Then, after (7) all the nrecILOG- rules are satisfied and all the updates are com m itted. However besides the initially update issued by the user, the interaction between A and B adds two more insertions insert A.person-no(John,123); insert A.person-no(Ben,007); to the total delta. This acts as if the user’s transaction violates an implicational constraint [Fag82] Va;, y,z,w (A . person-no(a:, y) A A .n o -o ff ice(y, z) A A .n o -o ff ice(u>, z)) —► A .p e rso n -n o (r, w) and the additional deltas are repairs proposed by this LIC. □ V There are two im portant open problems concerning LICs: (1) Given a semantic correspondence SC, what is the set of LICs implied by SCI (2) Given a semantic correspondence SC and the associated LICs, what output will B ird produce if the user update initially yields an instance violating some LIC? The answers for these questions can help the DBA to understand the side effects and sem antic changes incurred by the sem antic correspondence. However these problems are outside the scope of this thesis. 44 Chapter 4 Uni-directional Propagation of Incremental Updates Involving OID In this chapter, the capabilities of B ird are upgraded adding the ability to m ain tain uni-directional semantic correspondence involving OIDs. Although most of the framework introduced in C hapter 3 can be used in the presence of OIDs, there are m ajor differences between a uni-directional value-based case and a uni-directional object-based case. To begin, a new section, E n tity -E q u iv a le n c e , is added to ALDs specifying the following information. E n tity E q u iv a le n t S p e c ifica tio n (E E S ): In a sem antic correspondence involv ing OIDs, the B ird system allows the DBA to specify pairs of entity types in the two databases that are “equivalent,” i.e., they hold sets of OIDs standing in a one-one correspondence with each other. This specification is called the E ntity Equivalent Specification (EES) in the ALD. The semantics of EES and a rewriting system translating entries in an EES into nrecILOG- rules are presented in Section 4.1. W ith the presence of OIDs, the nrecILOG" program in an ALD contains in vention rules specifying relationships between entity sets. As a result the following fundam ental problem arises. O ID a sso c ia tio n p ro b le m : To support increm ental OID creation/destruction, the B ird system needs to carry information associating rem ote OIDs in the target database with their local “witness” values in the source database. This problem is described in Section 4.2. 45 The following extensions of the B ird system, introduced in Chapter 3, are developed to remedy this problem. A u g m e n ta tio n of so u rc e sc h e m a d u rin g th e A L D c o m p ila tio n : For each EES entry and ILOG invention rule in Pa^ b of the ALD, the ALD compiler will augment the A schema to include the interm ediate relation defined by the rule or the EES entry (Section 4.3). T ra n s la tio n o f IL O G in v e n tio n ru les: The production rules used for invention rules need to create OIDs in the rem ote database and rem ember the associ ation between the OIDs invented and their witness values. The translation (Section 4.4) is more involved because of: (1) rem ote OID invention/deletion and (2) communication of OIDs across the boarder of databases. M a c h in e D e p e n d e n t O b je c t T ra n sla tio n (M D O T ): W hen introducing OIDs in a semantic correspondence, the B ird system needs to communicate OIDs between databases. Since OIDs are only meaningful within the local database, a machine dependent object translation mechanism is needed to encode local OIDs into the global or “im m utable” OIDs (Section 4.5). W hen introducing the first two extensions above, we ignore the problems raised by the locality of OIDs. In particular, we assume that all OIDs are global or im m utable until Section 4.5, when the MDOT mechanism is introduced. Finally, in Section 4.6 at the end of this chapter, the subtle difference between the static and dynamic semantics of ALDs are discussed. This difference is mainly due to the non-determ inistic nature of assigning the OIDs at run time; if an object with OID o is destroyed and then created, the new OID is not necessarily o even though it is created with the same witness value. This phenomenon raises the profound question of whether the identification of an object can be truly represented by its surrogates Throughout this chapter, two examples will be used to illustrate various com ponents of our approach. The first example illustrates the situation when the pairs of entity sets from the two databases correspond exactly; the second one illustrates cases where a pair of entity sets in the two databases correspond in a less direct fashion. 46 Database A Database B C't Node ir S g y ^ 0 " nrrmfe node-label A.c-name B.node-label City name C | L.A. c2 S.F. c3 N.Y. Node label nl L.A. n2 S.F. n3 N.Y. Figure 4.1: Schemas and instances for “City” V E x a m p le 4.1: “ City” In this example, database A contains entity type A . City and its attribute A . c-name; database B contains the entity type B.Node and its attribute B.node-label. There is a one-one correspondence between the cities in A.City and the nodes in B.Node. The schemas and example instances are depicted in Figure 4.1. The following ALD specifies th at correspondence. (abs-linkage city A-schema ( City(entity); c-name(City,string); ) B-schema ( Node(entity); node-label(Node,string); ) Entity-Equivalence ( A.City = B.Node; //El ) A-to-B ( B.node-label(o,1) A.City=Node[c,o], A.c-name(c,l); // R1 ) ) Notice that in the above ALD, Entity-Equivalence is a new section. It con tains an entity equivalence specification (EES), with El specifying th at A.City and 47 Database A Database B Segment (node) ■pl s t r i n g ! has-node segment-info I STRING I I STRING I (time) (price) A.has-node Segment node S.F. *2........ N.Y. A.segment-info Segment time price S1 10:00 65 sl 14:00 80 *2 09:00 150 Flight flight-info ™ § ] fSTR fN G l (time) B.flight-info STRING I Flight city time f. S.F. 10:00 h S.F. 14:00 b N.Y. 09:00 B.price Flight price f. 65 b 80 ............ 150 Figure 4.2: Schemas and instances of Example “Segment Flight” B .Node are “equivalent” entity types. Associated w ith E l is one 1 system m aintained interm ediate relation A .C ity=N ode, storing th e m atching relationship betw een th e cities in A .C ity and their “equivalent” counterparts of nodes in B.Node. T he detail of the EES will be discussed in the next section. □ E x a m p le 4 .2 : “ Segment Flight” Continuing with Example 1.2, the following is the uni-directional ALD for the “Segment Flight” example. In Chapter 5, this example will be revisited for the bi-directional case. (abs-linkage segment-flight A-schema ( S egment(ent ity); has-node(Segment.string); segment-info(Segment,string,string); 1Later in Chapter 5, when bi-directional ALDs are discussed, there are two duplicate copies of the system maintained intermediate relations associated with an EES entry. The two intermediate relations are stored in databases A and B respectively. 48 ) B-schema ( Flight(entity); flight-info(Flight.string,string); price(Flight,string); ) A-to-B ( A.i-flight[* »s,t] A.segment-info(s,t,p); // SI B.Flight(f) A.i-flight[f,s,t]; // S2 B.flight-info(f,c,t) :- A.i-flight[f,s,t], A.has-node(s.c); // S3 B.price(f.p) :- A.i-flight[f,s,t], A.segment-info(s,t,p); // S4 ) ) □ In the above ALD, rule SI is a nrecILOG- invention rule. Notice th at for a nrecILOG- invention rule, the source and destination databases are the same. T hat is, the interm ediate relation A. i - f l i g h t (which is not included in the A schema) is assigned to be located in database A. W ith the interm ediate relation m aterialized (Section 4.3) and treated as an ordinary relation, each predicate in the rule body of a nrecILOG- rule is located in the source database (.4.), and each rule head, except for the invention rules, is located in the destination database (B). From the two examples above, it looks convenient and straightforward to use nrecILOG- invention rules describing the sem antic correspondence in the object- based case. However, we will see in a later discussion th at the invention of OIDs across the border of a database needs special treatm ent. 4.1 Entity Equivalence Specification (EES) For the object-based semantic correspondence, the DBA can use the E n tity Equiva lence Specification (EES) in an ALD to specify the equivalent relationship between an entity type in A and an entity type in B, in an ALD. In the “City” example above, there is an EES entry A.City = B.Node // El 49 in the Entity-Equivalence section of the ALD. Intuitively speaking, El specifies th at there is a one-one correspondence between OIDs in A . City and OIDs in B . Node. Associated with El is the system m aintained EES relation A.City=Node recording the m atching relationship between OIDs of the two entity types. The semantics of the EES is very similar, as least for the uni-directional case, to the semantics of nrecILOG- invention rules. As a m atter of fact, in Bird, the EES is implemented by the following rewriting rules which translate an ALD L containing EES entries into another ALD, denoted EES—to-ILOG(L), replacing each EES relation by a nrecILOG- invention rule. R e w r itin g S y s t e m 4 .1 For each EES entry A .R = B . S in an ALD, the EES translator EES-to-ILOGrewrites the Pa*-*b program in the ALD in the following ways. (1) The following ILOG rule will be generated and added into Pa^ b - i-S[* S,r] A.R(r); (2) Each occurrence of the predicate A.R=S[x,y] in the ALD will be substituted by i-S[y,x] □ In the above rewriting, (1) ensures th at whenever there is an OID r in A.R, there will be a unique OID s in B.R corresponding to it. Since the interm ediate relation i-S is actually holding the m atching correspondence between OIDs in A .R and OIDs in B.S, it can play the role of the EES relation A.R=S. (2) is just a syntactical modification substituting i-S for A.R=S. It can be easily shown th at EES-to-ILOG(L) em ulates the semantics of the original L , and the sem antic correspondences specified by EES-to-ILOG(L) and L are the same. E x a m p le 4 .3 : For the “City” example, EES—to-ILOG(L) is derived from L by replacing the original Pa~b with A-to-B C B.node-label(o,l) A.i-node[o,c], A.c-name(c,1); 11 Rl’ 50 A. i-node [*,c] A.City(c); // R2’ B.Node(n) A.i-node[n,c]; // R3’ ) Notice that in the above, R2' and R3’ are added because rewriting rule (1) and R1 were rew ritten into R1 ' by (2). □ From the rewriting process above, it seems th at invention rules alone are sufficient to express the entity equivalent relationships between entity types. The justification of providing a second way, other than nrecILOG- invention rules, to specify the entity equivalent relationship is not clear for the uni-directional case. However, in Section 5.5 we will see th at by writing an EES entry, semantically speaking, the DBA means more than just the OID creation/destruction behaviors between two entity types. As a rule of the thum b, for the tim e being, whenever there is a one-one correspondence between OIDs of two entity types in the two databases, the DBA should use EES to articulate the relationship in the ALD. For the current im plem entation, entries in an EES are restricted in the following ways. These restrictions apply to both uni-directional and bi-directional cases. • Each entry contains exactly one entity type in A and one entity type in B. • Each equivalence class specified by the entries in an EES cannot contain more than two entity types. For the rem ainder of this chapter, unless otherwise specified, we assume th at the ALDs in discussion are first pre-processed by EES-to-ILOG, so that P a ^b contains only the nrecILOG- rules with no EES relations appearing in the rule bodies. 4.2 OID Association Problem This section explores the OID association problem th at arises when OIDs in the target database are destroyed incrementally by updates propagating from the rem ote database. This m otivates the need for the schema augm entation process (Section 4.3) to m aterialize the interm ediate relations appearing in the ALD. First, let’s recall the OID invention process of the ILOG language presented in Section 2.3 by using the schemas in the “City” example. 51 (1) S k o lem ize p ro g ra m P a * -* b ' The ILOG program P a *-* b is first skolemized into A-to-B ( A.i-node[f(n),n] A.City(c); // R1 B.node-label(o,l) A.i-node[o,c], A.c-name(c,1); // R2 ) T hat is, the * sign in the interm ediate relation i-node of the invention rule R1 is substituted by the skolem function f (n). (2) C o m p u te th e sk o lem ized p re -in sta n c e : Use the instance of database A shown in Figure 4.1 as the source and run the skolemized Pa> - + b as if it were an ordinary Datalog program. The resulting skolemized pre-instance for database B and the interm ediate relation A. i-n o d e are A.i-node: Node city name /(c 1 ) L.A. /(<*) S.F. /(<*) N.Y. B.node-label: Node label f(ci) /(<*) /(<*) L.A. S.F. N.Y. (3) C o m p u te th e in s ta n c e fro m th e p re -in sta n c e : Substitute each distinct skolem term in the skolemized pre-instance with a unique OID value. Assume that m apping ^ is /(c i) n x /( c 2) * - * ■ n 2 . f( c 3) n 3 Then the resulting instance for B is the one depicted in Figure 4.1 Recall th at, in the original ILOG semantics, the interm ediate relations served only as “scratch paper”. They are not part of the output instance and are thrown away after the instance transform ation process term inates. However, as we will 52 see below, in the situation dealing with incremental OID deletions, the association between OIDs and their witness values stored in the interm ediate relations is indis pensable. Consider the following deletion scenario for the “C ity” example. Suppose a user at database A issues transaction { delete city(ci); delete c-name(ci ,L. A. ) ; } According to R1 of Pa*-*b, the OID of‘ B .Node associated with the witness ci, or more precisely the logic OID /(c i), should be deleted from the B database. However, as shown in [CH94], in the absence of other information, the problem of finding the correct OID is essentially equivalent to the graph isomorphism problem[GJ79] and, thus, probably intractable. The solution is to store the interm ediate relations (e.g., i-node in this example) in the source database A . By [CH94], with the interm ediate relations carrying the OID association information, the problem of finding the correct OID is polynomial. 4.3 Augmentation of Source Schema Continuing with the discussion in the previous section, to keep this OID/witness association information available in database A , for each nrecILOG" invention rule r in PA^ B , the ALD compiler generates a relation definition for the interm ediate relation appearing in the rule head of r, and puts them into concrete linkage Ea- These new relations generated by the ALD compiler are called the witness relations, and the family of these is denoted by auxA. After loading E ^, these witness relations are m aterialized in database A . As will be shown in the next section, the production rules generated from the invention rules in P a > -* b m aintain the association informa tion in the witness relations incrementally. For the “C ity” example, since there is only one invention rule R1 with A. i-node as rule head, auxA contains one witness relation, and the following AP5 relation definition will be generated by the ALD compiler and put into Ea- (DEFRELATION i-node :TYPES (Entity City)) Similarly, for the “Segm ent-Flight” example, auxA contains relation i-flight, and the following AP5 relation definition will be generated and put into E^. (DEFRELATION i-flight :TYPES (Entity Entity String)) 53 w «|g To compute Jg, PA-*a first computes IauxA ■ With 1 ^ = 1auxA ^ I A ^ en computes Jg and throws IauxA away. This is why each execution of A*-.swill produce an equivalent but not equal physical image. Figure 4.3: The semantics of OID invention rules After loading into the A database, these witness relations are m aterialized and serve as ordinary relations in A except th at they are only accessible by the production rule generated from the invention rules in the, ALD; e.g. El for the “City” example and R1 for the “Segm ent-Flight” example. The details of the interaction between production rules generated from nrecILOG" invention rules and the witness relations will be given in the next sec tion. Conceptually speaking, with the presence of witness relations, the semantics of P a ^b can be viewed as the two-phase process 2 depicted in Figure 4.3; i.e., pro gram P a > -* b can be divided into to parts: P%^+b containing all the invention rules in P a ^ b , and P a ^ b containing all the non-invention rules in P a < -+ b - T o compute target instance Jb from source instance I a, the B ird system first computes witness IauxA (the physical instance of the witness relations ) by Pa<->b{Ia) = IauxA- Then the target instance can be computed by P ai-*b ( I a U I uuxa) — J b The notion of witness can be defined formally as follows. D e fin itio n : Witness 2This relies heavily on the restriction that there is no recursion in a nrecILOG" program. 54 Given an ILOG program Pa>->b and instances I a , J b of database A and B such that [P^sC /yi)] = [Jb], or Pa>-+b{Ia) = J b T hat is, the resulting (physical) instance of Pa^ b (Ia) is OID-equivalent to Jb - Then an instance I auxA over auxA is a potential witness from I a to J b if Pa*-*b{.Ja) ~ IauxA, and Pa^-*b(Ia 0 IauxA) = Jb - A potential witness I aU xA is a witness from I a to J b if P a ~ b (Ia 0 IauxA) = = J b ■ (The notion of witness from J b to I a in the bi-direction case is defined analogously.) □ Using the source and target instances depicted in Figure 4.1, the corresponding witness I aU xA for the “City” example is A.i-node: Node City Til Cl n 2 c2 nz C 3 For the “Segm ent-Flight” example, with the source and target instances in Fig ure 4.2, the witness I auxA is A.i-flight: Flight Segment Time h Si 10:00 /2 S\ 14:00 fs s 2 9:00 Due to the OID invention semantics of ILOG, instances of the witness relations satisfy certain functional dependencies. D e fin itio n : Skolem Functional Dependency Let r be a relation with arity n, then a skolem functional dependency over r is denoted as SFD(r). For each instance 3 7, 7(r) |= SFD(r) iff I(r ) (= 1 {2, ...,n ) and 7(r) (= {2, ...,n} 1 3The notations used here are conventional functional dependency using coordinate position to indicate attributes. 55 □ As stated in the next lemma, a special property of the nrecILOG- OID invention process implies that each witness relations r satisfies SFD(r). L e m m a 4.3.1: Let IauxA he a witness. For each interm ediate relation r in a auxA , then h u xA b SFD(r) th at is IauxA satisfies the Skolem Functional Dependence of r. P ro o f: It follows immediately from the semantics of OID invention of a invention rule in ILOG. □ 4.4 Compilation of ILOG Invention Rules Continuing with the discussion of witness relations in the previous section, we now consider the translation of an ILOG invention rule r into population rule 4 rp. The trigger part of rp is translated in a similar fashion as th a t of an ILOG non invention rule; it m onitors the conjunction predicate derived from the rule body of r against changes in the source database of r. It is the action part of rp th at needs special treatm ent to handle OID creations in the rem ote database. We now use El in the “C ity” example A.i-node[* Node,c] A.City(c); // R1 as an example. As m entioned in Section 2.3, R1 is actually the shorthand for the following two rules. A.i-node[* ,c] A.City(c); // Rl-1 B.Node(n) A.i-node[n,c]; // Rl-2 In particular, R l-2 can be translated as a non-invention rule presented in Section 3.2; R l-1 is translated into the following production rule which is put into S ^. Rule Name: populate-A. i-node 4Since the rule head of r, a witness relation, is transparent to the rest of the database and can only be modified by rp, there is no need for delay checking rules. This is true even in the bi-directional case. 56 Trigger: any updates changing the truth value of A.City(c) with the variable binding of c. Action: IF A.City(c) is becoming true; THEN create a new OID in B, and update A.i-node as follows: 1.1 Call remote procedure create-OID in B to create a new OID O b and have it sent back to A 1.2 insert A. i-node(Os,c) ELSE find the OID to be destroyed in A.i-node, update A.i-node, and send a deletion request to B as follows: 2.1 Search A.i-node for the OID O b associated with the witness value c 2.2 delete A. i-node (Os, c) Notice that, after performing the insertion (1.2) or deletion (2.2) in the action above, the production rule maintaining Rl-2 will be triggered and will propose insertion or deletion to B.Node. Similarly for the “Segment Flight” example, invention rule A.i-flight[*,s,t] :- A.segment-info(s,t,p); // SI is translated into the following production rule which is put into £ .4 . R ule N am e: populate-A. i-flight Trigger: any updates changing the truth value of ( 3p A.segment-info(s,t,p)) with the variable bindings of s and t. Action: IF the trigger condition is becoming true; THEN create a new OID in B, and update A.i-flight as follows: 1.1 Call remote procedure create-OID in B to create a new OID O b and have it sent back to A 1.2 insert A.i-flight(0£»sft) ELSE find the OID to be destroyed in A. i-flight, update A. i-flight, and send a deletion request to B as follows: 2.1 Search A.i-flight for the OID O b associated with the witness value (s,t) 2.2 delete A.i-flight(Ojg,s,t) 57 R em T ranjj 7 L istening Phase Execution P hase Execute requests in Remote_Agenda insert B.FIight(f49) insert B.flight-info(f49,D.C.,8:0 0) insert B.price(f49,250) Figure 4.4: Scenario for incremental OID creation The following example dem onstrates how these production rules create and delete OIDs and properly m aintain the witness relations. E x a m p le 4.4: Continuing with the “Segment-Flight” example, assume th at the two databases are in the states depicted in Figure 4.2, and the concrete linkage is generated and loaded as described. Suppose a user at database A wants to add one more segment s 3 8 to database A and issues Transaction {insert A.has-node(s3 8 ,D.C.) ; insert A.segment-info(s 38, 8:00,250);} The resulting firing sequence of the production rules is depicted in Figure 4.4. Then assume the user wants to undo the previous update by issuing Transaction {delete A.has-node(s3 8 ,D.C.); delete A.segment-info(s3 8 , 8:00,250);} The resulting firing sequence of the production rules is depicted in Figure 4.5. After the RemTrans transaction commits, databases A and B return to the original states shown in Figure 4.2. □ RemTranj^ Execution Phase The following rules are Fired A.i-flight[* Flight,s,t]A.segment-info(s,t,p); 1.1 Call create-OID in B and get f 4g 1.2 Insert A.i-flight[f49,s3e 8:00] 1.3 Append "insert B.FIight(f49)“ to Remote-Agenda B.flight-info{f,c,t):- A.i-flight[f,s,t], A.has-node(s.c); Append"insert B.flight-info(f49,D.C.,8:00)" to Remote-Agenda B.price(f.p):- A.i-flight[f,s,t], A.segment-info(s,t,p); Append "insert B.price(f49,250)“ to > Remote-Agenda P roposing P hase Send Remote_Agenda to B ^ 58 R em T ratig 7 L istening Phase Execution P hase Execute requests in Remote_Agenda delete B.FIight(f49) delete B.flight-info(f4g,D.C.,8:0 0) delete B.price(f49>250) Figure 4.5: Scenarios for increm ental OID destruction Notice that the action of an OID invention rule requires newly created OIDs in B be passed and stored in A . If pure OIDs were used, a problem might arise; the OIDs may not be im m utable. For example, B can change its OIDs due to an autom atic background garbage collection procedure which is independent of A . If pure OIDs were sent to A , this could result in dangling OID references. For this reason, the translation of ILOG invention rules described above serves only as a conceptual model. The actual MDOT mechanism used in B ird to provide the sharing of OIDs, in the context of incremental OID creations/destructions, is presented in the next section. 4.5 Machine Dependent Object Translation (MDOT) Mechanism Stated in [EK91], the value of OIDs, or machine dependent objects (MDOs), in a lo cal database may change even within a session. In order to provide a “stronger notion of identity” at a global level, B ird has the Machine Dependent Object Translation RemTrattj^ E xecution Phase The following rules are Fired A.i-flight[* Flight,s,t]A .segm ent-info(s,t,p); 2.1 found f4g associated with (s3B ,8:00) 2.2 delete A.i-flight[f49,s38 8:00] 2.3 Append "delete B.FIight(f49)" to Remote-Agenda B.flight-info(f,c,t)A.i-flight[f,s,t], A.has-node(s.c); Append "delete B.flight-info(f49,D.C.,8:00)" to Remote-Agenda B.price(f.p)A.i-flight|f,s,t], A.segment-info(s,t,p); Append "delete B.price(f49,250)“ to > Remote-Agenda P roposing P hase Send Remote_Agenda to B ' 59 (MDOT) mechanism to translate local OIDs into im m utable OIDs so th at they can be shared among databases. In the following, Section 4.5.1 introduces the functionalities of the MDOT mech anism which include • the translation of a local OID to an im m utable OID before sending the local OID to a foreign database • the translation of an im m utable OID back to its local value when receiving from a foreign database. The translation of nrecILOG- invention rules introduced in Section 4.4 serves only as a conceptual model. The actual actions of the production rules implementing nrecILOG- invention rules incorporate the MDOT mechanism to achieve the sharing of OIDs. Section 4.5.2 elaborates this collaboration between rem ote OID creations of the invention production rules and the MDOT mechanism. 4 .5 .1 T ra n sla tio n o f M a c h in e D e p e n d e n t O b je c t Each OID traveling beyond the boundary of a database is encoded by the MDOT mechanism as [host id], where host is the database where it originates and id is a unique num ber count for this OID in the host database. To record the necessary information for the machine dependent objects presented in a database, each database m aintains a variable MDQ-count and a relation MDOTT (for M DOT table). MDQ-count is used to assign a unique num ber for each machine dependent object originating in this database. For each machine dependent object th at travels in or out of the database, MDOTT records their originating host host, identification num ber id and local value substitution value in the database. This may include the OIDs created in the local database and the OIDs created in other databases. Figure 4.6 illustrates the snapshot of the MDOT tables and the witness instances of the “Segm ent-Flight” example assuming th at the databases are in the states shown in Figure 4.2. The translation process can be divided into a two-way communication process—encoding and decoding of MDOs—described below. 60 A.i-f light Flight witness 1 witness 2 f| S| 10:00 h 14:00 h , s3 09:00 MDO counter = 2 A. MDOTT host id local value A 1 * 1 A 2 S 2 B 1 IB, 1] B 2 [B,2] B 3 [B, 3] B.i-segment Segment witness Si SP. s2 N.Y. |M DO counter = 3 B. MDOTT host id local value A 1 [A, 1] A 2 [A, 2] B 1 f. B 2 f* B 3 h local value si look up A.MDOTT and encode MDOT look up B .MDOTT local value and decode MDOT [A,l] -----— — ------► hM! IB,3] [B,3] Figure 4.6: MDOT snapshots for the “Segment-Flight” example E n c o d in g M D O s: W henever the local OID Oiocai is to be sent to another database, do the following: IF there is a tuple MDOTT ( lo c a l- h o s t -nam e, id -n o ,Oiocai) in the MDOT table THEN substitute Oiocai by [lo c a l-h o s t-n a m e ,id -n o ] ELSE increm ent MDO-count, insert MDOTT (local-host-name, MDO-count, Oiocai) into MDOTT, and substitute OiO C ai by [local-host-name, MDO-count] For example, with the A.MDOTT instance shown in Figure 4.6, if A wants to send sj to B , then sj is substituted by [A,l] before it is sent. Nonetheless, if A wants to send [B ,3] to B, the encoding [B,3] itself is sent. D e c o d in g M D O s: W henever an encoding [host-nam e, id -n o ] is received from another database, do the following: 61 IF there is a tuple M D O T T (host-nam e,id-no,local-value) in the MDOT table THEN substitute [h o st-n a m e ,id -n o ] by lo c a l- v a lu e ELSE insert MDOTT(host-name, id -n o , [host-nam e, id -n o ] ) into the MDOT table For example, if B receives [A ,l] from A , since there is a tuple B.MD0TT(A, 1, [A ,l]) in the MDOT table, the encoding [A ,l] itself is used in database B. If B receives [B ,3], it is substituted by / 3. W ith the translation scheme and the MDOT available, in case a local OID, say Oi7, changes to o 'l7, then the entry in A.MDOTT recording 017 will also change to o'17. T hat is, the new A.MDOTT now becomes host id value A 1 ° '\7 W henever A communicates with a rem ote database, the same encoding of [A, 1] is always used independent of the change of O 17. Thought the MDOT mechanism shown in this section does not contain the typing information for the OIDs, it can be easily extended. 4 .5 .2 S y n th e sis o f O ID In v e n tio n an d M D O T Recall the translation of an ILOG invention rule presented in Section 4.4. W hen the population rule of an ILOG invention rule is fired, conceptually speaking, its action performs the following: (1) Invoke a rem ote procedure call to create an OID Oremote in the rem ote database, (2) Receive the rem ote OID Oremote as a return value of the rem ote procedure call, and (3) Store the Oremote and its witness values in the witness relation. 62 Rem Tran^ Execution Phase A.i-flight[*Flight,s,t]A.segment-info(s,t,p) (1) Remote procedure ca ll---------------------------------- get return value = 4 (2) Append "create_oid=o; Insert MDOTT(B,4,o)“ to Remote_Agenda (3) Insert A.MDOTT(B,4,[B,4]) P roposing Phase Send Remote_Agenda to B - Rem Tran ft L istening Phase > get_next_count: Increment MDO counter Return MDO counter E xecution Phase Executing "create_oid=o; Insert MDOTT(B,n,o)" (1) Create new OID value = f49 (2) Inserte B.MDOTT(B,4, f49) Figure 4.7: Scenarios of OID invention under MDOT mechanism Recall also the two scenarios shown in Figure 4.4 and 4.5, (1) happens during the executing phase of a RemTrans transaction. For most of the active object-based database im plementations, creating an object is regarded as a database event and may trigger some production rule in the rem ote database. As a result, a deadlock situation similar to the scenario depicted in Figure 3.1 may happen. Also in the real im plem entation, (2) and (3) require the rem ote OID to travel across the border of a database. From the discussion in Section 4.5.1, we know th at special treatm ent m ust be taken to incorporate this rem ote OID invention process within the general MDOT framework. Consider the scenario depicted in Figure 4.7 which portrays the process of OID invention under the framework of the MDOT mechanism. Shown in the figure is the actual OID creation process of an OID / 4g in A .F lig h t first presented in Figure 4.4(Section 4.4). In the above figure, assume the user at A database issues a Rem Trans transac tion. During the execution phase of RemTransA, the increm ental update triggers 63 the population rule rp generated from A.i-flight[*,s,t] A .segment-info(s,t ,p) ; The action of rp first invokes a rem ote procedure call get-next-count in database which increments the MDO-count in B and returns its value, 4, to A . Then it appends an update request to Remote-Agenda requesting the B database to create a new OID and recognize this new OID value as the 4th machine dependent object created in B (2). Finally, the action of rp put the encoding [B,4] of this rem ote OID into the MDOT table in A database(3). In the Proposing Phase, RemTransA sends the update requests stored in Remote-Agenda to B{4). After receiving and executing the update request from A , R em Trahss creates a new OID, / 49, and insert B .MD0TT(B,4,/49) into its own MDOT table(5). After th at, the resulting MDOT tables for databases A and B are A.MDOTT; host id value B 4 PM] B.MDOTT: host id value B 4 /* 9 4.6 Static vs. Dynamic Semantics in ALD This section revisits the “City” example and investigates the subtlety in choosing different witnesses for the invention of rem ote OIDs in a nrecILOG- invention rule. Assume that in the “City” example, A.c-name is the key for entity type A.City. We can rewrite P a > -+ b into P'a ^ b by substituting R 1 by A.i-node[* Node,n] :- A.c-name(c,n); \\ XI T hat is, instead of using the cities in A.City as the witnesses, the nodes in B.Node are invented with the witness of city names in A. c-name. Statically speaking, given a source instance the two programs always give the same target instance on the logical instance level(Section 2.3.2). T hat is, i W / ) = P A „ S { / ) , o r [ P 4„ B ( / ) ] = [ n „ B ( / ) ] However, the concrete linkages generated from the two nrecILOG programs(Rl and XI) may behave differently. Consider the two scenarios depicted in Figure 4.8. 64 Database A A.c-name Database B B.node-label Node label ni L.A. “2 S.F. “3 N.Y. City name C| L.A. C 2 S.F. C J N.Y. delete A.c-name(c1tL.A.) insert A .c-nam e^.D .C .) delete B.node-labeKc^LA.) A .i-N ode[*N ode,c]A .C ity(c) ftp insert B.node-labelCc^D.C.) Scenario (1) delete B.Node(n1); delete B.node-labeKc^LA.) A.i-Node[*Node,n]:- A.c-name(c,n) ........................ aw- insert B.Node(n250}; insert B.node-label(c250,D.C.) Scenario (2) Figure 4.8: Dynamic behavior differences for “City” example Initially, the two databases are in the states depicted on the top of the figure. As sume a user at A issues a transaction Transaction{ delete A.c-name(ci,L.A.) ; insert A .c-name(ci,D.C .) ; } Then, in scenario (1) running the concrete linkage generated from Pa^-b , the corre sponding updates sent to B are delete A . c-name(ci,L. A . ) ; insert A. c-name (ci, D.C. ) ; which in essence modify the attribute B.node-label of the node rii from L.A. to D.C. On the other hand, in scenario (2) running the concrete linkage generated from Pa^ b i updates propagated to the B database include delete B.Node(ni); delete node-label(n1}L.A.); insert B .Node(n 2 5o) ; insert B.node-label(n25o, D.C.); Now the OID ni will first be deleted then a new OID n 2 5o will be created with the new label of D.C. Notice that the two resulting target instances in the two scenarios are equivalent as logical instances. But suppose B.Node has another attribute B. color with the 65 following instance initially. B.color: Node color ni green n2 yellow nz red Then, in scenario (1) the color information of the node («i) corresponds to city Ci will still be green; in scenario (2) the color information of the node (^ 250) that corresponds to city C i is either lost or contains a dangling OID reference. This phenomenon is due to different ways of identifying rem ote objects with nrecILOG- ; as in the two scenarios above, even though P a > -* b are P'a ^ b compute the same (logical) target instances, on the physical instance level the concrete linkages associated by B ird to these programs have different dynamic behavior. In P a ^b , the identity of a node n\ is “witnessed” by a city OID ci; therefore changing the attributes of ci does not affect the existence of n\. In contrast, in P'a^b fhe focus is on the names of cities. The existence of node ni depends on the nam e of the city L.A. in A . As a result, changing the name of city ci results in first the destruction of n i, and second the creation of another node n 2so th at corresponds to the new name D.C. of city c i. For the application of in the “City” example, P a > -* b describes more precisely the sem antic correspondence of the application, because no m atter what the name of a city is, there will be a node on the GUI representing th at city. Therefore, the correspondence is between the OIDs of A.City and the OIDs of B.Node. A complete investigation of this phenomenon is beyond the scope of this thesis. Practically speaking, it is im portant for DBAs to be aware of this subtlety and to ensure th at the correct nrecILOG- invention rules are used th at best express the semantics of the application. 66 Chapter 5 Bi-directional Propagation of Incremental Updates W ith the framework introduced in Chapter 3 and 4, we are now ready to present the full capabilities of the B ird system to m aintain bi-directional sem antic corre spondences involving OIDs. First in Section 5.1, the preprocessing of EES in the bi-directional case is dis cussed. Then Section 5.2 discusses the witness update problem which motivates the need for witness generators; these are special rules that populate the witness relations in the rem ote database. Section 5.3 gives a formal definition of the wit ness generators. An algorithm, called W itness Generator Generator(WGG), which generates the witness generators from a DBA-given ALD will be presented in the next chapter. Section 5.5 presents the justification of using EES to specify entity equivalence relationships, and the subtle semantics implied by the EES th at can not be expressed by nrecILOG- invention rules alone. Finally, Section 5.6 summarizes the development and the m ajor components of the B ird system introduced so far. The “C ity” and “Segment-Flight” examples introduced in Chapter 4 will again be used in this chapter. However, the difference is that the applications now require the two databases A and B be treated symmetrically; i.e., both databases can issue increm ental updates that propagate to the other database. For this reason, the ALDs include two nrecILOG- program s{ P a ^ b and P b > -> a ) specifying the instance mappings of both directions. 67 E x a m p le 5.1: “ City”, the bi-directional case Continuing with the “City” example introduced in C hapter 4, the following ALD specifies the bi-directional semantic correspondence for the “City” example. (abs-linkage city A-schema ( City(entity); c-name(City.string); ) B-schema ( Node(entity); node-label(Node,string); ) Entity-Equivalence ( A.City = B.Node; ) A-to-B ( B.node-label(n.l) :- A.City=Node[c,n], A.c-name(c,1); ) B-to-A ( A.c-name(c.l) :- B.City=Node[c,n], B.node-label(n,l); ) ) □ E x a m p le 5.2: “ Segment Flight” , the bi-directional case Continuing with the “Segment-Flight” example introduced in C hapter 4, the fol lowing ALD specifies the bi-directional sem antic correspondence for the “Segment- Flight” example. (abs-linkage segment-flight A-schema ( Segment(entity); has-dest(Segment,string); segment-info(Segment,string,string); ) // Ei / / R1 / / SI 68 B-schema ( Flight(entity); flight-info(Flight,string,string); price(Flight,string); ) A-to-B ( A.i-flight[* Flight,s,t] A.segment-info(s,t,p); B.flight-info(f,c,t) A.i-flight[f,s,t], A.has-node(s,c); B.price(f,p) A.i-flight [f,s,t], A.segment-info(s,t,p); ) B-to-A ( B.i-segment[* Segment,c] B.flight-info(f,c,t); A.has-node(s,c) B.i-segment[s,c]; A.segment-info(s,t,p) B.i-segment[s,c], B.flight-info(f,c,t), B.price(f,p); ) ) □ For the discussion in this chapter, it is assumed th at the MDOT mechanism introduced in Section 4.5 is being used, so that all OIDs are global, im m utable, and can be passed between databases. 5.1 Preprocessing of EES in Bi-Directional Cases As shown in Section 4.1, the EES can be used to express one-one correspondence re lationship between a pair of entity types in the two databases. For the bi-directional cases, consider the ALD in Example 5.1 above. Notice th at, different from the ALD in Exam ple 4.1, the ALD above has the B-to-A section. Also the EES relation B.City=Node is used in SI recording the m atching OID pairs of A.City and B.Node. As indicated by its prefix, B.City=Node is a relation located in database B. T hat is, for each EES entry in a bi-directional ALD, the B ird system m aintains two duplicated copies of the same EES relation, one in each database. / / R1 // R2 // R3 / / SI // S2 // S3 69 Similar to the uni-directional cases, the EES in a bi-directional ALD L is first translated by the following rewriting system, then compiled by the ALD compiler. T he result of the translation, denoted EES-to-ILOG(Z/), is an ALD where the P a^b and P b ^a does not contain any EES relation predicate. R e w r it in g S y s t e m 5 .1 For each EES entry A.R = B.S in a bi-directional ALD, the EES translator EES-to-ILOG rewrites the Pa*-*b and P b ^a programs in the ALD as follows. (1) The following ILOG rules will be generated and added into Pa* - + b and Pb ^ a , respectively. A.i-S[* S,r] A.R(r); B.i-R[* R,s] B.S(s); (2) Each occurrence of the predicate A . R=S [x, y] in Pa>-*b will be substituted by i~S[y,x], and each occurrence of the predicate B.R=S[x,y] in P b> -+ a will be substituted by i-R[x,y]. □ In the above, (1) and (2) are simply the same rules in Rewriting System 4.1 apply sym m etrically to both Pa> - + b and Pb * - * a - E x a m p le 5 .3 : For the ALD L of the bi-directional “City” example, the resulting ALD, EES-to-ILOG(L), of the EES rewriting process is Entity-Equivalence ( A.City = B.Node; // El ) A-to-B ( B.node-label(n,l) A.i-node[n,c], A.c-name(c,l); // R1 A.i-node[* Node,c] :- A.City(c); // R2 ) B-to-A ( A.c-name(c,l) :- B.i-city[c,n], B.node-label(n,l); // SI B.i-city[* City.o] :- B.Node(o); // S2 ) 70 JauxB Figure 5.1: Naive compilation for bi-direction ALD involving OIDs By rewriting rule (2), the EES relation predicates A . City=Node and B . City=Node in Rl and SI are replaced by A . i-node and B . i-city, respectively. And, by rew rit ing rule (1), R2 and S2 are added. □ For the rem ainder of this chapter, unless otherwise specified, we assume th at the ALDs in discussion are first pre-processed by EES-to-ILOG, so th at their P a < -+ b and P b *-* a do not contain any EES relation predicate. 5.2 Witness Update Problem As discussed in Section 4.3, for the uni-directional case, the schema of A is aug m ented with witness relations to store the association between the OIDs defined logically by Pa^ b and the physical OIDs of B that correspond to them . For the same reason, in the bi-directional case, witness relations— denoted as auxA and auxB , respectively—are added to both A and B. The resulting schemas are de noted as A = A U auxA and B = B U auxB. For a bi-directional ALD, a naive way to m aintain the sem antic correspondence is to simply compile both P a > -* b and P b < -* a as in the uni-directional case. The resulting concrete linkages can then translate increm ental updates in both directions. U nfortunately as indicated in Figure 5.1 there is a potential problem —although can be used to propagate Jb, there are no rules in A to populate the witness relations in a u x B , nor are there rules in B to populate auxA. This is the witness update problem. To see how this leads to a problem under increm ental updates, consider the following scenario for the “Segment Flight” example. 71 The user issues a transaction: Insert Segment{s38) Insert segment-info(s38,8:00,250) Insert has-node(s38,D.C.) R em T ra n s E xecution P hase The following rules are Fired A.i-flight[* Flight,s,t]A .segm ent-info(s,t,p); B.flight-info(f,c,t)A.i-flight[f,s,t], A.has-node(s,c); B.price(f,p)A.i-flight[f,s,t], A.segment-info(s,t,p); P roposing P hase Send Remote_Agenda to B L istening P hase E xecution Phase Executing requests in Remote_Agend'Sv insert A.Segment(s13) insert A.segment-info(s13,8:00,250) insert A.has-stop{s13,D.C.) R em T rang rListening P hase E xecution Phase Executing requests in Remote_Agenda insert B.FIight(f4g) insert B.flight-info(f4g,D.C.,8:00) insert B.price(f49,250) The following rules are Fired B.i-segment[* Segm ent.c]B .flight-info(f,c,t); A.segment-info(s,t,p):-B.i-segment[s,c], B.flight-info(f,c,t), B.pnce(f,p); A.has-nodet(s.c):- B.i-segment(s,c] P roposing Phase Send Remote_Agenda to A Figure 5.2: Scenario using the “naive” concrete linkages Suppose the ALD in Example 5.2 is compiled by the “naive compiler” m entioned above. Assume after loading the concrete linkages, the two databases are in a state of equilibrium illustrated in Figure 5.3 and all the production rules are satisfied. Then a user at A wants to add one more segment by issuing the following transaction. Transaction! Insert A.Segment(s3s) ;Insert A.segment-info(,S38,8:00,250) ; Insert A.has-dest(s3 8 ,D.C.) ; } As we can see in Figure 5.2, after executing the user-issued transaction and assuming that / 4g is the new OID created in B , three production rules in A are fired, and RemTransA sends the following update requests to B: insert B.FlightC//o) ; insert B.flight-info(/jg,D.C. ,8:00) ; insert price(/49,250) ; After executing the update requests, three rules in B are triggered. These request a new OID from A , say it is 543, and propose 72 Transaction! Insert A.Segment($1 3); Insert A.segment-info(si3,8:00,250); Insert A.has-dest(si3 ,D.C.); } back to A. At this point, things have already gone wrong! A new OID .S 1 3 has been created for the same segment represented by 533. This abnorm ality is called the witness update problem. Analogous problems may arise when information associated with a given OID is modified. The root of the problem is th at there are no rules in A to populate the association information of ( 5 3 3, D.C.) to B. i-segment and prevent unnecessary rule firings in B to “re-invent” s3S. The solution to this problem is to augment P a > -* b by adding a nrecILOG- rule B.i-segment[s,c] A.has-node(s,c); W ith this rule added, after executing the user-issued update, this new rule will be fired and adds one more update request Insert B. i-segment (5 3 3 ,D.C.); to be sent to B. Now, due to this newly-added rule, the association between the S3 8 and its witness D.C. will be established in B. i-segment. As a result, all rules in B are satisfied and finally the transaction is com m itted. Similarly, the rule A.i-flight[f,d,t] B.i-segment[s,d], B.flight-info(f,d,t) , B.price(f,p ) ; m ust be added to Pb ~ a propagating updates of database B to the witness relations in A. The two rules just given are called the witness generators. In the next section, the witness generator is formally defined. In general, finding witness generators is a non-trivial task. C hapter 6 presents a procedure for finding them , and C hapter 7 presents a theoretical analysis of when the procedure halts. 73 5.3 Definition of Witness Generator As pointed out in the previous section, in order to resolve the witness update prob lem, the ALD compiler generates a set of nrecILOG- rules, called the witness gen erators, and adds them to the ALD during compilation. In this section, the formal definition for the witness generator is presented. Intuitively speaking, the witness relations serve as the glue combining the re m ote physical OIDs and their corresponding logical witness values, so th a t the local database can determ ine how to translate updates to the specific rem ote OIDs. As defined in Section 4.3, the information stored in the witness relations is called a wit ness. In the context of bi-directional sem antic correspondence, the witnesses stored at databases A and B co-exist as a pair. D e fin itio n : Witness Pair Let L be an ALD specifying a semantic correspondence SC a b - , an<l (Ia , Jb ) G SC ab be an instance pair in the sem antic correspondence. The two instances IauxA over auxA and JauxB over auxB are a witness pair for (Ia , J b ) if (1) IauxA is a witness from I a to J b and J a U xB is a witness from J b to I a - , th at is { I *A>-*b(Ia) — IauxA, and P a*->b(Ia 0 IauxA) = Jb P b ^ a ( J b ) = JauxB , and Pb^+a(Jb u JauxB) = I A (2) For each interm ediate relations B. iP and A. iQ resulting from the EES-to-ILOG translation of an EES entry A.P = B.q then the two relations B. iP and A. iQ are inverse to each other. T hat is \/p,q,iP(p,q) € JauxB iQ(4,P) € IauxA □ A key aspect of the approach taken by B ird is to m aintain the witness pair (IauxA, JauxB) along with (I a ,J b )• In this way the rem ote (physical) instance Jb can be com puted locally at A by JB = PA>— * b (IA U IauxA) 74 T hat is, the A database holds enough information to reconstruct Jb - Similarly, the instance I a can be com puted locally at B by I A — P U J a u x B ) Based on the notion of witness pair, the witness generator can be formally defined as follows. D e fin itio n : Witness Generator Given an ALD L containing two nrecILOG- programs P a ~ b and P b ^ a , a nrecILOG- program P ^ a u x B ls sa^ the witness generator from A to a u x B , if for each (I a , J b ) € S C (p A^ BtpB„ A) with the witness pair { L uxA, JauxB), then JauxB = PA<-^auxB^A U IauxA) Similarly, a nrecILO G - program P g ^ a u x A said to be the witness generator from B to auxA if IauxA ~ P& > — auxA(JB G JauxB) The rules in the witness generators are call the witness rules. □ Example 5.4: Continuing with the “Segment Flight” example (Exam ple 5.2), from the ALD, we know auxA — { i-flight }, and auxB — { i-segment }. The witness pair ( L uxA, JauxB) and the two instances I a and Jb are depicted in Figure 5.3. As shown in the previous section, the witness rule VFGi-segment is B.i-segment[s,d] A.has-dest(s,d), A.segment-info(s,t,p); and the witness rule VFGi-fljght is A.i-flight [f,d,t] :- B .i-segment[s,d], B .flight-info(f,d,t), B .p r i c e ( f ,p ) ; Then «/o«a:S 'i_ segment = W G i _ s e g m e n t ( I A U L u x a ) , and /auxA-i-flight = W U J a u x B ) □ 75 Database A U Database B JB A. has-node A.segment-info Segment time price Si 10:00 65 sl 14:00 80 S 2 09:00 150 Segment node Si S.F. s2 N.Y. Witness pair «WW¥WIWW<I*IM*V>W»WVW B.flight-info B.price Flight city time f. S.F. 10:00 u S.F. 14:00 b N.Y. 09:00 Flight price f> 65 f2 80 .L___ 150 IauxA A.i-fiight Flight Segment time fl S l 10:00 h si 14:00 u s2 09:00 •IquxB B. i-segment Segment node Si S.F. s2 N.Y. Figure 5.3: W itness pair for the “Segment Flight” example 5.4 Logical and Physical Views of a Semantic Correspondence As m entioned in Section 2.3, due to the non-determ inistic nature of OID inven tion, the original semantics presented in [HY90] is defined on the notion of “logical instance” , i.e. the OID-equivalence class of database instances. Therefore when the DBA writes an ALD L involving OID invention/destruction, what L actually describes is the sem antic correspondence SC ab between logical instances of A and logical instances of B . D e fin itio n : Semantic Correspondence Involving OIDs Let L be an ALD defined between two databases A and B . Let the two nrecILOG- program s in L be P a * -> b and P b > -> a containing some invention rules. Then the sem antic correspondence S C a b defined by L is { ([ /], [J]) I {[Pa~ b (I)] = [J]) A {[Pb~ a{J)\ = [I])} □ 76 ®® m in im i ° ° , c ALD Compiler o augmented ALD describing the SC at the “logical Instance” level B describing the SC at the “physical Instance” level Figure 5.4: Different levels of abstraction in the BIRD system However, in practice logical instances are more a concept to achieve “OID in dependence” than a reality. For most of the database im plem entations, only the physical instances are stored in the database. The witness relations (Section 4.3) and witness generators can be viewed as the extra information needed to bridge the gap between the logical and physical aspects of an ALD. T hat is, to m aintain the sem antic correspondence between two databases physically, the B ird system needs: (1) the witness relations keeping the OID association information, and (2) the witness generators to properly m aintain the witness relations. This motivates the following definition. D e fin itio n : Extended Semantic Correspondence Let L be an ALD specifying a sem antic correspondence SCab- Let auxA and auxB be the schemas of the witness relations appearing in Pa*-*b and P b ^ a respectively. 77 Let A = A U auxA and B = B U auxB. Then the extended semantic correspondence SC Ai} is {(^4? J&) I (IA = I a Pa*-*b^JaA) A (J g = J b U Pb>-^a(Jb))A (PZ+bVa) = JB) A ( P ^ A i W = I a ) } □ Figure 5.4 shows how the B ird system helps the DBA to specify and m aintain a sem antic correspondence by facilitating the both aspects of a sem antic correspon dence. As indicated in the figure, the DBA specifies a sem antic correspondence at the abstract level of “logical instances”, then the ALD compiler translates this “logical level” specification into the concrete linkages, so th at the extended sem antic correspondence SC Ag can be m aintained efficiently at the “physical level”. 5.5 Equivalent and Mutually Recursive Classes This section discusses the justification of using EES to specify the equivalent rela tionship between two entity types, and the subtle semantics implied by the EES that can not be expressed by nrecILOG- invention rules alone. Consider the bi-directional “City” example. An “ILOG-activist” may argue th at the following nrecILOG- programs alone express more naturally the sem antic cor respondence. (These are virtually the same nrecILOG- programs in the resulting ALD generated by EES-to-ILOG.) A-to-B ( A.i-node[* Node,c] A.City(c); B.node-label(n,l) A .i-node(n,c), A.c-name(c,l); ) B-to-A ( B.i-cityC* City,n] :- B.Node(n); A.c-name(c,l) :- B .i-city[c,n], B.node-label(n,1); ) However, the DBA should use the EES instead of the conventional nrecILOG invention rules because of the following reasons. \ \ T1 \ \ T2 \ \ U l \ \ U2 78 City=Node City=Node C l • » City=Node C l • ■ « ............. » ♦ W 2 City=Node (a) Four possible OID mappings of the “City” example by EES i-Node i-C ity Node City ni c 1 n. c2 City Node cl ni __ “1 (b) Example O D D mapping recorded in i-Node and i-City i-Node i-Node i-Node ■ ► •"a i-City i-City City i-Node i-Node i-Node i-Node (c) Possible O D D mappings of the “City” example by invention rules Figure 5.5: Different entity type relationships specified by EES and invention rules T o a r tic u la te th e e q u iv a le n t r e la tio n s h ip b e tw e e n m a t c h in g e n t it y ty p e s : The EES gives a more direct indication th at there is a one-one correspon dence between the two entity types, which can not be articulated by using nrecILO G- invention rules alone. As we have seen in the “C ity” example, an EES entry A.City = B.Node; in the ALD directly points out that there is a one-one corresponding rela tionship between OIDs in A.City and OIDs in B.Node. Suppose there are two OIDs c1?c2 in A.City and two OIDs n \,n 2 in B.Node, then Figure 5.5(a) shows in a schematic way two different possibilities of the correspondence. On the other hand, the correspondence specified by the nrecILOG- invention rules i-node[* Node,c] A.City(c); \\ T1 i-city[* City,n] B.Node(n); \\ Ui is not necessarily one-one. Figure 5.5(b) illustrates example instances of the 79 two witness relations. In it, i-node records the m apping from ci,C 2 to ni,rc2 respectively; on the contrary, i-city records the m apping from rii, n 2 to c2, c\ respectively. In fact, as indicated in Figure 5.5(c) in a schem atic way, there are four possible ways two nodes « i,n 2 and two cities c i,c 2 can link up. To avoid endless loop in W G G algorithm: The two invention rules T1 and U1 form a recursive OID invention loop between A.C ity and B.Node. T hat is, T1 says th at the witness for the invention of an OID in B.Node is an OID in A.C ity ; U1 says th at the witness for the invention of an OID in A.C ity is an OID in A.C ity . As we will see in Section 6.5, using nrecILOG- invention rules recursively to specify the equivalence of two entity types will result in endless loop for the W GG algorithm. Later in C hapter 6, we will see th at the ALD compiler uses the entity equiva lence information in the Entity-Equivalence section, so th at the proper witness generators can be generated for the two nrecILOG- invention rules generated by the rewriting rule (2) in the Rewriting System 5.1. 5.6 Recapitulation of the Bird System This section summarizes the m ajor components of the B ird system presented so far. The refined system architecture for the B ird system is shown in Figure 5.6. To specify a semantic correspondence, the DBA first writes an ALD specifying the correspondence between the logical instances of A and the logical instances of B. An ALD includes the following parts. A-Schema: The schema description of A. B-Schema: The schema description of B. Entity-Equivalence: Entries specifying the one-one correspondence relationships between entity types in A and entity types in B. A-to-B: A nrecILOG- program specifying the mapping from the logical instances of A to the logical instances of B 80 ALD A-schema B-schema Entity-Equivalence A LD C om piler A ->B (1) Preprocessing of EES (2) Witness Relations augmentation (3) Adding Witness Generators to the ALD (4) Generating Concrete Linkages definitions for auxA population rules for 1 JA_ >B checking rules for Pb RemTrans / / AP5 Active DBS MDOT Mechanism Communication Subsystem / / J relation definitions for population rules for Pb- delay-checking rules for \ RemTrans AP5 Active DBS MDOT Mechanism Communication Subsystem Figure 5.6: Refined system architecture for the B ird system B-to-A: A nrecILOG" program specifying the m apping from the logical instances of B to the logical instances of A. For a uni-directional sem antic correspondence, the ALD does not contain this part. W ith the ALD as input, the ALD compiler performs the following tasks: (1) P re p ro c e s s in g o f E E S : Each EES entry in the ALD is translated according to Rewriting System 5.1 presented in Section 5.1. (2) W itn e s s re la tio n s a u g m e n ta tio n : The ALD compiler generates relations definitions for the interm ediate relations appearing in the rule heads of pro duction rules in the ALD(Section 4.3). These relations, called the witness 81 relations, are to be m aterialized once the concrete linkages are loaded in the databases. (3) A d d in g W itn e s s G e n e ra to rs to th e A L D : The W GG algorithm (to be presented in Chapter 6) is used to generate a witness generators for each invention rule in the ALD. These witness generators(Section 5.3)—they are non-invention nrecILOG- rules with an interm ediate relations appearing in the rule heads— are added into the ALD. (4) G e n e ra tin g C o n c re te L inkages: Then the compiler generates two concrete linkages and S e for databases A and B respectively. The concrete linkages include relation definitions generated in (2), and population and delay-checking rules(Section 3.2) for the two database. The translation for nrecILOG" non-invention and invention rules are presented is Section 3.2 and 4.4 respectively. After loading the concrete linkages, the two databases A and B are ready to propagate increm ental updates. An overview of the interactions between layers of the B ird system to translate and propagate a user-given increm ental update A a (A b) into A s (A ^) was presented in Section 1.3.1. 82 Chapter 6 How to Find Witness Generators One of the m ain contributions of this research is the development of the W itness Gen erator Generator (W GG) algorithm that generates correct witness generators from a user-specified ALD. This chapter focuses on the intuition and informal discussions of the algorithm . A formal discussion for the soundness and decidability of halting of WGG will be presented in Chapter 7. In Section 6.1, an informal description of W GG is given to highlight the intuition behind the procedure. Section 6.2 first presents the notion of “well-foundedness” on ALD; it then describes a technique removing the inherited non-well-foundedness in ALDs containing EES definitions. Section 6.3 presents some formal definitions needed in Chapter 7 for the proof of W GG soundness. Section 6.4 describes the WGG procedure. However, unfortunately, the WGG algorithm does not always term inate. Finally, In Section 6.5 we present two families of ALD th at may lead to endless WGG execution. These dem onstrate the incompleteness of the WGG algorithm , and m otivate the need for a further study of the term ination behavior in C hapter 7. For the discussion in this chapter, examples “C ity” (Exam ple 5.1) and “Segment Flight” (Exam ple 5.2) will again be used. 83 6.1 Overview of Automatic Generation of Witness Generators In this section, we present the intuition behind autom atic generation of witness gen erators. In the following discussion, two examples are presented; the first demon strate how the witness rule for an interm ediate relation generated by EES-to-ILOG translation can be generated, and the second shows the construction of the witness rule for an ordinary interm ediate relation. For the sake of discussion, we define the Skolem operator on an ALD as follows. D e fin itio n : Skolem(L) Let L be an ALD, then Skolem(L) is the ALD such th at each nrecILO G" invention rule i-R[*R, w] ...; is first expanded into the non-shorthand form i-R[*,uf] :- ...; R (r) :- i-R [r,«;]; Then each non-shorthand invention rule is skolemized (Section 2.3.1) into i 'W i - R ^ ) ’™ ] ! with a distinct skolem function a The next example walks through the “C ity” example and shows how we can use “common sense” to infer the witness rule for interm ediate relation B . i-city derived from EES entry A. City = B.Node. E x a m p le 6 .1 : The ALD, L. presented in Exam ple 5.1 can be transform ed into a new ALD, L' — EES-to-ILOG(L), as follows. Entity-Equivalence ( A.City = B.Node; // El ) A-to-B ( B.node-label(n,l) :- A .i-node[n,c], A.c-name(c,1); // R1 84 A .i-n o d e [* N ode,c] A .C ity (c ); / / R2 ) B-to-A ( A.c-name(c,l) :- B .i-city[c,n], B.node-label(n,1); // Sl B.i-city[* City,o] :- B.Node(o); // S2 ) ) L' can then be skolemized into L" = Skolem(L') as follows. Entity-Equivalence ( A.City = B.Node; // El ) A-to-B ( B.node-label(n,l) :- A.i-node[n,c], A .c-name(c,1); // R1 A .i-node[f(c),c] :- A.City(c); // R2' B.Node(n) :- A.i-node[n,c]; // R3J ) B-to-A ( A.c-name(c,l) :- B.i-city[c,n], B.node-label(n,l); // Sl B.i-city[g(o),o] :- B.Node(o); // S2J A.City(c) :- B.i-city[c,o]; // S3' ) ) By the definition of witness generator in Section 5.3, a witness generator 72. for B . i-city satisfies Ai ^b ) ^ S C abi Jb.i-city = 72.(7^) From the semantics of EES we know the following is true. Vc, n, B. i- c ity [c , rz] < - > ■ A. i-node[n, c] Thus, the witness generator for i-city is B.i-city[x,y] :- A.i-node[y,x] 85 STEP MGU Goal Set 0 A.i-flight[f,s,t] I A.i-flight[f(slt)lslt]A.segment-info(s,t,p) 1 A.segment-info(s,t,p) B.i-segment[s,c] B.flight-info(f\c,t) B.price(f\p) a/g(c), f/f{g(c),t) j B.flight-info(f”,c,t’) A.i-flight(f’,s\t) A.has-node(s’.c) B.price(f’,p) ................ B.i-segment[s’,c] s7g(c),fVf(g(c),t) |i 1 1 9 1 \ Figure 6.1: SLD Expansion for “Segment Flight” In fact, we can generalize this result: for each EES entry A . P = B. Q , the witness generators for B. iP and A . iQ are B. iP [x,y] :- A.iQ[y,x]; and A.iQ[x,y] :- B .iP[y,x]; respectively. □ The next example dem onstrates how to construct a witness rule for an interm e diate relation which does not associate with any EES entry. E x a m p le 6 .2 : Let the ALD shown in Exam ple 5.2 for the “Segment Flight” example be L. L can be skolemized into V — Skolem (L) as follows. A-to-B ( A.i-flight[f(s,t),s,t] A.segment-info(s,t,p); // R1 B.flight-info(f,c,t) :- A.i-flight[f,s,t], A.has-node(s.c); // R2 B.price(f.p) :- A.i-flight[f,s,t], A.segment-info(s,t,p); // R3 ) B-to-A ( 86 B.i-segment[g(c),c] B.flight-info(f,c,t); // SI A.has-node(s.c) B.i-segment[s,c]; // S2 A.segment-info(s,t,p) B.i-segment[s,c], // S3 B.flight-info(f,c,t), B.price(f,p); ) Let A.i-flight[f,s,t] be an arbitrary tuple in Ix- L et’s “infer” , from the above nrecILO G" program s, what is also true along w ith the fact A.i-flight[/, 5,t] e Ix According to rule R1 A. i - f l i g h t [ f ( s , t ) , s , t ] : - A. s e g m e n t- in f o ( s ,t,p ) ; and the fact th at there is no union in nrecILOG" programs Pa^ b and Pb ^ a , there m ust be a tuple A.segment-info(s, t, p) in Ix f°r some price p, and / is equal to the skolem term f ( s , t). Next, from rule S3 A .s e g m e n t-in fo (s ,t,p ) : - B. i-s e g m e n t[s , c ] , B . f l i g h t - i n f o ( f , c , t ) , B .p r i c e ( f ,p ) ; we know there m ust be three tuples B.i-segment[s,c], B .flight-info(//, c, t), and B .p rice(//, p) in for some c, / '. In essence, the “common sense” reasoning can be carried out as doing a SLD- resolution style expansion. This is shown in a pictorial way in Figure 6.1. In particular, at step 1 the initial OID variable / is substituted by skolem term f( s ,t) . As the expansion goes on, at step 3 / is further substituted— along with the substitution of s by g(c)— by f(g(c),t). Alone the expansion, there is another variable / ' which is also substituted by the same skolem term f/(g (c ),t) at step 4. Intuitively speaking, from the above observation we know th at variable / m ust equal to variable / '. Furtherm ore, by observing the expansion, we know the goal set at step 2 has the following properties: (1) it contains the variable / ', (2) each predicate in it comes from database B, and (3) it contains both witness variables or their equivalents, t and s', of A. i - f l i g h t [/, s,£]. Then the witness rule can be constructed by using the goal set at level 2 as the rule body and A.i-flight[/, s, t] as the rule head with equivalent variables equated; i.e., the witness generator for i - f l i g h t is 87 A.i-flight[f,s,t] B.i-segment[s,c],B.flight-info(f,c,t), B.price(f,p) □ From the two examples above, we can now summarize the “common sense” rea soning as follows. To find a witness rule for interm ediate relation r, consider the following two cases. Case 1) r is defined in a EES entry R = S Directly construct the witness rule for r as r[x,y] s[y,x] Case 2) r is not defined in any EES entry Then do a SLD-resolution style expansion as follows. (1) Initially construct a goal set containing an atom r[o, w] where o is the OID variable and w is the witness variables. (2) Expand goal set as if doing an SLD-expansion, until there exists a variable o', such th at both o and o' are substituted into a same skolem term . (3) Then construct the witness rule 71 using r[o, w] as the rule head. The body of 7Z is the goal sets which contain: (a) only atom s from the different database, and (b) variables equivalent to o and w. The W GG algorithm is basically based on 1 the above “common sense” reasoning. 6.2 Removal of EES As m entioned in Section 5.5, the EES expresses the one-one correspondence rela tionship between entity types, and the EES-to-ILOG (Section 5.2) transform ation defines the semantics of an EES entry in term s of the nrecILOG- invention rules. In particular, an EES entry *As will be shown later, the way a witness generator is constructed by the WGG algorithm is more complicated than (3) above. For some ALDs, (3) does not always find a goal set satisfying conditions (a) and (b). 88 Database A Database B Node Se it-in fo F lig h t [ ST R IN G ] | s t r i n g ! I STRING I 1 STRING I (time) (price) (tim e ) Figure 6.2: Schemas of Exam ple “Itinerary” A P = B.Q; is translated into two nrecILOG- invention rules[*q,p] A.P(p); B.iPOP,q] B.q(q); However, the two rules above involve a recursive invention loop; i.e., an OID p of A. P is the witness of the invention of an OID q of B. q and vice versa. This makes the ALD “non-well-founded” (Section 6.2.1). As will be shown in Section 6.5.2, non-well-founded ALDs result in endless executions of the W GG algorithm . To circumvent the non-well-foundedness inherited in the ALDs containing EES, two rewriting operators reduce and expand are proposed (Section 6.2.2). In essence, an ALD L is first “reduced” into a well-founded ALD reduce(L); then the witness generators for reduce(L) are “expanded” into witness generators for the original For the discussion in this section, the following example will be used. E x a m p le 6.3: “ Itinerary” This is alm ost the same as the “Segment Flight” example (Exam ple 5.2), except that A . Node and B . City are entity types. Furtherm ore, there is a one-one correspondence between nodes in A . Node and cities in B . City. The schemas are shown in Figure 6.2. The sem antic correspondence can be specified by the following ALD L which has already been translated by EES-to-ILOG. (abs-linkage itinerary A-schema ( Segment(entity); Node(entity); ALD L. 89 has-node(Segment,Node); segment-info(Segment,string,string); ) B-schema C Flight(entity); City(entity); flight-info(Flight,City,string); price(Flight.string) ; ) Entity-Equivalent ( A.Node - B.City; //El ) A-to-B ( A.i-flight[* Flight,s,t] :- A.segment-info(s,t,p); // R1 A.i-city[* City,o] :- A.Node(o); // R2 B.flight-info(f,o,t) :- A.i-flight[f,s,t], A.has-node(s,o); // R3 B.price(f,p) :- A.i-flight[f,s,t], A.segment-info(s,t,p); // R4 ) B-to-A ( B.i-segment[* Segment,c] :- B.flight-info(f,c,t); // SI B.i-node[* Node.c] :- B.City(c); // S2 A.has-node(s,o) :- B.i-segment[s,c], B.i-node[o,c]; // S3 A.segment-info(s,t,p) B.i-segment[s,c], // S4 B.flight-info(f,c,t), B.price(f,p); ) ) Notice th at EES entry El specifies the equivalent relationship between A.Node and B .C ity . □ 6 .2 .1 W e ll-fo u n d e d n e ss o f an A L D Simply put, an ALD is “well-founded” if every entity type appearing in the ALD has a finite rank defined as follows. 90 D e fin itio n : Rank Let L be an ALD. The rank function on the types in schema(L) is defined as follows. { 0 if r is a base type M ax{rank(rr) \ r ' G W } + 1 r is invented by i[* ,..] and W is the types of witness columns in i[* ,..] The rank num ber of L is then rank(L) = M ax{rank(r) | r G schem a(L)}. □ D e fin itio n : Well-foundedness An ALD L is well-founded if for each type r in Schema(L), ra n k(r) is defined. □ E x a m p le 6.4: The ALD shown in Example 5.2 for the “Segment Flight” example has rank of 2. This is com puted as follows. By the invention rule B .i-segment[* Segment,c] B.flight-info(f,c,t); we know ra n k (k. Segment) = rank(String) + 1 = 1 Then from the invention rule A.i-flight[* Flight,s,t] :- A.segment-info(s,t,p); we know rank{ B. Flight) = rank( A. Segment) + 1 = 2 Thus, by definition, the ALD is well-founded. However, for ALD L presented in Exam ple 6.3, rank(L) is not well-defined. This is due to the A. i - c i t y / B . i-n o d e OID invention loop. Suppose ranfc(A.Node) = i. From A.i-city[* City,o] :- A.Node(o); // R2 we know rank(B.City) = i + 1. Then by B.i-node[* Node,c] :- B.City(c); // S2 we have rank(A.Node) = i + 1 + 1! Contradiction. □ 91 6 .2 .2 R e d u c tio n an d E x p a n sio n O p e r a tio n s Let L be an ALD containing EES entry A.P = B.Q. Recall th at, in Section 6.1, the witness generators for A. iQ and B. iP can be simply constructed as A . iQ [q,p] B.iP[p,q]; B.iP[p,q] :- A .iQ[q,p]; However, as will be shown later, the W GG algorithm fails to find the witness gen erators for those interm ediate relations of L not appearing in the EES. This is m ainly because of the inherited non-well-foundedness resulting from transform ation EES-to-ILOG. In this subsection, we introduce a technique th at removes this inherited non- well-foundedness, so th at the W GG algorithm can be applied to find the remaining witness generators. The technique is based on two operators: reduce and expand. To find witness generator W G T for interm ediate relation r not appearing in the EES, the ALD compiler performs the following: (1) Apply the reduce operator to L. In this new ALD, reduce(L), entity types A.P and B.Q are “reduced” into printables. (2) W ith reduce(L) and reduce{r) as input, run the W GG algorithm to find the witness generator, W G redU ce(r), f°r the “reduced” interm ediate relation reduce(r) in reduce(L). (3) Apply the expand operator to W G redU ce(r)i and return expand{W Gre(iU C e(r)) as the output; i.e., expand(W Gr e < iuce(r)) is the witness generator for r w .r.t. L. D e fin itio n : reduce A Let L be an ALD which has been pre-processed by EES-to-ILOG. Let A = A U auxA and B = B U auxB be the extended schemas for databases A. and H, respectively. For each EES entry A.P = B.Q in L, the result of reduce(L) is defined as follows. S c h e m a s: reduce(A) contains all the relations in A except the interm ediate relation A . iQ derived from an EES entry A.P = B. Q in L. A A reduce(B) contains all the relations in B except the interm ediate relation B . iP derived from an EES entry A.P = B.Q in L. 92 Then defined a new printable type PQ-value. The columns of reduce(A) th at in A had entity type A.P, are given types PQ-value in reduce(A), and the A A columns of reduce(B) th at in B had entity type B.Q, are given types PQ-value in reduce(B). P ro g ra m s : For each EES entry A.P = B.Q in L replace the two rules A.iQO Q,p] :- A.P(p) in P a ~ b and B.iPO P,q] :- B.Q(q) in P b ^ a by A.P(x) :- B.Q(x) in reduce{PA*-*B) and B.Q(x) :- A.P(x) in reduce{Ps>-^A) respectively. For a rule 7lr in Pa^b-, where r is not defined in EES, construct reduce(TZr ) in re d u c e (P A ^ B ) as follows. For each iQ[q,p], do the following: (1) replace each atom A . iQ [q,p] in 7?.r .body by a new atom s A . P (arpg) where x pq is a new variable, and (2) replace each p, q in 7Zr by the new variable x pq. Each rule in reduce(PB^A) can be constructed in the same fashion. □ The next definition describes how to construct the reduced instance pairs, (reduce(I^),reduce(J^)) £ Inst(reduce(A )) x Inst(reduce(B)), from an instance pair (/,« /) € Inst(A ) x Inst(B ). D e fin itio n : Reduced Instance Pair Let L be an ALD, and J# ) be an instance pair in the extended sem antic cor respondence SC ab • Let A.P = B.Q be an EES entry in L. From the semantics of EES, we know th at for each p, q iP [p,?] G J f i.iP < - * • iQ[q,p] € IA.iQ The red u ce operations on OIDs p and q are defined as follow: reduce(p) = reduce(q) = xpq, where x pq is a new value of type PQ-value. For all other value i € adom(I^ U Jg) define reduce(t) = t. Then reduce(I^) and reduce(J^) can be constructed as follows. 93 (1) remove A . iQ from and B . iP from Jg . (2) All other relations r in A are replaced by reduce(I^.r) in reduce(Ij) and all other relations s in B are replaced by reduce(Jg).s in reduce(Jg). a E x a m p le 6.5: The instance pair shown in Figure 4.2 for the “Segment Flight” example is a valid instance pair for the “Itinerary” Example after reduction. □ The next lem m a states th at for any instance pair in an extended se m antic correspondence S C \B defined by an ALD L, then the reduced instance pair reduce(I£ ■ , J b ) iS in the reduced sem antic correspondence ^^'reduce(AS) ' L e m m a 6 .2 .1 : Reduced Semantic Correspondence Let L be an ALD defining an extended sem antic correspondence SC jj^. Let the extended sem antic correspondence defined by reduce(L) be S C ^ ^ j ^ y Then, (JA’Jb ) ^ SCab (reduce(IA),reduce(J6 )) € SCZtZ{AB) □ After finding witness generator W G reduce(r) for interm ediate relation reduce(r) in the reduced ALD, reduce(L), the witness generator W G r for r of the original ALD is constructed by “expanding” W G redU ce(r) as follows. D e fin itio n : expand(W Gr> ) Let W G reduce(r) be a witness generator of reduce(r) under reduce(L). W.l.o.g. as- A sume reduce(r) £ reduce(A), then expand(W GredU ce(r)) can be constructed as fol lows. For each variable xpq in V ar(W G redU C e(r))-> of type PQ-value, derived from an EES entry A.P = B.Q in L, do the following. (1) add one atom B.iP[p, q] to the rule body, where p q are new variables not used in W G reduce(r), (2) replace each occurrence of x pq in the rule body by q, 94 (3) replace each occurrence of xpg in the rule head, if any, by p. expand(W G reduce(r)) f°r reduce(r) in reduce(B) is defined analogously. □ The next lem m a shows that the expanded witness generator is indeed a witness rule in the original sem antic correspondence. L e m m a 6 .2 .2 : Expanded Witness Rule Let L is an ALD defining a sem antic correspondence SC l • Let the sem antic corre spondence defined by reduce(L) be S C r e d UCe ( L ) - If W G r> is a witness rule for r' in S C r e d u c e { L ) , then expand(W Gr') is a witness rule for r in S C l • D E x a m p le 6 .6 : Recall th at witness generator ITG a.i- f light for the “Segment Flight” example, con structed in the previous section, was A.i-flight[f,s,t] B.i-segment[s,c],B.flight-info(f,c,t), B.price(f,p); Then the witness rule for A. i-f light for the “Itinerary” example is, in fact, expand(WGu,i-iiigh.t)- T hat is, A.i-flight[f,s,t] :- B .i-segment[s,y],B.flight-info(f,y,t), B.price(f,p), B .i-node[x,y]; O 6.3 WGG Expansion History In this section, some basic definitions about the expansion process are presented. These definitions will be used in the description of the WGG algorithm . Later in C hapter 7, they will be used to prove various result regarding the execution of the WGG algorithm . Recall the “common sense” reasoning of Example 6.2 in the previous section. As shown in Figure 6.1, the expansion can choose the “most promising” atom in a goal set to expand first. This decision on the sequence of atom expansions, called the computation rule in [Llo87], is based on some intuition. 95 R e m a rk 6.3.1: For the formal development we insist the W GG algorithm con structs the expansion history in a system atic, layer by layer fashion to simplify the proof of soundness. In practice, however, heuristics might be developed to expand only the “most promising” portions of the tree. < d In contrast, the expansion in the W GG algorithm is perform ed in a more system atic way, in which atoms in a goal set are expanded according a flag called schema tag. The value of the schema tag alternates from “aux” to “base” after all the atom s of the interm ediate relations are expanded; after all atom s of the interm ediate relations are expanded, the value of the schema tag switches back from “base” to “aux” . D e fin itio n : Schema Tag t is a schema tag if t € {“base” , “aux” }. Let G be a set of atom occurrences, then G is “base” congruous if Gi contains only base relation atom s from the same database “ aux” congruous if Gi may contain atom s of both base and interm ediate relations from the same database □ Next we define the WGG expansion history; this records the execution history of the W GG algorithm . R e m a rk 6.3.2: For the discussion in this section, we assume th at the ALDs have been first translated by EES-to-ILOG, then rew ritten by reduce, so that they are well-founded. < 1 For the sake of brevity, the following notations are used. Given substitutions cr1, . . . , crn, we use ed1’ ”! as a shorthand for cr1 ■ cr 2 • ■ • crn. Since each rule in an ALD has a distinct relation name in the rule head, 7Zr represents the rule in skolem(L) with relation r appearing in the rule head, and Var(1Z) represents the vector of variables appearing in 7Z. Also, we use 7£.head-term to represent the term s/variables appearing in the rule head and 7Z.body to represent the the body of rule 7Z. D e fin itio n : WGG Expansion History Let L be an ALD, t be a schema tag and G be a set of atom occurrences th at is t 96 A congruous. The WGG expansion history under L with initial goal set G and initial schema tag t, denoted as W G G l(G , t), is an infinite sequence {((?*•, 0t, U), i > 0} of three-tuples, where Gi, 0 {, and £ * • are called the goal set, substitution, and the schema tag at step i, respectively. Initially, let to = t Go = G 0 0 = 0 The schema tag at step i > 1 is f “base” if £,•_i = “aux” [ “aux” if ti-i — “base” Later, we will see th at, in the expansion history, Gi is always £ * congruous. 