Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Semantic heterogeneity resolution in federated databases by meta-data implantation and stepwise evolution
(USC Thesis Other)
Semantic heterogeneity resolution in federated databases by meta-data implantation and stepwise evolution
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SEMANTIC HETEROGENEITY RESOLUTION IN FEDERATED DATABASES BY
META-DATA IMPLANTATION AND STEPWISE EVOLUTION
by
Goksel Aslan
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
May 1998
Copyright 1998 Goksel Aslan
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UNIVERSITY OF SOUTHERN CALIFORNIA
THE GRADUATE SCHOOL
UNIVERSITY p a r k
LOS ANGELES. CALIFORNIA 90007
This dissertation, written by
...................GOKSEL.ASLAN
under the direction of his Dissertation
Committee, and approved by all its members,
has been presented to and accepted by The
Graduate School, in partial fulfillment of re
quirements for the degree of
DOCTOR OF PHILOSOPHY
Dean of Graduate Studies
A p r il 1 7 , 1998
Date
DISSERTATION COMMITTEE
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Dedication
To the people who make life bearable...
my mom,
my wife,
my brother,
and my sisters
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgements
First of all, it is my greatest pleasure to acknowledge my advisor Dennis McLeod
for his never-ending support. I would like to offer my most sincere and deepest gratitude
to him not only for his scientific guidance but also for his understanding, kindness, flexi
bility, patience, and friendlyness during my studies here at USC. Other than being my
advisor who directed me towards appropriate and precise research directions, he has been
a good friend who has shared my difficult and weak moments with me giving me tremen
dous support during these hard times. I do not think I could make this far without him.
I also wish to thank the committee members, Craig Knoblock for his invaluable
comments and helpful feedbacks during the creation of this dissertation, and Daniel
O’Leary for his motivating encouragement. They have contributed to the quality of this
dissertation as well as to its technical depth in great deal.
My colleagues at USC, Johnghyun Kahng, Latifur Khan, Wen Hsiang Kevin Liao,
and Cha-Hwa Lin deserve no less acknowledgement for their positive suggestions and
constructive discussions. I have enjoyed sharing a beautiful research atmosphere with
them, which came true with their sincere contributions.
I would like to further thank my family for their emotional support and their under
standing. I owe millions of thanks to my mother Neziha for her countless sacrifices
throughout her life. For his help in every aspect of life, I am grateful to my brother Gursel.
I thank my sisters Aysel and Nursel for being there for me when I needed. Last but not
least, I would like to offer my appreciation to my wife Firdevs for her endurance through
my long academic life.
iii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table of Contents
Dedication ii
Acknowledgements iii
List of Figures vi
Abstract viii
Chapter 1 Introduction and Motivation I
1.1 Context Of The Problem............................................................................ 2
1.2 Problem: The Folding Problem................................................................. 3
1.3 Problem S co p e........................................................................................... 5
1.4 Schema Implantation and Semantic Evolution A pproach.................... 6
Chapter 2 Related Research 10
2.1 Semantic Heterogeneity (SH), SH Identification and Resolution 12
2.1.1 What Constitutes the Semantics................................................. 12
2.1.2 Causes and Kinds of Semantic Heterogeneity.......................... 13
2.1.3 A Taxonomy of Semantic Heterogeneity Causes...................... 15
2.1.4 Semantic Heterogeneity Resolution........................................... 20
2.2 Schema Integration..................................................................................... 28
2.3 Schema Evolution....................................................................................... 34
2.4 Categorization of Research Applicable to the Folding Problem 37
2.5. Projetcts Directly Related to the Folding Problem.................................. 41
Chapter 3 A Perspective on the Folding Problem 45
3.1 Assumptions and Goals............................................................................... 45
3.2 Characteristics of Our A pproach.............................................................. 53
Chapter 4 Schema Implantation and Semantic Evolution 56
4.1 HSDM (Heterogeneous Semantic Data Model) as the CDM ................ 57
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.1.1 HSDM Classes and Class Hierarchy............................................. 58
4.1.2 HSDM A ttributes........................................................................... 63
4.1.2.1 Complementary Attributes............................................ 64
4 .1.2.2 Equivalence of Primitive Attributes............................ 67
4.1.2.3 Attribute Constraints...................................................... 68
4.1.3 Class Keys and Instance Keys....................................................... 68
4.1.4 Null Values in HSDM ..................................................................... 71
4.1.5 Schema Evolution Primitives in HSDM ...................................... 71
4.2 An Example Sharing Scenario: Collaborating Scientists....................... 75
4.3 Schema Implantation and Semantic Evolution A pproach..................... 77
4.3.1 Semantic Clarification P h a s e ....................................................... 81
4.3.2 Schema Implantation P h a se ......................................................... 89
4.3.2.1 Relationships between Remote and Local C lasses... 91
4.3.2.2 Harmonizers.................................................................... 93
4.3.2.3 E xam ple.......................................................................... 97
4.3.3 Semantic Evolution Phase.............................................................. 99
4.3.3.1 Harmonizer S ta te s ......................................................... 101
4.3.3.2 Acquisition of Local and Remote Knowledge 102
4.3.3.3 Determining the Relationship between Local and
Remote Classes............................................................... 104
4.3.3.4 E xam ple.......................................................................... 108
Chapter 5 Experimental Prototype Implementations 110
5.1 PDM Implementation.................................................................................. I ll
5.2 Implementation of the HSDM Mediator................................................... 112
5.2.1 The HSDM M ediator.................................................................... 115
5.2.1.1 The Database S ubm enu............................................... 116
5.2.1.2 The Browse Schema Submenu.................................... 116
5.2.1.3 The Working Kind Subm enu...................................... 117
5.2.1.4 The Manipulate DB Subm enu.................................... 118
5.2.1.5 The Interoperate Subm enu.......................................... 119
5.3 Experimental R esu lts.................................................................................. 121
Chapter 6 Conclusions 138
6.1 Contributions.................................................................................................. 139
6.2 Future W o rk .................................................................................................. 143
Reference List 145
V
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List Of Figures
1 . A Cooperative Federated Database System Environm ent...................................... 2
2. The Folding P roblem ................................................................................................... 3
3. Folding remote conceptual schema elements into the local conceptual schema 5
4. A Taxonomy of Semantic Heterogeneity Causes................................................... 15
5. Global Schema Approach to the Folding Problem.................................................. 38
6. Federated Databases with Global Structures (Semantic Dictionary/Ontologies) 39
7. Multidatabases with Multidatabase Languages....................................................... 40
8. Projects related to the Folding Problem .................................................................... 42
9. Principles of Our Approach to the Folding P roblem .............................................. 46
10. A conceptual territory and its sub-territories........................................................... 58
11. Corresponding HSDM Conceptual Schema............................................................. 59
12. Predefined Primitive Classes in H SD M .................................................................... 61
13. Complementary attributes and possible values for numeric attributes.................. 66
14. Complementary attributes and possible values for non-numeric attributes .... 66
15. Compatibility matrix on primitive classes............................................................... 67
16. Schema Evolution Primitives in HSDM.................................................................... 72
17. Local and Remote Conceptual Schemas.................................................................... 75
18. Schema Implantation and Semantic Evolution A pproach..................................... 78
19. Semantic Clarification P h a se ..................................................................................... 82
20. Activities in Semantic Clarification.......................................................................... 82
21. Local and Remote Conceptual Schemas after Semantic Clarification.................. 86
22. Complementary Attribute Values of Primitive Attributes in the Local Schema. 87
23. Complementary Attribute Values of Primitive Attributes in the Remote Schema 88
24. Schema Implantation P h ase....................................................................................... 89
25. Superimposing Sub-Phase......................................................................................... 90
26. Hypothesis Specification Sub-Phase.......................................................................... 91
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
27. Possible relationships between a local class 1 , and a remote class r .................... 92
28. The notion of instance equivalence.......................................................................... 93
29. The Structure of Harmonizers................................................................................... 94
30. Local Schema Implanted with Remote Schema Portion........................................ 97
31. The Structure of Harmonizerj................................................................................... 98
32. Semantic Evolution Phase......................................................................................... 99
33. State Transition Diagram of a Harmonizer............................................................... 101
34. An Algorithm for Determining the Characteristic Subset of a Class A ................102
35. Acquisition of Knowledge about the Remote Class (Instances).............................103
36. Determining Class Relationships.............................................................................. 105
37. Final Local Conceptual Schema after Semantic Evolution P h ase........................ 109
38. Implementation A rchitecture...................................................................................114
39. Functionality of the HSDM Mediator........................................................................115
40. The HSDM Mediator Main M e n u ............................................................................115
41. The Database Submenu................................................................................................116
42. The Browse Schema Submenu...................................................................................116
43. The Working Kind Subm enu.....................................................................................117
44. The Manipulate DB Subm enu.................................................................................. 118
45. The Interoperate Subm enu.........................................................................................119
46. Local Conceptual Schema in H SD M ........................................................................123
47. Remote Conceptual Schema Portion in H SD M ...................................................... 124
48. Implanting Remote Schema into the Local Environment....................................... 125
49. Local Schema Implanted with Remote Schema Portion......................................... 127
50. Harmonizer!................................................................................................................. 128
51. Acquiring Local Data for H arm onizerj................................................................... 129
52. Acquiring Remote Data for Harmonizerj.................................................................132
53. Loading the Acquired Remote D ata..........................................................................133
54. Harmonizer Evaluation................................................................................................135
55. Final Schema after Harmonizerj is Evaluated........................................................ 137
56. Parameters of an Implanted Conceptual Schem a....................................................139
57. Possible Solutions to the Folding Problem...............................................................142
vii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract
A Cooperative Federated Database System (CFDBS) is a collection of autonomous,
heterogeneous component database systems, which unite into a loosely-coupled form in
order to interoperate. Interoperability between component database systems is achieved by
means of the ability of individual components to actively and cooperatively share and
exchange information units with other components in the federation. Information sharing
and exchange necessitates data and meta-data to be mediated across component databases
in a federation. One way to achieve such a mediation between a local database and a
remote database is to fold remote meta-data (conceptual schema) into the local meta-data
(conceptual schema), thereby creating a common platform through which information
sharing and exchange becomes possible. The problem is termed as the folding problem,
which is the focus of this thesis.
Schema Implantation and Semantic Evolution, our approach to the folding problem,
is a partial database integration scheme in which remote and local (meta-) data are inte
grated in a stepwise manner over time. Knowledge required to interrelate remote and local
(meta-) data is acquired from corresponding domain experts who are supposed to have the
necessary expertise about their application domain semantics. In our approach, we intro
duce meta-data implantation and stepwise evolution techniques to interrelate database ele
ments in different component databases, and to resolve conflicts on the structure and
semantics of database elements (classes, attributes, and individual instances). Use of a
semantically rich canonical data model that makes information unit semantics explicit,
comparable, and interpretable, and an incremental integration and semantic heterogeneity
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
resolution scheme where relationships between local and remote information units are
determined whenever enough knowledge about their semantics is acquired constitute the
core ideas of our approach. Folding remote meta-data into the local meta-data is accom
plished by implanting remote database elements into the local database, a process that
imports remote database elements into the local database environment and that hypothe
sizes the relevance of local and remote classes, and by customizing the organization of
remote meta-data, so that it semantically fits into the local class hierarchy. We also intro
duce the concept of harmonizers which are hypothetical relevances between local and
remote classes. Harmonizers are used to test the relevance of local and remote classes for
the purpose of relating them by taking their semantics into account. We have implemented
a prototype system that realizes our approach.
The advantages of our approach to the folding problem are: it requires minimum
global knowledge and effort from federation users both before and during interoperation.
Global structures which should be maintained in the federation are minimum. Interrela
tionships between schema elements in different components are highly dynamic. It recog
nizes the fact that knowledge required to relate inter-database elements may not be
available, derivable from within the federation, or obtainable from users prior to interoper
ation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
Introduction and Motivation
Information is of little value if it is not shared. In order for any information to carry a
more global value, it must be shared and exchanged actively between information provid
ers and information receivers. Introduction of computer networks allowed interconnectiv
ity among the computer systems which maintain information in their databases. This kind
of interaction between computer systems was an early and primitive form of information
exchange rather than information sharing, in which an information provider and an infor
mation receiver could exchange messages only [55, 65]. Over time, interconnectivity pro
vided necessary foundations for interoperability, an advanced kind of interconnectivity.
Interoperability is defined as understanding and interpreting potentially exchangeable
messages [65], The ultimate goal in this respect is to achieve intelligent interoperability in
which information sharing and exchange is achieved by means of cooperation between
information receivers and providers.
The ability of a collection of interconnected database systems to share and exchange
information is termed Database Interoperability. The reason for such form of sharing may
be to expand a local database with related information maintained in remote databases as
well as to query the remote information units, which may be complementary to, overlap
ping with, and even contradictory to the local information units. This thesis concentrates
on achieving database interoperability between database systems in a pairwise fashion.
I
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.1 Context Of The Problem
Users
Component
DBS
Communication Language:
Users Component
DBS
Users Component
DBS
CDM
Component
DBS
Users
Figure 1 : A Cooperative Federated Database System Environment.
The aim of this thesis is to ultimately achieve database interoperability within coop
erative federated database systems (CFDBS). Such a federated database system is shown
in Figure 1, where a collection of database systems (components) unite into a loosely cou
pled form in order to share and exchange information [29]. Components share and
exchange information units while maintaining their autonomy requirements at the same
time. Furthermore, they are heterogeneous in the sense that they may use different data
models to model related real world domains. They may come up with different conceptual
schemas even if they use the same data model while modeling related real world domains.
In summary, they model, represent, and store information units in a variety of forms,
which may introduce certain conflicts among components when information units are
desired to be shared and exchanged. Special agents and/or tools such as Sharing Advisor,
Semantic Dictionary, sharing heuristics, discovery tools, and import/export tools, although
not explicitly shown in Figure 1, help individual components to function properly within
the federation. Typically, a common data model (CDM) is employed federation-wide as
the communication tool between the federation components.
2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.2 Problem: The Folding Problem
knowledge required to
interrelate schema elements
local
database schema
remote database
schema (portion)
(partial)
integration
local database schem a\
with remote database
schema (portion) )
Figure 2: The Folding Problem.
One way to achieve database interoperability between a local database and a remote
database in Cooperative Federated Database Systems (CFDBS) is to fold remote meta
data (conceptual schema) into the local meta-data (conceptual schema), thereby creating a
common platform through which information exchange and sharing becomes possible.
This is termed as the folding problem (Figure 2). More formally, given a local and a
remote database system in a federation of database systems, the problem can be stated as
the process of folding (importing and customizing) remote conceptual schema elements1
into the local conceptual schema in the presence of semantic differences between two
environments, so that remote information units can be accesses/manipulated from/within
the local database environment. Importing means bringing remote database elements into,
and making them accessible within the local database environment. Customization, on the
other hand, refers to the process of reorganizing and tuning local and previously imported
remote database elements by taking the real world concepts they model into account.
1. schema elements refers to structural components of a conceptual schema such as object classes,
attributes, and attribute constraints. Database elements refers to schema elements and instances.
3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Primitive and unstructured forms of this folding process are keeping bookmarks to
related documents in Netscape, and embedding document links into WWW documents.
Prospero File System [63, 70, 60], another example in the operating systems domain,
emphasizes limited customization of remote information within the local environment by
allowing users to organize remote files according to their personal preferences. We desire a
more structured way to perform this folding process in the database domain. Given a local
and a remote database, we frequently need to expand the local database content with the
information contained in the remote database. For example, an experimenter may want to
expand his/her database with additional experiments performed by remote experimenters
as well as other information related to the experiments. On the other hand, a researcher
may want to form a literature database by combining several databases for his own use.
Previous approaches that are applicable to the folding problem assume that knowl
edge required to relate inter-database elements, symbolized by ^ in Figure 2, is available,
derivable from within the federation, or obtainable from users. Consequently, they propose
a one-shot solution, in which resolution of conflicts on the semantics of information units,
and integration of remote database elements into the local database are performed in a sin
gle pass. Assuming that knowledge required for semantic heterogeneity resolution and
schema integration is always derivable, if not available, results in either frequent user con
sultations on the semantics and inter-relationships of information units, or inconsistent
database states. The folding process constitutes a real problem in database research where
information exchange, information sharing and information customization are vital. Effi
cient ways and mechanisms to provide a solution to the folding problem, which consider
that knowledge required for resolution and integration purposes is often partial, insuffi
cient, or simply unavailable, are needed. The aim of this thesis is to provide necessary
means, techniques, and mechanisms for this folding process to take place, which will
ensure database interoperability between the components in a cooperative federated data
base system.
4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.3 Problem Scope
where to find
related information i
Phase 1. Information Discovery
Phase 2. Schema Mapping
how to express it ini
terms of common
data model
how to find/resolve
conflicts
Phase 3. Semantic Heterogeneity Resolution
Phase 4. Schema/Instance Importation
how to bring remote
information units into
the local environment!.
how to adjust organi
zation of imported
information units
Phase 5. Schema/Instance Customization
Figure 3: Folding remote conceptual schema elements into the local conceptual schema.
5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
We have identified five phases of the folding problem, which are shown in Figure 3.
In Figure 3, Information Discovery refers to the process of locating the component data
base^) and the related portion of its conceptual schema that is of interest to the local data
base. In Schema Mapping phase, conceptual schemas are translated from their native data
models into the Common Data Model (CDM) through which component databases can
exchange their schema elements. Semantic Heterogeneity Resolution eliminates differ
ences in the meanings of shared information. Importation and Customization steps are
referred traditionally as schema integration. During Schema/Instance Importation, remote
conceptual schema elements and remote object instances are imported either physically or
logically into the local database. Imported schema elements are restructured to fit into
local application requirements in Schema/Instance Customization phase. Although current
methodologies impose a strict order for the overall process, the one shown in Figure 3, our
mechanism follows a different order which overlaps the last three phases.
Schema Implantation and Semantic Evolution approach, which we present in this
thesis, covers the problems associated with phases 2 through 5 while Information Discov
ery is addresses extensively throughout the database literature [63, 70, 30, 58, 82, 35, 88],
1.4 Schema Implantation and Semantic Evolution Approach
The problem is complicated by nature, and it has a broad spectrum which necessi
tates different database techniques to be employed cooperatively. There is a large number
of tools and models to record, organize, and maintain information units in databases.
There is not a common, agreed-upon standard for modeling information units, for
exchanging and sharing information units except for a few standardization efforts such as
CORBA [68, 79]. Neither is there a standard for interpreting the meanings of information
units kept in databases. Frequent existence of incomplete information within databases,
and existence of assumed and implicit knowledge associated with information units, com
plicate the problem further.
6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Research on federated databases, database interoperability, schema integration,
schema evolution, and semantic heterogeneity resolution, although developed indepen
dently or with a minimum dependency between them, are all highly related to the prob
lem, each addressing the individual parts of the overall problem, sometimes making strong
assumptions about the other parts, sometimes disregarding them completely. Nevertheless,
there is not a single piece of research that provides a uniform approach to the overall prob
lem, namely the folding problem. Schema Implantation and Semantic Evolution approach
has been developed to fill in this gap in database research. It provides a uniform solution to
the problem by relaxing some of the strong assumptions made by previous research, and
by introducing new techniques to overcome the difficulties imposed by the complicated
nature of the problem.
Schema Implantation and Semantic Evolution approach described in this thesis is a
uniform approach to the problems of semantic heterogeneity resolution, (partial) schema
integration and schema customization for database interoperability in federated database
systems. It is a three-phased approach.
The first phase is termed Semantic Clarification. In this phase, local and remote con
ceptual schemas are enriched by mapping them into the Common Data Model1 (Heteroge
neous Semantic Data Model) employed in the federation. HSDM makes meanings of
conceptual schema elements explicit, easy to interpret, and easy to compare. The purpose
of HSDM is to provide a semantic foundation for information exchange. Possible confu
sions about the meanings of information units are eliminated by obtaining further knowl
edge about information units maintained in the component databases. Class identity and
instance identity semantics are defined and formalized by individual constructs that
HSDM provides. Obtaining a conceptual schema that documents what information it con
tains is the aim of this phase. HSDM bears necessary object oriented constructs as well as
new constructs and primitives for explicit/clear information unit semantics.
1. Common Data Model (CDM) is a synonym for Canonical Data Model. HSDM is an instance of
CDM which we use in our methodology.
7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Second phase is termed Schema Implantation during which remote conceptual
schema elements and remote object instances are imported into the local database. Impor
tation of instances may be physical or logical. In other words, they may be copied from the
remote database, or special pointers (e.g., surrogates) may be constructed in the local data
base to access them. Furthermore, a number of hypothetical relevances (harmonizers) is
specified between local and remote conceptual schema elements during the implantation
phase. The purpose of the implantation process is to loosely integrate local conceptual
schema with the portion of remote conceptual schema identified previously, and to specify
some hypothetical relevances (harmonizers) between elements of the two.
Semantic Evolution is the last phase. During this phase, previously hypothesized rel
evances are tested, and depending on the test results new hypothetical relevances are
formed, existing ones are propagated down into the class hierarchy, or individual evolution
primitives are activated. Tests on harmonizers may require additional knowledge about the
remote and local conceptual schema elements. This knowledge is acquired from domain
experts of individual components, who are supposed to have the necessary expertise about
their application domain semantics. The purpose of this phase is to determine the exact
relationships between remote and local conceptual schema elements, which in turn
enables customization of implanted remote conceptual schema elements within the local
conceptual schema. Over time, remote schema elements live with the problem, the prob
lem of relating the remote schema elements to the local schema, within the local database.
Loose and imprecise relationships between remote and local conceptual schema elements
evolve into tighter and precise relationships that fit into the context of the local database.
Research on a framework that allows local component database systems to gradually
adapt imported conceptual schema portions into their own conceptual schemas in an incre
mental manner over time is required in order to achieve database interoperability in a
cooperative federated database system. We believe that Schema Implantation and Seman
tic Evolution approach takes early steps in this direction.
8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The rest of this thesis is organized as follows: In Chapter 2, we review related
research. Section 2.3 categorizes the research applicable to the folding problem, while
Section 2.4 presents a categorization of researches that are applicable to the folding prob
lem. Section 2.5 reviews a representative set of projects that are related to the folding
problem. Chapter 3 introduces our perspective on the folding problem. Section 3.1 lists
assumptions and goals of our approach to the folding problem, while Section 3.2 describes
desired characteristics of our solution. Chapter 4 starts with the specification of HSDM,
which is the canonical data model we have employed in our mechanism. Section 4.2 pre
sents a sharing scenario which will be used throughout this dissertation for illustration
purposes. In Section 4.3, we elaborate individual phases of our approach. Chapter 5 pro
vides detailed information about prototype implementations. Finally, Chapter 6 discusses
contributions of our research as well as future research directions.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2
Related Research
Based on our intuition derived from the phases of the folding problem (Figure 3), the
database research that have the potential to contribute towards the solution of the folding
problem can be divided into three broad and overlapping categories: Semantic Heteroge
neity Identification/Resolution, Schema Integration, and Schema Evolution.
First, Semantic Heterogeneity Identification/Resolution research is highly relevant
since the folding problem considers autonomous components, which have the freedom to
model, represent, and use information units in whatever way and form they may desire.
Semantic Heterogeneity occurs when there is a disagreement about the meaning, interpre
tation, or intended use of the same or related data [73]. Semantic Heterogeneity Resolu
tion is the process of reaching on an agreement on the semantics (meaning, interpretation,
or intended use) of potentially sharable data. When considered within the context of the
folding problem, Semantic Heterogeneity Resolution is the process of determining the
potential relationship between a local conceptual schema element and a remote conceptual
schema element, if one exists.
The second category concentrates on the schema integration problem, which can be
defined as: given independently created and maintained conceptual schemas, it is the pro
cess of obtaining an integrated conceptual schema which subsumes original schemas in
terms of structural aspects (information content and organization) and behavioral capabili-
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ties (queries conceptual schema responds) of the original schemas. Research related to
schema integration shows a high overlap with research related to semantic heterogeneity.
Some of the problems that appear during schema integration process are directly related to
the semantic heterogeneity problem. Nevertheless, the semantic heterogeneity problem is
not limited to the schema integration context since approaches assuming no global schema
do not deal with schema integration problem but still experience semantic heterogeneity
problem. Individual studies reviewed in this chapter may be related to both schema inte
gration and semantic heterogeneity. Schema Integration problem constitutes an important
part of the folding problem during both during bringing remote information units into the
local database and during customizing them to fit into local application domain semantics.
Third, Schema Evolution research has vital contributions to our approach with
regard to customization of remote information units within the local database. Schema
Evolution is the ability of a database system to tune its conceptual schema and its
instances in response to changes (e.g., changes in real world counterparts of object classes,
changes in application requirements, or application environment,..). The need for such an
ability stemmed from application areas such as Artificial Intelligence (AI), Office Infor
mation Systems (OIS), Computer Aided Design (CAD), and Computer Aided Manufactur
ing (CAM), where type changes occur as frequently as individual instance changes.
Typically, the effect of the technique used for schema evolution should be minimum on
individual instances for the sake of practicality.
In the remaining part of this chapter, we will review the most important representa
tives of these three categories. While emphasizing the strong and inspiring points of these
researches, we will also specify their shortcomings with regard to the folding problem. At
the end of the chapter, we will present a new categorization of approaches that can be
applied to the folding problem.
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.1 Semantic Heterogeneity (SH), SH Identification and Resolution
2.1.1 What Constitutes the Semantics
In order to gain a good understanding of what semantics and semantic heterogeneity
are, it is important to distinguish semantic and structural properties of databases. Neuhold
[27] describes what should be considered structure and what should be considered seman
tics in a class specification in object-oriented databases. The aspects that can be captured
formally are defined to be structural while aspects which are referring to actual objects of
the real world, or which cannot be fully captured with mathematical terms are the seman
tic aspects of a class specification. As different class hierarchies should be maintained for
structural and semantic aspects of class specifications, different integration techniques
should be employed while integrating semantic and structural aspects of a class specifica
tion.
Context Information plays a vital role in representing information unit semantics.
Semantics of information units are tightly coupled with the environment where informa
tion units reside. Therefore, information maintained in different database environments
should be processed by respecting their corresponding environments, namely their con
texts. Siegel and Madnick [74] suggest representing, moving, and processing the context
along with the information it describes. According to the authors, the ability to represent,
manipulate, and compare contexts is an important part of providing semantic interopera
tion in multidatabase systems. Hence, context information associated with information
units should be taken into account along with the information units in order for correct
representation and interpretation of information unit semantics.
In conclusion, semantics of an information unit refer to the meaning, interpretation,
and use of this information unit. Organizational conventions and rules where information
units reside have a direct effect on information unit semantics.
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.1.2 Causes and Kinds of Semantic Heterogeneity
Heterogeneity may be observed on a number of levels including data models, data
model constructs, query languages, system platforms, utilized naming conventions, and
low-level data representations. Semantic heterogeneity in federated database systems,
however, is primarily caused by design autonomy of component database systems [73].
Component database systems may employ different design principles for modeling same
or related data, resulting in semantic heterogeneity as well as other kinds of heterogene
ities (e.g., naming, representing, and conceptualizing same or related data differently).
Batini et al. [11] present causes of semantic heterogeneity as different perspectives
(e.g., designers adopt their own point of views in modeling the same situation), equivalent
constructs (e.g., existence of several data modeling constructs for modeling the same con
cept), and incompatible design specifications (e.g., erroneous choices regarding names,
types, and integrity constraints). In another study, McLeod and Hammer [29] identify a
spectrum of heterogeneity: Components may use different data models in modeling data.
Even if they use the same data model, they may come up with different conceptual sche
mas for the same or related data. Even object representations and low-level data formats
differ from component to component employing the same data model. Finally, they may
use different tools to manage and provide an interface for the same or related data.
One of the reasons for semantic heterogeneity, among others, is domain evolution.
Changes in the real-world counterparts of domain values cause old applications to become
obsolete since old applications assume old semantics associated with domain values. Ven-
trone and Heiler [86] presents examples of domain evolution in their work. Domain evolu
tion cases such as changes on the semantics of domain values, cardinality and graduaiity
changes, changes on the encoding scheme to represent certain database values, and time
and unit changes introduce semantic heterogeneity within a database as well as across
multiple databases. Ventrone and Heiler suggest making semantic information explicit as
part of the solution, so that it can be read and interpreted by the application code.
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Examples of heterogeneity can be observed in every application where information
sharing and exchange is essential. Worboys and Deen [97] have identified two kinds of
heterogeneity in Distributed Geographical Databases, generic semantic heterogeneity, and
contextual heterogeneity. Generic semantic heterogeneity occurs when nodes use different
conceptual models for modeling spatial information while contextual heterogeneity is
caused by local environment conditions at nodes. They also provided solutions to generic
semantic heterogeneity such as constructing transforming processors between conceptual
models, and incorporating a canonical model for spatial information representation.
Multidatabases introduce additional problems with respect to semantic heterogene
ity. Most of the assumptions we take for granted in single databases are not applicable to
multidatabases. Kent [40] studies semantic problems introduced by multidatabases with
specific regard to identity, naming, semantic constraints, certitude, and stability issues. In
a single database, there is a one-to-one correspondence between real world objects and
their representational counterparts, proxy objects, in the database. Multidatabase systems
may have several proxy objects for a single real world object. Similarly, despite the auto
matic enforcement of semantic integrity constraints within single databases, how to
enforce them across autonomous component databases is still an open research issue.
Although single databases seem certain about their information content, multidatabases
often have conflicting values about the same real world object. Single databases are stable
in the sense that a query returns the same answer every time it is imposed on the database.
However, there might be various answers to a query in multidatabases depending on which
databases are attached to the query at that particular moment. Being in the same track with
multidatabases, cooperative federated database systems suffer from the same limitations,
primarily due to autonomy requirements of individual components.
Semantic heterogeneity arises because of several reasons, and investigation of
semantic heterogeneity reasons may lead to solutions on case by case basis. Kim et al. [43]
present the most comprehensive classification of schematic conflicts, and techniques to
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
resolve these conflicts. Schematic conflicts can be interpreted as a special kind of semantic
heterogeneity, which may arise when integrating multiple database schemas. In their
work, they assume a global schema that represents an integration of the sharable portions
of component database schemas in a multidatabase system. Schematic conflicts are
divided into three broad categories: generalization conflicts (occurs when an entity or an
attribute in one database is included in another entity or an attribute in another database),
aggregation conflicts (occurs when domains are different for semantically equivalent
attributes, or occurs when equivalent data modeling concepts are represented in one data
model as an aggregation abstraction while not in the other), and conflicts due to methods.
Resolution techniques provided for these conflicts includes renaming, homogenizing rep
resentations, homogenizing attributes, horizontal join, vertical join, mixed join, and
homogenizing methods. Conflict resolution techniques described in their work have been
implemented in a commercial multidatabase system called UniSQL/M [83].
2.1.3 A Taxonomy of Semantic Heterogeneity Causes
Application Oriented
P different application requirements
P different organizational contexts
argumentative real world concepts
P implicit assumptions associated with the application domain
^ incomplete information in dynamic applications and evolution
Data Model Oriented
I> different data models
P data models specialized to certain applications
I> lack of standards for data modeling
P equivalent constructs
Modeler Oriented
P different/wrong perceptions
P implicit assumptions by modelers
Figure 4: A Taxonomy O f Semantic Heterogeneity Causes.
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Under the lights of related research, we have developed a taxonomy of semantic het
erogeneity causes (Figure 4). Mainly three categories of reasons cause semantic conflicts
across databases: differences among component database system environments, where
information units are modeled, differences between data models to model real world con
cepts, and different ways modelers behave during modeling these concepts.
Different application requirements lead to different conceptual schemas, resulting in
the possibility of semantic conflicts, even if these schemas model the same real world
domain. In this respect, structural and behavioral needs of the application determine the
structure of the conceptual schema, methods on object classes, and even meanings of con
ceptual schema elements. For example, disagreement between applications may be on the
level of perceiving the real world. Consequently, the same concept may be modeled differ
ently, or may be assigned different magnitudes of importance by different applications. In
one application, the distinction between the concepts “students” and “research assistants”
may be vital while in another application such a distinction is unnecessary. There are a
variety of standards to quantify measurable real world concepts. This causes less problems
since in most cases, a direct transformation from one measure to another can be found.
Different units and precisions may be used to model/measure these real world concepts,
depending on application environment characteristics again. For instance, we measure the
‘age’ concept by ‘years’ for humans. In contrast, ‘weeks’ or even ‘days’ is the correct
measure/precision for ‘rats’ whose life expectancy is less than that of humans. More
importantly, utilized precision might be vital for some applications. Measuring ‘height’
concept in the precision of millimeters does not gain any additional benefits for some
applications despite the fact that such a precision is vital for an application that is built to
correct the path of a space rocket. This is again related to application needs.
Conceptual schemas created to model application semantics are located in different
system platforms, and they are obliged to conform different organizational rules imposed
by their environments. In other words, their organizational contexts are different. Context
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
of a conceptual schema is highly related to application requirements and perception of real
world concepts by that application domain. As a result, related real world concepts may be
modeled differently. For example, in a database environment where rats are used inten
sively to perform experiments in order to study effects of certain drugs, properties of rats
are more important than properties of people. Their age, weight, size, type, color, drug his
tory, and physical conditions should be maintained in the database. In such an application,
it is enough to keep record of an identifying property about experimenters (e.g., their
name). In a literature database application, on the other hand, such a detail about rats
might not be necessary. Keeping detailed information about experimenters such as their
names, addresses, telephone numbers, numbers of publications, and for each experiment
performed, the number of rats used in the experiment may be modeled. In conclusion, dif
ferent application environments put different weights on the importance of similar real
world concepts. This causes conceptual schemas to have different structures or even differ
ent semantics associated with schema elements.
Other than different real world perceptions of applications, real world concepts
themselves can be argumentative. Definitions, meanings, and interpretations of real world
concepts change from time to time, place to place, and person to person. Considering the
human concept as a sub-concept of animal might offend some religions although it is a
common and natural classification in scientific domains.
Databases often assume some of the facts related to the application environment
without explicitly modeling them. This set of facts constitutes the common knowledge
organization-wide, but they might be the reason for certain conflicts outside the organiza
tion. An American company does not have to record the units of money it models in its
employee database. It is assumed to be “U.S. dollars”, and this fact is known to everyone
who works for the company. On the other hand, an international company which keeps
track of foreign exchange rates has to record not only the units of money it models in its
database, but also the transformation information between them on a day by day basis.
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Some of the applications gather individual information units in an incremental man
ner, and almost all of the applications evolve. Consequently, attribute values may be miss
ing, partial, or unknown at the time being but would be known in the near future.
Graduation date for student entities is an example for this case. Not only individual facts
but elements of conceptual schemas may be missing. New attributes, new object classes,
new constraints can be added on demand.
Structural and behavioral requirements of applications are modeled using a variety
of data models each of which depends on a different philosophy for information modeling.
Hierarchical data model represents real world concepts in the form of a hierarchy of
record types. According to the relational data model, table abstraction is a common way to
represent real world concepts while Object-Oriented data model views everything as an
object. Data models offer different constructs to model information units. Object-oriented
data model provides a rich set of semantic primitives such as classification, generalization,
and aggregation. Relational data model is limited in this respect, providing neither the
notion of generalization nor a direct correspondent for object identifiers as observed in
object-oriented databases. Functional data models choose to represent attributes as func
tions. Semantic integrity constraints, behavioral and structural extensibility issues, and
individual constructs provided by the data models differ significantly from one data model
to another, causing certain semantic conflicts.
Some data models focus on special application domains. In general, data models aim
to provide a one-to-one mapping between data model constructs and their corresponding
real world concepts. Some application domains may contain real world concepts and their
interrelationships which cannot be captured directly by any constructs in a general pur
pose data model. These application domains require special data model constructs. Data
models proposed for version management and temporal data management are examples of
such data models. In summary, the data model chosen to model application semantics has
a direct effect on how precisely real world concepts can be represented by conceptual
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
schemas. Data models with rich sets of semantic primitives produce conceptual schemas
whose elements directly correspond to real world concepts while data models with limited
capabilities result in a lot of concepts that overlap in conceptual schemas. Conceptual
schemas whose elements correspond to more than one concept is a primary reason for
semantic heterogeneity in databases.
There is not a well-known standard for modeling information units in a semantically
explicit manner. Conceptual schemas produced by means of today’s data models cannot
express explicitly the meanings of their real world counterparts they are intended to
model. Different data modeling choices in a data model, and absence of a standard guide
line for information modeling are primary reasons for this. The same application domain
can be modeled by the same data model in different ways producing different schemas. As
long as produced schema elements can express what they correspond to, this does not
introduce a problem. Unfortunately this is not the case with today’s data models. The
notions of class semantics and instance semantics should be revisited in order to obtain
conceptual schemas that can express what real world concepts they model. One should be
able to determine the real world concept an object class corresponds to by examining its
attributes, methods defined on its instances, its relationships with other object classes, and
its location within the class hierarchy in object-oriented databases. Similarly, object
instances should be identifiable based on their attribute values along with their identifiers.
Equivalent constructs in data models contributes to the semantic heterogeneity prob
lem although it helps semantic relativism. The choice between modeling a concept by an
attribute value as a string of characters, or as an object class is an attractive example. This
choice is application and user dependent and causes some instances of semantic conflicts.
Often, real-world concepts are modeled incorrectly due to wrong perception of
application environments by modelers, or wrong specification of application requirements.
Conceptual schemas produced as the result of such a process cannot express correct infor
mation unit semantics.
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Missing or incorrect information stored in databases is another reason caused by
users/modelers. Missing or incorrect information is especially important during investiga
tion of instance equivalences.
Finally, modelers tend to make assumptions that would not be explicit in conceptual
schemas. Moreover, modelers perceive application environments differently while model
ing information requirements.
2.1.4 Semantic Heterogeneity Resolution
Use of attributes to define original attributes of a conceptual schema contributes
clear and easy to interpret information unit semantics. Siegel and Madnick [75] describe a
rule-based approach to semantic specification that can be used to establish semantic agree
ment between a source and a receiver. The source (database) supplies data used by the
receiver (application). In such a scenario, the use of meta-data is suggested in order to
decide if database can provide data that is semantically meaningful to the application, and
if the application is affected by a change in the database semantics. DMD (Database Meta
data Dictionary) defines attribute semantics by associating meta-attributes to them. ASV
(Application Semantic View) contains the application’s definition of its semantic require
ments. Along with meta-attributes, both DMD and ASV contains a set of rules for defining
attribute semantics. During interoperation, ASV and DMD rules are compared to decide if
source (database) can supply semantically meaningful information to the receiver (appli
cation). Although studied on a source-receiver model, it is stated that the technique can
also be used for semantic reconciliation and schema integration in multidatabase systems.
Sciore et al. [71] examine the problem of how data from one environment can be
transferred to, and manipulated in another environment by respecting the context associ
ated with the data. Relational model is extended with meta-attributes, which can be associ
ated with attributes to describe them, to represent the context information. Meta-attributes
can be defined explicitly, or they can be defined using rules that enable derivation of meta
attribute values from actual attribute values on which they are defined. The notion of con-
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
version is defined, so that attribute values can be converted from one context to another.
Conversion functions change the context but preserve semantics of information units. In
other words, they produce an alternative representation of an information unit in another
environment. Conversion functions, meta-attributes, and the “incontext” clause, which
permits a query or update to specify its context requirements are added into SQL.
A successive study by Sciore et al. [72] take one step further in context representa
tion and interchange by treating semantic values as a unit of exchange that facilitates
semantic interoperability between heterogeneous information systems. A semantic value
is defined to be a piece of data together with its associated context. Exchanging semantic
values rather than simple values enables mapping values from one environment into
another while preserving their semantics. An extended version of SQL, called C-SQL
(Context SQL), is introduced, which includes access, manipulation, and update capabili
ties for semantic values and their contexts. A system architecture is presented that allows
autonomous components to share semantic values. The key component in this architecture
is the Context Mediator, whose job is to identify and construct the semantic values being
sent, to determine when the exchange is meaningful, and to convert the semantic values to
the form required by the receiver.
Breitbart [13] reviews multidatabase interoperability methodologies with regard to
issues related to schema integration, semantic heterogeneity, and transaction management.
Semantic heterogeneity problem is viewed within the context of schema integration in this
work. Global schema approach in which database administrators create a global schema
for a set of local databases, and federated approach in which administrators create import
schemas to keep track of information about remote information units accesses by local
databases are two approaches to schema integration problem according to the author.
Author concludes that standardization efforts would contribute to the development of
future multidatabase systems and the solutions to the problems associated with it, includ
ing the semantic heterogeneity problem.
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Common (Canonical) Data Model (CDM) is important during interoperation among
heterogeneous components since it can be used to overcome structural as well as semantic
differences between component databases (i.e. if it contains constructs and primitives
which make semantics of information units explicit, then semantic heterogeneity resolu
tion problem becomes easier.) Hence, standardization makes very much sense for the
CDM employed federation-wide. Saltor et al. [67] investigate suitability of data models as
the CDM in federated database systems with respect to expressiveness and semantic rela
tivism they support. Expressiveness is defined as the ability of a data model to directly rep
resent a conceptualization no matter how complex it may be. Semantic relativism of a
database is the degree to which it can accommodate different conceptualizations of the
same real world. CDM incorporated in the federation should have an expressive power
rich enough to cover all the data models in use in individual components. Specifically, it
should contain semantic primitives such as generalization, classification, and aggregation.
It should also be extensible in terms of operations and integrity constraints. Moreover, the
CDM should have primitives which enable switching back and forth between alternative
representations of the same or related information ensuring semantic relativism in the fed
eration. A number of data models ranging from hierarchical to object-oriented have been
analyzed with respect to expressiveness and semantic relativism. E-R and object-oriented
models are found adequate candidates for the CDM as the result of this analysis.
One of the ways to resolve semantic heterogeneity problem is to describe explicitly
semantics of conceptual schema elements using a meta-system. Explicit semantics of
schema elements makes semantic conflicts easy to locate, and easy to resolve. M(DM), an
extensible meta level system in which syntax and semantics of data models, schemas, and
databases can be uniformly represented, is presented by Barsalou and Gangopadhyay [26,
9], Higher order types (META, THING, LINK, BASIC THING, CONNECTION) orga
nized in the form of an inheritance lattice, which comprehensively represents constructs in
well-known data models, and second order formulae (observer functions and observer
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
predicates) associated with higher order types are able to capture various data model con
structs. Although they focus on representational heterogeneity, we believe that their work
takes initial steps towards “representing information unit semantics using data model con
structs”. Authors illustrate how relational, DAPLEX, IRIS [89, 25], entity-relationship,
structural, and a generic object-oriented data model constructs and primitives can be mod
eled in terms of M(DM) types, observer functions and observer predicates.
Identifying objects in different databases that are semantically related, and then
resolving schematic differences among semantically related objects have been the focus of
Kashyap and Sheth [36]. They define and formalize the notions of schema correspondence
and semantic proximity. Schema correspondence is used to represent structural similarities
between objects. Semantic proximity represents semantic similarity between objects.
Associating semantic correspondences with semantic proximity enables the representation
of semantic similarity (e.g., semantic equivalence, semantic relationship, semantic rele
vance, semantic resemblance, and semantic incompatibility) among semantically related
objects. Based on semantic proximity, schematic and data conflicts are enumerated and
classified. Identifying semantically related objects is also an important issue for schema
integration research. Therefore, it will be revisited in the next section when we discuss the
research related to schema integration.
Sophisticated Language Capabilities constitutes another solution for semantic heter
ogeneity problem in multidatabases. If a single fact, a simple connection between two
things, can be implemented in a variety of field configurations [41], an application domain
or a conceptual territory might have been implemented by numerous representations.
Sophisticated language capabilities enable switching back and forth between these alter
native representations of related domains. Litwin et al. [52] present methodologies for
multidatabase system design. In addition, they evaluate language capabilities and lan
guage limitations for multidatabase manipulations in commercial systems. They provide
some important research directions for multidatabase interoperability.
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Once it is agreed on that sophisticated language capabilities is the way to overcome
the semantic heterogeneity problem, the next step is to decide what constructs these lan
guages should include.
First, Kent [39] concentrates on domain mismatch and schema mismatch problems
in multidatabases. Domain mismatch arises when several databases treat some common
conceptual territory in different ways (e.g., differences in units of measurement) while
schema mismatch arises when similar concepts are expressed differently in conceptual
schemas. The notions of conceptual territory, spheres, domain groups, domain mappings,
localized and integrator functions, and type and function groups have been proposed as
part of the solution. A domain group is a set of domains which cover some conceptual ter
ritory. Each domain is typically associated with a distinct sphere which is a portion of a
database schema along with its associated data. Switching back and forth between
domains through domain mappings enables seeing domains in some integrated way. Inte
gration becomes possible by means of localized and integrator functions. Localized func
tions appear in distinct spheres, and integrator functions choose and execute appropriate
localized functions to achieve interoperation. A type whose instances are types is named a
type group while a type whose instances are functions is named a function group. Pro
posed solution suggests structuring the environment where domain mismatch and schema
mismatch are observed in terms of domain groups corresponding to conceptual territories,
with different domains occurring in different spheres. Integration then occurs using inte
grating domains in an integrating sphere. Proposed technique requires sophisticated lan
guage capabilities in order to permit the solutions to be expressed and maintained within
the database, rather than in application code. Among these capabilities are arbitrary com
putational power (e.g., conditionals, iteration, recursion, aggregate types, and operations),
type and function groups, uniform treatment of system and user objects, update entry
points, overloading, derived types, and subtype of literals.
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Second, Krishnamurthy et al. [46] claim higher order expressions, higher order view
definitions, and complete updatebilility for the users of multidatabase views are required
language features to be implemented for resolving schematic discrepancies. A higher
order expression is an expression in which variables can range over data and meta-data in
multiple databases. Higher order expressions permit formulation of queries spanning over
several databases. Moreover, they are used to define a unified view over multiple data
bases. Higher order views provide database and integration transparencies by defining
varying number of relations depending on the state of the database. IDL (Interoperable
Database Language) which provides higher order capabilities by extending hom clauses
for higher order logic is proposed.
Last but not least, MRDSM (Multics Relational Data Store Multidatabase), a proto
type multidatabase system, [51] provides multidatabase users with many capabilities for
manipulating data that may be in distinct, non-integrated schemas. MRDSM contains
capabilities to join data in different databases, to broadcast user intentions over database
schemas with the same or different naming rules for data with similar meanings, to flow
data between databases, to transform actual attribute meanings into user defined types, and
to aggregate data from different databases using various built-in functions. MRDSM data
manipulation language is called MDSL (Multidatabase Data Sublanguage), which is based
on tuple calculus. To achieve multidatabase interoperability, MDSL should support the
following notions: multiple identifiers (a name shared by several attributes, relations or
databases), semantic variables (several relations containing semantically relevant informa
tion with one name in user query), dynamic attributes (non-persistent attributes defined on
actual attributes), and new standard functions for multidatabases such as name, norm, and
upto. Name function returns the name of the container of data, and it is used when the
same data is maintained as a value in one database, and as a meta-data in another. Norm
function merges all tuples corresponding to the same object into a single tuple. UpTo func
tion limits the multiplicity of information that may come from several databases.
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Still another way to resolve semantic conflicts among databases in a federation, as an
alternative to language features, is to describe the sharable part of component databases by
explicitly maintaining descriptions and relationships of sharable information units.
McLeod [57], McLeod and Hammer [29], and McLeod et al. [30] described a mechanism
for identifying and resolving semantic heterogeneity. Authors describe three components
as the part of the solution: meta-functions, local lexicon, and semantic dictionary. Meta
functions return structural information about objects in remote database components.
Local Lexicon in each component provides semantic information about the sharable
objects in that component in the form of a common knowledge representation. In other
words, Local Lexicon specifies component’s perspective on the relationships between its
local types and a global set of commonly understood concepts. Semantic Dictionary,
which is created and maintained by the Sharing Advisor, contains partial knowledge (rela
tionships) about all the terms in the local lexica in the federation. The role of the Sharing
Advisor in this architecture is to detect similarity and dissimilarity between concepts with
the help of sharing heuristics. The ultimate goal is to derive meanings of concepts
unknown to a component by using meta-functions, local lexicon, and semantic dictionary.
After having agreed on a collection of concepts, mapping conceptual schema ele
ments into this set of commonly understood concepts, and maintaining a hierarchy of such
concepts are related to both semantic heterogeneity resolution and information discovery.
An approach that provides a uniform framework for organizing, indexing, searching, and
browsing information units within a federated database system environment is given by
McLeod and Si [58]. This work concentrates on information discovery, capability for
users to discover available information units (resources) that might be relevant to users’
interest, without even being aware of the existence and location of remote database
sources. Type objects in individual components are organized into a concept hierarchy in
which concepts may have subconcept/superconcept relationship between them. Once the
exported information is properly organized within the concept hierarchies, components
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
can request information that is similar, complementary, or overlapping with their local
information.
In another study, Tsai and Chen [82] describe an approach to integrate database
schemas into a relational global view which consists of concept hierarchies. The schemas
to be integrated are organized into concept hierarchies which capture the relationships
among relations in different relational databases. Concept hierarchies are created by multi
database administrators, who collect the necessary information from local database
administrators. Multidatabase administrators integrate relational schemas into concept
hierarchies by a schema integration language designed for this purpose. Concept hierar
chies provide users with valuable information to capture relationships among different
relations during specifying global queries. Transformation from global query to local sub
queries, and query optimization studies are also given in this work.
Advances on organization of concept hierarchies led to the notion of ontologies, con
cepts along with their relationships which describe information units. Kahng and McLeod
[35] present a Dynamic Classificational Ontology, a mediator to help component database
systems identify and resolve ontological similarities and differences in a cooperative fed
erated database system for the purpose of information discovery. According to the authors,
such an ontology consists of a static part (requires agreement among components) and a
dynamic part (changes dynamically). Participating components contribute to the develop
ment of the dynamic part.
Wiederhold [88] emphasizes the importance of domain ontology, a vocabulary of
terms and a specification of their relationships, for interoperation at the semantic level. A
knowledge-based algebra including intersection, union, difference, and map operations
over ontologies is defined. This algebra will help information integration.
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.2 Schema Integration
Batini et al. [11] present an excellent survey of methodologies for database schema
integration. They identify four necessary steps for schema integration: preintegration
(grouping the schemas to be integrated), comparison of schemas (checking for representa
tional conflicts in original schemas), conforming schemas (resolving conflicts with
schema transformations), and merging/restructuring (superimposing schemas). This sur
vey assumes a global (unified) schema1 . The global schema obtained as a result of integra
tion process, and temporary schemas obtained via schema transformations should satisfy
quality criteria such as completeness, minimality, and understandability. In this work, het
erogeneity issues were addressed as part of schema integration.
Integration languages, which contain a set of operations for schema restructuring
purposes, is one of the solutions proposed for schema integration. Schema restructuring is
performed so that component schemas to be integrated look like each other. In this respect,
restructuring operators helps resolving representational differences between related infor
mation units. Being able to specify the relationships between the global (integrated)
schema and individual component schemas that are fed to the schema integration process
is an important and necessary capability for schema integration. Integration Languages,
proposed for this purpose, contain primitives to enable derivation of a global schema from
component schemas. Breitbart et al. [14] present such an integration language obtained by
extending relational data model primitives. ADDS (Amorco Distributed Database System)
enables development of a heterogeneous distributed database system to logically integrate
pre-existing applications without redesign. Extended primitives such as join, outer join,
natural join, natural outer join, difference, and select are supported by the uniform data
definition language of ADDS. Composite database is obtained by query expressions which
use these extended operations on component database schemas.
1. Global schema, integrated schema, composite schema, and unified schema are used as syn
onyms of each other.
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Motro [59] explains how to generate a virtual database (a superview) by imposing
schema derivation operators on original component schemas. A collection of primitives
(meet, join, fold, rename, combine, connect, aggregate, telescope) is described. Virtual
integration of databases through superviews becomes possible with the Virtual Database
Generator, which accepts integration statements along with a number of schemas to be
integrated. The Virtual Database Generator then generates a superview, and superview to
component database mappings. Generated mappings are used by the Virtual Query Pro
cessor to transform queries imposed on the superview into equivalent ones executable on
component databases. Mappings are constructed by means of expressions on initial classes
of component schemas. This study has a special importance with regard to the folding
problem because it emphasizes the customization aspect of the folding problem.
Schema Integration methodologies assuming no global schema in federated data
bases depend on query language primitives which are capable of combining related infor
mation from autonomous component databases. These methodologies provide the illusion
of a unified schema although such an integration does not take place physically. Czejdo
and Embley [18] study schema integration and query formulation problems in the context
of federated databases. As a solution, relational model is extended with connectors that
impose predicate conditions over attributes of relations residing at the same site or at dif
ferent sites. In their work, they assume no global schema, and users formulate queries
against a number of relations at different sites. A query language which enables represen
tation and manipulation of database schemas, and formulation of queries graphically is
proposed. Relational model is extended with connectors, whose purpose is to specify rela
tionships between attributes within a user query. Since they assume a federated database
system with no global schema, there is no actual integration of component database sche
mas, and users are supposed to be database administrators who have enough knowledge
about the federation to formulate queries over a number of components. Virtual integra
tion of component schemas is achieved by means of operators related to connectors.
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
How to construct a global (integrated) schema out of independently maintained com
ponent schemas is another important issue in schema integration. Correspondences
between the elements of component schemas should be known in advance in order to
achieve such an integration in the form of an integrated schema. Chen et al. [17,45] focus
on the problem of constructing a global schema. Database administrators are responsible
for specifying corresponding assertions, conditions among classes, attributes, or composi
tion hierarchies of object schemas in different components. Based on the corresponding
assertions, integration rules are constructed, which use a set of primitive integration opera
tors to do the integration. Semantics of these operators (refine, hide, rename, aggregate,
invert, build, demolish, ounion, generalize, specialize, inherit, upgrade) are defined. Inte
gration operators restructure component schemas, so that they become similar to each
other. Rules governing the mapping from corresponding assertion cases to integration
rules are also given. Mapping tables, which show the relationship between the global
schema and component schemas, are constructed incrementally, stored in the Data Dictio
nary/Directory, and utilized for global query processing.
Afsarmanesh et al. [3] describe a partial schema integration technique for federated
databases. 3DIS data model [2] is adopted as the CDM throughout the federation. Accord
ing to the proposed architecture, each component in the federation (named a peer) can
have one integrated schema where information exported from other peers will be stored.
Community Directory, a special agent in this architecture, provides up to date information
about agents in the federation. Primitives for types (rename, union, subtract, restrict), and
maps (rename, union, threading) are provided for schema restructuring purpose.
Utilizing Views in object-oriented database systems is one of the traditional schema
integration approaches. Abibeboul and Bonner [1] introduce such a view mechanism in
the context of an object-oriented data model. Proposed view mechanism allows program
mers to restructure class hierarchies, and to modify the behavior and structure of objects.
Programmers are provided with capabilities to restructure the class hierarchies such as
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
hiding classes, introducing new ones, merging attributes, and importing classes from
remote databases. Newly introduced classes are called virtual classes. Virtual classes are
populated by specialization, generalization, or behavioral generalization. In addition, they
can be populated by imaginary objects that exist in the view but not in any database. Each
imaginary object is given an object identifier.
Kaul et al. [37] suggest integrating heterogeneous information bases (databases,
information retrieval and file systems) by object-oriented views. Viewsystem, an object-
oriented programming environment developed in this study, employs the VODAK data
model to overcome differences in data models and DBMSs. VODAK supports complex
data types, a rich set of data abstractions (specialization, aggregation, and grouping), and
inheritance. Furthermore, it is extensible in the sense that it is open for user-defined data
types and user-defined semantic relationships. VODAK supports two kinds of classes to
be defined: extensional (their objects are stored with them), and intensional (non-material-
ized classes). Intentional classes are further divided into two groups: external (imported
from an external information base), and derived (described in terms of underlying classes).
The schema transformation mapping during which all schemas of component information
bases are described with VODAK, and the schema integration mapping during which indi
vidual schemas are brought together are two levels of abstractions provided by Viewsys
tem to integrate heterogeneous information bases. Utilization of VODAK features listed
above ensures schema transformation mapping phase while a collection of class integra
tion operators offered by Viewsystem ensures schema integration mapping phase. A query
evaluation technique is provided which takes different kinds of classes into consideration.
Byeon and McLeod [15] combine the concepts of views and schema versions into a
unified concept called virtual database. A virtual database can contain imported, derived,
and local classes. Accordingly, imported and local instances may coexists in a virtual data
base. A methodology for creating a virtual database is provided. The concept of virtual
database has implications on integration of remote types and instances to local databases.
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
As view integration has, generalization-based integration has also been a significant
schema integration technique in the database literature. Generalization semantic primitive
helps eliminating representational differences between the object classes to be integrated,
and builds a semantic link between them. Dayal and Hwang [19] devote their work to a
database integration methodology based on views and generalization, an abstraction that
groups classes of objects with common properties into a generic class of objects. Two
types of generalization is recognized: entity generalization (ISAe) and function generaliza
tion (ISAf). Functional data model is expanded with ISAe and ISAf. Defining the global
schema as a view with superfunctions and supertypes, which are generalizations of func
tions and types in component schemas, constitutes the proposed solution. Prior to the defi
nition of this view, schema differences are resolved. An algorithm for transforming a
global query into equivalent queries on a collection of local schemas is given.
Knowledge of how conceptual schema elements at different sites are interrelated is
vital to schema integration. Investigation of domain relationships, class relationships, and
attribute relationships contributes acquisition of this knowledge. Domain relationships
have been the focus of Elmasri and Navathe [21], who investigate integration of entity
classes in different views expressed in E-C-R (Entity-Category-Relationship) model1 .
Domains of two entity classes may be identical, contained, intersecting or disjoint depend
ing on the real world concepts they model. Key attributes of these two classes, and the
relationship between their domains determines the approach to be taken during integration
process. For example, if they have identical domains (they model the same real world con
cept), they are integrated in the form of a new class. The union of attributes of these two
entity classes becomes the attributes of the new class. Integration rules for identical, con
tained, intersecting, and disjoint domains as well as integration rules for relationships are
given in this work.
1. E-C-R model is an extended version of E-R data model. For more information about the E-R
model, Batini et al. [10] devote a whole textbook to conceptual database design using the E-R
model.
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Attribute relationships were investigated by several studies [54, 47, 48, 98, 99].
Mannino and Effelsberg [54] illustrate two types of matching techniques for global
schema design: entity type matching and attribute matching. Entity type matching divides
the set of entity types into common areas in which entity types represent similar kinds of
information. Within each common area, attribute matching can be performed by requiring
designer input for specification of assertions among attributes.
Larson et al. [47] provided a formal study on the different forms of equivalence
between attributes represented in E-C-R (Entity-Category-Relationship) data model. Any
pair of entity classes whose identifying attributes can be integrated can be integrated
according to the authors. There may be various forms of equivalences between attributes.
Attributes having strong or weak equivalence between them can be integrated. If attributes
are strongly equivalent, integrated attributes can be updated. If attributes are neither strong
nor weak equivalent but they have the same roles, they are said to be disjoint equivalent,
and can still be integrated. Object equivalences are defined accordingly by considering
their identifying attributes. The notion of relationship equivalence is also defined.
Yu et al. [98, 99] propose determining attribute relationships semi-automatically
using common concepts, concept hierarchies, and aggregate concept hierarchies. It is
assumed that each attribute can be characterized by a set of common concepts. For each
attribute, database administrator identifies applicable concept(s) among a set of applica
tion independent concepts, which form a hierarchy. Joinable or semantically equivalent
attributes are identified using similarity values of attributes which are obtained by a math
ematical derivation from relevant concepts.
Some of the research related to schema integration takes (semi-) automatic determi
nation of relationships between schema elements into account rather than assuming that
such relationships are known prior to the integration. The notions of equivalence of
attributes, domains, and classes allow determining relationships between schema ele
ments. This determination process depends on names, data values, attribute specification
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
as well as user input. In Li and Clifton [48], attributes are categorized according to their
names, field specifications, and data values to recognize similarity between attributes.
According to this methodology, a DBMS-specific parser extracts database information
from the first database to be integrated. Its output, normalized characteristics and statistics,
are used to generate training data which is fed to a network training process. This trained
network is called self organizing map network. Finally, similarity determination process
takes trained network for the first database, and normalized characteristics and statistics
from the second database as inputs. It then produces equivalent attributes and similarity
between them, which are subject to a final user verification. Categorizing attributes is per
formed by a self-organizing classifier algorithm while training the network to recognize
input patterns, and determining similarity between attributes are performed by a back-
propagation learning algorithm.
Terminological Knowledge also helps identifying relationships between conceptual
schema elements in component databases for schema integration. Fankhauser and Neu-
hold [23, 24] propose integrating heterogeneous database schemas using fuzzy real world
knowledge. Real world knowledge is represented by means of a fuzzy terminological net
work. In this network, nodes represent terms corresponding real world concepts while
edges represent generalization or positive association between terms. The absence of an
edge between two nodes signifies negative association. Strength of a path indicates the
similarity between the terms corresponding to the nodes located at each end of the path,
and is computed by means of three t-norms presented in this study. It is suggested that this
strength can be used to decide if two names are similar.
2.3 Schema Evolution
The need for schema evolution methodologies stemmed from engineering applica
tions where meta-data and its organization changes as frequently as data itself. Research
on schema evolution is relatively old. Different kinds of schema evolution cases, and prob-
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
lems associated with these evolution cases are well understood. Consequently, approaches
developed for schema evolution are mature enough to be relied upon. Banerjee and Kim
[8] enumerate schema change kinds in ORION [42,44], an object-oriented database sys
tem. Two categories of schema changes are identified: changes to class definitions and
changes to the structure of a class hierarchy. Twenty distinct schema change kinds are
described along with their semantics. Invariants of these schema change kinds (properties
of a class lattice that must be preserved before and after any schema change) are also spec
ified along with rules governing maintenance of database schemas in response to changes.
A formal demonstration of completeness (e.g., does ORION capture every possible
schema change kind?) and correctness (e.g., do ORION schema change operations gener
ate valid schemas?) of the schema evolution mechanism in ORION is presented. In a
related study, Baneijee et al. [7] provide a taxonomy of schema evolution in the object-ori
ented context. Provided taxonomy divides schema changes into four categories: changes
to an instance variable, changes to a method, changes to an edge in class hierarchy, and
changes to a node in class hierarchy. Invariants and rules of schema evolution are studied.
Type evolution constitutes an instance of schema evolution. Skarra and Zdonik [76,
77] examine the problem of type evolution in an object-oriented database environment.
Type changes have direct effect on objects (instances) of types as well as programs that
use instances of types. Consequently, new programs running on old instances of types, and
old programs running on new instances of types may fail due to the changes on type defi
nitions. Applying versioning to types has been proposed as a solution in this study. The
Version Set Interface for a type is defined as the behaviors supported by all versions of that
type. The difference between behaviors supported by the Version Set Interface and behav
iors supported by a specific type version creates the problem with respect to old and new
programs. Authors propose writing handlers for this difference in order to cover the un
supported behavior(s) by this specific type version. As a result, all versions of a type seem
to support a uniform set of behaviors.
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Some mechanisms apply versioning approach to schema evolution problem. Object
classes, attributes, conceptual schemas, and portions of conceptual schemas (and even
object instances) can be versioned. Versioning is beneficiary when new and updated ver
sions should be available to the database applications at the same time. Andony et al. [4]
too take versioning approach for schema evolution. However, versioned objects are not
types (object classes) anymore in their approach. Rather, they version contexts, a partition
of a schema which regroups certain elements of schema and masks others. Accordingly, a
database schema consists of a set of contexts. Evolution and manipulation semantics of
contexts are defined and illustrated in detail.
Changes to the conceptual schema may affect instances as well as methods. The
effect of a methodology suggested for schema evolution should be minimum on object
instances, thus alleviating the need for database restructuring. Object instances should be
updated only if there is a specific need to do so. Nguyen and Rieu [61] compare database
evolution methodologies in object-oriented systems. Prototype systems are evaluated with
respect to various schema change operators supported by them. According to authors,
change propagation which is the process o f propagating schema changes on the instances
is underemphasized. An approach which allows change propagation in object-oriented
databases is provided.
Evolution is not limited to object types. Sometimes object instances evolve and
behave like object types. Richardson and Schwarz [66] observe a different need: some
times objects need to change their types. The notion of aspects is introduced for this pur
pose. An aspect extends an object with new behavior and new state while maintaining the
same identity. It can be used to smooth over differences between two types so that they can
interchange instances. Discovery, authorization, and deletion of aspects are user initiated.
Aspects and conformity model roles that can be played by different types of entities hav
ing some common behavior.
36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In a more recent study, Lee and McLeod [49, 50] examine changes to the fundamen
tal semantics of objects in object-oriented database systems. Such changes are termed
Object Flavor Evolution. A learning-based approach to conceptual database evolution is
presented. Proposed data model, PKM (Personal Knowledge Manager) permits each
object to have a flavor (characteristic). An object can evolve to a different flavor when
there is a change in its characteristics. There are three levels of flavors supported by the
PKM data model. Every object is initially created at the first level, and an object can be
deleted if it is at the first level. Objects with atomic flavor (non-decomposable information
units) are in the first level. Object flavors at the second level are open atomic set (expand
able set of atomic objects), closed atomic set (non-expandable set of atomic objects),
social (concepts that are described by their relationships with other objects), mapping
(binary relationships between objects), and procedural (executable operations/methods).
Objects with flavors open social set (expandable set of social objects), closed social set
(non-expandable set of social objects), compound mapping (virtual composition of map
ping objects), and compound procedural (virtual composition of procedural objects) con
stitute the third level. Objects can evolve from one flavor to another, and thirteen of such
evolution cases are illustrated. Object manipulation operations of PKM is extended with
operations for processing objects with different kinds of flavors. In the architecture pro
vided by authors. Intelligent Concept Evolver, an entity which supports object flavor evo
lution and learning, utilizes different learning techniques.
2.4 Categorization of Research Applicable to the Folding Problem
Categorization of related research that can be applicable to the folding problem is
vital in discussing shortcomings of database research with regard to the folding problem.
There are mainly three categories of research applicable to the folding problem: (a) Global
Schema Approach: components agree on a common, federation-wide global schema
before any information sharing and exchange takes place. Information unit semantics and
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
inter-database relationships are fixed on this schema. Information sharing and exchange
occurs through this shared global schema, (b) Federated Databases with Semantic Dictio
nary or Ontologies: components agree on a pool of real world concepts and relationships
between concepts. Each component is responsible for expressing sharable portion of its
conceptual schema in terms of this common vocabulary. Information sharing and ex
change occurs by analyzing the actual concepts implied by individual database elements,
by investigating inter-concept relationships, and deriving the meanings of unknown con
cepts when necessary, (c) Multidatabases with Powerful Multidatabase Language Capabil
ities: multidatabase system employs a powerful language which is furnished with explicit
primitives that enable a user to mediate through the component databases. Users in such
an environment are assumed to be knowledgeable enough to be able to express their inten
tions by using this language statements to achieve information sharing and exchange.
schema integrator(s)
component
schemai
Global
Schema
Integration
Global (Unified) Schema!
component
schema,,
Global-to-Local
Mappings
correspondences
between component
schema elements
Figure 5: Global Schema Approach to the Folding Problem.
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Approaches assuming a global schema [1,4, 17, 18, 19, 21, 37, 45, 47, 59] (Figure
5) cover related issues such as generating a global schema physically or virtually (via a
view mechanism), generating global schema-local schema mappings, usage of generaliza
tion primitive in schema integration, and the notions of equivalence of domains, classes,
and attributes between databases. The most serious limitation of global schema approach
is that it requires too much and broad global knowledge to come up with the global
schema. It also requires huge amounts of global structures to be maintained (e.g., corre
spondences between component schema elements in Figure 5). Consequently, integration
effort is very high prohibiting this approach to be taken within the context of the folding
problem. Another limitation is the lack of flexibility of the relationships between compo
nent database elements, since the relationships become fixed in the form of a unified
schema after schema integration.
user consultation
component
schema i
SH Resolution
and
Unification
unified with
component schema;
component
schema-)
semantic dictionary
or
ontologies
Figure 6: Federated Databases with Global Structures (Semantic Dictionary/Ontologies).
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Federated Databases with Semantic Dictionaries/Ontologies approach [29, 30, 31,
35, 57, 58, 82, 88] (Figure 6) is based on the idea of agreeing on a collection of concepts
and inter-concept relationships federation-wide, and describing sharable portions of com
ponent databases in terms of this commonly understood set of concepts. When considered
within the context of the folding problem, e.g., in Figure 6, this approach requires more
moderate amount of global knowledge, global structures, integration effort, and actual
exchange effort than previous approaches. The main limitation of this category is the diffi
culty of agreeing on a set of concepts and inter-concept relationships in a federation envi
ronment that constantly evolves.
$
multidatabase user
component
schema |
I
multidatabase language
statements
component
schemao
Integration integrated1 schema
J
Figure 7: Multidatabases with Multidatabase Languages.
As another approach towards the solution to the folding problem (Figure 7), multida
tabase systems offer multidatabase users with powerful multidatabase languages through
which users can manipulate data in different non-integrated schemas [46. 51, 52]. This
approach, when applied to the folding problem, requires small amount of global structures
to be maintained. Inter-relationships of different databases are highly dynamic. Neverthe
less, its requirement for database users to have and maintain very high degree of global
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
knowledge about the remote information unit semantics constitutes a very important limi
tation with respect to the folding problem, leading to a high user effort during information
sharing and exchange.
Schema Implantation and Semantic Evolution1 , our approach to the folding problem,
requires no global structures although it does not rule out usage of such structures. In our
approach, global knowledge required from a component in order to interoperate with other
components is minimum, leading to a very low integration effort. Inter-relationships
between database elements in the federation are highly dynamic, and the effort that has to
be spent during actual sharing and exchange is moderate.
2.5 Projects Directly Related to the Folding Problem
In this section, we will review some of the projects that can be related to the folding
problem, and compare them with our approach to the folding problem.
Figure 8 characterizes the representative set of projects related to the folding prob
lem. The SIMS, MCC Carnot and MCC InfoSIeuth projects show some similarities with
our approach. They have also significant differences with respect to the target environ
ments on which they focus, CDMs they employ, global entities they maintain and database
issues they cover.
The importance of explicit structures that maintain information unit semantics and
their interrelationships is emphasized in several projects including the SIMS project [5, 6].
In SIMS, a domain model in the form of a hierarchical, terminological knowledge base is
constructed in order to describe application domain semantics. Furthermore, SIMS is spe
cialized to a tightly-coupled multidatabase environment, and it provides access to informa
tion sources within that environment. In contrast, our approach does not assume a global
entity such as a knowledge base, and it is applicable to databases in a CFDBS that can be
I. A detailed comparison of applicable research categories and our approach is given in Chapter 6,
when we discuss our approach’s strong sides over the others.
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
created and maintained by highly autonomous components. The issues covered by the
SIMS project but not by our approach are query formulation and access planning, which is
due to the fact that we assume a less restricted target environment.
SIMS Carnot InfoSIeuth Our Approach
Target
Environment
tightly
coupled
multidatabases
knowledge-based
systems,
process models,
databases.
information sources
in the Internet.
databases
in a
federation.
Common
Model/
Language
LOOM GCL KQML HSDM
Global
Entities
a domain
model
- global context.
- declarative
constraint base.
- articulation axioms.
ontology(s) none
Issues
- information
access.
- query
formulation.
- access
planning.
- unifying info.
- semantic expansion
of queries.
- interresource
consistency.
- finding/integrating
information.
- agent collaboration.
- info, advertisement,
search and fusion.
- knowledge
acquisition.
- database
integration.
- semantic
relativism.
Figure 8: Projects related to the Folding Problem.
Being in the same track with the SIMS project, the Carnot project at MCC [32, 81,
90, 91, 93, 95] also assumes a global schema or context in the form of a Cyc knowledge
base, which is used as the federating mechanism. A relationship between a domain con
cept from a local component and one or more concepts in the global context is expressed
as an articulation axiom, a statement of equivalence between these contexts [32]. One of
the services Camot provides, among many others, is the semantic services whose purpose
is to provide a global, enterprise-wide view of all the resources being integrated [91]. The
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
main limitation of this work is the difficulty of obtaining the global entities such as the
global context, articulation axioms, and the declarative resource constraint base which
contains inter-resource dependencies, consistency requirements, and consistency restora
tion strategies.
Like the SIMS project, the Camot project addresses providing access to a number of
components by reformulating queries imposed on the global level in terms of queries that
are expressible on individual components. However, Camot considers know-ledge-based
systems and process models as well as databases as components. Again, we do not assume
a global entity such as the global context observed in the Camot project. The articulation
axioms correspond to the interschema element relationships in our work, and unlike Car
not, we do not assume the availability of them prior to interoperation. In other words, our
approach aims to obtain these correspondences between schema elements in different
components rather than assuming that they are available prior to interoperation.
By contrast with the InfoSIeuth project, our approach assumes more restricted com
ponents, namely databases. The InfoSIeuth project [12, 34, 92, 94, 95, 96] investigates the
use of Camot technology in a more dynamically changing environment such as the Inter
net [12]. In such an ever-growing and ever-changing environment, information advertise
ment, information discovery, and collaboration between components to fuse information
from many information sources are necessary means to be provided with. InfoSIeuth fol
lows an agent-based approach to these problems, in which responsibility of carrying spe
cific tasks is distributed over highly specialized agents. For example, InfoSIeuth employs a
special agent, called the Ontology Server. The Ontology Server in the proposed architec
ture is responsible for the creation, update, and querying of multiple ontologies. The User
Agent, on the other hand, utilizes Java applets to match user requests with the appropriate
servers. KQML language in this architecture enables agents to communicate. Information
advertisement, information access and ontology maintenance are all performed by differ
ent agents. In this respect, InfoSIeuth has a broader spectrum than that of our work.
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In summary, the projects that can be applied to the folding problem assume different
target environments. For example, SIMS assumes a single application domain. InfoSIeuth
focuses on information sources in the Internet. Our approach is applicable to databases in
a federation.They employ different data models during integration or while maintaining
global structures. Unlike our approach, all of the projects assume some kind of a global
knowledge base that keeps interrelations of database elements in different components.
They are also different in terms of issues they focus. For example. InfoSIeuth project
focuses on finding, searching, and fusing information in a dynamically changing environ
ment such as the Internet. Our research focuses on acquisition of knowledge required to
interrelate database elements in different components. It aims for a partial database inte
gration scheme.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 3
A Perspective on the Folding Problem
In this chapter, we will present assumptions our solution is based on, and our goals.
In other words, this chapter will describe the principles that we will remain loyal during
the process of solving the folding problem in a CFDBS (Cooperative Federated Database
System) for the purpose of information sharing and exchange. In such an environment, we
will focus on a partial integration scenario in which two databases, one remote and one
local, are of interest to us. The ultimate goal of our solution will be to fold remote concep
tual schema elements into the local conceptual schema. We will also specify characteris
tics of our solution, which will be used later in this dissertation to develop some metrics to
evaluate our solution, and to illustrate advantages of our solution over previous approaches
to the folding problem.
3.1 Assumptions and Goals
Figure 9 lists assumptions that our solution is based on. This set of assumptions col
lectively signifies desired characteristics of our solution to the folding problem.
First, we assume that the relevant part of the remote conceptual schema was already
identified (No concern for Information Discovery in Figure 9), thereby alleviating the need
for information discovery techniques to be employed. Information discovery can be per-
45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
formed by using one of the information discovery techniques proposed in the literature
[63, 70, 30, 58, 82, 35, 88]. In our approach, it is enough to know remote conceptual
schema elements relate to the local conceptual schemas elements in some way; however,
we do not require or expect to know the exact relationships between the two.
0£§= No concern for Information Discovery
(]£ § = No global anything other than the CDM
0£§= CDM is important.
(£§= Loyalty to Object-based concepts
q £§= Loyalty to local conceptual schema context
Relative Semantics rather than Absolute Semantics
0£g= Indeterministic World Assumption
q £§= Eventual Integration rather than Immediate Integration
SH Transparency is ensured in the long run
(£ § = ■ Semantic Evolution rather than Schema Evolution
Value-based Identity for determining equivalent objects
Targeted users: Domain Experts
Figure 9; Principles of Our Approach to the Folding Problem.
The second principle in Figure 9 states that no centralized or global anything other
than the CDM is assumed in the federation. Some of the research discussed in the previous
chapter keep global structures like semantic dictionaries, dependency schemas, or ontolo
gies in order to define the correspondences between conceptual schema elements and their
real world counterparts, a set of standardized real world concepts. These structures aim to
describe information unit semantics, and are of great help in identifying related informa
tion across databases. Such structures are centralized (logically and/or physically); thus,
they necessitate existence of a centralized decision making authority to create and main
46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
tain them. Sharing Advisors and human administrators play the role of this intelligent
authority in today’s database technology. We claim and assume that although local knowl
edge about the semantic information for component databases is dependable, global
knowledge of any kind cannot be depended upon for the purpose of solving the folding
problem, because reaching to such a global knowledge requires negotiation between indi
vidual components which is not guaranteed to lead to a solution. Following considerations
justifies this assumption: (1) It is very difficult to create and maintain such structures. It
takes time and space to come up with such a knowledge base. (2) Reaching to an agree
ment on the real world concepts is not easy. It is subjective, argumentative, and it requires
an inter-disciplinary effort to populate a set of standard, agreed upon concepts whose
meanings directly correspond to real world entities. Furthermore, mapping conceptual
schema elements into this set of already argumentative concepts introduces additional
problems in a federation environment whose components are assumed to be autonomous,
heterogeneous, and distributed. Changes in conceptual schemas require propagation of
these changes to the mappings, which is not a trivial process in a dynamically changing
federation environment. (3) Uncertainty in remote data as well as in remote meta-data
makes it hard to define these conceptual schema elements with the correct set of concepts.
In our framework, CDM (common data model in use federation-wide) is the only global
entity assumed.
Third, Common Data Model (CDM) employed in the federation for information
exchange purposes is important in that constructs provided by CDM should help clarifying
semantics of information units being represented in original component conceptual sche
mas. In other words, CDM should have the ability to make semantic heterogeneity resolu
tion a relatively easy task in the late phases of folding process by explicitly defining the
meaning of conceptual schema elements. In our methodology, CDM functions as a tool
that can be used to overcome structural and semantic differences between component con
ceptual schemas.
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The fourth principle emphasizes the importance of Object-Oriented technology1 in
today’s database research. Object-oriented approach is widely used in today’s database
integration techniques [69]. Object-oriented approach both complicates and eases the
database integration problem: Object-oriented systems offer a rich set of constructs creat
ing the potential for increased heterogeneity of components on one hand. On the other
hand, they provide a natural framework for use in integrating component databases. Since
object-based (object-oriented and semantic) data models are the only ones which subsume
other data models in terms of semantic expressiveness, CDM should be based on object-
based principles. An object-based database model, called HSDM (Heterogeneous Seman
tic Data Model) is described in the following chapters for this purpose. HSDM is fur
nished with constructs that makes information unit semantics explicit and easy to
interpret. In this regard, HSDM constructs ease the semantic heterogeneity resolution
problem. It is also extended with constructs which enable testing hypothetical relavences
between remote and local conceptual schema elements. HSDM also offers primitives for
schema evolution, which is important for customization of imported schema elements in
order that their structure and semantics could fit naturally into local application environ
ment requirements.
The fifth principle is Loyalty to Local conceptual schema context in Figure 9. Partial
integration of local and remote conceptual schema elements will take place within the
local component. Consequently, local component’s context should be dominant to remote
component’s context. Remote component’s conceptual schema can be restructured, and
organization of information units can be changed in any way to suit local conceptual
schema’s context. This is another principle which lets customization of remote conceptual
schema elements. In short, local component is the only authority to decide how it will use/
customize imported information units in its context.
1. For more information on object-oriented (O-O) systems, a set of textbooks should be consulted:
Cardenas and McLeod [16] explore current state of the research related to the 0 - 0 databases.
Martin and Odell [56], and Graham [28] focus on 0 - 0 analysis and design methodologies.
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Global Schema Approach, which can be observed in both distributed databases [64]
and tightly coupled federated databases [73], offers static semantics for information units.
Meanings of conceptual schema elements and their interrelationships are fixed once global
schema has been agreed upon, and cannot be changed by any component in an autono
mous manner. Global/federated schema dictates the relationships between schema ele
ments, since relationships between schema elements are specified once and for all in the
global/federated schema. Semantic dictionary or ontology-like structures proceed one step
further in this regard, and offer absolute semantics. Although approaches that assume
these global structures do not dictate the relationships between schema elements across
databases, they suggest them. Meanings and interrelationships of information units in con
ceptual schemas are mapped into a set of commonly understood concepts, and relation
ships between concepts. Relationships between conceptual schema elements across
component databases are symmetrical in the sense that given a simple relationship
between a local conceptual schema element and a remote conceptual schema element, the
relationship between the two is the same in local component’s point of view and in remote
component’s point of view. This is called absolute semantics. Absolute semantics may be
promoted to relative semantics by providing sophisticated schema restructuring operators
such as the ones described in [59], Nevertheless, loosely-coupled federated databases
emphasize the importance of relative semantics, and we want to preserve this useful fea
ture in our methodology (the sixth principle in Figure 9). Semantic relationships between
conceptual schema elements in different component databases should be determined/
shaped by the component in which information will be utilized. Local component’s point
of view and remote component’s point of view about a simple relationship between con
ceptual schema elements may be different. This is called relative semantics. In summary,
current approaches do not remain loyal to relative semantics by proposing global struc
tures like semantic dictionaries, ontologies, and global schemas. Our approach, Schema
Implantation and Semantic Evolution, remains loyal to relative semantics by maintaining
49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
no global structures that may dictate or imply interrelationships between conceptual
schema elements. Relationships between schema elements are not dictated by any means,
they are not fixed, and they may not be symmetrical. Interrelationships between compo
nent schema elements change from component to component. The component where
information units will be implanted determines the final semantics of and interrelation
ships between schema elements. This is achieved by respecting the context of the compo
nent where implantation and semantic evolution will take place.
Indeterministic World Assumption is the seventh principle in Figure 9. Keller and
Wilkins [38] present a taxonomy of assumptions related to how complete and precise the
mappings between information units represented in databases and their real world coun
terparts are. In Open World Assumption, the information contained in the knowledge base
is correct, but not necessarily complete. Closed World Assumption states that information
contained in the knowledge base is correct, and complete. Modified Closed World
Assumption, proposed by the authors, states that information contained in the knowledge
base is correct, and not necessarily complete. For the incomplete part, knowledge base
explicitly provides descriptions where it is not complete. This involves associating various
null values to information units in order to represent the incomplete part. We introduce
another assumption to be used for semantic heterogeneity resolution. Indeterministic
World Assumption states that information contained in a knowledge base for the purpose
of semantic heterogeneity resolution may be incomplete, argumentative, or even incorrect.
Information necessary to resolve semantic conflicts may be missing at the time of analysis.
It may be argumentative, especially if it is related to unscientific domains, or newly dis
covered scientific domains for which a lot of new concepts are introduced whose seman
tics are not clear yet. It may be incorrect to some domains if it favors one of the many
possible views about a simple concept.
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Because current research related to schema integration perceives a deterministic
view of the real world, and the problem space associated with the real world domain being
modeled, one-shot integration (immediate integration) is adopted, assuming all of the
information required to integrate conceptual schemas is available, precise, complete, and
consistent. They also assume that semantic heterogeneity resolution process produces
deterministic and precise information about conceptual schema elements and their interre
lationships which can be used during schema integration. Consequently, all of the infor
mation necessary for schema integration is assumed to be present prior to schema
integration. Semantic heterogeneity resolution is a pre-condition for schema integration in
such schema integration researches. In contrast, we argue that semantic heterogeneity res
olution process produces sometimes incomplete, and sometimes inconsistent knowledge,
which might not be enough to integrate conceptual schemas in an appropriate, and healthy
manner. Our perception of real world is based on Indeterministic World Assumption (the
eighth principle in Figure 9), thus leading to Eventual Integration in which remote concep
tual schema elements are adapted into the local context over time.
The ninth principle is that semantic heterogeneity transparency is ensured in the long
run. In Schema Implantation and Semantic Evolution approach, users/DBAs are aware of
the fact that heterogeneities exists within the implanted local conceptual schema because
of the implanted portion of the remote conceptual schema. Applications involving only
local conceptual schema elements, but not implanted remote conceptual schema elements
continue to operate as they used to. At the same time, users/DBAs browse implanted
remote conceptual schema elements in order to gain additional knowledge about their
semantics. In the meantime, a set of hypothetical relevances can be specified by users in
order to interrelate local and implanted schema elements. Testing these hypotheses may
lead to individual schema evolution cases, each of which is subject to user verification.
Evolutions on conceptual schemas eliminate semantic conflicts in an incremental manner.
Conflicts are not resolved immediately highlighting graduality aspect of our methodology.
51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Current schema evolution methodologies require explicit user involvement for the
individual evolution cases to be initiated. The need for individual evolution cases is
detected by the user, and evolution primitives are invoked by the user. Our methodology
suggests Semantic Evolution (the tenth principle in Figure 9) as an alternative to Schema
Evolution. In Semantic Evolution, application dynamics determine whether an evolution is
to take place. User/DBA verification of evolution cases is the only user involvement. In
Semantic Evolution approach, semantic hints, information about meanings and interrela
tionships between schema elements obtained over time, activate evolution primitives,
which in turn evolves conceptual schemas into a different form.
We favor Value-Based Identity over Object-Based Identity for determining equiva
lent objects across databases (the eleventh principle in Figure 9). Object identifiers are
beneficiary in identifying objects within a single database. Nevertheless, they do not help
identifying the same object across databases, which is a vital task during integration of
information units in different databases. Although local-remote object identifier mappings
can be maintained across component databases, before that a Value-Based identity para
digm should be employed to determine object equivalences. Object identity has been a sig
nificant issue in both single and multidatabase systems. Third generation database system
manifesto [79] provides early insights for object identity in databases. Third generation
database systems, successors of relational databases, should let unique identifiers for
records to be assigned by the DBMS only if a user-defined primary key is not available.
Wang and Madnick [87] explore object identification in multidatabase systems while
Elliassen and Karlsen [20] studies the role of object identifiers during interoperation. The
notion of object identity has been important in our methodology too. Once two classes,
one local and one remote, are found to be equivalent, identifying same object instances in
these classes ensures that an instance will occur in the integrated class only once. HSDM
employs object identifiers to distinguish objects within a single database. However, we use
a Value-based paradigm to identify equivalent instances across databases.
52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
And, finally twelfth principle in Figure 9 describes targeted users who will use our
methodology. Our approach to the folding problem is intended for domain experts who
have a broad and detailed knowledge about the information content and information orga
nization of their component databases. In addition, domain experts are assumed to have
moderate knowledge about remote databases that are of interest to their components. They
gain more detailed knowledge about remote databases over time by using the methodol
ogy we propose.
3.2 Characteristics Of Our Approach
We can specify the characteristic of our approach to the folding problem using six
criteria, namely Sharing Granularity, User’s Role, Determinism, Graduality, Semantic
Relativism, and Completeness.
(a) Sharing Granularity: Most of the schema integration methodologies propose
sharing of conceptual schemas by means of an integrated (global) schema. According to
this global schema approach, component schemas are mapped into the global schema
either by physically (creating a global schema and generating the mappings from/to the
global schema to/from individual component schemas) or virtually (using a view mecha
nism to define the global schema). In this respect, global schema approach suggests total
integration of component schemas. Total integration is impractical in terms of space and
time required to build such a big global schema. Moreover, building a global schema is an
inter-disciplinary effort that requires all components to agree on the content, representa
tion, and purpose of the information units to be contained in the global schema. Autonomy
requirements of individual components, and the difficulty of reaching an agreement on the
meanings of information units, especially when the scale of a federation is considered,
prohibits taking such an approach for the folding problem in federated database systems.
Instead, we desire partial integration without any global schema. Partial integration allows
the system to scale up and components to maintain their autonomy by alleviating the need
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
to reach an agreement on the semantics of all the information units in the federation,
which is a mundane task even for database administrators.
(b) User’s Role: Some of the approaches to schema integration and semantic hetero
geneity problems depend on an intelligent entity to specify correspondences between
schema elements (classes, attributes, and object instances) in component schemas. Data
base administrators (DBAs) play the role of this intelligent entity due to the absence of a
corresponding entity in the database technology today. We can not expect a human admin
istrator to have such a broad knowledge in a CFDBS environment, where both contents of
conceptual schemas and meanings of their elements change constantly. Rather, we want
DBAs to play a role where they confirm the analysis and results provided by the proposed
mechanism with regard to the folding problem. Instead of forcing DBAs to gain and main
tain a universal and dynamic form of knowledge federation-wide, our methodology
enables them to focus on the components related to their own interests.
(c) Determinism: Schema Integration methodologies perceive a deterministic view
of the real world, and of the problem space associated with the real world. Ail of the infor
mation necessary for schema integration is assumed to be known, obtained, or available
prior to the integration time. Existence of partial, incomplete knowledge, assumed but not
modeled knowledge, and conflicts that cannot be resolved prior to the integration time due
to insufficient information in local database environments contradict deterministic view of
the problem space as well as of the real world. As an alternative to the deterministic view
of the real world and problem space, we assume an environment in which some of the
information needed for schema integration may not be available prior to the integration
time, and available information may be conflicting, and argumentative.
(d) G raduality: As a consequence of the deterministic view of the previous research
on schema integration and semantic heterogeneity, related techniques propose one-shot
integration or one-shot semantic heterogeneity resolution. Accordingly, integration or het
erogeneity problem is resolved by considering all the necessary inputs, processing them,
54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
and producing the desired output (an integrated schema or resolved conflicts) in one step.
Due to the indeterministic view of the problem space we adopted in this thesis, one-shot
integration (resolution) is unrealistic. A mechanism, as explained in this dissertation,
which enables gradual integration (resolution) is needed.
(e) Semantic Relativism: Semantic relativism is one of the properties of loosely-
coupled federated database systems. It allows the co-existence of multiple view points in a
federation. Global schema approach for schema integration, where relationships (corre
spondences) between conceptual schema elements are fixed, does not allow semantic rela
tivism. Global structures in loosely-coupled federated database systems, on the other
hand, keep semantic dictionary or ontology-like structures. These structures suggest the
relationships (correspondences) between the real world concepts which were modeled by
conceptual schema elements. In this respect, these structures can be used for semantic het
erogeneity identification but not for semantic heterogeneity resolution since they do not
give any clue about the final organization and semantics of information units in the target
components where they will be customized. In our framework, in order to enable semantic
relativism, we assume no global anything (integrated schema or any structure) which may
dictate (or even suggest) relationships between component conceptual schema elements.
(f) Completeness: Research on schema integration and semantic heterogeneity con
centrate on individual parts of the problem, the folding problem. Some focus on determin
ing correspondences between schema elements by analyzing conceptual schemas or
databases itself while others study different ways of integration supposing all the semantic
conflicts have already been resolved. These studies make assumptions about the other
parts of the folding problem. None of the research reviewed in the previous chapter
present a unified mechanism for the folding problem. Our approach aims to provide an
integrated solution to the folding problem.
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 4
Schema Implantation and Semantic Evolution
Schema Implantation and Semantic Evolution, our approach to the folding problem,
employs meta-data implantation and stepwise evolution techniques to interrelate database
elements in different component databases, and to resolve conflicts on the semantics of
database elements. It is a uniform approach to the problems of (partial) schema integra
tion, semantic heterogeneity resolution, and schema customization for the purpose of
ensuring database interoperability in CFDBSs, and it is based on the following key ideas:
(a) the CDM employed in federation and its constructs are important in making informa
tion unit semantics explicit, comparable, and interpretable, thereby creating a distributed
ontology effect, and consequently alleviating the need for maintaining global structures,
(b) an incremental integration and semantic heterogeneity resolution process where rela
tionships between local and imported remote information units are determined whenever
enough knowledge about their semantics is acquired, thereby recognizing the possibility
of incomplete, missing, and insufficient knowledge about information unit semantics in a
federation, (c) a partial integration scheme where semantics and context of local informa
tion units are dominant to that of remote information units, thereby enabling multiple
semantics to co-exist in a federation.
This chapter starts with specification of the CDM used in our approach. Later, we
have elaborated on individual phases of our approach.
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.1 HSDM (Heterogeneous Semantic Data Model) as the CDM
We employ a semantically rich and expressive object-based data model, called
HSDM (Heterogeneous Semantic Data Model), as the Canonical Data Model (CDM) in
use federation-wide. HSDM is the successor of PDM (Personal Data Manager) [53]. As an
object-oriented database model, PDM supports notions such as object classes1 , attributes,
and a set of semantic primitives (classification, aggregation, and generalization). A PDM
conceptual schema forms a generalization hierarchy where user kind2 “Object” is the root.
Inheritance is directly supported in PDM. PDM employs a basic naming paradigm that can
be seen as the primitive form of object identifiers observed in object-based databases.
Being the successor of PDM, HSDM has adopted many of its constructs from PDM.
Additionally, HSDM is furnished with constructs which enable clarification, representa
tion, and evolution of information units for interoperability purposes. There are three
extensions in HSDM to the object-based philosophy: constructs for clear data and meta
data semantics, a mechanism that enables incomplete information to be recorded in an
HSDM database, and schema evolution primitives. Constructs in the first category help
representing information unit semantics in an easy to understand, and easy to interpret
manner (e.g., attributes describing meanings of attributes and classes, class identity and
instance identity assertions). These constructs are used during early phases, and used
throughout the interoperation process. Constructs for incomplete information are benefi
ciary in representing partial database values. Two null values are introduced in HSDM for
this purpose. Schema Evolution primitives provide capabilities to restructure conceptual
schemas. Combining two object classes, making a class a superclass (subclass) of another,
and filtering some values of the remote class instances and combining this portion with the
local class can be achieved by schema evolution primitives, which help placing remote
object classes (instances) into their designated places in the class hierarchy.
1. object class and class are used as synonyms.
2. although PDM uses the term user kind, we prefer a more standard term, class or object class.
57
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
An HSDM database is a collection of objects and relationships between objects.
Objects correspond to real world entities to be represented in the database. Each object has
a unique, system-assigned object identifier which distinguishes it from others in the data
base. Additional identifiers for objects can be defined as instance keys. An HSDM class is
a collection of similar objects which have common properties and behaviors. HSDM
classes are organized in the form of a class hierarchy in which ISA (generalization/special
ization) relationships between object classes are explicit. An HSDM conceptual schema
consists of a set of object classes and semantic relationships between object classes.
4.1.1 HSDM Classes and Class Hierarchy
An HSDM conceptual schema models real world concepts and their interrelation
ships. Each conceptual schema concentrates on a particular real world domain. Real world
domains being modeled by HSDM conceptual schemas are called conceptual territories.
Conceptual territories consist of a collection of conceptual sub-territories in which closely
related concepts are grouped together. Individual concepts within a conceptual sub-terri
tory establish sub-concept/super-concept relationships between each other. Other than
sub-concept/super-concept relationship, concepts may be related to each other through
various other relationships. Such a conceptual territory is symbolized in Figure 10.
S ’ Conceptual Conceptual Conceptual Conceptual
Sub-Territory | Sub-Territory; Sub-Territory3 Sub-Terri tory4
Individual Concepts
Real World Conceptual Space
Figure 10: A conceptual territory and its sub-territories.
58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
immediate
•BJEC sub-roots
attributes between
classes
Immediate Sub
model conceptu; territories
Figure 11: Corresponding HSDM Conceptual Schema.
An HSDM conceptual schema is aimed to establish a one-to-one mapping between
real world concepts and their counterparts in the conceptual schema. The corresponding
HSDM conceptual schema that represents the conceptual territory shown in Figure 10, is
given in Figure 11. As shown in Figure 11, an HSDM class hierarchy is a directed tree.
Nodes represent object classes, and edges represent ISA (generalization/specialization)
relationships between object classes. Inheritance of attributes is supported. The root of the
class hierarchy is the “Object” class. Each direct sub-class of the “Object” object class is
called immediate sub-root in the class hierarchy. Trees located at immediate sub-roots are
named immediate subtrees. Immediate sub-trees correspond to conceptual sub-territories
in the real world. Object classes in different sub-trees and within a subtree can be
related to each other through attributes defined between them. Object classes residing
within the same immediate sub-tree signify conceptual similarity between the real world
concepts they represent while conceptual dissimilarity may be observed across object
classes in different immediate subtrees. Multiple inheritance is not supported currently for
simplicity.
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
An object class is a set of similar objects with common properties and behaviors.
Object classes in an HSDM conceptual schema may be of two kinds: local classes and
remote (foreign) classes. Local classes are those that are local to the component. Remote
classes are imported from other components. Local and remote classes co-exist in an
HSDM conceptual schema. Remote classes are loosely-coupled with local classes at the
beginning. As time passes by, local application semantics and complementary information
about local and remote classes help establishing relationships between them leading to a
tighter coupling between remote and local conceptual schemas.
Subclasses in HSDM represent ISA relationship between a superclass and a sub
class, and can be defined in three different ways: (1) Attribute defined subclasses-. The sub
class has more attributes which conceptually distinguish it from its superclass and from its
sibling classes in the class hierarchy. (2) Predicate defined subclasses: The number of
attributes in the subclasses is the same as (and the ones inherited from the superclass) the
number of attributes in the superclass. The subclass has a predicate that differentiates its
instances from the instances of the superclass and from the instances of subclasses’ sib
lings. Predicate defined subclasses have a complementary1 , class attribute which contains
the predicate to be used to populate its instances. Predicate values for sibling classes for
predicate defined subclasses can not overlap. (3) User defined subclasses: Subclass
instances are explicitly specified by the user. Attribute defined and predicate defined sub
classes are beneficiary for semantic clarification purposes, since they offer a way to con
ceptually distinguish a class from another based on its properties. As user defined abstract2
subclasses do not have any distinguishing property, they are converted to attribute defined
abstract subclasses during early phases of our approach in order to eliminate this problem.
Attribute defined and predicate defined subclasses do not cause any problems in this
regard, because they can be distinguished from their superclasses and siblings by means of
an attribute or a predicate.
1. complementary attributes will be defined in the next section
2. abstract classes will be discussed later in this section
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract classes and primitive classes are two kinds of classes which can be
observed in HSDM schemas. Abstract classes represent abstract concepts in the real world
which carry special importance in an application. They are defined by the relationships to
other object classes, and they cannot be displayed directly. Primitive classes model dis-
playable objects (data values). Primitive classes do not have application dependent
attributes but complementary attributes such as Content, Length, and Format. Attribute
semantics of HSDM data model will be given in the next section. There are a number of
pre-defined primitive classes in HSDM (Figure 12). They are subclasses of STRING and
NUMBER primitive classes. STRING and NUMBER are also predefined as the immedi
ate sub-roots of object class “Object”.
NUMBER STRING
P_STRXNuffN_STRING STRINGlTTHARACTEFC
Figure 12: Predefined primitive classes in HSDM.
Subclasses of STRING and NUMBER are shown in Figure 12: P_STRING (pure
string), N_STRING (numerable string), M_STRING (mixed string), and CHARACTER.
Distinguishing these four subclasses helps resolution of certain semantic confusions about
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the attributes defined on them. It also eases attribute equivalence determination process.
P_STRING (pure string) represents values which consist of a sequence of characters. Indi
vidual characters in the string cannot be numeric. N_STRING (numerable string) repre
sents values which consist of a sequence of numeric characters. In addition, values can be
preceded by a sign (‘+’ or ‘-’), and they can contain a V character after the sign.
M_STRING (mixed string) values also represent a sequence of characters provided that
there is at least one numeric and at least one non-numeric character in the sequence.
CHARACTER values are defined as STRING values whose length is one. INTEGER and
REAL are subclasses of primitive object class NUMBER. An integer value represents a
sequence of numeric characters. Moreover, it can be preceded by a sign character (‘+’ or '-
’). REAL primitive object class represents a sequence of numeric characters. In addition to
‘+’ and a value of REAL can contain a single V character in any position after the sign
character, if one exists. STRING and NUMBER primitive classes have Content attribute,
which is inherited by their subclasses. Content attribute is used to store primitive instance
(data values) of these primitive classes. In addition to the Content attribute, STRING prim
itive class has Length attribute which represents the number of characters Content con
tains.
Primitive subclasses, subclasses of predefined classes that may be defined by users,
can only be predicate defined. They cannot be attribute defined since attributes of primitive
classes are standardized, and attributes themselves cannot be used to distinguish a primi
tive subclass from its superclass. They cannot be user defined either because user defined
primitive subclasses would not provide any means to distinguish a subclass from its super
class.
62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.1.2 HSDM Attributes
Attributes are mappings from object classes to domain classes. They are responsible
for three important tasks in HSDM: (1) They represent application semantics'. They repre
sent properties of and interrelationships between object classes, (2) They clarify applica
tion semantics: We associate special attributes to attributes and classes in order to clarify
their meanings, (3) They are used for deciding if two object classes, one local and one
remote, are related by accumulating complementary information about these object
classes. Like classes, attributes may be a property of a local application, or it may refer to
a remote conceptual schema.
We can classify HSDM attributes based on the quantity of instances with which their
values are associated. Instance attributes apply to individual instances of a class while
class attributes apply to a class as a whole. Another classification is based on the kind of
attribute domain classes. Interclass attributes map abstract classes to abstract classes. In
contrast, primitive attributes map abstract classes to primitive classes. In other words, their
domain classes are primitive. The value of an interclass attribute refers to an object which
cannot be displayed directly while the value of a primitive attribute is a simple data value.
Primitive attributes and their values constitute a base in our methodology for determina
tion of class and instance equivalences since they are the only attributes whose values are
visible to users. Clarifying their meanings is very important in this regard.
Each attribute and object class in HSDM has a descriptor attribute which describes
the meaning of the attribute (class) in natural language. Descriptor attribute values are pro
vided during Semantic Clarification1 with the help of a domain expert. Descriptor
attributes help users (DBAs) to associate a local class with a remote class during Schema
Implantation and Semantic Evolution phases. They can be benefited for a better under
standing of remote attribute (class) meanings by browsing their values.
1. Semantic Clarification, Schema Implantation, and Semantic Evolution are the phases of our
approach, which will be addressed in section 4.3.
63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.1.2.1 Complementary Attributes
Still another classification, which is based on the origins of attributes, is vital for
interoperation purposes. Application attributes are needed to satisfy local application
requirements for a local database. They exist before and after the Semantic Clarification,
Schema Implantation, and Semantic Evolution phases. They are necessary for database
applications to continue their local processing as they used to do before we apply our
mechanism. Therefore, each application attribute corresponds to an attribute in a local
conceptual schema. Complementary attributes are those which are introduced as a result
of Semantic Clarification, Schema Implantation, and Semantic Evolution phases. For
example, descriptor attributes defined previously are complementary in the sense that they
provide additional knowledge about the meanings of attributes and object classes. Another
distinction between complementary attributes and application attributes is that while
application attributes are associated with classes only, complementary attributes (e.g.,
descriptor attributes) can be associated with both application attributes and object classes.
Complementary attributes make sure that semantics of conceptual schema elements
(classes and attributes) are clear, explicit, and easy to interpret. Moreover, they provide
additional information about the classes (attributes), which may be missing, assumed, or
inconsistent in local conceptual schemas. In general, they can be viewed as properties of
attributes (classes) which describe attributes (classes). Complementary attributes may not
have any corresponding attributes in any local conceptual schema in the federation. They
are introduced as a result of applying our methodology.
Complementary attributes may be used for class identity and instance identity
semantics in a database. In some cases, with the existing attributes in conceptual schemas,
it is not possible to distinguish one object class (instance) from another. In such cases,
complementary attributes are introduced, so that every object class and every object
instance can be identified uniquely based on its attributes and attribute values. Comple
mentary attributes are widely used in order for hypothetical equivalence testing. Remote
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
and local classes which are hypothesized to be related generate attributes that are comple
mentary to each other. They then try to find meaningful values for these newly introduced
complementary attributes.
In summary, complementary attributes have four specific purposes in our methodol
ogy: (1) They clarify the meanings of original conceptual schema elements, (2) They com
plement original conceptual schema semantics, (3) They are used for class identity and
instance identity, and (4) They are used during schema implantation and semantic evolu
tion in order to complement semantic and structural aspects of classes.
HSDM provides a number of complementary attributes, which are generally class
attributes, for primitive attributes. Primitive attributes whose domain classes are NUM
BER or N_STRING have Kind and Unitofmeasure complementary attributes which
describe attribute’s kind and unit of measure respectively.
Predefined primitive classes constitute the standard part of the conceptual schemas
in a federation. In other words, they are application independent, and common to each and
every conceptual schema in the federation. Therefore, standardization of these classes and
attributes defined on these classes helps clarifying application semantics. Such a standard
ization effort should have strong roots in science. Standardized concepts should be
accepted in every domain, and should not be argumentative at all. For example, Kind com
plementary attribute associated with primitive attributes describes the real world concept
referred to by this primitive attribute. Kind attribute, which can contain a value from a
small but extensible set of agreed-upon concepts, ensures a distributed ontology effect on
database schemas.
Permissible values of Kind are weight, height, age, time, and order for numeric or
numerable primitive attributes (Figure 13). In order to allow extensibility, another value is
added, namely ‘others'. The value of Unitofmeasure and the value of Kind attributes for a
primitive attribute should be compatible. For example, we measure weight by kilogram,
pound, ton, etc. We measure age with years, months, and days.
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Kind
weight
height
length
age
time
order
coding
others or user defined
Unitofmeasure
kg, lb, to n ,..
cm, m, feet,..
cm, m, feet,..
year, month, day,..
sec, min, hour,..
number
number
user defined
Domain Class
REAL, INTEGER. N_STRING
REAL, INTEGER, N_STRING
REAL, INTEGER, N_STRING
REAL, INTEGER, N_STRING
REAL, INTEGER, N_STRING
REAL, INTEGER, N_STRING
REAL, INTEGER, N_STRING
REAL, INTEGER, N_STRING
Figure 13: Complementary attributes and possible values for numeric attributes.
Non-numerable or non-numeric primitive attributes have Kind, Format. and Length
complementary attributes, which are generally class attributes. Kind attribute has the same
semantics as in the numeric case. Format and Length attributes refer to format and length
of the character string being represented by the attribute value. Figure 14 illustrates per
missible values of these attributes.
w v n r - n u rir K n T T O T m T T i m i I 1 MBHI F ill I H ilT i T T T T i n n i T ~ r f ~ ~ IT f ~ ~T-T j irii r r r i ~ I H V i'IT
| Kind
I name
g address
1 date
| language
| coding
2 others or userdefined
Format
free, formatted
free
nn/nn/nn,..
free
ccnnn,..
user defined
Length
variable, max 20, fixed...
variable, max 20, fixed...
8...
max 15,..
5,..
user_specified
Domain Class
P_STRING
P_ STRING, M_STRING
M_ STRING
P_STRING
P_STRING, M_STRING
P_STRING, M_ STRINGi
Figure 14: Complementary attributes and possible values for non-numeric attributes.
66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.1.2.2 Equivalence of Primitive Attributes
Some binary combinations of primitive attributes are compatible since their domain
classes are compatible. Hence, primitive attribute values whose domain classes are com
patible can be converted to each other provided that conversion functions are specified for
them. This provides a base for attribute equivalence among primitive attributes in our
methodology. Figure 15 shows compatibilities between primitive classes. Some conver
sions may be information loosing in the sense that converted values lose some of the preci
sion representable by the original value. For precision loosing conversions, DBAs are
consulted to confirm whether local application is flexible enough to tolerate such a loss.
j
5
|
i
1
FROM\
TO
INTEGER REAL CHAR P_STR M_STR N_STR
INTEGER
0 - - X X -
REAL
- O X X X -
CHAR
- X o - X -
P_STR
X X - o X X
M_STR
X X X X o X
N_STR
~ - - X X 0
CHAR : CHARACTER P_STR : PURE STRING M_STR : MIXED STRING N.STR : NUMERABLE STRING
O : compatible without conversion. •
: compatible with an appropriate conversion function, (no information!precision) is lost during the conversion.) j
: compatible with an appropriate conversion function.(tnformation(precision) may be lost during the conversion^
: not compatible at all. i
Figure 15: Compatibility matrix on primitive classes.
Primitive attributes whose domain classes are compatible are not necessarily equiva
lent. Similarly, primitive attributes whose domain classes are not compatible may be
equivalent. Nevertheless, compatibility provides hints about attribute equivalence in most
cases. Incompatible primitive attributes can be converted to each other by means of map
ping tables (lookup tables) while conversion functions are sufficient for compatible ones.
67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.1.23 Attribute Constraints
Constraints are needed in a database in order to assure that the database will always
be in a consistent state. In other words, they ensure application semantics. HSDM corre
spondence of constraints is defined on attributes. Constraints in HSDM give information
about permissible values of attributes as well as about their cardinalities. The following
attribute constraints are provided: (1) Uniqueness Constraint: attribute values or combina
tion of attribute values can be specified to be unique across all instances of an object class.
(2) Cardinality Constraints: An attribute can be single-valued or multi-valued. Single-val
ued attributes map an object instance to zero or one object instance while multi-valued
attributes map an object instance to zero or more object instances. (3) Value Constraints:
Attribute values can be constrained into a specific range in the domain class. (4) Null con
straints: An attribute value may be optional or required for an object class meaning it may
take null values or not.
4.1.3 Class Keys and Instance Keys
Class key for a class is defined as a collection of attributes of that class whose exist
ence conceptually distinguishes that class from others within the database. There may be
more than one class key for an object class. Individual attributes appearing in a class key
are called Classifying Attributes since they help finding an appropriate place for the class
within the class hierarchy. Class keys are like object identifiers in object-based data mod
els, and relation keys in relational data models: An object identifier uniquely distinguishes
one object from another in a single database while a relation key distinguishes one tuple
from another in a relation. Unlike object identifiers and relation keys, class key values are
not used to distinguish classes in a database. Rather, it is the existence of attributes in class
keys which distinguish one class from another within the conceptual schema. In this
respect, class keys are conceptual identifiers which conceptually distinguish two object
classes.
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Class key of a class allows determination of the appropriate immediate sub-tree this
class belongs within the generalization hierarchy. Moreover, it allows determination of a
correct place (e.g., depth) within the immediate sub-tree. Whether an attribute is classify
ing is application dependent, and choosing classifying attributes of each class constitutes
application’s point of view about the real world. Classifying attributes can be complemen
tary attributes or application attributes (it may pre-exist in original conceptual schemas or
it may be introduced later). Non-classifying attributes are needed for application require
ments, and they do not help distinguishing one class from another. In summary, the notion
of class keys signifies: (1) class identification problem in a conceptual schema, (2) the pro
cess of finding the object class an instance belongs (given a particular instance and its
attribute names), and (3) distinguishing property of a concept (class).
Since primitive classes do not have application attributes, and since we cannot intro
duce any more complementary attributes for them, we can not depend on the existence of
attributes to distinguish a primitive class from another. However, distinguishing predefined
primitive classes is easy as they constitute the standard part of the conceptual schemas in a
federation. Considering their names (INTEGER, REAL,..) is sufficient in this regard, and
further analysis is not required. User defined primitive classes, on the other hand, intro
duces some problems. Since we confined user defined primitive subclasses to be predicate
defined only, they can be handled by considering their predicate values.
A class key for an abstract subclass should distinguish the subclass from other
classes in the class hierarchy. As user defined subclasses are transformed into attribute
defined subclasses during Semantic Clarification phase, we have only two possibilities for
subclasses: (1) Class Keys for attribute defined subclasses: The attribute used to define the
subclass is used as the class key if it distinguishes this subclass from all other classes. If it
is not unique, a complementary attribute is introduced for this purpose. (2) Class keys for
predicate defined subclasses: The subclass has the same class key as its superclass. The
class key and the value of the predicate are used for distinguishing it from others.
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Use of classifying attributes, class keys, and complementary attributes enables con
ceptual clarification based on class properties. It is a simple solution to class identity prob
lem in object-oriented databases. Class keys has a unique contribution to our mechanism.
It is the existence of class keys and complementary attributes which enables easy place
ment of remote classes into the local class hierarchy.
An instance key is defined as the combination of attributes whose values distinguish
one instance from another within a single object class. Instance keys are like relational
keys in relational databases with a major difference. Instance keys may contain comple
mentary attributes which do not exist in original conceptual schemas. Introducing comple
mentary attributes as part of instance keys can be observed in two cases: when there does
not exist a set of attributes whose values are unique across the instances of a class, and
when some attributes of an imported remote object class are filtered resulting in an object
class without any unique combination of attribute values.
Instance keys are used to identify equivalent instances of two related classes, one
local and one remote. The value of any attribute in an instance key should not be null since
it is used for determining identical instances in our methodology. Even if they are null,
their values should be supplied prior to combining remote and local instances. Instance
keys are defined early, during Semantic Clarification, and utilized to determine identical/
related object instances throughout the Semantic Evolution in our methodology. Once two
object classes, one local and one remote, are proved to be related, migrating remote class
instances into the local class necessitates determination of equivalent instances of local
and remote classes by means of the instance key of the local class.
The sole purpose of instance keys is not for object identification in HSDM. Unique,
system generated object identifiers are used for this purpose. It is rather to determine iden
tical/related instances in local and remote classes which are hypothesized and proven to be
semantically related.
70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.1.4 Null Values in HSDM
Null values have a special importance in HSDM since we use them in order to repre
sent different kinds of uncertainties observed in the real world.
HSDM provides three kinds of null values corresponding to three kinds of uncertain
ties: initial null (null;), don’t know null (null?), and inapplicable null (nullx). Initial null
(nullj) corresponds to the initial unknown state of a complementary attribute. It is intro
duced during Schema Implantation and Semantic Evolution phases. The value of the
attribute, which is initialized to null}, will be determined later when we have enough infor
mation. Don’t know null (null?) corresponds to a missing attribute value in a database. The
value of the attribute is not known at the time; however, the attribute is a property of the
class it belongs. Inapplicable null (nullx) signifies improper attachment of an attribute to a
class. The value of the attribute is not known at the time and will not be known in the
future, because the attribute is not the property of the class it is attached to. In other words,
it should not have been associated with this class in the first place. Inapplicable nulls are
introduced during Semantic Evolution phase when remote (local) object class tries to imi
tate local (remote) class’s structural properties. Inability for a local (remote) class to sup
ply meaningful values for remote (local) class attributes results in inapplicable nulls, and it
is a direct indication of semantic non-relevance between these two object classes.
4.1.5 Schema Evolution Primitives in HSDM
HSDM contains a number of schema evolution primitives that provide capabilities
such as renaming an attribute (class), adding an attribute to a class, combining two classes,
and making a class a subclass of another. Schema evolution primitives are extensively
used during restructuring the implanted remote schema during integration; after a relation
ship is detected between a local class and a remote class, schema evolution primitives are
used to propagate this relationship into the local conceptual schema.
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
rename
attribute/class:
add/drop
attribute:
combine:
make_sub/
make_sup:
overlap:
Q
A t
A n
n+1
CD
An m
G
C D /C
G c
CD
A n m
m
Figure 16: Schema Evolution Primitives in HSDM.
72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
HSDM schema evolution primitives are given in Figure 16. ‘rename attribute/class’
renames an object class or an attribute. Neither instances nor semantics of attribute/class is
affected from this evolution case. One can drop the definition of an object class or an
attribute from the database using ‘drop attribute/class’ primitive, although it is not shown
in Figure 16. In the case of drop attribute, attribute values become inaccessible for
instances. Similarly, drop class results in deletion of instances from the database.
More vital primitives with regard to schema customization are combine, make_sub,
make_super, and overlap. Given a local class and a remote class, ‘combine’ merges local
and remote class definitions in the form of the local class. Attributes of the local class
become the union of local and remote class attributes. Moreover, remote class instances
are migrated into the local class. During the migration, instance key of the local class is
consulted. This primitive is used when local class and remote class refer to the same real
world concept according to the local application’s point of view. In other words, it is used
when local and remote object classes are proven to be equivalent. (Direct) Subclasses of
the remote class are attached as the new subclasses of the local class while sub-instances
of the remote class are updated to refer to the migrated remote class instances as their
super-instances. After the migration, remote class definition is dropped.
‘make_sub’ and ‘make_sup’ primitives establishes a subclass/superclass relationship
between the local and remote classes. When remote class is made a subclass of the local
class (make_sub), remote class attributes that make sense for the local class are attached to
the local class definition in addition to the original attribute definitions of the local class.
Similarly, when remote class is made a superclass of the local class (make_sup), local
class attributes that make sense for the remote class are added to the remote class defini
tion in addition to the original attribute definitions of the remote class. Once again, over
lapping and equivalent instances are detected with the help of the instance key of the local
class. Accordingly, subinstance/superinstance relationships are maintained in both cases.
These two primitives are needed when one class is more specific (general) than the other.
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Finally, ‘make_overlap’ evolution primitive is used for cases where local and remote
classes correspond to concepts that have a common superconcept. This primitive is appli
cable to local and remote classes which have a common direct superclass. A new super
class is introduces as the direct and common superclass of the local class and the remote
class. The original common superclass of remote and local classes are made as the super
class of this newly introduced superclass. The new superclass has attributes that make
sense for both local and remote classes. These common attributes are filtered from the
local class definition and from the remote class definition, and the remaining attributes are
left in the new definitions of the local and remote classes. Instances are organized accord
ingly; one superclass instance is created for each local class instance, and for each remote
class instance. Newly created superclass instances, and local and remote classes instances
are maintained with the help of instance keys.
Schema evolution primitives reorganize the local class hierarchy, which was previ
ously implanted with the remote classes. This reorganization obviously changes the class
hierarchy, and it may affect instance and class keys. Although renaming an attribute or a
class, adding an attribute to a class has no effect with regard to instance keys and class
keys, make_sub, make_sup, overlap primitives introduce the problem of maintaining
instance and class keys for the newly introduced or modified class definitions. In such
cases where a new instance key or a class key is needed, it is defined automatically by the
system if the system can figure out what the new class key/instance key will be. Other
wise, domain experts are consulted. For example, while making the remote class a sub
class of the local class, the system can figure out that instance key of the new local class
definition will be the same as it was before. In this case, new local class maintains its old
class key. If the remote class has not propagated its class/instance key attributes, it keeps
them in its new definition. If the remote class has propagated its class key attribute(s) to
the local class, however, a new class/instance key definition is queried from the domain
expert who initiated the evolution.
74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.2 An Example Sharing Scenario: Collaborating Scientists
In this section, we will present an example sharing scenario between two federation
components, which maintain related information in their databases. Throughout the rest of
the dissertation, we will illustrate individual phases of our mechanism, and new concepts
we introduce by means of this example from the science domain.
researchData
• uniquelD
subject
experiment
experiment person
uniquelD • uniquelD
• name title
startDate name
contactPerson phone
description address
Local Schema (USCBP)
Physiology,
author(s)
Neuron(s)
annotation(s)
7 \
Time Series Histogram_
Data Data
author(s) author(s)
Neuron(s) Neuron(s)
annotation(s) annotation(s)
number of trace: data_points
trace identifiers scaling
data_set(s)
Remote Schema Portion
(Cornell Neuron Net Database)
Figure 17: Local and Remote Conceptual Schemas.
Figure 17 shows an example of a local component which desires to expand its data
base with related remote information units. Local schema is a simplified version of USC
Brain Project (USCBP) [84] schema, which is managed by object-relational Ilustra DBMS
[33]. In USCBP database, information about experiments, people performing the experi
ments are recorded as well as information about research data. Each experiment has a
name, start date, description, and a contact person. Each person has a name, title, phone
number, and an address. reserachData relation keeps related information on what experi
ment is performed on what subject. In addition, each relation has a unique identifier.
75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Remote component, on the other hand, is interested in detailed analysis of result data
associated with experiments. As a result. Physiology_Data relation keeps record of differ
ent experimental data sets published in the literature. For each data, its authors, related
neurons, and annotations are recorded. Moreover, two kinds of physiology data are identi
fied. In addition to the attributes of Physiology _Data, Time_Series_ Data relation records
number of traces, a set of trace identifiers, and data sets while data points and scaling
information is kept for Histogram relation.
At some point in time, local database may be more involved with experiment results,
e.g., different kinds of data sets obtained from experiments. The purpose of the database
administrator of local database would be to extend the information content of the local
database with the information content of the remote database in this case. The goal, then,
is to access or manipulate remote information units within/from the local database. In
order to do so, we need to fold remote conceptual schema into the local schema. This fold
ing process will take place within the context of the local database.
76
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.3 Schema Implantation and Semantic Evolution Approach
In this section, we will describe our solution to the folding problem, and elaborate on
sub-phases of our methodology. We will describe each sub-phase, and also show the infor
mation flow within/across sub-phases. Individual sub-phases will be illustrated.
Schema Implantation and Semantic Evolution, our approach to the folding problem,
employs meta-data implantation and stepwise evolution techniques to interrelate database
elements in different component databases, and to resolve conflicts on the semantics of
database elements. It is a uniform approach to the problems of semantic heterogeneity res
olution, (partial) schema integration, and schema customization in cooperative federated
database systems. The ideas of explicit representation of information unit semantics thru
the use of a semantically rich data model, meta-data implantation where remote (meta)
data is implanted into the local (meta) data and expected to establish concrete relation
ships with local (meta) data, and step-wise evolution, where relationships between local
and remote (meta) data are established and reflected to the local context thru schema evo
lution primitives whenever enough knowledge about the remote and local meta-data
semantics is acquired constitute the key philosophy of our approach.
Figure 18 presents the overall data architecture and information flow across the indi
vidual phases of our approach. Schema Implantation and Semantic Evolution consists of
three sub-phases: Semantic Clarification, Schema Implantation, and Semantic Evolution.
The first phase is called Semantic Clarification. During this phase, possible confu
sions about the meanings of information units are eliminated by obtaining additional
knowledge about the conceptual schema elements that represent these information units.
This is done with the help of a domain expert. Class identity and instance identity seman
tics are defined during this phase. Obtaining a conceptual schema that documents what it
contains, and what real world concepts its elements correspond to is the purpose of this
phase. HSDM plays a vital role in this respect by providing necessary object-oriented con
structs as well as new constructs for explicit/clear information unit semantics.
77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Remt ConceptuaK,
Schema in ]
Data Model I J
ConceptuaK
h p m n in !
Local
Data Model
Doinain Expert Dotnain Expert
Semantic
Clarification
Semantic
Clarification
Remt Conceptual
Schema in
HSDM
Local Conceptui
Schema in
HSDM
Schema
Implantation
Local Schema
implanted with
Remote Schema,
acquired knowledge on
information unit semantics
Evolved Local
Schema implanted
with Remote Schem;
acquired knowledge on
information unit semantics
Final Local
Schema implanted
with Remote Schem:
Figure 18: Schema Implantation and Semantic Evolution Approach.
78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Semantic Clarification phase is responsible for two tasks: (1) It transforms conceptual
schemas from their native data models into equivalent conceptual schemas in HSDM,
which is the common data model employed federation-wide. In other words, it transforms
information units represented in various data models into a common representational form
through which information exchange becomes possible. It functions as a schema translator
in this respect. (2) It enriches/clarifies semantics of information units by obtaining addi
tional knowledge on the meanings of schema elements. A domain expert who is assumed
to have extensive knowledge about his/her domain provides this additional knowledge. In
this respect, it transforms information units into a common formalism through which their
semantics can be understood and interpreted better. Although Semantic Clarification phase
operates on conceptual schemas (it transforms conceptual schemas into HSDM), it may
have implications on both object classes and object instances. New attributes may be intro
duced, and new attribute values may be required. Consequently, components should
reserve a memory segment for this purpose implementation-wise.
Each component in the federation is responsible for describing sharable portions of
their conceptual schemas (e.g., their export schemas) in terms of HSDM constructs in the
Semantic Clarification phase. Accordingly, each component performs activities associated
with Semantic Clarification phase, either when they enter into the federation, or when they
decide to share and exchange information units with other components.
The second phase is termed Schema Implantation during which the remote concep
tual schema elements are imported into the local conceptual schema. Furthermore, a num
ber of hypothetical relevances is specified between the local and remote conceptual
schema elements. The purpose of implantation process is to loosely integrate the remote
and local conceptual schemas. Prior to this phase, relevant part of the remote conceptual
schema should be identified. It is enough to know that the local and remote conceptual
schema elements are related in some way, but we do not expect to know the exact relation
ship between the two in advance. This phase involves three activities: (1) Relevant part of
79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the remote conceptual schema is imported into the local conceptual schema. At this stage,
the remote and local conceptual schema elements are loosely integrated since they do not
have explicit relationships (e.g., generalization, specialization, attributes) between them.
The remote and local class hierarchies are superimposed where they share the “Object”
object class as the common superclass. (2) User defined primitive classes are integrated
with the help of a domain expert. Predefined primitive classes and “Object” class consti
tutes the shared (agreed-upon) part of the local and remote conceptual schemas after this
phase. (3) A domain expert specifies some hypothetical relevances between the local and
remote conceptual schema elements. Domain expert can use element names, their descrip
tor attributes, and their locations in the class hierarchy to specify hypothetical relevances.
During Schema Implantation phase, hypothesis specifications are limited to direct sub
classes of the object class “Object”. Later, during Semantic Evolution, these hypothetical
relevances may be propagated down into the class hierarchy. The purpose of Schema
Implantation phase is to loosely integrate two schemas in the form of an implanted con
ceptual schema, and to prepare the necessary environment where the implanted conceptual
schema will evolve into more tightly integrated forms.
The last phase is Semantic Evolution. In Semantic Evolution phase, previously
hypothesized relevances are tested, and depending on the test results new hypothetical rel
evances are formed, existing ones are propagated down in the class hierarchy, or individ
ual evolution primitives are activated. Tests on individual relevances require additional
knowledge about the remote conceptual schema elements as well as about the local con
ceptual schema elements. The purpose of this phase is to determine ultimate relationships
between the remote and local conceptual schema elements over time, which in turn
enables customization of implanted remote conceptual schema elements in the local con
ceptual schema. Over time, remote conceptual schema elements live with the problem in
the local database environment. Whenever a semantic hint/information is obtained, which
could resolve certain heterogeneities, the remote conceptual schema elements start to
80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(semantically) migrate into the local conceptual schema. Loose and imprecise relation
ships between remote and local conceptual schema elements evolve into tighter and more
precise relationships which fit into the context of the local database.
Hypothetical relevances are specified on classes, one local and one remote. The local
and remote classes are called semantic peers” of each other if they are hypothesized to be
related. Semantic peers form harmonizers (instances of hypothesized relevances). Seman
tic peers attempt to complement each other’s structural and semantic aspects. In other
words, they try to establish a relationship between them over time. This period of time is
called testing period for the harmonizer. When the system obtains enough information that
can lead to a healthy decision about the relationship between semantic peers, the harmo
nizer is said to reach to an equilibrium state. When the harmonizer reaches to this state,
the system is able to suggest a relationship between the semantic peers under investiga
tion. The system reacts by suggesting one of the possible relationships (e.g., equivalence,
subclass, superclass, overlapping, non-relevant, and further analysis required). The sug
gested relationship is verified by the domain expert. This may lead to formation of differ
ent harmonizers, or activation of evolution primitives. Evolution may have effect on object
instances although we desire to keep such effects minimum. In the remainder of this sec
tion, we will discuss the internals of each sub-phase, and illustrate each sub-phase.
4.3.1 Semantic Clarification Phase
Inability of current data models to represent information unit semantics in an
explicit, comparable, and easily interpretable manner necessitates explicit structures to
maintain descriptions of and inter-relationships between information unit semantics that
may reside in different databases. In order to alleviate the need of such global structures,
HSDM is furnished with special constructs that make information unit semantics explicit,
comparable, and interpretable (e.g., class keys, instance keys, descriptor attributes, Kind
attribute for primitive attributes).
81
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
^Component Schema^
in
Data Model X
Semantic
Clarification
7
schema
transformation;
semantic
enrichment
Domain Expert
(k
‘ 'Component Schema y k
kv
Y
in
HSDM
Figure 19: Semantic Clarification Phase.
Semantic Clarification phase (Figure 19) translates conceptual schemas in compo
nents from their native data models into HSDM. It is imperative to note that semantic clar
ification phase does not solve semantic conflicts between conceptual schema elements. It
only clarifies semantics of conceptual schema elements which in turn eases determination
of conceptual schema element relationships in the following phases.
Component Data Model to HSDM transformation.
Class Key determination for each class.
Instance Key determination for each class.
Descriptor Attribute specification for each attribute and for each class.
Complementary Attribute value specification for each primitive attributes
User-defined Subclass elimination.
Figure 20: Activities in Semantic Clarification.
82
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 20 describes the activities performed during semantic clarification. These
activities are performed for each component in the federation once the component enters
into the federation, or it decides to share and exchange information with other compo
nents.
First, Semantic clarification phase and activities it involves necessitate the existence
of a number of translators. Translators convert conceptual schemas from their native data
models into HSDM. This is not an ordinary transformation since conceptual schema ele
ment semantics are also enriched during the transformation. Therefore, we name such a
translator as a semantic translator as it also considers meanings of conceptual schema ele
ments as well as their structures. A semantic translator maps component data model con
structs to HSDM constructs. This mapping is relatively easier for object-based to HSDM
transformations since HSDM is object-based, and it supports almost all of the key object-
based concepts in use today. Conceptual schemas in other data models are needed to be
enriched semantically and structurally to find a corresponding set of constructs in HSDM.
Second, class key determination for an object class involves several sub-activities.
For each immediate sub-root of the object class “Object”, we determine a class key which
distinguishes this immediate sub-root from others across all immediate sub-roots. As we
mentioned before, classifying attributes constitute class keys. For any object class, if there
does not exist a classifying attribute that distinguishes it uniquely across immediate sub
roots, then a complementary attribute is introduced for conceptual classification purposes
only. Complementary attributes in this sense are generally class attributes since they are
introduced for conceptual classification, and their values for each instance are not of inter
est to the application. Newly introduced complementary attributes are initialized with null;
(initial null) or any value from the attribute’s domain. After class keys are determined for
all immediate sub-roots in the class hierarchy, we determine class keys for subclasses of
immediate sub-roots. Subclasses can depend on class keys of their superclasses in order to
distinguish them from other classes in other immediate sub-trees. In order to distinguish a
83
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
subclass from its superclass and from its sibling classes, we depend on properties of the
subclass. Predicate defined and attribute defined subclasses have properties (predicate val
ues or attributes) which conceptually distinguish them from their superclasses. Therefore,
class keys for subclasses consist of the combination of their super classes’ class keys and
their classifying attributes (or predicate values). If a conflict arises between siblings and
the subclass in which we cannot determine a class key for the subclass, again complemen
tary attributes are introduced for classification purposes.
Third, instance key determination process shows the point of view the local concep
tual schema adopts for identifying instances of a class. By specifying a combination of
attributes as an instance key for a class, the local database administrator declares that
instances with the same instance key values will be considered equal during information
exchange. Again, complementary attributes are introduced in cases where the object class
does not have a number of attributes that makes individual instances unique. Unlike com
plementary attributes introduced for class keys, complementary attributes for instance
keys are instance attributes rather than class attributes because we need their values in
order to determine if two instances, one local and one remote, are equal. Nevertheless, val
ues of complementary attributes are not needed at this stage. They can be updated later on,
before harmonizers involving this object class suggest a relationship between the local and
remote classes.
Fourth, descriptor attributes are defined on classes and attributes of classes. They
describe the meanings and real world counterparts of attributes and classes in the natural
language. They are beneficiary because of three reasons: (1) By indicating the real world
counterparts of attributes and classes, they clarify the meanings of conceptual schemas. (2)
They can be browsed by database administrators in order to understand the conceptual ter
ritory a conceptual schema models. (3) During schema implantation and semantic evolu
tion phases they help building realistic harmonizers which associates related classes.
84
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Fifth, we introduce complementary attributes whenever a semantic clarification is
necessary. Moreover, introduced complementary attributes help conversion functions to be
defined easily. We associate complementary attributes to primitive attributes at this stage.
For primitive attributes whose domain classes are P_STRING, M_STRING, or CHARAC
TER, we create Kind, Format, and Length complementary attributes to further clarify
attributes’ meanings. For primitive attributes whose domain classes are N_STRING,
INTEGER, or REAL, we introduce Kind and Unitofmeasure complementary attributes.
Sixth, user defined subclasses do not have any aspects that conceptually distinguish
them from their superclasses. Memberships of their instances are determined and specified
by users. Conceptually, they do not provide any hint about the semantics of a subclass
apart from its superclass. As a result, we do not desire to have user defined subclasses in
our schemas as they make formalization of class and instance identity semantics difficult.
Consequently, they are translated to predicate defined subclasses if such a predicate can be
found. Otherwise, they are converted to attribute defined subclasses. If an appropriate
class key can not be found, a complementary attribute is introduced.
It is important to note that the problems of synonym attributes and synonym classes
(the same attribute/class in different databases with different names), and the problems of
homonym attributes or homonym classes (different attributes/classes in different databases
with the same name) do not exist in local conceptual schemas in this phase. These prob
lems are consequences of implanting remote object classes which do not exist in the local
conceptual schema prior to schema implantation.
When applied to the local and remote databases of Figure 17 in section 4.2, semantic
clarification phase produces two conceptual schemas in HSDM, as shown in Figure 21. In
this figure, ovals represent classes, while dark arrows represent the ISA relationship.
Attributes enclosed within boxes are class keys. Instance keys are underlined. For clarity,
complementary attributes and their values for local and remote classes are shown in differ
ent figures, Figure 22 and Figure 23 respectively.
85
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Local Schema after Semantic Clarification
(in HSDM)
Object
hysiology
Da
ime_Seriesj|f Histogram
iWa Da
Remote Schema after Semantic
Clarification (in HSDM)
Figure 21: Local and Remote Conceptual Schemas after Semantic Clarification.
Mapping relational conceptual schemas into HSDM requires some challenges (Fig
ure 21). For example, semantic translator should be intelligent enough to detect attributes
that are introduced as unique identification purposes only, and should eliminate such
attributes in the HSDM correspondent schema. Class keys of classes are shown as
attributes enclosed by boxes in Figure 21. For example, in the local schema, the existence
of phone attribute conceptually distinguishes person class from all other classes in the
hierarchy. Figure 2 1 also shows the instance keys of classes in the local and remote data
bases. Specification of instance keys indicates application’s point of view about the real
world. For example, local database distinguishes experimenters by their name attribute
values. Instance keys can contain complementary attributes, although such a need is
observed neither in the local case nor in the remote case for our example.
86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Class Name: researchData
Descriptor: data obtained as the result of experiments performed on subjects.
Attributes:
Name Domain Class ComDlementarv Complementary Value
subject P_STRING Kind animal
Format free
Length max 20
experiment experiment none none
Class Name: experiment
Descriptor: keeps information related to experiments.
Attributes:
Name Domain Class Complementary Complementary Value
name M_STRING Kind coding
Format AAA999
Length fixed. 6
startDate P_STRING Kind date
Format DDMMYY
Length fixed,6
contactPerson person none none
description P.STRING Kind definition
Format free
Length max 200
Class Name: person
Descriptor: persons in this research organization.
Attributes:
Name Domain Class Complementary Complementary Value
title P_STRING Kind other
Format free
Length max 30
name P_STRING Kind name
Format free
Length max 40
phone N.STRING Kind phone number
Unitofmeasure none
address P_STRING Kind address
Format free
Length max 60
Figure 22: Complementary Attribute Values of Primitive Attributes in the Local Schema.
87
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 22 lists classes, their descriptor attributes, and their attributes along with the
domain class, complementary attributes and complementary attribute values for each
primitive attribute in the local schema.
Class Name: PhysioIogy_Data
Descriptor: data from literature.
Attributes:
Name Domain Class Complementary Complementary Value
authors set of P_STRING Kind name
Format free
Length max 40
neurons set of P_STRENG Kind other
Format free
Length max 20
annotations set of P_STRING Kind definition
Format free
Length max 100
Class Name: Time_Series_Data
Descriptor: physiology data kept as time series data.
Attributes:
Name Domain Class Complementary Complementary Value
numoftraces INTEGER Kind number
Unitofmeasure sequence
trace ids set of INTEGER Kind coding
Unitofmeasure sequence
datasets set of P_STRING Kind other
Format free
Length max 5000
Class Name: Histogram_Data
Descriptor: physiology data kept as histograms.
Attributes:
Name Domain Class Complementary Complementary Value
datapoints set of REAL Kind coordinatey
Unitofmeasure other
scaling INTEGER Kind other
Unitofmeasure other
Figure 23:Complementary Attribute Values of Primitive Attributes in the Remote Schema.
88
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Similarly, Figure 23 shows the remote classes, descriptor attributes, and for each
attribute, its complementary attributes and values. We have categorized primitive attributes
on their Kinds. Possible Kind values constitute an extendible set of concepts which are
agreed upon in advance, and which are not argumentative. This set of Kind values corre
spond to application independent concepts. Although we tried to keep such set of con
cepts small, the set is extendible, ‘others’ value for Kind complementary attribute
corresponds to concepts that are not agreed upon yet, and that are application dependent.
We also describe descriptor attributes for each attribute in original conceptual schemas,
although they not shown in the figure for the sake of clarity.
4.3.2 Schema Implantation Phase
After Semantic Clarification phase, the local and remote conceptual schema ele
ments are enriched by adding new information to clarify their semantics. Now, they are
ready for implantation. Schema Implantation phase (Figure 24) loosely integrates the local
conceptual schema elements with the remote conceptual schema elements.
(l Remote Schema
in HSDM
Local Schema
in HSDM
superimposing
r
hypothesis
specification
Schema
Implantation
Local Domain Expert
hypothetical
relevances between
local and remote
chema elements
Local Schema
implanted with
remote schema elements
(H arm onizers)
Figure 24: Schema Implantation Phase.
89
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Schema Implantation consists of two sub-phases: (1) Superimposing, and (2)
Hypothesis specification. During the superimposing sub-phase (Figure 25), the remote
conceptual schema elements (classes and attributes) are imported and superimposed into
the local conceptual schema. Immediate subtrees of the remote conceptual schema are
added as the new immediate subtrees and as sub-trees of the object class “Object” in the
local conceptual schema. We assume that necessary means to access the remote object
instances and their attribute values are provided. A related study is described for instance
level sharing in [22]. We can follow this approach and create local surrogate objects in
order to access remote object instances. Predefined primitive classes do not cause any
problems with regard to superimposing process since they constitute the directly sharable
(agreed upon) portion of conceptual schemas in the federation. On the other hand, user
defined primitive subclasses are analyzed based on their complementary attributes and
predicates which are used to form these subclasses. Since user defined primitive sub
classes can only be predicate defined, this is a trivial process. Our methodology will con
centrate on investigation of abstract object class equivalence for this reason.
Remote Class Hierarchy
Object
Local Class Hierarchy
Object
Object
Local Class Hierarchy loosely integrated with Remote Classes §
Figure 25: Superimposing Sub-Phase.
90
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The second sub-phase is Hypothesis Specification (Figure 26) during which a num
ber of hypothetical relevances is specified between the local and imported remote classes
by means of harmonizers. Individual harmonizers are constructed between the immediate
sub-roots of local and remote conceptual schemas in this phase as we want to locate the
proper immediate sub-tree a remote class should belong. Once this immediate sub-tree is
determined correctly, we can construct additional harmonizers (additional hypotheses)
deep in the sub-tree in order to find a proper depth for the remote class. Specifying initial
hypotheses about class correspondences may be performed arbitrarily. Semantic Evolution
will respond properly even in this case. But using classifying attributes, and descriptor
attributes is suggested in order to obtain reasonable initial hypotheses that lead to more
accurate estimates.
local class imported remote class
hypothetical relevance
(harmonizer)
equivalent (9)
subclass l-lJ
superclass
overlapping
irrelevant
Figure 26: Hypothesis Specification Sub-Phase.
After Schema Implantation, the remote conceptual schema elements are loosely inte
grated into the local conceptual schema. Implanted local conceptual schema is now ready
for relevance testing and customization, which constitutes the activities performed during
the Semantic Evolution phase of our methodology.
4.3.2.1 Relationships between Remote and Local Classes
The most important problem in schema integration is to determine relationships
between the remote and local classes. If we knew all of the class keys of a remote class
91
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(not only the modeled ones), if we knew attribute correspondences between the remote
class and the local class, and if we knew corresponding instances in the remote and local
classes, then schema integration would be a trivial and one-step process. Because of the
problems of attribute/class homonyms, attribute/class synonyms, equivalent constructs in
data models, incompatibilities between schema elements, missing factual and conceptual
information about classes, and semantic confusions about schema elements, it is unreason
able to expect the availability of information in original conceptual schemas that is neces
sary to resolve incompatibilities between classes. Consequently, based on the available
information, it is impossible to integrate them. We propose hypothetical processing for
this reason. According to the hypothetical processing, the absence/existence of a hypothet
ical correspondence between remote and local object classes are tested over time.
There are five possible relationships between a remote class r, and a local class 1
(Figure 27): equivalence, generalization, specialization, overlapping, and irrelevance.
P equivalence equal ( 1 , r ) : r and 1 are equivalent, which means that they model
exactly the same concept in the real world according
to the local application’s point of view.
P generalization gen ( 1 , r ) : local class 1 is more general than remote class r. which
corresponds to a real world state where I models a
superconcept of r. Again, this holds according to the
local database's point of view about the real world.
P specialization spec ( 1 , r ) : local class 1 is less general than remote class r, which
means that 1 models a concept that is a subconcept of
r in the real world.
t> Overlapping overlap ( 1 , r ) : local class 1 and remote class r are related via a third
common superconcept of which their corresponding
concepts are subconcepts.
P irrelevance distinct ( 1 , r ) : local class 1 and remote class r are distinct, not related
at all. They model distinct concepts in the real world,
and there is no way for local component to consider
them as equivalent or related.
Figure 27: Possible relationships between a local class 1 , and a remote class r.
92
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
If a local class 1 and a remote class r are related through equal, gen, spec, or overlap
relationships, they are said to be semantically related. Supposing that r and 1 are semanti
cally related, they may have overlapping attributes, or they may have distinct attributes. In
order to determine the correct relationship between r and 1 (equal, spec, gen, overlap), we
need attribute correspondence assertions for r and 1 . Furthermore, r and 1 may contain
overlapping instances, or distinct instances. Having object correspondence assertions help
determination of instance relationships between the instances of r and 1 (equal or not).
Instance key of the local class is utilized during specification of object equivalence asser
tions. Figure 28 defines the notion of instance equivalence employed in our methodology.
i): an instance of local class I.
ir: an instance of remote class r.
supposing semantically_related (1. r),
- equal(i|, ir) iff Instance_Key_Local (i{) = Instance_Key_Local (ir)
- distinct(i), ir) iff Instance_Key_Local (ij) o Instance_Key_Local (ir)
Figure 28: The notion of instance equivalence.
4.3.2.2 Harmonizers
A Harmonizer is a persistent structure that associates an abstract local class with an
abstract remote class for the purpose of investigating whether or not they are semantically
related (e.g., equivalence, generalization, specialization, overlap, or irrelevance). It incor
porates heterogeneity into databases (it includes instances from different classes). A har
monizer aims to find an appropriate location for a remote class within the local class
hierarchy. When associated with a local class through a harmonizer, a remote class lives
with the local class within the local database for a while (during the testing period). The
purpose of the harmonizer is to prove or disprove equivalence/relevance of local and
remote classes. The rationale behind the idea of harmonizers is to live with the problem
within the problem environment until problem environment is known better.
93
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Name
Associated Local Class.
Associated Remote Class.
Attribute Equivalence Assertions.
Object Equivalence Assertions.
Figure 29: The Structure of Harmonizers.
Figure 29 shows structural components of a harmonizer. Naming harmonizers is nec
essary as there might be more than one harmonizer in a conceptual schema. Associated
local class and associated remote class for a harmonizer are called semantic peers whose
equivalence/relevance is going to be investigated during the testing period. Attribute
equivalence assertions include known equivalences between the attributes of the local and
remote classes. Object equivalence assertions, on the other hand, specify what attributes of
the local and remote classes will be used in order to identify related local and remote
instances. It is obtained either directly or via a derivation expression from the instance key
of the associated local class. Semantic peers should have a common superclass (initially
the ‘Object’ class) to form a harmonizer. There may be more than one harmonizer during
schema implantation phase. However, their associated local classes in the local conceptual
schema should be located in different sub-trees. Similarly, their associated remote class
should be located in different immediate sub-trees of the local conceptual schema.
94
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Semantics peers are bound to each other in the form of a harmonizer for a common
goal: to establish a relationship between them. Moreover, if this relationship turns out to
be an equivalence, semantic peers aim to resolve structural and semantic differences
between them. In order to realize this goal, they have to prove that they can complement
each other’s structural, behavioral, and semantic aspects. Consequently, each semantic
peer introduces complementary attributes on the other. With the construction of a harmo
nizer, the local class definition is expanded with attributes that exists in the remote class
definition but not in the local class definition. Similarly, the remote class definition is
expanded with attributes that exists in the local class definition but not in the remote class
definition. The purpose of these newly added attributes, unlike complementary attributes
introduced during semantic clarification phase, is to make sure that semantic peers have
the ability to complements each other’s structural and semantic aspects. The job of the
harmonizer is to fill in these attributes with meaningful values, so that integration of
remote and local object classes becomes possible. Complementary attribute values are ini
tialized with initial null (null;) during harmonizer construction. When a harmonizer is con
structed, there is a contract implicitly agreed-upon by semantic peers. According to this
contract, (1) The local class will supply meaningful complementary attribute values intro
duced by the remote class, and (2) The remote class, in return, will supply meaningful
complementary attribute values introduced by the local class. It is because of this contract
that the local and remote object classes can complete each other’s structural and semantic
aspects. At the end of the testing period, and depending on the harmonizer output, seman
tic peers may decide to form a more tightly coupled relationship (they integrate), or they
may choose to form a more loosely coupled alternative (they disintegrate) between each
other. There may be more than one harmonizer during schema implantation phase. How
ever, their associated local classes in the local conceptual schema should be located in dif
ferent sub-trees. Similarly, their associated remote class should be located in different
immediate sub-trees of the local conceptual schema.
95
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Don’t know null (null?) or any permissible value from the domain class of a comple
mentary attribute refers to meaningful attribute values while inapplicable null (nullx), and
values violating attribute constraints refer to non-meaningful attribute values. Nullx signi
fies the inability of the object class to accept the attribute as one of its properties.
In summary, in order to decide whether a local class and a remote class are equiva
lent/relevant, a harmonizer passes different stages: (1) It hypothesizes the relevance of the
local and remote conceptual schema elements (classes, attributes, and objects), (2) It tests
viability of these hypotheses over time, during the testing period.lt receives complemen
tary information about semantic peers: local (remote) class supplies meaningful values to
attributes which exist in its semantic peer’s class definition, and which do not exist in its
own class definition. In other words, semantic peers try to complement each other’s struc
tural and semantic aspects, (3) At the end of the testing period, whenever enough informa
tion is obtained in order to determine the relevance of local and remote object classes, the
harmonizer is said to reach to an equilibrium state. The harmonizer is then able to tell
whether or not these classes are related, (4) The harmonizer then suggests an appropriate
action/relationship for these classes. Different relationships between classes require differ
ent actions to be taken. Equivalent classes need to be combined. Irrelevant classes require
construction of new harmonizers between the remote class and other local classes. Classes
having generalization/specialization relationship between each other may require addi
tional harmonizers between (subclasses of) the remote class and (subclasses of) the local
class in order to determine the correct location remote class should belong in the class
hierarchy, (5) User verification is necessary to ensure the validity of the relationships/
actions that have been suggested by the harmonizer.
Harmonizers are constructed with the help of domain experts at the end of the
Schema Implantation and during the Semantic Evolution phases. We will revisit harmo
nizers, and focus on acquisition of complementary attribute values during Semantic evolu
tion phase in the next section.
96
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
43.2.3 Example
With regard to our example. Figure 30 shows the local conceptual schema implanted
with the remote conceptual schema elements. Attribute domains and predefined part of the
conceptual schema are not shown for simplicity. A harmonizer (Harmonizeri) is con
structed between local class researchData and remote class Physiology_Data. Structure of
this harmonizer is shown in Figure 31. Figure 31 also shows how complementary
attributes are introduced by semantic peers on each other to complement each other’s
structural and semantic aspects.
Object
Physiology.
^ Data,
Time_Series_3Hf Histogram.
^ Data Data.,
Figure 30: Local Schema Implanted with Remote Schema Portion.
In Figure 31, Harmonize^ (RSHDATA-PHYSDATA) is built as part of the schema
implantation phase in order to investigate the relevance of associated local class research
Data and associated remote class Physiology_Data. While no attribute equivalences are
specified for this harmonizer!, object equivalence assertion depends on the instance key of
97
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the associated local class. According to this assertion, an instance of researchData class
will be considered equivalent to another instance of Physiology_Data class, if and only if
the value of attribute experiment of local instance is equal to the value of attribute experi
ment of the remote instance.
After harmonizer( is constructed, the local class researchData is added with comple
mentary attributes, authors, Neurons, and annotations, which originally belong to the
remote class Physiology_Data. Accordingly, the remote class Physiology_Data is comple
mented with complementary attributes, subject and experiment, which originally belong to
the local class researchData. According to the implicit contract agreed-upon by research
Data and Physiology_Data classes as a consequence of building harmonizer!, (1) the
researchData class will supply meaningful values for the complementary attributes bor
rowed from the remote class PhysioIogy_Data (authors, Neurons, and annotations), and
(2) Physiology_Data will supply meaningful values for the complementary attributes bor
rowed from the local class researchData (subject and experiment). These are to be done
during the semantic evolution phase.
After Schema Implantation (superimposing and hypothesis specification), semantic
evolution phase starts.
researchData experim en^
authors _
experiment \
a— ^MC^hysiology^Datg) Physiology _Data
Neurons Neurons
annotations annotations
researchData.experiment
Figure 31: The Structure of Harmonizer j.
98
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4 3 3 Semantic Evolution Phase
Semantic Evolution phase (Figure 32) investigates whether previously hypothesized
equivalences/relevances hold, configures harmonizers for different local class-remote
class combinations, or activates individual schema evolution primitives depending on the
outcome of these investigations. In the case of relevance, implanted local conceptual
schema is restructured, so that remote object classes establish semantic relationships with
their local correspondences.
local schema
implanted with remote
schema elements
Local Domain
Expert
Remote Domain
Expert
■-/Id
remote knowledge
for a harmonizer
local knowledge
for a harmonizer
evolved local schema
implanted with
remote schema elements.
Semantic
Evolution
• testing hypothetical
relevances
• schema evolution
t more hypothesis
specification
concrete relationships
between local and
remote classes
hypothetical
relevances
(
local schema
integrated with remote
schema elements ,
Figure 32: Semantic Evolution Phase.
99
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 32 symbolizes Semantic Evolution phase, where local and remote domain
experts try to supply meaningful values to complementary attributes imposed by semantic
peers on each other. Complementary information about semantic peers refers to attribute
values introduced by a semantic peer on the other. During the testing period of a harmo
nizer, the local domain expert tries to supply values for complementary attributes imposed
by the remote class on its semantic peer. Values can be meaningful (null? or any permissi
ble value from the domain class of the complementary attribute), or nullx (does not make
sense for the local class). Moreover, the remote domain expert tries to supply values to
complementary attributes imposed by the local class on its semantic peer. When the har
monizer reaches to the equilibrium state (the equilibrium state for a harmonizer is reached
when both local and remote information are obtained for that harmonizer), a semantic
relationship between the local class and the remote class (equivalence, generalization, spe
cialization, overlapping, irrelevance) is suggested by the harmonizer. The decision regard
ing the relevance of semantic peers is based on the accumulated local and remote
knowledge during the testing period. After validating this relationship by means of user
confirmation, individual evolution primitives are used to restructure the local conceptual
schema, and to migrate remote class instances into the local class.
Associating semantic peers by means of harmonizers starts from the immediate sub-
roots of the implanted local conceptual schema in the schema implantation phase. How
ever, they are not limited to the immediate sub-roots during Semantic Evolution phase pro
vided that the local and remote classes have a common superclass. In Semantic Evolution,
if a semantic relevance is found between semantic peers during the testing period, new
harmonizers are created for local class and subclasses of remote class, or remote class and
subclasses of the local class.
Semantic Evolution phase continues until a stable database and a stable conceptual
schema are obtained, where all the implanted remote schema elements establish relation
ships with the local schema elements.
100
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.33.1 Harmonizer States
Figure 33 shows possible states a harmonizer may be in during the testing period. A
harmonizer enters into the initial state when it is first created. Initial state corresponds to a
state where the harmonizer sends local and remote requests to domain experts requesting
additional knowledge about associated local and associated remote classes of the harmo
nizer (complementary attribute values which were introduced by the semantic peers on
each other). In the initial state, the harmonizer waits for this additional knowledge to
arrive. Either local knowledge or remote knowledge may arrive first. If the local knowl
edge arrives first, the harmonizer jumps to WR state (in Figure 33), where it waits for the
remote knowledge to arrive next. Similarly, WL state is reached when the harmonizer is in
the initial state and it receives the local knowledge. In both cases, harmonizer state
changes to RE. In the RE state, the harmonizer is ready to be evaluated since it has already
acquired both the local and the remote knowledge in order to determine the relevance of
semantic peers. Receiving the evaluate directive, it determines the relationship between
semantic peers, and after a final user verification, it activates schema evolution primitives
to establish the relationship on the conceptual schema. The state F is the final state, where
this harmonizer is not needed any more, and can be deleted.
evaluate
create
harmonizer
I : Initial State
WR : waiting for remote data
WL : waiting for local data
RE : ready to be evaluated
F : Final State
Figure 33: State Transition Diagram of a Harmonizer.
101
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
43.3.2 Acquisition of Local and Remote Knowledge
A subset of instances of a class that can be treated as a representative for all
instances is called a characteristic subset of that class. More formally, a characteristic sub
set of a class is defined as the minimal set of instances of that class which is enough to
approximate the structural and semantic aspects of all instances of that class. Characteris
tic subset of a class contains at least one instance from each of the subclasses of that class.
This is to ensure that characteristic subset is a uniform sample of all instances. Character
istic subset can not contain instances which have attributes with null values. Therefore,
instances without subinstances, and without null values are good candidates to be in the
characteristic subset of a class. Figure 34 provides a simple algorithm for generating a
characteristic subset of a class. The algorithm selects two appropriate instances from each
subclass and from the class itself. We mentioned that during the testing period, comple
mentary attribute values are supplied for the local and remote classes by domain experts.
However, values are not supplied for all instances of the local class or the remote class.
Providing complementary information about the characteristic subset suffices.
characteristic_subset = null
classes = A U subclasses (direct or indirect) of A
for each class B in classes
for each instance I of B
if the instance I has no subinstances.
and does not contain any nulls in its attributes,
if there are not more than 1 B instances in the characteristic_subset
then add I to the characteristic_subset.
Figure 34: An Algorithm for Determining the Characteristic Subset of a Class A.
Providing remote attribute values for local instances is trivial with the assumption
that such values are supplied by the local domain expert who is supposed to have detailed
knowledge about his/her domain, and to have moderate knowledge about the remote data
base domain. Providing local attribute values for remote instances, however, requires spe-
102
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
cial attention. Such values can only be provided by the remote domain expert. A
mechanism for educating the remote domain expert about the local attribute semantics, in
this case, is required. In Figure 35, we present a protocol just for this purpose.
local database remote database
P
attribute definitions of attr_spec(L). 1
P
descriptor attribute for the local class.
0
descriptor attributes for attr_spec(L). '
P
location of the local class in the class hierarchy. 1
0
characteristic subset of the remote class. 1
1
1
1
_______ _____ ' remote domain expert
— " values of attr_spec(L) for the
characteristic subset of the
remote class.
attr_spec( C ). attributes that exists in C’s definition but not in its superclass definition
and not in its semantic peer’s definition.
Figure 35: Acquisition of Knowledge about the Remote Class (Instances).
After a harmonizer is constructed, the local domain expert tries to provide meaning
ful values for the attributes introduced by the remote class. Similarly, when a harmonizer
is constructed, the attributes that exists in the local class definition but not in the remote
class definition are packed along with information that makes their semantics explicit
(e.g., attribute definitions, descriptor attributes, location of the attribute’s class within the
class hierarchy), and are shipped to the remote component. Based on the semantic infor
mation about these attributes, the remote domain expert investigates if it would be mean
ingful to expand the remote class definition with these new attributes by trying to provide
meaningful attribute values for the characteristic subset of the remote class, and ships back
these new values to the local component. New attribute values are chosen among permissi-
103
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ble values in the value classes of attributes or among different null values. For example, it
is easy to choose values for a primitive attribute (an attribute whose value class is primitive
such as REAL, INTEGER, etc.), since primitive classes constitute the common part of
each and every component database schema in the federation. However, an inter-class
attribute (an attribute between two user-defined classes) could not always be given mean
ingful values from the value class of the attribute. This is because value class of the
attribute is defined in one context (local database) and may not be well known to the other
context (remote database). The value of such an attribute can be chosen among different
kinds of null values depending on whether or not the attribute makes sense for the specific
instances.
4.3.3.3 Determining the Relationship between Local and Remote Classes
When the harmonizer reaches to the equilibrium state, information acquired during
the testing period plays a vital role in the determination of the relationship sought between
local and remote classes. Gathered information may indicate different knowledge states.
In this section, we present an algorithm that can be used to determine the relationship
between the local class and the remote class based on the information acquired during the
testing period about these classes. Instead of presenting the algorithm in pseudo code, we
will specify possible states of a harmonizer, decisions about the relationships between the
semantic peers, and consequent actions. Figure 36 outlines this algorithm. The corre
sponding algorithm takes two sets of instances as input: the characteristic subset of the
local class, and the characteristic subset of the remote class. Therefore, when we mention
“instances”, we will be referring to the characteristic subset of a class, not to all of the
class instances. The algorithm examines the characteristic subset of the local class, and the
characteristic subset of the remote class. It investigates if these sets have meaningful val
ues for the complementary attributes and for the class keys. It then suggests the relation
ship sought between the local class and the remote class along with recommended actions.
104
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
L: local class
R: remote class Case 1 C ase 2 C ase 3 Case 4 C ase 5 C ase 6 C ase 7 C ase 8
meaningful local class key
values for remote instances.
Yes Yes Yes Yes No No No No
S ta te
meaningful local non-class key
values for remote instances.
Yes Yes No No Yes Yes No No
meaningful remote attribute
values for local instances.
Yes No Yes No Yes No Yes No
D e c is io n
£
.J
t —
Z
L U
-J
i
Z >
J P E R -C L A S S (L .K )
J B -C L A S S (L ,K )
2
J
s
X
<
-i
a s
U i
£
j
C / 5
C / 5
<
-2
U
2
£
u
z
C / 5 J B -C L A S S (L .R )
2
j
C
z
H
C / 3
L £ l C / 5 C / 3 O C / 5
-
C / J
A c tio n
2
-j
2m
5
2
_ i
Z
c
s 2
3 d
• ’ St
II
i i
_ Si
2 2
J
e*
V
c
I E
3
X J
3
Si
i >
|
3
Si
S 3
3
Si
X J
JSt
3
3 £
2 5
3 3
.3 .C
3
S t
X J
■ g
• » i
2 5
• — w
3 3
. £ -3
Figure 36: Determining Class Relationships.
It should be noticed that above algorithm puts different emphasis on the local appli
cation semantics and the remote application semantics. For example, it requires remote
class to supply meaningful values to class key of the local class but it does not investigate
if local class supplied meaningful values to the class key of the remote class. That is
because algorithm favors the local application context over the remote application context,
since information units are going to be a part of the local database.
Since class keys signify class identity semantics, if a remote class can supply mean
ingful values to the class key attributes of a local class, as in the Cases I, 2, 3, and 4, there
105
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
should be a high semantic similarity between these two classes. Case 1 corresponds to a
state where the local class and the remote class mutually complete each other’s structural
and semantic aspects. The remote class instances satisfy the condition to be considered as
instances of the local class, and the local class instances satisfy the condition to be consid
ered as remote class instances. Therefore, the local and remote classes are equivalent. Sug
gested action in this case is to combine local and remote class definitions in the form of the
local class. Consequently, the local class is expanded with the attributes that exists in its
definition but not in the remote class’s definition. Subclasses of the remote class are added
as the new subclasses of the local class. Instance and class keys of the local class remain
unchanged. As the next step, the algorithm suggests construction of new harmonizers
between the old subclasses of the local class and the new subclasses of the local class.
In Case 2, although the remote class instances satisfy the condition to be considered
as instances of the local class, local class instances can not satisfy the condition to be con
sidered as the remote class instances, since they can not provide meaningful values to all
of the complementary attributes imposed by the remote class. Therefore, the remote class
is a subclass of the local class. Attributes of the remote class which make sense in the local
class are detached from the remote class definition, and attached to the local class defini
tion. After the remote class is made as a subclass of the local class, the new local class
keeps its original instance and class key attributes. If class/instance key attribute(s) are
moved from the remote class to the local class, a new class/instance key is determined for
the new remote class. Since local and remote classes are not at the same level in the class
hierarchy anymore, there is no need to construct new harmonizers for the subclasses of the
remote class.
Case 3 represents a state where local class instances can satisfy the condition to be
considered as instances of the remote class, since they can not provide meaningful values
to all of the complementary attributes imposed by the remote class. Moreover, remote
class instances have meaningful local class key values. However, some of the attribute val-
106
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ues other than the local class key does not make sense for the remote class instances. In
other words, the local class must be a superclass of the remote class. Similar to Case 2, the
attributes of the local class which make sense in the remote class are detached from the
local class definition, and attached to the remote class definition. The remote class is made
a superclass of the local class. Since the local class key makes sense for the remote class, it
is also moved to the remote class, leaving the new local class without a class key. A new
class key is created for the local class with the help of database administrator. The old
class key of the local class becomes the new class key of the remote class. The same pro
cess applies for the instance key if it is moved from the local class to the remote class. The
next step is to build harmonizers between the subclasses of the local and remote classes.
In Case 4, remote class instances can supply meaningful values to the class key
attributes imposed by the local class. Nevertheless, they can not provide meaningful val
ues to some of the non-class key attributes. Being unable to provide meaningful values to
some of the attributes imposed by the remote class, local class instances can not satisfy the
condition to be regarded as remote class instances either. The relationship being sought is
overlap between these classes. After user confirmation, a common superclass is created
with the local class key and the attributes common to both local and remote classes. Again,
class/instance keys of the local and remote classes are maintained as described previously.
New harmonizers can be created between the subclasses of the new local class and the
subclasses of the new remote class.
In Cases 5, 6, 7, and 8, remote class instances can not be regarded as the new
instances of the remote class, since they do not contain meaningful values in their local
class key attributes. However, if they contain meaningful values for all of the non-class
key attributes imposed by the local class, and if the local class instances also contains
meaningful values in all of the attributes imposed by the remote class (Case 5), then it can
be inferred that the remote class is a superclass of the local class. All the attributes other
than the local class key attributes are moved to the remote class definition. Since the local
107
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
class keeps its old class key attributes, there is no need to create a new class key for the
local class. Remote class can also maintain its class key provided that it does not violate
the class key integrity of the overall conceptual schema. Otherwise, a new class key is
introduced for the remote class. New harmonizers between the subclasses of the old
remote class and the subclasses of the local class can be constructed after Case 5 is han
dled.
Case 7 is almost the same as Case 5, resulting in a subclass relationship between the
local and remote classes again. The difference is in the handling of instance keys. In Case
5, instance key attributes of the local class may be moved to the remote class, requiring the
creation of a new instance key for the evolved local class. In Case 7, however, no attribute
is moved from the local class since none of them make sense for the remote class.
Case 6 and Case 8 results in irrelevance between the semantic peers, since they are
unable to complement each other’s structural and semantic aspects. In these cases, it is
encouraged to build new harmonizers between different local class-remote class combina
tions.
Cases other than 6 and 8 requires equivalent instances to be detected. Instance keys
are consulted in this situation in order to represent the same instance only once in the data
base. In addition, instances are adapted to the new definitions of the local and remote
classes implementation wise.
4.3.3.4 Example
In order to test the relevance of the local and remote classes researchData and
Physiology_Data respectively, we have built a harmonizer between these two classes in
the previous section. During the testing period of this harmonizer, the remote domain
expert tries to supply values to the subject and experiment attributes of the local class for
the characteristic subset of the remote class. Alternatively, the local domain expert pro
vides values to authors, Neurons, and annotations attributes of the remote class for the
108
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
characteristic subset of the local class. While the remote domain expert succeeds in pro
viding meaningful values for the attributes subject and experiment, being unable to find
meaningful remote class attribute values for some instances in the characteristic subset of
the local class, the local domain expert provides inapplicable nulls for these attribute val
ues. Therefore, evaluation of the harmonizer results in a superclass relationship between
the local class and the remote class. Obtained conceptual schema is shown in Figure 37.
For the intermediate steps, the reader is encouraged to refer to Section 5.3.
Figure 37: Final Local Conceptual Schema after Semantic Evolution Phase.
109
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5
Experimental Prototype Implementations
Implementation of a full-fledged mechanism which realizes our approach requires
two components. First, translators that transform database schemas from various data
models to HSDM should be developed. Second, a database tool which manages the inter
actions between HSDM-managed databases should be constructed. Assuming that there
have been many efforts in the first area, we have focused on developing a tool which real
izes Schema Implantation and Semantic Evolution approach in a CFDBS where all the
component database systems employ HSDM as their data model. This database tool,
which was built for prototyping purposes, is named as the HSDM Mediator.
The HSDM Mediator has been obtained in two consecutive phases. As HSDM is an
extension of PDM (Personal Data Manager) [53], first, we have implemented a database
tool which allows modeling, representation, and management of information units using
the PDM data model. Customizing the software which was used to implement this data
base tool, we have implemented the HSDM Mediator as the next phase. Currently, both
programs are operational. PDM software has been tested by a small group of students. The
HSDM Mediator software, on the other hand, has been tested, and it requires minor code
tuning to maximize its memory and run-time performance. In this chapter, we will present
these two phases, focusing on the functionality of the HSDM Mediator with regard to
information sharing and exchange between component database systems in a CFDBS.
110
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.1 PDM Implementation
From February 1995 to August 1995, we have already implemented a simple, easy to
use, object-based DBMS, called PDM, on top of ObjectStore [62] within the context of
Human Brain Project [84]. Produced software has been tested by a group formed within
the CS586 class during Spring 1996 semester at USC. PDM (Personal Data Manager) is
based on a simple semantic data model proposed by McLeod and Lyngbaek [53]. It has a
friendly user interface in which users can browse the conceptual schema, kind1 defini
tions, and individual instances. It also supports individual data manipulation operations
such as insert, modify, and delete.
PDM supports a basic naming scheme according to which every object in the data
base has a character string representation for identification purposes. This naming scheme
can be seen as a primitive form of object identifiers in object-based data models. Users can
browse/create/manipulate PDM conceptual schemas and PDM databases by means of an
important construct, namely the Working Kind. The Working Kind is like a cursor in rela
tional DBMSs. It can be bound to a number of instances of an object kind. Most of the
operations including data manipulation operations work relative to the working kind.
The PDM software has been written in C++ with embedded ObjectStore [62] state
ments for database related functionality, and embedded “curses” library function calls [80]
for user interface related functionality. It consists of 8093 lines of code. We have used
ObjectStore as the storage subsystem in implementing PDM. The choice of ObjectStore as
the storage subsystem has been very beneficiary during the implementation of PDM.
As HSDM data model subsumes the PDM in terms of functionality, we will not
specify details of the database tool that realizes the PDM data model.
I. a class in object-oriented data models is called an object kind in PDM.
I l l
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.2 Implementation of the HSDM Mediator
The HSDM Mediator has two functions within a CFDBS. First, it functions as a
database modeling tool by enabling information units to be modeled/maintained in an
HSDM database. Second, it allows HSDM databases to interoperate with each other under
the principles of our approach. Like the PDM software, the HSDM Mediator software has
been written in C++ with embedded ObjectStore statements for database related function
ality, and embedded “curses” library function calls for user interface related functionality.
It consists of 16650 lines of code. We have used ObjectStore as the storage subsystem in
implementing the HSDM Mediator too.
In order to implement a prototype tool for HSDM, which also realizes our methodol
ogy, we have customized the PDM software. This has been achieved by implementing the
capabilities HSDM supports but PDM lacks. In particular, we implemented the following
capabilities of HSDM, and integrated it with the already existing PDM software. PDM
does not offer any means to clarify schema semantics. HSDM, in contrast, allows specifi
cation of complementary attributes, class keys, and instance keys. Therefore, we have
written code that enables classes to have instance and class keys, and that allows attach
ment of complementary attributes to classes and attributes. For explicit information unit
semantics, we have implemented attribute defined and predicate defined subclass mecha
nisms too. Code required for interoperation purposes such as code for importing remote
classes and instances, and code for harmonizer construction and maintenance has been
produced and added into the software. Finally, schema evolution primitives we need to
restructure conceptual schemas are added to the HSDM Mediator software.
We have used ObjectStore in order to implement the HSDM Mediator. Mapping
PDM constructs into ObjectStore constructs was easy since ObjectStore is an object-ori
ented DBMS, and it supports many of the key object-oriented notions. We envisioned that
implementation of HSDM will be easy too because of the same reason, which came true
while implementing the HSDM Mediator.
112
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The choice of a real DBMS as a storage subsystem plays an important role in imple
menting the desired database management system in terms of time and implementation
effort required. Relational DBMSs, which were widely used in implementing object-ori
ented DBMSs, do not support most of the notions offered by object-based data models.
Consequently, implementing an object-based DBMS on top of a relational DBMS as a
storage subsystem necessitates building intermediate data structures in order to map indi
vidual capabilities offered by the object-based DBMS but not supported by the relational
DBMS directly. The choice of ObjectStore as the storage subsystem in implementing both
PDM and the HSDM Mediator has been very beneficiary since ObjectStore can directly
express many of the notions seen in PDM and HSDM. For example, creating and dropping
class definition as well as instances, maintenance of the class hierarchy, and object identi
fiers are all directly supported by ObjectStore. For this reason, transforming PDM features
into ObjectStore features has become a trivial task during the implementation. For the
same reason, we have implemented the HSDM Mediator by using ObjectStore facilities.
In most of the cases, implementing a primitive or a structure of HSDM has been simply
finding a corresponding way to express it in ObjectStore.
The specific reasons why ObjectStore was the right choice for our prototypes are the
following: ObjectStore provides many features which can be used directly to implement
individual HSDM constructs. Creating/dropping databases, classes, or instances, access
ing/updating attribute definitions of a class, multivalued and inverse data members
(attributes), an automatically maintained inheritance hierarchy, accessing/updating
attribute values, null values, collection, set, list, bag, and array data structures and associ
ated operators, cursors, transactions, queries, and indexes are all directly supported by
ObjectStore. For example, maintaining a class hierarchy for a HSDM schema became
unnecessary since ObjectStore already does that. We did not have to worry about atomic
ity of data manipulation operations since ObjectStore provides a transaction mechanism
for this purpose. More importantly, these features could be activated dynamically during
113
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the program execution by means of the Meta Object Protocol (MOP) of ObjectStore. MOP
is a set of meta-class definitions and associated functions, which are callable during run
time. All of the capabilities of ObjectStore mentioned earlier can be used dynamically by
activating corresponding function calls to MOP at run time. Another advantage of using
ObjectStore as the storage subsystem in implementing our prototype is the Schema Evolu
tion Facility of ObjectStore, which was inevitable to use for our prototype. Classes as well
as object instances should be restructured by evolution primitives during Semantic Evolu
tion. ObjectStore provides a schema evolution facility, which was utilized by evolution
primitives in HSDM. The only disadvantage of using ObjectStore was in its initial learn
ing period. Lack of programmer resources in implementing the software forced us to write
almost all of the code by ourselves.
Figure 38 shows the implementation architecture of our approach to the folding
problem, where the local component communicates with the remote component via the
“Implant DB” operation and via requests for complementary information.
• " L o ca l C o m p o n e n t R e m o te C o m p o n e n t
I
The HSDM
Mediator
Implant DB
The HSDM
Mediator
remote meta-
ObjectStore ObjectStore
aj>i
1 1 — V
local
I 1
remote
database
1 1
database
1 1
Figure 38: Implementation Architecture.
1 i
MBRonnJ
114
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.2.1 The HSDM Mediator
We will present the functionality of this database tool by describing individual oper
ators it provides. The Mediator’s functionality is divided into five categories, which corre
sponds to five submenus in the user interface: Database, Browse Schema, Working Kind,
Manipulate DB, Interoperate submenus (Figure 39 and Figure 40).
The Main Menu
database
operations
meta-data
browning
opesdtions ,
y w orn i
Submenu / open
The Database
The Browse Schema
Submenu
g kind
tions
The Working Kind
Submenu
^interoperability
^ ^operations
itabase
nipulation
opehttions The Interoperate ■
Submenu j
1
i
The Manipulate DB i
Submenu j
j
>
Figure 39: Functionality of the HSDM Mediator.
Figure 40 shows the HSDM Mediator’s main menu. While choosing individual sub
menus is performed using numbers 1 thru 5, or left and right arrows, operations within
submenus are selected using up and down arrows.
IBS
U Database 2) E h ~ a w se SO i—a 3) Working Kind 4) Manipulate D B 5) Interoperate
Enter Selection: |
Figure 40: The HSDM Mediator Main Menu.
115
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.2.1.1 The Database Submenu
C re a te
Open
C lo s e
S ta tu s
D e s tro y
Q u it
Figure 41: The Database Submenu.
The Database submenu (Figure 41) contains menu items which operates on data
bases. Create creates an HSDM database with no instances. Open opens an existing data
base, and initializes the class hierarchy. Users can open a database for reading or for
reading and writing purposes. Similarly, Close closes the database being processed. There
can be only one open database at a time. Status displays the current database along with
some information about this database such as its path name, open mode (read only, read
and write). Destroy deletes an HSDM database. Finally, Quit terminates the HSDM Medi
ator.
5.2.1.2 The Browse Schema Submenu
C c r're n t K in d T re e
F in s t K in d T re e
N e x t K in d T re e
P r e v . K in d T re e
K in d L o c a tio n
D is p la y K in d D ef
Figure 42: The Browse Schema Submenu.
The Browse Schema submenu (Figure 42), enables users to browse the kind hierar
chy, as well as to browse kind definitions. Indentation, higlighting, and underlining tech
niques are used to display kind hierarchies on the screen. As mentioned before, an HSDM
class hierarchy consists of a set of kind trees, which are direct subtrees of the “Object”
kind. Displaying the kind hierarchy is achieved by displaying each kind tree one by one.
Current Kind Tree displays the kinds, their attribute definitions, and ISA relationships
116
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
within the current kind tree. First Kind Tree updates the current kind tree as the first one in
the kind hierarchy. Similarly, Next Kind Tree, and Prev. Kind Tree operations updates the
current kind tree as their names suggest, which are necessary for traversing the kind hier
archy. Kind Location operation displays the kind tree where a specified kind is located.
Being the last operation in this category. Display Kind Def presents detailed information
about a kind including its class and instance keys, its attribute definitions (both kind-spe
cific and inherited ones), and whether this kind is attribute defined or predicate defined.
5.2.1.3 The Working Kind Submenu
U J ' I J 4 )
I n i t i a l i z e
E xpand
R e s t r ic t
Remove
Mao
Name
D is p la y
P r in t
S ave
D is p la y C o n te x t
Figure 43: The Working Kind Submenu.
The third submenu in the user interface of the HSDM Mediator is the Working Kind
submenu, whose operations are shown in Figure 43. This submenu enables manipulation
of individual instances. All of the instances in an HSDM database are manipulated by
means of the working kind, which is a temporary structure that holds a set of instances of
a particular kind. The working kind has an associated permanent kind.
Initialize binds the working kind to a kind. The working kind can be initialized with
an empty set of instances, all instances, instances satisfying a predicate, or explicitly enu
merated instances of the specified kind. Expand operation expands the instances of the
working kind again with either of the following: an empty set of instances, all instances of
a kind, instances of a kind satisfying a predicate, or explicitly enumerated instances of a
kind. Expanding kind may be permanent or temporary, which is basically a named
instance of the working kind. Associated permanent kind of the working kind is updated as
117
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the closest common superkind of the associated permanent kind of the working kind, and
expanding kind. Restrict operation is used to select some instances of the working kind,
and exclude the others. Once again, working kind can be restricted with an empty set of
instances, a set of instances of a kind satisfying a predicate, all instances of a kind, or
instances of a kind explicitly stated by the user. Associated permanent kind of the working
kind is not affected from this operation. Remove is almost the same as Restrict. The only
difference is that instead of restricting, Remove excludes the qualified instances from the
working kind. Map projects the instances in the working kind on an attribute of the associ
ated permanent kind of the working kind. The domain kind of the specified attribute
becomes the new associated permanent kind. The new working kind includes attribute val
ues of the old instances. Name operation is for obtaining a named state of the working
kind at a particular time. Consequently, a temporary kind is created to hold all instances of
the working kind. Display, Print, and Save operate in an identical fashion. They all pro
duce a formatted or free-format listing of all the instances contained in the working kind.
Display displays them on the screen. Print prints out them, and Save saves them in a text
file. Not all of the attribute values are outputted, if user specifies desired set of attributes.
The last operation in this submenu is Display Context, which provides detailed informa
tion about a working kind: associated permanent kind and its definition are displayed
along with temporary subkinds of the working kind.
5.2.1.4 The Manipulate DB Submenu
C re a te K in d
C re a te O b je c t
R e s to re
D e le te
M o d ify
Odd
Figure 44: The Manipulate DB Submenu.
118
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
This submenu includes operations that change the database state in a permanent
manner (Figure 44). First operation, Create Kind, creates a new user kind, and adds it into
the class hierarchy. Create Instance creates an instance of the associated permanent kind of
the working kind. Restore loads instances from a text file, and creates them in the database
as the instances of the associated permanent kind of the working kind. Delete permanently
removes qualified instances from both the working kind and from the database. Modify
updates attribute values of the instances in the working kind. These updates are also prop
agated to the database. Lastly, Add adds the instances in the working kind as the new
instances of a specified kind. If the specified kind does not exists, a new kind is created
and initialized with these instances.
5.2.1.5 The Interoperate Submenu
A M W fU .'l'JU JU m t e
Im p la rr t OB
A l l R em ote K in d s
C r e a te H a m o n iz e r
f t l l H a rm o n ize r-s
H a rm o n iz e r S ta tu s
D e le te H a rm o n iz e r
H a rm . L o ad D a ta
H a rm . E n te r D a ta
R H arm . E n te r D a ta
E v a l. H a rm o n iz e r
Figure 45: The Interoperate Submenu.
This submenu (Figure 45) enables HSDM-managed components to interoperate
Implant DB is the first operation. Users use this operation to implant remote databases into
their local databases. The operation implants a remote database into the local database by
importing the class hierarchy, kinds, and instances of the specified remote database. In our
prototype, remote database is specified as a path name, simulating the remote situation.
All Remote Kinds operation is necessary for displaying imported remote kinds
which were not tied with the local class hierarchy yet. The databases these remote kinds
are from, and any harmonizers where they are involved are also displayed.
119
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The remaining operations are for harmonizer processing. Using the Create Harmo
nizer operation, users specify hypothetical relevances, harmonizers, between local and
imported remote kinds. Specified relevances are subject to investigation. With the con
struction of a harmonizer, the characteristic subset of the remote kind along with the
semantics of complementary attributes introduced by the local kind is enclosed in a text
file, and sent to the remote component. Since we have not implemented communication
primitives, remote site is assumed to be the same as the local site.
All Harmonizers displays all the harmonizers under investigation. Harmonizer Sta
tus, on the other hand, displays information about a specific harmonizer such as its name,
associated local kind, associated remote kind, and its status (waiting for local and remote
data, waiting for local data, waiting for remote data, ready to be evaluated, and evaluated).
Harmonizers are persistent structures in our prototype. They should be disposed
when they complete performing their functions (e.g., after they are evaluated). Delete Har
monizer is used for this purpose.
Harm. Enter Data operation enables the local domain expert to enter complementary
attribute values for the characteristic subset of the local kind. After this operation, status of
the harmonizer changes either from “waiting for local and remote data” to “waiting for
remote data”, or from “waiting for local data” to “ready to be evaluated”.
Using the RHarm. Enter Data operation, the remote domain expert provides comple
mentary attribute values for the characteristic subset of the remote kind. Supplied values
are sent back to the local component, and accumulated in a spool directory. Harm. Load
Data operation extracts these values from the spool. It then moves this information to the
database where information about the corresponding harmonizer is kept. After this opera
tion, the harmonizer status changes either from “waiting for local and remote data” to
“waiting for local data”, or from “waiting for remote data” to “ready to be evaluated”.
When a harmonizer is in the “ready to be evaluated” state, the Eval. Harmonizer
operation can be activated. As a result, accumulated information is analyzed, and a sugges-
120
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
tion is made regarding the possible relationship between the associated local kind and the
associated remote kind of this harmonizer. After the user is prompted with the conse
quences of the suggested action, he/she is expected to confirm this suggestion. When the
user confirms the suggested action, individual schema evolution primitives are activated to
establish the suggested relationship between the associated local and remote kinds of the
harmonizer. The harmonizer can be deleted after this operation.
In our prototype system, schema evolution primitives are internal to the Eval. Har
monizer operation. They are not accessible to the users directly.
5.3 Experimental Results
In this section, we will illustrate the usage of the HSDM Mediator on a sample shar
ing scenario that was used throughout this dissertation. Local and remote database sche
mas were previously shown in Figure 21, in Section 4.3.1. We would like to enable
information sharing and exchange between these local and remote database components
by making information units accessible and available from/within the local database. In
particular, we will show how the remote database is implanted into the local database and
how the local and remote knowledge about a harmonizer are acquired.
Figure 46 and Figure 47 in the following pages correspond to the local and remote
database schemas in Figure 21 in Section 4.3.1. They are created in the HSDM data
model, and corresponding local and remote databases are populated with a handful of
instances. In order to make remote information units available within the local database,
first we will implant the remote data and meta-data into the local database. Then, we will
construct a harmonizer to test the relevance of the local kind researchData and the remote
kind Physiology_Data. The HSDM Mediator will manage the process of accumulating
local and remote knowledge about this harmonizer. After the HSDM Mediator suggests
the degree of relevance between the local and remote kinds, it will evolve the local data
base, so that suggested relationship between these two kinds is realized.
121
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
B B S
25EEB23BE
s u b je c t : P_STRING
e x p e rim e n t : e x p e rim e n t
PLERSE ENTER RETURN TO RETURN TO MAIN M ENLl
name : M_STRING
s ta r tD a te : P_STRING
c o n ta c tP e rs o n : p e rso n
d e s c r ip tio n : P_STRING
PLEftSE ENTER RETURN TO RETURN TO M AIN M ENLl
(a) The First Kind Tree.
(b) The Second Kind Tree.
1 2 2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
t i t l e : P_STRING
name : P_STTRING
phone : N_STRING
a d d re s s : P_STRIHG
PLERSE ENTER RETURN TO RETURN TO M F I IN M E H j________________________________________________________________
(c) The Third Kind Tree.
Figure 46: Local Conceptual Schema in HSDM.
Figure 46 shows the local schema in HSDM. Name and domain kind information are
displayed for attributes. Kind trees shown in Figure 46 have been displayed by means of
the First Kind Tree and the Next Kind Tree submenus of the HSDM Mediator. As it can be
seen from the figure, the local schema consists of three user-defined kind trees. Note that
predefined kind trees are not shown. ISA relationships between kinds are indicated by
indentation. For detailed information about a specific kind, and its attributes, we need to
use Display Kind Def. submenu of the HSDM Mediator.
Figure 46 (a) shows the first kind tree of the local kind hierarchy. The researchData
kind along with its attributes are displayed as the first kind tree. Figure 46 (b) shows the
second kind tree, which includes the experiment kind as the only kind in this tree. Last,
Figure 46 (c) presents the third kind tree. The person kind is the only kind in the third kind
tree.
123
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
h h h h e
m
p
p.
'.S TR IN G }
'.S TR IN G J
a n n o ta tio n s P .S TR IN G }
III W L J II3 T T O
n u » b e r'_ o f_ tr'a c e s : INTEGER
tr a c e id s : f INTEGER }
d a ta s e ts : { P .S TR IN G }
d a ta _ p o in ts ; { FLOAT )
s c a lin g : INTEGER
PLERSE ENTER RETURN TO RETURN TO M R IN MENlM
Figure 47 shows the remote schema in HSDM. Domain kinds enclosed by curly
brackets means that the attribute is a multi-valued attribute. For example, the authors
attribute of the Physiology_Data is multivalued; it can hold more than one reference to
author objects as its value. Since the remote schema has only one user-defined kind tree, it
sufficed to use the First Kind Tree submenu in this case.
Time_Series_Data and Histogram_Data kinds are subkinds of the Physiology_Data.
As it can be observed, indentation is used to represent the ISA relationships on the screen;
subkinds are displayed at different indentation levels, and sibling kinds are shown at the
same indentation level.
Figure 48 in the following page illustrates the implantation process. When the
Implant DB operation is activated, the mediator asks for the remote database name to be
implanted (Figure 48 (a)). Remote kinds (Physiology_Data, Time_Series_Data and
Histogram_Data) are imported first. As the next step, individual instances of the imported
remote kinds are imported as part of the implantation process (Figure 48 (b)).
Figure 47: Remote Conceptual Schema Portion in HSDM.
124
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
g s s i s
S B
R o u t e n t i A —e Hwe re m o te
in p o r tin g k in d P fm s io lo c n D a ta
im p o rte d
in p o r tin g k in d Tim e S e r ie s D a ta
in p o rte d
in p o r tin g k in d H is to g ra m D a ta
im p o rte d
PLEflSE ENTER SPfCE TO CONTINUEM
im p o rtin g in s ta n c e s o f P h u s io lo o u D a ta
1 8 /1 8 in s ta n c e s a r e im p o rte d
in p o r tin g in s ta n c e s o f T in e S e r ie s D a ta
6 /6 in s ta n c e s a r e im p o rte d
in p o r tin g in s ta n c e s o f H is to c ra n D a ta
6 /6 in s ta n c e s a r e im p o rte d
PLEflSE ENTER RETURN TO RETURN TO MAIN M E N U
(b) Implanting Remote Instances.
Figure 48: Implanting Remote Schema into the Local Environment.
(a) Implanting Remote Kinds.
Remote Bntabaae Name re m o te
125
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 49 shows the resulting local schema implanted with the remote kinds. Note
that the implanted remote schema is superimposed with the local schema, but there does
not exist any relationship between the two at this stage.
P STRING }
P_STTRING }
a n n o ta tio n s : ( P_STRING }
d a ta _ p o in ts : { FLOflT }
s c a lin g : INTEGER
n u « b e r'_ o f_ tra c e s : INTEGER
tr a c e id s : { I NTEGER }
d a ta s e ts : ( P_STRING }
PLEflSE ENTER RETURN TO RETURN TO MAIN MENLl
(a) The First Kind Tree.
■IMMMiyUMiUUlU
s u b je c t : P_STRING
exp erim en t : e x p e rim e n t
PLEflSE ENTER RETURN TO RETURN TO MAIN MENLl
(b) The Second Kind Tree.
126
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
name : M_STRING
s ta r-tD a te : P_ai KING
c o n ta c tP e rs o n : p e rson
d e s c r ip tio n : P_STRING
PLEflSE ENTER RETURN TO RETURN TO MAIN MENLl
(c) The Third Kind Tree.
Bi
t i t i e : P_STRING
name : P.STRING
phone : N.STRING
ad dress : P.STRING
PLEflSE ENTER RETURN TO RETURN TO MAIN MENLl
(d) The Fourth Kind Tree.
Figure 49: Local Schema Implanted with Remote Schema Portion.
127
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
H arm onizer Name : h a r n o n iz e r l
A sso ciated L o c a i K ind : re s e a rc h D a ta
A sso ciated Remote K ind : P h y s io lo g y _Data
S ta tu s : waitlnc Far local a d re n te data
PLEASE ENTER RETURN TO RETURN TO MAIN MENU________________________________________________________________
Figure 50: Harmonizer!.
Next, we have built a harmonizer called harmonizer j (Figure 50). harmonizer! will
test the relevance of the local kind researchData and the imported remote kind Physiology
_Data. In this figure, the status of harmonizer! * s shown just after it is constructed, which
is “waiting for local and remote data”. With the construction of this harmonizer, requests
for local and remote data are sent to both local and remote components.
Figure 51 in the following pages illustrates how the local data about this harmonizer
is acquired from a local domain expert. In Figure 51 (a), an explanation screen is dis
played to help the domain expert while providing complementary values. Figure 51 (b)
shows how the local domain expert provides attribute values for the characteristic subset
of the local kind. In this case, the local kind has only one instance in its characteristic sub
set (datl), and the local domain expert could not provide any meaningful values to any of
the attributes imposed by the remote kind Physiology_Data. Therefore, the local domain
expert enters “nullx” for these attribute values. Provided values are kept in a data structure
in the local database for further processing.
128
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A d d itio n a l in fo rm a tio n is r e q u ire d f o r t h is h a rm o n iz e r. P le a s e p ro v id e
m e a n in g fu l a t t r i b u t e v a lu e s f o r th e lo c a l In s ta n c e s s p e c if ie d b e lo w .
S B B B D B i n r e f e r s to v a lu e s w h ich a r e in th e dom ain k in d o f th e
a t t r i b u t e . You can a ls o p ro v id e n u ll? i f a t t r i b u t e makes sense f o r th e
in s ta n c e , b u t you do NOT know th e v a lu e o f t h a t in s ta n c e . O r you can
p ro v id e n u llx i f a t t r i b u t e does n o t make sense f o r th e in s ta n c e a t a l l .
I f you know some a t t r i b u t e e q u iv a la n c e s , you can s p e c ify th e lo c a l a t t r i b u t e
name proceeded by a 3 s ig n as th e B U B o f th e rem o te a t t r i b u t e . You can o n ly
s p e c ify a t t r i b u t e e q u iv a la n c e s f o r th e f i r s t in s ta n c e .
PLEflSE ENTER SPACE TO COMTIMJEl_________________________________
(a) Explanations.
O b je c t Name :
A t t r ib u t e Name
Value :
A t t r ib u t e Name
Value :
A t t r ib u t e Name
Value :
au th o rs
n u llx
neurons
n u llx
a n n o ta tio n s
n u llx
PLEASE ENTER RETURN TO RETURN TO MAIN MENLl_____________________________________________________________ _ _
(b) Entering Data.
Figure 51: Acquiring Local Data for Harmonizer,.
129
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acquiring remote data about harmonizer j is illustrated in Figure 52. First, Figure 52
(a) shows the explanations displayed for the remote domain expert. In Figure 52 (b),
semantic information about the complementary local attributes is displayed on the screen
in order to help the remote domain expert to have a good understanding of the local kind
and complementary attribute semantics. For example, attribute descriptors, kind of
attributes, domain kinds and domain descriptors are displayed. In Figure 52 (c) through
Figure 52 (e), the remote domain expert tries to provide values for these new attributes
imposed by the local kind researchData. After these values are entered, they are packed in
a text file and sent back to the local database. Local database receives these values and
accumulates them in a directory.
A d d itio n a l in -fo r*ia tio n is re q u ire d f o r t h is h a r a o n iz e r . P le a s e p ro v id e
n e a n in g fu l a t t r i b u t e v a lu e s f o r th e lo c a l in s ta n c e s s p e c if ie d b e lo w .
B H H W H IM H H H H I r e f e r s to v a lu e s w hich a r e in th e d o a a in k in d o f th e
a t t r i b u t e . You can a ls o p ro v id e n u il7 i f a t t r i b u t e wakes sense f o r th e
in s ta n c e , b u t you do NOT know th e v a lu e o f t h a t in s ta n c e . Or you can
p ro v id e n u llx i f a t t r i b u t e does n o t wake sense f o r th e in s ta n c e a t a l l .
I f you know soae a t t r i b u t e e q u iv a la n c e s , you can s p e c ify th e lo c a l a t t r i b u t e
nane p r o c e e d e d by a 3 s ig n as th e H U H o f th e r e a o te a t t r i b u t e . You can o n ly
s p e c ify a t t r i b u t e eq u iv a la n c e s f o r th e f i r s t in s ta n c e .
PLEASE ENTER SPACE TO COMTINUEl________________________________
(a) Explanations.
130
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
f t .
SEMANTICS OF ATTRIBUTES FOR WHICH VALUES ARE REQUESTED
KIND NOME : P h y s io lo g y _ D a ta
A ttr ib u te Nane : B S D
A ttr ib u te D e s c rip tio n
A ttr ib u te K ind
Donain Kind
Donain D e s c rip tio n
A ttr ib u te Nane : ____
A ttr ib u te D e s c rip tio n
D onain Kind
D onain D e s c rip tio n
a n in a i on u h ic h th e e x p e rim e n t was perform ed
a n in a i
P_STRING
S im p le K in d lv a lu e s c o n s is t o f c o n v e n tio n a l s tr in g s
e x p e rim e n t code fro n which t h is d a ta is o b ta in e d
e x p e rim e n t
e x p e rim e n ts p e rfo rm ed on s u b je c ts a t USC N euroS cience
PLEASE ENTER SPACE TO CONTINUEM________________________________________
(b) Explanations (continued).
O b ject Name : B O S H B I
A ttr ib u te Nane : s u b je c t
Value : r a t
A ttr ib u te Nane : e x p e rim e n t
Value : n u ilT fl
(c) The First Instance in the Characteristic Subset of the Remote Kind.
131
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
O b je c t N ane :
a t t r i b u t e Nane
Value :
A t tr ib u te Nane
Value :
s u b j e c t
n o n k e y
e x p e rin e n t
n u U l
(d) The Second Instance in the Characteristic Subset of the Remote Kind.
O b je c t Nane : B M H
A ttr ib u te Nane : s u b je c t
Value : r a b b it
A ttr ib u te Nane : e x p e rin e n t
Value : n u ll?
PLEASE ENTER RETURN TO RETURN TO MAIN M E N f
(e) The Third Instance in the Characteristic Subset of the Remote Kind.
Figure 52: Acquiring Remote Data for Harmonizer!.
132
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Accumulated remote data is loaded into the local database by the Harm. Load Data
operation (Figure 53). As a result, remote data is extracted from the directory, and stored
in the data structure assigned for harmonizer j. Both Harm. Load Data and Harm. Enter
Data alter the status of harmonizer t. In our example, the Harm. Enter Data changes the
harmonizeri’s status from “waiting for local and remote data” to “waiting for remote
data”. After Harm. Load Data is activated, it changes the state to “ready to be evaluated,”
and harmonizer! becomes ready to be evaluated by the Eval. Harmonizer operation.
PLEflSE ENTER RETURN TO RETURN TO MAIN MENLl________________________________________________________________
Figure 53: Loading the Acquired Remote Data.
Figure 54 in the following pages illustrates evaluation of harmonizeri. The mediator,
first, makes an analysis based on the accumulated knowledge about harmonizeri. It then
suggests a relationship between semantic peers, which is a SUB KIND relationship in Fig
ure 54 (a). The actions suggested by the mediator are also displayed.
After user confirmation, consequences of the suggested actions are displayed (Figure
54 (b) and Figure 54 (c)). Once again, the user is expected to confirm. If the user agrees,
the local schema is evolved, so that it reflects the suggested relationship and suggested
133
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
actions. In our example in Figure 54 (d), the remote kind Physiology_Data is made a sub-
kind of the local kind researchData. Finally, corresponding instances are adjusted.
A sso ciated L o c a l K ind : re s e a rc h D a ta
A sso ciated R enote K in d : P h u s io logy_D ata
Suggested R e la tio n s h ip : — i w h i i
Suggested A ct ions : 1 -) Make r e n o te k in d a n o th e r subkind o f th e lo c a l k in d .
2 - ) D e le te t h is h a m o n iz e r .
3 - ) C re a te a nee h a m o n iz e r b e tee en th e subkinds o f lo c a l k in d .
Would you l i k e to p erfo rm a c tio n #1 ? yesfl
(a) Evaluation Results.
p n i
L o c a l Kind D e fin it io n m i l l In c lu d e th e fo llo w in g A ttr ib u te s
s u b je c t
e x p e rin e n t
PLEASE ENTER SPACE TO CONTINUEl
(b) Suggestions.
134
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
aacandld o t / hoaa/eacondtdoS/db/caSBg/eaalan I - 1 _l
R eao te K ind D e f i n i t io n w i l l In c lu d e th e fo llo w in g A ttr ib u te s
au th o rs
neurons
a n n o ta tio n s
A re you s u re you u a n t to do t h is ? yesQ
(c) Suggestions (continued).
C re a tin g L o c a i In s ta n c e s f o r th e Reaote K ind
E v o lvin g th e R eaote K ind
Tuning Up th e L o c a l In s ta n c e s
W eH e ^EKTER RETURN TO RETURN TO MR IN MENLi
(d) Evolution.
Figure 54: Harmonizer Evaluation.
135
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
s u b je c t : P_STRING
e x p e rim e n t : experim ent
" ^ a u th o rs : ( P_STRING }
neurons : { P_STRING }
a n n o ta tio n s : [ P_SXRING }
d a ta _ p o in ts : { ELORT }
s c a lin g : INTEGER
n u a b e r_ o f_ tra c e s : INTEGER
tr a c e id s : ( INTEGER }
d a ta s e ts : f P.STRING }
" " p L E S ^ E N n ^ ^ lT ^ R ^ T O RETURN TO MfiXN MENLt
(a) The First Kind Tree.
1
rta *e : M_SXRING
s t a r t D a te : P_STRING
contacrtP erson : p e rson
d e s c r ip t io n : P_STRING
PLEflSE ENTER RETURN TO RETURN TO MR IN rE N L l
(b) The Second Kind Tree.
136
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
t i t l e : P_STRING
name : P_STRING
phone : N_STR IH G
a d d ress : P_STRING
PLEflSE ENTER RETURN TO RETURN TO MAIN M EM jj_______________________
(c) The Third Kind Tree.
Figure 55: Final Schema after Harmonizer[ is Evaluated.
Finally, Figure 55 shows the final local schema after harmonizerj is evaluated. It
conforms to the final conceptual schema that we have foreseen in Figure 37 in Section
4.3.3.4.
137
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 6
Conclusions
The objective of this research has been to develop and to implement an information
sharing and exchange mechanism that allows database systems to interoperate in a cooper
ative federated database system (CFDBS). We have achieved our goal by providing an
efficient and realistic solution to the folding problem, which is a uniform solution to the
problems of (partial) schema integration, semantic heterogeneity resolution, and schema
customization in CFDBS environments. Clarification of information unit semantic, meta
data implantation and stepwise evolution techniques constituted the parts of the solution.
In particular, we have emphasized hypothetical processing (hypothesizing two classes are
related via equivalence, specialization, generalization, or overlapping relationships, and
trying to prove that such a hypothesis holds for a small, but a uniform subset of instances),
incremental acquisition of knowledge required to fold a remote conceptual schema into a
local conceptual schema, and incremental placement of remote schema elements into the
local hierarchy in our solution. We have implemented a prototype database tool called the
HSDM Mediator, which realizes our solution Schema Implantation and Semantic Evolu
tion approach to the folding problem.
In this chapter, we will present anticipated contributions of our solution to the data
base research by comparing it with previous approaches that are applicable to the folding
problem. We will also outline the future research directions.
138
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.1 Contributions
One of the observations we have made throughout our research is the difficulty of
maintaining and/or agreeing on global structures in a CFDBS environment. Approaches
assuming such global structures suffer from heavy initial exchange effort, which is the
time required to come up with these global structures. Approaches requiring no global
structures, on the other hand, bear high exchange effort, which is the time spent during the
integration of information units. The aim of our research has been to balance these two
conflicting parameters.
While comparing individual approaches with a qualitative criteria for initial
exchange effort, we will use a quantitative measure to compare their exchange effort. In
order to compare individual approaches on their exchange effort, we need to define quanti
tative measures on conceptual schemas which indicate how much effort is necessary to
achieve total integration with another schema. An implanted conceptual schema has a
number of parameters on which we can build our quantitative measures. Figure 56
describes these parameters.
m : number of (abstract) remote classes for which appropriate places in the local class hierarchy are
found.
n0 : number of (abstract) local classes before the implantation process.
m0 : number of (abstract) remote classes before the implantation process,
ilj : number of instances in the local class lp i=l,..,n0.
iq : number of instances in the remote class rP i=l,..,m0.
clj : number of instances in the characteristic subset of the local class lj. i= 1 ,..,n0.
cr, : number of instances in the characteristic subset of the remote class rp i= 1 ,..,m0.
Pi : number of local classes whose relevance is tested with the remote class rp i=l...,m0.
Figure 56: Parameters of an Implanted Conceptual Schema.
In Figure 56, n0 and mg are the total number of abstract local classes and the total
number of abstract remote classes respectively. They are constant throughout the exchange
139
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
process, m is the number of integrated abstract remote classes, and it changes during the
semantic evolution phase. Whenever we find an appropriate place for a remote class, we
increment mo by one, or by the number of subclasses of the remote class plus one. When
m becomes mg, remote conceptual schema is totally integrated with the local conceptual
schema. Given a remote class rr the number of instances of r4 is represented by iq. Simi
larly, ilj represents the number of instances of a local class lj. cq and clj are the number of
instances in the characteristic subset of a remote class rj, and the number of instances in
the characteristic subset of a local class lj respectively. When a class involves in a harmo-
nizer, complementary information should be supplied for the instances in the characteris
tic subset of the class.
The first quantitative measure we define is the Integrity Factor (EF), which is defined
on an implanted conceptual schema. It measures how much local and remote classes are
integrated.
m + n n
I F = -----------
m 0 + « 0
Integrity Factor (IF) gives an idea about how much of the implanted conceptual
schema is completely integrated. Integrity Factor at the beginning of the Semantic Evolu
tion phase is relatively low since no remote class is integrated with the local classes yet (m
= 0). Whenever harmonizers suggest locations for remote classes, in which case m is
incremented, EF becomes greater. Integrity Factor for an implanted conceptual schema
ranges between no/(mo+n0) and 1.
Integrity Factor (EF) gives a rough idea about the effort required during the integra
tion. Another quantitative measure, Integration Effort (IE), provides more precise informa
tion in this sense. Integration Effort is defined on remote classes and on implanted
databases.
140
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Integration Effort for a remote class r; is represented by e(, and it is calculated by the
following formula:
e i =
Pi
j = 1
P i x cr[ + I ‘ ‘ ^ (c/y)
i r i
where ej is the ratio of the total number of instances (local and remote) for which comple
mentary attribute values are supplied to the total number of instances of remote class rj.
Integration Effort (IE) for an implanted database is the sum of integration efforts of
implanted remote classes.
We want the integration effort (IE) to be minimum for the purpose of practicality and
efficiency. One way to achieve so is to supply complementary values to only characteristic
subset of an object class rather than supplying values to all the instances of the object
class. This will minimize individual integration efforts of remote classes (efs) leading to a
smaller overall integration effort (EE) for the database. Another way is to keep the size of
characteristic subset as small as possible. The choice of a local class as the associated local
class in a harmonizer also affects the Integration Effort (IE). Imprecise guesses make pj’s
grater leading to a greater integration effort for a remote class.
The B E for our approach is proportional to the sum of the number remote classes to
be implanted and the number of local classes in the local class hierarchy since in our
approach, characteristic subsets of classes are constant (in our prototype for example, one
instance is selected from the subclasses of the class to find its characteristic subset.)
141
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Approaches assuming powerful multidatabase languages reach to an IE of 0, pro
vided that users know relationships between local and remote classes. As they expect too
much from multidatabase users, however, multidatabase users spend a tremendous amount
of time to interrelate remote and local schemas. Approaches assuming global structures
such as semantic dictionary/ontology-like structures, require high initial effort to come up
with the global structure and relatively reasonable IE due to the time required to consult to
these global structures. Global schema approach gives a very high initial integration effort
and almost zero exchange effort.
Global Schema
Approach
very high
highm
moderate
low
very low
Federated Databases with
Semantic Dictionaries/
Ontologies
Multidatabases with
Multidatabase Languages Semantic Evolution
G U I G SM UE SKC EE GXJt GSM (IE S I C EE GKJt OSM (IE
GKR: Global Knowledge Required: amount of global knowledge that each component is supposed
to have/maintain about other components.
GSM: Global Structures Maintained: amount of memory required to maintain globally in order to
guide information sharing and exchange between components.
HE : Initial Integration Effort: amount of effort each component has to put before it can actually
share and exchange information units with other components.
SRC : Staticity of Relationships between Concepts: degree of staticity of relationships between
database elements in different components.
EE : Exchange Effort: amount of effort a component has to put in terms of both computing time and
user involvement during actual sharing and exchange.
Figure 57: Possible Solutions to the Folding Problem.
142
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
When considered within the context of the folding problem, each approach has its
pros and cons, which are symbolized in Figure 57. For example. Global Schema Approach
requires too much global knowledge from federation users during integration. It requires
huge amounts of space to store the global schema, and initial integration effort that has to
be spent is very costly and prohibitive. Furthermore, semantics and interrelationships of
schema elements are not dynamic because of the fixed nature of the global schema. Never
theless, exchange effort is very low since every possible sharing pattern is fixed and obvi
ous in the form of the global schema. Schema Implantation and Semantic Evolution
approach on the other hand requires minimum global knowledge, minimum global struc
tures, and minimum initial integration effort. Interrelationships of schema elements in dif
ferent components are highly dynamic since they are not tightly bounded. However, it
necessitates moderate exchange effort to be spent because of the need to acquire additional
attribute values for classes that are hypothesized to be related.
6.2 Future Work
The success of our approach in specific, and of any approach claiming to provide a
solution to the folding problem in general, depends on the amount of user interaction/con
sultation required. Acquiring additional information for only characteristic subset of a
class rather than all instances is a direct result of this concern in our approach. One cause
for too much user interaction in our scheme is choosing harmonizers in a way that results
in irrelevance. Currently, our approach depends on user intuition in building initial harmo
nizers. It would be worthwhile to consider building a mechanism that analyzes data and
meta-data of the remote and local databases, and suggests potential harmonizers to users.
This would tremendously reduce required user input.
Another shortcoming of our approach is that we depend on structural properties of a
class definition while investigating its relevance to a remote class. Methods contribute to
the semantics of a class as much as structural components of the class definition. In this
143
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
regard, an extension that considers behavioral properties (methods) of class definitions as
well would contribute to our work in a great deal.
Multimedia data has become more and more important in recent years [85]. Class
identity and instance identity semantic that we have introduced should be revisited for the
multimedia data (e.g., the problem of how two images are related, and the problem of
whether two video streams are equivalent should be addressed.)
Still another interesting extension would be to study implications of multiple inherit
ance within this framework.
A graphical user interface would be more appropriate implementation wise.
144
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reference List
[1] S. Abiteboul and A. Bonner, Objects and views. In Proceedings of ACM SIGMOD
(1991) pages 238-247.
[2] H. Afsarmanesh and D. McLeod, The 3DIS: An extensible object-oriented infor
mation management environment, ACM Transactions on Office Information Sys
tems 7,4 (1989) pages 339-377.
[3] H. Afsarmanesh, F. Tuijnman, M. Wiedijk and L. O. Hertzberger, Distributed
schema management in a cooperating network of autonomous agents, In Proceed
ings o f the 4th IEEE International Conference on Database Expert System Appli
cations (1993).
[4] J. Andany, M. Leonard and C. Palisser, Management of schema evolution in data
bases, In Proceedings o f the 17th International Conference on Very Large Data
Bases (1991) pages 161-170.
[5] Y. Arens, C. Y. Chee, C. Hsu, H. In, C. A. Knoblock, Query processing in an infor
mation mediator, In Proceedings o f the ARPA/Rome Laboratory Knowledge-Based
Planning and Scheduling Initiative Workshop (1994).
[6] Y. Arens, C. Y . Chee, C. Hsu, C. A. Knoblock, Retrieving and integrating data
from multiple information sources, In International Journal of Intelligent and
Cooperative Information Systems (1993) 2,2 pages 127-158.
[7] J. Baneijee, H. Chou, J. F. Garza, W. Kim. D. Woelk and N. Ballou, Data model
issues for object-oriented applications, ACM Transactions on Office Information
Systems 5,1 (1987) pages 3-26.
145
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[8] J. Banerjee, W. Kim, H. J. Kim and H. F. Korth, Semantics and implementation of
schema evolution in object-oriented databases. In Proceedings o f the ACM SIG-
MOD International Conference on Management o f Data (1987) pages 311-322.
[9] T. Barsalou and D. Gangopadhyay, M(DM): An open framework for interoperation
of multimodel multidatabase systems, In Proceedings o f IEEE International Con
ference on Data Engineering 8 (1992) pages 218-227.
[10] C. Batini, S. Ceri and S. B. Navathe, Conceptual database design: an entity-rela-
tionship approach, Cummings Pub. Co., Redwood City, Calif. (1992).
[11] C. Batini, M. Lenzerini and S. Navathe, A comparative analysis of methodologies
for database schema integration, ACM Computing Surveys 18,4 (1986) pages 323-
364.
[12] R. Bayardo, W. Bohrer, R. Brice, A. Cichocki, G. Flowler, A. Helal, V. Kashyap, T.
Ksiezyk, G. Martin, M. Nodine, M. Rashid, M. Rusinkiewicz, R. Shea. C. Unni-
krishnan. A. Unruh, and D. Woelk, InfoSleuth: Semantic Integration of Informa
tion in Open and Dynamic Environments, MCC Technical Report MCC-INSL-
088-96.
[13] Y. Breitbart, Multidatabase interoperability, ACM SIGMOD Record 19,3 (1990)
pages 53-60.
[14] Y. Breitbart, P. L. Olson and G. R. Thompson, Database integration in a distributed
heterogeneous database system, In Proceedings o f the 2nd International Confer
ence on Data Engineering (1986) pages 301-310.
[15] K. J. Byeon and D. McLeod, Towards the unification of views and versions for
object-oriented databases. International Symposium on Object Technology and
Advanced Software (ISOTAS), Kanazawa, Japan (1993) pages 220-236.
[16] A. F. Cardenas and D. McLeod, Research foundations in object-oriented and
semantic database systems, Prentice Hall, Englewood Cliffs, NJ (1990).
[17] A. L. P. Chen, J. L. Koh, T. C. T. Kuo and C. C. Liu, Schema integration and query
processing for multiple object databases, Integrated Computer-Aided Engineering:
Special Issue on Multidatabase and Interoperable Systems 2 ,1 (1995) pages 21-34.
[18] B. Czejdo, M. Rusinkiewicz and D. Embley, An approach to schema integration
and query formulation in federated database systems, In Proceedings o f the 3rd
IEEE Conference on Data Engineering (1987) pages 477-484.
146
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[19] U. Dayal and H. Hwang, View definition and generalization for database integra
tion in a multidatabase system, IEEE Transactions on Software Engineering 10,6
(1984) pages 628-644.
[20] F. Eliassen and R. Karlsen, Interoperability and object identity, SIGMOD Record
20,4 (1991) pages 25-29.
[21] R. Elmasri and S. Navathe, Object integration in logical database design. In Pro
ceedings o f IEEE Computer Society 1st International Conference on Data Engi
neering (1984) pages 426-433.
[22] D. Fang, S. Ghandeharizadeh, D. McLeod and A. Si, The design, implementation,
and evaluation of an object-based sharing mechanism for federated database sys
tems, In Proceedings o f International Conference on Data Engineering (1993).
[23] P. Fankhauser, M. Kracker and E. J. Neuhold, Semantic vs. structural resemblance
of classes, ACM SIGMOD Record 20,4 (1991) pages 59-63.
[24] P. Fankhauser and E. J. Neuhold, Knowledge based integration of heterogeneous
databases, Technical Report, Techniche Hochschule Darmstadt (1992).
[25] D. Fishman, D. Beech, H. P. Cate, E. C. Chow, T. Connors, J. W. Davis, N. Derrett,
C. G. Hoch, W. Kent, P. Lyngbaek, B. Mahbod, M. A. Neimat, T. A. Ryan and M.
C. Shan, IRIS: An object-oriented database management system, ACM Transac
tions on Office Information Systems 5,1 (1987) pages 48-69.
[26] D. Gangopadhyay and T. Barsalou, On the semantic equivalence of heterogeneous
representations in multimodel multidatabase systems, ACM SIGMOD Record 20,4
(1991) pages 35-39.
[27] J. Geller, Y . Perl and E. J. Neuhold, Structure and semantics in object-oriented
database class specifications, ACM SIGMOD Record 20,4 (1991) pages 40-43.
[28] I. Graham, Object-oriented methods, Addison-Wesley, Wokingham, England 1991.
[29] J. Hammer and D. McLeod, An approach to resolving semantic heterogeneity in a
federation of autonomous, heterogeneous database systems, International Journal
o f Intelligent and Cooperative Information Systems 2,1 (1993) pages 51-83.
[30] J. Hammer, D. McLeod and A. Si, Object discovery and unification in a federated
database system, Technical Report USC-CS, Computer Science Department, Uni
versity of Southern California, Los Angeles, (1994).
147
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[31] D. Heimbigner and D. McLeod, A federated architecture for information manage
ment, ACM Transactions on Office Information Systems 3,3 (1985) pages 253-278.
[32] M. Huhns, N. Jacobs, T. Ksiezyk, W. Shen, M. Singh and P. Cannata. Enterprise
Information Modelling and Model Integration in Carnot, in Charles J. Petrie Jr.,
ed.,. Enterprise Integration Modeling: Proceedings o f the First International Con
ference, MIT Press, Cambridge, MA, 1992.
[33] Illustra User’s Guide, Release 3.2., Illustra Information Technologies Inc., 1995
[34] N. Jacobs and R. Shea, The Role of Java in InfoSleuth: Agent-based Exploitation
of Heterogeneous Information Resources., IntraNet96 Java Developers Confer
ence., April, 1996.
[35] J. Kahng and D. McLeod, Dynamic classificational ontologies for discovery in
cooperative federated databases, In Proceedings o f the Ist International Confer
ence on Cooperative Information Systems, Brussels, Belgium, June 1996 pages 26-
36.
[36] V. Kashyap and A. Sheth, Schema correspondences between objects with semantic
proximity. Technical Report DCS-TR-301, Rutgers University, October 1993.
[37] M. Kaul, K. Drosten and E. J. Neuhold, Viewsystem: Integrating heterogeneous
information bases by object-oriented views, In Proceedings o f International Con
ference on Data Engineering 6 (1990) pages 2-10.
[38] A. M. Keller and M. W. Wilkins, Approaches for updating databases with incom
plete information and nulls , In Proceedings o f the International Conference on
Data Engineering (1984) pages 332-340.
[39] W. Kent, Solving domain mismatch and schema mismatch problems with an
object-oriented database programming language, In Proceedings o f International
Conference on Very Large Data Bases (1991) pages 147-160.
[40] W. Kent, The breakdown of the information model in multidatabase systems, ACM
SIGMOD Record 20,4 (1991) pages 10-15.
[41] W. Kent, The many forms of a single fact, In Proc. IEEE Spring Compcon (IEEE
Feb. 1989) pages 438-443.
[42] W. Kim, N. Ballou, H. T. Chou, J. F. Garza and D. Woelk, Integrating an object-
oriented programming system with a database system, In Proceedings o f 3rd Inter
national Conference on Object-Oriented Programming Systems, Languages, and
Applications {1988) pages 142-152.
148
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[43] W. Kim, I. Choi, S. Gala and M. Scheevel, On resolving schematic heterogeneity
in multidatabase systems, Distributed and Parallel Databases, An International
Journal 1,3 (1993) pages 251-279.
[44] W. Kim, J. Garza, N. Ballou and D. Woelk, Architecture of the ORION next gener
ation database system, IEEE Transactions on Knowledge and Data Engineering
2,1 (1990) pages 109-117.
[45] J. L. Koh and A. L. P. Chen, Integration of heterogeneous object schemas, In Pro
ceedings o f the 12th International Conference on Entity-Relationship Approach
(1993) pages 297-314.
[46] R. Krishnamurthy, W. Litwin and W. Kent, Language features for IEEE interopera
bility of databases with schematic discrepancies, Proceedings o f ACM SIGMOD
International Conference on Management o f Data (1991) pages 40-49.
[47] J. Larson, S. B. Navathe and R. Elmasri, A theory of attribute equivalence in data
bases with application to schema integration, IEEE Transactions on Software
Engineering 15,4 (1989) pages 449-463.
[48] W. Li and C. Clifton, Semantic integration in heterogeneous databases using neu
ral networks. In Proceedings o f the 20th Conference on Very Large Data Bases
(1994) pages 1-12.
[49] Q. Li and D. McLeod, Conceptual database evolution through learning in object
databases, IEEE Transactions on Knowledge and Data Engineering 6,2 (1994)
pages 205-224.
[50] Q. Li and D. McLeod, Object flavor evolution in an object-oriented database sys
tem, In Proceedings o f ACM Conference on Office Information Systems (1988).
[51] W. Litwin and A. Abdellatif, Multidatabase interoperability, IEEE Computer 19,12
(1986) pages 10-18.
[52] W. Litwin, L. Mark and N. Roussopoulos, Interoperability of multiple autonomous
databases, ACM Computing Surveys (1990) pages 267-293.
[53] P. Lyngbaek and D. McLeod, A personal data manager. In Proceedings o f the
International Conference on Very Large Data Bases (1984) pages 14-25.
[54] M. Mannino and W. Effelsberg, Matching techniques in global schema design, In
Proceedings of the 1st IEEE Conference on Data Engineering (1984) pages 418-
425.
149
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[55] F. Manola, S. Heiler, D. Georgakopoulos, M. Homick and M. Brodie, Distributed
object management, International Journal o f Intelligent and Cooperative Informa
tion Systems 1,1 (1992) pages 5-42.
[56] J. Martin and J. Odell, Object-oriented analysis and design, Prentice Hall, Engle
wood Cliffs, NJ (1992).
[57] D. McLeod, The identification and resolution of semantic heterogeneity in multi
database systems, International workshop on Interoperability in Multidatabase
Systems, Kyoto (1991).
[58] D. McLeod and A. Si, The design and experimental evaluation of an information
discovery mechanism for networks of autonomous database systems, In Proceed
ings o f IEEE International Conference on Data Engineering (1995) pages 15-24.
[59] A. Motro, Superviews: Virtual integration of multiple databases, IEEE Transac
tions on Software Engineering 13,7 (1987).
[60] B. C. Neuman, The Prespero File System: A global file system based on the virtual
system model, Computing Systems 5,4 (1992) pages 407-432.
[61] G. T. Nguyen and D. Rieu, Schema evolution in object-oriented database systems,
Data and Knowledge Engineering 4 (1989) pages 43-67.
[62] ObjectStore User Guide: Library Interface, Release 3.0, Object Design Inc., 1993.
[63] K. Obraczka, P. B. Danzig and S. Li, Internet resource discovery services, In IEEE
Computer (1993) pages 8-22.
[64] M. T. Ozsu and P. Valdurez, Principals of distributed database systems, Prentice
Hall, Englewood Cliffs, NJ (1991).
[65] M. Papazoglu, S. Laufmann and T. Sellis, An organizational framework for coop
erating intelligent information systems, International Journal o f Intelligent and
Cooperative Information Systems 1,1 (1992) pages 169-202.
[66] J. Richardson and P. Schwartz, Aspects: Extending objects to support multiple
independent roles, In Proceedings o f ACM SIGMOD International Conference on
Management o f Data (1991) pages 298-307.
[67] F. Saltor, M. Castellanos and M. Garcio, Suitability of data models as canonical
models for federated databases, ACM SIGMOD Record 20,4 (1991) pages 44-48.
[68] C. Schaffert, CORBA: OMG’s object request broker, Extended Abstract, for more
information about OMG and its products, refer to the URL http://www.omg.org.
150
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[69] P. Schevermann, C. Yu, A. Elmagarmid, H. G. Molina, F. Manola, D. McLeod, A.
Rosenthal and M. Templeton, Report on workshop on heterogeneous database sys
tems held at Northwestern University, ACM SIGMOD Record 19,4 (1990) pages
23-31.
[70] M. Schwartz, A. Emtage, B. Kahle and B. C. Neuman, A comparison of Internet
resource discovery approaches, Computing Systems 5,4 (1992) pages 461-493.
[71] E. Sciore, M. Siegel and A. Rosenthal, Context interchange using meta-attributes,
In 1st International Conference on Information and Knowledge Management
(1992) pages 377-386.
[72] E. Sciore, M. Siegel and A. Rosenthal, Using semantic values to facilitate interop
erability among heterogeneous information systems, ACM Transactions on Data
base Systems 19,2 (1994) pages 254-290.
[73] A. Sheth and J. Larson, Federated database systems for managing distributed, het
erogeneous, and autonomous databases, ACM Computing Surveys 22,3 (1990),
pages 183-236.
[74] M. Siegel and S. Madnick, Context interchange: sharing the meaning of data, ACM
SIGMOD Record 20,4 (1991) pages 77-78.
[75] M. Siegel and S. E. Madnick, A metadata approach to resolving semantic conflicts,
In Proceedings o f the International Conference on Very Large Data Bases (1991)
pages 133-145.
[76] A. H. Skarra and S. B. Zdonik, Type evolution in an object-oriented database. In
Research Directions in Object-Oriented Programming, MIT Press, Cambridge,
MA, pages 393-415.
[77] A. H. Skarra and S. B. Zdonik, The management of changing types in an object-
oriented database, In Proceedings o f the Conference on Object-Oriented Program
ming Systems, Languages, and Applications (1986) pages 483-495.
[78] R. M. Soley and W. Kent, The OMG object model, for more information about
OMG and its products, refer to the URL http://www.omg.org.
[79] M. Stonebraker, A. Rowe, B. Lindsay, J. Gray, M. Carey, M. Brodie, P. Bernstein
and D. Beech, Third generation database system manifesto, ACM SIGMOD
Record 19,3 (1990) pages 31-44.
[80] J. Strong, Programming with curses, O ’Reilly & Associates, Inc., 1986.
151
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[81] C. Tomlinson, G. Lavender, G. Meredith, D. Woelk and P. Cannata, The Carnot
Extensible Services Switch (ESS) - Support for Service Execution, in Charles J.
Petrie Jr., ed.,. Enterprise Integration Modeling: Proceedings o f the First Interna
tional Conference, MIT Press, Cambridge, MA, 1992.
[82] P. S. M. Tsai and A. L. P. Chen, Concept hierarchies for database integration in a
multidatabase system, In 6th International Conference on Management o f Data
(1994).
[83] UniSQL/X User Guide, Revision 2.2., UniSQL Inc., 1994.
[84] USC Human Brain Project, http://www-hbp.usc.edu:8376/HBP/Home.html.
[85] USC IMSC Project, http://www.usc.edu/dept/imsc/.
[86] V. Ventrone and S. Heiler, Semantic heterogeneity as a result of domain evolution,
ACM SIGMOD Record 20,4 (1991) pages 16-20.
[87] Y. R. Wang and S. Madnick, The inter-database instance identification problem in
integrating autonomous systems. In Proceedings o f the 5th International Confer
ence on Data Engineering (1989) pages 46-55.
[88] G. Wiederhold, Interoperation, mediation and ontologies. Workshop on Heteroge
neous Knowledge-Bases w3 (1994) pages 33-48.
[89] K. Wilkinson, P. Lyngbaek and W. Hasan, The IRIS architecture and implementa
tion: object and function model, IEEE Transactions on Knowledge and Data Engi
neering 2,1 (1990) pages 63-75.
[90] D. Woelk, Carnot Intelligent Agents and Digital Libraries, Proceedings o f the First
Annual Conference on the Theory and Practice o f Digital Libraries. June, 1994.
[91] D. Woelk, P. Cannata, M. Huhns, W. Shen and C. Tomlinson, Using Camot for
Enterprise Information Integration, Second International Conference on Parallel
and Distributed Information Systems, January, 1993. pages 133-136.
[92] D. Woelk, M. Huhns and C. Tomlinson, InfoSleuth Agents: The Next Generation
of Active Objects, MCC Technical Report INSL-054-95, June, 1995.
[93] D. Woelk, W. Shen, M. Huhns and P. Cannata, Model Driven Enterprise Informa
tion Management in Camot, in Charles J. Petrie Jr., ed.,. Enterprise Integration
Modeling: Proceedings o f the First International Conference, MIT Press, Cam
bridge, MA, 1992.
152
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[94] D. Woelk, and C. Tomlinson, InfoSleuth: Networked Exploitation of Information
Using Semantic Agents, COMPCON Conference, March, 1995.
[95] D. Woelk, and C. Tomlinson, Camot and InfoSleuth: Database Technology and the
World Wide Web, ACM SIGMOD Intl. Conf. on the Management o f Data, May,
1995.
[96] D. Woelk, and C. Tomlinson, The InfoSleuth Project: Intelligent Search Manage
ment via Semantic Agents, Second International World Wide Web Conference,
October, 1994.
[97] M. F. Worboys and S. M. Deen, Semantic heterogeneity in distributed geographic
databases, ACM SIGMOD Record 20,4 (1991) pages 30-34.
[98] C. Yu, B. Jia, W. Sun and S. Dao, Determining relationships among names in het
erogeneous databases, ACM SIGMOD Record 20,4 (1991) pages 79-80.
[99] C. Yu, W. Sun, S. Dao and D. Keirsey, Determining relationships among attributes
for interoperability of multidatabase systems, In Proceedings o f the 1st Interna
tional Workshop on Interoperability in Multidatabase Systems (1991) pages 251-
257.
153
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
IMAGE EVALUATION
TEST TARGET (Q A -3 )
1 . 0
l.l
JZ8
m
I I I M
1 ^
L i
H I M
I I
inL8
I I I M s
1 . 2 5
1 . 4
1 . 6
150mm
IIW IG E . Inc
1653 East Main Street
Rochester. NY 14609 USA
Phone: 716/482-0300
Fax: 716/288-5989
0 1993. Applied Image. Inc.. All Rights Reserved
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI
films the text directly from the original or copy submitted. Thus, some
thesis and dissertation copies are in typewriter face, while others may be
from any type of computer printer.
The quality of this reproduction is dependent upon the quality of the
copy submitted. Broken or indistinct print, colored or poor quality
illustrations and photographs, print bleedthrough, substandard margins,
and improper alignment can adversely afreet reproduction.
In the unlikely event that the author did not send UME a complete
manuscript and there are missing pages, these will be noted. Also, if
unauthorized copyright material had to be removed, a note will indicate
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by
sectioning the original, beginning at the upper left-hand comer and
continuing from left to right in equal sections with small overlaps. Each
original is also photographed in one exposure and is included in reduced
form at the back of the book.
Photographs included in the original manuscript have been reproduced
xerographically in this copy. Higher quality 6” x 9” black and white
photographic prints are available for any photographs or illustrations
appearing in this copy for an additional charge. Contact UMI directly to
order.
UMI
A Bell & Howell Information Company
300 North Zeeb Road, Ann Arbor MI 48106-1346 USA
313/761-4700 800/521-0600
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
NOTE TO USERS
The original manuscript received by UMI contains pages with
indistinct print. Pages were microfilmed as received.
This reproduction is the best copy available
UMI
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 9902767
Copyright 1998 by-
Aslan, Goksel
All rights reserved.
UMI Microform 9902767
Copyright 1999, by UMI Company. All rights reserved.
This microform edition is protected against unauthorized
copying under Title 17, United States Code.
UMI
300 North Zeeb Road
Ann Arbor, MI 48103
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Protocol evaluation in the context of dynamic topologies
PDF
The importance of using domain knowledge in solving information distillation problems
PDF
Probabilistic analysis of power dissipation in VLSI systems
PDF
Orthogonal architectures for parallel image processing
PDF
User-assisted design and evolution of physical databases
PDF
Experimental and computational analysis of microsatellite repeat instability in mismatch-repair-deficient mice
PDF
MUNet: multicasting protocol in unidirectional ad-hoc networks
PDF
A unified model and methodology for conceptual database design and evolution
PDF
Dynamic constraints and database evolution
PDF
Cathodoluminescence studies of the influence of strain relaxation on the optical properties of InGaAs/GaAs quantum heterostructures
PDF
Citizenship, property, and place: land and housing in South Africa
PDF
The role of the sensorimotor cortical system in skill acquisition and motor learning: a behavioral study
PDF
The effect of helmet liner density upon acceleration and local contact forces during bicycle helmet impacts
PDF
Iterative data detection: complexity reduction and applications
PDF
A learning-based object-oriented framework for conceptual database evolution
PDF
Continuous media placement and scheduling in heterogeneous disk storage systems
PDF
Imaging alterity: discourse, pedagogy, and the reception of ethnographic film
PDF
The role of host proteins in retroviral cDNA integration
PDF
The design and synthesis of concurrent asynchronous systems.
PDF
Data sharing in interactive continuous media servers
Asset Metadata
Creator
Aslan, Goksel
(author)
Core Title
Semantic heterogeneity resolution in federated databases by meta-data implantation and stepwise evolution
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
1998-05
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c17-338761
Unique identifier
UC11350729
Identifier
9902767.pdf (filename),usctheses-c17-338761 (legacy record id)
Legacy Identifier
9902767
Dmrecord
338761
Document Type
Dissertation
Rights
Aslan, Goksel
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
computer science