Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 575 (1994)
(USC DC Other)
USC Computer Science Technical Reports, no. 575 (1994)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
An In telligen t System F or Iden tifying and In tegrating
NonLo cal Ob jects In F ederated Database Systems
Joac him Hammer Dennis McLeo d and An tonio Si
Computer Science Departmen t
Univ ersit y of Southern California
Los Angeles CA USA
Abstract
Supp ort for inter op er ability among autonomous
heter o gene ous datab ase systems is emer ging as a key
information management pr oblem for the s A
key chal lenge for achieving inter op er ability among
multiple datab ase systems is to pr ovide c ap abilities to
al low information units and r esour c es to b e exibly
and dynamic al ly c ombine d and inter c onne cte d while
at the same time pr eserve the investment in and
autonomy of e ach existing system This r ese ar ch
sp e cic al ly fo cuses on two key asp e cts of this how to disc over the lo c ation and c ontent of r elevant
nonlo c al information units and how to identify
and r esolve the semantic heter o geneity that exists b e
twe en r elate d information in dier ent datab ase c om
p onents We demonstr ate and evaluate our appr o ach
using the R emoteExchange exp erimental pr ototyp e
system which supp orts information sharing and ex
change fr om the ab ove p ersp e ctive
In tro duction
The past sev eral decades ha v e witnessed the emer
gence of p o w erful generalpurp ose database manage
men t facilities as w ell as the proliferation of databases
constructed with those facilities Acen tral curren t
problem in database researc h is to supp ort the eec
tiv e sharing and exc hange of information among v ari
ous database systems while at the same time resp ect
ing the autonom y of those systems F urther recen t
adv ances and wide acceptance of comm unication net
w orks ha v e pro vided foundational mec hanisms to sup
p ort the in terconnection of related existing database
systems Suc h a collection of co op erating but het
erogeneous autonomous database systems ma ybe
termed a fe der ate d datab ase systems or fe der ation for
short eac h individual database system in a federation
is termed a c omp onent datab ase system or c omp onent While the in terop erabilit y problem is largely ad
dressed in the comm unication net w ork en vironmen t
it only pro vides limited supp ort for the in terop
eration of heterogeneous database systems A main
reason for this limitati on stems from the diculties
This researchw as supp orted in part b y NSF gran t IRI
in o v ercoming the problem of unifying heterogeneous
information This is commonly kno wn as the seman
tic heter o geneity pr oblem In ligh t of this observ ation v arious arc hitectures
ha v e b een prop osed to address the database in
terop erabilit y problem Sucharc hitectures range
from tigh tlycoupled comp osite approac hes in whic h
comp onen t databases are in tegrated in to a cen tral
ized global database to lo oselycoupled federated
en vironmen ts wherein information is shared among
comp onen t database systems while retaining their au
tonom y A k ey c hallenge for all these en viron
men ts is to pro vide capabilities to allo w information
units and resources to b e exibly and dynamically
com bined and in terconnected while at the same time
preserving the in v estmen t in and autonomyofeac h
individual database comp onen t
The R emoteExchange federated arc hitecture and
exp erimen tal system extends the traditional federated
arc hitecture in supp orting t wok ey asp ects of this howtodisco v er the lo cation and con ten t of related
nonlo cal information units simple ob jects t yp es of
ob ject units of b eha vior etc and ho wtoiden tify and resolv e the seman tic heterogeneit y that ex
ists b et w een related information in dieren t database
comp onen ts The goal of this pap er is to demonstrate
the arc hitecture of this system and explain k ey imple
men tation issues faced Our approac h to the t wok ey
problems cited ab o v e can b e utilized bya v arietyof in telligen t and co op erativ e information systems ICISs
The remainder of this pap er is organized as fol
lo ws In Section w e review related w ork In Sec
tion w ein tro duce a t ypical sharing scenario in a
federated database en vironmen t wherein individual
comp onen ts can share and exc hange information units
ob jects Section describ es the arc hitecture of the
RemoteExc hange exp erimen tal system in detail and
illustrates the ideas of our approac h to information
sharing In Section w e examine k ey issues in the
design and implemen tation of the RemoteExc hange
protot yp e using a collection of Omegabased database
systems Finally Section con tains concluding ob
serv ations with a critical ev aluation of our results and
their p oten tial impact on future w ork
Related W ork
The term heterogeneous databases w as origi
nally used to distinguish w ork that included database
mo del and conceptual sc hema heterogeneit y from
w ork on distributed databases
whic h addressed is
sues solely related to distribution Recen tly there
has b een a resurgence in researc h in the area of het
erogeneous database systems HDBSs W ork in this
area can b e c haracterized b y the dieren tlev els of in
tegration of the comp onen t DBSs and b y dieren tlev els of global federation services In Multibase for
example whic h is considered a tigh tlycoupled HDBS
comp onen t database sc hemas are in tegrated in to one
cen tralized global sc hema with the option of dening
dieren t user views on the unied sc hema While this
approac h supp orts preexisting comp onen t databases
it falls short in terms of exible sharing patterns F ur
thermore the in tegration pro cess is exp ensiv e and dif
cult and tends to b e hard to c hange
The federated arc hitecture prop osed in whic h
is similar to the m ultidatabase arc hitecture of in v olv es a lo oselycoupled collection of database sys
tems stressing autonom y and exible sharing pat
terns through in tercomp onen t negotiation Rather
than using a single static global sc hema the lo osely
coupled arc hitecture allo ws m ultiple imp ort sc hemas
enabling data retriev al directly from the exp orter and
not indirectly through some cen tral no de as in the
tigh tlycoupled approac h
One common approachtosc hema in tegration is to
reason ab out the meaning and resem blance of hetero
geneous ob jects in terms of their structural represen
tation In Larson et al the meaning of an at
tribute is appro ximated in terms of its v alue t yp e set
of p ossible v alues cardinalit y constrain ts in tegrit y
constrain ts and allo w able op erations Ho w ev er one
can argue that anysuc h set of c haracteristics do es
not sucien tly describ e the realw orld meaning of an
ob ject and th us their comparison can lead to unin
tended corresp ondences or fail to detect imp ortan t
ones Other promising metho dologies that ha vebeen
dev elop ed include heuristics to determine the similar
it y of ob jects based on the p ercen tage of o ccurrences
of common attributes More accurate tec h
niques use classication for c ho osing a p ossible rela
tionship b et w een classes Whereas most of previous metho ds primarily uti
lize sc hema kno wledge tec hniques utilizing seman tic
kno wledge based on realw orld exp erience ha vealso
b een in v estigated These approac hes usually
assume the existence of a real w orld kno wledge base
whic h serv es as a global sc hema to whichev ery lo cal
sc hema is mapp ed Ho w ev er w e b eliev e that a useful
approac h is for the cen tralized kno wledge base to only
con tain information activ ely used to supp ort sharing
within the federation and th us as illustrated in our
mec hanism should b e incremen tally built tailored to
the federation F urthermore these approac hes lac k
The term distributed database is used here as it has b een
mainly used in the literature denoting a relativ ely tigh tly
coupled homogeneous system of logically cen tralized but ph ys
ically distributed comp onen t databases
the abilit y of tailoring the iden tication pro cess to
the con text of user request
A dieren t approachproposed b y Ken t uses
an ob jectorien ted database programming language
to express mappings among dieren t similar concepts
that allo w a user to view them in some in tegrated w a y It of course remains to b e seen if a language that is
sophisticated enough to meet all of the requiremen ts
giv en b y Ken t in his solution can b e dev elop ed in the
near future
Av ery recen t approachto in terop erabilityb y
Meh ta et al uses socalled pathmetho ds to ex
plicitly create in tercomp onen t and in terob ject map
pings b et w een source and target ob ject classes in or
der to retriev e and up date related data ob jects The
ob vious dra wbac k of this approachisthe large o v er
head in calculating and main taining the mappings
whichma y b e impractical for large federations with
extensiv e andor dynamic sharing patterns
The F ederated Database System
Con text
Consider the follo wing scenario in v olving a group
of molecular biologists whose databases form a col
lab orativ e Federation Of Macromolecular Databanks
F OMD as depicted in Figure The goal of F OMD
is to share and exc hange macromolecular information
bet w een individual institutes and researc hers F or instance a researc her could ecien tly reuse pro
tein data that had already b een sequenced deco ded
b y other comp onen ts for hisher exp erimen tal w ork
Figure also sho ws a snapshot of the information
stored in F OMD at a giv en p oin t in its lifetime Since
eac h macromolecular database is managed b y a dif
feren t organization institute the con ten ts of their
databases reect the dieren tfoci and in terests W e
can see for example that institute B is the only
comp onen t with information on b oth genetic and pro
tein sequences All other comp onen ts main tain either
genetic or protein information with v arious lev els of
detail
A dicult y in nding a solution to the problem of
ac hieving in terop erabilityin F OMD and similar fed
erations stems from the conicting nature of sharing
and autonom y On the one hand an institute w ould
lik e to share information with other comp onen ts of
F OMD On the other hand the same comp onen t
w ould also lik e to exercise autonomyo v er its o wn
database with resp ect to organization administration
and sharing eg con trol o v er the information it is
willing to exp ort to the other comp onen ts
In con sequence w e assume that when a comp onen t agrees to
join the federation information to b e made a v ailable
to other institutes is stored in a sp ecially mark ed
sc hema called an exp ort schema The lev el of in terop erabilit y that can b e ac hiev ed
within suc h a federation dep ends largely up on t wok ey
This capabilit y of exp orting only a sp ecic p ortion of a
comp onen ts database is esp ecially imp ortan t in the F OMD en
vironmen t where a researc h institute w ould probably not willing
to release an y informatio n that has not b een published or fully
v alidated
Figure A conceptual o v erview of F OMD and its comp onen ts
capabilities
the abilit y of a comp onen t to iden tify and lo
cate p oten tially appropriate nonlo cal informa
tion with resp ect to its needs the disc overy pr ob
lem and
as requested remote information is iden tied and
lo cated the abilit y to fold it in to the lo cal system
framew ork the unic ation pr oblem The fo cus of this w ork is to address the ab o vet w o
problems sev eral other imp ortan t issues suc h as secu
rit y access con trol and automatic up date of shared
data are not directly addressed here
Ac hieving In terop erabilit y in
RemoteExc hange
In order for an y collab oration to tak e place among
the heterogeneous comp onen ts of a federation some
common mo del for describing the sharable data m ust
b e utilized One ma y of course argue as to the nature
of this lingua franca W e b eliev e that this mo del
should b e seman tically expressiv e enough to capture
the in tended meanings of conceptual sc hemas whic h
ma y reect essen tial kinds of heterogeneit y as will
b e discussed b elo w F urther this mo del m ust b e
simple enough so that it can b e readily understo o d
and implem en ted T o this end w eha vec hosen to
use a Minimal Ob ject Data Mo del MODM as the
common database mo del for describing the structure
constrain ts and op erations for sharable data
MODM
MODM is a generic functional ob ject database
mo del whic h supp orts the usual ob jectbased con
structs In particular it dra ws up on the essen tials
of functional database mo dels suc h as those pro
p osed in Daplex Iris and Omega MODM con
tains the basic features common to most seman tic and
ob jectorien ted mo dels The mo del supp orts com
plex ob jects aggregation t yp e mem b ership classi
cation subt yp e to sup ert yp e relationships gener
alization inheritance of prop erties attributes from
sup ert yp e to subt yp es and userdenable functions
metho ds
In the con text of MODM in terop eration among
comp onen ts generally is p ossible at man y dieren t
lev els of abstraction and gran ularit y ranging from fac
tual information units data ob jects to metadata
conceptual sc hema to b eha vior F or this pap er w e
limit our in v estigation to the sharing of t yp e ob jects
a pro cess w e term typ elevel sharing Sharing of indi
vidual instances instanc elevel sharing and sharing
of b eha vior or functions functionlevel sharing are
preliminarily examined in RemoteExc hange Exp erimen tal
System
Figure illustrates the arc hitecture of the Remote
Exc hange exp erimen tal system In this gure the
sharing advisor is a sp ecial comp onen t that manages
kno wledge ab out existing t yp e ob jects that comp o
nen ts exp ort this kno wledge resides in the semantic
dictionary The sharing advisor pro vides four in tel
ligen t services to the comp onen ts of the federation
R e gistr ation Disc overy Semantic Heter o geneity R es
olution and Unic ation These services address dif
feren t asp ects of the problem of t yp elev el sharing
ranging from detecting similarit y during the registra
tion pro cess to determining precise relationship b e
t w een ob jects during the unication pro cess While
it is nearly imp ossible to completely automate these
services our approac h pro vides substan tially useful
functionalit y Registration
Registration allo ws a new comp onen t to inform
the sharing advisor ab out an y information it is willing
to share with other comp onen ts in the federation In a
sense it establishes an initial sharing con text within
Figure The RemoteExc hange arc hitecture
the federation b y logically connecting the exp orted
information to the seman tic dictionary via the sharing
advisor Incremen tal registration allo ws a comp onen t
to augmen t its exp ort sc hema with new information
Figure sho ws t w o partial conceptual sc hemas of
comp onen t A and comp onen t D resp ectiv ely Let
us assume that comp onen t A has already registered
the t yp e ob jects Protein Instances and Amino
Acid Sequences with the sharing advisor This is
reected in the seman tic dictionary sho wn in Fig
ure a When comp onen t D registers t yp e ob ject
Protein Structures the sharing advisor can use its
existing kno wledge in this case the information ob
tained from comp onen t A to determine if Protein
Structures has an y similarities with previously reg
istered t yp es ie Protein Instances and Amino
Acid Sequences User in teraction ma y b e neces
sary to instruct the sharing advisor in case the in
formation in the seman tic dictionary is insucien tto
detect dissimilarities among t yp e ob jects automat
ically The newly acquired kno wledge and the newly
registered information are stored in the seman tic dic
tionary for future consultation see for example con
ten ts of the seman tic dictionary in Figure b after
the registration of the t yp e ob ject Protein Struc
tures In a sense the seman tic dictionary represen ts
a dynamic fe der atedknowle dge b ase ab out sharable
information in the federation
In the remainder of this section w e illustrate ho w
the sharing advisor can eectiv ely utilize the seman
tic dictionary as a means of organizing v arious reg
istered t yp e ob jects in order to accommo date infor
mation sharing W e also in tro duce a set of sharing
heuristics that guide the sharing advisor through reg
istration
The Seman tic Dictionary
In the seman tic dictionaryt yp es determined to b e
similar b y the sharing advisor are classied in to a col
lection called a c onc ept within whic h sub collections
called sub c onc epts can b e further classied This gen
erates a c onc ept hier ar chy Naturally the relation
ships expressed in this concept hierarc h y can only b e
appro ximations of the true realw orld relationships
that exist b et w een the exp orted t yp es of dieren t
comp onen ts additional mec hanisms are needed to es
tablish more exact relationships see Section Figure sho ws t w o snapshots of the concept hi
erarc h y in the seman tic dictionary tak en at dieren t
times during the lifetime of our example federation
Figure a indicates the concept hierarc h y after t yp es
Protein Instances and Amino Acid Sequences of
comp onen t A ha v e b een registered Figure b sho ws
the corresp onding hierarc h y after comp onen t D has
registered t yp e Protein Structures The hierarc h y
in Figure b also indicates that Protein Instances
and Protein Structures are represen ting similar
information since they b elong to the same classi
cation called Protein Information In addition
t yp es Protein Instances and Protein Structures
ha v e prop erties that distinguishes one t yp e from an
other Hence they are also created as sub concepts of
Protein Information in order to express their dis
similarities By con trast Amino Acid Sequences
is similar to neither Protein Instances nor Protein
Structuresand th us app ears as a separate concept
in the hierarc h y Similarit y and dissimilarityof this
kind is detected b y the sharing advisor based up on
sharing heuristics describ ed b elo w with user input
as required
The adv an tage of organizing t yp e ob jects in a con
cept hierarc h y is that a hill clim bing tec hnique can b e
used to place newly registered t yp e ob jects F or ex
ample consider a t yp e ob ject b eing registered whic h
is determined to b e unrelated to mem b ers of concept
Protein Information in Figure b in this case no
further comparisons with mem b ers of sub concepts of
Protein Information are necessary Generallyat yp e ob ject represen ts a sp ecic view
Figure P artial conceptual sc hemas of t w o macromolecular databases
of a corresp onding real w orld concept and is tailored
to the fo cus and in terest of the database comp onen t
therefore the set of prop erties asso ciated with the
t yp e ob ject can b e view ed as a subset of those as
so ciated with the real w orld concept In Figure t yp e Protein Instances of comp onen t Aand Pro
tein Structures of comp onen t D indicate t w o dif
feren t views on the real w orld concept proteinIn
order to prop erly merge v arious views on a similar
concept the seman tic dictionary is established in a
b ottom up fashion with the set of prop erties b elong
ing to a concept at a particular lev el represen ted as the
union of the prop erties of all its sub concepts This is
illustrated b y concept Protein Information in Fig
ure b Using a concept hierarc h y to organize ex
p orted t yp es will incremen tally establish a fe der ate d
view of all exp ort sc hemas in the federation
Note that the concept hierarc hyin tro duced is not
static It is dynamic and ev olv es incremen tally de
p ending on the kno wledge and information receiv ed
b y the sharing advisor Consider the follo wing sce
nario where a total of N t yp e ob jects are registered
with the sharing advisor On one extreme where all N
t yp es are similar enough to b elong to one single con
cept there will b e N ob jects in the seman tic dic
tionary N t yp e mem b ers concept on the other
extreme where all N t yp es are completely unrelated
there will b e N ob jects in the seman tic dictionary
N t yp e mem b ers N concepts
Sharing Heuristics
The sharing heuristics emplo y ed drawuponthe in cremen tal clustering paradigm of mac hine learning
as describ ed in The idea b ehind these heuris
tics is to assess the exten t of distinguishing capabil
it y of a prop ert y with resp ect to a concept this al
lo ws the sharing advisor to determine if the meaning
of a t yp e ob ject b eing registered can b e determined
based up on its prop erties or whether further assis
tance from users is necessary The distinguishing ca
pabilit y with resp ect to a concept is based up on the
interc onc ept dissimilarity bet w een concepts and the
intr ac onc ept similarity within a concept As an ex
ample consider prop ert y co de of concept Protein
Information in Figure b This prop ert y has a high
in terconcept dissimilarit y as no other concept at the
same scop elev el p ossesses suc h a prop ert yOn the
other hand co de also has a high in traconcept sim
ilarit y with resp ect to Protein Information since
this prop ert y is asso ciated with all concept mem bers
of Protein Information for example Protein In
stancesHo w ev er when lo oking at the sub concepts
of Protein Information prop ert y co de hasalo w
in terconcept dissimilarit y with resp ect to eachsub concept of Protein Information since this prop ert y
is p ossessed b y all concepts within this same scop e of a
common sup erconcept Protein InformationNote
that a prop ert y do es not necessarily p ossess the same
exten t of distinguishing capabilit y across dieren tlev els
In terconcept dissimilarit yand in traconcept sim
ilarityv alues are estimated via statistical analysis
based on previously registered t yp e ob jects A statis
tical heuristicbased approac h oers a degree of error
resilience allo wing the accuracy of the distinguishing
capabilities to b e gradually impro v ed o v er a p erio d
of time A limitatio n of this approac h is its p oten
tially oscillating nature during early stages of a fed
eration ho w ev er this limitation b ecomes less signif
icantasthe n um b er of comp onen ts and t yp e ob jects
increases damping the oscillation to a steady state
Information that is utilized b y the sharing advi
sor to in terrelate similar prop erties comes from three
dieren t sources The prop ert y itself in the sit
uation where a prop ert y is dened as an atomic data
t yp e suc h as INTEGER or STRING only the prop
ert y name is utilized b y the sharing advisor this is
b ecause the v alue t yp e of the prop ertydoesnot pro vide additional useful seman tic On the other hand
Figure Ev olution of concept hierarc hies in the seman tic dictionary
if a prop ert y is dened as a user dened t yp e b oth
prop ert y name and the v alue t yp e of the prop ert y
will b e considered b y the sharing advisor as b oth
of these information pro vide real w orld seman tics to
the meaning of the prop ert y Previously acquired
kno wledge a list of prop ert y terms corresp ondence is
incremen tally main tained in the seman tic dictionary
as a result of user instruction This list is consulted
b y the sharing advisor whenev er prop erties corresp on
dence information are needed User consultation
in the situation where the seman tic dictionary do es
not pro vide adequate kno wledge to the sharing advi
sor to prop erly in terrelate similar prop erties user is
consulted
Disco v ery
The purp ose of disco v ery is to iden tify appropriate
information relev an t to the request of a comp onen t
initiating a sharing pro cedure Although w eha v e pro
p osed a metho dology to detect similariti es among
dieren tt yp es it is not adequate in a federated en
vironmen t where the goal is to in tegrate sp ecic non
lo cal information in to the en vironmen t of a lo cal
comp onen t This is b ecause ev en though a remote
ob ject is similar to a particular lo cal ob ject it migh t
not b e relev an t within the in tended con text of a lo
cal comp onen t Therefore it is imp erativ e that user
c haracteristics of the lo cal comp onen ts b e tak en in to
consideration when lo cating relev an t information F or
this purp ose w eha v e categorized three basic kinds
of disco v ery requests whic h when com bined allo wa
comp onen t to disco v er a wide v ariet y of nonlo cal in
formation
T yp e Similar Concepts
In this kind of disco v ery request a comp onen t
user is in terested in lo cating t yp e ob jects in
remote comp onen ts that are conceptually simi
larrelated to a particular t yp e ob ject in the lo
cal comp onen t All t yp e ob jects b elonging to the
p ortion of the concept hierarc h y in whic h the lo
cal t yp e ob ject resides in the seman tic dictio
nary are appropriate to this request F or exam
ple in the concept hierarc h y of Figure b Pro
tein Structures of comp onen t D is a prop er
candidate to the request b y comp onen t A for re
lated information on Protein Instances T yp e Compli m en tary Information
In this kind of disco v ery request a comp onen t
user is in terested in disco v ering additional infor
mation ab out a lo cal t yp e ob ject This ma y o ccur
when comp onen t A is in terested in additional in
formation on Protein Instances for example
All t yp e mem b ers with dieren tsets of proper ties b elonging to the same p ortion of the hier
arc h y in whic h the lo cal t yp e ob ject resides are
prop er candidates for this request F or exam
ple Protein Structures of comp onen t D w ould
also satisfy a request for additional information
of Protein Instances issued b y comp onen t A T yp e Ov erlapping Information
This kind of disco v ery request arises when a
comp onentis in terested in lo cating nonlo cal
t yp e ob jects that o v erlap in their information
con ten t with a comp onen ts lo cal t yp e F or ex
ample comp onen t A w ould lik e to displayall
proteinlik e information using its o wn three di
mensional viewing program whichw orks on mem
b ers of t yp e Protein Instances All t yp es with
similar prop erties as Protein Instances that
b elong to the subhierarc h y ro oted at Protein
Instances are prop er candidates for this request
According to the concept hierarc h y of Figure b
there is no candidate typeobjectthat w ould sat
isfy suc h a request at this time
Seman tic Heterogeneit y Resolution
After iden tifying relev an t nonlo cal information ob
jects a comp onen tma y wish to fold them in to its
lo cal information framew ork Ho w ev er the problem
is to determine ho w this information can b e unied
with its o wn lo cal data due to seman tic discrepan
cies that ma y exist b et w een related concepts in dif
feren t comp onen ts Suc hseman tic heterogeneityis a
natural consequence of the indep enden t creation and
ev olution of autonomous databases whic h are tailored
to the requiremen ts of the applications they serve The purp ose of our approac h to seman tic heterogene
it y resolution is to resolv e these discrepancies b et w een
the relev an t nonlo cal t yp e ob jects and the lo cal
metadata con text conceptual sc hema This is a
prelude to unifying the remote information in to the
lo cal con text Our mec hanism for resolving seman
tic heterogeneit y emplo ys a lo c al lexic on for eac h
comp onen t whic h sp ecies its p ersp ectiv e on the pre
cise relationship b et w een its lo cal t yp es and a global
set of commonly understo o d concepts Sp ecically kno wledge is represen ted as a static collection of facts
of the simple form
term relationsh ip descriptor term
A term on the left hand side of a relationship descrip
tor represen ts a lo cal t yp e ob ject ie the unkno wn
whic h is describ ed b y the term on the righ t side W e
ha v e an initial set of descriptors whic h is extensible
w e also an ticipate collections of descriptors that are
tailored to giv en application domains The follo wing
is a list of basic conceptual relationship descriptors
initially supp orted in RemoteExc hange
Descriptor Meaning
Identical
Property of all instances
Equal
Compatible
KindOf
Assoc
CollectionOf
Common
Feature
Has
Two types are the same
Two types are equivalent
Two types are transformable
Specialization of a type
Positive association between types
Collection of related types
Characteristic of a collection
Descriptive feature of a type
Example
ProteinCollectionOf Amino Acids
ProteinIdentical Protein
DNA StrandEqual Genetic Base Seq.
RNACompatible Protein
ProteinKindOf Molecule
GeneAssoc Protein
DNACommon Structure=Double Helix
ProteinFeature Molecular Structure
ChromosomeHas Genetic Map
F or example w emayha v e the follo wing in the lo cal
lexicon of comp onen t A Term Relationship Key concept Textual description
Descriptor (not part of lexicon)
Protein Instances KindOf Protein Information A protein instance is a certain kind
of protein information.
Protein Instances CollectionOf Proteins It is a collection of all proteins.
Protein Instances Has Components Each instance of Protein Instances
consists of (amino acid) building blocks.
The terms that are used to describ e unkno wn con
cepts are tak en from a dynamic list of concepts dra wn
from the seman tic dictionaryc haracterizing the com
monalities in a federation Since in terop erabilit y only
mak es sense among concepts that mo del similar or
related information it is reasonable to exp ect a com
mon understanding of a minima l set of concepts tak en
from the application domain The seman tic dictio
nary con tains partial kno wledge ab out all the terms
in the lo cal lexica in the federation suggesting p os
sible relationships b et w een dieren t terms Utilizing
this kno wledge eac h lo cal lexicon describ es the pre
cise meaning of all t yp e ob jects that the comp onen t
exp orts In our example see Figure b the set of
commonly understo o d concepts at a particular mo
men t could b e C fA mino A cid Se quenc es A uthors Pr oteins Pr otein
Information Pr otein Instanc es Pr otein Structur esg
The essen tial idea of using a lo cal lexicon is to rep
resen t the seman tics of the shared terms in a more
expressiv e and complete manner than in the concep
tual sc hema The additional seman tic information is
imp ortan t for the follo wing reason as noted earlier
the results of the disco v ery pro cess are not alw a ys
conclusiv e enough to determine the exact meaning of
a term or more preciselyho wt w o similar terms in
dieren t comp onen ts are related Similarl yit is not
p ossible to deriv e the meaningusage of terms b y
lo oking at their structural represen tation in the con
ceptual sc hema of a lo cal comp onen t In order to
determine the relationships b et w een ob jects in a fed
eration w e realize that not one single metho d but
acom bination of sev eral dieren t but complimen
tary approac hes ie Registration Disco v ery Seman
tic Heterogeneit y Resolution tak en together is highly
promising F or example there is an imp ortan t con
nection b et w een lo cal lexica and seman tic dictionary Lo cal lexica con tain only seman tic information and no
kno wledge ab out an y relationships among its ob jects
information that is necessary to solv e the seman tic
heterogeneit y problem This kind of information is
pro vided b y the seman tic dictionary through regis
tration whic h con tains partial kno wledge ab out the
relationships among all the terms in the lo cal lexica
in the federation Note of course that the lexica and
seman tic dictionary are b oth dynamic ie they gro w
and shrink as the amoun t of shared data in the fed
eration increases decreases
The basic problem addressed b y the seman tic het
erogeneit y resolution mec hanism maynowbe ex pressed without loss of generalit y as giv en t woob jects a lo cal and a foreign one return the relationship
that exists b et w een the t w o Sp ecically our strategy
is based on structural kno wledge conceptual sc hema
information and the kno wn relationships that ex
ist b et w een common global concepts in set C and the
t w o ob jects in questions lo cal lexicon seman tic dic
tionary One c haracteristic of our approachis that
the ma jorit y of user input o ccurs b efore the resolution
step is p erformed ie when selecting the set C and
creating the lo cal lexicon rather than during
Unication
Atthispoin t the nonlo cal ob jects can b e uni
ed with the corresp onding lo cal ob jects In some
cases the lo cal metadata framew ork conceptual
Figure As sc hema after resolution and unication of proteinlik e information
sc hema m ust b e restructured to ac hiev e a result that
is complete minima l and understandable
Complete since the new in tegrated sc hema m ust con
tain all concepts that w ere presen t b efore the uni
cation pro cess to ok place Minimal since concepts
should only b e represen ted once And understand
able since the in tegrated sc hema should b e easy to
understand for the end user
Up on imp orting the metadata structural con
icts with existing t yp es in the comp onen ts lo cal t yp e
hierarc hyma y arise In w een umerated in detail
v arious conicting p ossibilities that can arise while
imp orting nonlo cal metadata in to a lo cal sc hema
Here w e illustrate our mec hanism with the simple
example of imp orting t yp e Protein Structures of
comp onen t D in to As sc hema The follo wing t w o
p ossibilities exist
Protein Structures is seman tically equiv a
lentto Protein Instances in As sc hema
In this case w emak e Protein Structures a
subt yp e of Protein Instances Prop erties that
previously b elonged to Protein Instances re
main there Prop erties that b elong to Pro
tein Structures but not to Protein Instances
are added to the new subt yp e In those cases
where Protein Instances has additional prop
erties that do not exist for Protein Structures
sp ecial n ull v alues m ust b e assigned to all its in
stances see Figure a Note that subsetting
is considered to b e the basis for accommo dating
m ultiple user p ersp ectiv es on comparable t yp es
used b y most metho dologies The case in
whic h Protein Structures is iden tical to Pro
tein Instances is a sp ecial case and only requires
the imp ortation of the t yp e instances in whic h A
is in terested
Protein Structures is related to Protein In
stances in As sc hema
In this situation a new sup ert yp e called Pro
teins is created that con tains only the prop erties
common to b oth t yp es Protein Instances and
Protein Structures The prop erties in whic h
Protein Instances and Protein Structures
dier are asso ciated with t wosubt yp es whichin herit the prop erties from their common sup er
t yp e see Figure b T ogether the new sup er
t yp e and its t wosubt yp es con tain the same in
formation as Protein Instances and Protein
Structures in separate comp onen ts b efore the
unication
Note that whether Protein Structures is related
to Protein Instances or is equiv alentto Protein
Instances dep ends up on the p ersp ectiv e of the im
p orting comp onen t as sp ecied in the lo cal lexicon
Exp erimen tal Implemen tation
The curren t RemoteExc hange protot yp e consists
of a collection of Omega database comp onen ts
Eac h comp onen t consists of an Omega database for
main taining lo cal information and t wocomm unica
tion in terfaces called an imp orter and an exp orter
whic h run on top of Omega The purp ose of these
in terfaces is to handle comm unication requests to
and from remote comp onen ts using the RPC message
passing paradigm This section briey describ es the
k ey issues w e faced when implemen ting our protot yp e
As noted our mec hanism requires the abilit y
to extract metadata information from a database
comp onen t Since the original Omega database do es
not supp ort this functionalit yw ein tro duced socalled
metafunctions in the exp orter
whic h return meta
data information ab out ob jects in its lo cal Omega
The curren t RemoteExc han ge en vironmen t also supp orts
Iris database comp onen ts to some degree and w eplan toin corp orate other database systems in the future
Weha v e an alternativ e of implemen ti ng suc h functionalit y
inside the Omega database system ho w ev er it w ould require
mo dication of the k ernel of the database system whic h is not
acceptable additional mo dication sw ould also b e necessary
database An initial set of metafunctions is the fol
lo wing
Meta-Function Description
ShowAllTypes()
HasStored Functions(t:Type)
HasComputedFunctions(t:Type)
HasInstances
a
(t:Type)
HasValue(i:Instance, f:Function),
HasValueType(f:Function)
HasDirectSubTypes(t:Type)
HasDirectSuperType(t:Type)
Returns a list of all types that can be shared
Returns a list of all stored functions defined on t
Returns a list of all computed functions defined on t
Returns a list of all instances defined on t
Returns the value of the stored function f on i
Returns the value type for a stored function f
Returns a list of all direct subtypes of type t
Returns the direct supertype of type t
a
Some primitive types like STRING or INTEGER do not have this functionality
In eect metafunctions serv easan in tercomp onen t
comm unication proto col
In the design and implemen tatio n of the sharing
advisor w e mo del it as another Omega comp onen t
Toa v oid mo dications of the Omega system eac hser vice pro vided b y the sharing advisor registration dis
co v ery resolution and unication is supp orted again
through its exp orter The seman tic dictionary and the
concept hierarc h y are accordingly mo deled b y the cor
resp onding Omega database as follo ws Eac h concept
in the concept hierarc h y is mo deled as a t yp e ob ject
in the seman tic dictionary while eacht yp e mem ber of
the concept hierarc h y is mo deled as an instance ob ject
in the seman tic dictionary Conclusions and F uture Directions
Wein tro duced a framew ork and mec hanism for
iden tifying and in tegrating t yp e ob jects from div erse
information sources The mec hanism is built on an
ob ject based mo del con taining features commonly
found in most existing ob jectbased database systems
and hence our mec hanism can b e implemen ted in
these systems without mo dication to existing DBMS
soft w are W e also demonstrated the feasibilityof our
mec hanism in the F OMD en vironmen t represen ting
an area of tremendous and gro wing in terest and im
p ortance An imp ortantc haracteristic of our mec hanism
stems from the fact that our approac h separates the
disco v ery issue from the in tegration issue prior w orks
largely mix the t w o with the disco v ery issue implicitly
apartof in tegration W e observ ed that dieren t kinds
of kno wledge are needed in the disco v ery and in tegra
tion pro cesses b yiden tifying the kno wledge needed
in the dieren t pro cesses a comp onen t is more lik ely
to b e able to access the most appropriate nonlo cal
information in the most natural w a y As noted the abilit y of the sharing advisor to de
termine the similarit y among t yp e ob jects is critical to
the success of our mec hanism As suc h w e are in the
pro cess of exp erimen ting with the sharing advisor b y
measuring its b eha vior on dieren t sets of testing data
of div erse nature The measuremen ts are based up on
whenev er new metafunctio ns are in tro duced the follo wing t w o metrics c ompletenesswhic h
measures if all the similar ob jects are b eing iden ti
ed b y the sharing advisor and pr e cisionwhic h
measures if all the iden tied ob jects are similar
In this pap er w eha v e fo cused on the abilityto
access and utilize appropriate nonlo cal information
in a natural w a y p erformance eciency issue w as
not explicitly considered W e are curren tly in v esti
gating the impact of the socalled lazy evaluation paradigm on reducing the amountof o v erhead in
v olv ed Briey the lazy ev aluation paradigm shifts
the burden on determining the similarit y among t yp e
ob jects from the registration pro cess to the disco v
ery pro cess In this w a y the similarit y among t yp e
ob jects is nev er determined unless the t yp e ob ject is
queried b y a comp onen t thereb y eliminating the un
necessary o v erhead on analyzing the t yp e ob jects that
are nev er queried b y comp onen ts
Ac kno wledgemen ts
The authors w ould liketoac kno wledge the v aluable
commen ts of Shahram Ghandeharizadeh and KJ
By eon of USC as w ell as those of the referees Sp ecial
thanks to Kathryn Stuart who help ed us understand
the molecular biology that is used in our example sce
nario
References
S Ceri and G P elagatti DistributedDatab ases
Principles and Systems McGra w Hill U Da y al and H Hw ang View Denition and
Generalization for Database In tegration in Multi
base a System for Heterogeneous Distributed
Databases IEEE T r ansactions on Softwar e En
gine ering D F ang S Ghandeharizadeh D McLeo d and
A Si The Design Implemen tation and Ev alua
tion of an Ob jectBased Sharing Mec hanism for
F ederated Database Systems In International
Confer enc e of IEEE Data Engine ering
D F ang J Hammer and D McLeo d An Ap
proac h to Beha vior Sharing in F ederated Database
Systems In MT Ozsu U Da y al and PV al
duriez editors International Workshop on Dis
tribute d Obje ct Management pages Morgan
Kaufman
P F ankhauser and E Neuhold Kno wledge Based
In tegration of Heterogeneous Databases T ec h
nical rep ort T ec hnisc he Ho c hsc h ule Darmstadt
D Fisher Kno wledge Acquisition Via Incremen tal
Conceptual Clustering Machine L e arningpages
K F renk el The Human Genome Pro ject and
Informatics Communic ations of the A CM
S Ghandeharizadeh et al Omega A P arallel
Ob jectBased System T ec hnical Rep ort USCCS
Computer Science Departmen t Univ ersit y
of Southern California Los Angeles CA Ma y J Hammer and D McLeo d An Approac htoRe solving Seman tic Heterogeneityin a F ederation of
Autonomous Heterogeneous Database Systems
International Journal of Intel ligent and Co op
er ative Information Systems Marc h
S Ha yne and S Ram MultiUser View In te
gration System MUVIS An Exp ert System for
View In tegration In Pr o c e e dings of the th Inter
national Confer enc e on Data Engine ering IEEE
F ebruary D Heim bigner and D McLeo d A F ederated Ar
c hitecture for Information Systems A CM T r ans
actions on Oc e Information Systems July M Huhns N Jacobs T Ksiezyk W Shen
M Singh and P Cannata En terprise Informa
tion Mo deling and Mo del In tegration in Carnot
T ec hnical Rep ort Carnot MCC R Hull and R King Seman tic Database Mo d
eling Surv ey Applications and Researc h Issues
A CM Computing Surveys Septem
b er W Ken t Solving Domain Mismatc h Problems
with an Ob jectOrien ted Database Programming
Language In Pr o c e e dings of the International
Confer enceon V ery L ar ge Datab ases pages IEEE Septem b er J Larson SB Na v athe and R Elmasri A The
ory of A ttribute Equiv alence and its Applications
to Sc hema In tegration IEEE T r ansactions on
Softwar e Engine ering April W Lit win and A Ab dellatif Multidatabase In
terop erabilit y IEEE Computer Decem ber
F Manola S Heiler D Georgak op oulos
M Hornic k and M Bro die Distributed Ob ject
Managemen t International Journal of Intel ligent
and Co op er ative Information Systems A Meh ta J Geller Y P erl and P F ankhauser
Computing Access Relev ance to Supp ort P ath
Metho d Generation in In terop erable Multi
OODB In Pr o c e e dings of the International Con
fer enceon V ery L ar ge Datab ases pages IEEE August M P apazoglou S Laufmann and T Sellis An
Organizational F ramew ork F or Co op erating In tel
ligen t Information Systems International Journal
of Intel ligent and Co op er ative Information Sys
tems J Saldanha and J Eccles The Application of
SSADM to Mo delling the Logical Structure of
Proteins CABIOS pages A Sa v asere A Sheth S Gala S Na v athe and
H Marcus On Applying Classication to Sc hema
In tegration In Pr o c e e dings of IEEE st Inter
national Workshop on Inter op er ability in Multi
datab ase Systems pages Ky oto Japan
April M Sc h w artz A Em tage Brewster K and
B Neuman A Comparison of In ternet Resource
Disco v ery Approac hes Computing SystemsAu gust A Sheth J Larson A Cornelio and S B
Na v athe AT ool for In tegrating Conceptual
Sc hemata and User Views In Pr o c e e dings of the
th International Confer enceonDataEngine er
ing pages IEEE F ebruary
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 574 (1994)
PDF
USC Computer Science Technical Reports, no. 572 (1994)
PDF
USC Computer Science Technical Reports, no. 721 (2000)
PDF
USC Computer Science Technical Reports, no. 849 (2005)
PDF
USC Computer Science Technical Reports, no. 833 (2004)
PDF
USC Computer Science Technical Reports, no. 879 (2006)
PDF
USC Computer Science Technical Reports, no. 589 (1994)
PDF
USC Computer Science Technical Reports, no. 558 (1993)
PDF
USC Computer Science Technical Reports, no. 578 (1994)
PDF
USC Computer Science Technical Reports, no. 796 (2003)
PDF
USC Computer Science Technical Reports, no. 593 (1994)
PDF
USC Computer Science Technical Reports, no. 639 (1996)
PDF
USC Computer Science Technical Reports, no. 598 (1994)
PDF
USC Computer Science Technical Reports, no. 549 (1993)
PDF
USC Computer Science Technical Reports, no. 626 (1996)
PDF
USC Computer Science Technical Reports, no. 585 (1994)
PDF
USC Computer Science Technical Reports, no. 792 (2003)
PDF
USC Computer Science Technical Reports, no. 773 (2002)
PDF
USC Computer Science Technical Reports, no. 836 (2004)
PDF
USC Computer Science Technical Reports, no. 931 (2012)
Description
J. Hammer, D. McLeod, and A. Si. "An intelligent system for identifying and integrating non-local objects in federated database systems." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 575 (1994).
Asset Metadata
Creator
Hammer, J
(author),
McLeod, D.
(author),
Si, A.
(author)
Core Title
USC Computer Science Technical Reports, no. 575 (1994)
Alternative Title
An intelligent system for identifying and integrating non-local objects in federated database systems (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
10 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16269690
Identifier
94-575 An Intelligent System for Identifying and Integrating Non-Local Objects in Federated Database Systems (filename)
Legacy Identifier
usc-cstr-94-575
Format
10 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/