Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 574 (1994)
(USC DC Other)
USC Computer Science Technical Reports, no. 574 (1994)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Ob ject Disco v ery and Unication in
F ederated Database Systems
Joac him Hammer Dennis McLeo d and An tonio Si
Computer ScienceDep artment
University of Southern California
L os A ngeles CA USA
fjo achimmcle o dasi gcsusce du
Abstract
Ak ey c hallenge in sharingorien ted information managementen vironmen ts suc h
as net w orks of heterogeneous autonomous database systems is to pro vide capabil
ities to allo w information units and resources to b e exibly and dynamically com
bined and in terconnected while at the same time preserving the in v estmen t in and
the autonomyof eac h individual comp onen t The researc h describ ed here sp ecically
fo cuses on t wok ey asp ects of this ho wto disco v er the lo cation and con tentof
relev an t nonlo cal information units and ho w to iden tify and resolv e the seman tic
heterogeneit y that exists b et w een related information in dieren t database comp o
nen ts Our approac h serv es as a basis for the sharing of related concepts through
partial metadata conceptual sc hema unication without the need for a global view
of data W e demonstrate and ev aluate our approac h using the RemoteExc hange ex
p erimen tal protot yp e system whic h supp orts information sharing and exc hange
from the ab o v e p ersp ectiv e
In tro duction
Supp ort for in terop erabilit y among autonomous heterogeneous datakno wledge base sys
tems is emerging as a k ey information managemen t problem for the s Co op erativ e
w ork computerbased man ufacturing scien tic databases and traditional data pro cessing
are only a few of the en vironmen ts where collab oration among the individual systems is
desired Prop osed arc hitectures to address the in terop erabilit y problem range
from tigh tly coupled comp osite approac hes in whic h individual databases are in tegrated
in to a cen tralized global database to lo osely coupled federated en vironmen ts wherein
This researchw as supp orted in part b y NSF gran t IRI
information is shared among individual database systems while retaining their autonom y
In this pap er w e adopt the general con text of a collection of co op erating heteroge
neous autonomous database systems DBSs this is termed a fe der ate d datab ase system
FDBS or fe der ation for short Eac h individual database system constituting a federation
is henceforth termed a c omp onent datab ase system or c omp onent for short Within this
framew ork w e will demonstrate a new mo dular facilit y for disco v ering the lo cation and
con ten t of related nonlo cal remote information and for iden tifying and resolving the
seman tic heterogeneit y that exists b et w een related information in the FDBS
The remainder of this pap er is organized as follo ws In Section w e review related
w ork In Section w ein tro duce an example of a realw orld sharing scenario using a fed
eration of collab orating scien tists conducting researc h on macromolecule s In Section w e describ e howweac hievein terop erabilit y among individual comp onen ts using the col
lab orating scien tists example Wein tro duce imp ortan t terms and concepts used in this
pap er and pro vide the ob ject con text in whichweha vecouc hed our researc h Sections through describ e in greater detail ho w our approac h to information sharing functions
fo cusing on the individual mo dules of the mec hanismFinally Section con tains conclud
ing observ ations with a critical ev aluation of our results and their p oten tial impact
Related Researc h
The term heterogeneous databases w as originally used to distinguish w ork that in
cluded database mo del and conceptual sc hema heterogeneit y from w ork on distributed
databases
whic h addressed issues solely related to distribution Recen tly there has
b een a resurgence in researc h in the area of heterogeneous database systems HDBSs
W ork in this area can b e c haracterized b y the dieren tlev els of in tegration of the com
p onentDBSsand b y dieren t lev els of global federation services In Mermaid for
example whic h is considered a tigh tly coupled HDBS comp onen t database sc hemas are in
tegrated in to one cen tralized global sc hema with the option of dening dieren t user views
on the unied sc hema While this approac h supp orts preexisting comp onen t databases
it falls short in terms of exible sharing patterns F urthermore the in tegration pro cess is
exp ensiv e and complicated and tends to b e dicult to c hange
The federated arc hitecture prop osed in whic h is similar to the m ultidatabase arc hi
tecture of in v olv es a lo osely coupled collection of database systems stressing autonom y
and exible sharing patterns through in tercomp onen t negotiation Rather than using a
single static global sc hema the lo osely coupled arc hitecture allo ws m ultiple imp ort sc he
mas enabling data retriev al directly from the exp orter and not indirectly through some
cen tral no de as in the tigh tly coupled approac h
One common approachtoin terop erabilit y is to reason ab out the meaning and resem
blance of heterogeneous ob jects in terms of their structural represen tation In Larson et
The term distributed database is used here as it has b een mainly used in the literature denoting a rel
ativ ely tigh tly coupled homogeneous system of logically cen tralized but ph ysically distributed comp onen t
databases
al the meaning of an attribute is appro ximated in terms of its v alue t yp e set of
p ossible v alues cardinalit y constrain ts in tegrit y constrain ts and allo w able op erations
Ho w ev er one can argue that an y suc h set of c haracteristics do es not sucien tly describ e
the realw orld meaning of an ob ject and th us their comparison can lead to unin tended
corresp ondences or fail to detect imp ortantones Other promising metho dologies that
ha v e b een dev elop ed include heuristics to determine the similarit y of ob jects based on
the p ercen tage of o ccurrences of common attributes More precise tec hniques use
classication for c ho osing a p ossible relationship b et w een classes In addition to metho ds primarily utilizing sc hema kno wledge tec hniques based up on the
use of seman tic kno wledge based on realw orld exp erience ha v e also b een in v estigated These approac hes t ypically assume the existence of a real w orld kno wledge base whic h
serv es as a global sc hema to whichev ery lo cal sc hema is mapp ed The similarit ybet w een
dieren t ob jects is determined b y the use of tec hniques suc h as spanning trees in the
kno wledge base The approac h adopted in asso ciates a fuzzy v alue with eac h relationship
in the kno wledge base indicating the degree of fuzziness of that relation with resp ect to
the conceptt yp e to whic h the relation b elongs the set of fuzzy v alues also acts as a basis
for quan titativ e similarit y measurementbet w een t w o ob jects These approac hes ho w ev er
lac k the abilit y to tailor the iden tication pro cess to the con text of user request F urther
this approac h assumes the existence of the cen tralized kno wledge base W e b eliev e that a
useful approac h is for the cen tralized kno wledge base to only con tain information activ ely
used to supp ort sharing within the federation and th us as illustrated in our mec hanism
should b e dynamically built tailored to the federation
A dieren t approac h prop osed b y Ken t uses an ob jectorien ted database program
ming language to express mappings among dieren t similar concepts that allo w a user to
view them in some in tegrated w a y It of course remains to b e seen if a language that is
sophisticated enough to meet all of the requiremen ts giv en can b e dev elop ed in the near
future
Av ery recen t approachtoin terop erabilitybyMeh ta et al uses socalled path
metho ds to explicitly create in tercomp onen t and in terob ject mappings b et w een source
and target ob ject classes in order to retriev e and up date related data ob jects The ob
vious dra wbac k of this approac h is the large o v erhead in calculating and main taining the
mappings whic hma y b e impractical for large federations with extensiv e andor dynamic
sharing patterns This approac h of course also requires the determination of the relation
ship b et w een ob jects b elonging to dieren t comp onen ts
AF ederation of Collab orating Scien tists
Consider the follo wing scenario in v olving a group federation of collab orating scien tists
eac h main taining a database that con tains information ab out macromolecul es Figure sho ws a snapshot of the information stored in eac h database Since eac h database is
managed b y a dieren t scien tist or group thereof the con ten ts of these databases reect
the dieren t fo ci and in terests of their creatorso wners W e can see for example that
Amino Acids
Choromosome Genetic Base
Sequences
Scientist B’s Conceptual Schema
Protein Instances
Amino Acid
Sequences
Choromosomes
DNA
Sequences
Protein Structures
Amino Acid
Chains
Scientist A’s Conceptual Schema
Scientist C’s Conceptual Schema Scientist D’s Conceptual Schema
Residue
Structures
Genetic
Genes
Function of
Side Chains
Info
Proteins
Deficiencies
Proteins
Four database components (Scientist A - D) containing
information about macromolecular protein structures
Network
Figure A F ederation of Collab orating Scien tists
comp onen t D is the only one with information on b oth genetic and protein sequences
All other comp onen ts main tain either genetic or protein information with v arious lev els of
detail
It w ould clearly b e b enecial to a scien tist if information in hisher database could b e
dynamically shared and exc hanged for instance a scien tist could ecien tly reuse protein
data that had already b een sequenced deco ded b y other comp onen ts for hisher exp er
imen tal w ork A dicult y in nding a solution to information sharing in a federation
of this kind stems from the conicting nature of sharing and autonom y On one hand
eac h scien tist w ould lik e to share information with others on the other hand the same
scien tist w ould still lik e to retain autonomyo v er hisher o wn database with resp ect to
organization and information release eg con trol the information it is willing to exp ort
to other comp onen ts This capabilit y of exp orting only a sp ecic p ortion of a comp onen ts
database is particularly imp ortan t in our federation of collab orating scien tists where eac h
comp onentw ould probably not b e willing to release an y information that has not y et b een
published or fully v alidated F or simplicit y here w e assume that all the information stored
in a sp ecially mark ed section of a comp onen ts database exp ort sc hema is a v ailable
to ev ery other comp onen t in the federation
The lev el of in terop erabilit y that can b e ac hiev ed within suc h a federation dep ends
largely up on t wok ey capabilities
the abilit y of a comp onentto iden tify and lo cate p oten tially appropriate nonlo cal
information with resp ect to its needs the disc overy pr oblem and
as requested remote information is iden tied and lo cated the abilit y tofolditin to
the lo cal system framew ork the unic ation pr oblem
DBS 1
Interconnection: Sharing and Transmission (MODM)
DBS 2 DBS 3 DBS n
. . .
Sharing
Advisor
Semantic
Dictionary
Registration
Semantic
Heterogeneity
Resolution
Sharing
Heuristics
uses uses
Discovery Unification
Figure The RemoteExc hange sharing arc hitecture
The fo cus of this w ork is to address the ab o vet w o problems Sev eral other imp ortan t issues
suc h as securit y access con trol and the up date of shared data are not directly addressed
in this pap er
Ac hieving In terop erabilit y in RemoteExc hange
In order to pro vide a con text for our approachto disco v ery and lo cal unication Figure illustrates the arc hitecture of the RemoteExc hange exp erimen tal system F or sharing and
exc hange of information to tak e place among the comp onen ts some common mo del for
describing the sharable data m ust b e utilized T o this end w e employa Minimal Ob ject
Database Mo del MODM whic hcon tains the essen tial basic features found in existing
ob jectbased and seman tic database systems Briey MODM is a functional ob jectbased
mo del supp orting complex ob jects t yp e mem b ership subt yp e to sup ert yp e relationships
inheritance of functions metho ds from sup ert yp e to subt yp e and userdenable functions
An adv an tage of using an ob jectbased common database mo del is the p ossibilit y of sharing
information at dieren t lev els of abstraction and gran ularit y including sp ecic facts meta
data and units of b eha vior In the con text of MODM in terop eration among comp onen ts is p ossible at man y dif
feren t lev els of abstraction and gran ularit y ranging from factual information units data
ob jects to metadata conceptual sc hema to b eha vior F or this pap er w e limit our in
v estigation to the sharing of t yp e ob jects whic hw e term typ elevel sharing The sharing
of individual instances instanc elevel sharing and the sharing of b eha vior or functions
functionlevel sharing are examined in
Protein
Loop
Helix
Strand
Type
Type
Type
name
code
function
molecular_weight
resolution
authors
Component B
Protein
name
code
function
molwt
Component A
Amino Acid
name
start
end
molwt
Seconday
Tertiary
source
Instances
Sequences
Structure
Structure
Structures
Threeten
Alpha
Pi
Supertype-Subtype
Legend
property
Type
Figure P artial conceptual sc hemas of t w o macromolecul ar databases
As sho wn in Figure a sp ecial comp onen t termed the sharing advisor manages kno wl
edge ab out existing t yp e ob jects that comp onen ts exp ort this kno wledge resides in the
semantic dictionary The sharing advisor pro vides four in telligen t services to the com
p onen ts of the federation R e gistr ation Disc overy Semantic Heter o geneity R esolution and
Unic ation These services implem e n t the disco v ery of t yp e ob jects and the folding of
those ob jects in to the en vironmen t of a lo cal comp onen t It is imp ortan t to note that as it
is nearly imp ossible to automate these pro cesses w e pro vide substan tial functionalit y and
utilize guided user input as necessary Registration
Registration allo ws a new comp onen t to inform the sharing advisor ab out an y informa
tion it is willing to share with other comp onen ts in the federation In a sense it establishes
an initial sharing con text within the federation b y logically connecting the exp orted infor
mation to the seman tic dictionary via the sharing advisor Incremen tal registration allo ws
a comp onen t to augmen t its exp ort sc hema with new information
Figure sho ws t w o partial conceptual sc hemas of comp onen t scien tist A and comp o nen t scien tist B resp ectiv ely Let us assume that comp onen t A has already registered
the t yp e ob jects Protein Instances and Amino Acid Sequences with the sharing ad
visor This is reected in the seman tic dictionary sho wn in Figure a When comp onen t B
registers t yp e ob ject Protein Structures the sharing advisor can use its existing kno wl
edge in this case the information obtained from comp onen t A to determine if Protein
Structures has an y similarities with previously registered t yp es ie Protein Instances
and Amino Acid Sequences User in teraction ma y b e necessary to instruct the shar
Protein
Protein Protein
Amino Acid
name
code
function
molecular_weight
resolution
authors
source
name
code
molecular_weight
resolution
authors
name
code
function
molwt
name
start
end
molwt
(a) (b)
source
Sequences
Structures Instances
Legend
Superconcept-Subconcept
member-of (instance-of)
Information
Protein
name
code
function
molwt
source
Instances
Amino Acid
name
start
end
molwt
Sequences
property
Concept
Figure Ev olution of a concept hierarc h y in the seman tic dictionary
ing advisor in case the information in the seman tic dictionary is insucien t to detect
dissimilaritie s among t yp e ob jects automatically The newly acquired kno wledge and the
newly registered information are stored in the seman tic dictionary for future consultation
see for example con ten ts of the seman tic dictionary in Figure b after the registration
of the t yp e ob ject Protein Structures In a sense the seman tic dictionary represen ts a
dynamic federated kno wledge base ab out sharable information in the federation
In the remainder of this section w e illustrate ho w the sharing advisor can eectiv ely
utilize the seman tic dictionary as a means of organizing v arious registered t yp e ob jects in
order to accommo date information sharing W ealsoin tro duce a set of sharing heuristics
that guide the sharing advisor through registration
The Seman tic Dictionary
In the seman tic dictionaryt yp es determined to b e similar b y the sharing advisor are
classied in to a collection called a c onc eptwithinwhic h sub collections called sub c onc epts
can b e further iden tied This generates a c onc ept hier ar chy Naturally the relationships
expressed in this concept hierarc h y can only b e appro ximations of the true realw orld
relationships that exist b et w een the exp orted t yp es of dieren t comp onen ts Additional
mec hanisms are needed to establish more exact relationships see Section Figure sho ws t w o snapshots of the concept hierarc h y in the seman tic dictionary tak en
at dieren t times during the lifetime of our example federation Figure a indicates
the concept hierarc hyafter t yp es Protein Instances and Amino Acid Sequences of
comp onen t A ha v e b een registered Figure b sho ws the corresp onding hierarc hyafter
comp onen t B has registered the t yp e Protein Structures The hierarc h y in Figure b
also indicates that Protein Instances and Protein Structures are represen ting similar
information since they b elong to the same classication called Protein Information
In addition the t yp es Protein Instances and Protein Structures ha v e prop erties that
distinguishes one t yp e from the other Hence they are also created as sub concepts of
Protein Information in order to express their dissimilarities By con trast Amino Acid
Sequences is similar to neither Protein Instances nor Protein Structures and th us
app ears as a separate concept in the hierarc h y Similarit y and dissimilarit y of this kind is
detected b y the sharing advisor based up on sharing heuristics describ ed b elo w with user
input as required
The adv an tage of organizing t yp e ob jects in a concept hierarc h y is that a hill clim bing
tec hnique can b e used to place newly registered t yp e ob jects F or example consider a
t yp e ob ject b eing registered whic h is determined to b e dissimilar to mem b ers of concept
Protein Information in Figure b In this case no further comparisons with mem bers
of sub concepts of Protein Information are necessary Generallya t yp e ob ject represen ts a sp ecic view of a corresp onding real w orld con
cept and is tailored to the fo cus and in terest of the database comp onen t therefore the set
of prop erties asso ciated with the t yp e ob ject can b e view ed as a subset of those asso ciated
with the real w orld concept In Figure t yp e Protein Instances of comp onen t A and
Protein Structures of comp onen t B indicate t w o dieren t views on the real w orld con
cept pr otein In order to prop erly merge v arious views on a similar concept the seman tic
dictionary is established in a b ottom up fashion with the set of prop erties b elonging to a
concept at a particular lev el represen ted as the union of the prop erties of all its sub con
cepts This is illustrated b y concept Protein Information in Figure b Using a concept
hierarc h y to organize exp orted t yp es will incremen tally establish a federated view of all
the exp ort sc hemas in the federation
The Sharing Heuristics
The sharing heuristics emplo y ed dra w up on the incremen tal clustering paradigm of ma
c hine learning as describ ed in The idea b ehind these heuristics is to assess the exten t
of the distinguishing capabilit y of a prop ert y with resp ect to a concept this allo ws the
sharing advisor to determine if the meaning of a t yp e ob ject b eing registered can b e
determined based up on its prop erties or whether further assistance from users is neces
sary The distinguishing capabilit y of a prop ert y with resp ect to a concept is based up on
the interc onc ept dissimilarity bet w een concepts and the intr ac onc ept similarity within a
concept As an example consider prop ert y co de of concept Protein Information in
Figure b This prop ert y has a high in terconcept dissimilarit y as no other concept at the
same scop elev el p ossesses suc h a prop ert y On the other hand co de also has a high in tra
concept similarit y with resp ect to Protein Information since this prop ert y is asso ciated
with all concept mem b ers of Protein Information for example Protein Instances Ho w ev er when lo oking at the sub concepts of Protein Information prop ert y co de has
The name of this concept is automatically determined b y a simple algorithm whic hmaybe o v erridden
b y the user
alowin terconcept dissimilarit y with resp ect to eac h sub concept of Protein Informa
tion since this prop ert y is p ossessed b y all concepts within this same scop e of a common
sup erconcept Protein Information Note that a prop ert y do es not necessarily p ossess
the same exten t of distinguishing capabilit y across dieren tlev els
In terconcept dissimilarit yand in traconcept similarit yv alues are estimated via sta
tistical analysis based on previously registered t yp e ob jects A statistical heuristicbased
approac h oers a degree of error resilience allo wing the accuracy of the distinguishing
capabilities to b e gradually impro v ed o v er a p erio d of time A limitation of this approac h
is its p oten tial oscillating nature during the early stages of a federation ho w ev er this lim
itation b ecomes less signican tas the n um ber of componen ts and t yp e ob jects increases
damping the oscillation to a steady state
Disco v ery
The purp ose of our disco v ery mec hanism is to iden tify appropriate information relev an t
to the request of a comp onen t initiating a sharing pro cedure Although w eha v e prop osed
a metho dology to detect similarities among dieren tt yp es it is not adequate in a fed
erated en vironmen t where the goal is to in tegrate sp ecic nonlo cal information in to the
en vironmen t of a lo cal comp onen t This is b ecause ev en though a remote ob ject is similar
to a particular lo cal ob ject it migh t not b e relev an t within the in tended con text of a lo cal
comp onen t Therefore it is imp erativ e that user c haracteristics of the lo cal comp onen ts b e
tak en in to consideration when lo cating relev an t information F or this purp ose w eha v e cat
egorized three basic kinds of disco v ery requests whic h when com bined allo w a comp onen t
to disco v er a wide v ariet y of nonlo cal information
Disco v ery Request T yp e Similar Concepts
In this kind of disco v ery request a comp onen tuser isin terested in lo cating t yp e
ob jects in remote comp onen ts that are conceptually similarrelated to a particular
t yp e ob ject in the lo cal comp onen t All t yp e ob jects b elonging to the p ortion of
the concept hierarc h y in whichthe local t yp e ob ject resides in the seman tic dic
tionary are appropriate to this request F or example in the concept hierarc hyof
Figure b Protein Structures of comp onen t B is a prop er candidate to the request
b y comp onen t A for related information on Protein Instances Disco v ery Request T yp e Complimen tary Information
In this kind of disco v ery request a comp onen tuser is in terested in disco v ering addi
tional information ab out a lo cal t yp e ob ject This ma y o ccur when comp onen t A is
in terested in additional information on Protein Instances for example All t yp e
mem b ers with dieren t sets of prop erties b elonging to the same p ortion of the hier
arc hyinwhic h the lo cal t yp e ob ject resides are prop er candidates for this request
F or example Protein Structures of comp onen t B w ould also satisfy a request for
additional information of Protein Instances issued b y comp onen t A
Disco v ery Request T yp e Ov erlapping Information
This kind of disco v ery request arises when a comp onentis in terested in lo cating non
lo cal t yp e ob jects that o v erlap in their information con ten t with a comp onen ts lo cal
t yp e F or example comp onen t A w ould liketo displa y all proteinlik e information
using its o wn three dimensional viewing program whichw orks on mem b ers of t yp e
Protein InstancesAll t yp es with similar prop erties as Protein Instances that
b elong to the subhierarc h y ro oted at Protein Instances are prop er candidates for
this request According to the concept hierarc h y of Figure b there is no candidate
t yp e ob ject that w ould satisfy suc h a request at this time
Seman tic Heterogeneit y Resolution
After iden tifying relev an t nonlo cal information ob jects a comp onen tma y wish to fold
them in to its lo cal information framew ork Ho w ev er the problem is to determine ho wthis
information can b e unied with its o wn lo cal data due to seman tic discrepancies that ma y
exist b et w een related concepts in dieren t comp onen ts Suc hseman tic heterogeneityis a
natural consequence of the indep enden t creation and ev olution of autonomous databases
whic h are tailored to the requiremen ts of the applications they serve The purp ose of
our approachtoseman tic heterogeneit y resolution is to resolv e these discrepancies b et w een
the relev an t nonlo cal t yp e ob jects and the lo cal metadata con text conceptual sc hema
This is a prelude to unifying the remote information in to the lo cal con text Our mec hanism
for resolving seman tic heterogeneit y emplo ys a lo c al lexic on for eachcomponen t whic h
sp ecies its p ersp ectiv e on the precise relationship b et w een its lo cal t yp es and a global
set of commonly understo o d concepts Sp ecically kno wledge is represen ted as a static
collection of facts of the simple form
term relationship descriptor term
A term on the left hand side of a relationship descriptor represen ts a lo cal t yp e ob ject
ie the unkno wn whic h is describ ed b y the term on the righ t side W eha v e an initial
set of descriptors whic h is extensible w ealsoan ticipate collections of descriptors that are
tailored to giv en application domains The follo wing is a list of basic conceptual relationship
descriptors initially supp orted in RemoteExc hange
RDescriptors Meaning
Identical Twot yp es are the same
Equal Twot yp es are equiv alen t
Comp a tible Twot yp es are transformable
KindOf Sp ecialization of a t yp e
Assoc P ositiv e asso ciation b et w een t wot yp es
CollectionOf Collection of related t yp es
Inst anceOf Instance of a t yp e
Common Common c haracteristic of a collection
Fea ture Descriptiv e feature of a t yp e
Has Prop ert y b elonging to all instances of a t yp e
F or example w emayha v e the follo wing in the lo cal lexicon of comp onen t A Protein Instances KindOf Protein Structures A protein instance is a sp ecialization
of protein structure
CollectionOf Proteins It is a collection of all proteins
Fea ture Authors One of its c haracteristicsis the
existence of an author who
disco v ered it
The terms that are used to describ e unkno wn concepts are tak en from a dynamic list
of concepts dra wn from the seman tic dictionaryc haracterizing the commonalitiesin a
federation Since in terop erabilit y only mak es sense among concepts that mo del similar or
related information it is reasonable to exp ect a common understanding of a minimal set
of concepts tak en from the application domain The seman tic dictionary con tains partial
kno wledge ab out all the terms in the lo cal lexica in the federation suggesting p ossible
relationships b et w een dieren t terms Utilizing this kno wledge eac h lo cal lexicon describ es
the precise meaning of all t yp e ob jects that the comp onen t exp orts In our example see
Figure b the set of commonly understo o d concepts at a particular momen t could b e C fA mino A cid Se quenc es A uthors Pr oteins Pr otein Information Pr otein
Instanc es Pr otein Structur esg
The essen tial idea of using a lo cal lexicon is to represen t the seman tics of the shared
terms in a more expressiv e and complete manner than in the conceptual sc hema The
additional seman tic information is imp ortan t for the follo wing reason as noted earlier the
results of the disco v ery pro cess are not alw a ys conclusiv e enough to determine the exact
meaning of a term or more precisely ho wt w o similar terms in dieren t comp onen ts are
related Similarly it is not p ossible to deriv e the meaningusage of terms b y lo oking
at their structural represen tation in the conceptual sc hema of a lo cal comp onen t In or
der to determine the relationships b et w een ob jects in a federation w e realize that not
one single metho d but a com bination of sev eral dieren t but complime n tary approac hes
ie Registration Disco v ery Seman tic Heterogeneit y Resolution tak en together is highly
promising F or example there is an imp ortan t connection b et w een lo cal lexica and seman
tic dictionary Lo cal lexica con tain only seman tic information and no kno wledge ab out
an y relationships among its ob jects information that is necessary to solv e the seman tic
heterogeneit y problem This kind of information is pro vided b y the seman tic dictionary
through registration whic hcon tains partial kno wledge ab out the relationships among all
the terms in the lo cal lexica in the federation Note of course that the lexica and seman tic
dictionary are b oth dynamic ie they gro w and shrink as the amoun t of shared data in
the federation increases decreases
The basic problem addressed b y the seman tic heterogeneit y resolution mec hanism ma y
no w b e expressed without loss of generalit y as giv en t w o ob jects a lo cal and a foreign
one return the relationship that exists b et w een the t w o Sp ecically our strategy is based
on structural kno wledge conceptual sc hema information and the kno wn relationships
that exist b et w een common global concepts in set C and the t w o ob jects in questions lo cal
lexicon seman tic dictionary One c haracteristic of our approac h is that the ma jorityof
user input o ccurs b efore the resolution step is p erformed ie when selecting the set C and
creating the lo cal lexicon rather than during
Unication
A t this p oin t the nonlo cal ob jects can b e unied with the corresp onding lo cal ob jects
In some cases the lo cal metadata framew ork conceptual sc hema m ust b e restructured
to ac hiev e a result that is complete minim al and understandable Complete
since the new in tegrated sc hema m ust con tain all concepts that w ere presen t b efore the
unication pro cess to ok place Minimal since concepts should only b e represen ted once
And understandable since the in tegrated sc hema should b e easy to understand for the end
user
Up on imp orting the metadata structural conicts with existing t yp es in the com
p onen ts lo cal t yp e hierarc hyma y arise In w een umerated in detail v arious conicting
p ossibilities that can arise while imp orting nonlo cal metadata in to a lo cal sc hema Here
w e illustrate our mec hanism with the simple example of imp orting t yp e Protein Struc
tures of comp onen t B in to As sc hema The follo wing t w o p ossibilities exist
Protein Structures is seman tically equiv alentto Protein Instances in As
sc hema
In this case w emak e Protein Structures a subt yp e of Protein InstancesProp erties that previously b elonged to Protein Instances remain there Prop erties that
b elong to Protein Structures but not to Protein Instances are added to the
new subt yp e In those cases where Protein Instances has additional prop erties
that do not exist for Protein Structures sp ecial n ull v alues m ust b e assigned to
all its instances see Figure a Note that subsetting is considered to b e the basis
for accommo dating m ultiple user p ersp ectiv es on comparable t yp es used bymost
Protein
code
function
molwt
name
source
Instances
Protein
molwt
a) Unification of (semantically) equivalent types
Amino Acid
source
Structures
Sequences
Protein
Instances
...
name
code
function
resolution
authors
newly integrated
information
b) Unification of (semantically) related types
Protein
Structures
Proteins
molecular_weight
resolution
authors
newly integrated
information
Amino Acid
Sequences
...
Supertype-Subtype
Legend
property
Type
Figure As sc hema after resolution and unication of proteinlik e information
metho dologies The case in whic h Protein Structures is iden tical to Protein
Instances is a sp ecial case and only requires the imp ortation of the t yp e instances
in whic h A is in terested
Protein Structures is related to Protein Instances in As sc hema
In this situation a new sup ert yp e called Proteins is created that con tains only the
prop erties common to b oth t yp es Protein Instances and Protein Structures The prop erties in whic h Protein Instances and Protein Structures dier are
asso ciated with t wosubt yp es whic h inherit the prop erties from their common su
pert yp e see Figure b T ogether the new sup ert yp e and its t wosubt yp es con tain
the same information as Protein Instances and Protein Structures in separate
comp onen ts b efore the unication Note that whether Protein Structures is related to Protein Instances or is equiv
alentto Protein Instances dep ends up on the p ersp ectiv e of the imp orting comp onen t
as sp ecied in the lo cal lexicon
Conclusions and Directions
Weha vein tro duced a framew ork and mec hanism for iden tifying and in tegrating t yp e ob
jects from div erse information sources The mec hanism is built on an ob jectbased mo del
con taining features commonly found in most existing ob jectbased database systems and
hence can b e implem en ted in these systems with little or no mo dication to existing DBMS
soft w are W eha v e also demonstrated the feasibilit y of our mec hanism in a macromolecu
lar en vironmen t represen ting an area of tremendous and gro wing in terest and imp ortance
An imp ortantc haracteristic of our mec hanism stems from the fact that our approac h
separates the disco v ery issue from the in tegration issue prior w ork largely mixes the t w o
with the disco v ery issue implicitl y a part of in tegration W e observ ed that dieren t alb eit
related kinds of kno wledge are needed in the disco v ery and in tegration pro cesses b y
iden tifying the kno wledge needed in the dieren t pro cesses a comp onen t is more lik ely to
b e able to access the most appropriate nonlo cal information in the most natural w a y As noted the abilit y of the sharing advisor to determine the similarit y among t yp e
ob jects is critical to the success of our mec hanism As suc h w e are in the pro cess of
exp erimen ting with the sharing advisor b y measuring its b eha vior on div erse sets of test
data The measuremen ts are based up on the follo wing t w o metrics c ompleteness whic h measures if all the similar ob jects are b eing iden tied b y the sharing advisor and
pr e cision whic h measures if all the iden tied ob jects are similar
Weha v e fo cused here on the functionalit y rather than p erformance eciency W eare
curren tly in v estigating the impact of lazy evaluation paradigm on reducing the amoun t
of o v erhead in v olv ed in our system Briey the lazy ev aluation paradigm shifts the
burden on determining the similarit y among t yp e ob jects from the registration pro cess to
the disco v ery pro cess In this w a y the similarit y among t yp e ob jects is not determined
unless the t yp e ob ject is actually used b y a comp onen t
Ac kno wledgemen ts
The authors w ould liketo ac kno wledge the v ery useful insigh ts of the researc hers in v olv ed in
RemoteExc hange including KJ By eon and Jongh yun Kahng Sp ecial thanks to Kathryn
Stuart who help ed us understand the molecular biology that is used in our example sce
nario
References
C Batini M Lenzerini and S Na v athe A Comparativ e Analysis of Metho dologies
of Database Sc hema In tegration A CM Computing Surveys S Ceri and G P elagatti Distribute d Datab ases Principles and Systems McGra w
Hill U Da y al and H Hw ang View Denition and Generalization for Database In tegration
in Multibase A System for Heterogeneous Distributed Databases IEEE T r ansactions
on Softwar e Engine ering
D F ang J Hammer and D McLeo d An Approac h to Beha vior Sharing in F ederated
Database Systems In MT
Ozsu U Da y al and PV alduriez editors Distribute d
Obje ct Management pages Morgan Kaufman D F ang J Hammer D McLeo d and A Si RemoteExc hange An Approac hto Con trolled Sharing among Autonomous Heterogenous Database Systems In Pr o c e e dings
of the IEEE Spring Comp c on IEEE San F rancisco F ebruary P F ankhauser and E Neuhold Kno wledge Based In tegration of Heterogeneous
Databases T ec hnical rep ort T ec hnisc he Ho c hsc h ule Darmstadt D Fisher Kno wledge Acquisition via Incremen tal Conceptual Clustering Machine
L e arning pages K F renk el The Human Genome Pro ject and Informatics Communic ations of the
A CM J Hammer and D McLeo d An Approac h to Resolving Seman tic Heterogeneityin a
F ederation of Autonomous Heterogeneous Database Systems International Journal
of Intel ligent Co op er ative Information Systems Marc h S Ha yne and S Ram MultiUser View In tegration System MUVIS An Exp ert
System for View In tegration In Pr o c e e dings of the th International Confer enceon
Data Engine ering IEEE F ebruary D Heim bi gner and D McLeo d A F ederated Arc hitecture for Information Systems
A CM T r ansactions on Oc e Information Systems July M Huhns N Jacobs T Ksiezyk W Shen M Singh and P Cannata En terprise
Information Mo deling and Mo del In tegration in Carnot T ec hnical Rep ort Carnot
MCC W Ken t The ManyF orms of a Single F act In Pr o c e e dings of the IEEE Spring
Comp c on IEEE F ebruary W Ken t Solving Domain Mismatc h Problems with an Ob jectOrien ted Database
Programming Language In Pr o c e e dings of the International Confer enceonV ery L ar ge
Datab ases pages IEEE Septem ber J Larson SB Na v athe and R Elmasri A Theory of A ttribute Equiv alence and
its Applications to Sc hema In tegration IEEE T r ansactions on SoftwareEngine ering April W Lit win and A Ab dellatif Multidatabase In terop erabilit y IEEE Computer Decem b er
F Manola S Heiler D Georgak op oulos M Hornic k and M Bro die Distributed
Ob ject Managemen t International Journal of Intel ligen t Co op er ative Information
Systems A Meh ta J Geller Y P erl and PF ankhauser Computing Access Relev ance to
Supp ort P athMetho d Generation in In terop erable MultiOODB In Pr o c e e dings of
the International Confer enceonV ery L ar ge Datab ases pages IEEE August
M P apazoglou S Laufmann and T Sellis An Organizational F ramew ork for Co op
erating In telligen t Information Systems International Journal of Intel ligen t Co op
er ative Information Systems J Saldanha and J Eccles The Application of SSADM to Mo delling the Logical
Structure of Proteins CABIOS pages A Sa v asere A Sheth S Gala S Na v athe and H Marcus On Applying Classica
tion to Sc hema In tegration In Pr o c e e dings of IEEE st International Workshop on
Inter op er ability in Multidatab ase Systems pages Ky oto Japan April A Sheth J Larson A Cornelio and S B Na v athe A T ool for In tegrating Conceptual
Sc hemata and User Views In Pr o c e e dings of the th International Confer enc e on Data
Engine ering pages IEEE F ebruary T T empleton et al Mermaid A F ron tEnd to Distributed Heterogenous Databases
In Pr o c e e dings Intl Conf on Data Engine ering pages IEEE G Wiederhold Mediators in the Arc hitecture of F uture Information Systems IEEE
Computer Marc h
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 575 (1994)
PDF
USC Computer Science Technical Reports, no. 572 (1994)
PDF
USC Computer Science Technical Reports, no. 721 (2000)
PDF
USC Computer Science Technical Reports, no. 849 (2005)
PDF
USC Computer Science Technical Reports, no. 833 (2004)
PDF
USC Computer Science Technical Reports, no. 879 (2006)
PDF
USC Computer Science Technical Reports, no. 589 (1994)
PDF
USC Computer Science Technical Reports, no. 578 (1994)
PDF
USC Computer Science Technical Reports, no. 593 (1994)
PDF
USC Computer Science Technical Reports, no. 579 (1994)
PDF
USC Computer Science Technical Reports, no. 598 (1994)
PDF
USC Computer Science Technical Reports, no. 558 (1993)
PDF
USC Computer Science Technical Reports, no. 594 (1994)
PDF
USC Computer Science Technical Reports, no. 586 (1994)
PDF
USC Computer Science Technical Reports, no. 585 (1994)
PDF
USC Computer Science Technical Reports, no. 591 (1994)
PDF
USC Computer Science Technical Reports, no. 549 (1993)
PDF
USC Computer Science Technical Reports, no. 592 (1994)
PDF
USC Computer Science Technical Reports, no. 587 (1994)
PDF
USC Computer Science Technical Reports, no. 611 (1995)
Description
J. Hammer, D. McLeod, and A. Si. "Object discovery and unification in a federated database system." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 574 (1994).
Asset Metadata
Creator
Hammer, J
(author),
McLeod, D
(author),
Si, A.
(author)
Core Title
USC Computer Science Technical Reports, no. 574 (1994)
Alternative Title
Object discovery and unification in a federated database system (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
16 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270746
Identifier
94-574 Object Discovery and Unification in a Federated Database System (filename)
Legacy Identifier
usc-cstr-94-574
Format
16 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/