Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Resource Discovery in Federated D atabase Systems by Antonio Si A D issertation Presented to the FACULTY O F TH E GRADUATE SCHOOL U N IV ERSITY O F SOUTHERN CA LIFO RN IA In P artial Fulfillment of the Requirem ents for th e Degree D O C TO R O F PH ILOSO PH Y (Com puter Science) August 1994 Copyright 1994 Antonio Si UMI Number: DP22892 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Publishing UMI DP22892 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by Antonio Tze Lun Si under the direction of hXs. Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of Cp 3 3 1 4 5ALo DOCTOR OF PHILOSOPHY Dean of Graduate Studies Date DISSERTATION COMMITTEE Chairperson Dedication If at first you don’t succeed, try, try again. W IL L IA M ED W A RD HICKSON. This dissertation is dedicated to m y loving family... Sik-Chow Sy, my father, Chor-Fong Wong, my m other, Canty Sy, my older sister, Doris Sy, my younger sister, and Arien Sy, my younger brother. Acknowledgments Forsake not an old friend; for the new is not comparable to him: a new friend is as new wine; when it is old, thou shalt drink it with pleasure. APO CRYPH A. E C C L E SIA ST IC U S 9:10. I always wonder why people tend to acknowledge about everyone in th e universe in their dissertation; now, I start to understand. F irst and forem ost, I would like to express my deepest g ratitu d e to m y dissertat ion advisor, Professor Dennis McLeod, for his guidance, criticism , support, patience, encouragem ent, as well as friendship throughout the years of my research. I would like to sincerely thank Professor Shankar R ajam oney for his invaluable discussions and opinions. I would also like to thank m y other com m ittee m em bers, Professor Shahram G andeharizadeh, Professor Ellis Horowitz, and Professor Michel Dubois for their countless technical discussions. Special thank to Professor P eter Danzig for providing equipm ents in th e infra stru ctu re lab in which the prototype and experim ents of this research is built. I th an k my office m ates and good friends for many, m any interesting and stim u lating discussions as well as p u ttin g up w ith m e such a wondering and com fortable working forum: Doug Fang, Joachim H am m er, K. J. Byeon, Jonghyun K ahng, and K atia Obraczka. I am tru ly in debted to my best friend, M aricus Tai for his support, encourage m ent, and friendship through all these years. I also owe Tri Tran an acknowledgm ent for helping m e understand the molecular biology concepts used as exam ples in this thesis. I leave to last, thanking m y friends whose support gave m e the strength to succeed: Sharon Cheong, Sharon Fung, Paul Chu, Irene Yee, Y ukuang Lim, V incent Lung, R obert Lai, Waikei Tsang, Em ily Liu, Simon Wong, Alex So, Sosana Chan, and Shirley Tai. Table of Contents D e d ic a tio n ii A ck n o w led g m en ts iii L ist O f T ab les v ii L ist O f F ig u res v iii A b str a c t x 1 In tr o d u c tio n 1 1.1 M otivation and Research C o n te x t................................................................... 2 1.2 Research Strategy . . ............................................................................................ 3 1.2.1 Results O v e rv ie w ..................................................................................... 5 1.3 Guide to Rest of Thesis ..................................................................................... 6 2 A S h arin g S cen ario 8 2.1 T he M acrornolecular Inform ation B a s e s ........................................................ 9 2.2 T he U ltim ate Sharing Goal .............................................................................. 9 2.3 Research P r o b le m s ............................................................................................... 11 3 R e la te d R esea rch 13 3.1 Networking C ontext ............................................................................................ 14 3.2 M ulti-databases C o n t e x t ..................................................................................... 14 4 A F ed era ted S h arin g C o n te x t 18 4.1 T he Federated E n v iro n m e n t............................................................................. 19 4.2 T he Core O bject D ata Model (C O D M )........................................................ 20 4.3 A Spectrum, of Resources w ithin C O D M ..................................................... 22 5 R e so u r ce D isco v e ry M ech a n ism 25 5.1 A Two-phase Framework to Discovery ........................................................ 26 5.2 T he Sem antic D ictionary: O rg a n iza tio n ........................................................ 28 iv 5.3 The Heuristics: I n d e x in g ..................................................................................... 31 5.3.1 D istinguishing Power E stim ation ..................................................... 33 5.3.2 P robability E s tim a tio n .......................................................................... 35 5.3.2.1 Prior Values E s tim a tio n ..................................................... 39 5.3.3 Heuristics Indexing E stim ation ......................................................... 40 5.3.4 D isc u ssio n ................................................................................................... 41 5.4 Discovery Requests: S e a rc h in g .......................................................................... 42 5.4.1 D isc u ssio n ................................................................................................... 43 5.5 The Discovery Tool: B ro w sin g .......................................................................... 44 6 P r o to ty p e Im p lem e n ta tio n 47 6.1 The Overall A rc h ite c tu re .................................................................................... 48 6.2 The Design of the Sharing A d v i s o r ............................................................... 50 6.3 The Im porter and E x p o r t e r ........................................................................... 51 7 P erfo rm a n ce E v a lu a tio n 55 7.1 Evaluation T e s tb e d ............................................................................................... 56 7.2 M etrics for M e a su re m e n t.................................................................................... 57 7.3 Design of E x p e rim e n ts ........................................................................................ 58 7.3.1 E xperim ental C o n fig u ratio n s............................................................... 59 7.4 E xperim ental R e s u lts ........................................................................................... 60 7.4.1 Experim ent # 1 . . ................................................................................. 60 7.4.2 Experim ent # 2 ............................................................................................ 63 7.4.3 Experim ent # 3 ............................................................................................ 65 7.4.4 E xperim ent $ 4 ........................................................................................... 67 7.4.5 E xperim ent # 5 ........................................................................................... 67 7.4.6 Experim ent # 6 ........................................................................................... 69 7.5 D isc u ssio n ................................................................................................................ 72 8 O verh ead O p tim iza tio n 73 8.1 O ptim ization M e th o d ........................................................................................... 74 8.2 O verhead M e a su re m e n ts .................................................................................... 75 8.3 O ptim ization T ra d e o ffs........................................................................................ 77 9 C o n clu sio n and S u m m a ry 79 9.1 Sum m ary of Results ........................................................................................... 80 9.2 Research C o n trib u tio n s....................................................................................... 80 9.3 Lim itations of this R e s e a r c h ............................................................................. 81 9.4 Directions for F uture R e se a rc h ......................................................................... 82 9.4.1 Addressing Properties Inter-d ep en d en cies...................................... 82 9.4.2 Instance-level D is c o v e r y ...................................................................... 83 9.4.3 Schema Integration.................................................................................... 84 9.4.4 An Internetw orked F e d e ra tio n ............................................................ 85 v A p p e n d ix A E xport Schema Specification 86 A p p e n d ix B A dditional Experim ental Results .............................................................................. 88 B .l E xperim ent # 1 ....................................................................................................... 88 B.2 Experim ent # 2 ....................................................................................................... 90 R e fe r en ce L ist 92 vi L ist O f T a b le s 4.1 A spectrum of resources w ithin CODM 6.1 T he set of M eta-functions ..................... L ist O f F ig u r e s 2.1 A federation of protein/genetics d a t a b a s e s ................................................. 10 2.2 Two possible approaches to access non-local d a t a ................................... 11 2.3 An idealized approach to access non-local d a t a .......................................... 12 4.1 A federated sharing environm ent ................................. ................................ 20 4.2 P artial conceptual schemas of two protein/genetics com ponents . . . 21 5.1 A nother view of a federated sharing environm ent ................................... 27 5.2 Evolution of concept hierarchies in the sem antic d ic tio n a r y ................. 29 5.3 Fundam ental base of th e sharing h e u r is tic s ................................................. 32 5.4 Pseudo code for estim ating sharing h e u ristic s.............................................. 40 5.5 T he discovery too].................................................................................................... 45 6.1 T he four-layer architecture of the f e d e r a tio n .............................................. 49 6.2 Interactions among com ponents w ithin the f e d e r a tio n ............................ 50 6.3 C onceptual schem a of the sem antic d ic tio n a r y .......................................... 51 6.4 Flow of inform ation for browsing th e sem antic d ic tio n a r y ..................... 52 6.5 T he discovery tool for browsing the sem antic d ic tio n a ry ........................ 54 7.1 Design of the sim ulation mode]........................................................................... 57 7.2 Sim ilarity Ratios: heuristic-based vs. p ro p e rtie s -in te rse c tio n .............. 62 7.3 Instruction Ratios: heuristic-based vs. properties-intersection................ 62 7.4- Sim ilarity ratios for 100, 200, 400, 800, and 1000 o b j e c t s ..................... 64 7.5 D issim ilarity ratios for 100, 200, 400, 800, and 1000 o b je c ts ................. 64 7.6 Instruction ratios for 100, 200, 400, 800, and 1000 o b je c ts ..................... 64 7.7 Sim ilarity ratios for varying N simna r ............................................................... 66 7.8 D issim ilarity ratios for varying N simua r ........................................................ 66 7.9 Instruction ratios for varying N simua r ............................................................ 66 7.10 Sim ilarity ratios for varying N n ssimua r ............................................................ 68 7.11 D issim ilarity ratios for varying Ndissimilar..................................................... 68 7.12 Instruction ratios for varying Ndissimilar........................................................ 68 7.13 Sim ilarity ratios for varying R .......................................................................... 70 7.14 Dissim ilarity ratios for varying R ................................................................... 70 7.15 Instruction ratios for varying R ...................................................................... 70 vm 7.16 Sim ilarity ratios for 100, 200, 400, 800, and 1000 o b j e c t s ..................... 71 7.17 Dissimila,rit,y ratios for 100, 200, 400, 800, and 1000 o b je c ts .................. 71 7.18 Instruction ratios for 100, 200, 400, 800, and 1000 o b je c ts ..................... 71 8.1 Evolution of the sem antic dictionary w ith lazy in d e x in g ......................... 75 8.2 O verhead saved with, lazy i n d e x i n g ................................................................ 76 9.1 An internetw orked federation .......................................................................... 85 A .l Specification of the export s c h e m a ................................................................... 87 B .l Sim ilarity ratios: properties-intersection m echanism under different th re s h o ld ..................................................................................................................... 89 B.2 Instruction ratios: properties-intersection mechanism under different th re sh o ld ..................................................................................................................... 89 B.3 Sim ilarity ratios: properties-intersection mechanism under different th re sh o ld ..................................................................................................................... 91 B.4 Dissim ilarity ratios: properties-intersection m echanism under different th re sh o ld ................................................................................................... 91 B.5 Instruction ratios: properties-intersection m echanism under different th re sh o ld ..................................................................................................................... 91 IX Abstract Recent years have witnessed great advances in database m anagem ent system s (DBM Ss) for organizing, storing, and providing access to d a ta and knowledge. The inform ation world is overwhelmed w ith d ata and knowledge scattered am ong d ata base system s residing locally, nationally, and internationally. We are now in a situ ation som ew hat analogous to th a t faced a few decades ago when we had a diverse sets of d ata m anaging by their corresponding application program s; currently, we have a, proliferation of databases constructed and m anaged by th e corresponding database m anagem ent software, available via networking facilities. Techniques and m echanism s to support the effective and efficient access to and sharing of related inform ation among th e (potentially) inter-connected existing database sources are now required. A key challenge to supporting interoperation among a collection of database system s is to provide facilities for users of individual system s to learn about th e lo cation and content of the rem ote inform ation space available. D espite the existence of th e state of the art networking inform ation discovery system s, th e problem, of discovering relevant database inform ation units among a network of database sys tem s rem ains largely unsolved. This thesis describes a resource discovery system for a collection of autonom ous isolated databases th a t allows various inform ation units to be dynam ically (1) located, (2) interrelated, and (3) queried. T he m etrics for m easuring the perform ance of the resource discovery system, are presented and th e perform ance of the system based on these m etrics is thereby evaluated. The m echanism is based upon a core set of database m odeling constructs th a t character izes object-based database systems and a set of self-adaptive heuristics employing techniques from, m achine learning. The approach provides an uniform, fram ework for organizing, indexing, searching, and browsing database resources w ithin th e en vironm ent of m ultiple databases. The feasibility of the approach, and m echanism is dem onstrated by a prototype developed for the USC Om ega DBMS. Perform ance tradeoffs are exam ined and analyzed, w ith the help of experim ents perform ed on a carefully designed sim ulation model. x Chapter 1 Introduction “ Begin at the beginning”, the king said gravely, “ and go until you come to the end, then stop.”. L E W IS CARRO LL. T he capability to support the dynam ic sharing of inform ation among a collection of autonom ous, heterogeneous database system s has a ttra cted trem endous atten tio n and interest w ithin th e past few years. A central and key open problem in this context is to provide m echanisms for users of a database system to discover or query non-local inform ation units th a t are “relevant” to some requirem ents. In this chapter, we investigate the m otivations behind the dem and for th e sharing and thereby, discovery of database inform ation. O ur strategy to pursuing this discovery problem will also be discussed in this chapter. 1 1.1 M otivation and Research Context Since th e introduction of the first generation database m anagem ent system s (DBM Ss) several decades ago as a m ean to consistently m anipulate and share diverse d a ta files, we have witnessed great advances in database technology w ith the em er gence of powerful general purpose database m anagem ent facilities. We can however observe th a t developing a com pletely integrated database is often difficult, and in consequence, we see m any databases and database system s w ithin and across orga nizations, m aintaining possibly overlapping or related inform ation. For instance, a recent report by US W est, a large telecom m unication company, has indicated th a t 5 terabytes of d ata are m anaged by 1,000 separate system s, w ith custom er inform at ion alone spread across 200 different databases [9, 47]. A nother report from G T E telephone indicated having 27,000 elem ents from ju st 40 of its applications; it was estim ated th a t an average of 4 hours per d ata elem ent is required to ex tract and docum ent its sem antics and inform ation [47]. We are now, in a sim ilar situation where we were a few decades ago when we had diverse sets of d a ta m anaging by their corresponding application program s; currently, we have a proliferation of databases constructed and m anaged by the corresponding database m anagem ent softwares. D espite m ajor steps forward in relatively high speed interconnection networks which offered th e potential for m ore inform ation sharing and exchange among com puter systems th an was previously possible, state of the art database system s logically, rem ains largely isolated and uncooperative. As an analogy to the original m otivation for general purpose DBMS for dynam i cally sharing and accessing related d ata from diverse d a ta files, a current m otivation is to provide m echanism s to support dynam ic access and share related inform ation (resources) am ong th e already existing isolated database sources; this is som etim e referred to as the database interoperability problem. Traditional d istributed and he terogeneous database system s have concentrated on m aking decentralized system s appear centralized, i.e., individual database system s are combined into a unified (virtual) database. This violates the autonom y of each individual database system . Typically, strict requirem ents are placed upon individual system s, w ith regards to the form at of data, contents and form at of the d ata dictionary, d a ta m anipulation, and d a ta definition language. F urther, the rigid consistency requirem ent of a distributed 2 database system has resulted in a trem endous am ount of overhead and investm ent. R ecently though, the need for loosely-coupled interaction am ong individual d ata base system s respecting their own autonom y has been recognized [44, 45]. Such a collection of cooperating but heterogeneous, autonom ous database system s m ay be term ed a federated database systems or federation for short; each individual database system in a federation is term ed a component database system or component [20], We observe th a t central and key open problems of sharing in this context of federated environm ent can be broken down into three areas: 1. T he discovery and identification of relevant non-local (rem ote) inform ation units (resources) w ith respect to a specific user request. 2. T he resolution of the sim ilarities and differences (semantic heterogeneity [17]) in m odeling of inform ation units among database system s. 3. T he (partial) integration of relevant non-local inform ation units into a specific database system. In this thesis, we specifically address the problem of inform ation/resource discov ery in federated database system s;1 allowing users of one com ponent system of a database netw ork to find rem ote resources th a t is “relevant” to a particular require m ent. Previous works in database interoperation largely ignore this discovery issue and m ainly focus on th e resolution and integration aspects. This requires database users to analyze a potentially non-m anageable resource space. 1.2 Research Strategy In this thesis, an approach and m echanism to support the dynam ic discovery of inform ation units w ithin a federation of autonom ous and heterogeneous database system s is described. One of th e m ain concerns in supporting a sharing and thereby, discovery m echanism for a federated system of autonom ous com ponents is th e desire of each database system to p articipate in sharing, while at the same tim e, preserve 'T h is work is p a rt of th e R em ote-E xchange p ro ject here a t USC which a tte m p t to address all th ree of these issues [11]. 3 its investm ent in and autonom y over its own database w ith respect to ad m in istrat ion and inform ation release such as th e control over the inform ation it is willing to “ex p o rt” to th e other com ponents. Any solution involving th e m odification or rew riting of any of a com ponent’s DBMS software is largely undesirable. A nother critical issue in supporting the sharing and discovery of database inform ation units among a collection of autonom ous database system s is th e ca pability of interrelating various database inform ation units. T raditional “properties- intersection” approaches simply use a “pattern-m atching” paradigm in which the relationships am ong various inform ation units are determ ined according to their corresponding database representations [19, 30]. We observed th a t, in real world, different database system s m ay be interested in com plem entary or even disjoint views of related inform ation; by contrast, various database system s m ay m odel un related inform ation similarly. Traditional “properties-intersection” approaches are therefore not feasible in relating various inform ation units m aintained by different database system s. More intelligent technique is therefore necessary to sm artly de duce the relationships among various inform ation units according to th e different foci and interests of different database system s. A key related issue here is to iden tify inform ation units th a t are not only related, b u t also appropriate to a database com ponent depending on the needs of the database com ponent. In supporting th e dynam ic discovery of relevant inform ation units effectively, a global repository of knowledge and inform ation units th a t are sharable among different database system s has to be m aintained. The com plexity of knowledge m aintained in this global repository will have a m ajor im pact on w hat kind of discovery services are supported. In one extrem e, the inform ation could be a sim ple list of com ponent databases and short description of their contents. T he overhead involved in m aintaining such a knowledge repository will be m inim al; discovery ser vices supported, however, will be also very lim ited. At th e other extrem e, com plete inform ation of all com ponent databases could be m aintained. M ore sophiscated discovery services can be supported in this case; the knowledge repository will be come non-m anageable however. A related key issue is to organize the knowledge in th e repository m eaningfully so as to facilitate the identifying of relevant inform ation efficiently. A sim ple organizational structure will m inim ize m aintainence overhead in th e expense of expensive identification process. A sophiscated organizational 4 structure, by contrast, requires expensive m aintainence w ith the payoff of simple searching technique. A final issue th a t is needed to be considered in designing a discovery system for a m ultiple databases environm ent is the capability for users to explore or navigate th e content of the resource space. This capability of browsing through th e resource space is closely related to the organizational structure of resources since the b etter th e resources are organized, the m ore m eaningful the organization will be and the easier and m ore effective it is to browse. In light of these requirem ents, the discovery m echanism described in this thesis is based upon a core set of database m odeling constructs th a t characterizes object database systems; this allows the m echanism to be readily im plem ented using exist ing object-oriented database system s. Furtherm ore, the m echanism utilizes a set of self-adaptive heuristics employing techniques from m achine learning to deduce the relationships am ong various database inform ation units; this offers a degree of error resilence, allowing th e accuracy of the deducing process to be gradually im proved over a period of tim e. We also introduce the notion of concept hierarchy which inter-relate various database inform ation units. A special characteristic of our con cept hierarchy is th a t m aintainence overhead is m inim al, while at the sam e tim e, considerable am ount of discovery services can be supported. T he concept hierarchy is placed in a global knowledge repository called semantic dictionary which can be accessed by all database com ponents. W ith the help of this notion of concept hier archy, several discovery services are supported th a t allows com ponent users to issue a discovery request to search for relevant inform ation units. As an alternative to discover rem ote inform ation units through a discovery request, our discovery system also provides browsing capability, viz., a discovery tool which allows users to navigate and explore the resource space. In this respect, our m echanism provides an uniform fram ew ork for organizing, indexing, searching, and browsing database inform ation units w ithin an environm ent of m ultiple, autonom ous, interconnected databases. 1.2.1 R e su lts O verview In th e rest of this thesis, we will be dem onstrating th a t by using a core set of object-oriented database m odeling constructs, our resource discovery m echanism for 5 a federated database systems can be readily and easily im plem ented on an existing object-oriented database m anagem ent system . T he architecture of our prototype also shows th e portability of our m echanism on various database system platform s. By using a carefully designed sim ulation m odel, we will be showing th a t the perfor m ance of our discovery m echanism which is based on a set of self-adaptive heuristics not only outperform s traditional properties-intersection algorithm , b u t also scales well w ith the size of the federation. In other words, the perform ance of our discovery m echanism improves as a function of the size of th e federation. 1.3 Guide to Rest of Thesis T he rem ainder of this thesis is organized as follows. In C hapter 2, a running exam ple involving a collection of protein/genetics scientific databases is described. In C hapter 3, we survey previous works in the area of inform ation sharing, w ith specific regards to th e discovery problem. In particular, we com pare and contrast th e desired goal, problem s involved, and m echanisms employed in th e inform ation sharing system s in th e networking area w ith those in th e database area. In C hapter 4, we introduce a generic C ore O bject D a ta M odel (CODM ) which contains all th e necessary m odeling constructs th a t our discovery m echanism is built upon. A taxonom y of resources th a t can be sharable and discoverable in th e context of this CODM w ithin the federated environm ent is also provided. In C hapter 5, we describe our discovery m echanism in detail, em phasizing its ability on “organizing” , “indexing” , “searching” , and “browsing” a database resource space. It is im portant to note th a t our m echanism is not im plem entation specific as it is based on com m only accepted object-based constructs [27]. Hence, any com ponent supporting the object-oriented paradigm will support these constructs and can take advantage of our discovery mechanism. In C hapter 6, we dem onstrate the feasibility of our mechanism, by describing the im plem entation of our experim ental prototype using th e existing USC O m ega [16] object-oriented DBMS. In C hapter 7, we present the experim ental testbed, m etrics for perform ance m ea surem ent, and experim ental results. Our experim ents are conducted to investigate 6 th e perform ance differences betw een our m echanism and the traditional properties- intersection m echanism s. Furtherm ore, the perform ance of our m echanism under diverse conditions is also analyzed in detail. In C hapter 8, we look at a m echanism in reducing the overhead involved in our discovery m echanism . In particular, we contrast th e so-called “lazy analysis” paradigm w ith our “eager analysis” paradigm, on reducing th e am ount of overhead involved in our discovery mechanism. Finally, in C hapter 9, we sum m arize th e results, contributions, and lim itations of this work. We also use this chapter to discuss ongoing and future research oppor tunities th a t can be built based on the work presented in this thesis. 7 Chapter 2 A Sharing Scenario Example is a bright looking-glass, universal and for all shapes to look into. M ICHEL D E M O NTAIG N E. P ast few years have witnessed an explosive growth of protein/genetics sequenced d a ta or m acrom olecular structure in general. Support for m echanism s for effective and efficient reuse of previously sequenced d ata has attra cted trem endous growing interest b o th in th e database com m unity as well as in th e microbiology com m u nity. In this chapter, we dem onstrate th e benefits of employing a federated database system s to cooperate a collection of protein/genetics databases w ith respect to the ability of reusing previously sequenced data. The scenario presented in this chap ter will also be used as a working exam ple for illustrating the ideas and concepts presented in this thesis. 2.1 The Macromolecular Information Bases Consider th e situation in which a neuroscience clinical researcher (CR) is investi gating th e p otential brain pathology of certain patients. C R m ight want to investi gate th e brain CAT-SCAN images of the patients. Suspecting th e brain pathology m ight be related to abnorm al activation of genetics sequences w ithin a certain brain area (e.g., H untingtons disease), CR m ight want to further investigate the genetics inform ation of all patients having such brain pathology. Upon obtaining th e genetics inform ation of these patients, CR m ight also desire to display the three dim ension m olecular stru ctu re of th e genes for further com parison and m anipulation. In prac tice, th e above described units of inform ation m ight be scattered among different databases residing locally, nationally, or even internationally. For instance, th e brain images of patients m ight be m aintained in a database of a laboratory; th e genetics inform ation of the patients m ight be scattered am ong various databases m aintained by different hospitals in different states or even in different countries. C ritical here are facilities for locating and interrelating these already existing inform ation units in order for the CR to efficiently and effectively m anipulate th e proper inform ation w ithout the need to regenerate th e inform ation from scratch again. Focusing on th e problem of interrelating genetics inform ation m aintained in different databases of various hospitals, Figure 2.1 shows a snapshot of th e inform at ion stored in different m acrom olecular databases. Since each m acrom olecular d ata base is m aintained by different hospitals, th e contents of their databases reflect their different foci and interests. We can see, for exam ple, th a t databases A and B are th e only two com ponents w ith inform ation on both protein and genetics sequences. All other com ponents m aintain either protein and genetics inform ation w ith various levels of detail. 2.2 The Ultim ate Sharing Goal Figure 2.2 shows two possible ways th a t the CR can identify related protein/genetics inform ation from such a network of databases using current technology. In Fig ure 2.2a, the CR utilizes a network inform ation discovery tool such as the In ter net Gopher [39] to locate databases possibly containing inform ation on genetics 9 ------------------------------------------------------------------------------------------------------------------------------------ Hospital A!a-CQi»ceptwal SHigma Hospital B’s Conceptual Schema L i n k H ^ c Map Netwo rk W o sp ita L C ^ -C n M a d u a ! S s te m a Hospital P ’s Conceptual Schema (^Thi U N A Four database components (Hospital A O) containing V _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _information about macromolecular structures _ _ _ _ _ _ __ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Figure 2.1: A federation of protein/genetics databases sequences. T he CR, then, has to search for appropriate docum ents describing cor responding inform ation of the patients and F T P [39] the docum ents over to h is/h er local system . T he lim itation to this approach is th e prim itive docum ent retrieval techniques utilized by current inform ation discovery tools which require all databases to be properly docum ented. Inform ation sharing and exchange in this approach is also lim ited to relatively unstructured strings of bits/bytes. Furtherm ore, th e hum an has full responsibility to m anually interrelate various pieces of inform ation together. In Figure 2.2b, the CR working at h is/h er windowing w orkstation, gets login to m ultiple database systems and issues a query for inform ation at each system . For instance, th e CR m ight m anipulate th e Genbank database [6] in one window and query the Broolchaven databank [23] in another window. T he lim itation to this approach is to be aware of and fam iliar w ith the appropriate database system s. A gain, th e hum an has full responsibility to m anually interrelate various pieces of inform ation together. W h at is desired and needed is, therefore, a higher level capability, which al lows stru ctu red units of inform ation to be identified and dynam ically interconnected am ong database systems residing at various sites; this idea is illustrated in Figure 2.3. Here, the CR issues a query at h is/h er local database system . A typical query th a t th e CR m ight issue at the local database system m ight look like: “select genetics 10 (a) O) Figure 2.2: Two possible approaches to access non-local d ata information for all patients’ ’ ’. The local query gets evaluated in the local system as well as tra n sm itted to an external agent which attem p ts to locate appropriate inform ation from m ultiple database sources and returns th e inform ation back to th e in itiated system . T he whole process, although executed both locally and rem otely, is totally tran sp aren t to the CR. Ideally, th e CR does not even aware of th e existence of rem ote database systems; th e whole process is executed as if it is a norm al local query request. 2.3 Research Problems T he level of interoperability th a t can be achieved w ithin such a federation of pro tein/genetics database com ponents depends largely upon th e capability for a com ponent to identify and locate potentially relevant rem ote inform ation w ith respects to its needs. This discovery problem can be further decom posed into two m ain subproblem s. 1. F irst, th e similarity problem is concerned w ith the ability to determ ine if two database com ponents are modeling common concepts. For instance, consider the four database com ponents of Figure 2.1. P r o te in In sta n c es of com po nent A ’s schem a and P r o te in S tr u ctu r e of com ponent B ’s schem a indicate 11 OSQ!L,> select genetics fo r all patients; V_________________________________ : _________________________________________ J Figure 2.3: An idealized approach to access non-local d ata common concept m odeled in these two com ponents. A principal difficulty here is th a t two database com ponents m ay be m odeling com plem entary or even disjoint views of a real world concept due to the different interests and foci of th e com ponents. A nother difficulty is th a t unrelated inform ation m ay be m odeled sim ilarly in different databases. 2. Second, the relevance problem is concerned w ith the ability to identify relevant inform ation in th e context of a specific user request. T he m ain difficulty in this case is understanding the constraints im posed on th e discovery request. We shall see how our discovery m echanism addresses these two problem s and other related key issues. 12 Chapter 3 Related Research Not many appreciate the ultimate power and potential usefulness of basic knowledge accumulated by obscure, unseen investigators who, in a lifetime of intensive study, may never see any practical use fo r their findings but who go on seeking answers to the unknown without thought of financial or practical gain. EU G EN IE CLARK . T he problem of inform ation sharing among m ultiple inform ation sources has been m ainly addressed in two different areas. In th e netw orking environm ent, the current Internet resource discovery system s landm ark the state of the art inform ation sharing environm ent am ong m ultiple com m unication networks. In th e database area, a variety of approaches and architectures have been proposed to address the sharing of database resources in the context of pre-existing database com ponents. In this chapter, we survey inform ation sharing system s in these two areas w ith specific regard to th e discovery problem. 13 3.1 Networking Context In th e networking environm ent, th e units of inform ation th a t can be shared among individual, possibly heterogeneous, inform ation system s, i.e., networks, are basically flat files, or m ore specifically, relatively unstructured strings of bits or bytes, describ ing resources available w ithin a local system such as electronic m ailing addresses of local users, software libraries existing in the local system and so forth. W ith specific regard to th e discovery problem , current (Internet) resource discovery system s a t tem p t to identify useful inform ation for users of a specific netw ork. For exam ple, the Netfind system [8, 39] attem p ts to discover electronic m ailing addresses and other inform ation about Internet users. These resource discovery system s m ainly provide four functional capabilities to users [39]: 1. Organization concerns how various units of inform ation are interrelated; 2. Indexing concerns how new units of inform ation are placed w ithin th e organi zational structure; 3. Searching concerns how relevant inform ation can be identified from the resource repositories w ith respect to a user request; and 4. Browsing concerns the capability for users to explore th e resource space as an alternative to autom atic searching for inform ation. Various system s support these four functional capabilities to different degree. For instance, docum ent retrieval techniques [36] are em ployed in the W AIS system [39] in which docum ents are indexed according to the keywords extracted. Browsing facility is also supported to different extent in various In tern et resource discovery system s such as th e W W W [2] system employs the hy p ertex t paradigm where various pieces of inform ation can be explored through hypertext links. 3.2 M ulti-databases Context From th e database perspective, by contrast, th e focus is on th e sharing of struc tu red inform ation am ong a collection of heterogeneous databases. T he term 14 “heterogeneous databases” was originally used to distinguish work th a t included database m odel and conceptual schem a heterogeneity from work on “distrib u ted databases”1 which addressed issues solely related to distribution [5]. Recently, there has been a resurgence in research in the area of heterogeneous database system s, paying specific considerations to pre-existing database com ponents. W orks in this a,rea can be characterized by the different levels of integration of th e com ponent database system s and by different levels of global services [41]. In M erm aid [46], for exam ple, which is considered a tightly-coupled environm ent, com ponent database schemas are integrated into one centralized global schem a w ith th e option of defining different user views on the unified schema. W hile this ap proach supports pre-existing com ponent databases, it falls short in term s of failing to support for flexible sharing pattern s accom m odation, lacking of a dynam ic p ar tial integration process, and th e rigid consistency requirem ents. F urtherm ore, th e integration process is expensive, difficult, and tends to be hard to change. T he federated architecture proposed in [20], which is sim ilar to th e m ultidatabase architecture of [32], involves a loosely-coupled collection of database system s, stress ing autonom y and flexible sharing p attern s through inter-com ponent negotiation. R ath er th an using a single, static global schema, th e loosely-coupled architecture allows m ultiple im port schemas, enabling d ata retrieval directly from th e exporter and not indirectly through some central node as in the tightly-coupled approach. M ajor problem s w ith this loosely-coupled federated environm ent include its lack of support for interrelating related inform ation from diverse database sources, leaving users of individual system s to assum e full responsibility of determ ining th e relation ships am ong those. Recently, several sem i-autom atic tools for determ ining th e relationships among inform ation units in m ultiple databases have begun to appear [48]. One common approach is to reason about th e m eaning and resem blance of heterogeneous objects in term s of th eir stru ctu ral representation. In Larson et al. [30], th e m eaning of an a ttrib u te is approxim ated in term s of its value type (set of possible values), cardinality constraints, integrity constraints, and allowable operations. However, 1T h e term d istrib u ted d atab ase is used here as it has been m ainly used in th e literatu re, denoting a relatively tightly-coupled, hom ogeneous system of logically centralized, b u t physically d istrib u ted com ponent databases. 15 one can argue th a t any such set of characteristics does not sufficiently describe the real-world m eaning of an object, and thus their com parison can lead to unintended correspondences or fail to detect im portant ones. O ther m ore prom ising m ethod ologies th a t have been developed include heuristics to determ ine the sim ilarity of objects based on th e percentage of occurrences of com m on attrib u tes [19, 42]. M ore accurate techniques use classification for choosing a possible relationships betw een classes [38]. W hereas m ost of previous m ethods prim arily utilize schem a knowledge, tech niques utilizing sem antic knowledge (based on real-world experience) have also been investigated [12, 21]. These approaches usually assum e the existence of a real world knowledge base which serves as a global schem a to which every local schem a is m apped. T he sim ilarity betw een different objects is determ ined by the use of span ning trees spanning in the knowledge base. T he approach adopted in [12] is m ore am bitious in th a t it associates a fuzzy value w ith each relationship in the know ledge base indicating the degree of fuzziness of th a t relations w ith respect to the concept to which th e relation belongs; th e set of fuzzy values also acts as a basis for q u an titativ e sim ilarity m easurem ent betw een two objects. However, one could argue th a t coming up w ith such a set of fuzzy values in the first place will be difficult. These approaches also lack the ability of tailoring th e identification process to the context of user request. Further, these approaches assum e th e existence of an initial com prehensive centralized knowledge base. We believe th a t a useful approach is for th e centralized knowledge base to only contain inform ation actively used to support sharing w ithin the federation and thus, as illustrated in our m echanism , should be dynam ically built tailored to the federation. A different approach proposed by K ent [26] uses an object-oriented database program m ing language to express m appings am ong different sim ilar concepts th a t allow a user to view them in some integrated way. It of course rem ains to be seen if a language th a t is sophisticated enough to m eet all of the requirem ents given by K ent in his solution can be developed in the near future. A very recent approach to interoperability by M ehta et al. [34] uses so-called path-m ethods to explicitly create inter-com ponent and inter-object m appings be tw een source and targ et object classes in order to retrieve and update related d ata 16 objects. T he obvious draw back of this approach is the large overhead in calcula ting and m aintaining the m appings, which m ay be im practical for large federations w ith extensive an d /o r dynam ic sharing p atterns. This approach of course also re quires th e determ ination of th e relationships betw een objects belonging to different com ponents. A m ajor capability th a t is still largely missing from current and proposed system s is th e capability for users to discover available rem ote inform ation units (resources) th a t m ight be relevant to h is/h er interest w ithout even being aware of th e existence and location of rem ote database sources. As illustrated by th e netw orking dom ain, a resource discovery system should provide additional facilities for users in ad d it ional to sim ply detecting possible relationships among inform ation units. In light of this observation, our resource discovery system for a federation of databases is designed m ainly based on providing the four functional capabilities of “organizing” , “indexing” , “searching”, and “browsing” a database resource space. 17 Chapter 4 A Federated Sharing Context Information networks straddle the world. Nothing remains concealed. But the sheer volume of information dissolves the information. We are unable to take it all in. G U NTH ER GRASS. In order for any collaboration to take place among th e autonom ous com ponents of a federation, some common m odel for describing th e sharable d ata m ust be utilized. In this chapter, we present a generic Core O bject D ata Model (CODM ) which acts as a com m on language for th e com m unication among individual database com ponents. A spectrum, of resources th a t can be discovered in the context of this CODM will also be investigated. 18 4.1 The Federated Environment R evisiting our exam ple in C hapter 2, the external agent has to support two capa bilities in order to collect appropriate genetics inform ation from m ultiple database sources. F irst, th e external agent has to understand th e concept of genetics and should be able to interrelate (partial) databases th a t m anage genetics inform ation. Second, the external agent has to understand the inquiry subm itted by the in iti ated database com ponent and identify appropriate inform ation from its knowledge repository. O ur resource discovery system focuses on the above two aspects and specifi cally consider th e interrelation of inform ation m anaged by a federation of database com ponents. A key characteristic of such a database federation is the capability for a com ponent to share inform ation w ith other com ponents while, at the same tim e, preserve its investm ent in and autonom y over its own database w ith respect to adm inistration and inform ation release such as the control over th e inform ation it is willing to “ex p o rt” to th e other com ponents. For instance, a hospital database m ight w ant to keep genetics inform ation of certain patients confidential. In conse quence, when a com ponent agrees to join th e federation, inform ation to be m ade available to other com ponents is specially placed in w hat we term an export schema (see A ppendix A for m ore details). Figure 4.1 illustrates a top-level functional architecture of a federated sharing environm ent. In order for any collaboration to take place am ong the heterogeneous com ponents of a federation, some common m odel for describing the sharable d ata m ust be utilized. R eturning to our exam ple in C hapter 2, th e external agent m ust be able to understand th e d ata m odel used to describe the genetics inform ation by each database system . This can be achieved by having each database com ponent utilized a com m on d a ta model. One m ay of course argue as to th e n atu re of this “lingua franca” . We believe th a t this m odel should be sem antically expressive enough to capture the intended meanings of the conceptual schemas of each com ponent. Fur th er, this m odel m ust be sim ple enough so th a t it can be readily understood and im plem ented using a variety of already existing database m anagem ent system s. To this end, we have chosen to use a C ore O bject D a ta M odel (CODM ) as th e com m on database m odel for describing the structures, constraints, and operations for 19 Resource* l>isco\t,i * i > Resource* Infc^rialion service C’OOM i '/ ' a d a t a b a s e c o m p o n e n t d a t a b a s e c o m p o n e n t d a t a b a s e c o m p o n e n t d a t a b a s e c o m p o n e n t Figure 4.1: A federated sharing environm ent sharable data. This is illustrated in Figure 4-.1 in which the collection of database com ponents are each encapsulated by th e system -level CODM reference d ata model. Figure 4.1 also shows a special federation com ponent called th e sharing advi sor. T he role of th e sharing advisor is to act as a service provider to individual com ponents of the federation. In particular, th e sharing advisor provides a resource discovery service for each com ponent of th e federation to request relevant inform at ion available in other com ponents. It also provides a resource integration service for each com ponent to (partially) integrate th e newly discovered rem ote inform ation into th e com ponent’s local database. The work presented in this thesis, however, only focuses on the resource discovery mechanism; th e resource integration m echanism is th e focus of a related research project and is described in detail in [17, 18]. 4.2 The Core Object Data Model (CODM ) CO DM is a generic functional object database model, which supports th e usual object-based constructs. In particular, it draws upon the essentials of functional database m odels, such as those proposed in D aplex [43], Iris [14], and O m ega [16]. CO DM contains the basic features common to m ost sem antic [1, 22] and object- oriented models [28, 31, 33]. T he m odel supports complex objects (aggregation), 20 T y p e G e n e s p ro d u c t P r o t e i n I n s t a n c e s P r o t e i n S t r u c t u r e S e c o n d a r y S t r u c t u r e T e r t i a r y S t r u c t u r e Component A Component H i (a) (b > Figure 4.2: P artial conceptual schemas of two protein/genetics com ponents ty p e m em bership (classification), subtype to supertype relationships (generaliza tion), inheritance of properties (attributes) from supertype to subtypes, and user- definable functions (m ethods). Not supported at this point are rich constraints (sem antic integrity rules) on types and properties. Among the advantages of using an object-based common d a ta m odel are the ability to encapsulate th e functionality of shared objects, its extensible nature, and object uniform ity. In theory, everything in the CODM including m eta-data, is m odeled as objects. T here are prim arily three kinds of objects: Types, Instances, and Functions (user defined m ethods). Functions operate on instances, types, and possibly other func tions. A type represents a specific view of a real world concept. It has one or m ore properties or m ore generally, inter-object relationships. Types constrain the instances on which functions can operate. Types also serve to encapsulate instances by th e set of functions which operate on m em bers of the type (as ab stract d a ta type). Instances sharing th e same set of properties and functions are classified into types and m odel/represent real world entities or values. T he types in a system form a hierarchy or a rooted directed acyclic graph, modeling supertype-subtype rela tionships. All properties and functions defined for a ty p e are inherited into all its subtypes recursively. All objects have a unique system generated object identifier [14, 27]. 21 Figure 4.2 shows th e p artial conceptual schemas represented in CODM of two p rotein/genetics database com ponents which was originally introduced in th e colla borating scientific databases exam ple in C hapter 2. T he diagram serves to illustrate th e sim ple diagram m atic notation we will be using for depicting conceptual schemas in CODM . In our diagram m atic notation, type objects are depicted as bubbles. For instance, th ere are seven type objects depicted in com ponent A of Figure 4.2a: P r o te in In sta n c e s, A m in o A cid S eq u en ces, S eco n d a ry S tr u ctu r e, T ertia ry S tr u c tu r e , C h o ro m o so m es, L inkage M ap , and G en es. Each type has a set of properties defined on it which is placed im m ediately above the ty p e object. Two kinds of inter-object relationships are explicitly m odeled in CODM and have corresponding diagram m atic notations: the supertype to subtype(s) relationship and inter-type relationship. T he supertype to subtype(s) relationship is depicted w ith thick dark arrows from supertype to subtype(s) such as th e relationship betw een type A m in o A cid S eq u en ces and type S eco n d a ry S tr u c tu r e , and th e relat ionship betw een type A m in o A cid S eq u en ces and type T er tia ry S tr u c tu r e in Figure 4.2a. Inter-type relationship is depicted w ith th in dark lines betw een two type objects such as the relationship between type P r o te in In sta n c e s and A m in o A cid S eq u en ces in Figure 4.2a. This diagram will also serve as an exam ple for illustrating our ideas and concepts presented in th e rest of this thesis. 4.3 A Spectrum of Resources within CODM In the context of CODM , discovery request, generally, is possible at m any different levels of abstraction and granularity, ranging from discovering factual inform ation units (instance objects), to m eta-d ata (type objects), to units of behavior (functions) w ith respect to th e specific needs of a com ponent. 22 D epending on th e granularity of inform ation available to th e federation, the granularity of inform ation th a t can be discovered will be affected; this is illustrated in th e following table: G oal In fo rm a tio n A v a ila b le In fo rm a tio n U tiliz e d D is c o v e r T y p e O b je c ts P r o p e r tie s w ith A to m ic D a t a t y p e P r o p e r ty n a m e D is c o v e r T y p e O b je c ts P r o p e r tie s w ith U se r D e fin e d t y p e P r o p e r ty n a m e + V a lu e T y p e D is c o v e r I n s ta n c e O b je c ts P r o p e r tie s V a lu e s P r o p e r ty n a m e + V a lu e T y p e + V a lu e Table 4.1: A spectrum of resources w ithin CODM 1. T he first two cases refer to the situation in which only the definitions of type objects are exported to the federation. In these cases, only at th e level of ty p e objects can possibly be discovered; we term ed this process, type-level discovery. As an exam ple of type-level discovery, consider th e (partial) con ceptual schemas of the two database com ponents shown in Figure 4.2 again. Here, a typical type-level discovery request by com ponent A m ight be: “W hat additional inform ation is present rem otely to which type C h o r o m o so m e s is directly related to?” Of course, th e problem of determ ining the m eaning of “directly related ” in this exam ple is nontrivial. D epending on how th e type objects are defined, inform ation th a t can be utilized in achieving this kind of type-level discovery will be lim ited: (a) In th e first situation, a property is defined as an atom ic d a ta value such as IN T E G E R or STRIN G . Consider th e (partial) conceptual schem a of database com ponent A in Figure 4.2, property p o s it io n ^ of ty p e L in k age M ap belongs to this category. In this case, th e value type of a property does not supply any additional useful sem antic to th e property nam e; only the property nam e is utilized by the sharing advisor in this situation. (b) In the second situation, a property is defined as a user defined type such as th e property am in o_acid of type P r o te in In sta n c e s defined in com ponent A of Figure 4.2 is defined to retu rn an A m in o A cid S eq u en ces 23 object. In this case, the value type of a property provide fu rth er real world sem antic to the property nam e and thus should be considered in additional to the property name. 2. In th e final case, in which th e definitions and d a ta values of instance objects are b o th exported, a finer level of granularity of instance objects can possibly be discovered; we term ed this process, instance-level discovery. Using th e partial conceptual schem a of Figure 4.2 as an exam ple again, a typical instance-level discovery request by com ponent A will be like: “W hat is th e DNA sequence of choromosomes # 4 ? ” . T he work described in this thesis only focuses on th e support for discovery of rem ote ty p e objects w ith respect to a com ponent request, i.e., we focus on th e first two cases m entioned in Table 4.1. 24 Chapter 5 Resource Discovery Mechanism Knowledge is o f two kinds. We know a subject ourselves, or we know where we can find information upon it. SAM U E L JOHNSON. In this chapter, we describe th e details of our resource discovery m echanism for a database com ponent to query or identify rem ote inform ation th a t is relevant to its requirem ents. Our m echanism is based on a two-phase particip atio n of each com ponent. D uring registration, th e relationships among exported type objects of each com ponent are established in a knowledge base. D uring discovery, th e request su b m itted by an initiated com ponent is analyzed and appropriate inform ation are identified by consulting the knowledge base established during registration. 25 5.1 A Two-phase Framework to Discovery Figure 5.1 illustrates a detail picture of the sharing advisor com ponent. In order for the sharing advisor to identify appropriate inform ation in response to a com ponent discovery request, it m ust of course be aware of the exported inform ation available. T his is achieved by having each com ponent to broadcast its exported type objects to th e federation. As m ulticasting th e exported inform ation to every com ponent w ithin the federation would be expensive especially for large federations, each com ponent is required to register its exported type objects w ith the sharing advisor when it first joins th e federation. We term this the resource registration process. This regis- * tratio n process establishes for th e registering com ponent an initial sharing context w ithin th e federation, by logically connecting its (initially) exported inform ation to a knowledge base m anaged by th e sharing advisor called th e semantic dictionary (see Figure 5.1). Increm ental registration is also necessary whenever a com ponent augm ents its exported inform ation w ith new inform ation. T ype objects th a t have been registered w ith th e sharing advisor will be indexed into th e organizational stru ctu re of the sem antic dictionary w ith th e help of th e sharing heuristics, w ith user interactions as necessary. T he m ain purpose of this action is to determ ine if the type objects being exported are sim ilar to type objects previously registered and to establish a common understanding among the different existing types available in th e federation. For exam ple, consider th e two conceptual schemas of com ponent A and B in Figure 4.2. Assum ing th a t com ponent A has al ready registered types P r o te in In sta n c es, A m in o A cid S eq u en ces, and G en es w ith th e sharing advisor. W hen com ponent B registers type P r o te in S tr u ctu r e, the sharing advisor will use th e knowledge previously established during th e regis tra tio n of com ponent A to deduce if P r o te in S tr u c tu r e has any sim ilarity w ith previously registered types, i.e., P r o te in In sta n c e s, A m in o A cid S eq u en ces, and G e n es in this example. To properly index th e newly exported type objects in th e organizational stru ctu re of th e sem antic dictionary, th e (dis)sim ilarity am ong various exported type objects have to be determ ined. Interactions w ith com ponent users m ay be necessary if the sharing heuristics fails to determ ine th e relationships am ong th e exported type objects properly. For instance, user interactions are necessary to instruct the sharing 26 Kesonrc't' 1 nte^riition Uosourcr I >isco\ « .* r\ service service jSharing JA divisor d a t a b a s e c o m p o n e n t d a t a b a s e c o m p o n e n t C’ODiYl d a t a b a s e c o m p o n e n t d a t a b a s e c o m p o n e n t v . Figure 5.1: A nother view of a federated sharing environm ent advisor about th e sim ilarity between P r o te in In sta n c e s and P r o te in S tr u ctu r e if th e sharing heuristics fail to assist th e sharing advisor to detect such sim ilarity when com ponent B registers. A fter a new com ponent registered w ith the sharing advisor, newly acquired know ledge and newly registered inform ation are stored in th e sem antic dictionary. For instances, th e sim ilarity inform ation betw een newly registered ty p e P r o te in S tr u c tu r e and previously registered type P r o te in In sta n c e s will be m aintained in the sem antic dictionary. A federated knowledge base is thus, established in the sem antic dictionary. Discov ery of relevant rem ote inform ation can, then, be achieved by either sending a discov ery request to th e sharing advisor or sim ply browsing through th e resource space in the sem antic dictionary. We term this, the resource discovery process. In th e form er situation, th e sharing advisor analyzes the constraints of th e discovery request and searches for relevant inform ation available in rem ote com ponents, through consulta tion w ith th e knowledge in the sem antic dictionary established during registration, th a t satisfy th e constraints im posed by the request. In th e la tte r situation, a com ponent user employs a sem antic dictionary browsing tool provided by the sharing advisor to browse through the resource space. 27 We now look at th e resource discovery m echanism from a different perspective, em phasizing its ability on organizing, indexing, searching, and browsing the resource space. 5.2 The Semantic Dictionary: Organization T he sharing advisor attem p ts to properly organize exported type objects in the sem antic dictionary during registration w ith guided user input as necessary. As will be described in subsequent sections, th e organization of ty p e objects has a substantial effect on th e am ount of efforts involved in th e identification of appropriate inform ation for a discovery request. In th e sem antic dictionary, types determ ined to be sim ilar by th e sharing advisor, i.e., representing sim ilar inform ation, are classified into a collection called a concept. Sub collections of concepts called subconcepts can be further classified. A concept hierarchy is thus generated, m odeling superconcept-subconcepts relationship. T he classification of sub concepts is determ ined according to w hether th e sub concepts represent com plem entary or overlapping views of th e superconcept. T he sem antic dictionary, therefore, consists of a collection of concept hierarchies, one for each m ain concept observed. Figure 5.2 shows two snapshots of th e concept hierarchies in th e sem antic d ict ionary taken at different tim es during th e life-tim e of our sam ple federation shown in Figure 4.2. Figure 5.2a indicates the concept hierarchies after types P r o te in In sta n c e s, A m in o A cid S eq u en ces, and G en es of com ponent A have been re gistered w ith th e sharing advisor. Figure 5.2a also indicates th a t th e sem antic d ict ionary is em pty initially before any type object has been exported; therefore, when th e first com ponent (com ponent A in our exam ple) enters th e federation and reg isters its exported inform ation, th e sharing advisor sim ply im ports th e inform ation into th e sem antic dictionary. In other words, th e initial concepts created in the se m antic dictionary are the type objects exported by the database com ponent which first joins th e federation. Figure 5.2b shows th e corresponding hierarchies after com ponent B has registered types P r o te in S tr u c tu r e and G e n e tic s w ith th e sharing advisor. T he hierarchies 28 Legend Superconcept-Subconcept. member-of (instance-of) Concept: sh o rt/lo n g u rm lin k ag e m a p position A m i n o A c id S e q u e n c e s pro d u ce G e n e t i c s p ro d u c t P r o t e i n I n s t a n c e s G e n e s p ro d u c t G e n e s P r o t e i n S t r u c t u r e P r o t e i n I n s t a n c e s (a) Figure 5.2: Evolution of concept hierarchies in the sem antic dictionary in Figure 5.2b indicates th a t P r o te in In sta n c es and P r o te in S tru c tu r e are re presenting sim ilar inform ation, as they belong to th e sam e hierarchy/classification called P r o te in In fo rm a tio n . F urther, P r o te in In sta n c e s and P r o te in S tru c tu r e have properties th a t distinguish one type from the other, they are also created as subconcepts of P r o te in In form ation ; this expresses th eir dissim ilarities w ithin th e context of their common superconcept, i.e., P r o te in In fo rm a tio n . In the situ ation in which the properties of a type are a proper subset of another, th e la tte r will be created as a superconcept of th e former; this idea is illu strated by th e concept hierarchy of G e n e tic s. In another situation in which properties of th e two types are identical, only one concept will be created which holds th e two type object m em bers. By contrast, type A m in o A cid S eq u en ces is sim ilar to none of th e other con cepts at this m om ent, it appears as a separate concept in th e sem antic dictionary. Again, sim ilarity and dissim ilarity of this kind is detected by th e sharing advisor based upon sharing heuristics w ith user input as required. At this point in tim e, the sem antic dictionary contains three m ain concepts, G e n e tic s, P r o te in In fo rm a t ion , and A m in o A cid S eq u en ces. P r o te in In fo rm a tio n is fu rth er classified into two subconcepts, P r o te in In sta n c es, and P r o te in S tru c tu re s. 29 This exam ple also illustrates some potential problem s in relating properties of different ty p e objects. F irst of all, there m ay exist a nam ing conflict am ong sim i lar/re la te d properties such as property p r o d u ct of type G en es in com ponent A and property p ro d u ce of type G e n e tic s in com ponent B. This correspondence among synonym term s is obtained via user interactions; once determ ined, this inform ation is m aintained in th e sem antic dictionary for future reference. In this way, a term ino logical thesaurus is increm entally established in th e sem antic dictionary. A second problem stem s from the m odeling freedom on sim ilar inform ation; this is som etim es referred to as semantic heterogeneity. For instance, the tern ary relationship among types G en e s, L inkage M ap, and C h o ro m o so m es in com ponent A is m odeled as a single type G e n e tic s in com ponent B . Notice, however, th a t th e task of sharing ad visor is not to determ ine th e precise relationships among th e exporting type objects; rath er, its task is to detect possible relationships among those. In our discovery m echanism , all three types, G en es, L inkage M ap , and C h o ro m o so m es will be classified into th e same hierarchy of G e n e tic s.1 An advantage of organizing type objects as concept hierarchies of this sort is th a t a hill clim bing technique can be used to place newly exported type objects. For instance, consider a type object being exported which is determ ined to be unrelated to m em bers of concept P r o te in In fo rm a tio n in Figure 5.2b; in this case, no further com parison w ith m em bers of its subconcepts is necessary. By contrast, if a newly exported type object is determ ined to be sim ilar to concept P r o te in In fo rm a t io n , it will be recursively com pared w ith subconcepts of P r o te in In fo rm a tio n u ntil th e m ost specific concept to which th e newly exported object should belong is determ ined. Generally, a type object represents a specific view of a corresponding real world concept, and is tailored to the focus and interest of the database com ponent; th ere fore, the set of properties associated w ith a type object can be viewed a,s a subset of those associated w ith the real world concept. For instance, types P r o te in In sta n c e s and P r o te in S tru ctu re of com ponent A and B represent two different views on th e real world concept of protein. Types G en es and G e n e tic s of com ponent A and B represent another two different views on th e real world concept of genetic. In 1 D etails of detecting precise relationships am ong ty p e objects are described in [17]. 30 order to properly interlink various views w ith com plem entary inform ation on a real world concept, the set of properties belonging to a concept at a particular level of a hierarchy is defined as th e union of th e properties of all its subconcepts. This is illus tra te d by concept P r o te in In fo rm a tio n in Figure 5.2b. This increm entally builds up a collection of federated views on real world concepts exported to th e federation. T he sem antic dictionary is therefore, logically, a centralized federated knowledge base established increm entally in a bottom up fashion, and containing only inform ation actively used to support sharing w ithin the federation. This contrasts w ith other approaches which assum e th e existence of an initial com prehensive global schem a or knowledge base [7, 12]. Note th a t concept hierarchies introduced are not static. They evolve dynam ically and scale increm entally, depending on th e knowledge and inform ation exported to the federation. To understand th e size range of the sem antic dictionary, let us assum e th a t a to tal of N ty p e objects are registered w ith th e sharing advisor. On one extrem e is th e case in which all N types are sim ilar enough to belong to one single concept; th ere will be N -f 1 objects in th e sem antic dictionary (N type m em bers + 1 concept); on the other extrem e is th e case in which all N types are com pletely unrelated; here, there will be 2 N objects in the sem antic dictionary (N type m em bers -f N concepts). 5.3 The Heuristics: Indexing In order to index th e exported type objects correctly, placing them into the proper location in the concept hierarchies, we employ a collection of sharing heuristics which draw upon m achine learning techniques. T he idea is to im prove the indexing capa bility over tim e [13]. These heuristics assess the extent of distinguishing capability of a property w ith respect to a concept; this allows th e sharing advisor to determ ine if the m eaning of a type object being exported can be determ ined based upon its properties, or w hether further assistance from users is necessary. T he distinguishing capability w ith respect to a concept is based upon th e obser vation th a t related concepts may possess com plem entary or even disjoint properties (an intra-concept similarity indicator) whereas unrelated concepts m ay in fact pos sess overlapping properties (an inter-concept dissimilarity indicator). T raditional 31 unrelated concepts related concepts high Ju t r a - c o iie c p t .sim ila rity common property un common property dissimilarity Figure 5.3: Fundam ental base of the sharing heuristics techniques th a t determ ine th e m eaning of a type object by a sim ple properties- intersection m echanism are sim ply not feasible [7, 19, 30]. Figure 5.3 illustrates the relationship between th e distinguishing power of a property and th e association of th e property with related or unrelated concepts. As illu strated in Figure 5.3, a property th a t is common (uncom m on) among related concepts will have a high (low) intra-concept sim ilarity m easurem ent. On the other hand, a property th a t is common (uncom mon) among unrelated concepts will have a low (high) inter-concept dissim ilarity m easurem ent. Intuitively, a property possesses a high distinguishing capability in discrim inating a concept from others if it has a high intra-concept sim ilarity as well as a high inter-concept dissim ilarity m easurem ent; this is the shaded region in Figure 5.3. As an exam ple, consider property c o d e of concept P r o te in In fo rm a tio n in Fig ure 5.2b. This property has a high intra-concept sim ilarity w ith respect to P r o te in In fo rm a tio n since this property is associated w ith all types belonging to P r o te in In fo rm a tio n . Similarly, property c o d e has a high inter-concept dissim ilarity as no other concept at the same scope/level possesses such a property. Consequently, property c o d e possesses a high intra-concept sim ilarity and a high inter-concept dissim ilarity w ith respect to concept P r o te in In fo rm a tio n . Therefore, property c o d e has a high distinguishing power w ith respect to P r o te in In fo rm a tio n . This 32 inform ation suggests th a t property co d e is probably useful in discrim inating type objects belonging to concept P r o te in In fo rm a tio n from those belonging to other concepts such as G e n e tic s. W hen looking at th e subconcepts of P r o te in In fo rm a tio n , property co d e has a low inter-concept dissim ilarity w ith respect to each subconcept of P r o te in In fo rm a tio n . This is because property co d e is possessed by all the concepts w ithin th e com m on superconcept P r o te in In fo rm a tio n , giving rise to a low inter-concept dissim ilarity w ith respect to concepts P r o te in In sta n c e s and P r o te in S tru ctu re. N otice then, a property does not necessarily possess th e same degree of distinguishing capability across different levels of a hierarchy. Based on th e intra-concept sim ilarity and inter-concept dissim ilarity value of each property w ith respect to a concept, the sharing advisor estim ates a probability indi cator [35] indicating th e likeliness th a t a newly exported type object should belong to th e concept. In th e rest of this section, we will present a form al m athem atical m odel for the estim ation of the distinguishing power, probability indicator, and the heuristics indexing algorithm . 5.3.1 D istin g u is h in g P ow er E stim a tio n Intra-concept sim ilarity and inter-concept dissim ilarity value are estim ated via sta tistical analysis based on previously registered type objects. Their values will be u p d ated w henever new types are registered into th e federation. Formally, th e intra-concept sim ilarity value of a property p w ith respect to a concept A is estim ated using th e conditional probability th a t a type object T, will possess p given th a t T belongs to A , i.e., P (T possesses p \ T £ A). Similarly, th e inter-concept dissim ilarity value of a property p w ith respect to a concept A is estim ated using the conditional probability th a t a type object, T , will possess p given th a t T does not belong to A, i.e., P (T possesses p | T A). To illu strate how th e two conditional probability values are com puted, consider th e following param eters: 33 P a r a m e te rs M ea n in g s N Total num ber of types being registered w ith th e sharing advisor R a Total num ber of types belonging to concept A r P,A Total num ber of types possess property p belonging to concept A np Total num ber of types possess property p T he intra-concept sim ilarity value of property p with respect to concept A is, there fore, estim ated by: P (T possesses p \ T £ A) = (5-1) Ha Similarly, the inter-concept dissim ilarity value of property p w ith respect to concept A is estim ated by: 1 — P (T possesses p | T $ A) — 1 — ------ r r v (5-2) (A — H a ) Consider property co d e of concept P r o te in In fo rm a tio n depicted in Fig ure 5.2b. Here, we have the following param eters values: P a r a m e te rs V alu es N 5 -^-Protein Information 2 rcode,Protein Information 2 ^code 2 A ccording to Equation 5.1, the intra-concept sim ilarity value of property c o d e w ith respect to concept P r o te in In fo rm a tio n is | = 1. Similarly, according E qua tion 5.2, th e inter-concept dissim ilarity value of property c o d e w ith respect to con cept P r o te in In fo rm a tio n is 1 — (fzfj = 1- Therefore, property c o d e has a high intra-concept sim ilarity value as well as a high inter-concept dissim ilarity value w ith respect to concept P r o te in In fo rm a tio n . As m entioned previously, this suggests th a t property c o d e has a high distinguishing power in discrim inating type objects belonging to concept P r o te in In fo rm a tio n from those belonging to other concepts. W hen trying to estim ate the intra-concept sim ilarity and inter-concept dissim i larity value of property co d e w ith respect to concept P r o te in In sta n c e s w ithin the 34 scope of its superconcept P r o te in In fo rm a tio n , we have th e following param eter values: P a ra m eters V alu es N 2 ^Protein Instances 1 ^code,Protein Instances 1 ^•code 2 According to E quation 5.1, the intra-concept sim ilarity value of property c o d e w ith respect to concept P r o te in In sta n c e s is | = = 1. On the other hand, th e in ter concept dissim ilarity value of property co d e w ith respect to concept P r o te in In sta n c e s is 1— (2- 1) = O' Therefore, property c o d e has a high intra-concept sim ilarity value b u t a low inter-concept dissim ilarity value w ith respect to concept P r o te in In sta n c e s w ithin th e scope of its superconcept P r o te in In fo rm a tio n . This sug gests th a t property co d e is not a perfect candidate in distinguishing type objects belonging to P r o te in In sta n c e s from other concepts such as P r o te in S tr u c tu r e w ithin th e sam e scope of th e common superconcept, P r o te in In fo rm a tio n . Once th e intra-concept sim ilarity value and the inter-concept dissim ilarity value of all properties w ith respect to each concept in the sem antic dictionary are d eter m ined, th e sharing advisor adopts a probability m odel [35] th a t utilizes th e in tra concept sim ilarity value and the inter-concept dissim ilarity value of all properties belonging to a concept and estim ates th e probability th a t a newly exported type ob ject should belong to th e concept. We describe th e probability m odel in the following section. 5 .3 .2 P r o b a b ility E stim a tio n A ssum ing th a t a type object T is being registered w ith th e sharing advisor. As m entioned above, th e sharing advisor will estim ate a value for each concept A cur rently existed in the sem antic dictionary indicating th e probability th a t type object T should belong to concept A. F urther assume th a t A contains m properties, p\ to p m > denoted by th e vector Va - 35 Va = {pi, •••, P n j w here pi indicates th e ith properties of concept A. T he existence or non-existence of a property pi of concept A in type object T can be represented by a binary vector X t,a - Xx,A — tm.} w here ti = 1 if T possesses property pi, ti — 0 if T does not possess property p{. T hrough th e vector Xjr,A-> th e relationship between concept A and type T is extracted and expressed explicitly. Consider a binary variable w t , a where: w t,a = 1 : if T belongs to concept A , u > t ,a — 0 : if T -i belongs to concept A. In term s of w t,a > we would like to com pute P ( w t,a = ^\X t,a ) ' ■ probability th a t T should belong to A given th e inform ation about th e relationship betw een T and A O nce P (w x tA = 1|X t,a ) is determ ined, th e sharing advisor deduces th a t T should belong to concept A if P{w t,a = 1\X t,a ) > £a w here ca is a similarity threshold of concept A, i.e., for each concept A in the sem antic dictionary, it keep tracks of its own threshold; if th e probability value of type T w ith respect to concept A estim ated by th e sharing advisor exceeds the threshold, the sharing advisor deduces th a t type T should belong to concept A. In order to determ ine th e values of P ( w t,a — f\Xx,A)> we apply B ayes’ T heorem [24, 25] and obtain: 36 _ 1 I v - \ ___________________P ( X t , a \ ^ T , A — 1) P (« > T ,A -1 )________________ (r o\ ,A J-lA T , A ) P ( X TlA\wTtA= 1) P(v>T,A= 1) + P ( X T:A \w TiA=0) P ( w T>A = 0 ) W " 3 ! Therefore, in order to estim ate P ( w t,a = 1\X t,a ), we have to estim ate th e following four probability values: • P { X t ,a \w t ,a = 1)- Given th a t the type object being com pared belongs to concept A, th e proba bility/likelihood th a t the object is T. • P ( X t ,a \w t ,a = 0): Given th a t a type object being com pared does not belong to concept A, the probability/likelihood th a t th e object is T. • P ( w t ,a — 1): Based on previous experience, the probability th at type object T should belong to A. • P ( w t ,a - 0): Based on previous experience, th e probability th a t type object T should not belong to A. T he rationale behind th e Bayes’ theorem captures hum an reasoning behavior in the sense th a t hu m an ’s inferencing process is based on previously acquired know ledge and some additional inform ation at th e current m om ent. T he two quantities, P { w t ,a — 1) and P { w t ,a ~ 0) capture our reasoning behavior based on past expe rience in th e absence of new knowledge and are known as prior probability values. P ( X t ,a \w t ,a — 1) and P ( X t ,a \‘ U}t ,a = 0), on th e other hand, capture th e reasoning process once new inform ation are supplied and are known as posterior probability values. In the followings, we will first describe how to estim ate the two posterior probability values, P { X t ,a \w t ,a = 1) and P ( X t ,a \u> t ,a = 0). We will th en proceed to discuss how the two prior probabilities, P ( u)t ,a — 1) and P ( w t ,a — 0), can be estim ated in th e following section. 37 Let us assume: T he properties of a concept are m utually independent. Based on this assum ption, the values of P ( X t ,a \u> t ,a = 1) and P ( X t ,a \w t ,a — 0) can be com puted according to the following equations [24, 25]: P ( X t ,a \w T,A = 1) = P ( t l \ w T,A = 1) ••• P(tm\wT,A = 1) (5-4) P ( X t ,a \w t ,a = 0) = P(t-i\wT,A = 0) ... P ( t m \v>T,A = 0) (5.5) Now, define Si,A = P (ti = l \ w T>A = 1), di,A = P (ti = 1\WT,A = 0). w here SitA represents th e probability th a t if object T belongs to concept A, property Pi is present in T ; this is precisely our definition of intra-concept sim ilarity value of property pi w ith respect to concept A. Similarly, ditA represents th e probability th a t if object T does not belong to concept A, property pi is present in type T ; this is precisely our definition of inter-concept dissim ilarity value of property pi w ith respect to concept A. From SitA and ditA, we can calculate th e corresponding probabilities for th e absence of pi i n T and they are: 1 - SijA = P ( t i = 0 \w T,A = 1), 1 - diA = P (ti = 0\w t,a = 0). i.e., 1 -SitA represents the probability th a t if object T belongs to concept A, p, is absence in T ; therefore, 1 - s ^ is inversely proportional to th e intra-concept sim ilarity value of property pi. 1-d^A is interpreted sim ilarly and is inversely proportional to th e inter-concept dissim ilarity value of property pi w ith respect to concept A. Consider P ( U \ w t ,a = 1) in Equation 5.4. T he value of ti i s either 1 or 0; therefore, the value of P(ti\wT,A. — 1) is: if ti 1, and 1 — s ^a if U = 0. 38 T hus, for i = 1, m, P{U\wT,a = 1) = s*A (1 - s1 ,a )1~U (5.6) S ubstitu tin g Equation 5.6 into each of the corresponding term s in E quation 5.4, we have m P ( X t ,a Iv t ,a = 1) = I I 4 U (1 - («.7) Similarly, consider P(ti\wx,A — 0) in Equation 5.5. T he value of P(ti\u>T,A — 0) is: dt,A if U = 1, and 1 — ditA if U — 0. so, for i = 1, ..., m , P ( U \ w t ,a = 0) = < % A (1 - (5.8) S ubstituting Equation 5.8 into each of th e corresponding term s in E quation 5.5, we have m P ( X t ,a \w t,a = 0) = II (1 - di,Af~U (5-9) i=1 Now, each property of a concept, A , in the sem antic dictionary is associated w ith two param eters, and d itA which indicates its intra-concept sim ilarity and inter-concept dissim ilarity in th e context of concept A respectively. The estim ation of these two param eters has been described in th e previous section. Now, le t’s look at th e estim ation of the two prior probability values. 5 .3 .2 .1 P r io r V alu es E stim a tio n T here are several different ways of estim ating the prior probability values, P ( w t ,a = 1) and P ( w t,a = 0), of th e type object being newly exported, T , w ith respect to a concept A, in the sem antic dictionary. Again, these two probability values cap tu re our reasoning behavior based on past experience. Therefore, we estim ate y Pi P ( w t ,a — 1) by over all properties of the concept A. Pi represents th e average 39 r > i.S tin g u is h in g _ P o w e r _ K s tiT n a tio n ( c o n c e p t A , f l o a t i n t r a - c o n c e p t - s i m i l a r i t y U , f l o a t i n t e r - c o n c e p t - d i s s i m i l a r i t y t J ) { f o r e a c h p r o p e r t y / ? o f / 4 { T /et n(p) b e t h e n u m b e r o f t y p e o b j e c t s p o s s e s s in g p r o p e r t y p r(pA) b e t h e n u m b e r o f ty p e o b j e c t s b e l o n g in g t o c o n c e p t A p o s s e s s in g p r o p e r t y p A' b e t h e t o t a l n u m b e r o f ty p e o b j e c t s e x p o r t e d t o t h e f e d e r a t i o n 7 f ( 4 ) b e t h e n u m b e r o f o b j e c t s b e l o n g i n g to c o n c e p t A Intra-concept-similaritylp] =r(p^4) /J?(/4) in t e r - c o n c e p t - d i s s i m i l a r i t y f / ? ] = ( [nip') -r(pA)) f [N - RCA)) ) > H e u r i s t i c s ( e x p o r t e d t y p e T) { f l o a t i n t r a - c o n c e p t - s i m i l a r i t y [ ] ; f l o a t i n t c r - c o n c e p t - d i s s i m i l a r i t y n ? f l o a t p r o b a b i l i t y ; g e t t h e f i r s t c o n c e p t A f r o m t h e s e m a n t i c d i c t i o n a r y w h ile ( t h e p o s i t i o n o f T h a s n o t b e e n d e t e r m i n e d ) { r > is t i n g u i s h i n g _ P o w e r _ R s t i m a t i o n ( A , i n t r a - c o n c e p t - s i m i l a r i t y , i n t e r - c o n c e p t - d i s s i m i l a r i t y ) ; p r o b a b i l i t y = C o m p u t e _ P r o b a b i i i t y ( Ty i n t r a - c o n c e p t - s i m i l a r i t y , in t e r - c o n c e p t - d i s s i m i l a r i t y ) ; i f ( p r o b a b i l i t y > t h e s h o l d o f A) { t y p e T s h o u l d b e l o n g to A ; g e t s u b c o n c e p t o f A ; > e ls e g e t th e n e x t c o n c e p t A f r o m t h e s e m a n t i c d i c t i o n a r y ; (a) ( b ) Figure 5.4: Pseudo code for estim ating sharing heuristics of previous probability values having th e same value. In addition, P { w t ,a = 0) is estim ated by 1 — P (w t,a = !)• T he reason behind this estim ation m ethod is th a t users always tend to reason about th e m eaning of a type by simply looking at its properties. In other words, if th e type being exported has m ost of the properties th a t a particular concept possesses, it usually will be regarded as representing th a t concept. On the other hand, it will be regarded as not representing th a t concept if th e type only possesses a few property th a t the concept possesses. O ur estim ation process captures this kind of hum an reasoning behavior as the prior probability values. 5 .3 .3 H e u r istic s In d e x in g E stim a tio n Figure 5.4 illustrates the pseudo code of our heuristics indexing algorithm . Fig ure 5.4a presents th e pseudo code for estim ating the intra-concept sim ilarity value and inter-concept dissim ilarity of each property for a concept in th e sem antic dict ionary as described in Section 5.3.1. Figure 5.4b presents th e pseudo code for de term ining th e concept to which a type object T being exported should belong. Our heuristics of indexing a type object into th e concept hierarchies in th e sem antic dictionary is very simple. Precisely, each concept, A of the hierarchies keeps track 40 of its own sim ilarity threshold; if th e probability indicator of a type object being exported, T , w ith respect to a concept A estim ated by the sharing advisor exceeds th e threshold of A, the sharing advisor deduces th a t type T should belong to concept A. T he sim ilarity threshold of each concept is not static; if a type object is indexed into a particular concept hierarchy, th e threshold of all concepts belonging to the hierarchy will be adjusted accordingly. 5 .3 .4 D isc u ssio n T his kind of statistical heuristic-based approach offers a degree of error resilence, allowing th e accuracy of th e distinguishing capabilities to be gradually im proved over a period of tim e. A possible lim itation of this approach is its potential oscillating n atu re during early stages of a federation; however, our experim ents have shown th a t this lim itation becomes less significant as th e num ber of registering com ponents and type objects increases, dam ping th e oscillation to a steady state (see C hapter 7). Inform ation th a t is utilized by th e sharing advisor to determ ine th e correspon dence betw een properties comes from, three different sources: 1. T he property itself: As m entioned previously, in th e situation in which a property is defined as being of an atom ic d ata type such as IN T E G E R or STR IN G , only the property nam e is utilized by th e sharing advisor. This is because th e value type of th e property does not provide additional useful sem antic inform ation, see for exam ple, th e property p o s it io n # of type L inkage M ap of com ponent A in Figure 4.2a. On th e other hand, if a, property is defined as a user defined value type, b o th property nam e and th e value type of the property are considered by th e sharing advisor, see for exam ple, property am in o_acid of type P r o te in In sta n c e s of com ponent A in Figure 4.2a. 2. Previously acquired knowledge: As described above, a list of property term correspondences is increm entally m aintained in the sem antic dictionary as a result of user interaction. This list is consulted by the sharing advisor to determ ine th e correspondence between 4 -1 properties, e.g., for property p ro d u ct of type G en es in. com ponent A and property p ro d u ce of type G e n e tic s in com ponent B of Figure 4.2. 3. User consultation: In th e situation in which th e sem antic dictionary does not provide adequate knowledge to th e sharing advisor to properly interrelate sim ilar properties, the user is consulted. 5.4 Discovery Requests: Searching O nce th e exported inform ation is properly indexed w ithin th e concept hierarchies, com ponents can request inform ation by issuing a discovery request. We specifically differentiate three basic kinds of discovery request prim itives. These request prim i tives, when com bined, allow a com ponent to discover a wide varieties of rem ote inform ation. • T ype 1-Sim ilar Inform ation: In this kind of discovery request, a com ponent user is interested in locating type objects in rem ote com ponents th a t are conceptually sim ilar/related to a p articu lar type object in th e local com ponent. All type objects belonging to th e concept hierarchy in which the local ty p e object resides (in th e sem an tic dictionary) are appropriate to this request. For exam ple, in th e concept hierarchies of Figure 5.2b, P r o te in S tru c tu re of com ponent B is a proper candidate to th e request by com ponent A of Figure 4.2a for related inform ation on P r o te in In sta n c es. • T ype 2-O verlapping Inform ation. This kind of discovery request arises when a com ponent is interested in locat ing rem ote type objects th a t overlap in their inform ation content w ith a local type. In other words, a com ponent user is interested in rem ote ty p e objects th a t represent th e same or an equivalent local view of inform ation. For ex am ple, com ponent A would like to display all “protein”-like inform ation using its own three dim ensional viewing program which works on m em bers of type 42 P r o te in In sta n c es. All types w ith sim ilar properties as P r o te in In sta n c e s th a t belong to the sub-hierarchy rooted at P r o te in In sta n c e s are proper can didates for this request. In th e particular exam ple in Figure 5.2b, there is no candidate type object th a t would satisfy such a request at this tim e. • T ype 3-Com plem entary Inform ation: Here, a com ponent user is interested in discovering additional inform ation about a local type object. In other words, th e user is interested in a com ple m entary view on a sim ilar/related concept. This m ay occur when com ponent A of Figure 4.2a is interested in additional inform ation on P r o te in In sta n c es, for exam ple. All type m em bers belonging to th e concept hierarchy in which, th e local type object resides (in th e sem antic dictionary) except those type m em bers belonging to th e sub-hierarchy rooted at th e m ost specific concept in which th e local ty p e object resides are proper candidates for this request. In our exam ple, all type objects belonging to the hierarchy of P r o te in In fo rm a t ion except those belonging to the sub-hierarchy rooted at P r o te in In sta n c es are appropriate candidate for a request on com plem entary inform ation of P r o te in In sta n c es. P r o te in S tr u c tu r e of com ponent B would then be a proper candidate. 5 .4 .1 D isc u ssio n It is im p o rtan t to note th a t although, intuitively, discovery of inform ation occurs after registration, these two processes, in fact, occur in an arbitrarily interleaved m anner depending when the discovery request is in itiated and when inform ation units are exported and registered into th e federation. N otice also th a t there is a close relationship betw een discovery of inform ation and registration of inform ation. D uring registration, a ty p e object is presented to th e sharing advisor; th e sharing advisor attem p ts to find the proper location in the concept hierarchies where the newly exported type object should reside. T he know ledge state of the sem antic dictionary is updated; the heuristics are re-estim ated and th e concept hierarchies are re-organized to include the newly registered inform at ion for upcom ing registration an d /o r discovery request(s). D uring discovery, a type 43 object is presented to th e sharing advisor. If the object does not currently exist in th e sem antic dictionary, i.e., th e object was not previously registered, th e sharing advisor atte m p ts to find the m ost sim ilar concept existed in th e current state of the sem antic dictionary using the same heuristic-based algorithm . On the other hand, if the presented type object currently exists in th e sem antic dictionary, the proper inform ation will simply be identified according to th e discovery request. 5.5 The Discovery Tool: Browsing As it is not always possible for a com ponent user to specify an appropriate p a t tern ty p e for a discovery request, it is beneficial if the com ponent user can browse through th e resource space m aintained in th e sem antic dictionary for any interesting inform ation. For our clinical researcher exam ple in C hapter 2, the CR can browse through the resource space for any genetics inform ation. Further, browsing capa bility is also crucial to allow a com ponent user to further explore any discovered inform ation. N otice th a t the capability of browsing through the resource space in th e sem antic dictionary is closely related to th e organization of resources in the se m antic dictionary since the b etter the resources is organized, the m ore m eaningful th e organization will be and the easier and m ore effective it is to browse. To this end, we have im plem ented a browser for a database com ponent to explore the content of the sem antic dictionary, viz, a Discovery Tool. Figure 5.5 illustrates our current p rototype of this discovery tool. In order for th e discovery tool to com m unicate w ith a rem ote database, the nam e of th e host in which, th e rem ote database reside and th e nam e of the rem ote database have to be provided. Together, th e host nam e and th e database nam e serve as the address of the rem ote database. Once con nected to a rem ote database, th e set of type objects available at the rem ote database is displayed in one window by clicking th e “U pdate Types” button. T he hierarchy of th e type objects will also be displayed in another window by clicking th e “Show Schem a” b u tto n . A type object of interest to a user can be selected by clicking at the type nam e and the set of properties defined for th e selected type object will be dis played in another window. For exam ple, as shown in Figure 5.5, th e type hierarchy of a rem ote database nam ed S c ie n tis tA resides in host c a t a r i n a is connected to the 44 Discovery Tool File Browser Host Database catarioa Scientist^ Available Types Properties \ * \ Show Schema Secondary Structure iAminoJU:id_Sequences TertiaryStructure Protein Instances Choromosomes Properties D S name -> S IR IN G D S functions -> S T R IN G D S molwt -> R E A L D S source -> S T R IN G D S code -> SIR IN G D S amino_acid -> /taino_Acid_Seguences Update T ypes! i Sh?r«: Yahies W indow Shov S c h e m a Figure 5.5: The discovery tool 45 discovery tool. T he database consists of five types: A m in o _ A cid _ S eq u en ces, P r o te in In sta n c e s, G en es, LinkageJVIap, and C h o ro m o so m es. The ty p e object A m in o _ A cid _ S eq u en ces, in tu rn , has two subtypes S e c o n d a r y -S tr u c tu r e and T e r tia r y -S tr u c tu r e . Type object P r o te in In sta n c e s, in tu rn , has five properties as depicted in the figure. A lthough not explicitly shown in Figure 5.5, d a ta values of rem ote instances can also be displayed by selecting th e property (properties) th a t is (are) of p articu lar interest and clicking th e “Show Value” b u tton. 46 Chapter 6 Prototype Implementation Architecture... the adaptation o f form to resist force. JO H N R US KIN. We have im plem ented an experim ental prototype involving a collection of Om ega com ponents. In our experim ental prototype, the sharing advisor is im plem ented as a com ponent of th e federation; each com ponent consists of a Omega database, an Importer, an Exporter and is associated w ith a m achine where th e database physically resides. In this chapter, we discuss th e im plem entation details of each individual com ponent as well as the interactions among them . 47 6.1 The Overall Architecture Since th e focus of this work is on interoperating pre-existing databases which was originally developed w ithout the intent to be cooperated, com m unication facility is usually not supported in these databases. T he m ain issue in im plem enting our discovering m echanism is, therefore, to provide com m unication capability for each database com ponent to com m unicate w ith one another and to com m unicate w ith the sharing advisor. It is however, unacceptable to rew rite the kernel of th e database m anagem ent software to incorporate such facility; additional capabilities needed to be interfaced w ith original database system . In light of these lim itations, th e design of each database com ponent of our experim ental prototype consists of four different layers as shown in Figure 6.1. A t the netw ork level, com m unications among individual database system s are accom plished using R P C message passing paradigm . O n top of th e com m unication netw ork runs a database com ponent. In our prototype, each database com ponent is im plem ented by an Om ega database system [10] m ainly because of its availablilty and its support of th e database m odeling constructs required by our CODM . As noted, our mechanism, requires th e ability to ex tract m eta-d ata inform ation from a database com ponent. Since the original Omega database does not support this functionality, we introduced so-called meta-functions interfacing w ith th e un derlying Om ega database,1 which retu rn m eta-d ata inform ation about objects in its local O m ega database. A nother advantage of having a set of m eta-functions in ter facing w ith th e underlying database is th e flexibility of porting different database system s instead of Om ega to the discovery m echanism ; all a database com ponent needs to do is to im plem ent their own set of m eta-functions. For instance, a com po nent em ploying a relational database system can still take advantage of our discovery m echanism by im plem enting its own sem antics of m apping th e underlying relational database constructs to th e corresponding CODM m odeling constructs in its own set of m eta-functions. 1W e have an altern ativ e of im plem enting such functionality inside th e kernel of th e O m ega d atab a se system ; however, it w ould require m odification of th e kernel of th e d atab a se system which is n o t acceptable; ad d itio n al m odifications w ould also be necessary w henever new m eta-functions are determ ined to be needed. 48 USERS USER S I m p o r t e r \ ll I V 1 I . V I IO.NS ME I \ I- I N< I IONS k i» u m e s s A t■ r: i' a s s in c ; Figure 6.1: T he four-layer architecture of th e federation T he set of m eta-functions is as follows: M e t a - F u n c t i o n D e s c r i p t i o n S how A U Types() R e tu r n s a lis t o f all t y p e s t h a t c a n b e sh a r e d HasPropert,ies(t:Type) R e tu r n s a lis t o f a ll p r o p e r tie s d e fin e d o n ty p e t H a slnstances ( t:Type) R e tu r n s a lis t o f a ll in s t a n c e s d e fin e d o n u se r d e fin e d t y p e t Has Value(i:Instance, f.'Property) R e tu r n s th e v a lu e o f t h e p r o p e r ty f o n in s t a n c e i Has ValueType(f:Property) R e tu r n s th e v a lu e t y p e for a p r o p e r ty f H asD irectSubtypes( t: Type) R e tu r n s a lis t o f a ll d ir e c t s u b t y p e s o f ty p e t H asD irectSuperType(t:Type) R e tu r n s th e d ir e c t s u p e r ty p e o f t y p e t Table 6.1: T he set of M eta-functions In effect, m eta-functions serve as an inter-com ponent com m unication protocol. T he m eta-functions also interface w ith two ex tra pieces of modules called Im porter and Exporter which handles com m unication requests to and from rem ote com ponents as shown in Figure 6.1. We discuss th e functional capabilities of the Im porter and E xporter in th e subsequent section. 49 Sharing Advisor as Omega Component Semantic Dictionary as Omega Database Information returned through Importer Acce request through Exporter Omega Component Omega Component Importer Importer KPC TVTKSSAOKS (O M M I I N K A 'H O N Exporter Figure 6.2: Interactions am ong com ponents w ithin th e federation 6.2 The Design of the Sharing Advisor Figure 6.2 shows another perspective of our federated architecture. Here, the design and im plem entation of the sharing advisor are modeled as another O m ega com ponent (see Figure 6.2) of the federation. W hen th e Im porter of a com ponent receives a request from th e user or from the discovery tool, th e Im porter constructs the appropriate R PC calls and sends them to the sharing advisor through th e network. T he E xporter of the sharing advisor receives the appropriate R P C calls; perform s the requested service and sends th e results back to the initiated com ponent through the Im porter of the sharing advisor. T he E xporter of th e in itiated com ponent receives th e inform ation from the sharing advisor and presents th e inform ation back to the user. In order to keep track of th e registered inform ation, th e sem antic dictionary and th e concept hierarchy are accordingly m odeled by th e corresponding Om ega d ata base. Figure 6.3 indicates the conceptual schem a of th e sem antic dictionary includ ing th e m eta-d ata inform ation. As shown in Figure 6.3, each concept in the concept hierarchy is m odeled as a type object in the sem antic dictionary while each type m em ber of th e concept hierarchy is m odeled as an instance object in the sem antic dictionary as shown in th e shaded region of Figure 6.3. Supplem entary inform ation 50 Objects System -'l^pes Supplementary Information Types Figure 6.3: Conceptual schem a of th e sem antic dictionary about registered type objects are m aintained by separate m eta-d ata type objects in the sem antic dictionary. For exam ple, th e location of each registered ty p e object is m aintained by th e two properties, h o st and d b n a m e, of th e type object S u p p le m e n ta r y In fo rm a tio n keeping track of the host nam e and th e database nam e of th e original database to which the registered type object belongs. Here, m ulti ple m em bership m odeling construct of Om ega is used to relate th e original location of a registered type object to the concept to which the type object belongs. T he intra-concept sim ilarity and inter-concept dissim ilarity values of each property w ith respect to each concept are m aintained by the ternary relationships am ong m eta types T y p e s, P r o p e r tie s, and P a ra m eters. Finally, th e list of term inological correspondences are m aintained by th e property list o f sy n o n y m of th e m eta-type S y ste m -T y p e s which is a supertype of th e two m eta-types T y p es and P r o p e r tie s which will have its own list o f sy n o n y m through inheritance. 6.3 The Importer and Exporter In general, th e Im porter of a com ponent serves two purposes: 1. It receives a request from th e com ponent user or the discovery tool; constructs th e appropriate R PC calls and sends th e R PC calls to the target com ponent th a t services the requests. 51 User specifies address of Semantic Dictionary Discovery Tool sends request to Importer Im porter sends RPC call to Exporter of Sharing Advisor < Ex porter \ waits \ for / information / Importer sends information to Discovery Tool Exporter passes information to Importer (a) Information flow at initiated component Exporter o f Sharing Advisor receives RPC call Exporter issues m et a- fun cti on call: S h o w A llT fy p e s Q Information passed to Importer of Sharing Advisor Importer sends RPC call to Exporter of initiated component_ _ _ _ _ _ (b) Information flow at the Sharing Advisor Figure 6.4: Flow of inform ation for browsing th e sem antic dictionary 2. It receives the inform ation from th e Exporter of th e sam e com ponent th a t it serves and sends them to th e receipent com ponent or th e discovery tool. By contrast, th e E xporter of a com ponent only serves one purpose: 1. It receives R PC requests sent from th e Im porter of a rem ote com ponent; p er forms the appropriate services and passes th e results to th e Im porter of its local com ponent which, in tu rn , sends the inform ation back to the initiated com ponent. To further understand how th e Im porter and E xporter function, consider the scenario in which com ponent A of Figure 4.2a would like to explore th e content of th e sem antic dictionary using the discovery tool. Figure 6.4 illustrates the flow of inform ation between the discovery tool and the sem antic dictionary for th e scenario. C om ponent A would inform the discovery tool of its willingness to explore th e se m antic dictionary by specifying the address (i.e., the host and database nam e) of th e sharing advisor in the discovery tool (see Figure 5.5) as th e sharing advisor is th e com ponent responsible for th e m aintainence of th e sem antic dictionary. Once received th e request from com ponent A, th e discovery tool sends a request to the Im porter of Com ponent A for retrieving th e m eta-d ata inform ation of th e sem antic dictionary. T he Im porter of com ponent A, in tu rn , services the request from the 52 discovery tool by constructing a R PC call and sending the call to the E xporter of th e sharing advisor. Consider the content of the sem antic dictionary as depicted in Figure 5.2b. T he E xporter of the sharing advisor, upon receiving th e request from com ponent A, will issue a sequence of m eta-functions calls to retrieve th e m eta-d ata inform ation of th e sem antic dictionary such as ShowAUTypesQ to retrieve th e set of currently existing concepts and HasProperties(Pro tein Inform ation,) to retrieve th e set of properties associated w ith the concept P r o te in In fo rm a tio n . These inform ation will be passed to th e Im porter of the sharing advisor in which th e corresponding R P C calls will be constructed to send the inform ation back to com ponent A. T he E xporter of com ponent A , upon receiving the inform ation from the sharing advisor, will send th e inform ation to the Im porter of com ponent A which, in tu rn , will send th e inform ation back to the discovery tool. T he display of the discovery tool w ith th e content of th e sem antic dictionary shown in Figure 5.2b is depicted in Figure 6.5. 53 g] Show Schema 0 AnmoAcidSequences Protein Information Proteinlnstances Protein Structures Genes Show Yahi% Close W indow Properties STRMB D S code -> S T R IN G D S functions -> SIRIN G D S molecularveight -> R E A L D S resolution -> SIRIN G D S authors -> SIR IN G D S source -> S T R IN G D S amino acid -> Amino_Acid_Sequences 1 W rt V l V M W M W V W W i W l W IW A W W M W W i W Figure 6.5: T he discovery tool for browsing th e sem antic dictionary 54 Chapter 7 Performance Evaluation When you can measure what you are speaking about, and express it in numbers, you know something about it. W IL L IA M KELVIN. There are ultim ately two central claims in this research. F irst, we have claim ed th a t trad itio n al approaches to determ ining the sim ilarity of type objects by a properties-intersection m echanism are not feasible due to the fact th a t sim ilar (type) objects m ay have com plem entary inform ation while non-sim ilar objects m ay have overlapping inform ation. Second, we claim ed th a t our heuristics can properly in dex sim ilar (type) objects w ithin a classification in th e sem antic dictionary. In this chapter, we characterize the perform ance of our discovery m echanism w ith respect to these two criteria. 55 7.1 Evaluation Testbed O ur prototype shows the feasibility of im plem enting our discovery m echanism using existing database technologies. We are, in fact, m ore interested in th e perform ance of th e m echanism under various situations. We have therefore, chosen to study the behavior of our m echanism w ith the help of a carefully designed sim ulation model. We have chosen th e “csim ” [40] sim ulation language to im plem ent th e sim ulation model. Figure 7.1 presents an overview of our sim ulation model. Here, a specific reference (p attern ) type object is defined. Based on th e reference object, a fixed num ber of testing (type) objects are generated. T he queue of testing objects is generated in such a way th a t a certain percentage of objects are defined to be sim ilar to th e reference object, while th e rem ainder of th e objects in th e queue are defined to be non-sim ilar to th e reference object. For each object defined to be sim ilar to th e reference object, its set of properties is defined to be a random subset of th a t of the reference object; two objects in th e testing queue m ight then, contain com plem entary inform ation even though they are both defined to be sim ilar to the reference object. For each object defined to be non-sim ilar to th e reference object, its set of properties is defined such th a t it m ay still contain some random subset of th a t of th e reference object; therefore, two objects in the testing queue m ight contain overlapping inform ation even though they are defined to be dissim ilar.1 In effect, th e reference object represents a real world concept and th e testing objects defined to be sim ilar to th e reference object represent different views on this real world concept. This inform ation of sim ilarity and dissim ilarity is stored in th e “H um an A gent” knowledge base which acts as a com ponent user interacting w ith th e sharing advisor during th e experim ental process (see Figure 7.1). The goal of our experim ents is to study th e effectiveness of th e sharing advisor in indexing th e type objects into th e sem antic dictionary under various conditions w ithout the knowledge of how the testing objects are generated from th e reference object. In other words, the knowledge of generating th e testing objects is only available to the hum an agent. A nother goal of our experim ents is to com pare the 1O ne o b ject is defined to be sim ilar to th e reference object w hile th e other is defined as not. 56 Human Agent Fixed reference type object Sharing Advisor resource discovery mechanism Input Queue of Testing Objects Figure 7.1: Design of th e sim ulation m odel indexing perform ance of our m echanism w ith th e traditional properties-intersection approach(es). Ideally, all type objects th a t are defined to be sim ilar to th e reference object should be properly indexed such th a t they all belong to one classification (refer to it as classification C thereafter). Similarly, all type objects th a t are defined to be non-sim ilar to the reference object should be indexed such th a t none of them will belong to C. 7.2 M etrics for Measurement T hree m etrics are adopted to m easure th e indexing capability of th e sharing advisor: 1. Similarity Ratio: This m easures if all objects sim ilar to th e reference object are being indexed in the same classification of C w ithout any hum an instruction and is estim ated by: „ „ # o f sim ilar objects properly indexed . S im ila rity Ratio = (7.1) total jp oj sim itar objects 2. Dissimilarity Ratio: This m easures if all objects dissim ilar to th e reference object are properly indexed outside of C w ithout any hum an instruction and is estim ated by: 57 _ . „ . # o f non — sim ilar objects properly indexed . D issim ila rity Ratio = -------- — — -------- — — f - -— (7.2) total -jb o f non — sim ila r objects 3. Instruction Ratio: This m easures th e num ber of hum an instructions to have all objects properly indexed and is estim ated by: r ^ , • o , • # ° f instructions Instruction Ratio — ------—------— —------ (7.3) total ff of objects U ltim ately, we would like to achieve a high sim ilarity and dissim ilarity ratios, i.e., close to 1, as well as a low instruction ratio, i.e., close to 0. This set of perform ance m easurem ent m etrics is “sound” , since we would like all objects to be properly indexed w ith as little hum an instructions as possible. F urther, this set of m etrics is also “com plete” because using any subset of th e m etrics cannot fairly justify the perform ance of th e indexing mechanism. For instance, a triv ial indexing m echanism can achieve a sim ilarity and dissim ilarity ratio of 1 by categorizing all objects into 1 classification; however, this will result in a very high instruction ratio as well. 7.3 Design of Experiments For each experim ent described below, two different trials are conducted on th e experi m ent. In th e first trial, the set of properties of the reference object is restricted to be of prim itive value types only so th a t th e set of properties of each testing object is also confined to prim itive value types. In the second trial, th e reference object also contains properties of user defined value types so th a t th e set of properties of each testing object m ight also relate to another user defined type. In order to generate properties of user defined value types, th e process of generating testing objects are applied recursively. In other words, additional reference object(s) are further defined and additional testing objects are generated based on the new reference object(s). D uring each experim ent, the testing objects are registered w ith th e sharing advi sor one at a tim e. T he experim ents are conducted to study th e ability of th e sharing advisor on indexing the type objects into the sem antic dictionary over a period of 58 tim e. Therefore, unlike experim ents conducted to study m achine learning behav ior in which a training sets of objects is usually supplied to pre-train th e learning m echanism before any perform ance m easurem ent is studied, we start w ith an em pty sem antic dictionary and m easure the m etrics cum ulatively at a 10% interval of the to tal num ber of testing objects in the queue. In order to m ake sure th a t th e pefor- m ance m easurem ents are not biased tow ards any specific set of testing objects, the content of the sem antic dictionary is trashed and a new queue of testing objects are re-generated random ly betw een successive trials and experim ents. It is im portant to note th a t since the testing objects are generated random ly every tim e an experi m ent is conducted, it is very difficult to perform a detailed analysis on each specific m easurem ent; rather, we are only interested in the overall behavior of the results. One shall note th a t th e experim ents described here are conducted to study the indexing capability of the resource discovery m echanism . C ertain issues such as the resolution of sem antic heterogeneity, are not considered in their entirety here. Fur th er notice th a t since there is a close relationship between discovery and registration as described in C hapter 5.4.1, the perform ance of indexing type objects in th e con cept hierarchies has a direct im pact on th e perform ance of discovering inform ation as well. 7.3 .1 E x p e r im e n ta l C o n fig u ra tio n s Before any experim ent can be conducted, certain param eters of th e sim ulation pro gram are needed to be configured. These are explained as follows: • N um ber of properties of th e reference object (R): We arbitrarily choose an integer value for this param eter. T he experim ents described below are all conducted w ith this param eter set to 16 w ith th e ex ception of experim ent # 5 which is conducted to study the effect of varying this param eter on the effectiveness of th e indexing m echanism . We arbitrarily choose 16 properties for the reference object as a m iddle ground betw een a few properties and a large num ber of properties. • M axim um num ber of properties belonging to a testing object th a t is defined to be sim ilar to th e reference object (N simuar): 59 We arbitrarily choose an integer value < R. This decision is m ade due to the observation th a t a database (type) object usually represents a specific view of the corresponding real world concept; therefore, th e set of properties associated w ith a database object usually is a subset of those associated with the real world concept. T he experim ents described below are all conducted w ith this param eter set to 8 w ith the exception of experim ent ^ 3 which is conducted to study the effect of varying this param eter on the effectiveness of the indexing mechanism. • M axim um num ber of properties belonging to a testing object th a t is defined to be dissim ilar to the reference object (T V d issim ila r)'- Again, we arbitrarily choose an integer value < R. The experim ents described below are all conducted w ith this param eter also set to 8 w ith the exception of experim ent # 4 which is conducted to study th e effect of varying this param eter on th e effectiveness of the indexing mechanism. • Size of th e testing queue (T0bjects): T he experim ents are conducted w ith 100, 200, 400, 800, and 1000 testing objects depending on th e need of the specific experim ent(s) conducted. 7.4 Experimental Results 7.4.1 E x p e r im e n t # 1 O ur first experim ent is conducted to study the differences betw een trad itio n al properties-intersection approach and our heuristic-based approach to indexing type objects in th e sem antic dictionary. In m easuring th e perform ance of th e properties- intersection approach, a registered type object is deduced to be sim ilar to a concept of th e sem antic dictionary if the num ber of overlapping properties betw een th e type object and th e concept exceeds a certain threshold. In this experim ent, the thres hold is set to 0.5 as a m iddle ground between no overlap, i.e. 0, and to tal overlap, i.e., 1. A ppendix B shows th e results of additional experim ents conducted on th e properties-intersection m echanism w ith different threshold values. 60 In this experim ent, we generate a queue of 200 objects, all defined to be sim ilar to th e reference object. Ideally, all 200 objects should be indexed w ithin one single classification. T he sim ilarity ratios and instruction ratios of our heuristic-based approach and th e traditional properties-intersection approach at each 10% interval are shown in Figure 7.2 and Figure 7.3 respectively.2 Figure 7.2a illustrates the cum ulative sim ilarity ratios at each 10% interval in which properties are restricted to be of prim itive value types, while Figure 7.2b illustrates the corresponding ratios when user defined value types are allowed. As illu strated in Figure 7.2, th e sim ilarity ratios are very low and continue dropping in both experim ents when a properties-intersection approach is adopted to index th e testing objects. This suggests th a t a properties-intersection approach fails to adequately index and determ ine the sim ilarity among various ty p e objects. This is due to the simple syntactic m atching of the properties-intersection m echanism ; therefore, as m ore objects are observed by th e sharing advisor, th e system starts to get confused about to which concept th e object should belong. O ur heuristic- based approach, by contrast, provides quite prom ising results. This is owing to the observation th a t the sim ilarity ratios continue increasing as a function of observed objects in b o th experim ents. The superior perform ance of our m echanism is due to th e capability of our heuristics in deducing th e degree of dependency betw een properties of a type object and the concept to which the type object belongs. Figure 7.3 illustrates the instruction ratios at each 10% interval for th e same experim ents. T he instruction ratios can be considered as th e com plem ent of the corresponding sim ilarity ratios, because hum an instruction is needed w henever the sharing advisor fails to index th e registered type objects properly. Predictably, the instruction ratios are very high and continue increasing as a function of objects observed w hen a properties-intersection approach is employed. By contrast, the corresponding ratios continue dropping when our heuristic-based approach is used. In order to understand our heuristic-based approach in m ore details, we fu rth er investigate our m echanism in a 50-50 situation, i.e., 50% of the objects in th e testing queue are sim ilar to th e reference object while another 50% are not. 2 Notice th a t we did not m easure th e dissim ilarity ratio s in th is experim ent because all objects are defined to be sim ilar to th e reference object. 61 1. 0 . 0.9 _ 0.8 _ 0.7 _ 0 .0 . 0.5 _ 0.4 _ 0.3 _ 0 .2 . 0.1 _ Heuristic-based Pro per L i cs -i n ter secti on 10 20 30 40 50 00 70 80 90 100 % of Objects Observed o a ■ f c d •g 1.0 _ 0 .9 . 0-8 . 0.7 _ 0 . 0 . 0.5 . 0 .4 . 0 .3 . 0.2 . 0. 1 . Heuristic-based Properties-intersection I I I I I 1 I 1 I 10 20 30 40 50 00 70 80 90 100 % of Objects Observed (a) (b> Figure 7.2: Sim ilarity Ratios: heuristic-based vs. properties-intersection i 1.0 _ 0.9 — 0 .8 _ _ 0.7 — o — — e d D c 5 0.0 _ — Properties-intersection O 0.5 _ 0.4 _ 0 .3 _ _ 0.2 _ 0.1 — “ Heuristic-based t....- r ....t i r i ..................................—— lO 20 30 40 50 60 70 80 90 T O O % > o f Objects Observed i k 1-0 _ _ 0.9 _ 0.8 _ o 0.7 ---- < 3 o .o _ Proper ti es-i n tersccti on J 0-5 - | 0-4 — j= 0.3 _ 0.2 _ 0.1 _ Heuristic-based i i i i r i v i I 10 20 30 40 50 O O 70 80 90 100 % of Objects Observed (a) (b) Figure 7.3: Instruction Ratios: heuristic-based vs. properties-intersection 62 7.4.2 Experiment # 2 T he purpose of our second experim ent is three fold: 1. to study the effect of negative exam ples, i.e., objects th a t are dissim ilar to the reference object, on the indexing capability of the sharing advisor, 2. to study if there is any oscillating behavior of our heuristic-based indexing m echanism , and if th e oscillating behavior will converge to an equilibrium stage, and 3. to study th e effect of the num ber of testing objects on th e indexing capability of th e sharing advisor. In this experim ent, 50% of the objects in the testing queue are defined to be sim ilar to th e reference object. Ideally, all objects th a t are sim ilar to the reference object should be indexed into the same classification, while none of th e other objects should be so indexed. T he experim ent is repeated w ith 100, 200, 400, 800, and 1000 testing objects. T he sim ilarity ratios, dissim ilarity ratios, and instruction ratios m easure m ents are illustrated in Figure 7.4, Figure 7.5, and Figure 7.6 respectively. Figure 7.4a shows the change of sim ilarity ratios over tim e w hen the type objects contain properties w ith prim itive value types only while Figure 7.4b shows th e ratios when th e type objects also contain properties w ith user defined value types. These results suggest th a t th e sim ilarity ratios increase over tim e in both situations, i.e., as a function of objects observed, despite th e existence of negative exam ples. We observe th e oscillating behavior at the early stages of th e experim ent, i.e., at < 30%; however, an equilibrium stage is established when enough objects have been observed. We can also deduce th a t our m echanism stabilizes m ore rapidly when we restrict objects to properties w ith prim itive value types only. This can be explained by th e fact th a t objects containing properties w ith user defined value types im pose additional diversity, requiring m ore objects to be observed by th e sharing advisor to properly refine the heuristics. Finally, the results also suggest th a t th e m ore objects observed by th e sharing advisor, the b etter will be its perform ance; this is illustrated by th e higher sim ilarity ratios in the situation of 1000 testing objects th an th e 100 counterpart. This can be explained by the fact th a t as m ore objects are observed by 63 £ 0.3_ _ _ _ _ 3 0 20 30 40 50 60 70 S O 90 3 00 > o f O b jects O bserved (a) 0.7 0.6 _ _ 0.5_ _ _ _ _ 10O o bjects 200 o bjects 4 0 0 o b jects 8 00 o bjects I OOO o b jects 0 . 1 _____ % o f O bjects O bserved Figure 7.4: Sim ilarity ratios for 100, 200, 400, 800, and 1000 objects 0.8______ 0.7_ _ _ _ _ 0.6 __ ^ 1() 20 30 40 50 60 70 80 * H > lO O > o f O bjects O bserved 3.0 _ 0.9 _ _ '■ s i C ti 0.8 _ _ 0.7_ _ 'Si 0.6_ _ 0.5_ _ '3 0.4 _ S 0.3 _ 0.2_ _ 0.3 _ 10O o bjects 200 o bjects 4 0 0 o b jects 800 o bjects 1 000 o b jec ts 3 0 20 30 40 50 60 70 80 90 300 % o f O b jects O b serv ed Ca> <» Figure 7.5: D issim ilarity ratios for 100, 200, 400, 800, and 1000 objects 3 -O __ 0.3 30 20 30 40 50 60 7 (1 80 90 3 00 i o f O b jects O bserved ( a ) 1 OO o bjects 200 o bjects 4 00 o bjects 800 o b jects JOOO o bjects 0.5_ _ _ _ _ 0.4_ _ _ _ _ 3 0 20 30 40 50 60 70 80 O O 300 ° A > o f O b jects O b served Figure 7.6: Instruction ratios for 100, 200, 400, 800, and 1000 objects 64 the sharing advisor, m ore exam ples and inform ation can be utilized by the sharing advisor to further refine th e heuristics. However, as enough objects are observed by the sharing advisor, additional observed objects will not provide too m uch useful inform ation; this is illustrated by the fairly equivalent values for th e 200 to 1000 objects test cases. Figure 7.5 illustrates th e corresponding dissim ilarity ratios for each experim ent conducted. T he results suggest th a t th e m ore objects observed by the sharing advi sor, th e m ore steady will its perform ance be; this is illustrated in the fairly oscillating n atu re when there are only 100 testing objects in th e queiie as opposed to th e fairly constant n atu re in th e situation of 1000 testing objects. Finally, Figure 7.6 shows th e instruction ratios for each corresponding experi m ent. Again, th e experim ent w ith less testing objects shows a m ore oscillating na tu re as opposed to th e experim ent w ith m ore testing objects. In addition, we observe th a t the m ore objects observed by the sharing advisor, the lower is th e instruction ratios; this can be explained by th e sim ilar argum ent described previously. 7 .4 .3 E x p e r im e n t # 3 T he subsequent two experim ents are conducted to study if th e num ber of properties of each testing object will have any effect on the indexing capability of the sharing advisor. In this respect, this experim ent is conducted w ith 400 testing objects, w ith 50% defined to be sim ilar to th e reference object and 50% defined to be dissim ilar to th e reference object. T he experim ent is repeated w ith N simuar set to 4, 8, 12, and 16. T he Ndissimiiar is kept constant at 8. T he sim ilarity ratios, dissim ilarity ratios, and instruction ratios for each experim ent are shown in Figure 7.7, Figure 7.8, and Figure 7.9 respectively. As depicted in the figures, the indexing capability of th e sharing advisor does not seem to be affected by Afom,nar. This can be explained by th e fact th a t the value of N s im ila r has no effect w ith the degree of overlapping betw een objects of the two classifications, i.e., objects tha,t are sim ilar to th e reference object and objects th a t are dissim ilar; therefore, once the heuristics values are properly estim ated, th e perform ance of the m echanism will rem ain constant. 65 0.5 _ 0.4_ _ 0.3_ _ _ _ _ 0.2 _ o f O b jects O bserved t i 1.0_ _ _ 0.9_ _ 0.8_ _ 'H 0.7 C * S 0.6 _ § 0.5 _ — ------------------ T V sim ilar = * * H 0.4 _ . . . . . . . A S’ sim ilar ® 0.3 _ _ -------- ^ s im ila r ~ 12 0.2 _ 0.1 _ 1 1 1 1 1 ! 1 1 to lO 20 l l I ! 1 1 I I *■ 30 40 50 60 70 80 90 lO O % o f O b jects O b serv ed (a) < * > > Figure 7.7: Sim ilarity ratios for varying N a imuar 1.0______ 0.9 _ _ 0.7_ _ _ _ _ _ 0 .6 _____ 0.5 _ / 0.2______ lO 20 30 40 50 60 70 S O 90 lO O 7 o f O bjects O bserv ed 0.6 _____ 0.5_ _ _ _ _ _ 0.4_ _ _ _ _ _ T V sim itar ~ 4 N sim ila r ~ 8 sim ilar = = 12 T V sim ilar — 16 0.1 ______ lO 20 30 40 50 60 70 S O 90 lO O o f O b jects O bserv ed (a) O ) Figure 7.8: Dissim ilarity ratios for varying N simuar O .S_ _ 0.7_ _ _ _ _ _ 0.6_____ lO 20 30 40 50 60 70 80 90 lO O i o f O b jects O bserved ^ s im ila r — 16 0 .2 ______ lO 20 30 40 50 60 70 80 90 lO O o f O b jects O b served C a> d>) Figure 7.9: Instruction ratios for varying N simuar 66 7.4.4 Experiment # 4 This experim ent is conducted to continue the study of the effect of the num ber of properties of testing objects on th e indexing capability. In this experim ent, N d i s s i m i l a r is set to 4, 8, 12, and 16 respectively while keeping N simnar constantly at 8. Like experim ent # 3 , the num ber of testing objects in the queue is still keep at 400 objects w ith 50% defined to be sim ilar to the reference object. T he sim ilarity ratios, dissim i larity ratios, and instruction ratios for each experim ent are shown in Figure 7.10, Figure 7.11, and Figure 7.12 respectively. The figures suggest th a t th e sharing advisor suffers a m inor perform ance degra dation in indexing objects when the value of Ndissimiiar increases; this is owing to the observation th a t both th e sim ilarity and dissim ilarity ratios drop as Ndissimiiar in creases while th e instruction ratios increase accordingly. T his is because as Ndissimiiar increases, th e chance th a t there is a overlap betw een objects th a t are sim ilar to the reference object and those th a t are dissim ilar to th e reference object will be in creased. It, therefore, takes m ore tim e and requires m ore objects to be analyzed in order to properly determ ine th e distinguishing power of each property. Notice th at although th e perform ance of th e indexing m echanism suffers a m inor degradation as Ndissimiiar increases, the effectiveness of the indexing m echanism still improves over tim e as the sim ilarity ratios and dissim ilarity ratios still increase as a function of objects observed while th e instruction ratios decrease as a function of objects observed. T he results of this experim ent suggest th a t as th e num ber of common properties shared am ong unrelated concepts increases, the sharing advisor requires m ore objects to be registered to properly deduce th e distinguishing power of each property. 7 .4 .5 E x p e r im e n t # 5 T he purpose of this experim ent is to study the effect of th e num ber of properties of th e reference object on the effectiveness of the indexing capability of the sharing advisor. This inform ation is interesting because the set of properties of th e reference object represents all possible properties a type object m odeling the reference object can possess. In this experim ent, we generate a queue of 400 objects w ith 50% of th e testing objects defined to be sim ilar to the reference object. T he experim ent is 67 1.0 0.5 0.4 0 .2 _____ lO 20 30 40 50 60 70 80 90 lO O > o f O b jects O bserved ^ d is sim ila r — 4 ^ d is s im ila r ” 8 ™ d i 0.3_ _ _ _ _ . 1 2 0.1 lO 20 30 40 50 60 70 80 90 100 • o f O b jects O b serv ed C a> (b> Figure 7.10: Sim ilarity ratios for varying Ndissimilar 0.9_ _ _ _ _ 0 .8 _____ ^ d is s im ila r — 4 TVdissim ilar ~ 8 T V dissimilar = 1 2 . ^ d is s im ila r ~ 3 -6 0.1 _____ lO 20 30 40 50 60 70 80 90 100 o f O b jects O b serv ed 0.3_ _ _ _ _ 0 .2 ______ 0.1______ lO 20 30 40 50 60 70 80 90 100 < » ) < * > ) Figure 7.11: D issim ilarity ratios for varying Ndissimiiar 0 .8 _____ 0.7_ _ _ _ _ _ 0.5_ _ _ _ _ _ lO 20 30 40 50 60 70 80 90 lO O > o f O b jects O bserv ed N dissim ila r ~ ~ 4 N dissim ila r “ 8 ^ d is sim ila r = * 2 TVdissim ilar — 3 <5 1.0_____ 0.3 lO 20 30 40 50 60 70 80 90 lO O % o f O b jects O b serv ed Ca) <b) Figure 7.12: Instruction ratios for varying Ndissimiiar 68 conducted w ith R set to 16, 20, 24, 28, and 32 while keeping N simuar and Ndissimnar constantly at 8. The sim ilarity ratios, dissim ilarity ratios, and instruction ratios for each experim ent are shown in Figure 7.13, Figure 7.14, and Figure 7.15 respectively. From th e figures, we observe th a t the sim ilarity and dissim ilarity ratios increase as R increases while th e instruction ratios decrease accordingly. This im plies th a t as th e num ber of possible properties a type object can possess increases, the effec tiveness of th e heuristic-based indexing m echanism increases accordingly. This is due to th e fact th a t as R increases, the possible num ber of properties an object can possess increases; therefore, the chance of overlapping betw een objects from th e two classifications decreases accordingly. A lthough as th e num ber of possible properties increases will also decrease th e chance of overlapping among objects w ithin the same classification, our heuristic-based m echanism is able to ad ap t to this behavior and properly determ ine th e distinguishing power of each property with respect to each concept. T h at is also another reason why the sim ilar ratios and dissim ilarity ratios increase as a function of R. 7 .4 .6 E x p e r im e n t # 6 O ur last experim ent is conducted to study the capability of th e indexing m echanism on identifying the distinguishing properties of a concept, i.e., properties having a high distinguishing power of discrim inating a concept from others. In this experi m ent, 50% of the testing objects are defined to be sim ilar to th e reference object. All objects th a t are defined to be sim ilar to the reference object will all possess a specific property. In other words, there is a special property having an intra-concept sim ilarity value of 1 w ith respect to the collection of objects th a t are sim ilar to th e reference object. T he experim ent is conducted on 100, 200, 400, 800, and 1000 testing objects. T he sim ilarity ratios, dissim ilarity ratios, and instruction ratios m easurem ents are illustrated in Figure 7.16, Figure 7.17, and Figure 7.18 respec tively. From Figure 7.16, we observe a m uch higher sim ilarity ratios in this experim ent th a n th e corresponding sim ilarity ratios shown in Figure 7.4. This is due to the existence of a property having a high intra-concept sim ilarity value; therefore, once th e sharing advisor observes such characteristics, it is able to determ ine easily the 69 0.4_ _ _ _ _ 0.3_ _ _ _ _ 0.2_____ lO 2 0 30 40 50 60 70 80 90 lO O > o f O b jects O b served It s 16 It = 20 0.5_ _ _ _ _ 0.3_ _ _ _ _ 0.2_____ It = 28 R = 32 lO 20 30 40 50 60 70 80 90 lO O ■ o f O b jects O b serv ed C»> < * > > Figure 7.13: Sim ilarity ratios for varying R 1.0 0.9 0.8 0.6 ______ 0.5 _ / . 0.1______ lO 20 30 40 50 60 70 80 90 lO O j o f O b jects O b serv ed 0.8 0.7 It = 20 0.3 lO 20 30 40 50 60 70 80 90 lO O % o f O b jects O b served < « ) u» Figure 7.14: D issim ilarity ratios for varying R 0.9_ _ _ _ _ 0.8_____ It = 24 It = 28 It = 32 lO 20 30 40 50 60 70 80 90 lO O o f O b jects O b serv ed 1.0_____ 0.9_ _ _ _ _ _ % of Objects Observed (a ) C b > Figure 7.15: Instruction ratios for varying R 70 lOO o b jects 2 0 0 o bjects •400 o bjects 8 00 o bjects 1000 o bjects 0.3 lO 20 30 40 S O 60 70 80 90 lO O o f O b jec ts O b served 1.0 0 .9 0.8 0.7 0.0 O . S 0.4 0.3 0.2 0.1 lO 20 30 < 4 0 S O 60 70 80 90 lO O C a> <t> ) Figure 7.16: Sim ilarity ratios for 100, 200, 400, 800, and 1000 objects 1.0 0.9 0.8 0.7 0.0 O .S 0.4 .2 . 1 lO 20 30 40 S O € 5 0 70 80 90 lO O < o f O b jects O b served 1.0 0.9 .& 0.8 e x '. 0.7 ‘5 0.6 O .S 33 0.4 s 0.3 0.2 O .l 1.00 o bjects 2 00 o bjects 4 0 0 o bjects 8 00 o bjects lOOO o bjects + - I I I I I I I 4 - lO 20 30 40 S O 60 70 80 90 lO O % o f O b jects O b served (a > 0 > > Figure 7.17: D issim ilarity ratios for 100, 200, 400, 800, and 1000 objects .O .9 .8 .7 . 6 .S .4 .3 .1 lO 20 30 40 S O 60 70 80 90 lO O » o f O b jects O b served 1.0 - 0.9 _ 0.8 _ 0.7 _ 0.6 _ O .S _ 0.4 _ 0.3 _ 0.2 _ 0.1 . J OO o bjects 200 o bjects 4 00 o bjects 800 o b jects lOOO o bjects lO 20 30 40 S O 60 70 80 90 lO O % o f O b jec ts O bserved < « ) (h) Figure 7.18: Instruction ratios for 100, 200, 400, 800, and 1000 objects 71 concept to which objects th a t possess such property should belong. This also shows th a t th e indexing m echanism of the sharing advisor is able to identify th e property th a t possess a high distinguishing power. Figure 7.17 also indicates th a t th e indexing m echanism perform s b e tter in this experim ent th a n it perform s in experim ent # 2 in classifying objects th a t are dissim i lar to th e reference object as the dissim ilarity ratios resulted in this experim ent are higher than those resulted in experim ent # 2 . This can be explained by th e fact th a t once th e sharing advisor observes the existence of a common property am ong those objects sim ilar to th e reference object, i.e., having a high intra-concept sim ilarity value w ith respect to the objects sim ilar to th e reference object, th e sharing advisor is able to determ ine easily th e concept to which objects th a t do not possess such property should belong. Finally, Figure 7.18 also shows sim ilar results. 7.5 Discussion From our experim ents, we can m ake several general observations about th e be havior of th e indexing m echanism of our resource discovery system . F irst, our experim ents have shown th a t our heuristic-based indexing m echanism perform s m uch b ette r th an traditional properties-intersection approach(es). A ppendix B presents additional comparisons betw een our heuristic-based m echanism and tra ditional properties-intersection m echanism under various conditions. Second, our heuristic-based m echanism is able to reuse previously acquired knowledge to deduce th e m eaning of newly exported objects. This is due to the observations from experi m ents # 1 to # 6 th a t the sim ilarity and dissim ilarity ratios increase over tim e while th e instruction ratios decrease accordingly. We also observe from th e experim ents th a t the effectiveness of our indexing m echanism increases as a function of objects, i.e, th e m ore objects registered/indexed into the sem antic dictionary, th e b e tte r will be th e perform ance of th e m echanism. This behavior is shown in experim ent # 2 and Finally, our experim ents have shown th a t the m ore objects registered/indexed into th e sem antic dictionary, the m ore stable will be the perform ance of th e m echa nism . This is also being indicated in experim ent # 2 and # 6 . 72 Chapter 8 Overhead Optimization To speed today} to be put back tomorrow. SO C RATES. T he overhead of the indexing m echanism contributes a m ajor portion of th e to tal overhead of th e discovery m echanism . In this chapter, we investigate th e so-called “lazy indexing” paradigm versus our current “eager indexing” approach, w ith respect to reducing the overall am ount of overhead involved in our discovery m echanism . We also investigate th e tradeoffs between the two indexing approaches. 73 8.1 O p tim iz a tio n M e t h o d O ur current indexing m echanism indexes a newly exported type object on an eager basis, i.e., an object is indexed into th e proper concept hierarchy of the sem antic dictionary as soon as the object is being registered. One of th e shortcom ings of this “eager indexing” paradigm is the potential of indexing an object th a t is of no interest to any database com ponent of th e federation, thus, w asting the overhead being spent as a result of indexing these objects. For instance, consider th e con te n t of th e sem antic dictionary shown in Figure 5.2b. If all database com ponents of th e federation are only interested in inform ation related to G e n e tic s, th e know ledge on th e relationships among type objects belonging to other concepts such as th e relationship between P r o te in In sta n c e s and P r o te in S tr u c tu r e will not be needed. Therefore, the overhead involved in determ ining th e relationships among type objects not related to G e n e tic s will be wasted. In order to elim inate th e unnecessary overhead, we experim ent our discovery m echanism w ith the so-called “lazy indexing” paradigm which indexes an object on an “as needed” basis. Consider the two partial conceptual schemas shown in Figure 4.2 again. Figure 8.1 shows the evolution of the content of the sem antic dictionary when lazy indexing paradigm is adopted. W ith lazy indexing, th e sharing advisor sim ply places the exported type object in the sem antic dictionary during registration; th e relationships among th e exported type object are not determ ined at th e tim e of registration. This idea is shown in Figure 8.1a which indicates the content of th e sem antic dictionary after types P r o te in In sta n c e s, A m in o A cid S eq u en ces, G en es of com ponent A and types P r o te in S tr u c tu r e , G e n e tic s of com ponent B have been registered w ith th e sharing advisor. T he indexing of a type object is done only when the relationships am ong the type object being interested and others is needed when a discovery request is initiated. For instances, Figure 8.1b illustrates th e content of th e sem antic dictionary after a discovery request is initiated to identify inform ation related to type G e n e tic s. As shown in Figure 8.1b, th e only relationships th a t have been determ ined at the m om ent are the sim ilarity relationship betw een G e n e tic s and G en es, th e dissim i larity relationship betw een G e n e tic s and P r o te in In sta n c es, the dissim ilarity relationship between G e n e tic s and P r o te in S tru ctu re, as well as the dissim ilarity 74 Leyend Superconcept-Subconcept member-of (instajnce-of) p ro d u c t A m i n o A c id S e q u e n c e s p ro d u ce P r o t e i n S t r u c t u r e G e n e t i c s (a) V _ th o ro m o so m c # sh o rt/lo n g a rm !inkagc_m ap_positlon rcccsslv c/d o m in an t cxton s/in tro n p ro d u ce G e n e t i c s rccessi' p ro d u c 7 rtc cssiv c/d o m in an t p ro d u c t Concept Type n am e s ta rt end m olw t A m in o A c id S e q u e n c e s ( b ) n am e fu n ctio n m olw t so u rce code am ino_acid n am e code fun ctio n m olccul ar_w cigtit reso lu tio n a u th o rs C P r o t e i n ^ S t r u c t u r e ) Figure 8.1: Evolution of the sem antic dictionary w ith lazy indexing relationship betw een G e n e tic s and A m in o A cid S eq u en ces. The sim ilarity relat ionship betw een P r o te in In sta n c es and P r o te in S tr u c tu r e and th e dissim ilarity relationship betw een P r o te in In sta n c es and A m in o A cid S eq u en ces for exam ple, will never be determ ined unless it is being requested by a database com ponent, thus, saving the overhead in determ ining such a relationship if the relationship is never needed. 8.2 Overhead Measurements To understand the relative overhead involved in eager indexing versus the lazy index ing paradigm , we have done a prelim inary study on our sim ulation m odel, m easuring th e tim e spent on indexing the testing objects under the two indexing paradigm s. In this experim ent, we first m easure the tim e spent on indexing all the testing objects; this corresponds to th e tim e spent under our original eager indexing paradigm . N ext, we m easure th e tim e spent on indexing objects th a t are defined to be sim ilar to the reference object only; this corresponds to the tim e spent under the lazy indexing paradigm w ith specific interests on objects th a t are related to the reference object. T he tim e difference betw een these two paradigm s is m easured at each 10% incre m ent of the num ber of objects being sim ilar to th e reference object. T he experim ent 75 50 100 objects 200 objects 400 objects 800 objects 1000 objects 40 40 30 30 20 20 "'v 10 lO lO 20 30 40 SO 60 70 80 90 100 10 20 30 40 SO 60 70 80 90 100 % of Similar Objects b of Sim ilar Objects <b) Figure 8.2: Overhead saved w ith lazy indexing is conducted on 100, 200, 400, 800, and 1000 testing objects and th e results are displayed in Figure 8.2. Figure 8.2a shows th e tim e saved in which the set of properties of the reference object is restricted to be of prim itive value types while Figure 8.2b shows th e cor responding tim e saved in which the set of properties of th e reference object can be of user defined value types as well. As depicted in th e figure, the tim e saved us ing th e lazy indexing paradigm increases w ith th e num ber of objects registered into the sem antic dictionary, b u t drops linearly w ith th e percentage of objects sim ilar to th e reference object. This behavior is due to th e fact th a t for a certain percen tage of objects th a t are sim ilar to th e reference object, th e num ber of objects th a t are of no interest increases w ith th e num ber of objects being registered w ith the sharing advisor; therefore, the tim e saved due to not indexing these non-interested objects increases accordingly. On th e other hand, th e num ber of objects th a t are of no interest decreases w ith th e percentage of objects sim ilar to th e reference object; therefore, the tim e saved due to not indexing the non-interested objects decreases accordingly. 76 8 .3 O p tim iz a tio n T rad eoffs A n advantage of th e eager indexing paradigm is th e transparent behavior of the system during discovery in which com ponent users actually issue a discovery request. This is because m ost overhead was spent during registration; the response tim e during discovery should then be of m inim al and thus, users will not notice any delay in th eir local database system. T he lazy indexing paradigm , on the other hand, shifts th e burden of indexing from registration to discovery. In this way, unnecessary overhead on analyzing type objects th a t are never queried will be elim inated. Com ponent users, however, will not have th e transparent feeling during discovery as the overhead, in this case, is p artitio n ed betw een registration and discovery. Furtherm ore, our experim ent has showed th a t th e overhead saved w ith this paradigm decreases as the num ber of objects being interested by com ponent users increases; therefore, if all objects in the sem antic dictionary will eventually be queried, the lazy indexing paradigm will not offer any global benefit over the eager indexing paradigm . A nother critical issue th a t need to be considered in deciding if lazy indexing paradigm is to be adopted is the extent of evolution of the sem antic dictionary. In a dynam ic environm ent where inform ation units are exported to th e federation frequently, the lazy indexing paradigm m ight require additional overhead. This is because betw een subsequent discovery requests, additional inform ation units m ight have been exported to the sem antic dictionary; therefore, w ith lazy indexing, the sta te of th e sem antic dictionary has to be analyzed every tim e a discovery request is in itiated in order to detect additional relevant inform ation possibly registered betw een subsequent requests. For instance, consider the content of th e sem antic dictionary shown in Figure 8.1b. Now, assum ing th a t a new type object sim ilar to concept G e n e tic s, nam ed G en e tic In fo rm a tio n , is registered into th e sem antic dictionary at some point in tim e. A subsequent discovery request for inform at ion related to G e n e tic s will require the sharing advisor to analyze th e state of th e sem antic dictionary to detect the relationship betw een G e n e tic s and G e n e tic In fo rm a tio n . W ith eager indexing, by contrast, this additional overhead will be elim inated as all inform ation units are properly indexed during registration; there fore, the sharing advisor does not need to re-evaluate th e state of th e sem antic 77 dictionary when a discovery request is initiated. In this aspect, it seems th a t eager indexing is m ore suitable for a very dynam ic environm ent. For a static environm ent, th e sharing advisor does not need to re-evaluate the state of th e sem antic dictionary if the same discovery request is in itiated m ore th an once even w ith lazy indexing. This is because th e sharing advisor can sim ply as sum e th a t the inform ation relevant to the request has been previously organized. In this aspect, it seems th a t lazy indexing is m ore suitable for a relatively static environm ent. 78 Chapter 9 Conclusion and Summary Science becomes dangerous only when it imagines that it has reached its goal. G EO RG E B E R N A R D SHAW. A lthough our experim ents have shown th a t our m echanism perform s well and superior to traditional approaches, it does not account for all the contributions of this work. We feel th a t the m ain success of this work stem s from showing th e feasibility of em ploying m achine learning technique in addressing th e problem of database interoperation. Further, the research has laid the foundation for fu tu re work in this area of database interoperability. In this chapter, we conclude this thesis by discussing future research opportunities available. 79 9.1 Summary of Results T he em ergence of network of databases has m otivated the need for facilities to sup port th e discovery of related inform ation among database system s. In this thesis, we have developed, experim entally im plem ented, and tested a registration and discovery m echanism to enhance the sharing and reuse of previously discovered inform ation. Our discovery fram ework provides four functional capabilities: organization, index ing, searching, and browsing non-local database inform ation units. O ur m echanism is built upon a core set of object-based modeling constructs commonly found in m ost existing object-oriented database system s [28, 31, 33]; hence, our m echanism can be readily im plem ented in existing object-oriented database system s with none or little m odifications to existing DBMS software; this is dem onstrated through our experim ental prototype im plem ented using the USC Om ega [16] system . O ur pro to ty p e also shows th e ease of porting our discovery m echanism on various database system platform s. We have also dem onstrated the feasibility of our mechanism, with th e protein/genetics application environm ent, representing an area of trem endous and growing interest and im portance [15, 37]. Finally, our experim ental evaluation indicates th a t our m echanism perform s well, and is superior to traditional properties- intersection approaches. 9.2 Research Contributions This work has im pact on the area of inform ation sharing am ong a netw ork of d ata bases in several ways: 1. Previous works on inform ation sharing among a collection of database system s only focus on some variant of integration of database schemas [19, 29]. The is sue of identifying inform ation relevant to a user request has not been addressed in their entirety in th e literature. This issue is especially im portant in the con tex t of inform ation sharing as a database user is m ost likely interested only in a subset of database inform ation units of a rem ote database system . Fur ther, different kinds of knowledge are needed for th e discovery and integration processes. D uring discovery, only knowledge about th e possible relationships am ong type objects is required; knowledge of th e precise relationships does 80 not necessarily benefit th e discovery process. By contrast, during integration, th e knowledge on the precise relationships among type objects is needed. By identifying the different depth of knowledge needed in the different processes, a com ponent is m ore likely to be able to access th e m ost appropriate non-local inform ation in the m ost n atural way. 2. T he idea of cooperating individual database systems is m ade possible through th e establishm ent of the sem antic dictionary as a result of increm ental evo lution through dynam ic interaction w ith existing database system s. This task gradually becomes easier over tim e as th e sem antic dictionary gradually evolves to a standard federated knowledge base from which new com ponents can con sult for designing their schemas. This benefits the whole federation because the degree of schem a diversities among different com ponents will be reduced and th e inter-relationships among com ponents’ schemas can be observed in a com paratively easier fashion. 3. Looking from a different perspective, th e results of this research has shown th e feasibility of employing m achine learning techniques in addressing the d a ta base interoperation problem. Further, we have laid the foundation for several research problem s in th e general area of database interoperation which will be discussed in the subsequent section. 9.3 Limitations of this Research A lthough our experim ents have shown th a t our m echanism perform s well and su perior to trad itio n al approaches, the problem of database interoperability is not com pletely solved yet. Furtherm ore, our current resource discovery m echanism still suffers from the following lim itations: 1. In order to insure m anagability, we lim it th e am ount of inform ation placed in the sem antic dictionary to th a t which is specifically required by our discov ery m echanism . Obviously, m ore detailed and elaborate inform ation available in th e sem antic dictionary such as a com prehensive global schem a or know ledge base [7, 12] would be beneficial to the discovery task; such a global 81 data/know ledge base is however very difficult to establish in practice, and m ay be of quite unm anageable size. Even a centralized sem antic dictionary for m aintaining the relationships among the exported type objects in our discovery m echanism m ay be problem atical, particularly for large federations. 2. O ur current indexing m echanism lim its an exported ty p e object to belong to one single concept. A m ore general m echanism would allow type objects to be indexed into m ultiple concepts and be able to m igrate from, one concept to others as th e degree of sim ilarity of the type objects w ith respect to a concept changes over tim e. 3. As our heuristics are estim ated based on th e type objects exported to th e fede ration, th e speed of stabilization depends upon th e order of objects observed by the sharing advisor. A future extension to this research is to m odify the heuristics such th a t the heuristics will be independent from, the arriving order of th e exported objects. 4. An assum ption th a t we have m ade in com puting the probability value is th a t each property of a concept is m utually independent. However, real world con cepts hardly follow this assum ption. In other words, realistically, th e existence of a property w ithin a concept m ay depend on th e existence of other pro p erty /p ro p erties w ithin th e same concept. A future extension to this work is to incorporate this kind of properties inter-dependency inform ation in our probability model. 9.4 Directions for Future Research 9 .4 .1 A d d r e ssin g P r o p e r tie s In te r -d e p e n d e n c ie s As m entioned earlier, properties of real world concepts m ight be inter-dependent upon one another. For instance, consider the concept P r o te in In fo rm a tio n shown in Figure 5.2. A lthough property co d e possess a high distinguishing power w ith res pect to concept P r o te in In fo rm a tio n , its existence w ithin P r o te in In fo rm a tio n m ight actually depend on the existence of other property of P r o te in In fo rm a tio n such as property fu n ctio n of concept P r o te in In fo rm a tio n for exam ple. In other 82 words, whenever a concept possesses property fu n c tio n , it will also possess pro perty co d e. If this kind of properties inter-dependencies inform ation can be deduced based on th e content of the sem antic dictionary, th e probability m odel th a t we in tro duced in C hapter 5.3.2 can be easily extended to incorporate this inform ation [35] to have a m ore accurate estim ation on th e probability indicator value between an ex ported type object w ith respect to a concept in the sem antic dictionary. H enceforth, when a new type object, w ithout possessing property co d e nor fu n ctio n , registers w ith th e sharing advisor, its probability indicator value w ith respect to P r o te in In fo rm a tio n will not be penalized because of th e absence of property c o d e since both properties fu n c tio n and c o d e are missing, confirming to the inter-dependency inform ation. 9 .4 .2 In sta n c e -le v e l D isco v e ry As the focus of this research is on th e identification of related and relevant type ob jects am ong a netw ork of autonom ous database com ponents, th e issue of identifying related and relevant instance objects is not considered in th eir entirety here. It does not im ply th a t it is not an im portant or a w orthw hile problem to be investigated; rath er, we feel th a t this problem of instance-level discovery will have a substantial im pact on inform ation sharing as it allows the sharing of inform ation at different levels of abstraction and granularity. As described in Section 4.3, th e sharing of instance objects requires th e definition and d ata values of instance objects to be exported to th e federation. This problem is further com plicated by the fact th a t different com ponents m ay represent their d a ta values differently and u p d ate their databases at different point in tim e, resulting in th e inconsistency betw een related instance objects. A lthough some works have been proposed to address th e represen tational incom patibility issue in identifying related instance objects [26], th e issue of inconsistently updating d ata values in different database com ponents has never been considered. In this context, we plan to explore the appropriateness of using historical data, suggesting possible values set an instance object can possess. 83 9 .4 .3 S ch em a In te g r a tio n An im portant capability for a com ponent to flexibly m anipulate relevant rem ote inform ation is to be able to “fold in” rem ote inform ation into th e local database system once the relevant rem ote inform ation are discovered. This is usually referred to as the schem a integration problem . C urrent approaches to this schem a integra tion problem has been perform ed in a rath er ad hoc m anner, requiring users of a database com ponent to specify precisely the detail relationships betw een the rem ote database schem a and the local schema. O ur heuristics to the database interope ratio n problem allow us to look at this schem a integration problem from a totally different perspective. As a sim ple exam ple, assum ing th a t com ponent A of Figure 4.2 would like to in teg rate th e rem ote type P r o te in S tru ctu re into its local database. T he integration problem , ultim ately, is refined to determ ining if P r o te in S tr u ctu r e should be sub- ty p e or supertype of the local type P r o te in In sta n c es of Com ponent A as types P r o te in In sta n c e s and P r o te in S tru ctu re are determ ined to be representing sim ilar inform ation according to the content of the sem antic dictionary shown in Figure 5.2b. One possible heuristic to deduce such a relationship is to look at the intra-concept sim ilarity values of th e properties th a t exist in P r o te in In sta n c e s, b u t not in P r o te in S tru ctu re such as property reso lu tio n . A possible inferencing heuristic m ight be: If the intra-concept sim ilarity of property r eso lu tio n w ith respect to concept P r o te in In sta n c e s in the sem antic dictionary is high, it m eans all type m em bers of P r o te in In sta n c es should possess this property. In this case, type P r o te in S tru ctu re should be supertype of P r o te in In sta n c e s since P r o te in S tr u ctu r e does not possess such property. This kind of statistical heuristics only holds when there is a large am ount of objects joining th e federation which is actually w hat our resource discovery m echa nism is based on. Nonetheless, the intra-concept sim ilarity value allows us to view th e integration problem from a totally new perspective and provides some useful guidelines and insights for th e com ponent perform ing the integration process. We are planning to further investigate the possibility of using this kind of statistical heuristics in resolving th e schem a integration problem. 84 Internetworked Sharing Advisor d a t a b a s e S h a r i n g Advisor d a t a b a s e compone d a t a b a s e c o m p o n e n i Figure 9.1: An internetw orked federation 9 .4 .4 A n In tern etw o rk ed F ed era tio n As m entioned in C hapter 9.3, one problem w ith a centralized sem antic dictionary is scalability especially for large federations. Note however, th a t the sem antic d ict ionary is only centralized logically; it m ay be physically d istributed, creating a fam ily of dictionaries. In fact, we could extend our environm ent to a collection of database federations by interrelating the sharing advisor of each federation, thus, creating an internetw orked federations. This idea is shown in Figure 9.1 As shown in Figure 9.1, each database federation is treated as a com ponent of the internetw orked federation. This idea can be recursively applied to generate a hierarchy of federations. An u ltim ate goal in establishing a hierarchy of federations is to understand th e content of individual database system s and relocate the database com ponent to the m ost related federation. W hat is even m ore challenging is to be able to analysis th e traffic (registration and discovery requests) involves in each dictionary and determ ine if a dictionary should be spit into m ultiple ones or m ultiple dictionaries should be combined into a single dictionary in order to achieve th e best global perform ance. 85 A p p e n d ix A Export Schema Specification A key characteristic of a database federation is the capability for a com ponent to share inform ation w ith other com ponents while, at the same tim e, preserve its au tonom y over its own database such as th e control over th e inform ation it is willing to export to the other com ponents. This is achieved by specially placing inform ation units th a t are m ade available to other com ponents in an export schema. As of to day, not m uch work has been done in th e context of organizing and m aintaining the exported inform ation. The only related work in this m atter is reported in [3]. In this work, a separate co-database is created to m aintain all exported inform ation. On th e other hand, little has been said on th e relationship, coordination, and coupling betw een the original database and th e co-database. In this appendix, we simply provide some insights of how an export schema can be m aintained in th e context of CODM. Figure A .l illustrates the idea to m aintaining th e exported inform ation using our CODM object-based constructs. An ex tra sub-hierarchy rooted at E x p o r te d - T y p e s is created under the root, i.e., O B J E C T S . Each exporting type will be m aintained by a corresponding E_ subtype of E x p o r te d -T y p e s w ith th e corre sponding exporting properties preceded by E_ as well; this allows a com ponent to specify exporting inform ation at the granularity of a single property. Each export ing instance, on the other hand, is handled by the m ultiple m em bership m odeling construct of CODM relating th e original type object to which th e instance belong to th e E_ type counterpart. In Figure A .l, only type P r o te in In sta n c e s has a cor responding E _ P ro tein In sta n c e s type; this notifies the sharing advisor th a t only ty p e P r o te in In sta n c es is exported. However, all properties of P r o te in In sta n c e s 86 S u p e r t y p e - S u b t y p e M e m b e r s h i p O BJECTS lOxporled Ivpes f u n c t i o n m o l w t e n d m o l w t c o d e Protein Instances Amino Acid Sequences Figure A .l: Specification of the export schem a are exported to th e federation since all properties of P r o te in In sta n c e s have the corresponding E_ properties defined on E _P rotein In sta n ces. Furtherm ore, only 2 instances of P r o te in In sta n ces belong to E _P rotein In sta n c e s through m ultiple m em bership construct; this informs the sharing advisor th a t only th e 2 specified in stances are exported to th e federation. In effect, th e export schem a of a com ponent represent an object view of its conceptual schema [4]. As far as im plem entation is concerned, m aintaining th e exported inform ation in this way will have very little effect on th e current prototype. The only m odification required is to modify the m eta-functions of each com ponent to look at inform ation under th e E x p o r te d -T y p e s subtree, i.e., the shaded region of Figure A .l, whenever m eta-d ata inform ation of the underlying database is needed. 87 A p p e n d ix B Additional Experimental Results In this appendix, we conduct a comprehensive study on th e perform ance differ ences betw een th e properties-intersection indexing m echanism and our heuristic- based mechanism. B .l Experiment # 1 In conducting our experim ent # 1 described in C hapter 7.4.1, the threshold value used when m easuring the perform ance of the properties-intersection indexing m echa nism is set to 0.5 as a m iddle ground between no overlap and to tal overlap. T he purpose of this experim ent is to study the perform ance of th e properties-intersection m echanism under different threshold values. In this experim ent, we generate a queue of 200 objects, all defined to be similar to th e reference object. The experim ent is conducted w ith the threshold value set to 0.1, 0.3, 0.5, and 0.7. Again, ideally, all 200 objects should be indexed w ithin one single classification. Special atten tio n is paid to identify th e threshold value th a t can produce acceptable perform ance under th e properties-intersection indexing mechanism.. T he sim ilarity ratios and instruction ratios of th e properties-intersection indexing m echanism at each 10% interval are m easured cum ulatively and are shown in Fig ure B .l and Figure B.2 respectively. Again, type objects w ith properties restricted to prim itive value types and type objects w ith properties of user defined value types allowed are conducted in two different trials. As depicted in the figures, th e only threshold value th a t can produce reasonably good results is 0.1. F urther, we observe th a t as the threshold value increases, th e 88 3.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 .1 30 20 30 40 50 60 70 80 90 100 of Objects Observed pt c & 'M 'g 1.0 _ 0.9 _ 0.8 _ 0.7 _ 0.6 _ 0.5 _ 0.4 _ 0.3 . 0.2 „ 0.1 . threshold = 0.1 threshold = 0.3 threshold = 0 .5 threshold = 0.7 lO 20 30 40 50 60 70 80 90 100 % o f Objects Observed (a) (b) Figure B .l: Sim ilarity ratios: properties-inter section m echanism under different threshold threshold = ■ f> 1 ---------------- threshold — 0.3 i ^ threshold — 0.5 1.0 J .................... threshold = 0.7 0.9 _ ■ — 0.8 _ o 0.7 _ - • • • • ---------------- 0.6 _ B 0 5 - U 0.4 _ * v , - .5 0.3 _ 0.2 _ 0.1 _ 1 1 1 1 1 1 1 1 1 - ! 1 1 ! I 1 1 1 1 — lO 20 30 40 50 60 70 80 90 100 % of Objects Observed j 1 .0 _ _ 0 .9 _ _ 0 .8_ _, o - _ ...........- M , o * - g 0.5 _ / 1 °-4 - £ 0.3 _ J L 0.2 __ 1 1 1 1 1 1 1 1 I ^ E 1 1 1 1 ! 1 1 I — lO 20 30 40 50 60 70 80 90 100 % ol" Objects Observed (a) o > ) Figure B.2: Instruction ratios: properties-intersection m echanism under different threshold 89 perform ance of th e indexing m echanism decreases. T he reason for this behavior is very obvious. At a small threshold value, two objects only need a sm all degree of overlapping in their corresponding set of properties for their sim ilarity to be detected; th e required degree of overlapping increases as th e threshold value increases. The probability th a t two objects can be properly related at a small threshold value is, therefore, m uch higher than th a t at a higher threshold value. N onetheless, even at the 0.1 threshold value, the perform ance of th e properties- intersection indexing m echanism does not increase w ith the num ber of objects being observed. This suggests to us th a t th e perform ance of the properties-intersection indexing m echanism is not very stable even at a low threshold value. We verify this hypothesis in th e next experim ent through experim enting the properties-intersection indexing m echanism on the effect of negative exam ples, i.e., objects th a t are defined to be dissim ilar to the reference object. B.2 Experiment T he purpose of this experim ent is to study th e effect of negative exam ples on the effectiveness of th e properties-intersection indexing m echanism under different th res hold values and to com pare its perform ance w ith our heuristic-based m echanism ob tained in C hapter 7. In this experim ent, we generate a queue of 200 objects w ith 50% of th e testing objects defined to be sim ilar to th e reference object. The experim ent is conducted w ith the threshold value set to 0.1, 0.3, 0.5, and 0.7. T he sim ilarity ratios, dissim ilarity ratios, and the instruction ratios are shown in Figure B.3, Figure B.4, and Figure B.5 respectively. From the figures, although the properties-intersection indexing m echanism at th e 0.1 threshold value results in reasonably good sim ilarity ratios as shown in Figure B.3, it fails to properly index objects th a t are dissimilar to th e reference object as shown in Figure B.4. The is due to the fact th a t objects from th e two classifications m ight overlap in their corresponding set of properties; therefore, at a low threshold value, th e indexing m echanism becomes confuse about objects from the two classifications. U ltim ately, th e indexing m echanism deduces th a t all the objects are w ithin one classification. T h at is why th e sim ilarity ratios increase while th e dissim ilarity ratios decrease over tim e, i.e., as a function of objects observed. 90 0.8 0.4 0.3 lO 20 30 40 S O 60 70 80 90 lO O % o f O b jects O b served threshold — 0.1 thresh old = 0.3 thresh old = 0.5 thresh old = 0.7 0.7 0.3 0.2 lO 20 30 40 50 60 70 80 90 lO O > o f O b jec ts O b served (» > C b ) Figure B.3: Sim ilarity ratios: properties-intersection m echanism under different threshold thresh old = 0.1 thresh old = 0.3 thresh old = 0.5 1 . 0 0.8 0.7 .1 0 20 30 40 50 60 70 80 90 lO O % o f O b jects O bserved 1.0 0.3 lO 20 30 40 50 60 70 80 90 lO O (a) Cb) Figure B.4: D issim ilarity ratios: properties-intersection m echanism under different threshold < 2 0.8 lO 20 30 40 50 60 70 80 90 lO O % o f O b jects O b served th r e s h o ld = 0.1 th r e s h o ld = 0 .3 0.9 _ 0.8 _ 0.7 _ 0.4 lO 20 30 40 50 60 70 80 90 lO O • o f O b jects O b serv ed ( a ) Cb) Figure B.5: Instruction ratios: properties-intersection m echanism under different threshold 91 At a higher threshold value, however, the indexing m echanism becomes confuse about objects even w ithin the same classification. T h at is why both th e sim ilarity ratios and dissim ilarity ratios drop accordingly and the instruction ratios increase accordingly as th e threshold value increases. This further verifies our hypothesis th a t our heuristic-based indexing m echanism performs much b etter than the traditional properties-intersection m echanism. 92 Reference List [1] H. A fsarm anesh and D McLeod. T he 3DIS: An Extensible, O bject-O riented Inform ation M anagem ent Environm ent. A C M Transactions on Office Informat ion Systems, 7:339-377, O ctober 1989. [2] T. Berners-Lee, R. Cailliau, and B. Pollerm ann. W orld-W ide Web: The Inform ation Universe. Electronic Networking: Research, Applications, and Pol icy, 1(2), 1992. [3] A. B ouguettaya and R. King. Large M ulti databases: Issues and Directions. In IFIP Technical Committee Second International Working Conference, DS-5 Semantics of Interoperable Database Systems, 1992. [4] K. J. Byeon and D. McLeod. Towards the Unification of Views and Versions for O bject D atabases. In Proceedings of the International Symposium on Ob ject Technologies for Advanced Software, November 1993 (LNCS 742, Springer- Verlag). [5] S. Ceri and G. Pelagatti. Distributed Databases: Principles and Systems. Mc- Graw Hill, 1984. [6] M. Cinkosky, J. Fickett, P. Gilna, and C. Burks. Electronic D ata Publishing and G enBank. Science, pages 1273-1277, 1991. [7] C. Collet, M. Huhns, and W . Shen. Resource Integration Using a Large Know ledge Base in Carnot. IEEE Computer Magazine, pages 55-63, 1991. [8] P. Danzig, S. Li, and K. Obraczka. Internet Resources Discovery Services. IE E E Computer Magazine, pages 8-22, 1992. [9] P. Drew, R. King, D. McLeod, M. Rusinkiewitz, and A. Silberschatz. R eport of th e W orkshop on Sem antic H eterogeneity and Interoperability in M ultidatabase System s. A C M SIGM OD Record, 1992. [10] D. Fang, S. G handeharizadeh, D. McLeod, and A. Si. The Design, Im plem en tatio n and Evaluation of an O bject-Based Sharing M echanism for Federated D atabase Systems. In International Conference of IE E E Data Engineering, pages 467-475, 1993. 93 [11] D. Fang, J. H am m er, D. McLeod, and A. Si. Rem ote-Exchange: An Approach to Controlled Sharing among Autonom ous, H eterogenous D atabase Systems. In Proceedings of the IEEE Spring Compcon, San Francisco. IEEE, February 1991. [12] P. Fankhauser and E. Neuhold. Knowledge Based Integration of Heterogeneous Databases. Technical report, Technische Hochschule D arm stadt, 1992. [13] D. Fisher. Knowledge Acquisition Via Increm ental C onceptual Clustering. Ma chine Learning, pages 139-172, 1987. [14] D. Fishm an, D. Beech, H. C ate, E. Chow, T. Connors, T. Da,vis, N. D errett, C. Hoch, W . K ent, P. Lyngbaek, B. M ahbod, M. N eim at, T. Ryan, and M. Shan. Iris: An O bject-O riented D atabase M anagem ent System . A C M Transactions on Office Information Systems, 5(l):48-69, January 1987. [15] K. Frenkel. The H um an Genome Project and Inform atics. Communications of the ACM, 34(11):41-51, 1991. [16] S. G handeharizadeh, et al. Design and Im plem entation of OM EGA O bject- based System. Technical Report USC-CS, C om puter Science D epartm ent, Uni versity of Southern California, Los Angeles CA 90089-0781, Septem ber 1991. [17] J. H am m er and D. McLeod. An Approach to Resolving Sem antic Heterogene ity in a Federation of Autonom ous, Heterogeneous D atabase Systems. Interna tional Journal of Intelligent and Cooperative Information Systems, 2(l):51-83, M arch 1993. [18] J. H am m er, D. McLeod, and A. Si. O bject Discovery and Unification in Fed erated D atabase Systems. In Proceedings of International Workshop on Inter operability of Database Systems, pages 3-18, 1993. [19] S. H ayne and S. Ram . M ulti-User View Integration System (M UVIS): An E xpert System for View Integration. In Proceedings of the 6th International Conference on Data Engineering. IEEE, February 1990. [20] D. Heim bigner and D. McLeod. A Federated A rchitecture for Inform ation Sys tem s. A C M Transactions on Office Information Systems, 3(3):253-278, July 1985. [21] M. Huhns, N. Jacobs, T. Ksiezyk, W. Shen, M. Singh, and P. C annata. E nter prise Inform ation M odeling and Model Integration in C arnot. Technical R eport Carnot-128-92, M CC, 1992. [22] R. Hull and R. King. Sem antic D atabase Modeling: Survey, A pplications, and Research Issues. A C M Computing Surveys, 19(3):201-260, Septem ber 1987. 94 [23] M. H uysm ans, J. Richelle, and S. W odak. SESAM: A R elational D atabase for S tructure and Sequence of M acromolecules. Proteins: Structure, Function, and Genetics, pages 59-76, 1991. [24] S. Jam es. Bayesian Statistics: Priciples, Models and Applications. W iley In terscience, 1988. [25] J. Kalbflesich. Probability and Statistical Inference, Volume 1 and Volume2. Spring-Verlag, 1985. [26] W . K ent. Solving Domain Mismatch. Problem s w ith an O bject-O riented D ata base Program m ing Language. In Proceedings of the International Conference on Very Large Databases, pages 147-160. IEEE, Septem ber 1991. [27] W. Kim. Introduction to Object-Oriented Databases. The M IT Press, 1990. [28] W . Kim., J. Banerjee, H. T. Chou, J. F. Garza, and D. Woelk. Com posite O bject Support in an O bject-O riented D atabase System. In Proceedings of the Confer ence on Object-Oriented Programming Systems, Languages, and Applications, pages 118-125, 1987. [29] W . Kim , I. Choi, S. Gala, and M. Scheevel. On Resolving Schem atic H etero geneity in M ultidatabase Systems. Distributed and Parallel Databases, pages 251-279, 1993. [30] J. Larson, S.B. Navathe, and R. Elm asri. A Theory of A ttrib u te Equivalence and its A pplications to Schema Integration. IEEE Transactions on Software Engineering, 15(4):449-463, April 1989. [31] C. Lecluse, P. Richard, and F. Velez. O 2 , an O bject-O riented D ata M odel. In Proceedings of the A C M SIGMOD International Conference on Management of Data. ACM SIGM OD, June 1988. [32] W . Litw in and A. A bdellatif. M ultidatabase Interoperability. IE E E Computer, 19(12), December 1986. [33] D. M aier, J. Stein, A. O tis, and A. Purdy. Developm ent of an O bject-O riented DBMS. In Proceedings of the Conference on Object-Oriented Programming Sys tems, Languages, and Applications, pages 472-482. ACM, 1986. [34] A. M ehta, J. Geller, Y. Perl, and P. Fankhauser. Com puting Access Relevance to Support Pal h-Met hod G eneration in Interoperable M ulti-O O D B. In Proceed ings of the International Conference on Very Large Databases, pages 119-139. IEE E, A ugust 1992. [35] C. Rijsbergen. Information Retrieval. B utterw orths, 1979. 95 [36] S. Robertson, C. Rijsbergen, and M. Porter. Probabilistic models of indexing and searching, pages 35-56. B utterw orths, 1981. [37] J. Saldanha and J. Eccles. The A pplication of SSADM to M odelling th e Logical S tructure of Proteins. CABIOS, pages 515-524, 1991. [38] A. Savasere, A. Sheth, S. Gala, S. N avathe, and LI. M arcus. On Applying Classification to Schem a Integration. In Proceedings of IE E E 1st International Workshop on Interoperability in Multidatabase Systems, pages 258-261. Kyoto, Japan, April 1991. [39] M. Schwartz, A. Em tage, Brew ster K., and C. N eum an. A Com parison of Internet Resource Discovery Approaches. Computing Systems, A ugust 1992. [40] H. Schwetman. CSIM Reference Manual (Revision 15), 1991. M icroelectronics and C om puter Technology Corporation. [41] A. Sheth and J. Larson. Federated D atabase Systems for M anaging D istributed, H eterogenous, and Autonom ous D atabases. In A C M Computing Surveys, pages 183-236. 1990. [42] A. Sheth, J. Larson, A. Cornelio, and S. B. Navathe. A Tool for Integrating Conceptual Schem ata and User Views. In Proceedings of the fth Interna,tional Conference on Data Engineering, pages 176-183. IEE E, February 1988. [43] D. Shipm an. The Functional D ata Model and th e D ata Language DAPLEX. A C M Transactions on Database Systems, 2(3):140-173, M arch 1981. [44] M. Silberschatz, M. Stonebraker, and J. Ullman. D atabase Systems: Achieve m ents and O pportunities. A C M SIGMOD Record, 19(4):6-23, Decem ber 1990. [45] M. Stonebraker, L. Rowe, B. Lindsay, J. Gray, M. Carey, M. Brodie, P. B ern stein, and D. Beech. T hird-G eneration D atabase System M anifesto. A C M SIGM OD Record, 19(3):31-44, Septem ber 1990. [46] T. Tem pleton, et al. M ermaid: A F ront-E nd to D istributed Heterogenous D atabases. In Proceedings of IEEE, pages 695-708, 1987. [47] V. Ventrone and S. IJeiler. Some Practical Advice for D ealing w ith Sem antic Lleterogeneity in Federated D atabase Systems. Subm itted to the Journal of Usenix Association, 1993. [48] Gio W iederhold. M ediators in the A rchitecture of F uture Inform ation Systems. IE E E Computer Magazine, pages 38-49, 1992. 96
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11257749
Unique identifier
UC11257749
Legacy Identifier
DP22892