Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Ontology-based semantic integration of heterogeneous information
(USC Thesis Other)
Ontology-based semantic integration of heterogeneous information
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ONTOLOGY-BASED SEMANTIC INTEGRATION OF HETEROGENEOUS INFORMATION SOURCES by Sangsoo Sung A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2008 Copyright 2008 Sangsoo Sung Dedication To my wife and my son. ii Acknowledgments I am grateful to my parents. When I turned ten, they presented me a 8-bit Apple com- puter which lead me to the world of computer science. Their relentless support has enabled me to pursuit higher degrees in the United States. I am also extremely grateful to Dennis McLeod. His classes and advice lead my research direction. Thank you for your guidance on this research. I thank Seokkyung Chung and Hyunwoong Shin. It was with Seokkyung that I first learned how to do research and how to write. Thank you, Seokkyung, I am very grateful. I would like to thank my wife for her enduring patience. Although she has been pregnant for last 8 months, she made enormous sacrifices in supporting me to complete this dissertation. I dedicate this dissertation to my wife and my son whom I will meet in a month. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract xi Chapter 1: Introduction 1 1.1 Motivation of the Research . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Overview of the Solutions . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Ontology-Based Federated Architecture for Information Man- agement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Ontology-Driven Schema Matching Framework . . . . . . . . . 8 1.2.3 Efficient Concept Clustering to Enrich Existing Ontologies . . . 10 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 2: Ontology-Based Federation Architecture 15 2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Ontronic: Ontology-Based Federation Approach . . . . . . . . . . . . . 17 2.2.1 Ontology Extraction . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 The Canonical Data Model . . . . . . . . . . . . . . . . . . . . 20 2.2.2.1 Attributes Connection . . . . . . . . . . . . . . . . . 21 2.2.2.2 Instance Connection . . . . . . . . . . . . . . . . . . 22 2.2.2.3 Group Connection . . . . . . . . . . . . . . . . . . . 22 2.2.2.4 Subclasses Connection . . . . . . . . . . . . . . . . 22 2.2.2.5 CIOM Example . . . . . . . . . . . . . . . . . . . . 24 2.2.3 Conceptual Architecture . . . . . . . . . . . . . . . . . . . . . 26 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 iv Chapter 3: Ontology-Driven Schema Matching Framework 31 3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Matching Framework Architecture . . . . . . . . . . . . . . . . . . . . 34 3.3 Data-Driven Mapping Framework . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Pattern-based Matcher . . . . . . . . . . . . . . . . . . . . . . 37 3.3.2 Attribute-based Matcher . . . . . . . . . . . . . . . . . . . . . 38 3.4 Semantics-Driven Mapping Framework . . . . . . . . . . . . . . . . . 38 3.4.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.2 Compound Word Processing . . . . . . . . . . . . . . . . . . . 42 3.5 Similarities Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6.1 Test Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6.2 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . 45 3.6.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . 47 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Chapter 4: Efficient Concept Clustering for Ontology Evolution 49 4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Rough Cluster Identification . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Discovery of Hierarchical Topic Relationships . . . . . . . . . . . . . . 55 4.3.1 Cluster Refinement using an Expensive Similarity Metric . . . . 56 4.3.2 Ontology Enrichment using Refined Clusters . . . . . . . . . . 57 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 5: Related Work 68 5.1 Information Systems Interoperability . . . . . . . . . . . . . . . . . . . 68 5.1.1 Classification of Architecture . . . . . . . . . . . . . . . . . . 68 5.1.2 Process of Integration . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Semantic Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.1 Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.2 Ontology Matching . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Ontology Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.1 Natural Language Processing for Ontology Learning . . . . . . 77 5.3.2 Data Mining for Ontology Learning . . . . . . . . . . . . . . . 77 5.3.3 Hybrid Method . . . . . . . . . . . . . . . . . . . . . . . . . . 78 v Chapter 6: Conclusion 81 6.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 References 85 vi List of Tables 3.1 Domains and data sources for the experiment [DDH01] . . . . . . . . . 45 4.1 The precision, recall, F-measure, and time costs of different distance thresholdsα andβ for structuringO a . Note that the time cost of struc- turing O b was 394.67 mins. Based on our observations, news articles have a near-weekly cycle so that the best performing α and β were 10 and 5 respectively, which is indicated in bold. . . . . . . . . . . . . . . 62 4.2 Sample concept clusters with centroids . . . . . . . . . . . . . . . . . . 63 vii List of Figures 1.1 Traditional federated architecture for information management [HM85] 2 1.2 Schema matching example: two attributes “phone number” and “tele- phone” can be matched by comparing various aspects in schemaS and T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Ontology-based federated architecture for information management . . 7 1.4 Ontology-driven schema matching example: concept C is the most spe- cific common parent between attribute E in schema T and attribute D in schema S. Similarly, concept B is the most specific common parent between attribute E in schema T and attribute D in schema H. Hence, E can be matched to D since concept C is more specific than concept B. . 9 1.5 Ontology-driven semantic matching example in heterogeneous informa- tion sources: heterogeneous information sources can be expressed in canonical models by extracting ontologies. Thus, “.com” in a relational database and “corporation” in XML are expressed in concepts of ontolo- gies. Lastly, they can be matched since they are semantically clustered in the general ontology. . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 The Southern California Earthquake Data Center (SCEDC) XML schema and The Southern California Seismic Network (SCSN) relational database schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Conceptual procedure for ontology extraction . . . . . . . . . . . . . . 20 viii 2.3 We extracted the seismology domain ontology expressed in CIOM from QuakeTables [GGD + 05]. There are two roots: event and seismology. Seismology can be specialized as either geo-phenomenon, seismology research or geological feature. In particular, geo-phenomenon is a sub- class of the intersection of event and seismology (operationally-defined subclass). Seismology paper is published seismology research. Geo- logical features are further classified into fault and segment. In addi- tion, an event can be sub-categorized as disaster, conference or geo- phenomenon. Earthquake and volcano are types of both natural (pred- icate) disaster and geo-phenomenon while a nuclear test is both man- made (predicate) disaster and geo-phenomenon. These relationships are subclass connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Conceptual procedure for ontology construction . . . . . . . . . . . . . 27 2.5 Overall architecture of Ontronic . . . . . . . . . . . . . . . . . . . . . 28 2.6 A snapshot of CIOM Modeler in Ontronic: an ontology engineer can manipulate ontologies using the CIOM Modeler. . . . . . . . . . . . . . 29 3.1 Matching ambiguity can be resolved with the semantics-driven map- ping framework. The values of the instances of the attributes “Fax” and “Telephone” in the table “Firm Information” of schema S and the attribute “Call” in the table “Company” of schema T share common patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 The semantics-driven framework provides candidate mappings. The values of the instances of the attribute “Dot Com” of the table “Com- pany information” in schemaS are dissimilar to all the attributes of the table “Firm list” in schemaT . The mapping between “Corporation” and “Dot Com” can be identified by the semantics-driven mapping frame- work because their semantic similarity is higher than the other mappings. 33 3.3 Overall architecture of our matching framework . . . . . . . . . . . . . 34 3.4 An example of the semantic similarity computation . . . . . . . . . . . 41 3.5 Average matching accuracy compared with complete LSD [DDH01]: we ran the experiments with 300 instances and it had higher average accuracy than the complete LSD on the 4 domains. . . . . . . . . . . . 46 3.6 We ran the experiments with 100, 200, 300, and 500 instances available from Real Estate I. Our mapping framework had in higher accuracy with small number of data instances than the complete LSD. . . . . . . . . . 46 ix 3.7 We recorded the elapsed time of our mapping frameworks. The elapsed time of data-driven mapping framework increased when the number of instances was increased. Our mapping framework had higher speed. . . 47 4.1 We have collected term frequencies from online news articles for 100 days between May 11, 2007 and August 19, 2007 for both “David Beck- ham”, who is a famous football (soccer) player, and the “Los Angeles Galaxy”, which he joined in July 2007 and had his first official soccer practice session on July 16, 2007 . . . . . . . . . . . . . . . . . . . . . 50 4.2 An example of ontology enrichment process using a refined cluster . . . 58 4.3 We quantified tf t for each term by executing Eq. (4.1) and (4.2) as shown in (A). (B) illustrates the change points oftf t for each term that we found by running Eq. (4.3) through Eq. (4.6). As depicted in (C), the terms were assigned to overlapping smaller groups by creating Ω k with two tunable tight (α) and loose (β) thresholds for an inexpensive distance measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 We compared the distributions of similarities for structuring both O a and O b using Eq. (4.10). (A) shows the sorted distribution of pairwise similarities of terms in overlapping smaller groups. (B) presents the sorted distribution of complete pairwise similarities of 173 terms. . . . . 62 5.1 Heterogeneity resolution in federated database system . . . . . . . . . . 69 5.2 Conceptual procedure of information integration . . . . . . . . . . . . . 70 6.1 Inefficient schema matching . . . . . . . . . . . . . . . . . . . . . . . 83 6.2 Schema clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 x Abstract The main goal of this research is to improve interoperability between different infor- mation sources. Since ontologies, collections of concepts and their interrelationships, have become a synonym for the solution to many problems resulting from computers’ inability to understand natural language, they can capture the semantics of diverse rep- resentations in heterogeneous information sources. Thus, ontologies can facilitate the identification of semantic matching between the different representations. Therefore, this dissertation studies the role of ontologies in semantic matching structured data and a method of building ontologies for the semantic matching. One of the critical problems in the federation of information sources is that similar domains have been expressed in different manners. To address this problem, this dis- sertation presents an ontology-based federation of heterogeneous information sources. We define a simple yet powerful representation model for structuring ontology which can extract canonical representation from a broad range of meta data models, including relational databases, XML, RDF, OWL, and DAML+OIL. The second major problem in the federation is that similar domains also have been expressed in diverse terminologies by domain experts who typically have their own interpretations of the domain. To tackle this problem, we incorporate ontologies to identify matches among different terminologies. Since ontologies play a key role in xi knowledge management by providing solutions to many problems resulting from com- puters’ inability to understand natural language, they can facilitate the identification of semantic matching between the different representations. The basic idea in computing the semantic similarity is that similar concepts share a more specific common parent. As many thousands of articles are published daily on the Web, neologisms or domain- specific terms appear as time passes. Thus, the third major problem in the federation is that the employment of out-of-date ontologies may decrease the accuracy of our match- ing framework. Also, the ontology learning process where traditional clustering algo- rithms are involved tends to be slow and computationally expensive when the dataset is as large as the Web. Therefore, it is essential to maintain ontologies to reflect up-to-date knowledge. To address this problem, we present an efficient concept clustering tech- nique for ontology learning that reduces the number of required pairwise term similarity computations without a loss of quality. This study makes three major contributions. The first contribution is a solution archi- tecture that resolves conflicts in the semantics of existing information sources. The sec- ond major contribution of the research is a solution to automatically create semantic mappings. The another important contribution of this dissertation is a solution archi- tecture that provides a well-founded, rapid ontology learning framework based on the reduction of the use of the expensive measure by pre-clustering a large dataset. Our approach can be coupled with any type of federating, matching, clustering methods and can be utilized for making algorithms scalable with respect to the millions of information sources and documents. Therefore, this dissertation has contributed to both understand- ing the integrating problem on diverse information sources and developing matching framework using ontologies. xii Chapter 1 Introduction This dissertation studies semantic integration: the reconciliation of semantic hetero- geneity of diverse expressions and different representations. The main goal of this research is to improve interoperability between different information sources. Since ontologies, collections of concepts and their interrelationships, have become a synonym for the solution to many problems resulting from computers’ inability to understand natural language, they can capture the semantics of diverse representations in heteroge- neous information sources. Thus, ontologies can facilitate the identification of semantic matching between the different representations. Therefore, this dissertation is composed of two parts: a role of ontologies for semantic matching structured data and a method of building ontologies for the semantic matching. Section 1.1 shows that a federation of heterogeneous information sources is a fun- damental aspect of the knowledge management system and defines key problems of the federation. Section 1.2 frames our solution architecture. In Section 1.3, we outline our specific contributions in the context of this architecture. Lastly, we provide a road map to the remainder of this dissertation 1.4. 1.1 Motivation of the Research With the rapid progress in data acquisition, sensing, and storage technologies, the popu- larity of structured data sources available online is increasing at an enormous rate. Thus, a large amount of data from many domains of science are currently available in a number 1 Figure 1.1: Traditional federated architecture for information management [HM85] of autonomous heterogeneous information sources. As a result, similar domains have been illustrated in different manners and in diverse terminologies by domain experts who typically have their own interpretations and analyses of the domain. This environ- ment yields many information management challenges. In particular, the federation of the existing information sources is inherently quite difficult to automate because of the following key challenges: One of the major problems in the federation of information sources is that an indi- vidual information source is constructed in various representation models. Tradition- ally, much of the work in the context of the federation focused on integrating mul- tiple databases by defining a global schema. As shown in Figure 1.1, they typically defined mappings from local schema to the composite one in database systems. They typically are built on diverse metadata models such as the relational model, Extensi- ble Markup Language (XML), Resource Description Framework (RDF) [CS04], Web 2 Figure 1.2: Schema matching example: two attributes “phone number” and “telephone” can be matched by comparing various aspects in schemaS andT . Ontology Language (OWL) [AvH04], and DAML+OIL [MFHS02]. However, conven- tional methods of database integration have been liquidated in the context of integration of information sources on the Web since they are mismatched in expressiveness and semantics at the language level. If a global schema approach is used, a conceptual rep- resentation among them must be selected, and applications using another representation may need transformation and modification. Hence, some information on each concep- tual structure might be lost or added during transformations. Another important problem is that the same or similar facts may be contained in heterogeneous information sources yet expressed in different terminologies. To obtain meaningful inter-operation, applications must establish semantic mapping between the schema. The first step in integrating the schema is to identify and to characterize these inter-schema relationships. This is a process of schema matching. Figure 1.2 shows an example of schema matching. In general, the semantics of information units are tightly coupled with the environment where information units reside. Information maintained in 3 different database environments should be processed by respecting their corresponding environments [AM99, BGF + 97, SSR94]. Hence, the schema matching problem is very difficult to automate. However, the manual approach of this schema matching problem is regarded as a tedious and error-prone task. Furthermore, with the rapid growth of the online information, we are now experiencing overwhelming quantities of schema in the existing information sources. Therefore, the manual approach is not eligible to be scalable. Although there has been a great deal of research into these major challenges in the federation of the existing information repositories, no satisfactory solution exists. The composite structure is still popularly used despite that it may be suboptimal for many users’ needs, and the vast majority of semantic mappings are still manually created. These have become critical problems in building information management and data- sharing applications. 1.2 Overview of the Solutions A Semantic Technology was recently introduced for supporting the exchange of knowl- edge by providing well-defined meaning [BL05]. Semantics is considered to be the best framework for dealing with the heterogeneous and massive scale informa- tion sources. In addition, relationships are fundamental to semantics by associat- ing meanings to attributes, classes, and objects [KS94]. Ontology, which provides an explicit model for structuring concepts, together with their definitions and inter- relationships, has been playing a key role in building semantic information. Moreover, recent research [DMD + 03, DLD + 04, ES04] has focused on semantic integration pro- duced by various projects in the ontology and the Semantic Web community. Therefore, ontologies can be used in an integration task to describe the semantics of the information 4 sources and to make the content explicit. With respect to the integration of data sources, they can be used for the identification and association of semantically corresponding information concepts. We discuss our approach to ontology-based federation of heterogeneous information sources; Section 1.2.1 presents ontology-based federated architecture, and Section 1.2.2 describes ontology-driven schema matching framework. Since ontologies play a key role in knowledge management by providing solutions to many problems resulting from computers’ inability to understand natural language, they can facilitate the identifica- tion of semantic matching between the different representations. As many thousands of articles are published daily on the Web, neologisms or concepts appear as time passes. Thus, the employment of out-of-date ontologies may decrease the accuracy of match- ing. Therefore, it is essential to maintain ontologies to reflect up-to-date knowledge. To address this problem, Section 1.2.3 presents an efficient concept clustering technique to enrich existing ontologies. 1.2.1 Ontology-Based Federated Architecture for Information Management There are considerable research results that focus on the federation of existing informa- tion sources. Composite schema is one way to main categories of research. As illus- trated in Figure 1.1, the composite schema is an agreeable, common, and global among the application in order to exchange information through it. A key issue of this approach has been generating composite schema in which information unit, semantics and inter- database relationships are fixed [AB91, CRE87, GSSC95, V A96, AM99, KDN90]. This approach is limited in that the composite schema is a bottleneck for large-scale information sharing since it requires a large global structure to be maintained. On the one hand, as the number of information sources grows, the effort of integration is also 5 increased. On the other hand, many information providers have begun to employee vari- ous ontology languages such as XML, RDF, OWL, and DAML+OIL, as these languages have gained popularity in the database and the Semantic Web community. As a result, the languages can differ in their syntax and thus this diversity creates mismatches at the representation level [KF01]. Namely, building conceptual structures using a set of rep- resentation primitives is not available using other primitives. Hence, previous research [KS05, She05] has attempted to transform information sources to the same representa- tions prior to semantic-matching. Consequently, it suffers from information loss during transformation. To address this problem, our study first builds an canonical expression for individ- ual information source by semi-automatically extracting ontology from each source. Through the canonical representation, our mapping framework discovers the semantic correspondences among the representations of the information sources as illustrated in Figure 1.3. Moreover, the matching information can be used to extend the results of users’ queries, since it has links to the other related information sources. For example, users can retrieve complete magnitude information of the southern California earth- quake database and the northern California earthquake database by querying it to only the southern California earthquake database, if we already have matching information from these two databases. The canonical representation must be constructed based on a common data model that provides the explicit representation of a conceptualization, which is preferably shared and agreed upon. In addition, it should be possible to write models in the conven- tional ontology languages. Furthermore, the model is required to include the designer’s knowledge in the largest portion possible. In other words, it should employ a sufficient mechanism to express the meaning of interrelationships among concepts. 6 Figure 1.3: Ontology-based federated architecture for information management Towards this end, we propose a novel model for developing ontologies, called Clas- sified Interrelated Object Model (CIOM) [SM05]. CIOM is the design of a higher-level ontology model that will enable the ontology developers to naturally and directly incor- porate the semantics of ontologies into its meanings. In order to apply CIOM in appli- cations that utilize domain-specific knowledge, we introduce Ontronic 1 [SM05]; this 1 http://sir-lab.usc.edu:7777/ontronic/index.html 7 provides general functionality for the engineering, discovery, management, and presen- tation of ontology-based metadata incorporated with CIOM. In addition, Ontronic estab- lishes a platform that is necessary to create the ontological canonical expression of the diverse information sources and to discovery mappings between them. The technique of the latter is briefly described in the following section. 1.2.2 Ontology-Driven Schema Matching Framework Schema matching has been historically difficult to automate. Schema matching is the process whereby two schemas are semantically related at the conceptual level. The source schema instances are transformed into the target schema entities according to those semantic relations. In order for two parties to understand each other, they should use the same formal representation for the shared conceptualization. Unfortunately, it is difficult to have everybody agree on the same schema for a domain. When we have different schema for the similar domain, the problems appear because parties with different schema do not understand each other. Most previous studies [DMD + 03, DLD + 04, DDH01, MZ98, PTU03, MBR01, CAdV01] have tried to find matches by exploiting information on schema and data instances. However, schema and data instances cannot fully capture the semantic information of the databases. Therefore, some attributes can be matched to improper attributes. To address this problem, we propose a schema matching framework that sup- ports the identification of the correct matches by extracting the semantics from ontolo- gies [SM06]. In ontologies, two concepts share similar semantics in their common parent. In addition, the parent can be further used to quantify a semantic similarity between them. Figure 1.4 shows an example of ontology-driven schema matching. The basic idea in 8 Figure 1.4: Ontology-driven schema matching example: concept C is the most specific common parent between attribute E in schema T and attribute D in schema S. Similarly, concept B is the most specific common parent between attribute E in schema T and attribute D in schema H. Hence, E can be matched to D since concept C is more specific than concept B. calculating the semantic similarity is that similar concepts share a more specific com- mon parent. Furthermore, by combining this idea with effective contemporary mapping algorithms, we perform an ontology-driven semantic matching in heterogeneous data sources. Figure 1.5 illustrates an example of the semantic matching using ontologies. With the rapid growth of the Web, overwhelming quantities of new online informa- tion are available on the Web. Thus, the employment of out-of-date ontologies which do not contain neologisms or domain specific information may decrease the accuracy of the matching. To maintain up-to-date ontologies, the next section presents an efficient concept clustering technique to enrich existing ontologies. 9 RDB XML RDF OWL DAML+OIL corporation .com Ontology “.com” semantically related to “corporation” Schemas Matched Ontologies Figure 1.5: Ontology-driven semantic matching example in heterogeneous information sources: heterogeneous information sources can be expressed in canonical models by extracting ontologies. Thus, “.com” in a relational database and “corporation” in XML are expressed in concepts of ontologies. Lastly, they can be matched since they are semantically clustered in the general ontology. 1.2.3 Efficient Concept Clustering to Enrich Existing Ontologies Ontology learning integrates many complementary techniques, including machine learn- ing, natural language processing, and data mining. Specifically, clustering techniques facilitate the building of interrelationships between terms by exploiting similarities of concepts. With the rapid growth of the Web, online information has become one of the major information sources. The ontology learning process where traditional cluster- ing algorithms are involved tends to be slow and computationally expensive when the dataset is as large as the Web. 10 To address this problem, we present an efficient concept clustering technique for ontology learning that reduces the number of required pairwise term similarity compu- tations without a loss of quality [SCM08]. Our approach is to identify relevant terms using a computationally inexpensive similarity metric based on an event life cycle in online news articles. Then, we perform more sophisticated similarity computations. Hence, we can build clusters with high precision/recall and high speed. Without a loss of clustering quality, our framework reduces the number of required computations from O(N 2 ) to(N +L 2 ) (L≪N) whereN is the number of candidate concepts. 1.3 Contributions of the Research The main goal of this research is to improve interoperability between different informa- tion sources by identifying a role of ontologies for semantic matching structured data and a method of building ontologies for the semantic matching. Toward this end, we advocate the idea from information retrieval, natural language processing, and data min- ing. Hence, our approach is to employ ontologies in order to build a novel paradigm for generation of canonical representation. Also, we utilize ontologies to discovery map- ping representations. Lastly, we provide an efficient concept clustering technique to maintain up-to-date ontologies. Upon completion of this dissertation, this research will contribute towards the following aspects: 1. The most significant contribution of this research is a solution architecture that resolves conflicts in the semantics of existing information repositories. By pro- viding a compact and unifying representation model to structure ontology, it can handle a broad range of extracting canonical representation from diverse reposi- tory, including relational databases, XML, RDF, OWL, and DAML+OIL. 11 It is simple and easy to extend the integration by identifying mapping from canon- ical representation of the existing sources to one of the newly added source. Con- sequently, this incremental integration approach can remove the management of extremely broad global schema, that is missing from most previous integration works [AB91, CRE87, GSSC95, V A96, AM99, KDN90]. 2. Another major contribution of the dissertation is a mapping framework which finds semantic correspondences between information repositories. This mapping framework can improve matching accuracy, since it utilizes fluent semantics in data, which is captured by ontologies, while the traditional methods [DMD + 03, DLD + 04, DDH01, MZ98, PTU03, MBR01, CAdV01] cannot. 3. This dissertation also provides a well-founded, rapid ontology learning framework based on the reduction of the use of the expensive measure by identifying rough cluster. Such a framework lets the ontology engineers focus on validating the interrelationships, rather than structuring from numerous information resources. Eventually, this drastically reduces the time frame. The ontologies generated by our framework can be utilized for a broad range of ontology-driven applications that require up-to-date ontologies. The generated ontologies continuously maintain up-to-date interrelationships among concepts by detecting an event life cycle on the Web. For example, we envision these ontologies will be an important resource for query expansion and refinement in search engines. We applied this federation framework to the earthquake science domain to build Quaketables [GGD + 05], that provide global real-time accessibility to a diverse set of earthquake and fault data. Since the various natures of data in earthquake science and the interpretations of data differ from resource to resource, and from scientist to scientist, 12 a semantic metadata management system must provide integration portability to manage the interoperability for heterogeneous data [SM05]. Besides our efforts on earthquake domains, our framework can be extended to other applications as well. It is expected to impact Web news service. Rapid growth in the technology of online news service has made it possible for users to obtain news articles through the Web Service. Especially, with Really Simple Syndi- cation (RSS) [LB03], which is an XML-based format for content distribution, users can take news and television headlines, incorporate them into their preferred newsreaders and web logs. In this circumstance, online news service providers usually have their own analysis of classification of new articles. For example, the New York Times and CNN.com categorized content with their own interpretations as the following: CNN.com : Top Stories, World, U.S., Politics, Law, Technology, Science & Space, Health, Entertainment, Travel, Education, Offbeat, Most Popular, Most Recent The New York Times : Arts, Automobiles, Books, Business, Circuits, Pogue’s Posts, Dining & Wine, Editorials/Op-Ed, Education, Fashion & Style, Health, Home & Garden, International, Magazine, Media & Advertising, Most E-mailed Articles, Movie News, Movie Reviews, Multimedia, National, New York / Region, Obitu- aries, Real Estate, Science, Sports, Technology, Television News, Theater, Travel, Washington, Week in Review As presented above, individual RSS sources are distributed and organized in differ- ent manners and diverse terminologies from online news service providers result in the misunderstanding of semantics across people and computers. Therefore, the process of managing news data using RSS service faces an emerging problem that can be charac- terized as semantic conflicts among various information sources. As we already have successfully completed work on earthquake domain [GGD + 05, SM05], it will clearly be 13 possible to integrate these distributed RSS sources by employing our federation frame- work. 1.4 Outline The rest of the dissertation is organized as follows. Chapter 2 provides a study on the federated architecture for integrating multiple information sources by extracting canon- ical representation. Chapter 3 presents the method and algorithm of schema matching framework where ontology is utilized to capture the semantics of information units in information repositories. Chapter 4 describes an efficient concept clustering technique for ontology evolution in order to maintain existing up-to-date ontologies. Chapter 5 provides an overview of an extensive amount of literature about federation of informa- tion systems, semantic matching, and ontology learning. Finally, concluding remarks and future research directions are presented in Chapter 6. 14 Chapter 2 Ontology-Based Federation Architecture As stated in Section 1.1, the main goal of information federation is to resolve hetero- geneity of the diverse representation models and to increase interoperability of existent information sources. Traditionally, integrating schema is difficult to generate and to maintain the global schema [NDH05], and the conventional database federation method is not applicable to the current information management environment. Hence, we begin this chapter by describing major challenges in information integration. Particularly, we consider the problem of generating canonical representation for individual information source. The chapter is organized as follows: Section 2.1 defines a motivating example. Sec- tion 2.2 provides an overview of our solution embodied by the Ontronic system includ- ing a canonical data model for extracting ontologies from existing information sources. Lastly, Section 2.3 summarizes the chapter. 2.1 Problem Definition In this section, we provide the basic motivation for the proposed federation method for multiple information sources. In particular, we explain why an ontology-extraction is useful in an environment where information sources employ heterogeneous data model. 15 To illustrate a simple example, Figure 2.1 depicts two schemas in the seismology- domain information sources. Figure 2.1: The Southern California Earthquake Data Center (SCEDC) XML schema and The Southern California Seismic Network (SCSN) relational database schema Two schemas in Figure 2.1 describe the same domain, and both contain very sim- ilar information about earthquake data in Southern California. Most elements in the 16 SCEDN 1 RDF schema and attributes in the SCSN 2 table can be intuitively matched. For example, “hasLongitude” property in SCEDN corresponds to “LON MINUTE” and “LON DEGREE” attributes in SCSN. By exploiting conventional databases integrating approach, we can integrate these schemas. First, one of the schemas need to be con- verted into the other’s data model. During the conversion, the information loss can occur in the more expressible data model than the other. For example, if we convert SCEDN RDF into the relational database model, it is difficult to represent the complex inter-relationship. Moreover, the composite schema includes common attributes and exclusive attributes from both schemas so that the composite schema is typically larger than the original schemas. If we consider an information environment in the real world, numerous and various organizations provide earthquake information with their scien- tists’ own interpretation and diverse data models. Consequently, in this case, a broad global schema will be generated with more information loss during the transformation. 2.2 Ontronic: Ontology-Based Federation Approach Ontologies have been proposed as a means of achieving consistent communication between systems in multiple information systems. Toward this end, one ontology or possibly a collection of overlapping ontologies must be constructed which integrates the combined world views or local ontologies of the participating agents. In distributed information systems, however, many of these participants would be existing legacy information systems have have very limited logical information in their schema. This metadata can represent a significant contribution towards the description of the informa- tion systems local ontology, but it is too impoverished to play this role by itself. The 1 http://www.data.scec.org/xml/ 2 http://www.scsn.org/data.html 17 schema information must be enriched with ontological information before it can be used in this way. The purpose of this section is to describe a pragmatic system to extend the Seman- tic Database Model (SDM) [HM78] so that it can be used as a canonical data model to structure ontologies from existing information sources. It has been validated by the implementation of a Ontronic, which is our ontology-based metadata management sys- tem that supports CIOM in structuring semantic information to construct, refine, and expand ontologies. In addition, Ontronic performs ontology-based integration of mul- tiple heterogeneous information sources by the employment of ontology extraction and multi-strategy schema matching techniques to interrelate elements in different compo- nent information sources. It is a uniform approach to the problems of schema integration and semantic heterogeneity resolution for the purpose of ensuring information system interoperability. It is based on the following key ideas: • Canonical representations are generated by extracting ontology from multiple information sources. • Ontronic finds the semantic correspondences among the representations of the information sources without semantic conflicts through the canonical representa- tions. • A canonical data model employed in Ontronic is significant in generating simple and sharable information unit semantics, which are preferably shared and agreed upon. • The canonical data model is capable of writing models in the various ontology languages, like OWL, DAML+OIL, and RDF. To present our approach, the following sections include a definition of the canonical data model, a method of ontology extraction, and a description of overall architecture of 18 Ontronic. The schema matching techniques, which resolve conflicts on the semantics of database elements, are further illustrated in Chapter 3. 2.2.1 Ontology Extraction In this section, we introduce an approach to extract ontologies from existing information sources. If we want to avoid misinterpretation and to allow relevant advertisements, we need to extract the meaning of the information which will be exported from the resource to Ontronic. This step is a part of the ontological commitment that involves the conceptualization vocabulary and objects which are used by our resource. Since we have already developed ontologies to share information, we can naturally follow the same scheme to describe this conceptualization of the local ontology of the resource. In fact, several kinds of resources already embed an ad-hoc and explicit if not complete conceptualization of their domain. Relational and Object Oriented Database resources belong to this category since their schema is in essence an abstraction which classifies data. From the perspective of someone who wishes to create a local ontology for such a resource, the aim is to share the commonly accepted concepts already present in the shared ontology and/or the conceptual isolation already embedded in the join- ing resource. We propose to support this task by reusing the information present in the schema thanks to the help of some soft extension to an object-oriented data definition language. This extraction process has two steps as depicted in Figure 2.2. The first step is a syntactic translation from the native schema of the information source into the inter- mediate schema. The intermediate schema achieves the syntactic translation from the source’s native data model to the intermediate schema language. The second step is the ontological upgrade, a semi-automatic translation plus knowledge-based enhancement, where the canonical data model adds knowledge and establishes further relationships 19 Figure 2.2: Conceptual procedure for ontology extraction between the entities in the translated schema. In order to fulfill the commitment of the information source to its ontology and to be compliant with our translation scheme, we need to describe a way to translate expressions in this canonical data model to equivalent expressions in the original information source. We describe CIOM which inherits SDM in order to create our canonical data model to support this task in the next section 2.2.2. 2.2.2 The Canonical Data Model This section describes fundamental semantic primitives in Classified Interrelated Object Model (CIOM) as the canonical data model for Ontronic. Classical database data mod- els are not expressive enough to be able to capture the required richness of ontological knowledge, even when it is known and available within the organization. Toward this end, our approach incorporates an object-based classified database model to structure 20 a domain-dependent ontology for representing semantic information, that is, informa- tion about a statement’s or fact’s meaning. CIOM inherits the core primitives of SDM. SDM is a design of a higher-level database model that enables the designer to naturally and directly utilize more of the semantics of schema into its meanings. Therefore, by extending the basic structures and constraints of SDM, CIOM wants to create a model the real world and to capture the semantics of applications. CIOM provides enriched semantics primitives (types of interrelationships the sys- tem understands), such as subclass (special kind of), attribute (property of), and inverse (inverse of a property), grouping classes that are second-order collections, namely classes of classes and instances (specific fact occurrences). To represent and understand the meaning of interrelationships, the semantic primi- tives in CIOM can be categorized into concepts and inter-connections. Ontology defines a set of representational terms that we call concepts. Concepts are composed of classes in CIOM. A class is defined as a collection of objects, which is a logically homogeneous set of one type of objects. The objects of a class are said to be the instances of the class. The name of a class is used to identify the class from others and is written inside of an oval to represent the class. There are several built-in classes predefined in CIOM. They are often used as the value classes for basic attributes, such as strings and numbers. In addition, concepts are interconnected by means of interrelationships among concepts. The types of these relationship connections are detailed in the following section. 2.2.2.1 Attributes Connection Attributes are the common aspects of instances of a class. An attribute describes each instance of the class. Each attribute has a name that uniquely identifies itself within a class. Similar to a class name, an attribute name can be any string of symbols. The value 21 class of an attribute is the set of all possible values that applies to the attribute. This set can be any built-in or user-defined class. Furthermore, attributes always have inverses and are usually shown in pairs. In each pair, each attribute is said to be the inverse of the other. This relation is specified symmetrically. The inverse attribute of attribute A is denoted asA −1 . In addition, three kinds of cardinality are used to describe the type of relationship between attributes, such as one-to-one (1 : 1), one-to-many (1 : m) and many-to-many (n : m). In CIOM, an attribute can also be specified as non-null (nn) which requires a value. Moreover, CIOM supports aggregate functions, such as maximum, minimum, average, and sum by means of class attributes. 2.2.2.2 Instance Connection Instance connection is used to show membership. An instance is a member of class. The inter-relationship between instances and classes corresponds to an instance connection. 2.2.2.3 Group Connection A grouping class defines a class of classes. In other words, the instances of a group- ing class can be a Part-Of grouping class [KMH05]. It is a second order class in that its instances are of higher-order object type than those of the underlying classes. The instances of a grouping class are viewed as classes themselves. 2.2.2.4 Subclasses Connection Subclasses connection is used to represent concept inclusion. A concept represented by subclass is said to be a specialization of the concept represented by super class, like a child is a kind of the parent. A subclass automatically inherits all attributes from 22 its parent class. These attributes need not to be shown in the subclass, but they exist implicitly. This connection can be further categorized into four types: mutually exclusive group, collectively exhaustive group, predicate-defined subclasses and operationally-defined subclasses. For two or more subclasses, mutual exclusiveness means that there is no instance of belonging to more than one of the subclasses, while collective exhaustiveness means that it must belong to at least one of the subclasses for any instance of the parent class. For a predicate-defined subclass, its membership is determined by specific condi- tions (predicates) by one or more attributes. If an instance’s attribute values satisfy all the conditions, then the instance is automatically added to the subclass. Their member- ship is decided by three operations; intersection, union, and difference for operationally- defined subclasses. An intersection subclass contains the instances in both of the parent classes involved in the intersection. For type compatibility, both classes involved in this operation must be subclasses of a common parent, directly or indirectly. A union subclass contains the instances in either of the parent classes involved in the union operation. A difference subclass contains the instances of one parent class that are not in the other class. In addition, a subclass by parent’s intersection inherits all the attributes of both parents. A subclass by parents union inherits only those attributes common to both parents. A subclass generated by parent’s difference inherits all the attributes of the parent on the left side of the operator. 23 QuakeTables QuakeTables stores a various types of earthquake science data in Southern California. Figure 2.3: We extracted the seismology domain ontology expressed in CIOM from QuakeTables [GGD + 05]. There are two roots: event and seismology. Seismology can be specialized as either geo-phenomenon, seismology research or geological feature. In particular, geo-phenomenon is a subclass of the intersection of event and seismology (operationally-defined subclass). Seismology paper is published seismology research. Geological features are further classified into fault and segment. In addition, an event can be sub-categorized as disaster, conference or geo-phenomenon. Earthquake and volcano are types of both natural (predicate) disaster and geo-phenomenon while a nuclear test is both man-made (predicate) disaster and geo-phenomenon. These rela- tionships are subclass connections. 2.2.2.5 CIOM Example This section exemplifies a seismology-domain ontology, which is extracted from QuakeTables [GGD + 05] database system 3 . The QuakeTables system, which is a subpro- ject of QuakeSim 4 , integrates generic seismology information of Southern California. 3 http://infogroup.usc.edu:8080/public.html 24 Figure 2.3 shows an example ontology of the seismology domain. It has been designed to conceptualize several types of fault data and data sets, as well as simulated or hypothetical data. It shows an initial domain ontology (a description of key concepts and interrelationships in the domain) developed by computer scientists and earthquake science experts [GGD + 05]. Earthquakes have one or many faults that have a name and a strand name. Faults are divided into characteristic segments that are expected to rupture as a unit. Segment has name, depth, friction, strike. As well, seismology papers are published at confer- ences and they are referred by each segment. Seismology papers include author, title, published year. These relationships are attribute connections. The given ontology also presents the characterization of dynamically-defined earth- quake faults which also include Material Rectilinear Layer parameters for 3-dimensional tectonic deformation modeling. In addition, the instances of the fault in the ontology contain data from California faults, but there is no geographic restriction on future data entries. Furthermore, the instances of the seismology paper in the ontology are extracted from refereed journal articles, professional papers, professional reports, and conference abstracts. In CIOM, ontologies are represented by a directed acyclic graph (DAG). Here, each node in the DAG indicates a concept. A thin arrow is used to denote attributes while a subclass is drawn in a thick arrow. An inverse attribute is drawn in a pair of inter-linked arrows. We denote grouping classes with triple arrows, whereas a narrow line indicates Instance-Of. 4 http://quakesim.jpl.nasa.gov 25 2.2.3 Conceptual Architecture A fundamental method of constructing ontologies in Ontronic consists mainly of three layers as follows: • Domain layer: A conceptualization that captures the shared knowledge of the given domain. • Semantic layer: An explicit ontology model based on CIOM is generated from the conceptualization of the given domain. Such a model has a collection of concepts, interrelationships and the constraints. • Metadata layer: A formal description for the above model is produced to be machine understandable. As shown in Figure 2.4, in the domain layer, ontology developers usually conduct an information requirements analysis and express the results of their analysis in terms of the semantic model. The gap between the semantic level of the domain and ontolo- gies can be bridged by CIOM in the semantic layer. In other words, in comparison with the model theory of the contemporary ontology languages, CIOM can be used as a higher-level semantic model in which the ontology developers design ontologies. Con- sequently, a collection of metadata, which is represented as various kinds of ontology languages, such as DAML+OIL and OWL, can be generated from ontologies that are already produced in the previous layer. Given the process of ontology design, we can identify key functionalities that must be provided by Ontronic: 1. We present an interface to communicate graphical representation between users (ontology developers) and Ontronic. 26 Figure 2.4: Conceptual procedure for ontology construction 2. Ontronic contains functions to facilitate ontology creation with the semantic prim- itives in CIOM. 3. Ontronic provides a metadata generation mechanism from the ontologies created with CIOM. The produced metadata can be exported to or imported from to per- manent storages. As illustrated in Figure 2.5, the architecture of Ontronic is clearly separated into a user-interface, a model, and a storage component to meet the above key requirements. Ontronic provides a visual ontology manager for a graphical representation. The primary role of the visual ontology manager is to provide a graphical user interface (GUI) to view, analyze, and compose ontologies. It contains an ontology DAG, ontology tree, ontology visualization API, query processor, and metadata viewer. Ontology DAG illustrates ontologies with DAGs based on the denotations of the semantic primitives in CIOM. The ontology tree also presents a hierarchical structure of concepts. The 27 Ontronic Client Java Applet Ontology Visualizer CIOM Manager import/export Visualize ontology Ontology Mapping Framework Jena Updating metadata WordNet Wrapper Mapping Info. mgr Ontronic Server Mapping Info. Database Ontronic Database WordNet Database CIOM Modeler Ontology Extractor Ontologies [OWL/RDF] Relational X M L X M L X M L XML Conceptual/Logical/Physical Data Models Existing Information Sources Figure 2.5: Overall architecture of Ontronic query processor parses and executes a query. In addition, generated metadata written in the ontology languages, such as DAML+OIL and OWL is displayed in the metadata viewer. Figure 2.6 presents a snapshot of the visual ontology manager in Ontronic. Ontronic provides APIs to retrieve and manipulate ontologies. Ontology visual- ization API facilitates creating, updating, and deleting graphical units for an ontology presentation by using CIOM API. CIOM API can construct meta-models for ontologies consisting of the semantic primitives in CIOM. CIOM API also translates queries from CIOM model to RDQL [MSR02]. Additionally, CIOM API provides a comprehensive transformation from CIOM meta-model into RDF by utilizing Jena 5 API which pro- vides RDF-based metadata management infrastructure designed for semantics-driven applications. 5 http://jena.sourceforge.net/ 28 Figure 2.6: A snapshot of CIOM Modeler in Ontronic: an ontology engineer can manip- ulate ontologies using the CIOM Modeler. Finally, Ontronic supports the storage of ontologies into both a database and a file system. This framework supports the definition, storage, access, and control of collec- tions of structured data. The relational database management system (DBMS) physi- cally stores the metadata in the Ontronic server. Furthermore, Ontronic provides a web- based cooperative workspace in which multiple authors can build and share the same ontologies simultaneously. In order to endure heavy requests to the DBMS, Ontronic deploys a connection pool. Additionally, the clients can export and import the RDF, OWL, and DAML+OIL files. 29 2.3 Summary In order to resolve semantic conflicts among various information sources in seismol- ogy and geoscience, we proposed the ontology-based semantic information manage- ment methodology. In this chapter, we define CIOM as a canonical data model for diverse information sources and described Ontronic, a system that provides the general functionalities to manage ontology-based metadata. The major contributions of this work are: • Ontronic resolves conflicts on the structure and the semantics of existing informa- tion repositories. • By providing a compact and unifying representation model to structure ontology, Ontronic can handle a broad range of extracting canonical representation from diverse information sources. • Ontronic currently provides the capacity to access and manage ontologies popu- lated with paleoseismic data from the major faults and with three structured data sets containing summary fault attributes. These ontologies provide geographic coordinates, geometry, and summary attributes for many active faults and fault segments in California. Chapter 3 focus on incorporating ontologies into schema matching in Ontronic in the federated architecture. 30 Chapter 3 Ontology-Driven Schema Matching Framework The previous chapter presents our federated architecture for information management. This chapter presents schema matching frameworks, which are a core component of our federated architecture. Our matching frameworks support interconnecting similar domain schemas. Our approach is simple yet very powerful; we incorporate ontologies to gain semantic information of data. We divide the mapping algorithms into two cate- gories: the semantics-driven mapping framework utilizes the semantics of data, which is captured by ontologies, while the data-driven mapping framework incorporates diverse features in schema and instances of information sources. The remainder of this chapter is organized as follows: Section 3.1 defines matching problems. Section 3.2-3.5 illus- trate our mapping frameworks. Section 3.6 shows the experimental results. Finally, Section 3.7 summarizes this chapter. 3.1 Problem Definition Schema matching is inherently difficult to automate and has been regarded as a tedious and error-prone task since schemas typically contain limited information without seman- tics of attributes. In most previous studies, schema matching has been in general per- formed by gathering information for mapping from various phases of an attribute includ- ing its name, type, patterns and statistics of data instances [DLD + 04, DMD + 03, KN03, 31 Figure 3.1: Matching ambiguity can be resolved with the semantics-driven mapping framework. The values of the instances of the attributes “Fax” and “Telephone” in the table “Firm Information” of schemaS and the attribute “Call” in the table “Company” of schemaT share common patterns. MBDH05, RB01, LC00, MZ98]. For example, by comparing names, types, and sam- ple instances between attributes “phone number” and “telephone” in compatible tables, these two attributes can be matched. However, schema and data instances thus cannot fully capture the meanings. If we only consider patterns of instances, domain, and the name of attributes “phone number” and “fax number”, these two names can then be 32 Figure 3.2: The semantics-driven framework provides candidate mappings. The val- ues of the instances of the attribute “Dot Com” of the table “Company information” in schema S are dissimilar to all the attributes of the table “Firm list” in schema T . The mapping between “Corporation” and “Dot Com” can be identified by the semantics- driven mapping framework because their semantic similarity is higher than the other mappings. matched. Therefore, excluding semantic information of the attributes is limited to dis- covering appropriate matches between database schemas. By illuminating the difficul- ties posed by the lack of semantics, we have shown that there is a need for an alternative method to obtain semantics of data from external data sources. Figure 3.1 shows that the estimated similarities resulting from the data-driven map- ping framework are too close to determine which correspondence is more suitable for this mapping. However, the semantics-driven mapping framework provides increased evidence for the mapping between the attributes “Telephone” and “Call” since both words are semantically more related than the attribute pair of “Fax” and “Call”. There- fore, it is necessary to prune candidate mappings. 33 Similarity Regression Data-Driven Matching Prediction Semantics-Driven Matching Prediction Attribute1 Attribute2 Data-Driven Mapping Framework Data-Driven Mapping Framework Attribute-based Matcher Instance-based Matcher Attribute Name Domain Knowledge Instance Pattern Instance Overlap Semantics-Driven Mapping Framework Semantics-Driven Mapping Framework Compound Word Processor Information Content WordNet Ontology <sale/> <value/> </ sale > Untagged Corpora Matching Similarity Figure 3.3: Overall architecture of our matching framework Figure 3.2 illustrates that the data-driven framework fails to find the mapping of candidates while the semantics-driven mapping framework can identify the mapping of them. 3.2 Matching Framework Architecture This section describes an overall architecture of our matching framework. As stated in the previous section, this matching framework is divided into two mapping procedures: semantics-driven mapping framework and data-driven mapping framework. In Figure 3.3, the semantics-driven mapping framework attempts to determine the similarity of two attributes using shallow level natural language processing and the 34 ontological distance between the attributes. Also, the data-driven mapping framework quantifies the similarity of two attributes based on the various contexts of schema and instance. To discover the correspondences in schema S and T, we compute the similar- ity matrix M ST for S and T . S has attributes s 1 ,s 2 ,··· ,s n , and T has attributes t 1 ,t 2 ,··· ,t n . M ST is defined as: SIM(s 1 ,t 1 ) ··· SIM(s 1 ,t m ) . . . SIM(s i ,t j ) . . . SIM(s n ,t 1 ) ··· SIM(s n ,t m ) where SIM(s i ,t j ) is an estimated similarity between the attributess i and t j . 1≦ i≦ n,1≦ j≦m). To find the most similar attribute in the other schema, we propose an ontology- driven mapping algorithm with an ensemble of multiple matching methods. The map- ping algorithms are mainly divided into a semantics-driven mapping framework and a data-driven mapping framework. The former generates the matches based on informa- tion content [Res99], while the latter performs the matches based on the premise that the data instances of similar attributes are typically congruent. Both frameworks thus increase the accuracy of similarity by mutual complementation. Each framework pro- duces a mapping matrix; respectivelyM sem ST andM dat ST . Thus, the similarity matrixM ST is defined as: M ST =α·M sem ST +β·M dat ST (3.1) whereα+β = 1. 35 Leveraging a mapping based on the meaning of the attributes achieves a level of matching performance that is significantly better than a conventional schema matcher. Two techniques contribute to the matching accuracy: • Matching ambiguity resolution: It can identify actual mappings although they are ambiguous. • Providing candidates that refer to a similar or the same object: It also provides matching candidates even if the data-driven framework fails to select the candidates. The following sections illustrate each framework in detail. 3.3 Data-Driven Mapping Framework The attribute names in schema can be very difficult to understand or interpret. In this section, we propose a framework that functions correctly even in the presence of opaque attribute names using the data values. Previous research has shown that an effective matching technique utilizes an ensemble of searching overlaps in the selection of the data types and representation of the data values, of comparing patterns of the data instances between the schema attributes, of linguistic matching of names of schema attribute, and of using learning techniques [DLD + 04, DMD + 03, KN03, MBDH05, RB01, LC00, MZ98]. In the data-driven mapping framework, we mainly make use of the fact that the schemas, which we are matching, are associated with the data instances we have. By comparing the attribute instances, the mapping can be found since the similar attributes share similar patterns or representations of the data values of their instances. Thus, there are two types of base matchers, such as the pattern-based matcher and the attribute-based matcher. 36 3.3.1 Pattern-based Matcher The pattern-based matcher tries to find a common pattern of the instance values, such as fax/phone numbers, or monetary units. It determines a sequence of letters, symbols and numbers that are most characteristic in the instances of an attribute. Given any value of the instances, we transform each letter to “A”, symbol to “S”, and number to “N”. To compute the similarity, it compares the patterns by calculating the values of the edit distance [Lev75] of a pair of patterns. An edit distance between two strings is given by the minimum number of the operations needed to transform one string into the other where an operation can be insertion, deletion, or substitution. For example, “(213)321- 4321” is transformed into “SNNNSNNNSNNNN” and “213-321-4321” is transformed into “NNNSNNNSNNNN”. In this case, the edit distance between two numbers is 1. Let a i and b j be instances of the attribute s and t 1≦ i≦ N a ,1≦ j ≦ N b ). LetEditDist(a i ,b j ) denote an edit distance value between the patterns of the attribute instancesa i and b j . In addition, it also contributes to a performance to use top a i most frequent instances because pairwise comparison is typically a time consuming task. Let g i denote the number of the instancea i in the attribute s, and the number of the instance h j in the attribute t. We assume that a i and b j are sorted in a descending order with respect to g i and h j . The similarity between the instance patterns of the attributes can be quantified as follows: SIM dat (s i ,t j ) = k X i=j=1 { 1 2 ( g i N a + g i N b )× 1 EditDist(a i ,b j )+1 } (3.2) By detecting the most k frequent pattern, we can use the pattern to find a match with the pattern of the corresponding attributes. 37 3.3.2 Attribute-based Matcher The attribute-based matcher tries to find common properties of the attributes. Compar- ing various phases of the attributes, such as name and domain information, also provides the correspondence between the attributes [DLD + 04, DMD + 03, KN03, MBDH05, RB01, LC00, MZ98]. Thus, the attribute-based matcher maps attributes by compar- ing their names and types. Comparison of the names among the attributes is performed only when the domain information of two attributes is similar. Due to a number of diverse ways to represent the names of the attributes like compound words, we compute a prediction based on the frequency of the co-occurred N-gram of their names. Tri-gram was the best performer in our empirical evaluation. Let SIM pat (s i ,t j ) be a prediction, which is produced by this attribute-based matcher. Thus, the similarity from the data-driven mapping framework can be defined as: SIM pat (s i ,t j ) =α·SIM pat dat (s i ,t j )+β·SIM atr dat (s i ,t j ) (3.3) whereα+β = 1. Unfortunately, this mapping framework is not always successful as indicated in Fig- ure 3.1 and Figure 3.2. When it fails to find mappings, it is often because of its inability to incorporate the real semantics of the attributes to identify the correspondences. In the following sections, we propose a technique to resolve this problem. 3.4 Semantics-Driven Mapping Framework The semantics-driven mapping framework tries to identify the most similar semantics of attributes in the other schema when the attributes names are not opaque. The name of an attribute typically consists of a word or compound words that contains the semantics of the attribute. Thus, the semantic similarity between s i and t j can be measured by 38 finding how many words in the two attributes are semantically alike. We describe how we measure semantic similarity. 3.4.1 Semantic Similarity Previous research has measured semantic similarity, which is based on the statisti- cal/topological information of the words and their interrelationships [JC98]. An alter- native approach has recently been proposed to evaluate the semantic similarity in a tax- onomy based on information content [Res99, Lin98]. Information content is a corpus- based measure of the specificity of a concept. This approach relies on the incorporation of the empirical probability which estimates a taxonomic structure. Previous research has shown that this type of approach may be significantly less sensitive to link density variability [JC98, Lin98]. Measures of the semantic similarity in this approach quantify the relatedness of two words, based on the information contained in an ontological hierarchy. Ontology is a collection of the concepts and interrelationships. There are two types of interrelation- ships: a child concept may be an instance of its parent concept (is-a relationship), or a component of its parent concept (part-of relationship). In addition, the child concept can have multiple parents, thus there may exist multiple paths between the child con- cept and the parent concept. WordNet, which is a lexical database, is particularly well suited for similarity measures since it organizes nouns and verbs into hierarchies of is- a or part-of relations. Thus, we have employed WordNet Similarity [PPM04], which has implemented the semantic relatedness measures that compute information content using WordNet ontology from untagged corpora, such as the Brown Corpus, the Penn Treebank, and the British National Corpus [PPM04]. 39 Let c denote a word of an attribute. The information content of a word w can be quantified as follows: IC(w) =−log(p(w)) (3.4) wherep(w) is the probability of how much wordw occurs. Frequencies of words can be estimated by counting the number of occurrences in the corpus. Each word that occurs in the corpus is counted as an occurrence of each concept containing it. freq(w) = X w i ∈Cc count(w i ) (3.5) whereC c is the set of concepts subsumed by a wordw. Then, concept probability forw can be defined as follows: p(w) = freq(w) N (3.6) whereN is the total number of words observed in corpus. This equation states that informativeness decreases as concept probability increases. Thus, the more abstract a concept, the lower its information content. This quantization of information provides a new approach to measure the semantic similarity. The more information that these two words share, the more similar they are. Resnik [Res99] defines the information that is shared by two words as the maximum information content of the common parents of the words in the ontology. The similarity of two words is defined as follows: sim resnik (c i ,c j ) = max t∈CP(C i ,C j ) (−logp(c)) (3.7) whereCP(C i ,C j ) represents the set of parents words shared byC i andC j . 40 Figure 3.4: An example of the semantic similarity computation The value of this metric can vary between 0 and infinity. Thus, it is not suitable to use it as a probability. Lin [Lin98] suggested a normalized similarity measure as follows: sim(c i ,c j ) lin = 2×sim resnik (c i ,c j ) −{log(p(c i )+logp(c j ))} (3.8) Figure 3.4 depicts an instance of computing a similarity between the nodes “E” and “H”. The node “B” has the maximum information content of the common parents of the nodes “E” and “H”, since the node ”B” is the most specific common parent of the nodes “E” and “H”. Concept frequency of the node “B” is 12 since it is the sum of its word frequency (6) in the corpus and the sum of the word frequencies (6) of its descendants “C” and “D”. Therefore, the similarity between the nodes “E” and “H” is 0.03. 41 3.4.2 Compound Word Processing The name of the attributes sometimes consists of a compound word such as “agent name.” In English, the meaning of the compound word is generally a specialization of the meaning of its head word. In English, the head word is typically placed on the right- most position of the word [Col96, Col97]. The modifier limits the meaning of the head word and is located at the left of the head word [Col96, Col97]. This is the most obvious in descriptive compounds, in which the modifier makes it more specific by restricting its scope. A blackboard is a particular kind of board which is black, for instance. Based on this computational linguistic knowledge, our approach is to give weight to the mapping with the head word. We disassemble the compound word into atomic words and try to compute predicted similarities between each word to the attributes in the other schema. There are two issues of disassembly of the name of the attribute. • Tokenization: “agent name” appears in various formats, such as “agent name” or “AgentName.” In order to correctly identify these variants, tokenization is applied to names of attributes. Tokenization is a process that identifies the boundaries of words. As a result, non-content bearing tokens (e.g., parentheses, slash, comma, blank, dash, and upper case) can be skipped in the matching phase. • Stopwords removal: Stopwords are the words that occur frequently in the attribute but do not carry useful information (e.g., of). Such stopwords are elimi- nated from the vocabulary list considered in the Smart project [SM83]. Removing the stopwords provides us with flexible matching. We then integrate each similarity with more weight on the right words. 42 Let a 1 ,a 2 ,··· ,a k be a set of tokenized words sorted by the rightmost order in the compound word, which is the name of the attribute s in the schema S. The estimated similarity for thes is defined as: k X r=1 1 r 2 ·N ×sim(a r ,t) (3.9) whereN = P k r=1 1 r 2 r 2 is a heuristic weight value, which is verified in the empirical evaluation. If both attributes are compound words, then let b 1 ,b 2 ,··· ,b k denote a set of atomic words sorted by the rightmost order in the compound word, which is the name of the attribute t in the schema T . Once we define the similarity between two concepts, the semantic similarity between two attributes (s i andt j ) can be defined as follows: SIM sem (s i ,t j ) = l X q=1 1 q 2 ·M ×{ k X r=1 1 r 2 ·N ×sim(c r ,c q ) lin } (3.10) whereM = P l q=1 1 q 2 . Thus, the semantic similarity matrixM sem ST is SIM sem (s 1 ,t 1 ) ··· SIM sem (s 1 ,t m ) . . . SIM sem (s i ,t j ) . . . SIM sem (s n ,t 1 ) ··· SIM sem (s n ,t m ) Together with the data-driven mapping frameworks, this framework is optimally combined as described in the next section. 43 3.5 Similarities Regression Using the machine learning technique, we combine the predicted similarities: SIM dat (s i ,t j ) andSIM sem (s i ,t j ). Since each similarity can have different significance with contribution to the combined prediction, a weight is assigned to each similarity. To improve the predictions of the different single mapping framework, parameter optimiza- tion [BS93] is performed where possible by cross-validating on the training data with logistic regression. The final estimated similarity between attribute s i and t j can be defined as follows: SIM(s i ,t j ) = sigmoid(α·SIM dat (s i ,t j )+β·SIM sem (s i ,t j )) (3.11) whereα+β = 1 and sigmoid(x) is defined as follows: sigmoid(x) = 1 1+e −x Since the sigmoid transfers function can divide the whole input space smoothly into a few regions, the desirable prediction can be obtained. 3.6 Experiments To demonstrate the accuracy and effectiveness of our mapping framework, we per- formed experiments on real world data. We applied our mapping framework to real estate domain datasets that were previously used in LSD [DDH01]. We compare our experiment results with that of complete LSD results in terms of accuracy. The com- plete LSD matches the schema with the schema and data information. 44 3.6.1 Test Datasets We used two real estate domain datasets. Both datasets contain house-for-sale listing information. The mediated schema of Real Estate II is larger than that of Real Estate I. Table 3.1 illustrates the data information of the two datasets. Properties Real Estate I Real Estate II Attribute number in the mediated schema 20 66 Sources 5 5 Downloaded listing 502-3002 502-3002 Attribute number in the source schemas 19-21 33-48 Matchable attribute in the source schemas 84-100% 100% Table 3.1: Domains and data sources for the experiment [DDH01] In Real Estate I domain, the source schema has a total of 6 elements. The left column of this figure shows the elements, and the right column shows the mappings for the elements. There are 2 one-to-one mappings and 4 complex mappings. All complex mappings here are concatenation mappings. In Real Estate II domain, the source schema has a total of 19 elements. The left column of this figure shows the elements, and the right column shows the mappings for the elements. There are 6 one-to-one mappings and 13 complex mappings. 3.6.2 Experimental Procedure In order to empirically evaluate our technique, we train the system on the Real Estate I domain (the training domain). With this data, we perform cross-validation ten times to attain more reliable weights for the combination of the predictions from the semantics- driven mapping framework and the data-driven map-ping framework. These values are denoted asα andβ in Section 3.2. 45 Matching Accuracy Comparison 81.1 71.4 90.6 73.2 64.7 59.4 73.1 68.2 85.3 84.3 89.5 85.3 88.6 87.7 93.1 89.1 0 10 20 30 40 50 60 70 80 90 100 Real Estate I Time Schedule Faculty Listings Real Estate II Domains Matching accuracy Base Line (Complete LSD) Data-Driven Matching Prediction Semantic Matching Prediction Regressed Prediction Matching Accuracy Comparison 81.1 71.4 90.6 73.2 64.7 59.4 73.1 68.2 85.3 84.3 89.5 85.3 88.6 87.7 93.1 89.1 0 10 20 30 40 50 60 70 80 90 100 Real Estate I Time Schedule Faculty Listings Real Estate II Domains Matching accuracy Base Line (Complete LSD) Data-Driven Matching Prediction Semantic Matching Prediction Regressed Prediction Figure 3.5: Average matching accuracy compared with complete LSD [DDH01]: we ran the experiments with 300 instances and it had higher average accuracy than the complete LSD on the 4 domains. Matching Accuracy for Real Estate I Domain 0 10 20 30 40 50 60 70 80 90 100 500 300 200 100 Number of instances Average matching accuracy Data-Driven Matching Prediction Semantics-Driven Matching Prediction Regressed Prediction Base Line (Complete LSD) Figure 3.6: We ran the experiments with 100, 200, 300, and 500 instances available from Real Estate I. Our mapping framework had in higher accuracy with small number of data instances than the complete LSD. 46 Elapsed Matching Time 19:42:04 10:05:17 4:51:21 2:17:45 1:13 1:13 1:13 1:13 0:00:00 2:24:00 4:48:00 7:12:00 9:36:00 12:00:00 14:24:00 16:48:00 19:12:00 21:36:00 #500 #300 #200 #100 Number of data listings Elapsed time Data-Driven Mappiing Framework Semantics- Driven Mapping Framework Figure 3.7: We recorded the elapsed time of our mapping frameworks. The elapsed time of data-driven mapping framework increased when the number of instances was increased. Our mapping framework had higher speed. 3.6.3 Experiment Results Our experiment aimed to ascertain the relative contributions of utilizing ontologies to identify the semantics of the attributes in the process of schema reconciliation, while LSD exploited learning schema and data information. As shown in Figure 3.5, 3.6, and 3.7, we have 7.5% and 19.7% higher average accuracy than that of the complete LSD on the two domains. 3.7 Summary We considered the computation of semantic similarity techniques from ontologies to identify the correspondence between database schemas. An experimental prototype system has been developed, implemented, and tested to demonstrate the accuracy of the proposed model which was compared to the previous mapping model. 47 As extremely large amount of new online information is published daily on the Web, neologisms or new concepts appear as time passes. Thus, the employment of out-of-date ontologies may decrease the accuracy of matching. Therefore, it is essential to maintain ontologies to reflect up-to-date knowledge. The next chapter presents an efficient con- cept clustering technique to enrich existing ontologies in order to resolve this problem. 48 Chapter 4 Efficient Concept Clustering for Ontology Evolution Numerous hand-crafted, lexical-semantic databases, such as WordNet [Mil94], have shown that ontologies can facilitate machine understanding of text and the automatic processing of documents. However, this manual approach to extracting semantic mean- ings cannot scale with the growth of the Web. Therefore, ontology learning, the process of mining taxonomic relations from information sources, has become a significant sub- field of ontology engineering. We focus on the efficiency of identifying relevant concept candidates for ontology learning. Typically, identifying semantic structure, where clustering methods have been very involved, is computationally expensive for a large number of data sets since it requires a significant number of computations to measure a similarity among each pair of candidates [MNU00]. To address the computational bottleneck, the reduction of the number of required similarity computations has emerged as a problem awaiting a solu- tion in ontology learning from a large collection of documents. To address the computational issue, we present a novel similarity computation approach based on the analysis of an event life cycle on the Web. By using an inex- pensive similarity metric, the proposed approach identifies rough concept clusters that have high potential to be clustered together, thus avoiding unnecessary similarity com- putations among concepts that are far from each other. Based on the obtained rough concept clusters, we perform expensive similarity computations to refine the cluster 49 0 20 40 53 60 80 100 0 0.5 1 1.5 2 2.5 3 t (time) between May 11, 2007 and August 19, 2007 tf t (term frequency) David Beckham Los Angeles Galaxy July 16, 2007 Figure 4.1: We have collected term frequencies from online news articles for 100 days between May 11, 2007 and August 19, 2007 for both “David Beckham”, who is a famous football (soccer) player, and the “Los Angeles Galaxy”, which he joined in July 2007 and had his first official soccer practice session on July 16, 2007 structure. We hypothesize that if an event life cycle based candidate selection in the first phase results in a small number of similarity computations, then it would improve the clustering speed without a loss of quality. Our approach is based on the analysis of event life cycle on the Web. Certain events generate postings to the Web, such as news articles. The volume of postings typically starts small, grows, and then gradually disappears as time passes [Kle03]. Hence, a new relationship between two concepts can be created when a certain event has happened. Figure 4.1 depicts an example of term frequency transition over time. The results show that the frequencies of both terms sharply surged up and then dropped around the same time frame (July 16, 2007). Observations of other terms show the same phenomenon. 50 One of the reasons this observation is important is that we can identify the relevant con- cept candidates if we cluster the terms whose frequencies have a rapid fluctuation at a similar point of time. In other words, it is unnecessary to compute a similarity between the terms before the event happens because the similarity between them probably would be low prior to the event. Our approach can be coupled with any type of clustering algo- rithms and can be utilized for making algorithms scalable with respect to the millions of documents. Once concept clusters are identified, we utilize the obtained clusters to enrich exist- ing ontologies. As many thousands of articles are published daily on the Web, neolo- gisms or concepts appear as time passes. Thus, it is essential to maintain ontology to reflect up-to-date knowledge. To address this problem, we present how to utilize refined clusters to enrich existing ontologies. The remainder of this chapter is organized as follows. Section 4.1 presents the ontol- ogy learning problems that we consider in this paper. Section 4.2 discusses our similarity framework based on event life cycle, and Section 4.3 describes how to enrich ontologies using term similarity. Section 4.4 presents our experimental results. Finally, Section 4.5 summarizes this paper, and discusses future work. 4.1 Problem Definition Ontologies provide an explicit model for structuring concepts, together with their inter- relationships. Most existing ontologies, such as WordNet [Mil94], were manually cre- ated and are being maintained by ontology engineers. Instead of employing manual construction, many recent studies have investigated ontology learning that automati- cally or semi-automatically extracts information from texts and builds a structured orga- nization using extracted information. Inherently, it integrates multiple complementary 51 techniques, such as machine learning, natural language processing, and data mining. Particularly, clustering techniques facilitate building interrelationships between terms by exploiting similarities of concepts to propose a hierarchy of concept categories. With the rapid growth of the Web in the past decade, online information has become one of the major information sources. Traditional clustering algorithms have a scala- bility limitation in that similarity computation needs to be performed on all term pairs. Thereby, it would be inefficient when we have numerous terms which are extracted from the overwhelming number of documents on the Web. To address this problem, the fol- lowing two sections present a method for ontology learning that reduces the number of required computations without sacrificing the precision/recall. 4.2 Rough Cluster Identification We now illustrate the key ideas of our approach that one can greatly reduce the number of similarity computations required for ontology learning. We first identify rough cluster structure from documents using inexpensive similarity metric, and then only measure the similarity among pairs of concepts in rough clusters to identify their internal semantic structure. Hence, our first step in ontology learning from Web documents is to develop a solid solution for rough cluster identification. Most research in concept extraction has incorporated both natural language pro- cessing and information retrieval techniques for term indexing [SB88, CJM06]. These methods include parsing HTML, tokenizing, stemming words 1 , eliminating stopwords 2 and detecting phrases. A set of terms (phrases) is defined as X = {x i : 1 ≤ i ≤ N} 1 We combine the Porter stemmer [RvRP80] with the lexical database [Mel95] since this combination deals with irregular plural/tense [CJM06]. 2 We employ the stopword list used in the Smart project [SM83]. 52 wherex denotes a concept candidate. A term is a concept candidate and can consist of multiple words. Given concept candidates extracted by these methods, we consider an event life cycle on the Web to cluster relevant concept candidates. Key terms of certain events on the Web, like online news articles become popular and then diminish as time passes. In other words, the term frequency of those key terms greatly increases and decreases as an event appears, grows, and disappears. A term which is highly relevant to the key term of the events also appears more frequently in the documents and peaks during the same time frame. We are interested in the term frequency valued time series denoted by tf t (x),t = 0,1,2,...,T , wherex is a term andt is a time variable. A functioncount(x) computes the number of x’s occurrences in a document d i . To account for different length of documents, we normalize the term frequency (freq) of term x in a document d i as follows: freq d i (x) = count(x) l i (4.1) wherel i is the length ofd i . Let D t be a set of documents that are published in time slot t. The time slot t that we use is a day. We formulatetf t (x) as follows: tf t (x) = X d i ∈Dt freq d (x) (4.2) We now present a solid algorithm that captures the important fluctuations in the tf t over time. The algorithm is referred to as the Gallistel change point finding algorithm[GMKL01]. We would like to find a change point where a certain attribute of the distribution from which tf t are drawn has the important fluctuations. We divide this algorithm into the following steps: 53 We first define a function that associates with each point intf t as follows: f(t i ) = tf t i (x)−tf t 0 (x) t i −t 0 wherei> 1 (4.3) We then identify a putative change point (pcp) by finding the earliest point t that deviates maximally fromf(t). pcp t = i max j=0 (|f(t j )−tf t j (x)|) where 0≤i≤T (4.4) To verify that pcp t satisfies a decision criterion, we divide the set of tf t up to that point into two subsets, the subset ofS 1 for thetf t values beforepcp t , and the subset of S 2 for thetf t values after it. We then perform an unequal variance t-test[O’M86]. The t statistic (z) to test whether the means are different can be calculated as follows: z = ¯ S 1 − ¯ S 2 q s 2 1 n 1 + s 2 2 n 2 (4.5) where ¯ S 1 and ¯ S 2 are the sample means of thetf t values (g) inS 1 andS 1 , ands 2 1 ands 2 2 are the standard errors as follows: s 2 1 = P n 1 i (g i − ¯ g 1 ) 2 n 1 −1 ands 2 2 = P n 2 i (g i − ¯ g 2 ) 2 n 2 −1 We can identify a change point (cp) when t is significantly far from zero (null), rejecting the null hypothesis[O’M86] that thetf t values in S 1 and S 2 have been gener- ated based on the same mean. The user can specify the threshold for rejecting the null hypothesis. Thereby, a change point can be expressed as: cp k =t ofpcp t (4.6) 54 We recursively execute Eq. (4.3) through Eq. (4.6) by starting immediately after the most recently identified change cp k . Given the above change point finding algorithm, we define a set of terms (ω t ) whose elements have the samecp k as follows: ω t ={x i ∈X :cp(x i ) =t,0≤i≤N} (4.7) wheret =cp and1≤i≤N. We also define a set ofω t as follows: Ω ={ω t : 0≤t≤T} (4.8) To reduce the number of expensive similarity computations in the next phrase, we clusterΩ into overlapping subsets like canopies[MNU00] using a computationally inex- pensive distance metric. Letα andβ (α >β) be distance thresholds (difference of days), which are specified by the user. We start with timet ofω t with two distance thresholds. We then assign all the termsω t att that are within the distance thresholdt−0≤α to a subsetΩ p . Next, we remove from the list all points that are within the distance threshold β in Ω. A collection of Ω p can be obtain by repeating these clustering steps until Ω becomes empty. Consequently, each term would be assigned to smaller groups in which similarity measures would be performed in the next phase. 4.3 Discovery of Hierarchical Topic Relationships This section presents how we refine rough cluster structures using an expensive similar- ity metric and explains how refined clusters are utilized to expand existing ontologies. 55 4.3.1 Cluster Refinement using an Expensive Similarity Metric This section presents a robust similarity model in order to identify the closeness of the candidate terms in Ω p . Since concepts in ontologies commonly are expressed in few words, it is difficult to quantify the similarity of candidate terms using traditional similarity measures[SH06]. Our approach to this problem is to use documents in order to capture a richer semantic context of candidate terms instead of simply quantifying their string-wise similarity. Specifically, we are only interested in the documents which are posted on the Web during the time frame ofΩ p . Letcp i andcp j be the time range of Ω p . Λ p denotes the set of documents which are published betweencp i andcp j . We now provide precise definitions of the similarity measures that we use. The core of the algorithm is inspired by Chung et al.[CJM06] and Sahami and Heilman[SH06] using a tf-idf vector weighting scheme[SM83]. We incorporate Λ p to generate a tf-idf vector for each candidate term x ∈ Ω p . The vector includes terms which co-occurred in the documents withx. Letv i be a term vector for each documentd i ∈ Λ p where the weightw x,d i associated with termx in documentd is defined to be: w x,d i =freq d i (x)×log( |Λ p | |{d i :x∈ Λ p }| ) (4.9) wherefreq d i (x) is Eq. (4.1),|Λ p | is the total number of documents inΛ p , and|{d i :x∈ Λ p }| is the number of documents where the termx appears. We eliminate all the elements in v i except m highest w x,d i terms using the opti- mal trade-off curve between sophistication and efficiency. We then employ a cosine metric[SM83], which has been widely used in a vector space model. It measures the similarity of two vectors according to the angle between them. Letγ(x) be the centroid vector of all v i ’s. The next step is to measure the relatedness between them. Given 56 two termsx i andx j , the similarity between x i andx j is defined as the inner product as follows: Sim(x i ,x j ) = sigmoid( γ(x i ) kγ(x i )k 2 · γ(x j ) kγ(x j )k 2 ) (4.10) where sigmoid(x) is defined as follows: sigmoid(x) = 1 1+e −x (4.11) To obtain a desirable similarity prediction, we use the sigmoid transfer function because it can divide the whole input space smoothly into a few regions. Using Eq. (4.10), we generate a vector R i for each term x i ∈ Ω p by quantifying the close- ness of candidate terms. R i containsx j ∈ Ω p (i6= j). We then truncate each vectorR i to include itsδ highest related terms based onSim(x i ,x j ). 4.3.2 Ontology Enrichment using Refined Clusters Exploiting a manually developed ontology with a controlled vocabulary is helpful in diverse applications such as query expansion. However, although ontology-authoring tools have been developed in the past decades, manually constructing ontologies when- ever new domains are encountered is a time-consuming process. Moreover, the con- structed ontology should evolve over time given that neologisms or concepts appear as time passes. Thus, we present how to utilize refined clusters to enrich existing ontology. Figure 4.2(A) shows a sample example for a refined cluster. The refined clusters con- tain all the concepts that are highly similar to each other, but hierarchical relationships have not been discovered yet. To extract a relationship within the cluster, we utilizek- nearest density estimation [CJM04]. The idea is simple yet effective. A general concept has lower information content than a specific concept [Res95]. Consequently, a general concept has lowerk-nearest density within a cluster than a specific concept has. Based 57 soccer MLS soccer player Los Angeles Galaxy DC United David Beckham Steve McClaren soccer MLS soccer player Los Angeles Galaxy DC United David Beckham Steve McClaren soccer MLS soccer player Los Angeles Galaxy DC United David Beckham Steve McClaren event game injury foot ball basket ball player league non professional transaction professional NFL NBA college league football player team (A) Refined cluster (B) Interrelationship discovery (C) Ontology enrichment Figure 4.2: An example of ontology enrichment process using a refined cluster on these observations, we first perform k-nearest density estimation for each concept within a cluster, and then extract topically hierarchical relations by examining density distribution. For example, a low density concept is marked as a general concept. The transition from Figure 4.2(A) to Figure 4.2(B) indicates this process. Once we identify topical structure within a cluster, the next step is to merge the cluster with the existing ontology. Towards this end, we use the Sports domain ontology in Khan et al [KMH04]. Note that our approach can be coupled with any other type of ontology for ontology enrichment. By identifying the location of a general concept in 58 Figure 4.2(B) at Figure 4.2(C), we can enrich the existing ontology with new concepts. The transition from Figure 4.2(B) to Figure 4.2(C) indicates this process. 4.4 Evaluation The main purpose of this paper is to reduce the number of necessary similarity computa- tions for concept clustering. That is, we want to produce the same quality of clusters by using a significantly small amount of information, instead of using all pairwise term sim- ilarity. Thus, assuming that clusters resulting from using all possible pairwise similarity computation is ground truth data (O b ), we want to measure how many clusters produced by our approach (O a ) are close toO b . Toward this end, we use standard precision/recall metrics as follows: precision = 1 K K X i=1 |O a i ∩O b j | |O a i | (4.12) recall = 1 K K X i=1 |O a i ∩O b j | |O b j | (4.13) F = 2·(precision· recall) (precision+ recall) (4.14) where K is the number of clusters in O a , and a cluster O a i is referred to as a concept O b j cluster if and only if the majority of concepts forO a i belong toO b j . 59 0 20 40 60 80 100 0 5 10 15 20 25 30 35 t (1 ≤ t ≤100) ft t (term frequency) (A) 173 sample term frequencies transition over 100 days 0 20 40 60 80 100 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 cp t (1 ≤ t ≤100) ft cp t (term frequency) (B) 173 terms’ sample change points over 100 days 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 cp t (1 ≤ t ≤100) ft cp t (term frequency) (C) Ω k (1 ≤ k ≤ 19, α = 10, β = 5) Ω k Figure 4.3: We quantified tf t for each term by executing Eq. (4.1) and (4.2) as shown in (A). (B) illustrates the change points of tf t for each term that we found by running Eq. (4.3) through Eq. (4.6). As depicted in (C), the terms were assigned to overlapping smaller groups by creatingΩ k with two tunable tight (α) and loose (β) thresholds for an inexpensive distance measure. 4.4.1 Experiment We conducted our experiment using the Google API 3 to crawl online US news arti- cles. We extracted 173 sample terms from 1,269,940 online news articles for 100 days between May 3, 2007 and August 11, 2007 by the methods that we described in Sec- tion 4.2. We produced two types of output. On the one hand, we constructed O a by computing similarities that are described in Eq. (4.9) through Eq. (4.10) after identify- ing all terms inX using inexpensive similarity metric. As shown in Figure 4.3, average 34 terms were evenly assigned to eachΩ k (1≤ k≤ 19) when we setα to 10 days andβ to 5 days. On the other hand, we performed the complete similarity computations of 173 terms for structuringO b by skipping the creation ofΩ k . All methods were implemented in C++ and Matlab, and experiments were performed on a Pentium4 CoreDuo 3.40GHz with 3GB memory, enough such that there was no paging activity. 3 http://code.google.com/apis/ 60 4.4.2 Complexity Analysis We determined the complexity of the methods to build bothO a andO b as a function of the number of extracted terms (N). Since the different computational phases contributes to complexity, time complexity (C) ofO a andO b can be expressed as: C(O a ) =O(Eq.4.1)+O(Eq.{4.2 : 4.6})+O(Eq.4.10) C(O b ) =O(Eq.4.1)+O(Eq.4.10) The complexityC(O a ) is different fromC(O b ) in thatC(O a ) has rough cluster identifi- cation step (Eq. (4.2) through Eq. (4.6)), which runs inO(N). Thereby,C(O a ) involves Eq. (4.10) computation time, which isO(L 2 )(L≪N) sinceT is assigned toΩ k , while C(O b ) corresponds to O(N 2 ) where L is the number of candidate terms after rough cluster identification. Therefore, structuring O a is more efficient than constructing O b becauseO(N +L 2 )≪O(N 2 ). 4.4.3 Experimental Results Figure 4.4 describes the distributions of the similarity for identifying related terms in different term-sets: Ω k for structuring O a and all terms in X for building O b . The results clearly show that the former Figure 4.4(A) has the log-normal distribution while the latter Figure 4.4(B) has a distribution with long-tailed behavior. Hence, almost 99% of similarity computation in Figure 4.4(B) was wasteful. Computational time increases since more pairwise similarity computations are needed, as we increaseα and decrease β. Table 4.1 presents a summary of the experimental results that computational time increases since more pairwise similarity computations are needed, as we increaseα and decreaseβ. 61 0 100 200 300 400 500 0 0.2 0.4 0.6 0.8 1 pair of (x i , x j ) (0 ≤ i ≤ N, 0 ≤ j ≤ N, N = (|Ω k | 2 − |Ω k |)/2, and i ≠ j) sim(x i , x j ) (A) Sorted distribution of sim(x i , x j ) ∈ Ω k (1 ≤ k ≤ 19) 0 2000 4000 6000 8000 10000 12000 14000 0 0.2 0.4 0.6 0.8 1 pair of (x i , x j ) (0 ≤ i ≤ 14878, 0 ≤ j ≤ 14878, and i ≠ j) sim(x i , x j ) (B) Sorted distribution of sim(x i , x j ) ∈ complete X Figure 4.4: We compared the distributions of similarities for structuring both O a and O b using Eq. (4.10). (A) shows the sorted distribution of pairwise similarities of terms in overlapping smaller groups. (B) presents the sorted distribution of complete pairwise similarities of 173 terms. Table 4.1: The precision, recall, F-measure, and time costs of different distance thresh- olds α and β for structuring O a . Note that the time cost of structuring O b was 394.67 mins. Based on our observations, news articles have a near-weekly cycle so that the best performingα andβ were 10 and 5 respectively, which is indicated in bold. α β Precision Recall F-Measure Minutes 3 1 0.761 0.387 0.513 15.59 5 1 0.709 0.497 0.584 17.47 5 3 0.899 0.572 0.699 23.18 7 1 0.880 0.775 0.824 19.42 7 3 0.864 0.809 0.836 21.18 7 5 0.869 0.838 0.853 23.15 10 1 0.841 0.838 0.840 22.01 10 3 0.956 0.919 0.937 24.52 10 5 1.00 0.965 0.982 25.61 15 1 0.920 0.867 0.892 37.26 15 3 0.962 0.908 0.934 27.05 15 5 0.962 0.908 0.934 30.17 15 10 1.00 0.965 0.982 32.05 62 Construction ofO a was 1541.07% faster thanO b where the distance thresholdsα and β were respectively 10 and 5. Therefore, it demonstrates that our framework remark- ably reduces the number of required computations for clustering without loss of the precision/recall. Table 4.2 shows that the clustered sample terms using our concept clustering method. Table 4.2: Sample concept clusters with centroids Centroids Clusters Alan Greenspan Commerce Department American Idol Paula Abdul, Jordin Sparks American Legion Memorial Park, National Cemetery, Memorial Day Parade, Marine Corps, Civil War, Korean War, Veterans Memorial Business OMX AB, Commerce Department, Toll Brothers, Alan Greenspan, Commerce Department Barack Obama Iraq and Afghanistan, Hillary Clinton, House and Senate, John Edwards Baseball Tournament Conference Baseball Bertie Ahern Fianna Fail, Fine Gael Blue Jays Roy Halladay, Zack Greer Brian Cashman George Steinbrenner Busch Series Nextel Cup, Motor Speedway Carlos Boozer Lake City, Deron Williams Casey Mears Nextel Cup, Indianapolis 500, Motor Speedway Civil War National Cemetery Commerce Department Toll Brothers Continued on next page 63 Table 4.2 – continued from previous page Centroids Clusters Corning Classic Young Kim Czech Republic French Open Danica Patrick Helio Castroneves, Indy 500, Indianapolis 500, Tony Kanaan, Dario Franchitti, Motor Speedway, Marco Andretti Dario Franchitti Helio Castroneves, Indy 500, Indianapolis 500, Tony Kanaan, Dario Franchitti, Marco Andretti David Beckham Steve McClaren, Los Angeles Galaxy Deron Williams Lake City, LAKE CITY , Deron Williams Detroit Pistons Rasheed Wallace, LeBron James Entertainment Paula Abdul, Orlando Bloom, Star Wars, Keira Knightley, Jordin Sparks, Walt Disney, Johnny Depp, American Idol, Paula Abdul Eduardo Romero PGA Championship Fernando Alonso Lewis Hamilton, Monaco Grand Prix Fianna Fail Fine Gael French Open Justine Henin, Roland Garros, John McEnroe, Nikolay Davydenko, Grand Slam, French Open, Marat Safin George Bush John Edwards Grand Slam Justine Henin, Roland Garros, John McEnroe, Nikolay Davydenko Helio Castroneves Helio Castroneves, Indy 500, Indianapolis 500, Tony Kanaan, Motor Speedway, Marco Andretti Continued on next page 64 Table 4.2 – continued from previous page Centroids Clusters Indianapolis 500 Nextel Cup, Indy 500, Tony Kanaan, Motor Speedway, Marco Andretti Indy 500 Tony Kanaan, Motor Speedway, Marco Andretti Interior Ministry Viktor Yanukovych, Iraq and Afghanistan National Cemetery, Marine Corps Jack Sparrow Orlando Bloom, Keira Knightley, Walt Disney, Johnny Depp Johnny Depp Orlando Bloom, Keira Knightley, Walt Disney Justin Rose PGA Championship Justine Henin Roland Garros Keira Knightley Orlando Bloom, Walt Disney LeBron James Rasheed Wallace Lewis Hamilton Monaco Grand Prix Marco Andretti Tony Kanaan, Motor Speedway Matt Kenseth Motor Speedway Memorial Park Veterans Memorial Military Academy West Point Motor Speedway Nextel Cup, Tony Kanaan Nasdaq Stock Market OMX AB Nikolay Davydenko Roland Garros Orlando Bloom Walt Disney Politics Fine Gael, West Point, John Edwards, Bertie Ahern, Fianna Fail, Fine Gael Continued on next page 65 Table 4.2 – continued from previous page Centroids Clusters Sport Helio Castroneves, Young Kim, Justine Henin, Indy 500, Indianapolis 500, Tony Kanaan, Zack Greer, Rasheed Wal- lace, Steve McClaren, Marat Safin, Marco Andretti, Deron Williams, George Steinbrenner, French Open, Nextel Cup, Monaco Grand Prix, Roland Garros, National Cemetery, Roy Halladay, John McEnroe, Memorial Day Parade, Niko- lay Davydenko, Lake City, Lewis Hamilton, Grand Slam, Dario Franchitti, Motor Speedway, LAKE CITY , LeBron James, PGA Championship, Conference Baseball World National Cemetery, Marine Corps, Viktor Yanukovych, Fianna Fail, Civil War, Iraq and Afghanistan, Hillary Clin- ton, Veterans Memorial, House and Senate, John Edwards, Barack Obama, Iraq and Afghanistan, Hillary Clinton 4.5 Summary Given large quantities of online information with many billions of terms, quantifying all pairwise similarities of terms is very computationally expensive. Hence, the develop- ment of efficient techniques for clustering is crucial. In this paper, we have focused on efficiency in concept clustering. Our approach is to produce a rough cluster structure by reducing the number of required computations based on an event life cycle on the Web 66 and then to refine the cluster structure. We also presented how refined clusters can be used to enrich existing ontology. 67 Chapter 5 Related Work This chapter reviews an extensive amount of literature related to our ontology-based information federation solution and discusses in detail how our solution advances the state of the art. The relevant research area is information systems interoperability, which in this dissertation focuses on two categories: architecture for interoperable information (database) systems and methods of discovering semantic mappings, like schema match- ing. Section 5.1 provides a survey on previous methods of integrated access to infor- mation systems. In Section 5.2, a brief survey on (semi-)automatic schema matching work is presented. Finally, Section 5.3 gives an overview of different methods that learn ontologies or ontology-like structures from unstructured text. 5.1 Information Systems Interoperability In this section, we present a short survey of the diverse techniques for integrating multi- ple information sources. Section 5.1.1 presents a classification that enables us to group related works based on the type of systems being integrated and the integration archi- tecture. We also analyze and show the information integration step in Section 5.1.2. 5.1.1 Classification of Architecture In part of summarizing earlier surveys, Fileto et al. [FM03] classified different types of integrated architecture into multiple information sources (databases). The first class is schema integration [AB91, CRE87, GSSC95, V A96, AM99, KDN90, EP90, JN86] that 68 Figure 5.1: Heterogeneity resolution in federated database system provides a composite schema through which information sharing and exchange occurs. As illustrated in Figure 1.1, the schema of each distributed database is a global view of composite schema. As stated in Section 1.1, the composite schema approach is limited in that it is difficult to handle the addition of information sources since it suffers from a noticeable loss of information or an increase in schema-generating complexity. The second class is the federated database [LA86, LMR90, SL90, KM96] that each information repository, as shown in Figure 5.1, can access and exchange data by import- ing and exporting the sharable portion of its conceptual schema [FM03, AM99]. This approach allows heterogeneous components with different data model or schema. Infor- mation sharing and exchange occurs by analyzing the actual concepts implied by indi- vidual database elements, by investigating inter-concept relationships, and by deriving the meanings of unknown concepts when necessary. The main limitation of this cate- gory is the difficulty of agreeing on a set of concepts and inter-concept relationships in a federation environment that is dynamic on which evolves. 69 Figure 5.2: Conceptual procedure of information integration 5.1.2 Process of Integration Parent et at. [PS98] divided a collection of integration processes into three major steps: 1. Heterogeneous data are first converted to a homogeneous format, using transfor- mation rules that explain how to transform data from the source data model to the target data model. 2. The correspondence between elements from heterogeneous sources are investi- gated, by employing semantic descriptions and similarity rules. 3. The correspondence assertions and integration rules are used to produce an inte- gration specification, which describes how data elements from heterogeneous sources must be transformed and mixed to produce a unified view. Figure 5.2 depicts the overall flow of integrating multiple information sources. Inte- grating multiple information sources is inherently regarded as a tedious and error-prone 70 task so that it ultimately requires human labor. In particular, the techniques of the seman- tic matching in step 2 are inherently difficult to automate since schemas typically con- tain limited information. Several previous attempts at automation are presented in the following Section 5.2. In addition, as we discussed in Section 1.2, the above approaches have the following limitations: • Schema integration: it is difficult to manage the composite schema in a highly scalable environment. • Federated database: information loss may occur during the transformation since one data model needs to be converted into another. To address these problems, we build a federated architecture that extracts ontolo- gies using a sharable canonical representation model from diverse information sources. Section 2 presents this resolution in detail. 5.2 Semantic Matching As shown in Section 5.1, schema matching is a pivotal part of the information integra- tion process. In this section, we present previous schema matching techniques together with ontology mapping. Since ontologies can be viewed as schemas for knowledge bases, both techniques developed for both problems are mutually beneficial. In most conventional systems, schema matching is performed manually. Many previous studies have tried to partially automate identifying the semantic matches between two schemas. 71 5.2.1 Schema Matching Rahm et al. [RB01] classified schema matching techniques into schema-level match- ers, instance-level matchers, element-level matchers, structure-level matchers, linguistic matchers, and constraint-based matchers. • Schema-level matchers only consider schema information, which includes the usual properties of schema elements, such as name, description, data type, rela- tionship type, constraints, and schema structure. • Instance-level matchers utilize instance data to obtain important insight into the contents and meaning of schema elements. It is useful where schema is not provided, e.g. semi-structured data, but can be partially constructed. It can also be used as a check against schema-level approach. • Element-level matchers match schema elements like attributes, fields, and columns, in isolation without considering relative parent or substructure. • Structure-level matchers match schema structures like sub-trees or tables. For instance, combinations of elements form structures. It can have full or partial structural matches. • Linguistic matchers consider similarities in element names, descriptions, instance values, and so on. It examines equality, canonical extraction, synonyms, hypernyms, common forms, or user-provided matches. • Constraint-based matchers consider similarities in constraint information such as cardinalities, relationships, data types, and value constraints. Some proposed mapping techniques at schema level or instance level, while others introduced schema matching with a mixture of both schema and instance level. The 72 combining matcher integrates several matching approaches into one system to deter- mine ranked candidates or to ensemble results of several independent matchers. In the following list, we review some previous schema matching implementations in databases based on [RB01]. ARTEMIS and MOMIS : ARTEMIS [CAdV01, CA99] clusters schema attributes based on affinities and predicts mappings using schema matching techniques. MOMIS [BCV99] is a database mediator that integrates independently developed schemas into virtual global schema. COMA [DR02] is a composite matcher based on comprehensive matcher library and various combination strategies. Cupid [MBR01] is a generic schema matcher which applied to XML and relational schemas. It is composed of 3 phase algorithm; linguistic processing, structural processing and, evaluating weighted mean of similarity coefficients and deter- mining mapping. DIKE [PTU03] is a system to automatically determine synonym, homonym, “is-a”, and hypernym relationships between objects in entity-relational schemas. It uses schema matching techniques to determine similarities between objects LSD and GLUE [DDH01, DMD + 03] is multi-strategy machine-learning approach. The matching process is divided into training phase and matching phase. Finally, it produces automatic trained composition of match results by incorporating user- supplied, domain-specific constraints SemInt [LC00] match prototype supports up to 15 constraint-based and 5 content- based matching criteria. It determines match signature and considers Euclidean distance between signatures using neural networks 73 SKAT [MWJ99] is rule-based and semi-automatic schema matching framework. First- order logic rules express match and mismatch relationships. It intended for ontol- ogy matching, matching based on “is-a” relationships. TransScm [MZ98] is a rule-based matchers. It automatically translates data between schema instances. Schemas internally represented as labeled graphs. 5.2.2 Ontology Matching Many of the existing information integration systems, such as [MKSI96, PyHG + 00], use multiple ontologies to describe the information. We discuss general approaches that are used in information integration systems. In the previous survey, Wache et al. [WVV + 01] categorized ontology mapping approaches as defined mappings, lexical relations, top- Level grounding, and semantic correspondences. • Defined mappings are a common approach to the ontology mapping problem to provide the possibility to define mappings. Different kinds of mappings are distinguished in this approach starting from simple one-to-one mappings between classes and values up to mappings between compound expressions. This approach allows great flexibility, but it fails to ensure a preservation of semantics: the user is free to define arbitrary mappings even if they do not make sense or even produce conflicts. • Lexical relations are an attempt to provide at least intuitive semantics for map- pings between concepts in different ontologies. These approaches extend a com- mon description logic model by quantified inter-ontology relationships borrowed from linguistics. • Top-level grounding: In order to avoid a loss of semantics, one has to stay inside the formal representation language when defining mappings between different 74 ontologies. A straightforward way to stay inside the formalism is to relate all ontologies used to a single top-level ontology. This can be done by inheriting con- cepts from a common top-level ontology. This approach can be used to resolve conflicts and ambiguities. • Semantic correspondences try to overcome the ambiguity that arises from an indirect mapping of concepts via a top-level grounding and attempt to identify well-founded semantic correspondences between concepts from different ontolo- gies. In order to avoid arbitrary mappings between concepts, these approaches have to rely on a common vocabulary for defining concepts across different ontologies. The above approaches were taken in various previous studies. The following list briefly describe the mapping techniques in the previous works. KRAFT [PyHG + 00] employs the defined mapping approach, where translations between different ontologies are done by special mediator agents that can be cus- tomized to translate between different ontologies and even different languages. OBSERVER [MKSI96] takes the lexical relations approach. In OBSERVER, rela- tionships used are synonym, hypernym, hyponym, overlap, covering and disjoint. While these relations are similar to constructs used in description logics, they do not have formal semantics. Consequently, the subsumption algorithm is more heuristic than formally grounded. DWQ [CGL01] uses the top-level grounding approach. While it allows establish- ing connections between concepts from different ontologies in terms of common super classes, it does not establish a direct correspondence. This might lead to problems when exact matches are required. 75 Wache et al. [WSSKR99] incorporates the semantic correspondence approach. They use semantic labels in order to compute correspondences between database fields. As we have discussed, most previous research in schema matching facilitated infor- mation when the schema contain such as elements, instances, structures, and constraints, while the ontology community attempted to enhance schema matching methods to iden- tify mappings. However, this sort of information obtain very limited semantic infor- mation of the databases [SM06] since the matching accuracy of these systems depends totally on the quantity of similar aspects of schema information. To tackle this prob- lem, we exploit more general information to gain similarity between two schemas by the incorporation of ontology with the natural language processing techniques. In our approach, the role of ontologies is to provide a measure of similarity in hierarchical structure of concepts. Section 3 presents this in detail. 5.3 Ontology Learning Ontologies provide an explicit model for structuring concepts, alone with their interre- lationships. By collectively presenting an abstract view of a certain domain, ontologies can facilitate text understanding and automatic processing of documents. Therefore, ontology construction becomes one of the key issues in information management. Most existing ontologies, such as WordNet [Mil94], CYC [LPS86] and its non-English coun- terparts, e.g. EuroWordNet [VDOP97] and CoreNet [CB03], were manually created and are being maintained by ontology engineers. Hand-crafted ontologies are generally constructed by extracting important concepts from information sources like the Web. Although many ontology authoring tools, such as Prot˙ eg˙ e [NFM00] and KAON [BEH + 02], have been developed to help ontology 76 engineers construct ontologies, these manual constructions are still laborious and error- prone. Therefore, the hand-crafted work has become the main bottleneck in ontology creation. Instead of employing manual construction, many recent studies have introduced ontology learning that can be divided in constructing ontologies from scratch and extending existing ontologies. Inherently, it integrates multiple complementary tech- niques, such as natural language processing and data mining. An extensive amount of literature has addressed ontology learning in the context of natural language processing and data mining. Section 5.3.1 summarizes some ontology learning methods that comprise natural language processing techniques. Section 5.3.2 presents other techniques that incorpo- rate data mining techniques. Finally, Section 5.3.3 introduces a hybrid method which combines both natural language processing and data mining methods. 5.3.1 Natural Language Processing for Ontology Learning Some studies have employed rules for acquiring lexical entries, and these rules have taken inspiration from natural language processing techniques [Bie05]. Hearst [Hea92] used patterns that capture an interrelationship among terms. Caraballo constructs ontologies based on conjunction and appositive data of noun candidates from news arti- cles. Using the tf-idf vector scheme, the similarity is computed for hierarchical bottom- up clustering. 5.3.2 Data Mining for Ontology Learning Numerous research efforts for building interrelationships among concepts have profited from clustering methods by exploiting similarities of concepts to propose a hierarchy of concept categories [KMS04]. Many works have utilized clustering methods in ontology 77 learning where terms are associated by specified similarity measures. Most methods use a single similarity metric which is usually slow/accurate or fast/less accurate. To obtain high speed and high precision, McCallum et al. [MNU00] used two different similarity measures: using quick, cruder metrics, they generated smaller clusters, and then used refined metrics on the smaller clusters. These inspiring works provided us with a con- crete foundation to further enhance the contemporary ontology learning framework. We develop their foundation in the following ways. 5.3.3 Hybrid Method This section presents ontology learning techniques using specificity of a term. Ryu et al. [RC06] quantifies term specificity measures using a multi-strategy by combining the above two approaches: natural language processing and data mining. The former assumes that a more specific term has more informative, more component words whose frequencies in documents is lower. Since more specific terms are placed at a deeper level in a term hierarchy, we can determine the depth of a term hierarchy using term specificity for structuring IS-A relationships between terms. Term specificity measures are divided into two different approaches: a term occurrence-based statistical approach and grammatical structure-based statistical approach. To implement this idea, Ryu et al. incorporated mutual information between a term and its component words. Lety k ∈ Y x l (1≤ k ≤|x l |) be a component ofx l ∈ R i (1≤ l≤|R i |). Sincey k quantifies the probability of co-occurrence ofx l mutual information computes relatedness between two words, the mutual information betweenx l andandy k and the probability of observation ofx l andy k independently as follows: MI(x l ,y k ) = log( P(x l ,y k ) P(x l )×P(y k ) ) (5.1) 78 LetK be a randomly selected word fromY x l . SinceMI(x l ,K) represents the reduction of uncertainty aboutx i in case of unknownY , the specificity ofx i is defined as follows: Spec 1 (x l )≈MI(x l ,K)≈ tf i (x l ) |y k ∈ R i | X y k ∈Yx l ρ·log |R i | tf i (y k ) (5.2) wheretf i (x) is Eq. (4.2). The latter based on the assumption that specific terms are rarely modified, while gen- eral terms are usually modified. For example, Caraballo [CC99] quantifies term speci- ficity using the distribution of modifiers. LetP(mod k ,x i ) be the probability ofmod k ’s modification ofx i . We then define entropy of probabilistic distribution of modifiers for a term as follows: H mod (x i ) =− X 1≤k≤L P(mod k ,x i )logP(mod k ,x i ) (5.3) where L is the number of modifiers of x i ∈ R i . Since more specific terms have lower entropy, more specific terms have a larger quantity of information as follows: Spec 2 (x l )≈ max 1≤j≤|R i | H mod (x j )−H mod (x l ) (5.4) By combining these two methods, the term specificity is finally defined to be: Spec(x l ) = 1 ν( 1 Spec 1 (x l ) )+(1−ν)( 1 Spec 1 (x l ) ) (5.5) whereν is the weight of Eq. (5.2) and Eq. (5.4). Our research presented in this dissertation focused on improving the efficiency of the ontology learning process. Our approach is to identify rough clusters of relevant terms using a computationally inexpensive distance metric [MNU00] based on an event life 79 cycle [Kle03] in online news articles before performing more sophisticated similarity computations. Hence, our approach is distinguished from the above studies in that it can greatly decrease the required computations for building ontologies by identifying related terms in smaller groups. 80 Chapter 6 Conclusion Federation is a critical step in numerous information management applications. This research presented an ontology-based semantic integration of heterogeneous informa- tion sources in three areas: the ontology-based information federation platform, the ontology-driven schema matching framework, and the efficient concept clustering tech- nique for ontology-learning. Manual integration on the diverse information sources is very expensive. Hence, it is important to develop techniques to automate the integrating process with a single canonical data model. Given the rapid proliferation and the grow- ing diversity of applications today, automatic techniques for information integration has become even more important. This dissertation has contributed to both understanding the integrating problem of diverse information sources and developing matching frame- work using ontologies. This chapter summarizes the key contributions of the dissertation and proposes future research. 6.1 Key Contributions This study makes three major contributions. The first contribution is a solution archi- tecture that resolves conflicts in the structure and in the semantics of existing infor- mation sources. We define unifying representation model to structure ontology which can extract canonical representation from a broad range of meta data models, including relational databases, XML, RDF, OWL, and DAML+OIL. Our representation model is simple yet powerful enough to extend the semantic integration by identifying mapping 81 from the existing information sources to the newly added source. Consequently, this incremental integration approach can avoid maintaining the broad global schema, that is missing from most previous integration works [AB91, CRE87, GSSC95, V A96, AM99, KDN90]. The second major contribution of the research is a solution to automatically cre- ating semantic mappings. The mapping framework finds semantic correspondences between information sources. This mapping framework can improve matching accuracy, since it utilizes sufficient semantic information of data, which is captured by ontologies, while the traditional methods [DMD + 03, DLD + 04, DDH01, MZ98, PTU03, MBR01, CAdV01] cannot. Another important contribution of this dissertation is a solution architecture that pro- vides a well-founded, rapid ontology learning framework based on the reduction of the use of the expensive measure by identifying rough concept clusters from a large dataset. Such a framework lets the ontology engineers focus on validating the interrelationships, rather than structuring them from numerous information resources. Eventually, this drastically reduces the effort and time of the engineers. Also, it can be utilized for a broad range of ontology-driven applications that require up-to-date ontologies. The generated ontologies continuously maintain up-to-date interrelationships among con- cepts by detecting an event life cycle on the Web. We envision these ontologies will be an important resource for query refinement in search engines and ontology-driven matching solutions [SM06]. 6.2 Future Directions We have made significant inroads toward understanding and developing solutions for the federation of diverse information sources, but substantial work is necessary to achieve 82 Figure 6.1: Inefficient schema matching the goal of a comprehensive integrating solution. In what follows, we discuss the direc- tion of our future work. Our schema matching algorithm would be a bottleneck when we try to find map- pings of a given schema from numerous schemas of the existing information source. As illustrated in Figure 6.1, it is not efficient to find matches as n is growing by con- sidering 1-to-n schema comparison. In part, this is because it should exploit evidence that is present in all schemas of the existing information sources in order to discover the mappings of a schema requested by users. We intend to extend this research to focus on further enhancing schema map- ping frameworks to improve the robustness of the schema matching algorithms. Our approach will leverage schema clustering to reduce the number of schema comparisons. A cluster offers a collection of similar schemas and, therefore, can be leveraged for multiple purposes. Figure 6.2 illustrates how the number of candidate schemas can be 83 Figure 6.2: Schema clustering decreased by clustering schemas. In our example in Figure 6.2, the given schema only needs to be compared withm schemas in the schema cluster 1 overn existing schemas (n ≥ m). If existing schemas are evenly scattered intok clusters (n ≥ k), the number of computations would be dramatically reduced. 84 References [AB91] Serge Abiteboul and Anthony J. Bonner. Objects and views. In SIGMOD Conference, pages 238–247, 1991. [AM99] Goksel Aslan and Dennis McLeod. Semantic heterogeneity resolution in federated databases by metadata implantation and stepwise evolution. VLDB J., 8(2):120–132, 1999. [AvH04] Grigoris Antoniou and Frank van Harmelen. Web ontology language: Owl. In Handbook on Ontologies, pages 67–92. 2004. [BCV99] Sonia Bergamaschi, Silvana Castano, and Maurizio Vincini. Seman- tic integration of semistructured and structured data sources. SIGMOD Record, 28(1):54–59, 1999. [BEH + 02] Erol Bozsak, Marc Ehrig, Siegfried Handschuh, Andreas Hotho, Alexan- der Maedche, Boris Motik, Daniel Oberle, Christoph Schmitz, Steffen Staab, Ljiljana Stojanovic, Nenad Stojanovic, Rudi Studer, Gerd Stumme, York Sure, Julien Tane, Raphael V olz, and Valentin Zacharias. Kaon - towards a large scale semantic web. In Kurt Bauknecht, A. Min Tjoa, and Gerald Quirchmayr, editors, E-Commerce and Web Technologies, Third International Conference, EC-Web 2002, Aix-en-Provence, France, September 2-6, 2002, Proceedings, volume 2455 of LNCS, pages 304– 313. Springer, 2002. [BGF + 97] St´ ephane Bressan, Cheng Hian Goh, Kofi Fynn, Marta Jessica Jakobisiak, Karim Hussein, Henry B. Kon, Thomas Lee, Stuart E. Madnick, Tito Pena, Jessica Qu, Annie W. Shum, and Michael Siegel. The context interchange mediator prototype. In SIGMOD Conference, pages 525–527, 1997. [Bie05] Chris Biemann. Ontology learning from text: A survey of methods. LDV Forum, 20(2):75–93, 2005. 85 [BL05] Tim Berners-Lee. Www at 15 years: looking forward. In WWW, page 1, 2005. [BS93] T. Back and H.P. Schwefel. An overview of evolution-ary algorithms for parameter optimization, volume 1. MIT Press, 1993. [CA99] Silvana Castano and Valeria De Antonellis. Artemis: Analysis and rec- onciliation tool environment for multiple information sources. In SEBD, pages 341–356, 1999. [CAdV01] Silvana Castano, Valeria De Antonellis, and Sabrina De Capitani di Vimercati. Global viewing of heterogeneous data sources. IEEE Trans. Knowl. Data Eng., 13(2):277–297, 2001. [CB03] Key-Sun Choi and Hee-Sook Bae. A korean-japanese-chinese aligned wordnet with shared semantic hierarchy. In ICADL, page 690, 2003. [CC99] Sharon A. Caraballo and Eugene Charniak. Determining the specificity of nouns from text. In the joint SIGDAT conference on empirical methods in natural language processing and very large corpora, pages 63–70, 1999. [CGL01] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. Ontol- ogy of integration and integration of ontologies. In Description Logics, 2001. [CJM04] Seokkyung Chung, Jongeun Jun, and Dennis McLeod. Mining gene expression datasets using density-based clustering. In CIKM, pages 150– 151, 2004. [CJM06] Seokkyung Chung, Jongeun Jun, and Dennis McLeod. A web-based novel term similarity framework for ontology learning. In OTM Conferences (1), pages 1092–1109, 2006. [Col96] Michael Collins. A new statistical parser based on bigram lexical depen- dencies. In ACL, pages 184–191, 1996. [Col97] Michael Collins. Three generative, lexicalised models for statistical pars- ing. In ACL, pages 16–23, 1997. [CRE87] Bogdan D. Czejdo, Marek Rusinkiewicz, and David W. Embley. An approach to schema integration and query formulation in federated database systems. In ICDE, pages 477–484, 1987. [CS04] Jeremy J. Carroll and Patrick Stickler. Rdf triples in xml. In WWW (Alter- nate Track Papers & Posters), pages 412–413, 2004. 86 [DDH01] AnHai Doan, Pedro Domingos, and Alon Y . Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In SIGMOD Conference, 2001. [DLD + 04] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Y . Halevy, and Pedro Domingos. imap: Discovering complex mappings between database schemas. In SIGMOD Conference, pages 383–394, 2004. [DMD + 03] AnHai Doan, Jayant Madhavan, Robin Dhamankar, Pedro Domingos, and Alon Y . Halevy. Learning to match ontologies on the semantic web. VLDB J., 12(4):303–319, 2003. [DR02] Hong Hai Do and Erhard Rahm. Coma - a system for flexible combination of schema matching approaches. In VLDB, pages 610–621, 2002. [EP90] Ahmed K. Elmagarmid and Calton Pu. Guest editors’ introduction to the special issue on heterogeneous databases. ACM Comput. Surv., 22(3):175–178, 1990. [ES04] Marc Ehrig and Steffen Staab. Qom - quick ontology mapping. In GI Jahrestagung (1), pages 356–361, 2004. [FM03] Renato Fileto and Claudia Bauzer Medeiros. A survey on information systems interoperability, 2003. [GGD + 05] Lisa B. Grant, Miryha M. Gould, Andrea Donnellan, Dennis McLeod, Anne Yun-An Chen, Sangsoo Sung, Marlon Pierce, Geoffrey C. Fox, and Paul Rundle. A web services-based universal approach to heterogeneous fault databases. Computing in Science and Engg., 7(4):51–57, 2005. [GMKL01] C. R. Gallistel, T. A. Mark, A. P. King, and P. E. Latham. The rat approx- imates an ideal detector of changes in rates of reward: Implications for the law of effect. Journal of Experimental Psychology: Animal Behavior Processes, (27):354–372, 2001. [GSSC95] Manuel Garc´ ıa-Solaco, F` elix Saltor, and Mal´ u Castellanos. A structure based schema integration methodology. In ICDE, pages 505–512, 1995. [Hea92] Marti A. Hearst. Automatic acquisition of hyponyms from large text cor- pora. In COLING, pages 539–545, 1992. [HM78] Michael Hammer and Dennis McLeod. The semantic data model: A mod- elling mechanism for data base applications. In SIGMOD Conference, pages 26–36, 1978. 87 [HM85] Dennis Heimbigner and Dennis McLeod. A federated architecture for information management. ACM Trans. Inf. Syst., 3(3):253–278, 1985. [JC98] Jay J. Jiang and David W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In International Conference on Research in Computational Linguistics, 1998. [JN86] John L. Carswell Jr. and Shamkant B. Navathe. Sa-er: A methodology that links structured analysis and entity-relationship modeling for database design. In Stefano Spaccapietra, editor, Entity-Relationship Approach: Ten Years of Experience in Information Modeling, Proceedings of the Fifth International Conference on Entity-Relationship Approach, Dijon, France, November 17-19, 1986, pages 381–397. North-Holland, 1986. [KDN90] Manfred Kaul, Klaus Drosten, and Erich J. Neuhold. Viewsystem: Inte- grating heterogeneous information bases by object-oriented views. In ICDE, pages 2–10, 1990. [KF01] Michel C. A. Klein and Dieter Fensel. Ontology versioning on the seman- tic web. In SWWS, pages 75–91, 2001. [Kle03] Jon M. Kleinberg. Bursty and hierarchical structure in streams. Data Min. Knowl. Discov., 7(4), 2003. [KM96] Jonghyun Kahng and Dennis McLeod. Dynamic classification ontologies for discovery in cooperative federated databases. In CoopIS, pages 26–35, 1996. [KMH04] Latifur Khan, Dennis McLeod, and Eduard H. Hovy. Retrieval effec- tiveness of an ontology-based model for information selection. VLDB J., 13(1):71–85, 2004. [KMH05] Latifur Khan, Dennis McLeod, and Eduard H. Hovy. A framework for effective annotation of information from closed captions using ontologies. J. Intell. Inf. Syst., 25(2):181–205, 2005. [KMS04] Martin Kavalec, Alexander Maedche, and V ojtech Sv´ atek. Discovery of lexical entries for non-taxonomic relations in ontology learning. In SOF- SEM, pages 249–256, 2004. [KN03] Jaewoo Kang and Jeffrey F. Naughton. On schema matching with opaque column names and data values. In SIGMOD Conference, pages 205–216, 2003. [KS94] Vipul Kashyap and Amit P. Sheth. Semantics-based information broker- ing. In CIKM, pages 363–370, 1994. 88 [KS05] Yannis Kalfoglou and W. Marco Schorlemmer. Ontology mapping: The state of the art. In Semantic Interoperability and Integration, 2005. [LA86] Witold Litwin and Abdelaziz Abdellatif. Multidatabase interoperability. IEEE Computer, 19(12):10–18, 1986. [LB03] Charlie Lindahl and Elise Blount. Weblogs: Simplifying web publishing. Computer, 36(11):114–116, 2003. [LC00] Wen-Syan Li and Chris Clifton. Semint: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng., 33(1):49–84, 2000. [Lev75] Vladimir I. Levenshtein. On the minimal redundancy of binary error- correcting codes. Information and Control, 28(4):286–291, 1975. [Lin98] Dekang Lin. An information-theoretic definition of similarity. In ICML, pages 296–304, 1998. [LMR90] Witold Litwin, Leo Mark, and Nick Roussopoulos. Interoperability of multiple autonomous databases. ACM Comput. Surv., 22(3):267–293, 1990. [LPS86] Douglas B. Lenat, Mayank Prakash, and Mary Shepherd. Cyc: Using common sense knowledge to overcome brittleness and knowledge acqui- sition bottlenecks. AI Magazine, 6(4):65–85, 1986. [MBDH05] Jayant Madhavan, Philip A. Bernstein, AnHai Doan, and Alon Y . Halevy. Corpus-based schema matching. In ICDE, pages 57–68, 2005. [MBR01] Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. Generic schema matching with cupid. In VLDB, pages 49–58, 2001. [Mel95] I. Dan Melamed. Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. CoRR, cmp-lg/9505044, 1995. [MFHS02] Deborah L. McGuinness, Richard Fikes, James A. Hendler, and Lynn Andrea Stein. Daml+oil: An ontology language for the semantic web. IEEE Intelligent Systems, 17(5):72–80, 2002. [Mil94] George A. Miller. Wordnet: A lexical database for english. In HLT, 1994. [MKSI96] Eduardo Mena, Vipul Kashyap, Amit P. Sheth, and Arantza Illarramendi. Observer: An approach for query processing in global information sys- tems based on interoperation across pre-existing ontologies. In CoopIS, pages 14–25, 1996. 89 [MNU00] Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient cluster- ing of high-dimensional data sets with application to reference matching. In KDD, pages 169–178, 2000. [MSR02] Libby Miller, Andy Seaborne, and Alberto Reggiori. Three implementa- tions of squishql, a simple rdf query language. In International Semantic Web Conference, pages 423–435, 2002. [MWJ99] P. Mitra, G. Wiederhold, and J. Jannink. Semi-automatic integration of knowledge sources. In Proc. of the 2nd Int. Conf. On Information FUSION’99, 1999. [MZ98] Tova Milo and Sagit Zohar. Using schema matching to simplify heteroge- neous data translation. In VLDB, pages 122–133, 1998. [NDH05] Natalya Fridman Noy, AnHai Doan, and Alon Y . Halevy. Semantic inte- gration. AI Magazine, 26(1):7–10, 2005. [NFM00] Natalya Fridman Noy, Ray W. Fergerson, and Mark A. Musen. The knowledge model of prot´ eg´ e-2000: Combining interoperability and flexi- bility. In EKAW, pages 17–32, 2000. [O’M86] Michael O’Mahony. Sensory Evaluation of Food: Statistical Methods and Procedures. 1986. [PPM04] Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. Wordnet: : Similarity - measuring the relatedness of concepts. In AAAI, pages 1024– 1025, 2004. [PS98] Christine Parent and Stefano Spaccapietra. Issues and approaches of database integration. Commun. ACM, 41(5):166–178, 1998. [PTU03] Luigi Palopoli, Giorgio Terracina, and Domenico Ursino. Dike: a system supporting the semi-automatic construction of cooperative information systems from heterogeneous databases. Softw., Pract. Exper., 33(9):847– 884, 2003. [PyHG + 00] Alun D. Preece, Kit ying Hui, W. A. Gray, Philippe Marti, Trevor J. M. Bench-Capon, Dean M. Jones, and Zhan Cui. The kraft architecture for knowledge fusion and transformation. Knowl.-Based Syst., 13(2-3):113– 120, 2000. [RB01] Erhard Rahm and Philip A. Bernstein. A survey of approaches to auto- matic schema matching. VLDB J., 10(4):334–350, 2001. 90 [RC06] Pum-Mo Ryu and Key-Sun Choi. Determining the specificity of terms using inside-outside information: a necessary condition of term hierarchy mining. Inf. Process. Lett., 100(2):76–82, 2006. [Res95] Philip Resnik. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI, pages 448–453, 1995. [Res99] Philip Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. (JAIR), 11:95–130, 1999. [RvRP80] Stephen E. Robertson, C. J. van Rijsbergen, and Martin F. Porter. Proba- bilistic models of indexing and searching. pages 35–56, 1980. [SB88] Gerard Salton and Chris Buckley. Term-weighting approaches in auto- matic text retrieval. Inf. Process. Manage., 24(5):513–523, 1988. [SCM08] Sangsoo Sung, Seokkyung Chung, and Dennis McLeod, editors. To be appeared in the 2008 ACM Symposium on Applied Computing (SAC), For- taleza, Cear˙ a, Brazil, March 16 - 20, 2008. ACM, 2008. [SH06] Mehran Sahami and Timothy D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW, pages 377– 386, 2006. [She05] Amit P. Sheth. From semantic search & integration to analytics. In Seman- tic Interoperability and Integration, 2005. [SL90] Amit P. Sheth and James A. Larson. Federated database systems for man- aging distributed, heterogeneous, and autonomous databases. ACM Com- put. Surv., 22(3):183–236, 1990. [SM83] G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983. [SM05] Sangsoo Sung and Dennis McLeod. Semantic information management for seismology and geoscience. Technical Report IMSC-05-002, Inte- grated Media System Center at University of Southern California, 2005. [SM06] Sangsoo Sung and Dennis McLeod. Ontology-driven semantic matches between database schemas. In InterDB, 2006. [SSR94] Edward Sciore, Michael Siegel, and Arnon Rosenthal. Using semantic values to falilitate interoperability among heterogeneous information sys- tems. ACM Trans. Database Syst., 19(2):254–290, 1994. 91 [V A96] Mark W. W. Vermeer and Peter M. G. Apers. The role of integrity con- straints in database interoperation. In T. M. Vijayaraman, Alejandro P. Buchmann, C. Mohan, and Nandlal L. Sarda, editors, VLDB’96, Proceed- ings of 22th International Conference on Very Large Data Bases, Septem- ber 3-6, 1996, Mumbai (Bombay), India, pages 425–435. Morgan Kauf- mann, 1996. [VDOP97] Piek V ossen, Pedro Diez-Orzas, and Wim Peters. Multilingual design of EuroWordNet. In Piek V ossen, Geert Adriaens, Nicoletta Calzolari, Anto- nio Sanfilippo, and Yorick Wilks, editors, Automatic Information Extrac- tion and Building of Lexical Semantic Resources for NLP Applications, pages 1–8. Association for Computational Linguistics, New Brunswick, New Jersey, 1997. [WSSKR99] Holger Wache, Thorsten Scholz, Helge Stieghahn, and Birgitta K¨ onig- Ries. An integration method for the specification of rule-oriented media- tors. In DANTE, pages 109–112, 1999. [WVV + 01] H. Wache, T. V¨ ogele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neu- mann, and S. H¨ ubner. Ontology-based integration of information — a survey of existing approaches. In H. Stuckenschmidt, editor, IJCAI–01 Workshop: Ontologies and Information Sharing, pages 108–117, 2001. 92
Abstract (if available)
Abstract
The main goal of this research is to improve interoperability between different information sources. Since ontologies, collections of concepts and their interrelationships, have become a synonym for the solution to many problems resulting from computers' inability to understand natural language, they can capture the semantics of diverse representations in heterogeneous information sources. Thus, ontologies can facilitate the identification of semantic matching between the different representations. Therefore, this dissertation studies the role of ontologies in semantic matching structured data and a method of building ontologies for the semantic matching.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
From matching to querying: A unified framework for ontology integration
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
Spam e-mail filtering via global and user-level dynamic ontologies
PDF
Context-based information and trust analysis
PDF
Enabling laymen to contribute content to the semantic web: a bottom-up approach to creating and aligning diversely structured data
PDF
Understanding semantic relationships between data objects
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Tag based search and recommendation in social media
PDF
Text understadning via semantic structure analysis
PDF
Learning the semantics of structured data sources
PDF
Representing complex temporal phenomena for the semantic web and natural language
PDF
An efficient approach to categorizing association rules
PDF
Location-based spatial queries in mobile environments
PDF
Learning paraphrases from text
PDF
QoS based resource management for Internet applications
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Integrating complementary information for photorealistic representation of large-scale environments
PDF
Semantic web technologies for event analysis in reservoir engineering
PDF
An adaptive temperament -based information filtering method for user -customized selection and presentation of online communication
PDF
Learning semantic types and relations from text
Asset Metadata
Creator
Sung, Sangsoo
(author)
Core Title
Ontology-based semantic integration of heterogeneous information
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
01/25/2010
Defense Date
11/08/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,ontology learning,ontology management
Language
English
Advisor
McLeod, Dennis (
committee chair
), Horowitz, Ellis (
committee member
), Pryor, Lawrence (
committee member
)
Creator Email
sangsung@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m990
Unique identifier
UC1490107
Identifier
etd-Sung-20080125 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-594592 (legacy record id),usctheses-m990 (legacy record id)
Legacy Identifier
etd-Sung-20080125.pdf
Dmrecord
594592
Document Type
Dissertation
Rights
Sung, Sangsoo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
ontology learning
ontology management