Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Enabling laymen to contribute content to the semantic web: a bottom-up approach to creating and aligning diversely structured data
(USC Thesis Other)
Enabling laymen to contribute content to the semantic web: a bottom-up approach to creating and aligning diversely structured data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENABLING LAYMEN TO CONTRIBUTE CONTENT TO THE SEMANTIC WEB: A BOTTOM-UP APPROACH TO CREATING AND ALIGNING DIVERSELY STRUCTURED DATA by Baoshi Yan A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2006 Copyright 2006 Baoshi Yan Dedication To my mother, my father and my wife ii Acknowledgements I would like to express my gratitude to my advisor, Dr. Robert Neches, for his guidance over the years. Through numerous discussions with him I improved my research skills, management skills, and presentation skills. I became to realize that to pursue a Ph.D. is not just about getting a degree; it is a process of sharpening yourself in different aspects. I would also like to express my appreciation to Dr. Robert MacGregor. I really enjoyed working with him and the daily meetings with him. He is not only a great theorist, but also a great implementer. From him I learned a lot of skills on making complicate functionalities accessible via simple user interfaces. Mydissertationcommitteemembers, ProfessorDennisMcLeodandProfessorIl-Horn Hann were always ready to provide help and guidance. And their criticism helped me strengthen my dissertation from different angles. I would also like to thank Dr. Martin Frank, Dr. Pedro Szekely, and Dr. Ke- Thia Yao for their advices on my research. They have great expertise in the area of the Semantic Web, which benefited me greatly. My thanks also go to many current or former members of the Distributed Scalable Systems Division at USC Information Sciences Institute, including Dr. Peter Will, Dr. Michael Orosz, Dr. Alejandro Bugacov, Dr. JinboChen,Dr. In-YoungKo,Dr. JuanLopez,MinCai,MinQin,NaderNoori,Jing Jin, Craig Rogers, and Carolina Quinteros. Thank Dr. Tatiana Kichkaylo for valuable advices on my defense presentations. iii I want to thank Dr. Craig Knoblock and Dr. Jose-Luis Ambite for their support during the final years of my research. I also had the honor to work with Professor Nenad Medvidovic and Professor Barry Boehm when I first came to USC. Thank them for helping me get started with my Ph.D. study. Much of my work has been sponsored by DARPA DAML program funding for Web- Scripter under contract number F30602-00-2-0576, and in part by the National Science Foundation under Award No. IIS-0324955. iv Table of Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures ix Abstract xii Chapter 1 Introduction 1 1.1 Laymen Contribution is Essential to the Success of the Semantic Web . . 2 1.1.1 Building the Semantic Web . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1.1 Creation of Semantic Data . . . . . . . . . . . . . . . . . 3 1.1.1.2 Alignment of Semantic Data: Ontology Alignment . . . . 3 1.1.2 Laymen Contribution is Essential . . . . . . . . . . . . . . . . . . . 4 1.1.3 The State of the Semantic Web: Lack of User Contribution . . . . 5 1.2 Problem: High Barrier for Laymen to Contribute to the Semantic Web . . 6 1.2.1 Current Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1.1 Ontology-based, Top Down Data Creation . . . . . . . . 6 1.2.1.2 Expert-based Ontology Alignment . . . . . . . . . . . . . 8 1.3 Our Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Bottom-up Data Creation . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Grass-roots Ontology Alignment . . . . . . . . . . . . . . . . . . . 10 Chapter 2 Background 11 2.1 Semantic Web: from HTML to Semantic Data . . . . . . . . . . . . . . . 11 2.2 Resource Description Framework (RDF) . . . . . . . . . . . . . . . . . . . 13 2.3 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Web Ontology Language (OWL) . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 3 Bottom-Up Data Creation 18 3.1 The Importance of Data for the Semantic Web . . . . . . . . . . . . . . . 18 3.2 Current Status: The Lack of Data for the Semantic Web . . . . . . . . . . 18 3.3 Related Work: Survey of Data Creation Tools and Their Problems . . . . 19 3.3.1 Classification of Semantic Data Creation Tools . . . . . . . . . . . 19 3.3.1.1 Ontology Editors . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1.2 Semantic Annotation Tools . . . . . . . . . . . . . . . . . 20 v 3.3.1.3 Semantic Web Applications . . . . . . . . . . . . . . . . . 21 3.3.1.4 Data Translation Tools . . . . . . . . . . . . . . . . . . . 21 3.3.2 Problems with Traditional Semantic Data Creation Tools . . . . . 22 3.3.2.1 An Illustrative Example with Traditional Tools . . . . . . 22 3.3.2.2 Ontology-based Approach Forces Users to Take on Diffi- cult Tasks First . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.2.3 Early Binding into Ontology Causes Evolvability Problem 31 3.4 Our Approach to Semantic Data Creation: Bottom-up Data Creation . . 32 3.4.1 Brief Introduction of MetaDesk . . . . . . . . . . . . . . . . . . . . 33 3.4.2 A MetaDesk Example: Basic MetaDesk GUI Operations . . . . . . 34 3.4.3 The MetaDesk Data Hierarchy . . . . . . . . . . . . . . . . . . . . 36 3.4.3.1 Mapping MetaDesk Data Hierarchies to RDF . . . . . . . 36 3.4.3.2 Generality of the MetaDesk Data Hierarchy . . . . . . . . 39 3.4.4 Batch Creation of MetaDesk Hierarchies . . . . . . . . . . . . . . . 42 3.4.5 Data Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.5.1 Finer-grained Data Structures . . . . . . . . . . . . . . . 46 3.4.5.2 Marking up Data Hierarchies with Ontology Information 47 3.4.5.3 Augmenting Data Hierarchy by Batch Markup . . . . . . 51 3.4.6 Inferring Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.7 Advantages of the Bottom-Up Approach . . . . . . . . . . . . . . . 62 3.4.7.1 Lowering Entrance Barrier to Semantic Data Creation . 62 3.4.7.2 Instant Gratification . . . . . . . . . . . . . . . . . . . . . 63 3.4.7.3 Overall Easiness of Semantic Data Creation . . . . . . . . 64 3.4.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.8.1 Cognitive Easiness of the Bottom-up Approach . . . . . . 65 3.4.8.2 Instant Gratification . . . . . . . . . . . . . . . . . . . . . 69 3.4.8.3 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Chapter 4 Grass-Roots Ontology Alignment 75 4.1 Introduction: The Importance of Ontology Alignment . . . . . . . . . . . 75 4.2 Survey of Alignment Tools and Techniques . . . . . . . . . . . . . . . . . 76 4.3 Problems with Traditional Alignment Approaches. . . . . . . . . . . . . . 78 4.3.1 Alignment as an Isolated Task from Other Data Manipulation Tasks 78 4.3.2 Expert-Based Alignment . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Our Alignment Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4.1 Introduction of Grass-Roots Alignment . . . . . . . . . . . . . . . 80 4.4.2 Alignment as Side-Effects . . . . . . . . . . . . . . . . . . . . . . . 80 4.4.3 End-User Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.5 The WebScripter Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.5.1 WebScripter Overview . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.5.2 System Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5.2.1 Constructing a WebScripter Report . . . . . . . . . . . . 83 4.6 Advantages of Grass-Roots Alignment . . . . . . . . . . . . . . . . . . . . 89 4.6.1 Instant Gratification . . . . . . . . . . . . . . . . . . . . . . . . . . 89 vi 4.6.2 Ease of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.7 Reusing Grass-roots Alignments for Alignment Purposes . . . . . . . . . . 90 4.7.1 Approximations and Inconsistencies in Grass-Roots Alignments . . 90 4.7.2 Observations and Heuristics . . . . . . . . . . . . . . . . . . . . . . 91 4.7.3 Algorithm for Reusing Grass-roots Class Alignment . . . . . . . . 94 4.7.4 Algorithm for Reusing Grass-roots Property Alignment . . . . . . 101 4.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.8.1 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.8.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Chapter 5 Summary 106 Chapter 6 Future Work 108 6.1 Semantic-enabled Mind Mapping Tools . . . . . . . . . . . . . . . . . . . . 108 6.2 Community-based Semantic Data Creation and Alignment Environment . 109 6.3 UsingMetaDesk’sDataRefinementMechanismforXML-to-RDFConversion109 References 111 vii List of Tables 2.1 The XML Serialization of the Example RDF Graph . . . . . . . . . . . . 15 3.1 The Underlying RDF for the MetaDesk Data Hierarchy: RDF Triples For the Trip Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Importing XML into MetaDesk: An XML Segment . . . . . . . . . . . . . 40 4.1 Resultant Alignment Axioms . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2 Facts Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 viii List of Figures 2.1 RDF data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 RDF graph example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Data creation tools for the Semantic Web . . . . . . . . . . . . . . . . . . 22 3.2 A sample resume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Creating classes in Prot´ eg´ e . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Creating properties in Prot´ eg´ e . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5 Create instance in Prot´ eg´ e . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6 The MetaDesk environment . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.7 Importing XML into MetaDesk . . . . . . . . . . . . . . . . . . . . . . . . 41 3.8 Import folder/file hierarchies into MetaDesk . . . . . . . . . . . . . . . . . 42 3.9 Batch input editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.10 The initial hierarchy from batch input . . . . . . . . . . . . . . . . . . . . 44 3.11 Coarse-grained node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.12 Breaking a node into several in Batch Editor . . . . . . . . . . . . . . . . 47 3.13 Finer-grained nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.14 Editing node type directly . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.15 Editing links between nodes directly . . . . . . . . . . . . . . . . . . . . . 49 3.16 Links(properties) between nodes . . . . . . . . . . . . . . . . . . . . . . . 49 3.17 Specifying a node as a collection node . . . . . . . . . . . . . . . . . . . . 50 ix 3.18 A node of type “Collection” . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.19 A node with type obtained from its parent . . . . . . . . . . . . . . . . . . 50 3.20 A large list of project information. . . . . . . . . . . . . . . . . . . . . . . 51 3.21 A large hierarchy of initial project nodes . . . . . . . . . . . . . . . . . . . 52 3.22 Table view of selected node . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.23 Table view of children nodes after zoom-in . . . . . . . . . . . . . . . . . . 53 3.24 Newly created columns in table view . . . . . . . . . . . . . . . . . . . . . 54 3.25 Auto-classifying cell values into appropriate columns . . . . . . . . . . . . 54 3.26 Auto-classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.27 Refined table view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.28 Refined data hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.29 Inferred ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.30 Artificial constructs created: MetaDesk vs Prot´ eg´ e . . . . . . . . . . . . . 68 3.31 Task completion percentage: MetaDesk vs Prot´ eg´ e . . . . . . . . . . . . . 69 3.32 Instant gratification: MetaDesk vs Prot´ eg´ e . . . . . . . . . . . . . . . . . 71 3.33 Instant gratification: hierarchy creation time comparison . . . . . . . . . . 71 3.34 Data entry efficiency: MetaDesk vs Prot´ eg´ e . . . . . . . . . . . . . . . . . 74 4.1 Comparison of ontology alignment/schema matching techniques . . . . . . 79 4.2 WebScripter GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 WebScripter: ontological path inference . . . . . . . . . . . . . . . . . . . 86 4.4 Aligning data in WebScripter . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5 Constructed class alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6 Constructed property alignment. . . . . . . . . . . . . . . . . . . . . . . . 88 x 4.7 Alignment is not transitive . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.8 Observations on grass-roots alignments . . . . . . . . . . . . . . . . . . . . 93 4.9 Implications of alignment: Case 1 . . . . . . . . . . . . . . . . . . . . . . . 95 4.10 Implications of alignment: Case 2 . . . . . . . . . . . . . . . . . . . . . . . 96 4.11 Implications of alignment: Case 3 . . . . . . . . . . . . . . . . . . . . . . . 96 4.12 Alignment example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.13 Precision and recall of obtained facts when all alignments are valid . . . . 103 4.14 Performance of algorithm with some invalid alignments . . . . . . . . . . 103 6.1 Integrating bottom-up semantic data creation into mind mapping tools. . 109 xi Abstract This dissertation aims at lowering the entrance threshold for laymen to contribute to the Semantic Web. To be more specific, it aims at lowering the difficulty for laymen to perform two basic and tightly related tasks on the Semantic Web: semantic data creation and ontology alignment. The content of the current Web — plain texts mixed with HTML tags — is difficult formachinestointerpret. Researchersfromuniversitiesandindustrieshavebeenworking on the development of next generation Web — the Semantic Web. The Semantic Web aims at creating and connecting a web of machine-understandable semantic data, which allows for intelligent processing and could bring the Web to a higher level. In order for the Semantic Web to succeed, it has to be easy for laymen to make contributions.It has to be easy for laymen to contribute semantic data and operate on heterogeneous data. Semanticdataisstructureddatamarkedupwithontologyterms. Apieceofsemantic data specifying one’s telephone number could be represented as (JohnSmith, O1: pho- neNumber, “123-456-7890”), where “phoneNumber” is an ontological term in ontology “O1”. Ontology alignment is the process of matching terms from different ontologies. For example, “phoneNumber” in one ontology corresponds to “telephoneNumber” in a second ontology. xii The realization of the Semantic Web is dependent on the creation of massive amount ofsemanticdata. Semanticdata,beingencodedinastructuredformandmarkedupwith ontologies, is much more machine-friendly and makes more intelligent processing feasible. Aligning terms from different ontologies is the key to integrating heterogeneous se- mantic data in different ontologies. Given the openness of the Web, it is unrealistic to expect that all semantic data is created in the same ontology. Ontology alignment could help semantic data creation as well by supplying existing ontologies and semantic data in the intended domain. Manytoolshavebeenproposedtocreatesemanticdataandalignontologies. However, the lack of semantic data is still the biggest problem with the current Semantic Web development. In addition, no alignment tool has gained widespread use among ordinary users, and there is little ontology alignment data available on the current Semantic Web. In this dissertation, we argue that the conventional tools and techniques for semantic data creation and ontology alignment pose a high barrier of entry to laymen. We argue that conventional tools and techniques share a common characteristic: they are all top- down, ontology-based tools. As a result, they are difficult to use by laymen because of the inherent difficulty in dealing with ontologies: ontologies are abstract, generalized, high-level entities. Instead, we proposed a bottom-up, data-centric approach to semantic data creation andontologyalignment. Inbottom-updatacreation, userscreateweakly-structureddata first, and gradually refine their data. An ontology is derived as a summary of created data so far. In bottom-up ontology alignment, laymen, rather than ontology experts, align their semantic data (for their own purposes) within end applications, the inferred xiii implicit and sometimes imprecise ontology alignments are then integrated and mined to produce higher accuracy ontology alignments. In both tasks, the difficulty of carrying out the task is significantly reduced, making it easier for laymen to contribute to the development of the Semantic Web. xiv Chapter 1 Introduction This dissertation studies how to lower the entrance threshold to the Semantic Web for laymen. To be more specific, it studies how to lower the difficulty for laymen to create and align semantic data on the Semantic Web. An informal example of semantic data is like (JohnSmith, O1: phoneNumber, “123-456-7890”) where O1 is a shorthand for an ontology. An example of ontology alignment is that “O1: phoneNumber” corresponds to “O2:telephoneNumber”. Webeginthischapterbyshowingthatlaymen’scontributioninsemanticdatacreation and ontology alignment is vital to the success of the Semantic Web (Section 1.1). Next,weshowthatthetraditionaltoolsandtechniquesposeahighbarriertolaymen’s participation in semantic data creation and ontology alignment (Section 1.2). WethenoutlineourapproachtoloweringthedifficultyofcontributingtotheSemantic Web by laymen (Section 1.3). The remainder of the thesis expands upon and evaluates the ideas outlined in this chapter. The next chapter (chapter 2) provides background information of the research. It describes in more detail the Semantic Web, ontology, and related standards (RDF and OWL). Chapter 3 then elaborates on the semantic data creation problem and our ap- proach, as well as describing an evaluation of the results. Chapter 4 elaborates on the 1 ontology alignment problem, our approach, and an evaluation of our approach. We then conclude with a summary of the dissertation and future work in Chapter 5. 1.1 Laymen Contribution is Essential to the Success of the Semantic Web Despite its vast success, the current Web, as a mixture of plain texts and HTML tags, is hard for machines to interpret. Most of the Web’s content is designed for human consumption rather than machine processing. As a result, it is difficult for people to filter and synergize the massive amounts of data that exist in the Web. Instead of trying to understand the Web in its current form, which is unrealistic with state-of-the-art techniques, the other way around is to encode its content in a machine- friendly form. The Semantic Web[16], proposed to be the next generation of Web, aims at creating and connecting a web of machine-understandable semantic data, which allows for intelligent processing and could bring the Web to a higher level. More background information on the Semantic Web and its underlying technologies RDF(S), OWL and Ontology can be found in chapter 2. In order for the Semantic Web to take off, there has to be a massive amount of semantic data (ontologically marked up structured data). In addition, there has to be the alignment of heterogeneous semantic data. 2 1.1.1 Building the Semantic Web There are two essential tasks in building the Semantic Web, the creation of semantic web content (i.e., ontologically marked-up structured data), and the alignment of heteroge- neous semantic data (mostly, ontology alignment). 1.1.1.1 Creation of Semantic Data TheSemanticWebisallaboutawebofSemanticData. Peoplehavecometorealizethata largeamountofsemanticdataisthekeytotheemergenceoftheSemanticWeb. Without the availability of a large amount of real semantic data, it’s difficult to test and compare different tools, and different techniques such as different ontology alignment algorithms or different query techniques. Without data, there’s no basis for the emergence of novel semantic web applications. Therefore, a top-priority task in building the Semantic Web is to create a large amount of semantic data. 1.1.1.2 Alignment of Semantic Data: Ontology Alignment Given the decentralized nature of the Semantic Web, it is inevitable that people will create different ontologies even for the same domain. Therefore, ontology alignment is a critical problem. It will arise when two users share data. It can also arise in a single-user case when the user expresses similar concepts differently, as an analogy, just think about how many file folders are created for the same kind of files. Ontology alignment has been studied by many researchers in ontology and semantic web community (see Chapter 4 for more information on research in this area). It has also beenstudiedextensivelyindatabasecommunityunderthenameofschemamatching(see 3 [103] for a survey) . It is often studied under different names such as schema mediation, schema reconciliation, schema mapping, semantic coordination, semantic mapping, or ontology mapping. In most research work, ontology alignment is defined as this problem: given two sep- arately conceived ontologies, find likely matches between terms of the two ontologies. Note that alignment is not equivalence. Alignment is the establishment of correspon- dence between concepts in two ontologies. It can largely depend on different users and circumstances. 1.1.2 Laymen Contribution is Essential In order for the Semantic Web to take off, contribution by laymen is essential [54][85]. LaymenmustbeabletocontributesemanticdataandontologyalignmenttotheSemantic Web. As an analogy, the success of the World Wide Web depends on the contributions of web pages and browsing from individual end users. For example, with the user involve- ment to the Web reaching another level as demonstrated in Blogs and Social Networking, a new generation of Web (coined as Web 2.0) is emergent. Similarly, the Semantic Web is impossible to succeed without end users’ contributions in data creation and ontology alignment. However, enabling laymen’s participation in the Semantic Web is much more difficult than the current web, due to the complexity and heterogeneity of structured data that users need to deal with. To lower the entry barrier to the Semantic Web, easy-to-use tools must be developed for end users. 4 1.1.3 The State of the Semantic Web: Lack of User Contribution In the current state of the Semantic Web, there is not much participation [85], even less contribution from laymen, as is evident in the lack of semantic data, let alone ontology alignment. Data is the key to the success of the Semantic Web, which is the consensus of the Semantic Web community [28]. The lack of data is still a problem as of now. For example, RDFData.org, a comprehensive portal of RDF data, lists only 174 data stores as of April 2006. The Swoogle[32] semantic search engine, by far the largest semantic data index, indexed 1,320,812 semantic web documents. Further investigation on the index reveals that a vast majority of those data is RSS data that is more of an XML syntax and doesn’t have much semantics in it (For example, a simple search of RSS property http://purl.org/dc/elements/1.1/title returns 695,047 semantic web documents in Swoogle). Few users contribute data; even fewer users contribute ontology alignment. As of April 2006, to our knowledge, there is no online library indexing ontology alignments. In the next section, we will argue that the current entry barrier to the Semantic Web is too high for laymen, which led to the current state of lack of user contribution to the Semantic Web. 5 1.2 Problem: High Barrier for Laymen to Contribute to the Semantic Web 1.2.1 Current Practices Although many tools (see Chapter 3 Chapter 4 for more details) have been proposed to create and align data for the Semantic Web, most of them still have very high user requirements. These tools adopt a top-down, “ontology-centric” approach. For example, they ask users to start with creating ontologies. This is a difficult task before creating data. They also expect end users to produce accurate ontology alignments as a separate task. However, an ontology is a high-level, abstract concept. Consequently, these tools are designed more for experts rather than for laymen. 1.2.1.1 Ontology-based, Top Down Data Creation The traditional tools for semantic data creation follow the same paradigm: users need to define an ontology first, and then fill in the ontology template to create instance data. However, there are inherent difficulties associated with such a paradigm. Ontology-based Approach Forces Users to Take on Difficult Tasks First: We argue that the major reason traditional semantic data creation tools are difficult to use is that they employ an ontology-based approach. That is, users need to define an ontology first, and then create actual data conforming to the created ontology. We arguethatthereareinherentdifficultiesassociatedwithsuchanontology-based,top-down approach. 6 1. Ontology creation is difficult. “Ontology is a conceptualization of a domain” [46]. Compared to creating instance data, creating ontology is a more abstract task that requires higher-level understanding of the domain. Ordinary people are good at thinking in terms of concrete objects, but they are generally not trained in thinking at an ontological level. Furthermore, Users might not know about a domain enough to create an ontology. Therefore ontology creation is more difficult. 2. Efforts are needed to locate appropriate ontology concepts for the instance data to be created. In an ontology-based approach, any instance data must conform to a concept in the ontology. Before users are able to create data, they need to examine whether appropriate ontology concepts exist for the data to be created. And progress becomes its own barrier, because as the amount of ontology and ontology concepts grows, this is an increasingly difficult task. 4. By requiring users to define their ontology first, which normally takes time and effort, the barrier to data creation is made so high that users are discouraged from using the tools. Early binding into Ontology Causes Evolvability Problem: Being able to modify previously created ontology and data is important. When users create an ontology at the first time, they might not envision new requirements and new data in the future. Also, the old ontology might not be sufficient to support new data, thus requiring modifications. Itisacommonunderstandinginsoftwareengineeringthatitisbettertoavoidmaking designdecisionsprematurely. Inanontology-basedapproach,designdecisionsonontology have to be made before any instance data can be created. The structure of the instance 7 is completely determined by the defined ontology. Thus whenever modifications need to be made to ontology, care must be taken to synchronize previously created instance data withthenewversionofontology, whichisnotnecessarilyaneasyjobevenforontologists. In addition, any modification to the structure of any data instance is impossible without modifyingtheontology, whichinturnwouldrequiremodificationtootherdatainstances. As a result, ontology and data evolvability is a tricky issue in an ontology-based tool. It is error prone and could result in loss of data. 1.2.1.2 Expert-based Ontology Alignment The characteristics of traditional alignment techniques make them not entirely suitable for ordinary users to use. In the traditional approach, ontology alignment is an isolated process where an ontology expert tries to obtain correct alignments between ontologies. Alignment is isolated from other data manipulation tasks Alignment or schema matching is a separate task. Thus users do not get immediate rewardforalignmenteffort. Furthermore, becausealignmentisaseparatetask, itisoften difficult for users to make alignment decisions due to the lack of context. Expert-Based Alignment The outputs are precise mappings between terms in dif- ferent schemas. However, precise mapping between ontologies are difficult to obtain, which requires thorough understanding of the ontologies. Therefore the alignment task is generally carried out by ontology experts. However, for ordinary users, they may want to alignment some concepts for their own purposes although those concepts might not be precisely aligned to each other. For example, is “usc:Professor” the same concept as “stanford:Professor”? For a secretary in the United States Department of Education, 8 the answer is probably yes. For students in either university, the answer might be no. However, foreitherthesecretaryorthestudents, theywouldknowwhethertoalignthese two concepts, based on their own purposes. On the contrary, if only precise alignments are wanted, it is even difficult for an ontology expert to figure out the nuances in the ontologies and make decisions, let alone ordinary end users. 1.3 Our Approach In this thesis, we propose a bottom-up, grass-roots approach to data creation and align- ment. We argue how this approach lowers the entry threshold to the Semantic Web. To be more specific, our approach can be described as follows when applied to data creation and alignment: 1.3.1 Bottom-up Data Creation Most tools require users to define an ontology first, and then create instance data ac- cording to the previously defined ontology. Defining an ontology itself is not an easy task. In our approach, users can create structured data immediately without defining any ontology. The data is structured into hierarchies with general “parentChild” links and “Thing” type. Users can then refine the data with the help of our tool. An ontology can be inferred from the data. Ontology is rather an emergent phenomenon because of data creation. By doing so, we are challenging the traditional ”ontology-centric” view of the Semantic Web by a ”data-centric” view. 9 1.3.2 Grass-roots Ontology Alignment Most tools regard ontology alignment as a separate task and often even give it a greater importancethanotherusertasks. Thusthegeneratedalignmentmustbeofhighprecision and the alignment task must be carried out by ontology experts. Instead, we argue that ontology alignment by itself is not rewarding to users. In our approach, data alignment is achieved implicitly as side effects of other user tasks. It is performed implicitly and only when users feel necessary. Users align data rather than aligning ontologies. Thus it provides immediate check that the alignment is possibly valid. Being a side-effect of other data manipulation tasks, the exactness of ontology alignment is not required. Thus the alignment can be easily carried out by laymen, not necessarily ontology experts. The resultantapproximatealignmentisthenintegratedandminedtoproducehigherprecision alignment. Ontology alignment is rather an emergent phenomenon of data manipulation. We will present our approach in more details later. We will discuss the advantages of our approach, the difficulties encountered and our solutions to them. We will present the evaluation of the bottom-up approach on data creation and alignment tasks. 10 Chapter 2 Background This chapter provides background information of the dissertation. 2.1 Semantic Web: from HTML to Semantic Data Since its inception, the World Wide Web has greatly improved information sharing and communicationbetweenpeople. Anunprecedentedlargeamountofinformationisnowat thefingertipsofordinaryperson. Accordingtoarecent2005study,thesizeofthecurrent indexable Web is more than 11.5 billion pages [48]. Thus one of the biggest problems people nowadays face is information overload. Without automatic machine processing, it is impossible for people to filter and synergize the massive amounts of data that exist in the Web. However, despite its tremendous success, the Web today has its limitations. Most of the Web’s content is designed for human consumption rather than machine processing. Web pages are normally a mixture of plain texts, and HTML tags that layout the presen- tation of the texts, which is easy for human to understand but is difficult for machines to interpret. Machines can parse the web pages and understand its syntaxes: headers, tables, links, etc. But it is difficult for computers to figure out the semantics of a web page: this web page is about a person named John Smith, whose specialty is librarian, etc. As a result, although massive amount of data is available to people on the Web, it 11 is difficult for people to extract information they look for. For example, an intelligence analystwantingtofindoutallchemistswithpossibleAlQaedaconnectionwillhavediffi- culty in getting the answer from the current Web because the relevant information (such as the participants of chemistry conferences, people’s travel records and citizenship) is most likely to be embedded in web documents whose semantics could not be understood by computers. Lots of work (Information Extraction, Question Answering on the Web, etc) has been carried out in understanding plain texts, or web pages. But they all have limited success. These techniques are either too domain-specific, or lacking in performance. Instead of trying to understand the Web in its current form (a mixture of plain texts and HTML tags), which is unrealistic with state-of-the-art techniques, the other way around is to encode its content in a machine-friendly form. The Semantic Web [16] – aimed to be the next generation of Web – will bring structures to the Web. TheSemanticWebwillbeawebofsemanticdata–ontologicallymarkedupstructured data whose meaning can be interpreted by machines. The availability of such a large amount of machine-friendly data will enable new applications with unprecedented power and sophistication. TheSemanticWebinitiative,startedbytheWorldWideWeb’sinventorTimBerners- Lee, is a joint effort by World Wide Web Consortium (W3C), US Defense Advanced Research Project Agency (DARPA), and EU Information Society Technologies (IST) Programme. Sinceitsinception,theSemanticWebresearchanddevelopmenthassparked lots of interests and activities in both academic and industry. It is one of the hot topics 12 in the World Wide Web Conference. The International Semantic Web Conference series, solely dedicated to the Semantic Web research, has been held for 4 years. ThebasisoftheSemanticWebisthatitsdataisencodedwithamachine-understandable semantic representation. Such representation is based on ontologies and various W3C standards such as RDF(S) and OWL. We will introduce these concepts and standards next. 2.2 Resource Description Framework (RDF) The Semantic Web is built upon the W3C recommendation Resource Description Frame- work (RDF) [73][68]. RDF, following previous developments of Dublin Core 1 and the PlatformforInternetContentSelectivity(PICS) 2 ,providesastandardwayofdescribing resources on the Semantic Web. TherearethreecoreelementsofRDF:UniformResourceIdentifier(URI) 3 ,thetriple data model, and XML serialization of RDF descriptions [73]. 1. AllRDFresourcesareidentifiedwithauniqueURI.TheadoptionofURI’sensures that resources being described are not simply strings in a document, but are something global and uniquely identifiable that everyone can refer to and talk about on the Web. URI’s make it possible for anyone to talk about anything, as the freedom current Web users possess to link to any web page they want. URI’s also make it possible to integrate 1 Dublin Core Metadata Element Set, Version 1.1: Reference Description. http://dublincore.org/documents/dces/. 2 Platform for Internet Content Selectivity, http://www.w3.org/PICS/. 3 URIs, URLs, and URNs: Clarifications and Recommendations 1.0, http://www.w3.org/TR/uri- clarification/. 13 resource descriptions from distributed locations as long as the descriptions are about the resources with the same URI. 2. RDF defines a data model, i.e., how to describe resources. An RDF description is an RDF statement (or RDF triple) of the form: [subject predicate object.] Figure 2.1: RDF data model whichcanalsobedepictedasinFigure 2.1. Eachofthesubject, predicateandobject is identified with URI’s, the object can also be string literals. Figure 2.2: RDF graph example Figure 2.2 shows an example RDF data segment. It basically says that the movie “Artificial Intelligence” was released in year 2001 and directed by Steven Spielberg. As can be seen from the example, the RDF data model is very simple and intuitive. It naturally matches how people talk about things. 14 3. TorepresentRDFstatementsinamachine-processableway, RDFdefinesaspecific eXtensible Markup Language (XML) syntax, referred to as RDF/XML [73]. Table 2.1 shows how the above mentioned RDF segment can be encoded in an XML format. <?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:imdb="http://www.imdb.com/ont#"> <rdf:Description rdf:about="#AI"> <imdb:title>Artificial Intelligence: AI</imdb:title> <imdb:year>2001</imdb:year> <imdb:directedBy resource="http://www.imdb.com/ont#StevenSpielberg"/> </rdf:Description> </rdf:RDF> Table 2.1: The XML Serialization of the Example RDF Graph In short, RDR provides basic means (the triple data model) for people to make as- sertions on the Semantic Web and necessary foundations (the adoption of URI’s) for the assertions to be integrated. 2.3 Ontology RDF provides a way of describing resources in a structured way. However, RDF itself is not enough for fully specifying the semantics of its assertions. Another core element of the Semantic Web is ontology, which assigns semantics to RDF structures. For example, in the RDF graph above, the semantics of “imdb:year”, “imdb:title”, “imdb:directedBy” is not defined. Ontology serves as such purposes. RDF and ontology together provides necessary means to specify semantic data – ontologically marked-up structured data. Historically,ontologyisaphilosophytermthatreferstothestudyabouttheexistence of entities in the universe and how they are related. In knowledge management and the Semantic Web, ontology is a ’representation of a shared conceptualization of a specific 15 domain’ [46] [112]. An ontology defines concepts (classes), the relationships between concepts (properties), as well as additional constraints (axioms) in a particular domain. An ontology provides a common vocabulary of the domain with which people can share and communicate assertions in this domain. For example, in a movie domain we may define an ontology consisting of classes and properties as the following: Classes: Movie, Actor, Director (Data) Properties: title, year (of the Movie class) (Object) Properties: directedBy (as the relationship between Movie and Director). With its mapping to the domain, ontology provides semantics for assertions made with the ontology. Ontologies are a popular research topic in various areas. Ontologies, as domain mod- els, areappliedinknowledgemanagement[11][12][38][12], informationretrieval[67][107] [27] and information extraction [3] [47]. As shared domain models, ontologies are used in information integration [10] [90] [83]. Ontologies have also find their applications in E-Commerce [74] [39], Medical Science [104] [42] [100] [64] [45] [71] and Bioinformatics [106]. 2.4 Web Ontology Language (OWL) “The OWL Web Ontology Language is a language for defining and instantiating Web ontologies”[105]. OWL is currently a W3C recommendation, built upon the previous ontology language RDFS[22], DAML[58], OIL[59] and DAML+OIL. OWL itself consists 16 of three layers: OWL Lite, OWL DL and OWL Full, with increasing expressiveness and complexity. OWL is based on description logic. [6] [78] [79] OWL defines such ontology constructs such as class, subclassOf, domain, ranges, etc, which can be used to define ontologies. Together with RDF, OWL provides foundations to describe a domain and information in this domain on the Semantic Web. 2.5 Summary In this chapter, we provided more background information of the thesis, i.e., the devel- opment of the Semantic Web, and its underlying technologies and standards (Ontology, RDF, OWL). Our approach and tools, as well as many of other tools we study, are based on these standards. 17 Chapter 3 Bottom-Up Data Creation 3.1 The Importance of Data for the Semantic Web Despiteseveralyearsofhardworkbymanyresearchers, theSemanticWebisstilladvanc- ing very slowly. People have come to realize that a large amount of semantic data is the key to the emergence of the Semantic Web. Otherwise, it’s difficult to test and compare different tools and different techniques (such as different ontology alignment algorithms or different query techniques). Furthermore, there is no base for the emergence of novel semantic web applications. 3.2 Current Status: The Lack of Data for the Semantic Web Data is key to the success of the Semantic Web, as is the consensus of the Semantic Web community [28][85]. The lack of data is still a problem as of now. RDFData.org, a comprehensive portal of RDF data, lists only 174 data stores as of April 2006. The Swoogle[32] semantic search engine, by far the largest semantic data index, indexed 1,320,812 semantic web documents. Further investigation on the index reveals that a vast majority of those data is RSS data that is more of an XML syntax and doesn’t have much semantics in it (For 18 example, a simple search of RSS property http://purl.org/dc/elements/1.1/title returns 695,047 semantic web documents in Swoogle). ThesuccessofWWWdependedonalargenumberofwebpages. Therearemanytools for users to create web pages easily. And these web pages are of utility to their creators. Similarly, we believe the success of the Semantic Web depends on a large amount of semantic data, and there must exist some easy-to-use tools for users to create such data, and there must be incentives for users to do so. That is, the tools and data must be of some utilities to data creators. A good tool should make data creation as easy as possible and as rewarding as possible. 3.3 RelatedWork: SurveyofDataCreationToolsandTheir Problems 3.3.1 Classification of Semantic Data Creation Tools People have realized the importance of semantic data. Many tools have been developed to create semantic data. These tools can be roughly classified into several categories: 3.3.1.1 Ontology Editors Ontologies are the focus of semantic web research. Unsurprisingly, ontology editors (OilEd[9] Prot´ eg´ e[37] Ontolingua[87] OntoEdit [110] Apollo[84] KAON[97]) are among the first developed semantic web tools. Ontology editors support users to create, import and modify ontologies. In addition, ontology editors normally support users to create instance data according to ontology as well. 19 3.3.1.2 Semantic Annotation Tools In addition to ontology editors, people have realized the importance of attaching se- mantic descriptions to web pages or other resources, also known as annotation. Tradi- tional annotation tools (Annotea[70] SHOE Annotator[57] Melita[26] SMORE[66] On- toMat Annotizer[53] COHSE[91] MnM[113] KIM[101]) are mostly ontology-driven . Pre- viously defined ontologies provide templates for users (or machine) to fill in necessary semantic data. Most tools (except Annotea[70]) allow users to create data according to custom ontologies. However, although ontology-driven, most annotation tools (except SMORE[66] from the list) do not provide an ontology editing interface. They all assume that users will import an ontology that is created elsewhere with some ontology editors. Thisnotonlylowers thetoolusability, butalsobringsaboutsomeinteroperabilityissues: the output ontology from an ontology editor might not be properly recognized by the an- notationtool. Aimingforannotatingwebpages,annotationtoolsnormallyhaveabuilt-in web browser or are integrated with existing web browsers (Annotea[70] COHSE[91]). Manual Annotation VS (Semi-)Automatic Annotation: As a first step, most annotation tools allow users to manually annotate web pages withsomesemanticdata. Sometools(Melita[26]OntoMatAnnotizer[53]MnM[113])also try to automate the annotation process by learning from previous annotations. These tools normally use information extracting techniques, using the previous annotation as the training set. However, performances of such tools are not entirely satisfactory. In- formation extraction techniques often require web pages to have regular HTML syntax structures. As a result, many web pages are not amenable to these techniques. Some 20 tools (AeroDAML[69] KIM[101]) try to fully automate the annotation process without requiring an annotation training set. KIM[101] tries to recognize semantic entities in a web page by matching the text with entities in its knowledge base. AeroDAML[69] tries to extract people, organization, location and some other kinds of semantic entities from web pages with a combination of Natural Language Processing and other techniques. These tools are not directly applicable to annotating with custom ontologies. KIM[101] and AeroDAML[69] are better called tagging tools rather than annotation tools. 3.3.1.3 Semantic Web Applications Another way of creating semantic data is by using some semantic web applications. Se- mantic desktop systems such as Haystack[102] allow users to create data with an appli- cation interface. The created data is in the end turned into semantic web formats. Such tools are mostly bound to certain ontologies and it’s difficult to create data according to custom ontologies. 3.3.1.4 Data Translation Tools Convertinglegacydata(Relationaldatabase, XML,etc)intoasemanticwebformatisan important source of semantic data. However, the focus of the thesis is on user creation of semantic data. Thus we will not discuss it in more details here. A list of different tools with their respective capabilities is given in Figure 3.1 21 Figure 3.1: Data creation tools for the Semantic Web 3.3.2 Problems with Traditional Semantic Data Creation Tools 3.3.2.1 An Illustrative Example with Traditional Tools Althoughmanytoolshasbeendevelopedtohelpuserscreatesemanticwebcontent, most of them remain too difficult to use. To illustrate the difficulty, we will go through an 22 example of how to create ontology and instance data with one representative of these tools. The tool we are experimenting with is Prot´ eg´ e [37], regarded by many as the most popular ontology and instance data editor ([23][114]) developed by Stanford Medical InformaticsattheStanfordUniversitySchoolofMedicine. “Prot´ eg´ eisatoolwhichallows users to construct domain ontologies, customize data entry forms, and enter data.” Prot´ eg´ e is the most popular among its kind. As of October 16, 2005, there are 35,412 registered users of Prot´ eg´ e. There are conferences and workshops especially for the tool. The First International Prot´ eg´ e Workshop was held in June 1995 and in the July of 2005 the Eighth International Prot´ eg´ e Conference was held in Madrid, Spain. Despite the long history, relatively large number of downloads and registered users of the tool, the number of ontologies created with the tool remains less impressive. As of June 3, 2006, there are only 62 ontologies listed in the Prot´ eg´ e ontology library 1 . The majority of these ontologies are in scientific research domains such as biological, clinical, or medical areas. And a large portion of the rest of the ontologies are translations of some standards such as UML (Unified Modeling Language). Few of the ontologies gain widespread usage, and few of the ontologies were created by ordinary persons (i.e., non- researchers and non-experts). WewillnowgothroughanexampletoillustratetheprocessofusingProt´ eg´ e-liketools to create ontology and instance data. For many companies, managing their employee’s capabilities is important for their competitiveness. Being able to manage employee’s capabilities would help a company form best groups of employee with relevant exper- 1 http://protege.stanford.edu/download/ontologies.html. 23 tise and capabilities for tasks requiring specific capabilities. To facilitate such capability management, it would be a great help if employee’s capabilities are described with struc- tured data so that they can be easily searched for or integrated. One good source of employee capability information is their resumes, which normally contain their educa- tion, specialization, training, past experience and skills. Thus it would be desirable to have some tool, in this example Prot´ eg´ e, to help convert such information into structured description. One sample resume 2 is given in Figure 3.2. Creating a structured description of a resume with Prot´ eg´ e involves two steps. The firststepistocreateanontologyfortheresumedomain. Thesecondstepisthentocreate instance data for a given resume conforming to the created ontology. The most difficult step, as is also the first step, is to create the domain ontology. An ontology consists of a set of classes, properties and relationship between classes. In order to use Prot´ eg´ e to convert the given resume into a structured form, we must first determine what classes and properties should be created. However, determining the set of classes and properties for a particular domain is not an easy task , to name a few: 1) It is sometimes difficult to decide whether a particular concept should be a class or a property. Should “Education” be a property, probably of the “Resume” class? Or shouldtherebean“Education”class, becauseeducationinformationitselfiscomplicated enough (degree, school, major, etc) to require a class definition? Or should we have both an “Education” class and a “hasEducation” property for the “Resume” class? For another example, “Activities” are divided into two groups, “College” and “Community”. Are “College” and “Community” properties, classes, or subclasses of “Activities”? 2 http://www.uwsp.edu/career/Handouts-Word/resume.doc 24 Figure 3.2: A sample resume 25 2) It is also difficult to decide the level of granularity for the ontology. For example, shall we simply create an “address” property and regard the whole address information as the literal value of the “address” property, or shall we break an address into more detailed structures such as street, city, state, etc? 3) It is not obvious to decide the structure of the ontology. For example, shall we regard person name, address, phone number as the properties of the “Resume” class, or shall we create another class “Contact” to hold such information and create another property to link “Contact” class to “Resume”? Making all these decisions is not easy even for ontology specialists, and for ordinary users they seem even more daunting. After we decide the name and properties of a class, we can use Prot´ eg´ e to help us create it. Figure 3.3: Creating classes in Prot´ eg´ e 26 Figure 3.3 shows the Prot´ eg´ e GUI for creating classes. On the left side of the GUI a list of created classes are displayed. On the right side the properties (which are called slots in Prot´ eg´ e) of the selected class are displayed. In the figure it shows that there are 6 properties for the “Education” class: “degree”, “hasGPA”, “major”, “minor”, “time”, and “university”. Users can create more classes with the toolbar on the left side, and add more properties with the toolbar on the right side just above the slot list. Figure 3.4: Creating properties in Prot´ eg´ e Figure 3.4 shows the Prot´ eg´ e GUI for creating and editing properties (or slots). The figureshowsthatapropertynamed“hasEducation”iscreatedwhosedomainis“Resume” and whose range is “Education”. That is, a “Resume” may have a “hasEducation” property whose value is a “Education” object. 27 Figure 3.5: Create instance in Prot´ eg´ e Finally, after a class and its properties are defined. Users can create instances of the class. Figure 3.5 shows the Prot´ eg´ e GUI for creating and editing class instances. When a class is selected and instance creation button is clicked, an template with empty property fields for the selected class is displayed allowing users to enter instance information. In the figure, an instance of “Education” class is created with information such as degree, major, university. The set of properties of an instance is solely determined by its type. Therefore, users must take care if they want to change the type of an instance, as the instance would lose all properties that are not in the new type. Similarly, if users want to add, remove or rename properties of instance, they must do so in the ontological level, that is, change class or property definition, which in turn would affect all instances of the same type. 28 To convert the whole resume into a structured form, a large set of classes and prop- erties need to be defined. These classes could include “Resume”, “Contact”, “Address”, “Employment”, “Experience”, “Activity”, “Interest”, “Skill”, “Reference”, “Company”, etc. Each class may have multiple properties. For example, the “Employment” class may have properties such as “position”, “duration”, “companyname”, “city”, and “state”. And the “Resume” class might have properties such as “hasContact”, “hasReference”, “hasEducation”, “hasActivity”, etc. After all these classes and properties are defined, users can create instances of classes by filling the class template with information from the actual resume. Users also need to link instances together, for example, users need to link a “Resume” instance with an “Education” instance via the “hasEducation” property of the “Resume”. As we have shown above, to convert a resume into a structured form with traditional tools not only requires tremendous effort, but also is mind-challenging. There is even a research article written on defining resume ontology [18]. It is a difficult task even for ontology experts, and more so for ordinary users. In the latter sections we will further elaborate on the root of such difficulties. We will then propose our approach which is significantly easier for ordinary users to create structured data than Prot´ eg´ e-alike tools. 3.3.2.2 Ontology-based Approach Forces Users to Take on Difficult Tasks First As shown in the resume example, creating structured data with top-down tools such as Prot´ eg´ e is both mind challenging and effort consuming. We argue that the major reason these tools are difficult to use is that they employ an ontology-based approach. That 29 is, users need to define an ontology first, and then create actual data conforming to the created ontology. We argue that there are inherent difficulties associated with such an ontology-based, top-down approach. 1. Ontology creation is difficult. “Ontology is a conceptualization of a domain” [46]. Compared to creating instance data, creating ontology is a more abstract task that requires higher-level understanding of the domain. It is easier for users to say that Jane Smith was a salesperson for J.C. Penny at Green Bay working on several things, than to play with ontology concepts such as there is an “Experience” class with properties like “role”, “company”, “location”, “responsibility”. Ordinary people are good at thinking in termsofconcreteobjects, buttheyaregenerallynotasgoodatthinkingatanontological level. Furthermore, Users might not know about a domain enough to create an ontology. Therefore ontology creation is more difficult. 2. Efforts are needed to locate appropriate ontology concepts for the instance data to be created. In a ontology-based approach, any instance data must conform to some concepts in the ontology. Before users are able to create data, they need to examine whetherappropriateontologyconceptsexistforthedatatobecreated. Withtheamount of ontology and ontology concepts growing, this is an increasingly difficult task. As an analogy, many people attempt to create a nice folder hierarchy for their files. With the time going on, they usually lose track of the old hierarchy and end up with creating new folders for the same kind of files. Similar phenomena happen to emails. When the ontology corpus becomes large, people may get tired of looking up appropriate ontologies and rather create a new one for the data at the hand. 30 3. Due to the inherent complexity in ontology creation, the user interfaces (or even tools) for creating ontologies are normally difficult to use and require significant learning time. In addition, the user interfaces for creating ontology are different from those for creating instance data, thus further increasing learning time of the tool. Take Prot´ eg´ e for example, users need to deal with concepts such as “slot”, “cardinality”, “domain” and “range” in the GUI which ordinary users are unfamiliar with. 4. By requiring users to define ontology first, which normally takes time and effort, the barrier to data creation is made so high that users are often discouraged from using the tools. In the ontology-based, top-down approach, users do not get to data creation phase immediately. On the contrary, they might need to spend a large amount of time and patience defining the ontology first, which in many cases would turn people away. For example, if a person is planning a trip and would like to store trip information such as hotel confirmation number and flight number, she would rather use notepad to store such information if she is required to define a trip ontology first with the ontology-based tool. 5. Another minor problem with ontology-based tools is that these tools typically organize created data in a class hierarchy, which is not what users want in many cases. For example, an information analyst might want to organize intelligence information by different terrorists instead of by intelligence categories. 3.3.2.3 Early Binding into Ontology Causes Evolvability Problem As we discussed earlier, one problem with ontology-based tools is that they force users to take on difficult ontology creation task first, making it difficult for users to get started. 31 However, after the ontology and data are created, another problem with ontology-based approach, we argue, is the difficulty in changing previous created ontology and data. Being able to modify previously created ontology and data is important. When users create the ontology at first time, she might not envision new requirements and new data in the future. Also, the old ontology might not be sufficient to support new data, thus requiring modifications. It is a common practice in software engineering that it is better to avoid making de- sign decisions prematurely. In an ontology-based approach, design decisions on ontology have to be made before any instance data can be created. The structure of the instance is completely determined by the defined ontology. Unknown types and unknown rela- tionships are not allowed. Thus whenever modifications need to be made to ontology, care must be taken to synchronize previously created instance data with the new version of ontology, which is not necessarily an easy job even for ontologists. In addition, any modification to the structure of any data instance is impossible without modifying the ontology, which in turn would require modification to other data instances. As a result, ontology and data evolvability is a tricky issue in a ontology-based tool. It is error prone and could result in loss of data. 3.4 Our Approach to Semantic Data Creation: Bottom-up Data Creation We have argued that the difficulty in using traditional data creation tools is that they adopt a ontology-based approach. For ordinary users, however, their primary concern is 32 the instance data, not ontology. Therefore, to resolve the problem ofdifficulty in creating structured data, we adopt a bottom-up, data-centric approach. We will argue that this approach provides instant gratification to users, and significantly lowers difficulty in the data creation. 3.4.1 Brief Introduction of MetaDesk Ourbottom-up,data-centricapproachtodatacreationisembodiedinourtoolMetaDesk. MetaDeskis anRDFauthoring tool that emphasizes entryof facts, ratherthanconstruc- tion of ontologies. MetaDesk’s approach to semantic data creation is different from that of other tools. First, MetaDesk allows users to create semantic data structures immediately without needing to construct ontologies first. Ontologies are inferred later on from the created data. That is, ontology is the summary of data rather then the prerequisite of data creation. Second, the semantic data users create need not be fully specified. When unspecified,bydefaultalldatainstancescreatedareoftheuniversaltype”Thing”andthe relationships between them are the universal ”parentChild” link. The universal ”Thing” type and ”parentChild” link allows users to quickly put up data structures to express related concepts with minimum cognitive requirements. MetaDesk then provides a set of operationsthatareintuitiveandeasytouseforuserstomodifyorrefinethecreateddata, resulting in data and ontology with comparable quality to those produced by ontology- based tools. 33 3.4.2 A MetaDesk Example: Basic MetaDesk GUI Operations In this section 3 we will give a short example of data creation in MetaDesk. Suppose you are planning a trip to the forthcoming International Semantic Web Con- ference(ISWC)andyouneedtorecordinformationaboutthetripinanorganizedfashion. Details could include flight carrier, confirmation number, hotel preferences, hotel avail- ability details, prices, etc. In addition, you would like the information to be represented in such a way that amendments to the data are easily feasible. Storing such information in traditional RDF authoring tools is a tedious process. As opposed to directly writing the information in the tool, you first have to create a myriad of classes and properties like Trip Class, Flight Class, Hotel Class and Conference Class. Also, the domain and range constraints of the properties have to be specified. Further more, the ontological information is not very obvious in certain cases. For example, it is difficult to name the relationship between Trip class and Flight class and between Trip class and Hotel class. As a result, a naive user, or one in a hurry, would prefer to create such information in a text format than recording it in such RDF authoring tool. We would like our tool to excel in simplicity, providing an efficient data entry paradigm. Recording the information in this example is easy and fast with MetaDesk. MetaDesk provides two metaphors for entering information: (1) users can create ”nodes” (represented internally as RDF resources) that are arranged in a hierarchy, and (2) they can attach attribute-value pairs to nodes. A new node is created by highlighting anexistingnode,andexplicitlytypingthenameofachildnode,orbydraggingsomething (aWebpage,PDFfile,WordDocument,etc. oranothernode)ontothehighlightednode. 3 The materials in 3.4.2 and 3.4.3 has also been published in [77] 34 MetaDesk consciously imitates the gestures, look and feel used to construct hierarchies using Windows Explorer. To record the trip information, one can simply create a Trip node and add some child nodes to it. The childnodes couldbe a Flightnode, aHotelnode and aConference node. One can attach other information to individual nodes; for example, add an attribute- value pair of confirmation number to the Flight node. The resultant hierarchy is shown in Figure 3.6(the dashed rectangle part). As can be seen from the example, with only a few steps, the seemingly ontologically complex trip information is recorded in MetaDesk. Furthermore, the created data is structured, enabling processing such as information retrieval easier, as well as providing necessary foundations for future refinement. Figure 3.6: The MetaDesk environment 35 3.4.3 The MetaDesk Data Hierarchy 3.4.3.1 Mapping MetaDesk Data Hierarchies to RDF MetaDesk is RDF-based. Although users enter the data rather quickly without knowing anything about RDF, underneath, the created data is converted to RDF triples. Below wealsolisttheunderlyingRDFtriplesfortheinformationshowninTable 4.1. The“par- entChild”linksdefinethatunder“ISWC 2004 Trip”nodearethreenodes: “Flight”node, “Hiroshima Prince Hotel” node and “Conference Details” node. Under “Flight” node are four other nodes representing individual connecting flights: “JAL1604”, “JAL5016”, “JAL5015”, and “JAL1601”. For each individual flight, there are triples representing the departure time. There are also RDF triples defining the reservation number and phone number for the hotel, etc. Node Relationships: The default relationship (RDF property) between a parent node and a child node is called ’parentChild’ defined in the MetaDesk system namespace. The ’parentChild’ relationship can be regarded as a general, directed structural rela- tionship. It subsumes more specific relations such as part/whole, class/subclass, set/set member, or folder/subfolder. The generality of ’parentChild’ relationship enables easy integrationoftheserelationsintoMetaDeskdatahierarchy, whichwewilldiscussinmore details in Section 3.4.3.2. Given a parent node ’P’ and one of its child node ’C’, the link between them can be represented by a RDF triple of the form <P, R, C> where ’R’ is either ’parentChild’ or oneofitssubproperties. Inotherwords, thehierarchyisentirelydefinedby’parentChild’ triples. 36 <rdf:Description rdf:about="http://www.isi.edu/~user001/metadesk#Trips"> <rdfs:label>Trips</rdfs:label> <sew:parentChild rdf:resource="http://www.isi.edu/~user001/metadesk#ISWC_2004_Trip"/> </rdf:Description> <rdf:Description rdf:about="http://www.isi.edu/~user001/metadesk#ISWC_2004_Trip"> <sew:parentChild rdf:resource="http://www.isi.edu/~user001/metadesk#Hiroshima_Price_Hotel"/> <sew:parentChild rdf:resource="http://www.isi.edu/~user001/metadesk#Flight"/> <sew:parentChild rdf:resource="http://www.isi.edu/~user001/metadesk#Conference_Details"/> <rdfs:label>ISWC 2004 Trip</rdfs:label> <sew:parentChild rdf:resource="http://www.isi.edu/~user001/metadesk#Places_to_Visit"/> </rdf:Description> <rdf:Description rdf:about="http://www.isi.edu/~user001/metadesk#Hiroshima_Price_Hotel"> <rdf:type rdf:resource="http://www.isi.edu/~user001/metadesk#Hotel"/> <myNS:Phone_Number>81-82-256-1111</myNS:Phone_Number> <myNS:Reservation_Number>3345788</myNS:Reservation_Number> <rdfs:label>Hiroshima Prince Hotel</rdfs:label> </rdf:Description> <rdf:Description rdf:about="http://www.isi.edu/~user001/metadesk#Flight"> <sew:parentChild rdf:resource="http://www.isi.edu/~user001/metadesk#JAL1601"/> <sew:parentChild rdf:resource="http://www.isi.edu/~user001/metadesk#JAL1604"/> <sew:parentChild rdf:resource="http://www.isi.edu/~user001/metadesk#JAL5015"/> <rdfs:label>Flight</rdfs:label> <sew:parentChild rdf:resource="http://www.isi.edu/~user001/metadesk#JAL5016"/> </rdf:Description> <rdf:Description rdf:about="http://www.isi.edu/~user001/metadesk#JAL1604"> <rdfs:label>JAL1604</rdfs:label> </rdf:Description> <rdf:Description rdf:about="http://www.isi.edu/~user001/metadesk#Phone_Number"> <rdfs:label>Phone Number</rdfs:label> </rdf:Description> <rdf:Description rdf:about="http://www.isi.edu/~user001/metadesk#Places_to_Visit"> <fileNS:fullpath>C:\Documents and Settings\user001\My Docu-ments\Places to Visit</fileNS:fullpath> <rdfs:label>Places to Visit</rdfs:label> </rdf:Description> Table 3.1: The Underlying RDF for the MetaDesk Data Hierarchy: RDF Triples For the Trip Information Node operations: ’Move’ and ’Link’ A node can have multiple parents (being in the object position of multiple ’parentChild’ triples). A ’move’ operation on a node changes the parent of the node, that is, deletes the original ’parentChild’ and asserts another one. A ’link’ operation adds another parent to the node, i.e., asserts another ’parentChild’ triple whose subject position is the new parent node. The ’move’ and ’link’ operations can be conveniently carried out via drag-and-drop in MetaDesk. Node attributes: Each node N has zero or more properties (attributes), each of which is represented by a triple of the form <N, P, O> where ’P’ is not ’parentChild’, or 37 asubpropertyof’parentChild’,oritsinverse. Therearenorestrictionsonwhatattributes a node can have, that is, domain constraints are not enforced. The attribute value ’O’ can be either a literal or another node, which is called attribute node throughout this document. Node types: By default, each node has a ’rdf:type’ of “Thing”, which is defined to be the most general type. Users are encouraged, but not required, to fill in a more specific type. Labels and URI’s: MetaDesk hierarchy is entirely RDF-based - every link and every node attribute in MetaDesk maps to a RDF triple. However, RDF structures in its raw format are not readable. Thus we want to hide details of RDF from users, such as URIs and namespaces. Hence, all non-literal names that a user sees (names attached to nodes in the hierarchy, the names of attributes, and the names of nodes in attribute value position) correspond to RDF ’labels’. Underneath, each label ’N’ maps to a URI ’U’, and MetaDesk asserts the triple <U, rdfs:label, N>. Some labels have semantics built in, e.g., ”type” maps to ’rdf:type’ and ”parent class” maps to ’rdfs:subClassOf’. By default, a new node with a label ’xxx’ is assigned a URI of ’myns#xxx’, where ’myns’ is the URI for a user’s personal namespace. If the label ’xxx’ contains characters other than alphanumeric or an underscore, it cannot be used as-is in a URI. In this case, an artificial unique local name is concatenated with ’myns#’ in place of ’xxx’. If there already exists another node with the same URI, a unique URI for the new node is obtained by concatenating a number (usually a variation of the time stamp) to ’myns#xxx’. That is, by default nodes with the same labels are assumed to be different (assigned with different URI’s). Users can assert two nodes to be equivalent later on, 38 which would consolidate the two URI’s into one. In a earlier design, when creating a nodeusersmightbepromptedforaconfirmationwhetherthenewnodeisthesameasan existing node with the same label, which frequently disrupted the flow of data creation. In the current design the node consolidation can be carried out after users are done with node creation. The system can help by detecting nodes with the same labels and displaying their fullpaths to the root so that users can easily find possible consolidation candidates and make decisions. Users also have options to directly change the URI of a node. Anattributewithalabel’P’isassignedaURIof’myns#P’,forexample,anattribute labeled ’fullname’ is assigned a URI of ’myns#fullname’. An attribute value ’V’ is stored as a literal (a string) if the range of the attribute is a literal class (a subclass of ’xsd:Literal’), or as a resource if the range indicates a non-literal. If there is no range information, ’V’ defaults to a literal. Users can convert a literal value to a resource with such label at any time. 3.4.3.2 Generality of the MetaDesk Data Hierarchy MetaDeskuseshierarchiesasthedataentryandorganizationparadigm. Thereareseveral advantages to it. First, a data hierarchy is a natural and arguable the most intuitive way for users to organize information. Hierarchies can be seen in folders/files, emails folders, web directories, company organizations, etc. Second, the MetaDesk data hierarchy is designed to be a general hierarchy. The general ’parentChild’ link and ’Thing’ node type enable other kinds of hierarchies to be 39 integrated into MetaDesk. Below we will give two examples of how an XML hierarchy and a folder/file hierarchy can be imported into MetaDesk and what are the resultant underlying RDF structures. Import XML into MetaDesk Arbitrary XML files can also be dropped into a MetaDesk hierarchy. These are automatically converted into RDF, with the top-most tag forming the root resource. The ’parentChild’ Property is used to represent the relationship between tags and subtags (except when the subtag represents a literal). For example, for the following XML <trip> <hotel confirmation="39880A78B"/> <flight fltnum="884" confirmation="S38BN04"> <carrier>America West</carrier> </flight> </trip> Table 3.2: Importing XML into MetaDesk: An XML Segment Our translator would create resources of RDF type ’myns:Trip’, ’myns:Hotel’, and ’myns:Flight’, with ’parentChild’ links from the Trip resource to the Hotel and Flight resources. Each of the three attributes is converted into the obvious RDF triple. The Flight resource is linked via a triple to the string ”America West” via a property named ’myns:carrier’. The resultant hierarchy is show in Figure 3.7 below. Import File Hierarchy into MetaDesk Arbitraryfolder/filehierarchiescanalsobedroppedintoaMetaDeskhierarchy. These are also automatically converted into RDF. Each folder is mapped to a node of type ’metadesk:DesktopFolder’, and each file is mapped to a node of type ’metadesk:File’. 40 Figure 3.7: Importing XML into MetaDesk Each folder node and file node also has a ’metadesk:fullpath’ property whose value points to its full path in the file system. The ’parentChild’ Property is used to represent the relationship between folder, sub-folders and files. Figure 3.8 shows what a file hierar- chy would look like in MetaDesk. As another advantage, importing file hierarchy into MetaDesk would allow users to add annotations to the file hierarchy. MetaDesk also provides some nice features such as launching files and folder from within MetaDesk, and synchronizationbetweentheactualfilehierarchiesanditsimportedversioninMetaDesk. IncontrasttothegeneralityofMetaDeskhierarchy,theonlykindofhierarchyallowed in top-down ontology editors such as Prot´ eg´ e is class hierarchy. In these tools data can only attach to the class hierarchy, and cannot form their own organization structure, making the tools inconvenient for users to organize information. Finally, hierarchies are easily put up by users, while on the same time have basic structures, thus providing necessary scaffold for future refinement. 41 Figure 3.8: Import folder/file hierarchies into MetaDesk 3.4.4 Batch Creation of MetaDesk Hierarchies OnedistinguishingcharacteristicofMetaDeskisthatitallowsweaklyspecifieddatahier- archies thanks to the adoption of “parentChild” links between nodes. This characteristic not only makes it easy for users to input data, it also makes it possible to enter large volume of data into MetaDesk in a batch. This section will introduce the batch data creation functionality of MetaDesk. Going back to the resume example as mentioned earlier in this chapter. Converting the resume into a structured representation with top-down tools requires lots of effort because users need to create ontological structures to hold various pieces of information. With the universal “Thing” class and “parentChild” relationship in MetaDesk, it can be shown that the resume can be transformed into an initial structured representation with much less effort. 42 MetaDesk provides a batch-input paradigm for users to enter large amount of data hierarchy efficiently. To input the resume information into MetaDesk, users first copy and paste the resume into the MetaDesk batch editor. Users then indent the text in the editor into a desirable hierarchy with the Tab key, as shown in Figure 3.9. Figure 3.9: Batch input editor After the indentation process is completed, users can click on the “Set” button and an initial hierarchy with all the resume information is created, as shown in Figure 3.10. 43 Figure 3.10: The initial hierarchy from batch input As can be seen, creating a data hierarchy representing the Resume is easier with bottom-uptoolssuchasMetaDeskthanwithtop-downtoolssuchasProt´ eg´ e. First, with MetaDesk, users need not deal with ontology in the first place. Users play with instance data directly. In this resume example, the instance data is readily available and users are not required to create anything. With traditional top-down tools, on the contrary, users need to create necessary ontological structures in the resume domain to hold the instance data. Therefore in terms of cognitive easiness, MetaDesk has a clear advantage over top-down tools. Second, compared to numerous copying and pasting operations between 44 theresumedocumentandcorrespondingontologicalstructureswithtraditionaltools, the indentation operation in MetaDesk is much simpler and quicker. Theinitialdatahierarchyisinarudimentaryshape. Itdoesnotcontainanyontology andisnotmarkedupwithanyontologyinformation. Thesemanticsofthedatanodesare not specified either. Nevertheless, compared to plain text, the data hierarchy is already a big step forward. Compared to plain text, it is more fine-structured, which allows for more sophisticated processing. For example, search over the data hierarchy might find the individual data pieces instead of the whole document. A search over the resume hierarchy with keyword “Reference” can have matching results of the three references under the “Reference” node, while a search over a plain-text resume would only return the whole resume document. Furthermore, the data hierarchy, while still rudimentary, provides a useful scaffold for later refinement. As will be shown in later sections, the initial data hierarchy can be broken down into finer data structures, and marked up with ontology information. More importantly, the approach provides a way for users to record data quickly and easily, thus encouraging users to take first step towards semantic data creation. 3.4.5 Data Refinement Theinitialdatahierarchy,althoughmorestructuredthanplaintext,isstillrathercoarse- grained and lacks ontological information. Our approach to semantic data creation is to let users enter data easily, then refine the data gradually as they want. In this section, we will talk about how to refine the data hierarchy, to be more specific, how to augment it with ontology information. 45 Why Are Data Semantics Needed? Inadditiontoloweringthedatacreationthreshold, wealsowanttoensurethequality of created data. The created data should be augmented with semantics, i.e., marked up with ontology information. There are mainly two reasons the created data should be augmented with ontology information. First, data with ontology information is easier to process. Having nodes and node links is not enough. Without knowing ontology information such as node types and link names, it is hard to process data. For example, a search for “Movie” might not find the node with label “Lord of the Rings” if the node does not have a type of “Movie”. When nodes have types, we can also manipulate nodes of the same type as a set, rather than one node at a time. Second, ontologies make data sharing much easier. Without node types, it is difficult, if not impossible, to integrate two data sources. With node types, the problem can be reduced to ontology alignment. 3.4.5.1 Finer-grained Data Structures Theoriginallyentereddatacanbecoarse-grained. Intheresumeexample,whenusersfirst enterthepermanentaddress,theymightnotbreaktheaddressintoindividualpiecessuch as city, state and zip code, as shown in Fig 3.11. Users can choose to break the address into finer structure when they desire to. To do that, users simply select the address node, invokethe“SplitThisNodeintoSeveral”optionfromtheright-clickmenu,andthebatch input editor with the node as the content would appear. Just like how users batch-input some data hierarchy, users can break the address node into different lines, as shown in Fig 3.12. A finer-grained address hierarchy would be created, as shown in Fig 3.13. 46 Figure 3.11: Coarse-grained node Figure 3.12: Breaking a node into several in Batch Editor Figure 3.13: Finer-grained nodes 3.4.5.2 Marking up Data Hierarchies with Ontology Information Breakingcoarse-graineddataintofinerpiecesisoneaspectofdatarefinement. Theother and more important aspect of data refinement is to markup the original data hierarchy with ontology information. In the original data hierarchy, all nodes are of type “Thing” 47 and the relationships between nodes are the universal “parentChild”, which does not tell much about the semantics of the data hierarchy. MetaDesk provides several ways of augmenting the data hierarchy with ontological information. Supplying Ontology Information by Direct Editing: One way of refining the data is to directly give a data node a type. For example, in Fig 3.14, users can directly specify that the selected data node “Address until June 1, 1999”is of type “Address”. Figure 3.14: Editing node type directly Users can also directly specify the link between a parent node and a child node. For example, in Fig 3.15, users choose to specialize the link between “Salesperson” node and “Summers 1995 and 1999” node to “duration” property. As shown in Fig 3.16, the relationships between “Salesperson” and its three children nodes are specialized from “parentChild” to more meaningful “duration”, “organization”, and “job responsibility”. Getting Ontology Information from Node Hints: Frequently, part of the ontology information is already embedded in the data hierar- chy. In particular, a common pattern in a data hierarchy is that users often use a parent node to indicate the type of its children nodes. Taking the resume data hierarchy as an example, nodes such as “EDUCATION”, “RELATED EXPERIENCE”, “EMPLOY- MENT”,“SKILLS”,“REFERENCES”allindicatethetypesoftheirchildren. MetaDesk 48 Figure 3.15: Editing links between nodes directly Figure 3.16: Links(properties) between nodes provides a quick way of ontology markup for these kinds of nodes. With a single click (on the ’C’ button in the MetaDesk toolbar) users can tell MetaDesk that the selected node is of such kind, as shown in Fig 3.17. Being informed of such information, MetaDesk would determine that the selected node is of type “Collection” (Fig 3.18). MetaDesk would also determine the types of its children node (Fig 3.19) by stemming the label of the parent node. That is, if the parent node is “REFERENCES”, the types of its three children are “Reference”. 49 Figure 3.17: Specifying a node as a collection node Figure 3.18: A node of type “Collection” Figure 3.19: A node with type obtained from its parent In addition to type information, property information can be embedded in the data hierarchyaswell. Forexample,the’PermanentAddress’nodeintheresumehierarchycan be seen as a property of its parent node. These kinds of property nodes occur frequently 50 especially when the hierarchy was created with the batch input editor. Similarly, users can inform MetaDesk of such kinds of property nodes (with a click on the ’P’ button on the toolbar). 3.4.5.3 Augmenting Data Hierarchy by Batch Markup Figure 3.20: A large list of project information In addition to directly operating on individual node to augment it with ontology information, users can also operate a set of nodes in a batch through a table view, which 51 Figure 3.21: A large hierarchy of initial project nodes is especially useful when users need to markup a large set of similar (in the ontological sense) nodes. Suppose users need to convert a list of project information as shown in Fig 3.20 into RDF. To convert such information into RDF requires lots of effort using traditional top-down tools. We will show that converting it into RDF with MetaDesk is much faster and easier. As usual, users can first use the batch input editor of MetaDesk to get an initial data hierarchy of the project information, as shown in Fig 3.21. 52 Figure 3.22: Table view of selected node Figure 3.23: Table view of children nodes after zoom-in Usersgetaninitialhierarchywithoutanyontologicalinformationattached. Formany users this provides a good-enough first version. To augment the list of project nodes with ontology information, select ”Projects” node, click ”Table View” button as shown above. 53 Figure 3.24: Newly created columns in table view Figure 3.25: Auto-classifying cell values into appropriate columns A table view of the current selected node “Projects” will appear Fig 3.22. To zoom in to individual project node, select the column with the list of project names, click on the zoom button on the left-upper corner. 54 Figure 3.26: Auto-classification results Figure 3.27: Refined table view A table view of individual project nodes will appear Fig 3.23. Users can then select a cell value, create a new column, and move the selected cell value to the new column. 55 Figure 3.28: Refined data hierarchy Fig 3.24showssomenewlycreatedcolumns, withsomesamplevaluesmovedtothese columns. Users can then move other cell values into appropriate columns. To expedite the moving process, MetaDesk is built in with a classifier (Fig 3.25) which automatically classifies the values in a selected column into other appropriate columns, as shown in Fig 3.26. Users can also undo the automatic classification with the “Undo” button in the toolbar. More details on the classifier are to be discussed later in this section. 56 After users move some sample cell values to appropriate columns and do and undo auto-classification several times, a table view of the individual nodes with cell values nicely aligned with appropriate columns will be obtained. Users can also specify the type of the nodes (rows) in this table, as shown in Fig 3.27. Now if users switch to the hierarchical view, the project data hierarchy is nicely (and fully in this case) marked-up with ontological information (classes and properties), as shown in Fig 3.28. The MetaDesk Text Classifier: To quickly mark up a list of ontologically similar nodes, users can list these nodes together in a table. Users can create and change table column names. The markup process then amounts to moving table cells into appropriate columns. To expedite the moving process, MetaDesk is built in with a text classifier which learns from existing values in different columns and automatically classifies the values in a selected column into other appropriate columns. Given a string V and a list of columns {P1, P2, ... Pn}, the task of the classifier is to find a column Pi whose content is most similar to V. To do that, each column, as well as V, is represented with a feature vector. The dissimilarity between a column Pi and V is computed as the distance between their feature vectors. The column whose distance is shortest to V is then selected as where V belongs. Each feature is quantified to a value between [0, 1]. The feature vector consists of the following features: 57 Length Difference: the length difference between V and a column Pi is quantified to a value between [0, 1]. The quantification takes into account how many times larger one length is to the other, as well as length values. Digits: For V, this is the percentage of digits out of all letters. For Pi, this is the average percentage of digits out of all letters over its simple values. Person Names: For V, this is the percentage of person names out of all words. For Pi, this is the average percentage of person names out of all words over its simple values. A person-name recognizer was implemented in MetaDesk based on the list of last names and first names from the US Census data. Acronyms: For V, this is the percentage of acronyms out of all words. For Pi, this is the average percentage of acronyms out of all words over its simple values. Word Distribution: The percentage of words of V that is in the sample values of Pi. Given two feature vectors, we can compute their distance (smaller is better), or their cosine(largerisbetter). Ourexperimentshowedthatmin-distanceapproachhasabetter performance over max-cosine approach. 3.4.6 Inferring Ontology In this section we will describe how ontologies can be inferred from marked-up data hierarchy. The kinds of ontology information MetaDesk currently deals with include: classes and subclasses, properties (data properties whose values are literals and object properties whose values are resources), domains and ranges, and property cardinalities. 58 The inferred ontology is displayed in separate Tab, as shown in Figure 3.29. The information in this Tab is dynamic, due to that the inferred ontology might change as data hierarchy changes. Figure 3.29: Inferred ontology Classes: Gettingthelistofclassesfromthedatahierarchyisstraightforward—everynodetype thatisnotabuilt-inowlclass(e.g.. rdf:Property)orMetaDeskclass(e.g.,’metadesk:Collection’, ’metadesk:Folder) is a user-defined class in the ontology. Properties: Getting the list of classes from the data hierarchy is also straightforward — every node attribute, as well as any node that is of type ’rdf:Property’ is a property in the ontology. Property Domain, Range, and Cardinality: 59 The domain, range and cardinality information of a property is generalized from the data hierarchy. In the simplest case, for a node with type D and a property p, the domain of p is D. For a node with a property p, if the value of p is of type R, the range of p is R. For a property p, if there is a node with multiple p attributes, the cardinality of p is multiple. Otherwise the cardinality of p is one. Insomecasesapropertymighthavemultipledomainsandranges, whichisallowedin MetaDesk. That is, the actual domain and range are the union of the multiple domains and ranges respectively. Due to the way the URI of a property is constructed – the concatenation of ’myns#’ and the property label, sometimes two conceptually different properties with the same label would have the same URI, thus mistakenly being regarded as the same property. In the ontology inference phase, these two properties can be disambiguated based on their domains. To be more specific, if there is no meaningful subsumption relationship between their domains, these two properties will be regarded as different and the URI of onepropertywillbechanged. Forexample,both’myns:Project’and’myns:Person’might have a ’myns:name’ property. However, because there is no subsumption relationship between ’myns:Project’ and ’myns:Person’, the ’name’ property of ’myns:Project’ would be changed to ’myns:projectName’(concatenation of domain name and label) with the label ’name’ unchanged. Conversely, if there are two different properties (i.e., with different URI’s) with the same label, but their domains are found out to be relatedvia a subsumption relationship, 60 the two properties will be assigned with the same URI, i.e., they are regarded as the same. Subclasses: User-definedSubclassIntheontologypanelasshowninFigure 3.29,userscandirectly drag a class node and drop it under another class node, which be default would result in a subclass relationship between them. Predicate-defined Subclass MetaDesk could also infer subclass relationship based on the property sets of two classes. A property set of a class is the set of properties whose domain is this class, in other words, the property set defines what attributes an instance of the class can have. The idea is that a subclass would inherit the property set of its superclass. Thus if the property set of classCsub is a superset of the property set of class Csup, it can be inferred that Csub is a subclass of Csup. Ontology Statistics: In addition to the ontology information, some statistics on the ontology are also computed, as shown in Figure 3.29. The statistics include the number of instances of a class, and the number of occurrences of a particular property of the class. For example, Figure 3.29 shows that the property ’projectName’ appeared 14 times in class ’Project’. The ontology statistics is useful in that it helps users to identify important properties (which would occur very often), and sometimes wrong properties (which would occur rarely in the class). One possible, but un-implemented usage of the ontology statistics is to use it to identify the key property of a class. A key property of a class is just like the key attribute of a database table, which would appears exactly the same times as the number of table records. The key property of a class could help the system to identify 61 possible equivalent nodes in the hierarchy (please recall that all nodes are different by default even though they might have the same label). 3.4.7 Advantages of the Bottom-Up Approach We argue that the bottom-up approach to semantic data creation has several advantages over the top-down approach. 3.4.7.1 Lowering Entrance Barrier to Semantic Data Creation One advantage of the bottom-up approach is that it significantly lowers the entrance barrier to data creation. There are several reasons to it. First, witha bottom-upapproach, users gettodatacreationphase immediatelywith- out any prerequisite. In other words, there is no cost for users to start. This is obvious from how MetaDesk works. Second, the bottom-up approach allows the creations of data hierarchies without specifying types and links, which is a more difficult task and can be postponed to data refinement phase. Therefore it is very easy for users to put up the data hierarchy first. Thisisimpossiblewiththetop-Downapproach, whichisstrong-typedanddoesnotallow unknown types and links. In short, the bottom-up approach requires no initial cost for users to start data cre- ation. And during the creation process, it requires little effort from users to put up the data hierarchy. 62 3.4.7.2 Instant Gratification The bottom-up approach allows users to create a data hierarchy quickly. We argue that the initial data hierarchy, although still lacking ontological information, is already structured and of great utility to users. This will bring about another advantage of the bottom-up approach — instant gratification. That is, users quickly get something useful with the bottom-up approach. OnegoodthingaboutMetaDeskisthatthedatahierarchycreatedisnicelystructured. Beingabletorecorddataquicklydoesnottellmuchaboutatool. Anotepadwouldallow users to record data easy and fast too. But the recorded data is in such a primitive form that post-processing is extremely difficult. Wearguethatalittlebitofstructuregoesalongway. ThestructureintheMetaDesk data hierarchy not only provides a foundation for later refinement, but also is of great utilitytousersinitscurrentform. First,thedatahierarchyreflectsthewayusersorganize their data, thus it provides a good view of the data and it is easy for users to browse. Second, the data hierarchy makes more sophisticated processing such as advanced search possible. For example, a keyword search ’Skills’ on plain texts might only returns the whole resume document. The same search over the data hierarchy can return the actual list of skills. This is because, first, the data in the hierarchy is of finer granularity than documents or paragraphs. Second, the finer-grained pieces of data are nicely related via containment relationship in the hierarchy, thus it is easier to locate desired information. For more information on how to process keyword search over structured data, there is 63 a lot of work on keyword-based search over relational database [63] [1] [17] [62] and keyword-based search over XML [40] [49] [119]. 3.4.7.3 Overall Easiness of Semantic Data Creation We argue that in terms of the whole semantic data creation process, the bottom-up approach also has overall easiness. We discussed previously that it is easy to put up a data hierarchy with our bottom- up approach. After the data hierarchy is created, it is also easy for users to attach ontology information to it because they have data at hand and thus have a good context for making decisions. In addition, the several markup mechanisms (such as the batch markup mechanism) provided by MetaDesk makes it easy and efficient to mark up the data hierarchy. 3.4.8 Evaluation In order to evaluate the performance of MetaDesk versus that of traditional top-down tools, we conducted a series of user experiments. Subjects: The human subjects for this experiment are students and staff in different groups of the EnterpriseScalableSystemsDivision(hereafterreferredtoasDiv2thatisaninternal nicknameforit)attheUniversityofSouthernCalifornia’sInformationSciencesInstitute. To avoid bias towards one tool or another, we chose those subjects who had never used either tool before. Furthermore, all subjects were new to the area of ontology modeling, 64 whichwasconfirmedintheexperimentafterwards. Thesubjectsconsistofthreestudents and two staffs. Procedures: First, a tutorial on the two tools (MetaDesk as an exemplary tool of bottom-up approach and Prot´ eg´ e of top-down approach) was given on the examples of Div2 project page and a resume sample. Then all the subjects are given two documents. One is the Div2 personnel page 4 , the other is another resume example. The subjects were required to mark up these two documents into semantic form with the two tools. That is, they were required to perform four markup tasks. The time spent on each task was recorded, and the produced data was saved for later analysis. There was no limit on the time users spent on each task, but the total experiments lasted a hour. 3.4.8.1 Cognitive Easiness of the Bottom-up Approach Conceptual analysis and anecdotal impressions of the two approaches (MetaDesk as an example of bottom-up approach and Prot´ eg´ e as an example of top-down approach) sug- gests that bottom-up tools should be conceptually easier to use. However, for evaluation purposes, we need a more objective criterion. One such kind of objective criterion, we argue, is the number of artificial constructs users have to manually create during the ontology and data creation process. Artificial constructs are the opposite of literals. To end users, among the constituents of semantic data, literals are the only things meaningful to them. For example, literals like “Jane Smith”, “University of Michigan” are easy for end users to understand, while 4 http://www.isi.edu/divisions/div2/ 65 object ID’s like “http://bar/foo#people135” are not intended for human interpretation. Artificial constructs act as a scaffold to attach literals to form a semantic graph. Literals are the terminal nodes in this semantic graph, while the non-terminal nodes of the graph are artificial constructs. Artificial constructs can be categorized into the following categories: 1. Ontology: this includes class and property definitions, and constraints such as domain and range definition of properties. 2. Instances: instances of various classes, e.g., “O1:Person1”, “O2:Project2”, or “O3:OrganizationXYZ123”. This does not include its property values. 3. LinkingStatements: thoseRDFstatementsthatlinktwoinstancestogether. State- ments whose object is a literal describe the instance in the subject position. In contrast, statements whose object is an instance specify the relationship between two instances. In the semantic network, linking statements are the edges between non-terminal nodes. Let’slookatartificialconstructsandliteralsfromtheperspectiveofamarkupprocess. Let’s assume users are given a task of marking up a text document into a semantic form. Literals are those strings that can be seen in the text documents, i.e., the content of the text document. Artificial constructs, in contrast, are those structures and objects users need to create in order to connect those literals to form a meaningful semantic structure. They are “artificial” because they are created by users, not directly present in the text document. We argue that the amount of artificial constructs users need to create is indicative of the difficulty (or easiness) of the data creation task. Literals are readily available, in 66 manycasesusersonlyneedtocopyandpastethemintotherightposition, whileartificial constructs have to be “created”. WewouldarguethatMetaDeskeitherreducestheamountofartificialconstructsusers need to manually create, or lowers the difficulty in creating them. For category 1 artificial constructs (Ontology), MetaDesk lowers the difficulty in cre- ating them because users can generalize from the already created instance data, rather than creating ontology from scratch. This has been discussed sufficiently in previous sections. For category 2 artificial constructs (Instances), MetaDesk is slightly better because users do not need to explicitly create an instance node in the batch input editor (users do need to if they operate directly in the tree hierarchy), and users do not need to deal with instance ID’s or URI’s. The category 3 artificial constructs are where MetaDesk has the biggest advantage. People are good at manipulating “literals”, but no so good at manipulating artificial constructs. In top-down tools such as Prot´ eg´ e, users have to manually create every single links between two instances. This is a difficult task identified by several experiment subjects, because the things users deal with — instances, links between instances — are all artificial constructs and not very intuitive to deal with in many cases. Things become much more difficult if users have to create multiple levels of links until they reach the literals. On the contrary, in MetaDesk users mostly manipulate on literals, they indent and organize literals into a suitable data hierarchy, which are much easier because literals are intuitive for users to manipulate on. Necessary instances and links are implicitly created. 67 In theory, therefore, we can calculate the various categories of artificial constructs that are manually created by users with MetaDesk and Prot´ eg´ e. Figure 3.30 shows the number of artificial constructs manually created with MetaDesk vs Prot´ eg´ e in an theoretical experiment where both tools convert the resume example into an identical semantic form. Figure 3.30: Artificial constructs created: MetaDesk vs Prot´ eg´ e In the real-world experiment, however, since the difficulty of creating the data with thetwotoolsaredifferent, usersdidnotproducethesamekindofsemanticdatawiththe two tools under time constraints. They could produce more data with one tool than with the other one. Therefore, we use another criterion to measure the difficulty (or easiness) of the two tools. The criterion we use is the completion percentage of the markup task, that is, to what percentage did a user complete the markup task with a particular tool. Since the data users marked up might be different from the rest of the data in nature (for example, users might finish marking up “Education” information but not “Experiment” 68 information), we use the percentage of marked-up literals out of the total number of possible literals as the completion percentage. Using this definition of completion percentage, the experiment results were shown in Figure 3.31. As can be seen from the figure, MetaDesk outperformed Prot´ eg´ e for all the subjects on both tasks. In some cases (J1, P, J2, N), the task completion percentage of MetaDesk is roughly 4 times higher than that of Prot´ eg´ e. Figure 3.31: Task completion percentage: MetaDesk vs Prot´ eg´ e 3.4.8.2 Instant Gratification We argue that with MetaDesk users quickly get a useful data hierarchy to start with, which provides instance gratification for users, rather than for users to spend a lot of time and effort to get something more elaborate. Gradual Usage vs All or Nothing 69 There are three main reasons why users can get a data hierarchy much faster with grass-roots tools like MetaDesk than top-down tools like Prot´ eg´ e. First, with MetaDesk usersdonothavetocreateontologyfirstinordertocreatedatahierarchy, whichprovides a large head-start for MetaDesk inontology-heavy situations. Second, inMetaDesk, with a single copy and paste operation users can enter the text into its batch hierarchy editor anddoindentationfromthere. Thisisnotpossiblewithtop-downtoolslikeProt´ eg´ ewhere unknown types and links are not allowed. As a result, users have to do numerous copy- and-paste operations between Prot´ eg´ e and the textual document, which is much slower than one single copy-and-paste plus indentations. Finally, in the initial data hierarchy construction phase in MetaDesk, users mostly manipulate on literals inside MetaDesk’s Batch Hierarchy Editor, which is more intuitive and easier than with Prot´ eg´ e. We conducted an experiment (result shown in Figure 3.32) on ourselves on marking up Div2 Project page and the resume example. The Div2 Project page is ontology-light. So the experiment on Div2 Project page basically showed that one single copy and paste operation plus necessary indentations is roughly 5 times faster than numerous copy and paste operations between Prot´ eg´ e and the textual document. For ontology-heavy texts suchasaresume, thetimeneededforconstructingtheinitialdatahierarchyinMetaDesk did not increase. It actually decreased because of fewer literals to be indented. While the time needed to mark it up in Prot´ eg´ e increased greatly. As a result, the time needed to create a hierarchy in Prot´ eg´ e is roughly 10 times that of creating aninitial data hierarchy in MetaDesk. Our experiment on other human subject also confirmed this result. The criterion we use is the time needed to create a particular data hierarchy. The results (Figure 3.33) 70 Figure 3.32: Instant gratification: MetaDesk vs Prot´ eg´ e showed that the time for creating initial data hierarchy in MetaDesk is much less than the time for creating a hierarchy in Prot´ eg´ e. On average the former is 6 times faster. Figure 3.33: Instant gratification: hierarchy creation time comparison 71 At first glance, it might seem unfair that we are comparing the time of creating an initial hierarchy in MetaDesk with the time of creating a complete hierarchy in Prot´ eg´ e. However, please note that the initial hierarchy created in MetaDesk, being nicely struc- tured, is already of great usage to users. On the contrary, uses will not get something useful in Prot´ eg´ e until they create the complete hierarchy. Therefore, we are comparing the time required until the first usable data hierarchy is created. Some readers might ask why users cannot just create an initial rudimentary hierarchy in Prot´ eg´ e as well. There are several answers to this question. First, Prot´ eg´ e does not provide necessary means for users to create un-typed nodes and unspecialized links, therefore users cannot create a data hierarchy without defining domain-specific classes and properties. Second, even if users can define some generic classes and properties in Prot´ eg´ eandusethemasplaceholdersforthedatahierarchy, theyarebasicallymimicking our bottom-up approach in Prot´ eg´ e, which MetaDesk naturally supports. Third, Prot´ eg´ e only supports class hierarchy. Even if users can use some generic classes and properties to create the data in Prot´ eg´ e, the result would not be a data hierarchy. Finally, even if users can use some generic classes and properties to create the data in Prot´ eg´ e, the effort required to refine the data later on is almost equivalent to creating a complete hierarchy from scratch. 3.4.8.3 Efficiency We argue that bottom-up tools like MetaDesk have a better data entry efficiency than top-down tools like Prot´ eg´ e, that is, given the same amount of time users create more semantic data with MetaDesk than with Prot´ eg´ e. 72 In the worst case, we could emulate top-down approach in MetaDesk with early com- mitment to ontology. Thus MetaDesk should require no more overall effort than Prot´ eg´ e. Furthermore,becauseMetaDeskadoptsanauto-classificationalgorithm,itshouldrequire fewerstepsthanProt´ eg´ ewhendealingwithlargenumberofdata. Finally,wecouldargue in principle that MetaDesk requires less retraction cost than Prot´ eg´ e (which is hard to verify in experiments because we only get the final data set from users) When selecting a criterion for data entry efficiency, we started with the number of literals marked up. And there are several problems to be considered for this criterion. First, in the experiment, the literals users input into MetaDesk are marked up in various degrees. Some of the data is fully marked up, while some of the data is not marked up at all. However, we observed that there is a consistency for the same user. If a user didn’t mark up the data in MetaDesk, the user created very few classes/slots and very little instance data in Prot´ eg´ e as well. That is, the user had difficulty in ontology modeling. Even in this case, the user had no problem in creating a big initial data hierarchy in MetaDesk. Therefore, the degree of markup is not a problem. Secondly, users spent different amount of time on various tasks with MetaDesk and Prot´ eg´ e. Therefore, the criterion has to be normalized by total time spent. The Figure 3.34 shows the efficiency of MetaDesk vs Prot´ eg´ e. For people good at ontology modeling (i.e., user N and user M), the efficiency of MetaDesk is roughly two time of that of Prot´ eg´ e. For people with difficulty in ontology modeling (e.g., J1, P, J2), the efficiency of MetaDesk is much higher than that of Prot´ eg´ e. For these people, they could barely enter data into Prot´ eg´ e, but they have no problem entering data into MetaDesk although the data is not semantically marked-up. 73 Figure 3.34: Data entry efficiency: MetaDesk vs Prot´ eg´ e 3.5 Summary We surveyed tools for creating semantic web data. We argued that ontology-driven data creation paradigm is an important factor affecting the ease-of-use of these tools. We thus propose a bottom-up, data-centric paradigm to semantic data creation. In this paradigm users create structured data first, ontology will be inferred later on from the data corpus. Our evaluations showed that our bottom-up approach has advantages over the top-down approach in terms of ease-of-use, instant gratification and overall efficiency. 74 Chapter 4 Grass-Roots Ontology Alignment 4.1 Introduction: The Importance of Ontology Alignment Given the openness of the Semantic Web, it is inevitable that people would use hetero- geneous ontologies. There are several approaches dealing with heterogeneous ontologies. One approach is typically used in information integration, where a mediator architecture [116][89][115]isusedtomediateinformationindifferentontologiesorschemas. Suchsys- tems include SIMS[2] OBSERVER[90] Information Manifold[75] TSIMMIS [43] DISCO [111] COIN [21] [44] and Ontobroker [30]. A similar approach is ontology translation On- toMorph [25] OntoMerge [36], the difference being it deal with a pair of ontologies rather than a list of ontologies as in a mediator architecture. Another approach is ontology merging, where two or more ontologies are combined into a shared, bigger ontology [60] [108] [109]. In all these approaches, the prerequisite is to establish the correspondence between concepts in different ontologies, that is, ontology alignment. Therefore, ontology align- ment is a critical and necessary problem when dealing with heterogeneous data sources. In addition, because we allow users to create their semantic data in a bottom-up fashion without defining an ontology first, it is inevitable that even the same user can create different structures for the same concepts. Thus alignment is a necessary process to our data creation approach. 75 4.2 Survey of Alignment Tools and Techniques Ontologyalignmenthasbeenstudiedbymanyresearchersintheknowledgerepresentation [93][94][95] and semantic web communities [35][19]. It has also been studied extensively in database community [14] [34] [88] [76] [5] [8] [13] [50] [92] [98]. It is often studied under different names such as schema mediation [52] [43], schema reconciliation [24], schema matching [88] [14], schema mapping [81][51] [120], semantic coordination [19] [20], ontol- ogy matching [121] [41] and ontology mapping[96] [72][117]. A relatively comprehensive survey on schema matching was given in [103]. In most research work, ontology alignment is defined as this problem: given two separately conceived ontologies, find likely matches between terms of the two ontologies. Early ontology matching techniques were mostly based on heuristics. For example, ONION [93][94], CUPID [82], PROMPT [95] and [88] use name similarity and structure similaritybetweenontologiesorschemasashintstoguessmatchingbetweenterms. Later on, people realized that data instances are an import source of information as well, and proposed alignment techniques (LSD [34] GLUE[35] Automatch [14] SemInt[76]) that take data instances into account. Some technique uses information flow theory [65]. Differenttechniquesaregoodatdealingwithdifferentkindsofinformation. Tofurther improvematchingperformance,systemsthatintegratedifferentmatchingtechniqueswere also proposed (LSD [34], COMA [33]). The idea of reusing previous alignments was stated in [103] and further developed in COMA[33]. In order to match schema S1 and schema S2, COMA[33] requires the existence of schema S that has been already matched with S1 and S2, which makes it 76 unusable for schemas unseen before. Alon Halevy et al [51] [80] recently proposed to use a corpus of schemas and schema mappings to help schema mapping, however, it did not address the problem of constructing the corpus. Little work on this direction has been reported so far. Unlikeotherontologyalignmenttechniquesthattaketwoontologiesasinput,[55][56] takes a set of ontologies as input and tries to align them with holistic approaches. Their approaches are based on the mutual exclusiveness of equivalent terms to appear in the same ontology. Thereisalsoworkindiscoveringcomplexsemanticmatchesbetweendatabaseschemas iMAP [31], ontology comparison among different versions [96], agent mediation [117] [7], evolutionary transplantation [4], and others [72], [99]. Ontology alignment is unlikely to be fully automatic. User interaction is inevitable. Aslan et.al.[4] proposed to resolve semantic heterogeneity in federated databases by schemaimplantationandstepwiseevolution. Intheirapproach,aremoteschemawasfirst loosely implanted to local database schema. A hypothesis about the relevance (equiva- lence, superclass, subclass, overlap, or irrelevance) of a remote class to a local class was formed. The hypothetical relevance, along with a characteristic subset of instances of the remoteclassandlocalclasswasthensenttolocaldomainexpertsandremotedomainex- pertsrespectively, whowouldtrytosupplyvaluesforthenewlyaddedattributesofthose instances. Depending on the values domain experts supply, the actual relevance between the remote class and the local class can be determined and a new set of hypotheses can be formed and the process repeats. Yan et.al.[120] proposed an interactive tool that uses instance data to guide users in data navigation and schema mapping. 77 Insteadofrelyingononeorafewexpertstoalignontologies, Doanetal[86]proposed their MOBS approach which tries to replace an expert with a multitude of users. In their approach, users are forced to give feedbacks on system-proposed alignments before being allowed to use the system. Those feedbacks are then mined to filter out errors and inconsistencies. Mostontologyalignmenttechniquestakeasemiautomatedapproachtoontologyinter- operation: the system guesses likely matches between terms of two separately conceived ontologies, a human expert knowledgeable about the semantics of both ontologies then verifies the inferences, possibly using a graphical user interface. We also believe that approximate alignment axioms in the spirit of Hovy’s “generally associated with” links [61] will be a common phenomenon in the Semantic Web as it is unlikelythatconceptscasuallydevelopedbyenduserswouldmeetasstrongrequirements as implied by the OWL equivalency statements. A comparison of different alignment techniques is given in Figure 4.1. 4.3 Problems with Traditional Alignment Approaches 4.3.1 Alignment as an Isolated Task from Other Data Manipulation Tasks Several characteristics of traditional alignment techniques make them unsuitable for or- dinary users to use. There has been lots of research on alignment as we mentioned. Most of the techniques share the following characteristics: 78 Figure 4.1: Comparison of ontology alignment/schema matching techniques 1)Alignmentorschemamatchingisaseparatetask. Thususersdonotgetimmediate reward for alignment effort. 2) Input to the algorithms is schemas or ontologies, possibly with some instance data. Ontology is abstract. Aligning ontologies is thus more difficult. Also in many cases the ontology might not be available. 79 4.3.2 Expert-Based Alignment Theoutputisprecise mappingsbetweenterms indifferentschemas. This inturnrequires thorough understanding of the ontologies. Therefore the alignment task can generally be carried out by only ontology experts. However, ordinary users may want to align some concepts for their own purposes even although those concepts might not be precisely aligned to each other. 4.4 Our Alignment Approach 4.4.1 Introduction of Grass-Roots Alignment Ontology alignment is a key problem when dealing with heterogeneous data. It is even more important for the Semantic Web, which is all about structured data in different ontologies. The performance of an ontology alignment technique largely depends on the amount of information that can be leveraged for the alignment task. In the Semantic Web, end-users may explicitly or implicitly generate ontology align- ments during their use of the semantic data. This kind of end-user-generated ontology alignment, which we call grass-roots ontology alignment, is an important source of infor- mation that is not taken into account by previous ontology alignment techniques. 4.4.2 Alignment as Side-Effects Suppose a user’s PDA and mailer are Semantic-Web-empowered, but are using different ontologies. WhentheusercopiesanaddressfromherPDA’saddressbooktothemailer’s address book, she implicitly claims alignments between related classes and properties 80 of the PDA ontology and those of the mailer ontology. These alignments are valuable information and should be leveraged for future alignment purposes. For another example, an auto-buyer may check the price of a car with KBB.com and look up its registration information with CarFax.com, the implied alignment between the “Auto” class of KBB ontology and the “Car” class of CarFax ontology (Suppose these two sources are Semantic-Web-empowered) could then be used by intelligent information agents to integrate information from both sources. Tim Berners-Lee, in his vision for the Semantic Web [15], also discussed the power of this kind of end-user-generated alignments. He talked about how one could compose a businesscardoutofane-mailedinvoiceandhowtheimplicitlygeneratedalignmentcould then help other users: “I might be the first to establish that mapping ... but now anyone who learns of those links can derive a business card from an e-mailed invoice.” and “If I publish the relationships ... as a bit of RDF, then the Semantic Web as a whole knows the equivalence”. He also talked about the emergence of concepts from these alignments: “When, eventually, thousands of forms are linked together through the field for ’family name’or’lastname’or’surname’thenanyoneanalyzingtheWebwouldrealizethatthere is an important common concept here.” An interesting application demonstrating the idea of grass-roots alignment is Web- Scripter 1 [118]. Usersbuildareportbyextractingcontentfromheterogeneoussourcesand pasting that content into what looks like an ordinary spreadsheet. What users implicitly do in WebScripter (without expending extra effort) is to generate some ontology equiva- 1 DAML WebScripter Project, http://www.isi.edu/webscripter 81 lency statements. The resultant equivalency statements are then reused by WebScripter to help (other) users find and align related ontologies and data. 4.4.3 End-User Alignment Grass-rootsalignments,beingimplicitlygeneratedasside-effects,donotrequireprecision. Therefore, it can be obtained from data manipulations by end users. This is unlike other alignment techniques, which assumes a separate alignment process and try to guarantee thecorrectnessofalignmentobtainedduringtheprocess. Toensurecorrectness,theyhave to have an ontology expert to check and verify alignment proposed by these techniques. Grass-roots alignments produced by end users might be approximate or inconsistent. Thus a second part of our approach is to infer high-quality alignments out of low-quality grass-roots alignments. In Section 4.5 we will introduce one of our tools, WebScripter[118], that allows end users to produce grass-roots alignments. We then introduce our algorithm in Section 4.7 that combines grass-roots alignments to produce high-quality ontology alignments. 4.5 The WebScripter Tool 4.5.1 WebScripter Overview WebScripter[118]isatoolwedeveloped“thatenablesordinaryuserstoeasilyandquickly assemble reports extracting and fusing information from multiple, heterogeneous Seman- tic Web sources” in RDF Schema (RDFS), DAML, or OWL format. Different Semantic Web sources may use different ontologies. WebScripter addresses this problem by (a) 82 making it easy for individual users to graphically align the attributes of two separate externally defined concepts, and (b) making it easy to reuse others’ alignment work. At a high level, the WebScripter concept is that users extract content from heterogeneous sources and paste that content into what looks like an ordinary spreadsheet. What users implicitly do in WebScripter (without expending extra effort) is to build up an articula- tion ontology containing equivalency statements. We believe that in the long run, this articulation ontology will be more valuable than the data the users obtained when they constructedtheoriginalreport. Theequivalencyinformationreducestheamountofwork future WebScripter users have to perform. Thus, in some sense, you do not just use the Semantic Web when you use WebScripter, you help build it as you go along. 4.5.2 System Descriptions This section describes the current implementation of WebScripter by walking through a step-by-step example. In order to use WebScripter, users do not need to have knowledge of ontological languages. In this section we will describe how WebScripter helps ordi- nary users locate RDFS sources, build a report and customize the representation of a report. We then show how the resultant ontology alignment data benefits other users in constructing similar reports by identifying related sources and aligning data. 4.5.2.1 Constructing a WebScripter Report Step 1: Load RDFS Data In this example our job is to maintain a list of researchers working on the Semantic Web. The first task is to find the URLs where the researchers put their data (which we 83 presume to be in some RDF-based format for this example). Although locating RDFS sourcesisnotWebScripter’sfocus, WebScripterprovidessomesupportforitbywrapping Teknowledge’sSemanticSearchEngine 2 . Thissearchengineacceptsqueriesintheformat oftriplepatterns,andreturnsmatchesfromtheBBN’scrawledontologylibrary[29]. Our wrapper helps users by transforming their keyword-based queries into triple patterns, submitting them to Teknowledge’s Semantic Search Engine and extracting source URL’s from the results. Later on we will discuss how WebScripter can help identify related RDFS sources in a collaborative filtering fashion. In this example, we will use two RDFS datasources, ISWC’2002annotatedauthordata 3 andISI’sDistributedScalableSystems Division personnel data 4 . Step 2: Create a Report Figure 4.2: WebScripter GUI Figure 4.2 shows WebScripter just after loading the ISWC’2002 data. On the left side is a class hierarchy pane. Users can select a class to view its content in the lower 2 OWL Semantic Search Services, http://reliant.teknowledge.com/DAML. 3 http://annotation.semanticweb.org/iswc/documents.html 4 http://www.isi.edu/divisions/div2/ 84 right pane. The upper right pane is the report-authoring area. WebScripter offers three options for users to add a column to a report. (1) In the simplest case, users can select a column from a class and add it to the report, as shown in Figure 4.2. (2) Users can also typeexampledatainthereport-authoringarea; WebScripterwillthentrytoguess which column in which class the user is referring to. This is useful when users are lost in the class hierarchy. (3) In the most complicated case, users want to include information from different classes into a single report. We do not want to require users to understand the domain ontology in order to do that. For example, suppose users have already specified “name” and “email” for the instances of class “Person” in a report, and now they want to add information about the project a person works on, which is in the “Project” class. Instead of requiring users to specify how to go from the “Person” class to the “Project” class step by step, WebScripter will try to infer the ontological paths between these two classes, rank the paths first based on path length (shortest first) then by number of instance matches (more first), and lets users select (Figure 4.3). In our experience, the first entry listed (the one with the shortest ontological path and which fills the most blanks in the report) is virtually always the desired choice. Step 3: Align data from multiple sources In our running example, the user is now done with adding ISWC’2002 author in- formation to the report. Assume they happen to find ISI’s researcher information via Teknowledge’s Semantic Search Engine and want to include that in the report also. They basically repeat the previous steps of adding columns but this time they add the columns from ISI “Div2Member” class to the corresponding columns of the ISWC data (rather thanaddingitasnewcolumns). Figure4.4showsthecombineddatafromthetwogroups. 85 Figure 4.3: WebScripter: ontological path inference Figure 4.4: Aligning data in WebScripter When users compose a report by putting together information from heterogeneous sources, there is some implicit and valuable information that can be inferred. First, by composingareport, usersimplya(weak)associationbetweensources, i.e., “oneuserwho use this source also used that one”, somewhat analogous to Amazon’s book recommen- dations (“customers who bought this book also bought that one”). This association can 86 helpfutureuserslocaterelevantRDFSsources. Secondandmoreinterestingly,byputting heterogeneous information together, users also imply a (similarly weak) equivalency be- tween concepts from different ontologies. For example, from the report in Figure 4.4 WebScripter could infer that ISI’s “Div2Member” class is equivalent to ISWC’s “Person” class, ISI’s “fullname” property is equivalent to ISWC’s “name” property, and so on. Table 4.1 shows the equivalency information inferred from the report in DAML format. <?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf ="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:daml="http://www.daml.org/2001/03/daml+oil#" xmlns:wseq="http://www.isi.edu/WebScripter/2002/06/equivalencies#" > <rdfs:Class rdf:about="http://annotation.semanticweb.org/iswc/iswc.daml#Person"> <daml:sameClassAs rdf:resource= "http://www.isi.edu/webscripter/div2-org.o.daml#Div2Member"/> </rdfs:Class> <rdfs:Property rdf:about="http://annotation.semanticweb.org/iswc/iswc.daml#name"> <daml:samePropertyAs rdf:resource= "http://www.isi.edu/webscripter/div2-org.o.daml#fullname"/> </rdfs:Property> <rdfs:Property rdf:about="http://annotation.semanticweb.org/iswc/iswc.daml#email"> <daml:samePropertyAs rdf:resource= "http://www.isi.edu/webscripter/div2-org.o.daml#emailprefix"/> </rdfs:Property> <rdfs:Property rdf:about="http://annotation.semanticweb.org/iswc/iswc.daml#homepage"> <daml:samePropertyAs rdf:resource= "http://www.isi.edu/webscripter/div2-org.o.daml#homepage"/> </rdfs:Property> <rdfs:Property rdf:about= "http://annotation.semanticweb.org/iswc/iswc.daml#involved_in_project"> <daml:samePropertyAs rdf:resource= "http://www.isi.edu/webscripter/div2-org.o.daml#workson"/> </rdfs:Property> <rdfs:Property rdf:about="http://annotation.semanticweb.org/iswc/iswc.daml#project_title"> <daml:samePropertyAs rdf:resource= "http://www.w3.org/2000/01/rdf-schema#label"/> </rdfs:Property> </rdf:RDF> Table 4.1: Resultant Alignment Axioms The alignment axioms shown above are the simplest ones, a direct alignment between two named classes or properties. Since WebScripter also supports joins (between two classes) and filtering (of instances), the alignment axioms can also be more complex. For example, if users want to build a report of just the ISI students, users need to add 87 “Div2Member” instances to the report, do a join to their roles (“Div2Role”) and filter the roles by “Student”. The resultant equivalency is visualized as in Figure 4.5. Figure 4.5: Constructed class alignment Figure 4.6 shows an axiom that defines the equivalency between two property se- quences. This type of axiom can be captured with WebScripter (but we do not yet make use of it for our own alignment suggestions). To obtain the project name for a person, in the first case users simply follow the link “foo:projectName”; in the second case users need to follow the link “ISWC:involved in project”, then the link “ISWC:project title”. Figure 4.6: Constructed property alignment Traditional semi-automatic ontology mapping tools are good at one-to-one element mapping and tend to deal less with alignment axioms as complex as shown in Figures 4.5 and 4.6, which WebScripter in some sense captures “for free” by providing an easy way for users to perform join and filtering during report authoring. InthissectionwehavedescribedourtoolWebScripter,whichdemonstrateshowgrass- roots alignments can be produced by end users. might be approximate or inconsistent. Thus a second part of our approach is to infer high-quality alignments out of low-quality grass-roots alignments. 88 In Section 4.5 we will introduce one of our tools, WebScripter[118], that allows end users to produce grass-roots alignments. In the next section ( 4.6) we will talk about the advantages of obtaining ontology alignments this way, followed by a discussion of challenges in grass-roots alignment and our algorithm in dealing with those challenges to produce hi high-quality ontology alignments in Section 4.7. 4.6 Advantages of Grass-Roots Alignment 4.6.1 Instant Gratification Other approaches use a separate alignment process, thus the benefits of the alignment have to come later in other data manipulation tasks. Furthermore, the one who carries out the alignment effort (the ontology expert) might not be the one who benefits from the alignment effort. In contrast, in grass-roots alignment, users get immediate benefits for their implicit alignment, i.e., the completion of other data manipulation tasks, and users benefit from their own alignment effort rather than doing it for others. 4.6.2 Ease of Use To ensure correctness of ontology alignment requires a lot of effort and thorough under- standing of the domain. However, the creation of grass-roots alignment does not require the correctness of alignment. Therefore, end users can feel safe to generate them without worrying that their data manipulations will cause undesirable consequences down the road. The correctness of grass-roots alignment, instead, is to be achieved by combining different users’ alignment efforts. 89 4.7 ReusingGrass-rootsAlignmentsforAlignmentPurposes 4.7.1 Approximations and Inconsistencies in Grass-Roots Alignments Grass-rootsclassalignmentisausefulsourceofinformationthatcouldhelpuswithfuture alignmenttasks. Whenoneuser(implicitly)aligns“O1:PhDStudent” 5 with“O2:DoctoralStudent”, itcouldhelpotherusersaligntheseclasseswhentheyseeontologiesO1andO2. Further- more, it could help us with our future alignment tasks when these terms appear in other ontologiesagain. Thatis,itcouldhelpusalign“O3:PhDStudent”with“O4:DoctoralStudent”. However,notallgrass-rootsalignmentsarethatuniversal. Grass-rootsontologyalign- ment, often generated as a side effect of other data manipulations, could be user-specific, task-specific, approximate, or even contradictory. A university secretary, when counting thenumberofuniversitypersonnel, mayputtogetherall“Professors”, “Staffs”and“Stu- dents” from different sources, thus implying that “Professor”, “Staff” and “Student” are aligned. These classes, however, cannot be aligned in many other cases. Sincealignmentisoftengeneratedimplicitly,itcouldbeerroneouswhenusersperform mistaken data manipulations. Approximation of alignment often comes from the lack of corresponding concepts. Forexample, inoneontologytheremightbe a“GraduateStudent” class, while inanother one there might be only a “MasterStudent” class. Thus a person building a report of all graduate students will simply put all “GraduateStudent” and “MasterStudent” instances together,whichwillimplicitlyproducealignmentthat“GraduateStudent”isalignedwith “MasterStudent”. 5 The notation of O:C defines a class with term C in ontology O, C is a meaningful string representing the class, such as a class label. 90 Alignment is not transitive. If we regard alignment as transitive, in extreme cases everything would be aligned with everything else. As depicted in Figure 4.7 (Different color blocks represent different ontologies; arrows represent subclass relationships), every alignment (represented in dashed double line) is the best possible alignment between the two ontologies. All classes in Figure 4.7 will be aligned with each other if we regard alignment as transitive. Figure 4.7: Alignment is not transitive Therefore, to reuse grass-roots alignment for ontology alignment purposes, we must deal with the approximation or even erroneousness of those alignments. 4.7.2 Observations and Heuristics Notations: Before we proceed, let’s first define some notations. Given two terms A and B, there are four kinds of relationships between them: 1. A is more general than B, i.e., B is the subclass of A, which can be represented as A>B. We also call this A is an ancestor of B and B is a descendant of A. 2. B is more general than A, represented as B >A. 91 3. A and B are parallel (A k B), that is, neither of A and B is more general than the other one but they are related via class subsumptions. For example, A and B are siblings. 4. A and B are not related via class subsumptions, e.g., a “Movie” and a “Person”. For the sake of brevity, we use A •B to represent the situation when there is an- cestor/descendant relationship between A and B but we are not sure which one is more general. We use A∗B to represent that A and B are related (either via >, • or k). Although grass-roots alignment seems rather arbitrary because users might align two classes for their own purposes, we can still make some observations regarding grass-roots alignments. Observation 1: Our first observation is that: when users align two classes, the aligned classes tend to describe similar “kind” of things. For example, users might align “MasterStudent” with “Student”, or even “MasterStudent” with “Person”. But they rarely align “MasterStu- dent” with “Project”. In other words, aligned classes are highly likely related via a series of class subsumption relationships. Observation 2: Oursecondobservationisthatwhenfacingseveralapproximatealignmentcandidates, users tend to select the one that is semantically closer. Take Figure 4.8(a) for example, we all know that Student > GraduateStudent > MasterStudent. If Student > GraduateStudent appears in one ontology O1 while MasterStudent appears in another ontology O2, out of Student and GraduateStudent 92 users tend to pick GraduateStudent to align MasterStudent with. Similar observation can be made on the case depicted in Figure 4.8(b). Figure 4.8(c) suggests another situation. For example, we know that MasterStudent and PhDStudent are both subclasses of GraduateStudent. When GraduateStudent > MasterStudent appear in ontology O1 and PhDStudent appears in ontology O2, users tend to align O2:PhDStudent with O1:GraduateStudent. This is reasonable because B and C are not directly related. From the perspective of set theory, A ∩C contains B∩C. B∩C is ∅ when B and C are mutually exclusive. Figure 4.8(d) is actually a rare case compared with case (a) (b) and (c). It basically says that given O1 : Student > O1 : FemaleStudent and O2 : Female users tend to align O2:Female with O1:FemaleStudent. Figure 4.8: Observations on grass-roots alignments The assumption our approach based on is the validity of above observations. We assume that most users tend to align according to our observations. Note that in Fig- ure 4.8 class hierarchies are on the left side and alignments are on the right side. Given the validity of the observations, the right side (alignments) gives hints on how the left 93 side should look like. Such hints are then combined to infer the class hierarchies on the left side. 4.7.3 Algorithm for Reusing Grass-roots Class Alignment The first step of our class alignment algorithm is to get an initial set of facts about relationships between different classes. Such facts come from two sources: the subclass relationships explicitly specified in the original ontologies, and other facts implied by alignments. Step 1: Subclass Relationship Specified in the Ontology: Given an ontology O1, if it is specified in O1 that O1:B is a subclass of O1:A, we represent such a fact in a form: (A > B,e = O1), which means that term A is more general than term B, the evidence for this claim is ontology O1. Similarly, if in ontology O2 it is also specified that O2:B is subclass of O2:A, then we get (A>B,e=O1+O2). Step2: RelationshipsImpliedbyGrass-rootsAlignments: Forsimplicity,let’s assume here that there is no multiple inheritance in the class hierarchies we are working on. Our experiences with many class hierarchies suggest that multiple inheritance occurs relatively infrequently. The algorithm presented below can be extended for multiple- inheritance cases as well with slight modifications. Case 1: Suppose an alignment exists as on the left side of Figure 4.9 (the double dash linestandsforalignment), thatis, AisasuperclassofBandBisalignedwithC.Suchan alignment implies that C cannot be a superclass of A, otherwise C should be aligned with A, not B. Thus the possible relationship between A and C is A>C or A kC. Similarly the alignment implies that B and C are not parallel, otherwise C is better aligned with 94 A. Thus the possible relationship between B and C is C > B or B > C. Take all A, B and C into account, there are four kinds of combinations: 1. A k C and B > C: this combination is invalid because B > C combined with A>B will lead to A>C, conflicting with A kC. 2. A k C and C > B: in this case B inherits from both A and C, thus it is pruned because of our single-inheritance assumption. 3. A>C and C >B, or 4. A>C and B >C. We can combine 3 and 4 and A > B to get (A > B AND A > C AND B •C). For brevity we can also rewrite it in a shorter form A>B•C. Add in the evidence we will get (A>B•C,e=align1) where align1 is the alignment that links the O1:B class in the first ontology to O2:C class in the second ontology. Figure 4.9: Implications of alignment: Case 1 Case 2: Suppose an alignment exists as on the left side of Figure 4.10, that is, A is a superclass of B and A is aligned with C. Such an alignment implies that B cannot be a superclass of C, otherwise C should be aligned with B, not A. Further analysis will lead to (NOT (B >C) AND A∗C). The relationship between A and C can be arbitrary. 95 Similartocase1, addinthealignmentasevidenceweget(NOT(B >C)ANDA∗C, e =align2) where align2 is the alignment that links the O1:A class in the first ontology to O2:C class in the second ontology. Figure 4.10: Implications of alignment: Case 2 Case 3: This is actually not a separate case (Figure 4.11). It is an aggregation of two case 1’s. Therefore, we have: (A>B AND B ∼C) => A>B•C (D >C AND B ∼C) => D >B•C where ∼ stands for alignment. Since we are dealing with the single-inheritance case, the last two can be combined into A•D >B•C. Add in the alignment evidence align3 we will have (A • D > B • C,e = align3). Similarly, for the case (A > B AND D > C AND A ∼ D) we will get (NOT (B > D) AND NOT (C >A) AND A∗D). Figure 4.11: Implications of alignment: Case 3 96 Step 3: Forward-chaining Inference: Afterwegetallthefactsfromthesubclassrelationshipsintheontologiesandfromthe alignments, we apply forward-chaining inference to the facts knowledge base to obtain more facts. The inference rules used here are propositional rules. Some sample rules include: (A>B OR A<B) and NOT(A>B) will lead to B >A (Unit Resolution). A>B and B >C will lead to A>C (Transitivity of Class Subsumption). NOT(A>B) and NOT(A kB) will lead to B >A. The computation of evidence is as follows: Whenanewfactf isobtainedfromseveralotherfacts,(f1,e1)AND(f2,e2)...AND(fi,ei)=> (f,e), its evidence e = e1∗e2∗..∗ei. When a fact can be obtained several times with different evidences e1,e2,..ei, its evidence is updated as e=e1+e2+...+ei. Also note that same evidence doesn’t count twice, that is, e1+e1=e1,e1∗e1=e1. We do not use backward-chaining inference. The facts knowledge base is very likely to be inconsistent because users may assert contradictory alignments. Thus everything you want the KB to prove will be proven. With the forward-chaining inference, we try to infer as many facts as possible along with evidences for each fact. The evidences will be used to pick out better-supported facts in case of fact contradictions. Quantifying Evidences: We want to quantify evidences for comparison reasons. The value of an evidence is a numerical value between (0, 1). Let V(e) be the numerical value of evidence e. The computation of evidence value is according to the following: V(e1+e2) = 1-(1-V(e1))*(1-V(e2)) 97 V(e1*e2) = V(e1)*V(e2) The above two formulas have the following properties: 1) the computed numerical value is always between (0,1) 2) the more evidences support a fact the higher numerical evidence value it has 3) the more steps required to reach a fact the smaller numerical evidence value it has. The value of primitive evidences can be determined separately, for example, they can be based on user authority or ontology quality. In our experiment (to be discussed in Section 4.8), we assigned each ontology evidence a value of 0.5 and each alignment evi- dence a value of 0.25. This value assignment seems arbitrary, but our experience showed that different value assignments did not change the results of our algorithm much: what matters is the comparisons of values rather than the absolute values. In our experiment, this value assignment gave a relatively uniform value distribution between (0, 1). In fu- ture, we are looking at determining the value assignment based on the statistics on the lengths of evidences of all facts. Note that evidence values are not probabilities. A fact with evidence value 0.8 does not mean it has a probability of 0.8 to be true. It’s rather a measure of confidence which is only meaningful for comparison purposes. Step 4: Class Alignment Using Facts KB: The facts obtained from the inference step above will be used for the next class alignment task. Given a class A from ontology O1, we try to find a class B from the second ontology O2 such that B is the best alignment candidate for A. 98 Note that one desirable side effect of our algorithm is that it takes only a one-step query to get all superclasses or subclasses of a class because of the application of sub- sumption transitivity rule in Step 3. Let’s define three class sets as following: Sup(A) is the set of superclasses of A as in facts KB. Sub(A) is the set of subclasses of A as in facts KB. Ind(A) is the set of all classes A 0 such that (A > A 0 OR A 0 > A) but there is no fact in KB specifying either A > A 0 or A 0 > A. That is, A 0 and A are indistinguishable according to facts KB. To deal with possible inconsistencies, for each A 0 from Sup(A), if there is a better- supported fact A > A 0 , NOT(A 0 > A) or A 0 k A, remove A 0 from Sup(A). Do the same to Sub(A). We then determine the alignment candidate for class A in the following order: If there are one or more classes from O2 that belong to Ind(A), choose the best- supported one as the alignment candidate for A. IfthereareoneormoreclassesfromO2thatbelongtoSup(A),choosetheoneclosest to A, that is, out of B and C choose B if C > B. If the order between B and C cannot be determined, pick the better-supported one. If there is one or more classes from O2 that belong to Sub(A), choose the one closest to A, that is, out of B and C choose B if B > C. If the order between B and C cannot be determined, pick the better-supported one. Otherwise there is no alignment candidate for A in O2. 99 Example: Let’s use the ontologies and alignments in Figure 4.12 as an example to illustrate how to obtain facts about class relationships. Using the obtained facts for aligning classes is straightforward and thus not elaborated here. All the obtained facts are listed in Table 4.2. In order to (implicitly) integrate the three small class hierarchiesintoabiggerone,itisusefultodeterminetherelationshipbetween“Graduate” class and “UnivStudent” class. Note that since alignment does not mean equivalence, (Student > Graduate,O2) and the alignment between “UnivStudent” and “Student” do not immediately imply UnivStudent > Graduate. However, we will show that by combining facts obtained from different alignments and ontologies , we will still be able to infer UnivStudent > Graduate. As listed in Table 4.2, Facts 0 to 5 are directly obtained from respective ontologies, Facts 6 to 13 are obtained from the two alignments, the rest of facts are obtained with forward-chaining inference. From align2 we get Fact 9: NOT(Graduate > UnivStudent), which is Case 2 as depicted in Figure 4.10. From align1 we get Fact 8: (Graduate > UnivStudent OR UnivStudent > Graduate), which is case 3 as depicted in Figure 4.11. Applying unit resolution on Facts 8 and 9 results in UnivStudent>Graduate. Dealing With Multiple Inheritances: Weonlyneedtomakesmalladjustmentstoouralgorithmwhendealingwithmultiple- inheritance class hierarchies. For the case 1 scenario as depicted in Figure 4.9, the follow- ing facts are implied from the alignment instead: NOT(C > A), NOT(B k C), ((C > B AND A k C) OR (A > C AND (B > C OR C > B))). The case 2 scenario remains the same while the case 3 is not used. In the multiple-inheritance case we make less bold assumptions and introduce more uncertainties than in the single-inheritance case. 100 Figure 4.12: Alignment example 4.7.4 Algorithm for Reusing Grass-roots Property Alignment The focus of our research is on class alignment. Here we will also briefly talk about our initial work on property alignment (e.g., “O1:lastname” is aligned to “O2:surname”), which is greatly facilitated by class alignment. The basic idea of our property alignment algorithm is to take its domain class as context for the property alignment to hold. For example, when we talk about “Movie” class, “title” is equivalent to “name”. However, this does not hold for “Person” class. Furthermore,wetakeclasshierarchyintoaccountwhenapplyingapropertyalignment to other contexts: for example, the alignment of “title” and “name” can be applied to “Comedy Movie” as well which is a subclass of “Movie”. 4.8 Evaluation 4.8.1 Theoretical Analysis First, we want to do a theoretical analysis why our algorithm works. 101 Fact# Fact Evidence 0 Graduate > MasterStudent O1 1 Graduate > DoctoralStudent O1 2 Student > Graduate O2 3 UnivStudent > Undergraduate O2 4 UnivStudent > MSStudent O3 5 UnivStudent > PhDStudent O3 6 UnivStudent > MasterStudent align1 7 Graduate > MSStudent align1 8 Graduate > UnivStudent align1 OR UnivStudent > Graduate 9 NOT (Graduate>UnivStudent) align2 10 Student ∗ UnivStudent align2 11 NOT (MSStudent>Student) align2 12 NOT (PhDStudent>Student) align2 13 NOT (Undergraduate>Student) align2 14 UnivStudent > Graduate align1*align2 /*from 8 and 9*/ 15 Student > UnivStudent O2*align1*align2, OR UnivStudent > Student /*from 2 and 14, single-inheritance*/ ... ... ... Table 4.2: Facts Knowledge Base As is obvious, if all possible subclass relationships between all classes are known, it is trivial to find the best alignment candidate for each class. Therefore, the quality of the alignment produced by the algorithm is decided by the quality of the fact KB, which contains all subclass relationships inferred by the algorithm. In the ideal case, when alignments provided by users are valid, that is, they conform to our assumptions as described in Section 4.7.2, our algorithm will guarantee that the resultant facts in the fact KB are correct. On the basis of that, with more and more alignments provided, more and more facts will be inferred in the fact KB (Figure 4.13) will also demonstrate that. Given the correctness of the fact KB, when the coverage of the fact KB increases, so does the accuracy of produced alignments. When the fact KB covers all possible subclass relationships, the produced alignments are all correct ones. 102 In the not so ideal case, when some alignments provided by users are invalid, it is likely that the invalid facts inferred from them would be overridden by correct facts given the small percentage of the invalid alignments. 4.8.2 Experiment Results Figure 4.13: Precision and recall of obtained facts when all alignments are valid Figure 4.14: Performance of algorithm with some invalid alignments To verify our theoretical analysis, we performed an experiment in the university stu- dent domain. We retrieved and downloaded 26 ontologies about university students with 103 the help of the Swoogle[32] ontology search engine. The part of class hierarchy related to university student in each ontology consists of 5 classes on average. We found a high redundancy in class names, with about 3 different names for each class. As discussed in the previous section, the quality of the produced alignments is deter- mined by the quality of inferred subclass relationships in fact KB. Therefore, our first evaluation is to measure the precision and recall of fact KB. To measure the precision and recall of fact KB, we need a reference KB. We manually constructedacompleteclasshierarchycoveringallrelatedclassesinthe26ontologies. The reference KB then consists of all subclass relationships in this complete class hierarchy. We then compare fact KB to the reference KB. The recall of fact KB is defined as the percentageoffactsinreferenceKBcoveredbyfactKB.TheprecisionoffactKBisdefined as the percentage of facts in fact KB covered by reference KB. Figure 4.13 shows the precision and recall of fact KB when all user-provided align- mentsarevalidones. TheexperimentshowsthatthecorrectnessoffactKBisguaranteed while its recall increases with the number of user-provided alignments increases. Figure 4.14 shows the quality of fact KB and produced alignment when some user- provided alignments are invalid. At first, the precision of fact KB is 100%. We then put in some invalid alignments, as a result, the precision of fact KB dropped. We then put in the rest of valid alignments. The precision and recall of fact KB increased with more and more valid user-provided alignments were used, which shows that the effects of invalid alignments are gradually overridden by those of valid alignments. We also measured the accuracy of alignments produced by the algorithm. We started with two empty ontologies. For each set of equivalent classes (such as {Undergraduate, 104 BachelorStudent}), a member is randomly selected and randomly assigned to one of the ontologies. We then measure the precision of the alignments produced by our algorithm betweenthetwoontologies. TheresultisalsoshowninFigure4.14. Ascanbeseen,high- quality alignments can be inferred from grass-roots alignments provided by end users. 4.9 Summary End-user-generated ontology alignment, which we call grass-roots ontology alignment, is an important source of information that is yet to be taken into account by traditional ontology alignment techniques. Grass-roots ontology alignments are easy for end users to produce, but they might be approximate or erroneous. We discussed our work on dealing with the approximations and inconsistencies of grass-roots class alignments in order to reuse them for ontology alignment purposes. Our results showed that alignments with high precision can be obtained from grass-roots alignments with our algorithm. 105 Chapter 5 Summary In this dissertation we presented our approach to enabling laymen to contribute content to the Semantic Web. The realization of the Semantic Web is dependent on the creation and alignment of massive amount of semantic data. However, the lack of semantic content (both data and alignment) is the biggest problem with the current Semantic Web development. We surveyed conventional tools for creating semantic web data. We argued that these tools adopted an ontology-based, top-down data creation paradigm, which is an important factor affecting the ease-of-use of these tools. We thus proposed a bottom-up, data-centric paradigm to semantic data creation, embodied in our tool MetaDesk. In this bottom-up paradigm users create structured data first without needing to define an ontology first; users refine the data later on and an ontology will be inferred from the created data. We evaluated our approach as well as the top-down approach with MetaDesk and Prot´ eg´ e as the respective representative. Our evaluations showed that our bottom-up approach lowers difficulty in creating semantic data for laymen. In addition, the bottom- up approach provides instant gratification to user’s data creation effort, as well as has a higher overall efficiency in semantic data creation. Enabling laymen to create semantic data is only one step towards the Semantic Web. The created data will not be useful if there is no alignment among them because of the 106 heterogeneity of the data. Therefore, in order for the Semantic Web to succeed, it is important that laymen be able contribute ontology alignment as well, which is another important kind of semantic content on the Semantic Web. We surveyed conventional tools for ontology alignment. We argued that these tools are designed for experts, rather than for laymen. These tools treat ontology alignment as a separate process from other tasks, use heuristics to produce alignment suggestions, and rely on experts to produce precise alignment. Instead, we propose a grass-roots ontology alignment paradigm, in which laymen, rather than ontology experts, align their semantic data (for their own purposes) within end applications. The inferred implicit and sometimes imprecise ontology alignments are then integrated and mined to produce higher accuracy ontology alignments. Because users produce alignments as an implicit, side-effectofotherdatamanipulationtasks,andprecisealignmentisnotrequiredofusers’ effort, the difficulty of carrying out the alignment task is significantly lowered, making it easier for laymen to contribute to the development of the Semantic Web. Grass-roots ontology alignment produced by laymen might be approximate or erro- neous. We discussed our work on dealing with the approximations and inconsistencies of grass-roots class alignment in order to reuse them for produce high-precision ontology alignments. Our experiment results showed that alignments with high precision can be obtained from grass-roots alignment with our algorithm. In summary, with bottom-up data creation and grass-roots ontology alignment, we significantly lowered the barrier for laymen to contribute content to the Semantic Web. 107 Chapter 6 Future Work In this chapter we present future extensions or applications of our work. 6.1 Semantic-enabled Mind Mapping Tools Novel applications are also important for the development of the Semantic Web. One part of our future work is to integrate our bottom-up semantic data creation mechanism into the popular mind mapping tools. We found a lot of resemblance between mind mapping tools and our bottom-up data creation tools such as MetaDesk. Figure 6.1 shows a snapshot of one mind mapping tool — FreeMind. Like MetaDesk, mind mapping tools also use a hierarchy data paradigm. Furthermore, mind mapping tools allow users to markup the hierarchy with icons. If, in addition to icons, mind mapping tools also allow users to markup the hierarchy with ontological information, we will have a semantic-enabled mind mapping tool. A semantic-enabled mind mapping tool has advantages over old mind mapping tools in that it allows more sophisticated processing, sharing and integration of data. And it still keeps the functionalities of old mind mapping tools, therefore could possibly build up a Semantic Web user base among the users of mind mapping tools. 108 Figure 6.1: Integrating bottom-up semantic data creation into mind mapping tools 6.2 Community-basedSemanticDataCreationandAlignment Environment Another direction of our future work is to integrate our bottom-up data creation mech- anism and grass-roots ontology alignment mechanism into a single tool, thus creating a community-basedenvironmentinwhichordinaryuserscaneasilycreate,align,share,and integrate semantic data. 6.3 UsingMetaDesk’sDataRefinementMechanismforXML- to-RDF Conversion An import source of RDF is legacy data in XML format. In order to convert XML into RDF, information on whether a particular XML tag “<X>” should be converted into an RDF property “X” or an instance of type “X” should be provided. As of now, there are no tools available that allow users to easily provide such information. One of the data refinement mechanisms in MetaDesk is to provide two buttons via which users can specify whether a node represents a property or a class (collection, to be 109 more exact). Combining it with the fact that it is easy to import an XML hierarchy into MetaDesk, we can extend MetaDesk’s data refinement mechanism to convert XML into RDF. 110 References [1] Sanjay Agrawal, Surajit Chaudhuri, and Gautam Das. Dbxplorer: A system for keyword-based search over relational databases. Proceedings of the 18th Interna- tional Conference on Data Engineering (ICDE’02), 00:0005, 2002. [2] YigalArens, CraigKnoblock, andWei-MinShen. Queryreformulationfordynamic information integration. Intelligent Information Systems, 6(2-3):99–130, 1996. [3] Naveen Ashish and Craig A. Knoblock. Semi-automatic wrapper generation for internet information sources. In Conference on Cooperative Information Systems, pages 160–169, 1997. [4] GokselAslanandDennisMcLeod. SemanticHeterogeneityResolutioninFederated Databases by Metadata Implantation and Stepwise Evolution. VLDB Journal, 8(2):120–132, 1999. [5] Paolo Atzeni, Paolo Cappellari, and Philip A. Bernstein. Model-independent schema and data translation. In EDBT, pages 368–385, 2006. [6] Franz Baader and Werner Nutt. Basic description logics. In Description Logic Handbook, pages 43–95, 2003. [7] Sidney C. Bailin and Walt Truszkowski. Ontology negotiation as a basis for oppor- tunisticcooperationbetweenintelligentinformationagents. InCIA,pages223–228, 2001. [8] Carlo Batini, Maurizio Lenzerini, and Shamkant B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Comput. Surv., 18(4):323–364, 1986. [9] SeanBechhofer,IanHorrocks,CaroleGoble,andRobertStevens. OilEd: aReason- able Ontology Editor for the Semantic Web. In Proceedings of KI2001, Joint Ger- man/Austrian conference on Artificial Intelligence, number 2174 in Lecture Notes in Computer Science, pages 396–408, Vienna, September 2001. Springer-Verlag. [10] Domenico Beneventano, Sonia Bergamaschi, Silvana Castano, Alberto Corni, R. Guidetti, G. Malvezzi, Michele Melchiori, and Maurizio Vincini. Information integration: The momis project demonstration. In Amr El Abbadi, Michael L. Brodie, Sharma Chakravarthy, Umeshwar Dayal, Nabil Kamel, Gunter Schlageter, and Kyu-Young Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 611–614. Morgan Kaufmann, 2000. 111 [11] V. Richard Benjamins, Dieter Fensel, Stefan Decker, and Asunci´ on G´ omez-P´ erez. (ka) 2 : building ontologies for the internet: a mid-term report. Int. J. Hum.- Comput. Stud., 51(3):687–712, 1999. [12] V.RichardBenjamins,DieterFensel,andAsunci´ onG´ omez-P´ erez. Knowledgeman- agement through ontologies. In PAKM, 1998. [13] SoniaBergamaschi,SilvanaCastano,andMaurizioVincini. Semanticintegrationof semistructured and structured data sources. SIGMOD Record, 28(1):54–59, 1999. [14] Jacob Berlin and Amihai Motro. Database schema matching using machine learn- ing with feature selection. In Proceedings of the 14th International Conference on Advanced Information Systems Engineering, pages 452–466. Springer-Verlag, 2002. [15] Tim Berners-Lee. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. Harper San Francisco, 1999. [16] Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Scientific American, May 2001. [17] GauravBhalotia,CharutaNakhe,ArvindHulgeri,SoumenChakrabarti,andS.Su- darshan. Keyword searching and browsing in databases using BANKS. In ICDE, 2002. [18] UldisBojars. ModellingofResumeDataintheSemanticWebUsingRDF. Master’s thesis, Latvijas Universitate, 2002. [19] Paolo Bouquet, Luciano Serafini, and Stefano Zanobini. Semantic coordination: A new approach and an application. In International Semantic Web Conference, pages 130–145, 2003. [20] Paolo Bouquet, Luciano Serafini, and Stefano Zanobini. Peer-to-peer semantic coordination. J. Web Sem., 2(1):81–97, 2004. [21] St´ ephane Bressan, Cheng Hian Goh, Kofi Fynn, Marta Jessica Jakobisiak, Karim Hussein, Henry B. Kon, Thomas Lee, Stuart E. Madnick, Tito Pena, Jessica Qu, Annie W. Shum, and Michael Siegel. The context interchange mediator prototype. In SIGMOD Conference, pages 525–527, 1997. [22] Dan Brickley and R.V. Guha. Resource Description Framework (RDF) Schema Specification 1.0. W3C Candidate Recommendation, March 2000. http://www.w3.org/TR/2000/CR-rdf-schema-20000327/. [23] Travis Brown. Prot´ eg´ e and other ontology tools. http://www.ischool.utexas.edu/ i385t-sw/archive/protege/index.html. [24] Silvana Castano and Valeria de Antonellis. A schema analysis and reconciliation tool environment for heterogeneous databases. In IDEAS ’99: Proceedings of the 1999 International Symposium on Database Engineering & Applications, page 53, Washington, DC, USA, 1999. IEEE Computer Society. 112 [25] Hans Chalupsky. Ontomorph: A translation system for symbolic knowledge. In KR, pages 471–482, 2000. [26] Fabio Ciravegna, Alexiei Dingli, Daniela Petrelli, and Yorick Wilks. Timely and Non-Intrusive Active Document Annotation via Adaptive Information Extraction. InSemanticAuthoring,AnnotationandKnowledgeMarkup(SAAKM2002),ECAI 2002 Workshop, July 2002. [27] AlainCouchot. Improvingwebsearchingusingdescriptivegraphs. InNLDB,pages 276–287, 2004. [28] Mike Dean. Panel: State of the Semantic Web, 2002 DAML PI Meeting. http://www.daml.org/2002/10/pi-panel-mdean/slide1-0.html. [29] Mike Dean and Kelly Barber. DAML Crawler. http://www.daml.org/crawler. [30] Stefan Decker, Michael Erdmann, Dieter Fensel, and Rudi Studer. Ontobroker: Ontology based access to distributed and semi-structured information. In DS-8, pages 351–369, 1999. [31] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy, and Pedro Domin- gos. imap: discovering complex semantic matches between database schemas. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 383–394. ACM Press, 2004. [32] Li Ding. The Swoogle Semantic Web Search Engine. http://swoogle.umbc.edu/. [33] H. Do and E. Rahm. Coma - a system for flexible combination of schema matching approaches. In VLDB, 2002. [34] AnHai Doan, Pedro Domingos, and Alon Y. Halevy. Reconciling schemas of dis- parate data sources: A machine-learning approach. In SIGMOD Conference, 2001. [35] AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy. Learning to map ontologies on the semantic web. In The Eleventh International World Wide Web Conference, 2002. [36] Dejing Dou, Drew V. McDermott, and Peishen Qi. Ontology translation on the semantic web. J. Data Semantics, 2:35–57, 2005. [37] Henrik Eriksson, Raymond W. Fergerson, Yuval Shahar, and Mark A. Musen. Au- tomatic Generation of Ontology Editors. In the 12th Banff Knowledge Acquisition Workshop, 1999. [38] Dieter Fensel. Ontology-based knowledge management. IEEE Computer, 35(11):56–59, 2002. [39] Dieter Fensel, Deborah L. McGuinness, Ellen Schulten, Wee Keong Ng, Ee-Peng Lim, and Guanghao Yan. Ontologies and electronic commerce. IEEE Intelligent Systems, 16(1):8–14, 2001. 113 [40] Daniela Florescu, Donald Kossmann, and Ioana Manolescu. Integrating keyword search into XML query processing. Computer Networks (Amsterdam, Netherlands: 1999), 33(1–6):119–135, 2000. [41] Fr´ ed´ eric F¨ urst and Francky Trichet. Axiom-based ontology matching. In K-CAP, pages 195–196, 2005. [42] Aldo Gangemi, Domenico M. Pisanelli, and Geri Steve. An overview of the onions project: Applying ontologies to the integration of medical terminologies. Data Knowl. Eng., 31(2):183–220, 1999. [43] H.Garcia-Molina, Y.Papakonstantinou, D.Quass, A.Rajaraman, Y.Sagiv, J.Ull- man, V. Vassalos, and J. Widom. The TSIMMIS approach to mediation: data models and languages. Intelligent Information Systems, 8(2):117–32, 1997. [44] Manuel Garc´ ıa-Solaco, F` elix Saltor, and Mal´ u Castellanos. A structure based schema integration methodology. In ICDE, pages 505–512, 1995. [45] Christine Golbreich, Olivier Dameron, Bernard Gibaud, and Anita Burgun. Web ontology language requirements w.r.t expressiveness of taxonomy and axioms in medicine. In International Semantic Web Conference, pages 180–194, 2003. [46] Thomas R. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2):199–220, 1993. [47] Nicola Guarino. Semantic matching: Formal ontological distinctions for informa- tion organization, extraction, and integration. In SCIE, pages 139–170, 1997. [48] Antonio Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In WWW (Special interest tracks and posters), pages 902–903, 2005. [49] Lin Guo, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. Xrank: ranked keyword search over xml documents. In Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pages 16–27. ACM Press, 2003. [50] Farshad Hakimpour and Andreas Geppert. Resolving semantic heterogeneity in schema integration. In FOIS, pages 297–308, 2001. [51] Alon Halevy, Oren Etzioni AnHai Doan, Zachary Ives, and Jayant Madhavan. Crossing the structure chasm. In the First Biennial Conference on Innovative Data Systems Research (CIDR), 2003. [52] AlonY.Halevy,ZacharyG.Ives,DanSuciu,andIgorTatarinov. Schemamediation in peer data management systems. In Proc. of 2003 International Conference on Data Engineering, 2003. [53] Siegfried Handschuh and Steffen Staab. Authoring and Annotation of Web Pages in CREAM. In WWW 2002, May 2002. 114 [54] Stefan Haustein and Jrg Pleumann. Easing participation in the semantic web. In WWW-2002 Semantic Web Workshop, Honolulu, Hawaii, May 7 2002. [55] Bin He and Kevin Chen-Chuan Chang. A Holistic Paradigm for Schema Matching. SIGMOD Record, 33(3):120–132, September), year = 2004. [56] Bin He and Kevin Chen-Chuan Chang. Statistical schema matching across web query interfaces. In Proceedings of the 2003 ACM SIGMOD international confer- ence on Management of data, pages 217–228. ACM Press, 2003. [57] Jeff Heflin, James Hendler, and Sean Luke. SHOE: A knowledge representation language for internet applications. Technical Report CS-TR-4078, University of Maryland, College Park, 1999. [58] James Hendler and Deborah L. McGuinness. The darpa agent markup language. IEEE Intelligent Systems, 15(6):67–73, 2000. [59] Ian Horrocks, Dieter Fensel, Jeen Broekstra, Stefan Decker, Michael Erdmann, Carole Goble, Frank van Harmelen, Michel Klein, Steffen Staab, Rudi Studer, and Enrico Motta. OIL: The Ontology Inference Layer. Technical Report IR- 479, Vrije Universiteit Amsterdam, Faculty of Sciences, September 2000. See http://www.ontoknowledge.org/oil/. [60] E. H. Hovy. Combining and standardizing large-scale, practical ontologies for ma- chinetranslationandotheruses. InProceedings of the 1st International Conference on Language Resources and Evaluation (LREC), Granada, Spain, May 28–30 1998. [61] Eduard Hovy. Using an ontology to simplify data access. Communications of the ACM, 46(1):47–49, 2003. [62] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases, 2002. [63] ArvindHulgeri,GauravBhalotia,CharutaNakhe,SoumenChakrabarti,andS.Su- darshan. Keywordsearchindatabases. IEEE Data Engineering Bulletin, 24(3):22– 31, 2001. [64] W.DeanBidgoodJr.,LouisY.Korman,AlanM.Golichowski,P.LlodyHildebrand, AngeloRossiMori, BruceBray, NicholasJ.G.Brown, KentA.Spackman, S.Brent Dove, and Katherine Schoeffler. Controlled terminology for clinically-relevant in- dexing and selective retrieval of biomedical images. Int. J. on Digital Libraries, 1(3):278–287, 1997. [65] Yannis Kalfoglou and W. Marco Schorlemmer. Using formal concept analysis and information flow for modelling and sharing common semantics: Lessons learnt and emergent issues. In ICCS, pages 107–118, 2005. [66] Aditya Kalyanpur, Bijan Parsia, James Hendler, and Jennifer Golbeck. SMORE - Semantic Markup, Ontology, and RDF Editor. 115 [67] Atanas Kiryakov, Borislav Popov, Ivan Terziev, Dimitar Manov, and Damyan Ognyanoff. Semanticannotation, indexing, andretrieval. J. Web Sem., 2(1):49–79, 2004. [68] Graham Klyne and Jeremy J. Carroll. Resource description framework (rdf): Con- cepts and abstract syntax. http://www.w3.org/TR/rdf-concepts/, 2004. [69] Paul Kogut and William Holmes. AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages. In First International Confer- ence on Knowledge Capture (K-CAP 2001). Workshop on Knowledge Markup and Semantic Annotation, October 2001. [70] Marja-Riitta Koivunen, Eric Prud’Hommeaux, and Ralph R. Swick. Annotea: An Open RDF Infrastructure for Shared Web Annotations. In the Tenth International World Wide Web Conference, May 2001. [71] Anand Kumar and Barry Smith. The universal medical language system and the gene ontology: Some critical reflections. In KI, pages 135–148, 2003. [72] Martin S. Lacher and Georg Groh. Facilitating the exchange of explicit knowledge through ontology mappings. In FLAIRS Conference, pages 305–309, 2001. [73] Ora Lassila and Ralph R. Swick. Resource description framework (rdf) model and syntax specification - w3c recommendation. http://www.w3.org/TR/1999/REC- rdf-syntax-19990222/, 1999. [74] Fritz Lehmann. Machine-negotiated, ontology-based edi (electronic data inter- change). In Electronic Commerce, pages 27–45, 1994. [75] A.Y. Levy, D. Srivastava, and T. Kirk. Data model and query evaluation in global information systems. Intelligent Information Systems, 5(2):121–43, 1995. [76] Wen-Syan Li and Chris Clifton. Semint: a tool for identifying attribute corre- spondences in heterogeneous databases using neural networks. Data Knowl. Eng., 33(1):49–84, 2000. [77] RobertMacGregor, SameerMaggon, andBaoshiYan. MetaDesk: ASemanticWeb DesktopManager. InInternational Workshop on Knowledge Markup and Semantic Annotation (SemAnnot 2004), 2004. [78] Robert M. MacGregor. Inside the Loom Description Classifier. SIGART Bulletin, 2(3):88–92, June 1991. [79] Robert M. MacGregor. Beyond Description Logics. In 1994 International Work- shop on De-scription Logics, May 1994. [80] Jayant Madhavan, Philip A. Bernstein, AnHai Doan, and Alon Y. Halevy. Corpus- based schema matching. In ICDE, 2005. 116 [81] JayantMadhavan,PhilipA.Bernstein,PedroDomingos,andAlonY.Halevy. Rep- resenting and reasoning about mappings between domain models. In Eighteenth national conference on Artificial intelligence, pages 80–86. American Association for Artificial Intelligence, 2002. [82] Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. Generic schema match- ing with cupid. In The VLDB Journal, pages 49–58, 2001. [83] Andreas Maier, Hans-Peter Schnurr, and York Sure. Ontology-based information integrationintheautomotiveindustry. InInternational Semantic Web Conference, pages 897–912, 2003. [84] Kamil Matousek, Lubos Kral, and Martin Falc. Apollo CH Manual. http://apollo.open.ac.uk/. [85] Brian Matthews. JISC Technology and Standards Watch: Seman- tic Web Technologies. http://dip.semanticweb.org/documents/Techwatch2005- SemanticWebTechnologies.pdf. [86] Robert McCann, AnHai Doan, Vanitha Varadaran, Alexander Kramnik, and ChengXiang Zhai. Building data integration systems: A mass collaboration ap- proach. In WebDB, pages 25–30, 2003. [87] Deborah L. McGuinness, Richard Fikes, James Rice, and Steve Wilder. The Chi- maeraOntologyEnvironment. Inthe Seventeenth National Conference on Artificial Intelligence (AAAI 2000), July 30 - August 3 2000. [88] S. Melnik, H. Molina-Garcia, and E. Rahm. Similarity flooding: A versatile graph matchingalgorithm. Inthe International Conference on Data Engineering (ICDE), 2002. [89] Sergey Melnik. Declarative mediation in distributed systems. In ER, pages 66–79, 2000. [90] E. Mena, A. Illarramendi, V. Kashyap, and A.P. Sheth. OBSERVER: an approach for query processing in global information systems based on interoperation across pre-existing ontologies. Distributed and Parallel Databases, 8(2):223–71, 2000. [91] Timothy Miles-Board. COHSE Annotator. University of Southampton. http://www.ecs.soton.ac.uk/ tmb/cohse/annotator/. [92] Tova Milo and Sagit Zohar. Using schema matching to simplify heterogeneous data translation. In VLDB, pages 122–133, 1998. [93] Prasenjit Mitra and Gio Wiederhold. An algebra for semantic interoperability of information sources. In 2nd Annual IEEE International Symposium on Bioinfor- matics and Bioengineering,pages174–82,Bethesda,MD,USA,November4-62001. [94] Prasenjit Mitra, Gio Wiederhold, and Martin Kersten. A graph-oriented model for articulation of ontology interdependencies. In EDBT 2000, Lecture Notes in Computer Science, pages 86–100, Konstanz, Germany, March 27-31 2000. 117 [95] NatalyaF.NoyandMarkA.Musen. PROMPT:Algorithmandtoolforautomated ontology merging and alignment. In 17th National Conference on AI, 2000. [96] Natalya Fridman Noy and Mark A. Musen. Promptdiff: A fixed-point algorithm for comparing ontology versions. In AAAI/IAAI, pages 744–750, 2002. [97] Daniel Oberle, Raphael Volz, Boris Motik, and Steffen Staab. An extensible on- tology software environment. In Steffen Staab and Rudi Studer, editors, Handbook on Ontologies, InternationalHandbooksonInformationSystems, chapterIII,pages 311–333. Springer, 2004. [98] Luigi Palopoli, Giorgio Terracina, and Domenico Ursino. Dike: a system sup- porting the semi-automatic construction of cooperative information systems from heterogeneous databases. Softw., Pract. Exper., 33(9):847–884, 2003. [99] Yun Peng, Youyong Zou, Xiaocheng Luan, Nenad Ivezic, Michael Gr¨ uninger, and AlbertJones. Semanticresolutionfore-commerce. InWRAC,pages355–366,2002. [100] Domenico M. Pisanelli, Aldo Gangemi, and Geri Steve. A medical ontology library that integrates the umls metathesaurus tm . In AIMDM, pages 239–248, 1999. [101] Borislav Popov, Atanas Kiryakov, Angel Kirilov, Dimitar Manov, Damyan Ognyanoff, and Miroslav Goranov. KIM - Semantic Annotation Platform. In 2nd International Semantic Web Conference (ISWC2003), 20-23 October 2003. [102] Dennis Quan, David Huynh, and David R. Karger. Haystack: A Platform for Authoring End User Semantic Web Applications. In International Semantic Web Conference 2003, October 2003. [103] E. Rahm and P. Bernstein. On matching schemas automatically. Technical report, Microsoft Research, Redmon, WA, 2001. MSR-TR-2001-17. [104] Alan L. Rector, Sean Bechhofer, Carole A. Goble, Ian Horrocks, W. A. Nowlan, andW.D.Solomon. Thegrailconceptmodellinglanguageformedicalterminology. Artificial Intelligence in Medicine, 9(2):139–171, 1997. [105] James Hendler Ian Horrocks Deborah L. McGuinness Peter F. Patel-Schneider Sean Bechhofer, Frank van Harmelen and Lynn Andrea Stein eds. Owl web on- tology language reference. http://www.w3.org/TR/owl-ref/, 2004. [106] Robert Stevens, Carole A. Goble, Ian Horrocks, and Sean Bechhofer. Building a bioinformatics ontology using oil. IEEE Transactions on Information Technology in Biomedicine, 6(2):135–141, 2002. [107] Nenad Stojanovic. On the query refinement in the ontology-based searching for information. Inf. Syst., 30(7):543–563, 2005. [108] Gerd Stumme. Ontology merging with formal concept analysis. In Semantic Inter- operability and Integration, 2005. 118 [109] Gerd Stumme and Alexander Maedche. Fca-merge: Bottom-up merging of ontolo- gies. In IJCAI, pages 225–234, 2001. [110] York Sure, Jrgen Angele, and Steffen Staab. OntoEdit: Guiding Ontology Devel- opment by Methodology and Inferencing. In DOA/CoopIS/ODBASE 2002 Con- federated International Conferences DOA, CoopIS and ODBASE 2002, 2002. [111] AnthonyTomasic, LouiqaRaschid, andPatrickValduriez. Scalingaccesstohetero- geneous data sources with disco. IEEE Trans. Knowl. Data Eng., 10(5):808–823, 1998. [112] M. Uschold and M. Gruninger. Ontologies: Principles, Methods and Applications. The Knowledge Engineering Review, 1996. [113] Maria Vargas-Vera, Enrico Motta, John Domingue, Mattia Lanzoni, Arthur Stutt, andFabioCiravegna. MnM:OntologyDrivenSemi-AutomaticandAutomaticSup- port for Semantic Markup. In The 13th International Conference on Knowledge Engineering and Management (EKAW 2002), 2002. [114] YiminWang. KnowledgeElicitationPlug-inforProtg-CardSortingandLaddering . Master’s thesis, School of Computer Science, University of Manchester, 2005. [115] GioWiederhold. Mediatorsinthearchitectureoffutureinformationsystems. IEEE Computer, 25(3):38–49, 1992. [116] Gio Wiederhold. Interoperation, mediation, and ontologies. In International Sym- posium on Fifth Generation Computer Systems, Workshop on Heterogeneous Coop- erativeKnowledge-Bases,volumeW3,pages33–48.ICOT,Tokyo,Japan,December 1994. [117] FlorisWiesman, NicoRoos, andPaulVogt. Automaticontologymappingforagent communication. In AAMAS, pages 563–564, 2002. [118] Baoshi Yan, Martin Frank, Pedro Szekely, Robert Neches, and Juan Lopez. Web- scripter: Grass-roots ontology alignment via end-user report authoring. In the Second International Semantic Web Conference, Octor 2003. [119] Baoshi Yan and Robert MacGregor. Augmented keyword-based search over triple stores. May 2004. [120] Ling Ling Yan, Ren´ ee J. Miller, Laura M. Haas, and Ronald Fagin. Data-driven understandingandrefinementofschemamappings. SIGMOD Record (ACM Special Interest Group on Management of Data), 30(2):485–496, 2001. [121] Anna V. Zhdanova and Pavel Shvaiko. Community-driven ontology matching. In ESWC, pages 34–49, 2006. 119
Abstract (if available)
Abstract
This dissertation aims at lowering the entrance threshold for laymen to contribute to the Semantic Web. To be more specific, it aims at lowering the difficulty for laymen to perform two basic and tightly related tasks on the Semantic Web: semantic data creation and ontology alignment.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Discovering and querying implicit relationships in semantic data
Asset Metadata
Creator
Yan, Baoshi
(author)
Core Title
Enabling laymen to contribute content to the semantic web: a bottom-up approach to creating and aligning diversely structured data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/18/2008
Defense Date
08/15/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
layman participation,OAI-PMH Harvest,ontology alignment,ontology engineering,Semantic Web
Language
English
Advisor
Neches, Robert (
committee chair
), Hann, Il-Horn (
committee member
), MacGregor, Robert W. (
committee member
), McLeod, Dennis (
committee member
)
Creator Email
baoshi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m169
Unique identifier
UC1160537
Identifier
etd-Yan-20061118 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-29281 (legacy record id),usctheses-m169 (legacy record id)
Legacy Identifier
etd-Yan-20061118.pdf
Dmrecord
29281
Document Type
Dissertation
Rights
Yan, Baoshi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
layman participation
ontology alignment
ontology engineering
Semantic Web