Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
From matching to querying: A unified framework for ontology integration
(USC Thesis Other)
From matching to querying: A unified framework for ontology integration
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
FROM MATCHING TO QUERYING: A UNIFIED FRAMEWORK FOR ONTOLOGY INTEGRATION by Yinuo Zhang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2018 Copyright 2018 Yinuo Zhang Acknowledgments I would like to express my deep and sincere gratitude to my advisor, Dr. Viktor K. Prasanna. He continuously supported and encouraged me to achieve my research goals, and guided and mentored me patiently during my Ph.D. study. I would like to thank Prof. Dennis McLeod and Prof. Cauligi (Raghu) Raghavendra for serving on my thesis committee. Their valuable feedback has guided me to improve this thesis. I would like to acknowledge all my colleagues whom I have collaborated with during this work. In particular, I would like to thank Dr. Anand Panangadan, Dr. Vikram Sorathia, Dr. Harris Charalampos Chelmis, Om Patri, Jacky Chung Ming Cheung, and Hao Wu. The discus- sions and collaborations with them contributed most parts of this thesis. I would like to extend my thanks to Kathryn Kassar, Juli Legat, Lizsl De Leon, and Janice Thomp- son for taking care of logistics and administrative work for me. Finally, I would like to give my deepest gratitude to my family for their love and support. They have been continuously inspiring me during my Ph.D. life. ii Table of Contents Acknowledgments ii List of Tables vi List of Figures vii Abstract ix Chapter 1: Introduction 1 1.1 Information Integration on the Web . . . . . . . . . . . . . . . . . . 4 1.2 Querying over Heterogeneous Data Sources . . . . . . . . . . . . . 8 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Background and Related Work 14 2.1 Ontology Matching and Alignment . . . . . . . . . . . . . . . . . . 14 2.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Querying over Heterogeneous Ontologies . . . . . . . . . . . . . . 20 2.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Integration and Query Optimization . . . . . . . . . . . . . . . . . 23 2.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 3: The Unified Ontology Integration Framework 33 3.1 The Unified Fuzzy Ontology Matching - UFOM . . . . . . . . . . . 34 3.2 The Query Algorithm - UFOMQ . . . . . . . . . . . . . . . . . . . 35 3.3 The Integration Optimization - UFOM-ML . . . . . . . . . . . . . . 35 3.4 The Query Optimization - LTO . . . . . . . . . . . . . . . . . . . . 37 iii Chapter 4: Unified Fuzzy Ontology Matching 39 4.1 UFOM Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.1 Preprocessing Unit . . . . . . . . . . . . . . . . . . . . . . 40 4.1.2 Confidence Generator . . . . . . . . . . . . . . . . . . . . . 41 4.1.3 Similarity Generator . . . . . . . . . . . . . . . . . . . . . 42 4.1.4 Alignment Generator . . . . . . . . . . . . . . . . . . . . . 48 4.1.5 Query Execution . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.2 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Chapter 5: Querying for Individuals in Heterogeneous Ontologies 54 5.1 UFOMQ: Querying using Alignments . . . . . . . . . . . . . . . . 54 5.1.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . 60 5.2 Computational Cost of Querying for Related Entities . . . . . . . . 63 5.2.1 Direct Matching . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.2 Indirect Matching . . . . . . . . . . . . . . . . . . . . . . . 65 5.2.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . 67 5.2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . 69 Chapter 6: Optimizing Ontology Integration and Query Execution on the Semantic Web 74 6.1 Optimized UFOM with Machine Learning Components . . . . . . . 74 6.1.1 Optimization of Ontology Alignment . . . . . . . . . . . . 75 6.1.2 Logistic Regression (UFOM+) . . . . . . . . . . . . . . . . 78 6.1.3 Neural Network (UFOMNN) . . . . . . . . . . . . . . . . . 78 6.1.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . 79 6.2 LTO: Optimizing SPARQL Queries with Conditional Triple Pat- terns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2.1 Selectivity-based SPARQL Optimization . . . . . . . . . . . 87 6.2.2 Selectivity Estimation using Histograms . . . . . . . . . . . 90 6.2.3 Selectivity Estimation using Machine Learning . . . . . . . 95 6.2.4 The Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . 97 Chapter 7: Applications 106 7.1 Enabling Efficient Search in Oil and Gas Domain . . . . . . . . . . 106 7.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.1.2 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.1.3 Information Integration . . . . . . . . . . . . . . . . . . . . 113 7.1.4 Harmonization with ISO 15926 . . . . . . . . . . . . . . . . 121 iv 7.2 Integration of Heterogeneous Web Services for Social Networks . . 130 7.2.1 Introduction and Background . . . . . . . . . . . . . . . . . 131 7.2.2 System and Methodology . . . . . . . . . . . . . . . . . . . 134 7.2.3 Application . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Chapter 8: Conclusion 146 8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 146 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.2.1 Exploration on Matching and Querying over Ontologies . . 148 8.2.2 Exploration on SPARQL Query Optimization . . . . . . . . 149 Bibliography 151 v List of Tables 4.1 Computing Relevance relation between entities in the enterprise appli- cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1 Query Execution Time (UFOM vs Baseline) . . . . . . . . . . . . . 62 5.2 Parameters Used in the Experiments . . . . . . . . . . . . . . . . . 69 5.3 Number of triples in ontologies . . . . . . . . . . . . . . . . . . . . 71 5.4 Input pairs and number of non-pruned input entries and size of out- put . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.5 Running time of various query implementation strategies - client and server on same host (NL: Nested-loop, GP: Graph pattern) . . . 72 5.6 Running time of various query implementation strategies - client and server on different hosts (NL: Nested-loop, GP: Graph pat- tern) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.1 Computing Relevance relation between entities in the enterprise appli- cation using UFOM . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2 Histogram for “ex:page” . . . . . . . . . . . . . . . . . . . . . . . 91 6.3 Histogram for predicates “ufom:score” and “ufom:conf” . . . . . . 91 6.4 Statistics of the Synthetic Dataset . . . . . . . . . . . . . . . . . . . 97 vi List of Figures 1.1 Information exchange among subsystems . . . . . . . . . . . . . . 3 1.2 Part of the Linked Data Cloud . . . . . . . . . . . . . . . . . . . . 4 1.3 Ontology Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Initial graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 Intermediate graph . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1 The Unified Framework for Ontology Integration . . . . . . . . . . 34 4.1 Components of the UFOM system for computing fuzzy ontology alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Components of the UFOM Similarity Generator . . . . . . . . . . . 42 4.3 Precision, recall, and f-measure on applying UFOM to the Confer- ence Ontology matching problem . . . . . . . . . . . . . . . . . . . 51 4.4 Precision, recall, and f-measure on applying UFOM to the Instance Matching Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1 Illustration of direct matching and indirect matching . . . . . . . . . 56 5.2 Precision and execution time on applying UFOM query execution to the Instance Matching Ontology . . . . . . . . . . . . . . . . . . 61 5.3 Example of Direct Matching . . . . . . . . . . . . . . . . . . . . . 64 5.4 Example of indirect matching . . . . . . . . . . . . . . . . . . . . . 66 5.5 Computational cost plotted againstm . . . . . . . . . . . . . . . . . 68 5.6 Graph pattern implementation of Direct Matching . . . . . . . . . . 70 6.1 Neural Network for Estimating Relation Score . . . . . . . . . . . . 79 6.2 Precision, recall, f-measure and time on applying UFOM, UFOM+ and UFOMNN to the Conference Ontology matching . . . . . . . . 81 6.3 Precision, recall, f-measure and time on applying UFOM, UFOM+ and UFOMNN to the Instances Ontology matching . . . . . . . . . 82 6.4 Precision, recall, f-measure and time on applying UFOM, UFOM+ and UFOMNN to the Museum Ontology matching . . . . . . . . . . 83 6.5 The Optimization Architecture . . . . . . . . . . . . . . . . . . . . 85 6.6 Static Optimization Strategy 1 . . . . . . . . . . . . . . . . . . . . 87 vii 6.7 Static Optimization Strategy 2 . . . . . . . . . . . . . . . . . . . . 88 6.8 Dynamic Optimization . . . . . . . . . . . . . . . . . . . . . . . . 89 6.9 The Neural Network for Estimating Selectivity . . . . . . . . . . . . 98 6.10 Selectivity Estimation Accuracy Comparision . . . . . . . . . . . . 100 6.11 Network Bandwidth Comparision . . . . . . . . . . . . . . . . . . 102 6.12 Query Response Time Comparision . . . . . . . . . . . . . . . . . 103 6.13 Evaluation on Enterprise-scale Dataset: (a)Mean Squared Error; (b) Network Traffic; (C) Query Response Time. . . . . . . . . . . . . . 105 7.1 Schema interoperability problem . . . . . . . . . . . . . . . . . . . 109 7.2 PPDM Data Model 3.8 Architecture . . . . . . . . . . . . . . . . . 111 7.3 A scenario of how information flows . . . . . . . . . . . . . . . . . 112 7.4 Work flow for generating mapping ontology . . . . . . . . . . . . . 117 7.5 Ontology Development from Data and Schema . . . . . . . . . . . 118 7.6 Schema Integration PPDM Ontology Driven . . . . . . . . . . . . . 119 7.7 Framework of our schema and mapping ontologies . . . . . . . . . 122 7.8 An example table description in PPDM document . . . . . . . . . . 123 7.9 Attribute, PK and FK in XML file . . . . . . . . . . . . . . . . . . 124 7.10 WORK ORDER Class in owl format . . . . . . . . . . . . . . . . . 124 7.11 Subclasses of WORK ORDER Class . . . . . . . . . . . . . . . . . 124 7.12 Property examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.13 Subject Area Mapping . . . . . . . . . . . . . . . . . . . . . . . . 126 7.14 An example mapping . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.15 individuals of ClassOfActivity . . . . . . . . . . . . . . . . . . . . 128 7.16 Subclass of WELL ACTIVITY in PPDM . . . . . . . . . . . . . . 129 7.17 Query result for mapping ontology . . . . . . . . . . . . . . . . . . 129 7.18 Query result for schema ontology . . . . . . . . . . . . . . . . . . . 129 7.19 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.20 Using Karma to generate the semantic representation . . . . . . . . 138 7.21 Attending an Event . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.22 The front-end of EasyGo: (a) Interface for event search and recom- mendations. (b) A list of searched events. (c) A map of searched events. (d) Group and deal information for an event. . . . . . . . . . 143 7.23 Registration in EasyGo: (a) Tradition Registration. (b) Facebook Registration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.1 A Correlated Sample Synopsis Method . . . . . . . . . . . . . . . . 149 viii Abstract As the volume and variety of data on the web increase rapidly, it becomes necessary to develop technologies that enable applications to automatically discover and link rel- evant data sources. The Semantic Web is an extension of the traditional web that goes beyond hyperlinked documents into a network of data elements linked together in for- mally defined semantic relationships. This thesis deals with two primary challenges in the domain of data interoperability and the Semantic Web: (i) how to discover semantic links of different types between entities from various data sources, and (ii) how to enable efficient querying over heterogeneous ontologies. We develop an integrated framework to address these issues, where the links between entities are automatically discovered and then used to optimize querying. Unified Fuzzy Ontology Matching (UFOM) is an ontology matching system designed to discover seman- tic links between large real-world ontologies populated with entities from heterogeneous sources. In such ontologies, several entities are expected to be related to each other but not necessarily with one of the typical well-defined correspondence relationships (equivalent-to, subsumed-by). This motivates the need for a unified ontology matching system that can discover arbitrary types of relationships, including those that are more loosely defined in the context of specific applications. UFOM uses fuzzy set theory as a general framework for representing different types of alignments across ontologies. ix The main challenge in identifying similar instances across multiple ontologies is the high computational cost of evaluating similarity between every pair of entities. We develop the UFOMQ (Unified Fuzzy Ontology Matching Query) algorithm to query for similar instances across multiple ontologies that makes use of the correspondences dis- covered during ontology alignment in order to reduce this cost. The query algorithm uses a fuzzy logic formulation and extends fuzzy ontology alignment. The algorithm identifies entities that are related to a given entity directly from a single alignment link or by following multiple alignment links. We also show how the computational cost of the underlying SPARQL queries can be optimized. We develop two variants of UFOM, UFOM+ and UFOMNN, in which multiple optimization strategies and machine learn- ing techniques are incorporated to decrease the computational cost of ontology align- ment, while still maintaining high accuracy in the discovered alignments. We extend selectivity-based query optimization methods to queries with conditional patterns, using histograms to maintain an estimate of the distribution of the ontology instances. We then investigate a machine learning-based approach that uses the statistics of results from past queries to learn a model that relates query patterns to selectivity, which in turn can be used for query optimization. These two approaches (maintaining histograms and predicting from prior query results) are complementary: as query history is accumu- lated and used for training, histogram-based estimation is gradually traded for machine learning-based prediction. The performance of UFOM and its variants are extensively evaluated using synthetic and real-world enterprise-scale datasets. The proposed framework is applied to two domains. First, we integrate heterogeneous standards and provide a unified approach to information retrieval among different data sources in the oil and gas industry. Second, x we apply the framework to integrate heterogeneous web services for online social net- works. The experiments show that the proposed approach can decrease both network bandwidth requirements and query response time compared with state-of-art methods. xi Chapter 1 Introduction By some estimates, there are over a billion webpages on the web – which is ten times more than that existed ten years ago 1 . Many of these webpages describe the same topic. For example, someone’s job title can be found on both his/her Facebook and LinkedIn pages. As this amount of information increases on the Web, the efficiency of sharing and reusing data among different web sources becomes a crucial factor in the process of information retrieval and knowledge discovery. Web 1.0, also known as the web of documents, established the links among documents and abstracted away the physical storage and networking layers in information exchange. It eliminated document silos by introducing hyperlinks. The great advantage of Web 1.0 is that even though a large number of web servers exist in the machine layer, it abstracts them as the internet cloud. Unfortunately, users are passive on Web 1.0 and are only able to read what web page creators have written. Web 2.0, the web of applications, offers the web pages the ability to be dynamic, and also provides the users with the ability to be collaborative and form a web page according to their own needs. However, it has the phenomenon of application silos. The big drawback is that the applications do not interact and interoperate with each other. For example, in the social media domain, if your job title is updated on LinkedIn, it will not automatically update the corresponding information on your Facebook profile. The reason is that they do not share information and data are not connected to each other. 1 http://www.internetlivestats.com/total-number-of-websites/ 1 Companies have the same problem. Within an enterprise, different databases and sys- tems exist where each of them focuses on a specific domain (e.g., financial system, HR system, etc.). These systems are usually provided by different vendors and maintained by different groups of people. Traditional technologies for managing data, including relational databases and XML, are capable of handling structured organizational data and are able to process data at speeds that meet domain requirement. These technolo- gies are known to scale and support querying on the underlying data store. However, the scale and complexity of current enterprise level solutions are beyond simple storage and retrieval. Such solutions require mining, discovering new relations and generating flexible views from all available data; i.e., they need a framework capable of bringing together large amounts of heterogeneous sources. For example, in a financial institu- tion, customer relationship management, fraud detection, and many mining and anal- ysis applications require this additional capability to reveal new insights from existing data. Fig. 1.1 shows an example in an petroleum company where the systems currently only allow point-to-point communication and information is exchanged only between specific pairs of systems. It will become impossible to find integrated solutions as the number of systems increases. Thus, it is necessary to develop a better approach to reduce the effort in compiling and integrating the data. The main idea of the Semantic Web is to represent the data, categorize and store them in such a way that a computer can understand it as well as a human can. The Semantic Web maintains the links between facts and keeps the information on the web more organized. For example, two same entities from different data sources can be linked using identical URIs (Uniform Resource Identifiers). As a result, it can provide a common framework that allows information to be shared and reused across different applications, enterprises and domains. 2 Figure 1.1: Information exchange among subsystems Linked Data 2 is a method of publishing Semantic Web data so that it can both support queries to make data more accessible to applications and also be interlinked. Fig. 1.2 indicates the range and scale of the Web of Data. Each node in this cloud diagram represents a distinct data set published as Linked Data, as of August 2014. With the development of Linked Data, enterprise-scale industries are increasingly storing data as Semantic Web ontologies either as they are created or using automated tools such as D2R (Database to RDF) to convert data from traditional relational databases. An ontology is a formal representation of information in a domain as a set of concepts and 2 Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/” 3 Figure 1.2: Part of the Linked Data Cloud their relationships. Storing information as ontologies opens up the possibility of linking different but related data sources and providing a unified view to the end-user. We next introduce the role of ontology in the process of information integration, and describe how the query execution over heterogenous sources can be facilitated. 1.1 Information Integration on the Web The increasing amount of data that is captured in different applications within an enter- prise has led to the need for automatically discovering correspondences among these data sources. For instance, with appropriate shared vocabularies, it is possible to inte- grate information from multiple workflows, including employees, customers, products, 4 and operations, in order to have better analysis capabilities and business decisions [85]. The Semantic Web maintains the links between data sources and keeps the information meaningful and organized. Figure 1.3: Ontology Alignment On the Semantic Web, an ontology represents information as a set of formally defined concepts within a domain and their inter-relationships. For example, ontology 1 in Fig. 1.3 consists five concepts including Thing, Car, Locomotive, Engine and Horse- power, and five relations among them (denoted by arrows) in automobile domain. The Semantic Web can relate and integrate data elements from heterogeneous data sources and domains into a single model using bridge relations between multiple ontologies. It is able to provide a common framework allowing information to be shared across diverse systems and applications. Within an enterprise, applying Semantic Web technology to business intelligence and information search solutions in its information management can bring huge value [45]. Ontologies support better communication, explanation, inter- operability, and mediation between data representations. Ontologies are semantically 5 linked by discovering alignments between the entities contained in different ontologies. Fig. 1.3 also shows an example of ontology alignment between two ontologies mod- eling the basic concepts and relations in automobile domain. The example alignment consists of 4 correspondences corresponding to the 4 dashed curve arrows in the figure. Those correspondences imply the equivalent relationships between Thing and Object, Car and Automobile, Locomotive and Train, Horsepower and Horsepower. The process of determining such correspondences between concepts in ontologies is called ontology matching. In this context, automatically linking multiple Semantic Web ontologies has become an area of active research [83, 16]. In this thesis, we address issues that arise when semantically linking large, real- world ontologies that are populated with entities from heterogeneous sources. A desir- able feature of such ontology matching systems is the ability to represent and detect different types of relations that can exist between entities. In real-world datasets, enti- ties in an ontology do not always correspond to single physical entities (even in cases where the schema is well-defined, the instance population process may not always fol- low the schema). In such cases, several entities in different ontologies are expected to be related to each other but not necessarily with one of the typical well-defined relation- ships (equivalent-to, is-a, part-of, subsumed-by). Relations such as equivalence and dis- jointedness [83] do not represent all the possible relations that exist between entities in ontologies. Intuitively, there are entities that are related since they share a certain amount of mutual information. However, most of the existing systems for ontology matching focus on computing only the specific relations of equivalence [46, 40, 61, 53, 42, 34, 20] and subsumption [61]. Another challenge in identifying such relations is that the degree of relatedness varies, i.e., it is not accurate to simply declare whether two entities are related or not. Instead, such relationships have a degree of uncertainty fundamentally associated with 6 them. Recent research in Semantic Web and ontology matching has developed meth- ods of handling uncertainty. In general, these methods are based on two alternate means of representing uncertainty: Probability Theory [2, 63, 58] or Fuzzy Set The- ory [31, 19, 89]. When computing the degree to which a particular relationship holds between two entities in different ontologies, it is necessary to evaluate the supporting evidence. For example, if there are a large number of instances related to the two entities of interest and these instances show a high degree of similarity, then our degree of confidence in the relationship between the entities is high. In a real-world application, this confidence has to be reported along with the ontology alignment. In particular, this confidence should be used while querying linked ontologies in order to rank results by the certainty of correctness. However, the computational complexity of computing the multiple similarity metrics is high and hence this makes the integration framework less suitable for integrating enterprise-scale ontologies. In this thesis, we present two strategies in the integration process that incorporate several optimization methods to speed up the computation of the similarity metrics. Solely focusing on discovering ontology alignments has limited usefulness in enterprise-scale applications. The resulting alignments must be used to enable end users to query for related entities in a scalable manner. In this context, scalability refers to the ability to compare a large number of potentially related entity pairs, which is expected in enterprise datasets. We also describe how we can speed up querying for related entities by using the confidence measure computed during ontology alignment in addition to the actual alignment. 7 1.2 Querying over Heterogeneous Data Sources The Semantic Web also defines an ontology-based query language that enables users to ask questions based on the concepts and relations in the ontology. The query focuses on the domain entities and relations without any reference to how the data is actually stored and organized at the physical level. A query constructed using terms in the user’s vocab- ulary is answered by using the ontology to translate it to other concepts and relations that have actual data or instances associated with them. With the increasing use of ontologies for storing large amounts of heterogeneous data in enterprise-scale applications, there has been a corresponding increase in interest in automatically discovering links at instance level between such ontologies. While a variety of approaches have been proposed in recent years to discover the alignments at schema level [16, 83], much less work has been carried out in identifying similar indi- viduals in different ontologies. The main challenge in identifying similar individuals is the scale of the search space. Potentially, every individual represented in the ontolo- gies has to be evaluated for its similarity to the query individual, along with all of its properties. Such exhaustive evaluation of the search space is not feasible in enterprise- scale datasets. Another drawback for the current search and query technology is that it has low precision. There is no understanding of the meaning of the searched words. Semantic search is based on a formal knowledge base consisting of both schema data and individual data in the form of ontologies. The knowledge base is created by con- verting structured relational data and unstructured text into triples. Once the knowledge base is built, a semantic query engine exploits the ontology relations to find relevant information. In this thesis, we make use of the alignments discovered during the ontology linking process to efficiently query for all individuals in an ontology that are similar to a given individual. Specifically, if the alignments represent the degree to which two entities 8 in different ontologies share a particular type of relationship, then the algorithms that returns individuals in decreasing order of their similarity to the target individual only needs to follow alignments starting from all of that target individual’s properties. Such a query mechanism can be designed using the framework of fuzzy logic. The efficiency of querying for similar individuals can also be improved at the query level. SPARQL is the query language of the Semantic Web, which is able to pull values from ontologies and explore data by querying unknown relationships. For instance, consider an employee ontology used in a company. We have the following SPARQL query which has 3 conditions as triple patterns labeled ast1 tot3. SELECT ?x WHERE f ?x ex:type "person". (t1) ?x ex:birthstate "CA". (t2) ?x ex:age ?age. (t3) FILTER (?age < 20). (t3) g This query returns all California-born employees who are younger than 20. In gen- eral, ift1 is evaluated first, a large number of subjects will be returned since this ontol- ogy is mainly about employees and an employee must be a person. However, if either t2 ort3 is evaluated first, the intermediate results will be a subset of the results when executingt1 first because the subjects with age less than 20 or birth state as California must have type person in this employee ontology. Furthermore, if we know evaluating t3 returns smaller result set than evaluatingt2, we can safely state that a query engine should evaluate the query with the triple pattern ordert3!t2!t1. As a result, the optimizer will reverse the triple patterns and this optimization will lead to better query 9 performance. In this thesis, we develop multiple query optimization strategies which can generate better query execution plans. With the development of distributed systems, the RDF data can also be stored in a distributed manner instead of in a single centralized triple store. In such an environment, when we have large amount of data, the compu- tational burden and storage overload can be reduced. We have developed methods for optimizing SPARQL queries over RDF data stored in a distributed environment. 1.3 Contributions To integrate heterogeneous data sources on the semantic web and enable efficient query execution over the integrated framework, we propose novel approaches and algorithms, and study their performance. Specifically, this thesis makes the following contributions. Ontology Integration Framework We develop our unified ontology integration framework, which will be the basis for developing the matching component and query engine. The framework consists of a matching components, a query engine and two opti- mization enhancements. We briefly describe the functionality and methodology related to each component, and how the components interact. Information Matching on the Web We introduce the concept of fuzzy ontology alignment. We also introduce a new type of ontology correspondence relation, Rele- vance, and demonstrate its use in a real-world application. Based on these concepts, we develop a unified framework - UFOM (Unified Fuzzy Ontology Matching) that inte- grates multiple measures of computing similarity using the principle of fuzzy set theory. We evaluate the system using publicly available ontologies provided by Ontology Align- ment Evaluation Initiative (OAEI) campaigns. The accuracy of the matchings computed by UFOM is comparable with those of the best-performing systems in the OAEI cam- paigns. We also show experimental results on a dataset from an enterprise application. 10 To the best of our knowledge, UFOM is the first system to formally define a fuzzy rep- resentation of ontology matching and the new correspondence relation of Relevance. UFOM is also the first system based on a unified framework to generate ontology align- ments for computing different types of correspondences. Querying over Heterogeneous Ontologies We present a novel algorithm to effi- ciently identify similar entities in ontologies using links discovered during ontology alignment. We also derive the computational time complexity as a measure of effi- ciency. Then, we provide specification of the direct matching algorithm as a single SPARQL query that can be executed by an ontology server. We conduct quantitative evaluation of the running times of the proposed query algorithms as implemented using a procedural language and as a graph pattern executed by an ontology server over a real dataset. Integration and Query Optimization We develop a framework denoted UFOM- ML by introducing machine learning components to UFOM. Specifically, we improve the UFOM framework in terms of computational cost and accuracy of matching by using various optimization strategies and machine learning techniques in the similarity generation and alignment generation phases. We again evaluate the system using pub- licly available ontologies provided by Ontology Alignment Evaluation Initiative (OAEI) campaigns. We also show experimental results on a dataset from an enterprise applica- tion and ontologies representing museum collections. We also develop a LTO (Learning to Optimize) framework for SPARQL query optimization. Specifically, we define a histogram-based algorithm (HBO) and a machine learning-based approach (MLO). We evaluate the proposed approaches with the state-of-the-art work over both synthetic and enterprise-scale datasets. As shown in the results, we improve the query performance from accuracy, response time and network traffic aspects. 11 Applications We present several applications of using the proposed integrated framework in two domains. First, we enable efficient search capability in the oil and gas industry. We integrate heterogeneous standards and provide a unified framework for information retrieval among different data sources in the petroleum extraction domain. Second, we propose a framework which integrates heterogeneous web services for online social networks. The framework facilitates the process of joining events, and provides a component for event recommendation. 1.4 Organization This thesis is organized as follows: In Chapter 2, we present background concepts in fuzzy set theory, ontology align- ment and SPARQL query optimization. We also include a comprehensive litera- ture review in ontology integration, semantic query processing and its optimiza- tion. In Chapter 3, we present an overview of the proposed integration framework and briefly discuss the functionality of each component and how they interact. In Chapter 4, we investigate the problem of ontology matching by developing the UFOM (Unified Fuzzy Ontology Matching) framework, which integrates multiple similarity measures using the principle of fuzzy set theory. In Chapter 5, we introduce the UFOMQ algorithm which uses the fuzzy align- ments derived by UFOM to enable the capability of querying for similar entities in ontologies. In Chapter 6, we improve the performance of both matching and query processes by introducing various optimization strategies and machine learning techniques. 12 In Chapter 7, we present several applications that benefit from our proposed approaches. In Chapter 8, we draw conclusions of this thesis and provide several directions for future work. 13 Chapter 2 Background and Related Work In this Chapter, we formally define the problems discussed in this thesis, and present the literatures and background in the related areas. 2.1 Ontology Matching and Alignment 2.1.1 Related Work An ontology matching system aims to discover correspondences between entities in two ontologies. Most existing systems adopt multiple strategies to fully utilize information contained in the ontologies. In general, these strategies can be classified as terminolog- ical, structural and instance-based. [46] proposes an ontology matching system, SAMBO, for biomedical applications. They use n-grams and edit distances to compute the similarity between the names of two ontology entities. They also consider structural information of the ontology such as is-a and part-of to calculate the hierarchical similarity between two entities. In [40], the authors propose a divide-and-conquer approach to ontology matching. Specifically, the system partitions the ontologies into small clusters based on structural proximities between entities. Based on these pre-matched entities, similar blocks are discovered within the clusters. Finally, a linguistic matcher and an iterative structural matcher are combined to find correspondences between the matched blocks. This system performs well on large ontologies. Another matching system handling large-scale ontologies is proposed in [61]. Specifically, they build a query graph for each ontology. Multiple 14 terminological similarity approaches (e.g., Monger-Elkan and Jaccard distances) com- bined with Dempster-Shafer theory [82] are adopted in order to find the relevant graph fragments of two ontologies. Rule-based fuzzy inference is used in the final alignment generation phase. RiMOM [53] is a multi-strategy ontology matching system which considers both entity labels and instances. In order to capture structural properties, it adopts three similarity propagation strategies: class-to-class, property-to-property, and class-to-property. The choice of the strategy and the weights for similarity combination are dynamically determined based on the features of different ontologies. Semantic ver- ification is introduced in ASMOV [42]. ASMOV is different from the above-described systems in that it examines five types of patterns (e.g., disjoint-subsumption contra- diction, subsumption incompleteness) on the correspondences derived in the similarity calculation process. The verification process halts when no more correspondences are discovered. [34] introduces a concept called anchor which is a correspondence with exactly matched concepts. Given a set of anchors, the system gradually analyzes its neighbors to enrich the correspondence set. [20] considers user feedback in its system. It has a well-developed user interface which enables users to exercise more control over the ontology matching process. Most of the above systems discover 1-to-1 correspondences. Only [42] and [20, 21] consider n-to-m alignments. These systems also focus only on equivalence relation extraction, except [61] which also computes the subsumption relation. Moreover, the result correspondences are all exact. This contrasts with our approach, where we pro- vide a single framework to compute multiple alignments based on different relations simultaneously. Mitra et al. [58] introduce a probabilistic ontology matching framework ,which uses Bayesian Networks to represent the influences between potential correspondences in dif- ferent ontologies. They present a probabilistic representation of ontology matching rules 15 and inferences in order to refine the quality of an existing ontology alignment. Albagli et al. [2] propose a probabilistic framework for ontology matching based on Markov networks. In their framework, approximate reasoning is adopted to reduce the high computational complexity associated with their uncertainty model. Existing correspon- dences are used as training data to generate new ones and user feedback is incorporated to provide an interactive semi-automatic matching process. Besides probability theory, fuzzy set theory has recently been used for ontology matching. Todorov et al. [89] propose an ontology matching framework for multiple domain ontologies. They represent each domain concept as a fuzzy set of reference concepts. Then, the matches between domain concepts are derived based on their cor- responding fuzzy sets. In [31], a rule base provided by domain experts is used in the matching process. In that system, both Jaccard coefficient and linguistic similarity are first calculated for each pair of entities. Then, the system uses the rule base to gen- erate the final similarity measure of each correspondence. The fuzzy set is used as a link between the preliminary similarities and the final output. Though both of the above-described works adopt fuzzy set theory for the ontology matching task, they do not provide a formal definition of a fuzzy representation of correspondence. Moreover, these systems are specific to equivalence type relations. There is no generic framework identifying correspondences for different types of relations. In this thesis, we address these two issues. 2.1.2 Problem Definition We first present the basic concepts of fuzzy set theory [1], which forms the basis for our framework. We then define the problem of fuzzy ontology matching. 16 Definition 1. A fuzzy set is a set of ordered pairs, given byA =f(x; A (x)) :x2Xg whereX is a universal set and A (x) is the grade of membership ofx inA which lies in [0,1]. For example, a fuzzy set BABY (representing the set of all babies) can be defined as BABY =f(x; BABY (x)g where BABY (x) =exp(x). Here,x is the age of a person and asx! 0, BABY (x)! 1. Definition 2. A fuzzy relation between sets X and Y is given by R(x;y) = f R (x;y)=(x;y)j(x;y)2XYg where real number R (x;y) denotes the fuzzy mem- bership of relationR(x;y). For example, the membership function may be R (x;y) = exp[(xy) 2 ] where x andy are real numbers and the fuzzy relationR(x;y) describes how closex is toy. In our general ontology matching framework, every type of relation that is to be discov- ered between entities in two ontologies have to be described using an appropriate fuzzy relation. For example, the equivalence relation, equ, between two numeric entities, x andy, may be defined as equ = exp " X i min j jx i y j j+ X j min i jx i y j j !# wherex i andy j refer to the instances inx andy respectively. This membership function is maximized when every instance inx is also present iny and vice-versa. (Note that our UFOM system, described in detail in Chapter 4, uses a more complex membership function for the Equivalence relation.) The ontology matching problem aims to find an alignmentA for a pair of ontologies O 1 andO 2 . An alignment is defined as follows [83], Definition 3. An ontology alignmentA is a set of correspondences between entities of the matched ontologiesO 1 andO 2 . 17 Definition 4. An ontology correspondence is a 4-tuple:<id;E 1 ;E 2 ;r>, whereid is the identifier for the given correspondence,E 1 andE 2 are the entities of the correspondence (e.g., properties in the ontology), andr is the relation betweenE 1 andE 2 (e.g, equivalence and disjointedness). Note that in the above definition, the correspondence is exact; i.e., the relation r strictly holds between the ontology entitiesE 1 andE 2 . However, in systems that have to automatically determine the set membership from real-world data, it is natural that a degree should be associated with the relation between the entities. The higher the degree is, the higher the likelihood that the relation holds between them. In order to represent the uncertainty in the correspondences, we present a fuzzy variant of ontology alignment. Definition 5. A fuzzy ontology alignment is a set of fuzzy correspondences in the form of 6-tuple:<id;E 1 ;E 2 ;r;s;c> where s = r (E 1 ;E 2 ) which is the relation score denoting the membership of (E 1 ;E 2 ) in relationr, andc is the confidence score computed while generating the cor- respondence. With this definition of fuzzy correspondence, we can extend the relation type set with some other useful types. Next, we formally define a new type of relation called Relevance. We first motivate the new relation type with an example. LetO 1 be an ontology built for all products in a book store andO 2 be an ontology based on ISBN for all books. Let the Description property ofO 1 contain the title and author information of a book. This information can also be found separately in the Title and Author properties inO 2 . The relation between Description and Title, or Description and Author is neither subsumption nor part-of because Description is input by human beings and may not contain exact Title or Author of a book. Also Description may have many null values and we aim to model the degree of how two entities are associated. 18 In general, the same set of physical entities may be represented with multiple asso- ciations across different individuals in an ontology. We therefore define a new relation that we call Relevance as a general representation of the different possible associations between entities. In contrast with the typical relations used in existing ontology match- ing systems, we define Relevance as a fuzzy relation instead of a Boolean one. We define Relevance as follows. Definition 6. Rel(E 1 ;E 2 ) =Pr(I(e 1 )I 2 je 1 2E 1 ) Here,I(e) represents the set of physical entities referred to by an instancee andI t represents the set of physical entities referred to by instances of entity E t . The rele- vance score between propertiesE 1 andE 2 represents the probability that physical enti- ties referred to by instances ofE 1 are also referred to by instances inE 2 . Note that the actual implementation of a Relevance relation between two entities will depend on the specific application in which they appear. In this thesis, we present one implementation of the Relevance relation (Section 6.2.5) and evaluate it on an enterprise application. In Chapter 4, we present the UFOM system to automatically generate a fuzzy ontol- ogy alignment. UFOM implements a unified framework for fuzzy ontology matching with different types of relations in the correspondences. 19 2.2 Querying over Heterogeneous Ontologies 2.2.1 Related Work Ontology matching is one of the key research topics in the Semantic Web community and has been widely investigated in recent years [83, 71]. An ontology matching sys- tem discovers correspondences between entities in two ontologies. Most existing sys- tems adopt multiple strategies such as terminological, structural, and instance-based approaches to fully utilize information contained in the ontologies. However, exact matches do not always exist due to the heterogeneity of ontologies. Fuzzy set theory has recently been used for ontology matching in order to address this issue. [87] proposes an approach for automatically aligning instances, relations and classes. They measure degree of matching based on probability estimate. They focus on discovering different relations between classes and relations. For instances, only equiv- alence relation is considered. [51] introduces a robust ontology alignment framework which can largely improve the efficiency of retrieving equivalent instances compared with [87]. In Chapter 4([97]), we propose a unified fuzzy ontology matching (UFOM) framework to address these two issues. Specifically, UFOM uses fuzzy set theory as a general framework for discovering different types of alignment across ontologies. It enables representation of multiple types of correspondence relations and characteriza- tion of the uncertainty in the correspondence discovery process. Federated ontologies facilitate the process of querying and searching for relevant information. Nowadays, hundreds of public SPARQL endpoints have been deployed on the web. However, most of these are works in progress in terms of discoverability, interoperability, efficiency, and availability [3]. Mora and Corcho [59] investigate the evaluation of ontology-based query rewriting systems. They provide a unified set of metrics as a starting point towards a systematic benchmarking process for such systems. 20 [22] formalizes the completeness of semantic data sources. They describe different data sources using completeness statements expressed in RDF and then use them for efficient query answering. [76] presents a duplicate-aware method to querying over the Semantic Web. Specifically, they generate data summaries to reduce the number of queries sent to each endpoint. [88] proposes a query-specific matching method which generates path correspondences to facilitate query reformulation. All of these works aim to design a unified framework for query execution and information integration. However, none of them consider the uncertainty and ambiguity inherent in ontology correspondences. In our work, we present a query execution algorithm in which query execution is facilitated by using ontology correspondences with fuzzy relation types. Record Linkage is a field that has been studied extensively. It is the task of find- ing equivalent records between databases. Some of the results of this research can be applied to schema matching and ontology matching. For example, we apply some of the techniques developed to pre-process datasets for the UFOM matching process. [47] uses Bayesian inference to formulate probability equations to model the Record Linkage problem. Then, Gibbs sampling is used to estimate the joint probability of records. The records that are most likely to belong to a pair based on this modeling are the output from the probability simulation step. The field of constraint discovery in databases is related to our work. A schema may not always be available with a full specification; even the creator of the schema may not be aware of all foreign key relationships. Constraint discovery methods have been applied to automatically detect primary keys and foreign keys. Foreign key discovery is defined as the task of discovering inclusion dependen- cies. Some methods to approach constraint discovery are to sort and filter two lists of instances [5], by machine learning [74], and by distribution [95]. Such constraint rela- tionships on schemas help with both the matching process and query execution process in our work. 21 2.2.2 Problem Definition A single domain may be modeled differently leading to different ontologies. Finding correspondences of classes and their properties between such ontologies, the problem of ontology matching, is useful in many applications. Common approaches for matching include name-based, instance-based and structural-based methods. A query execution is a process of retrieving a set of individualsI =fi 1 ;i 2 ;:::;i n g which are relevant to a given individualt. I andt can belong to either one ontology or multiple ontologies. In this thesis, we study the case in whichI andt come from two heterogeneous ontologiesO 1 andO 2 . The one-ontology case is a specialization of the multi-ontology one. A brute force method is to comparet with all property values inO 2 . The time com- plexity of this approach isO(jO 2 jjE 2 j) wherejO 2 j is the number of individuals inO 2 andjE 2 j is the number of properties in O 2 . The brute force approach is inefficient in terms of search time. In Chapter 5, we describe how we can improve the query perfor- mance using fuzzy alignment information. We also analyze the computational cost of executing inter-ontology queries using the UFOM framework. Inter-ontology queries are related to joining of tables, i.e., finding and combining matching pairs of instances in the Cartesian product of instances of the two tables. In Note that the queries only return related entities – we do not attempt to solve the harder problem of harmonizing two ontologies. 22 2.3 Integration and Query Optimization 2.3.1 Related Work 2.3.1.1 Integration Optimization An ontology matching system aims to discover correspondences between entities in two ontologies. Most existing systems adopt multiple strategies to fully utilize information contained in the ontologies. In general, these strategies can be classified as terminolog- ical, structural, and instance-based. Parundekar et al. [67] hypothesize new composite concepts using conjunction and disjunction of RDF types and value restriction to discover new alignments in ontologies of linked data. The composition allows more alignments to be discovered in which one-to-one concept equivalences might not exist in the ontologies. Pesquita et al. [68] explore different types of synonyms in the ontology matching process. Their approach which combines internal synonym derivation and external sources, improves matching performance on a biomedical ontology. Recently, deep-learning neural networks have shown promise in various applica- tions, especially image processing [50]. Research has begun on the application of this approach to the problem of ontology alignment. Hariri et al. [35] propose a method for compound metric creation in ontology alignment that is based on a supervised learning approach. A training set is used to create a neural network model, performs sensitiv- ity analysis on it to select appropriate metrics among a set of existing ones, and finally constructs a neural network model to combine the resulting metrics into a compound one. Huang et al. [41] take an artificial neural network approach to learning and adjust- ing the weights used in ontology matching. They present a new ontology matching algorithm that is designed to avoid some of the disadvantages in both rule-based and learning-based ontology matching approaches. In our work, we propose a more general 23 neural network-based framework – UFOMNN which can deal with arbitrary types of relationships existing among ontology entities. One important advantage of creating a federated ontology is that it speeds up the process of querying and searching for relevant information in the knowledgebase. Cur- rently, hundreds of public SPARQL endpoints have been deployed on the web. However, most of them are incomplete in terms of discoverability, interoperability, efficiency and availability [3]. Hartig et al. [36] discuss the problem of executing SPARQL queries over the Web of Linked Data. Their proposed approach can efficiently discover rele- vant data for answering a query. However, their definition of “Relevance” is based on the number of RDF links, which is different from ours where we consider the semantic relation between objects in the ontology. Query rewriting is also one of the fundamen- tal processes in ontology-based information access. Mora and Corcho [59] investigate different ways of evaluating ontology-based query rewriting systems. They provide a unified set of metrics as a starting point towards a systematic benchmarking process for such systems. Darari et al. [22] formalizes the completeness of semantic data sources. Their system arranges different data sources using completeness statements expressed in RDF, and then uses these for efficient query answering. In the federated ontology environment, different sources may contain duplicated and related information. Saleem et al. [76] present a duplicate-aware method to querying over the Semantic Web. Specifically, they generate data summaries to reduce the number of queries sent to each endpoint. Schneider [80] focuses on geospatial data source and investigates a combination of designing, reasoning and integration. Tian et al. [88] pro- pose a query-specific matcher which generates path correspondences to facilitate query reformulation. Both these works aim to design a unified framework for query execution and information integration. However, these systems do not consider the uncertainty and ambiguity that can exist in ontology correspondences. In earlier work [98], we showed 24 how the discovered fuzzy alignments can be used to enable scalable querying for related entities in different ontologies. As the volume and velocity of web-accessible data increases, the computational cost of integrating this data with existing ontologies becomes a limiting factor. Rosati [73] presents some recent results on the definition of logic-based systems integrating ontolo- gies and rules. The work considers ontologies expressed in Description Logics and rules expressed in Datalog. Duong et al. [26] propose an approach to reducing computational complexity in ontology integration. Specifically, an identity-based similarity is com- puted, to avoid comparisons of all properties related to each concept, while matching between concepts. Coessens et al. [17] present a system that is optimized for a partic- ular application. They present a framework for collection and presentation of biomed- ical information through ontology-based mediation. The framework is built on top of a methodology for computational prioritization of candidate disease genes. It allows to integrate the information sources on a conceptual level, provide transparency to the user, eliminate ambiguity, and increase the efficiency of displaying information. Che- ung et al. [14] derive the computational complexity of querying across ontologies using the ontology alignment links discovered using the UFOM framework that is the basis of the current work. The study also considers the impact of specific implementation approaches on query time. 2.3.1.2 Query Optimization In general, there are two ways to optimize query execution on the Semantic Web. One direction is to improve data usability. For example, a data summary can be generated on each data source. When a query comes, the server knows the data distribution and duplicates on each endpoint. Based on this information, it can wisely avoid sending the 25 queries to irrelevant data sources. [76] formalizes this idea and presents a duplicate- aware method to querying over the Semantic Web. In order to provide a unified way to use the data summary, [22] further expresses the summary as a completeness statement in the RDF data. In this way, the summary can be presented in a machine-readable man- ner and the query engine can stop early if the complete information has been retrieved without the need to consult other sources. Another way to improve data usability is to find the relations between different data sources. In [88], Tian et al. propose a query- oriented ontology matching method which generates path correspondences to facilitate query execution. The query is able to provide extra information which is not available in tradition ontology matching. In the query execution phase, the system can utilize the corresponding mappings for answering different queries. All of the above works aim to improve data usability and design a unified framework for query execution. However, none of them consider the uncertainty and ambiguity inherent in ontology correspon- dences. In [98], we present a query execution algorithm in which query execution is facilitated by using ontology correspondences with fuzzy relation types. The other way to optimize the query execution is to focus on the SPARQL query itself. The fundamental aspects related to the efficient processing of the SPARQL query have been studied in [79]. The authors investigate equivalences over SPARQL alge- bra, a process which includes both rewriting rules like filter and projection, as well as SPARQL-specific rewriting schemes. They also develop a new approach to the semantic optimization of SPARQL queries, built on top of the classical chase algorithm. The key problem in the SPARQL query optimization is the join ordering problem. It is a fundamental challenge that has to be solved by any query optimizer. [33] proposes a join ordering algorithm aimed at large SPARQL queries. Specifically, they simplify the query by decomposing it into star-shaped subqueries and chain-shaped subqueries. Then they utilize the underlying data correlations to construct an efficient execution 26 plan. Sometimes, a batch of queries can executed together in some large-scale systems. Le et al. [49] consider optimization over multiple SPARQL queries. They present an algorithm which finds common subqueries in a batch of SPARQL queries and propose an approach calledMQO which replies on query rewriting based on that algorithm. In recent years, some researchers have introduced machine learning techniques into the context of SPARQL query optimization. As one of the first works, [37] uses SVM to predict SPARQL query execution time. Since this approach does not require any statistics of the RDF data, it is ideal for the Linked Data scenario. A SPARQL query consists of one or multiple triple patterns. Each triple pattern has its own selectivity given the RDF data. In order to have an optimal execution plan, the accuracy of selectivity estimation plays an important role. One of the early works [86] defines and analyzes different heuristics for selectivity-based basic triple pattern opti- mization. A summary of the statistics for RDF data is generated, which enables the selectivity estimation of joined triple patterns and the development of efficient heuris- tics. With the development of distributed systems, storing and processing RDF data can be done in a distributed manner [11]. Some recent works concentrate on the opti- mization of SPARQL queries over RDF stored on top of such a distributed environment. Erietta et al. develop two algorithms, the query chain algorihm (QC) and the spread by value algorithm (SBV), which can be applied to large-scale RDF data and distribute the query processing load evenly, which incurs little network traffic [55]. The idea of the QC algorithm is that the query is evaluated by a chain of nodes and each node corre- sponds to a triple pattern in the query. Intermediate results flow through the nodes of the chain and the final node returns the result back to the initial node. We will compare our proposed algorithms with the QC algorithm in the experimental evaluation. Another recent work by Kaoudi et al. proposes a dynamic optimization algorithm, which seeks to construct query plans that minimize the number of intermediate results during query 27 evaluation. They also propose several methods for estimating the selectivity of single triple patterns, as well as the selectivity of a conjunction of triple patterns [44]. How- ever, all of the above works only consider equality operators in the filter condition of the SPARQL queries. In Chapter 6, we propose two generic optimization algorithms which can handle arbitrary SPARQL queries. 2.3.2 Problem Definition Data and Query Models. We consider RDF triples with no blank nodes (a blank node is a node that does not have an associated URI or literal). For SPARQL queries, we con- sider basic graph patterns (BGP) with filter expressions that can include both equality and inequality operators. Queries with equality operators in the filter condition can be re-written to BGP queries without a FILTER condition. Since query plans that involve Cartesian products are very inefficient to evaluate in a distributed environment, [44] introduced the concept of a query graph to facilitate the query execution on distributed peers. The QC* algorithm of Kaoudi et al. [44] represents a SPARQL query as a graph to avoid Cartesian products in the evaluation phase. A graph representation also enables different query optimization methods to be represented in one framework since the opti- mizations are based on an ordering of the nodes in the query graph. In order to consider queries with both equality and inequality operators, we extend the representation of a query used in [62, 86, 44] in the following definition. Definition 7. A query graphg is a 4-tuple (N,C,H,E), where N and C are two disjoint sets of nodes, H is a set of hypernodes, and E is a set of undirected edges. Each node in N denotes a single triple pattern or a single conditional triple pattern having a fil- ter expression with an equality operator, each node in C denotes a single conditional triple pattern having a filter expression with an inequality operator, and each node in H denotes a conjunction of triple pattern. Two nodes from N[ C[ H are connected 28 with an edge E if and only if the corresponding triple pattern or the conjunction of triple patterns share at least one variable. Consider the following example SPARQL query. SELECT ?prop WHERE f ?prop ufom:type "onto2 prop". (t1) ?corr ufom:hasProp1 onto1:id. (t2) ?corr ufom:hasProp2 ?prop. (t3) ?corr ufom:relation "Relevance". (t4) ?corr ufom:score ?s. (t5) FILTER (?s > 0.5). (t5) ?corr ufom:conf ?c. (t6) FILTER (?c > 0.7). (t6) g The initial query graph for the above example is shown in Figure 2.1 which consists of only simple nodes (N andC), while Figure 2.2 shows an intermediate query graph which also has a hypernode (H). This indicates that triple patternst1^t3 have already been evaluated. Query Execution. The first P2P system is RDFPeers [11] where only a restricted query class is supported. Later, Atlas (http://atlas.di.uoa.gr) was developed as a full- blown open source P2P system for the distributed processing of RDF(S) data stored on top of distributed hash tables (DHTs). In this thesis, we focus on the optimization of SPARQL queries over RDF data stored in DHTs. In a distributed environment, the RDF triples are indexed in DHTs and distributed to different peers. Apeer is a remote triple store with certain computation capability. A DHT provides a lookup service for any 29 t1 t6 t5 t4 t3 t2 Figure 2.1: Initial graph t6 t5 t4 t2 H t1^t3 Figure 2.2: Intermediate graph participating peers to efficiently retrieve the triples based on the given key. Specifically, the triples are distributed to different peers based on the hash value of either its subject, predicate or object. Each peer maintains a table consisting of all local triples along with the information indicating which component is used to compute the hash value [55]. When a query is registered on a peer, it is translated into a query graph. In the QC* algorithm [44], the peer-holding triples, which are required to evaluate the first triple pattern, are notified by the initial peer and this starts the query evaluation based 30 on a specific query plan. The query optimizer is responsible for generating a query plan consisting of the execution order of all triple patterns. This order will affect network traffic and query processing time. For example, minimizing the traffic and processing time requires evaluating the triple patterns with low selectivity first. Conditional Triple Patterns. Conditional triple patterns are naturally created in many application domains. Below we show an example query with condition triple patterns from the book publication domain. SELECT ?title WHERE f ?author ex:type "person". ?book ex:type "publication". ?book ex:author ?author. ?book ex:title ?title. ?book ex:page ?page. FILTER (?page < 500). ?author ex:age ?age. FILTER (?age < 40). g This query retrieves all books with less than 500 pages and with its author younger than 40 years of age. It has two conditional triple patterns with inequality filter condi- tions on “page” of the book and “age” of the author. In the next section, we will discuss how to estimate the selectivity of such triple pattern with inequality filter conditions. In prior work [98], we described the UFOMQ algorithm which can execute queries over two heterogeneous ontologies using pre-computed fuzzy ontology alignments. Specifically, the query process consists of two phases: generating a fuzzy SPARQL 31 query and converting it to a crisp SPARQL query for execution. In the first phase, the relevant properties are retrieved using SPARQL queries with conditional triple patterns. The SPARQL query shown in the beginning of this section is an example used in the first phase. It consists of two conditional triple patterns t5 and t6. It is crucial to decide which triple pattern should be evaluated first. Ift5 has low selectivity,t5 can be evaluated before t6. As a result, the data transferred from t5 and the total processing time will be reduced. Similarly, below we show another query example from the first phase. This process generates two conditional triple patterns. UFOMQ thus requires a query optimizer that can handle queries with multiple conditional triple patterns. SELECT ?prop1 ?prop2 ?prop3 WHERE f ?prop1 ufom:type "onto2 prop". ?prop1 rdfs:domain ?class. ?prop2 ufom:idof ?class. ?prop3 ufom:type "onto2 prop". ?prop3 rdfs:range ?class. ?corr ufom:hasProp1 onto1:id. ?corr ufom:hasProp2 ?prop1. ?corr ufom:relation "Equivalence". ?corr ufom:score ?s. FILTER (?s > 0.7). ?corr ufom:conf ?c. FILTER (?c > 0.6). g 32 Chapter 3 The Unified Ontology Integration Framework In this Chapter, we give an overview of the unified ontology integration framework. Specifically, we discuss the functionality and methodology related to each component and how these components interact with each other. The framework consists of a match- ing unit and a query unit. In Figure 3.1, the matching unit is represented by the box labeled as UFOM/UFOM-ML and the query unit corresponds to the box labeled as UFOMQ/LTO. In general, UFOM (Unified Fuzzy Ontology Matching) is an ontology- matching component which takes heterogeneous ontologies as inputs and generates a set of fuzzy ontology alignments with arbitrary types. UFOM-ML improves the effi- ciency of UFOM by incorporating a machine learning-based parameter optimization process. For the query unit, UFOMQ utilizes the fuzzy correspondences discovered by UFOM/UFOM-ML to efficiently query for all individuals in an ontology that are similar to a given individual, while LTO (Learn To Optimize) is a SPARQL query opti- mization component which improves the query performance by computing a good query plan. The information modeling, integration, and query features of the framework can be leveraged to improve IT-business alignment and enterprise application integration by integrating models and artifacts from different systems into a single knowledge reposi- tory. Next, we will briefly talk about how each component works separately. 33 Heterogonous Ontologies UFOM UFOM-ML The Unified Ontology Integration Framework Fuzzy Alignments UFOMQ LTO Query Results Triple Store Figure 3.1: The Unified Framework for Ontology Integration 3.1 The Unified Fuzzy Ontology Matching - UFOM UFOM is explicitly designed to discover entities that are relevant to each other across multiple ontologies (in addition to the previously defined relationships of equivalence and subsumption). We present a new type of relation called Relevance that is designed to represent relationships between entities that are not covered by strict definitions of equivalence and subsumption. The system uses fuzzy set theory to represent the inher- ent uncertainty in the discovered correspondences and generates a fuzzy ontology align- ment. Ontology matching systems capable of working on real-world datasets have to use multiple similarity metrics to identify the variety of entities and relations between them. These include syntactic, semantic, and structural features [83]. UFOM computes mul- tiple measures of similarity among ontology entities. This is because different types of 34 semantic relationships require different means of detecting the relation. As the num- ber of such similarity metrics increase, it becomes increasingly complex to aggregate the results of the individual metrics. We describe how UFOM composes these multiple similarity measures in a principled manner using fuzzy set theory for ontology align- ment. 3.2 The Query Algorithm - UFOMQ In the query unit, we present an algorithm called UFOMQ that utilizes the fuzzy corre- spondences discovered by UFOM to efficiently query for all entities in an ontology that are similar to a given entity. The algorithm can identify entities that are related to the given entity directly from a single alignment link (direct matching) or by following multiple alignment links (indi- rect matching). The algorithms are specified using a fuzzy extension of SPARQL (f- SPARQL [13]). The fuzzy SPARQL queries are then converted to crisp queries for execution. We evaluate this approach using both publicly available ontologies provided by the Ontology Alignment Evaluation Initiative (OAEI) campaigns and also ontolo- gies of an enterprise-scale dataset. Our experiments show that it is possible to trade-off precision of the similarity of the identified entities for the running time of the proposed query algorithms. Compared with a baseline approach (traversing all properties in an ontology), our proposed approach reduces execution time by 99%. 3.3 The Integration Optimization - UFOM-ML As we know, UFOM is a fuzzy logic-based system for discovering alignments between entities in multiple ontologies. It is designed to identify alignments corresponding to dif- ferent degrees of relatedness. UFOM combines multiple metrics of similarity between 35 two entities in order to compute the degree of membership in the different types of semantic relationships possible between the entities in an ontology. In addition to the degree of fuzzy membership, UFOM also computes the degree of confidence in the membership calculation that is based on the supporting data in the two ontologies. The degree of confidence is a quantification of the evidence in the available data to support the inference that two entities are related with a particular level of membership. However, the computational complexity of computing the multiple similarity metrics is high and hence this makes UFOM less suitable for integrating enterprise-scale ontolo- gies. In order to tackle this problem, we present an upgraded version of UFOM, called UFOM-ML. It consists of two variants UFOM+ and UFOMNN, that incorporate several optimization methods to speed up the computation of the similarity metrics. One of the optimizations is a novel approach to computing semantic similarity with the help of a taxonomy. Specifically, we use the hypernym relationships in WordNet [57] to speed up the computation of semantic similarity. Computation of similarity metrics involv- ing mutual information is sped up by pre-computing a dictionary. UFOM is a highly parametrized system. In order to make similarity computation more efficient without sacrificing accuracy, we developed a machine learning-based parameter optimization process. Two different machine learning approaches are incorporated into UFOM. The parameter optimization in UFOM+ is based on cross-validation using logistic regres- sion. A 4-layer neural network trained with a hold-out subset of the data is used in UFOMNN. Extensive experimental evaluation of the optimized variants, UFOM+ and UFOMNN, indicate that these optimizations decrease the computation time of align- ments by nearly 40% with similar precision and recall as in UFOM. 36 3.4 The Query Optimization - LTO The efficiency of querying for similar individuals can also be improved at the SPARQL query level. In UFOMQ, multiple intermediate SPARQL queries are generated in order to retrieve relevant properties along with the final SPARQL query for retrieving simi- lar individuals. Most of these are complex queries with conditional triple patterns in which the FILTER expressions consist of inequality operators. When the RANGE of a property in an ontology is from numerical values, it is common to have a FILTER expression in the query condition. For example, given a property related to temperature, a user typically queries for a temperature range instead of a single temperature value. The problem of SPARQL query optimization has received a lot of attention in recent years [86, 44], but most of the heuristics and approaches only support simple triple pat- terns in the query. Although a SPARQL query with filter expressions involving equality operators can be easily rewritten to a simple query, the cases with inequality operators have not been studied. In this thesis, we consider selectivity-based Basic Graph Pat- tern (BGP) optimization for SPARQL queries with conditional patterns. In a BGP, each query condition in the SPARQL query corresponds to one triple pattern consisting of three components: the subject, the predicate, and the object. Each component is either bound (e.g.,: x) or unbound (e.g., ?x). The key task of selectivity-based optimization is to compute a join order which can be used to evaluate the query efficiently. A good join order in query evaluation is expected to return the results rapidly without actually executing the query. Intuitively, some simple heuristics can be used to decide the selectivity of triple patterns. For example, the selectivity of a triple pattern can be computed according to the type and number of unbound components (e.g., subjects are more selective than objects and objects more selective than predicates). If we have some knowledge about the data, we can further generate summary statistics for the ontology. For example, 37 we can pre-compute the number of employees with theirbirthplace in each state from the above employee ontology. However, if we have numerical properties likeage, it is infeasible to precompute summary statistics for each age. Histograms have been used to represent the distribution of numerical data. The granularity of a histogram has to be sufficiently fine to yield an accurate esti- mate of selectivity. The computational cost of building and maintaining a histogram is directly proportional to this granularity, a problem further exacerbated with conditional patterns. However, the statistics of results from past queries can be extracted from the query log in many applications. It is possible to use this information to learn a model that relates query patterns to selectivity, which in turn can be used for query optimiza- tion. These two approaches (maintaining histograms and predicting from prior query results) are complementary. Histogram-based estimation can be used in the beginning when no history is available. As query history is accumulated and used for training, histogram-based estimation is gradually traded for machine learning-based prediction. In this thesis, we use histograms to store the statistics of properties with continuous numerical values in the query optimization phase. We then describe how a neural net- work can be used to estimate the selectivity for query optimization. 38 Chapter 4 Unified Fuzzy Ontology Matching 4.1 UFOM Framework Our framework for computing fuzzy ontology alignments is based on computing both the similarity score and confidence score for every possible correspondence in the ontologies. In order to provide an extensible framework, every relation score (i.e., for every type of relation of interest) is computed from a set of pre-defined similarity func- tions. The framework is illustrated in the design of the UFOM system (Figure 4.1). Figure 4.1: Components of the UFOM system for computing fuzzy ontology alignment UFOM takes as input two ontologies and outputs a fuzzy alignment between them. UFOM consists of four components: Preprocessing Unit (PU), Confidence Generator (CG), Similarity Generator (SG), and Alignment Generator (AG). PU identifies the type of each entity in the ontology and classifies the entities based on their types. Different computation strategies are adopted for matching the entities with the most appropri- ate type. CG quantifies the quality of the resources used to generate a potential match 39 between two entities. V olume and variety are the two major factors considered in this step. SG computes multiple types of similarities for every pair of entities and gener- ates a vector of similarity scores. These similarity scores are the component functions for computing any particular type of relation. Finally, AG calculates the relation score using the fuzzy membership functions for each relation type and constructs the corre- spondence based on both this relation score and confidence score. We next describe these components in greater detail. We use uppercase letters to denote an entity (T for a Class andE for a Property). We use lowercase letters to denote instances of an entity (t for class instances ande for property values). 4.1.1 Preprocessing Unit The Preprocessing Unit is a constraint-based classifier. It classifies the enti- ties based on their types. We assume that the input ontologies are RDF or OWL files. We define three subtypes of the DatatypeProperty: String, Datetime, and Numerical. Specifically, an entity is classified as one of the fol- lowing types: ObjectProperty, String DatatypeProperty, Datetime DatatypeProperty, andNumerical DatatypeProperty. For DatatypeProperty, rdfs:range is used to decide which specific Datatype- Property it belongs to. If the range isxsd:decimal,xsd:float,xsd:double or xsd:integer, it is a Numerical DatatypeProperty. If the range isxsd:dateTime, xsd:time or xsd:date, it is a Datetime DatatypeProperty. If the range is none of above, it is a String DatatypeProperty. Based on this entity clas- sification, different matching algorithms will be applied during similarity generation (Section 4.1.3). 40 4.1.2 Confidence Generator The Confidence Generator computes a confidence score for each correspondence which reflects the sufficiency of the underlying data resources used to generate this correspon- dence. For correspondence between properties, their instances are the main resources. Intuitively, the more instances that are used for computing similarity, the more confident we can be in the matching process. In order to quantify the sufficiency of the proper- ties, we identify two metrics – Volume and Variety. The volume of resources related to propertyE of classT is defined as VO(E) = jft2Tjt:E6=;gj jft2Tgj (4.1) wherejft2 Tgj is the number of instances of class T andjft2 Tjt:E6=;gj is the number of instances of classT with a non-null value for propertyE. The variety of resources related to propertiesE 1 andE 2 ,VA(E 1 ;E 2 ), quantifies the variety of the property instances using the concept of entropy. It is defined as VA(E 1 ;E 2 ) = H(E 1 )+H(E 2 ) log(jv(E 1 )j+jv(E 2 )j) (4.2) wherev(E) is the set of non-null values of entityE (i.e.,v(E i ) =feje6=;;e2 E i g) andH(E i ) is the entropy of the values of propertyE i : H(E i ) = X x2v(E i ) p i (x)log(p i (x)) (4.3) p i (x) is the probability of property valuex appearing asE i ’s instance: p i (x) = jftjt:E i =xgj jftjt:E6=;gj (4.4) 41 Intuitively, if the properties E 1 and E 2 have a large number of unique values relative to the number of class instances having those properties, then they have a large variety score. Based on the volume and variety of the two properties, the confidence score of their correspondence is calculated as follows. c(E 1 ;E 2 ) = VO(E 1 )+VO(E 2 )+VA(E 1 ;E 2 ) 4 (4.5) 4.1.3 Similarity Generator Figure 4.2: Components of the UFOM Similarity Generator The Similarity Generator generates a vector of similarities between two entities. The similarities form the basis for computing different types of relation correspondences (using their respective fuzzy membership functions). Each component of the vector represents a specific type of similarity. In UFOM, the vector consists of four values: Name-based Similarity, Mutual Information Similarity, Containment Similarity, Struc- tural Similarity (Figure 4.2). Next, we describe these four types of similarity. 42 4.1.3.1 Name-based Similarity The name-based similarity is calculated based on both the semantic similaritys se and syntactics sy similarity between the names of the two properties. Intuitively, the name denoting a property typically captures the most distinctive characteristic of the instances. For semantic similarity, the following steps are used to generates se , 1. Tokenize the names of both properties. Denote byE i :TOK j as thej-th token in the name of propertyE i . 2. Retrieve the synset of each token using Open Linked Data (WordNet) 1 . Denote bysyn(w) the WordNet synset of a wordw. 3. Calculate the Jaccard similarity 2 on the synsets of each pair of tokens 4. Return the average-max Jaccard similarity as the semantic similarity: s se (E 1 ;E 2 ) = 1 n X i max j Jac(syn(E 1 :TOK i );syn(E 2 :TOK j )) Here,Jac() represents the Jaccard similarity between two sets andn is the number of tokens inE 1 ’s name. For example, consider two properties named complete date and EndDate. In the first step, the two names are tokenized into two setsf“complete”, “date”g and f“End”, “Date”g. In the second step, the synsets of “complete”, “date” and “end” are retrieved from WordNet. In the third step, we calculate the Jaccard similarity for each pair of tokens (“complete”-“end”, “complete”-“date”, “date”-“end” and “date”-“date”). Finally, we find the maximum Jaccard similarity for each token in the first property name 1 http://wordnet.princeton.edu/ 2 http://en.wikipedia.org/wiki/Jaccard index 43 and average the similarity over all tokens in the first property name. In this example, s se = 1 2 (Jac(syn(\complete 00 );syn(\end 00 ))+Jac(syn(\date 00 );syn(\date 00 ))). For syntactic similarity, Levenshtein distance[52] is adopted as the distance metric between the names of the two properties. Formally, the syntactic similarity,s sy (E 1 ;E 2 ) between two propertiesE 1 andE 2 is s sy (E 1 ;E 2 ) = 1 Lev(E 1 :Name;E 2 :Name) max(jE 1 :Namej;jE 2 :Namej) (4.6) whereE i :Name denotes the name of propertyE i ,jE i :Namej is the length of the name string, andLev(w 1 ;w 2 ) is the Levenshtein distance between two wordsw 1 andw 2 . In the above example, the syntactic similarity betweencomplete date andEndDate is 1 8 13 =0.3846. The name-based similarity is a weighted sum of the above two similarities: s na (E 1 ;E 2 ) =! se s se (E 1 ;E 2 )+! sy s sy (E 1 ;E 2 ) (4.7) The weights are pre-defined as system parameters in UFOM. In our future work, we intend to use machine learning techniques to learn the weights from labeled datasets. 4.1.3.2 Mutual Information Similarity The Mutual Information Similarity aims to model the mutual information that exists between the individuals of one entity and the domain represented by the second entity. Intuitively, if two properties have a high proportion of instances shared between them, then this is indicative of these properties being highly related. Specifically, Mutual Information Similarity increases when all the components describing an individual of the first entity are also found in the universe of components defined by all the instance of the second entity. In UFOM, we model this by computing the probability that all 44 words from an individual from the first entity appear in the set of words created by the union of the individuals of the second entity. Let W 2 denote the set of words in the instances of E 2 andjW 2 j is the total num- ber of such words. Let n(w;E 2 ) denote the number of appearances of the word w in the instances ofE 2 . Then, the conditional probability of a wordw appearing in some instance ofE 2 can be estimated by p(wjE 2 ) = n(w;E 2 ) jW 2 j (4.8) The joint probability of all words in a set S, also appearing in the word set of E 2 , assuming independence of these word occurrences inE 1 , is given by p(SW 2 ) = Y w2S p(wjE 2 ) (4.9) Letw2e 1 denote the words comprising the instancee 1 . Then, the Mutual Information Similarity,s mi (E 1 ;E 2 ) between two propertiesE 1 andE 2 is defined as s mi (E 1 ;E 2 ) = 1 jv(E 1 )j X e 1 2E 1 p(e 1 W 2 ) (4.10) = 1 jv(E 1 )j X e 1 2E 1 Y w2e 1 n(w;E 2 ) jW 2 j (4.11) 4.1.3.3 Containment Similarity Containment Similarity models the average level of alignment between an instance of property E 1 and the most similar instance in property E 2 . Containment similarity is 45 designed to detect pairs of properties that share a large number of common instances even if the instances are misaligned. Specifically, it is calculated as s co (E 1 ;E 2 ) = 1 jv(E 1 )j X e 1 2E 1 max e 2 2E 2 SWS(e 1 ;e 2 ) (4.12) SWS(e 1 ,e 2 ) calculates the Smith-Waterman Similarity [84] between two instances e 1 ande 2 treated as a sequence of characters. The Smith-Waterman Similarity identifies local sequence alignment and finds the most similar instances ofE 2 given an instance ofE 1 . For example, ife 1 is “Stephen W. Hawking” ande 2 is “A Brief History of Time by Stephen Hawking”, then, SWS(e 1 ,e 2 )= 15 18 =0.8333. Note that Containment Similarity is asymmetric. Intuitively, the greater the number of instances in E 2 compared to the number of instances inE 1 , the more likely thatE 1 is contained withinE 2 . 4.1.3.4 Structural Similarity The fourth value in the vector of similarities is designed to capture the structural similar- ity between two properties as they are represented within their ontologies. We represent the ontology as a graph with properties as edges and classes as nodes. Intuitively, if two properties have similar domains and ranges (classes), then they are assigned high similarity. In turn, two classes that have similar properties should have higher similarity. The following iterative equations are used to refine the edge (property) similarity. At every iteration t, the node similarity, s t N () is updated based on the edge similarity of iterationt,s t E (). s t N (n i ;n j ) = 1 2N X s(k)=i;s(l)=j s t E (e k ;e l )+ 1 2M X t(k)=i;t(l)=j s t E (e k ;e l ) (4.13) 46 Then, the edge similarity of the next iterationt+1,s t+1 E () is calculated from the node similarity of iterationt,s t N (). s t+1 E (e i ;e j ) = 1 2 s t (n s(i) ;n s(j) )+s t (n t(i) ;n t(j) ) (4.14) In these equations, s(k) = i andt(k) = j denotes that edgee k is directed from node n i ton j , i.e., propertyE k has domain classT i and range classT j . N is the number of properties which have its domain as T i multiplied by the number of properties which have its domain asT j . Similarly,M is the number of properties which have its range as T i multiplied by the number of properties which have its range asT j . The initial edge similarities are set from the weighted sum of the previously defined three similarities. s 0 E (e i ;e j ) =! na (s na (E i ;E j ))+ ! mi (s mi (E i ;E j ))+! co (s co (E i ;E j )) (4.15) where! na ,! mi and! co are pre-defined weights in UFOM and (x) = 1 1+e k(x) A non-linear addition ( function) is used to better distinguish between similarity values that are close to the median than a linear weighted sum. In this thesis, we have set k = 5. As in the case of name-based similarity, we will consider using machine learning techniques to learn these weights automatically from labeled ontology matching data. The algorithm halts whenjs t+1 E (e i ;e j )s t E (e i ;e j )j < where is a pre-defined parameter. The structural similarity of two properties are then set as s st (E i ;E j )) =s t+1 E (e i ;e j ) (4.16) 47 Finally, the similarity generator outputs a vector of similarities for each pair of properties. The vector for property pair (E i , E j ) is represented as ! s(E i ;E j ) = (s na (E i ;E j );s mi (E i ;E j );s co (E i ;E j );s st (E i ;E j )). 4.1.4 Alignment Generator The output of the Alignment Generator is a set of fuzzy correspondences in the form of 6-tuples: < id;E 1 ;E 2 ;r;s;c >. For the confidence scorec, it has already been calcu- lated by CG. In order to calculate the relation scores, a set of membership functions are pre-defined in UFOM. Each membership function corresponds to one type of relation. For example, the membership function for equivalence can be defined as a linear com- bination of all similarity values ( equ ( ! s ) = 1 4 (s na +s mi +s co +s st )), the membership function for subsumption can be approximated by a combination of two similarity val- ues ( sub ( ! s ) = 2 3 s mi + 1 3 s st ) and the membership function for relevance can be written as a non-linear s-Function ( rel ( ! s ) = S(s co ;0:2;0:5;0:7)). Then, the relation scores is computed by the corresponding membership function. Once boths andc are derived, AG prunes the correspondences withs andc less than pre-defined cutoff thresholdss andc . Different applications will have different thresh- olds. For example, a recommendation system may have relatively low thresholds since false positives are tolerated, while a scientific application may have high thresholds. Below are some examples of fuzzy correspondence: < 1;isbn :author;bkst :desc;relevance;0:73;0:92> < 2;isbn :author;bkst :desc;equivalence;0:58;0:92> < 3;isbn :author;bkst :pub;disjoint;0:83;0:75> 48 4.1.5 Query Execution In this section, we briefly discuss how the fuzzy alignment helps during query execution over the matched ontology. Given two ontologies, O 1 and O 2 , the query execution problem is as follows. Given an individualt inO 1 , return all individuals ofO 2 which are relevant tot. If there is no alignment, the only way to solve it is to compare each property value of t with all property instances in O 2 . The time complexity of this approach is O(jO 2 jjtjjE 2 j) wherejO 2 j is the number of individuals in O 2 ,jtj is the number of properties thatt has andjE 2 j is the number of properties inO 2 . With the help of the fuzzy alignment, the computation burden can be reduced. One strategy is to find the fuzzy correspondences, settingE 1 asI’s properties andr as rele- vance. Then each property value oft is compared only with the instances of properties appearing as E 2 in the correspondences. Specifically, the Smith-Waterman approxi- mate substring matching algorithm can be used for comparison. The time complexity is O(jO 2 jjtjjE 2 (I)j) wherejE 2 (I)j is the average number of properties inO 2 which have correspondences with a property ofI. Another strategy is to have multi-layer references. The idea is to identify some intermediate classes with properties having correspondences with both I’s properties and the target class’ properties. Then the previous strategy is applied twice to find the answer instances of the target class. 4.2 Experimental Evaluation In this section, we present the evaluation of UFOM on various datasets. We mainly con- sider two types of relations – Equivalence(Section4.2.1) and Relevance(Section4.2.2). 49 4.2.1 Equivalence 4.2.1.1 Dataset We performed a set of experiments using two sets of ontologies provided by Ontology Alignment Evaluation Initiative (OAEI) 2013 Campaign. The first dataset is Confer- ence Ontology 3 which contains 16 ontologies. In our experiments, we aim to find the alignment between one specific ontology Cmt and multiple ontologies including confOf, sigkdd, conference, edas, ekaw, iasted. In total, Cmt has 59 properties and the target ontology set has 188 properties. The second set is Instance Matching (IM) Ontology 4 . It has 5 ontologies and 1744 instances. The weights for the name-based similarity and the structural similarity are set as 1 2 and 1 3 . The membership function is constructed as equ ( ! s ) = s na +s mi +s co +s st 4 (4.17) For Conference Ontology, no instances are provided. As a result,s mi =s co = 0 and c = 0. For Instance Matching Ontology, the confidence thresholdc is set as 0.6. 4.2.1.2 Results For Conference Ontology, the relation score threshold is set as 0.25, 0.3, 0.35, and 0.4 (the upper bound of the score is 0.5 sinces mi =s co = 0). We compute precision, recall, and F-measure to evaluate the performance of UFOM. As the relation score threshold increases, precision increases while recall and F- measure decrease (Figure 4.3). Compared with the baseline matcher (StringEquiv) in 3 http://oaei.ontologymatching.org/2013/conference/index.html 4 http://www.instancematching.org/oaei/ 50 Figure 4.3: Precision, recall, and f-measure on applying UFOM to the Conference Ontology matching problem OAEI 2013, when recall is the same (0.43), precision for UFOM is 0.92 while for the baseline is 0.8. Similarly, we evaluate UFOM on Instances Matching Ontology. Since this ontology has instances, the upper bound of the score is 1.0. Thus, we tune the relation score threshold from 0.5 to 0.8 with step 0.1. A similar trend is shown in Figure 4.4. The difference is that precision, recall and F-measure are slightly higher than the results using Conference Ontology at most data points. The reason is that having instances in generating the correspondences can improve the quality of matching. 51 Figure 4.4: Precision, recall, and f-measure on applying UFOM to the Instance Match- ing Ontology 4.2.2 Relevance 4.2.2.1 Experimental Setting For the evaluation of computing relevance alignment, we used two ontologiesO 1 andO 2 from an enterprise-scale information repository. Each ontology focuses on a different application area within the enterprise, but they are related in terms of the entities that they reference. OntologyO 1 has 125865 triples. OntologyO 2 has 651860 triples. Due to privacy concerns, we do not expose the real names of the properties and ontologies. The task is to find all relevant properties in O 2 given a property E in O 1 . The weights for the name-based similarity and the structural similarity are set as 1 2 and 1 3 . The relevance membership function is constructed as 52 Table 4.1: Computing Relevance relation between entities in the enterprise application O 2 Property Relevance Confidence Domain Expert E 1 73.93% 92.85% Highly Relevant E 2 38.94% 87.73% Relevant E 3 38.89% 93.21% Relevant E 4 30.89% 87.73% Relevant E 5 23.95% 97.98% Relevant E 6 0.67% 77.31% Irrelevant E 7 0.67% 76.26% Irrelevant E 8 0.67% 72.98% Irrelevant rel ( ! s ) = 1 4 s na + 1 2 s co + 1 4 s st (4.18) 4.2.2.2 Results Table 4.1 shows the relevance score and confidence score betweenE and properties in O 2 . We manually labeled each discovered correspondence as being “Highly relevant”, “Relevant”, or “Irrelevant” based on our understanding of the underlying entities. As can be seen from Table 4.1, the relevance scores align well with the manually assigned labels. When the relevance score threshold is set as 0.2, precision and recall reach 1.0. We also evaluated query execution using the same experimental setup. We selected 10 representative individuals fromO 1 and retrieve their relevant instances fromO 2 using the fuzzy alignment. Both precision and recall achieve 1.0 after we verified the results with the ground truth obtained by manually examining the ontologies for each of the automatically retrieved entities. 53 Chapter 5 Querying for Individuals in Heterogeneous Ontologies 5.1 UFOMQ: Querying using Alignments We now describe the UFOMQ algorithm to efficiently execute queries over two hetero- geneous ontologies using pre-computed fuzzy ontology alignments. In order to take advantage of the fuzzy representation of the alignments, the query process consists of two phases: generating a fuzzy SPARQL query and then converting it to a crisp SPARQL query for execution. 5.1.1 The Algorithm We adopt a specific fuzzy extension of SPARQL called f-SPARQL [13]. An example of f-SPARQL query is given below. #top-k FQ# with 20 SELECT ?X ?Age ?Height WHEREf ?X rdf:type Student ?X ex:hasAge ?Age with 0.3. FILTER (?Age=not very young && ?Age=not very old) with 0.9. ?X ex:hasHeight ?Height with 0.7. 54 FILTER (?Height close to 175cm) with 0.8. g In this example, “not very young” and “not very old” are fuzzy terms, and “close to” is a fuzzy operator. Each condition is associated with a user-defined weight (e.g., 0.3 for age and 0.7 for height) and a threshold (e.g., 0.9 for age and 0.8 for height). The top 20 results are returned based on the score function [13]. Algorithm 1 UFOMQ - A query algorithm for UFOM 1: Identify a set of properties E direct = fe 1 ;e 2 ;:::;e m g in O 2 where e j is in a fuzzy correspondence < id;e t ;e j ;r;s;c > with s S, s C and r 2 fequivalence;relevanceg,S andC are user-defined thresholds ande t ist’s identi- fier property 2: Identify a set of property triplesE indirect =ffe 1 1 ;e 2 1 ;e 3 1 g;:::;fe 1 n ;e 2 n ;e 3 n gg fromO 2 wheree 1 j is in a fuzzy correspondence<id;e t ;e 1 j ;r;s;c> withsS,sC and r2fequivalence;relevanceg, ande 1 j ande 2 j are the properties of the same class (intermediate class) wheree 2 j is its identifier, ande 3 j is the target property equivalent toe 2 j 3: for eache j inE direct do 4: Generate a fuzzy SPARQL using the direct matching generation rule 5: Calculate a seed vector ! s =fs syn ;s sem ;s con g for each pair (t,v x ) wheret is the given individual andv x is a value ine j 6: Generate individuals with grades calculated by a relevance function of ! s in the instance ontologyonto ins 7: end for 8: for eachfe 1 j ;e 2 j ;e 3 j g inE indirect do 9: Generate a fuzzy SPARQL using the indirect matching generation rule Calcu- late a seed vector ! s =fs syn ;s sem ;s con g for each pair (t,v x ) wheret is the given individual andv x is a value ine 1 j 10: Generate individuals with grades calculated by a relevance function of ! s in the instance ontologyonto ins 11: end for 12: Generate a crisp SPARQL by computing the-cut of the fuzzy terms based on the membership function and each graph pattern corresponds to a value in E direct or E indirect 13: return the individual setI =fi 1 ;i 2 ;:::;i n g by executing the crisp SPARQL over onto ins 55 The UFOMQ algorithm is shown in Algorithm 1. The inputs to the algorithm are two ontologiesO 1 andO 2 , a set of fuzzy correspondences pre-computed using UFOM, and the target individual t2 O 1 . The algorithm returns a set of individuals from O 2 that are similar to t. In our description, we state that the correspondences are either equivalence or relevance. However, the approach can be extended to other types of relations provided the corresponding alignments are discovered by UFOM. Steps 1 and 2 in Algorithm 1 identify related properties using fuzzy correspondences generated by UFOM. These properties are computed using two methods: direct match- ing and indirect matching (Figure 5.1). Figure 5.1: Illustration of direct matching and indirect matching 56 For direct matching, we retrieve properties in O 2 which have fuzzy relations equivalence and relevance) with t’s identifier property (e.g., ID) using the fuzzy alignment derived by UFOM. For example, the following SPARQL code retrieves only the relevant properties of id based on direct matching. Thresholds for relation score and confidence score (e.g., 0.5 and 0.7) are also specified in the query. SELECT ?prop WHERE f ?prop ufom:type "onto2 prop". ?corr ufom:hasProp1 onto1:id. ?corr ufom:hasProp2 ?prop. ?corr ufom:relation "Relevance". ?corr ufom:score ?s. FILTER (?s > 0.5). ?corr ufom:conf ?c. FILTER (?c > 0.7). g Indirect matching is used to identify entities that do not share a single corre- spondence with t but are related via intermediate properties, i.e., more than one correspondence. We first identify all intermediate classes in O 2 . The properties of such classes have a fuzzy relation with t’s identifier property (e.g., id). From these intermediate classes, we discover the properties which are equivalent to the identifier of the intermediate class. This equivalence relation is found by checking Object Properties inO 2 . In contrast to direct matching which outputs a set of properties, indirect matching produces a collection of triples in the form of (e 1 ;e 2 ;e 3 ), wheree 1 is the intermediate class’ property with fuzzy relations with t’s identifier property, e 2 is the intermediate 57 class’ identifier property, ande 3 is the target property equivalent toe 2 . An example of the indirect matching approach for the relevance relation as expressed in SPARQL is shown below. prop1, prop2, and prop3 correspond toe 1 ,e 2 ande 3 respectively. SELECT ?prop1 ?prop2 ?prop3 WHERE f ?prop1 ufom:type "onto2 prop". ?prop1 rdfs:domain ?class. ?prop2 ufom:idof ?class. ?prop3 ufom:type "onto2 prop". ?prop3 rdfs:range ?class. ?corr ufom:hasProp1 onto1:id. ?corr ufom:hasProp2 ?prop1. ?corr ufom:relation "Relevance". ?corr ufom:score ?s. FILTER (?s > 0.5). ?corr ufom:conf ?c. FILTER (?c > 0.7). g Given the properties discovered by direct matching (prop) and indirect matching (prop1, prop2 and prop3), we can build fuzzy SPARQL queries based on the rules expressed in Steps 4 and 8: #top-k FQ# with 20 SELECT ?d WHERE f ?x onto2:id ?d. 58 ?x onto2:prop ?p. FILTER (?p relevant to t) with 0.75. g #top-k FQ# with 20 SELECT ?d WHERE f ?x onto2:prop1 ?p. ?x onto2:prop2 ?c. ?y onto2:prop3 ?c. ?y onto2:id ?d. FILTER (?p relevant to t) with 0.75. g In the above rules,t is the given individual (the identifier used to represent the indi- vidual) and “relevant-to” is the fuzzy operator. Since, eventually the fuzzy queries will have to be converted to crisp ones, we calculate a seed vector ! s =fs syn ;s sem ;s con g for each value pair (t, v x ) where t is the given value (e.g., identifier of the given individual) andv x is the value in the matched properties (e.g., “onto2:prop”) (Steps 5 and 9). ! s represents multiple similarity metrics including syntactic, semantic, and containment similarities as described in [97]. The results are used to calculate the relevance scores which are stored as individuals in the instance ontology onto ins (Steps 6 and 10). In Step 11, we compute the -cut of the fuzzy terms based on the membership function in order to convert to a crisp query. The resulting crisp SPARQL consists of multiple graph patterns and each of these corresponds to a matched property. The individual set I is derived by executing this crisp SPARQL query. An example of such a crisp SPARQL query returning individuals 59 ranked based on their membership grades is shown below. SELECT ?d WHERE f ?x onto2:id ?d. ?x onto2:prop ?p. ?ins onto ins:value1 ?p. ?ins onto ins:value2 t. ?ins onto ins:type "relevance". ?ins onto ins:grade ?g. FILTER (?g 0.75). g ORDER BY DESC(?g) Using UFOMQ, the computation cost for retrieving relevant instances can be reduced. The time complexity of the UFOMQ algorithm isO(jO 2 jjE 2 (t)j) wherejE 2 (t)j is the number of properties inO 2 which have fuzzy correspondences witht’s identifier property. We evaluate the computation cost of UFOMQ on datasets in Section 5.1.2. 5.1.2 Experimental Evaluation In this section, we present the results of applying the UFOMQ approach to two datasets. The first dataset includes the publicly available ontologies from the Ontology Align- ment Evaluation Initiative (OAEI) campaigns [28]. The second dataset comprises of ontologies of an enterprise-scale dataset. 60 5.1.2.1 OAEI Datasets We first performed a set of experiments to evaluate the query execution process. The dataset is the Instance Matching (IM) ontology 1 from OAEI 2013. The dataset has 5 ontologies and 1744 instances. The fuzzy alignment is generated first using UFOM [97]. Then, we initialize 10 individuals from one of the ontologies and retrieve related indi- viduals from the other ontologies. WordNet 2 and DBPedia 3 are used to retrieve the synset and similar entities of a given individual. The membership grade threshold is set to 0.75. Figure 5.2 shows the performance of our query execution component on the IM ontology. Each data point is generated by averaging the results of 10 individuals. 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0 200 400 600 800 1000 1200 1400 1600 1800 0.5 0.6 0.7 0.8 Time (ms) Relation Score Threshold Execution Time Precision Figure 5.2: Precision and execution time on applying UFOM query execution to the Instance Matching Ontology 1 http://islab.di.unimi.it/im oaei 2014/index.html 2 http://wordnet.princeton.edu/ 3 http://dbpedia.org/ 61 As the relation score threshold increases, both precision and running time for query execution decrease. This is because the number of correspondences decreases when we raise the relation score threshold. As a result, we have fewer correspondences to con- sider when we generate crisp queries and therefore the computational time is reduced. The reason precision also decreases is that we lose some true correspondences when we increase the relation score threshold. Those correspondences can help in finding more related individuals. However, as shown in Figure 5.2, an increase in the threshold from 0.5 to 0.8 causes precision to decrease by only 9.3% while execution time is reduced by 81.1%. This indicates that the elimination of correspondences caused by increasing the threshold does not significantly affect retrieving correct related individuals. This is because the remaining correspondences are still sufficient to find connections between classes. In terms of querying for similar individuals, the correspondences between the same pair of classes have functional overlap. 5.1.2.2 Enterprise-scale Dataset For the evaluation of querying for related individuals, we considered two ontologies from an enterprise-scale information repository. Each ontology focuses on a different application area, but they are related in terms of the entities that they reference. Ontol- ogyO 1 has 125,865 triples. OntologyO 2 has 651,860 triples. Due to privacy concerns, we do not expose the real names of the properties and ontologies. We considered two classes,C 1 andC 2 , inO 1 and two classes,C 3 anC 4 , inO 2 . Table 5.1: Query Execution Time (UFOM vs Baseline) Scenario UFOM(ms) Baseline(ms) C 1 toC 3 259 35974 C 2 toC 3 173 25706 C 1 toC 4 487 53937 C 2 toC 4 401 45752 62 We identified29 fuzzy correspondences between these two ontologies using UFOM. To evaluate query performance, we selected 10 representative individuals fromC 1 orC 2 and retrieved their relevant instances fromC 3 orC 4 using the fuzzy alignment. Both pre- cision and recall achieved 1.0 after we verified the results with the ground truth obtained by manually examining the ontologies for each of the automatically retrieved entities. We also generated the average execution time and the results are shown in Table 5.1. Compared with the baseline approach which traverses the values of all properties inO 2 , our proposed approach reduces the execution by 99% on average. 5.2 Computational Cost of Querying for Related Enti- ties We now describe how the relationship links in integrated ontologies can be used for executing inter-ontology queries. We consider queries involving a source and target table. The user specifies a value in the source table. The output is all instances in the target table that are related to instances with this value in the source table. The matching process consists of two methods — direct matching and indirect matching. 5.2.1 Direct Matching Direct matching finds instances among the source and target tables that are relevant based on a single relationship link discovered during the previously described relation- ship discovery procedure. The steps are described in Algorithm 2. First, fields with low information value (smaller than threshold) are pruned. All instances in the source table with the user-specified value are retrieved. We then consider the mapping of fields from source table to target table. We keep only those fields with relevance score and confidence score above a pre-defined threshold r and c , respectively. For each source 63 Figure 5.3: Example of Direct Matching table! target table field mapping pair, all values of the obtained instances are compared to that of the instances in the target table, and the matched instances form the output. There can be many definitions of what is considered a pair of matching instances. In our case, two instances match if one contains the other. Other possible definitions are exact matches, or similarity measures like edit distance. For example, in Figure 5.3, suppose the user desires to retrieve records related to instances with FieldA=“A1.” Then the third record with“A2” is pruned. The arrows indicate that FieldA is relevant to Field3 and FieldB is relevant to Field2. Therefore, values in these two pairs of fields are compared. The resulting output is then records “19452” and “23775”. 64 Algorithm 2 directMatch (target fieldf, target valuev, source tableS, target tableT , thresholds, r , c ) 1: Prune all fields inS with information value smaller than; 2: P Query for all(S!T) fields pairs with relevance r and confidence c ; 3: E Query for all entries in S with value v in field f; 4: for each(S!T) pair(field1;field2) inP do 5: for each entrye2E do 6: M Query for all values infield2 that matches with the value ofe:field1; 7: results results[M; 8: end for 9: end for 10: return results 5.2.2 Indirect Matching Direct matching only finds instances in the target table that have matching values in one of the fields it shares with the source table. However, not all information regarding a certain entity of an instance is stored in one table. It is possible that all fields regarding an entity are categorized and stored in separate tables. Thus, some matching instances may be missed by direct matching if the critical matching field is located in a separate tables. The Indirect matching algorithm finds instances in other tables that are within the same ontology as the target table and also others that match with instances in the source table. We denote these tables that exist in the same dataset as the target table as indirect tables. The algorithm then searches for equivalent instances of the matched entities in the target table. The steps are described in Algorithm 3. The initial steps are similar to direct matching. For each indirect table, we perform direct matching to match instances from the source table to the indirect table. Then, we consider the mapping of fields from indirect tables to target table. This step considers fields that have an equivalence score higher than the threshold e . Every equivalent property pair that is identified and every instance in an indirect table that is relevant to the specified value is then compared to 65 Figure 5.4: Example of indirect matching instances in the target table to identify equivalent ones. All matched equivalent instances are then output. In the example shown in Figure 5.4, the source table and target table have no relevant field pairs. However, a relevant record can still be found which is related to the source table via information in an indirect table. Here, FieldC is relevant to FieldY and record “19452” is identified to be relevant to a specified instance in the source table. This primary key is stored. FieldX andField1 are equivalent. Since “19452” is in the target table, it is output as the final result. 66 Algorithm 3 indirectMatch (target fieldf, target valuev, source tableS, target tableT , thresholds, r , c , e ) 1: Prune all fields inS with information value smaller than; 2: E Query for all entries in S with value v in field f; 3: for each indirect tableT i do 4: P i Query for all (S! T i ) fields pairs with relevance r and confidence c ; 5: for each(S!T i ) pair(field1;field2) inP i do 6: for each entrye2E do 7: K 0 Query for primary keys of all entriese i inT i such thate i :field2 = e:field1; 8: K K[K 0 ; 9: end for 10: end for 11: Q i Query for all(T i >T) fields pairs with equivalence e ; 12: for each valuek2K do 13: for each(T i !T) pair(field1;field2) inQ i do 14: results Query for all entriese i inT i such that (e i :primaryKey = k)^ (e i :field2 =e:field1); 15: results results[M; 16: end for 17: end for 18: end for 19: return results 5.2.3 Complexity Analysis We first consider the computational complexity of direct matching. In this analysis, we assume that the cost for comparing a single pair of entities is constant. Let the target table havejTj records. LetjPj be the size of a set of relevant field pairsP . Letn be the number of records in the source table that has the specified value in the given field. Then, the computational complexity is given byO(jPjnjTj). We next consider the computational cost of indirect matching. Let I be the set of indirect tables and m be the number of indirect tables. Each indirect table I i hasjI i j number of records. LetjP i j be the size of set of relevant field pairs P i between the source table and I i . Suppose K i is the final set of primary keys of relevant records 67 Figure 5.5: Computational cost plotted againstm identified in T i , with sizejK i j. LetjQ i j be the size of set of equivalent field pairs Q i between the target table andI i . The computational complexity of indirect matching is then given byO( P m i=1 (jI i jjP i jn+jQ i jjK i jjTj)) The organization of data in the indirect tables affects the computational cost of indi- rect matching. We study the effect of number and size of indirect tables on the computa- tional cost in one special case. Consider there is only one indirect table, if this table is to be normalized and split intom tables, how will the computational cost of indirect match- ing change? To simplify the analysis, we assume that after the split, the number of field pairs are evenly distributed into each split table. We also assume thatjK i j is a fraction f ofjI i j, as the more records there are, the more likely a match is found. We ignore the 68 impact of duplicated fields in the normalization process. The computational cost then becomesm(jI i j jP i j m n+ jQ i j m fjI i j m jTj)) =jI i jjP i jn+jQ i j fjI i j m jTj for anyi since the table sizes are the same. Table 5.2: Parameters Used in the Experiments Parameter Value n 50 T 50 P i P i 100 P i Q i 100 jI i j 1000 Figure 5.5 shows the change in cost asm (the number of indirect tables) increases and the impact off, the fraction of matched pairs. The other parameters set as shown in Table 5.2. We can see that the cost decreases asm increases. As the table is split up, less comparisons are made for each record inK i since there are fewer fields. As a result, we may miss some records that should have matched with records inK i . There is thus is a trade-off between speed and accuracy of the query. 5.2.4 Experimental Evaluation 5.2.4.1 Implementation strategies In this section, we evaluate the impact of specific implementation strategies of match- ing algorithms on the query execution time. We consider two different approaches of implementing the direct matching algorithm (Algorithm 2). The first approach is to implement the direct matching algorithms shown in section 5.2 using a procedural language. We denote this the nested-loop implementation due to the multiple for-loops required to perform queries for every pair of fields, and every pair of instances. 69 SPARQL SELECT ?p ?e ?c ?RMISprop where f graph<Relationships> f ?analysis prop:#confidence ?SourceConf; prop:#name ?SourceProp. FILTER(?SourceConf > 0.5)g graph<OntologyMatchingResults> f ?m mapping:#property2 ?SourceProp; mapping#property1 ?TargetProp; mapping:#hasProbability ?prob; mapping:#hasConfidence ?conf . FILTER (?prob > 0.5 && ?conf > 0.8)g graph <sourceTable>f ?a <field> "value"; ?SourceProp ?c. g graph<targetTable> f ?d ?TargetProp ?e; <PrimaryKeys"> ?p. FILTER regex(?e, ?c, "i")g g; Figure 5.6: Graph pattern implementation of Direct Matching The second approach is to implement the algorithm with a single SPARQL query that can be executed by an ontology server, which then returns the results as a set of triples. The query pattern would operate on the graphs that represent the ontology and pattern matches comprise the result. We denote this implementation approach as the graph pattern implementation. The corresponding graph pattern query for direct matching is shown in Figure 5.6. The prop: prefix is used for terms describing a property. Themapping: prefix is used for terms describing a matching between two properties across tables. 5.2.4.2 Performance Evaluation To evaluate these two implementation strategies, the direct matching algorithm was implemented in C# (nested-loop implementation) and with the SPARQL code shown in Figure 5.6 (the graph pattern implementation). We executed each implementation on 70 a triple store server. The ontology server was executed on an AMD Processor (4228 HE) at 2.8 GHz with 32 GB main memory. Two ontologies containing instances were used for these experiments. We used an enterprise-scale dataset to provide the source and target tables. OntologyO 1 was derived from two tablesT 1;1 andT 1;2 . OntologyO 2 is derived from four tablesT 2;1 , T 2; 2, T 2;3 , andT 2;4 . The number of triples in each of these tables is shown in Table 5.3. Table 5.3: Number of triples in ontologies Ontology Number of triples T 1;1 125,865 T 1;2 98,325 T 2;1 398,352 T 2;2 3,933,711 T 2;3 132,928 T 2;4 2,289,258 When executing similarity queries, we used different tables as the source and target table. For each pair of source and target tables, different input values were used in our evaluation. These values result in output sets of different sizes as shown in Table 5.4. Table 5.4: Input pairs and number of non-pruned input entries and size of output (Source, Target) pair Non-pruned inputs Outputs T 1;1 ;T 2;1 15 15 T 1;1 ;T 2;1 12 13 T 1;1 ;T 2;3 16 16 T 1;1 ;T 2;4 18 31 The algorithm is executed 11 times for each input and the first run is ignored. The mean and standard deviation of the remaining 10 running times for the nested-loop and graph pattern implementations are compared for the four different query inputs. The experiment was carried out with the two client implementations running on the same server hosting the ontology server. The resulting running times and deviations for the two implementations are shown in Table 5.5. 71 Table 5.5: Running time of various query implementation strategies - client and server on same host (NL: Nested-loop, GP: Graph pattern) Source, Target Mean - NL (ms) Mean - GP (ms) SD - NL (ms) SD - GP (ms) T 1;1 ;T 2;1 5271.6 10260.3 54.0395 108.8169 T 1;1 ;T 2;1 4209.7 8202.3 20.6454 69.3078 T 1;1 ;T 2;3 8311.7 14325.4 71.0634 131.1031 T 1;1 ;T 2;4 17694.5 17489.9 65.5138 174.4429 Table 5.6: Running time of various query implementation strategies - client and server on different hosts (NL: Nested-loop, GP: Graph pattern) Source, Target Mean - NL (ms) Mean - GP (ms) SD - NL (ms) SD - GP (ms) T 1;1 ;T 2;1 5290.8 10122.7 38.0520 66.9777 T 1;1 ;T 2;1 4247.7 8124 19.8385 39.4518 T 1;1 ;T 2;3 8399.8 14109.7 97.0519 76.9532 T 1;1 ;T 2;4 17945.1 18652.6 242.5322 1547.8781 In order to account for the effect of network access latency, the experiment is repeated with the client executing on a host that is different from that of the ontology server. This host is equipped with an Intel Core i7-4770 CPU at 3.40GHz with 8 GB main memory. The resulting running times and deviations for the two implementations are shown in Table 5.6. From these results, it can be seen that the running time of the nested-loop implemen- tation is approximately half of the graph pattern implementation for three of the four inputs. However, as the output size increases (inputT 1;1 ;T 2;4 ), the difference between the two methods decreases. The standard deviation of the data from the experiment ran on the ontology server is smaller than that of the remote host in general. The maximum standard deviation value for ontology server is under 200ms, while the one found from executing on a remote host is over 200ms in the case of the T 1;1 ;T 2;4 input. This is explained by the lower network latency when the direct matching code is run directly on the ontology server. 72 Note that the graph pattern implementation based on a single SPARQL query can be optimized by the ontology server before it is executed. The observation that the nested- for-loop implementation gives better performance than the SPARQL query indicates that significant further query optimizations are possible within the triple store server. 73 Chapter 6 Optimizing Ontology Integration and Query Execution on the Semantic Web 6.1 Optimized UFOM with Machine Learning Compo- nents UFOM was designed to represent degrees of different types of relationships between entities using a vector of four similarity metrics. The computation of these similarity metrics is expensive in terms of time, since they scale with the number of entity pairs across the ontologies. In other words, UFOM is designed for representational ability and not computation efficiency. However, as both volume and velocity of web-accessible data increases, the high computational cost becomes the bottleneck for generating a fuzzy alignment. Moreover, the membership function to represent alignment between two entities is not always available in an application. In most applications, only a limited amount of ground truth data (i.e., a set of entity pairs from different ontologies and the relationship degree between them) is available. Here, we describe ontology alignment optimization strategies that have been applied to the UFOM framework in order to make it applicable to real-world enterprise-scale datasets. 74 6.1.1 Optimization of Ontology Alignment 6.1.1.1 Name-based Similarity In UFOM, the name-based similarity considers both semantic and syntactic scores. For semantic similarity, UFOM calculates the Jaccard similarity between the synsets of tokens of two names. However, the calculation of Jaccard similarity is not a compu- tationally efficient operation when the synset consists of a large number of words. In order to reduce the total computational cost, we apply the Wu-Palmer Similarity [93] to calculate the semantic similarity between property names. WuP(w 1 ;w 2 ) = 2 depth(lcs(w 1 ;w 2 )) depth(w 1 )+depth(w 2 ) (6.1) Wu-Parlmer Similarity returns a score denoting how similar two words are based on depth(w 1 ) – the depth of the two words in the taxonomy (e.g., Wordnet) and depth(lcs(w 1 ;w 2 )) – the depth of their Least Common Subsumer (most specific ances- tor node). In Wordnet, the subsumer can be found using hypernyms. The score is between (0,1]. It cannot be zero because the depth of the LCS is never zero. The score is one if the two input concepts are the same. The following algorithm describes the steps in computings se . Note that this algorithm is computationally less expensive than the calculation of the Jaccard similarity. Algorithm 4 Calculate Semantic Similarity getSemanticSimilarity (property names E 1 ;E 2 ) 1: Tokenize the names of both properties. Denote byE i :TOK j as thej-th token in the name of propertyE i 2: Calculate the Wu-Parlmer similarity on each pair of tokens 3: return s se (E 1 ;E 2 ) = 1 n X i max j WuP(E 1 :TOK i ;E 2 :TOK j ): 75 6.1.1.2 Mutual Information Similarity The computation of the Mutual Information Similarity in UFOM can be made more effi- cient with the use of a dictionary data structure. To generate mutual information sim- ilarity, a dictionary is generated containing all words appearing in the instances of the second propertyE 2 . The frequency of each word appearing inE 2 can then be retrieved using this dictionary. The Mutual Information Similarity computation in real-world datasets can be further sped up. In real-world datasets, most of the words have very low frequency. However, they are all considered in the calculation of s mi . Removing such infrequent words from the dictionary W 2 does not significantly affect the condi- tional probability of a wordw fromE 1 appearing in some instances ofE 2 . However, removing the words will result in a significant reduction in the computational burden for calculatingp(wjE 2 ). A pre-defined frequency threshold2 [0;1] is used to control the lower bound of frequency of words in the dictionary. Specifically, only words with frequency> are included in the dictionary. Given this optimization, the calculation of the mutual information similarity is updated as: s mi (E 1 ;E 2 ) = 1 jv(E 1 )j X e 1 2E 1 Y w2e 1 n(w;(E 2 )) jW 2 j (6.2) (E) is the set of instances excluding the words with frequency less than . n(w;(E)) denotes the number of appearances of the wordw in the set(E). w2 e 1 denote the words appearing in the instancee 1 . 6.1.1.3 Containment Similarity The calculation of Containment Similarity can also be sped up. In the Containment Similarity calculation process in UFOM, note that even when the instance inE 2 which has the maximum Smith-Waterman Similarity [84] when e 1 is found, the algorithm 76 continues to check the rest of instances in E 2 . In large datasets, this redundant com- putation comes with high time cost. In order to avoid such computation, we intro- duce a SWS threshold . Specifically, when an instance in E 2 satisfying the condi- tion SWS(e 1 ;e 2 ) is found, the Containment Similarity algorithm exits without checking the rest of the instances inE 2 . The following algorithm describes the steps in computings co : Algorithm 5 Calculate Containment Similarity getContainmentSimilarity (Property E 1 ;E 2 ) 1: total = 0 2: for each instancee 1 2E 1 do 3: max = 0 4: for each instancee 2 2E 2 do 5: ifSWS(e 1 ;e 2 ) then 6: total+ =SWS(e 1 ;e 2 ) 7: break 8: else 9: max = (SWS(e 1 ;e 2 )>max)?SWS(e 1 ;e 2 ) :max 10: end if 11: end for 12: total+ =max; 13: end for 14: return s co (E 1 ;E 2 ) = total jE 1 j 6.1.1.4 Learning the Membership Function UFOM requires that each relation be defined using an explicit membership function to transform the four similarity values to a single relation score. However, each similarity value can have a different relative contribution to the final score when the underlying data are different. When some amount of data relating entities from different ontologies and their known relation, machine learning methods can be used to learn a model of the membership function from this data. We have evaluated two different machine learning methods in this thesis to develop an ontology matching framework that is applicable 77 when such training data is available but an explicit representation of the membership function is not. These two methods are described next. We name the optimized frame- work with logistic regression as the machine learning approach as UFOM+ and the one with neural network as UFOMNN. Note that both these frameworks include the opti- mizations described earlier. 6.1.2 Logistic Regression (UFOM+) UFOM+ extends UFOM by performing parameter optimization [4] using cross- validation on the training data with logistic regression. In UFOM+, the final relation score is defined as, r ( ! s ) =(! na s na +! mi s mi +! co s co +! st s st ) (6.3) where(x) = 1 1+e x . The sigmoid () function divides the 4-dimensional parameter space smoothly. 6.1.3 Neural Network (UFOMNN) Neural networks are appropriate machine learning techniques that are used in estimation and prediction when the structure of the underlying function is unknown and complex, as in the case of most real-world ontology alignment datasets. We used a 2-hidden layer neural network to compute the relation score. As shown in Figure 6.1, four neurons are used in the input layer which correspond to the four similarities (s na ,s mi ,s co ands st ). The output layer has one neuron which gives the estimated relation score. In the two hidden layers, 10 and 5 neurons are used separately. The sigmoid function is chosen as the activation function in the neural network. 78 … … Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Figure 6.1: Neural Network for Estimating Relation Score In order to train the neural network, we use the backpropagation algorithm [75] with the default error metric as MSE (Mean Squared Error), MSE = 1 n X i=1:::n (e i a i ) 2 (6.4) where e is a vector of n estimated values generated by the neural network, and a is the vector of observed values corresponding to the inputs to the function which generated the predictions. In the experiments, we divide the complete dataset into two sets based on the ratio 7:3 to generate both training and test sets. 6.1.4 Experimental Evaluation In this section, we present the evaluation of UFOM and its variants, UFOM+ and UFOMNN, on various datasets. The datasets include publicly available ontologies used 79 for evaluating ontology matching systems and (private) enterprise-scale instance data from a large corporation. For each dataset, we compare the performance of the UFOM approach with the two optimized variants with respect to accuracy of matching and computational cost of the matching algorithms. 6.1.4.1 OAEI Dataset We performed a set of experiments using two sets of ontologies provided by the Ontol- ogy Alignment Evaluation Initiative (OAEI) 2013 Campaign. The first dataset is Con- ference Ontology 1 which contains 16 ontologies. In our experiments, we aim to find the alignment between one specific ontology Cmt and multiple ontologies including confOf, sigkdd, conference, edas, ekaw, iasted. In total, Cmt has 59 properties and the target ontology set has 188 properties. The second set is Instance Matching (IM) Ontology 2 . It has 5 ontologies and 1744 instances. For UFOM, we have the membership function for equivalence relation as equ ( ! s ) = s na +s mi +s co +s st 4 (6.5) The weights for the name-based similarity, the mutual information similarity, the containment similarity and the structural similarity are the same. For Conference Ontology, no instances are provided. We therefore sets mi =s co = 0 andc = 0. The relation score threshold is set as 0.35 (the upper bound of the score is 0.5 since s mi = s co = 0). For UFOM+ and UFOMNN, we use 65% data as training set. We compute precision, recall, F-measure and time(s) on the test set to evaluate the performance of UFOM, UFOM+ and UFOMNN. 1 http://oaei.ontologymatching.org/2013/conference/index.html 2 http://www.instancematching.org/oaei/ 80 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Recall F-measure Time(s) UFOM UFOM+ UFOMNN Figure 6.2: Precision, recall, f-measure and time on applying UFOM, UFOM+ and UFOMNN to the Conference Ontology matching Since we use Wu-Palmer Similarity in the name-based similarity calculation, the computation time for generating the matches is reduced by 44% and 41% for UFOM+ and UFOMNN (Figure 6.2). At the same time, precision, recall and f-measure remain almost the same. Similarly, we evaluate UFOM, UFOM+ and UFOMNN on Instances Matching Ontology. For Instance Matching Ontology, the confidence thresholdc is set as 0.6 and the relation score threshold is set as 0.7. Since this ontology has instances, the upper bound of the score is 1.0. For the mutual information and containment similarities in UFOM+ and UFOMNN, we set = 0:05 and = 0:7. As shown in Figure 6.3, because of the optimization in mutual information similar- ity and containment similarity calculation, the computational time for UFOM+ can be reduced by 39% despite a slight decrease in precision, recall and f-measure. The rea- son for this improvement is the reduced frequency dictionary, and early stop strategy 81 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Recall F-measure Time(*10s) UFOM UFOM+ UFOMNN Figure 6.3: Precision, recall, f-measure and time on applying UFOM, UFOM+ and UFOMNN to the Instances Ontology matching can speed up the similarity computation process without losing many true matches. For UFOMNN, since a multi-layer neural network is used, the model can capture the non- linear relationship between different similarities and relation score better. As a result, the precision, recall and f-measure are improved. 6.1.4.2 Museum Dataset We performed experiments using ontologies from two museums. The first ontology has 14 properties and 7770 triples, while the second ontology has 9 properties and 7650 triples. We consider relevance relation in this evaluation. The weights for the name- based similarity, the containment similarity and the structural similarity are set as 3 5 , 1 5 and 1 5 . The confidence thresholdc is set as 0.8. In UFOM, the relevance membership function is defined as 82 rel ( ! s ) = 3 5 s na + 1 5 s co + 1 5 s st (6.6) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision Recall F-measure Time(*10s) UFOM UFOM+ UFOMNN Figure 6.4: Precision, recall, f-measure and time on applying UFOM, UFOM+ and UFOMNN to the Museum Ontology matching For this dataset, instances have few overlaps since the corresponding artworks appear in different museums. However, our method still shows the ability to generate cor- rect correspondences by using meta-information including schema and structure of the ontology. Specifically, the relation score threshold is set as 0.6 and = 0:05, = 0:7. As shown in Figure 6.4, the computational time can be reduced by 36% and 34% for UFOM+ and UFOMNN separately. For the matching accuracy, the precision, recall and f-measure remain the same for UFOM+, and the precision and f-measure are even increased for UFOMNN. This indicates that the reduction in similarity caused by the optimization steps is not sufficient to make the relation score of a true match drop below the threshold. 83 6.1.4.3 Enterprise-scale Data We evaluate UFOM and its variants on a dataset from an enterprise-scale information repository. The goal is to find relevant properties from two ontologiesO s andO t . Each ontology focuses on a different domain within the enterprise, but they are related in terms of the entities that they reference. Ontology O s has 275,018 triples. Ontology O t has 786,194 triples. The data is from a large industrial corporation. Due to privacy concerns, we do not expose the real names of the properties and ontologies. In order to find all relevant properties inO t given a propertyE inO s . The weights for the name-based similarity, the containment similarity and the structural similarity are set as 1 4 , 1 2 and 1 4 . The relevance membership function is constructed as rel ( ! s ) = s na 4 + s co 2 + s st 4 (6.7) Table 6.1 shows the relevance score and confidence score betweenE and properties inO t using UFOM. Table 6.1: Computing Relevance relation between entities in the enterprise application using UFOM O t Property Relevance Confidence Domain Expert E 1 78.01% 91.89% Highly Relevant E 2 73.47% 89.03% Highly Relevant E 3 34.15% 90.57% Relevant E 4 1.59% 73.41% Irrelevant E 5 1.59% 71.64% Irrelevant E 6 1.59% 70.50% Irrelevant The domain expert manually labeled each discovered correspondence as being “Highly relevant”, “Relevant”, or “Irrelevant” based on their understanding of the under- lying entities. Table 6.1 shows that the relevance scores align well with the manually 84 assigned labels. When the relevance score threshold is set as 0.3, both precision and recall reach 1.0. Similar relevance scores are generated using UFOM+ and UFOMNN ( = 0:05, = 0:7). If the score threshold remains 0.3, both precision and recall are 1.0 as well. Howevwer, the computational cost of UFOM+ and UFOMNN show a reduction in pro- cessing time by 50% (from 273.09s to 136.58s) and 48% (from 273.09s to 141.95s) respectively. 6.2 LTO: Optimizing SPARQL Queries with Condi- tional Triple Patterns Query optimizers aim to compute a query plan which optimizes the performance of the query execution. The performance can be measured by different metrics. In this thesis, we focus on reducing the network traffic and the query execution time. Figure 6.5: The Optimization Architecture Typically, the selectivity is estimated by maintaining a histogram of instance counts. The selectivity of a triple pattern is used to decide the evaluation order of triple patterns 85 in the query. The accuracy of selectivity estimation is dependent on the granularity of the histogram. Building and maintaining a fine-grained histogram is a computationally expensive operation. In real-world applications, some ranges appear more frequently (“hot” ranges) than others in the query filter. Instance distribution information for “cold” ranges in a histogram has therefore little value but increases the computation time to query with “hot” ranges since the entire histogram is traversed. We observe that query history can be accumulated at the query engine. Such query history not only gives us information about what types of queries are usually registered (e.g., frequently asked predicates and ranges), but also enables us to acquire the statistics of the query results (e.g., the number of triples in the results). This information on the prior query results can be used to estimate selectivity of as yet-unseen queries with machine learning-based regression. These two approaches, maintaining histograms and predicting from prior query results, are complementary. Histogram-based estimation can be used in the beginning when no history is available. As query history is accumulated and used for training, histogram-based estimation is gradually traded for machine learning-based prediction. Fig. 6.5 shows the architecture of this proposed SPARQL optimization scheme. In Sec- tion 6.2.2 and 6.2.3, we describe these two selectivity estimation approaches for con- ditional queries: first, using histograms to approximate the distribution of triples, and second, learning to compute this histogram from past query results. Once the ontologies are populated in the triple store, the histograms can be built based on the RDF data. As time goes by, more and more queries are evaluated by the system. The optimizer then can use the query history as the training queries to learn a model to estimate the selectivity. In this thesis, we adopt a widely used machine learn- ing technique, neural networks, to learn the model. As shown in Fig. 6.5, when a query is registered, it is decomposed into triple patterns in the representation of a query graph. 86 Then, the optimizer generates the query plan based on the triple pattern selectivity esti- mated using either histograms or neural networks. The algorithm in Section 6.2.4 shows how the query plan is computed. 6.2.1 Selectivity-based SPARQL Optimization Query optimization algorithms are classified into two categories based on when the query plan is generated. If the plan is generated before the query evaluation, the opti- mization algorithms involving such processes are categorized as static algorithms. If the algorithms compute query plans that minimize the number of intermediate results during query evaluation, they are called dynamic algorithms. Static Optimization. In this approach, the triple patterns are ordered based on the selectivities before the query is evaluated. There are two approaches to use the selec- tivity. In the first approach, only the selectivity of single triple pattern is considered. In Figure 6.6, we show an example illustrating this. The figure shows a query graph con- sisting of 4 triple patterns. The triple patterns with smaller selectivities will be evaluated first. As a result, the optimizer generates the query plan ast1!t3!t4!t2. t1 t4 t3 t2 0.02 0.03 0.05 0.04 Figure 6.6: Static Optimization Strategy 1 87 In the second approach [86], the selectivity of pairs of triple patterns is also con- sidered. Specifically, each edge of the query graph is also assigned with the selectiv- ity of the conjunction of the triple patterns in its two nodes. As shown in Figure 6.7, the optimizer first selects the edge with minimum selectivity (e.g., the edge linkingt2 andt3). Then the triple patterns connected by this edge are added into the query plan ordered based on their selectivity (e.g., t3! t2). The optimizer iteratively finds the edge with minimum selectivity in the remaining graph which is also connected with an existing triple pattern in the current query plan. This process halts when all triple patterns have been added into the query plan. In our example, the final query plan is t3!t2!t4!t1. t1 t4 t3 t2 0.02 0.03 0.05 0.04 0.5 0.2 0.4 0.3 Figure 6.7: Static Optimization Strategy 2 Dynamic Optimization. If the query plan is computed during query evaluation, then the algorithms involving such process are called dynamic optimization algorithms [44]. We use an example to illustrate the process. We consider the original query graph in Fig- ure 6.7. Similar to static algorithms, the optimizer first selects the edge with minimum selectivity (e.g., the edge connectingt2 andt3) and chooses one of the two connected triple patterns which has smaller selectivity to evaluate (e.g.,t3). Then, the optimization process is carried out at the peerP1 containing nodet3. A new query graph is generated 88 with a hypernodeH1 as shown in Figure 6.8(a). P1 decides which triple pattern con- nected to it should be evaluated next based on the edge selectivity. Aftert2 is evaluated, a new hypernodeH2 is generated(Figure 6.8(b)). The peer containing the triples needed for evaluatingt2 is calledP2. P2 updates the selectivity of edges connected withH2 based on the candidate triple pattern selectivity (e.g., t1 and t4) and the intermediate results for H2. In this example, P2 added t1 in the query plan since t1 has smaller updated edge selectivity thant4, even though the original edge selectivity is larger (Fig- ure 6.8(c)). Then the following selected peers iteratively add new triple pattern in the query plan until there is only one hypernode left in the query graph (Figure 6.8(d)). t4 t2 t3 0.05 0.04 0.5 0.2 0.4 0.3 H1 t1 0.02 t4 t3^t2 0.04 0.2 0.3 H2 t1 0.02 t4 t3^t2^t1 0.04 0.2 H3 t3^t2^t1^t4 H4 (a) (b) (c) (d) Figure 6.8: Dynamic Optimization Selectivity Estimation. No matter whether it is a static approach or a dynamic method, the key task is the same for all the optimizations: selectivity estimation. There 89 are multiple ways to estimate the selectivity for a single triple pattern. One heuristic [86] states that triple patterns with more bound components usually have smaller selectivity. This heuristic also assumes that subjects are more selective than objects which are in turn more selective than predicates. Another approach is based on using statistical properties of the underlying triples. Each peer maintains a “stats table” in which the frequencies of each value appearing as a subject, predicate, or object are stored. The selectivity of each component of a triple pattern with a specific value is computed using the information in this table. The selectivity of the whole triple pattern is then generated by multiplying all three triple pattern component selectivities. In order to compute the selectivity of the conjunction of two triple patterns, we can simply precompute the join cardinality and store it in the local stats table [86]. The com- putation becomes expensive when we have a large number of triples and methods have been proposed to more efficiently estimate statistics. For instance, [44] proposed an efficient way to estimate the selectivity of the conjunction of triple patterns. However, these methods cannot estimate the selectivity of conditional triple patterns since they are defined only for triple patterns either with no filter conditions or with only equal- ity operators in the filter conditions. In Section 6.2.2 and 6.2.3, we propose two new algorithms which are capable of estimating the selectivity of conditional triple patterns. These methods can also be combined in the query optimization process. 6.2.2 Selectivity Estimation using Histograms The key idea is to approximate the distribution of triples over the values of their predi- cates using a histogram. Specifically, a histogram is a look-up table with the key as the interval of a predicate or multiple predicates and the value as the number triples with the predicate value falling into the corresponding interval. Table 6.2 shows an example of a 90 histogram with only one predicate “ex:page”. From this histogram, we can directly find that the number of books (triples) with the number of pages falling between 0 to 200 is 1865. It is also possible to estimate the number of triples with any predicate range if a triple distribution is assumed for each histogram range (e.g., uniform distribution). The estimated triple count is converted to selectivity by dividing it by the total number of triples. Table 6.2: Histogram for “ex:page” Range Count 0, 200 1865 200, 400 2657 400, 600 2549 600, 800 2290 800, INF 1756 The distribution representation can be extended to multiple triple patterns. Table 6.3 shows a partial histogram example for two predicates. We next describe the Histogram- Based Optimization (HBO) algorithm that uses such a histogram to estimate the selec- tivity of the conjunction of two conditional triple patterns. Table 6.3: Histogram for predicates “ufom:score” and “ufom:conf” Range1 Range2 Count 0, 0.2 0, 0.2 59 0, 0.2 0.2, 0.4 42 ... ... ... 0.8, 1 0.6, 0.8 24 0.8, 1 0.8, 1 18 Constructing Histogram. As in Kaoudi et al. [44], we index each triple in the distributed hash table (DHT) three times based on its subject, predicate and object. The hash values of the subject, predicate and object are used to compute which peers are responsible for storing the triple. This mechanism guarantees that the triples with same subject, predicate or object will be distributed to the same peer. A triple can be stored 91 on either one, two or three peers depending on whether its subject, predicate and object have same hash value. The HBO algorithm maintains a histogram at each peer. We first discuss the single conditional triple pattern case. We assume all triples with predicate “ex:page” are stored on peerP . Algorithm 1 shows how the histogram is constructed from the triples onP . The inputs arepn,T ,g,pv max andpv min wherepn is the target predicate (e.g., “ex:page”), T is the triple set on the peer, g is the granularity of the histogram, andpv max andpv min are the maximum and minimum values of the object of triples which have predicate aspn. The output is a histogramHS in which the number of ranges isg, and the count for ranger i (i2 [0;g1]) is the number of triples with predicate aspn and object value falling into that range. Algorithm 6 Histogram Construction for Single Triple Pattern (Input: T;pn;g;pv max ;pv min ) 1: InitializeHS withg ranges 2: for eachi2[0;g1] do 3: HS:r i = [pv min +i pvmaxpv min g ;pv min +(i+1) pvmaxpv min g ] 4: HS:count(r i ) = 0 5: end for 6: for eacht2T do 7: ift:predicate =pn andt:object2r i then 8: HS:count(r i )++ 9: end if 10: end for 11: return HS Algorithm 6 uniformly divides the predicate domain intog intervals and initializes the counts to 0 (lines 3-5). The count is updated (lines 6-8) by traversing all the triples and updating the count in one of the ranges of the histogram based on the triple’s predi- cate value. If two predicates are hashed to the same peer (e.g., the predicates have same hash value), a two dimensional histogram can be constructed to estimate the selectivity of the conjunction of two conditional triple patterns. In Algorithm 7, a superscript (a and 92 b) distinguishes the two triple patterns. Algorithm 7 is similar to the single pattern case except that it checks whether both predicate values (e.g., pn a and pn b ) fall into corresponding intervals during the update of range counts. Algorithm 7 Histogram Construction for Conjunction of Two Triple Patterns(Input: T;pn a ;g a ;pv a max ;pv a min ;pn b ;g b ;pv b max ;pv b min ) 1: InitializeHS withg a g b ranges 2: for eachi2[0;g a 1] do 3: for eachj2[0;g b 1] do 4: HS:r a i = [pv a min +i pv a max pv min g a ;pv a min +(i+1) pv a max pv a min g a ] 5: HS:r b j = [pv b min +j pv b max pv b min g b ;pv b min +(i+j) pv b max pv b min g b ] 6: HS:count(r a i ;r b j ) = 0 7: end for 8: end for 9: for eacht2T do 10: if 9t 0 2 T where t:subject = t 0 :subject and t:predicate = pn a and t 0 :predicate =pn b andt:object2r a i andt 0 :object2r b j then 11: HS:count(r a i ;r b j )++ 12: end if 13: end for 14: return HS The HBO Algorithm. A multiple triple pattern histogram can be used to estimate the selectivity of conjunctions of conditional triple patterns using the HBO algorithm (Algorithm 8). The inputs are two conditional triple patterns, a andb, with inequality operators in the filter condition and the histogram HS. The filter condition for each triple pattern can be represented as rangesa:r andb:r separately, andcon is a boolean variable indicating whether to estimate the selectivity of a single triple pattern or the selectivity of the conjunction of two triple patterns. Since the HBO algorithm is executed on each peer locally, the peer always has all the triples with predicate asa’s andb’s when the estimation is on the conjunction of those two triple patterns. If the estimation is on a single triple pattern, the algorithm ignoresb. The output is the selectivitys of either the single triple patterna or the conjunction of the triple patternsa andb. 93 Algorithm 8 The HBO Algorithm - A Histogram-Based Optimization Approach (Input: HS;a;b;T;con) 1: cnt = 0 2: if!con then 3: for eachi2[0;HS:g a 1] do 4: cnt+ =HS:count(r a i ) jr a i j\ja:rj jr a i j 5: end for 6: else 7: for eachi2[0;HS:g a 1] do 8: for eachj2[0;HS:g b 1] do 9: cnt+ =HS:count(r a i ;r b j ) jr a i ;r b j j\jr a q ;r b q j jr a i ;r b j j 10: end for 11: end for 12: end if 13: s = cnt jTj 14: return s In Algorithm 8, jr a i j\ja:rj jr a i j or jr a i ;r b j j\ja:r;b:rj jr a i ;r b j j is the percentage of the overlapping part between the range in the query filter and the histogram unit range over the histogram unit range (lines 4 and 9). If the filter range fully covers the histogram unit range, the value is 1. If the filter range and the histogram unit range are disjoint (e.g.,con =false case) or the filter range and the histogram unit range in either of the two dimension are disjoint (e.g.,con = true case), the value is 0. Once all the histogram unit ranges are checked, the selectivity is the final count over the total number of triples (line 13). For example, consider a conditional query with a conjunction with its filters as ufom:score 0:2 and ufom:conf 0:3. Then, the histogram in Table 6.3 can be used to compute the final count as59+42 1 2 = 80 and the final selectivity as 80 jTj wherejTj is the total number of triples. 94 6.2.3 Selectivity Estimation using Machine Learning In the HBO algorithm, the larger g is, the more accurate the selectivity estimation is. However, as we increase g, the computational cost increases for both histogram con- struction and selectivity estimation. In real-world situations, ranges are not queried uniformly. In the previous example from ontology matching using UFOM [97], most queries filter for triples with high score and confidence. We propose a machine learning-based optimization (MLO) approach to estimate triple pattern selectivity without constructing the histogram. Intuitively, if we consider a conditional triple pattern, the number of triples returned is closely related to the size of the interval in its filter. In general, the larger the interval is, the more triples are returned. A machine learning-based model is used to represent such relations and the model is learned using the results of past queries. In this thesis, we have used a neural networks as the model of representing selectivity. A neural network was selected since the underlying structure of the relationship between instances and their distribution is unknown in most real-world ontology alignment datasets. After the neural network is trained, it is distributed to each peer. The peer which receives a query evaluation request will parse the upper and lower boundaries in the filter condition and use them as the inputs to the neural network model. Finally, the neural network outputs the estimated selectivity that is used for query plan optimization. 6.2.4 The Optimizer When the optimizer receives a query request, it will request corresponding peers to gen- erate the selectivity of each triple pattern or the conjunctive selectivity of each possible triple pattern pair (e.g., those pairs with triple pattern predicates hashed to the same peer) based on the DHT. Then, the query plan is computed using Algorithm 9. 95 Algorithm 9 The Optimization Algorithm (Input:C;N;E;con) 1: P q =null 2: if!con then 3: P q :add(argmin tp2C[N sel(tp)) 4: whilejP q j! =jC[Nj do 5: P q :add(argmin tp2C[NPq^(tp;tpx2Pq)2E sel(tp)) 6: end while 7: else 8: P q :add(argmin (tp i ;tp j )2E sel(tp i ;tp j )) 9: whilejP q j! =jC[Nj do 10: P q :add(argmin tp2C[NPq^(tp;tpx2Pq)2E sel(tp;tp x )) 11: end while 12: end if 13: return P q In Algorithm 9, a query planP q is generated based on the selectivity returned from each peer. On each peer, the selectivity is estimated using either HBO or MLO approach (corresponding to the sel() function). The input includes the basic components of a query graph, the set of nodes N[ C and the set of edges E, and a boolean variable con specifying if the selectivity is computed on single triple pattern or conjunctive triple patterns. If only single triple pattern selectivity is considered, the algorithm first finds the triple pattern with the smallest selectivity and adds it as the first node in the query plan (line 3). Then, from the remaining nodes, the algorithm iteratively finds the triple pattern with the smallest selectivity which is linked with at least one node in the current query plan P q , and append it to P q (line 4-5). If the selectivity of the conjunction of two triple patterns is considered, the algorithm identifies the pair of conjunctive triple patterns (tp i ;tp j ) which has the smallest estimated selectivity (line 8). Then, from the remaining nodes, the algorithm iteratively finds the triple pattern which has the smallest conjunctive selectivity with one node in the current query plan P q , and append it to P q (line 9-10). Finally, after all nodes are added intoP q , the algorithm returns the final 96 query planP q . In this algorithm,sel(tp) orsel(tp i ;tp j ) can be any selectivity estimation approaches not restricted to HBO or MLO. 6.2.5 Experimental Evaluation In this section, we present the results of evaluating the HBO and MLO approaches on two datasets. The first is generated synthetically. We consider different data character- istics in the data generation phase. The second dataset comprises of ontologies of an enterprise-scale dataset. 6.2.5.1 Synthetic Datasets We first performed a set of experiments on the synthetic dataset. The synthetic dataset consists of four subsets, and each 10 6 contains triples. The statistics are shown in Table 6.4. Table 6.4: Statistics of the Synthetic Dataset Subset Predicate and Range (UV) Distribution Dependency 1 a2[0, 9999](10K), b2[0, 9999](10K) Uniform No 2 a2[0, 999](841), b2[0, 999](841) Gaussian No 3 a2[0, 9999](10K), b2[0, 99999](100K) Uniform b!a 4 a2[0, 9999](7226), b2[0, 9999](7226) Gaussian b!a In the first subset, we generated10 6 triples with predicate asa andb separately. The values of object are integers within 0 to 9,999 following a uniform distribution. In the second subset, we generated10 6 triples with predicate asa andb as well. The values of object are integers within 0 to 999 following a Gaussian distribution. For the third and fourth subsets, we generated the values of predicatesa andb together with a function dependencyb!a. For evaluation, we generated two query sets for each data subset. Each query set consists of 13,000 queries. We divide each set into 3,000 queries for testing and 10,000 97 for training. The first set consists of queries with random ranges in the filter condition of the triple patterns, while the second set has queries with the ranges following Gaus- sian distribution, which simulates the hot region phenomenon [39, 90]. The hot region phenomenon says that concurrently executed queries are often similar to each other. In the previous employee ontology, the possible values of property “age” can range from 0 to 100, but the frequently queried age might be only from 20 to 60. It indicates that a small range of values are likely to be enquired instead of the whole range. Hot regions can also be found in sensor monitoring systems. For example, in a habitat monitoring application, high temperature ranges may be queried more than low ones during sum- mertime. … c … Input Layer (4 neurons: r q a .u, r q a .l, r q b .u, r q b .l) Hidden Layer 1 (10 neurons) Hidden Layer 2 (5 neurons) Output Layer (1 neuron: selectivity) c Figure 6.9: The Neural Network for Estimating Selectivity We use a multi-layer feed-forward neural network in our experiments. Such a neural network is typically defined by three parameters: 1) the interconnection pattern between different layers of neurons; 2) the weights of the interconnections; and 3) the activation 98 function that converts a neuron’s weighted input to its output activation. In this the- sis, the interconnection pattern and activation function are given, while the weights are trained in the learning process. There are multiple ways to train the neural networks such as supervised learning, unsupervised learning and reinforcement learning. Figure 6.9 shows a 2-hidden layer neural network which is used to estimate the selectivity. Four neurons are used in the input layer which correspond to the lower and upper boundaries of the intervalr a q and r b q in the two triple patterns of the query. If only single triple pattern is considered,r b q can be simply set as the whole range of the property. The output layer has one neuron which gives the estimated selectivity. There are two hidden layers consisting of 10 and 5 neurons in this example as well. The sigmoid functionS(t) = 1 1+e t is used as the activation function in the neural network. We use supervised learning to train the neural network. In supervised learning, the goal is to find a function in the allowed class of functions that matches the examples. The cost function we use is the MSE (Mean Squared Error), which tries to minimize the average squared error between the neural network’s output and the ground truth over all examples in the training set. Here is the definition of MSE, MSE = 1 n X i=1:::n (e i a i ) 2 (6.8) wheree is a vector ofn estimated values generated by the neural network, anda is the vector of observed values corresponding to the inputs to the function which gener- ated the predictions. In order to train the neural network, we use the back-propagation algorithm [75] with the default error metric as the MSE. We compare our HBO and MLO approaches with two other approaches. The first one is the baseline approach (BL) with no optimization strategy used, and the second 99 one is the QC algorithm proposed in [44]. We evaluate the performance from three aspects: selectivity estimation accuracy, network bandwidth and query response time. Selectivity Estimation Accuracy. Since BL and QC cannot estimate the selec- tivity for conditional triple patterns, we only compare our proposed HBO and MLO approaches for selectivity estimation accuracy. In Figure 6.10, each subfigure corre- sponds to one subset in the synthetic dataset. Each value in the figure is generated by averaging the MSE of 3,000 queries. We consider two query distributions for each subset: random queries and hot region queries. 0 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 Random Queries Hot Region Queries MSE Query Distribution HBO MLO 0 0.0005 0.001 0.0015 0.002 0.0025 Random Queries Hot Region Queries MSE Query Distribution HBO MLO 0 0.001 0.002 0.003 0.004 0.005 0.006 Random Queries Hot Region Queries MSE Query Distribution HBO MLO 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 Random Queries Hot Region Queries MSE Query Distribution HBO MLO (a) (b) (c) (d) Figure 6.10: Selectivity Estimation Accuracy: (a) MSE on subset 1 where data are uniformly distributed and have no dependency; (b) MSE on subset 2 where data follow Gaussian distribution and have no dependency; (c) MSE on subset 3 where data are uniformly distributed and a is dependent on b; (b) MSE on subset 4 where data follow Gaussian distribution and a is dependent on b. 100 As shown in Figure 6.10, when data are uniformly distributed, both HBO and MLO perform well in terms of generating small MSE for estimating the selectivity. More- over, MLO performs better than HBO especially when the queries have hot regions. The reason is the neural network can be learned better if the training process focuses on a specific region of the range compared with the cases involving randomly distributed training queries. For Gaussian distribution, MSE is larger than that in the uniform distri- bution cases. It is because within each range in the histogram, the triples are no longer uniformly distributed. The percentage of the overlapping ranges cannot model the real triple count within that range quite well. However, MLO still performs better than HBO because the neural network can learn the data distribution well regardless of the type of distribution when given sufficient training queries. If the data dependency exists Figure 6.10(c)(d), MLO can further improve the accuracy by learning the dependency between two predicates. It proves that neural networks can learn the hidden structure of the underlying relation quite well. Network Bandwidth. Next, we investigated how HBO and MLO can improve the query performance in terms of network traffic. In this set of experiments, the selectivity estimation for single triple pattern is enabled in the optimizer. As shown in Figure 6.11, HBO and MLO can reduce the network traffic by more than 50% compared with BL and QC for all datasets. One reason is that BL and QC cannot estimate the selectivity of conditional triple patterns. In other words, the query plan is generated randomly. Although some heuristics are used in QC, the estimated selectivity still can largely differ from the real one. Another reason is that both histogram and neural network capture the data distribution which facilitates estimating the number of triples satisfying the query condition. One interesting observation is that the performances over different datasets are almost the same, even though the accuracy of selectivity estimation differs. The reason is that the evaluation order of triple patterns is not very sensitive to the error 101 incurred in the selectivity estimation. In most cases, the relation (e.g., larger or smaller) between the selectivity of different triple patterns remains the same. (a) (b) (c) (d) 0 1 2 3 4 5 6 7 8 Random Queries Hot Region Queries Avg. # of Triples Transferred in the Network (x10 5 ) Query Distribution BL QC HBO MLO 0 1 2 3 4 5 6 7 8 Random Queries Hot Region Queries Avg. # of Triples Transferred in the Network (x10 5 ) Query Distribution BL QC HBO MLO 0 1 2 3 4 5 6 7 Random Queries Hot Region Queries Avg. # of Triples Transferred in the Network (x10 5 ) Query Distribution BL QC HBO MLO 0 1 2 3 4 5 6 7 8 Random Queries Hot Region Queries Avg. # of Triples Transferred in the Network (x10 5 ) Query Distribution BL QC HBO MLO Figure 6.11: Network Bandwidth: (a) network traffic on subset 1 where data are uni- formly distributed and have no dependency; (b) network traffic on subset 2 where data follow Gaussian distribution and have no dependency; (c) network traffic on subset 3 where data are uniformly distributed and a is dependent on b; (b) network traffic on subset 4 where data follow Gaussian distribution and a is dependent on b. Query Response Time. Finally, we evaluated all four approaches based on the query response time. Similarly, the selectivity estimation for single triple pattern is enabled. The total query response time consists of both query optimization time and query execution time. As shown in Figure 6.12, QC’s QRT is slightly higher than BL’s because of the extra optimization time. For the new proposed approaches, HBO can reduce the QRT by around 60% compared with BL and QC for all four datasets. The 102 reason is that HBO generates better query plans based on the histogram which saves execution time. However, HBO has larger QRT than MLO. It is because the optimization time for HBO is larger than MLO’s, even though they generate similar query plans. For MLO, the neural network model is trained in advance. When the query comes, the optimizer just needs to extract the input parameters and compute the output from using the neural network. It is more efficient than traversing the histogram to estimate the selectivity. (a) (b) (c) (d) 0 2 4 6 8 10 12 Random Queries Hot Region Queries QRT (s) Query Distribution BL QC HBO MLO 0 2 4 6 8 10 12 Random Queries Hot Region Queries QRT (s) Query Distribution BL QC HBO MLO 0 2 4 6 8 10 12 Random Queries Hot Region Queries QRT (s) Query Distribution BL QC HBO MLO 0 2 4 6 8 10 12 Random Queries Hot Region Queries QRT (s) Query Distribution BL QC HBO MLO Figure 6.12: Query Response Time: (a) QRT on subset 1 where data are uniformly distributed and have no dependency; (b) QRT on subset 2 where data follow Gaussian distribution and have no dependency; (c) QRT on subset 3 where data are uniformly distributed and a is dependent on b; (b) QRT on subset 4 where data follow Gaussian distribution and a is dependent on b. 103 6.2.5.2 Enterprise-scale Dataset For the evaluation of query optimization strategies on a real dataset, we considered two ontologies from an enterprise-scale information repository. Each ontology focuses on a different application area, but they are related in terms of the entities that they reference. Ontology O 1 has 125,865 triples. Ontology O 2 has 651,860 triples. Due to privacy concerns, we do not expose the real names of the properties and ontolo- gies. We generated 137 fuzzy correspondences between these two ontologies using UFOM [97]. Each correspondence has a “ufom:score” and a “ufom:conf” with the object value within [0,1]. We generated two sets of test queries: random queries and hot region queries. Each query has two conditional triple patterns with predicate as “ufom:score” and “ufom:conf”. For MLO, we have 500 randomly generated queries to train the neural network (4-10-5-1). Figure 6.13 shows the evaluation results of all four approaches on the enterprise- scale dataset. The average MSE is almost the same for HBO and MLO because both histogram and neural network capture the data distribution in the enterprise dataset well. As a result, the traffic is reduced by around 40% due to a good query plan, compared with BL and QC. Furthermore, all the approaches perform pretty consistently on both random queries and hot region queries. However, the QRT for HBO is larger than that for BL and QC (Figure 6.13(c)). This is because the number of triples is small in this dataset. As a result, the optimization time dominates the QRT instead of the execution time. Even in this case, MLO still has the smallest QRT. The reason is that the neural network is trained offline and the computational cost can be minimized during the optimization phase. As a result, using neural network can save more time in the optimization task than HBO, regardless of the data size. 104 (a) (c) (b) 0 0.0005 0.001 0.0015 0.002 0.0025 0.003 Random Queries Hot Region Queries MSE Query Distribution HBO MLO 0 5 10 15 20 25 30 35 40 45 50 Random Queries Hot Region Queries Avg. # of Triples Transferred in the Network Query Distribution BL QC HBO MLO (c) 0 0.1 0.2 0.3 0.4 0.5 0.6 Random Queries Hot Region Queries QRT (s) Query Distribution BL QC HBO MLO Figure 6.13: Evaluation on Enterprise-scale Dataset: (a)Mean Squared Error; (b) Net- work Traffic; (C) Query Response Time. 105 Chapter 7 Applications 7.1 Enabling Efficient Search in Oil and Gas Domain The petroleum industry is experiencing explosive growth in data [29]. In a single enter- prise, such data can be distributed across several different systems. These systems are used by different groups of domain experts who follow their own conventions to main- tain the data. However, the data stored in different systems may not be completely independent and may have potential correlations. For example, a petroleum company usually maintains different software systems for real-time monitoring and maintenance. In this case, the information regarding a specific equipment, say a pump, may appear in both of these systems, providing different context. When a pump fails, it is natural for a user to query its information from different systems at the same time. If these systems are not designed following a consistent set of rules, it becomes difficult to retrieve the information of the same pump from two different systems. In this thesis, we focus on integrating such systems in order to make them com- patible for queries spanning multiple systems. We utilize petroleum industry standards set by the Professional Petroleum Data Management Association (PPDM) 1 to guide us in the integration. The PPDM data model contains data schemas covering almost all aspects of petroleum industry. Based on these schemas, we build ontologies for inter- nal representations of all interrelationships and dependencies across various systems by 1 https://ppdm.org/ppdm/ 106 employing semantic-based data integration techniques. Having this integration frame- work in place, when a query is issued, all relevant information stored across the systems can be retrieved using such ontologies. In this thesis, we demonstrate how this integra- tion framework can guide data architects in developing more compatible data schemas. Finally, we evaluate our approach using the schemas from software systems widely used in petroleum industry. We compare our method with several baseline approaches. The results show that our methods can return complete query answers. 7.1.1 Introduction In the petroleum industry, data and information have been increasing dramatically over last several years [29]. Such data not only grows in depth but also in breadth. A petroleum enterprise may only need to maintain a few systems to monitor the oil pro- duction process in the early ages. However, with the development of oil and gas related technologies, more and more factors come to play into the larger picture. For example, a real-time system is needed to collect the data from sensors deployed in the environ- ment. It monitors and controls industrial, infrastructural, or facility-based processes. A maintenance system is also useful to keep the plant, equipment, and facilities available, reliable, and safe. After all those basic components are set up, companies may want to improve their production. A well information system can improve the bottom line for any oil and gas production operation. It can help enterprises better achieve their financial goals and objectives. One important aspect raised in recent years is the communication among employees. A social networking system is in need to increase the information exchange and improve the communication quality among all employees. The above systems are just the tip of the iceberg of the whole oil and gas production framework. Companies usually receive their products from different vendors. Another example is the development of the standards set by the Professional Petroleum Data Management 107 Association (PPDM). In its data model version 1, it only covers 5 subject areas for the petroleum production. But now, in its latest version (PPDM 3.8 Data Model), 53 subject areas are included. Looking at the changes up to now, we can get a clearer picture of how greatly the information is increasing. However, those systems are not completely independent and they have the potential for correlation. For example, a pump’s information may appear in different systems providing different context. When a pump fails, real time monitoring system first cap- tures this event. After receiving the signal, the maintenance team needs to find out the specifics of this pump in the maintenance system. Then the detailed information, such as the employee responsible for this pump, can be retrieved by equipment team. Fur- ther, from the social networking system, some experts can be found to fix this problem. Since those systems come from different vendors, their data schemas follow different name spaces and conventions. In order to make information exchangeable, a translator needs to be built for each pair of systems. As we can see in Fig. 1.1 in Chapter 1, the sys- tems currently only allow point-to-point communication and information is exchanged among each pair of systems. It will become quite inefficient when the number of sys- tems increases. Thus, it is very necessary to integrate those systems, in order to reduce the efforts in compiling and translating the data. In order to solve above problem, a simple idea is to integrate those subsystems. A large amount of research has been done in information integration and its related topics. Some of the main problems are listed as follows: 1. Naming conflicts: two attributes which have different names are semantically equivalent (i.e., pump code and pump id), or two attributes are semantically unrelated but have the same names (i.e., different dates). In this case, the mapping is usually established by considering the context and semantics. 2. Data scaling conflicts: two attributes which have similar semantics are represented using different units (i.e., meter 108 and feet). In this case, a lookup table is usually used to allow operations among those attributes. 3. Data representation conflicts: two semantically similar attributes have different data types (i.e., char and int). A conversion function is needed to do mapping between such attributes. For example (Fig. 7.1), attribute CAI in an active directory is actually equivalent to UPDATEDBY in a maintenance system, but it is hard to map them using only synthetic information. The same situation happens with the Related Work Order and EVENT attributes pair. Also, this pair may have different data types (i.e., integer and varchar). In order to solve this problem, semantic information is required. Figure 7.1: Schema interoperability problem We observe that there are many standards which become more and more impor- tant for information integration tasks [81, 65, 60]. Such standards can be considered as the guidance in subsystems integration. An example is the PPDM data model, which is a robust relational data model ideal for Master Data Management strategies and for business-focused application development. The latest version (3.8) contains the data schemas covering 53 subject areas in petroleum industry. Another example is ISO 15926 2 which is a standard for data integration, sharing, exchange, and hand- over between computer systems in oil and gas production process. It is also a standard for data modeling and interoperability using the semantic web. The third example is 2 https://www.iso.org/standard/29557.html 109 MIMOSA - An Operations and Maintenance Information Open System Alliance. They provide a series of interrelated information standards. The Common Conceptual Object Model (CCOM) provides a foundation for all MIMOSA standards, while the Common Relational Information Schema (CRIS) provides a means to store enterprise operations and maintenance information. MIMOSA also provides metadata reference libraries and a series of information exchange standards using XML and SQL. Some research has been done in using standards to guide information integra- tion. For example, PPDM standards can enhance integration of business processes and data [81]. ISO 15926 can achieve better data interoperability between different systems in oil and gas production [65]. In some general fields, a gold standard is always useful for data integration [60]. In this thesis, we will demonstrate how the PPDM data model can guide data integration among different distributed subsystems. 7.1.2 Data Model We use the latest version of PPDM data model which is PPDM 3.8 in this thesis. It is a rich data model covering 53 subject areas in the petroleum industry and is ideal for a master data management solution. One of its subject areas is Work Order, which contains the schemas for work order information. As in the maintenance information system, there is also a subject named Work which is highly related to Work Order area in PPDM. Another example is WELLS subject area in PPDM which has potential relations with well information system. It also has real time capability for processing the well data in some degree. Fig. 7.2 shows a high level architecture of PPDM data model 3.8. A table in PPDM which models work order is used as an example. From the schema we can find an attribute WORK ORDER ID which is the identifier of a work order. Attribute ROW CHANGED BY gives the id of employee who last updated the record. Thus, the ROW CHANGED BY attribute can be mapped to the attributes in at least two 110 Figure 7.2: PPDM Data Model 3.8 Architecture different subsystems (i.e., maintenance and active directory) in order to build the linkage between those systems. Next, let us investigate a scenario how information flows among different systems. In this section, a scenario is provided to illustrate how the information flows among different systems (Fig. 7.3). Suppose a pump fails, the query is to find an expert who can fix this problem. First, a pump failure signal is detected in well monitoring system. Pump id information (e.g., PID234) will be sent to the centralized equipment repository. In equipment repository, detailed information related to that pump can be found (e.g., status and diagnosed by). Furthermore, the maintenance record for that pump is stored in maintenance system. From the record, we can find the employee who repaired that specific pump before. Then we can retrieve the detailed information of that employee such as email in active directory using his/her id (e.g., CAI). Another possible query interpretation is that, instead of retrieving an employee’s information, the pump model can also be discovered in maintenance records. Based on the model, an expert rec- ommendation system in social networks can be used to find the expert of this specific model. 111 Figure 7.3: A scenario of how information flows As we mentioned before, current system only allows point-to-point communication and information is exchanged among each pair of systems. Thus, it is not very efficient to integrate different systems when the number of systems becomes very large. In this thesis, we propose a solution to this problem with the help of PPDM. ISO 15926 titled ”Industrial automation systems and integration of life-cycle data for process plants including oil and gas production facilities” is a standard for data inte- gration, sharing, exchange, and hand-over between computer systems. It is a semantic model containing an interconnected set of ontologies. ISO 15926 is largely used by the petroleum industry as an upper ontology and data integration standard and aims for reduced time, error and uncertainty caused by human information exchange. 112 ISO 15926 has eleven parts. The contents for each of the parts which concern use are as follows. We are mainly concerned with Part 2 which contains the actual data model (schema) and Part 4 which has the reference data (instances). Part 1 - Introduction, information concerning engineering, construction and oper- ation of production facilities is created, used and modified by many different orga- nizations throughout a facility’s lifetime. Part 2 - Data Model. a generic 4D model that can support all disciplines, supply chain company types and life cycle stages, regarding information about functional requirements, physical solutions, types of objects and individual objects as well as activities. Part 3 - Reference data for geometry and topology. Parts 4, 5,6 - Reference Data, the terms used within facilities for the process indus- try. Several technology vendors and standardization organizations, as well as oil and gas enterprises, are using it for transformation, compliance and integration within their sys- tems. However, there is no current method of achieving information integration between PPDM 3.8 and ISO 15926 on a generic basis without compromising on features of either standards. 7.1.3 Information Integration Semantic heterogeneity may exist in integrating different systems. In schema level, it can be naming conflict. For example, the pump identifier information is recorded in attribute Pump in a well monitoring system but in attribute Pump ID in maintenance system. The conflicts may also exist in data types. For example, the equipment code is in 113 varchar type in one system, but it can be an integer in another system. Another important aspect for schema level heterogeneity is variety of units of measure. For example, a temperature value can be recorded in Celsius or Fahrenheit. Finally, constraints may vary in different systems for the same attribute. In detail, an attribute which allows null may have a non-null corresponding attribute in another system. Besides schema level heterogeneity, data content level should also be considered. For example, we assume that an expert needs to be found to fix a specific pump. First, a pump failure signal is triggered in a real time monitoring system. But no contact information is stored in it. Then, the information in a maintenance system needs to be retrieved. Specifically, a contact person related to that pump is recorded in the main- tenance system. In order to further extract the detailed information of that person, the active directory should be used. Using the specific person’s code, we can retrieve all contact information related to him/her. Finally, the person can be reached via his/her contact information. A considerable amount of research has been done on schema matching problem. The approaches can be classified into two main categories: schema level approach and instance level approach. In the schema level approaches, the matching can be estab- lished based on the similarity between attribute names [6, 9] and attribute descriptions. The constraints can also be used to determine the similarity of schema elements [48]. In the instance level approaches, data can provide more insight and meaning of schema elements. [32]proposes a data guide approach which uses the pattern of the data to find out similar attributes. [91] presents an approximate schema graph which is constructed automatically from XML documents. Some other approaches have been proposed to perform an instance level matching using neural networks and machine learning tech- niques [7, 24, 54]. A hybrid approach which combines instance level and schema level matchers is proposed in [27]. However, in our scenario different systems can follow 114 different naming conventions and schema definition rules. It is hard to derive a good similarity measurement only using name and data type information. Moreover, two completely different attributes may have high similarity which is calculated based on their names and data types. In order to solve this problem, the semantic information needs to be considered. In this thesis, we adopt a semantics-based data integration approach to build an ontology repository which contains a set of mapping ontologies among the schemas of different systems. Those mapping ontologies can be considered the linkage among subsystems. There have been some works on semantics-based data integration. In [94], a domain ontology is used to guide the schema matching. Specifically, two schemas need to be mapped to a domain ontology and the match between those two schemas is constructed based on the relationships inherent in that ontology. A combination of search and learn- ing techniques are used in [23, 25] and mining techniques are employed in [38] . In [23] , the authors model the schema matching problem as a search in a very large or infinite match space. It employs a set of searchers which discover specific types of complex matches in order to search effectively. World Wide Web Consortium (W3C) provides a variety of standards in order to enable people to create data stores on the web, build vocabularies and write rules for han- dling data. Such standards include RDF, OWL and SPARQL. RDF is a standard model for data interchange on the web while OWL is an extension of RDF which is designed to represent rich and complex knowledge about things, groups of things, and the relations between things. SPARQL is a standard query language on semantic web such as RDF and OWL. There are also a number of active groups in W3C which focus on different aspects of semantic web. Another application of semantic web is ISO 15926, which is a standard for data integration, sharing, exchange, and hand-over between systems in oil 115 and gas industry. This ISO standard is written in OWL file, providing a well-structured ontology for oil and gas production operations. [12] introduces ISO 15926 into an inte- gration framework. However, ISO 15926 is only a component within the large system. [65] presents a more clear way to use ISO 15926 to guide in information exchange and reducing interoperability complexity. In this thesis, we use PPDM as a standard to guide us in data integration. Recently, some research has been done on PPDM. [18] observes that a comprehensive production optimization work process could require different types of information (e.g., mainte- nance schedules and real-time field measurements). The information is covered by some standards-related organizations such as PPDM. But there is no specific method proposed to use PPDM in information integration work among different systems. [66] examines a solution approach using PPDM standards as the enabling foundation for operational solutions across different domains such as drilling, planning and well enhancement. However, their method requires centralized storage which is not always true. For example, the local systems may be distributed in different place and monitored by different people. It is hard to integrate the data which has ready been generated. Also some processing can get benefits from distributed computing and storage. [10, 72] also state that some enterprises have already been using PPDM as the pre- ferred mode for data management. It can be used in the business intelligence data layer and it is standards-based, open and scalable. They commend leveraging open standards organization such as PPDM to encourage collaborative development of industry work- flows, business rules and data models. However, none of above literatures investigates how to use PPDM in schema level integration and how semantic web can be involved. We are the first to break the iceberg and build the mapping ontologies which link the schemas in distributed subsystems and 116 PPDM standards. In the future, we will also consider ISO 15926 and PRODML to enhance our integration system. In order to generate mapping ontology, 4 steps are needed as shown below (Fig. 7.4). Step1: Collecting relational model information Step2: Generating local ontology based on relational model Step3: Generating mapping between areas in PPDM and subsystems Step4:Generating mapping ontologies Synonyms and Alias XML Relationa l model Local Ontology Mapping Ontology Figure 7.4: Work flow for generating mapping ontology 1. Based on the relational model specification for both standard (i.e., PPDM) and subsystems (i.e., maintenance, well monitoring systems), an XML file is generated for each model which contains the detailed information for each attribute. Such informa- tion includes attribute name, table name, related areas, constraints (i.e., primary key, foreign key, nullable), data type, unit of measure and description. The relation model specification can be usually found in the corresponding system manual. 2. Based on the relational model for both standard (i.e., PPDM) and subsystems, the schema ontology for each model is constructed automatically (Fig 7.5). Specifically, we 117 Figure 7.5: Ontology Development from Data and Schema consider each table as a class. We model a foreign key reference as an object type prop- erty and a non-foreign key attribute as a data type property. We also involve subclass concepts in our ontologies. For example, if the primary key of a table is superset of the primary key of another table, then a subclass relation is built between the classes corre- sponding to those two tables. After the schema ontologies are generated, the information derived in 1 is merged into the ontology (e.g., descriptions). 118 Figure 7.6: Schema Integration PPDM Ontology Driven 3. Given all areas in both PPDM and subsystems, area mappings between the stan- dard and each subsystem are established. For example, work order area in standard PPDM is mapped with work area in maintenance system. This area mapping process can be done automatically. In this step, two lexicons are used to assist the mapping - synonyms and alias. They are constructed initially using Open linked data, the knowl- edge of WordNet as well as existing ontologies in current system. 119 4. The mappings between classes in PPDM and subsystems are derived using schema matching algorithms. If the attributes come from the same system (i.e., either standard or subsystem), foreign key constraints are simply applied to discover such map- pings. If the attributes exist in different systems which have common areas, the simi- larity between the following fields is calculated: attribute name, table name, primary key, nullable, description, data type and UOM. If the overall similarity between two attributes is larger than a predefined threshold, the mapping is established (Fig. 7.6). The final mapping ontology has both attribute matching information and type/UOM converting information. An example of mapping ontology is shown as below. <owl:Class rdf:ID="EMPLOYEE BA ID"> <rdfs:label rdf:datatype="www.w3.org/2001/XMLSchema#string"> EQUIP BA ID </rdfs:label> <owl:sameAs rdf:resource="http://joaquin.usc.edu/d7i#UPDATEDBY"/> <owl:sameAs rdf:resource="http://joaquin.usc.edu/actdir#CAI"/> </owl:Class> In this example, mappings between attribute EMPLOYEE BA ID in PPDM to attribute UPDATEDBY in maintenance system and attribute CAI in active directory are established. Based on the mapping ontology, a SPARQL query modeling the previous scenario can be written as follows, PREFIX d7i: <http://joaquin.usc.edu/d7i#> PREFIX actdir: <http:// joaquin.usc.edu/actdir#> SELECT DISTINCT ?att name 120 WHERE f ?ppdm class owl:sameAs ?att name. ?ppdm class owl:sameAs ?d7i:UPDATEDBY. g LIMIT 100 Executing this query, we can retrieve the attribute in the active directory which is mapped UPDATEDBY attribute in maintenance system. Once we get the attribute name, we can further retrieve the detailed information related to that employee in relational database. 7.1.4 Harmonization with ISO 15926 Our contributions consist of three parts (Figure 7.7). First, we generate schema ontology for both PPDM and subsystems such as maintenance system based on their relational model. Second, we build a mapping ontology between PPDM and subsystems ontolo- gies. Third, we harmonize PPDM schema ontology with ISO 15926 ontology. In this section, we will elaborate those three contributions in detail. Generating semantic data models for PPDM and local subsystems In this sec- tion, we will use PPDM for illustration. Specifically, we have 4 steps to generate schema ontology for PPDM relational data model. 1. Generating XML files We start from PPDM relational data model documentation written in PDF format. It provides detailed information of all tables and attributes in PPDM. Figure 7.8 shows the description of table WORK ORDER. Information such as attribute name, attribute type and referenced table can be extracted. 121 Figure 7.7: Framework of our schema and mapping ontologies Based on the manual, we create an XML file which contains the information for all attributes, primary keys and foreign keys. Figure 7.9 shows an example for each of above three types. 2. Identifying Class We consider each table in relational model as a Class in the schema ontology. Fig- ure 7.10 shows an example Class WORK ORDER in the form of owl. This class corre- sponds to the table WORK ORDER in PPDM relation model. 3. Identifying subClassof relationship 122 Figure 7.8: An example table description in PPDM document We use following rules to generate subClassof relation. If PK of table A is super set of PK of table B, then Class A is a subclass of Class B. For example, WORK ORDER ALIAS contains all the PK attributes of WORK ORDER, then we model WORK ORDER ALIAS as a subclass of WORK ORDER (figure 7.11). 4. Identifying property We also model both data type property and object property in the schema ontology. Each attribute corresponds to one property. If the attribute refers to another table, it is converted to an ObjectProperty. If not, it is converted to a DatatypeProperty. Figure 7.12 shows some sample properties. It also shows the definition for WORK ORDER FINAL BILLING DATE datatype property and WORK ORDER ALIAS WOAL WO FK object property. Creating mapping ontology for data integration In this part, we create mapping ontology between PPDM ontology and maintenance system ontology. In details, PPDM has 37730 attributes and maintenance system has 123 Figure 7.9: Attribute, PK and FK in XML file Figure 7.10: WORK ORDER Class in owl format Figure 7.11: Subclasses of WORK ORDER Class 124 Figure 7.12: Property examples 18360 attributes. It is impossible to find the mappings manually. So we propose a automatic method involving three steps. 1. Generating subject area mapping As we know, all attributes in PPDM and maintenance system can be classified into several subject areas. PPDM3.8 data model contains 53 modules while maintenance system has 29 areas. We first map those areas manually. For example, EQUIPMENT in PPDM is mapped to equipment in maintenance system (Figure 7.13). 2. Calculating matching score Given an attribute a i in subsystem schema, we calculate the matching score (S ij ) between it and each attributea j under the same subject area in PPDM. The purpose of introducing subject area is to reduce the space of comparison.S ij is defined as follow, S ij =! 1 Sn ij +! 2 St ij +! 3 Sc ij (7.1) 125 Figure 7.13: Subject Area Mapping Sn ij is the name similarity which is calculated by computing the Levenshtein similarity between two names. A domain-specific synonyms lexicon is used. St ij is the type similarity which is equal to 1 when the types are same. Sc ij is the comment similarity which computed based on the sentence similarity. 3. Generating property mappings 126 Figure 7.14: An example mapping After we compute the matching score, we compare it with a pre-defined threshold p. If the score is larger than p, we create the corresponding entry in the mapping ontology. Figure 7.14 shows an example entry in mapping ontology. We use owl:sameAs to build the linkage between two properties. Harmonization between ISO 15926 and PPDM Since PPDM focuses more on the completeness of attributes, it lacks hierarchy in its ontology. We also observe that ISO 15926 has a well-defined ontology which contains several hierarchic structures and instances. So we use ISO 15926 part 2 and part 4 to harmonize and improve PPDM ontology. First, we create the mapping of classes between ISO 15926 part 2 and PPDM. For example, ClassOfInanimatePhysicalObject in ISO is mapped to EQUIPMENT in PPDM, while ClassOfActivity in ISO is mapped to WELL ACTIVITY in PPDM. Second, a large amount of individuals exists in ISO 15926 part 4. For example, different types of activities are defined as individuals of class ClassOfActivity (Fig- ure 7.15). We model those individuals as classes in PPDM schema ontology in order to provide an improved schema ontology with hierarchic structures. Those classes are subclasses of WELL ACTIVITY (Figure 7.16). In the query processing phase, some more specific queries can be registered on such ontologies. Results and Discussion We design three sample SPARQL queries on our ontologies. The first two queries are registered on the mapping ontologies. It returns all classes which are mapped to COMPLETE DATA and BUSINESS ASSOCIATE ID in PPDM. 127 Figure 7.15: individuals of ClassOfActivity The results are shown in Figure 7.17. The third query is on PPDM schema ontology, it returns all the subclasses of WELL ACTIVITY in PPDM (Figure 7.18). In this thesis, we provided three major contributions. Firstly, we provided an auto- matic technique for building a semantic model of PPDM 3.8 Data Model in the form of an ontology extracted from the relational databases, which existing tools were not found to be able to achieve effectively. Secondly, we created a mapping ontology for mapping between PPDM and local maintenance subsystems. Finally, we improved this mapping by integrating information from ISO 15926 which, through the method of ontology harmonization, retains features of both the standards and still facilitates efficient inter- operability between the standards. 128 Figure 7.16: Subclass of WELL ACTIVITY in PPDM Figure 7.17: Query result for mapping ontology Figure 7.18: Query result for schema ontology 129 7.2 Integration of Heterogeneous Web Services for Social Networks Event-based online social networks are Internet-based services that enable users to par- ticipate in real-world experiences together. Event-based social networks can be created by a community of end-users based on their own interests in specific types of events and sources of event information. We propose a method to create such event-based social networks through integration of existing online information sources of events using a Semantic Web framework. In order to match people with common interests in such activities to self-organize into a social network, we integrate information from heteroge- neous information sources related to event schedules, ticket purchases, and group atten- dance from multiple online sources. The Semantic Web framework is used to represent these heterogeneous datasets and unstructured online data is converted into ontologies. Links between event information in different sources are discovered using both the syn- tactic similarity and semantic similarity between ontology classes. We use an approach based on Latent Dirichlet Allocation (LDA) [8] over the space of topics related to each event and user profiles for event recommendation. This enables the event-based social network to recommend friends based on shared interest in an event — online friendship is established after mutual attendance of the same event. We demonstrate this approach with EasyGo, a web-based mashup application which integrates information of events such as concerts, sports, theaters, as well as tickets and group purchase from multiple online sources. 130 7.2.1 Introduction and Background Event-based online social networks are Internet-based services that enable users to par- ticipate in real-world experiences together. With the increasing availability of location- based information and services, there has been a corresponding increase in the variety of event-based online social services. For instance, Facebook has realized the importance of social events and “Events” is now listed as one of four basic application types on their front page (the others being News Feed, Messages, and Photos). However, such event-based networks are typically created in a “top-down” approach in that the com- plete online system is designed with a specific framework for representing events and for subscribing end-users to these events. On the other hand, we envision an approach where event-based social networks can be created by a community of end-users based on their own interests in specific types of events and sources of event information. We propose a method to create such event-based social networks through integra- tion of existing online information sources of events using a Semantic Web framework. We focus on activities which enable participants to share a real-world experience such as attending a concert, theater, or sports event. In order to match people with common inter- ests in such activities to self-organize into a social network, we integrate information from heterogeneous information sources related to event schedules, ticket purchases, and group attendance from multiple online sources. Specifically, instead of retriev- ing the events and ticket information from one source (such as Plancast 3 and Yahoo! Upcoming 4 ), our approach integrates information from multiple sources (for instance, 3 http://www.plancast.com/ 4 http://upcoming.yahoo.com/ 131 ticket prices for the same event from different sources such as StubHub 5 , Barry’s Tick- ets 6 , and Ticketmaster 7 ). Challenges: Building such a system requires addressing three main technical chal- lenges. First, information from heterogeneous sources have to be represented in a uni- form manner that enables the system to reason over the available choices and select the most appropriate subset of the information that is relevant to a user. We propose to use the Semantic Web framework to provide such a representation. Since few existing sources present their information in Semantic Web standards, we present a method for converting unstructured data collected from the Web into structured information, specif- ically Ontologies. An ontology represents information as a set of concepts within a domain and a set of relations defined on them. Second, links between event informa- tion in different sources have to be discovered, i.e., semantic entities representing the same physical entities have to be identified. In our approach, we use both the syntactic similarity and semantic similarity between ontology classes to discover such links. The linkages are represented in the ontology which enables the system to use Semantic Web standards to make use of the discovered relationships between the data sources. Third, an event recommendation solution has to be included in the system. Event recommen- dation in our context refers to suggesting to a user other users who may be interested in the same event, and other events which may be of interest. Most online social networks (including Facebook) recommend friends based on the number of mutual friends of a user’s friends. An event-based social network opens the possibility of friend recommen- dation based on shared interest in an event. Online friendship is established after mutual 5 http://www.stubhub.com/ 6 http://www.barrystickets.com/ 7 http://www.ticketmaster.com/ 132 attendance of the same event. We use an similarity-based approach based on Latent Dirichlet Allocation (LDA) over the space of topics related to each event. We illustrate our approach of building an event-based social network from existing online sources with EasyGo, a web application (mashup) which integrates event con- tent from multiple websites and recommends events and fellow attendees to a user. A preliminary demonstration of EasyGo was presented in [99]. In this thesis, we give a more detailed description and analysis of the system. In particular, we describe the ontology that forms the framework for integrating heterogeneous information sources and the knowledge base for the system, the approach based on topic modeling for event recommendation, and the user interface. Event-based social networks (EBSNs) enable both online social interactions and offline social interactions among users. It is formally defined as a heterogeneous net- workG =fU;A on ;A off g, whereU represents the set of users (nodes),A on corresponds to the set of online social interactions (edges), and A off represents the set of offline social interactions (edges) [56]. Meetup 8 is a popular event-based social networking service which has already been attracting research interests in large scale online and offline social data analysis. [77] investigated the social behaviors of users participating events on Meetup and concluded that they share similar social structures. However, Meetup more focuses on community- based social ties instead of event-based ones. That means most events on Meetup are self-organized activities. In real life, live entertainment events such as concerts and sports also exist pervasively. Based on this observation, we aim to target on such events so that users can establish social ties by joining same events. Offline social activities are also associated with checkin actions. Such checkins can indicate social interactions to 8 http://www.meetup.com/ 133 some degree [15, 64]. At the same time, the location information can be used to infer social ties [78]. In social networking services, a recommendation system automatically identifies information relevant for a given user. This capability is used for influence analysis and targeted online marketing. [69] presented a model which combines geographical information of users and content of items to achieve a better rating perdition. [92] considered users’ co-tagging behaviors and added the similarity relationship to the graph to improve the recommendation performance. [43] proposed a social data integration framework which can facilitate the prediction of health conditions. Event recommendation plays an important role in establishing social connections. In our previous work [100], we proposed a recommendation system based on the similarity of an event and a user in terms of topics. We also considered social ties and attendance history to increase recommendation accuracy. [96] proposed a new group recommenda- tion method based on event-based social networks. [70] presented a Baysian probability model which considers the heterogeneous social impact and implicit feedback charac- teristic in the event recommendation. 7.2.2 System and Methodology The system architecture is shown in Fig.7.19. A Triple Store (TS) provides the frame- work for event information integration. All information related to events and users is stored in TS in the form of an ontology. In order to construct TS, we build three compo- nents in the system: Information Extractor (IE), Ontology Generator (OG), and Ontol- ogy Integrator (OI). IE converts unstructured raw data gathered on the web into structured information. This information covers basic aspects of events such as name, time, venue, and per- formers. OG takes this information as input and generates the semantic representation 134 Online Marketplaces for Tickets Information Extractor . . . StubHub Ticketmaster Barry’s Tickets Unstructured Data Unstructured Data Unstructured Data Structured Data Structured Data Structured Data Ontology Generator . . . . . . Ontology Ontology Data Ontology Data Ontology Data . . . Ontology Integrator Triple Store Ontology Social Networks Data Exchange concert Figure 7.19: System Architecture for the events in each source. In order to accommodate for the heterogeneity existing in the event information, OI discovers the alignments between events in different sources. Such alignments are created in the form of an ontology and stored in TS. TS not only provides the knowledge base for the EBSN but also carries the semantic representation of the EBSN. Now, we describe these three components in greater detail below. 7.2.2.1 Information Extractor Data Sources In order to acquire the information of various upcoming events, we consider several online ticket trading marketplaces as our event sources. Such marketplaces include: StubHub, an online marketplace owned by eBay, provides services for buyers and sellers of tickets for sports, concerts, theater, and other live entertainment events. Barry’s Tickets, an online ticket provider for all sports, concert, theater, and exclusive event tickets. They also provide tickets to special events that no other website offers. 135 Ticketmaster, an online ticket selling website for various sports, concert, and theater events. In contrast to the previous two websites, tickets on Ticketmaster are usually listed officially by organizers of the events. Heterogeneity exists in these marketplaces. For example, StubHub has the largest number of tickets and, usually, the cheapest ones [30]. Barry’s Tickets is the largest in terms of the number of events since it has some special events. Ticketmaster has the most tickets for those events which are just published. Moreover, the ticket price is fixed at the official price set by event organizers. Such heterogeneity is one important reason why we integrate information from different marketplaces. Our system is capable of supporting more than three platforms. Information Extraction: We use Scrapy 9 for the web crawling task. We recur- sively crawl the basic information of events such as name, location, time, and purchase link. We output this information in a .json file. For example, an event named “Carrie Underwood” discovered on StubHub is represented as, f"City": "Abbotsford", "Name": "Carrie Underwood", "Venue": "Abbotsford Entertainment and Sports Centre", "State": "BC", "API Link": "www.stubhub.com/.../4188319/", "Link": "www.stubhub.com/...4188319/", "Time": "Thu, 05/23/2013 7:30 p.m. PDT"g API Link has the link to all deals of that event (for the system to retrieve deal infor- mation) and Link has the link to the webpage of that event (for users to purchase the tickets). For efficiency, we only maintain basic event information. The information of deals is retrieved real-time in the query execution phase instead of being stored in the 9 http://scrapy.org/ 136 triple store. Once IE generates this structured information, OG takes it as input for the ontology generation process. 7.2.2.2 Ontology Generator In order to generate the semantic representation of events, we use Karma 10 Fig.7.20 for our ontology generation task. Karma is a semantic web tool that enables users to quickly and easily convert data from a variety of data sources including databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs to ontology in the form of RDF. In RDF, each piece of knowledge is represented as a triple. For example, the StubHub event ontology has 85,392 triples. After we generate the ontologies for all event sources, we store them in TS. We use OpenLink Virtuoso 11 as our triple store server. OpenLink Virtuoso is a middleware and database engine hybrid that combines the functionality of a traditional RDBMS, ORDBMS, virtual database, RDF, XML, free-text, web application server and file server functionality in a single system. 7.2.2.3 Ontology Integrator The information in TS generated from different sources is still isolated at this stage. The goal of OI is to match the information from heterogeneous event sources belonging to the same event. Besides the coverage of events and tickets, different marketplaces also have different naming conventions and information representations. For example, “location” and “time” (StubHub) are named as “venue” and “date” in Barry’s Tickets. To solve this problem, we consider both syntactic similarity and semantic similarity. 10 http://www.isi.edu/integration/karma/ 11 http://virtuoso.openlinksw.com/ 137 Figure 7.20: Using Karma to generate the semantic representation For syntactic similarity, Levenshtein distance[52] is used as the distance metric between the names of the two properties. Formally, the syntactic similarity,s sy (E 1 ;E 2 ) between two propertiesE 1 andE 2 is s sy (E 1 ;E 2 ) = 1 Lev(E 1 ;E 2 ) max(jE 1 j;jE 2 j) (7.2) where E i denotes the name of ith property,jE i j is the length of the name string, and Lev(w 1 ;w 2 ) is the Levenshtein distance between two wordsw 1 andw 2 . For semantic similarity, we first tokenize the names of both properties. Denote by E i :TOK j as the j-th token in the name of property E i . Then we retrieve the synset of each token using Open Linked Data (WordNet) 12 . Denote bysyn(w) the WordNet 12 http://wordnet.princeton.edu/ 138 synset of a wordw, and calculate the Jaccard similarity 13 on the synsets of each pair of tokens. Finally, we return the average-max Jaccard similarity as the semantic similarity: s se (E 1 ;E 2 ) = 1 n X i max j Jac(syn(E 1 :TOK i );(syn(E 2 :TOK j ))) Here,Jac() represents the Jaccard similarity between two sets andn is the number of tokens inE 1 ’s name. The final similarity is a weighted sum of the syntactic and semantic similarities: s(E 1 ;E 2 ) =! se s se (E 1 ;E 2 )+! sy s sy (E 1 ;E 2 ) (7.3) The weights are pre-defined as system parameters. In this thesis, ! se = ! sy = 0.5. We will investigate how the weights affect the performance on different datasets as one of our future works. For example, consider two properties named venue and location. The syn- tactic similarity s sy = 1 8 8 = 0 and the semantic similarity s se = 0:83. So s(venue;location) = 0:42. If the similarity is larger than a threshold, say 0.3, then we consider those properties are matched. Next, we consider instance-level matching which retrieves identical events. The following steps are used to discover the matches: 1. Consider a pair of events as a potential match only if they have identical location and time information. 2. For each pair of potential matched events (e 1 ;e 2 ), calculate the event similarity s(e 1 ;e 2 ) as 13 http://en.wikipedia.org/wiki/Jaccard index 139 ! Jac Jac(TOK(e 1 );TOK(e 2 ))+! SWS SWS(e 1 ;e 2 ) where Jac(TOK(e 1 );TOK(e 2 )) calculates the Jaccard similarity between the token sets of the names of e 1 and e 2 , and SWS(e 1 ;e 2 ) calculates the Smith- Waterman Similarity [84] between the names ofe 1 ande 2 . The Jaccard similarity eliminates the negative effect of word ordering. For exam- ple, consider an event named “Taylor Swift and Ed Sheeran” and another event named “Ed Sheeran and Taylor Swift”. The Jaccard similarity is 1 in this case. The Smith- Waterman Similarity identifies local sequence alignment. For example, “Taylor Swift” and “Taylor Swift featuring Ed Sheeran” has higher Smith-Waterman Similarity com- pared with Levenshtein Similarity. Similar to the property similarity, the weights are pre-defined as system parameters. In this thesis, both of them are 0.5. We also use open linked data to enrich the event ontology. DBpedia 14 and Linked GeoData 15 are used to expand the spatial information of events. Once we discover all matches, we store them in the form of triples in TS. Suppose we haven event sources (e.g., StubHub, Barry’s tickets etc.). In order to find all identical events, a brute-force way is to calculate the similarity between two events from all source pairs. The time complexity isO(n 2 ). However, since we know Barry’s Tickets has the largest number of events, we only consider the event pairs having one event from Barry’s Tickets source. As a result, the complexity is reduced toO(n). 140 Figure 7.21: Attending an Event 7.2.2.4 Event Recommendation for Event-based Social Networks Figure 7.21 illustrates the process of building a social network from the information stored in TS. This is illustrated in the following example. Kevin is an existing user in the system. He searches for events which are interesting to him by typing keywords such as “Lakers vs Heat”, or selects one of the events in his personalized recommendation list. Once an event is identified, the system automatically retrieves ticket deals from differ- ent data sources such as StubHub and Barry’s Tickets. For example, StubHub provides an API link in the format of http://www.stubhub.com/ticketAPI/restSvc/event/event id) 14 http://dbpedia.org/ 15 http://linkedgeodata.org/ 141 which has detailed information on each deal for a specific event. Kevin can either ini- tialize a new group on a deal or join a group with a vacancy. The information of existing groups for an event is retrieved from TS. If Kevin creates a new group, the information related to this group such as creator, members, number of tickets and number of vacan- cies will be generated in TS. If Kevin chooses to join an existing group (e.g., Group 1 in Fig 7.21), friendship between Kevin and other group members is established and stored as triples in TS. At the same time, the triples corresponding to that group are also updated. Each member is then directed to a payment page on a specific online ticket website once the group is full. After all payments are settled, the members proceed to the event and potentially form a social circle in real life. Social network formation requires an event recommendation capability. We adopt the machine learning approach proposed in our previous work [100], specifically Simi- larity Based Approach (SBA). We use Latent Dirichlet Allocation (LDA) to extract the topic distribution over each user and each event, and calculate the similarity between their distributions. The recommended events are selected from the events with high similarity to a user. Each user’s information for generating topic distribution is either extracted from the explicit profile of a user’s inputs in the registration phase or implicit profile from an existing service such as Facebook. Event profiles are obtained from the event ontology. We adopt cosine similarity (Equation 7.4), although other approaches can be used.S 1 (u i ;e j ) is the recommendation score of evente j for useru i . S 1 (u i ;e j ) =cos( ! u i ; ! e j ) = ! u i ! e j jj ! u i jjjj ! e j jj ; (7.4) However, due to privacy concerns, some users may not want to expose to much personal information to public. Also for some event sources, the event descriptions are not complete. In this case, SBA may not perform well in terms of precision. As a future 142 work, we will consider other methods such as Relationship Based Approach (RBA), History Based Approach (HBA), and Hybrid Approach (SRH) which take friendship and history into consideration once the population in our system is large enough. More details about these approaches are provided in [100]. 7.2.3 Application (a) (b) (c) (d) Figure 7.22: The front-end of EasyGo: (a) Interface for event search and recommenda- tions. (b) A list of searched events. (c) A map of searched events. (d) Group and deal information for an event. 143 (a) (b) Figure 7.23: Registration in EasyGo: (a) Tradition Registration. (b) Facebook Registra- tion. EasyGo is a web-based application developed based on the proposed Semantic Web framework. The front-end of EasyGo is shown in Figure 7.22(a). A user has two options to find interesting events (e.g., sports and concerts) He/she can either search for events by keywords or select an event in the recommendation list. For example, a user retrieves future events by providing keywords in the search box. After the keywords are received by the system, the relevant events stored in TS will be extracted. Specifi- cally, the search engine explores the name, description, and location properties to find the matched events. Then, a list of events is displayed in chronological order (Fig- ure 7.22(b)). The matched events can also be shown on a map by choosing “Ma” option (Figure 7.22(c)). We enabled Bing Maps API on EasyGo to give users a geospatial filter on the events. By selecting a specific event, say an NBA game, the user is directed to the event page (Figure 7.22(d)) where two tables are displayed. The first one shows the group infor- mation for this event which is generated based on the triples in TS. The information includes the total number of tickets for this deal, the price per ticket excluding service 144 charge and delivery fee, the ticket source, the creator of the group, the number of vacan- cies, and the amount of money that can be saved when joining this group. Once a user joins a group by clicking the “Join” button, both the group ontology and user ontology in TS are updated. Specifically, the user information will be linked to the group in the group ontology and the friendship will be established among group members in the user ontology. At the same time, an email is sent to the creator. The creator is responsi- ble for distributing the tickets to all group members. The second table shows all deals for which no group has been formed. The deals information is retrieved by accessing the event API link stored in the event ontology. EasyGo crawls all valid deals for that event through the link. All information, including the total number of tickets, the price per ticket, and the ticket source is shown in the table. A user can initialize a group on this deal by clicking the “Create” button. Similarly, the group ontology in TS will be updated based on the action. In order to maintain user and friendship records, registration is required for this application. A user has two ways to sign up. One way is to input personal infor- mation such as user name, email and password as basic registration information (Fig- ure 7.23(a)). An alternative is to connect to a user’s Facebook account for fast registra- tion (Figure 7.23(b)). The second choice is preferred since the user’s Facebook profile can be used as the source for event recommendation. All user information is stored in the user ontology in TS. 145 Chapter 8 Conclusion In this thesis, we studied the problem of information integration over heterogeneous ontologies. First, we introduced a unified fuzzy ontology matching framework which can discover correspondences from multiple ontologies accurately. Then, we investi- gated the problem of query execution on the semantic web and proposed a novel algo- rithm which can efficiently retrieve similar instances from different ontology sources. Finally, we developed several optimization strategies in both ontology matching and query execution tasks. 8.1 Summary of Contributions From the perspective of ontology alignment, our research has the following contribu- tions: We formalized a fuzzy set representation of an ontology alignment system, and we presented a novel correspondence relation type called Relevance. We developed a unified framework for fuzzy ontology matching (UFOM) which derives fuzzy correspondences with any relation types from heterogeneous sources. We conducted comprehensive experiments on both publicly available datasets and synthetic data which show our proposed framework achieves high precision, recall 146 and F-measure for generating the correspondences of both equivalence relation and the newly-defined relevance relation. From the perspective of query execution, our research has the following contribu- tions: We proposed an algorithm (UFOMQ) which enables scalable and efficient query- ing for related entities over heterogeneous ontologies. The query algorithm exploits the redundancy in the correspondences between classes. Similar enti- ties can be identified by following the strongest correspondences, not necessarily all the correspondences. We proved the efficiency of UFOMQ by formally deriving its computational com- plexity. We performed comprehensive experiments on publicly available datasets and large enterprise-scale datasets which show the query algorithm achieves a trade- off between precision of the returned query result and the computational cost of the query execution process. From the perspective of optimization in matching and query execution, our research has the following contributions: We presented two variants of the UFOM approach that include multiple optimiza- tion and machine learning strategies to improve the performance of computing similarities and relation scores in terms of efficiency. We conducted comprehensive experiments on publicly available datasets and large enterprise-scale datasets showing that our proposed enhancements achieve higher precision, recall and F-measure for generating the correspondences compared with the original UFOM. 147 We extended the query graph model to support SPARQL queries with conditional triple patterns. We developed two optimization approaches HBO and MLO which enable rapid evaluation of queries involving conditional triple patterns. The HBO algorithm utilizes the histogram to estimate the selectivity in order to compute a good query plan, while the MLO approach learns a neural network which gives better selectivity estimation when the training data are available. We performed comprehensive experiments on a large-scale synthetic dataset and large enterprise-scale datasets showing HBO and MLO can save both network bandwidth and query response time compared with the state of art. 8.2 Future Work A possible future direction of research is to formalize the query execution process and improve the usability of the unified query component on any ontology integration frame- work. At the same time, various query optimization strategies to facilitate efficient query execution are worth studying. 8.2.1 Exploration on Matching and Querying over Ontologies As a future work, different entity identification techniques can be introduced to improve the usability of the ontology alignment and query framework. Text-based search capa- bility can be enabled in the existing query engine. With such capability, a user will be able to retrieve all relevant entities based on a given text without specifying which class the text belongs to. There are two potential directions to tackle this problem: 1) With some knowledge of each class, the probability of a given text belonging to a class can be estimated. For example, a Naive Bayes model can be used to estimate such probability; 2) A correlated 148 sample synopsis method can be applied if some samples from each class are given. This subset of sample instances along with the classes is called a synopsis. Multiple class models can be built based on a synopsis. When a query text comes in, its correspond- ing class will be estimated based on the class models. Once the class is decided, the UFOM/UFOMQ framework can be used to retrieve all relevant individuals. (Figure 8.1) Figure 8.1: A Correlated Sample Synopsis Method 8.2.2 Exploration on SPARQL Query Optimization In real world, it is quite often that data values follow a non-uniform distribution within the domain. For example, in a habitat monitoring application, the air temperature may range from -130 F to +130 F, but most of the sensor readings are within [-50 F, +110 F]. Based on this observation, data skewness and data dependency may play an important role in determining query optimization performance. For the HBO approach, one strategy is to have a larger granularity for the skewered region when we build the his- togram. When correlations exist among the values from different properties, it is better to have multi-dimensional histograms instead of one-dimensional histograms. Similarly 149 for the MLO approach, a separate model can be trained for the skewered region along with a general one. In order to deal with the dynamism in big data, various transfer-learning techniques can be incorporated in the MLO approach. For example, a stock index may vary from year to year. It is not sufficient to use the model trained two years ago to help optimize the execution of a query requested today. Besides, it is not efficient and effective to re-train the model from scratch by just using the data gathered in this year. In this case, transfer learning is able to re-use the previous model, and incorporate the new data to update the model. By adding a transfer learning component in our query framework, the optimizer can automatically adjust the cardinality estimation model when new data arrive. 150 Bibliography [1] Fuzzy sets and relations. In Computational Intelligence, pages 37–66. Springer Berlin Heidelberg, 2005. [2] S. Albagli, R. Ben-Eliyahu-Zohary, and S. E. Shimony. Markov network based ontology matching. In IJCAI, pages 1884–1889, 2009. [3] C. B. Aranda, A. Hogan, J. Umbrich, and P. Vandenbussche. SPARQL web- querying infrastructure: Ready for action? In The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part II, pages 277–293, 2013. [4] T. B¨ ack and H. Schwefel. An overview of evolutionary algorithms for parameter optimization. Evolutionary Computation, 1(1):1–23, 1993. [5] J. Bauckmann, U. Leser, F. Naumann, and V . Tietz. Efficiently detecting inclusion dependencies. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 1448–1450. IEEE, 2007. [6] G. Bell and A. Sethi. Matching records in a national medical patient index. In CACM, 2001. [7] J. Berlin and M. Motro. Autoplex: automated discovery of content for virtual databases. In 9th Int Conf On Cooperative Information Systems (CoopIS), 2001. [8] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003. [9] M. Bright, A. Hurson, and S. Pakzad. Automated resolution of semantichetero- geneity in multidatabases. In TODS, 1994. [10] B. Burda, J. Crompton, H. Sardoff, and J. Falconer. Information architecture strategy for the digital oil field. In Digital Energy Conference and Exhibition, 2007. 151 [11] M. Cai, M. R. Frank, B. Yan, and R. M. MacGregor. A subscribable peer-to-peer RDF repository for distributed metadata management. J. Web Sem., 2(2):109– 130, 2004. [12] A. Chapman, R. Evans, and R. Montgomery. Industry standards and improved operations integrity with a model framework - reference data model. In Intelligent Energy Conference and Exhibition, 2008. [13] J. Cheng, Z. M. Ma, and L. Yan. f-sparql: A flexible extension of SPARQL. In Database and Expert Systems Applications, 21st International Conference, DEXA 2010, Bilbao, Spain, August 30 - September 3, 2010, Proceedings, Part I, pages 487–494, 2010. [14] C. M. Cheung, Y . Zhang, A. V . Panangadan, and V . K. Prasanna. Computa- tional cost of querying for related entities in different ontologies. In 2015 IEEE International Conference on Information Reuse and Integration, IRI 2015, San Francisco, CA, USA, August 13-15, 2015, pages 534–541, 2015. [15] E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: user movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 1082–1090, 2011. [16] N. Choi, I.-Y . Song, and H. Han. A survey on ontology mapping. SIGMOD Record, 35(3):34–41, 2006. [17] B. Coessens, S. Christiaens, R. Verlinden, Y . Moreau, R. Meersman, and B. D. Moor. Ontology guided data integration for computational prioritization of dis- ease genes. In On the Move to Meaningful Internet Systems 2006: OTM 2006 Workshops, OTM Confederated International Workshops and Posters, AWeSOMe, CAMS, COMINF , IS, KSinBIT, MIOS-CIAO, MONET, OnToContent, ORM, Per- Sys, OTM Academy Doctoral Consortium, RDDS, SWWS, and SeBGIS 2006, Montpellier, France, October 29 - November 3, 2006. Proceedings, Part I, pages 689–698, 2006. [18] M. L. Crawford and R. A. Morneau. Accelerating progress toward achieving digital-oilfield workflow efficiencies. In SPE Annual Technical Conference and Exhibition, 2010. [19] V . Cross and X. Hu. Fuzzy set and semantic similarity in ontology alignment. In FUZZ-IEEE, pages 1–8, 2012. [20] I. F. Cruz, F. P. Antonelli, and C. Stroe. Agreementmaker: Efficient matching for large real-world schemas and ontologies. PVLDB, 2(2):1586–1589, 2009. 152 [21] I. F. Cruz, C. Stroe, and M. Palmonari. Interactive user feedback in ontology matching using signature vectors. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, pages 1321–1324, 2012. [22] F. Darari, W. Nutt, G. Pirr` o, and S. Razniewski. Completeness statements about RDF data sources and their use for query answering. In The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part I, pages 66–83, 2013. [23] R. Dhamankar, Y . Lee, A. Doan, A. Halevy, and P. Domin-Gos. Imap: Discover- ing complex matches between database schemas. In ACM SIGMOD International Conference on Management of Data, 2004. [24] A. Doan, P. Domingos, and A. Halevy. Reconciling schemas of disparate data sources: a machine-learning approach. In ACM SIGMOD Conf, 2001. [25] A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, , and A. Halevy. Learning to match ontologies on the semantic web. In VLDB Journal, 2003. [26] T. H. Duong, G. Jo, J. J. Jung, and N. T. Nguyen. Complexity analysis of ontology integration methodologies: A comparative study. J. UCS, 15(4):877–897, 2009. [27] D. Embley, D. Jackman, and L. Xu. Multifaceted exploitation of metadata for attribute match discovery in information integration. In Int Workshop on Infor- mation Integration on the Web, 2001. [28] J. Euzenat, C. Meilicke, H. Stuckenschmidt, P. Shvaiko, and C. Trojahn. Ontol- ogy alignment evaluation initiative: Six years of experience. In Journal on data semantics XV, pages 158–192. Springer, 2011. [29] A. Farris. How big data is changing the oil and gas industry. Analytics Magazine, 2012. [30] J. Fekete. Making the Big Game: Tales of an Accidental Spectator. page 142, 2013. [31] S. Fern´ andez, J. R. Velasco, I. Mars´ a-Maestre, and M. A. L´ opez-Carmona. Fuzzyalign - a fuzzy method for ontology alignment. In KEOD, pages 98–107, 2012. [32] R. Goldman and J. Widom. Dataguides: enabling query formulation and opti- mization in semistructured databases. In 23th Int Conf On Very Large Data Bases, 1997. 153 [33] A. Gubichev and T. Neumann. Exploiting the query structure for efficient join ordering in SPARQL queries. In Proceedings of the 17th International Confer- ence on Extending Database Technology, EDBT 2014, Athens, Greece, March 24-28, 2014., pages 439–450, 2014. [34] M. S. Hanif and M. Aono. An efficient and scalable algorithm for segmented alignment of ontologies of arbitrary size. J. Web Sem., 7(4):344–356, 2009. [35] B. Hariri, H. Abolhassani, and H. Sayyadi. A neural-networks-based approach for ontology alignment. In In proceedings of Joint 3rd International Conference on Soft Computing and Intelligent Systems and 7thInternational Symposium on advanced Intelligent Systems, Tokyo, Japan, September 2006, Proceedings, 2006. [36] O. Hartig, C. Bizer, and J. C. Freytag. Executing SPARQL queries over the web of linked data. In The Semantic Web - ISWC 2009, 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25-29, 2009. Proceedings, pages 293–309, 2009. [37] R. Hasan. Predicting SPARQL query performance and explaining linked data. In The Semantic Web: Trends and Challenges - 11th International Conference, ESWC 2014, Anissaras, Crete, Greece, May 25-29, 2014. Proceedings, pages 795–805, 2014. [38] B. He, K. C. C. Chang, , and J. Han. Discovering complex matchings across web query interfaces: A correlation mining approach. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2004. [39] W. Hsu, K. Merchant, H. Shu, C. Hsu, and A. Helmy. Weighted waypoint mobil- ity model and its impact on ad hoc networks. Mobile Computing and Communi- cations Review, 9(1):59–63, 2005. [40] W. Hu, Y . Qu, and G. Cheng. Matching large ontologies: A divide-and-conquer approach. Data Knowl. Eng., 67(1):140–160, 2008. [41] J. Huang, J. Dang, J. Vidal, and M. Huhns. Ontology matching using an artificial neural network to learn weights. In Proc. IJCAI Workshop on Semantic Web for Collaborative Knowledge Acquisition (SWeCKa-07), pp. 80-85, Hyderabad, India, Jan. 2007, Proceedings, 2007. [42] Y . R. Jean-Mary, E. P. Shironoshita, and M. R. Kabuka. Ontology matching with semantic verification. J. Web Sem., 7(3):235–251, 2009. [43] X. Ji. Social data integration and analytics for health intelligence. In VLDB PhD Workshop, 2014. 154 [44] Z. Kaoudi, K. Kyzirakos, and M. Koubarakis. SPARQL query optimization on top of dhts. In The Semantic Web - ISWC 2010 - 9th International Semantic Web Conference, ISWC 2010, Shanghai, China, November 7-11, 2010, Revised Selected Papers, Part I, pages 418–435, 2010. [45] J. Kuriakose. Understanding and adopting semantic web technology. In Business Technology and Digital Transformation Strategies, Data Analytics and Digital Technologies, 2009. [46] P. Lambrix and H. Tan. Sambo - a system for aligning and merging biomedical ontologies. J. Web Sem., 4(3):196–206, 2006. [47] M. D. Larsen. Record linkage modeling in federal statistical databases. In FCSM Research Conference, Washington, DC. Citeseer, 2010. [48] J. Larson, S. Navathe, and R. ElMasri. A theory of attribute equivalence in databases with application to schema integration. In IEEE Trans Software Eng, 1989. [49] W. Le, A. Kementsietsidis, S. Duan, and F. Li. Scalable multi-query optimiza- tion for SPARQL. In IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1-5 April, 2012, pages 666–677, 2012. [50] Y . LeCun, Y . Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015. [51] S. Lee, J. Lee, and S. Hwang. Fria: fast and robust instance alignment. In 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13-17, 2013, Companion Volume, pages 175–176, 2013. [52] V . Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics-Doklady, 10(8):707–710, 1966. [53] J. Li, J. Tang, Y . Li, and Q. Luo. Rimom: A dynamic multistrategy ontology alignment framework. IEEE Trans. Knowl. Data Eng., 21(8):1218–1232, 2009. [54] W. Li, C. Clifton, and S. Liu. Database integration using neural network: imple- mentation and experiences. In Knowl Inf Syst, 2000. [55] E. Liarou, S. Idreos, and M. Koubarakis. Evaluating conjunctive triple pattern queries over large structured overlay networks. In The Semantic Web - ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, GA, USA, November 5-9, 2006, Proceedings, pages 399–413, 2006. 155 [56] X. Liu, Q. He, Y . Tian, W. Lee, J. McPherson, and J. Han. Event-based social networks: linking the online and offline social worlds. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, Beijing, China, August 12-16, 2012, pages 1032–1040, 2012. [57] G. Miller, C. Fellbaum, R. Tengi, P. Wakefield, H. Langone, and B. Haskell. WordNet. MIT Press Cambridge, 1998. [58] P. Mitra, N. F. Noy, and A. R. Jaiswal. Omen: A probabilistic ontology mapping tool. In International Semantic Web Conference, pages 537–547, 2005. [59] J. Mora and ´ O. Corcho. Towards a systematic benchmarking of ontology-based query rewriting systems. In The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Pro- ceedings, Part II, pages 376–391, 2013. [60] C. L. Myers and G. Troyanskaya. Context-sensitive data integration and predic- tion of biological networks. In Bioinformatics, 2007. [61] M. Nagy and M. Vargas-Vera. Multiagent ontology mapping framework for the semantic web. IEEE Transactions on Systems, Man, and Cybernetics, Part A, 41(4):693–704, 2011. [62] T. Neumann and G. Weikum. RDF-3X: a risc-style engine for RDF. PVLDB, 1(1):647–659, 2008. [63] M. Niepert, C. Meilicke, and H. Stuckenschmidt. A probabilistic-logical frame- work for ontology matching. In AAAI, 2010. [64] A. Noulas, S. Scellato, C. Mascolo, and M. Pontil. An empirical study of geo- graphic user activity patterns in foursquare. In Proceedings of the Fifth Interna- tional Conference on Weblogs and Social Media, Barcelona, Catalonia, Spain, July 17-21, 2011. [65] O. Paap. Accelerating deployment of iso 15926, the interoperability standard for the process industry. In European Plant Engineering and Design Conference, 2007. [66] G. A. Palmer. Confidence in data aids improved asset value. In SPE Asia Pacific Oil and Gas Conference and Exhibition, 2011. [67] R. Parundekar, C. A. Knoblock, and J. L. Ambite. Discovering alignments in ontologies of linked data. In IJCAI, pages 3032–3036, 2013. 156 [68] C. Pesquita, D. Faria, C. Stroe, E. Santos, I. F. Cruz, and F. M. Couto. What’s in a ’nym’? synonyms in biomedical ontology matching. In The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part I, pages 526–541, 2013. [69] Z. Qiao, P. Zhang, J. He, Y . Cao, C. Zhou, and L. Guo. Combining geographical information of users and content of items for accurate rating prediction. In 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7-11, 2014, Companion Volume, pages 361–362, 2014. [70] Z. Qiao, P. Zhang, C. Zhou, Y . Cao, L. Guo, and Y . Zhang. Event recommen- dation in event-based social networks. In 28th AAAI Conference on Artificial Intelligence, 2014. [71] E. Rahm. Towards large-scale schema and ontology matching. In Schema Match- ing and Mapping, pages 3–27. 2011. [72] L. R. Records and D. Shimbo. Petroleum enterprise intelligence in the digital oil field. In SPE Intelligent Energy Conference and Exhibition, 2010. [73] R. Rosati. Integrating ontologies and rules: Semantic and computational issues. In Reasoning Web, Second International Summer School 2006, Lisbon, Portugal, September 4-8, 2006, Tutorial Lectures, pages 128–151, 2006. [74] A. Rostin, O. Albrecht, J. Bauckmann, F. Naumann, and U. Leser. A machine learning approach to foreign key discovery. In WebDB, 2009. [75] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundations of research. pages 696–699. MIT Press, Cambridge, MA, USA, 1988. [76] M. Saleem, A. N. Ngomo, J. X. Parreira, H. F. Deus, and M. Hauswirth. DAW: duplicate-aware federated query processing over the web of data. In The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part I, pages 574–590, 2013. [77] T. H. Sander and S. Seminar. E-associations? using technology to connect cit- izens: The case of meetup.com. In Annual Meeting of the American Political Science Association, 2005. [78] S. Scellato, A. Noulas, and C. Mascolo. Exploiting place features in link predic- tion on location-based social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 1046–1054, 2011. 157 [79] M. Schmidt, M. Meier, and G. Lausen. Foundations of SPARQL query optimiza- tion. In Database Theory - ICDT 2010, 13th International Conference, Lausanne, Switzerland, March 23-25, 2010, Proceedings, pages 4–33, 2010. [80] P. Schneider. DC proposal: Towards a framework for efficient query answering and integration of geospatial data. In The Semantic Web - ISWC 2011 - 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part II, pages 349–356, 2011. [81] Y . Schulz, D. Fisher, and T. Curtis. PPDM manages oil and gas data better. In Faster PESA New, 1999. [82] G. Shafer. A Math. Theory of Evidence. Princeton Univ. Press, 1976. [83] P. Shvaiko and J. Euzenat. Ontology matching: State of the art and future chal- lenges. IEEE Trans. Knowl. Data Eng., 25(1):158–176, 2013. [84] T. Smith and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195 – 197, 1981. [85] S. Stephens. The enterprise semantic web. In The Semantic Web, pages 17–37. Springer, 2008. [86] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21-25, 2008, pages 595–604, 2008. [87] F. M. Suchanek, S. Abiteboul, and P. Senellart. PARIS: probabilistic alignment of relations, instances, and schema. PVLDB, 5(3):157–168, 2011. [88] A. Tian, J. Sequeda, and D. P. Miranker. QODI: query as context in automatic data integration. In The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part I, pages 624–639, 2013. [89] K. Todorov, P. Geibel, and C. Hudelot. A framework for a fuzzy matching between multiple domain ontologies. In KES (1), pages 538–547, 2011. [90] H. Wang, R. Zimmermann, and W. Ku. Distributed continuous range query pro- cessing on moving objects. In Database and Expert Systems Applications, 17th International Conference, DEXA 2006, Krak´ ow, Poland, September 4-8, 2006, Proceedings, pages 655–665, 2006. [91] Q. Wang, J. Yu, and K. Wong. Approximate graph schema extraction for semi- structured data. In Extending DataBase Technologies, 2000. 158 [92] Z. Wang, Y . Tan, and M. Zhang. Graph-based recommendation on social net- works. In Advances in Web Technologies and Applications, Proceedings of the 12th Asia-Pacific Web Conference, APWeb 2010, Busan, Korea, 6-8 April 2010, pages 116–122, 2010. [93] Z. Wu and M. S. Palmer. Verb semantics and lexical selection. In 32nd Annual Meeting of the Association for Computational Linguistics, 27-30 June 1994, New Mexico State University, Las Cruces, New Mexico, USA, Proceedings., pages 133–138, 1994. [94] L. Xu and D. Embley. Using domain ontologies to discover direct and indirect matches for schema elements. In Semantic Integration Workshop, Second Inter- national Semantic Web Conference (ISWC), 2003. [95] M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. Procopiuc, and D. Srivastava. On multi-column foreign key discovery. Proceedings of the VLDB Endowment, 3(1-2):805–814, 2010. [96] W. Zhang, J. Wang, and W. Feng. Combining latent factor model with location features for event-based group recommendation. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pages 910–918, New York, NY , USA, 2013. [97] Y . Zhang, A. Panangadan, and V . K. Prasanna. UFOM: Unified fuzzy ontology matching. In IRI 2014 - Proceedings of the 15th International Conference on Information Reuse and Integration, San Francisco, CA, USA, 13-15 August, 2014. [98] Y . Zhang, A. V . Panangadan, and V . K. Prasanna. UFOMQ: an algorithm for querying for similar individuals in heterogeneous ontologies. In Big Data Ana- lytics and Knowledge Discovery - 17th International Conference, DaWaK 2015, Valencia, Spain, September 1-4, 2015, Proceedings, pages 178–189, 2015. [99] Y . Zhang and H. Wu. EasyGo: An event-centered social network. In AI Mashup Challenge, Extended Semantic Web Conference (ESWC), 2013. [100] Y . Zhang, H. Wu, V . Sorathia, and V . K. Prasanna. Event recommendation in social networks with linked data enablement. In ICEIS, 2013. 159
Abstract (if available)
Abstract
As the volume and variety of data on the web increase rapidly, it becomes necessary to develop technologies that enable applications to automatically discover and link relevant data sources. The Semantic Web is an extension of the traditional web that goes beyond hyperlinked documents into a network of data elements linked together in formally defined semantic relationships. This thesis deals with two primary challenges in the domain of data interoperability and the Semantic Web: (i) how to discover semantic links of different types between entities from various data sources, and (ii) how to enable efficient querying over heterogeneous ontologies. ❧ We develop an integrated framework to address these issues, where the links between entities are automatically discovered and then used to optimize querying. Unified Fuzzy Ontology Matching (UFOM) is an ontology matching system designed to discover semantic links between large real-world ontologies populated with entities from heterogeneous sources. In such ontologies, several entities are expected to be related to each other but not necessarily with one of the typical well-defined correspondence relationships (equivalent-to, subsumed-by). This motivates the need for a unified ontology matching system that can discover arbitrary types of relationships, including those that are more loosely defined in the context of specific applications. UFOM uses fuzzy set theory as a general framework for representing different types of alignments across ontologies. ❧ The main challenge in identifying similar instances across multiple ontologies is the high computational cost of evaluating similarity between every pair of entities. We develop the UFOMQ (Unified Fuzzy Ontology Matching Query) algorithm to query for similar instances across multiple ontologies that makes use of the correspondences discovered during ontology alignment in order to reduce this cost. The query algorithm uses a fuzzy logic formulation and extends fuzzy ontology alignment. The algorithm identifies entities that are related to a given entity directly from a single alignment link or by following multiple alignment links. We also show how the computational cost of the underlying SPARQL queries can be optimized. We develop two variants of UFOM, UFOM+ and UFOMNN, in which multiple optimization strategies and machine learning techniques are incorporated to decrease the computational cost of ontology alignment, while still maintaining high accuracy in the discovered alignments. We extend selectivity-based query optimization methods to queries with conditional patterns, using histograms to maintain an estimate of the distribution of the ontology instances. We then investigate a machine learning-based approach that uses the statistics of results from past queries to learn a model that relates query patterns to selectivity, which in turn can be used for query optimization. These two approaches (maintaining histograms and predicting from prior query results) are complementary: as query history is accumulated and used for training, histogram-based estimation is gradually traded for machine learning-based prediction. ❧ The performance of UFOM and its variants are extensively evaluated using synthetic and real-world enterprise-scale datasets. The proposed framework is applied to two domains. First, we integrate heterogeneous standards and provide a unified approach to information retrieval among different data sources in the oil and gas industry. Second, we apply the framework to integrate heterogeneous web services for online social networks. The experiments show that the proposed approach can decrease both network bandwidth requirements and query response time compared with state-of-art methods.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Discovering and querying implicit relationships in semantic data
PDF
Ontology-based semantic integration of heterogeneous information
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
Learning the semantics of structured data sources
PDF
Scalable processing of spatial queries
PDF
Enabling laymen to contribute content to the semantic web: a bottom-up approach to creating and aligning diversely structured data
PDF
Understanding semantic relationships between data objects
PDF
Scalable data integration under constraints
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
Customized data mining objective functions
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
A complex event processing framework for fast data management
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Complex pattern search in sequential data
PDF
On efficient data transfers across geographically dispersed datacenters
PDF
Interactive learning: a general framework and various applications
Asset Metadata
Creator
Zhang, Yinuo
(author)
Core Title
From matching to querying: A unified framework for ontology integration
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/25/2018
Defense Date
06/21/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data integration,machine learning,OAI-PMH Harvest,ontology,query optimization,Semantic Web
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), McLeod, Dennis (
committee member
), Raghavendra, Raghu (
committee member
)
Creator Email
enoszhang@gmail.com,yinuozha@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-23163
Unique identifier
UC11669393
Identifier
etd-ZhangYinuo-6456.pdf (filename),usctheses-c89-23163 (legacy record id)
Legacy Identifier
etd-ZhangYinuo-6456.pdf
Dmrecord
23163
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Yinuo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
data integration
machine learning
ontology
query optimization
Semantic Web