Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Discovering and querying implicit relationships in semantic data
(USC Thesis Other)
Discovering and querying implicit relationships in semantic data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Discovering and Querying Implicit Relationships in Semantic Data by Muhammad Rizwan Saeed A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) December 2019 Copyright 2019 Muhammad Rizwan Saeed Dedication To my family for their support, patience, and prayers. ii Acknowledgments First and foremost, I would like to express my gratitude to my advisor Dr. Viktor Prasanna, for his continuous support and guidance. Without his mentorship, this thesis wouldn’t have been possible. I was lucky to have another mentor in Dr. Charalampos Chelmis who was always available to discuss my research problems and progress, often weekly and sometimes on short notices, even after he left the University of Southern California. I am grateful to Dr. Iraj Ershaghi and Dr. Cauligi Raghavendra for taking an interest in my work and providing useful feedback which helped in further enhancing this work. I want to thank Dr. Bill Cheng and (late) Dr. Dennis McLeod for giving me the opportunity to be the TA of their classes for several semesters, which was one of the most enjoyable parts of my academic experience. There are many others whose feedback and support have helped me finish my doctoral studies. My work was supported by Center for Interactive Smart Oilfield Technologies (CiSoft). I am eternally grateful to the institute and its partner Chevron, its head Dr. Iraj Ershaghi, and Dr. Viktor Prasanna for making this support possible and available. While affiliated with CiSoft, I was very fortunate to collaborate and get feedback from many people from both Chevron and USC such as Lisa Bernskelle, Nicola Killen, Brian Thigpen, Vega Sankur, Dr. Cauligi Raghavendra, Dr. Ulrich Neuman, Dr. Don Paul, and past and present students and research associates of Dr. Prasanna’s p-group. Throughout this journey, I have been helped by many of the members of USC administrative staff who made my concerns their concerns such as Kathy Kassar, Diane Demetras, Juli Legat, Jennifer Gerson, and Michelle Wilkinson. iii At Hogwarts School of Witchcraft and Wizardry, it was said that “help will always be given at Hogwarts to those who ask for it. ” My family was very fortunate to have friends who helped or offered to help us, sometimes even without asking. These friends were, are, and will be our family away from home. We are grateful to many of our friends and neighbors over the years: Muqarrab, Ahsan, Vanessa, Sh. Jamaal, Talha, Hassan, Taqi, Junaid, Ekraam, Muazzam and many others. Finally, my deepest gratitude to my family who have always been my mast through the stormy seas - my parents, Nasim and Shahida, my siblings, Asma and Usman, and my wife Ayfa, and my son Abdullah. I hope that I can reciprocate even 1% of what they have done for me in my lifetime. iv Table of Contents Dedication ii Acknowledgments iii List Of Tables viii List Of Figures ix Abstract xii Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Intuitive Framework for Querying Semantic Data . . . . . . . . . . . . . . . 1 1.1.2 Enabling Semantics-preserving Analytics for Semantic Data . . . . . . . . . 2 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Background and Related Work 6 2.1 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Ontologies: Conceptualizing the Domain . . . . . . . . . . . . . . . . . . . . 7 2.1.2 RDF Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3 Ontologies and Semantic Relationships . . . . . . . . . . . . . . . . . . . . . 8 2.1.4 Implicit Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.5 Extracting Representative Entity Subgraphs using Random Walks . . . . . 12 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Semantic Web-based Data Modeling . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Querying Abstractions for the Semantic Web . . . . . . . . . . . . . . . . . 15 2.2.3 Data Mining for the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 3: Strategies for Creating Features from Linked Open Data 25 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Specificity: An Intuitive Relevance Metric . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.1 Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.2 Incorporating Hierarchy of Classes into Specificity computations . . . . . . 30 3.5 Bidirectional Random Walks for Computing Specificity . . . . . . . . . . . . . . . . 32 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6.2 Experimental Setup and Methodology . . . . . . . . . . . . . . . . . . . . . 36 3.6.3 Specificity as a Metric for Measuring Relevance . . . . . . . . . . . . . . . . 38 3.6.4 Comparison of Sizes of Representative Subgraphs . . . . . . . . . . . . . . . 43 3.6.5 Generating Embeddings from Extracted Subgraphs . . . . . . . . . . . . . . 44 3.6.6 Parameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 v 3.6.7 Analysis of Running Time of Specificity Computations . . . . . . . . . . . . 55 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Chapter 4: Semantic Query Formulation for the Non-Expert 59 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 ASQFor: Automatic SPARQL Query Formulation . . . . . . . . . . . . . . . . . . 63 4.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4.2 Comprehensive Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5.2 Quality and Efficiency of ASQFor Generated Queries . . . . . . . . . . . . . 74 4.5.3 Effect of Automation on Query Formulation Time . . . . . . . . . . . . . . 75 4.5.4 Effect of Automation on Query Execution Time . . . . . . . . . . . . . . . . 77 4.5.5 Semantic Search of 1990 US Census Data . . . . . . . . . . . . . . . . . . . 78 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Chapter 5: Use Case: Smart Oilfield Safety Net (SOSNet) 81 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3 Asset Integrity Management Workflow . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Semantic Information Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.2 Semantic Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.3 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5 Accessing Semantic Data for the Non-expert . . . . . . . . . . . . . . . . . . . . . . 91 5.5.1 Automatic SPARQL Query Formulation (ASQFor) . . . . . . . . . . . . . . 91 5.5.2 Similarity-based Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.6 Representative Applications of SOSNet System . . . . . . . . . . . . . . . . . . . . 97 5.6.1 Querying of Integrated Asset Integrity Data Made Easy . . . . . . . . . . . 98 5.6.2 Support for Predictive Maintenance Analytics . . . . . . . . . . . . . . . . . 99 5.6.3 Driving the Development of Effective Visualizations . . . . . . . . . . . . . 101 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter 6: Use Case: Integrated Movie Database 103 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.3 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.4 Data Modeling and Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.5 Querying Integrated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Chapter 7: Thesis Conclusion 110 7.1 Discovering Implicit Relationships in Semantic Data . . . . . . . . . . . . . . . . . 110 7.2 Semantic Query Formulation for the Non-Expert . . . . . . . . . . . . . . . . . . . 111 7.3 Application: Smart Oilfield Safety Net . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.4 Limitations and Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 vi Reference List 114 vii List Of Tables 2.1 Comparison of Query Abstractions for Semantic Web . . . . . . . . . . . . . . . . . 18 3.1 Top semantic relationships based on frequency with corresponding specificity scores for dbo:Film . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Comparison of relevance metrics for example in Figure 3.1 . . . . . . . . . . . . . . 41 3.3 Comparison of Specificity and Specificity H scores (β = 0.25) . . . . . . . . . . . . 41 3.4 Results of Regression and Classification Tasks - Specificity H (β = 0.25) . . . . . . 52 3.5 Distribution of classes at different heights in DBpedia hierarchy . . . . . . . . . . . 57 4.1 Examples of Query Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 Evaluation Queries for Census Data . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Query Formulation for Representative Queries . . . . . . . . . . . . . . . . . . . . . 74 5.1 Results for search key AAQ-0108 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1 Information extracted from movie websites . . . . . . . . . . . . . . . . . . . . . . 104 6.2 Number of records in IMDB data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3 Evaluation queries for IMDB data . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 viii List Of Figures 2.1 Linked Open Data [Source: http://lod-cloud.net/] . . . . . . . . . . . . . . . . . . 20 2.2 DBpedia - Wikipedia as part of the LOD cloud . . . . . . . . . . . . . . . . . . . . 21 2.3 Knowledge Graph Embeddings - Source: Deepwalk [76] . . . . . . . . . . . . . . . 22 3.1 Random walks from node Batman (1989) . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Illustration of specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 A subset of DBpedia class hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Extracted subgraphs are represented as a sequence of edge and node labels resulting in a document-like representation. The specificity scores shown are with respect to the entity type Film. The relationship starring has a specificity score of 79.1%, whereas starring→spouse has a score of 34.16%. Only those nodes connected to se- mantic relationships of specificity higher than 50% are considered for the extracted representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Comparison of frequency- and specificity-based metrics for top-15 semantic rela- tionships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Average number of walks per entity for subgraph extraction . . . . . . . . . . . . . 42 3.7 Average subgraph extraction time per entity (msec) . . . . . . . . . . . . . . . . . . 44 3.8 Comparison of precision and recall for entity recommendation tasks (β = 0.25 for Specificity H ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.9 Effect of β on recommendation task . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.10 Projection of countries and capitals in 2D space using embeddings generated from RDF subgraphs of depth 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.11 Average number of walks per entity of types dbo:City and dbo:Country for subgraph extraction for different values of β . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.12 Effect of N walks on computation of specificity . . . . . . . . . . . . . . . . . . . . . 53 3.13 Effect of depth on computation of specificity for dbo:Film . . . . . . . . . . . . . . 54 3.14 Effect of|S| on computation of specificity . . . . . . . . . . . . . . . . . . . . . . . 55 3.15 Empirical analysis of computation time of specificity . . . . . . . . . . . . . . . . . 57 ix 4.1 Schema Ontology for University Data . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Sample SPARQL Query based on University Ontology . . . . . . . . . . . . . . . . 60 4.3 SPARQL query to get domain of a data property . . . . . . . . . . . . . . . . . . . 67 4.4 Step-wise generated SPARQL statements . . . . . . . . . . . . . . . . . . . . . . . 68 4.5 Generated full SPARQL query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6 Schema Ontology for Census Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.7 Comparison between query formulation and total query execution time using ASQFor 76 4.8 Mean and Standard Deviation for % of Query Formulation time over total execution time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.9 Comparison between execution times of ASQFor generated and manual queries . . 77 4.10 US Census 1990 - Database search application . . . . . . . . . . . . . . . . . . . . . 78 5.1 Typical Asset Integrity Management workflow . . . . . . . . . . . . . . . . . . . . . 83 5.2 SOSNet system architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Relationships between select concepts from SOSNet ontology . . . . . . . . . . . . 87 5.4 Visualization of asset integrity data in RDF related to a single equipment, T-1000 89 5.5 An example of SPARQL query to get work orders for severely corroded assets . . . 90 5.6 From simple keywords to formal ASQFor-generated formal SPARQL query . . . . 92 5.7 Attribute-to-attribute based computation of similarity between two assets . . . . . 94 5.8 Bipartite graph between values of attributes between two assets . . . . . . . . . . . 95 5.9 From RDF graph-based to feature vector-based representation . . . . . . . . . . . 96 5.10 Enabled intuitive search capability over integrated asset integrity data . . . . . . . 98 5.11 Prediction of behavior of TML over time . . . . . . . . . . . . . . . . . . . . . . . . 99 5.12 Facility overview screen (anonymized data) . . . . . . . . . . . . . . . . . . . . . . 100 5.13 Equipment overview screen (anonymized data) . . . . . . . . . . . . . . . . . . . . 101 6.1 SPARQL queries to get movies based on books of J.R.R. Tolkien . . . . . . . . . . 106 6.2 List of movies with selected attributes based on J.R.R. Tolkien’s books . . . . . . . 107 x 6.3 Comparison between query formulation and total query execution time for ASQFor generated queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.4 Comparison between execution times of manual and ASQFor generated queries . . 108 xi Abstract The combination of data, semantics, and the Web has led to an ever-growing and increasingly complex body of semantic data. As the volume of semantic data increases, the question of how users can access and utilize this data becomes of crucial importance. Accessing semantic data requires familiarity with Semantic Web standards, such as RDF and SPARQL, whereas analyzing this data requires an understanding of Machine Learning (ML) concepts. The ideal system would allow non-expert users 1 to benefit from the expressive power of the Semantic Web, while at the same time hiding the complexity of its various standards and technologies behind an intuitive and easy-to-use mechanism. To date, many frameworks for querying semantic data have been developed. However, such frameworks usually rely on predefined templates or require expensive customization. Furthermore, to avoid overwhelming end-users with the task of evaluating query results, such results should be ranked by relevance or importance. In turn, the ability to rank requires mining of implicit relationships from semantic data. Implicit relationships include but are not limited to finding similarity between entities, correlation, or classifying entities in the database. ML techniques applied to discover implicit relationships require the transformation of semantic data into low dimensional vector space. Projecting highly expressive semantic data in a low dimensional vector space comes at a price of lost semantics. The primary goal of the work presented in this thesis is to enable querying and analytics over semantic data for the non-expert. We present ASQFor (Automatic SPARQL Query Formulation) 1 We define non-expert as someone who is a domain expert in his area of interest but is not familiar with concepts in database systems, Machine Learning, and Semantic Web. xii framework that takes user input in a simplified manner and provides results of complex querying and analysis as output. The user provided keywords are matched to ontological concepts in the repository to formulate formal queries automatically. The approach avoids offline computation or predefined query templates while enabling online queries. To enable analytics on semantic data, we propose Specificity as a new relevance metric which enables the semantics-preserving transformation of semantic data. We show that specificity-based biased random walks can extract more relevant representations of entities in the semantic data, resulting in improved preservation of semantics, even when transformed into a lower dimensional vector space. To demonstrate the capabilities enabled by the discovery of implicit relationships in the semantic data, we use the generated feature vector representations to provide the functionality of similarity-based queries where users can retrieve similar entities given a search key. We evaluate our approaches for query formulation and analytics on real-world datasets, in- cluding DBpedia, movie datasets, and asset integrity data from the oilfield. For performance evaluation of the querying framework, we consider both query formulation time and execution time, over large scale data. For evaluation of the specificity-based approach, we compare preci- sion and recall of our approach with state-of-the-art methods. We highlight the applicability of our approach by presenting Smart Oil Field Safety Net (SOSNet), a data-driven system for Asset Integrity Management. xiii Chapter 1 Introduction 1.1 Motivation “Semantic Web”, proposed by Tim Berners-Lee [8] as the next version of the Web, aims at creating a web of data that is understandable by machines. The suite of technologies under Semantic Web has been developed to facilitate data annotation, integration, organization, and reuse [23, 116]. The resulting annotated and integrated data (or semantic data), which can be used for reasoning or simply querying, is one of the main strengths of the Semantic Web. Most of the applications employing Semantic Web technologies are based on the accessibility and integration of semantic data at various level of complexities. With the emergence of Linked Open Data [11], DBpedia [2], and Google Knowledge Graph 1 , semantic data as large-scale KGs has drawn much attention and become important data source for many data mining and knowledge discovery tasks [25, 44]. In order to allow non-expert users to access this semantic data and its implications, there are two key requirements. 1.1.1 Intuitive Framework for Querying Semantic Data The principles of Semantic Web have been applied to diverse fields such as the Internet of Things (IoT) where information from various devices, objects, and sensors is getting linked as Linked Data (LD) and is accessible for querying and analysis. In order to effectively utilize this semantic data, the users must possess an understanding of the structure and schema of the underlying information and query languages such as SPARQL 2 . Despite their strong expressive power, such formal languages impose an initial barrier to technology adoption due to their hard requirement 1 https://www.blog.google/products/search/introducing-knowledge-graph-things-not/ 2 http://www.w3.org/TR/rdf-sparql-query/ 1 for the understanding of their formal syntax, and the way knowledge is encoded in semantic repositories. The approaches that rely on IT experts to translate the requirements of users into predefined analytical reporting tools are inflexible [49]. In order to support real-time data explo- ration users need to be granted direct data access. This is a challenging task primarily due to the conceptual mismatch between the users’ understanding of data, and the way data is modeled and stored in the repository. Only Semantic Web experts can fully understand the way the data is stored and can execute semantic queries in order to retrieve data. There is a need for mechanisms that abstract the complexities associated with query languages and knowledge representation and enable easy exploration of data for the non-expert. 1.1.2 Enabling Semantics-preserving Analytics for Semantic Data Once a user executes a query, he gets all the results that match the patterns specified in the query. This extraction of a subset of data from the repository is just the first step. With the development of data analytics techniques, users are interested in getting insights from the data, rather than a simple listing of it. For instance, a user can be interested in specific ordering or grouping of the results based on metrics such as importance or similarity, which goes beyond the scope of the majority of existing query languages. Machine learning techniques are needed to be applied to the relevant data to augment the query results based on the user’s needs. In multiple approaches [82, 101], semantic data is extracted from Knowledge Graphs (KGs) and converted to propositional representations for using with Machine Learning approaches. The requires projecting highly expressive semantic data in a low dimensional vector space, which comes at a price of lost semantics. Since, machine learning models can be made to work with noisy, uncertain, or incomplete data; there is a scope for customizing such models specifically for Linked Data. Therefore, it is imperative to ensure that the models can either work with its standard knowledge representation or if transformed into a different representation (e.g., feature vectors), take into account the semantics of the data as much as possible. 2 1.2 Thesis Statement As stated, accessing semantic data requires familiarity with existing formal knowledge representa- tion and query languages such as RDF and SPARQL. Moreover, merely accessing semantic data is insufficient in applications where it is crucial to gain actionable insights (implicit information) from the data. In this thesis, we propose a framework that addresses the challenges of querying and analyzing semantic data for non-expert users. The thesis statement is: “A low latency, online querying abstraction that exploits implicit relationships automatically discovered offline can ease access to semantic data and enable analytics for the non-expert users. ” The description of the key terms in the statement are: • Online querying abstraction: Insulates users from complexities of schema and query lan- guages, without requiring pre-computed query templates or rules and formulates SPARQL queries automatically on-the-fly. • Low latency: Measured as query formulation overhead over the total query processing time. • Offline: process for discovering implicit relationships in the RDF 3 KG (e.g. similarity). To summarize, to abstract non-expert users from complexities of formal query languages and analytics, a framework is required which must (i) be easy to use, (ii) provide access to most of the functionality provided by standard query languages such as SPARQL, and (iii) allow users to pose analytics-based queries. 1.3 Research Contributions In this thesis, we present a set of algorithms to achieve this objective. The proposed algorithms are schema-agnostic and work with a variety of semantic knowledge bases. We evaluate these 3 https://www.w3.org/TR/rdf-concepts/ 3 algorithms using various real-world datasets to demonstrate their effectiveness. The main contri- butions of this thesis are summarized below: Entity-Specific Relevance Metric for Ranking Semantic Relationships We propose specificity as an accurate measure of identifying most relevant, entity-specific, subgraphs in Linked Data Knowledge Graphs (KGs). Specificity and its variant hierarchical specificity (Specificity H ) are used to rank most relevant semantic relationships that define the representative neighborhoods of target entities in a KG. Computation of Specificity for Large Scale Knowledge Graphs We propose a scalable method based on bidirectional random walks for computing specificity. We provide an experimen- tal analysis of running time as a measure of efficiency. We also show that the specificity can be computed by examining only a fraction of semantic relationships, making it a suitable approach for large scale KGs. Feature Generation from RDF KGs We show that specificity-based biased random walks extract more meaningful (in terms of size and relevance) substructures compared to the state-of- the-art and the graph embeddings learned from the extracted substructures are well-suited for common data mining tasks. We also show that these graph embeddings preserve the semantic context of represented entities even when projected to low dimensional vector space. Querying over Heterogeneous Ontologies We present ASQFor (Automatic SPARQL Query Formulation) algorithm that allows non-expert users to explore semantic data without needing an understanding of formal languages for querying and knowledge representation. ASQ- For, a schema-agnostic framework, provides a simple but powerful way of specifying complex queries and automatically translates them into formal queries on the fly, i.e., does not rely on predefined rules and can instantly adapt to changes in the ontology. We quantitatively evaluate ASQFor using real-world data to indicate possible performance overheads and highlights its ap- plicability in information searching activities due to its ability to reduce the amount of time spent to develop and adapt queries manually. 4 Applications For evaluating our approaches, we use publicly available datasets such US Census Data, DBpedia, and movie data from multiple sources. In addition to that, we show a comprehensive application of building a data-driven system with integration, querying, and analytics capabilities for the process of asset integrity management in the oil and gas industry. This system is named SOSNet (Smart Oil Field Safety Net), which leverages our Semantic Web- based proposed approaches. We show that with ASQFor, data can be retrieved through a simple interface. Secondly, using the specificity-based approach, we can extend our querying interface to enable similarity-based queries which allow users to get the relative ranking of assets against a given search key. 1.4 Organization This dissertation is organized as follows: • In Chapter 2, we present background concepts in the Semantic Web. We also include a comprehensive literature review covering querying abstractions and data mining tasks for Linked Data. • In Chapter 3, we discuss the limitations of the existing metrics used to extract relevant sub- graphs from KGs and then define specificity as a more suited metric for dealing with Linked Data KGs. We present the algorithm based on bidirectional random walks to compute specificity. We then use the specificity-based biased random walks for subgraph extraction and feature generation for Linked Data. • In Chapter 4, we introduce the ASQFor algorithm for generating SPARQL query subgraphs from user-provided keywords to enable the capability of querying different ontologies. • In Chapter 5 and 6, we discuss the Smart Oilfield Safety Net system and the Integrated Movie Database that have been built using our proposed approaches. • In Chapter 7, we conclude this thesis and provide several directions for future work. 5 Chapter 2 Background and Related Work 2.1 Semantic Web Consider the example of the Internet, which is a scalable network of networks that successfully connects heterogeneous devices. On top of the Internet, at the application layer, the Web pro- vides an information exchange (using open protocols, e.g., HTTP) where uniquely identifiable documents and web resources can be accessed by users worldwide using URLs, irrespective of the underlying networking infrastructure. There is an inherent drawback of the Web itself. Currently, the Web can more appropriately be called the Web of documents [7, 85]. Like real life files, users skim through web documents to extract relevant information which is cumbersome to do manu- ally, given the massive number of documents on the Web. Without giving a meaningful structure to the web document (or resource), it is difficult for the machines to understand the contents of the file. An improvement for the Web has been proposed as Semantic Web [8] which aims to build a large-scale web of machine-readable data on top of the existing Web of documents. The inherent property of the Semantic Web makes it easier to give meaning to data and provide the flexibility of linking multiple data sources together. This enables machines to comprehend information and facilitates the rapid integration of multiple data sources. The integrated data can be used to discover hidden facts due to inferencing capabilities, which is one of the main strengths of the Semantic Web. The Semantic Web provides a common framework for data, integration, sharing, and reuse across application, enterprise, and community boundaries [85]. 6 2.1.1 Ontologies: Conceptualizing the Domain Ontologies play a crucial role in enabling automatic knowledge processing, sharing, and reuse between applications. An Ontology is used to denote a formal conceptualization of a particular domain. An Ontology typically contains a hierarchy of concepts (or classes) pertinent to a domain along with their attributes (or properties). The Ontologies are formally described using RDF 1 (Resource Description Framework) syntax. RDF is the language used for representing information about resources to be used with Semantic Web. The Resource Description Framework (RDF) is a framework for expressing information about resources and their attributes and relationships with other resources. A resource can be anything, including a document, person, object, or an abstract concept. RDF allows us to make statements about these resources. The Semantic Web standards have led to the popularization of a graph-based representation of information known as Linked Data [11], which is encoded using RDF syntax 2 . 2.1.2 RDF Graphs An RDF graph is represented by a knowledge base of triples [110]. A triple consists of three parts: <subject (s), predicate (p), object (o)>. Subjects, predicates, and objects are represented using URIs (Uniform Resource Identifiers). URIs are used to identify resources irrespective of their location on the web [62]. Definition 1 RDF Graphs: Assuming that there is a set U of Uniform Resource Identifiers (URIs), a set B of blank nodes, a set L of literals, a set O of object properties, and a set D of datatype properties, an RDF graph G can be represented as a set of triples such that: G ={<s,p,o>| s∈ (U∪B), p∈ (D∪O), (o∈ (U∪B), if p∈O) ∧ (o∈L, if p∈D), ((D∪O)⊆U)} (2.1) 1 https://www.w3.org/TR/rdf-concepts/ 2 See Section 2.2.3.1 7 We can also represent an RDF graphG asG ={V,E} such thatV ∈ (U∪B∪L) andE∈ (O∪D), whereE is a set of directed labeled edges. In an RDF graph, URIs are used as location-independent addresses of entities (both nodes and properties), whereas blank nodes are assigned internal IDs. Literals can have any values conforming to the XML schema definitions. Thus, (2.1) signifies that any entity with a URI can be subject, predicate, or object. In practice, only entities with URIs defined as Object or Datatype properties are used predicates. However, these properties can also be subjects or objects in other triples, which is an interesting feature of an RDF graph that an edge can be between two edges which makes it different from the traditional definition of a graph. Blanks nodes can either be subject or object, whereas literals can only be objects. An RDF graph is a set of all such RDF triples [70, 110]. 2.1.3 Ontologies and Semantic Relationships As discussed, ontologies contain a hierarchy of concepts (or classes), their relationships (object properties) and attributes (data properties). Such semantic relationships are the fundamental aspect of knowledge representation in Semantic Web as they provide information about how entities are linked together. Thus, the ontologies are used to enforce a schema over RDF instance data [85]. Discovering and utilizing these semantic relationships have become an essential aspect of data mining performed on semantic data. Significant research has been focused on relationships that are explicitly represented using ontology modeling languages such as RDF and OWL [95]. However, we observe that the relationships that reside in an ontology are not limited to these explicit relationships, but also include implicit relationships derived from (or implied by) explicit relationships. Explicit relationships describe the relationships between two entities that are explicitly mod- eled using ontology languages such as RDF, RDFS, and OWL. A semantic relationship can be either direct or indirect. A direct semantic relationship links the subject and the object of a single triple. Such relationships are called object properties. Complex semantic relationships are 8 usually described as paths consisting of object properties that link two entities through other en- tities in the RDF graphs, also known as semantic associations [38]. Examples of a direct semantic relationship and a semantic association, respectively, are as follows: Marlon Brando actedIn −−−−−→ Godfather MarlonBrando actedIn −−−−−→ Godfather(movie) basedOn −−−−−→ Godfather(book) hasAuthor −−−−−−−→ MarioPuzo Another type of explicit relationship occurs between an entity and a literal. Such relationships are known as data properties. Godfather(book) yearPublished −−−−−−−−−→ 1969 Marlon Brando yearOfDeath −−−−−−−−−→ 2004 Through such triples, an entity (Godfather or Marlon Brando) is defined in a knowledge base along with its relationships with other entities and attributes. For this thesis, we define a generic semantic relationship as follows, which can be direct or indirect: Definition 2 Semantic Relationship: A semantic relationship in an RDF graph can be defined as<s,p d ,o> wherep d represents a path comprisingd successive predicates andd−1 intermediate nodes between s and o. When d = 1, <s,p 1 ,o> becomes equivalent to a single triple <s,p,o>, where p 1 or p represents a single predicate or edge between two nodes s and o. We define semantic relationship p d of depth or length d, as a template for a path (or a walk) in G, that comprises of d successive predicates p 1 ,p 2 ,...,p d . Thus, < s,p d ,o > represents all paths (or walks) between any two entitiess ando that traverse throughd− 1 intermediate nodes, using the samed successive predicates that constitutep d . To understand the difference between a triple, a path, and a template, assume that an RDF KG consists of only the following four triples: <s 1 ,a,x>, <x,b,o 1 >, <s 2 ,a,y >, and <y,b,o 2 >. <s 1 ,a,x> and <x,b,o 1 > constitute a complex relationship betweens 1 ando 1 . These two triples together are an example of a semantic 9 relationship of form p 2 which consists of two successive predicates a and b and one intermediate node x. If we treat a,b as a template of a path, then in this KG there are two instances of this template: the semantic relationship betweens 1 ando 1 (first and second triples) and the semantic relationship betweens 2 ando 2 (third and fourth triples). This also means that for an arbitrary s and o,|<s,p 2 ,o>| = 2, where p 2 =a,b. We will formulate the expression for specificity using this notation in Section 3.4. 2.1.4 Implicit Relationships Implicit relationships are defined as the relationships implied by explicit relationships in ontologies [95]. Such relationships become visible if the repository that holds the semantic data possesses logical or inductive inferencing capabilities. 2.1.4.1 Rule-based Reasoning for Discovery of Implicit Relationships Logical inferencing is performed by reasoners, which typically work with explicit relationships and rules to discover these implicit relationships. These rules can be defined using SWRL, which are in the form of implication between an antecedent and consequent. The consequent consists of relationships between entities not originally defined in the ontology but can be revealed through reasoning. Whenever the conditions specified in the antecedent are met, the implicit relationships specified in the consequent must exist [43]. Assume e is an entity in the ontology, a(e i ,e j ) is a semantic relationship between two entities e i and e j , a rule for discovering implicit relationship is in the form of a 1 (e 11 ,e 12 ) ∧ a 2 (e 21 ,e 22 ) ∧ ... ∧ a n (e n1 ,e n2 ) ⇒ a implicit (e l ,e m ) (2.2) An example of such a rule that defines an implicit relationship based on explicit relationships is: hasParent(a,b) ∧ hasBrother(b,c) ⇒ hasUncle(a,c) (2.3) 10 Given three entities a, b, and c with the presence of explicit relationships according to the left-hand side of Rule 2.3, a rule-based reasoner can discover an implied relationship according to the right-hand side. Such extracted implicit relationships can be considered as part of the knowledge base, enhancing the expressiveness of the semantic content. 2.1.4.2 Drawbacks of Rule-based Reasoning Based on ontological knowledge and an initial set of statements, reasoning can derive implicit statements by deductive inference. However, logical reasoning has its limitations. First, it does not efficiently scale up to the size of the Web as required by many applications, as logical reasoning may be both too demanding. The complexity of logical reasoning increases with the expressiveness of the representation. Second, logical reasoning can be error-prone because of inconsistencies or incompleteness in the semantic knowledge bases [80]. 2.1.4.3 Machine Learning for Discovery of Implicit Relationships Some implicit relationships cannot be defined using ontologies or rules and thus cannot be dis- covered by rule-based reasoning. These implicit relationships can only be revealed by analyzing the characteristics and connections of entities. These type of relationships can be defined based on specific metrics such as proximity, ordering, grouping, or similarity. Finding such implicit relationships is the focus of Chapter 3. Machine learning approaches have been applied to different aspects of the Semantic Web. So far, in this context, machine learning has been mostly considered as a tool to enrich or extend ontologies on the schema level [99, 124, 125]. More precisely, ML may serve the Semantic Web by supporting ontology construction, management, and refinement as well as the mapping, merg- ing, and alignment of ontologies. For example, in an ontology alignment problem, ML methods can be used to determine classes with different names, modeling the same concept or one class modeling the subset of the concept modeled by another class. In such scenario, by using ML techniques, deterministic knowledge containing properties such as owl:sameAs or rdfs:subClassOf 11 between target concepts can be added to the ontological knowledge. Some research has focused on validating (or estimating the probability) that statements on the instance level (triples) are true, which are neither explicitly asserted in the database nor can be proven (or refuted) based on logical reasoning [6]. Examples of problems in this category include class membership prediction and entity similarity. 2.1.5 Extracting Representative Entity Subgraphs using Random Walks Random graph walks provide a scalable method of extracting entity representations from large scale KGs [37]. A single graph walk is defined as follows: Definition 3 Graph Walk: Given a graph G ={V,E}, a single graph walk of depth d starting from a node v 0 ∈V comprises a sequence of d edges (predicates) and d + 1 nodes: v 0 e1 −→v 1 e2 −→ v 2 e3 −→ ... e d −→v d . Random graph walks start from a node v 0 ∈ V . In the first iteration, a set of randomly selected outgoing edgesE 1 is explored to get a set of nodesV 1 at depth 1. In the second iteration, from every v∈ V 1 , outgoing edges are randomly selected for exploring the next set of nodes at depth 2. This is repeated until a set of nodes at depth d is explored. The generated random walks are the union of explored triples during each of the d iterations. This simple scheme of random walks resembles a randomized breadth-first search. In literature, both breadth-first and depth-first search strategies and interpolation between the two have been proposed for extracting entity representations from large-scale KGs [37, 76, 82]. Since the vast number of Machine Learning algorithms work with the propositional repre- sentation of data (i.e., feature vectors) [82], several approaches [36, 65, 75] have been proposed for generating graph embedding for entities in a KG. As the first step for such approaches, a representative subgraph for each target entity in the KG must be acquired. Each entity (repre- sented by a node) in a heterogeneous KG is surrounded by other entities (or nodes) connected by directed labeled edges. For a node representing a film, there can be a labeled edge director 12 connecting it to a node representing the director of the film. Another labeled edge releaseYear may exist that connects the film to an integer literal, also represented by a node in the KG. If we want to extract a subgraph representing the film, ideally such relevant information (labeled edges and nodes) must be a part of the extracted subgraph. On the other hand, relationships linked to, say, the Director such as birthYear or deathYear, which are at two hops from a film, e.g. Batman 1989 director −−−−−→ Tim Burton birthYear −−−−−−→ 1958, may not be useful or relevant for representing a film. Therefore, in order to extract a useful representation (as a subgraph) of a given entity, we first need to automatically determine the most relevant edges and nodes in its neighborhood. An extracted representation of any target entity in a KG can only be con- sidered representative if it includes only the most relevant nodes and edge w.r.t the target en- tity. The representative subgraph (neighborhood) of an entity or a node v is the set of other nodes S in the graph that represent the most relevant information related to v. The edges or paths that link v to every s∈ S are also part of the representative subgraph. From the ex- ample of films represented as Linked Data, the representative subgraph of a film may contain depth-1 edges such as director, producer that link it to nodes representing its director(s) and producer(s). Some examples of depth-2 relationships are Film basedOn −−−−−→Book writtenBy −−−−−−−→Author and Film director −−−−−→ Director knownFor −−−−−−−→ Style. On the other hand, our intuition suggests that semantic relationships such as Film director −−−−−→ Director birthYear −−−−−−→ Year are not as relevant to represent a film and hence must not be part of the representative subgraph. 2.2 Related Work 2.2.1 Semantic Web-based Data Modeling Semantic Web technologies have been used to model data in various domains including oceano- graphic measurements [15], ecological surveys [66], external corrosion monitoring [90] and man- agement of maintenance records [28]. 13 The inherent property of the Semantic Web makes it easier to give meaning to data and provide the flexibility of linking multiple data sources together. In the domain of Semantic Web, ontologies (or vocabularies) define the concepts and relationships used to describe and represent an area of concern. The role of ontologies in the Semantic Web is to facilitate data organization and integration. This integrated data (known as Linked Data) which can be used for reasoning or simply querying is the main strength of the Semantic Web. A considerable amount of work has been done in the Semantic Web domain in terms of modeling and integrating sensor data from different fields. Semantic Sensor Network Ontology (SSNO) [19] focuses on domain independent sensing application by integrating sensor data (measurements) and sensor-specific data (sensing principles, or quality). Sensor Cloud Ontology (SCO) [66] extends SSNO by adding provisions for the parameters being sensed as separate entities instead of just metadata, explicitly introduces the concept of time series, creates a Link Open Service (LOS) on top of REST API that provides a SPARQL endpoint for querying integrated sensor data. A framework for encoding domain knowledge in the field of Ecological and Environmental monitoring is presented by Wang et al. [114]. The authors use OWL2 (https://www.w3.org/TR/owl2-overview/) to encode rules and regulations from different environment protection agencies to flag critical sensors and sites. Another approach is to build a semantic model on top of the relational database by using RDF annotations [71]. A naming convention for URIs is based on the tables name, row, and column names. However, using two technologies in parallel increases complexity. Having everything in a relational database does not allow the expressiveness and dynamics provided by RDF triple stores. Moreover, any application developed based on such architecture that provides SPARQL endpoint and supports writeback can introduce added complexity of managing the relational database and RDF annotations. There are examples where researchers have built complete systems leveraging only Semantic Web technologies for data acquisition, management, and analytics. Zhou et al. [127] propose an extensible model that caters to the information diversity in Smart Grids with provision to integrate new information sources and concepts based on Semantic Web. Such a 14 model can facilitate dynamic Demand Response (DR) planning for the utilities as it is capable of presenting electric consumption data, weather data, and building occupancy data. 2.2.2 Querying Abstractions for the Semantic Web In recent years, there has been an increase in the availability of semantic data. Search engines such as Google [103] embed Linked Data in Web pages using Extensible Hypertext Markup Language (XHTML) with RDFa 3 mark-ups. The American Art Collaborative (AAC), a consortium of 14 museums in the United States, allows public access to Linked Open Data on the subject of American Art through SPARQL endpoint which serves as a research tool for scholars and curators and as a public interface for students, teachers, and museum visitors [106]. Another example is from the domain of sensor networks, where data from various sensors and associated metadata are being integrated using the principles of the Semantic Web. In this context, Semantic Web technologies have been used in the areas of oceanographic measurements [15], ecological surveys [66], smart grids [127], and external corrosion monitoring [90]. These applications have benefited from the Semantic Web-based approaches in terms of discovery, contextualization, and integration of data. Such representation not only explicitly reveals relationships between facts but can also be used to drive methods to infer hidden or implicit relationships between entities. For instance, it is shown by authors in [100] that certain queries are only possible due to reasoning capabilities over semantic data and cannot be issued over relational data using SQL. Once the semantic data has been made available, the end user must have the pre-requisite skills to be able to utilize the data effectively. Developing Semantic Web applications require handling of RDF concepts and data in a pro- gramming language [79]. Currently, the majority of software is developed using object-oriented programming languages. However, programming in RDF is triples-oriented. Attempts to inte- grate Semantic Web and object-oriented programming have thus far resulted in solutions in which there is always a trade-off between cost, performance, and simplicity of use [72, 73]. Such solutions 3 http://www.w3.org/TR/rdfa-syntax/ 15 that propose the modification of programming syntax require the introduction of new compilers and interpreters. In order to understand users’ information needs accurately enough to allow for retrieving a precise answer, interfaces that translate users’ Natural Language (NL) based queries into formal queries have been explored [92, 21, 54, 112, 113, 32]. Compared to keyword-based search, systems based on natural language can imply semantic relationships between keywords using a whole sentence [34]. However, simple search-box and concept-based search interfaces have been shown to achieve comparable results to NL query approaches [27]. Additionally, some existing natural language based approaches limit input to a subset of natural language rules by introducing a pre- specified vocabulary, grammar, or sentence structures that must be followed while constructing a query [33]. Approaches that avoid the challenges of natural language processing rely on controlled envi- ronments by guiding the user step by step with suggestions of terms that are connected in the Ontology [9, 48, 20], formulating queries interactively. Querix [48] translates natural language queries into SPARQL. In case if the NL query translates into multiple semantic queries, Querix relies on clarification from the users via dialog boxes. Ginseng (Guided Input Natural language Search Engine) [9] allows users to query OWL knowledge bases using a controlled input language. The system provides suggestions through pop-up lists for each word in the user entry. These pop-up menus offer suggestions on how to complete the current user-entered word and show the options for the next word. The possible choices get reduced as the user continues to type. The system does not accept entries that are not part of these suggested lists. Once a query is gen- erated, Ginseng translates the entry to SPARQL statements, executes it against the Ontology model and displays the SPARQL query and answer to the user. The approach by Sander et al. [92] requires predefined SPIN (SPARQL Inferencing Notation) rules to be stored in the semantic repository. SPIN rules are SPARQL statements stored as part of the RDF graph. Similarly, form-based query construction methods [33], require users to fill 16 out a variety of information in web forms, which may be both cumbersome and time-consuming. Approaches that rely on predefined rules limit the ability of users to formulate new queries on demand and rely on the involvement of IT experts or database admins to add new queries to the library [49]. Finally, the approach presented in [108] adds a step related to graph summarization since it works on the data graph to construct queries. Another step is added for pre-processing user inputs using lexical databases (i.e., Wordnet 4 ). This results in the construction of multiple queries, so the approach presents the top-k queries to the user to select one for retrieving all answers. Instead, our proposed approach reduces the computational requirements of search while at the same time enabling queries that contain semantic relationships which are represented by a path in the RDF graph. Instead of examining all paths between all pairs of vertices or imposing path length constraints, we construct SPARQL queries using a compact subgraph that covers relevant vertices. 2.2.2.1 Comparison with ASQFor Table 2.1 summarizes the comparison of different approaches that aim to translate users’ query intentions into formal SPARQL queries. Usually, the approaches with NL or constrained NL inputs require pre-processing to disambiguate the user input. During the query formulation stage, some approaches rely on pre-computed dictionaries and rules such as templates, SPIN rules, and grammar rules [32]. Some approaches [58] limit input to NL queries that map to few triples only. ASQFor does not require pre-processing and does not rely on pre-computed rules for formulating queries. The schema Ontology is traversed on the fly to formulate SPARQL queries, resulting in a minimum formulation overhead. Since a user query does not need to be linguistically correct but must contain a minimum set of “relevant” concepts, we believe that our proposed keyword-based is more suitable for querying triple stores. This eliminates the need for pre-processing natural language phrases into discernible tokens before matching such keywords to concepts and attributes 4 https://wordnet.princeton.edu/ 17 Input Type Approach No pre- processing No user guided disam- bigua- tion Handles subclass relation- ships No restric- tions on struc- ture of ontology No limi- tations on query size Keyword-based Approach ASQFor X X X X Constrained NL Input Ginseng [9] X Querix [48] SQUALL [32] X X X NL Input Aqualog [58] Auto SPARQL [54] X NLP-Reduce [46] X X PANTO [113] X Pradel et al. [78] Sander et al. [92] X X X Sparklis [33] X Unger et al. [111] X X USI Answers [112] X Zheng et al. [126] X X Table 2.1: Comparison of Query Abstractions for Semantic Web 18 in the Ontology. If a user requires results based on multiple attributes from the database, then it will be tedious to formulate such a query through an NL interface. ASQFor can also handle subclass relationships among different classes. 2.2.3 Data Mining for the Semantic Web Knowledge discovery is a multistep process consisting of data selection, pre-processing, transfor- mation, mining, and evaluation. This process starts by developing an understanding of the target domain and identifying insights that can be beneficial for the end users. Based on this, the ap- propriate selection of the data set(s) needs to be made that comprises features or variables that can be used for discovering those insights. Data cleaning and pre-processing is always a critical step of this pipeline. Transformation process deals with feature reduction and appropriate repre- sentation depending on the chosen data mining task. The data mining task takes the transformed representation as input and finds the patterns of interest. Finally, the results are evaluated and presented back to the user [30]. In this thesis, we focus on the transformation processes for Linked Open Data to make it compatible with different knowledge discovery tasks. 2.2.3.1 Linked Open Data Cloud Linked Open Data (LOD) [121] is a massive KG shown in Figure 2.1 where each node itself is a Knowledge Graph. Each of these individual KGs comes from semantically annotating and integrating unstructured data and publishing as structured data on the web using the principles of Linked Data or Semantic Web [42, 74, 88, 90]. LOD cloud currently comprise 1205 KGs 5 . Few of the KGs part of LOD are Wikidata 6 , Freebase 7 , and DBpedia 8 [29, 45]. Due to its open availability, heterogeneity, and cross-domain nature, LOD is increasingly becoming a valuable source of information in many data mining tasks. However, most data mining algorithms work with a propositional feature vector representation of the data [82]. Recently, 5 Source: http://lod-cloud.net/ 6 https://www.wikidata.org/wiki/Wikidata:Main_Page 7 https://developers.google.com/freebase/ 8 https://wiki.dbpedia.org/ 19 Figure 2.1: Linked Open Data [Source: http://lod-cloud.net/] graph embedding techniques have become a popular area of interest in the research community. An embedding maps entire graph or individual nodes into low dimensional vector space, preserving as much information as possible related to its neighborhood [36, 81, 97]. DBpedia We use DBpedia in this thesis for our experiments. DBpedia is a knowledge graph- based representation of Wikipedia (Figure 2.2), and is available online through a publicly accessible SPARQL query endpoint 9 and as downloadable offline copies of the entire knowledge graph from different dates 10 . DBpedia is represented by the node in the exact middle of the LOD cloud 9 https://dbpedia.org/sparql 10 https://wiki.dbpedia.org/develop/datasets 20 shown in Figure 2.1. The main sources of information for populating DBpedia are the Wikipedia infoboxes which provide key information about the entities on each Wikipedia page [55]. Figure 2.2: DBpedia - Wikipedia as part of the LOD cloud 2.2.3.2 Discovering Implicit Relationships Sheth et al. [96] introduce the notion of implicit semantics as one type of semantics on the Semantic Web. Implicit semantics refer to the kind of semantics that is implicit from the patterns in data, and that is not represented explicitly in any strict machine understandable syntax. There is significant ongoing research in extracting and analyzing implicit semantic relationships in data mining and natural language processing communities. Bollegala et al. [12] propose a relational similarity measure to compute the similarity between implicit semantic relations implied by two pairs of words, using a Web search engine. Turney et al. [109] propose an unsupervised learning algorithm that mines large text corpora for patterns that express implicit semantic relations. Graph kernel-based approaches simultaneously transverse the neighborhoods of a pair of en- tities in the graph to compute kernel functions based on metrics such as the number of common substructures (e.g., paths or trees) [22, 61] or graphlets [94, 120]. Recently, graph embeddings methods have become a popular area of interest in the research community. Embeddings are low dimensional representations that map entire knowledge graph, entities (nodes), or relationships (edges) into vector space [14, 36], as shown in Figure 2.3. 21 Figure 2.3: Knowledge Graph Embeddings - Source: Deepwalk [76] Neural language models such as word2vec [65] and GloVe [75], proposed initially for generating word embedding, have been adapted for KGs [18, 37, 82]. Deep Graph Kernel [120] identifies graph substructures (graphlets) and uses neural language models to compute a similarity matrix between identified substructures. For large scale KGs, embedding techniques based on random walks have been proposed in the literature. DeepWalk [76] learns graph embedding for nodes in the graph using neural language models while generating truncated uniform random walks. node2vec [37] is a more general approach than DeepWalk and uses 2 nd order biased random walks for gener- ating graph embedding, preserving roles and community memberships of nodes. RDF2Vec [82], an extension of DeepWalk and Deep Graph Kernel, uses BFS-based random walks for extracting subgraphs from RDF graphs, which are converted into feature vectors using word2vec [65]. Ran- dom walk-based approaches such as RDF2Vec have been shown to outperform graph kernel-based approaches in terms of scalability and their suitability for ML tasks for large-scale KGs [36, 82]. The main limitation of approaches using uniform (or unbiased) random walks is the lack of con- trol over the explored neighborhood, which can lead to the inclusion of less relevant nodes in identified subgraphs of target entities. Biased random walks-based approaches have been recently proposed to address this challenge [2, 17, 91]. Such approaches use different weighting schemes for nodes and edges. The weights create the bias by making specific nodes or edges more likely to be visited during random walks. The work closest to ours is biased RDF2Vec approach [17], 22 which uses frequency-, degree-, and PageRank-based metrics for weighting schemes. Our proposed approach also uses biased random walks to extract entity representations. However, unlike [17], we use our proposed metric of specificity as an edge- and path-weighting scheme for biased ran- dom walks for identifying the most relevant subgraphs for extracting entity representations in the KGs. Translation-based embedding models like TransE [13], TransR [57], and TransH [115] have been shown to successfully model entities and their relations using embeddings. These models are trained for specific tasks, e.g., link prediction. Approaches like RDF2Vec based on word2vec can be used to create general purpose embeddings which have been shown to perform well for various tasks such as recommendation, regression, and classification [82]. CrossE [122] combines gen- eral embeddings and triple-based (interaction) embeddings to achieve better results on complex and more challenge data sets for KG completion tasks. Our work using specificity-based approach aims at reducing the size of the knowledge graph required for generating embeddings and is geared towards approaches (such as [82, 76, 37]) that are suitable for general purpose embeddings (i.e., not trained for a specific task). 2.2.3.3 Semantic Similarity and Relatedness There can be various types of implicit relationships between entities (concepts and relationships) in semantic data. However, we will focus on similarity (or semantic relatedness), which is a relational measure between semantic entities. For example, determining equipment with similar behavior on industrial facilities can help in optimizing maintenance routines or facilitate knowledge sharing. One way of determining this type of similarity measures require representing an entity and all of its attributes in vector space for comparison. Semantic similarity and relatedness between two entities have been relatively well explored [1, 35, 52, 74]. Searching for similar or related entities given a search query is a common task in the field of Information Retrieval [24, 41, 60]. To facilitate the search for similar entities, the notion of similarity and the set of attributes used for its computation must first be defined. Semantic 23 similarity and relatedness are often used interchangeably in literature [1, 74, 88, 105], where the similarity between two entities is sometimes computed based on common paths between them. This definition allows the computation of similarity between any two given entities, including entities of different types. For example, Kobe Bryant and Kareem Abdul-Jabbar (athletes) are each related to LA Lakers (team) based on path-based similarity. The other kind of similarity is between Kobe Bryant and Kareem Abdul-Jabbar who are entities of the same type, i.e., athletes. Both are athletes, basketball players, and have played for the same team. These attributes in KG, e.g. DBpedia, are represented through semantic relationships rdf:type and dct:subject. PathSim [105] is one of the approaches proposed for searching for similar entities in heterogeneous information networks. This approach is based on user-defined meta paths (i.e., the sequence of relationships between entities) connecting entities of the same type. In contrast, our objective is to identify the most relevant paths using specificity automatically. In this thesis, our objective is to automatically identify the semantic relationships that constitute the representative neighborhoods of entities of the same type. Therefore, we limit the computation of similarity to be between two entities of the same type. 24 Chapter 3 Strategies for Creating Features from Linked Open Data 3.1 Motivation Knowledge Graphs (KGs) have become useful sources of structured data for information retrieval and data analytics tasks. Enabling complex analytics, however, requires entities in KGs to be represented in a way that is suitable for Machine Learning tasks. Several approaches have been recently proposed for obtaining vector representations of KGs based on identifying and extracting relevant graph substructures using both uniform and biased random walks. However, such ap- proaches lead to representations comprising mostly popular, instead of relevant, entities in the KG. In KGs, in which different types of entities often exist (such as in Linked Open Data), a given tar- get entity may have its own distinct set of most relevant nodes and edges. We propose specificity as an accurate measure of identifying most relevant, entity-specific, nodes and edges. We develop a scalable method based on bidirectional random walks to compute specificity. Our experimental evaluation results show that specificity-based biased random walks extract more meaningful (in terms of size and relevance) substructures compared to the state-of-the-art and the graph embed- ding learned from the extracted substructures perform well against existing methods in common data mining tasks. 3.2 Problem Statement Knowledge Graphs (KGs), i.e., graph-structured knowledge bases, store information as entities and the relationships between them, often following some schema or ontology [8]. With the emergence of Linked Open Data [11], DBpedia [2], and Google Knowledge Graph 1 , large-scale 1 https://www.blog.google/products/search/introducing-knowledge-graph-things-not/ 25 KGs have drawn much attention and have become important data sources for many data mining and other knowledge discovery tasks [25, 44, 69, 77, 80, 98, 128] As such algorithms work with the propositional representation of data (i.e., feature vectors) [82], several adaptations of language modeling approaches such as word2vec [65] and GloVe [75] have been proposed for generating graph embedding for entities in a KG. As a first step for such approaches, a representative subgraph for each target entity in the KG must be acquired. Each entity (represented by a node) in a heterogeneous KG is surrounded by other entities (or nodes) connected by directed labeled edges. For a node representing a film, there can be a labeled edge director connecting it to a node representing the director of the film. Another labeled edge releaseYear may exist that connects the film to an integer literal, also represented by a node in the KG. If we want to extract a subgraph representing the film, ideally such relevant information (labeled edges and nodes) must be a part of the extracted subgraph. On the other hand, relationships linked to, say, the Director such as birthYear or deathYear, which are at two hops from a film, e.g. Batman 1989 director −−−−−→ Tim Burton birthYear −−−−−−→ 1958, may not be useful or relevant for representing a film. Therefore, in order to extract a useful representation (as a subgraph) of a given entity, we first need to automatically determine the most relevant edges and nodes in its neighborhood. An extracted representation of any target entity 2 in a KG can only be considered representative if it includes only the most relevant nodes and edge w.r.t the target entity. To accomplish this task approaches based on biased random walks [17, 37] have been proposed. Such approaches use weighting schemes to make a particular set of edges and nodes more likely to be included in the extracted subgraphs than others. However, weighting schemes based on metrics such as frequency or PageRank [17, 107, 26] tend to favor popular (or densely connected) nodes in the representative subgraphs of target entities at the expense of semantically more relevant nodes and edges. 2 Since entities are represented as nodes in a KG; we use entity and node interchangeably throughout the text. 26 3.3 Proposed Solution For our approach, we focus on RDF KGs which are encoded using the Resource Description Frame- work (RDF) syntax and constitute Linked Open Data [24, 51]. We assert that the representative subgraphs of different types of entities (e.g., book, film, drug, athlete) in RDF KGs may comprise distinct sets of relationships. Our objective is to automatically identify such relationships and use them to extract entity-specific representations. This is in contrast to the scenario where extracted representations are KG-specific because of the inclusion of popular nodes and edges, irrespective of their semantic relevance to the target entities. The key aspects of our solution are: • We propose Specificity as an accurate measure for assigning weights to those semantic re- lationships which constitute the most intuitively relevant representations for a given set or type of entities. • We provide a scalable method of computing specificity for semantic relationships of any depth in large-scale KGs. • We show that specificity-based biased random walks enable more compact extraction of relevant subgraphs for target entities in a KG as compared to the state-of-the-art. To demonstrate the usefulness of our specificity-based approach for real-world applications, we train neural language model (Skip-Gram [64]) for generating graph embedding from the extracted subgraphs and use the generated embeddings for the tasks of entity recommendation, regression, and classification for select entities in DBpedia. 3.4 Specificity: An Intuitive Relevance Metric In this section, we introduce and motivate the use of specificity as a novel metric for quantifying relevance. 27 Batman (1989) dir. −−→ Tim Burton knownFor −−−−−−→ Gothic Films Batman (1989) dir. −−→ Tim Burton subject −−−−→ 1958 births Batman (1989) dir. −−→ Tim Burton birthPlace −−−−−−→ Burbank, CA Figure 3.1: Random walks from node Batman (1989) 3.4.1 Specificity Consider the example shown in Figure 3.1. Starting from the entity Batman (1989) in DBpedia, a random walk explores the three shown semantic relationships (descriptive names used for brevity instead of actual DBpedia URIs). Our intuition suggests that the style of a director (represented by Gothic Films) is more relevant to a film than his year and place of birth. Frequency-, degree-, or PageRank-based metrics of assigning relevance may assign higher scores to nodes representing broader categories or locations. For example, PageRank scores (non-normalized) computed for DBpedia entities Gothic Films, 1958-births, and Burbank, CA are 0.586402, 161.258, and 57.1176 respectively (calculated based on [107]). PageRank-based biased random walks may include these popular nodes and exclude intuitively more relevant information related to the target entity. Our objective is to develop a metric that assigns a higher score to more relevant nodes and edges in such a way that the node Gothic Films becomes more likely to be captured for Batman (1989) than 1958 births and Burbank, CA. This way, the proposed metric captures our intuition behind identifying more relevant information in terms of its specificity to the target entity. To quantify this relevance we determine if Gothic Films represents information that is specific to Batman (1989). We trace all paths of depthd reaching Gothic Films and compute the ratio of the number of those paths that originate from Batman (1989) to the number of all traced paths. This gives specificity of Gothic Films to Batman (1989) as a score between 0.0-1.0. A specificity score of 1.0 means that all paths of depth d reaching Gothic Films have originated from Batman 28 (a) Specificity(p d ,S) = 50% when equal number of paths terminate on V S,p d from set S and its complement S 0 (b) Specificity(p d ,S) = 100.0% when p d is relevant to multiple different subsets of V but the information V S,p d it links to S is exclusively connected to S Figure 3.2: Illustration of specificity (1989). For G ={V,E}, this node-to-node specificity of a node n 1 to n 2 , such that n 1 ∈ V , n 2 ∈V and p d being any arbitrary path, can be defined as: Specificity(n 1 ,n 2 ) = |<n 2 ,p d ,n 1 >∈G| |<v,p d ,n 1 >∈G :v∈V| (3.1) Computing specificity between every pair of nodes in a large scale KG is impractical. Instead of defining specificity as a metric of relevance between each pair of entities (or nodes) we make two simplifying assumptions. First, we assert that each class or type of entities (e.g., films, books, athletes, politicians) has a distinct set of characteristic semantic relationships. This enables us to compute specificity as a metric of relevance of a node (Gothic Films) to a class or type of 29 entities (Film), instead of every instance of that class (Batman (1989)). Second, we measure the specificity of a semantic relationship (director,knownFor), instead of an entity (Gothic Films), to the class of target entities. Here, we are assuming that if the majority of the entities (nodes) reachable via a given semantic relationship represents entity-specific information, we consider that semantic relationship to be highly specific to the given class of target entities. From our example, this means that instead of measuring specificity of Gothic Films to Batman (1989), we measure specificity of semantic relationship director,knownFor to the class or entity type Film. Based on these assumptions we redefine specificity as Definition 4 Specificity: Given an RDF graphG ={V,E}, a semantic relationshipp d of depth d, and a set S⊆V of all entities s of type t, let V S,p d⊆V be the set of all nodes reachable from S via p d . We define the specificity of p d to S as Specificity(p d ,S) = 1 |V S,p d| X k∈V S,p d |<s,q d ,k>∈G :s∈S| |<v,q d ,k>∈G :v∈V| (3.2) whereq d represents any arbitrary semantic relationship of lengthd. Figure 3.2 provides visual representation of Equation 3.2. S is a set of all nodes with a given type t andS 0 =V−S. S and V S,p d are shown as disjoint sets only for illustrative purposes. S∩V S,p d6=∅ when p d creates a loop or when for d = 1 is a self-property. 3.4.2 Incorporating Hierarchy of Classes into Specificity computations We also present an extension of specificity which takes into account the class hierarchy of the schema ontology of the KG for its computation. In Equation 3.2, the numerator only counts those semantic relationships that originate froms∈S. This definition is rigid because a given se- mantic relationship can be specific to multiple classes of entities. In other words, certain semantic 30 Figure 3.3: A subset of DBpedia class hierarchy relationships can apply to a broader class and by extension to multiple of its subclasses. When computing specificity of a given semantic relationship to instances of one of these subclasses, e.g., Film, the specificity score may get penalized due to the relevance of the semantic relationships to the other subclasses such as TelevisionShow. To rectify this and make the specificity computa- tions more flexible, we leverage the hierarchical relationships represented by the schema ontology. One of the most common relationships in an ontology is hyponymy, also called is-a relationship [16, 83]. Hyponymy represents the relationship between hypernym (the broader category) and hyponyms (more specific categories). Figure 3.3 shows a subset of the hierarchical structure of DBpedia ontology. Film, MusicalWork, WrittenWork are hyponyms to the hypernym Work and co-hyponyms of each other. Assume that the class hierarchy in the schema ontology of the KG 31 is structured as an n-ary tree, i.e., a class can have more than one subclasses but not more than one superclass. Moreover, there is only one class that has no superclass which is the root of the entire class hierarchy. Let the given entity type t be at height (or depth) h from the root such that t and root can be labeled as C 0 and C h respectively, which results in a hierarchy of classes C 0 ⊆ C 1 ⊆ ...⊆ C h . Subclasses of C 1 are co-hyponyms of C 0 . Our objective is to include the instances of hypernyms of entity type t (C 0 ) in the computation of specificity. In Equation 3.2, we only consider s∈S, all of which are of type t (or instances of class t). We add a term to the numerator to include instances of hypernyms of t in the computation of specificity as follows: Specificity(p d ,S) = 1 |V S,p d| X k∈V S,p d 1 |<v,q d ,k>∈G :∀v∈V| . h X j=0 β j |<v j ,q d ,k>∈G :∀v j ∈V∧type(v j ) =C j | ! (3.3) Additional constraints for Equation 3.3 are: v 0 ∈S andv 1 ∩v 2 ∩...∩v h ∩S =∅. This is to ensure that each instance of typet or any of its superclasses is only considered once in the computation. The factor β ensures diminishing influence of broader concepts on the computation of specificity for more specific concepts. For example, from Figure 3.3, when computing specificity of a given semantic relationship for class Film other instances of class Work that are not films have a higher influence than the more broader class Thing. 3.5 Bidirectional Random Walks for Computing Specificity Computing Equation 3.3 requires accessing large parts of the knowledge graph. In this section, we present an approach that uses bidirectional random walks to compute specificity. To under- stand, consider an entity type t and a semantic relationship p d , for which we want to compute Specificity(p d ,t). We start with a set S containing a small number of randomly selected nodes 32 of type t. From nodes in S, forward random walks via p d are performed to collect a set of nodes V S,p d (ignoring intermediate nodes, for d> 1). From nodes in set V S,p d, reverse random walks in G (or forward random walks inG r =reverse(G)) are performed using arbitrary paths of lengthd to determine the probability of reaching any node of typet. Specificity is computed as the number of times a reverse walk lands on a node of type t divided by the total number of walks. This idea is the basis for the algorithm presented next which builds a list of most relevant semantic relationships up to depth d sorted by their specificity to a given entity type t. Algorithm 1 rankBySpecificity(G,d,t) Input: RDF graph G ={V,E}, d is the maximum depth of semantic relationships to be consid- ered, originating from entities of type t. Output: Returns ranked list Q spec [] of semantic relationships for depths≤ d, with a score of 0.0− 1.0 1: initialize Q paths ,Q spec [] to null/empty 2: initialize N paths ,N walks , β 3: S← Generate random nodes of type t 4: for i← 1,d do 5: Q paths ←selectPaths(G,S,i,N paths ) 6: Q spec [i]←computeSpecificity(G,Q paths ,S, t,i,N walks ,β) 7: end for 8: return Q spec [] Specifically, Q paths and Q spec [] hold the set of semantic relationships, unsorted and sorted by specificity respectively. Q spec [] is initialized as an array of size d to hold sorted semantic relationships for every depth up to d. N paths specify the size of Q paths . N walks is the number of bidirectional walks performed for computing specificity for each semantic relationship in Q paths . A setS of randomly selected nodes of typet is generated in line 3. For eachi th iteration (i≤d), a set of semantic relationshipsQ paths is selected in line 5. The function computeSpecificity, in line 6, computes specificity for each semantic relationship inQ paths and returns results inQ spec [i]. Each element ofQ spec is an array of dictionaries. Each dictionary contains key−value pairs sorted by value, wherekey is the semantic relationship and value is its specificity. For each i th iteration of for in Algorithm 1, Q paths can be populated from scratch with semantic relationships of depth i 33 by random sampling of outgoing paths from S. Alternatively, for iterations i≥ 2, Q paths can be populated by expanding from most specific semantic relationships in Q spec [i− 1]. Algorithm 2 computeSpecificity(G,Q,S,t,d,N,β) Input: RDF graph G ={V,E}, S is a set of random entities with a common type t, Q is a set of semantic relationships of length d to be processed, N is number of bidirectional walks to be performed for each semantic relationship in Q Output: Returns list L of semantic relationships sorted by specificity (0.0− 1.0) 1: G r =reverseEdges(G) 2: H =hierarchy(t) 3: initialize dictionary L 4: for all q∈Q do 5: count← 0.0 6: repeat 7: s← randomly pick a node from S 8: v← randomly explore node from s using any path using q in G 9: v 0 ← randomly explore node from v using any path using q 0 in G r 10: for i← 1,size(H) do 11: if∃<v 0 ,rdf :type,H[i]>∈G then 12: count←count +β i−1 13: break 14: end if 15: end for 16: until N times 17: insert (q, count N ) in L 18: end for 19: return L Algorithm 2 shows the function computeSpecificity which computes specificity for a given set of semantic relationships in Q (Q paths from Algorithm 1). In lines 7 and 8, for each semantic relationship q ∈ Q, a node s∈ S is randomly selected to get a node v reachable from s via semantic relationship q inG (forward walk). In line 9, by usingv and selecting an arbitrary path via any semantic relationship q 0 of depth d, a node v 0 is reached (reverse walk, reverse walk in G or forward walk in G r ). If implementing specificity based on Equation 3.2, the algorithm needs to check ift is one of the types associated with v 0 and simply increment an integer variable count [87, 91, 89]. To implement specificity based on Equation 3.3, we first need to acquire the class hierarchy of entity type t inH in line 2. The first index of H holdst and every subsequent entry H[i] for i > 1 holds the hypernym or superclass of entry H[i− 1]. For example, for entity type 34 Film (Figure 3.3), H = [Film, Work, Thing]. Starting from the first element in H at index i = 1, the existence of triple <v 0 ,rdf :type,H[i]> is checked in G. If such a triple exists, count is incremented by a factor of β i−1 . This process of bidirectional walks is repeated N times for each q. At line 17, specificity is computed as count N . Lines 4-18 are repeated until specificity for each q∈Q has been computed. The return variable L contains semantic relationships and their specificity scores as key−value pairs. 3.6 Evaluation We evaluate our approach in multiple ways. We analyze the behavior of specificity computed for most relevant semantic relationships up to depth 3. We evaluate the compactness of the extracted subgraphs by specificity-based biased random walk scheme against other metrics used as baselines. We generate embeddings from the extracted subgraphs to perform tasks of entity recommendation, regression, and classification. We analyze the ability of generated embeddings in preserving the semantics associated with the entities extracted from the KG. We study the sensitivity of specificity to the parameterN walks (number of bidirectional walks) and|S| (seed set size). We provide an empirical analysis of the running time of Algorithm 2. 3.6.1 Datasets We use DBpedia for evaluation which is one of the largest RDF repositories publicly available [55]. We have used the English version of DBpedia dataset from 2016-04 (http://wiki.dbpedia. org/dbpedia-version-2016-04). We create graph embeddings for 5000 entities each of types: Film, Book, and Album. We also generate embeddings for 500 and 3000 entities of types Country and City respectively. We use three different datasets for the tasks of classification and regression which provide classification and regression targets for DBpedia cities, films, and music albums: 35 • The Mercer Cities 3 dataset contains a list of cities and their quality of living as numeric scores and discrete labels (“high”, “medium”, “low”). • The Metacritic Movies 4 and Metacritic Music Albums 5 datasets contain the Metacritic score (0-100) which were used as regression targets. The classification targets are provided as ei- ther “good” (score≥ 50) and bad (score < 50) [82]. These datasets are are accessible at http://data.dws.informatik.uni-mannheim.de/rmlod/LOD\_ML\_Datasets/data/datasets/. Figure 3.4: Extracted subgraphs are represented as a sequence of edge and node labels resulting in a document-like representation. The specificity scores shown are with respect to the entity type Film. The relationship starring has a specificity score of 79.1%, whereas starring→spouse has a score of 34.16%. Only those nodes connected to semantic relationships of specificity higher than 50% are considered for the extracted representation. 3.6.2 Experimental Setup and Methodology For our experiments, we hosted the DBpedia dataset using OpenLink Virtuoso 6 on a server. All data transactions between the implemented modules and the repository occurred as SPARQL queries. The evaluation methodology is as follows: • Implementation of Baselines and Proposed Approach: In the first step, we com- puted Specificity and Specificity H to find the set of most relevant semantic relationships 3 https://mobilityexchange.mercer.com/Insights/quality-of-living-rankings 4 http://www.metacritic.com/browse/movies/score/metascore/all 5 http://www.metacritic.com/browse/albums/score/metascore/all 6 https://virtuoso.openlinksw.com/ 36 for entities of selected types. Unless otherwise specified we used following values for the parameters of the Algorithms 1 and 2: N walks = 5000,|S| = 500, β = 0.25. Starting from a random seed set S for each type of entities, we randomly sampled semantic relationships originating from entities in S and selected top-25d semantic relationships based on the fre- quency of occurrence for each depth d. This became the frequency-based baseline, where the relevance of semantic relationships were based on how frequently they occurred in the KG. After computing specificity for each of the most frequent semantic relationship, we only considered those as relevant that had specificity scores≥ 50%. For creating the PageRank- based baseline, we used the PageRank DBpedia dataset provided by Thalhammer et al. [107] (available at http://people.aifb.kit.edu/ath/#DBpedia_PageRank). • Biased Random Walks for Subgraph Extraction: Using the lists of most relevant semantic relationships based on frequency-, PageRank-, and specificity-based metrics, we performed subgraph extraction using biased random walks. We used weighted randomized DFS for subgraph extraction for each target entity. The DFS algorithm traversed the paths starting from each target node using the list of most relevant semantic relationships. The nodes linked by more relevant semantic relationships had a greater likelihood to become part of the extracted subgraphs. The subgraphs are extracted as a set of unique graph walks (Definition 3) and can be represented as a sequence of edge and node labels, in the order in which they were visited [82, 17]. An example of document-based representation of extracted subgraph is shown in Figure 3.4. The semantic relationships such as starring and director have higher specificity than the threshold of 50% and thus are included in the extracted representation. The two depth-2 relationships starring→spouse and director→birthplace which have low specificity scores are excluded. For our experiments, we include up to a maximum of 1000 distinct paths in the extracted subgraphs for each entity for proposed and baseline approaches. 37 • Embedding Generation using word2vec: With the document-based representation, we used the Python library gensim 7 which provides the implementation of word2vec [65] to estimate representation of each label in the generated document into vector space creating embedding for the RDF entities [82]. 3.6.3 Specificity as a Metric for Measuring Relevance Figure 3.5 shows the top 15 semantic relationships for entities of three types sorted by their Specificity up to depth 3. The frequency represents the percentage of number of times a path representing a particular semantic relationship is traversed when the neighborhoods of randomly selected nodes are explored. For depth 1, the topmost plot of each subplot in Figure 3.5 shows that there are only a few relatively high-frequency semantic relationships (represented by the peaks) whereas the rest show a uniform trend of frequency. As depth d increases, frequency exhibits a flattened trend due to a rapid increase in the number of possible semantic relationships at each depth. This trend makes it difficult to define a frequency-based cut-off value for choosing a certain set of semantic relationships as most relevant. This may require manual examination of the set of semantic relationships for choosing an appropriate frequency threshold. Alternatively, we can choose a value k such that top-k high-frequency semantic relationships are selected as the most relevant. For example, in Table 3.1, assuming that we wish to include the intuitively relevant semantic relationship dbo:director in the selected semantic relationships, we can either choose a frequency threshold of< 1.08 or choosek≥ 7. However, this ad-hoc method of selecting thresholds is impractical since it has to be done for every different class of entities, every depth, and every different RDF KG. The threshold of Specificity is drawn at 50% in all plots in Figure 3.5. The specificity of a semantic relationship is the probability of reaching any node of a given type from a set of nodes (V S,p d in Definition 4) by reverse walks in G (or forward walks in G r ) using any arbitrary path 7 https://radimrehurek.com/gensim/index.html 38 (a) Books (b) Music Albums (c) Films Figure 3.5: Comparison of frequency- and specificity-based metrics for top-15 semantic relation- ships 39 Semantic Relationship Freq. (%) Spec. (%) Spec. H (%) rdf:type 38.46 74.69 79.15 dct:subject 16.94 87.9 89.09 owl:sameAs 11.88 98.4 98.4 dbo:starring 5.82 79.06 84.43 dbo:writer 1.68 76.48 81.86 dbo:producer 1.34 83.9 88.22 dbo:director 1.08 81.76 87.42 dbo:musicComposer 0.92 70.77 80.28 dbo:distributor 0.84 74.08 83.22 dbo:language 0.72 38.65 53.42 dbo:editing 0.66 91.65 92.82 dbo:cinematography 0.64 92.99 94.49 Table 3.1: Top semantic relationships based on frequency with corresponding specificity scores for dbo:Film of length d. We can define a universal cut-off for specificity at 50%. Specificity score above this threshold means that the selected semantic relationship links the instances of given class (or entity type) t to the set V S,p d such that more than half the incoming edges to this set originate from instances of class t. This means that the information represented by nodes in V S,p d on average is more specific to t. In literature, other approaches such as [74] have employed a decaying factor α d (where α∈ [0.0−1.0]) as a function of depthd that is used to equally penalize the relevance score of all nodes at depth d from target nodes. This is done to implement the idea that the relevance of nodes decreases as we move farther away from the target nodes. Specificity, on the other hand, has a more fine-grained mechanism of assigning relevance score across different d 0 s. Figure 3.5 shows that there are multiple instances of semantic relationships at depth d that have higher specificity than semantic relationships at depth < d. This way specificity exhibits a more fine-grained behavior of relevance across depths, meaning that all semantic relationships at depth d do not simultaneously become less relevant as compared to those on depthd−1, asd increases. There are variations in the specificity-based relevance scores of semantic relationships at the same depth as 40 well as across depths. This allows both shallow (breadth-first) and deep (depth-first) exploration of the relevant neighborhoods around target entities by specificity-based biased random walks. Semantic Relationships Spec. PR Freq. director,knownFor 59.14 6.2 345 director,subject 1.05 823.53 73752 director,birthPlace 0.03 200.33 7087 Table 3.2: Comparison of relevance metrics for example in Figure 3.1 Table 3.2 shows computed relevance of the three semantic relationships from example in Figure 3.1 (Section 3.4.1) based on their specificity, PageRank, and frequency. The given PageRank values in column 3 are the average of non-normalized PageRank scores [107] of top-25 nodes linked to DBpedia entities of type Film by corresponding semantic relationship. The values of frequency in the last column represent the number of occurrences of the corresponding semantic relationship in DBpedia dataset. We argued that the semantic relationship director,knownFor is more relevant to a film as compared to the other two. Table 3.2 shows that the proposed specificity based relevance metric is closer to our intuition as compared to other metrics. Semantic Relationships Entity Type Spec. (%) Spec H (%) dbo:language dbo:Film 38.65 53.42 dbo:genre dbo:Album 43.29 50.8 dbo:recordLabel dbo:Album 43.41 56.34 dbp:genre dbo:Book 46.85 52.29 dbo:previousWork dbo:Book 40.0 55.0 dbo:litrerayGenre dbo:Book 46.22 51.52 Table 3.3: Comparison of Specificity and Specificity H scores (β = 0.25) 3.6.3.1 Hierarchical Specificity As discussed in Section 3.4.1 that Specificity only includes those nodes in its computation that are dominantly exclusive or specific to nodes of a given type. However, many types of entities require both specific as well as generic nodes and relationships for complete characterization. For example, assume that a node representing language is associated with all instances of the class Work and its subclasses (Figure 3.3). Since nodes representing languages can be linked 41 to multiple different entity types, therefore the specificity of the semantic relationship that links language-related nodes will be low. In other words, if the algorithm performs reverse walks from dbo:English, it can potentially land on multiple different types of nodes (e.g., films, books, songs, plays, games). Table 3.3 shows a few examples of intuitively relevant semantic relationships with specificity scores below the threshold of 50%, resulting in the exclusion from the representative subgraphs of instances of corresponding entity types. Specificity H computed using Algorithm 2 takes into account the applicability of such relationships to co-hyponyms and hypernyms of the given entity types, resulting in an increase in the specificity score. This is evident in all plots in Figure 3.5 where the curve of Specificity H lie on or above the corresponding curve of Specificity. (a) Books (b) Music Albums (c) Films Figure 3.6: Average number of walks per entity for subgraph extraction 42 3.6.4 Comparison of Sizes of Representative Subgraphs Figure 3.6 shows that specificity-based random walk schemes enable the extraction of relevant subgraphs with fewer number of walks. Specificity-based schemes use extraction template based on semantic relationships with≥50% specificity which enables collection of comparatively fewer but more relevant nodes and edges than the baselines for all three of the chosen entity-types. Specificity H , as seen in Figure 3.5 assigns higher specificity scores, resulting in a few more semantic relationships meeting the 50% threshold. This results in an increase in the collected number of walks which, nevertheless, is still below the baseline approaches. Figure 3.6a shows that the average size of the subgraph of each entity of type dbo:Book for Specificity H is smaller by a factor of 48, 16, and 5 w.r.t uniform, PageRank, and Frequency- based approaches respectively. Similarly, Figure 3.6b shows that the average size of the subgraph of each entity of type dbo:Book for Specificity H is smaller by a factor of 62, 30, and 2 w.r.t uniform, PageRank, and Frequency-based approaches respectively. We will revisit this discussion in the context of its impact on the recommendation task in Section 3.6.5.1. Figure 3.6a shows that the average subgraph size is larger for depth-3 than depth-2. One of the reasons is that the most specific depth-2 property for dbo:Book entities add on average one walk to each of the extracted subgraphs, whereas the most specific depth-3 property add 15 walks on average. On the other hand, Figure 3.6c shows the opposite effect where depth-2 subgraph size is larger than depth-3 for Specificity (β = 0.0). The top specific semantic relationships at depth-2 and depth-3 add 100.8 and 9.4 walks on the average to each extracted subgraph. The main underlying reason is that having a higher number of specific semantic relationships does not necessarily indicate a larger extracted subgraph or vice versa. The size of the subgraph depends on the number of nodes a particular semantic relationship connects to the target entities to include in the extracted subgraph. For example, for film nodes, on average dbo:director and 43 dbo:cinematographer add 1.06 and 1.07 walks to the extracted subgraphs whereas rdf:type and dct:subject add 28 and 8 walks respectively. (a) Books (b) Music Albums (c) Films Figure 3.7: Average subgraph extraction time per entity (msec) Figure 3.7 shows that the smaller subgraph sizes lead to proportionally reduced time in sub- graph extraction enabling faster extraction of most relevant entity-specific information from the Knowledge Graphs. 3.6.5 Generating Embeddings from Extracted Subgraphs We have shown that the specificity-based biased random walks extract more compact subgraphs representing entities as compared to other schemes. However, to show that the compactness of the extracted subgraphs is not a disadvantage, we use the graph embeddings as an application to evaluate their effectiveness. Using the extracted subgraphs extracted as documents (Section 3.6.2), we trained Skip-gram models using the following parameters: dimensions of generated vectors = 500, window size = 10, negative samples = 25, iterations = 5 for each scheme and 44 depth. All models for depth d > 1 are trained using sequences generated for both depths 1 and d. The parameters for this experiment are based on RDF2Vec [82]. 3.6.5.1 Suitability for Entity Recommendation task To show that the compactness of the extracted subgraphs is not a disadvantage, we use the generated graph embeddings for the task of entity recommendation. Given a vectorized entity as the search key, we list its top-k most similar results. We use the metrics of precision@k to quantify the performance of the recommendation tasks. Evaluating retrieved results requires ground truth. For music albums and books, the ground truth straightforwardly consists of other works by the same artists and authors respectively. For films, we selected franchises or series as ground truth. Films in a franchise or a series usually share common attributes (e.g., director, actors, genre, characters) and are more likely to be similar to each other. For example, for any of the three The Lord of the Rings (LOTR) films the other two films in the trilogy are its more likely top-2 similar results because of the same director, cast members, genre, and characters. In our experiments, King Kong (2005) also frequently appeared among the results similar to LOTR since it also has the same director. Other films may also occur among top-k results based on any number of other factors, e.g., cinematography, editing, distributor, all of which have high specificity to dbo:Film (from Table 3.1). Creating exhaustive lists of films for the ground truth to encompass all such scenarios is laborious. That is why we included only those films in the ground truth that are either in a franchise (e.g., all Batman films) or are part of a series (e.g., prequels or sequels). Similarity among such films is relatively stronger and easier to interpret. Assuming that there are n entities in the ground truth for a given franchise, series, author, or artist (e.g., n = 3 for LOTR or n = 71 for Agatha Christie in our DBpedia dataset), we chose one entity at a time to retrieve top-k similar entities (for k = 1 to n− 1), resulting in a total of n(n− 1) recommendation tasks per franchise, author, or music artist. We performed 59616, 221280, and 124032 recommendation tasks for films, books, and music albums for each random walk scheme respectively. 45 (a) Books - Precision@k (b) Books - Recall (c) Music Albums - Precision@k (d) Music Albums - Recall (e) Films - Precision@k (f) Films - Recall Figure 3.8: Comparison of precision and recall for entity recommendation tasks (β = 0.25 for Specificity H ) Results Figure 3.8 shows the results of recommendation tasks for each scheme. Here, we have chosen β = 0.25 for computing Specificity H . The baselines are shown as colored bars whereas Specificity H is drawn as a line to make the comparison clearer. Specificity H is generally slightly 46 better than other baseline schemes except in Figure 3.8e where PageRank-based extracted sub- graphs have a better performance. Here, it is important to note that the compactness of the extracted subgraphs is the main contribution of our approach. We use recommendation as an application to show that despite extracting less information from the KG, there is not a significant deterioration in performance when using the specificity-based approach with such applications. The results of the recommenda- tion task must be interpreted in conjunction with our primary metric of the size of the extracted subgraph. For dbo:Book, Figure 3.9a shows that the average size of the subgraph for each entity is 48, 16, and 5 times smaller for Specificity H w.r.t uniform, PageRank, and Frequency-based approaches respectively, whereas the precision values for Specificity H (β = 0.25) and the three baselines lie in 75.1± 0.78%. Similarly, for music albums (Figure 3.9b), the subgraph size for Specificity H (β = 0.25) is reduced by factors of 62, 30, and 2 respectively with precision lying 68.7± 1.6%. Figure 3.9c shows that the average size of the subgraph for PageRank-based ap- proach for each entity is 4.2 times larger than Specificity H with β = 0.25 with an advantage of 5% in precision. The size of the subgraph for each entity for unbiased (uniform) approach is ten times larger than Specificity H -based approach for films. The range of precision values for Specificity H (β = 0.25) and the baselines lie in the range 71.1± 3%. This shows that a substan- tial reduction in extracted information still allows us to have comparable performance in entity recommendation task with the specificity-based approach. Effects of β on Specificity Increasing the value of β means that the Specificity H score (for β > 0.0) for any semantic relationship will be equal or higher than its Specificity score. This is also evident from Figure 3.5 where Specificity H (β = 0.25)≥ Specificity (β = 0.0). This will result in more semantic relationships having higher specificity than the cut-off set at 50%. With a higher number of relevant semantic relationships used for subgraph extraction, the size of extracted subgraphs can increase. 47 (a) Books (b) Music Albums (c) Films Figure 3.9: Effect of β on recommendation task Figure 3.9 also shows the effect of changing β on the size of the extracted subgraphs and the recommendation task. Here first we refer back to discussion around Algorithm 1 in Section 3.5. In Line 5 Q paths holds the set of selected semantic relationships which are then re-ordered and filtered in Line 6 using the metric of specificity. In our implementation, the semantic relationships for populating Q paths are selected based on their frequency of occurrence. For β = 1.0, every semantic relationship in Q paths gets a high specificity score in Line 6, i.e., the count variable in Line 12 gets incremented by 1 irrespective of the type ofv 0 in Line 9. This means that all frequent semantic relationships are also deemed specific for β = 1.0. Figure 3.9 shows the average sizes of extracted subgraphs for Frequency and Specificity H (β = 1.0) are the same in each subplot. 48 3.6.5.2 Semantics of Specificity-based Vector Representations To analyze the semantics of the vector representations, we employ Principal Component Analysis (PCA) to project the generated embeddings into a two-dimensional feature space. We selected seven countries (similar to evaluation done for RDF2Vec [82]) and their capital cities and visu- alized the vectors as shown in Figure 3.10. Figures 3.10d and 3.10e show that specificity-based embeddings are capable of organizing entities of different types and preserving some semantic context among them. For instance, there is a separation between two different types of enti- ties: dbo:Country and dbo:City, preserving the rdf:type relationship. Compared to Specificity, Specificity H has comparatively better, although not perfect, organization of capitals in corre- spondence to their respective countries. Third, in Figure 3.10e, cities are grouped together based on the continents. The information regarding continents is represented via the semantic relation- ship dct:subject which links each city to different DBpedia categories including one that represents information about continents, e.g. dbc:Capitals in Asia. Figures 3.10b and 3.10c show that rdf:type relationship is not preserved by frequency and PageRank-based projections respectively. There exists some, but inconsistent, organization of countries and capitals with respect to each other and a grouping based on continents. There is also no clear segregation based on entity types. Figure 3.10a shows that semantic associations are also preserved using uniform random walks. However, it can be seen in Figure 3.11 that the average size of extracted subgraphs is order of magnitude more than the Specificity-based approach. Based on this discussion, we can see that the generated vector representations have preserved multiple types of semantic associations, i.e., entity-type, country-capital, country-continent, and country-organization. Specificity-based random walks have been able to capture these intuitively relevant semantic associations as part of the compact extracted subgraph, proving the usefulness of specificity as a metric of relevance. 49 (a) Uniform (b) PageRank (c) Frequency (d) Specificity (β = 0.0) (e) Specificity H (β = 0.25) (f) Specificity H (β = 0.5) Figure 3.10: Projection of countries and capitals in 2D space using embeddings generated from RDF subgraphs of depth 2 50 Figure 3.11: Average number of walks per entity of types dbo:City and dbo:Country for subgraph extraction for different values of β 3.6.5.3 Suitability for Regression and Classification Tasks We performed the tasks of classification and regression on the Mercer Cities, Metacritic Movies, and Metacritic Music Albums datasets using the embeddings generated for DBpedia entities. We used SVM for classification and Logistic Regression for regression tasks. We measured accuracy for classification tasks and root mean squared error (RMSE) for regression tasks. The numeric values shown in Table 3.4 can be interpreted as factor of improvement (lift) when Specificity H (β = 0.25)-based embeddings are used compared to the baselines and Specificity. The results show that all schemes are comparable with no scheme consistently outperforming the others. However, these results are achieved using embeddings generated from smaller extracted subgraphs, proving that specificity-based embeddings can be suitable for data mining tasks. 3.6.6 Parameter Sensitivity The algorithms for computing specificity use bidirectional random walks, governed by two param- eters: number of bidirectional walks (N walks ) and size of seed set S. In this section, we evaluate the sensitivity of specificity to both of these parameters. 51 Model Depth Classification Regression Cities Films Albums Cities Films Albums Uniform 1 0.935 1.0585 0.983 0.9011 1.0079 1.0002 Uniform 2 0.9359 1.0273 1.0374 0.9131 0.9976 1.0056 PageRank 1 1.0985 1.029 0.9663 0.9546 0.9955 1.0033 PageRank 2 1.0135 0.9849 1.0484 0.9347 0.9935 1.0019 Frequency 1 1.0709 1.0118 0.9421 0.9506 0.998 0.9926 Frequency 2 0.9558 1.0294 0.9886 1.0328 1.0002 0.9996 SpecificityS 1 1.1098 1.0341 0.9894 1.0601 0.9967 1.0017 SpecificityS 2 0.9161 1.023 1.0401 0.8882 0.9985 1.0027 Table 3.4: Results of Regression and Classification Tasks - Specificity H (β = 0.25) 3.6.6.1 Sensitivity of Specificity to Number of Bi-directional Random Walks The algorithm for computing specificity uses the parameterN walks for the number of bidirectional walks. To understand the effect of this parameter on specificity, we first computed specificity for N walks ∈ [5000, 50000]. Figures 3.12a and 3.12c show the comparison between specificity scores computed for different values of N walks for top-10 semantic relationships for entity type dbo:Film and dbo:Book. The semantic relationships are sorted on the x-axis (from left to right) in order of decreasing specificity computed based on N walks = 5000 for comparison. Figure 3.12e shows the total time taken to compute specificity for the top-10 semantic relationships for each value of N walks . The computation time increases with increasing number of bidirectional walks. Figures 3.12b and 3.12d show the average specificity and corresponding computation time for each semantic relationship for all values of N walks for the three chosen entity types. We use standard deviation to show the variations in the specificity scores and computation times observed for different values ofN walks . It can be observed that generally large variations in computation time do not lead to significant changes in the specificity scores. Choosing a larger value of N walks will simply increase the total computation time without having significant effect on the computed specificity scores. If we select N walks = 5000 as a suitable value for computing specificity, it can be seen from Figure 3.12e that the time for computing specificity scores w.r.t either entity-type is ≤ 150s. Moreover, this computation is only needed to be performed once for each type of target entities in a KG. 52 (a) Specificity variations - dbo:Film (b) Specificity vs. computation time variations - dbo:Film (c) Specificity variations - dbo:Book (d) Specificity vs. computation time variations - dbo:Book (e) Total computation time of top-10 semantic relationships Figure 3.12: Effect of N walks on computation of specificity 3.6.6.2 Sensitivity of Specificity to Depth of Bi-directional Random Walks Figures 3.13 show the specificity computed for different depths for dbo:Film entities. Figures 3.13a-3.13a reaffirm the results presented in previous section (3.6.6.1) that choosing a larger value of N walks does not have significant effect on the computed specificity scores. Also as expected, 53 Figure 3.13d shows that the computation time of specificity increases as depth increases, since the length of the walks increases. (a) Specificity variations for depth 1 semantic relationships (b) Specificity variations for depth 2 semantic relationships (c) Specificity variations for depth 3 semantic relationships (d) Total computation time of top-15 semantic relationships Figure 3.13: Effect of depth on computation of specificity for dbo:Film 3.6.6.3 Sensitivity of Specificity to Size of the Seed Set S Similar to previous discussion forN walks , Figures 3.14b and 3.14d also shows that the large varia- tions in computation time do not lead to major change in the specificity. Choosing a larger value of|S| increases the total computation time without having a significant effect on the computed specificity scores. If we choose|S|∈ [500− 1000] as a suitable value for computing specificity, it will result in computation time of≈ 150s for computing specificity of top-10 most specific semantic relationships. 54 (a) Specificity variations - dbo:Film (b) Specificity vs. computation time variations - dbo:Film (c) Specificity variations - dbo:Book (d) Specificity vs. computation time variations - dbo:Book (e) Total computation time of top-10 semantic relationships Figure 3.14: Effect of|S| on computation of specificity 3.6.7 Analysis of Running Time of Specificity Computations The specificity computations are performed by Algorithm 2. Algorithm 1 is used for setting up parameters and inputs for Algorithm 2. Since, the algorithm for computing specificity is based on random sampling governed by specific parameters, so instead of purely theoretical formulation 55 here we provide a limited complexity analysis supplemented by empirical analysis of the running time of the algorithm. Algorithm 2 computes specificity for a list of semantic relationships provided as input parame- terQ. The complexity of Algorithm 2 as presented is the complexity of computing specificity of a single semantic relationship multiplied by the size of Q. Therefore, we analyze the complexity of computing specificity of a single semantic relationship which is accomplished in Lines 5-17. Lines 6-16 repeatN (orN walks ) times, which is the number of bidirectional walks. In each iteration, the Algorithm 2 performs a forward walk in G and G r (reverse(G)) or a forward and then a reverse walk inG (Lines 8 and 9). After completing the bidirectional walk, a node v 0 is identified in Line 9. Lines 10-15 determine if v 0 has the type t or some other type in the class hierarchy, and the count is updated accordingly. There are two distinct tasks being performed in each iteration of the loop (Lines 6-16): (i) a bidirectional walk starts from a random nodes of typet and lands at a nodev 0 and (ii) determining the position of the type ofv 0 in the class hierarchy. Assuming a tree-like class hierarchy where the height of the tree isH, the running time of the Lines 6-16 can be represented asO(Nn 1 +NHn 2 ) wheren 1 andn 2 correspond to the two tasks described above. Tasksn 1 andn 2 require information that is gathered by issuing SPARQL queries in our implementation. Therefore, the complexity of these tasks depends on how the query engine of Virtuoso optimizes the query execution. Instead of further expanding the relation for complexity mathematically, we treat the query engine as a black box and measure the query execution times of the issued SPARQL queries. In our implementation, we compute specificity using three sets of SPARQL queries, each of which performs one of the tasks below: 1. Starting from the seed setS, get a setV ofN number of nodes using the semantic relation- ship of depth d, i.e., N iterations of Lines 7-8 of Algorithm 2. 56 2. Starting from the acquired set V , perform N number of reverse walks to get a set V 0 using any arbitrary path of depth d, i.e., N iterations of Line 9 of Algorithm 2. 3. Determine type of each v 0 ∈V 0 and update count in Line 12 accordingly. Height in the Class Hierarchy Number of Classes 0 1 1 49 2 126 3 209 4 271 5 72 6 23 7 4 Table 3.5: Distribution of classes at different heights in DBpedia hierarchy Table 3.5 shows the distribution of classes at different heights in the class hierarchy of DBpedia. Level 0 corresponds to dbo:Thing, the root of the class hierarchy. The maximum height of the tree H is 8. We replaceH with a constant term and rewrite the running time relation for our DBpedia- based experiments asO(N[n 1 +kn 2 ]). Since the time complexity of tasksn 1 andn 2 depends on the query optimizations performed by the SPARQL query engine, therefore, we empirically measure the behavior of the term in the parentheses. (a) Computation time for specificity for different values of N walks (b) Distribution of computational time Figure 3.15: Empirical analysis of computation time of specificity 57 In the Figure 3.15a, each marker represents a single computation of specificity. We computed specificity for five classes (dbo:Film, dbo:Book, dbo:Album, dbo:City, and dbo:Country, depths d≤ 3, different seed set S sizes (100-1000), and N walks = [5000, 50000], resulting in 23908 data points. The red curve shows the average of computation time averaged against each value of N walks . Figure 3.15b shows the distribution of computation times which shows 75% of all the data points shown in Figure 3.15a represent a computation time of less than 20s for DBpedia KG. Moreover, this computation is only needed to be performed once for each type of target entities in a KG. 3.7 Summary Graph embedding is an effective method of preparing KGs for AI and ML techniques. However, to generate appropriate representations, it is imperative to identify the most relevant nodes and edges in the neighborhood of each target entities. We discussed specificity as a useful metric for finding the most relevant semantic relationships for target entities of a given type. Our bi- directional random walks-based approach for computing specificity is suitable for large-scale KGs of any structure and size. We have shown through experimental evaluation that the metric of specificity incorporates a fine-grained decaying behavior for semantic relationships. It has the inherent ability to interpolate between the extreme exploration strategies: BFS and DFS. We used specificity-based biased random walks to extract compact representations of target entities for generating graph embedding. These generated representations have similar performance as compared to baseline approaches when used for our selected tasks of entity recommendation, regression, and classification. 58 Chapter 4 Semantic Query Formulation for the Non-Expert 4.1 Motivation The suite of technologies developed for realizing the Semantic Web such as Ontologies, Semantic Annotations, and Linked Data can be used for modeling, integration, querying, and sharing of information on the Web [4]. In recent years, the Semantic Web standards have evolved. The improvements and innovations in this field have allowed the delivery of more complex, more sophisticated and more far-reaching semantic applications in information brokering, knowledge management, and decision support in diverse fields such as telecommunications, logistics, man- ufacturing, energy, health, tourism, publishing, and culture [116]. As more and more semantic data become available, the question of how end users can access this body of knowledge becomes of crucial importance. Accessing semantic data requires intimate familiarity with existing formal query languages such as SPARQL 1 . Despite their strong expressive power, such formal languages impose an initial barrier to adoption due to their hard requirement for the understanding of their formal syntax and how knowledge is encoded in semantic repositories. The Resource Description Framework 2 (RDF) Semantic Web standard and its semantic query language, SPARQL, have been recognized as one of the key technologies of the Semantic Web. An RDF repository is a collection of triples (denoted as <subject, predicate, object>) which can be represented as a graph, the vertices of which denote subjects and objects and edges denote predicates. SPARQL allows users to write queries against data repositories that follow the RDF specification of the World Wide Web Consortium (W3C) by creating queries that consist of triples, 1 http://www.w3.org/TR/rdf-sparql-query/ 2 http://www.w3.org/RDF/ 59 Figure 4.1: Schema Ontology for University Data SELECT DISTINCT ?gradstudent ?professor ?professorname ?gradstudentname ?coursename WHERE{ ?gradstudent rdf:type univeristy:GradStudent. ?professor rdf:type university:Professor. ?professor university:name ?professorname. ?gradstudent university:name ?gradstudentname. ?gradstudent university:takesCourse ?course. ?course university:courseName ”CS570” . } Figure 4.2: Sample SPARQL Query based on University Ontology conjunctions, disjunctions, and optional patterns. Although SPARQL is a standard way to access RDF data, it remains tedious and difficult for end-users because of the complexity of its syntax and the RDF schema [92]. 4.2 Problem Formulation Consider our running example based on the ontology shown in Figure 4.1. The example query corresponds to the natural language question: “What are names of the students taking the course CS570 and name of the professor teaching it?” . Figure 4.2 illustrates the hand-crafted semantic query that returns the correct result. To automatically generate such a SPARQL query, a system would have to (i) separate the input into syntactic markers and “meaningful” tokens, (ii) map 60 tokens to concepts in the Ontology, (iii) link identified concepts based on relationships in the Ontology, and (iv) issue the query to collect the results. An ideal system would allow end-users to benefit from the expressive power of Semantic Web standards while at the same time hiding their complexity behind an intuitive and easy-to-use interface [47, 59]. Therefore, significant attention to interfaces for querying semantic repositories has resulted in a wide range of systems across disciplines including Natural Language Processing (NLP) systems [21, 54, 112, 113], Semantic Web Technologies [47, 67, 119], and visualization environments [92, 49, 104]. Modern query languages for the Semantic Web do not support the handling of natural language text. They usually require specialized solutions ranging from predefined templates which provide the skeleton for SPARQL queries [92, 111] to quasi-natural language querying systems [47, 9, 32] which rely on controlled vocabularies to guide the user step-by-step through the set of possible queries with suggestions of terms that are connected in the Ontology. While such approaches make Ontology queries more straightforward, (i) they require expensive customization to each new domain or Ontology and (ii) adding new templates requires the involvement of domain experts and language engineers. Furthermore, Natural Language interfaces can be limited by ambiguity, and even with controlled vocabularies, they require adherence to specific syntactic or grammatical rules. Conversely, keyword-based search over hypertext documents is an established technology that is being used by search engines to capture users’ complex information needs. Search engines have become popular because of their simplistic conceptual model, i.e., results include those documents that match the specified keywords. Such concept-based queries can be used to capture the information needs of a query (e.g. “Graduate Students CS570”) while at the same time offering a Google-like search box interface to the end-user. To summarize, a system that aims to abstract the complexities of Semantic Web standards to allow users to interact with semantic data needs to achieve the following targets: 61 1. Minimum reliance on predefined rules: Systems that rely on static dictionaries or predefined rules may lack portability and require customization before they can adjust to changes in schema Ontology of the semantic repository or can be used with a different one altogether. 2. Minimum query formulation overhead: The query formulation overhead can come from processing user inputs or creating and updating dictionaries, rules, or templates needed for query formulation. 3. Easy to use interface: Since the purpose of the querying abstraction (or interface) is to provide an alternative to formulating SPARQL queries directly, hence it should be simpler and easier to use than the actual SPARQL standard but at the same time provide as much functionality as possible. 4. Scalable Approach: The query formulation approach should be able to scale with the size of the schema Ontology or the number of triples in the semantic repository. 4.3 Proposed Solution We take a <key,value> approach to the problem of querying a semantic data repository. For example, the equivalent keyword-based query for Figure 4.2 is <Name,*>, <Professor, *>, <GradStudent, *>, <courseName, CS570>), which is similar to the way arguments are passed to functions in programming languages such as Java. With Automatic SPARQL Query Formu- lation (ASQFor) algorithm, we aim to create a reusable, extendable, and domain independent approach that can be used by users to query RDF repositories with virtually no training and prior understanding of Semantic Web. ASQFor’s simple and intuitive tuple-based interface ac- cepts<key,value> inputs and translates them into a formal language query (currently SPARQL). Generated queries are then executed against the semantic repository and the results are returned to the user. The users, thus, only need to be aware of the available information hosted by the 62 triple store and formulate their search criteria in terms of key-value pairs consisting of relevant terms and filtering values. The key aspects of our solution are: 1. Develop a domain-independent framework that provides a simple but powerful way of speci- fying complex queries and automatically translates them into formal queries on the fly (i.e., does not rely on predefined rules and can instantly adapt to changes in the Ontology). 2. Using real-world data, we evaluate ASQFor both (i) quantitatively to indicate possible per- formance overheads, and (ii) qualitatively to identify the possible ease-of-use and increased productivity in information searching activities as a direct result of reducing the amount of time spend to develop and adapt queries manually. 4.4 ASQFor: Automatic SPARQL Query Formulation The main goal of ASQFor is to enable end-users to formulate semantic queries in terms of classes and data properties while being oblivious to the actual structure of the semantic data. 4.4.1 Algorithm Let G be the graph of schema Ontology with the root r and let Q be the query subgraph with the root ˆ r. The workflow of the algorithm is as follows: 1. The user-provided list of keywords (such as <Name,*>, <Professor, *>, <GradStudent, *>, <courseName, CS570>)) is modified so that all the attributes (data properties) are replaced by their respective domain classes. This is computed by the function domain(k), which is invoked in Lines 5, 14, and 33 in Algorithm 3. The function takes as input the URI of a data propertyk, and outputs the URI of its domain class by issuing the SPARQL query shown in Figure 4.3. The SPARQL statements for these data properties are generated in the final step. For each identified domain class, the algorithm keeps track of all of its 63 Algorithm 3 ASQFor(L) Input: list L of key value pairs <K,V > Output: SPARQL query Q that encapsulates the keywords provided by the user and their se- mantic relationships that are inferred by the Ontology. In case values are provided, filtering statements are also included to ensure the information need of the end-user is met. 1: Q,varDictionary←∅ 2: for each key-value pair <k,v>∈L do 3: add variable for k in varDictionary 4: if k is a data property then 5: add variable for domain(k) in varDictionary 6: end if 7: if v =∅ then 8: insert k in the query header 9: end if 10: end for 11: r←findLCA(L) 12: for each key-value pair <k,v>∈L do 13: if k is a data property then 14: currentNode←domain(k) 15: else 16: currentNode←k 17: end if 18: while (currentNode.visited == 0 & currentNode6=r) do 19: currentNode.visited = 1 20: if∃ triple <class,prop,currentNode> in G then 21: classVar←varDictionary.get(class) 22: cNodeVar←varDictionary.get(currentNode) 23: Q← insert triple <classVar,prop,cNodeVar> 24: else 25: if∃ triple <currentNode,rdfs :subClassOf,class> in G then 26: childVar←varDictionary.get(currentNode) 27: insert (or replace) pair (class,childVar) in varDictionary 28: end if 29: end if 30: currentNode←class 31: end while 32: if k is a data property then 33: Q← insert triple <domain(k),prop,k> 34: if v6=∅ then 35: Q← insert filter statement for k using v 36: end if 37: end if 38: end forreturn Q 64 direct and indirect subclasses, which can inherit its data attributes. From our running example (Figure 4.1), the attribute name is associated with class Person. All the direct and indirect subclasses of Person inherit this attribute e.g. Student, PhDStudent, Professor etc. If name is part of the user provided list of keywords, then all of those (direct or indirect) subclasses of Person that also appear in the input list will have separate SPARQL statements in the final query linking them with the attribute name (see statements 3 and 4 in Figure 4.4). 2. ASQFor makes rooted tree assumption, which means that the rootr of the schema Ontology G is unique and can be determined using a single SPARQL query. The path from each class in the modified input list to the rootr is traced in order to compute LCA (Lowest Common Ancestor) of all the domain classes in the modified input list. The computed LCA is the root node ˆ r of the final query subgraph Q. 3. The algorithm iterates through the list of classes (from step 1). During each iteration, a node is selected from the list and is marked visited. In order to generate the correct SPARQL statement, the algorithm determines if the currently selected node is the range of a user- defined object property, a subclass of another class or both. In the first and third cases, the algorithm traces the path towards ˆ r using the domain of the identified user-defined object property and generate corresponding SPARQL statement using the current node, the object property of which it is range and domain of that object property. In the second case where the current node is a subclass of another class, the query variable assigned to the current node is assigned to its superclass. This process is repeated until the root ˆ r of the query subgraph Q or an already visited node is reached, after which the next keyword is selected from the modified input list, and the process is repeated. 4. The final step is to link the data properties to their respective domain classes through SPARQL statements and create filtering statements. 65 Algorithm 4 findLCA(T ) Input: list T of classes, root of the schema Ontology Output: The root of the query subgraph is computed which is the Lowest Common Ancestor of all the classes in the input list of classes. 1: if length(T ) = 1 then return t 1 ∈T 2: end if 3: LCA←t 1 4: for t j ∈T,j← 2 to length(T ) do 5: u←LCA 6: v←t j 7: if u =root||v =root then return root 8: else 9: LCA←ancestorOf(u,v) 10: end if 11: end forreturn LCA Algorithm 5 ancestorOf(u,v) Input: Two classes u and v in the Ontology, root of the schema Ontology Output: The Lowest Common Ancestor of given two nodes u and v. 1: if u =v then return u 2: end if 3: pathToU←list of nodes from root to u 4: pathToV ←list of nodes from root to v 5: i← 1 to length(pathToU) 6: j← 1 to length(pathToV ) 7: while pathToU i ==pathToV j & i≤length(pathToU) & j≤length(pathToV ) do 8: LCA←pathToU i 9: i←i + 1 10: j←j + 1 11: end whilereturn LCA 66 SELECT ?domain WHERE f <http://node1> rdfs:domain ?domain. FILTER NOT EXISTS f<http://dataprop1> rdfs:range ?anyg g Figure 4.3: SPARQL query to get domain of a data property 4.4.1.1 Lowest Common Ancestor To determine LCA (which will be root ˆ r ofQ), we traverseG, starting from the classes identified in the user input, towards the rootr. The information about the predecessors of selected nodes at each step is acquired by using SPARQL queries. We start by determining the common ancestor of two given nodes u and v. The path from u to r and from v to r are computed. Both paths are compared to find the LCA for these two nodes. We, then, select the next node from the user input and find LCA of the selected node and the LCA computed in the previous step. Algorithm 4 determines the root ˆ r of the query subgraph by repeatedly making use of Algorithm 5 to compute pairwise LCA of any two given nodes. 4.4.2 Comprehensive Example Consider the keyword-based query: <Name,∗>,<GradStudent,∗>,<Professor,∗>,<courseName, ”CS570”> The targets of the query are different concepts and attributes that lie on different branches of G e.g. the attributes name and courseName and the classes GradStudent and Professor. To formulate the SPARQL query, it is important to know how these concepts and attributes are related to each other. In our example name is an attribute of Person which is the superclass of both GradStudent and Professor. courseName is an attribute of class Course, which is related to the concepts Professor and GradStudent through properties isTaughtBy and takesCourse respectively. 67 1. ?gradstudent university:takesCourse ?course. 2. ?course university:isTaughtBy ?professor. 3. ?professor university:name ?professorname. 4. ?gradstudent university:name ?gradstudentname. 5. ?course university:courseName ?coursename. Figure 4.4: Step-wise generated SPARQL statements In the first step of the algorithm, all the data properties (name and courseName) are resolved to their respective domains (Person and Course respectively). The modified input list of key- words then contains: Person, GradStudent, Professor, and Course, where each entry is of type Class. In order to establish the relationship between these nodes the algorithm finds the smallest subgraph Q that connects them. To do so we find their lowest common ancestor, which becomes the root ˆ r of the query subgraph Q. In this example, r and ˆ r are the same i.e., Person. In the first iteration of step 3 of the algorithm, Person is selected. However no further processing is done as it is the root ˆ r of the query subgraphQ. In the next iteration GradStudent is selected. The algorithm determines that it is subclass of another class Student. So it assigns the variable for the subclass GradStudent, say ?gradstudent to class Student. Any SPARQL statement generated relevant to this class in current iteration must use the variable ?gradstudent. Using Student, the algorithm selects Course as the next node in the path to the root ˆ r. Using Student, Course and their linking property takesCourse, the algorithm generates the SPARQL statement 1 shown in Figure 4.4. Note that the variable ?gradstudent is used for Student. With Course as currently selected node, the next node on path to root ˆ r is Professor through the object property isTaughtBy. Hence statement 2 (from Figure 4.4) is generated. With Professor as the next selected node, the algorithm determines that it is the subclass of Person. The update of variable dictionary occurs as before; however, no new statement will be generated for Person in this iteration as it is the root ˆ r of the query subgraph Q. For the next class nodes Professor and Course in the input list, no statements are generated as these classes 68 SELECT DISTINCT ?gradstudent ?professor ?professorname ?gradstudentname ?coursename WHERE f ?gradstudent rdf:type university:GradStudent. ?professor rdf:type university:Professor. ?gradstudent university:takesCourse ?course. ?course university:isTaughtBy ?professor. ?professor university:name ?professorname. ?gradstudent university:name ?gradstudentname. ?course university:courseName ?coursename. FILTER ( ?coursename = "CS570" ) g Figure 4.5: Generated full SPARQL query have already been visited. This completes the process of generating statements for all classes relevant to the query. Finally, the algorithm iterates through the data properties in the unmodified input list (name and courseName in this example). Since name can be associated with multiple classes i.e., GradStudent and Professor, hence the algorithm assigns different query variables to name, re- sulting in statements 3 and 4 in Figure 4.4. The domain of data property courseName is Course, which leads to the generation of statement 5. In the final step, after applying filters based on non- empty values from the list of key-value pairs, the final query is formulated as shown in Figure 4.5, which is similar to the manually crafted query shown in Figure 4.2. Table 4.1 shows some other scenarios handled by ASQFor. Consider queries Q A andQ B that require courses taught by Professor with attribute “Prof1” and URI http://Prof1 respectively. For Q A the string literal is provided as a filter value for a class (which usually have URIs), the query statement generated will link the class Professor to the literal “Prof1” using a variable ?prop1 and leave the task of matching it to a data property to the query engine. Since the user has not specified which attribute he is interested in, the query engine will match this triple pattern to any attribute of class Professor in the database that has an attribute with value “Prof1” . Another way of entering this query is as Q C so that the query engine filters the instances of Professor 69 Q A Q B Q C <“Professor”, “Prof1”>, <“courseName”, “”> <“Professor”, “http://Prof1”>, <“courseName”, “”> <“Professor”, “”>, <“Name”, “Prof1”>, <“courseName”, “”> SELECT DISTINCT ?professor ?coursename WHERE f ?professor rdf:type university:Professor. ?student university:takesCourse ?course. ?course university:isTaughtBy ?professor. ?professor ?prop1 "Prof1". ?course university:courseName ?coursename. g SELECT DISTINCT ?professor ?coursename WHERE f ?student university:takesCourse ?course. ?course university:isTaughtBy <http://Prof1>. ?course university:courseName ?coursename. g SELECT DISTINCT ?professor ?name ?coursename WHERE f ?student university:takesCourse ?course. ?course university:isTaughtBy ?professor. ?course university:courseName ?coursename. ?professor university:name ?name. FILTER ( ?name = "Prof1" ) g Table 4.1: Examples of Query Formulation based on the value of its attribute name. For Q B a URI is provided as a filter value for the class Professor. This URI is inserted in the variable dictionary in order to generate query statements corresponding to class Professor. The formulated queries in all cases are shown in Table 4.1. 4.4.3 Complexity Analysis The first step in ASQFor is to find the smallest subgraph that connects all nodes relevant to the user provided concepts. The complexity of this step depends on the structure of the Ontology. Therefore, we analyze the complexity of query generation in the particular case of a tree Ontology. Let k be the keywords in the user query and n be the total number of nodes in the Ontology. ASQFor traverses the path from a node corresponding to a keyword towards the root r of the Ontology G, for each keyword, in order to compute the lowest common ancestor of all keyword mappings in the Ontology. This step requires O(k logn) operations in the worst case, i.e. when each node corresponding to a user provided keyword lies on a separate branch of the tree. Once 70 subgraph Q is constructed, ASQFor traverses Q to generate the SPARQL statements that con- stitute the query. It is easy to show that this step also requires O(k logn) operations in the worst case. Therefore, the overall complexity of ASQFor is O(k logn). Figure 4.6: Schema Ontology for Census Data 4.4.4 Limitations To achieve the goal of providing non-expert users a subset of the functionality of SPARQL while keeping our approach completely dynamic and domain independent, we have made certain deci- sions in designing ASQFor, which lead to the following limitations: 1. Currently ASQFor supports hierarchical Ontologies – it does not support cycles and self- properties. Hierarchical Ontologies are commonly used in multiple domains for cataloging or organizing information [49, 10, 39]. The reason behind this design choice is due to the complexity associated with finding a query subgraph (tree) that spans over selected query nodes in an arbitrary graph; this problem is known to be NP-hard [118]. 2. For ASQFor to be able to issue queries over a semantic repository, a well-defined schema ontology must be available, i.e., the schema ontology must contain complete information 71 about domains and ranges of all object properties, domains of all data properties, and the complete hierarchy of subclasses. Since ASQFor traverses the schema ontology during query formulation phase, missing information can result in the formulation of the incorrect query. 3. For ASQFor, we assume that the user query input is in the form of key-value pairs which exactly match corresponding ontological terms in the repository. Publicly available tools such as Stanford Parser [5] can be used to tokenize Natural Language sentences, followed by string matching techniques (e.g., [46]) to match tokens to ontological terms. However, this is beyond the scope of this work. 4. Aggregation queries (e.g., COUNT, SUM) are not supported in the current version of ASQ- For. The reason is that the simple key-value pair syntax we chose to accommodate user query needs cannot be mapped to the entire SPARQL syntax. Secondly, functions such as COUNT, SUM, and AVERAGE only affect the SELECT clause of the SPARQL query with the possible addition of statements at the end of the query, such as GROUP BY clause. The triple patterns in the body of the SPARQL query still rely on computing query subgraph that links all the ontological concepts relevant to the user query. Allowing users to specify the aggregating functions requires modification of how the user provides the input. The algorithm to traverse the schema ontology will remain unchanged. 4.5 Evaluation We evaluate ASQFor in two ways: (i) qualitatively to identify the possible ease-of-use and in- creased productivity in information searching activities as a direct result of reducing the amount of time spend to develop and adapt queries manually, and (ii) quantitatively to indicate possible performance overheads. For practical reasons, we have not measured productivity increase directly. Such direct eval- uation would require measuring (i) query construction speed of users with and without ASQFor, 72 Q 1 Name, birthplace, gender and marital status of all people on active military duty. Q 2 Occupations in different industries. Q 3 Names of people who attended private school. Q 4 All attributes for people born in California. Table 4.2: Evaluation Queries for Census Data and (ii) number of query tasks completed over a duration of time by users with and without ASQFor. We measured instead of the syntactic difference between queries generated by ASQFor as compared to optimized queries that were hand-crafted by qualified programmers for the same information need and difference in their execution times. To best of our knowledge, the source code of the other automatic query formulation tools such as Ginseng [9], Querix [48] and Squall [33] is not available for a fair comparison. Such systems rely on a web interface, hence it is impossible to measure the exact query formulation and execution time required and compare against ASQFor. For quantitative evaluation, we compared query execution with and without ASQFor on a triple store created using Apache Jena 3 . We evaluated four queries, ranging from simple to complex queries (see Table 4.2) using a dataset of varying size (ranging from 20− 200, 000 triples). We use query formulation time and query execution time to quantify the efficiency of ASQFor in the task of automatic SPARQL query generation. We also measured the overall time, the sum of query formulation and execution times. We compare the performance of ASQFor against manually defined and hand-crafted optimized queries. We implemented ASQFor as a Java function that takes a list of key-value pairs as input and returns valid SPARQL query as a string. Each SPARQL query was then evaluated against the semantic repository using the Jena API. The results were returned in CSV format. 3 https://jena.apache.org/ 73 4.5.1 Dataset For evaluation purposes, we used the 1990 US Census data 4 , which is provided in tabular format. The dataset contains 68 attributes for 2, 458, 285 individuals in total. For evaluation, we randomly sampled this dataset to create smaller datasets of different sizes (20, 200, 2, 000, 20, 000, and 200, 000). Each query, from Table 4.2, was issued five times on each dataset. Based on the Census Ontology shown in Figure 4.6, we converted the tabular data into RDF to be used for the experiments. Queries Q 2 Q 3 Manual SELECT DISTINCT ?industry ?occupation WHERE f ?workinfo census:Industry ?industry. ?workinfo census:Occupation ?occupation.g SELECT DISTINCT ?name ?school WHERE f ?person census:hasEducation> ?eduinfo. ?person census:Name> ?name. ?eduinfo census:School> "3".g Functional <“Industry”, “”>, <“Occupation”, “”> <“Name”, “”>, <“School”, “3”> ASQFor Generated SELECT DISTINCT ?industry ?occupation WHERE f ?workinfo census:Industry ?industry. ?workinfo census:Occupation ?occupation.g SELECT DISTINCT ?name ?school WHERE f?person census:hasEducation> ?eduinfo. ?person census:Name> ?name. ?eduinfo census:School> ?school. FILTER ( ?school = "3" )g Table 4.3: Query Formulation for Representative Queries 4.5.2 Quality and Efficiency of ASQFor Generated Queries The efficiency or execution time of automatically generated queries depends on identified concepts and their relationships, as well as the number of intermediate triples that are retrieved from each statement in the query. To demonstrate that ASQFor generates “good quality” queries, we show the difference between the manually written and automatically generated queries Q 2 and Q 3 in 4 Lichman, M. 2013. UCI machine learning repository, http://archive.ics.uci.edu/ml/ 74 Table 4.3. By comparing the two formulated queries for Q 2 , it can be verified that the manually created and the automatically-generated queries are identical. On the contrary, forQ 3 , some difference is observed. While the manual query refers directly to the “private” school 5 filter, the automatic query uses the FILTER function provided by SPARQL to reduce the result set to the requested information. This difference in using FILTER keyword can also be seen in the automatically formulated query shown in Figure 4.5 and equivalent manually crafted query shown in Figure 4.2. As a result, the manual query requests only triples that contain data referring to private schools, whereas the automatic query retrieves initially all triples based on the keywords. This larger result set is then filtered afterward. Using FILTER function this way can have an impact on query response time due to a larger intermediate result set. The manual query is well optimized to the specific task at hand and therefore is expected to perform better than the automatically generated query in this scenario. However, there is no difference in the returned results for both queries. 4.5.3 Effect of Automation on Query Formulation Time To evaluate the performance of ASQFor in formulating queries, we measured the time required to generate the representative queries shown in Table 4.2. These queries differ in the number of nodes and attributes they query and the depth of the query subgraph. We measured the overhead introduced by query formulation over query execution time using our datasets of different sizes (20, 200, 2, 000, 20, 000, and 200, 000). Figure 4.7 shows, in loga- rithmic scale, the average formulation time as compared to query execution time. This suggests that the overhead of ASQFor for query formulation is constant, whereas execution time varies as a function of the size of the result set and the size of the repository. Query formulation time is significant as compared to query execution time only when the repository is substantially small (i.e., less than 2, 000 entries). As expected, with increasing repository size, query execution time 5 Options for the attribute “School” are encoded in the dataset as integers. (0 = N/A, 1 = Not Attending, 2 = Public, 3 = Private) 75 (a) Q 1 (b) Q 2 (c) Q 3 (d) Q 4 Figure 4.7: Comparison between query formulation and total query execution time using ASQFor Figure 4.8: Mean and Standard Deviation for % of Query Formulation time over total execution time. surpasses query formulation time. The mean and standard deviation of the ratio of formulation 76 time to total time (formulation + execution) is shown in Figure 4.8. It can be seen that the for- mulation time on average takes∼ 90% of the total time execution times) for a dataset of size 20, whereas it accounts for < 25% of the total time for a dataset of size 200, 000 entries. Therefore, query formulation time becomes insignificant for large-scale semantic repositories. (a) Q 1 (b) Q 2 (c) Q 3 (d) Q 4 Figure 4.9: Comparison between execution times of ASQFor generated and manual queries 4.5.4 Effect of Automation on Query Execution Time Figure 4.9 shows the response time calculated over the five sets in our dataset for each of the four queries in Table 4.2. For most queries, the difference between the execution time of manual and automatic queries is insignificant for practical purposes. ASQFor adds only a little overhead as compared to the manually optimized queries, particularly as the size of the dataset increases. Moreover, ASQFor seems to differ from the run time of the manual query by a small margin in milliseconds. 77 Figure 4.10: US Census 1990 - Database search application 4.5.5 Semantic Search of 1990 US Census Data In this section, we show how ASQFor can be incorporated into a semantic search system, using the 1990 US Census data. For this, we have implemented a user interface (Figure 4.10) that supports a combination of semantic search and exploration queries using ASQFor. From the user’s perspec- tive, he only needs to know what kind of information is available in the database irrespective of how it is organized using Ontological concepts and their interrelationships specified using object and data properties. This has led to our minimalist design, which allows users to pick and choose the concepts that are relevant for the query, specify filtering values, and get the desired result. After selecting the required concepts, users can click theFilterOptions to specify filtering values for individual concepts or leave them blank. The filtering values can be entered concatenated with comparison operators e.g., <500 for range queries. After clicking SubmitQuery, the results are returned in CSV format. The selection and filter values in Figure 4.10 corresponds to a query asking for “People with more than 16 years of education who are employed and making more than $100,000/yr” . Unlike other visual query interfaces [49], our primary focus is to abstract the details 78 of SPARQL and schema Ontology from the end user, providing him only the data attributes to choose from. Furthermore, this interface can be dynamically generated from a schema Ontology, resulting in a portable application that only requires access to the semantic repository and builds a functional-to-SPARQL query translator and a GUI on the fly. 4.6 Summary While the Semantic Web promises data sharing and re-usability without boundaries, harnessing the rich semantic data provided by knowledge bases on the Web has proven difficult for those ordinary end-users who are not necessarily familiar with Ontologies or semantic query languages. To ensure that precise answers can be delivered to user queries while at the same time retaining the simplicity of keyword-based search, we have developed a framework that can answer complex semantic queries over structured repositories through a simple interface. We showed that our approach enables end-users to easily issue semantic queries that match their information needs without intimate knowledge of the underlying data representation or Semantic Web technologies. The run time of ASQFor depends only on the number of classes and attributes in the semantic database and not on the number of records in the database. As a result, the query formulation time remains constant despite the size of the semantic repository. The evaluation showed that ASQFor is efficient, even for large datasets. Our approach is not one of yet another library but rather how a simple abstraction can be used to leverage the advantages of Semantic Web technologies while at the same time offering a convenient approach to end-users to access semantically rich data from a knowledge base through simple API. In this context, our ASQFor framework: • adds a layer of abstraction between the user and the SPARQL endpoint, • does not rely on pre-defined templates, rules or dictionaries, • has low query formulation overhead, 79 • queries the semantic repository to extract the relevant schema information (in the back- ground), • automatically generates a SPARQL query based on user-provided keywords. 80 Chapter 5 Use Case: Smart Oilfield Safety Net (SOSNet) 5.1 Motivation In smart oilfields, large volume of data is being generated related to assets, personnel, environment, and other production and business related processes on a daily basis. Storing vast amounts of data is only justifiable if it leads to the discovery of actionable insights which can then be translated into improvements in operational efficiency and Health, Environment, and Safety (HES) conditions. Smart oilfield data is of high volume, variety, and velocity and can be located in multiple data silos. This presents an urgent need to develop scalable and extensible techniques that can enable domain experts to access data and perform analytics to yield better decisions and results. In this paper, we focus on the process of Asset Integrity Management and the role of Semantic Web technologies for significantly improving decision-making in this domain. In our view, the biggest challenges are to manage the high volumes of data, create holistic view of asset integrity data, allow intuitive access to the data, and generate insights through an agile system that can be utilized by domain experts without requiring extensive assistance from IT experts. We present, Smart Oil Field Safety Net (SOSNet), a Semantic Web driven platform that performs integration of asset integrity data, provides simplified querying mechanism for accessing the integrated data, and facilitates analytics on top of it to improve efficiency and robustness of the process of Asset Integrity Management. 81 5.2 Background Various processes in the oil and gas industry generate vast amounts of data [53]. This data usually comes from extensive instrumentation as well as manually generated documents e.g. notes, work orders, photographs, and drawings. By consolidating, processing, and analyzing such data, experts can find solutions to existing issues, optimize plant operations, and predict and prepare for possible incidents or failures in future. However, manually finding, integrating, and analyzing data for decision making is labor intensive. Finding and integrating relevant data across multiple data silos is even more challenging as studies have shown that domain experts spend 60-80% of their time in just looking for and collecting relevant data for analysis [49, 56]. Two main reasons for this bottleneck given in these studies are (i) distribution of relevant data across multiple files and databases and (ii) the reliance of domain experts on Information Technology (IT) experts for the retrieval of relevant data. Direct access to information would require domain experts to understand database systems and formal query languages. However, enabling domain experts to have faster access to relevant data can lead to faster decision making and quick resolution of current issues. Valuable insights can be gained to prevent potential hazardous incidents from occurring in the future. This necessitates a solution that not only acts as an effective data management tool that provides linking of relevant data sources but also enables quick and easy retrieval of integrated data, thus reducing the effort put in by domain experts. We present Smart Oil Field Safety Net (SOSNet) system, a scalable and extensible system, that significantly improves Asset Integrity Management decision-making in smart oilfields. SOSNet leverages Semantic Web technologies to facilitate intuitive access to integrated asset integrity data and empower domain experts to focus on analytics that generate insights and yield better decisions and results. We demonstrate that this Semantic Web based framework can serve as the foundation for automated Asset Integrity Management systems in smart oilfields, supporting 82 acquisition, management, access, and processing of asset integrity data. Our contributions in this paper are as follows: 1. Semantic Model for Asset Integrity Management: We present a hierarchical semantic model that extends existing ontologies with new concepts. Our semantic model is extensible to both within the domain of Asset Integrity Management and to other processes in the oil and gas industry. This semantic model is the key for automatic integration of multiple asset integrity data streams and for formulating queries to access the resulting integrated data. 2. Easy and Intuitive Querying Framework: We present an effective framework that enables domain experts, with no knowledge of formal query languages, to issue user-defined SPARQL queries over integrated asset integrity data without IT assistance. 3. SOSNet System: We demonstrate the capabilities of our integration and querying frame- work through the applications of SOSNet system which provides ease of use and flexibility and facilitates decision making for domain experts in the context of Asset Integrity Man- agement. Figure 5.1: Typical Asset Integrity Management workflow 83 5.3 Asset Integrity Management Workflow Asset Integrity Management (AIM) is a safety-critical process on oil and gas facilities. The ob- jective is to keep track of the structural integrity of various assets (e.g. vessels, pipelines) on a facility. Oil companies have workflows and safeguards in place to prevent Loss of Containment (LoC) incidents where fluid contained within an asset is accidentally released in the environment, creating a potentially hazardous situation for environment, personnel, and other assets [90]. Fig- ure 5.1 shows a typical Asset Integrity Management workflow. Through metering and surveys conducted on a facility, data in the form of numeric time series, structured and unstructured text, and multimedia content is collected and stored in various databases and files. This spread of information across multiple sources often requires asset integrity managers and experts to man- ually identify appropriate sources, query them, and integrate relevant data to understand state of the assets. This process needs to be repeated for different assets and for multiple facilities at a time. The collected data is then analyzed to identify potential risks and necessary actions are recommended or scheduled. All these different steps of the workflow generate data that is crucial for decision making. 5.3.1 Challenges Due to the inherent heterogeneity of the data sources involved in the process of Asset Integrity Management, providing on-demand access to an integrated view can be challenging [88]. Another challenge is knowledge management, which refers to a systematic way of capturing the results of various engineering models and analyses [102]. We summarize these challenges in creating an effective automated Asset Integrity Management system below: 1. Integrated view of asset integrity data: For effective decision making, there needs to be a system that presents a comprehensive and continuous view of the assets and processes. Useful information may reside in multiple databases or files which makes it difficult to gather relevant information to produce actionable insights. 84 2. Extensibility to new data sources: Oil facilities are well-instrumented to monitor various processes, not just the state of assets. Applications based on advanced analytics may lead to new factors being incorporated in complex models for characterizing the behavior of assets. Thus any solution that facilitates integration and mining of asset integrity data needs to be extensible to new data sources. 3. Efficient access to information: It is critical for asset integrity decision makers to have direct access to relevant pieces of information for making informed decisions. However, domain experts may not have adequate skills to search for relevant data in databases and have to rely on IT experts for their data needs, causing delays in the workflow. To address these challenges, we have developed SOSNet, a data-driven Asset Integrity Man- agement system. For the work presented in this paper, we mainly focus on modeling, integration and querying aspects of the SOSNet system. Figure 5.2: SOSNet system architecture 5.4 Semantic Information Modeling Figure 5.2 shows the architecture of our SOSNet system. We use Semantic Web based ontologies to identify AIM related entities in available data sources (e.g. equipment, facility, work order, and sensor) and model their attributes and relationships. The semantic model is then used for 85 annotating raw data sources and facilitating their automatic integration. The resulting web of knowledge (stored in the RDF Triple Store) provides an integrated view of the domain and is used to drive various applications. 5.4.1 Data Sources The data sources, for this project, include multiple spreadsheet, database files, images, CAD drawings and PDFs. This data consists of the following recorded information: • Time-stamped thickness measurements (in inches) collected every couple of years (usually after a gap of 5+ years) for multiple Thickness Measurement Locations (TMLs) on all assets on multiple facilities • Manually assigned corrosion categories (Severe, Moderate) to assets • Description of problems observed related to assets and their recommended solutions • Images of assets, focusing on areas of concern • Work orders providing description of requested work and remedial actions performed • Specification documents including equipment drawings and Piping and Instrumentation Di- agrams (P&IDs) 5.4.2 Semantic Data Modeling The purpose of ontologies is to capture the domain knowledge in terms of concepts (classes), relationships (object properties) and attributes (data properties) so data can be annotated for the machines to understand. For SOSNet, we extend Smart Oilfield Ontology (SOFOS) [90] to create the SOSNet Ontology, a Semantic Web based model for AIM. SOSNet Ontology is defined using Resource Description 86 Figure 5.3: Relationships between select concepts from SOSNet ontology Framework 1 (RDF) and Web Ontology Language 2 (OWL), the Semantic Web standards for knowl- edge representation. Similar to SOFOS, our focus in designing this ontology is data-centric. We do not focus on creating a hierarchy of all possible assets (vessels, tanks, motors, compressors, valves and so on) that exist on a typical oil and gas facility. This has already been well-defined by other ontologies such as ISO15926 3 and PPDM 4 . Our primary focus with SOSNet Ontology is to model data streams. However, we also need classes that model relevant physical assets for the purpose of identifying and integrating the data streams associated with them. For example, different sensors installed on a vessel can be measuring multiple parameters (e.g. pressure, level, temperature). Another set of records associated with the same vessel can be manually gener- ated data such as notes, work orders, multimedia content. Instances of physical assets serve as 1 https://www.w3.org/RDF/ 2 https://www.w3.org/OWL/ 3 ISO 15926, Standard for data modeling and interoperability using the Semantic Web http://15926.org/index. php 4 PPDM 3.8 Data Model http://www.ppdm.org/ppdm-standards/ppdm-3-8-data-model 87 focal points for grouping such relevant data streams together. This way all data recorded for a particular asset becomes integrated to present a complete view of its status. A subset of concepts and attributes of SOSNet Ontology is shown in Figure 5.3. The class Facility is extended from class System of SOFOS (which itself is an extension of class ClassOfOr- ganization of ISO-15962) to define a collection of assets. A facility instance can be a rig, platform, refinery etc. Classes Equipment and TML represent the physical assets which are the focus of the asset integrity monitoring and prevention procedures. An equipment in our ontology refers to an entity which has been given a unique ID within a facility. Class TML models a specific location on the Equipment to which a data stream can be associated with. This class is specific to the use case of Asset Integrity Management and models TMLs on equipment. Each TML has a time series of thickness values associated with it. We have created a class Survey to group together data collected related to an asset on a particular inspection date. A single instance of Survey can present a complete picture of the state of the associated asset based on all the data collected related to it at a specific point in time and multiple such instances constitute a chronological record of the state of the asset. In context of TML, instances of Survey provides historical data on thickness values. For Equipment, instances of Survey provide other relevant information such as photos, assessments, and unstructured text (problems, recommendations). Images associated with the Equipment are modeled through the class Photo and its attributes. All other survey related information is modeled through attributes of the class Finding. One last crucial piece of information comes from work orders which are modeled as instances of class WorkOrder. Work orders keep track of all the maintenance and repair work performed on an asset. In addition to thickness measurements, ambient conditions (temperature, humidity) can also be associated with all assets, depending on the spatial granularity of the available data. This can be done for an equipment, a facility or an entire production area. Such hierarchy can be easily created by extending the class System from SOFOS, as we did for Facility. 88 5.4.3 Data Integration Once the semantic model of the data streams has been created, the next step is to generate instances based on the SOSNet Ontology i.e., import the structured (tabular) and unstructured data in an RDF repository. An RDF graph is represented by a knowledge base of triples. Triples consists of three parts: <subject, predicate, object>. In RDF graph, Uniform Resource Identifiers (URIs) are used as location-independent addresses of entities, both classes (nodes) and properties (edges). Any entity with a URI can be subject, predicate or object. An RDF graph is a set of all such RDF triples [70]. The nodes of the graph represent entities which are instances of different classes defined in the ontology, whereas edges represent relationships between entities within or across data sources. Data integration is facilitated by making instances uniquely addressable through URIs. We define URIs to represent the organizational hierarchy of assets on a facility. For example: http://www.usc.edu/cisoft/sosnet/TK21A/T1000/PT1000A represents a URI for a pressure sensing device PT1000A mounted on equipment T1000, which is located on facility TK21A. The benefit of such a scheme is that it allows differentiation between assets, with identical IDs, located on different facilities. Figure 5.4: Visualization of asset integrity data in RDF related to a single equipment, T-1000 89 The RDF Generator in Figure 5.2 takes raw data files and imports their content into an RDF graph. The files are processed row by row. All values in each row are linked together based on the relationships defined by the ontology. The column names either correspond to a class or a data property. In the former case, the appropriate information from the row is used to create the hierarchical URIs for the entities which are instances of different ontological concepts. In latter case, a data property is created in the RDF graph with the value taken from the corresponding cell in the row which can be a numeric, string or date value. The example in Figure 5.4 shows the RDF representation of a subset of anonymized data associated with equipment T1000. URIs are not shown in the figure for readability. It is important to note that URIs (e.g. http://www.usc.edu/cisoft/sosnet/TK21A/T1000/PT1000A) are hidden from the domain experts who work with IDs as per design documents (e.g. T1000, TK21A, PT1000A) instead. Figure 5.5: An example of SPARQL query to get work orders for severely corroded assets After integration, the data initially spread across multiple files, becomes available as hierar- chically organized linked data. The integrated data is maintained in an RDF triple store and can 90 be accessed via a SPARQL 5 endpoint. As a result, asset integrity experts can issue complex and meaningful queries such as getting list of all work orders for assets that have been labeled severely corroded. The SPARQL syntax for this query is shown in Figure 5.5. This query acquires inte- grated data coming from work orders and inspection findings data sources, which were previously independent data sources. By issuing SPARQL queries, domain experts can get access to relevant data according to their needs. However, the limitation is that domain experts are not proficient in Semantic Web technologies. To address this problem, we have developed a technique aimed at allowing such non-expert users to directly interact with semantic data without the help of IT experts. We describe this approach in Section 5.5. 5.5 Accessing Semantic Data for the Non-expert As discussed in the beginning of this chapter, a desirable feature of an ideal Asset Integrity Management system is to be intuitive enough so it can be used by domain experts without excessive assistance needed from IT experts. We assume that domain experts do not necessarily possess the understanding of query languages, RDF, Semantic Web, and database systems. Moreover, people at different levels (e.g. technician, supervisor, manager) may have different information requirements from the system. Programming the system with pre-defined queries for all possible scenarios and personnel is impractical if not impossible, since data needs can vary from person to person and application to application. Maintaining and updating such a library of queries will again require IT assistance. There is a need for a simplified way of querying semantic data where the domain experts issue queries based on their data needs and formal queries are automatically formulated by the system. 5.5.1 Automatic SPARQL Query Formulation (ASQFor) Several systems have been proposed in the literature to allow non-expert users to formulate SPARQL queries indirectly using abstractions (see Chapter 2). Here, we present Automatic 5 https://www.w3.org/TR/rdf-sparql-query/ 91 SPARQL Query Formulation (ASQFor) algorithm which allows domain experts to indirectly issue SPARQL queries. ASQFor relies on an ontology (SOSNet Ontology in the case of AIM) and a keyword-based interface to formulate SPARQL queries which syntactically resemble manually written queries by IT experts. The complete pseudocode of the algorithm with detailed description of each step, complexity analysis and evaluation on 1990 US Census data is provided in [84, 86]. Here we show an example of how using that technique can make data exploration simpler for a non-expert user. Figure 5.6: From simple keywords to formal ASQFor-generated formal SPARQL query The ASQFor algorithm does not use any pre-defined information and extracts relevant informa- tion dynamically from the ontology, it can work with any semantic database that has well-defined schema ontology and is not tailor made for any specific process within oil and gas domain. This makes the approach domain independent, insusceptible to schema changes, and portable to other use cases without requiring any modifications to the algorithm. Figure 5.6 shows an example set of keywords provided to ASQFor and the translation that occurs through the different stages of 92 ASQFor. In this example, certain attributes with corresponding filtering values have been pro- vided. The invocation of ASQFor is done through a single query function that takes key-value pairs. Keys correspond to ontological terms. Values (e.g. “OPEN” and “SEVERE” in the exam- ple) are used for filtering. ASQFor automatically traverses the schema ontology and generates and issues the corresponding SPARQL query and returns the results. A single flexible function to query the repository, solely based on the information in the database and not on how it is organized and stored, can make developing applications easier for programmers. We show how this function can be leveraged for the development of intuitive applications later in Section 5.6. 5.5.2 Similarity-based Queries Based on number of records in a semantic repository, simple user queries can return dozens even hundreds of results, which can lead to information overload. For domain experts, a list of top-k most important assets may be more useful. One measure of importance can be based on the critical status of the assets, based on the assumption that most critical assets require more urgent preventive measurements. This requires importance (or criticality) to be measured in a quantifiable way, so that assets can be ranked based on that. One way is to devise methods for quantifying criticality based on user-defined rules or features of assets. Instead of adopting this methodology for SOSNet system, we leverage the idea behind content-based recommendation [68]. For example, given that a user has watched a movie, the system recommends other movies that e.g. may belong to the same genre or have similar plotlines. The key idea is that similarity is computed with respect to given entity or a set of entities i.e., already watched movie(s). Similarly, we use this notion of pairwise similarity between assets based on their features. User provides an asset as a search key and the system generates a list of assets most similar to it. This allows the user to define the criteria of importance or criticality indirectly. For example, by providing an asset that has severe corrosion and is of type tank, the assets more likely to appear as top 93 results will be other tanks with severe corrosion. We use two approaches for computing pairwise similarity, presented here. Figure 5.7: Attribute-to-attribute based computation of similarity between two assets 5.5.2.1 Similarity Scores based on Attribute-to-Attribute Comparison This method of computing pairwise similarity of assets is based on computing similarity between all assets using all attributes, one at a time. The final computed similarity score between two assets is a weighted sum of similarity scores computed based on individual attributes. Figure 5.7 shows RDF graph-based representation of two assets. We assert that the similarity of these two assets can be computed by comparing their corresponding direct and indirect attributes. Direct attributes are linked to the asset by a single edge (or predicate) which, in the figure, are fluid, type, and, status. Indirect attributes are those which are direct attributes of other entities (e.g. Surveys, Work Orders, Photographs) linked to the assets. In Figure 5.7, problem and recommendation from the survey data and description and solution from the work orders represent the indirect attributes of the shown assets. To compute pair-wise similarity, we one by one select each direct or indirect attribute and compare its sets of values, corresponding to both target assets, against 94 Figure 5.8: Bipartite graph between values of attributes between two assets each other. This can be visualized as a bipartite graph between corresponding values of the selected attribute. Figure 5.8 shows the example of computing pairwise similarity based on work orders. The weights of the edges represent the similarity computed between different values of the selected attribute, based on string or document matching techniques. The final similarity score based on single attribute between entity1 and entity2 is average of all the edge weights in the bipartite graph. The final similarity score between two entities based on all attributes is the weighted sum of all single attribute similarity scores. Given a set of N different direct and indirect attributes, the similarity score between two entities e 1 and e 2 can be computed as sim(e 1 ,e 2 ) = P N i=1 w i sim i (e 1 ,e 2 ) where P N i=1 w i = 1. sim i and w i are the similarity score based oni th attribute and its corresponding weight respectively. This methodology can be viewed as top-down search of relevant attributes followed by bottom-up computation of similarity. For each given asset, we go deeper (or farther) in the surrounding RDF graph for collecting attributes and their values for computing similarity. We then compute similarity between values of a single attribute (represented by edges in the bipartite graph), which is averaged to compute single attribute similarity scores. Moving up (closer to asset), we compute weighted sum of single attribute similarity scores to compute pairwise similarity of assets based on all attributes. This approach requires pre-determination of a set of attributes and their weights for computing similarity. In contrast to this methodology, we next present another approach which relies on 95 random walks for automatic collection of values of attributes and uses them to learn vector representations of assets using neural language models in unsupervised manner. 5.5.2.2 Asset2Vec: Generating Propositional Representations of Assets In this approach, instead of extracting attributes and their values for computing similarity, we extract representative neighborhoods of each asset using biased random walks [17, 91]. These extracted subgraphs are then converted into feature vectors using neural language model Skip- gram [65]. Figure 5.9: From RDF graph-based to feature vector-based representation Figure 5.9 shows the workflow of obtaining vector representations for RDF subgraphs. We use pruned and biased random walks [87] to traverse the neighborhood around target entities. The extracted subgraphs are represented as a sequence of tokens, resembling a document, as shown in the figure. We traverse the neighborhoods of target entities, up to depth 4, for subgraph extraction. Using these extracted subgraphs, we train Skip-gram models using the following parameters: dimensions of generated vectors = 500, window size = 10, negative samples = 25, iterations = 5 for each depth. All models for depth d > 1 are trained using sequences generated for both depths 1 and d [87]. Once the feature vectors are generated for all assets, we use simple cosine similarity measure to find the vectors (representing assets) most similar to the given search key. 96 Search Key Attribute-to-Attribute Compar- ison Asset2Vec AAQ-0108 AAQ-0103 AAQ-0103 AAQ-0105 AAQ-0103 PCT AAQ-0102 AAQ-0102 AAQ-0105 AAQ-0102 PCT AAQ-0103 PCT MTD-0204 Table 5.1: Results for search key AAQ-0108 Table 5.1 shows the top-5 results for equipment similar to given search keyAAQ-0108 (sub- stitute name for actual instrument ID). The lists of most similar assets is generated using both techniques discussed in this section. It can be seen that there are entries that are common to both lists, however, the orders of both lists is not identical. The attribute-to-attribute comparison is based on a pre-selected set of attributes and weights for computing pairwise similarity scores, whereas Asset2Vec selects attributes and labels from the neighborhoods of assets in the RDF graph in an unsupervised way. Therefore, a difference in the order of terms is expected between the two approaches because of difference in their methodologies for selection of attributes. In terms of interpretability, the first approach fares better, since the criteria for computing simi- larity can be concretely defined through selection of attributes and corresponding weights. The unsupervised approach is like a black box, since upon examining the results it is not immediately clear why certain assets have been declared most similar. A manual inspection of the data, how- ever, revealed that the search key and results share values of one or more attributes from P&ID, equipment type (piping), or corrosion status (Moderate). The benefit of Asset2Vec is that it does not require pre-selection of relevant attributes and corresponding weight assignment, meaning no domain knowledge is required. 5.6 Representative Applications of SOSNet System In this section, we discuss representative applications that we have built on top of SOSNet to demonstrate the capabilities enabled by our semantic data integration and querying framework. 97 Figure 5.10: Enabled intuitive search capability over integrated asset integrity data 5.6.1 Querying of Integrated Asset Integrity Data Made Easy ASQFor allows us to develop an intuitive mechanism for data exploration. The end user (domain expert) is exposed only to the attributes residing in the database. These attributes can be selected from the interface which are then passed to ASQFor algorithm which creates the SPARQL query automatically and returns relevant data. This interface can be generated programmatically based on concepts and attributes in the given ontology. The selection of check boxes shown in the Figure 5.10 corresponds to the sample query of Figure 5.5 which now can be automatically generated this way. As discussed in Section 5.5.1, the domain independent aspect of ASQFor allows this simple search application to be able to query multiple semantic repository with hierarchically organized data and well-defined schema ontology, functioning as a portable database search application without requiring any modifications to the algorithm. 98 Figure 5.11: Prediction of behavior of TML over time 5.6.2 Support for Predictive Maintenance Analytics In order to keep track of assets’ status, asset integrity experts maintain a database of thickness measurements of external coatings of assets. These measurements must be above minimum thick- ness thresholds for safe operation. Surveyors take measurements after regular periods or calculated intervals, based on rate of decrease in thickness between two consecutive previous measurements. A simple application can be developed that accesses all thickness measurement time series of tens of thousands of TMLs and based on their trends can extrapolate the time in future when the thickness will reach critical levels. This way critical assets can be identified and ranked based on the lengths of the time gaps for scheduling maintenance related activities. An example plot is shown in Figure 5.11. Two different forecast curves are shown based on long term and short term corrosion rates for a particular TML on tank T1000. Depending on the size of the asset, it can have dozens to a hundred TMLs and hence that many independent time series. Automatic ranking of assets based on the thickness levels can alert asset integrity managers in time to sched- ule repair and maintenance activities accordingly. To access integrated data for the plot shown 99 in Figure 5.11, the query function discussed in Section 5.5.1 can be used. An invocation of the query function to get all TML measurements for TML# 1.00 on T-1000 is: query (<Facility, “TK21”>, <Equipment, “T-1000”>, <TML, “1.00”>, <MinThick, “”>, <SurveyDate, “”>, <ThicknessValue, “”> By running multiple iterations of the above query function with different parameters or by removing the filter on the keyword Equipment, data for all TMLs on a facility can be gathered. Since ASQFor is insusceptible to changes in schema ontology, using the query function instead of hardcoded SPARQL queries in the application eliminates the need for updating the application every time the schema ontology changes. Figure 5.12: Facility overview screen (anonymized data) 100 Figure 5.13: Equipment overview screen (anonymized data) 5.6.3 Driving the Development of Effective Visualizations Visualization is helpful for collaboration and decision making by presenting data in a comprehen- sive manner [50]. One such example is shown in Figure 5.12 where an overview of the facility is provided for the user. Various charts showing alarms and events have been created. Active warnings and critical alerts are also shown, clicking on which can take users to detailed pages. A detailed page for a particular equipment is shown in Figure 5.12 where information about in- spection schedule, overview of TML trend and data from latest survey about the equipment are displayed. Clicking on marked points on the equipment diagram can display trends like the one shown in Figure 5.11. Due to the integration process, the data that was previously residing in var- ious individual files and databases has been organized and stored in an RDF semantic repository and can be easily queried through the query function of ASQFor. By using a single visualization 101 screen template (e.g. for the equipment overview screen in Figure 5.12 and the query function with different filtering values, development of similar visualization screens can be simplified. 5.7 Summary We presented a framework for Asset Integrity Management that leverages Semantic Web tech- nologies for data modeling, integration, and access. The presented extensible semantic model can be used for integrating scattered information into a coherent and comprehensive view of the AIM environment for the domain expert, facilitating robust decision making. With ASQFor, data can be retrieved through a simple interface and then can be analyzed using existing or emerging data analytics tools. We use the semantic model, the integrated information bus, and ASQFor as the basis for developing a novel Asset Integrity Management system, SOSNet. Secondly, we discussed the Asset2Vec approach that can be used to generate feature vectors-based representations of assets, making them compatible to be used with various data mining tasks. We show the example of similarity-based queries where users can get a ranked list of similar entities to a given search key. 102 Chapter 6 Use Case: Integrated Movie Database 6.1 Motivation Data about movies, just like assets in smart oilfield, is located in multiple locations and con- sists of multiple concepts (e.g. Actors, Directors, Books, Awards), relationships between them, and associated attributes. There are various websites that provide nuggets of movie data such as IMDB 1 (Internet Movie Database), the most comprehensive online source of information for movies. IMDB provides various attributes related to movies such as title, genre, run time, cast- ing details, and awards. BoxOfficeMojo 2 , another movie-related website, mainly focuses on the financial aspects of movies. It keeps track of the domestic and worldwide earnings of movies on a daily, weekly, and monthly basis. Other websites such as RottenTomatoes 3 assigns scores to movies based on the percentage of positive movie reviews published in notable publications. Due to the distribution of information about movies on multiple websites, users cannot get a unified view of all the relevant information pertaining to a movie. For example, performing an analysis of box office business of Oscar-winning movies requires accessing multiple web pages (i.e., from IMDB and BoxOfficeMojo). This can be a time-consuming process if done manually. 6.2 Problem Formulation The problem of consolidating scattered pieces of relevant information in this domain can be solved by a Semantic Web based system like SOSNet. We acquire information about movies from multiple web sources and use SOSNet, tailored to the movie domain, to organize, integrate, and query 1 http://www.imdb.com 2 http://www.boxofficemojo.com 3 http://www.rottentomatoes.com 103 movie related data. It is important to note that the objective of the work is not to create mash- up applications. Mash-up applications usually focus on presenting different streams of information in different panels or frames in the same application window. However, those streams cannot be used to run queries that combine data from both data streams. For example, with a mash-up application that shows information from RottenTomatoes and BoxOfficeMojo in separate panels within the same application window, we can filter movies using the two frames independently based on critic score and worldwide gross respectively. However, issuing a query that filters results based on both criteria simultaneously requires both data streams to be integrated. 6.3 Data Acquisition We gathered data from four websites. The attributes crawled from each website are listed in Table 6.1. The total number of records crawled from each website are given in Table 6.2: Website Extracted Information IMDB Title, Release Date, Genre, MPAA Rat- ing, IMDB User Rating, List of Cast Members (Actors/Actresses), Meta- critic Score, List of Academy Awards (won), Director RottenTomatoes Title, Year, Critic Score (Tomatome- ter) BoxOfficeMojo Title, Release Date, Genre, Run time, Domestic Gross, Worldwide Gross, Budget GoodReads Book Title, Year, Author Name, Rating Table 6.1: Information extracted from movie websites Website No. of Records Generated IMDB Records generated for 36, 549 movies, Total casting records generated: 856, 407 Rotten Tomatoes Records generated for 10, 000 movies Box Office Mojo Records generated for 16, 945 movies Good Reads Records generated for over 3, 000 books Table 6.2: Number of records in IMDB data 104 6.4 Data Modeling and Integration To model the collected data, we re-use concepts and attributes from ontologies such as DBpedia 4 and IMDB ontology [3]. The central concept of the model is Movie. An instance of class Movie is linked to an instance of the class Book if the movie instance is an adaptation. Each book has an associated instance of the class Author. Each movie has associated instances of classes Actor and Director. Each movie instance is associated with multiple instances of class Award (if it has won any) and each such instance of class Award has an associated Winner. The system takes the ontology and raw data as input and produces an RDF graph. Each website has its own convention of assigning an ID (or URL) for movies, which we use as URI in the RDF graph. This results in the RDF graph having instances of same movie with different URIs. To establish links between different instances of same movie in the RDF graph, we perform record linkage [31, 117] based on movie attributes. Specifically, we perform string matching based on movie titles and release years. The IMDB movies, which are adaptations, are linked to the corresponding books from GoodReads by matching titles and ensuring that the name of the author of the book appears in writing credits for the movie. We also ensure that the year of release of the movie doesn’t fall before the year of publication of the matched book. The basedOn property from the ontology is used to create links between relevant instances of Movie and Book. We have briefly described record linkage techniques that we have used in the overall workflow of our system. However, the purpose of our work in not to optimize the record linkage process. There is extensive work in the literature addressing this field [93, 40]. 6.5 Querying Integrated Data We formulate queries over the integrated repository of movie data based on attributes belonging to previously independent data sources. For example, the query “List all movies based on books of J.R.R. Tolkien along with their ratings and worldwide gross” would require users to explore IMDB 4 http://dbpedia.org/ontology/ 105 (a) Manual query (b) ASQFor-generated query Figure 6.1: SPARQL queries to get movies based on books of J.R.R. Tolkien (movie rating, release year), RottenTomatoes (critic score), BoxOfficeMojo (worldwide gross) and GoodReads (book title, year of publication, book rating) to get the answer. With the integrated data, this can be done with a single query as shown in Figure 6.1a. The results are shown in Figure 6.2. An expert can interact with the integrated movie database using full functionality of SPARQL. For a non-expert to gain access to the integrated repository, we developed a simple interface, similar to one shown in Figure 5.10. 106 Figure 6.2: List of movies with selected attributes based on J.R.R. Tolkien’s books # Description # of Results Q 1 Movies with RottenTomatoes score of > 50%. 5681 Q 2 Action movies with the list of Academy Awards. 111 Q 3 Info about movies based on Tolkien’s books. 6 Q 4 List of actors who have starred in a drama movie. 334,362 Table 6.3: Evaluation queries for IMDB data (a) Q 1 (b) Q 2 (c) Q 3 (d) Q 4 Figure 6.3: Comparison between query formulation and total query execution time for ASQFor generated queries In order to measure the query formulation overhead introduced by ASQFor over the total query execution time, we divided our data set into subsets D 1 , D 2 , D 3 , D 4 and D 5 containing 697, 782, 997, 775, 1, 295, 509, 1, 602, 075 and 1, 910, 741 triples respectively such that D 1 ⊂D 2 ⊂ D 3 ⊂D 4 ⊂D 5 . We evaluated four queries as shown in Table 6.3. 107 We measure the overhead introduced by ASQFor algorithm for query formulation over the total query execution time for each query over each data set. Figure 6.3 shows the comparison of average 5 formulation time with average query execution time. Figure 10 suggests that the overhead of ASQFor for query formulation is consistent across queries and is less than the query execution time for each query except Q 3 . For Q 3 the execution time is less than the formulation time since the query returns at most six results i.e., the three Lord of the Rings movies and the three The Hobbit movies. Since query execution time can vary as a function of the size of the result set, for large repositories where queries can generate thousands of results, the query execution time will most likely surpass query formulation time. Hence, for larger repositories, formulation time is insignificant as compared to the query execution time. This becomes clear in case ofQ 4 for which the comparison is shown in Figure 6.3d. Since Q 4 is looking for relevant instances of class Actor, issuing the query on incrementally increasing data sets results in proportional increase in the size of the result set and the query execution time. (a) Q 3 (b) Q 4 Figure 6.4: Comparison between execution times of manual and ASQFor generated queries We also show syntactic differences between manually crafted and ASQFor generated queries for the sample queryQ 3 in Figure 6.1. The minor distinction is in the use of the FILTER statement. The manual query refers directly to the “J.R.R. Tolkien” as one of the triple pattern, whereas the ASQFor generated query uses a separate FILTER statement at the end. Depending on how the query engine performs query optimization, this may impact the query execution time but 5 Each query was issued five times on each data set 108 there is no difference in the returned results (Figure 6.2) for both queries. In order to quantify the difference in execution time, the comparison of execution times of ASQFor generated and manually crafted queries is shown in Figure 6.4 for two queriesQ 3 (small result set) andQ 4 (large result set). It can be seen that the difference in the execution times for manual and ASQFor- generated queries is in the order of tens of milliseconds for majority of the experiments 6 . This demonstrates that ASQFor algorithm is an effective technique for allowing non-experts to query semantic repositories without compromising the retrieved data as compared to manual queries. 6.6 Summary Semantic Web-based techniques provide a comprehensive set of standards to organize, integrate, and query semantic data and are employed in multiple domains. We have shown the value of Ontology based integration in conjunction with ASQFor for the use case of Movies and Books by answering queries which would have required exploring multiple web pages to answer. 6 In Figure 6.4b, the values below the curve are for ASQFor generated queries. 109 Chapter 7 Thesis Conclusion In this thesis, we proposed approaches to allow non-expert users to gather insights from semantic or Linked Data. These approaches were shown to be schema-agnostic and worked with a variety of semantic knowledge bases. 7.1 Discovering Implicit Relationships in Semantic Data Knowledge Graphs (KGs) have become useful sources of structured data for information retrieval and data analytics tasks. To enable analytics on the semantic data, we proposed Specificity as an accurate measure of identifying the most relevant information in Linked Data Knowledge Graphs (KGs). Specificity allows the automatic identification of the most relevant semantic relationships for extracting entity-specific representations for various entities in the KG. This is in contrast to the scenario where frequency- or PageRank-based extracted representations are KG-specific because of the inclusion of popular nodes and edges, irrespective of their semantic relevance to the extracted entities. We computed Specificity using a scalable method based on bidirectional random walks making it suitable for large-scale repositories, part of the Linked Open Data cloud (e.g., DBpedia, Wikidata). Specificity-based biased random walks extract more meaningful (in terms of size and relevance) substructures compared to the state-of-the-art. The graph embeddings learned from the extracted substructures, using neural language models for unsupervised feature extraction, are well-suited for multiple data mining tasks. These graph embeddings preserve the semantic context of selected entities when projected to low dimensional vector space. We showed the use case of a content-based recommender system that can be efficiently built using this approach for both smart oilfields and other domains such as films, books, and music albums. 110 We also proposed an extension of Specificity termed as Specificity H with β as hyperparameter that controls the trade-off between better precision and smaller extracted subgraph sizes. 7.2 Semantic Query Formulation for the Non-Expert We also addressed the challenges of data access for non-expert users. Usually, such users require the assistance of IT experts to extract data from the databases for their day-to-day use. Accessing structured data requires learning formal query languages, such as SPARQL, which poses signif- icant difficulties for non-expert users. To avoid the pitfalls of existing approaches that rely on predefined templates or offline computations, while at the same time retaining the ability to cap- ture users’ complex information needs, we presented a simple keyword-based search interface to the Semantic Web. Specifically, we proposed Automatic SPARQL Query Formulation (ASQFor), a systematic framework to issue semantic queries over RDF repositories using simple concept- based search primitives. ASQFor has a straightforward interface, requires no user training, and can be easily embedded in any system or used with any semantic repository without prior cus- tomization. ASQFor algorithm allows non-expert users to explore semantic data without needing an understanding of formal languages for querying and knowledge representation. ASQFor, a schema-agnostic framework, provides a simple but powerful way of specifying complex queries and automatically translates them into formal queries on the fly, allowing it to adapt to changes in the ontology instantly. This approach is generalizable and easily adaptable to other domains by exchanging the data and the Ontology describing it as long as that Ontology follows a particular structure. 7.3 Application: Smart Oilfield Safety Net Using the proposed approaches, we built data-to-insights pipeline for the smart oilfield for the SOSNet (Smart Oilfield Safety Net) project. We presented a framework for Asset Integrity Man- agement (AIM) that leveraged Semantic Web technologies for data modeling, integration, and 111 access. The extensible SOSNet ontology can be used for integrating scattered information into a coherent and comprehensive view of the AIM environment. With ASQFor, data can be retrieved through a simple interface and then can be analyzed using existing or emerging data analytics tools. We discussed the Asset2Vec approach that can be used to generate feature vector-based representations of assets, making them compatible to be used with various data mining tasks. We showed the example of similarity-based queries where users retrieved a ranked list of similar entities to a given search key. By leveraging the comprehensive semantic model, the integrated information bus, ASQFor, and AssetVec, we built a real-world system for effective Asset Integrity Management. 7.4 Limitations and Future Extensions To achieve the goal of providing non-expert users a subset of the functionality of SPARQL while keeping our approach completely dynamic and domain-independent, we have made certain de- cisions in designing ASQFor, which lead to a few limitations. Finding optimal query subgraph based on a given subset of nodes in an arbitrary graph is modeled as the Steiner Tree problem in literature, which is NP-hard [118]. Therefore, ASQFor currently supports hierarchical Ontologies, i.e., it does not support cycles and self-properties. Hierarchical Ontologies are commonly used in multiple domains for cataloging and organizing information [49, 10, 39]. For ASQFor to be able to issue queries over a semantic repository, a well-defined schema ontology must be available. This means that the schema ontology must contain complete information about domains and ranges of all object properties, domains of all data properties, and the complete hierarchy of subclasses. Since ASQFor traverses the schema ontology during query formulation phase, missing information can result in the formulation of the incorrect query. ASQFor does not currently accommodate aggregation queries as it is not possible to map the entire SPARQL syntax into a key-value pair syntax. Even though this limitation can be addressed by introducing a specialized vocabulary in our method, we wanted the ASQFor framework to be as generic as possible. ASQFor can be 112 extended by including aggregate (e.g., COUNT, SUM, MAX) and nested queries to provide users the ability to express even more complex queries through simple abstractions. For the computation of Specificity H , we experimented with fixed values of β to control the size of extraction subgraphs versus performance of the downstream application. Estimating the value of β from the structure of the given KG is the logical next step. The performance of the recommendation task can be improved by incorporating collaborative information in conjunction with the specificity-based extracted semantic information into the recommender algorithm. In this thesis, we only focused on a limited set of data mining tasks (regression, classification, and recommendation). This work can be extended by using specificity-based embeddings with other tasks, such as KG completion [63], entity-type prediction, error detection in Knowledge Graphs, and entity set expansion [123]. Furthermore, valuable information about the same entities might be available in multiple Knowledge Graphs. The information can be combined either by (i) first aligning/integrating the Knowledge Graphs and then generating the embeddings or (ii) by generating embeddings separately and then aligning the resulting vector spaces. The second approach requires the alignment of embeddings to obtain a space where the same entities from the different KGs are closer while preserving the semantics of their original embeddings [5]. To summarize, we believe that contributions like this thesis will lead to advancements in managing, accessing, and analyzing Semantic Web-based representation of information and also make these capabilities accessible to non-expert users, removing the barriers to the rapid adoption of such technologies. 113 Reference List [1] Nitish Aggarwal, Kartik Asooja, Housam Ziad, and Paul Buitelaar. Who are the amer- ican vegans related to brad pitt?: Exploring related entities. In Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015, pages 151–154, 2015. [2] Maurizio Atzori and Andrea Dessi. Ranking dbpedia properties. In 2014 IEEE 23rd Inter- national WETICE Conference, WETICE 2014, pages 441–446, 2014. [3] Sasikanth Avancha, Srikanth Kallurkar, and Tapan Kamdar. Design of ontology for the internet movie database (imdb). 2001. [4] Payam M. Barnaghi, Wei Wang, Cory A. Henson, and Kerry Taylor. Semantics for the internet of things: Early progress and back to the future. Int. J. Semantic Web Inf. Syst., 8(1):1–21, 2012. [5] Matthias Baumgartner, Wen Zhang, Bibek Paudel, Daniele Dell’Aglio, Huajun Chen, and Abraham Bernstein. Aligning knowledge base and document embedding models using regu- larized multi-task learning. In The Semantic Web - ISWC 2018 - 17th International Seman- tic Web Conference, Monterey, CA, USA, October 8-12, 2018, Proceedings, Part I, pages 21–37, 2018. [6] Bettina Berendt, Andreas Hotho, and Gerd Stumme. Towards semantic web mining. In The Semantic Web - ISWC 2002, First International Semantic Web Conference, Sardinia, Italy, June 9-12, 2002, Proceedings, pages 264–278, 2002. [7] Tim Berners-Lee. The web of things. special theme on the future web. ercim news–the eu- ropean research consortium for informatics and mathematics (2008). http://ercim-news. ercim.eu/en72/keynote/the-web-of-things. Accessed: 2016-05-15. [8] Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Scientific American, 284(5):34–43, May 2001. [9] Abraham Bernstein, Esther Kaufmann, and Christian Kaiser. Querying the semantic web with ginseng: A guided input natural language search engine. In In: 15th Workshop on Information Technologies and Systems, Las Vegas, NV, pages 112–126, 2005. [10] Ghassan Beydoun, Antonio A. Lopez-Lorca, Francisco Garc´ ıa S´ anchez, and Rodrigo Mart´ ınez-B´ ejar. How do we measure and improve the quality of a hierarchical ontology? Journal of Systems and Software, 84(12):2363–2373, 2011. [11] Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data - the story so far. Int. J. Semantic Web Inf. Syst., 5(3):1–22, 2009. [12] Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. Measuring the similarity be- tween implicit semantic relations from the web. In Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009, pages 651–660, 2009. [13] Antoine Bordes, Nicolas Usunier, Alberto Garc´ ıa-Dur´ an, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2787–2795, 2013. 114 [14] HongYun Cai, Vincent W. Zheng, and Kevin Chen-Chuan Chang. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Trans. Knowl. Data Eng., 30(9):1616–1637, 2018. [15] Mark A. Cameron, Jemma Wu, Kerry Taylor, David Ratcliffe, Geoffrey Squire, and John Colton. Semantic solutions for integration of federated ocean observations. In Proceedings of the 2nd International Workshop on Semantic Sensor Networks ( SSN09 ), collocated with the 8th International Semantic Web Conference ( ISWC-2009 ), Washington DC, USA, October 26, 2009., pages 64–79, 2009. [16] Scott Cederberg and Dominic Widdows. Using lsa and noun coordination information to improve the precision and recall of automatic hyponymy extraction. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 111–118. Association for Computational Linguistics, 2003. [17] Michael Cochez, Petar Ristoski, Simone Paolo Ponzetto, and Heiko Paulheim. Biased graph walks for RDF graph embeddings. In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, WIMS 2017, Amantea, Italy, June 19-22, 2017, pages 21:1–21:12, 2017. [18] Michael Cochez, Petar Ristoski, Simone Paolo Ponzetto, and Heiko Paulheim. Global RDF vector space embeddings. In The Semantic Web - ISWC 2017 - 16th International Semantic Web Conference, Vienna, Austria, October 21-25, 2017, Proceedings, Part I, pages 190–207, 2017. [19] Michael Compton, Payam Barnaghi, Luis Bermudez, Ra´ ul Garc´ ıa-Castro, Oscar Corcho, Simon Cox, John Graybeal, Manfred Hauswirth, Cory Henson, Arthur Herzog, Vincent Huang, Krzysztof Janowicz, W. David Kelsey, Danh Le Phuoc, Laurent Lefort, Myriam Leggieri, Holger Neuhaus, Andriy Nikolov, Kevin Page, Alexandre Passant, Amit Sheth, and Kerry Taylor. The{SSN} ontology of the{W3C} semantic sensor network incubator group. Web Semantics: Science, Services and Agents on the World Wide Web, 17:25 – 32, 2012. [20] Danica Damljanovic, Milan Agatonovic, and Hamish Cunningham. Natural language inter- faces to ontologies: Combining syntactic analysis and ontology-based lookup through the user interaction. In Extended Semantic Web Conference, pages 106–120. Springer, 2010. [21] Danica Damljanovic, Milan Agatonovic, and Hamish Cunningham. Freya: An interactive way of querying linked data using natural language. In The Semantic Web: ESWC 2011 Workshops, pages 125–138. Springer, 2012. [22] Gerben Klaas Dirk de Vries and Steven de Rooij. Substructure counting graph kernels for machine learning from RDF data. J. Web Sem., 35:71–84, 2015. [23] Stefan Decker, Sergey Melnik, Frank van Harmelen, Dieter Fensel, Michel C. A. Klein, Jeen Broekstra, Michael Erdmann, and Ian Horrocks. The semantic web: The roles of XML and RDF. IEEE Internet Computing, 4(5):63–74, 2000. [24] Tommaso Di Noia, Roberto Mirizzi, Vito Claudio Ostuni, Davide Romito, and Markus Zanker. Linked open data to support content-based recommender systems. In Proceedings of the 8th International Conference on Semantic Systems, pages 1–8. ACM, 2012. [25] Dennis Diefenbach and Andreas Thalhammer. Pagerank and generic entity summarization for RDF knowledge bases. In European Semantic Web Conference, pages 145–160. Springer, 2018. 115 [26] Dennis Diefenbach and Andreas Thalhammer. Pagerank and generic entity summarization for RDF knowledge bases. In The Semantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings, pages 145–160, 2018. [27] Alistair Duke, Tim Glover, and John Davies. Squirrel: An advanced semantic search and browse facility. In The Semantic Web: Research and Applications, 4th European Semantic Web Conference, ESWC 2007, Innsbruck, Austria, June 3-7, 2007, Proceedings, pages 341– 355, 2007. [28] Vahid Ebrahimipour and Soumaya Yacout. Ontology-based schema to support maintenance knowledge representation with a case study of a pneumatic valve. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(4):702–712, 2015. [29] Michael F¨ arber, Frederic Bartscherer, Carsten Menne, and Achim Rettinger. Linked data quality of dbpedia, freebase, opencyc, wikidata, and yago. Semantic Web, 9(1):77–129, 2018. [30] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discovery in databases. AI Magazine, 17(3):37–54, 1996. [31] Ivan P Fellegi and Alan B Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969. [32] S´ ebastien Ferr´ e. SQUALL: A controlled natural language as expressive as SPARQL 1.1. In Natural Language Processing and Information Systems - 18th International Conference on Applications of Natural Language to Information Systems, NLDB 2013, Salford, UK, June 19-21, 2013. Proceedings, pages 114–125. 2013. [33] S´ ebastien Ferr´ e. Expressive and scalable query-based faceted search over SPARQL end- points. In The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II, pages 438–453, 2014. [34] Andr´ e Freitas, Edward Curry, Jo˜ ao Gabriel Oliveira, and Se´ an O’Riain. Querying heteroge- neous datasets on the linked data web: Challenges, approaches, and trends. IEEE Internet Computing, 16(1):24–33, 2012. [35] Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI 2007, Proceedings of the 20th In- ternational Joint Conference on Artificial Intelligence, 2007, pages 1606–1611, 2007. [36] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and perfor- mance: A survey. Knowledge-Based Systems, 151:78–94, 2018. [37] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016, pages 855–864, 2016. [38] Christian Halaschek-Wiener, Boanerges Aleman-Meza, Ismailcem Budak Arpinar, and Amit P. Sheth. Discovering and ranking semantic associations over a large RDF metabase. In (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31 - September 3 2004, pages 1317–1320, 2004. [39] Mostafa M. Hassan, Fakhri Karray, and Mohamed S. Kamel. Automatic document topic identification using wikipedia hierarchical ontology. In 11th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2012, Montreal, QC, Canada, July 2-5, 2012, pages 237–242, 2012. 116 [40] Oktie Hassanzadeh and Mariano P Consens. Linked movie data base. In Linked Data on the Web, 2009. [41] Benjamin Heitmann and Conor Hayes. Using linked data to build open, collaborative rec- ommender systems. In Linked Data Meets Artificial Intelligence, Papers from the 2010 AAAI Spring Symposium, Technical Report SS-10-07, Stanford, California, USA, March 22-24, 2010, 2010. [42] Pascal Hitzler and Frank Van Harmelen. A reasonable semantic web. Semantic Web, 1(1, 2):39–44, 2010. [43] Ian Horrocks, Peter F Patel-Schneider, Harold Boley, Said Tabet, Benjamin Grosof, Mike Dean, et al. Swrl: A semantic web rule language combining owl and ruleml. W3C Member submission, 21:79, 2004. [44] Yi Huang, Volker Tresp, Maximilian Nickel, Achim Rettinger, and Hans-Peter Kriegel. A scalable approach for statistical learning in semantic graphs. Semantic Web, 5(1):5–22, 2014. [45] Ali Ismayilov, Dimitris Kontokostas, S¨ oren Auer, Jens Lehmann, and Sebastian Hellmann. Wikidata through the eyes of dbpedia. Semantic Web, 9(4):493–503, 2018. [46] Esther Kaufmann and Abraham Bernstein. How useful are natural language interfaces to the semantic web for casual end-users? In Proceedings of the 6th International The Semantic Web and 2Nd Asian Conference on Asian Semantic Web Conference, ISWC’07/ASWC’07, pages 281–294, Berlin, Heidelberg, 2007. Springer-Verlag. [47] Esther Kaufmann and Abraham Bernstein. Evaluating the usability of natural language query languages and interfaces to semantic web knowledge bases. Web Semantics: Science, Services and Agents on the World Wide Web, 8(4):377–393, 2010. [48] Esther Kaufmann, Abraham Bernstein, and Renato Zumstein. Querix: A natural language interface to query ontologies based on clarification dialogs. In The Semantic Web - ISWC 2006 - 5th International Semantic Web Conference, Athens, Georgia, USA, October 19-23, 2014., pages 980–981. Springer, 2006. [49] Evgeny Kharlamov, Nina Solomakhina, ¨ Ozg¨ ur L¨ utf¨ u ¨ Oz¸ cep, Dmitriy Zheleznyakov, Thomas Hubauer, Steffen Lamparter, Mikhail Roshchin, Ahmet Soylu, and Stuart Watson. How semantic technologies can enhance data access at siemens energy. In The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I, pages 601–619, 2014. [50] N Killen, R House, V Sankur, B Thigpen, C Chelmis, VK Prasanna, U Neumann, R Raghavendra, K Yao, S You, et al. Smart oilfield safety net: An integrated system for prediction of asset integrity opportunities. In SPE Intelligent Energy International Conference and Exhibition. Society of Petroleum Engineers, 2016. [51] Angela Lausch, Andreas Schmidt, and Lutz Tischendorf. Data mining and linked open data–new perspectives for data analysis in environmental research. Ecological Modelling, 295:5–17, 2015. [52] Jos´ e Paulo Leal, Vˆ ania Rodrigues, and Ricardo Queir´ os. Computing semantic relatedness using dbpedia. In 1st Symposium on Languages, Applications and Technologies, SLATE 2012, pages 133–147, 2012. [53] Jessica Leber. Big oil goes mining for Big Data. May 2012. 117 [54] Jens Lehmann and Lorenz B¨ uhmann. Autosparql: Let users query your knowledge base. In The Semantic Web: Research and Applications, pages 63–79. Springer, 2011. [55] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, S¨ oren Auer, and Chris- tian Bizer. Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195, 2015. [56] PC Lesslar, FG Van den Berg, et al. Managing data assets to improve business performance. Journal of petroleum technology, 50(05):54–58, 1998. [57] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., pages 2181–2187, 2015. [58] Vanessa Lopez, Michele Pasin, and Enrico Motta. Aqualog: An ontology-portable question answering system for the semantic web. In European Semantic Web Conference, pages 546–562. Springer, 2005. [59] Vanessa Lopez, Christina Unger, Philipp Cimiano, and Enrico Motta. Evaluating question answering over linked data. J. Web Sem., 21:3–13, 2013. [60] Pasquale Lops, Marco de Gemmis, and Giovanni Semeraro. Content-based recommender systems: State of the art and trends. In Recommender Systems Handbook, pages 73–105. 2011. [61] Uta L¨ osch, Stephan Bloehdorn, and Achim Rettinger. Graph kernels for RDF data. In The Semantic Web: Research and Applications - 9th Extended Semantic Web Conference, ESWC 2012, pages 134–148, 2012. [62] Frank Manola, Eric Miller, Brian McBride, et al. RDF Primer. W3C recommendation, 10(1-107):6, 2004. [63] Christian Meilicke, Manuel Fink, Yanjie Wang, Daniel Ruffinelli, Rainer Gemulla, and Heiner Stuckenschmidt. Fine-grained evaluation of rule- and embedding-based systems for knowledge graph completion. In The Semantic Web - ISWC 2018 - 17th International Semantic Web Conference, Monterey, CA, USA, October 8-12, 2018, Proceedings, Part I, pages 3–20, 2018. [64] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. [65] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Dis- tributed representations of words and phrases and their compositionality. In 27th Annual Conference on Neural Information Processing Systems - NIPS 2013, pages 3111–3119, 2013. [66] Heiko M¨ uller, Liliana Cabral, Ahsan Morshed, and Yanfeng Shu. From RESTful to SPARQL: A case study on generating semantic sensor data. In Proceedings of the 6th In- ternational Workshop on Semantic Sensor Networks co-located with the 12th International Semantic Web Conference (ISWC 2013), Sydney, Australia, October 22nd, 2013., pages 51–66, 2013. [67] Axel-Cyrille Ngonga Ngomo, Lorenz B¨ uhmann, Christina Unger, Jens Lehmann, and Daniel Gerber. Sorry, i don’t speak SPARQL: translating SPARQL queries into natural language. In Proceedings of the 22nd international conference on World Wide Web, pages 977–988. International World Wide Web Conferences Steering Committee, 2013. 118 [68] Phuong T. Nguyen, Paolo Tomeo, Tommaso Di Noia, and Eugenio Di Sciascio. Content- based recommendations via dbpedia and freebase: A case study in the music domain. In The Semantic Web - ISWC 2015, 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part I, pages 605–621, 2015. [69] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016. [70] Xiaomin Ning, Hai Jin, Weijia Jia, and Pingpeng Yuan. Practical and effective IR-style keyword search over semantic web. Inf. Process. Manage., 45(2):263–271, 2009. [71] Ulf Noyer, Dirk Beckmann, and Frank K¨ oster. Semantic technologies for describing mea- surement data in databases. In The Semanic Web: Research and Applications - 8th Extended Semantic Web Conference, ESWC 2011, Heraklion, Crete, Greece, May 29 - June 2, 2011, Proceedings, Part II, pages 198–211, 2011. [72] Eyal Oren, Renaud Delbru, Sebastian Gerke, Armin Haller, and Stefan Decker. Activerdf: object-oriented semantic web programming. In Proceedings of the 16th International Con- ference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007, pages 817–824, 2007. [73] Eyal Oren, Benjamin Heitmann, and Stefan Decker. Activerdf: Embedding semantic web data into object-oriented languages. J. Web Sem., 6(3):191–202, 2008. [74] Christian Paul, Achim Rettinger, Aditya Mogadala, Craig A. Knoblock, and Pedro A. Szekely. Efficient graph-based document similarity. In The Semantic Web. Latest Advances and New Domains - 13th International Conference, ESWC 2016, Heraklion, Crete, Greece, May 29 - June 2, 2016, Proceedings, pages 334–349, 2016. [75] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pages 1532–1543, 2014. [76] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social repre- sentations. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pages 701–710, 2014. [77] Guangyuan Piao and John G. Breslin. Transfer learning for item recommendations and knowledge graph completion in item related domains via a co-factorization model. In Aldo Gangemi, Roberto Navigli, Maria-Esther Vidal, Pascal Hitzler, Rapha¨ el Troncy, Laura Hollink, Anna Tordai, and Mehwish Alam, editors, The Semantic Web, pages 496–511. Springer International Publishing, 2018. [78] Camille Pradel, Ollivier Haemmerl´ e, and Nathalie Hernandez. Natural language query interpretation into SPARQL using patterns. In Proceedings of the Fourth International Conference on Consuming Linked Data-Volume 1034, pages 13–24. CEUR-WS. org, 2013. [79] Matthias Quasthoff and Christoph Meinel. Supporting object-oriented programming of semantic-web software. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 42(1):15–24, 2012. [80] Achim Rettinger, Uta L¨ osch, Volker Tresp, Claudia d’Amato, and Nicola Fanizzi. Mining the Semantic Web. Data Mining and Knowledge Discovery, 24(3):613–662, 2012. 119 [81] Petar Ristoski, Stefano Faralli, Simone Paolo Ponzetto, and Heiko Paulheim. Large-scale taxonomy induction using entity and word embeddings. In Proceedings of the International Conference on Web Intelligence, pages 81–87. ACM, 2017. [82] Petar Ristoski and Heiko Paulheim. RDF2Vec: RDF graph embeddings for data mining. In The Semantic Web - ISWC 2016, pages 498–514, 2016. [83] M. Andrea Rodr´ ıguez and Max J. Egenhofer. Determining semantic similarity among entity classes from different ontologies. IEEE Trans. Knowl. Data Eng., 15(2):442–456, 2003. [84] Muhammad Rizwan Saeed, Charalampos Chelmis, and Viktor K. Prasanna. Thou shalt asqfor and shalt receive the semantic answer. In Proceedings of the Twenty-Fifth Interna- tional Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 4264–4265, 2016. [85] Muhammad Rizwan Saeed, Charalampos Chelmis, and Viktor K. Prasanna. Automatic integration and querying of semantic rich heterogeneous data: Laying the foundations for semantic web of things. In Managing the Web of Things: Linking the Real World to the Web, pages 251–273. 2017. [86] Muhammad Rizwan Saeed, Charalampos Chelmis, and Viktor K. Prasanna. ASQFor: Au- tomatic SPARQL Query Formulation for the non-expert. AI Commun., 31(1):19–32, 2018. [87] Muhammad Rizwan Saeed, Charalampos Chelmis, and Viktor K Prasanna. Not all embed- dings are created equal: Extracting entity-specific substructures for rdf graph embedding. arXiv preprint arXiv:1804.05184, 2018. [88] Muhammad Rizwan Saeed, Charalampos Chelmis, and Viktor K. Prasanna. Smart Oilfield SafetyNet - An Intelligent System for Integrated Asset Integrity Management. In SPE Annual Technical Conference and Exhibition (ATCE), September 2018. [89] Muhammad Rizwan Saeed, Charalampos Chelmis, and Viktor K. Prasanna. Extracting entity-specific substructures for RDF graph embedding. Semantic Web Journal, Special Issue on Knowledge Graphs: Construction, Management and Querying, 2019. [90] Muhammad Rizwan Saeed, Charalampos Chelmis, Viktor K. Prasanna, Robert House, Jacques Blouin, and Brian Thigpen. Semantic web technologies for external corrosion de- tection in smart oil fields. In SPE Western Regional Meeting, April 27-30, 2015, Garden Grove, California, USA, 2015. [91] Muhammad Rizwan Saeed and Viktor K. Prasanna. Extracting entity-specific substructures for RDF graph embedding. In 9th IEEE International Conference on Information Reuse and Integration (IRI), Salt Lake City, Utah, USA. IEEE, 2018. [92] Malte Sander, Ulli Waltinger, Mikhail Roshchin, and Thomas Runkler. Ontology-based translation of natural language queries to SPARQL. In AAAI Fall Symposium Series 2014, Arlington, Virginia, USA, November 13-15, 2014, 2014. [93] Fran¸ cois Scharffe, Yanbin Liu, and Chuguang Zhou. Rdf-ai: an architecture for RDF datasets matching, fusion and interlink. In Proc. IJCAI 2009 workshop on Identity, ref- erence, and knowledge representation (IR-KR), Pasadena (CA US), 2009. [94] Nino Shervashidze, S. V. N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten M. Borgwardt. Efficient graphlet kernels for large graph comparison. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009, pages 488–495, 2009. 120 [95] Amit Sheth, I Budak Arpinar, and Vipul Kashyap. Relationships at the heart of semantic web: Modeling, discovering, and exploiting complex semantic relationships. In Enhancing the Power of the Internet, pages 63–94. Springer, 2004. [96] Amit P. Sheth, Cartic Ramakrishnan, and Christopher Thomas. Semantics for the semantic web: The implicit, the formal and the powerful. Int. J. Semantic Web Inf. Syst., 1(1):1–18, 2005. [97] Baoxu Shi and Tim Weninger. Proje: Embedding projection for knowledge graph com- pletion. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 1236–1242, 2017. [98] Prashant Shiralkar, Alessandro Flammini, Filippo Menczer, and Giovanni Luca Ciampaglia. Finding streams in knowledge graphs to support fact checking. In IEEE International Conference on Data Mining (ICDM), pages 859–864. IEEE, 2017. [99] Pavel Shvaiko and J´ erˆ ome Euzenat. Ontology matching: State of the art and future chal- lenges. IEEE Trans. Knowl. Data Eng., 25(1):158–176, 2013. [100] Martin G. Skjæveland, Espen H. Lian, and Ian Horrocks. Publishing the norwegian petroleum directorate’s factpages as semantic web data. In The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21- 25, 2013, Proceedings, Part II, pages 162–177, 2013. [101] Jennifer Sleeman, Tim Finin, and Anupam Joshi. Topic modeling for RDF graphs. In Proceedings of the Third International Workshop on Linked Data for Information Extraction (LD4IE2015)2015., pages 48–62, 2015. [102] Ramakrishna Soma, Amol Bakshi, Viktor K. Prasanna, William J DaSie, Birlie Colbert Bourgeois, et al. Semantic web technologies for smart oil field applications. In Intelligent Energy Conference and Exhibition. Society of Petroleum Engineers, 2008. [103] Thomas Steiner, Raphael Troncy, and Michael Hausenblas. How google is using linked data today and vision for tomorrow. In Proceedings of Linked Data in the Future Internet at the Future Internet Assembly (FIA 2010), Ghent, December 2010, 2010. [104] Rob Stewart. A demonstration of a natural language query interface to an event-based se- mantic web triplestore. The Semantic Web: ESWC 2014 Satellite Events: ESWC 2014 Satellite Events, Anissaras, Crete, Greece, May 25-29, 2014, Revised Selected Papers, 8798:343, 2014. [105] Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. Pathsim: Meta path- based top-k similarity search in heterogeneous information networks. PVLDB, 4(11):992– 1003, 2011. [106] Pedro A. Szekely, Craig A. Knoblock, Fengyu Yang, Xuming Zhu, Eleanor E. Fink, Rachel Allen, and Georgina Goodlander. Connecting the smithsonian american art museum to the linked data cloud. In The Semantic Web: Semantics and Big Data, 10th International Conference, ESWC 2013, Montpellier, France, May 26-30, 2013. Proceedings, pages 593– 607, 2013. [107] Andreas Thalhammer and Achim Rettinger. PageRank on Wikipedia: Towards General Importance Scores for Entities. In The Semantic Web: ESWC 2016 Satellite Events, 2016, pages 227–240. Springer International Publishing, Cham, October 2016. 121 [108] Thanh Tran, Haofen Wang, Sebastian Rudolph, and Philipp Cimiano. Top-k exploration of query candidates for efficient keyword search on graph-shaped (RDF) data. In Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009 - April 2 2009, Shanghai, China, pages 405–416, 2009. [109] Peter D. Turney. Expressing implicit semantic relations without supervision. In ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Aus- tralia, 17-21 July 2006, 2006. [110] Yannis Tzitzikas, Christina Lantzaki, and Dimitris Zeginis. Blank node matching and RDF/S comparison functions. In The Semantic Web - ISWC 2012 - 11th International Semantic Web Conference, Boston, MA, USA, November 11-15, 2012, Proceedings, Part I, pages 591–607, 2012. [111] Christina Unger, Lorenz B¨ uhmann, Jens Lehmann, Axel-Cyrille Ngonga Ngomo, Daniel Gerber, and Philipp Cimiano. Template-based question answering over rdf data. In Pro- ceedings of the 21st international conference on World Wide Web, pages 639–648. ACM, 2012. [112] Ulli Waltinger, Dan Tecuci, Mihaela Olteanu, Vlad Mocanu, and Sean Sullivan. USI an- swers: Natural language question answering over (semi-) structured industry data. In Pro- ceedings of the Twenty-Fifth Innovative Applications of Artificial Intelligence Conference, IAAI 2013, July 14-18, 2013, Bellevue, Washington, USA., 2013. [113] Chong Wang, Miao Xiong, Qi Zhou, and Yong Yu. Panto: A portable natural language interface to ontologies. In The Semantic Web: Research and Applications, pages 473–487. Springer, 2007. [114] Ping Wang, Jinguang Zheng, Linyun Fu, Evan W. Patton, Timothy Lebo, Li Ding, Qing Liu, Joanne S. Luciano, and Deborah L. McGuinness. A semantic portal for next generation monitoring systems. In The Semantic Web - ISWC 2011 - 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part II, pages 253–268, 2011. [115] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Qu´ ebec City, Qu´ ebec, Canada., pages 1112–1119, 2014. [116] Charles Wankel and Agata Stachowiczstanusch. Emerging Web 3.0/semantic Web Applica- tions in Higher Education: Growing Personalization and Wider Interconnections in Learn- ing. Information Age Publishing, Incorporated, 2015. [117] William E Winkler. The state of record linkage and current research problems. In Statistical Research Division, US Census Bureau. Citeseer, 1999. [118] Pawel Winter. Steiner problem in networks: a survey. Networks, 17(2):129–167, 1987. [119] Mohamed Yahya, Klaus Berberich, Shady Elbassuoni, Maya Ramanath, Volker Tresp, and Gerhard Weikum. Natural language questions for the web of data. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 379–390. Association for Computational Linguistics, 2012. 122 [120] Pinar Yanardag and S. V. N. Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 10-13, 2015, pages 1365–1374, 2015. [121] Liyang Yu. Linked Open Data, pages 409–466. Springer Berlin Heidelberg, Berlin, Heidel- berg, 2011. [122] Wen Zhang, Bibek Paudel, Wei Zhang, Abraham Bernstein, and Huajun Chen. Interaction embeddings for prediction and explanation in knowledge graphs. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, pages 96–104, 2019. [123] Xiangling Zhang, Yueguo Chen, Jun Chen, Xiaoyong Du, Ke Wang, and Ji-Rong Wen. Entity set expansion via knowledge graphs. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017, pages 1101–1104, 2017. [124] Yinuo Zhang, Anand V. Panangadan, and Viktor K. Prasanna. UFOM: unified fuzzy on- tology matching. In Proceedings of the 15th IEEE International Conference on Information Reuse and Integration, IRI 2014, Redwood City, CA, USA, August 13-15, 2014, pages 787– 794, 2014. [125] Yinuo Zhang, Anand V. Panangadan, and Viktor K. Prasanna. UFOMQ: an algorithm for querying for similar individuals in heterogeneous ontologies. In Big Data Analytics and Knowledge Discovery - 17th International Conference, DaWaK 2015, Valencia, Spain, September 1-4, 2015, Proceedings, pages 178–189, 2015. [126] Weiguo Zheng, Lei Zou, Xiang Lian, Jeffrey Xu Yu, Shaoxu Song, and Dongyan Zhao. How to build templates for RDF question/answering: An uncertain graph similarity join approach. In Proceedings of the 2015 ACM SIGMOD International Conference on Man- agement of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 1809–1824, 2015. [127] Qunzhi Zhou, Sreedhar Natarajan, Yogesh Simmhan, and Viktor K. Prasanna. Seman- tic information modeling for emerging applications in smart grid. In Ninth International Conference on Information Technology: New Generations, ITNG 2012, Las Vegas, Nevada, USA, 16-18 April, 2012, pages 775–782, 2012. [128] Linhong Zhu, Majid Ghasemi-Gol, Pedro Szekely, Aram Galstyan, and Craig A Knoblock. Unsupervised entity resolution on multi-type graphs. In International Semantic Web Con- ference, pages 649–667. Springer, 2016. 123
Abstract (if available)
Abstract
The combination of data, semantics, and the Web has led to an ever-growing and increasingly complex body of semantic data. As the volume of semantic data increases, the question of how users can access and utilize this data becomes of crucial importance. Accessing semantic data requires familiarity with Semantic Web standards, such as RDF and SPARQL, whereas analyzing this data requires an understanding of Machine Learning (ML) concepts. The ideal system would allow non-expert users [We define non-expert as someone who is a domain expert in his area of interest but is not familiar with concepts in database systems, Machine Learning, and Semantic Web.] to benefit from the expressive power of the Semantic Web, while at the same time hiding the complexity of its various standards and technologies behind an intuitive and easy-to-use mechanism. ❧ To date, many frameworks for querying semantic data have been developed. However, such frameworks usually rely on predefined templates or require expensive customization. Furthermore, to avoid overwhelming end-users with the task of evaluating query results, such results should be ranked by relevance or importance. In turn, the ability to rank requires mining of implicit relationships from semantic data. Implicit relationships include but are not limited to finding similarity between entities, correlation, or classifying entities in the database. ML techniques applied to discover implicit relationships require the transformation of semantic data into low dimensional vector space. Projecting highly expressive semantic data in a low dimensional vector space comes at a price of lost semantics. ❧ The primary goal of the work presented in this thesis is to enable querying and analytics over semantic data for the non-expert. We present the ASQFor (Automatic SPARQL Query Formulation) framework that takes user input in a simplified manner and provides results of complex querying and analysis as output. The user-provided keywords are matched to ontological concepts in the repository to formulate formal queries automatically. The approach avoids offline computation or predefined query templates while enabling online queries. To enable analytics on semantic data, we propose Specificity as a new relevance metric that enables the semantics-preserving transformation of semantic data. We show that specificity-based biased random walks can extract more relevant representations of entities in the semantic data, resulting in improved preservation of semantics, even when transformed into a lower dimensional vector space. To demonstrate the capabilities enabled by the discovery of implicit relationships in the semantic data, we use the generated feature vector representations to provide the functionality of similarity-based queries where users can retrieve similar entities given a search key. ❧ We evaluate our approaches for query formulation and analytics on real-world datasets, including DBpedia, movie datasets, and asset integrity data from the oilfield. For performance evaluation of the querying framework, we consider both query formulation time and execution time, over large scale data. For evaluation of the specificity-based approach, we compare precision and recall of our approach with state-of-the-art methods. We highlight the applicability of our approach by presenting Smart Oil Field Safety Net (SOSNet), a data-driven system for Asset Integrity Management.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
From matching to querying: A unified framework for ontology integration
PDF
Learning the semantics of structured data sources
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Scalable data integration under constraints
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
Prediction models for dynamic decision making in smart grid
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Model-driven situational awareness in large-scale, complex systems
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
Asset Metadata
Creator
Saeed, Muhammad Rizwan
(author)
Core Title
Discovering and querying implicit relationships in semantic data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/28/2019
Defense Date
05/28/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
automatic query formulation,data mining,DBpedia,graph embedding,linked open data,OAI-PMH Harvest,ontologies,RDF,Semantic Web,SPARQL
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Ershaghi, Iraj (
committee member
), Raghavendra, Cauligi (
committee member
)
Creator Email
mrizwansaeed@gmail.com,saeedm@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-228926
Unique identifier
UC11675614
Identifier
etd-SaeedMuham-7886.pdf (filename),usctheses-c89-228926 (legacy record id)
Legacy Identifier
etd-SaeedMuham-7886.pdf
Dmrecord
228926
Document Type
Dissertation
Rights
Saeed, Muhammad Rizwan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
automatic query formulation
data mining
DBpedia
graph embedding
linked open data
ontologies
RDF
Semantic Web
SPARQL