Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Applying semantic web technologies for information management in domains with semi-structured data
(USC Thesis Other)
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
APPLYING SEMANTIC WEB TECHNOLOGIES FOR INFORMATION
MANAGEMENT IN DOMAINS WITH SEMI-STRUCTURED DATA
by
Ramakrishna Soma
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2008
Copyright 2008 Ramakrishna Soma
Dedication
This dissertation is dedicated to my family for their unconditional love and support.
ii
Acknowledgements
First of all I would like to thank my advisor Prof. Prasanna for all the support, advise and feedback
over the years. I have learnt a lot, both technically and non-technically, during my three years
with his research group. Many thanks to Amol Bakshi for his work on the IAM project. I have
benefited greatly from working closely with him- especially during the first half of my thesis. To
boot this, he has also been a great office mate and friend. I am sure without him, life would have
been really hard for all of us in the project.
I would like to thank the other members of the IAM project for sharing a wealth of ideas
directly or indirectly related to my thesis. Leading this list are our collaborators from Chevron-
especially Will Da Sie. His input, guidance and energy in the project has been vital to our achieve-
ments in the last few years. The excellent team from Avanade- Kanwal, Angela, Steve and others
have put in a great deal of thought and work on the IAM project, which has greatly benefited this
thesis.
I would like to thank all my fellow IAM project members at USC- Cong Zhang, Tao Zhu,
Fan Sun, Jing Zhao and QZ for their feedback and discussions. My heartfelt gratitude goes
to the wonderful administrative staff- Aimee, Janice and Estella for the assistance and patience
throughout my stay here. Thanks also to the other p-group members and alumni for sharing
iii
interesting ideas and perspectives. It has also been a lot of fun to have them around and I will
miss them dearly.
Special thanks to my other mentors/collaborators at USC Kihwan Choi, and Nikunj Mehta
for their guidance and collaboration during the early part of my stint at USC. Thanks also to
Prof. Massoud Pedram, Prof. Neno Medvedovic and Prof. Gaurav Sukhatme for the opportunity
to work for them during my masters. The work I did with them motivated and prepared me
immensely for the PhD.
On the personal front, I am greatly indebted to my family for their love and support. I have
stretched them emotionally or financially on more than one ocassion and each time they have
been more patient and compassionate. Thank you.
Finally! My friends have made it seem like it has not really been very long since I started this
program. Firstly, thanks to all my friends at USC- Karthik, Animesh, Shyam, MV , Guddu, Salim,
Nikhil, King, Appoo, Rishi, Ramya(s), KK, Preeti and so many others along the way. Karthik is
the first among equals, for everything he has done throughout my stay at USC- from getting me
my first on-campus job to driving around the country with me to introducing most of the above
friends to me. Fight on y’all! My best friends Kaki and Saamu have always been there for me and
shared so many of their experiences with me the last few years. This has helped me sometimes
feel like I have a life, even if a vicarious one! Last but not the least thanks to Prasad and my
meditation buddies for the silence.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List Of Tables viii
List Of Figures ix
Abstract xi
Chapter 1: Introduction 1
1.1 Motivation...................................... 1
1.1.1 Data Overload . . . . ........................... 1
1.1.2 What is Semi-structured Data? ....................... 3
1.1.3 Characteristics of Target Domains . . . . ................. 4
1.2 Information Management Framework . . . . . . . . . .............. 6
1.2.1 Requirements . . .............................. 6
1.2.2 Applying Semantic Web Technologies . . . ................ 8
1.2.3 Research Challenges ............................ 9
1.3 Contributions and Organization of the Thesis . . ................. 11
Chapter 2: Background 13
2.1 Ontologies and Knowledge Bases ......................... 13
2.2 Semantic Web Technologies ............................ 14
2.2.1 RDF..................................... 15
2.2.2 OWL .................................... 16
2.2.3 SPARQL .................................. 18
2.3 Metadata ...................................... 18
2.4 State-of-the-art in semantic web applications . . . ................. 20
Chapter 3: Developement Methodology for Semantic Web Applications 26
3.1 Methodology . ................................... 26
3.2 Change Management . . .............................. 29
3.3 Preliminaries . . . . . . . ............................. 31
3.3.1 OWL .................................... 31
3.3.2 SPARQL .................................. 33
v
3.4 Overview . . .................................... 34
3.5 Extension of SPARQL Queries ........................... 36
3.5.1 Triple Patterns . . . . ........................... 37
3.5.2 Compound Graph Patterns ......................... 41
3.6 Semantics of Change . . . . ............................ 44
3.7 Matching . ..................................... 46
3.8 Implementation ................................... 47
3.9 Evaluation ...................................... 49
3.10 Related Work . . . . . . . . ............................ 51
3.11 Discussion and Conclusions . ........................... 53
3.12 List of Changes made to OWL Ontology ...................... 55
Chapter 4: Case Study: Informantion Management Applications for Oil Field Opera-
tions 62
4.1 Integrated Asset Management . . . ........................ 62
4.2 Ontology Design . . . . .............................. 64
4.3 Metadata Catalog .................................. 68
4.3.1 Workflow and Components . . ....................... 68
4.3.2 Implementation . .............................. 70
4.3.2.1 Metadata Extractors . . . . . . ................. 70
4.3.2.2 OWL Knowledge base...................... 72
4.3.2.3 Tool 1: Metadata Catalog Browser ............... 72
4.3.2.4 Tool 2: OOIP Comparison utility . . .............. 73
4.3.3 Performance . . .............................. 74
4.3.3.1 Performance of Inferencing . . ................. 75
4.3.3.2 Performance of SPARQL queries . ............... 75
4.3.4 From research prototype to deployment: Challenges, lessons learnt . . . 77
4.4 Virtual data integration using semantic lookup and services ............ 82
4.4.1 Motivating Example . . .......................... 83
4.4.2 Approach .................................. 85
4.4.3 Modeling .................................. 86
4.4.4 Automatic service discovery . ....................... 90
4.4.5 Implementation . .............................. 91
4.4.6 Service Deployment . . . ......................... 94
4.5 Related Work .................................... 96
4.5.1 Metadata Catalog . . ............................ 96
4.5.2 VDIF . . . . . . . . ............................ 97
Chapter 5: Parallel Inferencing for OWL Knowledge Bases 100
5.1 Rule Classes and OWL-Horst . .......................... 102
5.2 Performance Model for Reasoning ......................... 104
5.3 Partitioning ..................................... 106
5.3.1 Data Partitioning . ............................. 107
5.3.1.1 Graph partitioning . ....................... 109
5.3.1.2 Hash based partitioning ..................... 110
vi
5.3.1.3 Domain specific partitioning . ................. 111
5.3.2 Rule-Base Partitioning Approach ..................... 111
5.3.3 Metrics ................................... 113
5.4 Parallel Algorithm . . . . . . ........................... 114
5.4.1 Correctness . . . .............................. 115
5.4.2 Optimization 1: Incremental addition algorithm .............. 116
5.4.3 Optimization 2: T-Box exclusion...................... 118
5.5 Implementation ................................... 118
5.6 Experimental Results . . . ............................. 119
5.6.1 Data Partitioning . ............................. 120
5.6.1.1 Networked Cluster........................ 120
5.6.1.2 Multi-core ............................ 123
5.6.2 I/O overhead and ideal speed-up ...................... 123
5.6.3 Comparison of data partitioning algorithms ................ 126
5.6.4 Rule-base Partitioning ........................... 128
5.6.5 Incremental addition . . . ......................... 129
5.6.6 Serial processing . . . ........................... 129
5.7 Related Work .................................... 130
5.8 Conclusions ..................................... 134
Chapter 6: Conclusions and Future Work 136
6.1 Conclusions ..................................... 136
6.2 Future Work ..................................... 137
6.2.1 Semantic Middleware: A vision . . . . . ................. 137
6.2.2 Methodology and Change Management . ................. 140
6.2.3 Parallel Inferencing for OWL ....................... 141
Reference List 142
vii
List Of Tables
2.1 A comparison of the features supported by OWL and other competitors . .... 17
2.2 Related work . . . . . . .............................. 23
3.1 NEXT . . . ..................................... 38
3.2 Changes to OWL ontologies and their semantics . ................. 45
3.3 Dirty query detection results for three algorithms. ................. 51
5.1 Average partition size for LUBM and UOBM . . ................. 122
5.2 Partitioning metrics for the LUBM data-set . . . ................. 128
viii
List Of Figures
1.1 Activities of a typical engineer in the petroleum engineering domain. . . . . . . . 2
1.2 Structured, Semi-structured and Unstructured data. ................ 4
2.1 The Semantic-web layer cake............................ 15
2.2 Data and Metadata types [128] . .......................... 19
3.1 The change-management problem . . . ...................... 29
3.2 A ontology change scenario............................. 30
3.3 Support for incremental development in Prot´ eg´ e environment . . ......... 48
4.1 Ontology design in IAM .............................. 65
4.2 The domain ontology . . .............................. 66
4.3 The metadata ontology . . . ............................ 68
4.4 Metadata Catalog workflow . ............................ 69
4.5 Screenshot of the search utility IAM . . ...................... 73
4.6 Comparing OOIP estimates across models . . . . ................. 74
4.7 OWL-DL reasoning . . . . . ............................ 76
4.8 Graph showing the execution time for queries from the metadata catalog imple-
mentation ...................................... 77
4.9 Graph showing the trend of query execution time for different datasets of different
sizes . . . . ..................................... 78
ix
4.10 Graph showing the breakup of development time spent on various components of
the system ...................................... 81
4.11 Flow-chart of an example aggregation workflow . ................. 84
4.12 Code snippet for the example aggregation program ................ 86
4.13 Major elements of our framework and their interrelationships ........... 95
5.1 Regressing a performance model from observed reasoning times for LUBM and
UOBM data-sets. .................................. 106
5.2 Illustration of the partitioning algorithm. ...................... 109
5.3 Speedup for the LUBM-10, UOBM benchmarks on different number of processors.121
5.4 Speedup for the LUBM-10, UOBM benchmarks on different number of processors.123
5.5 Speedup for the LUBM-5,benchmarks on a multi-core processor. ......... 124
5.6 Overhead of various sub-tasks of parallel processing for LUBM-10. . . . . . . . 124
5.7 Speedup for the LUBM-10 benchmark on different number of processors. . . . . 125
5.8 Speedups for lubm-dataset of different sizes. . . . ................. 126
5.9 Comparison of performance of the two data-partitioning algorithms for LUBM-10. 127
5.10 Speedup for the different benchmarks for rule-base partitioning. ......... 129
5.11 Rule dependency graph for MDC. ......................... 130
5.12 Time spent per iteration on each partition. . . . . ................. 131
5.13 Time vs. Memory trade-off for the LUBM benchmark. . . . ........... 131
6.1 Semantic middleware . . .............................. 138
x
Abstract
The two main goals of this thesis are to demonstrate applications of semantic web applications
for domains with semi-structured data and propose scalable techniques for building such appli-
cations. To this end we first describe an adaptation of the agile methodology for building large
scale semantic web applications. When applying this methodology the ontology is constantly
modified. Due to this other artifacts- including the queries, messages, application code etc. that
depend on it also need to be modified to keep them consistent with the new ontology. We propose
a novel technique that detects the SPARQL queries that need to be modified due to changes to
an OWL ontology. We present an implementation of our technique as an extension to a popular
ontology development tool which makes it a convenient environment for the ontology engineer
in our methodology.
We then present a case-study that demonstrates applications of semantic web technology in
the oil-field application domain. We highlight two components that enable various information
management applications. The metadata catalog stores provenance information, access informa-
tion and key pieces of data that are extracted from (semi-structured) documents. We present the
lessons learnt, best practices and some empirical analysis obtained from a significant effort in im-
plementing this component. The second component we present is a semantic lookup component.
xi
This component is used to simplify the development of services that aggregate and transform
information from various services.
Finally we propose a parallelization approach to improve the performance of OWL infer-
encing process- a key bottleneck in semantic web applications. We propose two approaches to
efficiently partition the computational workload of the inferencing. A generic parallel algorithm
is used for inferencing on the partitions created by these. Experimental results obtained from an
implementation of our algorithm show significant speedups for two popular benchmarks and our
own data set which makes this a promising approach for scaling OWL reasoning.
xii
Chapter 1
Introduction
Water water everywhere and not a single drop to drink.
A little semantics goes a long way.- Jim Hendler, 2003
1.1 Motivation
1.1.1 Data Overload
Enterprises today are faced with the the data overload problem- the management and effective
use of large amounts of digitized data in their data stores and new data being produced. The
unprecedented data volumes and growth is illustrated by the following statistics [29]:
By 2007 the size of databases at many organizations reached up to hundreds and in
some cases thousands of terabytes. For example, in 2004 AT&T had 11 exabytes
(107 TB) of wireline, wireless and Internet data. Wal-Mart has 500 terabytes of
transactional data and is adding 10
7
transactions per day. On average, the size of
transactional databases doubles every five years with core databases doubling every
two years. Data reporting and analysis warehouses (OLAP stores) triple in size every
three years...
1
Although more data is available, information, which refers to the right piece of data in the
right form and at the right time for decision making, is not always easily available. It is now a
well regarded fact that a typical knowledge worker spends a significant amount of time searching
for the information required for doing the “real work” [110, 29]. For example, Fig. 1.1 shows
that a typical petroleum engineer spends about 60% of the time just searching for information.
Figure 1.1: Activities of a typical engineer in the petroleum engineering domain.
We refer to this problem of handling and making of large amounts of information easily
available to the decision maker as the information management problem. The following are some
of the complexities that make the it hard:
• The data resides in different data sources, e.g., different data-bases, data warehouses, pro-
duced by services etc. The data management systems are created and managed by different
departments, leading to the problem of heterogeneities in terms of access mechanisms, the
structure and semantics of the data. Obtaining a single view of the data in such a setting is
complex [48, 82].
• The data in these data sources often has to be cleansed and integrated before it is usable in
decision making.
2
• This problem is especially complicated in domains where the data is stored in legacy data
sources or file systems with proprietary formats. In such settings, just accessing a piece of
data required could be a challenge.
• Many times not only the data but also the metadata for it, e.g., who created the data, is
important for decision making but seldom stored.
1.1.2 What is Semi-structured Data?
Data is often classified as structured, semi-structured and unstructured data, depending on (if
or) how it is structured and the access mechanisms. Structured data refers to the data stored
in systems which enforce well-defined structures (schema) and are accessed through standard
query languages. A commonly quoted example of structured data is the data stored in a relational
database (RDBMS). The relational data model forces all the data to be organized according to the
schema and makes it accessible through a specialized query language (SQL) which is generally
implemented by a highly optimized query engine. Unstructured data on the other hand does
not have any schema associated with it and is mostly in the form of text in a natural language.
Extracting meaningful information from unstructured data is complex and generally involves
using sophisticated natural language processing techniques. Semi-structured data lie somewhere
in the middle of this spectrum. The structure or schema of the data is typically known but may
not always be enforced. The data is seldom accessible through optimized query mechanisms-
thus making it harder or less efficient to query and retrieve. Examples of semi-structured data
include data stored in text files with specific formats (e.g. csv files), html documents with defined
structures etc. This is illustrated in figure 1.2.
3
Figure 1.2: Structured, Semi-structured and Unstructured data.
The information management problem for domains where data in mostly stored in structured
data sources is well understood and addressed [74, 122, 7, 52]. However the same cannot be said
for unstructured data sources [64]. This has motivated us to study the this.
1.1.3 Characteristics of Target Domains
In summary, some typical characteristics of the domains to which we believe our work will be
applicable are as follows:
• Pervasiveness of data silos: Data silos refer to islands of applications that were not de-
signed to interoperate or integrate [8]. Decision making typically requires the information
from these systems to be integrated and interoperated.
4
• Complex semi-structured data: The data in the system is stored primarily as semi-
structured data. Moreover, the data could be quite complex- meaning that a single data
object contains many pieces of information. As an example from an oil-field application
domain, a simulation model contains various useful information including production val-
ues of the reservoir at different timesteps, assumptions about the reservoir etc. It is often
stored as a ASCII file with a specific grammar. The information management system needs
to support extracting and retrieving the data embedded within the objects for use in analysis
modules. Similar characteristics have been seen and addressed in [14, 98].
• Multiple organizations and classes of users: Multiple classes of users and stakeholders,
with different specializations and roles across departmental boundaries are involved in var-
ious stages of production and consumption of data. Each class of stakeholders shares data
with the other stakeholders and performs certain analysis on the data products. This is often
not conveyed across the departmental boundaries.
• Large amounts of data: As mentioned above, we focus on enterprises which produce
and consume large amounts of data.
• Real time data and operations: The information management system needs to support the
user to make decisions in real time. Data might be constantly produced from transactions
or sensor readings and it needs to be processed and decisions made in a timely manner.
Many enterprises and other problem domains like oil-field operations [150, 137], transporta-
tion [160] and e-Science [68] have been shown to exhibit some or all of these characteristics.
5
1.2 Information Management Framework
In this section we present some high level requirements for a generic information management
framework for domains characterized above. We also present the motivation for and some key
research challenges to be addressed in using semantic web technologies to create such a frame-
work.
1.2.1 Requirements
We envisage an information management framework that addresses the data overload problem in
domains with semi-structured data by delivering the following broad capabilities:
• Efficient Access to Information: The users must be able to access any piece of data or
information easily. Search applications have been commonly used as a means to access
specific data objects in large information systems. A typical information management sys-
tem with a search interface must have: an index of the data in the system, tools to build the
index from the underlying data, a user interface for issuing queries and a query resolution
mechanism [28].
• Consistent and integrated view of Information: A common problem observed when
different classes of users and tools are present in the system is that the same information is
represented differently. Typical example is observed when different user classes in an oil-
field domain use different naming-conventions for the same entities, units of measurement
for quantities etc. An information management system should present all the information
in a consistent view. The system should also be capable of integrating information from
different sources.
6
• Support for provenance, audit trails and quality indicators: Audit trails and repro-
ducibility of results are important when evaluating or revisiting a decision made- it is often
times important to understand the reasons for, the data used and the context under which
the decision was made. Hence, a very desirable feature of the information management
system is to enable audit trails of data objects.
• Non-disruptive: The new information management system must integrate seamlessly with
current system and processes. It should be possible for an engineer to continue working
using existing tools and processes- albeit inefficiently. We believe that being non-disruptive
is an important requirement for successful adoption of such a framework.
• Extensibility: The system must be easily extensible, in being able to integrate new kinds
of information, data sources, (legacy) applications, analysis modules and workflows(which
use these applications and analysis modules).
• Scalability: The system must be able to handle very large scales of data to be managed.
• Usability: It is important to note that the ultimate purpose of an information management
system is to assist the decision maker in the rapid execution of day to day operations rather
than as a system for intelligent decision making [30]. In fact, our experience has shown
that “soft” factors such as the design of user interfaces, novel visualization tools, guided
workflow wizards, etc., are as important from the end users’ perspective as the underlying
technological solutions.
Addressing this problem is critical to the success of an enterprise [29]. Consequently our work
is by no means the first to address it. A variety of technologies and approaches like Enterprise
7
Application Integration (EAI), Data warehousing etc. have been used to address it in the past
(mostly for structured data sources). However there is no consensus on the perfect approach to
solve the problem and each technological choice brings its own set of opportunities and risks.
This motivates us to explore an alternative approach to the problem.
1.2.2 Applying Semantic Web Technologies
Semantic web technologies refers to a set of standards proposed by the World Wide Web Consor-
tium(W3C) to represent the data on the web in a richer format to enable better search applications
and ultimately autonomous agents. There has been much interest and thrust in the recent times to
apply these technologies for enterprise data management as seen by the tool support announced
by major industry vendors like Oracle [104], and IBM [72, 73] as well as work from the re-
search and industry fraternity in solving real life problems [77]. The motivation for applying the
semantic-web standards for the information management problem in the enterprise setting are the
following observations:
1. The RDF data-model of representing data as graphs is highly expressive and hence ideal [31,
49].
2. The standards provide a query language SPARQL to retrieve data from a RDF datastore or
data that is virtualized as RDF data at run-time.
3. The expressiveness of OWL provides the ability to create rich data-models that can be used
to define ontologies or schemas for the integrated data.
4. OWL ontologies can form the basis for inferencing additional data from the presented
data. This enables the user to encode business-logic without any code, thus saving money
8
in development and maintenance costs. The semantic-web standards also consists of a rule
language specification that can further encode such business logic.
1.2.3 Research Challenges
Although semantic web applications for information management have been proposed in liter-
ature, building such applications is still largely a one-off exercise. Many challenges have to
addressed and techniques for bringing rigor, reproducibility and reusablity are required to elevate
this field into the next level of technological maturity. Some of the challenges that need to be
addressed to do so, can be classified into the following categories
1
:
1. Methodological: This challenge pertains to the development methodologies used for large
scale information management frameworks as well as applications based on semantic web
technologies in general. Although various methodologies have been suggested in literature
for more traditional systems, very little work has been reported from the semantic web
community on the applicability and/or adaptation of existing methodologies or otherwise
on alternate methodologies.
2. Architectural: The software architecture of a system is the overall organization of its ma-
jor components and their interaction patterns [127]. It is widely believed that studying and
cataloging architectures, architectural styles and patterns of software system is an impor-
tant way of codifying software development knowledge. Very little work has been done in
the area of software architectures for information management frameworks using semantic
web applications, especially in enterprise settings and thus is an important area of research.
1
A similar categorization for challenges in the EAI realm has been presented in [8]
9
3. Implementation techniques: This challenge refers to the techniques used to design sys-
tem components and strategies to implement them. Most mature areas of work (e.g. object
oriented programming) have distilled a set of best practices and design patterns [53], which
provide a wealth of information to practitioner during the implementation of the system.
Similar studies in the area of applying semantic web technologies including ontology de-
sign patterns, programming interfaces to semantic data stores, automatic code generation
etc. is an interesting area of work.
4. Tools and Infrastructure: An important factor in the success and pervasiveness of the
relational database technology is widely attributed to the excellent set of tools, and support-
ing infrastructure in the market. The semantic web area is not close to the level of maturity
of the RDBMS technology. Tools that perform scalable reasoning, storage and querying of
large datasets etc. are the need of the hour and this is currently an area of active work in
both the industry and the academia.
5. Business and organizational: Last but not the least, for a technology to be a commer-
cial success and to obtain large scale applicability, it is important that it adds value to the
business domain to which it is applied. Moreover, real life solutions cannot address only
the technical aspect of the problem, but must pay close attention to the organizational pro-
cesses and impacts to the existing practices due to it. A better understanding of some of
these issues is required as a technology tries to transition into the mainstream.
Clearly a single thesis cannot solve all these challenges and ours makes a humble contribution
toward addressing these hard issues. Our work focuses primarily on the methodological and tool
10
and infrastructure challenges mentioned above. However, in our case-study for a real-life problem
setting in an oil-field domain we also address the other challenges.
1.3 Contributions and Organization of the Thesis
Our thesis demonstrates the usefulness of semantic web technology for information management
applications and addresses some challenges in building large-scale applications using it. First we
propose a methodology for building large scale applications built using semantic web technolo-
gies, then demonstrate applications using it for a real problem domain and finally we propose
techniques for a scalability problem observed. The contributions and organization of this thesis
are as follows:
In Chapter 3 we propose a generic iterative development methodology for building semantic
web applications. A problem that occurs in building a large scale application using an iterative
approach is because the ontology or schema is continuously modified to incorporate new require-
ments. This leads to a situation where other artifacts in the applications (code, messages, queries,
documentation etc.) inconsistent wrt. the new ontology. Our work addresses a specific problem
of detecting queries that need to modified as the ontology is modified, for applications that use
OWL ontologies and SPARQL queries. We propose an algorithm, implement it, evaluate it using
popular benchmarks and integrate it with an open-source ontology development environment. To
the best of our knowledge ours is the first work that address this change management problem for
an iterative development methodology involving semantic web technology.
In Chapter 4 we present a case study which demonstrates two components using semantic
web technologies and applications for them in the oil-field domain. We first describe a design
11
for the ontology that forms the basis of the components and applications. Then we describe a
component called the metadata-catalog, which serves as a repository that contains metadata re-
lated to the data objects containing the semi-structured data as well as key pieces of information
from the data. We also present an implementation and evaluation of the metadata catalog, appli-
cations built using it and lessons we have learnt during building it. Then we propose a technique
to ease the development and deployment of aggregation services. This technique uses a semantic
lookup component that uses the extensibility features of UDDI and demonstrates the benefits of
enhancing service descriptions with semantics.
In Chapter 5 of our thesis we address an important bottleneck observed in OWL-based
applications- performance of inferencing. Our approach to solve this problem works by parti-
tioning the workload of the inferencing process and processing each partition independently. We
present two approaches to partitioning the workload and a parallel algorithm that uses these par-
titions. Further, we demonstrate a real implementation based on a popular open source tool and
show promising results obtained from running it on parallel clusters and multi-core servers. To
the best of our knowledge ours is the first work which proposes such a parallel approach, provides
a real implementation and a thorough evaluation for OWL inferencing.
Finally in Chapter 6 we conclude and provide interesting areas of future work and extensions
to our existing work.
12
Chapter 2
Background
chrono-synclastic infundibulum: those places in the universe where all
the different kinds of truths fit together - The Sirens of Titan, Kurt V onnegut.
2.1 Ontologies and Knowledge Bases
An ontology is a shared representation or a data model of a set of concepts in a domain and the
relationships between them [2]. An ontology has been commonly used to solve two important
and related problems occurring in large organizations: information integration and knowledge
management. The information integration problem occurs as, different systems and databases
represent and store information in different ways. These differences are not just syntactic, i.e.,
using different technologies (XML, RDBMS Object Oriented etc.) to represent and store the
information, but also semantic in nature. Semantic differences in information representation
means that data is named or encoded differently in different data sources. E.g., a very commonly
observed phenomenon in oil-field operation is that the same entity (say a well), is called by dif-
ferent names in different sources (aliasing). As another example, different unit systems are used
13
to represent information in different information systems. Such issues have been successfully ad-
dressed by using an ontology. The definitions of the concepts and relationships are only a means
to categorize and capture the real instance data in the domain. A data store which contains data
that are instances of the concepts in the ontology is called a knowledge base. The ontology can
be considered to be the schema for the knowledge base. Shared data models like PRODML and
WITSML, defined by Energistics (formerly POSC), also address similar problems and can also
be considered as ontologies. In this paper, we consider ontologies defined and encoded using
semantic web technologies and argue that some of these applications are better implemented us-
ing such data models. Ontologies are captured or represented by ontology languages. Since an
ontology is used as a shared representation, it is desirable that it be represented in a language
which is non-proprietary or open. Further, the ontology language must provide enough features
to represent rich definition of concepts. Thus, another important requirement for an ontology
representation language is that of expressiveness. Finally, the ontology language should support
the ability to represent and query instance data.
2.2 Semantic Web Technologies
Tim Berners Lee, the pioneer of the World Wide Web, has put forth the idea of semantic web,
as the next generation web where computers become capable of analyzing all data on the web
[21]. The World Wide Consortium (W3C), which is the main body defining the standards for the
web, has proposed a set of standards that build upon the currently prevalent ones like XML and
address some of the key needs of such a semantic web. These standards address such areas as
languages for rich knowledge representation, querying, security, etc. Figure 1 below shows the
14
semantic web standards being defined by the W3C consortium. In this paper we will focus on the
standards highlighted in the figure to define ontologies and create knowledge bases.
Figure 2.1: The Semantic-web layer cake
2.2.1 RDF
Resource Description Framework (RDF) is a World Wide Web Consortium (W3C) specification
originally designed as a metadata model but has been used as a general method of modeling
information. The RDF paradigm is based upon the idea of making statements about resources.
Thus, the basic unit of data representation in RDF is a statement or triple, which is of the form
subject-predicate-object. A set of related statements form an RDF graph. Although the graph
based RDF paradigm can be encoded in different ways, the most common and important way of
encoding it is as a XML document, using a well defined convention. Two important specifications
closely related to RDF are the RDF Schema (RDFS) and SPARQL. RDFS is used to define the
meaning of the concepts used in an RDF document. The relationship between RDFS and RDF
15
is similar to the relationship between XML schema and XML. The difference between XMLS
and the RDFS is that unlike XMLS which only allow the definition of syntactic structures, RDFS
supports the notions of class, class hierarchies etc. This difference can be likened to the difference
between imperative programming like C which supports the definition of structs versus the object
oriented languages which support richer notions of classes, class hierarchies etc.
2.2.2 OWL
Although the RDF and the RDFS standards provide a richer set of primitives than XML, the se-
mantic web community found the need for a language that improves RDF by adding constructs
that make it more expressive. OWL (Web Ontology Language) is the resultant language which
builds on the RDF standards and improves its expressiveness while making sure that the com-
putational complexity of the language is reasonable. By layering it on RDF, many of the RDF
tools support and querying can be re-used. The OWL specification itself is designed as three
flavors (OWL-Lite, OWL-DL and OWL-Full) to support trade-offs between semantics and com-
putational complexity. OWL-lite is the least expressive dialect of OWL while being simplest
to implement and OWL-Full is the most expressive but is also computationally most expensive.
OWL-DL falls in the middle of the spectrum of expressiveness and computational complexity.
An important functionality that OWL and RDFS support is the ability to derive new information
based on the existing information and the schema definition that we refer to as inferencing. For
instance, we could define a class called SubsurfaceEntity and define the class Well as a sub-class.
When a record asserting “W001 is a Well” is added to the knowledge base, it implicitly adds a
record asserting “W001 is a SubsurfaceEntity”. More sophisticated inferencing based on OWL
semantics can be used for smart querying of the knowledge base.
16
Feature Category OWL RDF/RDFS XML/XMLS UML
Classes and class hierarchies Data
model-
ing
Y Y Partial Y
Properties and Property Hierar-
chies
Data
model-
ing
Y Y N N
Functional properties (Primary
keys), Transitive Properties, in-
verse properties etc.
Data
model-
ing
Y N N N
Class definition as constraints
(E.g. ClosedWell is a Well that
has status=closed)
Data
model-
ing
Y N N N
Ability to infer new information
based on existing information
Reasoning Y Y N N
Standard representation based on
open standards
Representation Y Y Y Y
Ability to represent and query in-
stance data
Instance
Data
Y Y Y Y
Table 2.1: A comparison of the features supported by OWL and other competitors
Commonly used paradigms for creating data models are relational, ER, UML, XML and
semantic web languages (RDF/OWL). Note that the term data-model is interchangeably used
in literature, to refer to the data-modeling paradigm (relational, XML etc.) as well as the data
model instances (also schemas). We have used the word to refer to the latter. The table below
summarizes some of the features of ontology languages.
OWL satisfies all the key requirements of an ontology language - based on open standards,
expressive, and able to store and query instance data. The semantic web standards are emerging
technologies and key risks in terms of tool availability and scalability need to be addressed.
17
2.2.3 SPARQL
SPARQL is a query language recommended by the W3C for querying RDF data sources [116].
Queries in SPARQL are primarily expressed through theSELECT operator. The most important
part of aSELECT query is the triple pattern, which is like a triple, except that variables can be
present in the subject or predicate or object positions. Triple patterns can be combined using four
ways connectors to form more complex graph patterns. The four connectors that can be used
areAND, UNION, OPTIONAL andFILTER. SPARQL queries are resolved by matching the
graph patterns of the query with the input RDF graph. Further, SPARQL provides facilities to
query multiple RDF graphs by specifying each named graphs and the graph pattern to be matched
in it. Finally the standard specifies result set modifiers likeORDERBY that presents the results in
a sorted order, DISTINCT that removes duplicate results, LIMIT that only returns a subset of
the results.
2.3 Metadata
Metadata has often been suggested as a way to address this problem of finding information.
Metadata is commonly and simplistically defined as data about data. In this work, we have used
the term metadata to mean the data that describes the structure and workings of an organization’s
use of information and the systems it uses to manage that information [67]. Metadata addresses
the following five questions: what data do we have, what does it mean, where is it, how did it
get there, and how do I get it? [147]. The figure below shows a categorization of metadata types
based on the kinds of information it denotes. At the bottom of the metadata types is the syntactic
metadata like file size. Structural metadata stores richer metadata like the format of the data. At
18
the higher levels, semantic metadata and ontologies provide the richest description of the data in
terms of the meaning of the information within the business context of the organization.
Figure 2.2: Data and Metadata types [128]
The ontology in our solution is centered on the data objects like simulation models, produc-
tion forecasts,. For each data object, we define three basic kinds of metadata: (i) access info
metadata which defines how the data object can be accessed, typically its file location, (ii) prove-
nance metadata which describes how the data object was created including the name of the person
or application that created it, other objects that are related to this object, etc., and (iii) datatype-
specific metadata that is a concise summarization of the key attributes of the data object that are
specific to that particular type of data and are of interest from the domain experts’ perspective.
For instance, for the Well class, the datatype-specific metadata will capture whether the well is a
producer or injector, number of completions, zones it produces from, etc. This allows the user to
search for data objects which model certain realizations of the asset (e.g., a query such as “Show
me all simulation models in which the estimate of the initial oil in place is greater than one bil-
lion barrels”). A similar categorization of the metadata elements has been proposed in the grid
19
community. The concepts and relationships, captured in the ontology are further elaborated in
Section 4.
The metadata for data objects itself is stored in a knowledge base called the Metadata Catalog.
The metadata is created when a engineer publishes a data object into the system. Since much of
the metadata is present in the data object itself, special metadata extraction components for each
data object were written to parse the data objects and obtain the metadata. This is then persisted
in the metadata catalog and consumed by various applications. Some of the tools that use this
metadata to provide useful functionality to the end users are presented in Section 5.
2.4 State-of-the-art in semantic web applications
As the semantic web technologies transitions from research to the mainstream arena, more and
more applications of this technology, particularly in the area of information management have
been reported. Below we provide a taxonomy and summary of some interesting case studies that
have been reported in the last few years.
The advantages of using semantic web has been demonstrated in many domains, life sci-
ences [81, 80], health care [71, 108, 63], energy [100, 137], education [37, 56] etc. In nearly
all of these applications, legacy data mostly stored in relational databases [52, 32, 7, 124] or un-
structured data sources like HTML pages or text documents [100, 56, 118] are addressed through
an ontology. For applications in which the source data is stored in relational databases, the most
common method of accessing it is by mapping the tables and fields of a RDBMS to classes and
properties of an ontology. Queries which are generally provided in SPARQL or a similar lan-
guage are translated to one or more SQL queries to the source databases. For systems in which
20
most of the legacy data is stored in unstructured text documents, information extraction (IE) or
text mining components, which use NLP techniques are used to extract information from these
documents and store them in RDF format in a triple store. Since these IE techniques tend to be
quite inaccurate the main applications in such systems is search. A search differs from query in
that the results obtained are expected to be approximate and unstructured in search but accurate
and structured in query. However not much work has been done for similar solutions in domains
which contain a lot of semi-structured data (an exception is [139]). The interesting consequence
of this are that the architecture and knowledge extraction techniques is a mix of techniques found
in structured and un-structured information management.
Another important dimension of the classification is based on the methodology used in de-
veloping the ontologies and the applications. Three strategies are commonly used in defining
onotologies- in top-down ontology definition, a comprehensive ontology defining a large number
of concepts in the domain are first defined and applications using them are built. This is perhaps
the most commonly used method in vogue today and a few examples of such ontologies and ap-
plications using them are [114, 118, 71, 108]. On the other hand, in bottom up development the
ontologies are not comprehensive but built to address the generally growing set of requirements
for the applications. Such an approach is generally used in traditional information system design
and is useful for applications where the an entire community is not expected to use it. Although
not as popular as the top-down approach, it has been used in some of some applications like
[85, 81, 100]. Our own work uses this approach and demonstrates some ideas to structure and
develop them to be modular and agile.
The current work can be classified based on the scale of the application. This dimension
is somewhat harder to specify due the lack of a set of ideal metrics for measuring applications.
21
For the solutions that use a triple store one obvious metric is the number of triples stored in
it. Most of the real applications we have reviewed were relatively small (¡1M triples) although
the largest application had 60 million triples [108]. However studies from vendors and open
source tools have published support for much larger triple stores using standard benchmarks.
E.g., OWLIM supports storage and query for 1B triples [105], so too Oracle [156]. However,
both these results were obtained for LUBM benchmark which is a uses a very simple ontology.
A real life implementation is generally much more complex than the LUBM (esp. the queries).
New benchmarks have been proposed which are more likely to be closer to real life ontologies
and queries and results for such benchmarks are still awaited. Another factor that can be used
to measure the scale is the scale of the reasoning. The most impressive results in this area is
achieved by the IBM-SHER [42] which scales well for very expressive (complex) ontologies and
large knowledge bases. Oracle [156] and OWLIM [105] have also shown impressive results in
this arena. A parallel reasoning technique that we propose can be utilized with the state-of-the-art
reasoners for larger datasets.
Further most work in literature today do not share thoughts and experiences gathered from
developing applications based on an engineering methodology. We address this gap by propos-
ing a solution based on agile methodology to develop ontology applications and look at various
challenges and techniques to address these. Our experiences are based on a significant effort of
about 2 man years in research and more than 5 man years in development. The related work is
summarized in table 2.2.
22
Table 2.2: Summary of Related work.
System/Work Domain Semantic
Web Usage
Input
Data
Knowledge
Extrac-
tion
Ontology
Engg.
Applications App.
Dev.
Method-
ology
Scale Others
AKSIO [100] Petroleum Triple store,
Annotated
documents
RDBMS
and
un-
struc-
tured
data
Mapping,
Manual
annota-
tion
Bottom up,
Decentral-
ized
Search n/s Address
socio/
organi-
zational
issues
Ontoprise [7,
124]
Auto,
CRM
Runtime RDBMS,
Web
ser-
vices
Mediator Mapping Query,
Search
n/s - FLogic for
ontology
represen-
tation and
reasoning
Business
Process
Mange-
ment [44, 39,
120]
- Reasoner,
Dynamically
generated
triples
Web
ser-
vice
defini-
tions
Novel
mapping
function
Top Down Web service
composition
TIF [85] Finance Runtime,
Mediator
RDBMS Mapping Bottom up Search n/s KB: 2.7M
triples
CNR [56] Education KB RDBMS Mapping
+ IE/text-
mining
Reverse en-
gineering
Search,
Query
n/s not known -
S-CMS [37] Education Mediator RDBMS,
XML,
HTML
Custom
adapters
Simple Query n/s - Rule layer
to apply
rules
23
Table 2.2 – Continued
System/Work Domain Semantic
Web Usage
Input
Data
Knowledge
Extrac-
tion
Ontology
Engg.
Applications App.
Dev.
Method-
ology
Scale Others
AKTivePSI [2] e-Gov Runtime,
Triple Store
RDBMS
dumps
Custom
extractors
Reverse en-
gineer
Query,
Mashups
n/s small
International
Rela-
tions [118]
e-Gov Runtime,
Triple Store
HTML
pages
I.E. Top Down Search n/s Concepts:
85,
Prop:335;
facts: 60K
HealthFinland
[71]
Health Ontology
Mashup
services, KB
RDF - Top Down Search,
Quality
indicators,
semantic
content
creation
n/s Ontology:
50K con-
cepts
-
Cohort
Match-
ing [108]
Health Reasoning RDBMS Custom
transfor-
mation
Top Down Novel se-
mantic
matching
application
n/s Ontology
size: 500K
concepts,
KB size:
60m triples
Scalable
reasoning
SOMNet [63,
62]
Health Knowledge
base
User
Gen-
erated
RDF
- Top
Down/Port
existing
ontology to
OWL
Search n/s - -
Biomed [81,
80]
Life sci-
ence
Runtime,
Mediator,
Triple store
RDBMS,
XML
Mapping,
Extrac-
tors
Bottom up Query n/s
24
Table 2.2 – Continued
System/Work Domain Semantic
Web Usage
Input
Data
Knowledge
Extrac-
tion
Ontology
Engg.
Applications App.
Dev.
Method-
ology
Scale Others
Oracle [139,
140]
Life sci-
ences
Runtime Semi-
structured
and
struc-
tured
Extractors Top down Query n/s 20M
VSTO [52] eScience Runtime,
Mediator
Multiple
RDBMS
Mediator Reuse, Port
existing
ontology to
OWL+Top
down
Smart search
and data lo-
cation
n/s - Emphasis
on rea-
soning for
smarter
search
catcm [32] eScience Runtime,
Mediator
Multiple
RDBMS
Mediator Reverse en-
gineer
Search and
Query
n/s Ontology:
70 classes,
800 prop; 70
DBs
Provenance
and metadata
stores [97,
36, 159, 83]
eScience/
Grid
Triple Store - Generated
by my-
grid
compo-
nents
Top down Provenance
Query, Ser-
vice lookup
n/s
Semantic
lookup
and work-
flows [46,
20, 16]
eScience/
Grid
Semantic
Lookup,
Reasoner
User
pro-
vided
an-
nota-
tions
- Top Down Automated
workflow
composition
n/a 375K
triples [46]
25
Chapter 3
Developement Methodology for Semantic Web Applications
If you want to make an apple pie from scratch, you must first create the universe.
-Carl Sagan, Cosmos
Don’t tell me we can’t change... Yes, we can. Yes, we can change. -Barrack Obama
3.1 Methodology
We propose an agile approach for developing the ontologies and the applications based on them.
The key stakeholders involved in the development process are:
1. The business user community who decide the business entities and relationships are of
interest. The key to ontology development is to engage a wide and representative set of
business users, playing different roles in the oilfield operation process and with different
perspectives on which metadata items and relationships are important enough to be cap-
tured in the knowledge base.
26
2. The ontology designer interacts with the business user community to capture the elements
of the ontology, and encodes it in OWL. Additionally, it is the job of the ontology de-
signer to communicate the ontology design to the system architect and the engineers on the
development side.
3. The solution architect designs the overall software solution and manages the development
process. He is also responsible for coordinating with the ontology designer and the busi-
ness user community with respect to the issues related with the ontology - especially the
implementation issues that affect or constrain the design of the ontology.
4. The software developers build the applications that use the ontologies and thus are the
consumers of the ontology.
As in the agile development framework, work is planned in short cycles (sprints). In each
sprint we followed the following steps:
1. Requirement Specification: The domain engineers and the ontologists are involved in
defining the applications, e.g., different search parameters, the main parameters for the au-
dit trails etc. The commonly used technique to capture requirements in a iterative approach
is through user stories. A user story captures a scenario of usage of the application in detail.
2. Ontology Definition: Based on the user stories, the main entities and their attributes were
laid out and formalized as OWL axioms. The two distinct steps of Conceptualization and
Formalization in some ontology methodologies [112], are merged into this single step.
This seems to be a common trend as observed in [130]. For the users who are not very
familiar with the representation constructs in the semantic web standards it may make sense
to use the two step approach.
27
3. Review: Once the ontologies are created, they are reviewed by the domain experts and
feed-back provided. If the ontology is small enough, then the visualization of the ontology
development tool itself can be used. Otherwise, the ontology engineer should present it in
a format such that domain experts can easily understand it.
4. Application Development: Application development is carried on in parallel to ontology
development. In spirit with the iterative software development methodology, the ontolo-
gies developed during an iteration were passed on to the developers in the next iteration.
Constantly changing ontologies could hamper the software development because a change
in the ontology could potentially affect the user interfaces, XML schemas used for data
transfer, query formats to retrieve data from the knowledge base, etc. Therefore, the modi-
fications should carefully planned and the successive iterations of the ontologies only added
elements as far as possible. In section 3.6 we propose a novel technique to alleviate this
drawback.
Automatic code generation packages like Jastor [75, 78] which create Java code from
ontology definitions are also recommended to make this more efficient. We think that the
availability of such automated tools will be critical to the adoption and success of this
emerging technology.
5. Demonstration and Feedback: Finally the applications were demonstrated to the end
users for feedback. Not surprisingly, demonstration of functionality and the user interface
is a powerful trigger for many ideas and extensions by the user base.
28
3.2 Change Management
As mentioned earlier, frequent changes may thus be made to the ontology in order to accommo-
date the new requirements. Since other artifacts that form a part of the system are closely tied to
the ontology, changes to the ontology necessitates changes to them. This is shown in figure 3.1,
where a change to the ontology necessitates changes to the queries, the messages passed to the
components and to the application code. Thus we may end up in a situation in which changes
to an ontology could trigger an expensive chain of changes to other artifacts. Tools that ease the
detection and performance of such changes can increase the productivity of the software engi-
neers and hence reduce the cost of building software are very important for the success of such a
development methodology.
Figure 3.1: The change-management problem
29
We address a specific case of this broader problem for the class of applications that use OWL
for representing the ontologies and SPARQL for the queries. In general, we assume that the on-
tologies are built ground up- perhaps reusing existing ontologies. Our technique uses the changes
made to an OWL TBox to detect which queries need to be modified due to it- we call such queries
dirty queries. To understand why this is non-trivial consider the following simple scenario shown
in Fig. 3.2.
Figure 3.2: A ontology change scenario
The original setup consists of a TBox with three classes (Sub-SurfaceEntity, Well, Producer)
and a query to retrieve all Sub-SurfaceEntity elements. A new class Injector is then added to the
TBox and specified as a sub-class to Well (this is recorded in the change log). A naive change
detection algorithm [89] would have compared the entity names from the log (Well, Injector)
to those in the query (Sub-SurfaceEntity), and determined that the query need not be modified.
However, there could be a knowledge base consistent with the new ontology which contains
statements asserting certain elements to be of type Injector. Since these elements/type assertions
are not valid in any knowledge base consistent with the original ontology and will be returned
as the results of the said query, we consider the query to be dirty. The naive algorithm does not
30
detect this invalid example because the semantics are not considered in this approach (in this case
the class hierarchy).
Thus our approach goes beyond the simple entity matching by considering the semantics of
the ontology, the changes and the queries. A challenge we face in our approach arises because
SPARQL is defined as a query language for RDF graphs and its relationship with OWL ontologies
is not very obvious. We address this challenge by defining a novel evaluation function that maps
the SPARQL queries to the domain of OWL semantic elements (Sect. 3.5). Then the semantics
of the changes are also defined on the same set of semantic elements (Sect. 3.6). This enables us
to compare and match the the SPARQL queries with that of the changes made to the ontology and
hence determine which queries have been effected (Sect. 3.7). We present an implementation
of our technique, which seamlessly integrates with a popular, openly available OWL ontology
development environment (Sect. 3.8). We show an evaluation of our technique for a real-life
application for the oil industry and two publicly available OWL benchmark applications (Sect.
3.9). Finally we describe some related work (Sect. 5.7) and discussions and conclusions (Sect.
3.11).
3.3 Preliminaries
3.3.1 OWL
The OWL specification is organized into the following three sections [109]:
1. Abstract Syntax: In this section, the modeling features of the language are presented using
an abstract (non-RDF) syntax.
31
2. RDF Mapping: This section of the specification defines how the constructs in the abstract
syntax are mapped into RDF triples. Rules are provided that map valid OWL ontologies to
a certain sub-set of the universe of RDF graphs. Thus RDF mappings define a subset of all
RDF graphs called well-formed graphs (WF
OWL
) to represent valid OWL ontologies.
3. Semantics: The semantics of the language is presented in a model-theoretic form. The
OWL-DL vocabulary is defined over an OWL Universe given by the three tuple IOT,
IOC,ALLPROP, where:
(a) IOT is the set of all owl:Things, which defines the set of all individuals.
(b) IOC is the set of all owl:Classes, comprises all classes of the universe.
(c) ALLPROP is the union of set of owl:ObjectProperty (IOOP), owl:DatatypeProperty
(IODP), owl: AnnotationProperty (IOAP) and OWL:OntologyProperty (IOXP).
In addition the following notations are defined in the specification, which we will use in
the rest of this paper:
• A mapping function T is defined to map the elements from OWL universe to RDF
format.
• An interpretation function EXT
I
: ALLPROP →P(R
I
×R
I
) is used to define the
semantics of the properties.
• The notationCEXT
I
is used for a mapping fromIOC toP(R
I
) defining the extension
of a class C fromIOC.
From now on we will use ontology to refer to the TBox, ABox combine as commonly used in
OWL terminology.
32
3.3.2 SPARQL
SPARQL is the language recommended by the W3C consortium, to query RDF graphs [116].
The language is based on the idea of matching graph patterns. We use the following inductive
definition of graph-pattern [111]
1. A tuple of form (I∪L∪V)×(I∪V)×(I∪L∪V) is a graph pattern (also a triple pattern).
Where I is the set of IRIs, L set of literals and V is set of variables.
2. If P
1
and P
2
are graph patterns then P
1
AND P
2
, P
1
OPT P
2
, P
1
UNION P
2
are graph
patterns
3. If P is a graph pattern and R is a built-in condition then P FILTER R is a valid graph
pattern.
The semantics of SPARQL queries are defined using a mapping function μ, which is a partial
function μ:V→τ, where V is the set of variables appearing in the query and τ is the triple space.
For a set of such mappings Ω, the semantics of the AND ( ), UNION and OPT (
← −
) operators
are given as follows
Ω
1
Ω
2
={μ
1
∪ μ
2
|μ
1
∈ Ω
1
,μ
2
∈ Ω
2
are compatible mappings
1
}
Ω
1
∪ Ω
2
={μ|μ∈ Ω
1
or μ∈ Ω
2
}
Ω
1
← −
Ω
2
=(Ω
1
Ω
2
)∪ (Ω
1
\Ω
2
)
1
μ1 and μ2 are compatible if for a variable x,(μ1(x)= μ2(x))∨(μ1(x)= φ)∨(μ2(x)= φ)
33
where
Ω
1
\Ω
2
={μ∈ Ω
1
|∀μ
∈ Ω
2
,μ and μ
are not compatible}
Since SPARQL is defined over RDF graphs, its semantics with respect to OWL is not very
easy to understand. In an attempt to clarify this, a subset of SPARQL that can be applied to
OWL-DL ontologies is presented in SPARQL-DL [133]. The kind of queries we use in our work
are the same as those presented in their work but we also consider graph patterns that SPARQL-
DL does not- more specifically the authors consider only conjunctive (AND) queries, where as
we consider all SPARQL operators, viz, AND, UNION, OPTIONAL and FILTER. Note that the
main goal of [133] is to define a (sub-set of the) language that can be implemented using current
reasoners, whereas the goal of our work is to be able to detect queries that effected by ontology
changes.
3.4 Overview
In Sect. 3.2 we provided the intuition that a dirty query with respect to two TBoxes is one which
can match some triple from an ontology consistent with either one of the TBoxes but not both.
We further formalize this notion here, using:
• O is the original ontology.
• C is the set of changes applied to O.
• O
is the new ontology obtained after C is applied to O.
• Q is a SPARQL query. We need to determine if it is dirty or not.
34
• WF
O
is the set of RDF graphs which represent ontologies that are consistent wrt. the
statements in O.
• Similarly the set of well-formed OWL graphs wrt. O is given by WF
O
.
The extension of a query is defined as follows:
Definition: The extension of a query Q containing a graph pattern GP, wrt. an OWL
T-Box O (denoted EXT
O
(Q) or EXT
O
(GP)), is defined as the set of all triples that
match GP and are valid statements in some RDF graph from WF
O
.
A more formal definition of a dirty query is given as:
Definition: A query is said to be dirty wrt. two OWL TBoxes O and O
, if it matches
some triple in WF
O
\WF
O
or WF
O
\WF
O
, i.e., EXT
O
(Q)∩(WF
O
\WF
O
∪WF
O
\WF
O
)=φ.
Thus to determine if a given query is dirty we find the extension of the given query. Apart
from this, we also need to determine and compare it with the set of triples that are present in
WF
O
\WF
O
∪WF
O
\WF
O
. To do this we consider the changes C and determine the set of
triples added, removed or modified in WF
O
due to it- we call this as the semantics of the change.
Thus our overall approach to detect dirty queries consists of the following four steps:
1. Capture ontology change: The changes made to the ontology are logged. Ideally, the
change capture tool must be integrated with the ontology design tool, so that the changes are
tracked in a manner that is invisible to the ontology engineer. Since many other works [102,
101, 84] have focused on this aspect of the problem we re-use much of their work and hence
do not delve into it.
35
2. Determine the extension of the query.
3. Determine the semantics of change.
4. Matching: Determine if the ontology change can lead to an inconsistent result for the
given queries, by matching the extension of the query with the changed semantics of the
ontology.
Each of these steps is detailed in the following sections.
3.5 Extension of SPARQL Queries
Due to the complexity of consistency checking for DL ontologies [43, 70], it is very hard to
accurately determine the EXT of a query. In order to alleviate this we use a simplified function
called NEXT, which determines the set of triples that satisfy a graph pattern by using a necessary
(but not sufficient) condition for a triple to be a valid statement in an ontology K
O
. From the
SPARQL semantics point of view, NEXT can be thought of as a function that provides the range
for each variable in a query, in the evaluation function Ω. The range itself is defined in terms
of the semantic elements of an OWL TBox. In other words, Ω:Q →T(NEXT(Q))- where T is
the function to map OWL semantic elements to triples [109]. The semantics presented in the
RDF-Compatible Model-Theoretic Semantics section of the OWL specification has been used as
the basis for defining NEXT. We first show how NEXT is defined for simple triple patterns and
then generalize it to complete queries.
36
3.5.1 Triple Patterns
Queries to OWL ontologies can be classified into three types: those that only query A-Box state-
ments (A-Box queries), those that query only T-Box statements (T-Box queries) or those that
contain a mix of both (mixed queries) [133]. In the interest of space and as all the queries in the
applications/benchmarks we have considered are A-Box queries, we will only present the NEXT
values for them. A similar method to the one below can be used to create the corresponding tables
for the mixed and TBox queries.
Our evaluation of NEXT for triple patterns in A-Box queries is based on the following obser-
vations:
1. The facts/statements in an OWL A-Box can only be of three kinds: type assertions, or
identity assertions (sameAs/differentFrom) or property values.
2. A triple pattern contains either a constant (URI/literal) or a variable in each of the subject,
object and predicate position. Correspondingly, triple patterns are evaluated differently
based on whether a constant or a variable occurs in the subject, property, or object position
of the query.
We illustrate how NEXT values for triple patterns are determined through an example. Con-
sider a triple pattern of the form?var1 constProperty ?var2, where (?var1 and?var2
are variables andconstProperty is a URI). We know that for this triple to match any triple in
the A-Box,constProperty must be eitherrdf:type orowl:sameAs/differentFrom
or some datatype/object property defined in the T-Box.
Consider the case whereconstProperty isrdf:type. The only valid values that can
be bound to ?var2 are the URIs that are defined as a class (or a restriction) in the T-Box. In
37
other words it belongs to the set IOC. The valid values of the subject (?var2) is the set of all
valid objects in O, i.e.,IOT, because every element inIOT can have a type assertion. Therefore
the NEXT(tp) is given as P(IOT×{rdf:type}×IOC)- the power-set of all the triples from{IOT
×rdf:type×IOC}.
Note that this is a necessary but not sufficient condition because, although every triple in
NEXT cannot be proved to be a valid statement with respect to O (not sufficient), but by the
definition of these semantic elements, it is necessary for a triple to be in it. As a simple example
to illustrate this, consider a TBox with two classesMan andWoman that are defined to be disjoint
classes and a triple pattern ?x rdf:type ?var. An implication of Man and Woman being
specified as disjoint classes is that an individual cannot be an instance of both these classes i.e.
EXT(tp) / ∈{aInd rdf:type Man∧aInd rdf:type Woman}. However as described
above the NEXT for the triple pattern is P(IOT×{rdf:type}×IOC) and does not preclude such
a combination of triples from being considered in it.
Using similar analysis, we evaluate the NEXT values for other kinds of triple patterns as
shown in Table 3.1.
Table 3.1: NEXT values for SPARQL queries to OWL A-Boxes
Type Triple Pattern
(TP)
Case NEXT
O
(TP)
1 ?var1 ?var2
?var3
- P(IOT×Prop×(IOT∪LV
I
))
Continued on Next Page. . .
38
Table 3.1 – Continued
Type Triple Pattern
(TP)
Case NEXT
O
(TP)
2 ?var1 ?var2 Value
Value is a URI from
the TBox (Class Name)
P(IOT×{rdf:type}×{Value})
Value is an unknown
URI
P(IOT ×{IOOP ∪owl:sameAs
∪owl:differentFrom}×{Value})
Value is a literal P(IOT×{IODP}×Value)
3 ?var1 Property ?var3
Property is rdf:type P(IOT×{rdf:type}×(IOC))
Property is
owl:sameAs(differentFrom)
P(IOT×{owl:sameAs}×IOT)
Property is object
property i.e.Property
⊂IOOP
P(
D∈DOM
P
CEXT(D)
×{Property}×
R∈RAN
P
CEXT(R)))
Continued on Next Page. . .
39
Table 3.1 – Continued
Type Triple Pattern
(TP)
Case NEXT
O
(TP)
Property is Data-type
property i.e.Property
⊂IODP
P(
D∈DOM
P
CEXT(D)
×{Property}×LV)
4 ?var1 Property Value
Property is rdf:type
(and Value⊂IOC)
P(CEXT(C)) ×{rdf:type}
×Value) C = T
−1
(Value)
Property is
sameAs/differentFrom
(Value is a URI⊂IOT)
P(IOT ×{owl:sameAs}
×{Value})
Property is a object
property or data type
property (Correspond-
ingly Value is a URI or a
literal)
P(∪
D∈DOM
P
CEXT(D)
×{Property}×{Value})
5 Value ?var1
?var2
- P({Value}×ALLPROP ×{IOT
∪LV∪IOC})
6 Value ?var1
Value2
- Same as case 2.
Continued on Next Page. . .
40
Table 3.1 – Continued
Type Triple Pattern
(TP)
Case NEXT
O
(TP)
7 Value Property
?var2
- Same as case 3.
8 Value Property
Value2
- TP
3.5.2 Compound Graph Patterns
We now extend this notion to arbitrary graph patterns. Recall from Sect. 3.3 that a graph pattern
Q is recursively defined as Q = Q1 AND Q2 Q1 UNION Q2 Q1 OPT Q2 Q1 FILTER R.
The NEXT value for a query Q is defined based on what the connecting operator is as follows:
1. Consider a simple example of the first case in which both Q1 and Q2 are triple patterns
connected through AND: (?x type A AND ?x type B). For the variable x to satisfy the
first (second) triple pattern, it has to have a value in CEXT(A) (CEXT(B)). However,
due to the AND, x has to be a compatible mapping. Thus the valid values of x are in
(CEXT(A)∩CEXT(B)). For a variable that only appears in either of the sub-patterns, the
NEXT does not depend on the other sub-pattern.
2. For a UNION query, the mappings of the variables occurring in Q1 and Q2 are completely
independent of each other and thus the evaluation can be independently performed.
41
3. If the two sub-queries are connected by OPT, which has the left join semantics, the variables
on the left side (Q1) is independent of variables in Q2 . However the extension of the
variables in Q2, is similar to the queries in an AND query.
4. When expressions are connected using the FILTER operator, the extension is determined
as that of Q1 (we examine two special cases later).
Once the NEXT values of the variables in each sub-query are computed, the NEXT values of
the query can be computed as follows:
For a constant c, NEXT(c,Q) ={c}. The extension of the query Q is given as
NEXT(Q)=
tp∈Q
P({NEXT(sub
tp
,Q)× NEXT(prop
tp
,Q)× NEXT(obj
tp
,Q)})
where tp is each triple pattern in Q and sub
tp
, prop
tp
and obj
tp
represent the constant/variable in
the respective position in tp.
This procedure is summarized in algorithm 3.5.2. The algorithm takes as input a SPARQL
query that is fully parenthesized, such that the inner most parenthesis contains the expression that
is to be evaluated next. For each of the expressions surrounded by a parenthesis, we maintain
the value of the NEXT value to which the variable is mapped. When this is modified during the
evaluation of the expression in a different sub-query, it is updated to be the new value of the
variable, based on the operator semantics described above.
Exceptions: Two exception cases which are treated separately are:
• An interesting use of the FILTER expression is used to express negation in queries [123].
E.g., to query for the complement of instances of a class C one can write a query of the
form:
42
Algorithm 1 Algorithm to compute the NEXT of a compound query
Input: Fully Parenthesized Query in Normal form Q, Ontology O
Output: NEXT of Q
1: while all patterns are not evaluated do
2: P←innermost unevaluated expression in Q
3: if P is a triple pattern(tp) then
4: NEXT(var, P)←NEXT
S
(var, tp)
5: NEXT(var, tp)←NEXT
S
(var, tp)
6: else if P is of the form (P
1
AND P
2
) then
7: for each variable v in P
1
,P
2
do
8: if if v occurs in both P
1
and P
2
then
9: NEXT(v, P) = NEXT(v,P
1
)∩NEXT(v,P
2
)
10: Update the NEXT of v in P
1
and P
2
as well all sub-patterns it may occur in to
NEXT(v, P)
11: else if if v occurs in both P
1
and P
2
then
12: NEXT(v, P) = NEXT(v,P
i
)
13: end if
14: end for
15: else if P is of the form (P
1
OPT P
2
) then
16: for each variable v in P
1
,P
2
do
17: if v occurs in P
1
then
18: NEXT(v, P) = NEXT(v,P
1
)
19: else if v occurs in P
2
then
20: if v also occurs in P
1
then
21: NEXT(v, P
2
) = NEXT(v, P
1
)∩NEXT(v,P
2
)
22: else if v occurs only in P
2
then
23: NEXT(v, P) = NEXT(v,P
2
)
24: end if
25: end if
26: end for
27: else if P is of the form (P
1
FILTER R) then
28: for each variable v in P
1
,P
2
do
29: NEXT(v, P) = NEXT(v, P
1
)
30: end for
31: else if P is of the form (P
1
UNION P
2
) then
32: NEXT(v, P) = NEXT(v, P
1
)
33: end if
34: end while
35: return The union of NEXT of each triple pattern in the query
(?x type owl:Thing.OPT(?a type C.Filter(?x = ?a)).Filter(!Bound(a))
43
In this query ?x is bound to all objects that are not of type C i.e., the NEXT value for the
variable ?x should be assigned asIOT\CEXT(C).
• Another interesting case is the use of isLiteral condition in a FILTER expression. Consider
the triple pattern ?c type Student. ?c ?p ?val.FILTER(isLiteral(val)). Without the FILTER
clause, we might conclude that the variable p is bound to all properties with domain Stu-
dent. But since the filter condition specifies that val has to be a literal, p can be restricted to
the set of data-type properties with domain C. Note that by not considering the FILTER we
obtained the super-set of possible bindings. Therefore any change to one of these properties
would have still been detected but some false positives may have been present.
3.6 Semantics of Change
The second step of our change detection process is to map the changes made to the ontology to
OWL semantic elements, which will enable the queries and the changes to be compared. We
observe that the changes to a TBox can be classified as lexical changes and semantic changes.
Lexical changes represent the changes made to the names(URIs) of OWL classes or properties.
Such changes can be handled easily by a simple string match and replace in the query.
Semantic changes are more interesting because they effect one or more OWL semantic ele-
ments and need to be carefully considered. They can be further classified as:
• Extensional changes: Extensional changes are the changes that modify the extensional sets
of a class or a property. E.g., adding an axiom that specifies a class as a super-class of
another is an example of this because, the extension of the super-class is now changed to
include the instances of the sub-class.
44
• Assertional/rule changes: Assertional changes do not modify the extensions of TBox el-
ements but add additional inference rules or assertions. E.g., specifying a property to be
transitive does not change the extension of the domain or range of the property but adds a
rule to derive additional triples from asserted ones.
• Cardinality changes. The cardinality changes specify constraints on the cardinality of the
relationship.
The complete list of semantic changes that can be made to an OWL (Lite) ontology is pre-
sented in [84]. We have used it as the basis of capturing and representing the ontology changes
in our system. The semantics of a change, is the effect of the change to extension of the model is
represented as a set of all OWL semantic elements that are effected by the change. By matching
this to the extension (NEXT value) of the query, we can determine if the query is dirty or not. In
table 3.2, we show some examples of the changes and their semantics.
Object Operation Argument(s) Semantics of Change
Ontology Add Class Class definition (C) IOC=IOC
Ontology
Remove Class Class ID (C) IOC =IOC
, CEXT(SC)
=CEXT
(SC)
CEXT(Dom(P))
=CEXT
(Dom(P)),
CEXT=CEXT
(Ran(P))∀P
C∈Dom(P) or Ran(P)
Class (C) Add SuperClass Class ID (SC) CEXT(SC)=CEXT
(SC)
Class(C) Remove SuperClass Class ID (SC) CEXT(SC)=CEXT
(SC)
Property (P) Set Transitivity Property ID - (Assertional Change)
Property (P) UnSet Transitivity Property ID - (Assertional Change)
Table 3.2: Changes to OWL ontologies and their semantics
45
• Example 1: A class C is added to the TBox-IOC
2
the set of classes defined in the TBox
of the new ontology is different from the original one.
• Example 2: A more interesting case is when a class C is removed from the TBox. Not only
isIOC changed as before, but also the extension of the super-classes of C because all the
instances of C which were also instances of the super-class(es) in the original TBox are
not valid in the modified TBox. The modification also effects the extensions of the classes
(restrictions), intersectionOf in which the class C appears. Finally, the domain and range
resp. of the properties in which C appears are also modified. Note that a complete DL
reasoner (like Pellet or Racer) can be used to fully derive the class subsumption hierarchy,
which can then be used to derive the semantics of the change.
• Example 3: In the third example, an OWL axiom which defines a class C as sub-class
of SC is added. In this case, the extension of the class SC changes. Setting and un-setting
transitivity of a property is an example of an assertional change to the ontology as described
above.
In the interest of space the entire set of changes and the OWL semantic entities that it changes
are not presented here but the interested reader can find it online
3
.
3.7 Matching
The matching algorithm is fairly straight forward and is presented in pseudo-code form in Algo-
rithm 2.
2
The OWL spec [109] defines IOC as the set of all OWL classes, here we (ab)use the notation to denote the set of
classes defined in the ontology (T-Box).
3
http://pgroup.usc.edu/iam/papers/supplemental-material/SemanticsOfChange.pdf
46
Algorithm 2 Algorithm to detect dirty queries for a set of ontology changes
Input: Ontology Change log L, Query Q
Output: Dirty queries
1: Aggregate changes in L
2: for each lexical change l in L do
3: Modify the name of the ontology entity if it appears in it
4: end for
5: Let N←NEXT of Q
6: Let C←set of semantically changed extensions of O due to L
7: for each element n in N do
8: if Check if n matches any element in C then
9: Mark Q as dirty
10: end if
11: end for
12: return
In the first step the log entries are aggregated to eliminate redundant edits. E.g., it is possible
that the log contains two entries, one which deletes a class C and another which adds the same
class C. Such changes are commonly observed when the changes are tracked through a user in-
terface and the user often retraces some of the changes made. Clearly these need to be aggregated
to conclude that the TBox has not been modified. Then the lexical changes are matched and the
query is automatically modified to refer to the new names of the TBox elements. Finally the
NEXT value of Q and the semantic implications of each change in L are matched. This is done by
comparing the extension or element bound to the subject, object, property position of each triple
pattern of Q, with extensions modified due to the changes made to the TBox. If any of these sets
is effected, then the query is marked as dirty.
3.8 Implementation
Our technique has been implemented as a plug-in to the popular and openly available Prot´ eg´ e
ontology management tool [115]. Since we have used Prot´ eg´ e for ontology development in our
47
work, providing this service as a plug-in enables a seamless environment to the ontology engineer.
Moreover, the Prot´ eg´ e toolkit is equipped with another plug-in that tracks the changes made to an
ontology. We utilise this to capture the changes made to the ontology. After the user makes the
changes to the ontology in the design tab of the tool, he proceeds to the dirty query detection panel
and points to a file containing the SPARQL files for validation. The dirty queries are highlighted
and the user can then decide if the changes have to be kept or discarded. The query validation
plug-in is shown in Fig. 3.3(a) and the Prot´ eg´ e change tracking plug-in in Fig. 3.3(b).
(a) Query validation service implemented as a plug-in to the Prot´ eg´ e toolkit
(b) Change tracking service in Prot´ eg´ e
Figure 3.3: Support for incremental development in Prot´ eg´ e environment
48
Our implementation of the dirty query detection algorithm is in Java and uses a openly avail-
able grammar for SPARQL to create a parser for the queries
4
. Since all the queries in our ap-
plications were stored in a file, the problem of finding the queries was made easy. However,
many of the queries were parameterized and we had to pre-process the queries to convert it to a
form that was compliant with the specification. E.g., a query for a keyword search to find peo-
ple with a user specified name is usually parameterized as follows “SELECT ?persons WHERE
{?persons rdf:type Person. ?persons hasName $userParm$}”. Here$userParm$ is replaced
with a dummy string literal to make it into a parseable SPARQL query.
Once the queries are parsed, the NEXT values for the queries are evaluated as described in
Sect. 3.5. We have used the Pellet reasoner
5
for determining the class and property lattices
needed for evaluating NEXT. The changes tracked by Prot´ eg´ e ontology plug-in are logged in a
RDF file. We extract these changes using the Jena API
6
, perform the necessary aggregations and
evaluate the semantics of the changes. Again the Pellet reasoner is used here for computing the
class lattices etc. Finally the matching is done and all the details of the dirty queries- the triple
pattern in the query that is dirty, the TBox change that caused it to be invalidated, the person who
made the change and a suggested fix to the problem is displayed to the user.
3.9 Evaluation
We have evaluated our algorithm on three data-sets: the first two, LUBM [90] and UOBM [92]
are two popular OWL knowledge base benchmarks which consist of OWL ontologies related to
universities and about 15 queries. The third benchmark we have used (called CiSoft) is based on
4
http://antlr.org/grammar/1200929755392/index.html
5
http://pellet.owldl.com/
6
http://jena.sourceforge.net/
49
a real application that we have built for an oil company [?]. The schema for LUBM is relatively
simple- it has about 40 classes. Although UOBM has a similar size it ensures that all the OWL
constructs are exercised in the TBox. Both these benchmarks have simple queries- each query on
an average has about 3 triple patterns in it and they are all conjunctive queries. The TBox of the
Cisoft benchmark is larger than the other two (about 100 classes) and the queries we have chosen
from the Cisoft application is a set of about 25 queries, and each query has on an average 6 triple
patterns. These queries exercise all the SPARQL connectors (AND, OPT, UNION, FILTER).
To evaluate our algorithm, we compare it with two other algorithms. The simpler of these
two is the Entity name algorithm which checks if the name of the entities modified in the TBox
occur in the triple patterns of the query by string matching. If it occurs, then it declares the query
to be dirty and if not it declares it clean. The second algorithm called the Basic Triple Pattern is a
sub-set of our Complete algorithm. This algorithm does not consider the connectors between the
SPARQL operators i.e., it only implements the rules presented in table 1.
We have used the two standard metrics from information retrieval- precision and recall for
evaluation. Recall is given as the ratio of the no. of dirty queries retrieved by the algorithm to the
total no. of dirty queries in the data-set. Precision is given by the ratio of the total no. of dirty
queries detected by the algorithm to the total no. of results returned by the algorithm. The results
show in Table 3.9 are the average of 50 runs for each data-set; in each run a small number (< 10)
of random changes to the ontology was simulated and the algorithms were then used to detect the
dirty queries with respect to those changes.
We see that the Basic TP algorithm has a recall of 1 i.e., always returns all the dirty queries
in the data-set but it also returns a number of false positives (low precision). Since the results
returned by BTP is always a super-set of the complete algorithm- the recall is always 1. To
50
Benchmark Algorithm Recall Precision
Cisoft
Complete 1 1
Basic T.P 1 0.2
Entity name 0.4 0.45
LUBM
Complete 1 1
Basic T.P 1 0.6
Entity name 0.4 0.85
UOBM
Complete 1 1
Basic T.P 1 0.7
Entity name 0.25 0.6
Table 3.3: Dirty query detection results for three algorithms.
understand why a low precision is observed for BTP (especially for the Cisoft data-set), consider
a query of the form ?a rdf:type Student.?a ?prop ?value. Since the algorithm considers each triple
pattern in isolation, it infers that every valid triple in the ontology will match the second triple
pattern (?a ?prop ?value). Therefore any ontology change will invalidate the query. However,
this is incorrect because the first triple pattern ensures that only triples which refer to instances of
Student will match the query and therefore only changes related to the OWL class Student will
invalidate the query. The LUBM and UOBM queries do not have many triple patterns of this
form, thus the precision of BTP for these data-sets is higher.
The entity name algorithm does not always pick out the dirty queries (recall < 1). The main
shortcoming of this algorithm is that, it cannot detect the ontology changes that might affect
values that might be bound to a variable.
3.10 Related Work
Much work has been done in the general area of ontology change management [84, 141, 87].
Most of these works deal with the semantic web applications in which ontologies are imported or
51
built in a distributed setting. In such a setting, the main challenge is to ensure that the ontologies
are kept consistent with each other. In our work, we address the problem of keeping the SPARQL
queries consistent with OWL ontologies. Although, some aspects of the problem- e.g., the set of
changes that can be made to an OWL ontology are the same, our key contributions are in defining
the notion of dirty queries and the evaluation function which maps queries and (implications) of
ontology changes onto the OWL semantic elements, which makes it possible to compare them to
decide if the query is invalidated.
In [141], although the authors define evolution quite broadly as timely adaptation of an ontol-
ogy to the arisen changes and the consistent propagation of these changes to dependent artifacts,
they do not address the issue of keeping queries based on ontology definitions consistent with the
new ontology. The authors do define a generic four stage change handling mechanism- (change)
representation, semantics of change, propagation, and implementation, which is applicable to any
artifact that depends on the ontology. Our own four step process is somewhat similar to this.
An important sub-problem in the ontology evolution problem is the change detection problem.
Various approaches have proposed in literature to address this problem. Most of these address the
problem setting of distributed ontology development and thus provide sophisticated mechanisms
to compare and thus find the differences between two ontologies [101, 113]. On the other hand
we assume a more centralized setting in which we assume that the ontology engineers modify
the same copy of the ontology definition file. We have used an existing plug-in developed for the
Prot´ eg´ e toolkit [102], which tracks the changes made to the ontology.
An important artifact in a ontology based system is the knowledge base. In [152], the authors
address the problem of efficiently maintaining a knowledge base when the ontology (logic pro-
gram) changes. Similar to the work of view maintenance in the datalog community, the authors
52
use the delta program to efficiently detect the data tuples that need to be added or deleted from
the existing data store. This is an important piece of work addressing the needs of the class of
applications that we target, and is complimentary to our work.
In the area of software engineering, the idea of agile database [6] addresses the similar prob-
lem of developing software in an environment in which the database schema is constantly evolv-
ing. The authors present various techniques and best practices to facilitate efficient development
of software in such a dynamic methodology. Unlike our work, the authors however, do not ad-
dress the problem of detecting the queries that are affected by the changes to the schema.
3.11 Discussion and Conclusions
We have addressed a problem seen in the context of OWL based application development using an
iterative methodology. In such a setting as the (OWL) TBox is frequently modified, it becomes
necessary to check if the queries used in the application also need to be modified. The key
element of our technique is a SPARQL evaluation function that is used to map the query to OWL
semantic elements. This is then matched with the semantics of the changes to the TBox to detect
dirty queries. Our evaluation shows that simpler approaches might not be enough to effectively
detect such queries.
Although originally intended to detect dirty queries we have found that our evaluation func-
tion can be used as a quick way to check if a SPARQL query is semantically incorrect with
respect to an ontology. Semantically incorrect queries are those that do not match any valid graph
for the ontology- i.e., always return an empty result-set. For such queries, our evaluation function
will not find a satisfactory binding for all the triple patterns.
53
An assumption we make in our work is that all the queries to the A-Box are available for
checking when changes are made. In many application development scenarios this may not be
feasible. Therefore whenever possible it is a good practice for an application development team
working in such an agile methodology to structure the application so that the queries used in the
application can be easily extracted for these kinds of analysis. If it is not possible to do so, one
option for an ontology engineer is to use the OWL built-in mechanism to mark the changed entity
as deprecated, and phase it out after a sufficiently long time.
Often times, the queries are dynamically generated based on some user input. In such cases it
might be harder to check the validity of the queries. However, it might still be possible to detect
dirty queries because, such queries are generally written as parameterized templates which are
customized to the user input. If such templates are made available, it might still be possible to
check if they are valid.
54
3.12 List of Changes made to OWL Ontology
Object Operation Argument(s) Semantics of Change
Ontology Add Class Class definition (C) IOC=IOC
Ontology
Remove Class Class ID (C) IOC=IOC
, CEXT(SC)=CEXT
(SC)
CEXT(Dom(P))=CEXT
(Dom(P)),
CEXT=CEXT
(Ran(P))∀P C ∈Dom(P) or
Ran(P)
Class Add SuperClass Class ID (SC, C) CEXT(SC)=CEXT
(SC)
Class Remove SuperClass Class ID (SC, C) CEXT(SC)=CEXT
(SC)
Class
Add Equivalent Class Class description or
ID (C1, C2)
CEXT(C1) =CEXT
(C1), CEXT(C2)
=CEXT
(C2)
CEXT(SC) =CEXT
(SC), ∀SC superclass of C1
or C2
Table3.4 Semantics of Change 55
—-Table3.4Continued
Object Operation Argument(s) Semantics of Change
Class
Remove Equivalent Class Class description or
ID(C1, C2)
CEXT(C1) =CEXT
(C1), CEXT(C2)
=CEXT
(C2)
CEXT(SC) =CEXT
(SC), ∀SC superclass of C1
or C2
Class
Add Disjoint Class Class description or
ID(C1, C2)
CEXT(C1) =CEXT
(C1), CEXT(C2)
=CEXT
(C2)
CEXT(SC) =CEXT
(SC), ∀SC superclass of C1
or C2
C
lass Remove Disjoint Class Class description or
ID(C1, C2)
CEXT(C1) =CEXT
(C1), CEXT(C2)
=CEXT
(C2)
CEXT(SC) =CEXT
(SC), ∀SC superclass of C1
or C2
Table3.4 Semantics of Change
56
—-Table3.4Continued
Object Operation Argument(s) Semantics of Change
Class Add Restriction Restriction S or SF IOC=IOC
Class Remove Restriction Restriction IOC=IOC
ValueRestriction Change To Universal Restriction Restriction R CEXT(R)=CEXT
(R)
ValueRestriction Change To Existential Restriction Restriction R CEXT(R)=CEXT
(R)
CardinalityRestriction Add Lowerbound Integer MINCARD(P)=MINCARD
(P)
CardinalityRestriction Remove Lowerbound Integer MINCARD(P)=MINCARD
(P)
CardinalityRestriction Modify Lowerbound Two integers MINCARD(P)=MINCARD
(P)
CardinalityRestriction Add Upperbound Integer MAXCARD(P)=MAXCARD
(P)
CardinalityRestriction Remove Upperbound Integer MAXCARD(P)=MAXCARD
(P)
CardinalityRestriction Modify Upperbound Two integers MAXCARD(P)=MAXCARD
(P)
Ontology Add Property Property definition IOP=IOP
Ontology Remove Property Property ID (P) IOP=IOP
, EXT(SP)=EXT
(SP)
Table3.4 Semantics of Change
57
—-Table3.4Continued
Object Operation Argument(s) Semantics of Change
Property Add Domain Class description or
ID (P, C)
DOM(P)=DOM
(P), DOM(SP)=DOM
(SP)∀SP
super-property of P
Property Remove Domain Class description or
ID (P, C)
DOM(P)=DOM
(P), DOM(SP)=DOM
(SP)∀SP
super-property of P
Property Add Range Class description or
ID
RAN(P)=RAN
(P), DOM(SP)=DOM
(SP) ∀SP
super-property of P
Property Remove Range Class description or
ID
RAN(P)=RAN
(P), DOM(SP)=DOM
(SP) ∀SP
super-property of P
Property Set Functionality Property ID (p) MINCARD(p)=MINCARD
(p), MAXCARD(p)
=MAXCARD
(p)
Property UnSet Functionality Property ID (p) MINCARD(p)=MINCARD
(p), MAXCARD(p)
=MAXCARD
(p)
Table3.4 Semantics of Change
58
—-Table3.4Continued
Object Operation Argument(s) Semantics of Change
Property Add Symmetry Property ID (p) Assertional change
Property Remove Symmetry Property ID (p) Assertional change
Property Set Transitivity Property ID (p) Assertional change
Property UnSet Transitivity Property ID (p) Assertional change
Property Set InverseFunctionality Property ID (p) MINCARD(P)=MINCARD
(P), MAXCARD(P)
=MAXCARD
(P)
Property UnSet InverseFunctionality Property ID (p) MINCARD(P)=MINCARD
(P), MAXCARD(P)
=MAXCARD
(P)
Property Add SuperProperty Property ID (p, sp) DOM(sp), RAN(sp)
Property Remove SuperProperty Property ID (p,sp) DOM(sp), RAN(sp)
Property Add Equivalent
Property
Property ID DOM(p), RAN(p)
Table3.4 Semantics of Change
59
—-Table3.4Continued
Object Operation Argument(s) Semantics of Change
Property Remove Equivalent
Property
Property ID DOM(p), RAN(p)
Property Add Inverse Prop-
erty
Property ID Assertional change
Property Remove Inverse
Property
Property ID Assertional change
Property Change To DatatypeProperty Property ID IOOP=IOOP
, IODP=IODP
Property Change To ObjectProperty Property ID IOOP=IOOP
, IODP=IODP
Resource Add Label Value -
Resource Remove Label Value -
Resource Add Comment Value -
Resource Remove Comment Value -
Table3.4 Semantics of Change
60
—-Table3.4Continued
Object Operation Argument(s) Semantics of Change
Resource Add Annotation
Property
ID Value IOAP=IOAP
Resource Remove Annotation Property ID Value IOAP=IOAP
Individual Add Equivalent Individual Individual
Individual Remove Equivalent Individual Individual
Individual Add Disjoint Indi-
vidual
Individual
Individual Remove Disjoint In-
dividual
Individual
Ontology Add Individual Individual definition
Ontology Remove Individual Individual ID
61
Chapter 4
Case Study: Informantion Management Applications for Oil Field
Operations
A scientist builds in order to learn; an engineer learns in order to build. Fred Brooks
4.1 Integrated Asset Management
The work described in this paper is part of the Integrated Asset Management (IAM) project at the
Chevron-funded Center for Interactive Smart Oilfield Technologies at the University of Southern
California, Los Angeles [34]. The current focus of the IAM project is on enabling model-driven
reservoir management. In model driven reservoir management, the reservoir engineer relies on
simulations (and hence simulation models) to make key operational decisions pertaining to the
reservoir on a day-to-day basis. As a motivating example, consider a typical oil-field operation
setting for a oil field that is just starting to produce. Since little or no performance related data
for the field exists, the production engineer has to rely on simulations for making the initial set
of asset development decisions. Different simulation models of the oil-field are created and used
- these include earth models, reservoir simulation models, network models, integrated (coupled)
62
simulation models, etc. These simulation models are built and used at different times, different
locations, and by different asset team members - earth scientists, reservoir engineers, production
engineers, asset managers, etc. A particular member of the team (say, the reservoir engineer) is
typically an expert in a particular modeling and simulation technology and intimately familiar
with a few software toolkits in that area. This also means that models, workflows, and results
created by other software tools in other areas are not usable and accessible by that expert. As a
result, the insights and understanding of a team member in one role (say, geologist) are not fully
utilized by another role (say, production engineer). Moreover, these simulation models could be
constantly modified as new data is continuously produced in the oil-field and interpreted by one
or more members of the asset team. In this situation, changes made to the model(s) by one team
member should be immediately communicated to other team members who may be using that
model as the basis of scenario planning and forecasting, or who may need to modify their own
models to match the updates. Three (of the many) problems that are observed in this setting are:
• Efficient access to information: No engineer has complete knowledge of all the data in
the system and finding the relevant piece of information required to make a decision is a
challenge.
• Unified view of the information: Every simulation model, models one facet (reservoir,
network etc.) of the oilfield in detail. However, a unified view of the information related to
the asset elements is generally not accessible from one place or application.
• Knowledge management: As the models are constantly being calibrated and decisions are
taken, the rationale (knowledge) behind the changes and decisions are generally lost. Such
63
knowledge could be extremely useful for auditing the decisions made and also to train new
engineers.
Similar problems are observed widely in IT enabled businesses and IT enabled science (e-
science), as large scale instrumentation of physical and non-physical elements, have led to in-
creasing amounts of data being generated and used. Users are increasingly overwhelmed by the
large volumes of data generated and systems that help them quickly search for the right data and
access it are the need of the hour. Such systems, knowledge bases, should capture the key in-
formation in the data objects, the business context in which the data was created, the context in
which the information in the data set can be applicable, etc. This makes it possible for the user to
search for information using terms relevant to and within the context of the business.
4.2 Ontology Design
In this section we briefly discuss our approach to building ontologies, modularizing them and
the salient elements in them. Our ontologies were built for an in-house application, and thus our
design goals are different from those of POSC Caesar’s Oil and Gas Ontology (OGO), which
is intended to be a comprehensive domain vocabulary for the industry [114]. Our ontologies
were also designed to be used as a basis for a knowledge-base and for efficiency reasons we
have designed it to be small and modular. Currently we have designed three ontologies with
approximately 300 classes - the OGO ontology in contrast has more than 10,000 classes. Fig.
4.1 shows the modularization of our ontologies as a set of ontologies that capture different aspect
of the problem space and is similar to the approach presented in [58]. The ontologies in the lower
levels of the figure use the ontologies in the upper layers. OWL provides the ability to import
64
ontologies defined elsewhere, which makes it easy to modularize ontologies. The elements of the
three layers in the figure are described in Fig. 4.1.
Figure 4.1: Ontology design in IAM
Domain independent or upper ontologies: which describe concepts which are independent
of oil-field semantics. We have considered the (re-)use of time ontology from Pan et al [69],
and the Units ontology from the SWEET ontologies [145]. However, we found that both these
ontologies are too detailed for our needs and we hence created smaller subsets of these ontologies
to fulfill our needs. Larger ontologies are undesirable because they decrease the performance of
reasoning.
Domain ontology: This defines elements of the oil field domain. The elements of the domain
ontology are shown on the left hand side of figure 4.2. It mainly consists of physical entities
in the oil-field such as Well, Completion, Reservoir, Fault, Zone, etc. Important properties of
these entities like initial hydrocarbon estimates for the reservoir, on stream dates for the wells
are also captured in the domain ontology. Further important relationships between entities like
drawsFrom relation which links Well and Region is also captured in the ontology. As can be
seen below, the domain ontology itself uses elements of the domain independent ontologies.
The domain ontology is an important artifact in our system and building it has many inter-
esting challenges. Since the domain ontology needs to be application independent i.e. store
65
Figure 4.2: The domain ontology
metadata about the domain elements in the different data stores, it must have minimal ontologi-
cal commitment, i.e., the amount of information in the domain model needs to be minimal. For
example, the semantics of an entity called Well, is different in different simulation models in the
system. The domain model must commit to the least common properties of these applications.
When designing a generic application and tool independent domain model, we must address the
66
issue due to the heterogeneities in the way the elements of the domain model are represented
in different data sources. Two examples of such heterogeneities mostly commonly observed in
our domain are aliasing, where the same entity is addressed by different names in different data
sources and scaling heterogeneities, where different scaling systems (SI vs. Oil field units etc.)
are used to encode values. Aliasing is addressed in our system by allowing each entity to have
multiple names. To deal with this problem we have created a logical entity called the asset inven-
tory, which contains the physical entities which are the part of the asset. Every model, models
one or more of these physical entities and thus a reference from the entities in the model to the
entities in the asset inventory is created. Even if an entity is addressed by a different name in
different models, since it points to the same entity in the asset inventory, the system understands
it to be the same physical entity. The use of an units ontology allows us to address the problem
where different data sources used different unit systems to record the information.
Application specific ontologies: We envisage our ontologies to be used in many IAM ap-
plications. An example of an application is the metadata store for simulation models. Another
example is the Design Space Exploration tool, which enables the engineer to efficiently explore
the design space of a problem by using simulations of different granularities. Although these
also model elements in the same (oil field) domain, they are different from the domain ontologies
because unlike the domain ontologies, the scenarios of usage for these ontologies are restricted to
certain applications. Typically application specific ontologies build upon the domain and upper
ontologies by using some of the entities described in them. As an example, consider the metadata
catalog has its own application specific ontology, which is shown in the figure 4.3. As described
earlier, each class in the metadata ontology is used to define the metadata associated with each
kind of data object, e.g., geological model, reservoir simulation model, network model etc., and
67
the various metadata associated with it. Each data object has metadata summarizing its contents
in terms of the entities in the domain ontology. For example, for a simulation model, the domain
elements it models and the key properties of these elements are captured.
Figure 4.3: The metadata ontology
4.3 Metadata Catalog
4.3.1 Workflow and Components
Figure 4.4 shows the workflow which describes how the metadata from simulation cases is added
to the metadata catalog and queried and used by various IAM applications. The main components
68
of the system are listed below, and the workflow for publishing is steps 1-3 and for querying is
step 4.
Figure 4.4: Metadata Catalog workflow
1. Publish UI/IAM Agent: In a system without the metadata catalog, a simulation case is
created and validated by an engineer, and copied on to a shared network directory. In our
system, the engineer either publishes it through a user interface or just copies it to a agreed
upon shared network directory- as before. In the former case when the user submits a file,
the system, copies the the file to a shared directory and proceeds through the rest of the
workflow. An IAM agent constantly polls this location for new simulation cases accesses
this model. The rest of the steps are performed asynchronously and the user is notified
through email when it is finished.
2. Metadata extraction: The metadata is extracted from the simulation model by using meta-
data extractors. Custom metadata extractors are created for each kind of simulation model
in the system. We have created metadata components that parse ASCII based documents
69
as well as components that use custom APIs to access information in the simulation mod-
els. After the metadata is extracted, reasoning is performed to materialize all the inferred
information from the information extracted from the model.
3. Inference and Upload: Based on the information (metadata) presented and the OWL
ontology definitions, additional information can be inferred. In contrast to a materialized
knowledge base,, which performs inferencing to create all the new inferred information,
when new information is added to the KB, a non-materialized KB derives the inferred
data only when a query is issued to it. The advantage of using a materialized KB over
a non-materialized one is that the performance of query answering is always faster in the
materialized KBs. On the other hand adding information to the materialized KBs can be a
slow process and materialized KBs occupy more space. We have adopted this approach in
our application because, the frequency of publishing information into our KBs is relatively
low when compared to the frequency of queries.
4. Query: After the information is uploaded into the KB, IAM users can access it through
the various applications which in turn query the KB and present the information in intuitive
ways.
4.3.2 Implementation
4.3.2.1 Metadata Extractors
The other important component in our system is the metadata extraction component. Our imple-
mentation of this component consists of a set of parsers that extract the metadata from the models.
Since the abstractions in the metadata catalog may be quite different from the abstractions in the
70
simulation models, the parsers often contain logic that transforms and aggregates the data in the
simulation models to be appropriate for the MDC. So far, we have developed three classes of
metadata extractors:
• Parsers that extract from plain text based models. Many simulators use as input a model
specified in a plain text file. The formats of the text files are generally fixed and the formats
are typically specified in user manuals.
• Extractors from tools that allow access to the simulation parameters through specialized
API.
• Extractors for the IAM tools that generate and consume data in XML format.
Building metadata extractors for the text based simulation models can be extremely tedious
and time consuming because of the complexity of the simulation models. For example, one of the
extractors was for simulation models used by a tool developed in-house at Chevron which con-
tains hundreds of key-words and where the model itself tends to be hundreds of mega-bytes in
size. The parser we have created is based on a (LALR)grammar that we have reverse engineered
for the model file. This reverse engineering based on a manual leads to the component being frag-
ile and we are currently investigating techniques to improve the metadata extraction components.
The only research project that we are aware of that is addressing this problem is [98]. On the
other hand, some of the tools expose the data in the simulation models as API calls, and creating
extractors for such models is a much easier task. Finally the IAM tools create XML models and
since the schemas and tools are defined by us, creating metadata extraction components for these
models is a lot of but straight forward engineering.
71
4.3.2.2 OWL Knowledge base
Our implementation of the metadata catalog uses the Jena package for OWL support and SPARQL
based querying of the OWL/RDF graphs. The RDBMS RDF store or triple store implemented
by Jena is ontology independent and consists of a simple database schema to store a triple as
a single row in a table (additional tables are used for reification etc as described in [15]). As
mentioned earlier OWL inferencing is performed when the data is published and uploaded into
the RDF store. We have used the Jena rule based reasoner for reasoning. The expressiveness of
the rule based reasoner is similar to OWL-Horst [148].
A web service similar to the one specified in the SPARQL protocol [35] is used to wrap
the database so that implementation is abstracted from the underlying data store used. This was
done because we wanted to be able write clients in any language (e.g., .NET languages) and also
so that a different implementation of a semantic data store (e.g., Oracle or Sesame) can be used
when it becomes available. The Apache Axis2 web service engine has been used to implement
the web service.
4.3.2.3 Tool 1: Metadata Catalog Browser
The metadata catalog browser allows the user to look at all the data that has been published and
search for particular data based on the entities defined in the domain and the metadata ontologies.
It also allows the user to search the metadata catalog in intuitive ways. As an example, the
user could search for the reservoir model in which the OOIP of the reservoir is greater than a
certain number. The user could also search for and navigate through data objects based on its
relationships to other data objects. As an example, a search could be to find all the reservoir
72
simulation models which are coarse grained versions of a given model. A screen shot from the
tool we developed is shown below.
Figure 4.5: Screenshot of the search utility IAM
4.3.2.4 Tool 2: OOIP Comparison utility
An important uncertainty parameter that is used in different kinds of models like the geological
model, reservoir model and the field model is the OOIP information of the whole field as well
as different regions of the field. A tool that allows the geologists or the reservoir engineer to
compare the assumptions about the OOIP values in each of the models is extremely invaluable
and provides insight into the understanding of the asset by the other engineers. The screenshot
below shows a tool that provides such functionality. The regions in one model are not always
73
the same as those modeled in another and thus a way to map the regions is provided, so that
the user can compare regions in different models. This application also provides a unified view
of the information across models, because the estimates in different models can be compared
irrespective of the various heterogeneities across the models.
Figure 4.6: Comparing OOIP estimates across models
4.3.3 Performance
We perform our performance and scalability studies to understand the characteristics the infer-
encing and query answering times of the OWL data store. Our studies are based on an implemen-
tation that uses the Jena API and reasoners. We have used the ontology developed as described
above. Our Domain ontology has 60 classes and our Metadata ontology has 50 classes. We have
created synthetic data generators which create and populate the metadata for the KB. To make the
74
data more realistic we have used a real simulation model for each kind of model as the prototype
which is cloned and modified by the synthetic data generator. We have performed our scalability
testing based for three data-sets: a small, typical and large sized data sets. All the experiments
were performed on a commodity Windows machine with a 2 GHz dual core processor and 2 GB
of RAM.
4.3.3.1 Performance of Inferencing
We have used the Jena default OWL reasoner as well as the Pellet reasoner for performing in-
ferencing on the data. The Jena default reasoner uses a rule based reasoner which uses forward
chaining to infer data. On the other hand the Pellet reasoner uses tableaux algorithms to per-
form the reasoning. To measure the performance we measured the time and maximum memory
consumption of the process.
The performance of OWL inferencing for a OWL-DL ontology is shown below. Although
the rule based implementation of Jena seems to out perform the tableaux based Pellet, it is quite
slow- reasoning for 1 million triples took nearly 6 hrs. Moreover it’s memory requirements grow
exponentially and the reasoning fails for a KB with 2M triples. The pellet implementation on
the other hand is more compute intensive but the memory requirements are much smaller. This
observation has motivated us to use a parallelization approach to scale the OWL reasoning process
for the rule based reasoner. More details of this study are presented in Chapter 5.
4.3.3.2 Performance of SPARQL queries
To benchmark the query performance and scalability of the implementation we used a real dataset
as well as synthetic datasets. The real dataset consisted of metadata extracted from the simulation
75
Figure 4.7: OWL-DL reasoning
cases for a real deployment scenario. To create the synthetic datasets we cloned the real metadata
and modified the values of the metadata fields to obtain data sets of various sizes. The bench-
marked operations are chosen from those that were typically observed when using the current
IAM system. Figure 4.8 shows the time taken to execute the queries for one real dataset with
600K triples and three others with 800K, 1.3M and 2.5M triples. We expect that our metadata
catalog will not exceed this size in the next few years.
The first thing to note is that some of the queries are slow even for the simplest dataset.
This is because the queries are quite complex. Complexity of a query can be measured based
on the number of triple patterns in the query- particularly those joined by AND or OPTIONAL
connectors. This is because for the most state-of-the-art triple stores like Jena, Oracle, Sesame etc.
that use giant triple tables [1] to store the RDF data, each AND connector gets translated to a SQL
self-join operation. This is a very expensive operation, thus slowing down the SPARQL execution
time. Using other approaches like property tables [154] or column oriented databases [1] might
be a good way to reduce the query execution time and is an area of future work.
Another interesting characteristic of this graph is the scalability of query time with respect to
the size of the dataset. For some queries, the query seems to scale linearly for until a certain size
76
Figure 4.8: Graph showing the execution time for queries from the metadata catalog implemen-
tation
of the dataset. For datasets of larger sizes the performance seems to get worse. This is further
illustrated in figure 4.9. The reason this is observed is because for smaller datasets the database
can cache a many data structures- data, indexes etc. and the IO performed is low. However, for
larger datasets, the cache is not big enough to cache these data structures, leading to more IO
activity. One simple way to overcome this is by allocating more memory to the database- but
of course this is not a scalable approach. A more general and scalable approach to address this
problem is a open problem.
4.3.4 From research prototype to deployment: Challenges, lessons learnt
The ideas were first developed as a prototype at USC and on successful demonstration, was
developed into a full-strength deployable application. On the business side of things, one of
the most challenging step during the technology transition was selling the idea to the business
77
Figure 4.9: Graph showing the trend of query execution time for different datasets of different
sizes
unit. Although we obtained a reasonable buy-in of the business value of the functionalities and
the applications offered by the Metadata catalog and related tools. questions were raised on the
feasibility and the choice of OWL/RDF for the solution. The arguments in favor of the use of
OWL were:
1. The longer term prospects of using RDF as the data model for integrating various kinds of
data aligned with the key goals of IAM.
2. Success of previous W3C standards. Thus we felt that we could expect much better tools in
the near future. Announced support from major vendors like Oracle and IBM encouraged
us further.
3. Based on the requirement for and active research on the use of OWL for knowledge man-
agement applications.
78
Two key risks identified were the performance and scalability of inferencing and query an-
swering in RDF and lack of skilled workforce familiar with these standards. To address the
first risk, extensive performance studies (above) on the prototype data and for expected scales of
typical deployments were presented to the managers. We also promised to address this issue at
every stage of development. The second issue was addressed by committing the USC researchers
to educate the engineers hired to develop the project about OWL, RDF and the relevant tools.
Moreover, the ownership of the key task of ontology development was retained by researchers at
USC. Since the overall architecture of the system to be developed did not change significantly
vast amount of code was either reused or formed an excellent starting point for the engineer to
get familiar and quickly develop the technology. Based on this input and planning the project was
approved and a software development company was hired to develop the project.
An agile methodology as specified in the chapter 3 was used for software development. These
were some of the lessons learnt during the development of the project:
1. The OWL ontology was kept small and modular. A bottom-up development of ontologies,
where only elements relevant to the required functionalities were used in the ontology.
Integration with existing but not widely used ontologies was not done and any future re-
quirement for this was to be handled by ontology mapping.
2. The performance of the system was carefully planned and tracked throughout the product
development cycle. Standard performance enhancing tricks were added to the project.
Some of the ones used were caching, pre-fetching data and threading. Moreover the publish
process was made completely asynchronous.
79
3. Be cognizant of OWL features your tool supports. Most tools are not fully compliant with
standards
4. Since the change management technique was not available for much of the development
process, the schema changes were planned carefully at the beginning of each sprint and the
schema was not changed often. Even our change management technique addresses only a
small part of this problem- so schema changes between sprints should be planned carefully.
5. Design for change. It is very likely that future RDF data stores will offer much better
performance and scalability than the existing ones. Although obvious, it is always good
to reiterate that KB query/data access components should be cleanly separated from from
business logic and UI. Preferably use SPARQL (as opposed to the many non standard
languages) for querying because most products will be forced to support this language in
the future.
Figure 4.10 shows an approximate breakup of the overall time spent on development of the
important components of the system. These statistics were obtained from the actual planning
documents created during the execution of the project, over a period of nearly 1 year- about 5
man years of effort.
Some observations are as follows:
1. Ontology engineering contributes to a small part (about 2%) of the whole development
cost. This might suggest the use of light-weight ontology engineering process like Rapid-
OWL [9] as opposed to comprehensive ones (like Methontology [47]). We must note that
prior experience in building such ontologies, keeping the ontologies simple and focussed
80
Figure 4.10: Graph showing the breakup of development time spent on various components of
the system
on the requirements and the use of fairly shallow/less expressive semantics might have been
factors in driving the effort down significantly.
2. The user interface components were the most cost intensive, contributing to about 33% of
the whole development cost/time. We think that this is indicative of the importance and
difficulty of creating carefully crafted usability paradigms for the non-technical user. All
the effort expended in this task seems to have paid-off as the early users of the system have
reported our system as having a “pleasureable” user experience.
3. The business logic and queries constituted for a significant chunk of the development cost.
Our applications did not consist of much complex business logic and thus most of this part
is taken up for the SPARQL design. We believe that low level of familiarity of most of the
81
engineers with this language and the lack of tools for SPARQL (like syntax checkers etc.)
might have made this task quite complicated.
4. Since we were able to reuse much of the code and ideas from our research prototype,
the time spent on the two major components- the metacatalog service implementation and
parsers for the datasets was relatively low.
5. The misc category consists of activities related to security and report generation.
Research in the ontology community has focused on engineering and related tasks for onotolgies
like cost estimation [144]. However, there has not been much work that addresses the whole
development cycle for an ontology based application and there is a need for empirical and the-
oretical work on this. Intuitively traditional and mature software engineering techniques like
COCOMO [26] should be applicable to the area of ontology engineering but there is a dearth of
experiences with or validations of such a hypothesis and should be an interesting area of work in
the future.
4.4 Virtual data integration using semantic lookup and services
While the Metadata Catalog can be used for information, that is changing very infrequently, like
data in simulation cases, it is inappropriate for data that is frequently changing (e.g measurement
data). For such data, we need a framework which allows us to create components that pull infor-
mation from the data sources as required. Our virtual data integration is based on the following
observations:
82
1. Raw data from the data sources usually needs to be transformed and aggregated before it
can be used in analysis.
2. Programming languages provide the right degree of expressiveness and ease of use for this.
3. The transformed and aggregated data should be accessible through a standard interface.
4.4.1 Motivating Example
To motivate our technique we examine a simple data aggregation workflow from the petroleum
industry. Before we explain the workflow itself, we introduce the following terminology from
the domain. A well is an entity that produces oil, water, and gas. A block is a set of wells. The
production of a block is the sum of the production of its constituent wells. The oil, water, or gas
production for a well or a block is often represented by a “recovery curve” or a “decline curve”
for that well or block. A decline curve is a plot of production volume versus time. Decline curves
can also be plotted to show production volume versus the fraction of total oil in place recovered
till that time step.
One of the inputs to our aggregation workflow is a data-structure called WellProduction-
Data. This data structure holds the production information for a well and the data is pro-
duced by a simulator. However, higher-level tools require the aggregated production data at
the Block level. We will refer to the aggregated data as BlockProductionData. To obtain the
BlockProductionData from the WellProductionData, an additional data structure called
BlockData is used. This data structure contains the information about a block, which contains
information like what is the set of wells contained in a given block and some other pieces of
data used for aggregation. It is generated either from a database or from another workflow i.e.
83
is retrieved from another web service. A simplified flow chart showing the workflow BlockPro-
ductionData from WellProductionData is shown below. Note that well-to-block aggregation is
not a straightforward summation, and involves some complex calculations like finding moving
averages.
Figure 4.11: Flow-chart of an example aggregation workflow
We want the developer of this aggregation program to be unconcerned with the issues in web
service development. These include service discovery, service invocation, and data serialization
and de-serialization. The developer could also want the data aggregation program to be deployed
into the IAM framework, so that it can be re-used in other aggregation workflows. Thus we
require a framework that enables us to create and deploy shared data aggregation workflows
without the complexity of understanding web service standards and deployment issues.
84
4.4.2 Approach
To this end we have created a Metadata based virtual data integration framework which eases
the development of such components by allowing the user to create components that perform
aggregation in a normal programming language (like Java). The framework provides the user
with simplified mechanisms to access the data in the framework through metadata based data
requests and abstracts the complexities of deploying the component in a service environment. For
example, the code in the figure 4.12 shows a sample data aggregation program in our framework.
The most interesting part of this code snippet is Lines 19-21, which contains the code to obtain
the data required for aggregation. The requests for the data are specified as calls the DataFactory
which abstracts all the complexity of service discovery, service invocation and data serialization-
deserialization. The parameters passed to the DataFactory are the type of the object required and
a set of predicates that the data is required to have. The user in turn obtains a Java bean which is
used to perform the aggregations.
The other important parts of this code snippet are the lines 1-5 and 7-13. These lines con-
tain annotations with in the source code which specify the metadata used while deploying the
aggregation program into our framework. The contents of the metadata descriptions are similar
to those of a data request. It contains the type of object returned by the service and some predi-
cates describing it. Once this program is written, it is compiled, debugged as a normal program
would be and deployed into our framework. The aggregation program becomes accessible as a
web service and can be similarly searched and invoked in other aggregation programs.
85
Figure 4.12: Code snippet for the example aggregation program
4.4.3 Modeling
1. Domain model: This is an ontology that models the domain i.e. the entities in an oil-field
and their inter-relationships. This model provides an intuitive query vocabulary for the user.
For example a user may specify that the data required is the WellProductionData
86
for all the Wells in a Block called Block-A. The notions of what a Well, Block and their
relationship as a containment is defined in the domain model. A more detailed description
of the domain model is out of the scope of this paper and is described elsewhere [157]. A
simplified version which only considers parent-children relationships among entities of a
domain model is currently used and is defined below.
A Domain object is a 4 tuple Do=K, Nd, C, Pwhere:
• K is the kind/class of the domain object (for simplicity we assume here that it is a
string),
• Nd is the name of the domain object,
• C=Do
1
,Do
2
,Do
n
is a set of domain objects that are contained by Do,
• P is the parent of the domain object.
A domain Configuration (or scenario) is a 2-tuple, DC=Na, Dwhere
• Na is the name given to the configuration,
• D=Do
1
,Do
2
,Do
n
is the set of domain objects in the configuration.
A domain configuration is the context under which relationships between domain objects
are defined. Thus a relationship between domain objects must also specify the configura-
tion under which the relationship will be resolved (e.g. all Wells in Config1.Block-A).
2. Data Model: The data model is defined as an ontology of classes where each class is defined
by a set of meta-data properties and a data schema. The meta-data properties in our system
is similar to [131, 159] and contains information to identify objects, track their audit trails
etc. The schema of a class defines the various properties of the data object. Although the
87
meta-data properties and the data schema are very similar (they define key-value pairs) and
the difference between data and metadata is often tenuous, this distinction is very important
to us because the metadata is defined and manipulated within our system where as the data
schema is defined, stored and manipulated by ”‘external”’ systems. This approach also
allows us to reuse the data schemas published by standard bodies, software vendors etc.
More formally, a class is defined by a 4-tuple C =N, R, M, Swhere,
• N is the name of the class,
• R is the set of parent classes R= C
1
,C
2
C
k
,
• M is the set of metadata properties pm
1
,pm
2
,pm
n
where Pm
i
is given by a 2-tuple
T, Npwhere T is the type of the parameter and Np is the name of the property.
• S is the schema for the class (which defines its properties). S is given by a 2-tuple
Ns, Pwhere Ns is the name of the schema and P is the set of properties given as
2-tupleT, Np. A special kind of class is called the opaque class where the class
schema has only one property- its value i.e. P = value. This kind is used to represent
binary and other legacy objects (like ASCII files) in the system.
An object is given by a 4-tuple O =C, Vmd, Vd, πwhere
• C is the class the object belongs to,
• Vmd = vm
1
,vm
2
vm
n
is the set of values assigned to the meta-data fields where each
vi is a 2-tupleNp, Vp, Np is the name of the meta-data property and Vp is the value
attribute,
• Vd=vd
1
,vd
2
vd
k
is the set of values attached to fields from the schema,
88
• π=v
1
,v
2
v
w
each v i is a 2-tupleNp, Vp; each v i represents the set of parameters
needed to create the object O. Our framework models both real objects and virtual
objects. Virtual objects are created by services and (in general) have π=φ; on the
other hand real objects already exist in some persistent store and have = φ. Thus to
identify or create virtual objects we also need to specify the input parameters for the
service i.e. φ.
Finally we define an ontology as a 3-tupleC, D, Owhere,
• C= DC
1
,DC
2
DC
m
is a set of domain configurations
• D=Do
1
,Do
2
Do
n
is a set of data objects
• O:S→bool, is a function that takes a statement and informs whether that statement is
entailed in the ontology or not.
3. Data source Model: The data source model captures the semantics of the services- so
that they can be advertised, discovered and invoked. Since all the web services in our
framework are data providers, the captured semantics of the web services are all related
to the data. In particular they describe the data provided by the web service in terms of
our data and domain ontology. A service advertisement in our framework consists of the
type of the output the service delivers, types of the input parameters, meta-data predicates
that describes the output provided and the range of the data provided by the service defined
as predicates over the fields of the output type. A predicate is given by a 3 tupleP, op,
val where, P is the property asserted upon, op is a operation and val is the value. The
expression values, defines a (possibly infinite) set of objects defining the range of objects
catered by the service. It can be defined as expressions over primitive data types (int, float,
89
Date etc) or objects/fields from the data and domain models. For the meta-data predicates,
the field is from the data-model described above, while for the data predicates it is obtained
from the schema where the output type is defined.
A service profile is given by a 4-tuple S =C
o
ut
S
,S,MP
S
,DP
S
where,
• C
o
ut
S
is the type of the output of the service,
• S=p
1
,p
2
p
k
, where each p
i
is a 2-tupleN
i
,T
i
represents the input parameters of the
service,
• MP
S
is the set of predicates over the meta-data mp
1
,mp
2
,, mp
m
.
• DP
S
is the set of predicates over the fields from the schema of C
out
4.4.4 Automatic service discovery
Service discovery in our framework involves matching service advertisements (profile) with re-
quests. Please recall that the service advertisements are given by a 4-tuple S =C
out
S
,S, MP
S
,
DP
S
. Similarly, a request is a 4-tuple given by R =C
out
R
,R,mp
R
,dp
R
. Note that a request
tuple is quite similar to an object tuple; because the request identifies a set of objects needed for
the aggregation. Our matching algorithm matches the outputs (C
out
S
,C
out
R
) and input param-
eters ( S, R) as described in [11]. The match is rated as exact (C
out
S
=C
out
R
)plugin (C
out
S
C
out
R
)subsumed (C
out
S
C
out
R
)fail. In addition to inputs and outputs, we also need to match
the service and the request predicates. To define the predicate matching, we define a function:
INT:P
f
→D, D⊆range(f) Intuitively, INT is an interpreter function that maps a predicate P over
a field f to a set of values D. The set D is a subset of all the permissible values of f. We then say
90
that a predicate P
f
S
satisfies P
f
R
iff INT(P
f
R
)⊆INT(P
f
S
). This is written as P
f
S
⇒P
f
R
. While
matching the predicates of advertisements with those of a request, three cases occur:
1. Perfect match: P
f
S
⇒P
f
R
2. Failed match:¬(P
f
S
⇒P
f
R
)
3. Indeterminate: This occurs when for a predicate in the request P
f
R
, the service does not
advertise a predicate (P
f
S
) over the same field. We make an open world assumption and
consider an indeterminate match as a potential candidate.
Obviously the scoring function is ordered perfectindeterminatefail. All services with even
one fail predicate match are discarded from being considered as possible candidates. This is intu-
itive because, if the user wants data from a Sensor (mp
R
:Producer=“Sensor”), it is not acceptable
for her to get data from a Simulator(mp
S
:Producer=“Simulator”), even if it is for the same entity,
and with the same timestamp. Thus a service is a match for a given request, if its outputs and in-
puts are compatible i.e. have an exact or plugin or subsume relations and the (metadata and data)
predicates all match perfectly or indeterminately. Although the INT function is a good abstraction
to define the notion of predicate satisfiability, from a more practical standpoint, it is checked by
translating P
f
S
⇒P
f
R
to an equivalent statement S that can be answered by the (oracle function
O of the) ontology.
4.4.5 Implementation
The domain and data model are implemented using OWL in our prototype implementation. We
have used the OWL-S [94] ontology to represent web service descriptions. A OWL-S service
description consists of three parts: the service profile which describes what a service does and
91
is used for advertising and discovering services; the service model which gives a description of
how the service works and the grounding which provides details on how to access a service.
The OWL-S standard prescribes one particular realization of each of these descriptions, but also
allows application specific service descriptions when the default ontologies are not sufficient. We
have used this to define our own ontologies to describe web services.
1. Service Profile: This ontology contains the vocabulary to describe service advertisements.
We store the information described in the web service model here. As recommended in the
OWL-S spec, we store these predicates as string literals in the owl description. In the next
section, we describe how these advertisements are matched up with user specifications.
2. Service Model: The service model ontology describes the details of how a service works
and typically includes information regarding the semantic content of requests, the pre-
conditions for the service and the effects of the service. In many data-producing ser-
vices it is common that the parameters for the service actually define predicates over
the data-type. For example a service that returns WellProductionData may con-
tain parameters startDate and endDate that define the starting and ending dates
for which the data is returned. We capture the semantics of such parameters as predi-
cates over the fields. Thus the parameter startDate can be defined by the predicate
“WellProductionData.ProductionDate < startDate”. By doing this, we
alleviate the need for the user to learn all the parameters of a service and rather let the
user define the queries as predicates over the data object fields. Please note that not all
parameters can be described as predicates over the data fields. For example, a fuzzy logic
92
algorithm producing a report may require a parameter which describes the probability dis-
tribution function used. This parameter has no relation to the data being produced and is
modeled here. The default service model defined in the OWL-S standard, also defines the
semantics of input parameters using the parameterType property which points to a specifi-
cation of the class. Currently we do not model the pre-conditions and effects of the services.
We do not model the effects of the service because of our assumption that the services are
data producing services and do not change the state of the world. We do acknowledge that
pre-conditions may be required in many cases and we intend to address this as part of our
future work.
3. Service Grounding: The service grounding part of a model describes protocol and message
formats, serialization, transport and addressing. We have used the WsdlAtomicProcess-
Grounding as described in the specification to store the service access related/WSDL data.
This class stores information related to the WSDL document and other information that
is required to invoke the web services like the mapping of message parts in the WSDL
document to the OWL-S input parameters.
Service advertisements are stored in a UDDI repository in our framework as described in
[106]. Please recall from section 4.1 that a data request is programmed as a call to the DataFac-
tory. A request in the program is handled by first inferring the closure of data-types that are
compatible with the required data. This is used to query the UDDI store to retrieve the service
profiles of all compatible services. These are matched according to the ranking criteria scheme
described in the previous section and the best candidate is chosen. Currently our predicates can
have one of the following operators <, >, <>, = and values over basic data-types (Number
93
types, String, Date) and domain objects. Predicate matching for basic types is quite trivial. For
predicates involving domain objects the only operator allowed is , used to define the range of
the data sources as (all or some) of the children of a domain object. For example the range of
the objects served by a data source can be defined by assigning a meta-data field “domainObject
= ’Well IN Config1.Block A’”. A query requesting for a data for Well X i.e. with a predicate
“domainObject = Well X” can be resolved by querying the domain model. Please note that once
the user has written the aggregations, she can debug it as a normal program. We think that during
that process the request can be refined and the web services are bound as she intended. Once the
candidate web services are found, the information from the Profile and the Grounding part of the
web service model is used to construct a call to the appropriate web service. Please note that some
of the parameters are explicitly specified while some of them are a part of the predicates. The
parameters which map to these predicates are constructed using the information in the Profile. To
improve the performance of the system, we store all the Profile information and the Grounding
information in the UDDI itself. Thus all required information can be retrieved with one call to the
UDDI. We also “remember” the binding associated with a query, thus not incurring the overhead
of a UDDI-access. When a call to the system fails (service may be un-deployed or re-deployed),
the query is re-sent and the new service information is obtained.
4.4.6 Service Deployment
After the aggregation program is written it is deployed into a web service engine. For a typi-
cal web services engine like Apache Axis2 [10], this involves creating deployment description
document(s), packaging the classes and the dependent libraries as a jar file and copying it into
in a specified directory. Apart from this, the OWL-S model for the aggregated service needs to
94
be constructed and saved into a UDDI store. Most of the semantic information describing the
service is provided as annotations in the source code by the author of the service. This style of
embedding deployment specific information into the source code is a widely accepted technique
in the Java programming community. So, after the author writes an aggregation service as plain
java code with annotations, it is deployed by executing a pre-defined Ant script, which creates the
deployment descriptors for the web service engine as well as the OWL-S description documents.
In the future we envisage a system with multiple web service engines where service deployment
additionally involves choosing the “best” server. For example an important factor to consider is
to minimize the amount of data that needs to be moved. Thus it may be best to deploy an aggre-
gated service on the same machine as (or one “nearest” to) the data producer. A high-level view
of the various elements of our framework and their relationships is summarized in the UML class
diagram below.
Figure 4.13: Major elements of our framework and their interrelationships
95
4.5 Related Work
4.5.1 Metadata Catalog
The areas of work most pertinent to this paper are those which focus on data management on the
grid including work on metadata catalogs [131], provenance capture, tracking and management
[129] and use of semantically rich models and semantic web technologies in particular to enable
the above services [121, 83]. In this work we discuss how our system has evolved to incorporate
many key ideas from these works to suit our application needs. Our use of GME and the model
based approach is novel, and has allowed us to build a domain model/ontology with a huge degree
of active participation from the domain expert. The tailored GME environment also doubles as a
tool for definition of what-if scenarios, which acts as a portal for the underlying services in the
system.
Although integrated asset management is of great interest to the petroleum engineering com-
munity, we believe that our project is the only research effort in the computer science community
that is focusing on the challenges of high level workflow orchestration and technical knowledge
management for IAM. There are, however, comparable efforts in other domains that are focused
on exploiting distributed computing (including grid computing) and semantic web technologies.
For instance, the GEODISE project [33] aims to “bring together and further the technologies
of Design Optimization, CFD, GRID computation, Knowledge Management and Ontology in a
demonstration of solutions to a challenging industrial problem” of optimization and design search
for engineering.
Our work is inspired by a wealth of work on model/ontology based information and tool
integration [153, 12]. Our work uses a single ontology approach to data integration [153]. One
96
important difference between the techniques presented in other work and in our work is that in
[153], the author assumes that all the data sources are databases. Our problem is more complex as
the information stored in a simulation model (or simulation results) is neither as well structured
as in databases nor as easily accessible. In that respect, [12] is closer to our work because they
used the approach to integrate embedded system simulators. However, the simulation models,
their results, their inter-relationships etc in the petroleum domain tend to be much more complex
and hence makes our problem harder.
4.5.2 VDIF
WS-BPEL [38] is the W3C standard for writing web service compositions. BPEL allows us to
script composed services and exposes them as yet another web service. However, BPEL does
not provide support to add complex computations within the composition-akey feature required
to support aggregations. BPELJ [25] is a specification that addressed this shortcoming to some
extent. It allowed snippets of java code to be embedded within the BPEL script. The main goal
was to enable fairly simple calculations and transformations to be embedded within the compo-
sitions. It is not clear whether complex computations (such as integrating and invoking a third
party tool) can be used. Creating BPEL scripts requires the user to have a good understanding
of web services specifications as well as the where the web services are deployed. This conflicts
with one of our key requirements.
SSIS [138] is a tool integrated with MSSQL server which allows the user to build such ag-
gregation workflows. It consists of a visual composer, where data from various sources including
web services can be retrieved and aggregated. It also allows the user to specify complex transfor-
mations and aggregations in a .NET supported language. Although, the user need not write much
97
web service specific code, he still needs to be aware of the deployed services. The other problem
with SSIS is that it is tightly integrated with the rest of the toolkit (.NET, SQL server) and thus
integrating with it is not straight forward.
Much work in recent years has been performed on automatic discovery and composition web
services by providing semantic description of the constituent services [94, 134]. However, typ-
ical semantic web service composition frameworks like [146, 93, 134] do not seem to address
the general class of applications where the composed service could also include complex com-
putations and transformations. We have used the techniques to describe web services to define
facilitate discovery and invocation. Our system is built for a more controlled setting of an en-
terprise rather than the internet and we confine ourselves to applications that produce data rather
those that also change the“state of the world”. These differences have helped us to define focused
domain models which help us to tailor the semantic-web techniques for our requirements. Our
system is also similar to web service based data integration systems like Prometheus [149]. These
systems typically provide a unified database abstraction to a set of web services and address the
problem of how a query is resolved to calls to appropriate web services. However it is not clear if
data aggregations can be defined in these frameworks. Compared to these approaches, we make a
simplifying assumption that the produced by each source is a whole relation and thus do not con-
cern ourselves with issues like view integration etc. Data aggregation and similar workflows occur
commonly in scientific workflows. Specialized frameworks like Kepler [91] and Chimera [51]
(with concomitant service composition languages) have been employed for implementing such
workflows. The major difference between our framework and the above mentioned frameworks
is that they make the assumption that the modules containing the aggregation logic already exist
98
and (only) provide methods to “wire” them together (using their special language). In our frame-
work, the aggregation and wiring logic are both developed as part of a single program. This is
consistent with one of the primary goals of our system, to avoid the need for the user to learn
a new formalism. Programming languages provide the right level of expressiveness and support
we require to create aggregated services. Although the support for creating and consuming web
services in these platforms is becoming more and more seamless, the user still needs to have a
good understanding of web services, platforms and related issues like XML serialization, SOAP
messaging /stub generation etc to be able author such services. Moreover, these languages do
not address the problems of intuitive addressing and discovery of web services. Our technique
addresses some of these concerns by providing the user with an abstract data centric interface
for writing aggregations. The hard tasks of discovering, invoking web services are performed by
our framework. In essence, our framework marries the expressiveness of programming languages
with the semantic web services idea of using service metadata to facilitate service discovery for
authoring web service compositions.
99
Chapter 5
Parallel Inferencing for OWL Knowledge Bases
The best thing about being me... There are so many “me”s.- Agent Smith, Matrix.
OWL provides a rich suite of features for data modeling/knowledge representation. The abil-
ity to represent classes, properties of these classes and class hierarchies is one such feature. An-
other important feature is the ability to define a property as a functional property, a symmetric
property, transitive property etc. A functional property is the same as the idea of a primary key in
relational databases, i.e., an object can only have one value for that property. Transitive property
is best explained through an example: thebrotherOf relationship is a transitive property, i.e.,
If A is a brotherOf B andB is a brotherof C thenA is a brotherof C.
The OWL language specification is presented as a set of three dialects- OWL-Lite, OWL-
DL and OWL-Full. Each of these has a varying degree of expressiveness and computational
complexity. OWL-Lite is the least expressive dialect, easiest to implement and least expensive
computationally. On the other hand OWL-Full is the most expressive but undecidable. OWL-DL
falls in between the two languages in terms of expressiveness and complexity. In this paper, we
100
use a popular but non-standard subset of the OWL specification called OWL-Horst, first presented
in [148]. The semantics of OWL-Horst diverges from the standard in a few ways:
• It only covers a subset of OWL-Lite constructs- mainly it specifies if based semantics as
opposed to the iff based semantics in the OWL specification.
• Unlike OWL-Lite or OWL-DL, it is fully compatible with RDFS- especially the notion of
using a class as an instance.
Many of the currently available semantic-web tools, both open-source (Jena) and commercial
(OWLIM, Oracle) implement the OWL-Horst rule-set or variants/extensions of it.
We study the parallelization of rule based implementations of OWL reasoning. In general, the
rule based engines for OWL work by compiling the ontology into as set of rules. These are then
applied to the data-set to create the inferred data. The semantics of the rules are defined using
negation free datalog [151]. Datalog is a simplified subset of Prolog, and reasoning using datalog
has polynomial time complexity. All the rules of the OWL-Horst can be written as datalog rules.
The datalog rules are written as head←body (readif body then head). The head of the
rule has only one clause and the body of the rule is a horn clause with many sub-goals. In fact,
we have observed that only a small class of rules called single-join rules can used to represent
all but one of the rules. Single-join rules are rules which have two sub-goals in the body of the
clause and both these sub-goals share a variable. An example of such a rule is the rule used for
reasoning for transitive properties:
(?A brotherOf ?B) AND (?B brotherOf ?C) →?A brotherOf ?C
101
5.1 Rule Classes and OWL-Horst
In this section we introduce a classification of datalog rules based on the organization of the
clauses in the body of the rules. The relevance of such a classification is that, each class of
rules, affords different levels of ease with which the base tuples can be partitioned. The datalog
semantics we consider in this paper are that of the negation free conjunctive datalog as applied to
RDF triples.
• Single sub-goal rules: These are the simplest kind of rules, which have only one sub-goal
in the body. Simple class and property hierarchies can be expressed as single sub-goal
rules. E.g., the rule shown below, states that if a resource is of type person, it is also of type
OWL Thing class.
R1: ?X rdf:type lubm:Person→ ?X rdf:type owl:Thing
If a rule-set only contains single sub-goal rules, the base-triples can be arbitrarily parti-
tioned and each partition processed independently.
• Chained sub-goal rules: These are rules which have multiples sub-goals, but with the
constraint that at least one of the variables in every sub-goal is joined with a variable in
some other sub-goal. E.g., the rule shown below defines the owl:differentFrom predicate
based on two classes defined to be disjoint. Here the first sub-goal and the second sub-goal
share a (join) variable and the three sub-goals share two join variables ?C and ?D:
R2: (?A rdf:type ?C) (?C owl:disjoint ?D) (?B rdf:type ?D)→?A owl:differentFrom ?B
102
Depending on the number of sub-goals in the rule, chained sub-goals can be further cate-
gorized into:
– Single-join rules: Chained sub-goals which only have two sub-goals and therefore
share at least one (join) variable are called single-join rules. E.g., the recursive rule
shown below, is used to implement a OWL transitive property. The join variable B is
shared between the two sub-goals of the rule.
R3: (?A lubm:SO ?B) (?B lubm:SO ?C)→?A lubm:SO ?C
Note that, every single join rule can be correctly re-written as a single-join rule by
repeating the sub-goal in the body two times. E.g., rule R1 is re-written as follows in
a single-join rule form:
R1’: (?X rdf:type lubm:Person) (?X rdf:type lubm:Person)→?X rdf:type owl:Thing
Data partitioning for a rule-base that only contains single join rules is harder than par-
titioning for single sub-goal rules because, we must ensure that, each pair of tuples
that can fire a rule must be present in the same partition. This can be achieved by a
partitioning strategy that uses a static hash function mapping a tuple onto a proces-
sor based on its subject or object properties. All the tuples which could produce an
inferred triple, have a join variable in common, and will be hashed on to the same
partition.
– Multi-join rules: Chained rules that contain (n > 2) sub-goals, and hence at least
(n-1) join variables are called multi-variable join rules. The differentFrom rule R3,
given above is an example of such a rule, containing three sub-goals and two join
103
variables. Data partitioning for a multi-join rule base is harder than single-join rule-
base because the static hash function will need to partition a larger combinatorial
space of resources. E.g, for a two-join rule and a data set consisting of n resources,
the n
2
space of the possible combination of resources must be partitioned such that
for a certain pair of join values, all the tuples that can potentially be joined through
them are available on the same processor. The problem is harder when we have to
make sure that the partitions need to have the equal workloads.
• Independent sub-goal rules: These are rules containing more than one sub-goal such that,
the sub-goals do not share any variables. E.g., in the rule shown below, the two subgoals
do not share any variables.
R4: (?A P1 ?B) (?C P2 ?D)→?A P1 ?D
Partitioning a data-set for a rule-set consisting of independent sub-goals can be complex
because, each rule stipulates a different partitioning of the space and hence the partitioning
function can be very complex.
All but one of the rules from OWL-Horst can be written as single-join rules, when the T-Box
is known before-hand and no new blank nodes are introduced into the RDF graph. The one rule
that cannot be written as a single-join rule is a two-join rule.
5.2 Performance Model for Reasoning
The theoretic worst case complexity of datalog reasoning is given as O(n
m
), where m is the max-
imum number of free variables present in the body of any rule and n is the number of constants
104
in the Herbrand base i.e. the number of nodes in the input graph. For single-join rules there can
be a maximum of three variables in the body of the rule. Therefore for OWL-Horst reasoning
the worst case complexity is O(n
3
). However, the actual complexity of the rule-set might vary
from implementation to implementation. In the rest of our work, we assume a simple polynomial
model, where the time T for executing a ruleset r is given as:
T = k
1
r
1
n
3
+ k
2
r
2
n
2
+ k
3
r
3
n + c (5.1)
Intuitively, k
1
,k
2
and k
3
are hardware specific constants, r
1
,r
2
and r
3
correspond to the rules
that cause cubic, quadratic or linear reasoning times and c is the constant overhead for setup etc.
To verify this model, we performed extensive experimentation with the Jena reasoner and datasets
of various sizes for two OWL benchmark. Figure 5.1 shows the time taken for reasoning over
data-sets of different sizes (no. of nodes) for a benchmark data-set. The dark curve represents the
trend observed for LUBM data-sets whereas the lighter curve represents the trends observed for
UOBM datasets with different number of nodes. Based on this we have found that a quadratic
model (i.e. r
1
=0) is the best fit for the performance model. As further validation of our model, we
experimented by removing some of the quadratic rules of ruleset and obtained a linear scaling as
shown by the dotted line in the figure.
Note that the constants for the two graphs are different because the sizes and complexity of
the queries are different for the two benchmarks. The UOBM benchmark contains more rules
than the LUBM benchmark and thus scales more rapidly. We will use this performance model to
guide our workload partitioning techniques in the rest of this paper.
105
Figure 5.1: Regressing a performance model from observed reasoning times for LUBM and
UOBM data-sets.
5.3 Partitioning
Since the computational workload depends on the rule-set and the data-set, a viable partitioning
scheme can partition either of them or both (hybrid). Irrespective of how the computational
workload is partitioned, the partitioning scheme must achieve the following goals, so that the
parallel processing is efficient:
1. Balanced partitioning: The amount of work done on each processor should be nearly the
same. Otherwise, a partition may have to wait for another one to complete its processing,
thereby wasting time.
106
2. Minimize communication: Data to be transferred between processors should be minimized.
3. Efficiency: Ideally each triple in the result must be derived by exactly one processor. In
general, we want to minimize the number of times the inferences are duplicated in the
processors.
4. Speed and Scalability: The partitioning itself should be fast and scale for extremely large
data-sets.
5.3.1 Data Partitioning
The goal of data partitioning is to partition the data-set such that each partition processes a sub-
set of the data using the complete rule-set. Our data partitioning algorithm makes use of the fact
that all the rules for OWL reasoning can be written as single join rules. For reasoning on single-
join rules to be correct, a data partitioning algorithm must make sure that any two tuples that
can potentially join must be present on the same partition. Let us consider a simple rule which
specifies a transitive property:
(?A brotherOf ?B) AND (?B brotherOf ?C) →?A brotherOf ?C
If there are two tuples{Adam brotherOf Bob} and{Bob brotherOf Charlie},
then both these tuples must be present in the same partition for the rule to fire successfully. To
ensure this, we make sure that all tuples withBob as a subject or an object are all present in the
same node. In other words, we assign the ownership of each resource in the graph to a particular
node and ensure that each tuple that contains the resource as subject as subject or object is always
present on that node. Thus every rule that can be fired will be fired correctly. Various algorithms
can be used to partition the data as long as they have the following two characteristics.
107
1. Lossless: Each triple must be in at-least one partition.
2. Join Co-location: Every triple which can be potentially joined must be present in the same
partition. This means that every triple that has the same subject, object or predicate values
must be present in the same partition. The rule-sets for OWL can be written such that the
predicate of each sub-goal in the body of the rules, is always a constant. Therefore, the
join co-location rule stipulates that if there are two tuples with same subject or object, they
should be present in the same partition.
We will call partitioning algorithms that have these characteristics as single-join correct algo-
rithms and a partition created by such a algorithm single-join correct partition. A generic initial
partition algorithm is given in algorithm 3.
Algorithm 3 Data partitioning
Input: Initial tuples
Output: Set of partitions of original tuples, partition table )
1: Remove all the tuples involving the schema elements from the initial tuples.
2: Partition the resulting graph based on the partitioning policy.
3: for all tuples do
3: Assign the tuple to partition that is owner of the subject and the partition that is the owner
of the object of the tuple
4: end for
5: return
Figure 5.2 illustrates this for a simple graph with six nodes (A-F). The partition divides
the nodes equally into two halves as shown in the first step. It cuts the edge between B and E.
Partition P1 owns the nodes A, D, E and the partition P2 owns B, C, F. Since the edge B-E is cut,
and E is owned by P1, the node B is replicated in P1 and E in P2 and the edge B-E is present in
both partitions.
108
Figure 5.2: Illustration of the partitioning algorithm.
The generic data partitioning algorithm above only says that an ownership list has to be gen-
erated but does not specify how that list is generated. Various algorithms can be used to generate
such a list and we have implemented the following algorithms.
5.3.1.1 Graph partitioning
In the classical weighted/multi-constraint graph partitioning problem, the input is a graph G={V,
E, W}, where V is the set of vertices, E is the set of edges and W is a function that assigns weights
to vertices. The goal is to divide it into k sub-graphs, such that, the sum of weights of vertices
in each sub-graph is nearly the same and the number of edges cut, i.e., the edges between nodes
in different partitions is minimized. The input RDF graph, in which, each triple is represented
by two vertices, one each for the subject and the object and an edge representing the property is
considered for partition. All the vertices are uniformly weighted. The partitioning results in k
sub-graphs, from which we extract an owner-list, which is the set of vertices in a partition. For
every edge cut in the partitioning, the subject and the object of a triple are owned by the different
processor. Such triples are replicated on both the machines. Therefore, a triple from the dataset
can be present in at most two processors.
109
Since the rule-set used in each partition is the same, the load is partitioned by dividing the
number of nodes equally among the processors. This requirement is satisfied by one of the goals
of graph partitioning problem: to ensure that the number of vertices in each partition are equal.
Further, by minimizing the edge-cut of the input graph, we address both efficiency and minimum
communication. A statement R1 P R2 can be derived in one partition only if both R1 and
R2 are present in the same partition. Since minimizing edge cut also minimizes the number
of vertices that are replicated in partitions, efficiency is addressed. Similarly a tupleR1PR2
is communicated iff R1 and R2 are owned by different partitions. Finally, although the graph
partitioning problem is a NP-complete problem, many heuristics have been proposed which are
fast and scalable [45]. The graph partitioning package that we use in our implementation, called
Metis, has been shown to work for graphs with millions of nodes.
5.3.1.2 Hash based partitioning
In the hash based approach, a (generic/arbitrary) hash function is used to determine which pro-
cessor a node is assigned to. The hash based approach is easy to implement and has been shown
to work well for data partitioning problems in different domains [40]. The advantages of the algo-
rithm is that it can be implemented as a streaming algorithm, i.e., the whole data graph need not
be loaded into the memory for the partitioning. Moreover, if a sufficiently cheap hashing function
is used, the owner-list need not be replicated in each partition, thus making for more efficient and
scalable implementations. On the other hand, the disadvantage is that the hashing algorithm does
not minimize edge-cuts and therefore, the replication in the partitions could be very high.
110
5.3.1.3 Domain specific partitioning
A domain specific partitioning algorithm, uses knowledge about the characteristics of particular
data-sets to create a partitioning. As an example, the LUBM is an OWL benchmark that models
concepts in a university setting [90]. Typical concepts modeled in this benchmark include: De-
partments, Students, Professors, Publications etc. The entities (nodes) are organized such those
that belong to a certain university are more likely to be related to each other, than those that be-
long to different universities. And with-in the universities, the entities in a department are likely
to be related. We have used this characteristic of the data to create a partitioning algorithm, which
assigns an resource to a partition based on the university and department it is assigned to. The
partitioning is simplified because the data set is quite uniform as the number of nodes belonging
to each university and each department is more or less equal. Thus partitioning along university
and department boundaries partitions the number of nodes and hence the workload quite evenly.
Like the hash-based algorithm, the domain specific algorithm can be implemented as a stream-
ing algorithm, so that the whole input graph need not be loaded into the memory, making the
algorithm more scalable. A well thought out domain specific partitioning scheme could lead to
better partitions than the hash algorithm. The disadvantage of this algorithm is that it cannot be
reused across data-sets and a different partitioning algorithm must be implemented for different
data-sets.
5.3.2 Rule-Base Partitioning Approach
An alternate partitioning approach is the rule-base partitioning approach, in which we partition
the rule-set so that the above mentioned goals are obtained. To do this, we first create a rule-
dependency graph. In a rule-dependency graph, each rule is represented by a vertex and each
111
edge indicates that a clause in the head of a rule (r1) is present in the body of the rule (r2). This
means that a tuple generated due to r1 will be used by r2. Thus we need to minimize the number
of such edges that are cut in the rule-dependency graph. Finally to divide the workload evenly, we
weigh the nodes representing the rules differently. The quadratic rules are weighed much higher
than the linear rules. Ideally the weight of a quadratic rule is the square of the number of nodes in
the input data graph. Since this might not be known in advance, we use an approximation on the
size of the graph as the weight. For example, an approximation might use the size of the input file
as an indicator for the number of nodes in the graph. If such an approximation cannot be found
easily, we use a large number as the estimate for the number of nodes in the graph. The graph
obtained is then partitioned using the standard graph partitioning algorithm, which minimizes the
edge cut while keeping the sum of weights of nodes in each partition nearly equal.
To further improve the partitioning, we can also weigh the edges of the graph based on the
number of triples they may contribute. For example, let us assume that r1 is dependent on r2 and
r3 on r4. If r1 produces many more triples than r3 then the edge between r1 and r2 should be
weighed more than the edge between r3 and r4. When available, a priori knowledge about the
distribution of different predicates in the dataset can be used to weigh the edges of the graph. The
algorithm is summarized in algorithm 4:
Algorithm 4 Rule partitioning
Input: Rule-base created from an ontology
Output: Partition of the rule-base
1: Create rule dependency graph as follows;
2: Each rule is represented by a vertex
3: If the head of a rule contains a clause that is in the body of a rule then add a edge between
the two rules
4: Partition the rule-dep graph to minimize edge cut, balance no. of rules in each partition
(standard graph partitioning)
5: return
112
Once a rule-base partitioning is produced, each partition (sub-set of the rules) is used in a node
and applied to the original set of tuples. When a new tuple is generated, the tuple is matched with
the sub-goals in the body of the rules in the other partitions to determine if it can be potentially
used in it. It is then sent to each of the partitions for which it is a potential match.
5.3.3 Metrics
Since a multitude of such partitioning algorithms can be devised, we need a way to determine the
effectiveness of an algorithm, To this end, we propose the following metrics that address the four
goals of such algorithms.
1. Balanced partition: The standard deviation of CPU time of each processor in the system
is the ideal parameter for measuring balanced partitions. However, since the CPU time is
not not known before a parallel algorithm is run, a diagnostic metric, which can be used
is bal, given as standard deviation of the number of nodes in each partition. This is a
useful metric because the computational time of the reasoning is directly proportional to
the number of nodes in the RDF graph.
2. Efficiency: The metric is the output replication given by,
OR = Σ(No. of tuples in the result of each processor) / Total
no. of tuples in the unioned output graph
A diagnostic metric is the input replication given by,
IR = Σ(No. of nodes in each processor) / Total no. of nodes
in the input graph
113
3. Minimize communication: As above since the exact number of tuples communicated can-
not be determined before hand, a reasonable diagnostic metric for minimizing communi-
cation is the input replication. This is because, only the tuples that are related to the nodes
that are replicated, are transmitted.
4. Speed: The time taken for the partitioning itself is used as the metric.
5.4 Parallel Algorithm
The generic algorithm for parallel processing of OWL data is shown in algorithm 5 . The algo-
rithm is very similar to the parallel algorithm for datalog programs, presented in [54, 158, 155].
The input to the parallel reasoner is the set of base tuples and a rule-base that is compiled from
an OWL ontology. The master node, partitions either the data-set or the rule-base and sends the
appropriate partition to each processor in the system. In the case of data partitioning approach,
the base-tuples received at each processor, is a sub-set of the input tuples and the rule-base is the
same as the original. On the other hand in the rule-base partitioning approach, the base tuples
assigned to each partition is the same as the original tuples presented to the algorithm whereas
the rule-base is a subset of the original rule-base. Apart from this, the master node also sends a
partition table to each processor.
The algorithm at each node, works in rounds; in each round, every processor in the system,
applies the rule-base to the data set to obtain the inferred tuples. It then checks if any of the
inferred tuples are to be transmitted to another processor. The exact method for determining
if a tuple should be sent to another partition depends on the partitioning strategy. For the data
partitioning scheme, the partition/owner table which contains the information regarding which
114
processor in the system owns a node is used to determine if a tuple needs to be transmitted to
another node(s). For the rule-based partitioning, we match the newly generated with all the rules
of other partitions to determine if it can trigger any of them. The tuple is sent to all the system,
in which it can be used. Once the messages from all the processors are received, the next round
ensues, with the additional tuples are added to the tuples produced at the end of the previous
rounds used as the input tuples and the process is repeated. The algorithm terminates when, each
processor of the system has finished a round without creating tuples that need to be communicated
to another processor and there are no tuples in transit. Note that the master node itself has no role
to play once the initial partition is done and input supplied to each node. It can therefore, be used
as one of the nodes of the system and thus be used to process a partition.
Algorithm 5 Parallel Reasoning
Input: Initial base tuples, rule-base
Output: Base tuples and Inferred tuples
1: Partition the data or rule-base. Assign a partition to each node in the system.
At each node:
2: while !terminate do
3: Create all the new tuples for the given rule base and initial base-tuples
4: Send any of the newly generated tuples to other processors as necessary
5: Receive tuples from other processors add them to the base tuples.
6: end while
7: return
One nice characteristic of this parallelization algorithm is that it uses an existing reasoner for
creating additional tuples. Thus it can be built as a wrapper over an existing reasoner.
5.4.1 Correctness
The correctness of the algorithm is asserted in the following lemma.
115
Lemma 1. Every tuple that is present in a model generated by a serial reasoner is also present
in a model generated by the above algorithm, for a single-join rule set with known universe of
atoms and reasoning defined by negation free datalog semantics when the partition is produced
by a single-join correct partition.
Proof sketch: Intuitively, our algorithm ensures that any two tuples that can be joined to form
a new tuple are present in the same partition. This applies to tuples that are present in the base
partition as well as those that are generated in the later rounds of the algorithm. Thus, any tuple
that is derived in the serial algorithm is also derived in the parallel algorithm and the resulting
model is bound to be the same. The complete proof is presented in Appendix A.
5.4.2 Optimization 1: Incremental addition algorithm
An important challenge of the implementation is to ensure that the later rounds of reasoning, when
only a few tuples are added to the data-set, are executed efficiently. A naive approach would query
the reasoner for all the tuples in the knowledge base- which would cause the reasoner to perform
many of the derivations again. Our implementation, instead only considers the statements that
are added to the partition in the current round and all the statements that may in turn be created
due to their addition. The algorithm 6 is presented below and is similar in spirit to the work done
on incremental view maintenance algorithms [59].
Our incremental addition algorithm is based on the observation that the OWL rules are range
restricted i.e. all the variables that appear in the head of the rule also appear in the body of
the rule. Moreover, the subject term in every rule is always a variable. The correctness of this
algorithm is stated in the following lemma.
116
Algorithm 6 Incremental addition algorithm
Input: oldTriples, addedTriples
Output: inferredTriples
1: Initialize resource list as the all resources that appear as either the subject or object in
addedTriples
2: for all resources in resource list do
3: Query for all statements which have the subject or object as the current resource
4: for all new statements do
5: if the subject or object does not already appear in the resource list add it to the list
6: end for
7: end for
8: return
Lemma 2. A statement appears in the derived model iff the statement appears in the LFP of a kb
which contains the union of the old and added triples.
Proof. →Let us assume that there is a tuple in the LFP of the model consisting of the union of
input and added triples that is not present in the model that we derived. Obviously, the subject and
the object of such a tuple is not one of the resources that appears as a subject or object in a triple
of one of the newly added triples- otherwise it would have been generated by our algorithm. Since
all our rules are join rules, this tuple should have been generated due by a rule that was matched
by tuples only in the model from the previous iteration. But such a tuple would have already been
generated by a previous iteration and hence exists in the model generated by our algorithm. Thus,
every tuple that is in the LFP of the union of the kb from the previous iteration and the newly
added tuples is also present in the model generated by our algorithm.
←Because our algorithm uses a sound datalog reasoner, by definition, every tuple generated
by our algorithm has to be in the LFP of the union of the old model and the newly add tuples.
Thus the two models are equivalent.
117
5.4.3 Optimization 2: T-Box exclusion
This optimization is based on the observation that in OWL-DL and OWL-Lite reasoning, facts
in the A-Box do not add any facts about elements in the T-Box. Thus, the facts from the T-Box
can be shared/replicated among all the nodes in the partition and during partitioning, these nodes
are excluded. Since, a large number of facts in a typical OWL graph are type assertions, the
number of edges incident on a T-Box element are very high. If such nodes are not excluded
during partitioning, the node that is the owner of a T-Box element tends to have a larger number
of statements in it than other nodes. This results in a partition in which one of the parts has a much
larger processing time than others. By excluding these nodes during partitioning, the partitions
obtained are more balanced. This optimization does not hold for RDF, OWL-Full as well as the
full OWL-Horst reasoning, where there is no separation between the A-Box and the T-Box.
5.5 Implementation
We have used the Jena package [76] for OWL processing. Jena uses a hybrid reasoner, which uses
both the forward and backward chaining methods. The forward chaining is implemented using the
Rete algorithm where as the backward chaining is implemented using the standard SLD resolution
with tabling. The Rete algorithm [50] uses an acyclic directed graph (rete network) where the
nodes represent patterns of the rules and the paths from the root to the leaves represent left side
of the rule. The algorithm works by matching facts with patterns appearing in the rules. The
algorithm is faster because it reuses/shares the patterns occurring across rules, thus minimizing
the number of match operations and by remembering the facts that match a pattern in the memory
nodes.
118
SLD resolution is the technique used in most theorem provers and is used to implement
Prolog. This process works by constructing an AND-OR tree to prove that the negation of a
statement- which is an instance of the consequent of the rule, is false. The AND nodes which
represent all the AND conditions that must hold true- various sub-rules of the rule. The OR nodes,
which represent choices that can be used- the different elements that can be bound to the variables
in the rule. A statement is added to the KB when the procedure proves that the statement is a valid
one with respect to the current KB. The SLD-resolution in Jena is implemented quite similar to
a standard Prolog interpreter [27], although some optimizations and/or control primitives the like
cut operator are not included.
In general, the hybrid engine works by first compiling the ontology into rules by using the
forward engine. These rules are used by the backward engine to derive new tuples. To materialize
a KB, a query of the form for each resource, select all triples from the KB with that resource as
subject is issued. This triggers the reasoner and generates the inferred tuples in the KB. Although,
a different reasoning strategy can be used, e.g., bottom-up datalog evaluation, the contributions
of our work is not diminished because, our work is applicable to any kind of reasoner that adheres
to datalog semantics. Since Jena is a Java based package, we have also developed our parallel
implementation using the same language. The inter-partition communication is through the MPJ-
Express [99] package, which is a MPI (like) API for java programs.
5.6 Experimental Results
To measure the performance of our algorithm, we conducted experiments on a network cluster of
machines managed by the HPCC at USC. The experiments were conducted on AMD Opteron 2.6
119
GHz, 64-bit dual core machines, with main memories between 4-64GB. All these machines ran
the Linux OS and Java v1.5. We have also performed experiments on a AMD opteron 2.0 GHz
quad-core machine with 16GB RAM and a similar software setup to understand the scalability of
our algorithms on a multi-core system. In both cases, each partition was executed on a separate
processor core which communicated using MPJ-Express. Our evaluation was based on two stan-
dard benchmarks the LUBM-10 (1M triples) and UOBM-4 data-sets and our own data-set called
MDC.
5.6.1 Data Partitioning
5.6.1.1 Networked Cluster
Figure 5.3 shows the speedups we have obtained using the data partitioning approach, imple-
mented using the graph partitioning algorithm. For the LUBM-10 and the MDC data-sets we
see super linear speedups. On the other hand for the UOBM data-set, we observe sub-linear
speedups.
The super-linear speedups are observed for some benchmarks can be explained by examining
the reasoning process. To materialize the KB, queries of the form find all statements with a given
resource as subject is issued for each resource in the graph. To answer this query, the reasoner
creates kn triples, where each triple has the given resource as subject and each of the n nodes of
graph as the object for each triple, where k is the number of properties in the ontology. It then
tries to prove that the KB entails such a triple. The worst-case complexity of this algorithm is
polynomial in the number of resources in the KB (nodes of the RDF graph). For some data-sets
like LUBM and MDC, the reasoner, exhibits the worst case polynomial complexity.
120
Figure 5.3: Speedup for the LUBM-10, UOBM benchmarks on different number of processors.
For such data-sets, our partitioning algorithm, reduces the search space for of the LP theorem
prover and thus shows super-linear speedups. More specifically, the OWL constructs for express-
ing intersection classes and someValuesOf, are compiled on to rules similar to shown below:
(?X lubm:headOf ?A) (?A rdf:type LUBM:University)→(?X rdf:type lubm:Dean)
Such rules scale quadratically over the number of number of nodes (n) in the RDF graph,
because the theorem prover tries each of the n values for each of the two variables (?X, ?A).
The bad performance of the UOBM data-set is because, it is a dense graph. Thus even the
best partition cuts a number of edges, leading to replication of many nodes. Since the time taken
is quadratic with respect to the number of nodes in the graph, the speedups do not scale well for
the UOBM dataset. In table 5.6.1.1 we compare the ratio of average partition size to the complete
data-set size for LUBM and UOBM. More densely connected graphs often occurring due to
121
symmetric and transitive properties (including owl:sameAs and differentFrom) are less amenable
to parallelization. In such cases, it might be a good idea to perform partial materialization, where
specific reasoning is not performed at materialization time but rather at query time and the rest of
the reasoning can be parallelized.
No of partitions LUBM UOBM
2 0.53 0.7
4 0.28 0.5
8 0.14 0.28
16 0.07 0.17
Table 5.1: Average partition size for LUBM and UOBM
Recently some (commercial) OWL reasoners have been shown to scale linearly with the size
of the data [156] for similar rule-sets. Although evaluating our methodology with every such
reasoner is out of the scope of this paper, we illustrate the (potential) speedups for such reasoners
obtained by our algorithm using only the linear sub-set of the Jena ruleset in Figure 5.4. We
believe that such a reasoner can be implemented using the algorithms used in Jena by using
the cut operator, to reduce the search space explored by the theorem prover for the constructs
described above.
Both LUBM and MDC scale well for this sub-set of the rules but UOBM (again) scales worse
than the other two. This is because, the graph generated as a result of reasoning for UOBM is
more inter-connected than the others. This is due to transitive properties that form very long
chains in the graph.
122
Figure 5.4: Speedup for the LUBM-10, UOBM benchmarks on different number of processors.
5.6.1.2 Multi-core
The speedups of our data partitioning implementation on a multicore processor is as shown in
figure 5.5. We see that as expected, super-linear speedups are observed on a multi-core processor.
5.6.2 I/O overhead and ideal speed-up
Figure 5.6 shows the average amount of time spent for reasoning, IO, synchronization (waiting
for other partitions to finish) and aggregation by the parallel algorithm for the LUBM-10 bench-
mark. The figure shows the maximum values over the partitions. We see that as the number
of partition increases, the amount of time spent in IO and synchronization correspondingly in-
creases. For larger data-sets the large IO time is due to the final step in which each partition sends
all the tuples generated by local reasoning a central node for aggregation.
123
Figure 5.5: Speedup for the LUBM-5,benchmarks on a multi-core processor.
Figure 5.6: Overhead of various sub-tasks of parallel processing for LUBM-10.
124
In figure 5.7 we compare the speedup achieved by our implementation with the theoretical
maximum speedup for the LUBM data set. The theoretical maximum was calculated by using
a empirically derived performance model for the reasoning for LUBM data set. We ran the
serial reasoner on various LUBM data sets (LUBM-1, LUBM-5, LUBM-10 etc) and based on the
reasoning times obtained for them, we regressed a cubic model for the execution time, as shown
in figure 5.1. Since the worst case of the reasoning for the rule set is cubic, fitting a cubic model
is reasonable. The theoretical maximum is calculated as the speedup obtained by a partition in
which all the partitions are of equal size and there is no replication among the partitions. The
graph shows the reasoning for the slowest partition as well as the overall reasoning time for the
data-set. We see that the speedups observed are quite close to the theoretical maximum predicted
by the model.
Figure 5.7: Speedup for the LUBM-10 benchmark on different number of processors.
125
Figure 5.8 shows the speedups for the lubm datasets of different sizes. We see that the
speedups obtained for the larger dataset is much larger than the speedups for the smaller dataset
as the number of processors used gets larger. This also further corroborates our performance
model and explanation above.
Figure 5.8: Speedups for lubm-dataset of different sizes.
5.6.3 Comparison of data partitioning algorithms
Figure 5.9 shows the comparison between speedups obtained from the three data partitioning
algorithms presented earlier
1
.
1
Speedups for hash partitioning using 8 and 16 nodes not shown because, the experiments did not complete due to
memory size limitations.
126
Figure 5.9: Comparison of performance of the two data-partitioning algorithms for LUBM-10.
The domain specific partitioning performs nearly as well as the graph partitioning algorithm.
Both these algorithms, create partitions that are equal in size, and have few edge-cuts across par-
titions. The naive hash algorithm on the other hand, performs very badly because, it does not
minimize edge-cut. Therefore, the amount of duplication, i.e., the number of partitions in which
the same tuple is created, is very high. E.g., for the LUBM data-set using 4 partitions, the graph
based partitioning algorithm the duplication is nearly 10% whereas for the hash algorithm it is
about 100%. Thus, for a partitioning algorithm to work efficiently, it must ensure that the dupli-
cation is minimized. The table below summarizes the metrics proposed earlier for the partitioning
algorithms:
127
No of partitions Algoritm Bal OR IR Part. Time
2
Graph 1500 0.1 0.07 240
Dom sp. 9186 0.1 0.07 156
Hash 168 0.5 0.7 120
4
Graph 1533 0.1 0.13 235
Dom sp. 10950 0.1 0.11 154
Hash 772 0.95 1.5 120
8
Graph 986 0.15 0.16 235
Dom sp. 9304 0.13 0.13 154
Hash 788 - 2.1 120
16
Graph 604 X 0.19 235
Dom sp. 2963 X 0.19 154
Hash 2800 - 1.3 120
Table 5.2: Partitioning metrics for the LUBM data-set
5.6.4 Rule-base Partitioning
Figure 5.10 shows the speedups obtained by the rule-partitioning approach. We show the results
for the three benchmarks: LUBM, UOBM, and MDC. Note that the speed-ups achieved for the
UOBM data-sets using this approach are better than those using the data partitioning approach.
For dense graphs like UOBM, the rule partitioning approach does less replicated work than the
data-partitioning approach. This is because, in the data partitioning approach, all the rules are
applied to all the nodes and since in dense graphs, each node process a large fraction of the
nodes, each partition ends up duplicating a large amount of work.
A reason this partitioning shows less than optimal speed-up is because the volumes of data
being communicated across processors is much higher. To illustrate this, the rule dependency
graph for one of our benchmarks is shown in Figure 5.11. The rule depicted as the node in the
center of the graph has a large number of dependencies associated with it. An optimization we
have used to alleviate this is to duplicate some simple rules in all the partitions, such that it does
not significantly add to the computational overhead but decreases the amount of data transmitted.
128
Figure 5.10: Speedup for the different benchmarks for rule-base partitioning.
5.6.5 Incremental addition
Figure 5.12 demonstrates the advantage of using our incremental addition algorithm. We see
that without the incremental addition algorithm, the time taken for second iteration is more than
half that of the first iteration. Since, Jena implements a tabled top-down resolution, which avoids
multiple re-derivations and hence, the time taken is lesser than iteration one. In the second figure
we see that iteration 2 takes about 1 percent of the time for iteration 1.
5.6.6 Serial processing
Figure 5.13, shows that even if we process a partitioned model serially, the reasoning executes
faster. At the same time, the memory footprint of the application shrinks. For the case with 16
129
Figure 5.11: Rule dependency graph for MDC.
partitions, the memory foot-print is about 15% of the serial reasoner where as the time taken is
about half of it. This is because, the partitioned version does lesser work than the serial version
of the reasoning.
5.7 Related Work
To the best of our knowledge ours is the first work to examine the problem of parallelizing OWL
inferencing. However, much work has been done in the closely related areas of parallelizing
rule based and deductive database systems. We present a brief overview of the literature of such
systems below.
130
Figure 5.12: Time spent per iteration on each partition.
Figure 5.13: Time vs. Memory trade-off for the LUBM benchmark.
131
The existing work can be classified based on the kind of applications of the rule based sys-
tem for which the parallelization technique was devised. Techniques developed for production
systems (OPS5 is a language that is commonly used) are presented in [60, 4]. In this class of
applications, unlike our application domain, the data sizes are generally small where as the rule-
bases could be quite large. Hence the techniques devised in these works are not ideally suited
for our problem. Techniques devised for deductive databases, which use the Datalog language
for rules have been presented in [54, 158, 155]. This is the body of work that is most relevant
to us, as OWL reasoning can be implemented using a datalog reasoner and also because like in
deductive databases, the data-sets are much larger than rule-bases. The parallel algorithm that
we have presented is very similar to those proposed in the above works. However, to the best of
our knowledge all the work in this community is theoretical and we are not aware of a system
that actually implements the above techniques for a Datalog implementation. Moreover, to the
best of our knowledge there are no studies based on standard benchmark data-sets that will help
us compare our results with the work done in this community or map them directly to the OWL
reasoning problem.
The data partitioning based approach to parallelizing deductive data-bases have been pro-
posed [158, 155, 142]. In [158, 155], the authors propose an abstract hash function, which maps
a tuple of data on to a processor, for partitioning the data. The main challenge of such a system
is to choose an appropriate hash function for the data set to partition the computational workload
equally among the processors and minimize communication of tuples across partitions. Further,
the authors propose some useful guidelines for such a hash function in [158]. The partitioning
algorithms that we propose can be seen as realizations of such abstract hash functions for datalog
programs realizing OWL semantics and employ some of the learnings from the above mentioned
132
works. Two other data partitioning approaches are proposed in [142]. In the cluster based parti-
tioning a distance metric of a tuple referenced by a rule is used to partition the data-set by using
a clustering algorithm. This is somewhat similar to the graph partitioning approach that we have
used. In statistics based approach, the assumption is that of a (more or less) stationary dataset.
Thus statistics could be gathered and used to perform partitioning. Such a policy is mentioned in
[142]. Our domain specific partitioning algorithm is similar to this approach because it also uses
the characteristics of a particular data-set to create good partitions.
The rule-base partitioning approach for datalog programs has been presented in [17]. Our
own rule-base partitioning approach is somewhat similar to theirs although our algorithm is spe-
cific to OWL datalog programs. Two other partitioning approaches that we plan to explore in
our future work are the hybrid partitioning approach [126] and the control partitioning approach
[61]. In hybrid partitioning both the rule-set as well as data-set are partitioned to obtain better
results. In the control partitioning typically seen in Prolog systems, different branches or flows
of control are explored in parallel.
The existing systems can also be classified based on how the systems handle the issue of load
balancing to provide equitable workloads to each node in the system. In static load balanced
systems [54], the workload partitioning is done at the beginning and is not modified as data is
processed by the nodes. This is the approach we have used in our system. On the other hand, in
dynamic load-balancing system like [155, 41] reallocates workloads, if the initial partitioning
scheme did not provide an balanced partition. Our own system uses a static load balancing
strategy, which as we see based on our results, works quite well.
The systems can be further classified based on the parallel programming model used. In
shared memory/everything, like [60] both the working memory/RAM and the secondary storage
133
are shared. In shared nothing systems like [155, 54], the nodes do not share the memory or
secondary storage/database- the data is partitioned and processed independently by each nodes,
communicating using message passing mechanisms. In shared database systems, nodes have their
own RAM or working memory but they share the database or the secondary store containing the
original data. We have used the shared nothing model in our work.
Finally, the problem of scaling and speeding up OWL reasoning is an exciting area of active
research. Most current approaches work on scaling serial reasoners. OWLIM [105] uses a rule
based forward chaining reasoner called TREEE that achieves efficiency and speed by converting
rules into chunks of Java code together with an efficient in-memory representation of the OWL
graph. Oracle [104] implements a forward-chaining rule-set in SQL, and performs the reasoning
with in the database. Thus, they leverage much of their work in optimizing their database to
achieve very good performance for OWL reasoning. Since our approach is based on datalog
semantics, and the rule-sets use in the two reasoners mentioned above are similar to what we
have studied, our partitioning technique can be applied for these reasoners.
5.8 Conclusions
We have demonstrated two techniques for partitioning the workload of the OWL reasoning pro-
cess and hence parallelize it. Our approach can be used not only on parallel clusters, but also on
the popular multi-core processors. Our data partitioning algorithm is based on the observation
that the rule-set generated for the OWL-Horst ontologies, use only single-join rules. This enables
us to use a fairly simple technique to partition the datasets and to process them in parallel. In
rule partitioning approach, the rule-base is partitioned and the simpler rule-base is applied to the
134
original. For data-sets which are dense graphs, the rule partitioning approach provides better
speedups than the data partitioning approach. Moreover the rule-partitioning approach is much
cheaper than the data partitioning approach because, the data-sets are typically much larger than
the rule-sets.We note that for some rule reasoning algorithms like Rete, the analysis of time com-
plexity is not straight-forward [3] and our simple performance model might not be applicable to
it. The results we have obtained on standard benchmarks are promising with speedups up to 18x
being observed on a 16 node parallel cluster for a particular benchmark.
135
Chapter 6
Conclusions and Future Work
As for the future, your task is not to foresee it, but to enable it. -Antoine de Saint
Exupry
Dream on, dream on.. Dream until your dream come true- Aerosmith (Dream on,
1973)
6.1 Conclusions
Applying semantic web technologies is becoming an increasingly popular approach to address
the information management problems in a large enterprise setting. However only a handful of
studies have addressed the methodological and scalability challenges in using this approach. Our
thesis addresses this gap by first proposing an adaptation of the agile methodology for engineering
such solutions. We address an interesting problem in change management seen while applying
this methodology. We then present a case study demonstrating two components that address the
needs of a real world problem setting. The case study presents experiences and lessons learnt from
a significant development effort to build these components and applications using them. Finally
we address the scalability of OWL reasoning, a key bottleneck in building such solutions. We
136
use a parallelization approach to address this problem and show significant speedups. Based on
our experiences we believe that although not explicitly designed for it, this technology is indeed a
promising approach for the enterprise information management problem. We believe that on the
technical side scalability and performance of reasoning and query resolution are the two biggest
challenges. Like most emerging technologies, the bigger challenges are on the non-technical
side- adoption, push of and tool support by major vendors will likely determine the fate of this
technology being adopted as a de facto approach.
6.2 Future Work
6.2.1 Semantic Middleware: A vision
Middleware is a software or a set of services that hide the various heterogeneities (e.g. hardware,
programming language etc.) between applications or components, thus facilitation communi-
cation and co-ordination [22, 11]. Traditional middleware have relied on shared object models
(e.g. CORBA) or XML message formats (web-services) or a combination of both (e.g. Biztalk,
JMS) for communication among the components connected to them. The semantic-web stan-
dards especially OWL provide features that combines the expressiveness of the object model and
the openness, simplicity and flexibility of the XML formats. Thus, we envision a new kind of
middleware called the semantic middleware which uses semantic web technologies as the basis
for message passing and other functions. Figure 6.1 shows a layered architecture of a system with
semantic-middleware as well as some of the components that constitute such a middleware.
At the lowest layer are legacy data sources and applications in the system, which include the
semi-structured as well as the structured data sources. These data-sources are abstracted through
137
Figure 6.1: Semantic middleware
a semantic middleware that provide a common interface to all the data in the system. Some of the
other services that will be a part of the semantic middleware fabric that use RDF/OWL are:
• Semantic Data-stores or Knowledge bases: These are data-bases that store large amounts
of RDF data and are queryable through the SPARQL interface. Examples of semantic
data-stores include Jena [76], Oracle [104], Boca [73], sesame [125] etc. In a domain
with semi-structured data this is one of the key components of the semantic middleware.
This is because the data access from the semi-structured sources is generally slow- thus for
efficiency reasons, it is necessary to extract the data and store it in a semantic data store.
• Semantic Lookup: A lookup service is key component of any middleware, which provides
a simplified means to find services in a large system. Unlike normal lookup services like
JNDI [86] or UDDI [18], their semantic components allow components to find matches
138
based on the functionality of the target component, as opposed than arbitrary URIs etc. An
example of such services are UDDI enhanced with OWL-S descriptions [106, 136] etc.
• Semantic Messaging: Message Oriented Middleware (MOMs) are a popular way of inte-
grating disparate systems, which use a message bus for communication between compo-
nents. In semantic MOMs the messages are enhanced with semantics and rule technologies
are used to express routing routines. An example of such a system is [143, 79].
• Semantic services and workflows: Service Oriented Architecture(SOA) is a new style of
designing software in which functionalities and sub-systems are encapsulated and made
available as self-describing components with a open interface or services [107]. Semantics
are generally added to describe the service better including dependencies, input-output, pre-
conditions and post-conditions etc. using shared vocabularies [132, 65]. These semantic
services form the basis for defining simpler and higher level abstract workflows. These
abstract workflows are then automatically translated to concrete workflows by binding them
to services that are available and transparently applying transformations to make the data
from these services compatible with one another [20, 5, 135].
• Semantic Mediator: A mediator is a components that provides a database like abstraction
to data that is present in many different legacy data-sources [122, 55]. Unlike in a semantic
data-store, where the semantic data is not extracted a priori, the mediator extracts the data
when a query is issued to it. A semantic mediator provides a RDF interface to legacy data
[23, 119].
139
Although a semantic counterpart to comprehensive middleware stacks like Microsoft Biztalk [24],
J2EE [96] etc. do not exist today, efforts have been started to address such a comprehensive
framework [103].
6.2.2 Methodology and Change Management
Our change management technique only addresses a very small part of the complete problem
and thus can be extended in many ways. Two approaches have been purported in the database
community to handle the change problem: versioning and evolution [117, 13, 88]. In version-
ing, many RDF graphs each compliant to a version of the schema need to be maintained. The
queries generally specify the version of the schema/ontology that is are answered by using an
appropriate version of the database/RDF graph. An even higher level of support for changing
ontologies, transparently translates the queries compliant with older schemas to those for the new
schemas. Intuitively this works well in the database schemas where the schemas are refactored by
normalization or de-normalization, thus changing the schema structure but not necessarily the in-
formation content of the database. An analysis of if and under what circumstances this approach
might be applicable is an interesting problem for future research.
A much more ambitious goal is an automated mechanism to manage or even detect changes
to all artifacts based on changes to the ontology is an important but complex problem. The
most complex artifact to handle is the source code of the system. Even the simpler sub-problem
of mapping units of code to a domain schema can be quite complex [66]. A more practical
alternate to the automatic handling approach which can add much value to practitioners can be
a collection of best practices and design patterns to simplify this process. On the other hand,
it might be simpler and more tractable to detect and/or handle the errors in other artifacts and
140
run-time structures like messages, service definitions, look up entries etc. However to the best of
our knowledge there is no work done in this direction for ontology based applications and is an
interesting area of future research.
6.2.3 Parallel Inferencing for OWL
A direct extension of our work in parallel reasoning is the application of the hybrid partitioning
approach. In hybrid partitioning both the rule set and the data set are partitioned. The main
problem is to determine what such a partitioniong strategy might be. As an example, recall that
we have demonstrated in our work that data partitioning is better suited for sparsely connected
graphs where as rule partitioning works better for denser graphs. On non-uniform graphs which
have some subgraphs denser than some others, it might be a good approach to perform data
partitioning on the sparser subgraphs and rule partitioning for the denser subgraphs. We observe
that the data partitioning approach scales badly for dense datasets largely due to the presence of
large chains of transitive relations, e.g. the sameAs and differentFrom relations. These chains are
generally a result of the application of only a few rules. Thus a hybrid partitioning in which some
partitions applies the rules that create large chains in the graph and for the rest of the rule base,
use data partitioning might be an interesting approach. Finally a hybrid partitioning algorithm
that adapts the strategy to be used based on the input data is an intriguing problem.
We addresses the inferencing problem for materialized knowledge bases. Another problem
that extends our work is that of parallel query answering, where reasoning is performed when
and for the information relevant to the query. Intuitively, much of the algorithm we use can be
reused since our algorithm can be thought of as a variant which answers the query find all tuples.
141
However the complete algorithm and performance of such an algorithm is far from obvious and
is an interesting area of future work.
We note that our algorithm works correctly for a popular subset of the features specified in the
OWL specification. Thus an obvious extension is to extend it to address more expressive subsets
and the complete OWL- including OWL-Lite, DL and Full. Since the complete OWL reasoning
cannot be performed by rule based reasoners, the parallelization of the most complete subset of
reasoning possible- the Description Logic Programs [57] is an interesting area of work. Parallel
algorithms for tableau algorithms [19, 95] is also an area of interesting and active research. How-
ever applications of such algorithms for large ontologies and knowledge bases are not available
and are needed to validate the applicability of such techniques.
142
Reference List
[1] Daniel J. Abadi, Adam Marcus, Samuel R. Madden, and Kate Hollenbach. Scalable se-
mantic web data management using vertical partitioning. In VLDB ’07: Proceedings of
the 33rd international conference on Very large data bases, pages 411–422. VLDB En-
dowment, 2007.
[2] Harith Alani, David Dupplaw, John Sheridan, Kieron O’Hara, John Darlington, Nigel
Shadbolt, and Carol Tullo. Unlocking the potential of public sector information with se-
mantic web technology. In ISWC/ASWC, pages 708–721, 2007.
[3] Luc Albert and Franc ¸ois Fages. Average case complexity analysis of the rete multi-pattern
match algorithm. In ICALP, pages 18–37, 1988.
[4] J. Amaral. A Parallel Architecture for Serializable Production Systems. PhD thesis, The
University of Texas at Austin, Austin, TX. Electrical and Computer Engineering., 1994.
[5] Jos´ e Luis Ambite and Dipsy Kapoor. Automatically composing data workflows with rela-
tional descriptions and shim services. In ISWC/ASWC, pages 15–29, 2007.
[6] Scott Ambler. Agile Database Techniques: Effective Strategies for the Agile Software
Developer. John Wiley and Sons, 2003.
[7] J¨ urgen Angele and Michael Gessmann. Integration of customer information using the
semantic web. In Cardoso Jorge, Hepp Martin, and Lytras Miltiadis, editors, The Semantic
Web. Real-world Applications from Industry, pages 191–208. Springer, 2007.
[8] Ali Arsanjani. Introduction to the special issue on developing and integrating enterprise
components and services. Commun. ACM, 45(10):30–34, 2002.
[9] Soren Auer. The rapidowl methodology–towards agile knowledge engineering. In 15th
IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative
Enterprises, 2006.
[10] Apache axis2, http://ws.apache.org/axis2/.
[11] David Bakken. Middleware. In Urban and editors P. Dasgupta, editors, Encyclopedia of
Distributed Computing. Kluwer, 2001.
[12] Amol Bakshi, Vaibhav Mathur, Sumit Mohanty, Victor K. Prasanna, Cauligi S. Raghaven-
dra, Mitali Singh, Aditya Agrawal, James Davis, Brandon Eames, Akos Ledeczi, Sandeep
Neema, and Greg Nordstrom. MILAN: A model based integrated simulation framework
for design of embedded systems. ACM SIGPLAN Notices, 36(8):82–87, 2001.
143
[13] Jay Banerjee, Won Kim, Hyoung-Joo Kim, and Henry F. Korth. Semantics and imple-
mentation of schema evolution in object-oriented databases. ACM SIGMOD Record, 16,
1987.
[14] M. Beckerle and M. Westhead. Ggf dfdl primer. Technical report, Global Grid Forum,
2004.
[15] Dave Beckett and Jan Grant. Swad-europe deliverable 10.2:
Mapping semantic web data with rdbmses,technical report,
http://www.w3.org/2001/sw/europe/reports/scalable rdbms mapping report/, 2003.
[16] Khalid Belhajjame, Katy Wolstencroft,
´
Oscar Corcho, Tom Oinn, Franck Tanoh, Alan
Williams, and Carole A. Goble. Metadata management in the taverna workflow system.
In CCGRID, pages 651–656, 2008.
[17] David A. Bell, J. Shao, and M. Elizabeth C. Hull. A pipelined strategy for processing
recursive queries in parallel. Data Knowl. Eng., 6:367–391, 1991.
[18] Tom Bellwood, Luc Clement, David Ehnebuske, Andrew Hately, Maryann Hondo,
Yin Leng Husband, Karsten Januszewski, Sam Lee, Barbara McKee, Joel Munter, and
Claus von Riegen. Uddi version 3.0.
[19] Frank W. Bergmann and Joachim Quantz. Parallelizing description logics. In 19th Annual
German Conference on Artificial Intelligence: Advances in Artificial Intelligence, 1995.
[20] Chad Berkley, Shawn Bowers, Matthew B. Jones, Bertram Ludscher, Mark Schildhauer,
and Jing Tao. Incorporating semantics in scientific workflow authoring. In In Proceedings
of the 17th International Conference on Scientific and Statistical Database Management
(SSDBM’05, 2005.
[21] Tim Berners-Lee, James Hendler, and Ora Lasilla. The semantic web. The Scientific
American, 2001.
[22] Philip A. Bernstein. Middleware: a model for distributed system services. Communica-
tions of the ACM, 39(2), 1996.
[23] Christian Bizer and Andy Seaborne. Treating non-rdf data as virtual rdf data. In 3rd
International Semantic Web Conference (ISWC), 2004.
[24] Biztalk server 2004 architecture, whitepaper
http://www.microsoft.com/technet/prodtechnol/biztalk/2004/whitepapers/architecture.mspx,
2003.
[25] Michael Blow, Yaron Goland, Matthias Kloppmann, Frank Leymann, Gerhard Pfau, Dieter
Roller, and Michael Rowley. Bpelj: Bpel for java, 2004.
[26] Barry W. Boehm, Chris Abts, A. Winsor Brown, Sunita Chulani, Bradford K. Clark, Ellis
Horowitz, Ray Madachy, Donald J. Reifer, and Bert Steece. Software Cost Estimation with
Cocomo II. Prentice Hall PTR, 2000.
144
[27] Partrice Boizumault. The Implementation of Prolog. Princeton series in Computer Sci-
ence, 1993.
[28] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search
engine. In Computer Networks and ISDN Systems, pages 107–117, 1998.
[29] Michael L. Brodie. Computer science 2.0: A new world of data management. In VLDB,
page 1161, 2007.
[30] Fred Brooks. The computer scientist as toolsmith ii. Communications of the ACM, 1996.
[31] Peter Buneman. Semistructured data. In 16th ACM Symposium on Principles of Database
Systems, pages 117–121, 1997.
[32] Huajun Chen, Yimin Wang, Heng Wang, Yuxin Mao, Jinmin Tang, Cunyin Zhou, Ainin
Yin, and Zhaohui Wu. Towards a semantic web of relational databases: A practical seman-
tic toolkit and an in-use case from traditional chinese medicine. In International Semantic
Web Conference, pages 750–763, 2006.
[33] Liming Chen, Nigel Shadbolt, Feng Tao, and Carole Goble. Engineering grid resources
metadata for resource and knowledge sharing. International Journal of Web Service Re-
search (JWSR) on Bridging Communities: Semantically Augmented Metadata for Services,
Grids, and Software Engineering, 2005.
[34] Center for interactive smart oilfield technologies, http://cisoft.usc.edu/.
[35] Kendall Grant Clark, Lee Feigenbaum, and Elias Torres. Sparql protocol for rdf, w3c
recommendation 15 january 2008; http://www.w3.org/trrdf-sparql-protocol/.
[36]
´
Oscar Corcho, Pinar Alper, Paolo Missier, Sean Bechhofer, and Carole A. Goble. Grid
metadata management: Requirements and architecture. In GRID, pages 97–104, 2007.
[37] Jorge Cordoso. Developing course mangement systems using the semantic web. In Car-
doso Jorge, Hepp Martin, and Lytras Miltiadis, editors, The Semantic Web. Real-world
Applications from Industry, pages 97–122. Springer, 2007.
[38] F. Curbera, Y . Goland, J. Klein, F. Leymann, D. Rollerand, and S. Weerawarana. Business
process execution language for web services, version 1.1. specification, May 2003.
[39] Jos de Bruijn, Dieter Fensel, Uwe Keller, and Ruben Lara. Using the web services mod-
elling ontology to enable semantic ebusiness. Communications of the ACM (CACM), Spe-
cial Issue on Semantic eBusiness, 48(12), 12 2005.
[40] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large
clusters. In OSDI, pages 137–150, 2004.
[41] Hasanat M. Dewan, Salvatore J. Stolfo, Mauricio Hern´ andez, and Jae-Jun Hwang. Pre-
dictive dynamic load balancing of parallel and distributed rule and query processing. In
ACM-SIGMOD Intl. Conf. on Management of Data, pages 277–288, 1994.
145
[42] Julian Dolby, Achille Fokoue, Aditya Kalyanpur, Aaron Kershenbaum, Edith Schonberg,
Kavitha Srinivas, and Li Ma. Scalable semantic retrieval through summarization and re-
finement. In AAAI, pages 299–304, 2007.
[43] Francesco M. Donini. Complexity of reasoning. In Description Logic Handbook, pages
96–136, 2007.
[44] Christian Drumm, Jens Lemcke, and Daniel Oberle. Business process management and
semantic technologies. In Cardoso Jorge, Hepp Martin, and Lytras Miltiadis, editors, The
Semantic Web. Real-world Applications from Industry, pages 97–122. Springer, 2007.
[45] Ulrich Elsner. Graph partitioning - a survey, 1997.
[46] Weijian Fang, Simon Miles, and Luc Moreau. Performance analysis of a semantics-
enabled service registry. Concurrency and Computation: Practice and Experience,
20(3):207–224, March 2007.
[47] Mariano Fernndez, Asuncin Gmez-Prez, and Natalia Juristo. Methontology: From onto-
logical art towards ontological engineering. In AAAI Spring Symposium, 1997.
[48] Aykut Firat. Information Integration Using Contextual Knowledge and Ontology Merging.
PhD thesis, Sloan School of Management, 2003.
[49] Daniela Florescu. Managing semi-structured data. ACM Queue, 3(8), 2005.
[50] Charles Forgy. Rete: A fast algorithm for the many patterns/many objects match problem.
Artif. Intell., 19(1):17–37, 1982.
[51] I. Foster, J. V oeckler, M. Wilde, and Y . Zhao. Chimera: A virtual data system for repre-
senting, querying, and automating data derivation. In 14th International Conference on
Scientific and Statistical Database Management (SSDBM), 2002.
[52] Peter Fox, Deborah L. McGuinness, Don Middleton, Luca Cinquini, J. Anthony Darnell,
Jose Garcia, Patrick West, James L. Benedict, and Stan Solomon. Semantically-enabled
large-scale science data repositories. In International Semantic Web Conference, pages
792–805, 2006.
[53] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Ele-
ments of Reusable Object-Oriented Software. Addison-Wesley.
[54] Sumit Ganguly, Abraham Silberschatz, and Shalom Tsur. A framework for the parallel
processing of datalog queries. In SIGMOD Conference, pages 143–152, 1990.
[55] Hector Garcia-molina, Yannis Papakonstantinou, Dallan Quass, Yehoshua Sagiv, Jeffrey
Ullman, Vasilis Vassalos, and Jennifer Widom. The tsimmis approach to mediation: data
models and languages. Journal of Intelligent Information Systems, 8:117–132, 1997.
[56] Alfio Massimiliano Gliozzo, Aldo Gangemi, Valentina Presutti, Elena Cardillo, Enrico
Daga, Alberto Salvati, and Gianluca Troiani. A collaborative semantic web layer to en-
hance legacy systems. In ISWC/ASWC, pages 764–777, 2007.
146
[57] Benjamin N. Grosof, Ian Horrocks, Raphael V olz, and Stefan Decker. Description logic
programs: combining logic programs with description logic. In 12th International Con-
ference on the World Wide Web (WWW-2003), pages 48–57, 2003.
[58] N Guarino. Formal ontology and information systems. In 1st International Conference on
Formal Ontologies in Information Systems, FOIS’98, 1998.
[59] Amarnath Gupta and Inder Singh Mumick, editors. Materialized views: techniques, im-
plementations, and applications. MIT Press, Cambridge, MA, USA.
[60] Anoop Gupta, Charles Forgy, Allen Newell, and Robert G. Wedig. Parallel algorithms and
architectures for rule-based systems. In ISCA, pages 28–37, 1986.
[61] Gopal Gupta, Enrico Pontelli, Khayri A. M. Ali, Mats Carlsson, and Manuel V .
Hermenegildo. Parallel execution of prolog programs: a survey. Programming Languages
and Systems, 23(4):472–602, 2001.
[62] Marie Gustafsson and G¨ oran Falkman. Representing clinical knowledge in oral medicine
using ontologies. In Medical Informatics Europe, pages 743–748, 2005.
[63] Marie Gustafsson, G¨ oran Falkman, Fredrik Lindahl, and Olof Torgersson. Enabling an
online community for sharing oral medicine cases using semantic web technologies. In
International Semantic Web Conference, pages 820–832, 2006.
[64] Alon Y . Halevy, Naveen Ashish, Dina Bitton, Michael Carey, Denise Draper, Jeff Pol-
lock, Arnon Rosenthal, and Vishal Sikka. Enterprise information integration: successes,
challenges and controversies. In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD
international conference on Management of data, pages 778–787, New York, NY , USA,
2005. ACM Press.
[65] Armin Haller, Emilia Cimpian, Adrian Mocan, Eyal Oren, and Christoph Bussler. Wsmx
- a semantic service-oriented architecture. In ICWS, pages 321–328, 2005.
[66] Joachim Hammer, Mark Schmalz, William OBrien, Sangeetha Shekar, and Nikhil Halde-
vnekar. Seeking knowledge in legacy information systems to support interoperability. In
International Workshop on Ontologies and Semantic Interoperability(ECAI), 2002.
[67] David Hay. Data Model Patterns: A Metadata Map. Morgan Kaufmann, 2006.
[68] A. J. G. Hey and A. E. Trefethen. The data deluge: An e-science perspective. In Grid
Computing - Making the Global Infrastructure a Reality. Wiley and Sons.
[69] Jerry R. Hobbs and Feng Pan. An ontology of time for semantic web. ACM Transactions
on Asian Language Processing (TALIP)., 3(1), 2004.
[70] Ian Horrocks and Ulrike Sattler. A tableau decision procedure forSHOIQ. J. of Auto-
mated Reasoning, 39(3):249–276, 2007.
[71] Eero Hyv¨ onen, Kim Viljanen, and Osma Suominen. Healthfinland - finnish health infor-
mation on the semantic web. In ISWC/ASWC, pages 778–791, 2007.
147
[72] Ibm integrated ontology development toolkit,
http://www.alphaworks.ibm.com/tech/semanticstk.
[73] Ibm semantic layered research platform,
http://ibm-slrp.sourceforge.net/.
[74] William H Inmon. Building the Data Warehouse. John Wiley and sons inc., 2002.
[75] Jastor: Typesafe, ontology driven rdf access from java.
[76] Jena semantic web framework,
http://jena.sourceforge.net/.
[77] Cardoso Jorge, Hepp Martin, and Lytras Miltiadis, editors. The Semantic Web. Real-world
Applications from Industry. Springer, 2007.
[78] Aditya Kalyanpur and Daniel Jimnez. Automatic mapping of owl ontologies into java. In
Software Engineering and Knowledge Engineering, 2004.
[79] Dimka Karastoyanova, Bob Wetzstein, Tommo van Lessen, Daniel Wutke, Joerg Nitzsche,
and Frank Leymann. Semantic service bus: Architecture and implementation of a next
generation middleware. In 2nd International Workshop on Services Engineering (SEIW),
2007.
[80] Vipul Kashyap. From the bench to the bedside - the role of semantics in enabling the
vision of translational medicine. In HEALTHINF (1), pages 23–24, 2008.
[81] Vipul Kashyap, Kei-Hei Cheung, Donald Doherty, Matthias Samwald, Scott Marshall,
Joanne Luciano, Susie Stephens, Ivan Herman, and Raymond Hookway. Ontology based
data integration for biomedical research. In Cardoso Jorge, Hepp Martin, and Lytras Mil-
tiadis, editors, The Semantic Web. Real-world Applications from Industry, pages 97–122.
Springer, 2007.
[82] Vipul Kashyap and Amit Sheth. Semantic and schematic similarities between database
objects: a context-based approach. The VLDB Journal The International Journal on Very
Large Data Bases, 5(4), 2004.
[83] Jihie Kim, Yolanda Gil, and Varun Ratnakar. Semantic metadata generation for large
scientific workflows. In International Semantic Web Conference, pages 357–370, 2006.
[84] Michel Klein. Change Management for Distributed Ontologies. PhD thesis, Vrije Univer-
siteit Amsterdam, August 2004.
[85] Rub´ en Lara, Iv´ an Cantador, and Pablo Castells. Semantic web technologies for the finan-
cial domain. In Cardoso Jorge, Hepp Martin, and Lytras Miltiadis, editors, The Semantic
Web. Real-world Applications from Industry, pages 41–74. Springer, 2007.
[86] Rob Lee and Steve Seligman. The Jndi API Tutorial and Reference: Building Directory-
Enabled Java Applications. Addison-Wesley Longman Publishing Co., Inc. Boston, MA,
USA, 2000.
148
[87] Pieter De Leenheer and Tom Mens. Ontology evolution: State of the art and future di-
rections. In Martin Hepp, Pieter De Leenheer, Aldo De Moor, and York Sure, editors,
Ontology Management. Springer, 2007.
[88] Xue Li. A survey of schema evolution in object-oriented databases. In International
Conference on Technology of Object-Oriented Language and Systems, 1999.
[89] Yaozhong Liang. Enabling active ontology change management within semantic web-
based applications. Technical report, School of Electronics and Computer Science, Uni-
versity of Southampton, 2006.
[90] Lehigh university benchmark,
http://swat.cse.lehigh.edu/projects/lubm/.
[91] Bertram Ludascher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew
Jones, Edward A. Lee, Jing Tao, and Yang Zhao. Scientific workflow management and the
kepler system. Concurrency and Computation: Practice and Experience, 18(10), 2006.
[92] Li Ma, Yang Yang, Zhaoming Qiu, GuoTong Xie, Yue Pan, and Shengping Liu. Towards
a complete owl ontology benchmark. In ESWC, pages 125–139, 2006.
[93] Daniel J. Mandell and Sheila A. McIlraith. Adapting bpel4ws for the semantic web: The
bottom-up approach to web service interoperation. In Second International Semantic Web
Conference (ISWC2003), 2003.
[94] David Martin, M Burstein, J Hobbs, Ora Lassila, D Mcdermott, S Mcilraith, S Narayanan,
M Paolucci, B Parsia, T Payne, E Sirin, N Srinivasan, and K Sycara. Owl-s: Semantic
markup for web services, w3c member submission 22 november 2004.
[95] Michael Mendler and Stephan Scheele. Towards constructive dl for abstraction and refine-
ment. In Description Logics, 2008.
[96] Sun Microsystems. Java 2 platform, enterprise edition (j2ee).
http://java.sun.com/j2ee/, 2004.
[97] Simon Miles, Sylvia C. Wong, Weijian Fang, Paul Groth, Klaus-Peter Zauner, and Luc
Moreau. Provenance-based validation of e-science experiments. Web Semantics: Science,
Services and Agents on the World Wide Web, 5(1):28–38, 2007.
[98] L. Moreau, Y . Zhao, J. V oeckler I. Foster, and M. Wilde. Xdtm: Xml dataset typing and
mapping for specifying datasets. In European Grid Conference, 2005.
[99] Mpj-express,
http://www.acet.rdg.ac.uk/projects/mpj/.
[100] David Norheim and Roar Fjellheim. Aksio - active knowledge management in the
petroleum industry. In European Semantic Web Conference (ESWC ’06) Industry Forum,
2006.
[101] Natalya F Noy and Mark A Musen. Promptdiff: A fixed-point algorithm for comparing
ontology versions. In Eighteenth National Conference on Artificial Intelligence (AAAI).
149
[102] Natalya Fridman Noy, Sandhya Kunnatur, Michel Klein, and Mark A Musen. Tracking
changes during ontology evolution. In Third International Conference on the Semantic
Web, 2004.
[103] The open anzo project: Semantic application middleware,
http://www.openanzo.org/.
[104] Oracle semantic technologies,
http://www.oracle.com/technology/tech/semantic technologies.
[105] Owlim semantic repository,
http://www.ontotext.com/owlim/index.html/.
[106] M Paolucci, T Kawamura, T R Payne, and Katie Sycara. Importing the semantic web in
uddi. In E-Services and the Semantic Web Workshop, 2002.
[107] M. P. Papazoglou and D. Georgakopoulos. Service oriented computing. Communications
of the ACM (CACM), 24(10), 2003.
[108] Chintan Patel, James J. Cimino, Julian Dolby, Achille Fokoue, Aditya Kalyanpur, Aaron
Kershenbaum, Li Ma, Edith Schonberg, and Kavitha Srinivas. Matching patient records to
clinical trials using ontologies. In ISWC/ASWC, pages 816–829, 2007.
[109] Peter F Patel-Schneider and Ian Horrocks. Owl web ontology language semantics and
abstract syntax, w3c recommendation
http://www.w3.org/tr/owl-semantics/.
[110] Lesslar P.C. and van den Berg F.G. Managing data assets to improve business performance.
In SPE Asia Pacific Conference on Integrated Modelling for Asset Management, 1998.
[111] Jorge Perez, Marcelo Arenas, and Claudio Gutierrez. Semantics and complexity of sparql.
In The Semantic Web - ISWC 2006, pages 30–43. Springer, 2006.
[112] Helena Sofia Pinto and Joo P. Martins. Ontologies: How can they be built? Knowledge
and Information Systems, 6, 2004.
[113] Peter Plessers, Olga De Troyer, and Sven Casteleyn. Understanding ontology evolution:
A change detection approach. Web Semantics: Science, Services and Agents on the World
Wide Web, 5, 2007.
[114] Posc caesar association, oil and gas ontology (ogo)
http://www.posccaesar.com/.
[115] The protege ontology editor and knowledge acquisition system,
http://protege.stanford.edu/.
[116] Eric Prudhommeaux and A Seaborne. Sparql query language for rdf, w3c recommendation
http://www.w3.org/tr/2008/rec-rdf-sparql-query-20080115/.
[117] John F. Roddick. A survey of schema versioning issues for database systems. Information
and Software Technology, 37:383–393, 1995.
150
[118] Luis Rodrigo, V . Richard Benjamins, Jes´ us Contreras, Diego Pat´ on, D. Navarro, R. Salla,
Mercedes Bl´ azquez, P. Tena, and I. Martos. A semantic search engine for the international
relation sector. In International Semantic Web Conference, pages 1002–1015, 2005.
[119] Jes´ us B. Rodr´ ıguez,
´
Oscar Corcho, and Asunci´ on G´ omez-P´ erez. R2o, an extensible and
semantically based database-to-ontology mapping language. In Second Workshop on Se-
mantic Web and Databases (SWDB2004). Toronto, Canada.
[120] Dumitru Roman, Uwe Keller, Holger Lausen, Jos de Bruijn, Michael Stollberg, Axel
Polleres, Cristina Feier, Christoph Bussler, and Dieter Fensel. Web service modeling on-
tology. Applied Ontology, 1:77–106, 2005.
[121] D. De Roure, N. R. Jennings, and N. R. Shadbolt. The semantic grid: Past, present and
future. Procedings of the IEEE, 93(3):669–681, 2005.
[122] Mary Tork Ruth and Peter Shwartz. A wrapper architecture for legacy data sources. In
International Conference on Very Large Databases (VLDB), 1997.
[123] Simon Schenk and Steffen Staab. Networked graphs: a declarative mechanism for sparql
rules, sparql views and rdf data integration on the web. In WWW ’08: Proceeding of the
17th international conference on World Wide Web, pages 585–594, New York, NY , USA,
2008. ACM.
[124] Hans-Peter Schnurr and J¨ urgen Angele. Do not use this gear with a switching lever! auto-
motive industry experience with semantic guides. In International Semantic Web Confer-
ence, pages 1029–1040, 2005.
[125] Sesame: Rdf schema querying and storage, http://www.openrdf.org/.
[126] J. Shao, David A. Bell, and M. Elizabeth C. Hull. Combining rule decomposition and
data partitioning in parallel datalog program processing. In Proceedings of the First In-
ternational Conference on Parallel and Distributed Information Systems (PDIS 1991),
Fontainebleu Hilton Resort, Miami Beach, Florida, December 4-6, 1991, pages 106–115.
IEEE Computer Society, 1991.
[127] Mary Shaw and David Garlan. Software Architecture: Perspectives on an Emerging Dis-
cipline. Prentice Hall.
[128] Amit Sheth. Enterprise applications of semantic web: The sweet spot of risk and compli-
ance. In Conference on Industrial Applications of Semantic Web, 2005.
[129] Y . Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIG-
MOD Record, 2005.
[130] Elena Paslaru Bontas Simperl and Christoph Tempich. Ontology engineering: A reality
check. In International Conference on Ontologies, Databases and Applications of Seman-
tics (ODBASE), 2006.
[131] Gurmeet Singh, Shishir Bharathi, Ann Chervenak, Ewa Deelman, Carl Kesselman, Mary
Manohar, Sonal Patil, and Laura Pearlman. A metadata catalog service for data intensive
applications. In ACM/IEEE conference on Supercomputing, 2003.
151
[132] Munindar Singh and Michael Huhns. Service-Oriented Computing: Semantics, Processes,
Agents. John Wiley and sons, 2005.
[133] Evren Sirin and Bijan Parsia. Sparql-dl: Sparql query for owl-dl. In 3rd OWL Experiences
and Directions Workshop (OWLED), 2007.
[134] Kaarthik Sivashanmugam, John A. Miller, Amit P. Sheth, and Kunal Verma. Framework
for semantic web process composition. International Journal of Electronic Commerce,
9(2), 2004.
[135] Ramakrishna Soma, Amol Bakshi, and Viktor Prasanna. An architecture of a workflow
system for integrated asset management in the smart oil field domain. In First IEEE Inter-
national Workshop on Scientific Workflows(SWF), 2007.
[136] Ramakrishna Soma, Amol Bakshi, Viktor Prasanna, and Will DaSie. A model-based
framework for developing and deploying data aggregation services. In 4th International
Conference on Service-Oriented Computing (ICSOC), 2006.
[137] Ramakrishna Soma, Amol Bakshi, Viktor Prasanna, Will Da Sie, and Birlie Bourgeois.
Semantic web technologies for smart oil field applications. In 2nd SPE Intelligent Energy
Conference and Exhibition, February 2008.
[138] Microsoft sql server integration service, http://msdn.microsoft.com/sql/bi/integration/.
[139] Susie Stephens, David LaVigna, Mike DiLascio, and Joanne Luciano. Aggregation of
bioinformatics data using semantic web technology. J. Web Sem., 4(3):216–221, 2006.
[140] Susie Stephens, Alfredo Morales, and Matthew Quinlan. Applying semantic web tech-
nologies to drug safety determination. IEEE Intelligent Systems, 21(1):82–86, 2006.
[141] Lilijana Stojanovic. Methods and Tools for Ontology Evolution. PhD thesis, University of
Karlsruhe, 2004.
[142] SJ Stolfo, HM Dewan, D Ohsie, and M Hernandez. A parallel and distributed environment
for database rule processing: open problems and future directions. Emerging Trends in
Database and Knowledge-Based Machine, 1995.
[143] Suzette Stoutenburg, Leo Obrst, Deborah Nichols, Ken Samuel, and Paul Franklin. Dy-
namic service oriented architectures through semantic technology. In ICSOC, pages 581–
590, 2006.
[144] York Sure, Christoph Tempich, and Elena Paslaru Bontas Simperl. Ontocom: A cost
estimation model for ontology engineering. In International Semantic Web Conference,
2006.
[145] Sweet ontologies: Semantic web for earth and environmental technologies,
http://sweet.jpl.nasa.gov/ontology/.
[146] Katie Sycara, Mike Paolucci, Anupriya Ankolekar, and Naveen Srinivasan. Automated
discovery, interaction and composition of semantic web services. Journal of Web Seman-
tics, 2003.
152
[147] Adrienne Tannenbaum. Metadata solutions: using metamodels, repositories, XML, and
enterprise portals to generate information on demand. Addison-Wesley, 2002.
[148] Herman J. ter Horst. Combining rdf and part of owl with rules: Semantics, decidability,
complexity. In International Semantic Web Conference, pages 668–684, 2005.
[149] Snehal Thakkar, Jose Luis Ambite, and Craig A. Knoblock. Composing, optimizing, and
executing plans for bioinformatics web services. VLDB Journal, Special Issue on Data
Management, Analysis and Mining for Life Sciences, 14(3), 2005.
[150] Ganesh C. Thakur and Abdus Satter. Integrated Waterflood Asset Management. PennWell
Books, 1998.
[151] Victor Vianu. Rule-based languages. Annals of Mathematics and Artificial Intelligence,
19(1-2):215–259, 1997.
[152] R. V olz, S. Staab, and B. Motik. Incremental maintenance of materialized ontologies. In
2nd Int.Conf. on Ontologies and Databases (ODBASE)., 2003.
[153] H. Wache, T. V¨ ogele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and
S. H¨ ubner. Ontology-based integration of information — a survey of existing approaches.
In H. Stuckenschmidt, editor, IJCAI–01 Workshop: Ontologies and Information Sharing,
pages 108–117, 2001.
[154] Kevin Wilkinson. Jena property table implementation. In Scalable Semantic Web Knowl-
edge Base Systems (SSWS), 2006.
[155] Ouri Wolfson and Aya Ozeri. Parallel and distributed processing of rules by data reduction.
IEEE Trans. Knowl. Data Eng., 5(3):523–530, 1993.
[156] Zhe Wu, George Eadon, Souripriya Das, Eugene Inseok Chong, Vladimir Kolovski, Mel-
liyal Annamalai, and Jagannathan Srinivasan. Implementing an inference engine for
rdfs/owl constructs and user-defined rules in oracle. In 24th International Conference
on Data Engineering (ICDE), 2008.
[157] Cong Zhang, Abdollah Orangi, Amol Bakshi, Will Da Sie, and Viktor K. Prasanna. Model-
based framework for oil production forecasting and optimization: A case study in inte-
grated asset management. In SPE Intelligent Energy Conference and Exhibition, April
2006.
[158] Weining Zhang, Ke Wang, and Siu-Cheung Chau. Data partition and parallel evaluation
of datalog programs. IEEE Trans. Knowl. Data Eng., 7(1):163–176, 1995.
[159] Jun Zhao, Chris Wroe, Carole Goble, Robert Stevens, Dennis Quan, and Mark Green-
wood. Using semantic web technologies for representing e-science provenance. In Third
International Semantic Web Conference (ISWC 2004 ), 2004.
[160] Qunzhi Zhou, Amol Bakshi, Viktor Prasanna, and Ramakrishna Soma. Towards an inte-
grated modeling and simulation framework for freight transportation in metropolitan areas.
In IEEE International Conference on Information Reuse and Integration (IRI), 2008.
153
Abstract (if available)
Abstract
The two main goals of this thesis are to demonstrate applications of semantic web applications for domains with semi-structured data and propose scalable techniques for building such applications. To this end we first describe an adaptation of the agile methodology for building large scale semantic web applications. When applying this methodology the ontology is constantly modified. Due to this other artifacts- including the queries, messages, application code etc. that depend on it also need to be modified to keep them consistent with the new ontology. We propose a novel technique that detects the SPARQL queries that need to be modified due to changes to an OWL ontology. We present an implementation of our technique as an extension to a popular ontology development tool which makes it a convenient environment for the ontology engineer in our methodology.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Discovering and querying implicit relationships in semantic data
PDF
From matching to querying: A unified framework for ontology integration
PDF
Understanding semantic relationships between data objects
PDF
Learning the semantics of structured data sources
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Semantic web technologies for event analysis in reservoir engineering
PDF
A provenance management framework for reservoir engineering
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Empirical study of informational regularizations in learning useful and interpretable representations
PDF
Representing complex temporal phenomena for the semantic web and natural language
PDF
Detecting semantic manipulations in natural and biomedical images
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Enabling laymen to contribute content to the semantic web: a bottom-up approach to creating and aligning diversely structured data
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Unsupervised domain adaptation with private data
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
A complex event processing framework for fast data management
Asset Metadata
Creator
Soma, Ramakrishna
(author)
Core Title
Applying semantic web technologies for information management in domains with semi-structured data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/07/2008
Defense Date
09/12/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
information management,OAI-PMH Harvest,Semantic Web
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Nakano, Aiichiro (
committee member
), Raghavendra, Cauligi S. (
committee member
)
Creator Email
ramsoma@gmail.com,rsoma@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1885
Unique identifier
UC1437483
Identifier
etd-Soma-2468 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-132359 (legacy record id),usctheses-m1885 (legacy record id)
Legacy Identifier
etd-Soma-2468.pdf
Dmrecord
132359
Document Type
Dissertation
Rights
Soma, Ramakrishna
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
information management
Semantic Web