Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A statistical ontology-based approach to ranking for multi-word search
(USC Thesis Other)
A statistical ontology-based approach to ranking for multi-word search
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1
A STATISTICAL ONTOLOGY-BASED APPROACH TO RANKING FOR MULTI-
WORD SEARCH
by
Jinwoo Kim
______________________________________________________________________
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2013
Copyright 2013 Jinwoo Kim
2
Table of Contents
List of Figures and Tables
Abstract
Chapter 1. Motivation
Chapter 2. Hypothesis
Chapter 3. Related Work
3.1. N-Gram
3.2. Query Prediction
3.3. Proximity Search
3.4. Key Word in Context (KWIC)
3.5. Query Answering
3.6. Observations
Chapter 4. Research Approach
4.1. Overview and General Architecture
4.2. Corpus and Test Data
4.3. Analyzer and Indexer
4.4. New Query Interface
4.5. Ontology Builder
4.5.1. Local Ontology Building
4.5.2. Global Ontology Building
4.6. User Interface
iv
v
7
10
11
12
14
16
17
18
20
21
21
23
23
25
27
27
29
31
3
Chapter 5. Experiment Validation
5.1. Experiment Data
5.2. Experiment Methodology
5.3. Experiment Result
5.4. Experiment Conclusion
Chapter 6. Experiment Demonstration of Ontology Construction
6.1. Utilizing Commercial Search Engine
6.2. Extracting Topic Relevant Objects
6.3. Extracting Ontological Describer
6.4. Analysis of Experiment Results
6.4.1 Data Analysis of Topic Relevant Objects
6.4.2 Data Analysis of Ontological Describer
6.4.3. Refining Built Ontology
6.4.4. Structure of Ontology
6.4.5. Built Ontology
6.4.6. Utilizing Built Ontology
6.4.7. Extracting Topic from Describer and Object
6.5. Experiment Conclusion
Chapter 7. Contributions
References
33
33
34
35
42
43
43
46
49
54
54
56
64
65
66
67
71
72
74
76
4
List of Figures and Tables
Figure 1. Ontology Based Search System Architecture.
Figure 2. New Query Interface
Figure 3. Building Local Ontology
Figure 4. Global Ontology Building
Figure 5. User Interface for Search Results
Figure 6. Precision Comparison Result
Figure 7. Recall Rate Comparison Result
Figure 8. Average of TERC’s evaluation value comparison result
Figure 9. Average precision of each factoid-type query
Figure 10. Extracting the most frequent matching keyword.
Figure 11. Process of extracting ontological describer
Figure 12. Structure of built ontology
Table 1. Average precision of each query
Table 2. Average Recall Rate of Each Search Approach
Table 3. Average Evaluation Value of Each Query
Table 4. Summary of Experiment Results
Table 5. Average precision of each query
Table 6. Notation
Table 7. Indexed objects and its frequency
Table 8. Comparison between unfiltered ontological describer (up) and
filtered ontological describer (bottom).
Table 9. Ontological describer for General Ti and On
Table 10. Built ontology
22
25
27
30
31
36
37
39
41
47
50
64
36
38
39
40
41
45
48
53
59
63
5
Abstract
Keyword search is a prominent data retrieval method for the Web, largely because the
simple and efficient nature of keyword processing allows a large amount of information to
be searched with fast response. However, keyword search approaches do not formally
capture the clear meaning of a keyword query and fail to address the semantic relationships
between keywords. As a result, the accuracy (precision and recall rate) is often
unsatisfactory, and the ranking algorithms fail to properly reflect the semantic relevance of
keywords. Our research particularly focuses on increasing the accuracy of search results for
multi-word search. We propose a statistical ontology-based semantic ranking algorithm
based on sentence units, and a new type of query interface including wildcards. First, we
allocate higher-ranking scores to keywords located in the same sentence compared with
keywords located in separate sentences. While existing statistical search algorithms such as
N-gram only consider sequences of adjacent keywords, our approach is able to calculate
sequences of non-adjacent keywords as well as adjacent keywords. Second, we propose a
slightly different type of query interface, which considers a wildcard as an independent unit
of a search query to reflect what users are actually seeking by way of the function of query
prediction based on not query data but actual Web data. Unlike current information retrieval
approaches such as proximity, statistical language modeling, query prediction and query
answering, our statistical ontology-based model synthesizes proximity concept and
6
statistical approaches into a form of ontology. This ontology helps to improve web
information retrieval accuracy.
We validated our methodology with a suite of experiments using the Text Retrieval
Conference document collection. We focused on two-word queries in our experiments - as
two-word queries are quite common. After applying our statistical ontology-based
algorithm to the Nutch search engine, we compared the results with results of the original
Nutch search and Google Desktop Search. The result demonstrates that our methodology
has improved accuracy quite significantly.
7
Chapter 1. Motivation
At present, keyword search is a dominant search method due to its efficiency. However, it
does not systematically present a semantic understanding of keywords because it is difficult
to identify the correct meaning of each keyword without considering their semantic
relations or without considering their meanings in the context of a complete sentence.
Consequently, its search result ranking is often disappointing. When a user searches
information on the web through search engines, if the information he or she is trying to find
is not included in the high ranking, the user normally takes the trouble to search again with
a new query rather than flipping through the next pages [3, 41]. This trouble arises because
current ranking algorithms are not able to properly map the semantic relevancy between the
query and the web contents.
In this context, knowledge-based information retrieval methods have been introduced to
improve keyword search. Among them, one approach uses domain-specific rules to extract
information from the web [9, 10, 11], and another approach extracts ontology from HTML’s
specific structure [12, 13]. However, the existing knowledge-based search methods still
have failed to capture semantic relationships between keywords. Hence the application of
these approaches is limited to domain-specific information retrieval. In order to extend the
coverage of text retrieval over non-topic specific corpus, a search method which
understands keywords on a semantic level is needed. However, capturing semantic
relationships between keywords from WWW is difficult for the previous search engines
8
because most of them are based on mainly statistical techniques, which work to remove
unnecessary information including semantic relationships between keywords while they are
indexing documents for keyword search.
The primary advantage of our statistical ontology-based search model is that it allows a
user to get a semantically more relevant search result ranking. Our research is an attempt to
improve the ranking algorithm of multi-word search, as this statistical ontology-based
approach only applies to the case of multi-word search. It is known that existing search
engines adopt a number of factors to determine the ranking results of keyword search, such
as title, anchor, URL, plain text large font, plain text small font, PageRank, within-
document frequencies, inverse document frequencies, document lengths, etc. [1, 8, 40].
Among them, for multi-keyword search, the most important factors to determine their
ranking results are frequency and proximity [1, 39, 40, 44, 45, 46, 47, 48]. One of the main
problems with the current ranking algorithm of multi-word search arises from the fact that
its methodology calculates the relevance of keywords only by their proximity without
considering whether they exist in the same sentence or not.
Another problem is that current search algorithms fail to capture the semantic relevance of
sequences of keywords when they are not situated adjacently due to an insertion of other
words not included in the query such as adverbs or adjectives.
9
There is the other problem with current search algorithms. Users often encounter situations
where they do not exactly know the information they are actually searching for, but where
they only know keywords related to it. However, most current search engines are not able
to handle this situation effectively because they produce search results mainly based on
query inputs, which users only know. To solve this problem, our approach introduces a new
form of interface including wildcard.
Our research does not try to replace existing search algorithms with our new algorithm
model, but rather tries to improve the accuracy (the precision and the recall rate) of multi-
word search results by solving aforementioned problems caused by current search
algorithms. This means that existing search engines can adopt our method in addition to
their existing search methods in order to help users get more semantically relevant search
ranking results.
10
Chapter 2. Hypothesis
This research attempts to solve the problem with a couple of hypotheses. The first
hypothesis is that if we create a statistical ontology-based semantic search model, which
allocates higher-ranking scores to keywords in the same sentence compared with keywords
located in separate sentences, we will be able to produce more semantically relevant search
rankings. It is because calculating the relevance of a query by sentence unit is more
semantically relevant to what a user is actually searching for than simply considering
distances between keywords.
The second hypothesis is that introducing a new form of query interface which places one
or more wildcards between keywords or at the beginning or at the end of a multi-word
query will allow search engines to return what users are searching for more effectively. For
example, if a user wonders what a shark did to a victim, he or she can place a query of
shark, [wildcard], victim. This new query interface helps our model to calculate keywords,
which more frequently occur in the position of wildcard as more relevant to what users are
actually looking for. We expect that our new query interface will work better for the
question form of a query.
11
Chapter 3. Related Work
While the exact details of how current search engines perform their indexing and rank
query results are kept mostly secret for competitive reasons and to prevent manipulation by
end users, it is generally understood that [1, 23, 24, 25]:
1. Crawlers are fed numerous seed URLs and tokenize the text in the web pages they
find to be analyzed, following links they find and then tokenizing the text on those
pages to be analyzed.
2. “Stop words,” which are words that are high in frequency but provide mostly noise
with regard to search precision, are not indexed.
3. Word proximity, word frequency, and document frequency are all taken into
consideration in ranking documents.
4. Trust metrics are applied to further boost or decrease result rankings before they are
returned to the user running the search.
The most important factors which current search engines, including Google, adopt to
determine their ranking results for multi-keyword search are frequency and proximity [1,
25]. One of the main problems with the current ranking algorithm of multi-word search
arises from the fact that its methodology calculates the relevance of keywords only by their
proximity without considering whether they exist in the same sentence or not. For this
reason, this method fails to consider the possibility that multiple neighboring keywords
have no relevance to each other, for example, when one word is placed at the end of the
12
first sentence, and the other in the beginning of the second sentence. Another problem is
that even when multiple keywords are semantically closely relevant, if other words such as
modifiers are inserted between them, the current ranking methodology calculates their
relevancy as low. The other problem is that this methodology cannot successfully recognize
semantic differences between multiple keywords whose orders are reversed, for instance,
dog eat and eat dog. To overcome these problems, the following technologies have been
introduced.
3.1. N-Gram
N-Gram is the most representative methodology, which has been introduced to solve the
problems with the ranking algorithms of the current search engines. The N-Gram is a
methodology which assigns the highest scores to words that are statistically the most likely
to appear in the keyword group, by referring to the keyword orders saved in the document
collections. There is a similarity between the N-Gram and our statistical ontology-based
algorithm in the sense that both methodologies use statistical data to determine the
relevancy of the relation of the query and the web documents. However, the N-Gram
cannot correctly calculate the relevancy of keywords when other words, such as modifiers,
are inserted between the keywords. This is because the N-gram builds ‘N-grams’ based
only on contiguous sequences of keywords in the document collection. For this reason, the
N-Gram has to create automata to calculate every case statistically. This means that as the
size of n-gram increases, the size of the language models increases exponentially. Therefore,
13
for example, trigram models are likely to saturate even within a few billion words [16]. For
this reason, N-Gram is not a very efficient tool for large web data retrieval [15].
There is another problem with the N-Gram. The higher the 'n' in N-gram, the more accurate
the language model that is produced. However, as n increases more in N-gram, the higher
the chance that new data will present you with configurations never encountered in your
training corpus, which is called a sparse data issue [16]. To mitigate the sparse data issue,
various smoothing techniques have been introduced, such as counting the n-1-, n-2-, ..., 1-
grams, and using some weighted sums of each of them to derive our probability estimate
[16, 17, 18]. However, these techniques were not able to entirely alleviate the sparse data
issue.
Unlike the N-Gram, our statistical ontology-based model can express connections between
keywords by means of the vector space model [14]. Using the vector space model, our
approach can calculate where and how frequently the keywords are located in the
documents’ collections and create statistical ontology-based language models. As a result,
unlike the N-Gram – which only considers keywords’ orders and frequency for information
retrieval – our methodology additionally adopts sentence units and wildcard as basis for
analysis. Considering whether the keywords are located in the same sentence or not allows
us to better determine the semantic relevancy of the keywords placed non-contiguously.
This also helps us to reduce the amount of data to be calculated because our methodology
can recognize and categorize non-contiguous keywords as an equally semantically relevant
group, whereas the N-Gram categorizes non-contiguous keywords as a semantically
14
different group, so that the N-Gram has to calculate non-contiguous and contiguous
keywords separately for information retrieval [33, 34].
While the N-Gram only predicts words following or preceding the query keywords, our
new query interface using wildcard allows us to predict words between the query keywords.
Moreover, the wildcard helps our methodology to recognize the keyword sequence as
equally semantically relevant even when other words such as modifiers are inserted
between the keywords. For this reason, with our methodology, the sparse data issue of the
N-Gram does not occur. However, the N-Gram, which only calculates contiguous keywords,
cannot perform this function because other words, such as modifiers between the keywords,
serve as obstacles preventing the N-Gram from recognizing the keywords as semantically
relevant.
3.2. Query Prediction
Query prediction [35, 36, 37, 43] is a methodology, which suggests or adds a predicted
query referring to the user’s original query mainly to improve the precision rate of search
results. It adopts N-gram to perform this function and this means that it inherits the
drawbacks of N-gram. Therefore, it not only cannot process a large amount of data
efficiently, but it also cannot predict non-contiguous keywords.
15
While our approach adopts a predictive manner to find subsets of query-relevant documents
on the actual web, the query prediction uses keyword prediction to complete users’ queries
for their convenience and reference as well as for the improvement of search precision.
Therefore, while the query prediction uses only a complete form of queries for their
information retrieval, our approach retrieves information by means of an incomplete form
of query including wildcard. One of the advantages of our methodology is that if users do
not know exactly what the answer is to their question, the incomplete form of query
including wildcard is useful for finding the possible answers in the actual web data because
our system produces a list of keywords which most frequently occur in the position of
wildcard based on the statistically developed ontology. By contrast, the current prediction
system cannot perform information retrieval with an incomplete form of queries; therefore
it cannot perform the function of predicting what is likely to be in the location of wildcard
in the query assigned.
The current query prediction is only able to calculate predicted keywords, which are placed
contiguously. However, our approach is able to calculate and predict both contiguous and
non-contiguous keywords. The other difference between our statistical ontology-based
approach and the current query prediction system is that while the current query prediction
return results are only based on keyword logs [35, 36, 37, 43] or a limited amount of actual
web data, our system returns query prediction results are based on the whole web data
related to query keywords.
16
3.3. Proximity Search
Proximity search is based on the idea that more semantically relevant keywords are placed
closely. As several researches on proximity search show [44, 45, 46, 47, 48, 40, 39], this
method has improved information retrieval a great deal. To calculate proximity, the
proximity search considers distance between keywords. However, the difference between
our statistical ontology-based approach and the current proximity search is that while the
proximity approach calculates proximity only by considering distance between keywords,
our approach additionally calculates distance between keywords and wildcard in addition to
calculating distance between keywords. The main purpose of calculating distance between
wildcard and keywords is that it allows our model to calculate proximity between the user’s
query and what the user is actually looking for when the user does not exactly know wanted
answers. Our method is especially useful when users only know keywords related to the
answer, but when they do not exactly know what the answer is.
In addition, among many proximity algorithms, our approach adopts the method of
calculating proximity based on sentence unit. This method calculates the proximity of
keywords by presenting the links of keywords based on sentence unit and assigning the
same value to keywords group within the same sentence. Unlike proximity search, which
only considers distance between keywords, this method allows us to calculate how
frequently the keywords given occur by sentence unit, thereby creating statistical ontology
based on proximity and frequency of keywords. The merit of considering both proximity
and frequency of keywords helps our model to produce more semantically accurate search
17
results than proximity search because human language mostly expresses different meanings
by sentence unit. This means that our method of categorizing human language by sentence
unit can offer a more semantically relevant way of calculating proximity values of
keywords.
Another benefit of our approach over current proximity approaches is that when other
words such as modifiers are inserted between query keywords in the same sentence, the
proximity value of the query keywords increases in current proximity approaches and as a
result their semantic relevancy is recognized as low. However, in the same situation, our
approach calculates the proximity value of keywords with in-between modifiers as the
same as that of the keywords without in-between modifiers. This way, our approach is able
to calculate more semantically relevant search results by ignoring elements such as
modifiers, which are not semantically relevant to what the user is actually looking for.
3.4. Key Word in Context (KWIC)
The Key Word in Context (KWIC) is the most common format for concordance lines,
which was introduced to demonstrate data in a more efficient form by hiding information.
As Parnas, the inventor of the system, well explains, this system “accepts an ordered set of
lines, each line is an ordered set of words, and each word is an ordered set of characters.
Any line may be ‘circularly shifted’ by repeatedly removing the first word and appending it
at the end of the line” [38].
18
There is a similarity between the KWIC and our approach in the sense that our approach
also represents document collection as an ordered set of words through modified vector
space model. For this reason, this method allows our model to demonstrate and process
data much more efficiently.
The difference between our statistical ontology-based index and the KWIC index system
[38] is that while the KWIC only considers the sequences of keywords, our model
considers the frequency of keywords located in the position of wildcard as well as the
sequences of keywords. Therefore, our algorithm allows us to ignore irrelevant words such
as modifiers by way of calculating the frequency of keywords located in the position of
wildcard. That is, our model recognizes words which occur in the position of wildcard less
frequently as irrelevant to what we are actually looking for. For instance, we could say, the
following keywords groups are retrieved in the location of wildcard: “music player IPod”,
“Apple IPod”, and “IPod 15G version”. In this case, our model recognizes “IPod” as the
most semantically relevant to what users are looking for because the word ‘IPod” occurs
most frequently.
3.5. Query answering
Query answering system is a model that produces an answer to a question-type of query by
referring to a given corpus [49, 50, 51]. The query answering system and our model have in
common that they both try to answer to what users are asking. However, there is a
19
difference between them that while the query answering system only adopts a sentence-type
of question, our model accepts a group of keywords including wildcard.
There is a particular type of query answering system suggested by Kisuh Ahn and Bonnie
Webber, which adopts a hypothesis similar to ours in order to produce more precise
answers [50]. They hypothesized and proved that “the terms that more frequently appear in
… document characterise the relation between the two topics in statistical fashion, and this
document would be given a higher score for retrieval with respect to a question” [50]. Our
model also allocates higher-ranking values to the documents which contain more frequently
occurring keywords. The difference between our model and the query answering system is
that our statistical approach produces answers as a form of statistically created ontology,
thereby increasing the accuracy (the precision and the recall rate) of multi-word search.
The query answering system interprets a sentence-type of question to ensure precise
answer-mining process, and it requires a large amount of pre- and post- data processing
[50] because it extracts answers only by creating a group of question-relevant local
corpuses from the whole corpus. As a result of complex processing, although users can
expect relatively more precise answers to their question, it takes a lot of time because it has
to process a large amount of data. However, our approach takes less time than the query
answer system requires because our model does not need to create local corpuses only by
considering sequences and frequency of keywords. Hence, our algorithm is more suitable
for a large amount of web information retrieval.
20
3.6. Observations
In order to overcome the aforementioned problems with information retrieval approaches,
our approach considers several elements together such as whether keywords are located in
the same sentence or not, how frequently keywords appear in the location of wildcard, and
distance between keywords and wildcard. Moreover, our approach creates statistical
language models based on the order and frequency of keywords found in the target web
documents. We calculate the relevancy of the query and the document collections by
referring to our ontology, which actually works as a statistical language model. This way,
our approach helps to offer more semantically precise values of the relationships between
the inserted query and the document collection, thereby placing more semantically relevant
web pages higher in the rankings.
21
Chapter 4. Research Approach
In order to improve information retrieval, we have adopted a method of developing a
statistical ontology. This approach is based on the idea that in order to improve search
results, search engines should understand the semantic relevancy of keywords by sentence
unit because human languages tend to express different meanings by sentence unit. Our
ontology-based search model better reflects the semantic relevancy of keywords because
our ontology has been built based on data extracted by sentence unit. We have also
developed a new query interface to handle the situation where users do not exactly know
what the answer is to their questions, thereby expressing what they are looking for in the
incomplete form of a keyword query. This method also helps us to build our statistical
ontology more efficiently.
4.1. Overview and General Architecture
In order to test our hypothesis that referring to the ontology [27, 31, 32] and adopting a new
query interface produce more semantically relevant search rankings, we have developed a
statistical ontology-based semantic search model. Using the model, we have built the
ontology, have created a new query interface, and have generated and re-ranked a query-
relevant subset of the corpus of English News Text from the TREC [7] with a list of
questions and answers for each query. As shown in Fig. 1, our statistical ontology-based
22
search system consists of new query interface, ontology builder, analyzer, indexer, user
interface and corpus. The detailed process of each step below is described in the following
sections.
Figure 1. Ontology-Based Search System Architecture
.
23
4.2. Corpus and Test Data
Evaluating search results in a quantitative way is a difficult task. For this reason, to show
the effectiveness of our statistical ontology-based ranking algorithms, we used 2007
Question Answering Data from TREC (Text REtrieval Conference) [7, 30]. This data is
called AQUAINT 2, which contains about a million news articles developed by the TREC,
whose main goal is to help research within the information retrieval community by
providing the infrastructure necessary for large-scale evaluation of text retrieval
methodologies. For this goal, the AQUAINT 2 supplies a set of questions and answers to
help evaluate search performance in a quantitative way. Using the AQUAINT 2, we
evaluated our search model in a quantified way.
4.3. Analyzer and Indexer
As Fig. 1 shows well, our statistical ontology-based semantic search has been built
referring to the index structure of search engines. To perform the indexing of our data, we
used and modified the open source Apache Lucene indexer (version 2.9.1) [6] and pointed
it to look at all of our individual TREC documents. We began by using the built-in stop
words analyzer while modifying the white space tokenizer and filter. Whereas existing
search engines remove sentence delimiters while indexing so that they are not able to
process data by sentence unit, our approach allows the tokenizer to discard all symbols,
24
other than sentence delimiters such as periods as all items are one character in length. In
addition to removing tags, whitespace, and “stop words” such as “and”, “the”, and “to”, we
added “www”, “http”, “copyright”, and other words that appear frequently in the footers of
web pages and that are believed to be unnecessary when looking at the domain of our
corpus. Users can modify a text file to add/remove stop words or specify ones in addition to
the defaults in the command line. In this way, users can change the list of stop words for
various types of corpus.
25
4.4. New Query Interface
Our statistical ontology-based search model adopts a new form of query interface, which
we have developed for our approach as shown in Fig. 2. The query interface window has
three input boxes. If users need to enter a two-word query, in our query interface users
enter two keywords into any two input boxes out of three. This means that one box is
always left blank, which can be used as a wildcard to predict what users are actually
looking for. For example, if a user wonders what a shark did to a victim, he or she can
place a query of [shark], [wildcard], [victim] or if a user wonders what shark attacked, he or
she can place a query of [shark], [attack], [wildcard].
Figure 2. New Query Interface
One of the main functions of our new query interface is to allow users to express what they
are looking for in the form of the wildcard if they do not exactly know what it is. Users
often encounter situations where they do not exactly know the information they are actually
searching for, but where they only know keywords related to it. However, most current
search engines are not able to handle this salutation effectively because they produce search
26
results mainly based on query inputs, which users only know. By contrast, our query
interface is able to tackle this problem by means of the wildcard.
In addition, our new query system allows users to predict words between the query
keywords, whereas the N-gram only predicts words following or preceding the query
keywords. This new query interface using wildcard significantly reduces the amount of data
to process for building ontology. Moreover, our ontology is able to return the most
frequently used keywords in the location of wildcard from the actual web data. This method
can also work as a query prediction system.
Our new query system calculates the frequency of keywords, which occur in the location of
wildcard within our ontology, thereby creating a statistical language model in the location
of wildcard. Based on the statistical language model, we build ontology and increase
ranking value by referring to the ontology.
27
4.5. Ontology builder
4.5.1. Local ontology building
Figure 3. Building Local Ontology
We build local ontology referring to the frequency of keywords located in the place of
wildcard in a document. The first step to build the local ontology is that when keywords are
entered into our query interface, our ontology search model finds documents containing
28
keywords through inverted index. The inverted index, which we have built, allows our
statistical ontology-based search model to recognize if keywords are located in the same
sentence or not. For this reason, our search model is able to find sentences containing the
query and keywords in the location of the wildcard in the same sentence. Our model
categorizes all the keywords located in the wildcard within the sentence as candidates for
answers to the wildcard and decides which word is a correct answer by calculating their
frequency. This method allows our search model to recognize separately located keywords
as equally semantically relevant if they are located within the same sentence even when
other words such as modifiers are inserted between the keywords.
When we count the frequency of each keyword, we consider various language forms such
as plural forms or past tense, etc., the same as the original form. For this reason, all
keywords are simplified to be lower-case, and we normalize terms according to the Porter
stemming algorithm [4, 26]. Normalizing search results using the stemming algorithm
allows us to return more accurate results and cover a wider range of corpus since we do not
have to be concerned with verb tenses or conjugation. Doing this comes with a performance
cost however; as we need separate indices for stemmed and non-stemmed versions of
words in order to accurately return document position as well as term frequency. Once we
stem keywords during the indexing process, it is not possible to recover the original form of
words. Hence, in order to supply and utilize the original form of words, we need to keep
both a stemmed and non-stemmed index.
29
Even though the method of processing data by sentence unit increases the precision rate of
search results, it also might serve to decrease the recall rate. However, normalizing search
results using the stemming algorithm helps to mitigate the drop of the recall rate. This
normalization also allows us to better capture the precise meanings of keywords, thereby
creating a more precise ontology.
4.5.2. Global ontology building
As Fig. 4 shows below, we constitute an ontology using keywords links in the same
sentence, and we allocate higher scores to the keywords located in the place of the wildcard
if they are located in the same sentence. The link scores of each sentence, each document,
and all query-relevant documents are calculated together and summed up through global
ontology merging.
30
Figure 4. Global Ontology Building
We build global ontology by merging local ontologies. Referring to the global ontology, we
can calculate the frequency of keywords located in the place of wildcard in the whole
document. This methodology of merging local ontologies helps our model to process a
lesser amount of data for statistical language modeling than the amount of data N-gram has
to process. For this reason, whereas the N-gram has to pre-process data to create ontology
due to the excessive amount of data it has to process for statistical language modeling [16,
17, 18, 19], our statistical ontology-based model can create ontology instantly while
keyword search is being processed because it only has to process a small amount of data for
the statistical language model. For this reason, our model can be used for information
retrieval of a large scale of web data, while the N-gram cannot effectively process this task
31
for the reason that it takes too much time for the N-gram to retrieve information for a large
scale of web data [15].
4.6. User interface
Figure 5. User Interface for Search Results
In order to effectively show search results to users, our approach produces graphic user
interface (GUI) windows, which demonstrate triple ontology, links to the document,
frequency of keywords located in the place of wildcard. As shown in the results of Fig. 4 of
the previous section, our statistical ontology-based search produces the search results by
referring to the frequency of keywords located in the place of wildcard. Each keyword has
the global ontology frequency and the linked documents, which contain each keyword with
its own level of local ontology frequency. Demonstrating search results under the category
32
of each keyword helps users to predict and capture the whole contents of each document
better than the methodology of the current search engines, which demonstrate search results
only by documents links.
33
Chapter 5. Experiment Validation
5.1. Experiment data
Even though our statistical ontology-based search model is mainly designed for web data
retrieval, the TREC, which is off-line corpus, is used for quantized evaluation of search
performance. It is because the TREC data supplies questions and correct answers, and this
helps us to show the effectiveness of our search model in a quantitative way.
In detail, the TREC data, which we have used, comprises approximately 2.5 GB of text
(about 907K documents) covering the time period of October 2004 - March 2006. The
articles are written in English and collected from a variety of sources such as Agence
France Presse, Central News Agency (Taiwan), Xinhua News Agency, Los Angeles Times-
Washington Post News Service, New York Times, and The Associated Press.
In the TREC data, the contents will include people, organizations, events, and other entities.
Each question series is made up of some factoid and some list questions. Factoid-type
questions only have one single answer. Therefore, this type of question does not allow us to
easily evaluate the recall rate of search results, whereas list-type questions with multiple
correct answers make the task easier. For the experiment, we calculated the precision and
the recall rate by using twenty-two list-type questions.
34
5.2. Experiment Methodology
For the evaluation of our model, we compare our statistical ontology-based semantic search
to Google’s desktop search and an open source search engine, Nutch [5]. The precision
rates and recall rates of each approach are compared. We chose Google Desktop Search
[28] over Google’s Web Search because there is no way to force Google Web Search to
crawl and index our whole experiment data, whereas we can with Google Desktop Search.
Even though Google’s Desktop Search does not use PageRank algorithm [8] for ranking,
which is one of Google’s major Web search algorithms, Google Desktop Search can
produce the same result for our experiment as the experiment result we would expect with
Google Web Search because the TREC data does not have hyper-links required for page-
rank algorithm. Since the PageRank algorithm of the Google Web Search can be used in
addition to our statistical ontology-based algorithms, the fact that our experiment is not
concerned with the usages of page link algorithms does not undermine the possibility that
our approach can improve current search engines’ technologies.
The TREC news data consists of about 1,000 news files, and each news file has about 1,000
articles. The correctness of our search results for the questions which the TREC data
supplies is evaluated by considering whether our search results generate links to articles
35
which contain correct answers. Therefore, we created an HTML file for each article and
made three search engines involved with our experiment index about one million HTML
files in total.
Our experiment mainly dealt with a two-word query because a two-word query is the most
common form of a query employed by users [3, 42]. We have selected two semantically
most important words from each question on the list provided by the TREC data. We have
applied the two-word queries to Nutch, to our statistical ontology-based search model, and
to Google Desktop Search. We have evaluated the precision and the recall rate of the top
ten search results produced by the search engines because most users look only at the first
page of results and usually the first page produces 10 results [41]. And then, we have
produced final evaluation scores calculating the precision rates and recall rates together
following the evaluation method offered by the TREC.
5.3. Experiment Result
In order to test both recall rates and precision rates of each search approach, we have used
questions offered by the TREC data, which have multiple correct answers. A total of
twenty-two queries are used after excluding queries which have 0 results for all three search
approaches. Fig. 6 below shows a result comparing the precision rates of each search
approach. The X axis shows the TREC’s query ID, and the Y axis is the precision rate.
36
Figure 6. Precision Comparison Result
Search Approaches Nutch Ontology Google
Precision Avg. of 22
queries
0.192424242 0.429383117 0.543362193
Table 1. Average precision of each query
As Table 1 shows, our statistical ontology-based search improved the precision rate by
about 123% over original Nutch without ontology. Our approach considering whether or
not keywords are placed in the same sentence, adds one more constraint to the search
conditions of the previous search algorithms, thereby filtering more irrelevant search results
than original Nutch can. For this reason, our approach produces more correct search results
37
so that its precision rate is expected to increase more than original Nutch. Meanwhile,
Google Desktop Search showed a very high precision rate because Google Desktop Search
is sensitive to verb tenses or conjugation during the search process. Hence, we came to
know that Google Desktop Search is more focused on producing precise search results than
retrieving a wide range of target corpus.
The following Fig. 7 shows a result of recall rates of each search approach. The X axis
shows the TREC’s query ID, and the Y axis is recall rate.
Figure 7. Recall Rate Comparison Result
38
Search Approaches Nutch Ontology Google
Recall Rate Avg. of 22
queries
0.197663944 0.183466602 0.06486014
Table 2. Average Recall Rate of Each Search Approach
As we expected, calculating whether keywords are placed in the sentence or not helped our
ontology-based model to return a lesser number of search results than just calculating
frequency and distance. When the number of search results decreases, its recall rate is
expected to decrease. In our experiment, our statistical ontology-based search’s recall rate
also has decreased by about 7% over original Nutch’s recall rate.
Meanwhile, Google Desktop Search shows a lower recall rate compared with our model
and Nutch. This result can be easily expected because Google Desktop Search showed a
higher precision rate in the previous experiment. This result demonstrates that Google
Desktop Search considers more constraints during the search process in order to acquire a
higher precision search result than our model and Nutch’s.
In order to properly evaluate search engines, both precision and recall rate are generally
considered all together. Our experiment also has evaluated search results this way. In order
to evaluate three search approaches in a quantized way and to consider both precision and
recall rate equally, we have adopted a standard, which the TREC suggests as shown below
to aggregate recall rates and precision of each search approach.
39
IR = # instances judged correct & distinct/|final answer set|
IP = # instances judged correct & distinct/# instances returned
F = (2*IP*IR)/(IP+IR)
The following Fig. 8 shows the evaluation value of each search approach. The X axis
shows the TREC’s query ID and the Y axis is the TREC’s search evaluation value.
Figure 8. Average of TREC’s evaluation value comparison result
Search Approaches Nutch Ontology Google
Avg. of Eval of All queries 0.141727328 0.181746032 0.06349681
Table 3. Average Evaluation Value of Each Query
40
We have calculated the evaluation value for each query and the average of all evaluation
value is used to compare each search approach.
Search Approaches Nutch Ontology Google
Recall Ratio Avg of all queries 0.197663944 0.183466602 0.06486014
Precision Avg. of all queries 0.192424242 0.429383117 0.543362193
Evaluation value Avg. of all queries 0.1417273 0.181746032 0.06349681
Table 4. Summary of Experiment Results
In general, recall rate is in inverse proportion to precision. Our experiment results also
show the same result. Our statistical ontology-based model applying the sentence-based
filter to original Nutch served to lower the coverage of documents, and as a result, the recall
rate went down. However, in the case of our experiment, precision degree has increased
much more than the recall rate has decreased.
Accordingly, our algorithms and query interface have improved the final evaluation value
by 28% compared with Nutch without our methods. The Google Desktop Search has a
number of search algorithms. One of them makes it too sensitive to different forms of
words, and as a result, although the precision degree was relatively high, the recall rate was
too low so that the TREC’s final search evaluation value was lower than our methodology.
41
In order to properly evaluate search engines, both precision and recall rate are considered.
However, in order to test more queries and acquire more credibility, our methodology used
one hundred factoid-type questions with only one single answer thereby being able to
evaluate precision only. The following result shows that the precision increment rate is
similar to that of list-type questions.
Figure 9. Average precision of each factoid-type query
Search Approaches Nutch Ontology
Precision Avg. of 121
queries
0.176752906 0.357627866
Table 5. Average precision of each query
42
Google Desktop Search has been excluded in this experiment since it produces many search
results without any output as Figure 6 shows. Therefore, the precision of Google Desktop
Search is ranked high as in the previous experiments. However, the recall rate cannot be
evaluated with factoid-type questions because there is only one correct answer. We
concluded that a high precision rate calculated from search results without any output is not
valid. 102 Factoid questions and 19 list-type of questions are used for this experiment and
the precision has increased by 102%.
5.4. Experiment Conclusion
We have verified our hypothesis that allocating higher-ranking scores to keywords in the
same sentence for multi-word search is able to produce more semantically relevant search
rankings in the top-ranked documents than Nutch and Google Desktop Search. We also
have proved the hypothesis that placing wildcards between keywords or at the beginning or
at the end of a multi-word query helps to indicate user’s information demand more clearly.
As a result, with our model, more precise and efficient information retrieval is possible.
Furthermore, our statistical ontology-based model, which adopts a statistical language
model for multi-keyword search, helps to generate semantically more relevant information
retrieval results.
43
Chapter 6. Experiment Demonstration of Ontology
Construction
In this section, we focus on building general-purpose ontology using single
keywords. With experiments using real-world data, we show that it is still possible to use
single keywords with our method in order to generate broader ontology although it takes a
long time and many repetitive executions.
6.1 Utilizing Commercial Search Engine
To refine the web information efficiently, the new approach is using Google Search
Engine [62]. The search engine works as both initial corpus acquisition method and
preliminary data filter (see Figure 10). Utilizing commercial search engine brings many
benefits. Especially it improves the quality of corpus. According to Google [63], they hit 8
billion indexed web pages in Nov. 2004. Therefore, the commercial search engine’s search
results represent essence of broadly covered web documents. Moreover, the search engine
only returns topic specific information which is extremely efficient compare to finding
44
topic relevant data from the general offline corpus such as using the newspaper data [64]. If
we use open source search engine such as Nutch [5], we need to crawl the web data from
seed URLs. Due to limited computing power and network bandwidth [24], covering wide
range of web document is challenging [24]. The topic relevant crawling is impossible since
we don’t know where the topic relevant documents exist. Therefore, crawled data from
open source search engine is random web documents which could be lack of relevancy
from the information we want. We can setup domain specific seed URLs to increase
relevancy between crawled results and data we want. However, setting up domain-specific
rules against the automated ontology building process and the goal of this new approach is
building general purpose large ontology which must not rely on domain specific rule or
data. To insure the quality of crawled data, the open source search engine must cover wide
area of web as commercial search engines. Otherwise, the crawled results will be biased.
However, both the commercial search engine and the open source search engine are not
able to away from the bias since the commercial web sites are dominating the meaning of
specific keyword (Ex. Term, ‘apple’ is dominated by computer company apple’s website.).
In other word, it could be interpreted as many people are looking for the meaning of
commercial ‘apple’. Then, at least, search engine based new method can supply major
45
information requests as the search engine does. To enrich quality of ontology, the ideal
algorithm needs to gather topic relevant corpus as much as possible. In order to extend
topic relevant crawled results over commercial search engine’s one, the ideal algorithm
needs to extract topic relevant seed URLs from commercial search engine’s search results
and crawls the web documents from supplied seed URLs to meet the requirement. The
research is based on Google search engine results.
Symbol Definition
Ti Topic keyword
Tfi The frequency of Topic keyword
On Topic relevant Objects
Ofn The frequency of topic relevant Objects
Orn The frequency rank among topic relevant Objects
Dj Ontological describer between Ti and On
Dfj The frequency of Ontological describer for Ti and On
Drj The frequency rank among Ontological describer for Ti and On
Pijn Popularity of ontological relation describer Dj, how often Dj
are used between Ti and On
Si Set of web documents from search engine results with query Ti
Sin Set of web documents from search engine results with query
Ti+ On
Table 6. Notation
46
6.2. Extracting Topic Relevant Objects
The goal of this experiment is that if the user input topic Ti, then the algorithm will
build ontology about the Ti. For example, if user input Ti = ‘Guy’ then the algorithm will
build ‘guy’- ‘likes’- ‘Girl’, ‘guy’- ‘loves’- ‘sports’…... and corresponding web link in a
fully automated way. In specific, the algorithm will extract target On such as ‘Girl’,
‘Sports” and “Car”. And it will find Ontological describer Dj which is connecting the
meaning of Ti and On such as ‘likes’, ‘gets’ and ‘loves’ Eventually it will generate
multiple (Ti - Dj – On) ontologies. The new approach uses the recursive term frequency
algorithm and it is based on the basic concept of semantic web. The connection among
objects has many benefits. It is supplying richer meaning than keyword based technologies
and supplying link to sentence. The new approach is based on two hypotheses; topic
relevant keywords happen more frequently than irrelevant keywords do, and people use
similar terms when they describe same object.
47
Figure 10. Extracting the most frequent matching keyword.
In Figure 10, the topic keyword Ti will be queried to Google. Google API [62] returns
1000 topic relevant documents. The algorithm makes indices of downloaded documents by
open source indexer, Apache Lucene [6]. Lucene is specialized to keyword indexing.
Therefore, Lucene guarantees fast index building and short seeking time. It is very efficient
Target Keyword Ti
Search
Engine
WEB
DOC
Ti Relevant
Search results
Build Index
Keyword O1
Keyword O2
…
Keyword On
Frequent Noun Keywords On
48
since the algorithm retrieves large web documents. And Lucene supports location of
indexed keywords. Therefore, ontological describer Dj can be easily retrieved from built
Lucene index. Indexed object and its frequency are shown in Table 7 by using Lucene
Index Toolbox, Luke [65].
Table 7. Indexed objects and its frequency
The algorithm generates the most frequent topic relevant keyword set, On, by accessing
indexed data. The goal of this process is to find topic related objects. Therefore, this
algorithm will choose only nouns from the index list. General stop words and the web
49
relevant noisy keywords such as “www”, “http”, “policy”, “privacy”, “terms”, “contact”,
“search” and “copyright’ are eliminated. The algorithm also removes HTML tags prior to
indexing. Table 7 also contains some of noisy data since the method relies on automatic
term frequency filter. However, relying on the term frequency filter reflects up-to-date
information and the people’s interest to the corpus. However, the approach does not cover
broad area of lexical meaning that human can list. Moreover, the most frequent terms are
not always good for the target objects which will be ontologically connected to the topic.
However, utilizing index enables having all possible combinations with minimum space
and computation. Therefore, the index will keep all possible ontologies; nevertheless the
target objects could be noisy. The built ontologies will be served to users and they will
choose the data they want as search engine’s mechanism. Fortunately, the experiment result
shows term frequency filter works very well.
6.3 Extracting Ontological Describer
Now the indices have the topic (Ti) and the topic relevant objects (On). Then, the new
queries will be built by merging the topic with each object {O1, O2…On}. The number of
50
generated queries will be Ti*On; which means the number of web documents to build
ontology for the topic Ti will be Ti * On * 1000. Each generated query will be input to
Google search engine. As Figure 11, the algorithm will extract “Ti + On” related web
documents and it will generate indices for each query.
Figure 11. Process of extracting ontological describer.
Keyword Ti + O1
Keyword Ti + O2
…..
Keyword Ti + On
Search Engine
WEB
DOC
Build Index
WEB
DOC
WEB
DOC
Input Query
Query Results of
Ti + O1
Query Results of
Ti+ O2
Query Results of
Ti + On
…
…
51
Whereas, the most frequent terms became topic relevant data in object extraction process,
finding ontological describer Dj between Ti and On needs to consider the position of terms
because the nature of natural language is based on Subject + Verb + Object structure.
Ontological describer for Topic and Object usually will be placed between Subject and
Object. However, due to natural language’s high variation, computer can not know which
term is the best ontological describer in the sentence. NLP (Natural Language Processing)
has been researched about this process. However, it is still open issue. Moreover,
interpreting, decomposing and exacting the exact meaning of terms from the natural
language is impossible [21, 22]. And the noisiness of web document aggravates the
situation. Hence, NLP usually had poor results when it utilizes noisy web documents as
corpus [61] comparing to when it utilizes clean corpus. As stated hypotheses, the new
approach assumes that people use popular verb or describer to express the relation of
specific subject and object. Then, the used verb or describers for the subject and the object
are more frequent than irrelevant describer. Therefore, the algorithm recognizes the most
frequent terms which is occurring between Topic and Object as potential describer. The
research does not limit the describer to be verb since sometimes, noun also describes
ontological meaning depending on the topic. Therefore, end users can choose either verb-
52
only results or non-filtered results. Therefore, the algorithm extracts the multiple
ontological describer candidates from one sentence. For example, when Topic Ti = “apple”,
Object On = “iPod”, and extracted web sentence = “Recently, Apple released their new
iPod”, the algorithm will extract the ontological describer candidates as {released, their,
new} from the sentence. It will index the ontological describer candidate from each results
of query Ti + On. However, ‘their’ and ‘new’ will be considered as stop words during
ontological describer extraction from the indexed data. However, all ontological describer
candidates will be maintained in the index for possible extensions even though they are stop
words.
53
Table 8. Comparison between unfiltered ontological describer (up) and filtered
ontological describer (bottom).
The bottom of Table 8 shows fine example of extracted ontological describer when Ti =
“Shark” and On = “Victim”. According to the experiment results, extracting ontological
54
describer between Ti and On refines large amount of noisy terms (bottom of Table 8) more
than extracting the most frequent terms from the results of query, Ti + On (up side of Table
8). This will be discussed in section 5.2.
6.4. Analysis of Experiment Results
6.4.1 Data Analysis of Topic Relevant Objects
In this section, we will analyze if topic related objects are valid since the method of
finding object relies on automated frequency filter. Table 7 shows the most frequent objects
when Ti = “apple”. We chose “apple” for Ti since “apple” fulfills keyword requirements
that we discussed in introduction section such as polysemous keywords, one of noisiest
keyword and non-WordNet supported keyword (computer company “Apple”). “Apple” is
one of the most biased topic because Google search results show that more than 30% of Ti
= “apple” related websites are from “apple.com” or “mac.com”. Those websites are built
for the commercial purpose. Hence, the most of contents are non-text binary format and
information about the apple’s products. The results also reflect biased situation. The results
55
will be represented as [Orn] On (Ofn) = [The frequency rank among topic relevant Objects]
Object (Frequency of Object). [Orn] On (Ofn) = {[1
st
] “Mac” (766), [2
nd
] “Store” (761),
[3
rd
] “Inc” (757), [4
th
] “IPod” (705), [5
th
] “support” (698), [6
th
] “Itunes” (682)…}. In order
word, these On reflect current trend since computer company “apple” became a general and
popular keyword especially in the web environment, even though WordNet still only has
lexical meaning of “apple” as fruit. Table 10 shows the automatically built ontologies about
“apple” and it also shows describers between “apple” and listed Ons. Does the
unsupervised term frequency filter really work? The experiment results show that
extracting topic relevant objects are noisier than extracting ontological describer since the
object finding only relies on the frequency filter except stop words removal and noun
filtering. Whereas, finding ontological describer utilizes the filtering algorithm which finds
the connection word between Ti and On. Table 8 shows finding ontological describers
with/without the filter. As you can see, unfiltered data is noisy as extracted objects.
Therefore, finding the right On for Ti became the most important factor to improve the
quality of built ontology. Current object extraction is based on the frequency filter which
chose the most frequent terms as candidate. The results show that the most objects had
meaning of company “Apple”. Consequently, company Apple dominates built ontologies.
56
For example, fruit, “apple” related objects are ranked low Ons (Ofn) = {tree (31), fruit (21),
acid (5), vitamin (3)}. However, the built indices also have the information of fruit “apple”
related keywords. Hence, the algorithm can respond corresponding ontologies if end users
request “apple tree” or “apple vitamin” as query. In 5.2, we will discuss how the relation
between Ti and On could affect the quality of ontology.
6.4.2 Data Analysis of Ontological Describer
In this section, we will analyze the results to show if the suggested algorithm could
bring valuable ontological describer for Ti and On. We will analyze 3 built ontologies. In
terms of topic and object’s generality, each of ontology has different character to analyze.
First example, Table 8 shows one node of shark ontology. Upper side of Table 8 shows the
most frequent terms from Ti + On relevant documents and bottom one represents extracted
ontological describer and its frequency (Dj, Dfj). Non filtered upper side table shows the
overall term frequency is high since, the algorithm counts wherever terms are located.
However, the frequency of filtered ontological describer Dfj is low since, terms should be
placed between Ti and On which does smart filtering in this algorithm. Top ontological
57
describers and its frequency when Ti=”shark” + On= “attack”, [Drj] Dr (Dfj) = {[1
st
]
“attack” (443), [2
nd
] “bite” (138), [3
rd
] “white” (51), [4
th
] “great” (47)…}. The top frequent
terms and its frequency in unfiltered documents = {[1
st
] attack (822), [2
nd
] world (656),
[3
rd
] water (600), [4
th
] video (586)….}. Shown results are all stop words removed results.
The most significant difference between filtered result and unfiltered results are ‘bite’. One
of the most appropriate ontological describer for Ti + On, ‘bite’ is placed 101
st
result in
unfiltered results. However, it is 2
nd
result in filtered results right after the first strong result
“attack”. Moreover, the term frequency in unfiltered results shows lineally reducing. It is
hard to recognize the importance of extracted data. But the filtered result shows significant
frequency drop after 2
nd
result “bite”. This example does not represent the characters of all
results. The algorithm has variant performance depend on the topic since the web
documents are extreme noisy and the new approach utilizes fully automated method.
However, the result has been chosen to show that extracting terms (Dj) between Ti + On
could be one solution to refine noisy web documents without domain specific rules. And it
is necessary technology for large automated ontology. Extracting appropriate ontological
terms in top rank is ideal but it is difficult to achieve especially by non topic specific
method. However, the algorithm returned ideal results when Ti = “shark” and On =
58
“Victim”. Why? Since, Ti=”shark” and On = “Victim” exactly meet the second hypotheses,
“People use similar terms when they describe same object.” People generally remind
“attack” or “bite” when Ti=”shark” and On=”Victim” are given. The trend has been
reflected to frequency. Dfj of “attack” =443, Dfj of “attacks”=99, Dfj of “attacking”=4 and
Dfj of “attacked”=3. The total “attack” related Dfj=549. And Dfj of “bite” =138 became
179 by the same way. The algorithm can calculate total sum by using Porter stemming
algorithm [4] or Snowball stemming algorithms [26] which are removing the commoner
morphological and inflexional endings from the words in English. However, two limitation
lead this research maintain Dj and Dfj as untouched condition except some obvious case
such as likes and like, attacks and attack. The first limitation is that once the terms are
stemmed, reversing stemmed term to the original term is difficult. Since, sometimes,
morphological ending contains different meaning of word such as computer, computes.
Therefore, as philosopher L. Wittgenstein’s approach [21, 22], the algorithm will not
interpret the meaning of word. However, obvious case will be handled by preset keyword.
The experiment with Porter stemming analyzer did not improved the results over standard
analyzer with presetting and it usually removes important morphological endings. Porter
stemming analyzer could be supplied as option to end user. And the research which
59
generates two indices for each standard analyzer and Porter stemming analyzer for merging
purpose is ongoing.
What happens if Ti + On are general terms and Ti + On does not remind specific
ontological describer to people? such as ‘guy’ and ‘girl’, ‘car’ and ‘vehicle’.
Table 9. Ontological describer for General Ti and On
60
The second example, the bottom of Table 8 shows, the top ontological connection word
and its frequency when Ti=”guy” + On=”Girl”, [Drj] Dr (Dfj) = {[1
st
] like or likes (55),
[2
nd
] get or gets (52), [3
rd
] seek or seeking (30), [4
th
] hits or hit (21), [5
th
] look or looking
(21) …}. The results “guy likes girl”, “guy seeking girl” are reasonable ontology. The
informal relation, “guy hits girl” also has been built as ontology and it is a good example of
web document’s freshness. The meaning of “hits” could have two meanings in built
ontology “hit” or “hit on”. However; interpretation should be remained to the end users as
initial approaches of the thesis. It is interesting that many of top ranked Djs have similar
meanings such as “seeking”, “looking” and “gets”. We can tell reflecting natural language
means reflecting human’s thinking or people’s recent trend.
In order to show the popularity concept in ontology, the next step is comparison
between first example, Ti=”shark” + On=”victim” and second example, Ti=”guy” +
On=”girl”. The thesis will define popularity of ontological relation describer Pijn as {2*Djf
/ (Tfi+ Ofn)}*100. Pijn represents how often people use Dj when they describe relation
between Ti and On. The first example “Shark + victim”, shows Pijn of the first and second
frequent terms are “attack” 60%, “bite” 20%. However, in the second example “guy + girl”,
Pijn of top result “like” is only 9% and Pijn of second frequent result “get” is 8% which
61
mean there is no dominant relation describer for “guy + girl”. In other words, people use
diverse ontological describer for “guy + girl”, since the ontological describers for “guy +
girl” have more candidate compare to “shark + victim”. This situation will be expressed as
high ontological freedom from now on. Therefore, the other possible solutions are ranked
over 10
th
. Example, [Drj] Dj (Djf) = {[11
th
] talk (10), [12
th
] dating (9), [13
th
] meet (9)}. The
topic and the objects which have high ontological freedom are easily biased depend on
search results since, the difference among Dfj are small.
The third example, the bottom of Table 9 shows examples of the topic and the object
that are difficult to describe and have high ontological freedom. What general people can
remind when they received Ti = “car” + On = “vehicle” and What if they are asked to find
ontological describer? “Car is vehicle?”, “Car is kind of vehicle?” If WordNet had
manually built answer, it could be part-of or subsumption (the superclass-subclass relation).
The issue in here is that people do not express the relation between car and vehicle in
general. The top extracted ontological describer and its frequency when Ti= “car” + On =
“vehicle” are [Drj] Dr (Dfj) = {[1
st
] commercial (92), [2
nd
] auto (77), [3
rd
] used (76), [4
th
]
motor (70), [5
th
] history (67)…}. Top ranked ontological describers are more like
frequently happening keyword with Ti = “car” or On = “vehicle” such as “commercial
62
vehicle” and “used car”.
As listed characters of the web documents, if Ti + On are frequently used sentence or if Ti +
On have describable relationship. Then, the algorithm generates reasonable ontological
describers. And over 90% of 1
st
and 2
nd
Djs were recognized as proper relation describers
for certain Ti + On. With random Ti + On, the percentage has been reduced to 80%. The
evaluation of ontology is difficult since human need to manually decide if the built
ontology is making sense. The evaluation section will discuss the results. However, the goal
of the approach is not providing small set of exact results. It provides large set of ontologies
which have all possible solutions and when user asked specific data in indices. It can return
corresponding data in short time. For example, when Ti = “guy” + On= “Girl”, and a user
want to know the objects such as “guy dating what?" or “guy meet what?”. Built index can
answer “guy dating girl?” or “guy meet girl” and the input corresponding web link will be
supplied even though rank of “meet” and “dating” is over 10
th
ranking. Moreover, the
extended approach can calculate index value to find objects with fixed Ti +Dj. For example,
send query “guy dating” to ontologies and analyze the most frequent On. The experiment
result about extended approach will be discussed in section 5.6.
63
T i = ”a p pl e ” Onr, Tfi, On, Ofn (Djf) Dj
1, 749, Mac, 747 (144) support, (67) system, (63) computer, (59) operating, (50) imac
2, 965, Store, 965 (134) retail, (119) online, (91) visit, (35) itunes, (22) music
3, 930, Inc, 929 (282) computer, (100) mcintosh, (82) iphone, (64) mac, (63) ipod
4, 741, Ipod, 745 (123) store, (100) support, (98) iphone, (91) has, (81) gb
5, 822, Iphone, 828 (143) has, (129) ipod, (78) phone, (73) support, (71) announced
6, 162, Itunes, 162 (42) support, (42) ipod, (35) music, (32) mac, (32) logo
7, 763, retail, 763 (245) store, (184) online, (153) visit, (153) 1-800-my, (155) open
8, 890, Os, 868 (803) mac, (172) support, (82) computer, (76) system, (74) operating
T i = ”kd d” Onr, Tfi, On, Ofn (Djf) Dj
1, 572, data, 596 (274) knowledge, (254) discovery, (209) conference, (198) international, (81)
workshop
2, 256, information, 284 (54) nuggets, (54) list, (52) related, (51) moderated, (30) conference
3, 533, mining, 562 (530) data,(362)knowledge, (354)discovery, (330)conference, (316) international
4, 508, knowledge, 532 (189) conference, (163) international, (88) data, (71) process, (61) mining
5, 201, text, 245 (128) workshop, (59) kdd-2000, (30) knowledge, (24) discovery, (21) cup
6, 469, discovery, 489 (441) knowledge, (320) conference, (305) international, (86) data, (72) mining
7, 728, conference, 748 (471) international, (208) acm, (203) sigkdd, (105) mining, (96) data
8, 674, international, 690 (85) acm, (80) sigkdd, (34) Japan, (29) premier, (29) mining
T i = ”W or dN e t ” Onr, Tfi, On,
Ofn
(Djf) Dj
1, 233, search, 224 (25) database, (25) interface, (21) based, (20) internet, (18) lexical
2, 258, text, 243 (71) synsets, (70) improve, (65) word, (63) disambiguate, (60) senses
3, 183, information, 187 (52) proceedings, (50) senses, (48) acm, (44) international, (42) research
4, 63, language, 64 (40) natural, (11) wordnets, (8) cross, (8) English, (7) workshop
5, 537, database, 537 (244) lexical, (73) electronic, (25) English, (22) large, (21) cup
6, 536, word, 529 (271) has, (188) English, (122) related, (62) dictionary, (58) noun
7, 96, English, 96 (23) database, (19) Princeton, (18) lexical, (16) edition, (12) thesaurus
8, 47, system, 47 (15) language, (13) natural, (13) processing, (10) reference, (10) performance
Table 10. Built ontology
64
6.4.3. Refining Built Ontology
After analysis of built indices, Djs with following conditions are regarded as strong
results. However, remaining indices also have margins to refine. And the ranking algorithm
is only required when the user wants refined data. The algorithm will not determine if
returned data is valid. It will keep all indices. The end user will evaluate final data as the
search engine.
1. Pijn > 20%: The Ontological describer Dj is used to describe relation of Ti + On
over 20% then Dj is very strong results.
2. Drj < 10: The top ranked ontological describers usually have better descriptive
meaning for Ti and On. Since the frequency of describer Dfj, is vary depend on the topic
and the object, Drj must be considered in evaluation function.
3. Tfi > 300: If the frequency of Ti or On are too small, that means people does not use
the sentence in general. In order words, Ti + On are not related each other even though On
is ranked as strong objects.
65
4. Dj = verb: Since the structure of natural language, usually verb has better
descriptive meaning for Ti and On compare to nouns. However, sometimes, noun also does
the roll of ontological describer.
5. (Dfj! = Dfk) > 50: If the frequency of ontological describer Dfj has the same
number of frequency with other ontological describer Dfk. It usually means Dfj and Dfk are
used as one word such as “Michael Jackson”. Among the low frequent Dfjs, there are many
Djs (Dfj = Dfk), since frequency difference is small however, frequency over 50 and (Dfj =
Dfk) usually represents Dfj and Dfk are used as one word.
6.4.4. Structure of Ontology
As Figure 12, built ontologies has link among 3 elements (Ti, Dj and On). Each topic Ti has
multiple objects On and it has multiple describers Djs. The ontology could return
corresponding value when, any of two elements has been fixed. For example, when Ti and
On are given, the Djs {D1, D2…Dj} will be returned. And if Ti and Dj are given,
corresponding Ons {O1, O2…On} will be returned. The experiment results will be showed
in the evaluation of ontology section.
66
Figure 12. Structure of built ontology
6.4.5. Built Ontology
Table 10 shows automatically built ontologies. Due to considerations of space, it only
shows 3 topics and the top 40 ontological relations for each topic (8 objects O
n
s and 5 link-
words D
j
s for each object). However, actual “apple” ontology can represent over 10,000
ontological relationships and links to the corresponding web documents since original
“apple” ontology is consisted of 21 O
n
s and about 500~2000 D
j
s for each O
n
. Therefore,
during the experiment, over a million ontological relationships have been generated via one
hundred test topics. The research did not filter T
i
which has good results to show exact
T
i
D1
Dj
.
.
. O
1
D1
Dj
On
.
.
.
67
result of the new approach. Table 8 has been chosen randomly among automatically built
ontologies.
After analyzing many ontologies, we found that built ontologies maintains strong
relevancy with topic keyword since the term frequency filter usually returns topic relevant
keywords and the built ontologies are based on the combination of returned relevant terms.
This system is fully automated, the only input from the user is only Ti = {“apple”, “kdd”,
“WordNet”}. Then it returns Table 10.
6.4.6. Utilizing Built Ontology
Sentence-based query and answering is simple when we utilize built ontology whereas
Google and WordNet cannot. Built ontologies have links among 3 elements (T
i
, D
j
and O
n
).
Each topic T
i
has multiple objects O
n
, which have multiple link-words D
j
s. The ontology
can return corresponding values when any one or two elements have been given in a
sentence-based query. For example, when sentence-based query T
i
and O
n
are given,
multiple D
j
s {D
1
, D
2
…Dj} will be returned. And if T
i
and Dj are given, the corresponding
O
n
s {O
1
, O
2
…O
n
} will be returned.
68
If the end user wants to know “apple support what?” Then, built ontology returns
“apple support {(319) leopard, (271) system, (184) music, (172) OS, (159) video, (158)
applications, (114) macbook, (114) Mac, (100) iPod, (100) software, (100) information,
(73) iPhone, (63) computer, (54) product, (42) itunes, (17) store….}”. The results are from
built “apple” ontology which reflects Google search engine’s results in Feb, 2008. And
“(number)” means Dfj, frequency of Dj “support” between “apple” and On. And if user put
natural language query “what apple recently released?” Then, the algorithm could remove
“what”, “recently” as stop words and “apple” and “released” will be input to built “apple”
ontology. Since “released” is verb and located after the term “apple”, it will be input as Dj.
The input to ontology will be Ti=”apple” and Dj=”released”. Then built ontology returns
corresponding Ojs with corresponding web links “Apple released {(48) leopard, (25)
software, (24) OS, (23) Mac, (22), IPod, (19) iPhone, (17) macbook, (9) product, (9) music,
(9) system, (5) computer, (4) video, (3) application, (2) information, (2) itunes, (2) retail}”.
The results could have incorrect ontological describers, Ons, however, the user can choose
supplied information with frequency ranking which is similar concept of search engine
results. The ranking algorithm we discussed in 5.3, Refining Built Ontology will be used to
determine importance of results. And the highly ranked results will be served first.
69
The keyword based WordNet and The Search Engine were not able to support the
natural language query and results. WordNet does not even have meaning of company
“Apple”. Accordingly the word similarity between Ti =”Apple” and On were very low or
even not existed in WordNet. WordNet’s similarity between “shark” and “victim” are 0.25.
And the similarity between “guy” and “girl” are 0.2. Moreover, it does not have natural
language relation “support”, “attack” and “likes” Whereas, the new approach can supply
natural language relation Djs as “guy {likes, gets, seeking, hits, looking} girl”. And the new
approach can supply the precise link to the web documents that matches with the users’
sentence query. For example, if the user put query “guy likes girl” then the new approach
can return the links of documents which contains the variation of input sentence such as
“guy really likes the girl” or “guy likes sexy girl” whereas, the search engine cannot.
The new 3-tuple data representation method can represent the close-form of sentence
which resolves the issues of keyword based technologies. Moreover, 3-tuple data structure
supports reasoning. It is different reasoning from what semantic web is supporting. The
system does not interpret the meaning of relations to computer recognizable expression. For
example, the built ontology “guy likes girl” the logical relation between “guy” and “girl” is
“likes” if the “girl” ontology has “girl has pet”. We could reason “guy likes (freq) girl (freq)
70
who has (freq) pet (freq)” The frequency information could supply the weight of relations.
The evaluation of ontology is difficult, since humans need to manually determine
whether the built ontology makes sense. Therefore, researchers from USC semantic
information research lab evaluated 100 ontologies manually. As listed characteristics of
web documents, if topic and object are frequently used sentences or if topic and object have
describable relationships, the algorithm will generate reasonable link-words. Over 90% of
1
st
and 2
nd
link-words were recognized as proper link-word for certain topic and object.
With random topic and object, the percentage was reduced to about 70%. However, the
percentage is not that important in this approach since the goal of this approach is not to
provide a small set of exact results. It provides a large set of ontologies which have all
possible solutions and corresponding web links, and when the user asks for specific data in
indices, it can return corresponding data in a short time. For example, when T
i
= “guy” +
O
n
= “Girl”, and a tester writes sentence-based query to built ontology such as “guy dating
what?” or “guy meet what?”, the built program recognizes “what” as missing object and
returns “guy dating girl” and “guy meet girl”, and the input corresponding web link will be
supplied even though rank of “meet” and “dating” is way below the 10
th
ranking.
71
6.4.7. Extracting Topic from Describer and Object
Purpose of this part is to show why this approach had to extract ontological
describers between Topic and Object. Here is the different approach which is finding Ti
from given Dj or given On. For example, Dj = “care” and On = “child” to answer the
natural language query which is asking to find the topic, “who care child?” Results are Ti
(Tfi) = {quality (260), family (228), parents (214), providers (170), information (161),
families (159)……}. As evaluation criteria discussed in 5.3, the top ranked results are the
terms which has Pijn >20%. Since “care” and “child” have obvious relation with Ti as
character of the first ontology example in section 5.2, “shark” and “Victim”. However,
when we tested examples which have challenging features such as second and third
example ontology in section 5.2; the program usually returns irrelevant and noisy results. It
turns out that the topic is the most important filter to extract relevant data from the web.
Therefore, absence of the topic caused downloading of irrelevant data. Irrelevant data
generates irrelevant results and irrelevant ontologies. The experiment shows that the
algorithm is relied on the filtering roll of the search engine. We also tried to find On from
given Ti and Dj. It also returns irrelevant results. According to analysis of the result, search
72
engine is based on keyword. It usually removes meaningless verb or keyword when it
handles information. Therefore, search engine could disregard Dj. Moreover, as we
discussed previously, search engine does not return same results if input query is not
exactly match each other. For example, ‘Guy likes girl” and “Guy likes a girl” will have
different search engine results. Therefore, the query “guy like” does not bring the web
document which have “guy likes” or “guy really like”. Therefore, finding Di among Ti and
On does the most strongest filtering.
6.5. Experiment Conclusion
The experiment’s results show that the approach is capable of building a million of
relationships automatically. The goal of this approach is to maintain the index of each term
and its location so that the algorithm can easily support sentence-based query and ontology.
A natural language parser would require significant computation, and space and therefore it
is not optimized for large web data. However, our new approach supplies a response as
quickly as a search engine can because the search engine’s index is reused and utilized as a
73
source of corpus. Thus, the functionality is easily plugged into a search engine or any
applications to build a general knowledge base without any limitations in its expression.
To enrich the quality of the ontologies, the program needs to gather the topic-relevant
corpus as widely as possible. Once Nutch acquires as much data [61] as Google, the
algorithm would be capable of supplying an extremely large ontology without Google. Also,
word’s sense sensitive crawling is needed since more than 50% of word sense “apple” was
the company “apple” among top 1000 Google search results. Therefore, the company
“apple” was the dominant sense in the built ontologies. In order to cover more diverse
lexical meanings of keywords as WordNet does, the algorithm would need to detect the
meaning of word senses and would need to retrieve sense-corresponding data as corpora.
Therefore, our ideal goal is to supply plural meanings of each term by the ontology built.
74
Chapter 7. Contributions
A primary contribution of this research is to introduce a more semantic understanding of
web documents by adopting a method of retrieving web data by sentence unit. Our
sentence-based ranking algorithms were able to improve by 28% the semantic relevancy of
original Nutch search ranking by better reflecting semantic relevance of keywords. The
other new method has also contributed to the good result. By marking what a user is
searching for as a wildcard, we predict an answer to the wildcard based on our ontology we
have developed with actual web data, not with a query log. This means that we create a
query sequence including a wildcard, which allows us to more easily find an answer to
what a user is searching for because the ontology enables us to find words which appear
most frequently in the place of a wildcard on the web data. While displaying such words
with this method, based on actual web data we can provide a query prediction, a query
answering, and automated categorization. The difference between a query prediction and
our methodology is that while current search engines’ query prediction based on query log
only helps complete a query, our algorithms help users to find more relevant data to what
they are searching for on the web, thereby improving the rankings of search results. Our
statistical ontology-based algorithms also help to save a great deal of users’ time while they
are searching information on the web by improving semantic relevance of rankings of
search results.
75
We were able to build a useful ontology with a large scale of data collected from a huge
amount of documents and sources on the web. Users can draw upon our ontology for many
other purposes as well as for assisting data retrieval because our ontology has many
advantages over other knowledge bases; for example, it automatically reflects most up-to-
date information on a large scale. Hence, our ontology is able to reinforce WordNet [2, 29]
because the WordNet was built manually a long time ago so it does not contain up-to-date
information. In this context, this research can serve to lay groundwork for future studies to
develop a large-scale lexical web database like WordNet, by expanding links of words in
the sentence-based ontology. The merit of our ontology model is that it can be applied to
existing search engines without much modification because it is based on the Vector space
model [14].
The application of the statistical ontology-based algorithm we have suggested here has
been used to calculate the relations between multi-word and document collections.
However, since we generate ontology through statistical language modeling, our statistical
ontology-based algorithm can be widely applied to the applications of N-gram such as
information retrieval, query expansion, query answering, and speech recognition.
76
References
[1] Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web
Search Engine, WWW7 Proceedings of the seventh international conference on World
Wide Web 7. 1998
[2] George. A. Miller. Wordnet. A lexical database for English. In HLT, 1994.
[3] Bernard Jansen , Major Bernard , J. Jansen , Amanda Spink , Tefko Saracevic ,
Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the
Web (2000)
[4] M.F. PorterThe Porter Stemming Algorithm, 1980 retrieved from
http://tartarus.org/martin/PorterStemmer/
[5] Nutch, open source search software retrieved from http://lucene.apache.org/nutch/
[6] The Apache Lucene , Open source search software retrieved from
http://lucene.apache.org/
[7] Ellen M. Voorhees, Overview of the TREC-9 Question Answering Track 2001.
[8] Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation
Ranking: Bringing Order to the Web. Manuscript in progress. [Page 98] 1999.
[9] Naveen Ashish, Sharad Mehrotra, Pouria Pirzadeh: XAR: An Integrated Framework for
Information Extraction. CSIE (4) 2009: 462-466
77
[10] Wilks, Y., B.M. Slator, and L. Guthrie. 1996. Electric Words:Dictionaries, Computers,
and Meanings. Cambridge: MIT Press
[11] Harabagiu, S.M., G.A. Miller, and D.I. Moldovan. 1999. WordNet 2 - A
Morphologically and Semantically Enhanced Resource. Proc. Of the SIGLEX
Workshop.
[12] Minoru Yoshida, "Extracting ontologies from World Wide Web via HTML tables",
PACLING 2002, 2002.
[13] Wolfgang Holzinger, Bernhard Kr¨ upl, and Marcus Herzog 2006. Using Ontologies
for Extracting Product Features from Web Pages. The 5th International Semantic Web
Conference, Athens, GA, USA
[14] Gerard Salton, A.Wong, and C. S. Yang. A vector space model for information
retrieval. Communications of the ACM, 18(11):613?620, November 1975.
[15] Ethan Miller, Dan Shen, Junli Liu, Charles Nicholas, and Ting Chen, Techniques for
Gigabyte-Scale N-gram Based Information Retrieval on Personal Computers 1999.
[16] Ronald Rosenfeld, TWO DECADES OF STATISTICAL LANGUAGE MODELING
IEEE 2000.
[17] Slava M. Katz. Estimation of probabilities from sparse data for the language model
component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal
Processing, 35(3):400–401, March 1987.
[18] Hermann Ney, Ute Essen, and Reinhard Kneser. On structuring probabilistic
dependences in stochastic language modeling. Computer Speech and Language, 8:1–38,
1994.
78
[19] Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram lang.uage
modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech
and Signal Processing, volume I, pages 181–184, Detroit, Michigan, May 1995.
[20] T. Strzalkowski, L. Guthrie, J. Karlgren, J. Leistensnider, F. Lin, J. Perez-Carballo, T.
Straszheim, J.Wang, and J. Wilding. Natural language information retrieval: TREC-5
report. In Proceedings of the Fifth Text REtrieval Conference (TREC-5), 1997.
[21] Wittgenstein, Ludwig. Tractatus Logico-Philosophicus. Trans. C. K. Ogden. London:
Routledge & Kegan Paul, 1922.
[22] Wittgenstein, Ludwig. Philosophical Investigations. Trans. G. E. M. Anscombe. 3rd
ed. Oxford: Basil Blackwell, 1968.
[23] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram
Raghavan "Searching the Web." ACM Transactions on Internet Technology, 1(1):
August 2001.
[24] Marc Najork, Allan Heydon, High-PerformanceWeb Crawling 2001.
[25] Sean A. Golliher, , Search Engine Ranking Variables and Algorithms, SEMJ.ORG
VOLUME 1, SUPPLEMENTAL ISSUE, AUGUST 2008
[26] M.F. Porter, Snowball: A language for stemming algorithms 2001 retrieved from
http://snowball.tartarus.org/texts/introduction.html
[27] Boris Wyssusek, Queensland, University of Technology, On Ontological Foundations
of Conceptual Modeling 2005.
79
[28] Google Desktop, Inside Google Desktop retrieved from
http://googledesktop.blogspot.com/
[29] Pedersen, Patwardhan, and Michelizzi, WordNet::Similarity - Measuring the
Relatedness of Concepts - Appears in the Proceedings of Fifth Annual Meeting of the
North American Chapter of the Association for Computational Linguistics (NAACL-
04), pp. 38-41, May 3-5, 2004, Boston, MA. (Demonstration System)
[30] EllenM. Voorhees and Donna K. Harman, National Institute of Standards and
Technology, TREC: Experiment and Evaluation in Information Retrieval 2005.
[31] S. Chung, J. Jun, and D. McLeod. A web-based novel term similarity framework for
ontology learning. In OTM Conferences (1), pages 1092–1109, 2006.
[32] Nicola Guarino,National Research Council, LADSEB-CNR, Corso Stati Uniti 4, I-
35127 Padova, Italy, Formal Ontology and Information Systems 1998.
[33] Amit Singhal Google, Inc. ,Modern Information Retrieval: A Brief Overview 2001.
[34] Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern information Retrieval 1999.
[35] R. Baeza-Yates, C. Hurtado, and M. Mendoza. Query recommendation using query
logs in search engines. In International Workshop on Clustering Information 2004.
[36] R. Baeza-Yates and A. Tiberi. Extracting semantic relations from query
logs. In SIGKDD, pages 76–85, 2007. R. Baeza-Yates and A. Tiberi. Extracting semantic
relations from query logs. In SIGKDD, pages 76–85, 2007.
80
[37] D. Beeferman and A. Berger. Agglomerative clustering of a search engine query log.
In KDD, pages 407–416, 2000.
[38] D.L. Parnas, Carnegie-Mellon University, On the Criteria To Be Used in
Decomposing Systems into Modules 1972.
[39] Benny Kimelfeld. Yehoshua Sagiv, Efficient Engines for Keyword Proximity Search.
2006.
[40] Tao Tao, ChengXiang Zhai, An Exploration of Proximity Measures in Information
Retrieval 2007.
[41] A. Spink, D. Wolfram, B.J. Jansen, and T. Saracevic. Searching the web: the public and
their queries. J. American Society for Information Science and Technology, 52(3):226–234,
2001.
[42] V. Mittal, S. Baluja, and M. Sahami. Google tutorial on web information retrieval. In
RIAO-2004, 2004.
[43] Gilad Mishne, Maarten de Rijke, BOOSTING WEB RETRIEVAL THROUGH
QUERY OPERATIONS 2005.
[44] E. M. Keen. The use of term position devices in ranked output experiments. The
Journal of Documentation, 47(1):1–22, 1991.
[45] E. M. Keen. Some aspects of proximity searching in text retrieval systems. Journal of
Information Science, (18):89–98, 1992.
[46] D. Hawking and P. Thistlewaite. Proximity operators – so near and yet so far. In
Proceedings of the Fourth Text REtrieval Conference (TREC-4), pages 131–143, 1995.
81
[47] Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval
systems. In Proceedings of the 25
th
European Conference on IR Research (ECIR 2003),
pages 207–218, 2003.
[48] S. Buttcher, C. Clarke, and B. Lushman. Term proximity scoring for ad-hoc retrieval
on very large text collections. In SIGIR ’03: Proceedings of the 26nd annual
international ACM SIGIR conference on Research and development in information
retrieval, 2006.
[49] Katz, B., S. Felshin, D. Yuret, A. Ibrahim, J. Lin, G. Marton, A. McFarland, and B.
Temelkuran. 2002. Omnibase: Uniform access to heterogeneous data for question
answering.
[50] Kisuh Ahn, Bonnie Webber, Topic Indexing and Retrieval for Factoid QA 2008
[51] Kisuh Ahn, Johan Bos, James R. Curran, Dave Kor, Malvina Nissim & BonnieWebber
Question Answering with QED at TREC-2005
[52] M. Ross Quillian, Word concepts: A theory and simulation of some basic semantic
capabilities, Bolt, Beranek, and Newman, Cambridge, Massachusetts.
[53] Iain McGilchrist, “The Master and His Emissary,” Yale University Press, 2010
[54] Shamsfard Mehrnoush, Barforoush Abdollahzadeh. The State of the Art in Ontology
Learning: A Framework for Comparison. The Knowledge Engineering Review,
Volume 18, Issue 4. December 2003
[55] Paul Buitelaar, Philipp Cimiano, Marko Grobelnik, Michael Sintek, Ontology
Learning from Text Tutorial. ECML/PKDD 2005 Porto, Portugal; 3rd October - 7th
82
October, 2005. In conjunction with the Workshop on Knowledge Discovery and
Ontologies (KDO-2005)
[56] Thomas K Landauer, Peter W. Foltz, Darrell Laham, An Introduction to Latent
Semantic Analysis
[57] Ferdinand de Saussure, Structuralism Retrieved from
http://en.wikipedia.org/wiki/Structuralism
[58] A. Maedche, and S. Staab. Ontology learning for the Semantic Web. IEEE Intelligent
Systems, 16(2), 2001.
[59] Eneko Agirre, German Rigau Word Sense Disambiguation using Conceptual Density.
[60] M. Missikoff, P. Velardi, and P. Fabriani. Text mining techniques to automatically
enrich a domain ontology. Applied Intelligence, 18(3):323-340, 2003.
[61] E. Agirre, O. Ansa, E. Hovy, and D. Martinez. Enriching very large ontologies using
the WWW. In Proceedings of the ECAI Workshop on Ontology Learning, 2000.
[62] Google developers, Google Web Search API retrieved from
https://developers.google.com/web-search/
[63] Google, Google History retrieved from http://www.google.com/corporate/history.html
[64] Patrick Pantel and Deepak Ravichandran. 2004. Automatically Labeling Semantic
Classes. In Proceedings of Human Language Technology / North American Association
for Computational Linguistics (HLT/NAACL-04). pp. 321-328. Boston, MA.
83
[65] Apache Lucene, Index Toolbox, Luke retrieved from http://www.getopt.org/luke/
[66] Ferdinand de Saussure, Course in General Linguistics, lectures between the years 1906
and 1911
[67] M. Ross Quillian ,Semantic Memory, Air Force Cambridge Research Laboratories,
Office of Aerospace Research, United States Air Force, 1966 - 222 pages
Abstract (if available)
Abstract
Keyword search is a prominent data retrieval method for the Web, largely because the simple and efficient nature of keyword processing allows a large amount of information to be searched with fast response. However, keyword search approaches do not formally capture the clear meaning of a keyword query and fail to address the semantic relationships between keywords. As a result, the accuracy (precision and recall rate) is often unsatisfactory, and the ranking algorithms fail to properly reflect the semantic relevance of keywords. Our research particularly focuses on increasing the accuracy of search results for multi-word search. We propose a statistical ontology-based semantic ranking algorithm based on sentence units, and a new type of query interface including wildcards. First, we allocate higher-ranking scores to keywords located in the same sentence compared with keywords located in separate sentences. While existing statistical search algorithms such as N-gram only consider sequences of adjacent keywords, our approach is able to calculate sequences of non-adjacent keywords as well as adjacent keywords. Second, we propose a slightly different type of query interface, which considers a wildcard as an independent unit of a search query to reflect what users are actually seeking by way of the function of query prediction based on not query data but actual Web data. Unlike current information retrieval approaches such as proximity, statistical language modeling, query prediction and query answering, our statistical ontology-based model synthesizes proximity concept and statistical approaches into a form of ontology. This ontology helps to improve web information retrieval accuracy. ❧ We validated our methodology with a suite of experiments using the Text Retrieval Conference document collection. We focused on two-word queries in our experiments - as two-word queries are quite common. After applying our statistical ontology-based algorithm to the Nutch search engine, we compared the results with results of the original Nutch search and Google Desktop Search. The result demonstrates that our methodology has improved accuracy quite significantly.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
From matching to querying: A unified framework for ontology integration
PDF
Ontology-based semantic integration of heterogeneous information
PDF
An efficient approach to clustering datasets with mixed type attributes in data mining
PDF
Tag based search and recommendation in social media
PDF
An efficient approach to categorizing association rules
PDF
Enabling laymen to contribute content to the semantic web: a bottom-up approach to creating and aligning diversely structured data
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Discovering and querying implicit relationships in semantic data
PDF
Combining textual Web search with spatial, temporal and social aspects of the Web
PDF
Understanding semantic relationships between data objects
PDF
Spam e-mail filtering via global and user-level dynamic ontologies
PDF
Scalable processing of spatial queries
PDF
Weighted factor automata: A finite-state framework for spoken content retrieval
PDF
Learning the semantics of structured data sources
PDF
Enabling spatial-visual search for geospatial image databases
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Scalable data integration under constraints
PDF
A learning‐based approach to image quality assessment
PDF
Multimodal image retrieval and object classification using deep learning features
Asset Metadata
Creator
Kim, Jinwoo
(author)
Core Title
A statistical ontology-based approach to ranking for multi-word search
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
04/29/2013
Defense Date
01/22/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
information retrieval,OAI-PMH Harvest,ontology,search engine,Semantic Web
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McLeod, Dennis (
committee chair
), Kuo, C.-C. Jay (
committee member
), Medvidovic , Nenad (
committee member
), Nakano, Aiichiro (
committee member
), Pryor, Larry (
committee member
)
Creator Email
amadeust@gmail.com,jinwook@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-245525
Unique identifier
UC11288222
Identifier
etd-KimJinwoo-1614.pdf (filename),usctheses-c3-245525 (legacy record id)
Legacy Identifier
etd-KimJinwoo-1614-0.pdf
Dmrecord
245525
Document Type
Dissertation
Rights
Kim, Jinwoo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
information retrieval
ontology
search engine
Semantic Web