Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 879 (2006)
(USC DC Other)
USC Computer Science Technical Reports, no. 879 (2006)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
WebSim: A Novel Term Similarity Metric based
on a Web Search Technology
Seokkyung Chung
1?
, Jongeun Jun
2??
, and Dennis McLeod
2
1
Yahoo! Inc., 2821 Mission College Blvd, Santa Clara, CA 95054, USA
2
Department of Computer Science, University of Southern California,
Los Angeles, CA 90089, USA
schung@yahoo-inc:com, [jongeunj;mcleod]@usc:edu
Abstract. Given that pairwise similarity computations are essential in
ontologylearninganddatamining,weproposeWebSim(Web-basedterm
Similaritymetric),whosefeatureextractionandsimilaritymodelisbased
onaconventionalWebsearchengine.Therearetwomainaspectsthatwe
can bene¯t from utilizing a Web search engine. First, we can obtain the
freshest content for each term that represents the up-to-date knowledge
on the term. This is particularly useful for dynamic ontology manage-
ment in that ontologies must evolve with time as new concepts or terms
appear. Second, in comparison with the approaches that use the certain
amountofcrawledWebdocumentsascorpus,ourmethodislesssensitive
to the problem of data sparseness because we access as much content as
possible using a search engine. At the core of WebSim, we present two
di®erent methodologies for similarity computation, a mutual informa-
tion based metric and a feature-based metric. Moreover, we show how
WebSim can be utilized for modifying existing ontologies. Finally, we
demonstrate the characteristics of WebSim by coupling with WordNet.
Experimental results show that WebSim can uncover topical relations
between terms that are not shown in conventional concept-based ontolo-
gies.
1 Introduction
WiththerapidgrowthoftheWorldWideWeb,Internetusersarenowexperienc-
ingoverwhelmingquantitiesofonlineinformation.Sincemanuallyanalyzingthe
data becomes nearly impossible, the analysis would be performed by intelligent
information management techniques to ful¯ll users' information needs quickly.
Representation and extraction of semantic meanings from information con-
tents is essential in intelligent information management. This issue has been
exploredindiverseresearchdisciplinesincludingarti¯cialintelligence,datamin-
ing, information retrieval, natural language processing, etc. One of the widely
usedapproachestoaddressingthisproblemistoexploitontologies.Forexample,
when users use irrelevant keywords (due to their broad and vague information
?
This research was conducted when the author was at University of Southern
California.
??
To whom correspondence should be addressed.
needs or unfamiliarity with the domain of interests), query expansion based on
ontologiescanimproveretrievalaccuracybyprovidinganintelligentinformation
selection.
A knowledge acquisition problem (i.e., how to build ontologies) is one of
the main bottlenecks in ontology-based approaches. Although there exist hand-
craftedontologiessuchasWordNet [13]orCYC [9],signi¯cantamountsofdomain-
speci¯c terms (e.g., scienti¯c or engineering terms) or neology are not present
in general-purpose ontologies. Thus, it is essential to build ontologies that can
characterize given applications.
However, although ontology-authoring tools have been developed in the past
decades [17, 25], constructing ontologies by hand whenever new domains are
encountered needs signi¯cant amount of time and e®orts. Additionally, since
ontologies must evolve with time as new concepts or terms appear, it is essential
tomaintainexistingontologiesup-to-date.Therefore,ontologylearning,whichis
aprocessofintegratingknowledgeacquisitionwithdatamining,becomesamust.
Consequently, a knowledge expert can build and maintain domain ontologies
more e±ciently with the support of ontology learning. Given that computation
of the similarity between terms is at the core of ontology
3
learning problems, we
focus our attentions on computing similarity between terms.
AstheWebcontinuestogrowasavehicleforthedistributionofinformation,
the massive amounts of useful information can be found on the Web. Given this
wide availability of knowledge on the Web, we present WebSim (Web-based
Similarity metric), whose feature extraction and similarity model is based on a
conventional Web search engine. The proposed approach takes advantage of two
main aspects of the Web search engine technology. First, as many thousands of
Web pages are published daily on the Web, the Web re°ects and characterizes
current trend of knowledge. Thus, we can obtain the freshest content for each
term that represents the up-to-date knowledge on the term. This is particularly
useful for dynamic ontology management in that ontologies must evolve with
timeasnewconceptsortermsappear.Second,becauseweaccessasmuchcontent
as possible using search engines, our method is less sensitive to the problem of
data sparseness. Although previous text mining crawls large amount of Web
pages for feature extraction, since the crawled contents are just snapshot of the
entire Web, it still su®ers from a data sparseness problem.
At the core of WebSim, we present two similarity metrics, a frequency-based
metric and a feature-based metric. The frequency-based one is a mutual infor-
mation based measure that utilizes the number of Web pages associating with
each term. In contrast, the feature-based similarity metric extracts relevant fea-
turesforeachterm,andperformssimilaritycomputationbasedontheextracted
features. With the feature-based metric, we present how to deal with ambiguous
terms for the similarity computation, which is one of the di±cult problems in a
3
Our de¯nition of ontolgies is a collection of key concepts and terms along with their
inter-relationships. Although ontologies are often equipped with a set of inference
rules and constraints that are used to reason about concepts, this notion is beyond
the scope of this paper.
2
Web search. We also show how ontologies can be restructured or enriched with
WebSim. Finally, we demonstrate the characteristics of WebSim by coupling
with WordNet.
One of the main problems in concept-based ontologies is that topically re-
lated concepts and terms are not explicitly linked. That is, there is no relation
between Dell-notebook, Apple-iPod, etc. Although there exist di®erent types of
term association relationships in WordNet [13] such as \Bush versus President
of US" as synonym, or \G.W. Bush versus R. Reagan" as coordinate terms,
thesetypesofrelationshipsarelimitedtoaddressingtopicalrelationships.Thus,
concept-based ontologies have a limitation in supporting a topical search. For
example, consider the Sports domain ontology that we have developed in our
previous work [7]. In this ontology, \Kobe Bryant", who is an NBA basketball
player, is related with terms/concepts in Sports domain. However, for the pur-
poseofqueryexpansion,\KobeBryant"alsoneedstobeconnectedwitha\court
trial" concept if a user keeps \Kobe Bryant court trial" in mind. Therefore, it is
essential to provide explicit links between topically related concepts/terms. To
address this problem, we also demonstrate how topical relations are generated
in WebSim, and compare WebSim and semantic similarity in WordNet. In sum,
the purpose of this research is to move one step forward to achieving the devel-
opment of a novel feature extraction and similarity model that can be utilized
for any ontology learning framework.
The remainder of this paper is structured as follows. In Section 2, we brie°y
review the related work, and highlight the strengths and weaknesses of the pre-
vious work in comparison with ours. In Section 3, we present an information-
theoretic similarity measure. In Section 4, we explain our feature extraction al-
gorithm and explore how to compute similarity between terms based on the
extracted features. Section 5 explains how WebSim can be utilized for ontology
modi¯cation. In Section 6, we discuss the characteristics of WebSim by relat-
ing general-purpose ontologies. Finally, we conclude the paper and provide our
future plan in Section 7.
2 Related Work
Computation of the similarity between terms is at the core of ontology learn-
ing problem. There have been many attempts on automatic detection of similar
words from text corpora. One of the widely used approaches in similarity com-
putation is based on distributional hypothesis [5, 19]. That is, if words occur
in a similar context, then they tend to have similar meanings. The context can
be de¯ned in diverse ways. For example, it can be represented by co-occurrence
of words within grammatical relationships (e.g., a set of verbs which take the
word as a subject or object, a set of adjectives which modify the word, etc), or
co-occurring words within a certain length of a window. Each context is referred
to as features of a term. Thus, the set of all features of a term t
i
constitutes the
feature vector of t
i
.
3
Recently, there have been research e®orts for building ontologies automat-
ically. In particular, text mining tools have been widely used to build ontolo-
gies [14,11,24,16,6,23].Inordertoobtaincontext,currenttextminingresearch
usually utilizes the Web as corpus. Our approach is di®erent from the previous
workinthatweutilizeaWebsearchenginetoexploitthefullcontentoftheWeb.
That is, rather than relying on the snapshot of the Web, we can access as much
content as possible (depending on how much content a search engine spider can
crawl). Thus, our method is less sensitive to the problem of data sparseness. In
addition, our feature extraction methodology is di®erent from other approaches
in that the context of terms are de¯ned by a set of highly relevant documents
returnedbyasearchengine.Notethatourresearchiscomplementarytothepre-
vious ontology learning e®orts in that the extracted features or term similarity
can be utilized for any ontology learning framework.
Besides ontology learning, WebSim can be extended into other applications
as well. For example, ontology matching, which aims at identifying mappings
between related entities of multiple ontologies, has been widely studied re-
cently [2,26].Manydi®erentmatchingsolutionsproposedsofarexploitthevari-
ouspropertiesofdatasuchasstructuresofontologies,datainstances,andstring
similarity using edit distance. WebSim is expected to reinforce previous match-
ing technologies in that it is a similarity metric that is orthogonal to existing
ones. Moreover, with the popularity of online sellers' product recommendation
to their customers based on purchasing patterns, ontologies can be utilized to
customizing a system to a user's preferences in e-commerce [27]. That is, ontolo-
giescanbee®ectivelyusedformodellingcustomers'behaviorandusers'pro¯les.
WebSim is particularly useful in this type of application in that it can enrich
existing taxonomy maintained by online shopping malls (e.g., amazon.com). In
sum, WebSim is expected to be a key enabling technology for the tasks where
pairwise similarity computations play a central role.
3 WebSim: A Simple Mutual Information Approach
In this Section, we present the ¯rst similarity metric, which is simple but pow-
erful. The measure is referred to as an MI-based (Mutual Information) WebSim.
Table 1 illustrates the notations that will be used throughout this paper.
The underlying assumption behind the MI-based WebSim is that two terms
co-occur frequently if they are similar to each other. Mutual information is an
information-theoretic metric that quanti¯es relatedness between two words. The
mutual information between t
i
and t
j
is de¯ned as follows:
MI(t
i
;t
j
)=log
p(t
i
;t
j
)
p(t
i
)£p(t
j
)
(1)
The higher value in MI(t
i
;t
j
) implies the stronger association between t
i
and
t
j
.
4
Notation Meaning
ti An i-th term
df(t
i
) A total number of Web pages that are matched with a query t
i
N A total number of Web pages that a search engine indexes
p(t
i
) The probability that a search engine returns results for t
i
D
i
A set of Web pages (returned by a search engine) for t
i
fij A j-th feature of a term ti
d
k
A k-th document returned by a search engine
l
k
The document length of d
k
freq
ijk
Term frequency of a feature f
ij
in d
k
tf
ijk
Normalized term frequency of a feature f
ij
in d
k
N
i
The size of D
i
n
ij
The number of documents in D
i
where f
ij
occurs at least once
wij The weight of fij
IC(t
i
) The information content of t
i
prob(ti) The concept probability of how much ti occurs
count(t
i
) The term frequency of t
i
on corpus
concept freq(t
i
) The concept frequency of t
i
on corpus
C The size of corpus
Table 1. Notations for WebSim
Mutualinformationhasbeenwidelyusedinprevioustextminingresearchas
a criteria for measuring term association. The probability is usually de¯ned by
term frequency divided by the total number of terms observed in corpus. How-
ever, this probability is restricted by size of corpus. In particular, if a term is a
new coined one, then it su®ers from a data sparseness problem. To address this
problem, WebSim utilizes a search engine to estimate the probability approx-
imately. Figure 1 illustrates the idea. A FE (Front-end) scraper sends a query
to a Web search engine, and extracts the number of documents that a query is
matched with
4
. In order to estimate the total number of documents a search
engine indexes, based on the fact that most search engines supports a boolean
query, the number of documents for two di®erent queries (\t
i
" and \NOT t
i
")
are summed up. Thus, WebSim de¯nes p(t
i
) as follows:
p(t
i
)=
df(t
i
)
N
(2)
where N is the total number of Web pages that a search engine indexes, and
df(t
i
)isthetotalnumberofreturnedWebpageswithrespecttot
i
.Consequently,
4
AlthoughGoogleisused inthis paper, because most searchenginesdisplaythetotal
number of documents that are matched with a query, WebSim can use other search
enginesaswell.Itisalsoworthwhiletoinvestigatehowdi®erentsearchenginesa®ect
WebSim, but out of scope in this paper.
5
Fig.1. Overview of an MI-based WebSim
incontrasttoprevioustextminingapproaches,WebSimusesthedi®erentnotion
of p(t
i
), which is the probability that a search engine returns results for t
i
.
Most Web search engines provide advanced search features in that a user
can specify how a query is matched with the page. That is, a user can retrieve
the Web pages where a query is matched with title of the page (M
1
), text of
the page (M
2
), URL of the page (M
3
), links to the page (M
4
), or anywhere in
the page (M
5
). Although it is worthwhile to investigate how di®erent matching
a®ects the accuracy of WebSim, this is beyondthe scope of this paper. However,
we brie°y compare M
1
and M
5
.
Table 2presentssampleresults.Asshown,inmostcases(Type1),usingboth
M
1
and M
5
, WebSim captures similarity between related term pairs fairly well.
WebSim with M
1
is better than WebSim with M
5
in some cases (Type 2) such
as DAML-OIL, or notebook-computer. Moreover, M
1
is able to adjust the sim-
ilarity values that M
5
overestimates (text mining-computational biology). This
is expected in that M
1
provides more accurate context for term co-occurrence
than M
5
does. In some cases (Type 3), WebSim fails to capture similarity be-
tweenterm.Thisisprimarilybecausemutualinformationprefershigh-frequency
termstolow-frequencyones(i.e.,giveshighersimilarityvaluesforlow-frequency
terms).Forexample,incaseof DAML-OILandOWL-OIL,becauseofrelatively
low frequency of \DAML" in comparison with \OWL", WebSim captures as-
sociation for DAML-OIL while it cannot for OWL-OIL. This also holds for
apple-computer. Thus, although mutual information can detect pairwise simi-
larity fairly well, besides relying on only p(t
i
), there is a need to incorporate
content (associated with the term) into WebSim, which motivates the necessity
of the second similarity metric.
6
Type t
i
t
j
M
5
M
1
1 Natural Language Processing NLP 7.71 7.31
Self Organizing Maps SOM 6.32 3.59
Arti¯cial Intelligence AI 5.35 8.04
Genetic Algorithm Evolutionary Computation 8.60 8.47
Data Mining Clustering 4.18 4.06
Data Mining Knowledge Discovery 6.85 12.5
Text Mining Ontology 4.40 3.62
Text Mining ODBASE 5.33 7.12
Text Mining ODBASE 5.33 7.12
Text Mining Bioinformatics 4.87 4.19
Computer Security Firewalls 3.41 3.31
Clustering Classi¯cation 4.49 5.34
Classi¯cation Neural Networks 3.92 6.49
Machine Learning Text Mining 6.57 3.77
Semantic Web Ontology 5.39 7.33
Semantic Web DAML 5.86 4.87
Bioinformatics Computational Biology 5.47 9.96
OWL DAML 6.17 5.30
ODBASE Coopis 15.4 22.7
ODBASE DOA 10.0 14.6
Neural Networks Perceptron 8.35 6.84
Neural Networks Multi Layer Perceptron 8.92 10.4
2 DAML OIL 2.89 7.28
Notebook Computer 0.05 4.38
Operating System Unix 1.54 5.28
Text Mining Computational Biology 3.70 -2.18
3 OWL metadata 0.71 -3.23
OWL OIL -0.16 -3.98
Apple Computer -0.90 0.28
Data Mining Classi¯cation 1.75 1.27
Table 2. Mutual information for sample term pairs
4 WebSim: A Similarity Model based on Feature
Extraction
Thissectionpresentsanothersimilaritymodelbasedonfeatureextraction,which
isreferredtoasafeature-basedWebSim.Section 4.1discussesfeatureextraction
methodology. Section 4.2 explores similarity computation.
4.1 Feature Extraction
In this section, we explain how to extract features for each term. We ¯rst clarify
between\term" and \word". Although \term" and \word" have same meanings
7
in general, for the notational convention, \term" will be used to refer to the
entityofsimilaritycomputation(i.e.,wemeasuresimilaritybetweenterms)while
\word" is used to represent the feature of \term". Thus, a term is represented
by a set of words, where each word corresponds to the feature of the term.
ExtractingmeaningfulfeaturesforWebSimconsistsofthreephases:retrieval
of Web documents for each term, preprocessing of the retrieved documents, and
construction of a vector space model with a relevant feature extraction method.
A Web search engine is necessary to obtain the initial set of relevant docu-
ments for each term. Toward this end, we use the open source software, Google
Web API [28]
5
.
In the preprocessing step, meaningful information are extracted from Web
pages using standard IR tools. This process includes HTML preprocessing (e.g.,
removing irrelevant HTML tags or Javascript code, etc), tokenization, stem-
ming with a Porter stemmer [20] and the lexical database [12], and stopwords
removal [22]
6
.
Afterpreprocessing,aterm(t
i
)isrepresentedasavectorinavectorspace [22].
The simple way to do this is to employ bag-of-words approach. We treat each
word as a feature of t
i
, and represent each term as a vector of certain weighted
word frequencies in this feature space. The weight of a word for each term is
determined based on the following two heuristics.
{ Importantwordsoccurmorefrequentlywithinadocumentthanunimportant
words do.
{ The more times a words occurs throughout the documents within D
i
, the
stronger its predicting power becomes.
The term frequency (TF) is based on the ¯rst heuristic. In WebSim, term
frequency of t
i
is counted in a document in D
i
where D
i
is referred to as a set of
top most relevant documents for t
i
(returned by a search engine). In addition,
TF can be normalized to re°ect di®erent document length. Let f
ij
be the j-th
feature of t
i
, and freq
ijk
be the number of f
ij
's occurrences in a document d
k
where d
k
2D
i
. Then, term frequency (tf
ijk
) of f
ij
in d
k
is de¯ned as follows:
tf
ijk
=
freq
ijk
l
k
(3)
where l
k
is the length of d
k
.
The second heuristics is related with the document frequency (DF) of the
word (the percentage of the documents that contains this word). In traditional
IR research, inverse document frequency (IDF) has been widely used based on
the observation that low document frequency words tend to be particularly im-
portantinidentifyingrelevantdocumentswithrespecttoaquery.Thatis,words
with high document frequency tend to occur in many irrelevant documents be-
cause the number of relevant documents to a query is generally small. However,
5
Alternatively, we can use the front-end scraper that is presented in Section 3.
6
Web-speci¯c stopwords (e.g., host, click) are also added to our list.
8
in WebSim, since only relevant documents with respect to a term are retrieved,
and stopwords are removed in the preprocessing step, a word with high docu-
ment frequency within D
i
is considered to be of a particular relevant feature for
a term.
A combination of TF and DF introduces a new ranking scheme, which is
de¯ned as follows:
w
ij
=
n
ij
N
i
£
§
d
k
2D
i
tf
ijk
N
i
(4)
where w
ij
is an weight of f
ij
, N
i
is the total number of documents in D
i
, and
n
ij
is the number of documents in D
i
where f
ij
occurs at least once.
By exploiting the fact that only the top (high-weighted) few features con-
tribute substantially to the norm of the term, we only keep the high-weighted
features that make up most of the norm (80% or so). This approach reduces the
number of features signi¯cantly while it minimizes the loss of information.
Table 3 shows the features (with high weights) for sample terms. As shown,
the top features for each term characterize descriptive concepts of terms. For
example, consider \knowledge discovery" and \association rules". As expected,
key concepts that describe both terms are extracted as features. Note that the
extracted features sometimes do not always correspond to de¯nitions of terms.
For example, for a term \knowledge discovery", \sigkdd" is not a feature for
de¯ning the term. However, this is an important feature in that it is one of the
largestorganizationsindatamining.Similarly,\agraw"(RakeshAgrawal),who
isaninventorofassociationrulemining,isextractedasafeaturefor\association
rules". Therefore, extracted features by WebSim re°ect the current trend on the
term besides de¯nition for the term.
4.2 A Similarity Model based on Extracted Features
Once each term is represented as a vector in a feature space, the next step is
to measure closeness between two terms. Toward this end, we employ a Cosine
metricthathasbeenwidelyusedinpreviousinformationretrievalliterature [22].
It measures similarity of two items according to the angle between them. Thus,
vectors pointing to similar directions are considered as representing similar con-
cepts. The cosine of the angle between two vectors t
i
and t
j
is de¯ned by
Sim
1
(t
i
;t
j
)=Cosine(v
i
;v
j
)=
P
n
k=1
w
ik
¢w
jk
jjt
i
jj¢jjt
j
jj
(5)
wherev
i
andv
j
correspondtothevectorsoft
i
andt
j
,respectively.Cosine(v
i
;v
j
)
ranges from 0 to 1.
The underlying assumption of the proposed approach is simple but e®ective:
if t
i
(e.g., ipod) and t
j
(e.g., Apple) have some relationships, then the Web
pages returned by t
i
and t
j
would be somewhat similar, consequently, similarity
betweenv
i
andv
j
becomeshigh(byCosinemetric).Table 4illustratesthis.This
is generally true if a term is speci¯c or non-ambiguous. However, this does not
always hold due to the ambiguity of terms.
9
Term Features
Knowledge discovery data mine knowledg discoveri kdd number confer
sigkdd acm research inform volum web search scienc
Association rules rule associ data item mine transact databas inform agraw
algorithm con¯d analysi stream discoveri knowledg itemset
Sequential patterns pattern mine sequenti data time sequenc databas stream
algorithm associ agraw rule transact srikant frequent tempor
Decision trees tree decis data inform node learn classif algorithm test split
predict gain class train text machin model classi¯ attribut
Data warehouses data warehous inform wareh busi manag softwar databas
integr enterpris solut intellig servic server view decis
Object recognition object recognit imag model vision base featur comput match
visual scene view research geometr invari data shape recogn
Multi layer layer perceptron multi network output function input neural
perceptron weight learn hidden unit model neuron error linear mlp
Self organizing map organ data som neural kohonen network learn
maps vector wsom model visual cluster text similar inform
Parallel computer comput parallel architectur softwar hardwar design program
architecture memori multiprocessor perform morgan kaufmann network
Firewalls ¯rewal internet secur network comput connect protect
softwar servic window ¯lter packet inform server host port
Cryptography cryptographi secur crypto inform kei privaci encrypt rsa
archiv research cryptolog softwar algorithm pgp code
Machine translation translat machin english languag french onlin text mt
spanish comput german public chines associ research
Multimedia multimedia inform video time grolieronlin photo media
histori nation web archiv imag stori real °ash audio view
Virtual reality virtual realiti 3d world vr research model list time comput
simul vrml applic interact commun environ motion view
Push down automata push state context languag stack free pda formal
automata grammar program correct inform input ¯nit regular theori
Finite Automata ¯nit automata state algorithm determinist languag regular
machin string dfa comput transit express nfa
Turing machines machin ture comput state tape program number symbol
halt run problem instruct left alan cell blank function
Computational comput biologi research bioinformat journal genom inform
biology scienc center databas molecular analysi search life sequenc
Gene expressions gene express research genet human molecular biologi genom
link develop articl project evolut cell protein journal time
Gene Ontology gene ontolog databas link term protein annot search data
tool consortium product genom function biolog molecular
Table 3. Sample features for terms. Note that all features are stemmed. For example,
\inform" refers to \information", and so on.
10
t
i
t
j
Sim
1
(t
i
;t
j
)
Semantic Web XML 0.405
Genetic algorithms Evolutionary computation 0.433
Encryption Computer security 0.535
Information warfare Computer security 0.444
Parallel computing Computer architecture 0.623
Parallel programming MPI 0.383
Data warehouses Knowledge discovery 0.510
Data mining Knowledge discovery 0.778
Natural language processing Computational linguistics 0.439
Natural language processing NLP 0.583
Table 4. Sample term pairs that have relatively high Sim1(ti;tj)
One of the challenging problems in the feature-based WebSim is how to deal
with ambiguity of terms. That is, if a term has multiple meanings, then the
returned Web pages are somewhat un-correlated to each other. For example,
\clustering" has two meanings in top 10 ranked pages returned by Google (data
mining and computer architecture context). Consequently, due to the ambiguity
of \clustering", actual similarity between \clustering" and \data mining" be-
comes low even though clustering is one of the sub¯elds in data mining. This
problem is also inherent in a Web search. Because user queries usually tend to
be short, they may be ambiguous, which often leads to irrelevant search results.
Table 5showstophigh-weightedfeaturesforsixambiguousterms.Forexam-
ple, consider \classi¯cation". Due to the generality of this term, none of the top
10 ranked pages returned by Google is related with classi¯cation in data mining
context. Moreover, regarding to \oil", most top 50 pages returned by Google
are all pointers to information on gas oil, and only 2 pages are about Ontol-
ogy Inference Layer. Consequently, all extracted features for \oil" are related
with gasoline. This is because Web search engines tend to rank a page based
on how a query is matched with the page (i.e., whether the query is matched
with title, etc) and its popularity using the notion of authorities and hubs [8, 1].
The extracted features for \oil" are useful ones if a domain expert wants to add
\oil" (in terms of energy context) into ontologies. However, if he/she considers
OIL (Ontology Inference Layer) in Semantic Web context, then the extracted
features are problematic. Assuming \oil" in an energy sense is already in on-
tologies (because it was coined a long time ago), OIL in a Semantic Web sense
is of particular interest in terms of enriching ontologies (because it is neology).
Moreover, consider \selection", \crossover" and \mutation" that are three main
operators in genetic algorithms. Due to the generality of the terms, extracted
features do not represent distinctive characteristics of genetic algorithms.
In previous information retrieval research, query expansion has been widely
studied in order to provide more useful search results. That is, a query can
11
Term Features
Clustering cluster server softwar data technolog linux base inform window
high applic search servic product load analysi releas featur group
Classi¯cation classif link search onlin inform extern econom literatur number org
list north web bookmark journal
OIL oil energi industri shell locat ga chang product price servic bp
drill origin compani petroleum
Mutation mutat research issu volum databas journal copi page science
direct elsevi regist login
Crossover crossov network web o±c capabl linux custom softwar secur
microsoft profession featur copi window bui
Selection select search inform web servic internet scienc map natur access
public onlin journal research technolog human
Table 5. Features for ambiguous terms
be re¯ned by adding additional relevant search terms. The key point here is
that the added terms should be somewhat related with the original query term.
Otherwise, query expansion leads to a degradation of precision [7].
Suppose t
i
and t
j
are in consideration of similarity computation. If the simi-
larity between t
i
and t
j
is not high enough, then combined queries (i.e., t
i
t
j
and
t
j
t
i
)areissuedasqueriestoaWebsearchengine,andtopN documentsforboth
terms are retrieved. As discussed, if t
i
and t
j
are not related with each other,
then adding an additional term will not be helpful, consequently, similarity be-
tween t
i
and t
j
t
i
(and between t
j
and t
i
t
j
) will be still not high. However, if t
i
and t
j
are related with each other, then expanding t
i
with t
j
will result in high
similarity(i.e.,similaritybetweent
i
andt
j
t
i
isexpectedtobehigh).Inthisstep,
both t
i
t
j
and t
j
t
i
are submitted to the Web search engine because the order of
query terms a®ects the search results. Thus, when the similarity between t
i
and
t
j
is not high enough, the similarity will be re¯ned as follows:
Sim
2
(t
i
;t
j
)=Average(Cosine(t
i
;t
j
t
i
); Cosine(t
j
;t
i
t
j
)) (6)
Table 6 shows how term expansion can be used to re¯ne the sense of a term.
Since\classi¯cation"and\clustering"arerelatedindataminingcontext,adding
\clustering"to \classi¯cation"will re¯nethemeaningof\classi¯cation". Thisis
becausethereexistasensethatbothtermsshareeventhoughtheyhavemultiple
meanings. As a result, extracted features on \classi¯cation clustering" are on
data mining subject. However, expanding \linux" with \automata" destroys the
true characteristics of the term rather than re¯nes the meaning. Consequently,
resultedfeaturesaredistorted(distortedfeaturesareshowninitalic).Therefore,
term expansion is helpful only when a relevant term is added.
12
Term Features
Oil odbase ontolog oil web semant confer inform logic descript system
odbase knowle languag gobl base proceed databas model
Agent coopis agent system inform confer cooper univers comput coopi
base paper web distribut knowledg data model servic
Agent insurance insur agent life compani agenc state servic term licens
inform ¯nanci busi brok onlin quot health nationwid auto
Classi¯cation clustering cluster classif data class analysi method algorithm text
inform distanc group list network imag fuzzi vector type
similar variabl model hierarch program point document
Clustering architecture cluster architectur manag server applic network databas
servic group avail softwar replic microsoft
Linux automata linux automata program softwar version cellular simul
org game life ¯le comput window java state model 3d
Table 6. Sample features for terms with context
5 Ontology Modi¯cation with WebSim
Ontology modi¯cation is composed of two parts: ontology enrichment, and on-
tology restructuring. In this Section, we explore how to deal with this issue with
WebSim. Figure 2 provides an overview of the proposed method.
One of the key issues in ontology enrichment is how to identify candidate
terms that should be added into an ontology. In the past, we presented topic
mining, which e®ectively identi¯es useful patterns (e.g., news topics or events,
key terms at multiple levels of abstraction) from news streams [4, 3]. Topic min-
ing is a key enabling technology in ontology enrichment. The idea of coupling
topic mining and ontology enrichment is as follows:
Topic mining sends a Web crawler to a collection of key sites that are re-
lated with the domain of interest (or collection of popular Web sites like CNN
if we want to enrich general-purpose ontologies), and retrieves a set of domain
speci¯c documents. One of the main capabilities of topic mining is that the key
topicaltermsaredynamicallygeneratedbasedonincrementalhierarchicaldocu-
ment clustering. Thus, the topic mining framework can complement to ontology
enrichment in that it can automatically identify key candidate terms/concepts
fromWebdocumentstreams.Theidenti¯edcandidatescanbegiventoadomain
expert.
Moreover, the feature extraction methodology (in Section 4.1) can be e®ec-
tively used for candidate term generation. That is, as shown in Table 3, since
features for each term are key concepts that describe the main characteristics
of the term, WebSim can use a term in the ontology to derive the features for
the term. If the obtained features do not exist in the ontology, then domain ex-
pertscanaddthefeaturesforthepurposeofontologyenrichment.Alternatively,
13
Fig.2. Overview of the process for ontology modi¯cation with WebSim
terms identi¯ed by topic mining can be given as input to WebSim to populate
candidate terms for ontology enrichment.
Besides adding a new term into an existing ontology, the ontology should be
restructured as time evolves. That is, existing term relationships in ontologies
needtobechanged.Sinceitisextremelydi±culttoupdateontologymanually,it
isnecessarytosuggestwhichpartsofontologyhavepossibilitiesofmodi¯cation.
Toward this end, we present WebSim-based approach to select candidate term
pairs that should be modi¯ed. Due to the large number of terms in ontologies, it
is computationally expensive to compute pairwise similarity between all terms
in the ontology. This is because we cannot predict which part of the ontology
should be modi¯ed, e.g., consider OIL and XML that are far from each other
in conventional conceptual ontologies. Thus, if the size of the ontology is m,
then we need O(m
2
) pairwise similarity computations. However, the number
of computations can be signi¯cantly reduced if the WebSim is employed. As
discussed, for each term, extracted features by WebSim represent the up-to-
date knowledge on the term. Thus, rather than examining all term pairs in
the ontology, for each term (t
i
), we only need to compute similarity between t
i
and the features of t
i
that are in the ontology. Assuming that the number of
the features for each term is constant (because only top weighted features are
considered), the complexity can be reduced to O(m) from O(m
2
), which is a
signi¯cant improvement.
It is worthwhile to compare the MI-based WebSim and the feature-based
WebSim, so domain experts can choose the metrics for their own purpose. In
terms of computational cost, the MI-based WebSim is cheaper than the feature-
based WebSim. In our simulation, the di®erence between the number of docu-
14
ments of t
i
t
j
and that of t
j
t
i
is 912.70. This number can be ignored considering
that the average number of returned documents for the terms in our test is
8,267,363.81.Thus,for theMI-basedWebSim,only 3 queries need tobesubmit-
ted(i.e.,t
i
,t
j
andt
i
t
j
).Incontrast,forthefeature-basedWebSim,botht
i
t
j
and
t
j
t
i
(aswellast
i
andt
j
)needtobesubmittedtoaWebsearchenginebecausethe
order of query terms a®ects the top ranked Web pages, consequently, extracted
features will be changed. In sum, the MI-based WebSim needs 3 query sub-
missions while the feature-based WebSim needs 4 query submissions. Moreover,
the feature-based WebSim needs the feature extraction step, and cosine similar-
ity computations. Thus, the MI-based WebSim is computationally cheaper than
the feature-based WebSim. However, the feature-based WebSim produces much
more reliable similarity values than the MI-based WebSim does because mutual
information is strongly in°uenced by the marginal probabilities of terms.
6 Semantic Similarity versus WebSim
In this section, we present a methodology on how to investigate relatedness
between WebSim and existing ontologies like WordNet.
Recently, semantic similarity metrics have been proposed to evaluate simi-
larity between two terms in a taxonomy based on information content [10, 21].
These approaches rely on the incorporation of empirical probability estimates
into a taxonomic structure. Previous study has shown that this type of ap-
proaches is signi¯cantly less sensitive to link density variability.
The information content of a term t
i
(IC(t
i
)) can be quanti¯ed as follows:
IC(t
i
)=¡log(p(t
i
)) (7)
wherep(t
i
)istheprobabilityofhowmuchatermt
i
occurs.Frequenciesofterms
can be estimated by counting the number of occurrences in corpus. Each term
that occurs in the corpus is counted as an occurrence of each concept containing
it.
concept freq(t
i
)=
X
t
j
2C
t
i
count(t
j
) (8)
where C
t
i
is the set of terms subsumed by a term t
i
. Then, concept probability
for t
i
can be de¯ned as follows:
prob(t
i
)=
concept freq(t
i
)
C
(9)
where C is the size of corpus, which corresponds to the total number of terms
observed in corpus.
Equation (7) states that informativeness decreases as concept probability in-
creases. Thus, the more abstract a concept, the lower its information content.
This quantization of information provides a new approach to measure seman-
tic similarity. The more information two terms share, the more similar they
15
t
i
t
j
Lin MI Sim
1
Sim
2
Semantics Metadata 0 2.222 0.171 0.653
Firewall Encryption 0 4.566 0.218 0.699
Automata Turing 0 6.640 0.078 0.500
machine
Apple Computer 0.121 0.909 0.153 0.556
Yahoo Messenger 0.230 3.795 0.102 0.666
Tomato Vegetable 0.853 6.466 0.127 0.740
Doctor Nurse 0.797 3.093 0.091 0.637
Microsoft Windows Undef 3.559 0.541 0.783
Apple Ipod Undef 2.408 0.644 0.876
Table 7. WebSim versus semantic similarity. Undef denotes that the term does not
exist in WordNet.
are. Resnik [21] de¯nes the information shared by two terms as the maximum
information content of the common parents of the terms in the ontology (Equa-
tion (10)).
Resnik(t
i
;t
j
)=max
t2CP(ti;tj)
[¡log(p(t))] (10)
where CP(t
i
;t
j
) represents the set of parents terms shared by t
i
and t
j
.
BecausethevalueofEquation (10)canvarybetween0toin¯nity,weuseLin's
metric instead [10], which varies between 0 (dissimilarity) and 1 (similarity).
Lin(t
i
;t
j
)=
2£max
t2CP(ti;tj)
[¡log(p(t))]
(IC(t
i
)+IC(t
j
))
(11)
Table 7 shows semantic similarity and WebSim of selected terms. Semantic
similarity is obtained by using WordNet::Similarity [18]. As expected, due to
the lack of ability to express topical relations in WordNet, we observed low se-
mantic similarity for the ¯rst ¯ve term pairs. In contrast, our WebSim model
successfully captured the similarity relations. Even though Sim
1
(t
i
;t
j
) was low
for \automata" and \Turing machine", this is because of the ambiguity of \au-
tomata", which has two meanings (in theory of computation and computational
learning context). For the term pairs that were detected by semantic similarity
very well (such as \nurse" vs \doctor"), our WebSim could also identify high
similarity using re¯nement. Finally, for the terms that do not exist in WordNet
(e.g., Ipod or Microsoft), WebSim could capture high similarity. In sum, Web-
Sim performs well on high semantic similarity term pairs using re¯nement while
uncovers topical relations that do not exist in WordNet.
7 Conclusion and Future Work
In order to accommodate dynamically changing knowledge, we presented a Web
searchenginebasedsimilarityframeworkthatisreferredtoasWebSim.WebSim
16
iscomposedoftwosimilaritymetrics,anMI-basedone,andafeature-basedone.
The formal is computationally cheap while the latter can produce more reliable
similarity values. In addition, we suggested diverse ways on how WebSim can be
utilized for ontology modi¯cation. Finally, coupling with semantic similarity, we
demonstrated how WebSim is able to identify unknown relations in WordNet.
We intend to extend this work into the following three directions. First,
we plan to study more sophisticated feature weighting schemes. Based on the
observationthattheWebpagewithhighrankisgenerallymoreinformativethan
the ones with low rank, each Web page can be weighted by order when features
are extracted. That is, the features in the ¯rst page should be weighted higher
than the features in the 20-th page, and so on. Second, it is worthwhile to study
how di®erent Web search engines a®ect WebSim. Finally, we plan to investigate
the applicability of WebSim to ontology matching.
8 Acknowledgement
This research has been funded in part by the Integrated Media Systems Cen-
ter, a National Science Foundation Engineering Research Center, Cooperative
Agreement No. EEC-9529152.
References
1. S. Brin, and L. Page. The anatomy of a large-scale hypertextual web search engine.
In Proceedings of the 7th International World Wide Web Conference, 1998.
2. S. Castano, A. Ferrara, and S. Montanelli. H-MATCH: an algorithm for dynami-
cally matching ontologies in peer-based systems. In Proceedings of the 1st VLDB
International Workshop on Semantic Web and Databases, 2003.
3. S. Chung, and D. McLeod. Dynamic topic mining from news stream data. In
Proceedings of the 2nd International Conference on Ontologies, Databases, and Ap-
plication of Semantics for Large Scale Information Systems, 2003.
4. S.Chung,andD.McLeod. Dynamicpatternmining:anincrementaldataclustering
approach. Journal on Data Semantics, 2:85-112, 2005.
5. I. Dagan, F. Pereira, and L. Lee. Similarity-based estimation of word cooccurrence
probabilities. In Proceedings of the 32nd Annual Meeting of the Association for
Computational Linguistics, 1994.
6. E.J. Glover, D.M. Pennock, S. Lawrence, and R. Krovetz. Inferring hierarchical
descriptions. In Proceedings of the ACM International Conference on Information
and Knowledge Management, 2002.
7. L. Khan, D. McLeod, and E.H. Hovy. Retrieval e®ectiveness of an ontology-based
model for information selection. The VLDB Journal, 13(1):71-85, 2004.
8. J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings
of ACM-SIAM Symposium on Discrete Algorithms, 1998.
9. D. Lenat, R. V. Guha, K. Pittman, D. Pratt, and M. Shepherd. Cyc: Toward
programs with common sense. Communications of the ACM, 33(8):30-49, 1990.
10. D.Lin. Aninformation-theoreticde¯nitionofsimilarity. In Proceedings of the 15th
International Conference on Machine Learning, 1998.
17
11. A. Maedche, and S. Staab. Ontology learning for the Semantic Web. IEEE Intel-
ligent Systems, 16(2), 2001.
12. I.D. Melamed. Automatic evaluation and uniform ¯lter cascades for inducing n-
best translation lexicons. In Proceedings of the 3rd Workshop on Very Large Cor-
pora, 1995.
13. G. Miller. Wordnet: An on-line lexical database. International Journal of Lexicog-
raphy, 3(4):235-312, 1990.
14. M. Missiko®, P. Velardi, and P. Fabriani. Text mining techniques to automatically
enrich a domain ontology. Applied Intelligence, 18(3):323-340, 2003.
15. M. Reinberger, P. Spyns, W. Daelemans, and R. Meersman. Mining for lexons:
applying unsupervised learning methods to create ontology bases. In Proceedings of
International Conference on Ontologies, Databases and Applications of SEmantics,
2003.
16. J. Nemrava, and V. Sv¶ atek. Text mining tool for ontology engineering based on
use of product taxonomy and web directory. In Proceedings of the Dateso Annual
International Workshop on DAtabases, TExts, Speci¯cations and Objects, 2005.
17. N.F. Noy, M. Sintek, S. Decker, M. Crub¶ ezy, R.W. Fergerson, and M.A. Musen.
CreatingandacquiringSemanticWebcontentswithProt¶ eg¶ e-2000. IEEE Intelligent
Systems, 16(2):60-71, 2001.
18. T. Pedersen, S. Patwardhan, and J. Michelizzi. WordNet::Similarity - measuring
the relatedness of concepts In Proceedings of the 5th Annual Meeting of the North
American Chapter of the Association for Computational Linguistics, 2004.
19. F. Pereira, N.Z. Tishby, and L. Lee. Distributional clustering of english words.
In Proceedings of the 30th Annual Meeting of the Association for Computational
Linguistics, 1993.
20. M.F. Porter. An algorithm for su±x stripping. Program, 14(3):130-137, 1980.
21. P. Resnik. Semantic similarity in a taxonomy: an information-based measure and
its application to problems of ambiguity in natural language. Journal of Arti¯cial
Intelligence Research, 1999.
22. G. Salton and M.J. McGill. Introduction to modern information retrieval.
McGraw-Hill, 1983.
23. M. Sanderson, and W.B. Croft. Deriving concept hierarchies from text. In Pro-
ceedings of the 22nd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 1999.
24. P.Spyns,andM.Reinberger. Lexicallyevaluatingontologytriplesgeneratedauto-
maticallyfromtexts. InProceedingsofthe2ndEuropeanSemanticWebConference,
2005.
25. Y. Sure, M. Erdmann, J. Angele, S. Staab, R. Studer, and D. Wenke. OntoEdit:
collaborative ontology development for the Semantic Web. In Proceedings of Inter-
national Semantic Web Conference, 2002.
26. M.Ehrig,andY.Sure. Ontologymapping-anintegratedapproach. InProceedings
of the 1st European Semantic Web Symposium, 2004.
27. C. Ziegler, G. Lausen, and L. Schmidt-Thieme. Taxonomy-driven computation of
product recommendations. In Proceedings of the ACM International Conference on
Information and Knowledge Management, 2004.
28. Google Web APIs. http://www.google.com/apis/.
18
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 833 (2004)
PDF
USC Computer Science Technical Reports, no. 721 (2000)
PDF
USC Computer Science Technical Reports, no. 849 (2005)
PDF
USC Computer Science Technical Reports, no. 572 (1994)
PDF
USC Computer Science Technical Reports, no. 574 (1994)
PDF
USC Computer Science Technical Reports, no. 575 (1994)
PDF
USC Computer Science Technical Reports, no. 740 (2001)
PDF
USC Computer Science Technical Reports, no. 813 (2004)
PDF
USC Computer Science Technical Reports, no. 882 (2006)
PDF
USC Computer Science Technical Reports, no. 636 (1996)
PDF
USC Computer Science Technical Reports, no. 885 (2006)
PDF
USC Computer Science Technical Reports, no. 674 (1998)
PDF
USC Computer Science Technical Reports, no. 828 (2004)
PDF
USC Computer Science Technical Reports, no. 687 (1998)
PDF
USC Computer Science Technical Reports, no. 688 (1998)
PDF
USC Computer Science Technical Reports, no. 639 (1996)
PDF
USC Computer Science Technical Reports, no. 744 (2001)
PDF
USC Computer Science Technical Reports, no. 874 (2006)
PDF
USC Computer Science Technical Reports, no. 868 (2005)
PDF
USC Computer Science Technical Reports, no. 961 (2015)
Description
Seokkyung Chung, Jongeun Jun, Dennis McLeod. "WebSim: A novel term similarity metric based on a web search technology." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 879 (2006).
Asset Metadata
Creator
Chung, Seokkyung
(author),
Jun, Jongeun
(author),
McLeod, Dennis
(author)
Core Title
USC Computer Science Technical Reports, no. 879 (2006)
Alternative Title
WebSim: A novel term similarity metric based on a web search technology (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
18 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16269794
Identifier
06-879 WebSim A Novel Term Similarity Metric based on a Web Search Technology (filename)
Legacy Identifier
usc-cstr-06-879
Format
18 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/