Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 637 (1996)
(USC DC Other)
USC Computer Science Technical Reports, no. 637 (1996)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A Hierarc hic Arc hitecture for Conceptual Information Retriev al
ShihHao Li and P eter B Danzig
Computer Science Departmen t
Univ ersit y of Southern California
Los Angeles California fshli danzig gcsuscedu
Abstract
Conceptual retriev al returns information related to a sp ecic topic but not restricted to
a query term A common approac h is to compare the query with all the do cumen ts in the
database When the n um ber of do cumen ts is large the searc hing time b ecomes signican t
In this pap er w e prop ose a hierarc hic arc hitecture whic h in tegrates laten t seman tic indexing
LSI and hierarc hic agglomerativ e clustering to reduce the searc hing time W e emplo y three
clustering algorithms single link complete link and group a v erage and conduct exp erimen ts
on four standard do cumen t collections CA CM CISI CRAN and MED The exp erimen tal
results sho w our metho d requires less searc hing time while main taining comparable retriev al
eectiv eness as nonclustered searc h
In tro duction
Searc hing information b y conceptual meanings often suers from the vo c abulary pr oblem whic h
states that users ma y fail to obtain desired information if the query terms used are dieren tfrom
those indexed b y the retriev al system T o address the v o cabulary problem Deerw ester et al
prop osed L atent Semantic Indexing LSI where queries and do cumen ts are represen ted as
v ectors of conceptual meanings LSI compares a query with all the do cumen ts in the database
then returns those with higher similarit y to the user It has been tested on sev eral information
systems with promising results A deciency of LSI is that it searc hes through the whole database The searc hing time
b ecomes signican t when the database is large One w a y to ameliorate this problem is to searc h
do cumen ts b y clusters instead of eac h individual record This approac h is based on v an Rijsb ergens
cluster hyp othesis where he stated closely asso ciated do cumen ts tend to b e relev an t to the same
queries He also prop osed clusterbased retriev al on hierarc hically clustered collections to impro v e
retriev al eectiv eness and eciency In this pap er w e describ e a hierarc hic arc hitecture that
applies hierarc hic clustering to LSI do cumen ts and compare its p erformance with nonclustered
LSI retriev al on sev eral do cumen t collections W e review the metho dology of LSI and hierarc hic
clustering in Section W e describ e our arc hitecture in Section and sho w the exp erimen tal results
in Section Section presen ts the conclusions
Bac kground
Laten t Seman tic Indexing
LSI is an extension of Saltons V e ctor Sp ac e Mo del in whic h do cumen ts and queries are repre
sen ted as v ectors of term frequencies or w eigh ts T o capture the seman tic structure among do cu
men ts in a database LSI applies Singular V alue De c omp osition SVD to a termdo cumen t matrix
represen ting the database and generates v ectors of k t ypically to orthogonal indexing
dimensions where eac h dimension represen ts a linearly indep enden t concept The k dimensional
v ectors are used to represen t both do cumen ts and terms in the same seman tic space while their
v alues indicate the degrees of asso ciation with the k underlying concepts Figure sho ws SVD
applies to a termdo cumen t matrix
term-doc
matrix
db
term
matrix
(m xn)
(m xk)
document
matrix
(n xk)
SVD
(k)
Figure SVD applies to an m n termdo cumen t matrix where m and
n are the n um b ers of terms and do cumen ts in the database and k is the
indexing dimension used b y SVD
A query v ector in LSI is the w eigh ted sum of its comp onen t term v ectors F or example a
pterm query is represen ted as the a v erage sum of the p decomp osed term v ectors T o determine
relev an t do cumen ts the query v ector is compared with all the do cumen t v ectors and those with
the highest c osine c o ecient are returned Because the indexing dimension k is c hosen m uc h
smaller than the n um ber of terms and do cumen ts in the database ie the n um ber of ro ws and
columns in the termdo cumen t matrix those k concepts are neither term nor do cumen t frequencies
but are compressed forms of b oth Therefore a query can hit do cumen ts without ha ving common
terms but with common concepts
Hierarc hic Agglomerativ e Clustering
T o cluster LSI do cumen ts w e apply the hierarc hic agglomerativ e clustering metho d Hierarc hic ag
glomerativ e clustering has b een studied to increase retriev al eectiv eness and eciency as compared
to the con v en tional searc h of nonclustered data At ypical hierarc hic clustering metho d can b e describ ed as follo ws
Compute all pairwise in terdo cumen t similarit y co ecien ts
Place eachdocumen t in a cluster of its o wn
F orm a new cluster b y merging the most similar pair of curren t clusters
Recompute the similaritybet w een the newly merged cluster and the remaining clusters
Rep eat step while there is more than one cluster
The output of a hierarc hic clustering algorithm is a cluster hierarc h y as sho wn in Figure V arious
clustering metho ds dier in the manner in whic h they dene the similarit ybet w een clusters Three
of the most commonly used metho ds in information retriev al are single link c omplete link and gr oup
aver age Single link clustering uses the similaritybet w een the most similar pair of do cumen ts one
in eac h cluster as the similarit y bet w een the t w o clusters Complete link clustering uses that
of the least similar do cumen t pair in the t w o clusters Group a v erage clustering uses the a v erage
similarit y of all do cumen t pairs to b e the in tercluster similarit y Early exp erimen ts sho w ed that the
p erformance of hierarc hic clustering metho ds v aries when tested on dieren t do cumen t collections
0.9
0.5
0.2
0.3
d
6
d
3
d
2
0.7
d
5
d
1
d
4
Figure Sample cluster hierarc h y where do cumen ts are denoted as squares
and the similarities b et w een merged clusters are sho wn in circles
System Arc hitecture
W e in tegrate LSI with hierarc hic agglomerativ e clustering to impro v e the eciency of conceptual
retriev al As sho wn in Figure w e build a hierarc hic system b y applying one of the ab o v e clustering
metho ds to arrange do cumen ts at dieren tlev els
F or a N lev el system w e c ho ose N similarit y thresholds one for eac h lev el A t the i
th
lev el the clusters with similarit y larger than or equal to the i
th
lo w est threshold are group ed
together Eac h of the remaining clusters forms a group of its o wn Th us the whole database can
b e represen ted as a set of clusters at a sp ecic lev el Note that a higher lev el clusters can b e further
divided in to subclusters at a lo w er lev el The cluster hierarc h y of a database is created only once
Figure sho ws a lev el system hierarc h y generated from the cluster hierarc hyin Figure T o searc h relev an t do cumen ts a query is compared with eac h cluster c entr oid dened as the
a v erage v ector of all the do cumen ts in the cluster Tok eep the v olume of query results acceptable
w e set a cuto C for the desired n um b er of returned do cumen ts In the follo wing w e describ e the
similarity
measure
query
level N
level N-1
level 1
level 0
(documents)
Figure System Arc hitecture
d
6
d
3
d
2
d
5
d
1
d
4
level 3
level 2
level 1
documents
(0.3)
(0.4)
(0.6)
[1] [1]
[1]
[2] [2]
[2]
[2]
[3]
[4]
a b
cd e
fgh i
Figure A lev el system hierarc h y with similarit y thresholds and
generated from the cluster hierarc h y in Figure Clusters are lab eled
with their size sho wn in brac k ets
steps for searc hing and returning relev an t do cumen ts First the cluster cen troids at the top lev el
ie lev el N are compared with eac h query and rank ed in descending order of their similarities
with the query The required n um ber of do cumen ts are then tak en from the top of the ranking
If the rstrank ed cluster has less than C do cumen ts all the do cumen ts are returned The next
cluster in the list is c hec k ed un til a sucien t n um ber of do cumen ts are returned If a cluster has
more than C do cumen ts the searc hmo v es do wn w ard one lev el The same pro cedure is p erformed
at this lev el This pro cess con tin ues un til it reac hes the lo w est cluster lev el ie lev el where
clusters con tain no subclusters but do cumen ts Based on the cluster h yp othesis do cumen ts in the
same cluster are equally similar to the same query W e randomly select do cumen ts from the last
of the retriev ed clusters un til sucien tdocumen ts are obtained
Belo w w e use Figure as an example where C is set to A t the top lev el w e assume
cluster b has higher similarit y with the query than cluster a W e return all the do cumen ts in
cluster b b ecause it has less than C do cumen ts Then wec hec k cluster a Since it has more than C
do cumen ts w e compare all its c hild clusters with the query Assume clusters d and h ha v e higher
similarit y than clusters c and g resp ectiv ely W e randomly select one of the do cumen ts in cluster
h as the last do cumen t satisfying C Exp erimen ts
Toev aluate our metho d w e conduct exp erimen ts on four standard do cumen t collections CA CM
CISI CRAN and MED for whic h queries and relev an t judgmen ts are a v ailable W e apply all the
three hierarc hic clustering metho ds single link complete link and group a v erage to generate the
cluster hierarc h y and compare their retriev al p erformance on eac h collection
Metho dology
Do cumen ts in eac h database are indexed with terms o ccurring in the title and abstract but not
on a stop list of common w ords While queries are written in natural language terms in a
query are used only if they do not app ear on the same stop list and if they app ear in at least one
do cumen t All indexed terms are stored in their original forms without stemming T able sho ws
the c haracteristics of eac h collection
CA CM CISI CRAN MED
Num ber of documen ts
Num b er of queries Mean n um b er of terms p er do cumen t Mean n um b er of terms p er query Mean n um b er of relev antdocumen ts p er query T able Collection c haracteristics
Eac h do cumen t is represen ted b y a v ector calculated b y LSI with indexing dimension W e compute the cosine similaritybet w een eac h pair of do cumen ts and apply the three hierarc hic
clustering metho ds The implemen tations of those clustering metho ds are based on the algorithms
dev elop ed b y V o orhees In our exp erimen ts w e use a t w olev el arc hitecture The upp er lev el
consists of clusters generated using similarit y threshold while the lo w er lev el con tains clusters
with similarit y threshold T able sho ws the n um b er of clusters and their a v erage size generated
in the t w olev el arc hitecture using dieren t clustering metho ds
Metho d CA CM CISI CRAN MED
single level link level c omplete level link level gr oup level aver age level T able Num b er of clusters and their a v erage size in paren theses generated
in a t w olev el arc hitecture
T o ev aluate the retriev al p erformance of our exp erimen ts w e calculate the eectiv eness
measure E dened as E PR
P R
where P and R are the precision and recall v alues and is a parameter reecting the relativ e
imp ortance of recall to precision dened b y the user F or example indicates recall is t wice
as imp ortan t as precision The E measure has been commonly used in the studies of do cumen t
clustering It v aries from to in whic h lo w v alues corresp ond to high retriev al eectiv eness
In addition to E measure w e also measure the eciency of our metho ds bycoun ting the n um ber
of comparisons bet w een the query and the cluster cen troids to obtain sucien t do cumen ts W e
compare the results of using the three clustering metho ds with our standard ful l se ar ch whic h
matc hes queries with eac h do cumen t in the database ranks them in descending order of their
similarit y with the query and returns only the top cuto do cumen ts
Results
Figures and sho w the a v erage E v alues with and o v er all the queries on the
four databases using a cuto of and do cumen ts resp ectiv ely W e can see full searchhas in general the lo w est E v alues comparing to the other three clustering metho ds This is b ecause full
searc h returns the most similar do cumen ts selected from the whole database instead of a partial
n um ber of clusters Next to full searc h complete link has the second b est o v erall p erformance
follo w ed b y group a v erage and single link Because complete link tends to generate a large n um ber
of small and tigh tly b ound clusters most of the do cumen ts in a cluster migh t be relev an t to a
query if their cluster cen troid has a high similarit y with the query In con trast to complete link
single link usually forms a small n um ber of large clusters with little in ternal cohesion Therefore
it is lik ely to cause lo w precision and results in poor retriev al eectiv eness Group a v erage is a
compromise b et w een single link and complete link It generates clusters with medium size
Figure sho ws the a v erage n um ber of comparisons as the p ercen tage of full searc h for the
four databases using dieren t metho ds In Figure w e can see group a v erage is the most ecien t
metho d in terms of n um b er of comparisons to obtain sucien t do cumen ts It requires comparisons
a b
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CACM (20 documents)
beta
E value
full search
complete link
group average
single link
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CISI (20 documents)
beta
E value
full search
complete link
group average
single link
c d
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CRAN (20 documents)
beta
E value
full search
complete link
group average
single link
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MED (20 documents)
beta
E value
full search
complete link
group average
single link
Figure The a v erage E v alues using a cuto of do cumen ts for full
searc h complete link group a v erage and single link on the four databases
a CA CM b CISI c CRAN and d MED resp ectiv ely
a b
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CACM (50 documents)
beta
E value
full search
complete link
group average
single link
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CISI (50 documents)
beta
E value
full search
complete link
group average
single link
c d
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CRAN (50 documents)
beta
E value
full search
complete link
group average
single link
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MED (50 documents)
beta
E value
full search
complete link
group average
single link
Figure The a v erage E v alues using a cuto of do cumen ts for full
searc h complete link group a v erage and single link on the four databases
a CA CM b CISI c CRAN and d MED resp ectiv ely
a b
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CACM (100 documents)
beta
E value
full search
complete link
group average
single link
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CISI (100 documents)
beta
E value
full search
complete link
group average
single link
c d
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CRAN (100 documents)
beta
E value
full search
complete link
group average
single link
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MED (100 documents)
beta
E value
full search
complete link
group average
single link
Figure The a v erage E v alues using a cuto of do cumen ts for full
searc h complete link group a v erage and single link on the four databases
a CA CM b CISI c CRAN and d MED resp ectiv ely
as few as of that b y full searc h Complete link is a close second whic h ranges from to comparing to full searc h Single link requires as man y as in the w orst case In
con trast to the retriev al eectiv eness the eciency is impro v ed signican tly when using cluster
based searc h T able sho ws the p ercen tage of a v erage impro v emen t on retriev al eectiv eness and
eciency of the three clustering metho ds against full searc h
Metho d CA CM CISI CRAN MED
E F E F E F E F
single link c omplete link gr oup aver age T able P ercen tage of impro v emen t on retriev al eectiv eness E and ef
ciency F using the three clustering metho ds against full searc h The
negativev alues represen t a loss of p erformance
As sho wn in T able the impro v emen ts of retriev al eciency are signican tly larger than
the loss of retriev al eectiv eness when using clusterbased searc h The results sho w that clustering
metho ds impro v e retriev al eciency from to while still pro viding o v er retriev al
eectiv eness of that b y full searc h
Conclusions
Weha v e prop osed a hierarc hic arc hitecture to sp eedup the searc hing time for conceptual informa
tion retriev al Our metho d emplo ys hierarc hic clustering on do cumen ts indexed b y laten t seman tic
indexing The exp erimen tal results sho w the impro v emen ts of retriev al eciency using cluster
based searc h outp erforms nonclustered searc h signican tly while the loss of retriev al eectiv eness
is less than This indicates our metho d can pro vide m uc h faster query resp onse in concep
tual information retriev al while main taining equiv alen t retriev al eectiv eness as the con v en tional
metho d References
George W F urnas Thomas K Landauer Louis M Gomez and Susan T Dumais The
v o cabulary problem in h umansystem comm unication Communic ations of the A CMv ol no pp No v em b er Scott Deerw ester Susan T Dumais George W F urnas Thomas K Landauer and Ric hard
Harshman Indexing b y laten t seman tic analysis Journal of the A meric an So ciety for
Information Scienc ev ol no pp Septem ber C J v an Rijsb ergen Information R etrieval Butterw orth Co Publishers Ltd London
second edition N Jardine and C J v an Rijsb ergen The use of hierarc hic clustering in information retriev al
Information Stor age and R etrievalv ol pp
a
CISI CRAN MED CACM
100
90
80
70
60
50
40
30
20
10
0
comparison % of full search
single
complete
link
link
group
average
full
search
cutoff of 20 documents
b
CISI CRAN MED CACM
100
90
80
70
60
50
40
30
20
10
0
comparison % of full search
single
complete
link
link
group
average
full
search
cutoff of 50 documents
c
CISI CRAN MED CACM
100
90
80
70
60
50
40
30
20
10
0
comparison % of full search
single
complete
link
link
group
average
full
search
Cutoff of 100 documents
Figure The a v erage n um b er of comparisons as p ercen tage of full searchus ing a cuto of a b and c do cumen ts for full searc h complete
link group a v erage and single link on the four databases
Gerard Salton and Mic hael J McGill Intr o duction to Mo dern Information R etrieval McGra w
Hill Bo ok Compan y Gerard Salton A utomatic Information Or ganization and R etrieval McGra wHill Bo ok Com
pan y C J v an Rijsb ergen and W Bruce Croft Do cumen t clustering An ev aluation of some
exp erimen ts with the craneld collection Information Pr o c essing and Management v ol pp W Bruce Croft A mo del of cluster searc hing based on classication Information Systems v ol pp Alan Griths Lesley A Robinson and P eter Willett Hierarc hic agglomerativ e clustering
metho ds for automatic do cumen t classication Journal of Do cumentationv ol no pp
Septem b er Ab delmoula ElHamdouc hi and P eter Willett Comparison of hierarc hic agglomerativ e clus
tering metho ds for do cumen t retriev al The Computer Journal v ol no pp Alan Griths H Claire Luc kh urst and P eter Willett Using in terdo cumen t similarit y in
formation in do cumen t retriev al systems Journal of the A meric an So ciety for Information
Scienc ev ol pp Ellen M V o orhees Implemen ting agglomerativ e hierarc hic clustering algorithms for use in
do cumen t retriev al Information Pr o c essing and Management v ol no pp
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 632 (1996)
PDF
USC Computer Science Technical Reports, no. 636 (1996)
PDF
USC Computer Science Technical Reports, no. 609 (1995)
PDF
USC Computer Science Technical Reports, no. 594 (1994)
PDF
USC Computer Science Technical Reports, no. 579 (1994)
PDF
USC Computer Science Technical Reports, no. 611 (1995)
PDF
USC Computer Science Technical Reports, no. 495 (1991)
PDF
USC Computer Science Technical Reports, no. 733 (2000)
PDF
USC Computer Science Technical Reports, no. 662 (1997)
PDF
USC Computer Science Technical Reports, no. 652 (1997)
PDF
USC Computer Science Technical Reports, no. 624 (1996)
PDF
USC Computer Science Technical Reports, no. 841 (2005)
PDF
USC Computer Science Technical Reports, no. 704 (1999)
PDF
USC Computer Science Technical Reports, no. 578 (1994)
PDF
USC Computer Science Technical Reports, no. 630 (1996)
PDF
USC Computer Science Technical Reports, no. 647 (1997)
PDF
USC Computer Science Technical Reports, no. 628 (1996)
PDF
USC Computer Science Technical Reports, no. 639 (1996)
PDF
USC Computer Science Technical Reports, no. 601 (1995)
PDF
USC Computer Science Technical Reports, no. 887 (2007)
Description
Shih-Hao Li and Peter B. Danzig. "A hierarchic architecture for conceptual information retrieval." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 637 (1996).
Asset Metadata
Creator
Danzig, Peter B.
(author),
Li, Shih-Hao
(author)
Core Title
USC Computer Science Technical Reports, no. 637 (1996)
Alternative Title
A hierarchic architecture for conceptual information retrieval (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
12 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270864
Identifier
96-637 A Hierarchic Architecture for Conceptual Information Retrieval (filename)
Legacy Identifier
usc-cstr-96-637
Format
12 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/