Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 632 (1996)
(USC DC Other)
USC Computer Science Technical Reports, no. 632 (1996)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Vin tage A Visual Information Retriev al In terface Based on
Laten t Seman tic Indexing
ShihHao Li and P eter B Danzig
Computer Science Departmen t
Univ ersit y of Southern California
Los Angeles California fshli danzig gcsuscedu
Abstract
A visual information retriev al in terface helps searc hers bro wse a large n um b er of do cumen ts
and lo cate in teresting ones This article describ es a new metho d that displa ys the relationships
bet w een query terms and do cumen ts in a t w odimensional space Users can rene their query
b y marking returned do cumen ts as relev an t or nonrelev an t and resubmit for higher recall and
precision W eha v e implemen ted this metho d in a W ebbased protot yp e system called Vintage In tro duction
T raditional information retriev al systems return do cumen ts in a sorted or rank ed list Suc h lists
as they get long b ecome tedious to bro wse
By arranging do cuments inat w odimensional space based on their in terrelationships users
can more easily iden tify topics and bro wse topical clusters In Go w er and Digbyin tro duce
sev eral metho ds suc has princip al c omp onents analysis biplot and c orr esp ondenc e analysis for ex
pressing complex relationships in t w o dimensions These metho ds use Singular V alue De c omp osition
SVD to transform do cumen ts in to one or more matrices that carry the compressed relation
ships of the original data In Information A gents Cyb enk o et al use Self Or ganizing Maps
SOM to map m ultidimensional data on to a t w odimensional space SOM is based on neural
net w orks and needs iterativ e training and adjustmen ts Similarly Lin uses SOM to generate a
t w odimensional map displa ytosho w the do cumen t distribution in a database GUIDO Graphical
User In terface for Do cumen t Organization organizes do cumen ts in a t w odimensional displa y
according to their distances to t w o reference terms called Points of Inter est POI c hosen b y the
user In GUIDO do cumen ts lying in the same direction with resp ect to a giv en set of POIs are
judged similar
In this pap er w epropose a t w odimensional visualization sc heme based on L atent Semantic
Indexing LSI for retrieving do cumen ts Our metho d is similar to GUIDO but uses dieren t
co ordinates to arrange do cumen ts In addition it needs no training and adjustmen tin con trast
to SOM Our metho d displa ys data b y clusters according to their crosssimilarities and ranks
do cumen ts b y colors with resp ect to the user queryW ein tro duce LSI in Section and describ e
our metho d and Vintage in Section Section sho ws the exp erimen tal results for relev ance
feedbac k and Section presen ts our conclusions
Laten t Seman tic Indexing
LSI w as originally dev elop ed to address the vo c abulary pr oblem whic h states that p eople with
dieren tbac kgrounds or in dieren tcon texts describ e information dieren tly LSI assumes some
underlying seman tic structure exists in the pattern of term usage across do cumen ts and uses SVD to
capture this structure LSI represen ts do cumen ts as v ectors of term frequencies follo wing Saltons
V e ctor Sp aceMo del VSM An en tire database is represen ted as an m n termdo cumen t
matrix where m and n are the n um b er of terms and do cumen ts in the database T o capture
the seman tic structure among do cumen ts LSI applies SVD to this matrix and generates v ectors
of k t ypically to orthogonal indexing dimensions where eac h dimension represen ts a
linearly indep enden t concept The decomp osed v ectors are used to represen t b oth do cumen ts and
terms in the same seman tic space while their v alues indicate the degrees of asso ciation with the k
underlying concepts Because k is c hosen m uc h smaller than the n um ber of documen ts and terms
in the database the decomp osed v ectors are represen ted in a compressed dimensional space and
are not indep enden t Therefore t w o do cumen ts can b e relev an t without ha ving common terms but
with common concepts Figure sho ws SVD applied to a termdo cumen t matrix and Example illustrates the do cumen t represen tation b efore and after SVD
term-doc
matrix
db
term
matrix
(m xn)
(m xk)
document
matrix
(n xk)
SVD
(k)
Figure SVD applies to an m n termdo cumen t matrix where m and
n are the n um b er of terms and do cumen ts in the database and k is the
dimensionalit y of the SVD represen tation
Example Let d
i
i and t
j
j b e a set of do cumen ts and their asso ciated terms in
an information system where
d
ft
t
t
t
g d
ft
t
t
t
g d
ft
t
t
g d
ft
t
t
g d
ft
t
t
t
g W e apply LSI to these do cumen ts and c ho ose the indexing dimension k as T able sho ws their
v ector represen tations in VSM ie b efore SVD and LSI ie after SVD Figure sho ws the
t w odimensional plot of the decomp osed do cumen t and term v ectors in LSI
do cumen t term VSM LSI
description t
t
t
t
t
t
t
t
dim dim
d
t
t
t
t
d
t
t
t
t
d
t
t
t
d
t
t
t
d
t
t
t
t
T able The v ector represen tations of do cumen ts d
i
i in VSM ie
b efore SVD and LSI ie after SVD The t
t
and dim dim
columns sho w the v ectors in VSM and LSI resp ectiv ely -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
2-D Plot of Terms and Documents
Dimension 1
Dimension 2
d1
d2
d3
d4
d5
t1
t2
t3
t4
t5
t6
t7
t8
cluster A
cluster B
Figure Tw odimensional plot of decomp osed v ectors in Example where
do cumen ts d
i
i and terms t
j
j are represen ted as and resp ectiv ely The ab o v e example sho ws that do cumen ts ha ving a n um b er of common terms are close to
eac h other as w ell as their common terms In Figure w e can roughly see t w o clusters A and BIn
cluster A do cumen ts d
d
and d
are close to eac h other b ecause eac h pair of them share t
t
t
or t
Similarly d
and d
are close to eac h other b ecause b oth of them ha v e t
and t
T o presen t
do cumen ts and terms in this w a yw e can easily see their relationships in a t w odimensional space
In an information retriev al in terface based on this tec hnique users can nd in teresting do cumen ts
b y surrounding terms and lo cate the relev antones b y clusters
Visualizati on Using LSI
Based on the clustering feature in LSI w edev elop a new visual in terface to displa y the relationships
bet w een query terms and do cumen ts for information retriev al In this section w e describ e our
metho d and its protot yp e system Vintage Metho dology
Let k b e the indexing dimension used in LSI when transforming the original termdo cumen t matrix
to decomp osed term and do cumentv ectors Those decomp osed v ectors are only computed once
and are stored in the serv er for further query pro cessing LSI represen ts a query as a k dimensional
v ector whic h consists of the a v erage sum of its comp onenttermv ectors T o compute the answ er
set of a query LSI compares the query v ector with all do cumentv ectors and returns those with
the highest c osine c o ecient
T o displa y query terms and returned do cumen ts in a t w odimensional space w eneed to
reduce their k dimensional v ectors to t w o dimensions One solution is to run SVD on those k dimensional v ectors byc ho osing the new indexing dimension as This approac h can precisely
presen t the relationships b et w een terms and do cumen ts in a t w odimensional space Ho w ev er it
needs to recompute SVD for eac h query whic h generates a high computation o v erhead on the
serv er Another solution is to use the rst t woof the k dimensions This approac h needs no
SVD recomputations and can b e view ed as an appro ximation due to SVD prop erties When
computing SVD the singular v alues corresp onding to the indexing dimensions are of decreasing
imp ortance The original termdo cumen t matrix can b e appro ximated b y a linear com bination of
the top singular v alues and their corresp onding ro ws and columns from the decomp osed do cumen t
and term matrices The more singular v alues are included the closer the original matrix can b e
obtained Based on this concept w e implemen ted a protot yp e system called Vintage Vin tage
Vintage A Visual Information Retriev al In terface Based on LatentSeman tic Indexing is a W eb
based protot yp e system built using HTML and CGI scripts
In Vin tage users can en ter a list of query terms and submit them to the serv er The serv er
consists of t w o mo dules Server Gateway and Query Pr o c essing Unitassho wn in Figure The
serv er gatew a y parses user requests and transforms them to a standard query format for further
pro cessing The query pro cessing unit comp oses a query v ector b y its comp onentterm v ectors and
compares it with precomputed do cumentv ectors All the do cumen ts with similarit y larger than
a predened threshold are considered relev an t and are returned to the serv er gatew a y The serv er
gatew a y generates a D image and HTML page for the relev an t do cumen ts whic h it forw ards to
the user as b oth a list and a t w odimensional image In the list do cumen ts titles are displa y ed and
sorted in decreasing order of their similarities with resp ect to the query In the D image query
Web
Browser
format
request
generate
2-D image
and
Web page
similarity
measure
LSI
documents
Server Gateway
user request
Query Processing Unit
results
query
relevant
documents
Figure Vin tage system arc hitecture
terms are prin ted with a constan t color and do cumen ts are represen ted as squares and pain ted with
dieren t colors sho wing their degree of similarit y to the query the higher similarities the dark er
colors The n um b ers inside the squares indicate the ranks of asso ciated do cumen ts in the linear
list By doing this users can see the relationships b et w een query terms and do cumen ts b y their
distances as w ell as their individual similarities b y colors in the same t w odimensional space In
addition users can clic kondocumen ts in the list or the D image to see their detailed descriptions
Vin tage also pro vides relev ance feedbac k Users can rene a query b yc hanging the original
query terms or mark returned do cumen ts as relev an t or nonrelev an t and resubmit them to the
serv er In Vin tage w e adopt the Ide de chi relev ance feedbac k metho d whic h Salton and
Buc kley found has the b est o v erall p erformance among sev eral commonly used metho ds In
Ide dechi a rened query consists of the original query plus all userjudged relev an t do cumen ts
and min us the topmost nonrelev an t do cumen t Sp ecically Q
new
Q
old
r
X
i
R
i
N
where Q
old
and Q
new
are the v ectors of the original and the rened query R
i
and N
are the v ectors
of the relev an t do cumen ts and the topmost nonrelev an t do cumen t judged b y the user and r is the
n um b er of relev an t do cumen ts This metho d can b e iterated for further query renemen t
Beloww e presen t an example of a Vin tage user session The underlying database con
tains CA CM do cumen t abstracts indexed b y LSI Supp ose the user submits query terms
performance and evaluation and sets the maxim um n um b er of results to The result is sho wn
in Figure where do cumen ts are returned and displa y ed in a rank ed list and a D image
In the D image do cumen ts are scattered at dieren t places Among them four of the top v e
do cumen ts are closer to term ev aluation than p erformance The user lo oks at the titles of
these do cumen ts and nds do cumen ts and are closer to what he w an ts He then renes the
query b y adding a new term model and marking do cumen ts and as relev an t and and as nonrelev an t see Figure The new result set is reduced to nine do cumen ts as sho wn in
Figure where sev en of them gather in a cluster with roughly equal distance relev ance to the
three query terms The user clic ks on square the do cumen t with the highest similarit y to the
query and sees its detailed description in Figure
Figure Vin tage query result
Exp erimen ts
T o measure the p erformance of relev ance feedbac k in Vin tage w e conducted exp erimen ts on four
standard datasets CA CM CISI CRAN and MED where queries and relev ance judgmen ts are
a v ailable In eac h dataset do cumen ts are indexed with terms o ccurring in the title and abstract
but not on a stop list of common w ords While queries are written in natural language terms
in a query are only used if they do not app ear on a stop list and if they app ear in at least one
do cumen t All indexed terms are stored in their original forms without stemming W e apply
SVD to eac h termdo cumen t matrix using indexing dimension as suggested in Deerw esters LSI
exp erimen ts T able summaries the c haracteristics of eac h dataset
During relev ance feedbac k users need to mark returned do cumen ts as relev an t or nonrelev an t
for query renemen t F or a long returned list users are lik ely to bro wse only those in the top of
the list instead of all of them Therefore w e only fo cus on the toprank ed do cumen ts in our
exp erimen ts W e set the maxim um n um b er of results to four dieren tv alues and
Figure Query renemen t
eac h represen ting the n um b er of returned do cumen ts that users bro wse and mark their relev ancyIn
our exp erimen ts w edonot ha v e real users to judge eac h returned do cumen t Instead w esim ulate
users relev ance feedbackb y using the standard relev ance judgmen ts of eac h dataset The standard
relev ance judgmen ts con tain the p erfect results for eac h query If a returned do cumen t is in this
judgmen ts it is considered as judged relev antb y the user Otherwise it is nonrelev an t
As describ ed previouslyw e apply the Ide dechi metho d to compute relev ance feedbac k
in Vin tage The initial queries are tak en from the test queries in eac h dataset W e compare their
results with the standard relev ance judgmen ts and calculate the asso ciated precision and recall
F or query renemen t w e add the v ectors of all returned do cumen ts that are judged relev anttothe
curren t query and subtract the topmost v ector of those nonrelev an t ones The whole pro cedure is
rep eated v e times for eac h query Figure sho ws the a v erage precision and recall o v er all queries
for eac h dataset
In Figure w e can roughly see b oth precision and recall are impro v ed during eac h relev ance
feedbac k In the four datasets w e tested the recall gets higher as the n um b er of returned do cumen ts
Figure The result after query renemen t
increases while the precision b ecomes lo w er The reason is that most relev an t do cumen ts ha v e
higher ranks in the returned list Therefore when more do cumen ts are returned more nonrelev an t
do cumen ts are in v olv ed rather than the relev an t ones
As sho wn in Figure the precision and recall increases as more relev ance feedbac ks are
p erformed Ho w ev er the increasing rate b et w een eacht wocon tiguous feedbac ks do es not go up
accordingly Figure sho ws the a v erage increasing rate of precision and recall for the v e relev ance
feedbac ks In Figure w e can see the rst feedbac k has the highest impro v emen t for b oth precision
and recall As more feedbac ks pro ceed the increasing rate drops to almost zero This indicates
users can get most impro v emen ts from the early feedbac ks on these datasets
The exp erimen tal results sho w Vin tages relev ance feedbac k can impro v e precision and recall
on the four standard datasets In Vin tage users can set a small n um b er of returned do cumen ts to
get high precision on the initial query Then they can mark relev an t do cumen ts in the returned
list and increase this n um b er on further relev ance feedbac ks for higher recalls When the returned
n um b er is to o large users can lo ok at the D image to lo cate in teresting do cumen ts
Figure Vin tage do cumen t description
In our exp erimen ts w e use the same initial query with dieren t relev an t and nonrelev an t
do cumen ts for generating a new queryW e do not add or delete query terms during eac h feedbac k
In a real situation the precision and recall could ha v e more impro v emen ts byc hanging query terms
during relev ance feedbac k
Conclusions
A visual information retriev al in terface can help searc hers to bro wse a large n um b er of do cumen ts
and lo cate the in terested ones In this pap er w eha v e prop osed a t w odimensional visualization
sc heme based on Deerw esters laten t seman tic indexing Our metho d displa ys data according to
their crosssimilarities and still retains the rankings with resp ect to the query W eha v e implemen ted
a protot yp e system Vintage and sho wn that the em b edded relev ance feedbac k metho d can impro v e
b oth precision and recall on four standard datasets
a b
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CACM: Precision-Recall Curves
Recall
Precision
n = 10
n = 25
n = 50
n = 100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CISI: Precision-Recall Curves
Recall
Precision
n = 10
n = 25
n = 50
n = 100
c d
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CRAN: Precision-Recall Curves
Recall
Precision
n = 10
n = 25
n = 50
n = 100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MED: Precision-Recall Curves
Recall
Precision
n = 10
n = 25
n = 50
n = 100
Figure Precisionrecall curv es for dieren tn um b er of returned do cumen ts
n n and on four datasets a CA CM b CISI c CRAN
and d MED The data are obtained b y measuring the a v erage precision and
recall of all test queries eac h consisting of the initial query denoted as
and v e successiv e relev ance feedbac ks denoted as
Characteristic CA CM CISI CRAN MED
Num b er of do cumen ts Num b er of queries Num b er of indexing terms Av erage n um ber of terms per documen t Av erage n um b er of terms p er query T able Characteristics of datasets
a b
0 1 2 3 4 5 6
0
10
20
30
40
50
60
70
80
90
100
Precision Increasing Rate
Relevance Feedback
Increasing Rate (%)
CACM
CISI
CRAN
MED
0 1 2 3 4 5 6
0
10
20
30
40
50
60
70
80
90
100
Recall Increasing Rate
Relevance Feedback
Increasing Rate (%)
CACM
CISI
CRAN
MED
Figure Av erage increasing rate of relev ance feedbac ks for a precision
and b recall on four datasets CA CM CISI CRAN and MED
The currentv ersion of Vin tage puts all computations on the serv er site including measuring
similarities and generating D images W e b eliev e that mo ving the visualization part to the clien t
site can not only alleviate the load on the serv er but also pro vide more functionalities to the user
F or example users can zo om in a sp ecic p ortion of the image or rearrange do cumen ts and terms
based on their proles Toac hiev e this goal w e are dev eloping a new Vin tage using Sun
Microsystems Java language and Hot Java bro wser
References
J C Go w er and P G N Digb y Expressing complex relationships in t w o dimensions in
Interpr eting Multivariate Data V Barnett ed pp John Wiley Son Inc G Cyb enk o Y W u A Khrabro v and R Gra y Information agen ts as organizers in CIKM
Workshop on Intel ligent Information A gentsDecem ber T Kohonen The self organizing map Pr o c e e dings of the IEEE T r ansactions on Computers v ol no pp X Lin Visualization for the do cumen t space in Pr o c e e dings of Visualization Boston
MA pp Octob er A Nuc hpra y o on and R R Korfhage GUIDO a visual to ol for retrieving do cumen ts in
Pr o c e e dings of the IEEE Symp osium on Visual L anguages pp S Deerw ester S T Dumais G W F urnas T K Landauer and R Harshman Indexing b y
laten t seman tic analysis Journal of the A meric an So ciety for Information Scienc ev ol pp Septem ber G W F urnas T K Landauer L M Gomez and S T Dumais The v o cabulary problem in
h umansystem comm unication Communic ations of the A CMv ol pp No v em ber
G Salton and M J McGill Intr o duction to Mo dern Information R etrieval McGra wHill Bo ok
Compan y G Salton A utomatic Information Or ganization and R etrieval McGra wHill Bo ok Compan y E Ide New exp erimen ts in relev ance feedbac k in The SMARTR etrieval System Exp eri
ments in A utomatic Do cument Pr o c essing G Salton ed pp Englew o o d Clis NJ
Pren ticeHall Inc G Salton and C Buc kley Impro ving retriev al p erformance b y relev ance feedbac k Journal
of the A meric an So ciety for Information Scienc ev ol pp June SH Li and P B Danzig Tw odimensional visualization for In ternet resource disco v ery
T ec hnical Rep ort USCCS Univ ersit y of Southern California
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 637 (1996)
PDF
USC Computer Science Technical Reports, no. 636 (1996)
PDF
USC Computer Science Technical Reports, no. 609 (1995)
PDF
USC Computer Science Technical Reports, no. 594 (1994)
PDF
USC Computer Science Technical Reports, no. 579 (1994)
PDF
USC Computer Science Technical Reports, no. 733 (2000)
PDF
USC Computer Science Technical Reports, no. 611 (1995)
PDF
USC Computer Science Technical Reports, no. 662 (1997)
PDF
USC Computer Science Technical Reports, no. 652 (1997)
PDF
USC Computer Science Technical Reports, no. 495 (1991)
PDF
USC Computer Science Technical Reports, no. 550 (1993)
PDF
USC Computer Science Technical Reports, no. 548 (1993)
PDF
USC Computer Science Technical Reports, no. 639 (1996)
PDF
USC Computer Science Technical Reports, no. 643 (1996)
PDF
USC Computer Science Technical Reports, no. 694 (1999)
PDF
USC Computer Science Technical Reports, no. 885 (2006)
PDF
USC Computer Science Technical Reports, no. 846 (2005)
PDF
USC Computer Science Technical Reports, no. 963 (2015)
PDF
USC Computer Science Technical Reports, no. 887 (2007)
PDF
USC Computer Science Technical Reports, no. 719 (1999)
Description
Shih-Hao Li and Peter B. Danzig. "Vintage: A visual information retrieval interface based on latent semantic indexing." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 632 (1996).
Asset Metadata
Creator
Danzig, Peter B.
(author),
Li, Shih-Hao
(author)
Core Title
USC Computer Science Technical Reports, no. 632 (1996)
Alternative Title
Vintage: A visual information retrieval interface based on latent semantic indexing (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
12 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16271174
Identifier
96-632 Vintage A Visual Information Retrieval Interface Based on Latent Semantic Indexing (filename)
Legacy Identifier
usc-cstr-96-632
Format
12 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/