Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 609 (1995)
(USC DC Other)
USC Computer Science Technical Reports, no. 609 (1995)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Tw oDimensional Visualization for In ternet Resource Disco v ery
ShihHao Li and P eter B Danzig
Computer Science Departmen t
Univ ersit y of Southern California
Los Angeles California fshli danzig gcsuscedu
Abstract
T raditional information retriev al systems return do cumen ts in a list where do cumen ts are
sorted according to their publication dates titles or similariti es to the user query Users select
in terested do cumen ts b y searc hing them through the returned list In a distributed en vironmen t
do cumen ts are returned from more than one information serv er A simple do cumen t list is not
ecien tto presen t a large v olume data to the users In this pap er w e proposeat w odimensional
visualization sc heme for In ternet resource disco v ery Our metho d displa ys data b y clusters ac
cording to their crosssimilarities and still retains the rankings with resp ect to the user query In addition it pro vides a customized view that arranges data in fa v or of users preference terms
Index T erms information retriev al resource disco v ery similarit y measure visualization
In tro duction
T raditional information retriev al systems return do cumen ts in lists or sets where do cumen ts are
sorted according to their publication dates title alphab ets or computed similarities to the user
query In a distributed en vironmen t do cumen ts are returned from more than one information
serv er and a simple do cumen t list is not ecien t to presen t a large v olume of data to the users
One solution to this problem is to presen t data using a t w odimensional graphical displa y By arranging do cumen ts in a t w odimensional space based on their in terrelationships users can
more easily iden tify topics and bro wse topical clusters In Go w er and Digbyin tro duce sev eral
metho ds suc has princip al c omp onents analysis biplot and c orr esp ondenc e analysis for expressing
complex relationships in t w o dimensions These metho ds use Singular V alue De c omp osition SVD
to transform do cumen ts in to one or more matrices that carry the compressed relationships of the
original data In Information A gents Cyb enk o et al use Self Or ganizing Maps SOM to
map m ultidimensional data on to a t w odimensional space SOM is based on neural net w orks and
needs iterativ e training and adjustmen ts
In this pap er w epropose a t w odimensional visualization sc heme based on L atent Semantic
Indexing LSI for In ternet resource disco v ery Our metho d displa ys data b y clusters according to
their crosssimilarities and ranks do cumen ts with resp ect to the user query In addition it pro vides
a customized view that arranges data in fa v or of users preference terms Belo w w ein tro duce LSI
and describ e our metho d in In ternet resource disco v ery
Laten t Seman tic Indexing
LSI w as originally dev elop ed for solving the vo c abulary pr oblem whic h states that p eople with
dieren t bac kgrounds or in dieren tcon texts describ e information dieren tly LSI assumes there is
some underlying seman tic structure in the pattern of term usage across do cumen ts and uses SVD
to capture this structure Our goal is to presen t this structure instead of a simple rank ed list to
the users
In LSI do cumen ts are represen ted as v ectors of term frequencies follo wing Saltons w ell
established V e ctor Sp aceMo del VSM An en tire database is represen ted as a termdo cumen t
matrix where the ro ws and columns indicate the terms and do cumen ts resp ectiv elyT o capture
the seman tic structure among do cumen ts LSI applies SVD to this matrix and generates v ectors
of k t ypically to orthogonal indexing dimensions where eac h dimension represen ts a
linearly indep enden t concept The decomp osed v ectors are used to represen t b oth do cumen ts and
terms in the same seman tic space while their v alues indicate the degrees of asso ciation with the k
underlying concepts Because k is c hosen m uc h smaller than the n um ber of documen ts and terms
in the database the decomp osed v ectors are represen ted in a compressed dimensional space and
are not indep enden t Therefore t w o do cumen ts can b e relev an t without ha ving common terms but
with common concepts Figure sho ws SVD applied to a termdo cumen t matrix and Example illustrates the do cumen t represen tation b efore and after SVD
term-doc
matrix
db
term
matrix
(m xn)
(m xk)
document
matrix
(n xk)
SVD
(k)
Figure SVD applies to an m n termdo cumen t matrix where m and
n are the n um b ers of terms and do cumen ts in the database and k is the
dimensionalit y of the SVD represen tation
Example Let d
i
i and t
j
j b e a set of do cumen ts and their asso ciated terms in
an information system where
d
ft
t
t
t
g d
ft
t
t
t
g d
ft
t
t
g d
ft
t
t
g d
ft
t
t
t
g
T able sho ws their v ector represen tations in VSM ie b efore SVD and LSI ie after SVD
Figure sho ws the t w odimensional plot of the decomp osed do cumen t and term v ectors in LSI
do cumen t term VSM LSI
description t
t
t
t
t
t
t
t
dim dim
d
t
t
t
t
d
t
t
t
t
d
t
t
t
d
t
t
t
d
t
t
t
t
T able The v ector represen tations of do cumen ts d
i
i in VSM
ie b efore SVD and LSI ie after SVD The t
t
and dim
dim columns sho w the v ectors in VSM and LSI resp ectiv ely -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
2-D Plot of Terms and Documents
Dimension 1
Dimension 2
d1
d2
d3
d4
d5
t1
t2
t3
t4
t5
t6
t7
t8
cluster A
cluster B
Figure Tw odimensional plot of decomp osed v ectors in Example where
do cumen ts d
i
i and terms t
j
j are represen ted as and resp ectiv ely The ab o v e example sho ws that do cumen ts ha ving a n um b er of common terms are close to
eac h other as w ell as their common terms In Figure w e can roughly see t w o clusters A and B In cluster A do cumen ts d
d
and d
are close to eac h other b ecause eac h pair of them share t
t
t
or t
Similarly d
and d
are close to eac h other b ecause b oth of them ha v e t
and t
T o
presen t do cumen ts and terms in this w a yw e can easily see their relationships b y x y co ordinates
In ternet Resource Disco v ery
In a distributed en vironmen t p eople searc h information b y sending requests to asso ciated informa
tion serv ers using a clientserver mo del Figure a In this approac h the clien t needs to kno w
the serv ers name or address b efore sending a request In the In ternet where thousands of serv ers
pro vide information it b ecomes dicult and inecien t to searc h all the serv ers man ually a b
client directory
server server server
query
server(s)
result
request
client server
request
result
Figure a Clien tServ er Mo del b Clien tDirectoryServ er Mo del
One step to solv e this problem is to k eep a directory of services that records the description
or summary of eac h information serv er A user sends his query to the directory of services whic h
determines and ranks the information serv ers relev an t to the users request The user emplo ys the
rankings when selecting the information serv ers to query directlyW e call this the clientdir e ctory
server mo del Figure b
In ternet resource disco v ery services suc has Arc hie W AIS WWW Gopher
and Indie allo w users to retriev e information throughout the In ternet All of the ab o v e
systems pro vide services similar to the clien tdirectoryserv er mo del F or example Arc hie W AIS
and Indie supp ort a global index lik e the directory of services in their systems WWW and Gopher
do not ha v e a global index b y themselv es Instead they ha v e an addedon indexing sc heme built
b y other to ols suc h as Harv est WWWW for WWW and V eronica for Gopher
Visualizati on in Clien tDi rect oryServ er Mo del
In In ternet resource disco v ery users searc h data from thousands of information serv ers A simple
rank ed list is inecien t to presen t a large v olume data to the users T o solv e this problem w e
prop ose a LSIbased t w odimensional visualization sc heme
Our metho d presen ts do cumen ts in a t w odimensional space based on their crosssimilarities
Do cumen ts ha ving a n um b er of common terms are placed close to eac h other as w ell as their common
terms In addition do cumen ts are colored based on their similarities to the user query Do cumen ts
with higher similarities are pain ted with dark er colors or gra ylev els The dark est do cumentin
a cluster is used as the r epr esentative do cument of this cluster The do cumen ts with similarities
b elo w a certain v alue are colorless By doing this relev an t do cumen ts can b e group ed together
and still sho w their individual similarities to the query Users can bro wse the common terms or
represen tativ e do cumen ts and decide whether to see the details In con trast if returned do cumen ts
are presen ted in a rank ed list the rst and second do cumen ts mayha v e iden tical similarityv alues
with the query but co v er completely dieren t topics If the user w an ts to fo cus on the second topic
he still needs to searc h the whole list to nd other similar ones
T o implemen t our visualization metho d w e collect all the returned do cumen ts at the clien t
site generate a termdo cumen t matrix out of them and apply SVD to this matrix The resulting
terms and do cumen ts are plotted on a t w odimensional graphical displa y and presen ted to the
users Similarly the relev an t serv ers returned b y the directory of services are displa y ed in the same
manner Figure sho ws the arc hitecture of our prop osed metho d in the clien tdirectoryserv er
en vironmen t
a
returned servers
decomposed
Dimension 1
Dimension 2
s
1
t
1
t
2
s
2
t
5
s
4
t
4
s
3
SVD(2)
user profile
decomposed
server
vectors
term
vectors
from the
directory of services
server 1
server 2
server 3
server 4
b
returned documents
from server 1
returned documents
from server 2
Dimension 1
Dimension 2
SVD(2)
doc 1
doc 3
doc 5
doc 2
doc 4
doc 6
decomposed
term
vectors
decomposed
document
vectors
user profile
d
3
t
7
t
5
d
6
d
4
t
4
t
8
d
5
d
1
t
1
t
2
d
2
Figure Tw odimensional visualization for a returned serv ers from the
directory of services and b returned do cumen ts from relev an t serv ers in the
clien tdirectoryserv er en vironmen t The returned data are colored based on
their similarities to the user query Because users ha v e their o wn preference terms or taxonomies displa ying data in a cus
tomized view can assist users in searc hing and bro wsing do cumen ts T oac hiev e this w e rearrange
returned data b y merging them with a user prole that consists of a set of do cumen ts called
pseudo do cumentsselected b y users to represen t their in terests F or eac h user prole w egener
ate a termprole matrix lik e the termdo cumen t matrix W e merge it with the termdo cumen t
matrix of the returned do cumen ts and apply SVD to them The result is a customized view that
con tains the decomp osed term and do cumentv ectors arranged in fa v or of the terminology used b y
the merged user prole T o scale to the n um b er of returned do cumen ts w e replicate the pseudo
do cumen ts or increase their w eigh ts in the merged matrix Belo w w e dene the mer ge op eration
and demonstrate the customization in Example Denition Let A and B be t w o termdo cumen t matrices where A consists of m
terms ro ws
and n
do cumen ts columns and B consists of m
terms ro ws and n
do cumen ts columns Let
T
A
and T
B
b e the sets of terms in A and B resp ectiv elyIf A mer ges B is equal to C denoted
as A B C then C is a termdo cumen t matrix consisting of m terms ro ws and n do cumen ts
columns where n n
n
and m jT
A
T
B
j Example Let p
and p
be t w o pseudo do cumen ts in a user prole where
p
ft
t
t
g p
ft
t
t
g W e generate a termprole matrix from p
and p
and merge it with the original termdo cumen t
matrix in Example see Figure Figure sho ws the t w odimensional plot of the merged matrix
after SVD
d
d
d
d
d
t
t
t
t
t
t
t
t
p
p
t
t
t
t
d
d
d
d
d
p
p
t
t
t
t
t
t
t
t
Figure The termdo cumen t matrix in Example merges with the term
pseudodo cumen t matrix generated from a user prole
T o examine the eect of merging a user prole w e calculate the Euclidean distance b et w een
do cumen t d
i
and d
j
in eac h cluster b efore and after the merging T able and Figure sho w the
distances and t w odimensional co ordinates of the decomp osed do cumen tv ectors b efore and after
merging with the user prole
In Figure the do cumen ts and terms in eac h cluster are closer to eac h other than those in
Figure The distance c hanges b et w een do cumen ts b efore and after the merging can b e easily seen
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
2-D Plot of Terms and Documents
Dimension 1
Dimension 2
d1’
d2’
d3’
d4’
d5’
p1
p2
t1’
t2’ t3’
t4’
t5’
t6’
t7’
t8’
Figure Tw odimensional plot of the termdo cumen t matrix after merging
with a user prole The pseudo do cumen ts p
and p
are sho wn as The
do cumen ts d
i
i and terms t
j
j are represen ted as and resp ectiv ely do cumen t before after c hange
distance merging merging dierence
d
d
d
d
d
d
d
d
T able Do cumen t distances b efore and after merging with a user prole
The c hange dierence column sho ws the distance increasing or de
creasing rates
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
2-D Plot of Terms and Documents
Dimension 1
Dimension 2
d1
d2
d3
d4
d5
d1’
d2’
d3’
d4’
d5’
p1
p2
Figure The do cumen ts in t w odimensional space when merging with the a
prole The pseudo do cumen ts p
and p
are sho wn as The do cumen ts
b efore the merging are represen ted b y d
i
i sho wn as and
link ed b y dashed lines The do cumen ts after the merging are represen ted b y
d
i
i sho wn as and link ed b y solid lines
in Figure In T able the distances b et w een do cumen ts d
d
and d
all decrease These c hanges
are b ecause pseudo do cumen t p
con tains b oth t
and t
whic h increases the correlation b et w een
an y do cumen t con taining either t
suc has d
and d
or t
suc has d
d
and d
Similarly the
distance b et w een d
and d
decreases due to p
s con taining t
and t
F rom the example ab o v e w e can see the distance c hanges reecting the correlations b et w een
terms and do cumen ts in the seman tic space When merging with a user prole those do cumen ts
ha ving common terms with pseudo do cumen ts mo veto w ard eac h other
Summary
In In ternet resource disco v ery a simple rank ed list is inecien t to presen t a large v olume data to
the users In this pap er w eha v e prop osed a t w odimensional visualization sc heme based on Deer
w esters laten t seman tic indexing Our metho d displa ys data according to their crosssimilariti es
and still retains the rankings with resp ect to the queryW e b elievein tegrating t w odimensional
visualization with traditional information retriev al tec hniques suc h as query reformation and rele
v ance feedbac k can greatly impro v e users searc hing pro cess
References
J C Go w er and P G N Digb y Expressing complex relationships in t w o dimensions in
Interpr eting Multivariate Data V Barnett ed pp John Wiley Son INC G Cyb enk o Y W u A Khrabro v and R Gra y Information agen ts as organizers in CIKM
Workshop on Intel ligent Information A gentsDecem ber T Kohonen The self organizing map Pr o c e e dings of the IEEE T r ansactions on Computers v ol no pp S Deerw ester S T Dumais G W F urnas T K Landauer and R Harshman Indexing b y
laten t seman tic analysis Journal of the A meric an So ciety for Information Scienc ev ol pp Septem ber G W F urnas T K Landauer L M Gomez and S T Dumais The v o cabulary problem in
h umansystem comm unication Communic ations of the A CMv ol pp No v em ber
G Salton and M J McGill Intr o duction to Mo dern Information R etrieval McGra wHill Bo ok
Compan y K Obraczk a P B Danzig and SH Li In ternet resource disco v ery services Computer v ol pp Septem b er A Em tage and P Deutsc h Arc hie An electronic directory service for the in ternet in
Pr o c e e dings of the Winter USENIX Confer enc e pp B Kahle and A Medlar An information system for corp orate users Wide Area Information
Serv ers ConneXions The Inter op er ability R ep ortv ol no pp
T BernersLee R Cailliau JF Gro and B P ollermann W orldWide W eb The infor
mation univ erse Ele ctr onic Networking R ese ar ch Applic ations and Policyv ol no pp M McCahill The in ternet gopher proto col A distributed serv er information system Con
neXions The Inter op er ability R ep ortv ol pp July P B Danzig SH Li and K Obraczk a Distributed indexing of autonomous in ternet ser
vices Computing Systemsv ol no pp C M Bo wman P B Danzig D R Hardy U Man b er and M F Sc h w artz Harv est A scal
able customizable disco v ery and access system T ec hnical Rep ort CUCS Univ ersit y
of Colorado O A McBry an GENVL and WWWW T o ols for taming the Web in Pr o c e e dings of the
First International WorldWide Web Confer enc eMa y S F oster Ab out the Veronica service No v em b er Electronic bulletin b oard p osting on
the compinfosystemsgopher newsgroup
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 594 (1994)
PDF
USC Computer Science Technical Reports, no. 579 (1994)
PDF
USC Computer Science Technical Reports, no. 637 (1996)
PDF
USC Computer Science Technical Reports, no. 636 (1996)
PDF
USC Computer Science Technical Reports, no. 632 (1996)
PDF
USC Computer Science Technical Reports, no. 611 (1995)
PDF
USC Computer Science Technical Reports, no. 717 (1999)
PDF
USC Computer Science Technical Reports, no. 662 (1997)
PDF
USC Computer Science Technical Reports, no. 652 (1997)
PDF
USC Computer Science Technical Reports, no. 495 (1991)
PDF
USC Computer Science Technical Reports, no. 780 (2002)
PDF
USC Computer Science Technical Reports, no. 772 (2002)
PDF
USC Computer Science Technical Reports, no. 588 (1994)
PDF
USC Computer Science Technical Reports, no. 607 (1995)
PDF
USC Computer Science Technical Reports, no. 718 (1999)
PDF
USC Computer Science Technical Reports, no. 704 (1999)
PDF
USC Computer Science Technical Reports, no. 770 (2002)
PDF
USC Computer Science Technical Reports, no. 867 (2005)
PDF
USC Computer Science Technical Reports, no. 618 (1995)
PDF
USC Computer Science Technical Reports, no. 941 (2014)
Description
Shih-Hao Li and Peter B. Danzig. "Two-dimensional visualization for internet resource discovery." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 609 (1995).
Asset Metadata
Creator
Danzig, Peter B.
(author),
Li, Shih-Hao
(author)
Core Title
USC Computer Science Technical Reports, no. 609 (1995)
Alternative Title
Two-dimensional visualization for internet resource discovery (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
10 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270542
Identifier
95-609 Two-Dimensional Visualization for Internet Resource Discovery (filename)
Legacy Identifier
usc-cstr-95-609
Format
10 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/