Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 636 (1996)
(USC DC Other)
USC Computer Science Technical Reports, no. 636 (1996)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Searc hing Information Serv ers Based on Customized Proles
ShihHao Li and P eter B Danzig
Computer Science Departmen t
Univ ersit y of Southern California
Los Angeles California Email fshli danzig gcsuscedu
Abstract
W e in v estigate the eect of using customized proles to help searc hing relev antserv ers in
In ternet Our exp erimen ts demonstrate that the use of customized proles with laten t seman tic
indexing LSI tec hnique can impro v e the p erformance of In ternet searc hing
In tro duction
When searc hing information in a retriev al system people use dieren t terms to describ e their
information needs The retriev al system searc hes through its database and returns do cumen ts
indexed with matc hing terms Since a concept can b e represen ted bya v ariet y of terms users ma y
fail to obtain the information they require This is called the vo c abulary pr oblem The v o cabulary problem o ccurs not only in traditional information retriev al but also in
In ternet resource disco v ery where users seek relev an t information serv ers to submit their queries
Previously w e prop osed to use L atent Semantic Indexing LSI to ameliorate the v o cabulary
problem in the In ternet searc h Here w e expand the idea b y in tegrating a customized prole
with LSI to assist the searc hing W e demonstrate that customized proles can help a retriev al
system to understand a users terminology b etter and th us impro v e the p erformance
Bac kground
Originally LSI w as dev elop ed to address the v o cabulary problem in Saltons V e ctor Sp aceMo del
VSM where do cumen ts and queries are represen ted as v ectors of term frequencies or w eigh ts It
assumes some underlying seman tic structure exists in the pattern of term usage across do cumen ts
T o capture this information LSI applies Singular V alue De c omp osition SVD to a termdo cumen t
matrix represen ting a database and generates v ectors of k t ypically to orthogonal indexing
dimensions where eac h dimension represen ts a linearly indep enden t concept The decomp osed
v ectors are used to represen t both do cumen ts and terms in queries in the same seman tic space
while their v alues indicate the degrees of asso ciation with the k underlying concepts
A query v ector in LSI is the w eigh ted sum of its comp onen t term v ectors F or example a
pterm query is represen ted as the a v erage sum of the p decomp osed term v ectors T o determine
relev an t do cumen ts the query v ector is compared with all do cumen t v ectors and those with the
highest c osine c o ecient are returned Notice that a query can hit do cumen ts without ha ving
common terms b ecause the k indexing dimensions indicate the concepts not the exact terms used
Hence LSI impro v es searc h p erformance b y ameliorating the v o cabulary problem
A pr ole or user pr ole is a collection of data sp ecied b y users to reect their in terests It
can b e used as a lter to select new do cumen ts or information that matc h users in terests or used to augmen t the query for impro ving retriev al eectiv eness F oltz and Dumais used
LSI to lter new incoming do cumen ts based on user proles They compared new do cumen ts
against users w ord and do cumen t proles and rank ed them based on their similarities to the
prole F or the w ord prole users indicate w ords or phrases of in terests eac h is represen ted as
a separate v ector and compared with new do cumen ts using the standard v ector and LSI v ector
metho ds Similarlyeac h do cumen t in the do cumen t prole is expressed as a v ector and compared
to all new do cumen ts using the same matc hing metho ds In their exp erimen t they found LSImatc h
with do cumen t prole has the b est p erformance
Earlier w e prop osed TwoL evel LSI in the In ternet en vironmen t where a directory of
services records the descriptions of information serv ers using LSI A user sends his query to the
directory of services whic h determines and ranks the information serv ers relev an t to the users
request The user emplo ys the rankings when selecting the most relev an t information serv ers to
query directly Here wein v estigate the use of customized proles in t w olev el LSI In this researc h
a prole pro vides bac kground information to the query It could b e a set of do cumen ts reecting
a users in terests or a disciplines taxonom y represen ting a sp ecialized kno wledge T o distinguish
these t wot yp es of prole w e call the former user prole and the latter taxonom y in the pap er
Belo w w e compare the eect of merging taxonom y at the directory of services against expanding
queries with user prole at the clien t site
Exp erimen t
In w e sho w ed that t w olev el LSI can outp erform VSM in estimating serv er rankings in the
In ternet en vironmen t Here w e fo cus on the comparisons of using user prole and taxonom y with
LSI W e generate three serv er rankings using the original LSI LSI with user prole and LSI with taxonom y In this exp erimen t a user prole is used to expand a users query b efore sending
to the directory of services while a taxonom y is merged with serv er descriptions at the directory
of services Figure sho ws the three pro cesses W e use the standard CA CM and MED do cumen t
collections for whic h queries and relev an t judgmen ts are a v ailable W e compute the rankings
estimated b y the three metho ds and calculate their rankorder co ecien t and accum ulated recall
Metho dology
Wecom bine the do cumen ts from b oth the CA CM and MED collections and divide them in to nine
subcollections eac h represen ting a serv ers database Notice that these do cumen ts ma y use the
same terms for totally dieren t meanings b ecause they b elong to t w o dieren t disciplines computer
science and medicine There could exist sev ere v o cabulary problem in suchen vironmen t
Do cumen ts in eac h database are indexed with terms o ccurring in the title and abstract but
not on a stop list of common w ords While queries are written in natural language terms in a
query are used only if they do not app ear on the same stop list and if they app ear in at least one
do cumen t All indexed terms are stored in their original forms without stemming T able giv es
the additional c haracteristics of our exp erimen t
user
profile
Client
server
ranking
server
ranking
server
ranking query
#3
#2
#1
LSI match
Server 1
LSI match
ACM
Taxonomy
Server 2 Server N
Directory of Services
LSI match
server descriptions
(2)
(1)
(3)
Figure The three ranking pro cesses the original LSI LSI with
user prole and LSI with taxonom y In LSI ranking w e apply the single link clustering algorithm to construct serv er descrip
tions W e cluster do cumen ts when their similarit y is greater than a predened threshold Eac h of
the remaining do cumen ts forms a cluster of its o wn Eac h cluster is represen ted b y the mean v ector
of its comp onen t do cumentv ectors and the serv er description is the set of its cluster v ectors The
directory of services collects the serv er descriptions from all the serv ers and determines the ranking
using SVD for eac h user query In this exp erimen t serv er descriptions are decomp osed in to v ectors
of dimensions as suggested in Deerw esters LSI exp erimen ts The ranking is based on the
cosine similaritybet w een serv er descriptions and user query In LSI with user prole ranking w e select half of the relev an t do cumen ts of a query to
construct the user prole for that query Since those do cumen ts ha v e b een judged relev antb y the
user they can reect the in terests of the user Before sending a query to the directory of services
w e expand it b y adding the prole v ector whic h is the cen troid of all the do cumen t v ectors in
the prole The directory of services applies t ypical LSI algorithm to rank serv ers for the remaining
half of relev an t do cumen ts
In LSI with taxonom y ranking w e generate pseudodo cumen ts from the A CM taxonom y
whichcon tains a listing of computer science classication sc hemes W e then merge these pseudo
do cumen ts with the serv er descriptions in the directory of services b efore applying LSI algorithm
W e p ostulate that adding it as pseudodo cumen ts ma y reinforce the computer science in terpretation
of the terms in the CA CM do cumen ts Therefore it can help increase the lik eliho o d that computer
Num b er of do cumen ts
Num b er of queries
Num b er of indexing terms
Mean n um b er of terms p er do cumen t Mean n um b er of terms p er query T able The c haracteristics of the test collection
science rather than medical do cumen ts are returned from the new sup erset collection
Belo w w e use t w o metho ds to ev aluate the rankings estimated b y the ab o v e three approac hes
Our criterion is to giv e high ranks to the serv ers that con tain the most relev an t do cumen ts
RankOrder Correlation
T o v erify the estimated rankings w e generate a standard ranking denoted as STD b y sorting
serv ers based on their n um b er of relev an t do cumen ts excluding those used in the user prole W e
calculate the Sp e arman r ankor der c orr elation c o ecient r
s
to measure the closeness of STD
and the estimated ranking The r
s
ranges bet w een and If t w o rankings are iden tical r
s
If one ranking is the rev erse of the other r
s
The larger the r
s
the closer the rankings
The r
s
co ecien t allo ws us to determine whichof the ab o v e metho ds generates a ranking closest
to that of STD T o compare the rankings generated using the original LSI denoted as LSI LSI with user
prole denoted as LSIPRO and LSI with taxonom y denoted as LSITAX w e calculate their r
s
against STD for eac h query Among the samples r
s
LSI STD is larger than equal to and less
than r
s
LSITAX STD for and times resp ectiv ely This indicates when using indexing
dimension LSIwithtaxonom y generates a ranking closer to STD than without it for out of
times whereas the latter only has closer order for out of times Similarly r
s
LSITAX STD is larger than equal to and less than r
s
LSIPRO STD for and times resp ectiv ely Therefore LSI with user prole generates more closer rankings than with taxonom y T o measure the condence that LSI with user prole outp erforms the other metho ds w e
calculate the c ondenc e interval for pr op ortion dened as follo ws Sample prop ortion p max n
n
n
n
Condence in terv al
for prop ortion
p z
s
p p n
n
where n
is the n um b er of times one metho d is b etter than the other and n
is the n um b er of times
it is w orse The z
is the quan tile of a unit normal v ariate F or condence lev el
z
If the condence in terv al do es not include w ecan sa y with condence that
one metho d is sup erior to the other F or r
s
LSIPRO STDand r
s
LSITAX STD their condence
in terv al is Because it do es not include w e can sa y with condence that LSI
with user prole is sup erior to LSI with taxonom y Similarly the condence in terv al for r
s
LSITAX STDand r
s
LSI STD is Therefore LSI with taxonom y is sup erior to LSI with condence
F rom the t w o ranking comparisons w e conclude LSI with either user prole or taxonom y
giv es a b etter ranking than without it Additionally LSI with user prole p erforms b etter than
with taxonom y The reason could be that the user prole is querysp ecic whic h c hanges from
query to query while the taxonom y acts as a generic prole for all queries Therefore LSI with
user prole giv es more accurate results
Accum ulated Recall
T o measure the p erformance of using estimated serv er rankings w e calculate the accum ulated
recall for the top n out of total N serv ers in the ranking Let rel
i
b e the set of relev an t do cumen ts
and retr
i
the set of retriev ed do cumen ts for a giv en query on serv er i W e dene
Do cument R e c al l denoted as R
d
n is the ratio of the n um ber of relev an t do cumen ts retriev ed
in the top n serv ers o v er the n um ber of relev an t do cumen ts in all serv ers
R
d
n P
n
i
jrel
i
r etr
i
j
P
N
i
jrel
i
j
Server R e c al l denoted as R
s
n is the ratio of the n um b er of the top n serv ers ha ving relev an t
do cumen ts o v er the total n um ber of serv ers ha ving relev an t do cumen ts
R
s
n jfserv er i jrel
i
i ngj
jfserv er ijrel
i
i N gj
Because the returned do cumen ts are determined b y the query pro cessing engine in eachserv er w e
assume all relev an t do cumen ts are returned for simplicit y T able sho ws the a v erage do cumen t
and serv er recalls as a function of n um ber of serv ers for queries retriev ed on the test collection
In T able LSI with user prole has the highest do cumen t recall except when the n um ber of
serv ers n is and and the original LSI has the lo w est v alue except when n and This means
when retrieving the top and serv ers in the ranking estimated b y LSI with user prole
w e can get more relev an t do cumen ts than LSI or LSI with taxonom y If w e retriev e the serv ers
rank ed b y LSI only w e will obtain few er relev an t do cumen ts most of the time This is consisten t
with LSIs lo w er rankorder correlation co ecien t The a v erage order of the nine do cumen t recalls
for LSI LSIPRO and LSITAX are and resp ectiv ely Clearly LSI with user
prole p erforms b est among the three metho ds
F or serv er recall b oth LSIPRO and LSITAX get the rst places out of times The a v erage
order for LSI LSIPRO and LSITAX are and resp ectiv ely Th us LSI with user
prole p erforms sligh tly b etter than with taxonom y while b oth of them are m uc h b etter than the
original LSI Therefore users can get more relev an t serv ers using the ranking order estimated b y
LSI with either user prole or taxonom y than without it
Conclusions
W e prop osed to use Deerw esters laten t seman tic indexing with customized proles to searc h and
rank information serv ers in In ternet W e conducted exp erimen ts on standard do cumen t collections
and compared the p erformance of LSI LSI with user prole and LSI with taxonom y using rank
order co ecien t and accum ulated recall The results sho w that LSI with user prole p erforms b est
Recall n LSI LSIPRO LSITAX
R
d
R
s
T able The a v erage do cumen t recall R
d
and serv er recall R
s
as a func
tion of n um b er of serv ers n for queries retriev ed on the test collection
The n um b ers in paren theses indicate the order among the three metho ds for
agiv en n LSI with taxonom y second and the original LSI third in estimating and ranking relev an t serv ers
for user queries
A customized prole pro vides bac kground information to the query In practice no vice users
can use LSI with taxonom y to nd initial do cumen ts in a sp ecic eld then construct their o wn
prole for further searc hing Users ha ving a v ariet y of in terests can use dieren t prole for eac h
query to get higher recall As the n um ber of In ternet serv ers on In ternet gro ws rapidlyw e b eliev e
this tec hnique can ameliorate the v o cabulary problem and impro v e users searc hing pro cess
References
George W F urnas Thomas K Landauer Louis M Gomez and Susan T Dumais The
v o cabulary problem in h umansystem comm unication Communic ations of the A CMv ol no pp No v em b er Scott Deerw ester Susan T Dumais George W F urnas Thomas K Landauer and Ric hard
Harshman Indexing b y laten t seman tic analysis Journal of the A meric an So ciety for
Information Scienc ev ol no pp Septem ber ShihHao Li and P eter B Danzig V o cabulary problem in In ternet resource disco v
ery in Pr o c e e dings of the Se c ond International Workshop on Next Gener ation Informa
tion T e chnolo gies and Systems Naharia Israel June pp Av ailable from
ftpcatarinauscedushlingitspsgz
Gerard Salton and Mic hael J McGill Intr o duction to Mo dern Information R etrieval McGra w
Hill Bo ok Compan y Gerard Salton A utomatic Information Or ganization and R etrieval McGra wHill Bo ok Com
pan y K H P ac k er and D So ergel The imp ortance of SDI for curren t a w areness in elds with
sev ere scatter of information Journal of the A meric an So ciety for Information Scienc ev ol
no pp Shoshnan Lo eb Arc hitecting p ersonalized deliv ery of m ultimedia information Communi
c ations of the A CMv ol no pp Decem b er P eter W F oltz and Susan T Dumais P ersonalized information deliv ery An analysis of
information ltering metho ds Communic ations of the A CM v ol no pp Decem ber H Grzelak and K Ko w alski Automatic construction of information queries Information
Pr o c essing and Managementv ol pp Rob ert R Korfhage Query enhancemen tb y user proles in Pr o c e e dings of the ThirdJonit
BCS and A CM Symp osium pp Ellen M V o orhees Implemen ting agglomerativ e hierarc hic clustering algorithms for use in
do cumen t retriev al Information Pr o c essing and Management v ol no pp Jean E Sammet and An thon y Ralston The new computing review classication
system nal v ersion Communic ations of the A CMv ol no pp Jan uary
Maurice Kendall and Jean D Gibb ons R ank Corr elation Metho ds Edw ard Arnold London
fth edition Ra j Jain The A rt of Computer Systems PerformanceA nalysis John Wiley Son Inc New
Y ork
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 632 (1996)
PDF
USC Computer Science Technical Reports, no. 637 (1996)
PDF
USC Computer Science Technical Reports, no. 594 (1994)
PDF
USC Computer Science Technical Reports, no. 609 (1995)
PDF
USC Computer Science Technical Reports, no. 579 (1994)
PDF
USC Computer Science Technical Reports, no. 495 (1991)
PDF
USC Computer Science Technical Reports, no. 628 (1996)
PDF
USC Computer Science Technical Reports, no. 627 (1996)
PDF
USC Computer Science Technical Reports, no. 662 (1997)
PDF
USC Computer Science Technical Reports, no. 611 (1995)
PDF
USC Computer Science Technical Reports, no. 634 (1996)
PDF
USC Computer Science Technical Reports, no. 652 (1997)
PDF
USC Computer Science Technical Reports, no. 733 (2000)
PDF
USC Computer Science Technical Reports, no. 643 (1996)
PDF
USC Computer Science Technical Reports, no. 646 (1997)
PDF
USC Computer Science Technical Reports, no. 639 (1996)
PDF
USC Computer Science Technical Reports, no. 695 (1999)
PDF
USC Computer Science Technical Reports, no. 580 (1994)
PDF
USC Computer Science Technical Reports, no. 879 (2006)
PDF
USC Computer Science Technical Reports, no. 641 (1996)
Description
Shih-Hao Li and Peter B. Danzig. "Searching information servers based on customized profiles." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 636 (1996).
Asset Metadata
Creator
Danzig, Peter B.
(author),
Li, Shih-Hao
(author)
Core Title
USC Computer Science Technical Reports, no. 636 (1996)
Alternative Title
Searching information servers based on customized profiles (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
7 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16271000
Identifier
96-636 Searching Information Servers Based on Customized Profiles (filename)
Legacy Identifier
usc-cstr-96-636
Format
7 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/