Searc hing Information Serv ers Based on Customized Proles
ShihHao Li and P eter B Danzig
Computer Science Departmen t
Univ ersit y of Southern California
Los Angeles California Email fshli danzig gcsuscedu
W e in v estigate the eect of using customized proles to help searc hing relev antserv ers in
In ternet Our exp erimen ts demonstrate that the use of customized proles with laten t seman tic
indexing LSI tec hnique can impro v e the p erformance of In ternet searc hing
In tro duction
When searc hing information in a retriev al system people use dieren t terms to describ e their
information needs The retriev al system searc hes through its database and returns do cumen ts
indexed with matc hing terms Since a concept can b e represen ted bya v ariet y of terms users ma y
fail to obtain the information they require This is called the vo c abulary pr oblem The v o cabulary problem o ccurs not only in traditional information retriev al but also in
In ternet resource disco v ery where users seek relev an t information serv ers to submit their queries
Previously w e prop osed to use L atent Semantic Indexing LSI to ameliorate the v o cabulary
problem in the In ternet searc h Here w e expand the idea b y in tegrating a customized prole
with LSI to assist the searc hing W e demonstrate that customized proles can help a retriev al
system to understand a users terminology b etter and th us impro v e the p erformance
Bac kground
Originally LSI w as dev elop ed to address the v o cabulary problem in Saltons V e ctor Sp aceMo del
VSM where do cumen ts and queries are represen ted as v ectors of term frequencies or w eigh ts It
assumes some underlying seman tic structure exists in the pattern of term usage across do cumen ts
T o capture this information LSI applies Singular V alue De c omp osition SVD to a termdo cumen t
matrix represen ting a database and generates v ectors of k t ypically to orthogonal indexing
dimensions where eac h dimension represen ts a linearly indep enden t concept The decomp osed
v ectors are used to represen t both do cumen ts and terms in queries in the same seman tic space
while their v alues indicate the degrees of asso ciation with the k underlying concepts
A query v ector in LSI is the w eigh ted sum of its comp onen t term v ectors F or example a
pterm query is represen ted as the a v erage sum of the p decomp osed term v ectors T o determine
relev an t do cumen ts the query v ector is compared with all do cumen t v ectors and those with the
highest c osine c o ecient are returned Notice that a query can hit do cumen ts without ha ving
common terms b ecause the k indexing dimensions indicate the concepts not the exact terms used
Hence LSI impro v es searc h p erformance b y ameliorating the v o cabulary problem
A pr ole or user pr ole is a collection of data sp ecied b y users to reect their in terests It
can b e used as a lter to select new do cumen ts or information that matc h users in terests or used to augmen t the query for impro ving retriev al eectiv eness F oltz and Dumais used
LSI to lter new incoming do cumen ts based on user proles They compared new do cumen ts
against users w ord and do cumen t proles and rank ed them based on their similarities to the
prole F or the w ord prole users indicate w ords or phrases of in terests eac h is represen ted as
a separate v ector and compared with new do cumen ts using the standard v ector and LSI v ector
metho ds Similarlyeac h do cumen t in the do cumen t prole is expressed as a v ector and compared
to all new do cumen ts using the same matc hing metho ds In their exp erimen t they found LSImatc h
with do cumen t prole has the b est p erformance
Earlier w e prop osed TwoL evel LSI in the In ternet en vironmen t where a directory of
services records the descriptions of information serv ers using LSI A user sends his query to the
directory of services whic h determines and ranks the information serv ers relev an t to the users
request The user emplo ys the rankings when selecting the most relev an t information serv ers to
query directly Here wein v estigate the use of customized proles in t w olev el LSI In this researc h
a prole pro vides bac kground information to the query It could b e a set of do cumen ts reecting
a users in terests or a disciplines taxonom y represen ting a sp ecialized kno wledge T o distinguish
these t wot yp es of prole w e call the former user prole and the latter taxonom y in the pap er
Belo w w e compare the eect of merging taxonom y at the directory of services against expanding
queries with user prole at the clien t site
Exp erimen t
In w e sho w ed that t w olev el LSI can outp erform VSM in estimating serv er rankings in the
In ternet en vironmen t Here w e fo cus on the comparisons of using user prole and taxonom y with
LSI W e generate three serv er rankings using the original LSI LSI with user prole and LSI with taxonom y In this exp erimen t a user prole is used to expand a users query b efore sending
to the directory of services while a taxonom y is merged with serv er descriptions at the directory
of services Figure sho ws the three pro cesses W e use the standard CA CM and MED do cumen t
collections for whic h queries and relev an t judgmen ts are a v ailable W e compute the rankings
estimated b y the three metho ds and calculate their rankorder co ecien t and accum ulated recall
Metho dology
Wecom bine the do cumen ts from b oth the CA CM and MED collections and divide them in to nine
subcollections eac h represen ting a serv ers database Notice that these do cumen ts ma y use the
same terms for totally dieren t meanings b ecause they b elong to t w o dieren t disciplines computer
science and medicine There could exist sev ere v o cabulary problem in suchen vironmen t
Do cumen ts in eac h database are indexed with terms o ccurring in the title and abstract but
not on a stop list of common w ords While queries are written in natural language terms in a
query are used only if they do not app ear on the same stop list and if they app ear in at least one
do cumen t All indexed terms are stored in their original forms without stemming T able giv es
the additional c haracteristics of our exp erimen t
ranking query
LSI match
Server 1
LSI match
Server 2 Server N
Directory of Services
LSI match
server descriptions
Figure The three ranking pro cesses the original LSI LSI with
user prole and LSI with taxonom y In LSI ranking w e apply the single link clustering algorithm to construct serv er descrip
tions W e cluster do cumen ts when their similarit y is greater than a predened threshold Eac h of
the remaining do cumen ts forms a cluster of its o wn Eac h cluster is represen ted b y the mean v ector
of its comp onen t do cumentv ectors and the serv er description is the set of its cluster v ectors The
directory of services collects the serv er descriptions from all the serv ers and determines the ranking
using SVD for eac h user query In this exp erimen t serv er descriptions are decomp osed in to v ectors
of dimensions as suggested in Deerw esters LSI exp erimen ts The ranking is based on the
cosine similaritybet w een serv er descriptions and user query In LSI with user prole ranking w e select half of the relev an t do cumen ts of a query to
construct the user prole for that query Since those do cumen ts ha v e b een judged relev antb y the
user they can reect the in terests of the user Before sending a query to the directory of services
w e expand it b y adding the prole v ector whic h is the cen troid of all the do cumen t v ectors in
the prole The directory of services applies t ypical LSI algorithm to rank serv ers for the remaining
half of relev an t do cumen ts
In LSI with taxonom y ranking w e generate pseudodo cumen ts from the A CM taxonom y
whichcon tains a listing of computer science classication sc hemes W e then merge these pseudo
do cumen ts with the serv er descriptions in the directory of services b efore applying LSI algorithm
W e p ostulate that adding it as pseudodo cumen ts ma y reinforce the computer science in terpretation
of the terms in the CA CM do cumen ts Therefore it can help increase the lik eliho o d that computer
Num b er of do cumen ts
Num b er of queries
Num b er of indexing terms
Mean n um b er of terms p er do cumen t Mean n um b er of terms p er query T able The c haracteristics of the test collection
science rather than medical do cumen ts are returned from the new sup erset collection
Belo w w e use t w o metho ds to ev aluate the rankings estimated b y the ab o v e three approac hes
Our criterion is to giv e high ranks to the serv ers that con tain the most relev an t do cumen ts
RankOrder Correlation
T o v erify the estimated rankings w e generate a standard ranking denoted as STD b y sorting
serv ers based on their n um b er of relev an t do cumen ts excluding those used in the user prole W e
calculate the Sp e arman r ankor der c orr elation c o ecient r
to measure the closeness of STD
and the estimated ranking The r
ranges bet w een and If t w o rankings are iden tical r
If one ranking is the rev erse of the other r
The larger the r
the closer the rankings
The r
co ecien t allo ws us to determine whichof the ab o v e metho ds generates a ranking closest
to that of STD T o compare the rankings generated using the original LSI denoted as LSI LSI with user
prole denoted as LSIPRO and LSI with taxonom y denoted as LSITAX w e calculate their r
against STD for eac h query Among the samples r
LSI STD is larger than equal to and less
than r
LSITAX STD for and times resp ectiv ely This indicates when using indexing
dimension LSIwithtaxonom y generates a ranking closer to STD than without it for out of
times whereas the latter only has closer order for out of times Similarly r
LSITAX STD is larger than equal to and less than r
LSIPRO STD for and times resp ectiv ely Therefore LSI with user prole generates more closer rankings than with taxonom y T o measure the condence that LSI with user prole outp erforms the other metho ds w e
calculate the c ondenc e interval for pr op ortion dened as follo ws Sample prop ortion p max n
Condence in terv al
for prop ortion
p z
p p n
where n
is the n um b er of times one metho d is b etter than the other and n
is the n um b er of times
it is w orse The z
is the quan tile of a unit normal v ariate F or condence lev el
If the condence in terv al do es not include w ecan sa y with condence that
one metho d is sup erior to the other F or r
LSITAX STD their condence
in terv al is Because it do es not include w e can sa y with condence that LSI
with user prole is sup erior to LSI with taxonom y Similarly the condence in terv al for r
LSI STD is Therefore LSI with taxonom y is sup erior to LSI with condence
F rom the t w o ranking comparisons w e conclude LSI with either user prole or taxonom y
giv es a b etter ranking than without it Additionally LSI with user prole p erforms b etter than
with taxonom y The reason could be that the user prole is querysp ecic whic h c hanges from
query to query while the taxonom y acts as a generic prole for all queries Therefore LSI with
user prole giv es more accurate results
Accum ulated Recall
T o measure the p erformance of using estimated serv er rankings w e calculate the accum ulated
recall for the top n out of total N serv ers in the ranking Let rel
b e the set of relev an t do cumen ts
and retr
the set of retriev ed do cumen ts for a giv en query on serv er i W e dene
Do cument R e c al l denoted as R
n is the ratio of the n um ber of relev an t do cumen ts retriev ed
in the top n serv ers o v er the n um ber of relev an t do cumen ts in all serv ers
n P
r etr
Server R e c al l denoted as R
n is the ratio of the n um b er of the top n serv ers ha ving relev an t
do cumen ts o v er the total n um ber of serv ers ha ving relev an t do cumen ts
n jfserv er i jrel
i ngj
jfserv er ijrel
i N gj
Because the returned do cumen ts are determined b y the query pro cessing engine in eachserv er w e
assume all relev an t do cumen ts are returned for simplicit y T able sho ws the a v erage do cumen t
and serv er recalls as a function of n um ber of serv ers for queries retriev ed on the test collection
In T able LSI with user prole has the highest do cumen t recall except when the n um ber of
serv ers n is and and the original LSI has the lo w est v alue except when n and This means
when retrieving the top and serv ers in the ranking estimated b y LSI with user prole
w e can get more relev an t do cumen ts than LSI or LSI with taxonom y If w e retriev e the serv ers
rank ed b y LSI only w e will obtain few er relev an t do cumen ts most of the time This is consisten t
with LSIs lo w er rankorder correlation co ecien t The a v erage order of the nine do cumen t recalls
for LSI LSIPRO and LSITAX are and resp ectiv ely Clearly LSI with user
prole p erforms b est among the three metho ds
F or serv er recall b oth LSIPRO and LSITAX get the rst places out of times The a v erage
order for LSI LSIPRO and LSITAX are and resp ectiv ely Th us LSI with user
prole p erforms sligh tly b etter than with taxonom y while b oth of them are m uc h b etter than the
original LSI Therefore users can get more relev an t serv ers using the ranking order estimated b y
LSI with either user prole or taxonom y than without it
W e prop osed to use Deerw esters laten t seman tic indexing with customized proles to searc h and
rank information serv ers in In ternet W e conducted exp erimen ts on standard do cumen t collections
and compared the p erformance of LSI LSI with user prole and LSI with taxonom y using rank
order co ecien t and accum ulated recall The results sho w that LSI with user prole p erforms b est
T able The a v erage do cumen t recall R
and serv er recall R
as a func
tion of n um b er of serv ers n for queries retriev ed on the test collection
The n um b ers in paren theses indicate the order among the three metho ds for
agiv en n LSI with taxonom y second and the original LSI third in estimating and ranking relev an t serv ers
for user queries
A customized prole pro vides bac kground information to the query In practice no vice users
can use LSI with taxonom y to nd initial do cumen ts in a sp ecic eld then construct their o wn
prole for further searc hing Users ha ving a v ariet y of in terests can use dieren t prole for eac h
query to get higher recall As the n um ber of In ternet serv ers on In ternet gro ws rapidlyw e b eliev e
this tec hnique can ameliorate the v o cabulary problem and impro v e users searc hing pro cess
George W F urnas Thomas K Landauer Louis M Gomez and Susan T Dumais The
v o cabulary problem in h umansystem comm unication Communic ations of the A CMv ol no pp No v em b er Scott Deerw ester Susan T Dumais George W F urnas Thomas K Landauer and Ric hard
Harshman Indexing b y laten t seman tic analysis Journal of the A meric an So ciety for
Information Scienc ev ol no pp Septem ber ShihHao Li and P eter B Danzig V o cabulary problem in In ternet resource disco v
ery in Pr o c e e dings of the Se c ond International Workshop on Next Gener ation Informa
tion T e chnolo gies and Systems Naharia Israel June pp Av ailable from
Gerard Salton and Mic hael J McGill Intr o duction to Mo dern Information R etrieval McGra w
Hill Bo ok Compan y Gerard Salton A utomatic Information Or ganization and R etrieval McGra wHill Bo ok Com
pan y K H P ac k er and D So ergel The imp ortance of SDI for curren t a w areness in elds with
sev ere scatter of information Journal of the A meric an So ciety for Information Scienc ev ol
no pp Shoshnan Lo eb Arc hitecting p ersonalized deliv ery of m ultimedia information Communi
c ations of the A CMv ol no pp Decem b er P eter W F oltz and Susan T Dumais P ersonalized information deliv ery An analysis of
information ltering metho ds Communic ations of the A CM v ol no pp Decem ber H Grzelak and K Ko w alski Automatic construction of information queries Information
Pr o c essing and Managementv ol pp Rob ert R Korfhage Query enhancemen tb y user proles in Pr o c e e dings of the ThirdJonit
BCS and A CM Symp osium pp Ellen M V o orhees Implemen ting agglomerativ e hierarc hic clustering algorithms for use in
do cumen t retriev al Information Pr o c essing and Management v ol no pp Jean E Sammet and An thon y Ralston The new computing review classication
system nal v ersion Communic ations of the A CMv ol no pp Jan uary
Maurice Kendall and Jean D Gibb ons R ank Corr elation Metho ds Edw ard Arnold London
fth edition Ra j Jain The A rt of Computer Systems PerformanceA nalysis John Wiley Son Inc New
Y ork
