Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 736 (2000)
(USC DC Other)
USC Computer Science Technical Reports, no. 736 (2000)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
F eature Matrices A Mo del for Ecien t and Anon ymous
Mining of W eb Na vigations
Cyrus Shahabi F arnoush BanaeiKashani Jab ed F aruque A dil F aisal
Email shahabi banaeik a f ar uq ue f aisal uscedu
In tegrated Media Systems Cen ter and Computer Science Departmen t
Univ ersit y of Southern California Los Angeles CA Abstract
W e prop ose a new mo del whic h is a generalization of the V ector mo del for ecien t mining of w eb
na vigations Recen t gro wth of startup companies in the area of w eb na vigation mining is a strong indication
of the eectiv eness of this data in understanding user b eha viors Ho w ev er the approac htak en b y industry
to w ards w eb na vigation is oline and hence in trusiv e static and cannot dieren tiate b et w een v arious roles
a single user migh t pla y This is b ecause online collection preparation and analysis of the en tire data is
cum b ersome if not imp ossible due to the large v olume of clic kstreams generated within a p opular w ebsite
T o w ards this end sev eral researc hers studied probabilistic eg Mark o v and distancebased eg V ector
mo dels to summarize or compress the collected data and main tain only the imp ortan t features for analysis
The prop osed mo dels are either not exible to tradeo accuracy for p erformance based on the application
requiremen ts or not adaptable in realtime due to high complexit y of up dating the mo del In this pap er w e
demonstrate that our prop osed mo del the FM mo del is exible tunable adaptable and can b e used for b oth
anon ymous and online analysis W e prop ose sev eral similarit y measures for comparison among FM mo dels of
na vigation paths or cluster of paths W e conducted sev eral exp erimen ts to ev aluate and v erify our FM mo del
The results sho w that one of our similarit y measures PPED is the b est for comparing partial na vigation
paths with cluster represen tativ es This partial comparison is essen tial for anon ymous and realtime w eb
mining
In tro duction
Understanding and mo deling user beha viors b y analyzing users in teractions with digital en vironmen ts suc h
as w eb sites is a signican t topic that has resulted in v ast recen t commercial in terests Commercial pro ducts
suchas P er sonif y
TM
Verbind
TM
W ebS ideS tor y
TM
BlueM artini
TM
W atchW ise
TM
W ebT r ends
TM
and
E nterpriseReporter
TM
and acquired companies suc has M atchl og ic
TM
T r iv ida
TM
Andr omedia
TM
R ig htpoint
TM
and DataS ag e
TM
are all witnesses of suc h in terests In addition sev eral researc hers in v arious industrial and
academic researc h cen ters are fo cusing on this topic A nice surv ey is pro vided b y Mobasher et al
in Meaningful in terpretation of the users digital b eha vior is necessary in the disparate elds of ecommerce
distance education online en tertainmen t and managemen t for capturing individual and collectiv e proles of
customers learners and emplo y ees for targeting customizedp ersonalized commercials or information and for
ev aluating the information arc hitecture of the site b y detecting the b ottlenec ks in the information space
Understanding user b eha vior from w eb na vigation in v olv es the follo wing three steps collectionpreparation
of w eb na vigation aka w ebusage or clic kstream data kno wledge disco v ery from w eb na vigation data and p ersonalizationrecommendation based on disco v ered patterns One t ypical but naiveapproachtak en b y industry
is to rst build a static prole for eac h user based on hisher past na vigations during the rst and second steps
Subsequen tly whenev er the user revisits the site recommend or p ersonalize con ten ts based on the user prole
There are sev eral dra wbac ks with this approac h First for this approachto w ork the user iden tit y should be
detected using some in trusivetec hnique suc h as through co okies Second the prole cannot distinguish b et w een
dieren t roles the user mightpla y during v arious na vigations F or example buying ro c km usic CDs as a gift for a
friend v ersus purc hasing classical m usic CDs for oneself Third if m ultiple users share the same computerclien t
the prole b ecomes a mixture of sev eral p ossibly conicting tastes F or example some of us ha v e exp erienced
amazonc om suggesting to us those b o oks that are fa v orites of our signican t others Finally the prole b ecomes
static and cannot capture c hanges in the user taste
T o address these problems Zark esh et al prop osed statistical approac hes where rst a prole is built for
a collection of users with similar na vigation paths Then recommendationsp ersonalizations are mapp ed to these
proles through some learning mec hanism eg b y an exp ert or via utilization of training data Finally the
na vigation of a new user is compared to dieren t proles and recommendationsp ersonalizations will b e based on
either the closest prole hard mem b ership or some w eigh ted aggregate of sev eral proles soft mem b erships
Therefore the reaction to a user is not based on hisher previous actions but based on hisher recen t activities
As a result the system can deal with dieren t users sharing the same computer v arious roles the user pla ys or
v ariable tastes of the user Moreo v er the system do es not rely on co okies and has no kno wledge of an individual
user Although this approac h henceforth denoted as anonymous web mining is preferred it imp oses new
c hallenges to the design of all the three steps of w eb mining
In this pap er w e prop ose a no v el mo del F e atur e Matric es FM that not only addresses the new design
c hallenges of anon ymous mining but also can b e used for b etter con v en tional w eb mining F or the remainder of
this section w e rst describ e the new design c hallenges for anon ymous mining Then explain wh y con v en tional
w eb mining mo dels suchas the V ector mo del and the Mark o v mo del cannot meet the new requiremen ts Finally w e briey discuss ho w the FM mo del addresses the new design c hallenges It is imp ortan t to note that although
w e motiv ate and discuss the FM mo del in the con text of anon ymous w eb mining it is as m uc h applicable for
con v en tional w eb mining through oline analysis of w eblogs
Challenges in Anon ymous W eb Mining
Most of the w orks in the rst step of w eb mining are mainly on oline utilization of serv erlog les The problem
with serv erlog is its inaccuracy eg missing clien t cac he hits and the problem with oline pro cessing is its
relying on co okies to corresp ond oline data with online users F or anon ymous w eb mining ev ery single action
of a new user should b e accurately trac k ed or else the system cannot react to the user in a timely manner ie
b efore shehe lea v es the site Shahabi et al striv e to address the rst problem b yin tro ducing a clien t side
Ja v a agen t whic h collects accurate data The second problem ho w ev er is the result of the large v olume of
na vigation data generated p er site F or example Yahoo
TM
has million visitors ev ery da y generating GB
clic kstream data p er hour Ev en if this data can b e logged accurately online it cannot b e analyzed realtime
unless some asp ects of the data b e dropp ed out Hence in order for a mo del to b e used realtime it should b e
exible to capture less or more features of the logged data in realtime dep ending on the v olume of na vigation
During the second step user clusters m ust b e generated out of sev eral na vigation paths either in realtime or
oline A mo del is needed to conceptualize eachuserna vigation path andor a cluster of user paths The mo del
should b e tunable so that one can trade p erformance for accuracy or vicev ersa Also in order for the mo del to
b e adaptiv e ie gets up dated as the trend c hanges it should b e p ossible to up date it incremen tally as a new
path or batc h of paths b ecome a v ailable
Finally the third step deals with p ersonalizing the con ten t or recommending new con ten ts to w ards the user
preferences b y comparing a new user partial na vigation path to the mo del This pro cess should b e p erformed
ecien tly enough b efore the user lea v es the site
The V ector mo del and the Mark o v mo del are the t w o con v en tional w eb mining mo dels most appropriate for
anon ymous w eb mining The V ector mo del w as originally prop osed in for w eb mining and then extended b y
other researc hers It can handle large v olume of na vigation data b y discarding some asp ects of the data and
capturing only a minim um amoun t of imp ortan t features Hence although it allo ws data analysis in realtime it
is not exible enough to capture more features resulting in more accurate analysis if needed b y an application
F or example the V ector mo del cannot capture the se quenc e of w eb na vigation Therefore some attempts ha v e
b een made to extend the V ector mo del to supp ort sequences in the exp ense of violating the orthogonalit y
of the basis of the extended v ector space
The Mark o v mo del has b een used b y Hec k erman et al in for w eb mining Dieren t orders of the Mark o v
mo del can capture v arious subsequence lengths and hence it is a exible approac h in capturing the data features
Ho w ev er it is a static mo del that cannot ecien tly adapt to the short term c hanges in user b eha vior This is due
to the high complexit y of up dating the mo del in realtime Moreo v er Mark o v is unable to capture some useful
asp ects of the na vigation data suc h as its temp oral asp ects
The FM mo del is in fact a generalization of the V ector mo del that is exible enough to strik e a compromise
bet w een accuracy and eciency giv en the sp ecic requiremen ts of an application Mean while the FM mo del can
b e up dated b oth oline and incremen tally online so that it can adapt to b oth shortterm and longterm c hanges
in user b eha viors
The FM mo del b enets from t w o t yp es of exibilities First it can be tuned to capture b oth spatial and
temp oral features of w ebna vigation In addition it is an op en mo del so that new features can b e incorp orated
as necessary b y an application domain F or example instead of or in addition to k eeping trackof w ebpages
within a na vigation it can conceptualize data features on the pro ducts a v ailable on an ecommerce site Second
it has the concept of or der similar to that of the Mark o v mo del whic h can b e increased to capture more data
ab out the sequence or an y other feature of na vigations
Another con tribution of this pap er is prop osing sev eral similarit y measures for comparing the FM mo del
of paths andor cluster of paths With anon ymous w eb mining it is critical to accurately compare a partial
na vigation path of a new user to the cluster represen tativ es ecien tly In this pap er w e study dieren t similarit y
measures and compare their accuracy and p erformance Our exp erimen ts demonstrate that one of the similarit y
measures PPED is sup erior to the others Therefore w e emplo y ed PPED in a new dynamic clustering algorithm
to mak e the system adaptable to shortterm c hanges in user beha viors Although the dynamic clustering can
b e executed realtime and incremen tally its accuracy is only w orse than that of KMeans The concept of
dynamic clustering and the similarit y measures are also applicable to the V ector mo del A thorough exp erimen tal
study is also included whic hv eries and ev aluates the FM mo del and its similarit y measures In addition the
exp erimen ts compare the FM mo del with the V ector mo del The results indicate that nd order of the FM mo del
FM
is more accurate than the V ector mo del
The remainder of this pap er is organized as follo ws In Section w e formally dene the FM mo del Section explains our v arious similarit y measures In Section w e discuss our dynamic clustering tec hnique The results
of our exp erimen ts are included in Section Finally Section concludes the pap er and prop oses our future
directions
The F eature Matrices Mo del
Here w e presen t a no v el mo del to represen t b oth sessions and clusters in the con text of w ebsite na vigation
analysis W e denote this mo del as the F e atur e Matric es FM mo del With FM features are indicators of the
information em b edded in sessions In order to quan tify the features w e consider univ ersal set of segmen ts in
a concept space as basis for the session space Th us features of a session are mo deled and captured in terms
of features of its building segmen ts This conceptualization is analogous to the denition of basis for a v ector
space ie a set of linearly indep enden tv ectors that construct the v ector space Therefore the FM mo del allo ws
analyzing sessions b y analyzing features of their corresp onding segmen ts
F or the remainder of this section w e explain analyze and formalize the FM mo del First w e dene our
terminology Next basics of the FM mo del are explained the features captured as indicators of the em b edded
information and the main data structure used to presen t these features Subsequen tlyw e discuss ho w to extract
the session FM mo del and the cluster FM mo del separately Finallyw e analyze complexit y and completeness of
the mo del and formalize its uniqueness
T erminology
W ebsite Aw ebsite can b e mo deled as a nite set of static andor dynamic w ebpages
Concept Space Concept Eachw ebsite dep ending on its application pro vides information ab out one or
more concepts F or example amazonc om includes concepts suc h as Bo oks Music Vide o etc The w ebpages
within a w ebsite can b e categorized based on the concepts to whic h they b elong A c onc ept sp ac e or simply
c onc ept in a w ebsite is dened as the set of w eb pages that con tain information ab out a certain concept Note
that con ten ts of a w ebpage ma y address more than one concept therefore concept spaces of a w ebsite are not
necessarily disjoin t sets
P ath A p ath P in a w ebsite is a nite or innite sequence of pages
x
x
x
i
x
s
where x
i
is a page b elonging to the w ebsite P ages visited in a path are not necessarily distinct
P ath F eature F eature An y temp oral or spatial attribute of apath is termed a p ath fe atur e or fe atur e Num b er of times a page has b een accessed time sp en t on viewing a page and spatial p osition of a page in the
path are examples of features W e elab orate more on the notion of feature in the next section
Session The path tra v ersed b y a user while na vigating a concept space is considered a session Whenev er
a na vigation lea v es a concept space the session is considered to be terminated ev en though the user has not
exited the w ebsite but only en tered a page in a dieren t concept In suc h a case a new session is initiated
whic h will include the remainder of the user na vigation path This is the reason w e dieren tiate bet w een the
session and the path notions While this assumption do es not impact the generalit y of our prop osed mo del it
mak es the comparison b et w een sessions ev en more ecacious since they b elong to the same concept Therefore
hereafter w e assume all sessions b elong to the same concept space Similar analysis can b e applied to sessions in
an y concept space
Session Space The set of all p ossible sessions in a concept space is termed session sp ac e P ath Segmen t Segmen t A p ath se gment or simply se gment E is an ntuple of pages x
x
x
i
x
n
W e denote the v alue n as the or der of the segmen t E n Note that there is a onetoone corresp ondence
bet w een tuples and sequences of pages ie x
x
x
i
x
n
x
x
x
i
x
n
Weuse
tuple represen tation b ecause it simplies our discussion An y subsequence of pages in a path can b e considered
as a segmen t of the path F or example the path x
x
x
x
x
con tains sev eral segmen ts suchas st
order segmen t x
nd order segmen t x
x
and th order segmen t x
x
x
x
W e exploit the notion of
segmen t as the building blo c k of sessions in order to mo del their features
Univ ersal Set of Segmen ts n C
univ ersal set of order n segmen ts is the set of all p ossible ntuple segmen ts
in the concept space C Hereafter since w e fo cus on analysis within a single concept w e drop the subscript C
from the notation
Cluster A cluster is dened as a set of similar sessions The similarit y is measured quan titativ ely based on
an appropriate similarit y measure An um b er of similarit y measures are dened in Section Basics
F eatures
Wec haracterize sessions through the follo wing features
Hit H Hit is a spatial feature that reects whic h pages are visited during a session The FM mo del
captures H b y recording the n um ber of times eac h se gment is encoun tered in a tra v ersal of the session
Reader ma y consider H as a generalization of the con v en tional hitcoun t notion since hitcoun t coun ts
n umberofhits per p age whic h is a segmen t of order Se quenc e S Sequence is an appro ximation for the relativ e lo cation of pages tra v ersed in a session As
compared to H it is a spatial feature that reects the lo cation of visits instead of the frequency of visits With
the FM mo del S is captured b y recording relativ e lo cation of eac h segmen t in the sequence of segmen ts that
construct the session If a segmen t has b een rep eatedly visited in a session S is appro ximated b y aggregating
the relativ e p ositions of all o ccurrences Th us S do es not capture the exact sequence of segmen ts Exact
sequences can b e captured through higher orders of H View Time T View time captures the time sp en t while tra v ersing a session The FM mo del captures
total time sp en t on eac h segmen t while tra v ersing the session As opp osed to H and S T is a temp oral
feature
In an agen tbased na vigation proler is prop osed that is able to capture these features accurately Using suc h
atrac king system features of eac h session are captured in terms of features of the segmen ts within the session
W e ma y apply v arious orders of univ ersal sets as basis to capture dieren t features In our example w e ha v e
used
for Tand for H and S unless otherwise stated Therefore w e extract the feature T for singlepage
segmen ts x
i
and features H and S for ordered pagepair segmen ts x
i
x
j
In Section w e will explain ho w
using higher order bases results in more complete c haracterization of the session b y the FM mo del in exp ense of
higher complexit y The FM mo del is an op en mo del It is capable of capturing an y other meaningful session features in addition
to those men tioned ab o v e The same data structure can b e emplo y ed to capture the new features This is another
option with whic h completeness of the FM mo del can b e enhanced Ho w ev er our exp erimen ts demonstrate that
the com bination of our prop osed features is comprehensiv e enough to detect the similarities and dissimilarities
among sessions appropriately see Section Data Structure
Supp ose n is the basis to capture a feature F for session Uw edeployan ndimensional fe atur e matrix M
F
r
n to record the F feature v alues for all order n segmen ts of U ndimensional matrix M
r
n is a generalization of
dimensional square matrix M
r r
Eac h dimension of M
r
n has r ro ws where r is the cardinalit y of the concept
space F or example M
that is a cub e with ro ws in eac h of its dimensions is a feature matrix for a
page concept space with as the basis Dimensions of the matrix are assumed to be in a predened order
The v alue of F for eac h order n segmen t x
x
x
is recorded in elemen t a
of M
F
r
n T o simplify the
understanding of this structure reader ma y assume that ro ws in all dimensions of the matrix are indexed bya
unique order of the concept space pages then the feature v alue for the order n segmen t x
x
x
is lo cated
at the in tersection of ro w x
on the st dimension ro w x
on the nd dimension and ro w x
on the nth
dimension of the feature matrix Note that M
r
n co v ers all order n segmen t mem b ers of n for instance in a
page concept space with as the basis M
has elemen ts On the other hand n um b er of segmen ts
existing in a session usually is in the order of tens Therefore M
r
n is usually a sparse matrix The elemen ts for
whic h there is no corresp onding segmen t in the session are set to zero
T o map a session to its equiv alen t FM mo del the appropriate feature matrices are extracted for features of
the session The en tire set of feature matrices generated for a session constitutes its FM mo del
U
fm
n
M
F r
n
M
F r
n
M
F m
r
n m
o
If n max n
n
n
m
then U
fm
is an order n FM mo del
In subsequen t sections w e explain ho w v alues of dieren t features are deriv ed for eac h segmen t from the
original session and ho w they are aggregated to construct the cluster mo del
Session Mo del
In previous section w e dened the features that c haracterize a session and the data structure that records them
but w e did not explain howv alues of dieren t features are extracted from a session to form the feature matrices
of its FM mo del Recall that w e record features of a session in terms of features of its segmen ts Th us it suces
if w e explain ho w to extract v arious features for a sample segmen t E F or Hit H w ecoun t the n um ber of times E has o ccurred in the session H Segmen ts ma y partially
o v erlap As far as there is at least one nono v erlapping page in t w o segmen ts the segmen ts are assumed to
b e distinct F or example the session x
x
x
x
x
has a total of order segmen ts including
o ccurrence of x
x
o ccurrences of x
x
and o ccurrence of x
x
F or Sequence S w e nd the relativ e p ositions of ev ery o ccurrence of E and record their arithmetic mean
as the v alue of S for E S T o nd the relativ e p ositions of segmen ts w en um b er them sequen tially in
order of app earance in the session F or example in the session x
x
x
x
x
S v alue for
the segmen ts x
x
x
x
and x
x
are and resp ectiv ely F or View Time T w e add up the time sp enton eac h o ccurrence of E in the session T Example This example illustrates howto extract the FM modelofa sample session Assume session U is
captured in concept space Z fx
x
x
x
x
g as follo ws
x
x
x
x
x
x
x
x
x
Supp ose that the user has sp en t seconds on eac h page while na vigating the w ebsite T o simplify w e are
assuming
as the basis of T and as the basis for b oth H and S
f x
x
x
x
x
g
In this pap er w e are using path features H Sand TTh us here m equals
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
T o nd the relativ e p ositions of the segmen ts within the session rst w e should n um ber them x
x
x
x
x
x
x
x
x
F or the sample sequence x
x
H and S v alues are computed as H and S
Also
v alue of T for the sample segmen t x
is calculated as T U
fm
M
H
M
S
M
T
is the nal FM mo del extracted for U M
H
M
S
M
T
Cluster Mo del
Clustering is an inseparable part of an y system created to analyze w ebsite na vigations User sessions m ust b e
group ed in to a set of clusters based on similarit y of their features T o cluster sessions since the FM mo del is a
distancebased mo del w e need a similarit y measure to quan tify the similaritybet w een sessions and a clustering
algorithm to construct the clusters Moreo v er w e need a scalable mo del for the cluster No w ada ys an y p opular
w ebsite is visited b y a h uge n um ber of users F or example Y ahoo
TM
has more than million visitors p er
day In suc h a scale w ema y employan y similarit y measure and clustering algorithm to group the sessions
or b etter to sa y session mo dels in to clusters but this is not sucien t If a cluster is naiv ely mo deled as a set
of session mo dels an y analysis on a cluster will b e dep endenton the n um b er of sessions in the cluster whichis
not a scalable solution Therefore w e are desp erately in need of a cluster mo del that can b e used for analysis
indep enden t of the n um b er of cluster mem b ers In this section w e describ e our cluster mo del Subsequen tlyin
Section w ein tro duce sev eral applicable similarit y measures for the purp ose of clustering and nally in Section
w e prop ose a v ariation to con v en tional clustering algorithms to mak e them realtime adaptable to v arying
beha viors
With our approac h of mo deling a cluster w e aggregate feature v alues of all clustered sessions in to corresp onding
feature v alues of a virtual session This virtual session is considered as a represen tativ e of all the sessions in the
cluster or equally as the mo del of the cluster Consequen tly the complexit y of an y analysis on a cluster will
b ecome indep enden t of the cluster cardinalit y Supp ose weha v e mapp ed all the sessions b elonging to a cluster in to their equiv alen t session mo dels In order
to aggregate the features of the sessions in to the corresp onding features of the cluster mo del it is sucien tto
aggregate features for eac h basis segmen t Assume w e denote the v alue of a feature F for an y segmen t E in
the basis b y F E W e apply a simple aggregation function namely arithmetic aver agingto F E v alues in all
sessions of a cluster to nd the aggregated v alue of F E for the cluster mo del Th us if M
F
is the feature matrix
for feature F of the cluster mo del and M
F
i
is the feature matrix for feature F of the ith session in the cluster
eac h elementof M
F
is computed b y aggregating corresp onding elemen ts of all M
F
i
matrices This pro cedure is
rep eated for ev ery feature of the FM mo del The nal result of the aggregation is a set of aggregated feature
matrices that constitute the FM mo del of the cluster
C
fm
M
F M
F M
F n
Therefore the FM mo del can uniquely mo del b oth sessions and clusters
As men tioned b efore the aggregation function w e use for all features is the simple arithmetic a v eraging
function In matrix notation the aggregated feature matrix for ev ery feature F of the cluster mo del C
fm
is
computed as follo ws
M
F
N
N
X
i M
F
i
where N is the cardinalit y of the cluster C The same aggregation function can b e applied incremen tally when
cluster mo del has already b een created and wew an t to up date it as so on as a new session U
j
joins the cluster
M
F
N
N M
F
M
F
j
This prop ert y is termed dynamic clustering In Section w elev erage on this prop ert y to mo dify the con v en
tional clustering algorithms to b ecome realtime and adaptiv e
Example This example illustrates ho w to extract the FM mo del of a simple cluster Assume U
and U
are the only t w o mem b ers of cluster C and M
F
and M
F
are their feature matrices for feature F F ollo wing
equations are selfexplanatory
M
F
M
F
M
F
M
F
M
F
Again note that since weha v e dened features H S and T here C
fm
includes matrices namely M
H
M
S
and M
T
Analysis of the Mo del
W ebsite na vigation analysis basically in v olv es three categories of tasks constructing clusters of sessions cluster
ing comparing sessions with clusters and assigning a session to a cluster Regardless of the mo del emplo y ed for
analysis and the algorithm used for clustering complexit y of constructing the clusters is dep endenton N whic h
is the n um b er of sessions to b e clustered This is true simply b ecause during clustering eac h session should b e
analyzed at least once to detect ho w it relates to other sessions Standard hierarc hical clustering tec hniques scale
as O
N
while the more adv anced algorithms suc h as CLARANS CLUDIS and BIR CH impro vethe
complexityto O kN whic h is still dep enden t on N The FM mo del do es not address this issue instead the
FM cluster mo del is dened so that it reduces the time complexit y of the other t w o tasks If the complexityof
comparing a session with a cluster and assigning it to the cluster is indep enden t of the cluster cardinalit y user
classication and cluster up dating can b e fullled in realtime see Section The price w e ha v e to pa y to ac hiev e lo w er space and time complexit y is to sacrice c ompleteness If the
cluster mo del is merely the set of mem b er sessions stored in their complete form although the mo del is c omplete
in represen ting the mem b er sessions it do es not scale On the other hand if w e aggregate mem b er sessions to
construct the cluster mo del the mo del will lose its capabilit y to represen t its mem b ers with p erfect accuracy The
more extensiv e aggregation is applied the less complete the cluster mo del The FM mo del is exible in balancing
this tradeo based on the sp ecic application requiremen ts
FM Complexityv ersus the V ector and Mark o v Mo dels
Let FM
n b e an FM mo del of the order n see T able for the denitions of terms where n max n
n
n
m
In the w orst case FM
n comprises mndimensional matrices M
r
n one for eac hofthe mo del features Th us
sp ac e cost of FM
n is O mr
n
Time complexit y
for user classication is O mL and for up dating a cluster
b y assigning a new session to the cluster is O mr
n
Therefore space and time complexit yof FM
n mo del are
b oth indep endentof M F rom O mr
n
complexit y increases exp onen tially with nwhic histhe order of the FM mo del Prop erty
illustrates that as the order n increases the FM mo del b ecomes more c omplete in describing its corresp onding
session or cluster Th us added complexit y is the price for a more accurate mo del An appropriate order should
Wederiv e these complexityv alues in Section once w e describ ed our similarit y measures
b e selected based on the accuracy requiremen ts of the sp ecic application
Prop ert y If p
p
then FM
p is more complete than FM
p Without loss of generalit yw e base our discussion on the H feature The same deduction applies to anyother
feature W e pro veb y induction Supp ose FM
and FM
are used to mo del the session U in to U
fm
and
U
fm
resp ectiv ely Supp ose wew an t to reconstruct U based on the data captured in its mo del
U x
x
x
i
x
i x
L
If the rst i pages of U are already kno wn the H matrix of U
fm
suggests L i c hoices for x
i while using
U
fm
the n um ber of c hoices is limited to the sum of elemen ts lo cated in single rowof the H matrix whic his
indexed b y the page x
i
The latter is alw a ys less than or equal to L i This is reasonable b ecause the H matrix
in U
fm
not only records the n um b er of hits on pages but also implicitly captures information ab out their or der
in the session at least to the follo wing page Since during reconstruction of the session from the rst page to the
last the same deduction holds on the selection of the follo wing pages the total n um b er of sessions suggested b y
U
fm
will alw a ys b e less than or equal to those of U
fm
Therefore b y recording information ab out segmen ts
instead of pages w eha veac hiev ed a more complete mo del though with a higher complexit y Higher order FM
mo dels will inductiv ely further restrict n umberofc hoices for eac h next page simply b ecause visiting the p order segmen t x
i p x
i p
x
i
x
i is a sp ecial case of visiting the p order segmen t x
i p x
i
x
i
In the extreme case when p L U
fm
L uniquely iden ties U and the mo del is absolutely complete The other crucial parameter in O mr
n
is m the n um b er of features captured b y the FM mo del F eatures
are attributes of the sessions used as the basis for comparison The relativ e imp ortance of these attributes in
comparing the sessions is absolutely applicationdep enden t The FM mo del is an op en mo del in a sense that its
structure allo ws incorp orating new features as the need arises for dieren t applications P erforming comparisons
based on more features result in more accurate clustering though again w e increase the complexit y No w let us compare the p erformance of FM with t w o other con v en tional mo dels namely the V ector mo del
and the Mark o v mo del The V ector mo del as one of the classical mo dels applied in distancebased approac hes
can b e considered as a sp ecial case of the FM mo del As used in and the V ector mo del is equiv alenttoan
FM
mo del with H as the only captured feature Th us the V ector mo del scales as O r but as discussed ab o v e
since it is an order FM mo del it p erforms p o orly in capturing information ab out sessions Our exp erimen ts
illustrate that an FM
mo del with S and H as its features outp erforms the V ector mo del in accuracy see Section
The other mo del t ypically emplo y ed in probabilit ybased approac hes is the Mark o v mo del Although
whether or not w eb na vigation is a Mark o vian b eha vior has b een the sub ject of m uchcon tro v ersy the Mark o v
mo del has demonstrated acceptable p erformance in the con text of w ebsite na vigation analysis The transition
matrix of an order n Mark o v mo del is extractable from H feature matrix of an FM
n mo del Th us the FM
mo del at least captures the same amoun t of information as with an equiv alen t Mark o v mo del They also b enet
from the same time complexityof O L for dynamic user classication Ho w ev er the Mark o v mo del cannot b e
up dated in realtime b ecause the complexit y of up dating a cluster is dep enden t on the cardinalit y of the cluster
Moreo v er the Mark ovmodel is not an op en mo del as describ ed for FM b ecause it is dened to capture only
order and hit
F ormalization
In this section w eformallypro v e uniqueness of the FM session and cluster mo dels
Theorem T w o iden tical sessions ha v e iden tical FM mo dels ie U
U
U
fm
n U
fm
n Based on the session mo del denition
U
fm
n n
M
F i
j i m
o
U
fm
n n
M
F i
j i m
o
butifthe t w o sessions U
and U
are iden tical their feature matrices are corresp ondingly equal Hence
U
U
i m M
F i
M
F i
U
fm
n U
fm
n Theorem T w o iden tical clusters ha v e iden tical FM mo dels ie C
C
C
fm
n C
fm
n Based on the cluster mo del denition
C
fm
n n
M
F i
j i m
o
C
fm
n n
M
F i
j i m
o
but if the t w o clusters C
and C
are iden tical for eac h session U
j
in C
there is an iden tical session U
k
in
C
and vice v ersa
C
f U
j
j j M
C g
C
fU
k
j k M
C g
C
C
U
j
C
U
k
C
U
k
C
U
j
C
M
C M
C hence
M
F i
P
M C
j
M
F i
j
M
F i
P
M C
k
M
F i
k
M
F i
M
F i
C
fm
n C
fm
n
Similarit y Measures
A similarity me asur e is a metric that quan ties the notion of similarit y T o capture beha viors of the w eb
site users user sessions are to b e group ed in to clusters suc h that eac h cluster is comp osed of similar sessions
Similarit y is an applicationdep enden t concept and in a distancebased mo del suc h as FM a domain exp ert should
enco de a sp ecic denition of similarit yin to a pseudodistance metric that allo ws the ev aluation of the similarit y
among the mo deled ob jects With the FM mo del these distance metrics termed similarit y measures are used
to imp ose order of similarit y up on user sessions Sorting user sessions based on the similarit y is the basis for
clustering the users
In and authors in tro duce a similarit y measure for na vigation analysis that do es not satisfy an imp ortan t
precondition the basis segmen ts used to measure the similarit y among sessions m ust b e orthogonal Here w e
dene sev eral other similarit y measures based on the FM mo del All these new similarit y measures satisfy the
men tioned precondition Before dening the functions of these similarit y measures let us explain ho w they
in terpret the FM mo del
With all the discussed similarit y measures eac h feature matrix of FM is considered as a unidimensional
matrix T o illustrate assume all ro ws of an ndimensional feature matrix are concatenated in a predetermined
order of dimensions and ro ws The result will b e a unidimensional ordered list of feature v alues This ordered list
is considered as a v ector of feature v alues in R
r
n
where r is the cardinalit y of the concept space No w supp ose
w e w an t to measure the quan titativ e dissimilarit y bet w een the t w o sessions U
fm
and U
fm
assuming that the
certain similarit y measure used is an indicator of dissimilarit y analogous pro cedure applies when the similarit y
measure expresses similarit y instead of dissimilarit y Eac h session mo del comprises a series of feature v ectors
one for eac h feature captured b y the FM mo del F or eac h feature F
i
the similarit y measure is applied on the
t w o F
i
feature v ectors of U
fm
and U
fm
to compute their dissimilarit y D
F i
Since the dissimilaritybet w een U
fm
and U
fm
m ust b e based on all the FM features the total dissimilarit y is computed as the w eigh ted a v erage of
dissimilarities for all features
D
F
m
X
i
w
i
D
F i
m
X
i w
i
where m is the n um b er of features in the FM mo del D
F
can b e applied in b oth hard and soft assignmentof
sessions to clusters W eigh t factor w
i
is applicationdep enden t and is determined based on the relativ e imp ortance
and ecacy of features as similarit y indicators In Section w e rep ort on the results of our exp erimen ts in
nding the compromised set of w eigh t factors for H and S features
In the follo wing sections w ein tro duce our alternativ e similarit y measures They are classied based on the
t yp e of distance they use to estimate similaritybet w een feature v ectors These similarit y measures are applicable
within an y clustering algorithm Throughout this discussion assume
A and
B are feature v ectors equiv alen t
to ndimensional feature matrices M
F
and M
F
and a
i
and b
i
are their ith elemen ts resp ectiv ely V ectors are
assumed to ha v e N r
n
elemen ts where r is the cardinalit y of the concept space
Some similarit y measures are dened to b e indicator of dissimilarit y instead of similarit y F or the purp ose of clustering b oth
approac hes are applicable
B
A
kA
VA (A,B) = COS
B
A
A-B
PED(A,B) = |A-B|
a V A b PED
Figure T w o similarit y measures
Angular Distance
V ector Angle V A
VA measures the similaritybet w een feature v ectors based on their angular distance see Figure a
VA
A
B
cos A
B
A
B
P
N
i a
i
b
i
P
N
i a
i
P
N
i b
i
where D
A
B
E
and VA a
i
b
i
VA expresses the similarity the greater the VA
A
B
the more similar
A and
B T o obtain an in tuition
ab out VA recall the structure of a feature v ector A feature v ector is a list of v alues of a certain feature Feac h
v alue relev antto a segmen t in the basis univ ersal set A com bination of these v alues con v ey the status of the
corresp onding session in terms of the feature F Since direction of the feature v ector is determined b y the feature
v alues it comprises if VA nds t w o feature v ectors close in direction they m ust include more or less analogous
v alues Consequen tly the corresp onding sessions should b e similar
Ho w ev er if w elook in to VA functionalit y with more details w e notice that VA cannot dieren tiate b et w een
av ector and its scaled v ariations
k R
VA
A k
A
cos This c haracteristic migh t be undesirable in comparing feature v ectors of certain features suc h as S F or these
feature v ectors the mere fact that
V
k
V
do es not necessarily con v ey similaritybet w een corresp onding sessions
of
V
and
V
whereas VA nds
V
and
V
iden tical VA
V
V
VA
V
k
V
Ho w ev er for some other
path features suchas Hthisc haracteristic ma y b e meaningful F or example supp ose VA is used to compare the
H feature v ectors of these t w o sessions
U
x
x
x
U
x
x
x
x
x
Users na vigating these paths mayha v e had similar in ten tions b ecause they ha v e rep eatedly tra v ersed a certain
set of segmen ts VA detects this similarityb y comparing H v ectors of U
and U
M
H
V
M
H
V
V
V
VA V
V
FM
n in the w orst case comprises mndimensional matrices M
r
none for eac h of the mo del features refer
to T able for the denitions of the terms If VA is used to compare t w o sessions according to Equation time
complexit y for user classication is O mr
n
Euclidean Distance
In this section w e will discuss those similarit y measures devised based on Euclidean distance b et w een t w o feature
v ectors These similarit y measures particularly quan tify the dissimilarity bet w een feature v ectors the greater
their v alue the more dissimilar the compared v ectors
Pure Euclidean Distance PED
PED simply computes Euclidean distance b et w een t w o feature v ectors see Figure b
PED
A
B
A B
N
X
i
a
i
b
i
where PED As compared to VA PED can dieren tiate b et w een a v ector and its scaled v ariations
PED
A k
A
k
A
if k and
A Ho w ev er if PED is used to compare a session with its cluster the dissimilarit ymaybe o v erestimated T o illus
trate supp ose a user na vigates the session U that b elongs to cluster C It is not necessarily the case that the
user tra v erses ev ery segmen t as captured b y C
fm
In fact in most cases user na vigates a path similar to only
a subset of the na vigation pattern represen ted b y C
fm
and not the en tire pattern In ev aluating the similarit y
bet w een U
fm
and C
fm
w e should a v oid comparing them on that part of the na vigation pattern not co v ered b y
U or else their dissimilaritywill be o v erestimated Ov erestimation of dissimilarit y o ccasionally results in failure
to classify a session to the most appropriate cluster Example illustrates this problem
Example This example demonstrates ho w the o v erestimation problem ma y cause PED to mistarget a
session Supp ose clusters C and C
are represen ted b y the follo wing virtual sessions
C x
x
x
x
x
x
C
x
x
x
and session U is captured as follo ws
U x
x
x
The ob jectiv e is to select the cluster that is more similar to U Assume that the similarit y criterion is based only
on the S feature The S feature v ectors of C C
and U are as follo ws
M
S
C
V
S
C
M
S
C
V
S
C
M
S
U
V
S
U
As observ ed from the sequences of pages U itself is a subpath of C while it do es not ha vean y segmen t in common
with the C
cluster Ho w ev er PED wrongly nds U more similar to C
rather than C PED
V
S
U
V
S
C
PED
V
S
U
V
S
C
In next section w e pro vide a solution for the o v erestimation problem Finally note that according to Equation
PED has the same time complexityas that of VA namely O mr
n
refer to T able for the denitions of the
terms
Pro jected PED PPED
PPED is a v ariantof PED that alleviates the o v erestimation problem Assume
A and
B are t w o feature v ectors
of the same t yp e b elonging to a session and a cluster mo del resp ectiv ely Eachv ector comprises N comp onen ts
T o estimate the dissimilaritybet w een
A and
B PPED computes pure Euclidean distance b et w een
A and the
pro jection of
B on those co ordinate planes at whic h
A has nonzero comp onen ts
PPED
A
B
N
X
i a i a
i
b
i
A
where PPED Note that PPED is not comm utativ e
Nonzero comp onen ts of
A b elong to those segmen ts that exist in the session Zero v alues on the other hand
are related to the remainder of the segmen ts in the basis univ ersal set By con trasting
A with the pro jected
Bw e compare the session and the cluster based on just the segmen ts that exist in the session and not on the
en tire basis Th us the part of the cluster not co v ered in the session is excluded from the comparison to a v oid
o v erestimation The impact of o v erestimation correction of PPED is demonstrated in Example Example Consider the same scenario describ ed in Example As illustrated b elo w PPED assigns U
to the correct cluster cluster C PPED
V
S
U
V
S
C
q
PPED
V
S
U
V
S
C
q
PPED
V
S
U
V
S
C
PPED
V
S
U
V
S
C
Since PPED can compare sessions with dieren t lengths it is an attractiv e measure for realtime clustering
where only a p ortion of a session is a v ailable at anygiv en time see Section PPED also helps in reducing
the time complexit y of the similarit y measuremen t A ccording to Equation the time complexit y of PPED
impro v es to O mL refer to T able for the denitions of the terms In Section w e rep ort on the sup eriorit y
of PPED p erformance as compared to that of PED and VA Dynamic Clustering
As discussed in Section since the FM mo del of a cluster is indep enden t of the cluster cardinalit yan y cluster
manipulation with FM has a reasonably lo w complexit y Lev eraging on this prop ert yw e can apply the FM mo del
in realtime applications
Figure Algorithm for dynamic clustering
One b enet of this prop ert y is that FM clusters can b e up dated dynamically and in realtime Note that in most
common cluster represen tations complexit y of adding a new session to a cluster is dep enden t on the cardinalityof
the cluster Therefore practically in large scale systems they are not capable of up dating the clusters dynamically By exploiting dynamic clusteringan yna vigation analysis system can adapt itself to c hanges in users b eha viors
in realtime New clusters can b e generated dynamically and existing clusters adapt themselv es to the c hanges
in users tendencies Dela ysensitiv een vironmen ts suc h as sto ckmark et are among those applications for whic h
this prop ert y is most adv an tageous Figure depicts a simple pro cedure to p erform dynamic clustering when a
new session is captured
P erio dical reclustering is the t ypical approac h in up dating the clusters This approac h results in high accuracy but it cannot b e p erformed in realtime A ccording to our exp erimen ts to compare the accuracy of the dynamic
clustering with that of a p erio dical reclustering see Section dynamic clustering sho ws lo w er accuracy in
up dating the cluster set In fact with dynamic clustering w e are trading accuracy for adaptabilit y Th us dynamic
clustering should not b e used instead of classical clustering algorithms but a h ybrid solution is appreciated That
is the cluster set should b e up dated in longer p erio ds through p erio dical reclustering to a v oid div ergence of the
cluster set from the trends of the real user b eha viors Mean while dynamic clustering can b e applied in realtime
to adapt the clusters and the cluster set to shortterm b eha vioral c hanges
P erformance Ev aluation
W e conducted sev eral exp erimen ts to compare the ecacy of the path features in c haracterizing user sessions
study the accuracy of our similarit y measures in detecting the similarit y among user sessions compare the
p erformance of the FM mo del with that of the traditional V ector mo del and in v estigate the accuracy of the
dynamic clustering
Exp erimen tal Metho dology
Although there are accurate trac king mec hanisms to capture user sessions for example see here w e preferred
to use syn thetic data so that w e could ha v e more con trol o v er our input c haracteristics T o generate N syn thetic
user sessions w e start b y partitioning our user space in to k almost equisized groups The assumption is that
mem b ers of the same group are users with similar b eha vior Subsequen tlyw e enforce that all of the users within
a group na vigate a w ebsite almost iden tically Here w e use dieren t tec hniques to in tro duce noise in these
na vigations The ob jectiv e is to use our FM algorithms to form the clusters as close as p ossible to the original
groups b y just examining the na vigations W e use precision and recall metrics to compute the distance b et w een
FM clusters and original groups as our measures of success The w ebsite is not real but simply a randomly
generated single source Directed Graph DG
Our original k groups are formed based on the p ossible com binations of the follo wing user demo
graphics sex male female age y oung middleaged old and nancial status p o or middleclass w ealth y T o
enforce that similar users na vigate similar paths rst a core path w as created for eac h group as its represen tativ e
path Next eac h core path w as considered as the cen troid to construct similar sessions around it
Our w ebsite DG consisted of v ertices or pages one of whic h selected as the source or homepage Eac h
page had exactly edgeslinks l
to l
to other pages The destination pages of the links w ere selected
randomly once during w ebsite construction The core path for a cluster i C
i
w as generated b y tra v ersing the
link l
i
of eac h page starting from the homepage The length of a core path w as xed at pages are visited
As a result w e created core paths of length P
to P
all starting from the homepage
P
i
X
i X
H
X
i X
ij X
ij
X
i
where X
ij
is the destination page of the link l
i
in the page X
ij and X
H
is the homepage
Next user sessions are constructed around the core paths There are t w o knobs v and p to con trol the
similarit y of a user session to a core path v con trols the v ariation in the length of a user session Sp ecically v m where m is the minim um length of a session m Hence the higher v the more v ariations exist
in the lengths of the created sessions p determines ho w similar a user session is to the core path Sp ecically p is the probabilit y that the user selects an iden tical link to the core paths link at eac hand ev ery page In the
extreme case when p the sessions follo w the exact pattern of the core path pageb ypage though they migh t
ha vedieren t lengths F or example consider a user session U b elonging to cluster C
i
with the core path P
i
as
its represen tativ e
U x
x
x
j x
j
x
L
L
where x
j
is the page visited at the j th p osition Subsequen tly x
j
will b e iden tical to X
ij
of the core path with
probabilit y p otherwise from x
j w e select a wrong link with the probabilit y
p
and hence x
j
X
ij
In our exp erimen ts w ev aried v from to and p from to Eac h dataset consists of k user sessions
where eac h user is assigned to a cluster randomly Hence the size of eac h cluster is appro ximately users As
men tioned b efore in our exp erimen ts w e used precision and recall to estimate the accuracy Y axis in all graphs
in p ercen tage Whenev er these t w o measures b eha v ed similarly w e com bined them in to the Harmonic Me an
93
94
95
96
97
98
99
100
0 0.2 0.4 0.6 0.8 1
w
S
HM(%)
p=1.0 p=0.9999 p=0.99
p=0.95 p=0.90 p=0.80
0.9
0.92
0.94
0.96
0.98
1
0.75 0.8 0.85 0.9 0.95 1
p
HM(%)
VA PPED
a Compromising w eigh t factors for path features b Comparing VA and PPED
Figure Exp erimen tal results a comparing ecacy of spatial path features and b comparing t w o similarit y
measures
HM function dened as
HM P r ecision
Recall
P r ecision R ecal l H M HM assumes high v alue only when precision and recall are b oth high
Finally note that for simplicit y T is excluded from the exp erimen ts Th us analysis is p erformed only based
on spatial features of the sessions W e used order segmen ts to capture both S and H therefore the applied
mo del is FM
Ecacy of the P ath F eatures
A set of exp erimen ts w as conducted to study the relativ e ecacy of the path features H and S in detecting
similarities b et w een user sessions In Equation the w eigh t factor w
i
indicates relativ e imp ortance of the path
feature F
i
in computing the aggregated similarit y measure The higher w eigh ts are assigned to the features that
are more eectiv e in capturing the similarities Our exp erimen ts w ere in tended to nd the compromised set of
w eigh t factors w
S
w eigh t factor for Sand w
H
that results in the optim um accuracy in capturing the similarities
With this set of exp erimen ts w e used FM to mo del the sessions and KMeans with PPED to cluster the
sessions W e p erformed eac h exp erimen t with a dieren t p v alue Figure a summarizes the results In this gure
the X axis is w
S
and the Y axis is the Harmonic Mean HM Eac hcurv e corresp onds to a dieren t dataset with
a dieren t p v alue
As observ ed in the gure the accuracy is alw a ys ab o v e whic h indicates b oth features Hit and Sequence
are equally successful in iden tifying the spatial similarities though S demonstrates sligh t sup eriorit y see the end
poin ts The optim um accuracy is ac hiev ed b y emplo ying a compromised com bination of the similarities detected
in Hit and Se quenc e Dep ending on p the compromised w
S
v aries b et w een and The less distinguishable
less p the dataset is the less w eigh t should b e assigned to the Se quenc e That is when similarit y among users of
the same group decreases it is more imp ortan t to trackwhic h pages they visit rather than where in the session
they visit eac h page Hence for real data that is less distinguishable setting w
S
ma y result in the
optim um accuracy In sum if w e assume user b eha viors are similar when spatial c haracteristics of their sessions
are similar using H and S eectiv ely categorizes user b eha viors
A ccuracy of the Similarit y Measures
In Section w e in tro duced three similarit y measures Here the accuracies of these similarit y measures are
compared
First w e applied VA and PPED to cluster six datasets with dieren t p v alues and All datasets had v W e used FM to mo del the sessions and KMeans for clustering W eigh t factors
w
S
and w
H
w ere set to As observ ed in Figure b PPED outp erforms VA With p b oth similarit y
measures p erformed p erfectly but as datasets b ecome less distinguishable the margin b et w een their p erformances
increases Th us for real data whic h assumes lo w distinguishabilit y PPED is denitely preferable
Second w e conducted some exp erimen ts to v erify the eect of emplo ying PPED to alleviate the o v erestimation
problem with PED F or these exp erimen ts three datasets w ere used with p and v arious v v alues and
A ccuracies of the clusters generated b y applying PED and PPED are con trasted in Figure First note
that PPED outp erforms PED in both precision and recall Figures a and b W e consider this sup eriorit y
due to the alleviation of the o v erestimation problem F or p accuracy of PPED is at least though it
decreases for higher v v alues This is reasonable b ecause higher v implies more shortlength sessions exist in the
dataset The shorter a session is the less information ab out the user b eha vior it con tains Th us shorter sessions
are harder to assign to the appropriate clusters F ailure to assign the shorter sessions to their clusters results in
lo w er accuracy Ho w ev er PED sho ws a dieren t b eha vior F or PED the decrease in the recall measure at higher
v v alues is more than what exp ected On the other hand unlik e PPED its precision do es not sho w noticeable
v ariations as v increases F or the remainder of this section w e explain this b eha vior
A ccording to the exp erimen tal results when PED is used as the similarit y measure for clustering sometimes
a cluster with a shortlength core path attracts sev eral sessions from other clusters ev en though they do not ha v e
m uc h in common with the cluster A cluster with a shortlength core path represen ts a group of users ab out
whom w e do not ha vem uc h information This problem is also explainable when paths are represen ted b y feature
v ectors In Figure U and C are the feature v ectors for a session and its corresp onding cluster resp ectiv ely C
x
is the feature v ector for a cluster with a shortlength core path As observ ed from the gure although C
and U are close in direction so they are similar since C is long and U is short PED estimates a long
distance b et w een them On the other hand since C
x
is short it is considered similar to ev ery shortlength session
regardless of its direction Th us PED nds U more similar to C
x
than C
40
50
60
70
80
90
100
02 46 8 10 12 14 v
Recall(%)
PED PPED
40
50
60
70
80
90
100
02 46 8 10 12 14 v
Precision(%)
PED PPED
40
50
60
70
80
90
100
0246 8 10 12 14 v
HM(%)
PED PPED
a Recall Measure b Precision Measure c Harmonic Mean
Figure Comparing p erformances of PED and PPED
C
U
Cx
U-C
U-Cx
Figure Illustration of the o v erestimation problem
70
80
90
100
0.75 0.8 0.85 0.9 0.95 1 p
HM(%)
FM Vector
70
75
80
85
90
95
100
0.75 0.8 0.85 0.9 0.95 1 p
HM(%)
KM (PPED, init. 18) DC(PPED, init. 14) DC(PPED, init. 18)
a FM vs the V ector Mo del b Dynamic Clustering
Figure Exp erimen tal results a comparing p erformances of the FM mo del and the V ector mo del and b
accuracy of the dynamic clustering in creating init and up dating init the clusters in realtime
With PPED on the other hand this problem is a v oided b y taking the direction of the feature v ectors in to
accoun t T o estimate the similarit y PPED computes Euclidean distance b et w een a session and the pro jected
comp onen t of a cluster on the direction of the session In other w ords with PPED similaritybet w een a user and
a group is computed based on user c haracteristics rather than group c haracteristics Th us user joins the group
that has most in common with the user Therefore o v erestimation of the distance b et w een the session and its
in tended cluster is a v oided b y disregarding unnecessary group c haracteristics in distance estimation
Due to the problem describ ed ab o v e with PED most clusters exp erience lo w recall b ecause manyoftheir
sessions are accepted b y the cluster that is not w ellc haracterized Ho w ev er since most misclustered sessions join
a single cluster the a v erage precision of clustering among clusters is high Decrease in the recall in tensies
with higher v v alues b ecause shortlength sessions are more exp osed to misclustering HM com bines the precision
and recall gures and is a more realistic measure of accuracy for PED and PPED see Figure c
P erformance of the FM Mo del
Since w e consider the FM mo del as an extended form of the V ector mo del w e conducted some exp erimen ts to
compare p erformances of a sample FM mo del namely FM
with H and S as its features with the traditional
V ector mo del whic h is considered equiv alentto FM
with H as its only feature
Results of this study are depicted in Figure a As observ ed in the graph p erformance of the V ector mo del
decreases as the user sessions b ecome less distinguishable while the FM mo del main tains its accuracy with lo w er
v alues of p This sup eriorit y is b ecause of incorp orating S in to the mo del and capturing features based on
order segmen ts Note that ev en with p the V ector mo del fails to ac hiev e accuracy That is b ecause
of v ariation of the session lengths Although with p all sessions exactly follo w the pattern of the core paths
but they mightha v e dieren t lengths FM
is p erfect at p P erformance of the Dynamic Clustering
In Section w ein tro duced dynamic clustering as an approac h to up date cluster mo dels in realtime Ho w ev er
wealsomen tioned that dynamic clustering trades accuracy for adaptabilit y W e conducted sev eral exp erimen ts
to study the degradation of the accuracy due to applying the dynamic clustering F or this purp ose w e compared
dynamic clustering with KMeans
F or the exp erimen ts w e initiated the dynamic clustering in t wow a ys once w e initiated all clusters with
the core paths the other time clusters w ere initiated and the remaining clusters w ere left for the dynamic
clustering to create With the former dynamic clustering is applied to up date the existing clusters in realtime
and with the latter b esides up dating the existing clusters dynamic clustering also initiates new clusters
Results of the exp erimen ts are illustrated in Figure b It is notable that when all clusters are initialized
dynamic clustering performs as accurate as KMeans at p and ab o v e though as exp ected its accuracy
steeply decreases for less distinguishable datasets Instead the p erformance of the dynamic clustering is m uc h
b etter in up dating the existing clusters as compared to creating new clusters Th us the dynamic clustering can
b e applied to ac hiev e adaptabilit y but it should b e complemen ted b y p erio dical reclustering
Conclusions and F uture W orks
W e dened anew mo del the FM mo del whic h is a generalization of the V ector mo del to allo w for a exible
adaptiv e and realtime analysis of w eb na vigation data W e argued that these c haracteristics are not only useful
for oline and con v en tional w eb na vigation mining but also critical for anon ymous analysis W e demonstrated
the exibilit y of FM that can conceptualize new na vigation features eg view time of w eb pages as w ell as
trade o p erformance for accuracy b y increasing the order of the mo del F or FM w e prop osed a similarit y
measure PPED that can w ork with partial na vigation paths whichisessen tial for realtime and anon ymous
w eb na vigation mining W e then utilized PPED within a dynamic clustering algorithm to mak e FM adaptable to
short term c hanges in user b eha viors Dynamic clustering is p ossible since unlik e the Mark o v mo del incremen tal
up dating of the FM mo del as new sessions arriv e has a lo w complexit y Finallyw e conducted sev eral exp erimen ts
whic h demonstrated the high accuracy of PPED ab o v e the sup eriorit yof FM o v er the V ector mo del b y
at least and the tolerable accuracy of dynamic clustering as compared to KMeans only w orse while
b eing adaptable
Wein tend to extend this study in three w a ys First w ew ould lik e to run more exp erimen ts with real data
to both v erify our results and include the viewing time of w eb pages and h yp erlink iden tities in our analysis
T o w ards this end w e are building a real application so that w e can ev aluate the ecacy of our tec hniques
Second w eplan toin v estigate other aggregation functions that migh t b e more appropriate for certain features
as opp osed to a simple a v eraging for all the features Finally for our cluster and session mo dels w e w an t to
somewhat compress the matrix ev en further ma yb e through Singular V alue Decomp osition SVD
References
I Cadez D Hec k erman C Meek P Sm yth and S White Visualization of Navigation Patterns on W eb Site Using
Mo del Base d Clustering In T ec h Rep ort MSRTR Microsoft Researc h Microsoft Corp oration Redmond W A
March T Ra ymond and J Han Ecient and Ee ctive Clustering Metho ds for Sp atial Data Mining In Pro c of VLDB Conf
pages Septem b er M Ester H P Kriegel andXXu Know le dge Disc overy in L ar ge Sp atial Datab ases F o cusing on T e chniques for
Ecient Class Identic ation In Pro c of th In tl Symp osium on Large Spatial Databases T Zhang R Ramakrishnan and M Livn y BIR CH A n Ecient Data Clustering Metho d for V ery L ar ge Datab ases
In SIGMOD pages Mon treal Canada June C Shahabi A Zark esh J A dibi V Shah Know le dge Disc overy form Users W ebPage Navigation In Pro c of the
IEEE RIDE W orkshop April httpwwwwatchwisec om
httpwwwwebtr endsc om
httpwwwwebmanagec om
B Mobasher H Dai T Luo Y Sun and J Zh u Inte gr ating W eb Usage and Content Mining for Mor e Ee ctive Per
sonalization T o App ear in Pro c of the In tl Conf on ECommerce and W eb T ec hnologies ECW eb Green wic h
UK Septem b er M S Chen J S P ark and PSY u Data Mining for T r aversal Patterns inaW eb Envir onment In Pro c Of the
th In tl conf on Distributed Computing Systems R Agra w al and Srik an t Mining Se quential Patterns In Pro c of the In tl Conf on Data Engineering ICDE T aip ei
T aiw an Marc h R Co oley B Mobasher and J Sriv asta v a W eb Mining Information and Pattern Disc overy on the W orld Wide W eb
In In tl Conf on T o ols with Articial In telligence IEEE pages Newp ort Beac h A Joshi and R Krishnapuram R obust F uzzy Clustering Metho ds to Supp ort W eb Mining In Pro c of SIGMOD W orkshop on Data Mining and Kno wledge Disco v ery pages Seattle June B Hub erman P Pirolli J Pitk o w and R Luk os Str ong R e gularities in W orld Wide W eb Surng Science pages
httpdo csyaho oc omdo csprr ele ase html
B Mobasher R Co oley and J Sriv asta v a A utomatic Personalization Basedon W eb Usage Mining In the Comm u
nications of the A CM V ol No August A Zark esh J A dibi C Shahabi R Sadri and V Shah A nalysis and Design of Server Informative WWWSites In
Pro c of A CM CIKM K P Joshi A Joshi and Y Y esha W ar ehousing and W eb L o gs WIDM Kansas Cit y MO
T W Y an M Jacobsen H G Molina and U Da y al F r om User A c c ess Patterns to Dynamic Hyp ertext Linking In
th In tl W orld Wide W eb Conf P aris F rance Ma y
httpwwwp ersonifyc om
http
httpwwwwebsidestoryc om
httpwwwbluemartinic om
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 828 (2004)
PDF
USC Computer Science Technical Reports, no. 813 (2004)
PDF
USC Computer Science Technical Reports, no. 740 (2001)
PDF
USC Computer Science Technical Reports, no. 896 (2008)
PDF
USC Computer Science Technical Reports, no. 893 (2007)
PDF
USC Computer Science Technical Reports, no. 622 (1995)
PDF
USC Computer Science Technical Reports, no. 590 (1994)
PDF
USC Computer Science Technical Reports, no. 754 (2002)
PDF
USC Computer Science Technical Reports, no. 966 (2016)
PDF
USC Computer Science Technical Reports, no. 962 (2015)
PDF
USC Computer Science Technical Reports, no. 869 (2005)
PDF
USC Computer Science Technical Reports, no. 795 (2003)
PDF
USC Computer Science Technical Reports, no. 722 (2000)
PDF
USC Computer Science Technical Reports, no. 810 (2003)
PDF
USC Computer Science Technical Reports, no. 780 (2002)
PDF
USC Computer Science Technical Reports, no. 826 (2004)
PDF
USC Computer Science Technical Reports, no. 723 (2000)
PDF
USC Computer Science Technical Reports, no. 908 (2009)
PDF
USC Computer Science Technical Reports, no. 769 (2002)
PDF
USC Computer Science Technical Reports, no. 720 (2000)
Description
Cyrus Shahabi, Farnoush Banaei-Kashani, Jabed Faruque, Adil Faisal. "Feature matrices: A model for efficient and anonymous mining of web navigations." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 736 (2000).
Asset Metadata
Creator
Banaei-Kashani, Farnoush
(author),
Faisal, Adil
(author),
Faruque, Jabed
(author),
Shahabi, Cyrus
(author)
Core Title
USC Computer Science Technical Reports, no. 736 (2000)
Alternative Title
Feature matrices: A model for efficient and anonymous mining of web navigations (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
25 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270214
Identifier
00-736 Feature Matrices A Model for Efficient and Anonymous Mining of Web Navigations (filename)
Legacy Identifier
usc-cstr-00-736
Format
25 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/