Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Send files to FTP
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 877 (2006)
(USC DC Other)
USC Computer Science Technical Reports, no. 877 (2006)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Structural Analysis of User Association Patterns in Wireless LAN
Wei-jen Hsu
1
, Debojyoti Dutta
2
, and Ahmed Helmy
1
1
Department of Electrical Engineering, University of Southern California
2
Department of Computational Biology, University of Southern California
Email: f weijenhs, ddutta, helmyg @usc.edu
Abstract
Due to the rapid growth in wireless local area networks
(WLANs), itisimportanttocharacterizethe¯nestruc-
tureofuserassociationpatterns. Inthispaperwefocus
on understanding the structure in user's daily access
point (AP) association patterns in WLANs in the long
run. We propose new methodologies to systematically
analyze the WLAN user traces and utilize USC WLAN
trace as a case study. We ¯rst de¯ne a novel mea-
sure for consistent user behavior and apply a cluster-
ing technique for individual user and observe the daily
association patterns fall into multiple clusters for many
(> 50%) users, indicating multi-modal user association
behavior. However, the intrinsic dimensionality for the
constructed user association matrix is low, hence it is
possible to represent user's whole association history
with a few eigenvectors and the corresponding eigenval-
ues with minimal error. Using only ¯ve most important
eigenvectorsandeigenvalues, thereconstructionerroris
about5%. Wefurtherde¯nemetricstoquantifysimilar-
ity between association patterns of di®erent users, and
utilize the proposed metrics to group users into clus-
ters with similar association patterns. We rigorously
validate the e±cacy of our metrics by ensuring that
the inter and intra cluster distance distributions have
a clear separation. The distance is below 0:5 for most
user pairs in the same cluster, and above 0:9 for user
pairs in di®erent clusters. The analytical techniques we
introduce in the paper could be very useful to network
administrators and could provide better understanding
its users.
1 Introduction
There has been a rapid increase in wireless LAN
(WLAN) deployments, users and tra±c in the recent
years. Such explosive growth mandates the design of
sophisticated network management schemes for a mul-
titude of tasks ranging from better deployment of ac-
cess points meeting user needs to detection of malicious
activity on these WLANs. In order to investigate the
above problems, one of the ¯rst fundamental challenges
that needs to be addressed is to determine the user ac-
cess patterns. A detailed study of how users behave
within a WLAN environment could have far-reaching
consequences, from network deployment optimization
and usage pattern detection, to new applications that
push content to the end-user based on the access pat-
terns.
Most of the current studies in WLAN access patterns
havefocusedonusagestatistics(e.g.,numberofusersat
the access points (APs), online day counts for a user, or
percentage of tra±c using a protocol, etc.[13, 11, 12]).
Other researchers have attempted to model user asso-
ciation to the APs (e.g., User arrival process at APs
[14], user sessions length and preferences of AP selec-
tion [9, 18, 19], etc.). However, it is important to do
a ¯ne-grained quantitative study of the structures and
trends in long term user association patterns. Under-
standing such detailed structure is especially important
for large-scale WLAN with users that re-visit the en-
vironment over a period of time (i.e., User population
and its behavior does not vary drastically in the stud-
ied period), such as in university campus WLANs or
corporate WLANs.
There are very few known e®orts to determine and
quantify the ¯ne-grained structures and trends in user
associationpatternstoAPsinthelongrun,andclassify
users based on such patterns. In this paper, we propose
novel directions to understand such trends of user asso-
ciations in a campus WLAN, and ways of quantifying
thosetrends. Ourgoalistoquantifyrepetitiveandcon-
sistent structure of user association patterns.
In wireless LANs, user association to APs is a rough
indicator of the user's location, and this information re-
veals the set of locations the user visits. The structure
ofuserbehaviorcouldbede¯nedandstudiedatvarious
time-scales. In this paper, we choose to study the com-
position of the user's daily association patterns within
theUSCcampus-forexample,whethertheyaresimilar
or di®erent from day to day, and how do we represent
the major trends in these association patterns. Struc-
tural study of user association behavior can be useful
for the aforementioned networks in several ways. First,
by understanding the typical structure in user behav-
ior, we could establish metrics for association patterns
1
for the users and classify users into groups based on the
metrics. Also, such metrics could help us detect abnor-
mal user behavior (e.g., signi¯cant short-term change
in association pattern), and eventually help us to plan
forfuturenetworkdeploymentsbasedonlong-termuser
behaviortrends. Second,pro¯lingthestructuresinuser
association patterns help the network administrator to
understand and serve its users better. The administra-
tor may group users based on structural characteristics
of the user association patterns and provide location-
awareservicestousergroups. Third,pro¯lingthestruc-
tures in user association patterns can also help users to
¯ndotherswithsimilarassociationpatternsandprovide
useful information to context-aware protocols.
Speci¯cally, we focus on the following questions in
this paper:
(1) Do users show single-modal or multi-modal associ-
ation patterns across days? More precisely, does a user
show similar association pattern each day, or does it
choose the daily association pattern from one of several
di®erent association modes or classes?
(2) Is it possible to summarize the user association pat-
terns for multiple days in a concise fashion for current
WLAN users, regardless of whether the users display
single or multi-modal behavior?
(3) How do we quantify users as having similar or dis-
similar association patterns? Can we utilize such met-
rics to partition the whole user population into clusters
with similar users?
1.1 Our contributions
In this paper we provide new methodologies to analyze
WLAN traces. To illustrate the usage of our tools, we
analyze the association patterns of 5;000 users in the
campusWLANtracesatUniversityofSouthernCalifor-
nia. The trace was collected during the spring semester
of2006for94days. However,ourmethodologiesarenot
limited by the choice of data set, and can be applied to
study to other WLAN traces (e.g., [7], [8]).
Ourprimarycontributionisthemethodologiesweuse
to systematically analyze the association patterns. To
thise®ect,wede¯nenovelfeaturesthatcanbeextracted
from traces, similarity metrics using eigen decomposi-
tion and we robustly answer the questions we ask using
unsupervisedlearningmethodssuchasclustering,which
are subsequently rigorously validated. Our speci¯c con-
tributions are as follows:
² Multi-modaluserbehavior: We¯ndthatinouruni-
versity campus WLAN, most users display multi-
modal association behavior. There are only few
users that remain consistent in his association pat-
tern (i.e., single-modal user) for all the days we
studied. Speci¯cally, more than 50% of users are
classi¯ed as multi-modal under intermediate inter-
pattern distance and very few users are uni-modal
in nature.
² Summarize user-association patterns: We de¯ne a
new association matrix by concatenating the as-
sociation patterns each day into di®erent columns.
Usinganeigendecompositiontechniquetosumma-
rize the data set, we observe that for most users,
the intrinsic dimensionality of its association ma-
trix is quite low, i.e. it has very little modes of
variation. For more than 99% of users, we can use
at most 7 eigenvectors and eigenvalues to capture
more than 90% of power in its association matrix.
² Metrics to compare di®erent user-association pat-
terns: Weproposetwonovelmetricstoquantifythe
similarity between the association pattern struc-
tures of di®erent users. In the straight-forward
approach, we derive the metric based on detailed,
comprehensive comparison of the association pat-
terns of the users. In the feature-based approach,
we use the eigenvectors and eigenvalues obtained
from the association matrix as a feature set to
determine the similarity between association pat-
terns. We show that these two metrics are closely
related, withacorrelationcoe±cientof0:9119, but
thefeature-basedapproachiscomputationallymuch
simpler than that of the straight-forward approach.
Finally, we utilize these metrics to obtain a parti-
tion of the user population by clustering. We val-
idate our metrics by demonstrating that the mean
of inter-cluster distance of the resulting partition
is much larger than that of the intra-cluster dis-
tance in such partition. More than 90% of intra-
cluster user pairs have distance smaller than 0:5,
and more than 90% of inter-cluster user pairs have
distance larger than 0:9. There is a clear separa-
tion of the distributions of the inter and intra clus-
ter distances. Thus, with the proposed metric, we
are able to identify users with similar association
patterns.
The rest of the paper is organized as follows. Related
workisbrie°youtlinedinsection2. Insection3, wede-
scribeourtracecollectioninfrastructurethatweuseand
introduce the mathematical representations of user as-
sociation pattern. In section 4, we identify multi-modal
user behavior using clustering techniques. In section 5
wepresentthematrixrepresentationofuserassociation
patterns. The metrics for user similarity and user pop-
ulation partition are presented in section 6. Finally we
discuss the potential of our ¯ndings in section 7, and
conclude with future work direction in section 8.
2
2 Related Work
Recently, there has been numerous papers on the
empirical study of wireless LANs to understand its
users. Earlier papers focused on the trace collection
infrastructure[13, 11, 12, 5], presenting the user statis-
tics and basic understanding of user behavior, and con-
tributing those traces to the research community.
One of the recent directions has been to understand
user behavior deeply, which is also the goal of our pa-
per. Most of them suggest models to describe aspects
of user association behavior. In [9, 18, 19] the focus is
to model user preferences, association durations at the
access points. In [14] the authors propose a model for
user arrival patterns. In this paper, we take an alter-
nate approach and instead of modeling each user, we
seek to understand user association by observing the
long-term daily patterns empirically from traces and
quantitatively describe the structure in user association
patterns. We target at establishing a methodology to
understand and explain the characteristics of user asso-
ciation in WLAN as a multi-variate analysis problem.
To the best of our knowledge, the questions we seek
to answer in the paper are not fully addressed in the
current literature. The multi-modal association behav-
ior of WLAN users has not been investigated before,
and there is no previous work on quantifying similar-
ity metrics between user association patterns. Identify-
ing major trends of association patterns using singular
value decomposition, however, is also discussed in [22]
for cell-phone users. But unlike our study, similarity
metrics are not de¯ned. More speci¯cally, we leverage
the techniques from machine learning and data mining
to analyze the WLAN trace, which is a new approach
that augments the existing work on the traces. We uti-
lizemainlyhierarchicalclustering[1]techniquestoclas-
sify users in the study.
We also utilize the eigen decomposition [4] to decom-
posetheassociationmatrixofusersintoitseigenvectors
and eigenvalues, which is related to the principal com-
ponent analysis (PCA) [2], which is used in [10] to ¯nd
common trends in °ows on network links, and in [22] to
¯nd trends of cell-phone users associations. We adopt
a variant of traditional PCA, called uncentered PCA,
which has been used in ecology to study the diversity
of species at various sites[15]. Uncentered PCA is suit-
able if the mean is also a useful feature for comparison,
whichiscertainlytrueinourstudy. Comparingsimilar-
ity of data sets using the corresponding PCs of each set
is also used in [16, 17]. However, unlike previous work,
we de¯ne novel similarity metrics and use those metrics
for clustering wireless users' association patterns into
groups and robustly validate our metrics.
3 Preliminaries
In this section we ¯rst discuss about the wireless LAN
trace collection process at University of Southern Cali-
fornia (USC) and the basic facts about the trace. Then
we discuss how we choose to de¯ne and represent user
association pattern in the paper.
3.1 Trace collection
USChasawidescalewirelessLANdeployedoncampus.
We have been collecting traces from the USC WLAN
from early 2004 [5, 6] and it is available for the commu-
nity on our webpage [7]. Since the winter of year 2005,
theuserveri¯cationprocessforourcampusnetworkhas
changed signi¯cantly
1
, and the trace collection process
is also modi¯ed. Interested readers can check out the
release notes at MobiLib webpage [7]. From the traces,
we derive user association history (i.e. The start and
endtimetheuserassociateswithparticularlocation)at
per-switchportgranularity, whichapproximatelycorre-
sponds to buildings on USC campus.
In this paper we analyze the WLAN traces collected
at USC after the network policy changes. Speci¯cally,
we use the traces collected between Jan. 25, 2006 and
Apr. 28,2006,whichcorrespondtothespringsemester.
There are 137 unique switch ports (corresponding to
di®erent locations) in the trace. We have seen totally
25;481uniqueMACaddresses
2
,asigni¯cantincreaseas
compared to 14;856 unique MAC addresses seen in fall
2004 semester or 4;258 in summer 2005. We pick the
top 5;000 users among the 25;481 by ranking them in
terms of total online time during the trace period. We
consider a user being online whenever it is associated
with one of the switch ports. Since we are interested
in ¯nding suitable methods to represent the structure
in association pattern, we choose to focus on the more
active users, and disregard the majority of less active
users. For the 5;000 chosen users, the most active user
is online for 99:9% of time during this period (almost
alwayson),andtheleastactiveuserisonlinefor4:2%of
time. Notethateventhoughwehavedisregardedabout
80%oftheusers,the5;000chosenusersstillspanacross
a wide range of user activeness.
1
Speci¯cally, VPN connection to a central server is no longer
necessary, and the users can start using WLAN on campus upon
a veri¯cation process using the USC e-mail account. This has im-
provedtheaccessibilityofUSCWLANtostudentsandfacultyon
campus. We have seen the number of users increases signi¯cantly
after the change.
2
In this paper, we apply the common assumption that each
unique MAC address represents a unique device, which is owned
and used by a unique user.
3
3.2 Representation of User Association
Pattern
In this section we introduce our representation of the
userassociationpatternsusingtheinformationobtained
from the traces.
The ¯rst question we address in this paper is: how
to represent user association pattern. Representation
of user association patterns can be described at various
time intervals: hours, days, or even weeks. All these
representations are valid, but in this paper, we choose
the daily association pattern. Hence, later in the paper,
by association pattern, we implicitly mean the daily as-
sociation pattern(tobede¯nedlater)ofauser. Wepick
onedayasthetimeintervalbecauseUSCisacommuter
campusanddailypatternscapturesu±cientuniqueness
without representing highly averaged behavior and user
group speci¯city. For example, students on university
campuses move from class to class in 90 to 120 minutes
period, but such structure is absent for sta® on cam-
pus. Hence if we have chosen 90 minutes as a period
for identifying consistency in association behavior, the
results of study would heavily depend on to which pop-
ulation we perform the study, and how we apply the
90 minutes intervals (e.g. Whether it happens to align
with the class schedule).
Now we can formally de¯ne the daily association pat-
tern for a user: We represent the user association pat-
ternforadayasann-entryvector,(a
1
;a
2
;:::;a
n
),where
nisthenumberofswitchportsinourtrace. Eachentry
in the vector, a
i
, represents the fraction of online time
during the day the user spends at the switch port (i.e.
the time user spends at the particular switch port di-
vided by the user's online time of the day). We normal-
ize the user association time with respect to his online
time because we want to assess the relative importance
of the locations to the user. The importance of a loca-
tion for a user is better re°ected by the ratio of online
time the user spends at the location. In this case, the
conclusions we draw is not in°uenced by the absolute
value of online time, as this factor varies over a wide
range among users. Note that the sum of the entries in
thedailyassociationpattern,
P
n
i=1
a
i
, is always1ifthe
user has been online during the day. We use a zero vec-
tor to represent the association pattern when the user
is completely o²ine for the day.
In this paper, when we assess the similarity between
two association pattern vectors, we use Manhattan dis-
tance, or the L1 norm, as the distance measure since
it is more robust to statistical noise. The distance be-
tween two vectors a and b, denoted as d(a;b), is de¯ned
as:
d(a;b)=
n
X
i=1
ja
i
¡b
i
j (1)
where a
i
and b
i
are the i-th entry in vector a and b,
respectively.
4 Clustering of User Association
Patterns
Giventhedailyassociationpatternsofauserforthedu-
rationofasemester,the¯rstquestionweaskiswhether
the user shows a single mode in its daily association
patterns, or it switches between several modes of as-
sociation patterns through the course of semester. By
single-modal users,werefertothosewhodisplaysimilar
association patterns (i.e., the distance between associ-
ation patterns for di®erent days are small.). By multi-
modal users, we refer to those who display several dis-
tinctgroupsofassociationpatternsacrossdi®erentdays
(e.g., If a student goes to classroom buildings A and B
for 2 hours each on the days he attends the classes, and
stays in library for whole day if there is no class on the
day, there will be two unique and well-separated associ-
ation patterns - "class" mode and "library" mode - for
thestudent). Onenaturalwaytoanswerthequestionis
by applying clustering techniques [1] to the association
patterns of the user. If multiple clusters can be identi-
¯ed from the association patterns, then the user under
consideration is a multi-modal one. In this section we
brie°ydescribetheclusteringtechnique,andapplyitto
the association patterns of each user.
4.1 Clustering Technique
The user association pattern for each day is a vector in
a n-dimension space. If the user shows similar associ-
ation pattern each day, the vectors should be close to
each other in this space. On the other hand, if the user
shows drastically di®erent behavior each day, the vec-
tors will be scattered in the space. Clustering is a huge
area in itself, but it can be roughly classi¯ed into hier-
archicalorpartitionalschemes. Inthispaperweusethe
hierarchical clustering, in which each vector is initially
considered as a cluster containing one member. Then,
at each step, based on the speci¯c distance measure be-
tween the clusters, two clusters that are closest to each
other among all cluster pairs are merged into one clus-
terwithlargermembership. Thereforeforeachstepthe
total number of cluster decreases exactly by one. This
process continues until all vectors are merged into one
cluster containing every vector.
A dendrogram can be created through the process
of hierarchical clustering. It contains a tree structure
showing the order of clusters merging with each other
andthedistancebetweentheclustersthatarechosento
merge at each step. Therefore, with the complete den-
drogram, one could choose a proper distance threshold
to stop the merging process if all the inter-cluster dis-
tancesarelargerthanthethreshold(i.e.,alltheremain-
4
ing clusters are separated by distances larger than the
threshold), or a cluster number threshold to stop the
process when the vectors are merged into a pre-de¯ned
target number of clusters.
However, oneissueregardingtheabovetechniqueap-
plied to a data set with unknown characteristics is that
it is hard to pre-select adequate parameters in advance
for clustering threshold or target number of clusters.
An indication of a good clustering result is that the dis-
tances between vectors in the same cluster are low, the
distances between vectors in di®erent clusters are high,
andthereisaclearseparationbetweeninter-clusterand
intra-cluster distance distributions.
Theimportantparametersforclusteringare: (1)The
distance measure between the vectors, (2) the metric
to calculate the distance between clusters, and (3) the
clustering threshold (distance or cluster number). As
discussedinlastsection,weusetheManhattandistance
between association patterns. Regarding to calculating
distance between clusters, there are also various can-
didate methods. In this section we adopt two di®er-
ent widely-used methods for cluster distance calcula-
tion: Distance between the furthest vectors in the two
clusters (known as complete-link algorithm) and aver-
age distances between all vector pairs between the two
clusters.
4.2 Multi-modal Behavior of WLAN
Users
We apply clustering techniques to individual user asso-
ciation patterns. If the association patterns of a user
can be arranged into several clusters, we say that this
userdisplaysmulti-modalbehavior. Inotherwords, the
user's association patterns have several distinct modes,
and he chooses from one of the modes to follow on a
given day. Contrarily, if there is only one cluster in the
association patterns for a user, this user is referred to
as a single modal user.
We apply two di®erent ways to calculate inter-cluster
distance (complete-link and average distance) and use
various clustering threshold distances. We show that
regardless of the clustering threshold chosen, the asso-
ciationpatternsformostusersfallintomultipleclusters.
We ¯rst apply the hierarchical clustering scheme to
a sample user. The result is shown in Fig. 1, in which
we draw the number of clusters obtained from his asso-
ciation patterns with respect to the clustering thresh-
old, underbothmethodsofclusterdistancecalculation.
Certainly, as the clustering threshold increases, the re-
sulting number of clusters decreases. However, notice
thattheassociationpatternsforthisparticularuserfalls
into at least 2 clusters until the clustering threshold is
1:4, which is quite high given that the distance between
any possible association patterns is at most 2:0 since
0
1
2
3
4
5
0 0.5 1 1.5
Average distance
Complete-link
Figure 1: Number of clusters obtained by various dis-
tance measures and clustering thresholds. The user dis-
plays multi-modal behavior in most cases.
we choose Manhattan distance and the entries in each
pattern must sum up to 1:0. This leads to a strong
argument that this particular user shows at least two
drastically di®erent association mode: The average dis-
tance between the patterns in the last two clusters is at
least 1:4.
Now we show the distribution of number of clusters
obtained from the clustering analysis for all 5;000 users
with clustering threshold 0:9 in Fig. 2. The exact num-
ber of clusters obtained depends on the distance calcu-
lation methods. The average distance method leads to
more aggregation of clusters as compared to complete-
link method. With 0:9 as the distance threshold, the
patterns when the user is o²ine (i.e., the zero vectors)
are separated from the patterns when the user is on-
line (i.e., the vectors with entries sum up to 1), so there
should be at least two clusters in the association pat-
terns. Userswithtwoclusterscanbereferredtoas"on-
o®"userswithconsistentassociationpattern: Oneclus-
ter corresponds to the patterns when the user is o²ine
(the zero vectors), and the other one corresponds to the
patternswhentheuserisonline. Theseusersswitchbe-
tween online and o²ine behaviors from day to day, and
whenitisonline,theassociationpatternsareconsistent
and fall in a single cluster. However, for both distance
calculationmethods,wealsoobservemanymulti-modal
users. These users show more than two clusters, which
indicates that their association patterns fall into di®er-
entclusters,orassociationmodes,whentheyareonline.
If we consider users with more than two clusters (i.e.,
users with more then one association mode when it is
online) as multi-modal users, the ratio of multi-modal
users(outoftotalof5;000)canbeconsideredasafunc-
tion of clustering threshold, and we obtain it for both
distance calculation methods in Fig 3. Here we see that
independent of the distance calculation method, a non-
negligible portion of users are classi¯ed as multi-modal
even if high clustering threshold (those above 1:0) is
used. It implies that many users have at least two clus-
ters in their association patterns when they are online.
5
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6 7 8 9 10 10+
Complete-link
Average distance
Figure 2: Distribution of number of clusters with clus-
tering threshold 0:9 and various distance calculation
methods.
0
20
40
60
80
100
0.5 0.75 1 1.25 1.5
Average distance
Complete-link
Figure 3: Percentage of multi-modal user under various
clustering thresholds and distance calculation methods.
In other words, they have more than one modes of as-
sociation to di®erent set of locations when they come
online.
5 Summarizing Trends in User
Association Matrix
Aftervalidatingthatmostusersdisplaymulti-modalas-
sociationpatterninprevioussection, wemoveontothe
secondquestionweraisedforthepaper,whichisdesign-
ing a succinct way to express the major trend of user
associationpatternsthatdominatesduringthetracepe-
riod. As an analogy, if one is asked to give a summary
of the locations he usually stays at on campus, a sam-
ple response might be: "I usually spend eight hours in
my o±ce each day. Once a week, we have a long group
meetingsoIaminthemeetingroomforthatafternoon.
Igotothegymtwotothreetimesperweekforanhour.
I also visit the engineering library from time to time for
short periods. I rarely visit other places besides the
above." Note that in such summary, the desirable order
of presentation is to tell the components that describe
most of the visit pattern (i.e., the long duration spent
in the o±ce) ¯rst, and then explain di®erent deviation
from that in decreasing degree of importance.
We wish to have a mechanism that provides an in-
sightful but concise summary of user association pat-
terns similar to the example, such that it captures the
major trends together with how much weight the major
trends carry as compared to the variations in the asso-
ciation patterns. In this section, we seek a procedure
to generate a summary for the association pattern set
fortheuserusingasmallnumberofdescriptivevectors,
with a quantitative measure of the importance for each
vector, without the need of ¯ne-tuning parameters for
each user.
To represent the major trend of user association con-
cisely, one intuitive candidate would be the average of
the user's daily association patterns. Indeed, taking av-
eragewouldbesuitableifausershowsonlysinglemode
in his daily association patterns (i.e. Having only one
cluster in the analysis in the last section). However, as
this is not the case for most of the users, using only
the average could be sometimes misleading, as it shows
an average association pattern that falls in between the
common patterns for the user, and it is not at all a
representative vector for the user.
In this section we present a novel way to summarize
the user association patterns. Instead of taking the av-
erage,wearrangeuserassociationpatternsinanassoci-
ation matrix, andweperformsingularvaluedecomposi-
tion to the matrix. Interestingly, although there exists
multi-modalbehaviorformostoftheusers,theintrinsic
dimensionality for these association matrices is actually
low. That implies, with only a few eigenvectors and
its corresponding eigenvalues, we can fully summarize
the association matrices with low reconstruction errors.
We will introduce our de¯nition of association matrix
for the users and the necessary background of singular
value decomposition in subsection 5.1 and our ¯nding
by applying this technique to users in the traces in sub-
section 5.2.
5.1 Association Matrix and Singular
Value Decomposition
For a detailed description of users for the studied pe-
riod, all of his daily association patterns for each day
during this period are necessary. In this paper, we con-
struct the association matrix for a user by concatenat-
ing his daily association patterns in a single matrix. In
the association matrix, each column corresponds to a
dailyassociationpatternwithinthestudiedperiod, and
each row corresponds to the ratios of online time the
user associates with a given location for all the days. If
there are n distinct locations and the trace period is d
days, the association matrix forasingleuserisann-by-
d matrix. A zero column vector corresponds to a day in
which the user is never online.
From linear algebra [4], we know that for any n-by-d
6
matrix X, it is possible to perform singular value de-
composition, such that:
X =U¢§¢V
T
(2)
whereU isannbynmatrix,§isannbydmatrixwith
r non-zero entries on its main diagonal, and V
T
is an
d by d matrix where the superscript
T
in V
T
indicates
the transpose operation to matrix V. r is the rank of
the original association matrix X.
From the properties of singular value decomposition
(SVD) [4], we know that the columns of matrix V are
the eigenvectors of the covariance matrix X
T
X, and
§ is a diagonal matrix with the corresponding singular
valuestotheeigenvectorsonitsdiagonal,denotedas¾
1
,
¾
2
,...,¾
r
. Thesesingularvaluesonthemaindiagonalof
§ are ordered by their values (i.e. ¾
1
¸ ¾
2
¸ :::¸ ¾
r
),
and they are also the square-roots of the eigenvalues of
matrix X
T
X. We denote the eigenvalues as ¸
1
, ¸
2
, ...,
¸
r
. The eigenvalues are the measure of power captured
by its corresponding eigenvectors, in columns of matrix
V.
We can re-write Eq. (2) by taking column vectors,
u
i
and v
i
, from matrices U and V, and use them as
building blocks to reconstruct the original matrix, X,
with the following relationship (since § is a diagonal
matrix):
~
X
k
=
k
X
i=1
u
i
¾
i
v
T
i
(3)
This yields a rank-k approximation to the original
matrix X. If the intrinsic dimensionality of the original
matrix X is low, then by applying SVD to the matrix
and storing the most important eigenvectors and singu-
lar values (e.g. u
i
's and v
i
's correspond to large ¾
i
's),
a signi¯cant less amount of storage is required as com-
pared to storing the original matrix X. The percentage
ofpowerintheoriginalmatrixX capturedintherank-k
reconstruction in Eq. (3) can be calculated by
P
k
i=1
¾
2
i
P
r
i=1
¾
2
i
(4)
where r is the rank of the original matrix X. If the per-
centage of power captured in the rank-k reconstruction
islarge(i.e.,closeto1)withsmallvalueofk,wesaythat
the original matrix X has a small intrinsic dimension-
ality. In some cases, many non-zero singular values and
thecorrespondingeigenvectorsarenotimportantforre-
construction of the original matrix because the relative
weight for the component is low. Following Eq. (4), the
relative weight (or the importance) of an eigenvector v
j
is represented by ¾
2
j
=
P
r
i=1
¾
2
i
.
SVDcanbeviewedascalculatingtheeigenvaluesand
eigenvectorsofthecovariancematrix,X
T
X. Thisisthe
procedure typically used to perform Principal Compo-
nent Analysis (PCA) for matrix X. In our case, we do
not remove the mean of each dimension (i.e. each row)
before performing the SVD. Such a technique is known
as uncentered PCA, which is applicable if the origin of
the data set is an important point of reference [2]. In
our case, we want to capture how the average trend in
the association patterns (captured in the ¯rst principal
components (PC) if uncentered PCA is applied) relates
to the variations (captured in PCs other than the ¯rst
component), so we choose not to perform SVD on a
zero mean centered matrix. Furthermore, by adopting
uncentered PCA, the reference point (origin) remains
the same for all the association matrices, and it enables
us to compare PCs obtained from di®erent association
matrices. This point will be utilized further in the next
section. Using our notation, the principal components
(PCs)arethecolumnvectors,v
1
;v
2
;:::;v
r
,inthematrix
V, and the corresponding eigenvalues are the squares of
singular values, i.e. ¸
i
= ¾
2
i
. The PCs show the trends
in the user's association patterns in decreasing impor-
tance, with the ¯rst PC corresponding to the average
association pattern and the other PCs corresponding to
variation around the average.
5.2 Low Dimensionality of Association
Matrices
Following the procedure outlined in the previous sub-
section,wecreatetheassociation matricesforthe5;000
chosen users and apply singular value decomposition to
them. In this section we explain the observations we
make from the results of SVD.
From the SVD results, the ¯rst property we look into
is whether the association matrices can be decomposed
into a small number of representative eigenvectors. In
other words, do these association matrices have low in-
trinsic dimensionality? Low dimensionality will corre-
spondtofewintrinsicmodesofassociationandwillalso
helpusstoretheassociationpatterninacompactspace.
From Eq. (3), we see that if the matrices have low di-
mensionality, the original matrices can be summarized
withonlythevectorscorrespondingtohighsingularval-
ues. Discarding the vectors corresponding to low singu-
lar values would lead to only marginal reconstruction
error for the original matrices.
Interestingly, although in the last section we show
that most users display multi-modal behavior, from the
results of SVD we ¯nd the major trend in association
patterns can be captured in a few eigenvectors for most
users, and the dimensionality for the association ma-
trices are actually low. To visualize this, we calculate
thepercentageofpowerintheassociationmatricescap-
tured by the rank-k reconstruction using Eq. (4). In
Fig. 4, we show the ratio of users that a certain per-
centage of power in its association matrix can be cap-
tured in the rank-k reconstruction. From the graph we
7
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1 3 5 7 9 11
90 80
70 60
50
Figure 4: Low association matrices dimensionality: A
high target percentage of power is captured with low
reconstruction matrix rank for many users.
observe most users have high percentage of power in its
association matrices captured by low-rank reconstruc-
tions. For example, if we use rank-1 reconstruction, it
captures50%ormoreofpowerintheassociationmatri-
ces for more than 98% of users, and rank-3 reconstruc-
tion is su±cient to capture more than 50% of power
in association matrices for all users. Even if we con-
sider an extreme requirement, capturing 90% of power,
it is achievable for 68% of users using the rank-1 re-
construction, and for more than 99% of users using at
most rank-7 reconstruction. Comparing the ¯ndings we
havebythematrixdecompositionapproachtotheclus-
tering approach in the last section, we could conclude
that although some of the users show multi-modal as-
sociation patterns, for most users the top eigenvectors
are relatively much more important then the remaining
ones. That implies, the eigenvectors with high corre-
sponding eigenvalues obtained from SVD capture the
important trend in the association matrices. In other
words, although there are several clusters in user asso-
ciation patterns, only a few are important.
Since a few important eigenvectors capture most of
the power in the association matrices, a good recon-
structed version of the original association matrices
shouldbebuiltwithafeweigenvectorsandeigenvalues.
We verify the goodness of low-rank reconstruction by
calculating the matrix reconstruction error using best
rank-k reconstruction,
~
X
k
, for association matrix X.
We de¯ne the L-p norm of the relative error as:
kX¡
~
X
k
k
p
kX k
p
(5)
wherekX k
p
is the L-p norm for matrix X, de¯ned as
kX k
p
=
p
s
X
8(i;j)
jX
(i;j)
j
p
(6)
where X
(i;j)
is the entry at i-th row and j-th column of
matrix X. L-p norm of relative error is a standard way
to quantify relative errors in matrices [3]. In this paper,
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 5 10 15
L1-norm
L2-norm
Figure 5: L1-norm and L2-norm of relative error with
rank-k reconstruction.
we calculate the cases for p = 1;2 with Eq. (5) (i.e.,
The L1 and L2 norm of relative error) to capture the
reconstruction error with low-rank matrices. We show
the average L1 and L2 norm of relative error for all the
users in Fig. 5. We can see from the ¯gure that the
reconstruction error is also small under these standard
metrics. Using 5 eigenvectors, it is possible to achieve
relative error around 0:05.
The relationship between multi-modal association
pattern sets and low dimensionality of association ma-
trices can be investigated further. We show the per-
centage of power captured by the ¯rst eigenvector of
associationmatrix withrespecttotheuser's rankingby
its activeness (i.e., the total online time for the trace
period). In Fig. 6 (a), we observe that there is an inter-
esting relationship between the user activeness ranking
and the percentage of power in its association matrix
captured by the ¯rst eigenvector. Comparing with the
online time fraction for users (the total online time di-
vided by the trace period) shown in Fig. 6 (b), we see
that for the most active users who are almost online
allthetime, thepowercaptured bythe¯rsteigenvector
aloneisrelativelyhigh. Thisisduetothefactthatmost
of them are stationary users using WLAN as a substi-
tute for the wired network. Hence, these users do not
have much variation in it association pattern, and the
¯rsteigenvectorisagooddescriptorfortheirassociation
matrices. As the activity of user decreases, those users
who use the WLAN sporadically are more dynamic in
theirassociationpattern,andhencethepowercaptured
in the ¯rst eigenvector is lower.
In Fig. 6 (c) we show the number of clusters found
fromtheassociationpatternsofusersusingaveragedis-
tancesbetweenclusterwithclusteringthreshold0:9. We
observethatthelessactiveusersalsotendtohavemore
clusters in their association patterns than the highly
active users. These conclusions can be linked together:
ForthecurrentWLANusersonUSCcampus,thehighly
active users mostly remain stationary at a single loca-
tion so the dimensionality for their association matrices
are very low, and the association patterns tend to form
8
fewer clusters. Meanwhile, the users with higher varia-
tion in the association patterns are those who are less
active, and their association patterns form more clus-
ters. However, the bottom line is, for almost all users,
the intrinsic dimensionality of their association matri-
ces is much lower than the columns and rows of the
matrices.
6 User Clustering with Similar
Association Pattern
After understanding the composition of individual user
association patterns and its characteristics, we further
askthefollowingquestion: Howdowequantifyusersas
having similar or dissimilar association patterns? Can
we de¯ne a good metric to describe the similarity be-
tween user association patterns, and utilize such a met-
ric to partition the whole user population into clusters
with similar users per cluster?
In this section we ¯rst propose two di®erent ap-
proaches to quantify similarity between users: In the
¯rst, straight-forward approach, we directly perform
complete pair-wise comparison of association patterns
between two users to evaluate their similarity. This ap-
proach, albeit intuitive, requires signi¯cant computa-
tion to evaluate similarity between users. Targeting to
reduce the computation requirement, we come up with
the second, feature-based approach, in which we only
compare some carefully selected features that capture
the essence of user association patterns, and hopefully
the computation will be reduced if the selected feature
has a smaller cardinality than the original association
pattern set. From the results in previous section, using
the eigenvectors and the eigenvalues would be a good
candidate.
Basedonthetwodi®erentapproachesabove,wecould
de¯ne distances betweenusers basedon theirsimilarity.
Using these distances, we can group similar users into
clusters. We further verify that the proposed distances
and clustering techniques are appropriate, as the inter-
cluster and intra-cluster distance distributions show a
clear separation. We also show that the users in the
same cluster indeed have similar association patterns
by verifying the dimensionality of the joint association
matrices of these clusters.
6.1 Similarity Measure of Association
Pattern Sets
6.1.1 The Straight-forward Approach
We ¯rst introduce the straight-forward approach to
compare similarity between two sets of association pat-
terns (each set of association patterns belongs to a sin-
gle user). Intuitively, the sets are similar if they satisfy
40
50
60
70
80
90
100
0 1000 2000 3000 4000 5000
(a) Percentage of power captured by ¯rst eigenvector
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000
(b) User online time fraction
0
2
4
6
8
10
12
14
1 1001 2001 3001 4001 5001
(c) Clusters obtained with clustering threshold=0:9
using average distance
Figure 6: Metrics of users ranked by online time frac-
tion. The curves show averaged value for 10 close-by
points for better visualization
9
the following property: For each pattern from a set, we
can¯ndapatternintheothersetwhilethedistancebe-
tweenthesetwopatternsfromdi®erentsetsissmall. To
quantifythisintuitivede¯nition,weproposetheaverage
minimum pattern distance (AMPD)betweensetsAand
B of association patterns, AMPD(A;B), calculated by
the following:
AMPD(A;B)=
1
jAj
X
8a2A
arg min
8b2B
d(a;b) (7)
where a and b denote a single association pattern in set
A and B, respectively. d(a;b) denotes the Manhattan
distancebetweenthepatternsasde¯nedinEq. (1). jAj
denotes the cardinality of set A. The average minimum
pattern distance between sets A and B is the average of
distancesfromeachofthepatternsinsetAtotheclosest
patterninsetB. Ifthisdistanceissmall,thenthesetwo
association pattern sets are similar to each other. Note
that,withthisde¯nition,AMPD(A;B)isnotnecessar-
ily equal to AMPD(B;A). To obtain a symmetric dis-
tance measure between association pattern sets A and
B, we further normalize the average minimum pattern
distancesfromsetAtoallothersetsbetween(0;1)
3
,and
de¯ne the straight-forward distance between set A and
B as D(A;B) = (AMPD(A;B) + AMPD(B;A))=2.
Now we have D(A;B) = D(B;A) for any given sets A
andB. Wewillutilizethisdistancemeasuretoperform
cluster analysis of users in section 6.2.
6.1.2 The Feature-based Approach
Identifying similar users by comparing the complete set
of association patterns is computationally very expen-
sive. Hence, it is desirable to have a computationally
simpler measure for association pattern set similarity.
From last section we know that using the eigenvectors
andeigenvaluesisagoodwaytosummarizetheassocia-
tionmatrices,soweproposetoleveragethemasthefea-
turesbasedonwhichwecomparethesimilaritybetween
two association pattern sets. Since the eigenvectors are
unit length vectors and orthogonal to each other, the
problem of comparing the eigenvectors between users
is equivalent to comparing the similarity between two
sets of orthonormal vectors, and each of these vectors
is associated with some weight to indicate its relative
importance in its set. To carry out such comparison,
we propose the following procedure.
Suppose u
i
's and v
j
's are eigenvectors of two users,
i = 1;:::;r
u
and j = 1;:::;r
v
where r
u
and r
v
are the
ranks of the corresponding association matrices. The
similarityofanypairofunitlengthvectorsamongthese
two sets can be obtained by the absolute value of their
3
Among all sets, we ¯nd the set X such that AMPD(A;X)=
max
8N
AMPD(A;N). We than normalize AMPD(A;B) =
AMPD(A;B)=AMPD(A;X) for all sets B.
inner product,ju
i
¢v
j
j. This is equivalent to calculating
jcosµj, where µ is the angle between vectors u
i
and v
j
.
The similarity of the two sets can be calculated by the
sum of pair-wise inner products of individual vectors
u
i
's and v
j
's, adjusted by the weights of u
i
and v
j
. We
use w
u
i
to represent the weight of eigenvector u
i
in its
set, calculated by w
ui
= ¾
2
i
=
P
r
u
k=1
¾
2
k
. The weights
w
ui
'ssumupto1,andw
vj
'sarede¯nedsimilarly. After
consideringtheweights,weproposethesimilarityindex
between two sets of eigenvectors, U =fu
1
;:::;u
r
u
g and
V =fv
1
;:::;v
r
v
g, as:
Sim(U;V)=
ru
X
i=1
rv
X
j=1
w
u
i
w
v
j
ju
i
¢v
j
j (8)
Higher similarity index Sim(U;V) indicates that the
eigenvector sets U and V are more similar, and hence
the corresponding users have similar association pat-
terns. Since in most cases, only a few eigenvectors
capture most of the power in the association matrices,
the above equation can be reduced to comparing only
these important eigenvectors, in order to simplify the
calculation. In the following computations, we consider
only the eigenvectors that capture at least 0:1% of total
power.
Now we de¯ne the distance between association pat-
tern sets of users U and V based on the similarity
index between their eigenvector sets. We normalize
the similarity indexes from set U to all other sets be-
tween (0;1)
4
, and de¯ne the feature-based distance be-
tween set U and V as D
0
(U;V) = 1¡ (Sim(U;V) +
Sim(V;U))=2. Note that the similarity indexes are
higherforsimilarusers,soweneedtosubtractthesimi-
larityindexesfrom1togetadistancemeasure,inwhich
larger value implies the two users are further separated.
We need to verify that the two distance measures,
straight-forward distance and feature-based distance,
are correlated to each other. For this purpose we cal-
culate the correlation coe±cient between the two dis-
tance measures for all the possible user pairs among
the 5;000 chosen users. The correlation coe±cient is
very high with the value 0:9119. The strong correlation
indicates that the distance measures generated by the
two methods are consistent with each other. In other
words, the feature-based distance is a valid substitute
for straight-forward distance between user association
patterns, with much less computation required.
6.2 Clustering Users based on the Sim-
ilarity Measures
Now we apply the clustering techniques introduced in
section 4 to the users using the distance metric devel-
4
Among all sets, we ¯nd the set X such that Sim(U;X) =
max
8N
Sim(U;N). We than normalize Sim(U;V) =
Sim(U;V)=Sim(U;X) for all sets V.
10
oped in the previous subsection. In this subsection we
discuss the results of clustering and show that we can
obtain meaningful clusters using both straight-forward
distance and feature-based distance.
A good clustering solution must be robust to the
choice of clustering parameters. One way to justify the
choices of clustering threshold is to compare the dis-
tributionsofinter-clusterdistanceandintra-clusterdis-
tance. Iftheclusteringofusersisameaningfulone,then
the users fall into the same cluster should have much
smaller distances between each other, as compared to
users in di®erent clusters. We generate the distance
distributions for both categories. If there is clear sepa-
rationbetweenthetwodistributions,thentheclustering
is meaningful.
As a case study, we present the clustering of users
using the feature-based distance, D
0
(U;V). We con-
sider average distance between clusters when we merge
clusters in hierarchical clustering, and we proceed un-
til 5;000 users are merged into 200 clusters. In Fig. 7
(a), weshowthepdf'sforinter-clusterand intra-cluster
distances. We see from the ¯gure that the peaks of
thesetwopdf'sarewellseparated(i.e.,Thepdfofintra-
cluster distance is almost zero for distance larger than
0:5, and the pdf of inter-cluster distance is almost zero
for distance before 0:9.). We show the cdf's for the
same distributions in Fig. 7 (b). From the ¯gure we
observe that more than 90% of intra-cluster user pairs
have distance smaller than 0:5, and more than 90% of
inter-cluster user pairs have distance larger than 0:9.
We could use a cut-o® threshold around 0:6 to 0:8 to
separate most intra-cluster user pairs from inter-cluster
user pairs. The separation between the two distrib-
utions is clear, hence we have a meaningful result of
clustering. Once we ¯nd out the cut-o® threshold, it
can also be applied to evaluate the similarity between
two users - If their distance is lower than the thresh-
old,theyshouldbeconsideredsimilarandbelongtothe
same cluster. The users hence are able to directly judge
if they belong to the same cluster without the global
knowledge of all user distances. We have to emphasize
once again the actual number of clusters obtained from
thedatasetisdata-dependent,andcurrentlythereisno
standard method to tell how many clusters exist in the
data set beforehand. However, with our proposed dis-
tance measure, well-separated clusters can be obtained
from the data set. We also try using the other straight-
forward distance, D(A;B), and get similar results as
those shown in Fig. 7.
Weneedtofurtherverifythatindeedwehavegrouped
similaruserstogetherinthoseclusters. Theveri¯cation
process we propose is the following: We compose the
joint association matrix by concatenating the daily as-
sociation patterns of a cluster of m similar users in a
larger n-by-md matrix, where n is the number of lo-
0
10
20
30
40
50
60
0 0.2 0.4 0.6 0.8 1
Inter-cluster
Intra-cluster
(a) Probability density function
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Inter-cluster
Intra-cluster
(b) Cumulative distribution function
Figure7: Distributionsofdistancesforinter-clusterand
intra-cluster user pairs
cations and d is the number of days. The percentage
of power captured by the top eigenvectors of this joint
association matrix should be high, as the association
patterns in the matrix follow similar trends. On the
other hand, if association patterns of users with dif-
ferent trends are put in one joint association matrix,
the percentage power captured by its top eigenvectors
should be much lower. We carry out this veri¯cation
process ¯rst for some sample clusters. We form three
di®erent joint association matrices: The ¯rst joint as-
sociation matrix contains the users in one of the clus-
ters formed with the feature-based distance. There are
totally 20 users in this cluster. The second matrix con-
tains the users in one of the clusters formed with the
straight-forward distance. There are totally 37 users in
this group. For the third matrix we randomly pick 20
users from the whole population and put their associa-
tion patterns in the matrix.
The ratio of cumulative power captured by the top
eigenvectors for these three matrices is shown in Fig. 8.
Fromthe¯gureweseeclearlythatbothclustersformed
with the proposed distance measures in last subsection
leadto joint association matrices withlowdimensional-
ity. Thepowercapturedbythe¯rsteigenvectorisabove
75%forbothmatrices. Ontheotherhand,althoughthe
matrix with random users has the same size as the ¯rst
matrix, its dimensionality is much higher, as more than
9eigenvectorsarerequiredtocapturemorethan75%of
11
0
0.2
0.4
0.6
0.8
1
0 10 20 30
feature-based
straight-forward
random users
Figure 8: Cumulative ratio of power captured by the
topeigenvectorsofjointassociationmatricesforsample
clusters under various user clustering methods.
itspower. Clearly, theusersinclustersidenti¯edbythe
proposed distance measures are indeed similar in their
association trends.
We further carry out a large-scale veri¯cation of the
dimensionality of all clusters. Among the 200 clusters
formed, we pick those with more than ¯ve users in the
cluster,andcheckwhethertheseclustersindeedcontain
users with similar association patterns. There are to-
tally 130 such clusters when the feature-based distance
isused,and117suchclusterswhenstraight-forwarddis-
tanceisused. Weformthejointassociationmatricesfor
users in these clusters. At the same time, we also form
random matrices consist of the same number of ran-
domly picked users. We compare dimensionality of ma-
trices formed with users in the clusters to the matrices
formed with random users by plotting the cumulative
power captured in the top four eigenvectors of the ma-
trices in scatter graphs, Fig. 9. In these graphs, each
dotrepresentthecumulativeratioofpowercapturedby
the top four eigenvectors of the joint association matrix
forusersinaclusterasitsY-coordinate,andtheratioof
cumulative power captured by the top four eigenvectors
ofthecorrespondingrandommatrixisitsX-coordinate.
Clearly, most the dots are above the 45-degree line re-
gardless of either feature-based distance (Fig. 9 (a)) or
straight-forward distance (Fig. 9 (b)) is used, indicat-
ing that similar users can be found with our de¯nition
of distances between users.
To sum up, in this section we propose two metrics for
distancebetweenusersintermsoftheirsimilarityofas-
sociation pattern. Such metrics can be utilized to form
well-separatedclustersintheuserpopulation. Thisfur-
ther implies characteristic of association pattern is a
distinguishing feature of users - they can be classi¯ed
into groups with di®erent association pattern feature.
Such information can be useful for the network opera-
tor in several ways, as we discuss more in the following
section.
7 Discussion
The eigen-decomposition based representation of user
association patterns can help the network administra-
tor in several ways. First, due to the low dimension-
ality of user association matrices, the network operator
cane±cientlystorethemajortrendsinuserassociation
patterns. Bycheckingwhetherthenewassociationpat-
ternsgeneratedbythesameusercanberepresentedasa
linearcombinationoftheeigenvectorsobtainedfromits
previous association matrix, the network administrator
could determine whether the user has changed its as-
sociation behavior. If a user with consistent previous
association behavior suddenly changes its association
behavior, it could be viewed as an abnormal activity:
either it is a signi¯cant change in its usage pattern, or
a potential ongoing impersonation attack. Depending
on the policy, the administrator may want to look into
the actual reason for the change. Second, if the ad-
ministrator wants to deploy some personalized services
in the network (e.g., storing user's email and ¯les on
separate machines close to the user's frequently visited
locations), theeigenvectorscanbeutilizedtodetermine
where to store the user's data. The clustering of users
based on similarity of association patterns can further
help the network operator to enumerate users with dif-
ferent association patterns, and ¯nd out which type of
association trend dominates its network, and plan ac-
cordingly. Thirdly, although not discussed in detail in
this paper, it is also possible for the network adminis-
tratortoincorporatethesingle-dayassociationpatterns
from all its users into a matrix, and apply the same sin-
gular value decomposition based grouping technique to
observe the trends of all users on campus for the given
day, and perhaps create some daily norm (i.e., typical
behavior on Monday, Tuesday, etc.) for the network.
Creating such reference data would help the adminis-
trator to understand its user and network better.
The summary of association patterns may also bene-
¯t the users. For example, the user can pro¯le herself
basedonitsassociationpattern,anddeterminewhether
other users are similar to her in this aspect. As shown
in section 6, the users have a simpler way to express,
exchange, and compare their association patterns using
theeigenvectors. Inparticular,theinsightsdevelopedin
ourstudyareessentialine±cientprotocoldesign. Iden-
tifying users with similar association or movement pat-
terns is useful, for example, in making better forward-
ing decisions to deliver packets in an infrastructure-less
network. In [21], the authors proposed a mechanism
in which packets are delivered towards the destination
node by forwarding to nodes with increasing similar-
ity in movement patterns to the ¯nal destination. Our
proposal of summarizing user association pattern with
the eigenvectors and the similarity index proposed in
Eq. (8) provides a good way to serve this purpose. In
12
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
(a) Feature-based distance used for clustering
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
(b) Straight-forward distance used for clustering
Figure 9: Scatter graph: Cumulative power captured in top four eigenvectors of random joint association matrices
(X) and matrices formed by users in the same cluster (Y).
[20],theauthorsproposedtospreadmultiplecopiesofa
packettoseveralintermediatenodestoexpeditethede-
livery. This multi-copy strategy is better suitable when
the sender does not know the exact association pattern
ofthedestination. Inthisscheme,itshouldbebetterto
spreadthecopiestonodeswithdi®erentassociationpat-
tern trends to improve the probability that one of these
copy-carriers meets with the destination node. Again,
our proposal is a good candidate method to determine
similarity between nodes in this case.
8 Conclusion and Future Work
In this paper, we systematically unearth the underlying
structure in user's daily association patterns. Through
analysisofWLANtracesof5;000userscollectedatUSC
campus, we are able to identify the overall association
trends in WLAN users in a university campus network,
and answer the three questions proposed in the intro-
duction with the following:
(1)MostWLANusersshowmulti-modalassociationbe-
haviors. Thevectorsrepresentingtheirdailyassociation
behavior fall into separate clusters. For each user there
are at least two clusters, one corresponds to the days
it is o²ine, and the other corresponds to the days it
is online. Furthermore, for many users the association
patterns for online days fall into multiple clusters, sug-
gesting multi-modal association patterns exist.
(2) Regardless of the multi-modal association patterns,
the association matrix for the user usually has low di-
mensionality. This property suggests that usually one
or only few association patterns dominate the behavior
of the user. Due to the low dimensionality of associa-
tion matrix, it is possible to summarize the matrix with
its eigenvectors and eigenvalues. Such representation
summarizes the association patterns of the user with
eigenvectors in decreasing importance.
(3) We propose the metrics for distances between the
association pattern sets of the users and the similarity
index between the eigenvector sets. These metrics are
utilizedtoidentifyuserswithsimilarassociationtrends.
Thefeature-baseddistancecalculatedfromeigenvectors
leads to signi¯cant amount of saving in calculation due
tothelowdimensionalityofmostuser'sassociationma-
trices.
We believe that our methodical approach to mining
the WLAN user-association patterns along with these
¯ndings pave a new way to understand the structure
in user association patterns in WLANs. Although we
have chosen the USC trace to illustrate the usefulness
of the methodology and metrics, these methods are not
limited by the choice of data set, and can be applied to
study other traces available to the research community.
Weintendtoapplythetechniquestoothertracesinthe
research community (e.g., [7, 8]) as the next step. We
also plan to leverage the understanding of the WLAN
user behavior in designing better mobility models and
ad hoc routing protocols.
Finally, this paper is a small ¯rst step to the system-
atic mining of information from WLAN traces, espe-
cially user association patterns. We asked basic ques-
tions in mining and to answer those, we robustly moti-
vated, de¯ned and validated the choice of our similarity
metrics, which is the key to most mining tasks. Inves-
tigating better metrics is an ongoing work. Also we are
working on leveraging more sophisticated clustering al-
gorithmsandstatisticaltoolsforvalidatingourmetrics.
References
[1] A. Jain, M. Murty, and P. Flynn, "Data Clustering:
A Review," ACM Computing Surveys, vol. 31, no.
3, September, 1999.
13
[2] I.T. Jolli®e, Principal Component Analysis, second
ed., Springer series in statistics, published 2002.
[3] R. Horn and C. Johnson, Matrix Analysis, Cam-
bridge University Press, published 1990.
[4] G. Strang, Linear Algebra and Its Applications,
third ed., Brooks Cole, published 1988.
[5] W. Hsu and A. Helmy, "On Important Aspects
of Modeling User Associations in Wireless LAN
Traces," the Second International Workshop On
Wireless Network Measurement (WiNMee 2006),
April 2006.
[6] W. Hsu and A. Helmy, "On Nodal Encounter Pat-
terns in Wireless LAN Traces," the Second Interna-
tionalWorkshopOnWirelessNetworkMeasurement
(WiNMee 2006), April 2006.
[7] MobiLib: Community-wide Library of Mobility
and Wireless Networks Measurements (Investigat-
ing User Behavior in Wireless Environments). USC
WLAN trace and pointers to many WLAN trace
archives available at http://nile.usc.edu/MobiLib.
[8] CRAWDAD: A Community Resource for
Archiving Wireless Data At Dartmouth.
Many archived WLAN traces available at
http://crawdad.cs.dartmouth.edu/index.php.
[9] R.Jain,D.Lelescu,andM.Balakrishnan,"ModelT:
An Empirical Model for User Registration Patterns
inaCampusWirelessLAN,"inProceedingsofACM
MobiCom 2005, August 2005.
[10] A.Lakhina,K.Papagiannaki,M.Crovella,C.Diot,
E.D.Kolaczyk,andN.Taft,"StructuralAnalysisof
Network Tra±c Flows," ACM SIGMETRICS, New
York, June 2004.
[11] M. Balazinska and P. Castro, "Characterizing Mo-
bility and Network Usage in a Corporate Wireless
Local-Area Network," In Proceedings of MobiSys
2003, May 2003.
[12] M. McNett and G. Voelker, "Access and mobility
of wireless PDA users," ACM SIGMOBILE Mobile
Computing and Communications Review, v.7 n.4,
October 2003.
[13] T.Henderson,D.KotzandI.Abyzov,"TheChang-
ing Usage of a Mature Campus-wide Wireless Net-
work,"inProceedingsofACMMobiCom2004, Sep-
tember 2004.
[14] M.Papadopouli,H.Shen,andM.Spanakis,"Mod-
eling Client Arrivals at Access Points in Wireless
Campus-wide Networks," 14th IEEE Workshop on
Local and Metropolitan Area Networks, Chania,
Crete, Greece, September 2005.
[15] C. J. F. ter Braak, "Principal Components Biplots
andAlphaandBetaDiversity,"Ecology,vol.64,pp.
454-462, 1983.
[16] W. J. Krzanowski, "Between-groups Comparison
of Principal Components," J. Amer. Statist. Assoc.,
vol. 74, pp. 703-707, 1979.
[17] K.YangandC.Shahabi,"APCA-basedSimilarity
Measure for Multivariate Time Series," ACM Inter-
national Workshop On Multimedia Databases, No-
vember 2004.
[18] C.TuduceandT.Gross, "AMobilityModelBased
on WLAN Traces and its Validation," in Proceed-
ings of IEEE INFOCOM, March 2005.
[19] M.Papadopouli,H.Shen,andM.Spanakis,"Char-
acterizing the Duration and Association Patterns of
WirelessAccessinaCampus,"11thEuropeanWire-
less Conference 2005, Nicosia, Cyprus, April, 2005.
[20] T. Spyropoulos, K. Psounis, and C. S. Raghaven-
dra, "Spray and Wait: An E±cientRouting Scheme
for Intermittently Connected Mobile Networks,"
workshop on delay tolerant networking and related
networks (WDTN-05), August, 2005.
[21] J. Leguay, T. Friedman, and V. Conan, "Evaluat-
ing Mobility Pattern Space Routing for DTNs," in
Proceedings of IEEE INFOCOM, April, 2006.
[22] N. Eagle, "Machine Perception and Learning of
ComplexSocialSystems",Ph.D.Thesis,Programin
MediaArtsandSciences, MassachusettsInstituteof
Technology, June, 2005.
14
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 858 (2005)
PDF
USC Computer Science Technical Reports, no. 763 (2002)
PDF
USC Computer Science Technical Reports, no. 752 (2002)
PDF
USC Computer Science Technical Reports, no. 779 (2002)
PDF
USC Computer Science Technical Reports, no. 814 (2004)
PDF
USC Computer Science Technical Reports, no. 804 (2003)
PDF
USC Computer Science Technical Reports, no. 883 (2006)
PDF
USC Computer Science Technical Reports, no. 753 (2002)
PDF
USC Computer Science Technical Reports, no. 809 (2003)
PDF
USC Computer Science Technical Reports, no. 758 (2002)
PDF
USC Computer Science Technical Reports, no. 812 (2003)
PDF
USC Computer Science Technical Reports, no. 770 (2002)
PDF
USC Computer Science Technical Reports, no. 831 (2004)
PDF
USC Computer Science Technical Reports, no. 764 (2002)
PDF
USC Computer Science Technical Reports, no. 734 (2000)
PDF
USC Computer Science Technical Reports, no. 663 (1998)
PDF
USC Computer Science Technical Reports, no. 859 (2005)
PDF
USC Computer Science Technical Reports, no. 743 (2001)
PDF
USC Computer Science Technical Reports, no. 654 (1997)
PDF
USC Computer Science Technical Reports, no. 649 (1997)
Description
Wei-jen Hsu, Debojyoti Dutta, Ahmed Helmy. "Structural analysis of user association patterns in wireless LAN." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 877 (2006).
Asset Metadata
Creator
Dutta, Debojyoti
(author),
Helmy, Ahmed
(author),
Hsu, Wei-jen
(author)
Core Title
USC Computer Science Technical Reports, no. 877 (2006)
Alternative Title
Structural analysis of user association patterns in wireless LAN (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
14 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16269783
Identifier
06-877 Structural Analysis of User Association Patterns in Wireless LAN (filename)
Legacy Identifier
usc-cstr-06-877
Format
14 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/