Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
An efficient approach to categorizing association rules
(USC Thesis Other)
An efficient approach to categorizing association rules
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ANEFFICIENTAPPROACHTOCATEGORIZINGASSOCIATION
RULES
by
DongwooWon
ADissertationPresentedtothe
FACULTYOFTHEUSCGRADUATESCHOOL
UNIVERSITYOFSOUTHERNCALIFORNIA
InPartialFulfillmentofthe
RequirementsfortheDegree
DOCTOROFPHILOSOPHY
(COMPUTERSCIENCE)
August2010
Copyright 2010 DongwooWon
Acknowledgments
First and foremost, I would like to thank my advisor and chair of my committee, Dr.
Dennis McLeod, for his inspiring guidance over the past 5 years. He was truly an
exemplar role model for me. I always felt that he had the perfect blend of micro-
managementandmacro-managementskillswhileguidinghisstudents.
I would also like to specially thank another member of my committee, Dr. Ai-
ichiro Nakano and Lawrence Pryor. I am also very thankful to other members of my
committee: Dr. BarryBoehmandDr. ShahramGhandeharizadeh.
IwasfortunatetobesurroundedbymanygiftedandsmartcolleagueslikeSangSu
Lee, Seongwook Yeon, Seokkyung Chung, Hyun Woong Shin, Sangsoo Sung, Bomi
Song and I am also very thankful for the help and support I received from the other
membersofSemanticInformationResearchLaboratory.
Finally,Iwouldliketothankmyparentsandmybrotherwhowereaconstantsource
ofinspirationandalwayshelpedmetobelieveinmyself.
ii
TableofContents
Acknowledgments ii
ListofTables v
ListofFigures vi
ListofAlgorithms vii
Abstract viii
Chapter1: Introduction 1
1.1 MotivationandObjectives . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 MajorContributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 ThesisOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter2: RelatedWork 5
2.1 DataMining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 AssociationRule . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 DataAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 MarketBasketData . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 RadioFrequencyIdentification . . . . . . . . . . . . . . . . . . 18
2.3 Ontology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Chapter3: Preliminaries 22
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iii
3.3 AnalysisTypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 InterestingnessMeasure . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Application-specificConditions . . . . . . . . . . . . . . . . . 27
3.3.3 RuleConditions . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter4: Approach 30
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 RuleGeneralization(RG) . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 RuleCategorization(RC) . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 HARCAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.2 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.3 GroupingRules . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Chapter5: Evaluation 49
5.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.1 SearchSpaceReduction . . . . . . . . . . . . . . . . . . . . . 50
5.2.2 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 SearchQuality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.1 PrecisionandRecall . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Chapter6: ContributionandFutureChallenges 64
6.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 FutureChallenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2.1 Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
References 68
iv
ListofTables
3.1 DataFormat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 RelevancefordifferentCases . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 ResultforRuleGeneralization(RuleperItem) . . . . . . . . . . . . . . 51
5.2 ResultsofRuleGeneralization(RuleperCustomer) . . . . . . . . . . . 52
5.3 ResultsforIdentifyingaSpecificRule . . . . . . . . . . . . . . . . . . 56
5.4 ResultsforIdentifyingRelatedItems . . . . . . . . . . . . . . . . . . . 58
v
ListofFigures
2.1 KnowledgeDiscoveryinDatabaseprocesses . . . . . . . . . . . . . . . 7
3.1 ExampleofCalculatingInterestingnessValues . . . . . . . . . . . . . . 26
3.2 PartofanItemOntology . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 SystemOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 StepsforRuleGeneralization . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 OverviewofRuleCategorization . . . . . . . . . . . . . . . . . . . . . 34
4.4 StepsforRuleCategorization . . . . . . . . . . . . . . . . . . . . . . . 35
4.5 StructureforRuleTreeImplementation . . . . . . . . . . . . . . . . . 37
4.6 StepsforRelevanceCalculation . . . . . . . . . . . . . . . . . . . . . 40
4.7 GroupingRulesusingRelevance . . . . . . . . . . . . . . . . . . . . . 47
5.1 GraphforNumberofRulesperItems . . . . . . . . . . . . . . . . . . 53
5.2 GraphforNumberofRulesperCustomer . . . . . . . . . . . . . . . . 54
5.3 Graphforprecisionperitems . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Graphforrecallperitems . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5 Graphforf-measureperitems . . . . . . . . . . . . . . . . . . . . . . 63
vi
ListofAlgorithms
1 HierarchicalAssociationRuleCategorization . . . . . . . . . . . . . . 36
2 groupRules()Function . . . . . . . . . . . . . . . . . . . . . . . . . . 45
vii
Abstract
The application of association rules, which specify relationships among large sets of
items,isafundamentaldataminingtechniqueusedforvariousapplications. Inthisdis-
sertation,wepresentanefficientmethodofusingassociationrulesforidentifyingrules
from a stream of transactions consisting of a collection of items purchased, referred to
asmarketbasketdata. Acommonproblemencounteredwithmarketbasketanalysisis
thatitresultsinanumberofweaklyassociatedrulesthatareoflittleinteresttotheuser.
Tomitigatethisproblem,weproposeanefficientapproachtomanagingthedatasothat
only a reasonable number of rules need to be analyzed. First, we apply an ontology, a
hierarchicalstructurethatdefinestherelationshipsamongconceptsatdifferentabstrac-
tion levels, to minimize the search space, thereby allowing the user to avoid having to
search the large original result set for useful and important rules. Next, we apply a
novelmetriccalledrelevancetocategorizetherulesusingtheHierarchicalAssociation
Rule Categorization (HARC) algorithm, an algorithm that efficiently categorizes asso-
ciationrulesbysearchingthecompactgeneralizedrulesfirstandthenthespecificrules
that belong to them, rather than scanning the entire list of rules. The efficiency and
viii
effectiveness of our approach is demonstrated in our experiments on high-dimensional
syntheticdatasets.
ix
Chapter1
Introduction
1.1 MotivationandObjectives
This work describes an efficient approach for grouping association rules using an on-
tology and a metric called relevance for data sets consisting of a large number of col-
lections of objects, focusing on sets of market basket data. Market basket data sets,
artificiallygeneratedtransactionaldatasetsrepresentedbyasetofitemspurchased,are
characterized by the large dimensions of their attributes, the sparseness in the distribu-
tionoftheirdata,andthelargenumberoftheiroutliers. Themostcommonwayofdata
mining market basket data is using association rule mining, an important data mining
technique used in various applications for identifying the correlation between items in
shopping baskets. An association rule is an expression symbolized by X⇒ Y, where
X and Y are sets of data items sometimes called attributes. The intuitive meaning of
1
the rules is that transactions in the data that contain the items in X tend to also contain
the items in Y. An association rule algorithm helps in identifying patterns that relate
itemsfromtransactions. Whenanalyzingtransactionrecordsfromthemarketinmarket
basketanalysis,associationrulealgorithmscanbeusedtoidentifydifferentpurchasing
behaviorsandpatternstodesignanoptimumgoodsarrangementbywhichtomaximize
sales.
However, [1, 2, 29, 32, 37, 55,58, 66] haveidentifiedseveralchallenges withmin-
ing association rules from market basket data. A well-known challenge in association
rule generation is that the number of rules grows exponentially as the number of at-
tributesincreases,makingitdifficulttodeterminewhichrulesaremostuseful,interest-
ing,andimportant. Toovercomethisproblem,attemptshavebeenmadetoderivenew
similarity measures and apply new algorithms. [4] has constructed an index structure
along the similarity function, [60] uses a highly efficient graph-partitioning algorithm,
and [75] uses the so-called “small-large ratio” to perform clustering. We address this
problem from a different angle by using a grouping method, an ontology, and a metric
called relevance to use association rules to identify relevant knowledge from market
basket data. The goal of our approach is to provide users such as marketing people
or researcher with high quality of rules and also with faster and more efficient way to
searchforrelevantrules.
2
1.2 Hypothesis
By using the conceptual taxonomy, which defines the generalization relationships for
conceptsatthedifferentabstractionlevelsutilizedtominimizethesearchspace,auser
can avoid having to search a large original result set for interesting rules. By using
clusteringassociationrules,theusercanseecategorizedrelevantassociationrules.
1.3 MajorContributions
Our approach uses a conceptual taxonomy (i.e., an ontology) to generalize and reduce
the number of association rules for high-dimensional data. By doing so, our approach
produces a smaller search space of more relevant clusters of rule sets that are easier
to understand and more appropriate for marketing purposes compared with the results
obtained using prior techniques. Our approach thus allows a user to search a smaller
number of rules instead of the entire rule set to identify interesting rules, therefore in-
creasingthespeedofthesearchprocess. Duetotherelevancemetric,ourapproachalso
providesanadditionalinformationinsearchingforrulescomparedtootherapproaches
whichmayproduceresultsthatomitimportantandinterestingrules.
3
1.4 ThesisOutline
Theremainderofthisworkisorganizedintofivechapters:
• Chapter2discussesrelatedresearchanddescribeshowourapproachdiffersfrom
previousapproaches.
• Chapter3discussesthespecificapplicationsanddefinitionsoftheconceptsrele-
vanttoourapproach.
• Chapter 4 describes specific rule and search conditions and the algorithms used
forgeneralizingandcategorizingrules.
• Chapter 5 illustrates an experimental evaluation of our algorithm and evaluates
itsfeasibilityandusefulness.
• Chapter 6 recommends future research and describes the contributions of this
work.
4
Chapter2
RelatedWork
2.1 DataMining
Data mining, the process of extracting interesting (non-trivial, implicit, previously un-
known, and potentially useful) information or patterns from large information repos-
itories, is one of the three core processes of the Knowledge Discovery in Database
(KDD) processes, which are shown in Figure 2.1 [17]. The first KDD process is pre-
processingthedatasetsbycleaning,integrating,selecting,andtransformingthem. The
second process is the main process, during which different algorithms are applied to
produce the knowledge. The last process is post-processing the data to evaluate the
mining results according to user requirements and domain knowledge. The specific
processesareasfollows:
5
1. Cleanandintegratethedatasets. Thedatasetsmaycomefromdifferentsources,
leading them to have inconsistencies and duplications that must be eliminated
or integrated. When different databases use different words to refer to the same
thing,onewordmustbechosen,andwhendatasetsareincompleteandnoisydue
tomanualinput mistakes, theymust becleaned. Theintegrateddatasources can
thenbestoredasdatarepositories.
2. Selectandtransformdata. Thedatabasemaycontaincustomerinformation,item
information, time, and prices. If the goal is to identify the relationships among
the items, only item information is needed. As a result, the data set becomes
muchsmallerandtheprocessmoreefficient.
3. Apply data mining techniques. Different algorithm of data mining techniques
(e.g. classification, clustering, etc.) is appliedtothedatasetandknowledgewill
beproducedasaresult.
4. Evaluatetheresults. Evaluatetheknowledgeaccordingtodomainknowledgeor
concepts. Iftheresultsdonotsatisfytherequirement,redotheminingprocessor
modifyuserrequirements.
5. Visualizetheresults. Trytomakethedataminingresultseasiertouseandunder-
stand.
6
Figure2.1: KnowledgeDiscoveryinDatabaseprocesses
Data mining is divided into the subtypes of descriptive mining and prescriptive
mining. Descriptive mining is used to summarize or characterize general properties
of data from data sources while prescriptive mining is used to draw inferences from
currentdataormakepredictionsbasedonpastdata. Ofthevarioustypesofdatamining
techniques, such as association rule mining, classification, and clustering, this study
focusesonassociationrulemining.
2.1.1 AssociationRule
Association rule mining, one of the most important and well-researched data mining
techniques, was first introduced by [1]. Association rule mining aims to extract inter-
esting correlations, frequent patterns, associations, or casual structures among sets of
items in transaction databases or other data repositories. Association rules are widely
7
used in various areas, such as telecommunication networks, market and risk manage-
ment, and inventory control. Association mining techniques and algorithms are briefly
introducedinthesectionsbelow.
2.1.1.1 Problem
Theformalstatementoftheassociationrule-miningproblemwasfirststatedby[1]. Let
I =I
1
, I
2
,···, I
m
be a set ofm distinct attributes, T be the transaction that contains a
setofitemssuchthatT⊆I,andDbeadatabasewithdifferenttransactionrecordsTs.
AnassociationruleisanimplicationintheformofX⇒Y,whereX,Y⊂Iaresetsof
items called itemsets, X∩ Y =∅, X is an antecedent, Y is a consequent, and the rule
meansXimpliesY.
The two basic parameters of association rule mining are support and confidence.
Since the database is large and users are concerned with only frequently purchased
items,thethresholdsofsupport(minimalsupport)andconfidence(minimalconfidence)
areusuallypredefinedbyuserstodropthoserulesthatarenotinterestingoruseful.
Support of an association rule is defined as the percentage of records that contain
X∪ Y to the total number of records in the database. The support count does not take
the quantity of the item into account. For example, we only increase the support count
numberofbeerone,eventhoughacustomerbuysthreebottlesofbeerinatransaction.
Supportiscalculatedbythefollowingformula:
8
Thesupportofanassociationruleisdefinedasthepercentageofrecordsthatcon-
tain X∪ Y to the total number of records in the database. The support count does not
take the quantity of the item into account. For example, we would only increase the
support count number of beer by one even if a customer were to buy three bottles of
beerinatransaction. Supportiscalculatedusingthefollowingformula:
Support(XY) =
SupportcountofXY
TotalnumberoftransactioninD
Fromthedefinitionabove,supportindicatesthestatisticalsignificanceofanassoci-
ation rule. Ifthesupport of an item is0.5%, itmeans thatonly0.5%ofthetransaction
containsthisitem. Userswillregardsuchinformation,soahighpercentageofsupport
isdesiredformoreinterestingassociationrules. Usually,userscaninitiallyspecifythe
minimum support level as a threshold so that all rules below the threshold value are
ignored. However, in many cases generated rules are important even though they are
belowthethresholdvalue.
Theconfidenceofanassociationruleisdefinedasthepercentageofthenumberof
transactions that contain X∩ Y to the total number of records that contain X, where if
thepercentageexceedsthethresholdofconfidence,aninterestingassociationruleX⇒
Ycanbegenerated. Confidenceiscalculatedusingthefollowingformula:
9
Confidence(X|Y) =
Support(XY)
Support(X)
=P(Y|X)
If the confidence of the association rule X⇒ Y is 80%, it means that 80% of the
transactions that contains X also contains Y. As with confidence, users can initially
specify the minimum support level as a threshold so that all rules below the threshold
valueareignored.
TheliftofanassociationruleindicateshowmanytimesmoreoftenXandYwould
occur together than expected if they were statistically independent. Lift is not down-
wardclosed,anddoesnotsufferfromtherareitemproblem,butissusceptibletonoise
in small databases. Rare itemsets with low counts (low probability) that occur a few
times (or only once) can together produce enormous lift values [10]. Lift is calculated
usingthefollowingformula:
Lift(X|Y) =
Confidence(XY)
Support(Y)
=
P(X&Y)
P(X)P(Y)
Association rule mining aims to identify association rules that meet pre-defined
minimum support and minimum confidence values [2]. The first task is identifying
itemsets whose occurrence exceeds a predefined threshold in the database, which are
called frequent or large itemsets. The second task is generating association rules from
thoselargeitemsetswithintheconstraintsofminimalconfidence.
10
2.1.1.2 Approaches
Thissectionbrieflydescribesmilestonesinthedevelopmentofassociationrulemining
algorithms.
The AIS algorithm, the first algorithm used for association rule mining, was in-
troducedby[1]. Usingthisalgorithm,onlyoneitemconsequentofassociationrulesis
generatedsuchthatitonlygeneratesrulessuchasX∪Y⇒ZbutnotrulessuchasX⇒
Y∪Z.Themaindrawbacksofthisalgorithmarethatitrequirestoomanypassesover
theentiredatabaseandgeneratestoomanysmallcandidateitemsets,wastingspaceand
effort.
The Apriori algorithm was first introduced by [2] as being more efficient during
the candidate generation process than AIS. The Apriori algorithm employs a different
candidate generation method and a new pruning technique according to the following
steps:
1. ScanthetransactiondatabasetoobtainthesupportSofeach1-itemset,compare
Swithmin sup,andobtainasetoffrequent1-itemsets,L
1
.
2. UseL
k
−1 to joinL
k
−1 to generate a set of candidate k-itemsets, then use the
Aprioripropertytoprunetheunfrequentedk-itemsetsfromthisset.
11
3. ScanthetransactiondatabasetoobtainthesupportSofeachcandidatek-itemset
in the final set, compare S with min sup, and obtain a set of frequent k-itemsets,
L
k
.
4. Ifthecandidatesetisnull,returntostep3;otherwise,gotostep5.
5. Foreachfrequentitemsetl,generateallnonemptysubsetsofl.
6. For every nonempty subset s of l, output the rule “s⇒ (l-s)” if confidence C of
therule“s⇒(l-s)”(=supportSofl/supportSofs)min conf.
Many algorithms based on the Apriori algorithm have been suggested to improve effi-
ciency, though it still has the complex candidate generation process that uses most of
thetime,space,andmemory. Italsoscansthedatabasemultipletimes.
The FP-Tree algorithm, which was proposed by [30, 31], uses a tree structure to
mine associationrules, and overcomesthe drawbacksoftheApriorialgorithmbygen-
eratingfrequentlyappearingitemsets. First,FP-Treeisacompressedrepresentationof
the original database because only those frequently appearing items are used to con-
structthetree,withirrelevantinformationbeingpruned. Byorderingtheitemsbytheir
supports,theoverlappingpartsappearonlyoncewithadifferentsupportcount. Second,
it only scans the database twice, decreasing the computation cost dramatically. Third,
it uses a divide-and-conquer method that considerably reduces the size of subsequent
conditionalFP-Trees,withlonger,frequentlyappearingpatternsgeneratedbyaddinga
12
suffixtotheshorterfrequentlyappearingpatterns. However,theFP-Treealgorithmhas
difficulties with interactive systems and incremental mining. Users cannot change the
threshold of supportor insert a newdata set, whichmaylead torepetitionoftheentire
miningprocess,andopposesourapproach,inwhichuserscansetvariousconstraints.
Inrealapplications,itisdifficulttoidentifystrongassociationsbetweendataitems
at low or primitive levels of abstraction due to the sparseness of data within multidi-
mensionalspace. Multiple-levelassociationruleminingtriestominestrongassociation
rules among and between different levels of abstraction. Usually, items in a database
can be classified into different concept levels according to the knowledge of a cor-
responding concept hierarchy, which may be provided in the database or by domain
experts. [28, 29, 53] addressed the problem of multiple-level association rule mining
using a conceptual taxonomy for generalizing primitive-level concepts to higher-level
concepts. However, simply applying a concept hierarchy leads to difficulty in iden-
tifying useful rules. To overcome such a problem, our approach tries to generalize
rules instead of items, which helps avoid overlooking important rules, and supports
cross-level rules; that is, it uses items from different concept levels in a taxonomy for
generatingrules.
Anotherconcernistheminingofassociationruleswithinsingle-attributeandBoolean
data. All rules concerning the same attribute and value can only be yes/1 or no/0. By
13
miningmulti-dimensionalassociationrules,wecangeneraterulessuchasage(X,“20-
29”)∧ occupation (X, “student”)⇒ buys (X, “laptop”), rather than rules in a single
dimension, such as buys (diaper)⇒ buys (beer). Multi-dimensional association rule
mining allows discovery of the correlation between different attributes. The mining
process is similar to the process of multiple-level association rule mining. First, the
frequent 1-dimensions are generated, and then all frequent dimensions are generated
using the Apriori algorithm. As discussed earlier, an ontology can support different
dimensionsthroughvariousrelationshipsbetweendifferenthierarchies.
The last concern regarding association rule ruling is whether a user can view only
theneededandusefulrules. TheApriorialgorithmusestheminimalsupportandconfi-
dencevaluesasbasicconstraintstogeneraterules. [10],ontheotherhand,hascreated
an interestingness measure other than support and confidence values called conviction
asanewconstraint. [18,37]introducedmeta-rulemininginwhichatemplatespecifies
theformatofaninterestingsetofrules,withthealgorithmonlygeneratingrulesthatfit
theformat. Forexample,foraruletemplate“X&Y⇒Z,”ruleswithonlythreeitems
with that format can be produced, e.g., “bread & cheese⇒ milk,” “bagel & Pepsi⇒
Coke,” etc. Our approach includes templates of rules as well as the minimal relevance
valueconstrainttoselectacompactsetofrulesthatismorerelevanttotheuser.
14
2.1.2 Classification
Classificationconsistsofbuildingamodelthatcanclassifyagroupofobjectsinaman-
nerthatpredictstheclassificationormissingattributevalueoffutureobjects. Basedon
the training data set, a model is created to describe the characteristics of a set of data
classes or concepts during the first process (supervised learning). Next, the model is
usedtopredicttheclassesoffutureobjectsordata. Therearemanyclassificationtech-
niques available, such as decision-tree classification, Bayesian classification, nearest-
neighbormethods,neuralnetworking,andmanyotherformsofmachinelearning.
2.1.3 Clustering
Clustering, the process of grouping a set of physical or abstract objects into classes by
theirsimilaritytoeachotherandtheirdissimilarityfromotherobjects,isusedtoreduce
the number of rules by grouping a large number of rules into categories. In clustering,
objects are defined by similarity functions that are usually quantitatively specified in
termsofdistanceoranothermeasurebycorrespondingdomainexperts. Sincethereare
so many rules generated, attempts have been made to reduce the size of large data sets
to allow easier analysis. Subspace clustering is an extension of traditional clustering
thatseekstoidentifyclustersindifferentsubspaceswithinadataset. Subspacecluster-
ingcanlocalizethesearchforrelevantdimensionsbyallowingidentificationofclusters
thatexistinmultiple,possiblyoverlappingsubspaces[3,41,50]. Therehavebeenother
15
attemptstousetheclusteringtechniqueforgroupingassociationrulestoovercomethe
problem of having to view many unrelated association rules. [26] proposed a new
distance metric between two association rules that uses agglomerative clustering for
associationrulestoobtainmoreconciseandabstractdescriptionofthedata. [6,64,70]
introduced the idea of rule cover to prune rules, and used clustering for grouping as-
sociation rules. [39] suggested using clustering for two-dimensional association rules.
However, these approaches do not support multiple levels of association rules, as does
our approach. Recently, AT [78] and CBA [76] introduced taxonomy-based item clus-
tering to overcome the above drawbacks. As previously discussed, using a concept
hierarchy (taxonomy) can support multi-level association rule mining. However, clus-
tering items and then generating association rules using only those clustered items can
restricttherulestocontainselecteditemonly.
To support multi-dimensional space, our approach generates association rules first,
and then clusters the rules. Clustering association rules can categorize relevant asso-
ciation rules, enabling the user to search with a smaller number of rules instead of the
entire rule set for identifying interesting rules. Our proposal also uses a new similarity
measure called relevance to group rules. Generally, in computer science, relevance is
defined as the capability of a search engine or function to retrieve data or information
appropriatetoauser’sneeds[63,72,73]. Inourapproach,however,relevancerefersto
thesimilarityamongdata(rules)themselvesratherthanthesimilaritybetweenthedata
16
and the user. Therefore, we concentrated on designing a new metric for the distance-
based similarity calculation among rules [26]. To date, we have been able to calculate
the relevance value between pairs of rules, and will enhance the technique to enable
multiplerulecalculation.
2.2 DataAnalysis
Dataminingtechniquescanbeappliedtovarioustypesofdata. Ourapproachsupports
thefollowingdatasets.
2.2.1 MarketBasketData
Market basket analysis, one of the most common and useful types of data analysis for
marketingpurposes,aimstodeterminewhichproductscustomerspurchasetogether. It
takes its name from the idea of customers throwing all their purchases into a shopping
cart(a“marketbasket”)duringgroceryshopping. Knowingwhatproductspeoplepur-
chase as a group can be very helpful to a retailer or any other company. A store could
usethisinformationtoplaceproductsfrequentlysoldtogetherintothesamearea,while
catalog or online merchants could use it to determine the layout of their catalogs and
orderforms. Directmarketerscouldusethebasketanalysisresultstodeterminewhich
newproductstooffertheircurrentcustomers.
17
Bar code technology has made it possible to store the items purchased on a per-
transactionbasis. Marketbasketdataissynthetictransactionaldata,inmostcasessales
records, which are known to be of high dimensionality and sparsity and have massive
outliers. Data mining of market basket data focuses on the mining of association rules
toidentifythecorrelationamongitemsinashoppingbasket.
The limitations of market basket data mining is that it requires a large number of
real transactions to obtain meaningful data, and that the accuracy of the data is com-
promisedifalloftheproductsdonotoccurwiththesimilarfrequency. Theproblemof
miningassociationrulesfrommarketbasketdatahasbeenstudiedby[1,2,29,37,58].
As discussed earlier, the use of taxonomies or ontologies, as does our approach, can
overcometheselimitations.
2.2.2 RadioFrequencyIdentification
Spatial databases usually contain not only traditional data but also the location or ge-
ographic information about the corresponding data. Spatial association rules describe
the relationship between one set of features and another set of features in a spatial
database. IntheexampleMostbusinesscentersinSingaporeare around CityHall, the
spatialoperationsusedtodescribethecorrelationcanbewithin,near,nextto,etc.
18
RadioFrequencyIdentification(RFID)isawirelesstechnologycurrentlyreceiving
much attention from many organizations. RFID can identify objects using radio fre-
quency by storing customized information onto RFID tags, which consist of antenna
that detect radio waves and respond with signals, and a chip that stores and manipu-
latesdata. ThedatacanthenbereadwithanRFIDreader,whichrecognizesthestored
information. Unlike a bar code, which requires contact with the reader, an RFID tag
does not require line of sight for communication, increasing the speed of reading, en-
abling multiple reading, and decreasing cost. It is also possible to modify the stored
informationandtrackthelocationofthetags.
TherecentresearchonRFIDchallengeshasfocusedonthreebigissues: privacyand
security, data standards, and large data management. Privacy is needed because RFID
tags enable tracking of items or customers without their knowledge. [46] discusses
the necessity of privacy for every stakeholder involved in the deployment of RFID
technology. [44] describes a specific technology used to protect privacy using trusted
computingbysplittingtheRFIDreader. AcallforRFIDstandardstounifydifferenttag
data, application-device communications, and device-tag communications is receiving
wide attention due to the immaturity of the RFID market and the few existing RFID
standards[20,54,67].
Of all the challenges, the generation of large sets of data by RFID readers is the
greatest. The generated data has to be filtered to remove any duplicate or redundant
19
data, and the consolidated data has to be stored in databases or data warehouses for
effective use. Past work on RFID application-specific issues, such as the existence of
noisy and duplicate readings in large volume real-time RFID data streams, has been
discussed by [7]. A temporal data model is proposed by [69] that supports RFID data
tracking and monitoring. [21, 22, 23] have presented a model for warehousing RFID
data to support high-level analysis in multidimensional space. Although our model is
application-specific, it supports large RFID data sets, removes redundant data using
ontology-basedrulegeneralization,andcategorizesrulesusinghierarchicalassociation
ruleclustering[71,72].
2.3 Ontology
Our work uses ontologies to generalize and reduce the item set to produce fewer but
morecloselyassociatedrules. Anontology,theformalexplicitdescriptionofconcepts
in a domain and the relationships among them [24], together with a set of concepts
constitutesaknowledgebase. Inreality,thereisafinelinebetweenwheretheontology
ends and the knowledge base begins. Therefore, ontologies are an important factor in
data mining [47]. Using an “interest ontology” to improve the support value for asso-
ciation rule mining has been proposed by [13], and related work regarding ontologies
and clustering has been conducted by [9, 14]. [41] introduced an ontology for sub-
space clustering for generating clusters and minimizing redundancy by pruning. Our
20
domain ontology defines the generalization relationships for the concepts at the differ-
ent abstraction levels that are utilized to minimize the search space, allowing the user
toavoidhavingtosearchthelargeoriginalresultsetforinterestingrules.
21
Chapter3
Preliminaries
3.1 Dataset
The data set that we used in our experiment is an extension of aicomponents.com’s
supermarket data set, a kind of market basket data [5]. The structure of the data set
is presented in 3.1. Each transaction consists of customer information, such as age,
address,gender,name,etc.;basketinformationsuchassize,time,location,etc.;andall
the items that a customer has bought. The data set consists of 600 rows for customers
and 40 columns of different items. When a customer purchases an item, the value for
thecorrespondingitemandcustomerissetto1.
In order to evaluate our method properly, we divided the data set into 10 differ-
ent data sets. To test the results, we changed the number of items for different data
sets with a fixed number of 10 customers and named the data sets c10 i10, c10 i20,
22
c10 i30,c10 i40(cstandsforacustomerandiforanitem). Wealsochangedthenum-
berof customersfor differentdatasetswith afixednumberof20itemsandnamedthe
datasetsm c10 i20,m c20 i20,m c50 i20,m c100 i20,m c300 i20,m c600 i20(m,
whichstandsformodification,wasaddedtodifferentiatethissetfromthepreviousdata
set).
Table3.1: DataFormat
TID Items Customer Basket
pork beef ··· age address ··· time location ···
1 1 0 ··· 28 90036 ··· 8am 90028 ···
2 0 0 ··· 45 40098 ··· 9pm 49012 ···
3 1 1 ··· 22 60001 ··· 4pm 60980 ···
··· ··· ··· ··· ··· ··· ··· ··· ··· ···
3.2 Definition
Thefollowingaredefinitionsofthecomponentsofourapproach.
Definition1 Supermarket consists of sections S and items i that belong to those sec-
tions. i.e.,i∈S.
Definition2 A cart Cart
i
in a supermarket can contain Itemlist. A cart can be com-
paredtoonetransactionT.i.e.,Cart
i
=T ={Itemlist}.
23
Definition3 Itemlist contain itemsI
j
, which also includes basket and customer infor-
mation, i.e., Itemlist⊃ ( customer information, RELATION, I
j
, basket information ),
where RELATION represents predicates such as “buys”, “purchases”, “at”, “from”
etc.
3.3 AnalysisTypes
As indicated in the previous section, our architecture consists of both a user and a
systemcomponent. Theuserhandlesthepre/postprocessstage,whilethesystemman-
ages the detailed algorithms. Generally, the number of rules is too large for a user to
makeanefficientandrelevantsearch. Unlikepreviousworks,wesupportdataandrule
selection conditions during the pre/post process stage. We not only support (1) the in-
terestingness value for data selection, but also have (2) application-specific conditions
toselectmorerelevantdata. Duringthepost-process,wesupport(3)ruleconditionsto
obtainmorerelevantandcompactresultsforsearching. Lastly,inourmainsystem,we
generate,generalize,andthencategorizerulestogrouptherulesforefficientanalysis.
24
3.3.1 InterestingnessMeasure
Likeotherapproaches,wesupportuseoftheinterestingnessvaluetoreducethenumber
of generated rules. Listed below are brief descriptions of the interestingness measures
weuse:
• Minimum support: A minimum support threshold is used to select the most fre-
quent item combinations, which are called frequent itemsets. In the equation
below,thesupportvalueindicateshowoftentheitemsinXandYoccurtogether
inthesamefixedsizeofacollectionofelementsasafractionofthetotalnumber
of tuples. Suppose we want to identify the support value for bread, cheese, and
milk, as in Figure 3.1. The total transaction is 10, and bread, cheese, and milk
occurtogether3times,sothesupportvalueis3over10.
Support(XY) =
SupportcountofXY
TotalnumberoftransactioninD
(3.1)
• Minimumconfidence: Aminimumconfidencethresholdisusedtoexcluderules
thatarenotstrongenoughtobeinteresting. Intheequationbelow,theconfidence
valueindicateshowfrequentlyXandYoccurtogetherasafractionofthenumber
oftuplesinwhichXoccurs. IntheexampleinFigure3.1,theconfidencevalueof
25
bread,cheese,andmilkcanbecalculatedassupportXYoversupportX.Support
XY is 0.3, and support of X can be calculated as support of bread and cheese,
whichis4over10,sothefinalconfidencevalueis0.3over0.4.
Confidence(X|Y) =
Support(XY)
Support(X)
(3.2)
Userscanspecifytheminimumthresholdvaluetogeneraterulesonlyaboveaspec-
ifiedvalueinordertogeneratefewerrules.
Figure3.1: ExampleofCalculatingInterestingnessValues
26
3.3.2 Application-specificConditions
We support the use of application-specific conditions to select a data set. A transac-
tional database normally includes information such as customer or basket information
besides item information. Most previous research only considered the relationship be-
tween items, since other information such as customer, time, and location information
changes from application to application. However, we wanted to make use of all in-
formation to generate more relevant and interesting itemsets for the user. To do so, we
createdapplication-specificconceptsandtheirinstancesasontologies. Usinganontol-
ogy, which provides a way of representing information or knowledge that includes the
key concepts and the relationships among them [24, 61], enables our approach to be
highly scalable and extensible. It is always possible to further extend our item ontol-
ogy by integrating other domain ontologies to generate new ontologies. As described
earlier,otheriteminformationaswellasothercustomerandbasketinformationcanbe
integrated into our existing ontology. Because of these ontology characteristics, rela-
tionships such as “buy,” “purchase,” “at,” and “from” can be created, and constraints
can be used to specify more complex semantics. The integration of ontologies is an
activeareaofresearch,andmanyothermethodsandapproachestosupportithavebeen
developedby[11,52,56,68].
An ontology creation requires domain experts and knowledge experts to identify
concepts and the most relevant relationships among them [61]. Our domain ontology
27
was mainly created manually because the size of our shopping cart ontology is small.
Occasionally, a small ontology is helpful for obtaining global agreement among peo-
ple [15], and there is no automatic ontology creation tool available to generate large
ontologies. Our ontology was designed in OWL [49] and implemented in Jena [34].
The implementation allows us to query or search for an item in the ontology easily.
Our item domain ontology consists of 5 levels and 85 nodes, including the root node.
We call the second-level nodes “category nodes”, the lowest nodes “item nodes”, and
the remainder “sub-category nodes”. Part of the item ontology used throughout our
approachisshownbelowinFigure3.2. Thecodemappedforeachitemnodeisshown
nexttoeachitemnameforconvenience.
3.3.3 RuleConditions
Rule conditions help reduce the search space for generated rules and allow users to
view the most relevant rules. The first condition is to search by the form of the rule
by filtering out relevant rules. If a user can specify the form of rules by including a
wild card (*), it is possible for the user to analyze only a limited number of rules. For
example, a user can specify the form template as “A⇒ B & C” or “A⇒ B & *”
to view only rules that match that format. The second condition is to specify the list
of items. For example, a user can say “show rules that include bread and cheese” to
view rules that include only those items. The third condition is to use a new measure
28
Figure3.2: PartofanItemOntology
called relevance, a similarity metric that indicates the degree of closeness among rules
by using the number of common items and the similarity in the form of the rule. The
actual relevance value varies from 0 to n, where n is the number of literals per rule,
thenitisnormalizedtorangebetween0and1forconvenience. Atthebeginningofan
analysis, a user can specify the minimum relevance threshold value. A more detailed
discussionregardingrelevanceisprovidedinthelatterportionofthisdissertation.
29
Chapter4
Approach
4.1 Introduction
Figure4.1: SystemOverview
Figure4.1showsthegeneraloverviewofourHierarchicalAssociationRuleCatego-
rization(HARC)system. TheHARCsystemworksasfollows: First,theusercollectsa
30
largedataset,suchasamarketbasketdataset,webusagedataset,orgenomedataset,
from various sources. Next, the user generates association rules using the Apriori [1]
algorithm, a well-known and widely used algorithm. Next, using a domain ontology,
the user merges and generalizes the original rules to multi-level association rules. To
generalizemeanstoincreasetheconceptlevelofanitem. Forexample,theitemcheese
bagelsmaybegeneralizedtotheconceptlevelbakedfood. Therefore,theresultwillbe
a generalized rule and the original rules that belong to it. The number of rules in rule
generalization box indicates how many original rules are within that generalized rule.
Thefinalstepistocategorizetheoriginalrulesusingtherelevancemeasure. Usingthe
analogyofatree,therootnodeisthegeneralizedruleandtheoriginalrulesthatbelong
to the generalized rule are the item nodes. Lastly, the user can search or analyze the
databyspecifyingvariousconditionsorrestrictionsthatfituserneeds.
4.2 RuleGeneralization(RG)
This section examines the detailed steps for rule generalization, the process shown in
Figure 4.2. From the given data, we generate a very specific set of association rules
using the Apriori algorithm, as shown in step (1). Next, we encode the items using an
itemsymboltable,whichincludesthecodesforacurrentitemnode,aparentitemnode,
andatopitemnode. Atopitemnodeisanitemnodethatacurrentitemnodehastobe
generalized into. In step (2), all the items are encoded with a current item node code.
31
Then, we generalize the rules using the top item code. Step (3) shows the generalized
form of the rules. In our last step, we merge duplicate rules into one rule and store
the original rules and their frequency under the generalized rule. For example, “Soda
& Baked Food⇒ Dairy & Meat” has the original rules “Diet Cherry Coke & Cherry
Bagels⇒ Mozzarella & Pork Rib” and “Pepsi ONE & Cheese Cake⇒ 3% low Fat &
PorkRib”withafrequencyof2. Lastly,weconvertthecodesbacktotheoriginalcodes
forthesakeofconvenience. Theideabehindourapproachisthathigher-levelconcepts
should be easier for a user to understand. By searching the generalized rules first and
thenthe detailed rules that belong to (are within the categoryof)the generalized rules,
userscansearchrulesinaneasierandfastermanner.
4.3 RuleCategorization(RC)
Wehaveseenthatrulegeneralizationhelpssignificantlyreducethenumberofrulesthe
mustbeevaluated. However,evenwhenusingrulegeneralization,thesearchspacere-
mainslargeandtherelationshipamongtheselectedrulesremainscomplex. Therefore,
we propose the use of an efficient approach to categorizing rules by their relevance
value. The key advantage of this approach is that once the original rules have been
clustered under a generalized rule, such as ‘Soda & Baked Food⇒ Diary & Meat,” as
showninFigure4.3,wedonotneedtore-scantheentireruleseteachtimeforanalysis.
Instead, we select a generalized rule and work only with the rules that belong to it. To
32
Figure4.2: StepsforRuleGeneralization
search only a category instead of the entire rule set can be of great benefit to users by
increasing the efficiency of the analysis (see Tables 5.1 and Table 5.2). As shown in
Figure4.3,whenrulesarecategorizedusingtherelevancemeasuretoreducethenum-
ber of steps required for scanning, we no longer need to scan the entire list of rules,
onlythelevelsoftherules.
4.3.1 HARCAlgorithm
Inthissection,weillustrateourapproachusingtheHARCalgorithm,withalistofgen-
eralized rules and original rules as the inputs and the created categories as the outputs.
33
Figure4.3: OverviewofRuleCategorization
After performing rule generalization, we have a collection of original rules organized
under the generalized rule to which they belong. For example, in Figure 4.3, rulesR
1
,
R
2
, and R
3
belong to the generalized rule “Soda & Baked Food⇒ Dairy & Meat.”
For the same relevance value, we group rules together until there is no more rule with
same relevance value. Further explanation regarding the relevance measure and the
groupRules() function appears in section 4.3.2 of this dissertation. The main idea be-
ing using the HARC algorithm is to recursively group rules according to the relevance
value. Thecompleteprocessisshownlaterinthissection.
34
Figure 4.4 shows the rough data structure of the HARC algorithm and provides an
overview of rule categorization. First, generalized rulesG
1
, G
2
,···, G
n
are stored as
keysforahashtableandoriginalrulesR
1
,R
2
,···,R
n
arestoredasvalues. Hashtables
are used for the sake of efficiency, as we need frequent access to specific rules. Using
hashtables, accessing a value is performed within a time complexity of O(1), which
makesusingthemveryefficient.
Figure4.4: StepsforRuleCategorization
The second step is to categorize rules that belong to a generalized rule. Relevance
is measured between rules. Rules are iteratively grouped by the same relevance value
until one root node is left, creating a hierarchical cluster tree. The HARC algorithm is
35
shown in Algorithm 1. R
1
, R
2
,···, R
n
are categories for the rule set. Each category
containsatreeofsub-categories(clusters),whereeachnon-leafsub-categoryistheuni-
fication of all the child nodes and each leaf sub-category is equivalent to the original
rule. The algorithm prunes rules whose relevance value is below the user-defined rele-
vance threshold value, that have a value of zero, or that have surpassed the maximum
relevance value. Generalized association rules R and original association rules r are
generated from the previous section and stored to hashtable H. The HARC algorithm
outputsallthegeneratedcategoriesandsub-categories.
Algorithm1HierarchicalAssociationRuleCategorization
Require: GeneralizedRulesR
1
;R
2
;R
3
;··· ;R
n
,OriginalRulesr
1
;r
2
;r
3
;··· ;r
N
Ensure: RuleCategoriesC
1
;C
2
;C
3
;··· ;C
m
M
n
=numberoforiginalrulesforeachR
1
;R
2
;R
3
;··· ;R
n
forallRdo
calculaterelevanceof(r
i
,r
j
)wherei = 1;2;··· ;M
n
;j =i+1;i+2;··· ;M
n
endfor
sortresultbyrelevance
whilerelevance≥" &relevance!=max(relevance) do
nnode.p←grouprules()
nnode.r:value=r
endwhile
root.c←nnode.v:value
Our algorithm is implemented using a tree structure that consists of linked lists, as
shown in Figure 4.5. A generalized rule is represented by a linked list that has a key,
a value, and a pointer p. An original rule is also represented by a linked list that have
a pointer p, a virtual value v:value, and a real value r:value. A pointer p points to
neighbor nodes, while a v:value of an original rule points to a key value of a root node
36
Figure4.5: StructureforRuleTreeImplementation
for sub-categories. A v:value, which points to a parent node p within a node, is used
forgroupingrules.
In sum, our algorithm runs iteratively until all rules are grouped into generalized
rules. First,wecalculatetherelevancevalueforeachgeneralizedruleR
1
,R
2
,R
3
,···,
R
n
. Rulesarethensortedby relevance,frequency,orthenameofanitem(asgivenby
a user), and the sorted rules are stored back to a linked list L. Next, we calculate the
relevancevalueamongtheoriginalrulesforeachgeneralizedrule. Similartowhatwe
have done with generalized rules, we sort the original rules by the relevance value and
store them to a linked list. After we have created a linked list for each rule, we group
rules that have the same relevance value and we set the v:value to the parent’s pointer
p. Theprocessisrepeateduntilallrulesaregroupedintoasinglerootnode.
37
As the time complexity of our algorithm is O(n
2
), much more time is required for
creatingaruletreewhenwehaveverylargedataset. Asdiscussedearlier,becauseour
approachfocusesonthepost-analysisofassociationrulesratherthanontheircreation,
thesearchingofatreeismuchmoreimportantthanthecreationofatree. Ruletreecre-
ationcanoccuroccasionallyasabackgroundprocess,whileruletreesearchingoccurs
frequently in real time. Therefore, we emphasize that storing the output is necessary
for fast analysis. Currently, we use a general tree structure, but we can enhance our
approachfurtherusingthekd-treeforstoringrules.
4.3.2 Relevance
Relevance is the similarity measure used for grouping rules. In our case, rules are
said to be similar or relevant to each other when the number of literals for each side is
equal, the number of common literals for each side is equal, and the sum of the total
number of literals is small. The equations used for calculating the relevance between
twoassociationrulesr
1
andr
2
aregivenbelow.
relevance(r
1
,r
2
) =
A+B +C +D
E
∗
1
F
(4.1)
A =w
L
*n(lhs(r
1
)∩lhs(r
2
)) (4.2)
38
B =w
R
*n(rhs(r
1
)∩rhs(r
2
)) (4.3)
C =w
c1
*n(lhs(r
1
)∩rhs(r
2
)) (4.4)
D =w
c2
*n(rhs(r
1
)∩lhs(r
2
)) (4.5)
E = n(all(r
1
)∪all(r
2
)) (4.6)
F = max(n(all(r
1
),n(all(r
2
))) (4.7)
,wherer
1
hastheformoflhs(r
1
)⇒rhs(r
1
),r
2
hastheformoflhs(r
2
)⇒rhs(r
2
),
lhs(r
1
) means all the items on the left side in rule r
1
, rhs(r
1
) means all the items on
therightsideinruler
1
,andall(r
1
)meansalltheitemsinruler
1
.
w
L
= ⌊
total#ofliterals
2
⌋−|l
1
-l
2
| (4.8)
,where|l
1
-l
2
|isthedifferenceinthenumberofliteralsfortheleftside.
39
w
R
= ⌊
total#ofliterals
2
⌋−|l
3
-l
4
| (4.9)
,where|l
3
-l
4
|isthedifferenceinthenumberofliteralsfortherightside.
w
c1
= ⌊
total#ofliterals
2
⌋−|l
5
-l
6
| (4.10)
,where|l
5
-l
6
|isthedifferenceinthenumberofliteralsfortheoppositeside(see
arrow(3)inFigure4.6).
w
c2
= ⌊
total#ofliterals
2
⌋−|l
7
-l
8
| (4.11)
,where|l
7
-l
8
|isthedifferenceinthenumberofliteralsfortheoppositeside(see
arrow(4)inFigure4.6).
Figure4.6: StepsforRelevanceCalculation
ThestepsusedintherelevancecalculationareshowninFigure4.6. Wehavegiven
moreweighttothedifferenceinnumberofliteralsforeachside(i.e. dependingonthe
40
total number of literals and the number of literals for each side), then we multiply the
number of common items by the calculated weight. The relevance value ranges from
0 to n, where n is the number of literals in a rule. When a rule has four literals, the
relevancevaluerangesfrom0to4. Thatiswhy(4.7)isusedfornormalizationtomake
relevance range from 0 to 1. Several examples are shown below to illustrate how the
relevancevalueiscalculatedforvariouscases.
Case1: Rulesexactlymatch
r1: A⇒B∩Candr2: A⇒B∩C
w
L
= (⌊
6
2
⌋−|1−1|) = 3
w
R
= (⌊
6
2
⌋−|2−2|) = 3
w
c1
= (⌊
6
2
⌋−|1−2|) = 2
w
c2
= (⌊
6
2
⌋−|2−1|) = 2
First, we calculate the weight for (4.8), (4.9), (4.10), and (4.11) using the floor
functionofthetotalnumberofliterals(6/2=3). Thenwesubtractthedifferenceinthe
41
number of literals from the previous result, which is zero for w
L
and w
R
and one for
w
c1
andw
c2
,thencalculateeachweight. Thefloorfunctionofarealnumberx,denoted
as⌊x⌋orfloor(x),isafunctionwhosevalueisthelargestintegerlessthanorequaltox.
Thereasonwhyweuseafloorfunctionistogivelessweighttothoseruleswhosetotal
number of literals is different from those of the others. For example, if the number of
literalsis3forbothrules,theweightwillbe⌊
6
2
⌋. However,ifthenumberofliteralsis
2and3,theweightis⌊
5
2
⌋ =⌊2:5⌋ = 2. Next,wecalculatetheresultsfor(4.1)through
(4.6)bymultiplyingthenumberofcommonitemsbytheweightvaluecalculatedfrom
each side. Then, we divide the sum of that value by the number of literals (adding the
literalsfromthetworules). Thedenominatorpartisaddedtogivemoreweighttorules
thathaveagreaternumberofcommonliterals.
A=3*1=3,B=3*2=6,C=2*0=0,D=2*0=0
E=n(A,B,C∪A,B,C)=3
F=max(3,3)
relevance(r
1
,r
2
) =
3+6+0+0
3
∗
1
3
= 1
42
In Table 4.1, we show the results for other cases that can be calculated exactly as
wehavedescribed. Weshowindetailthefirstcase-thatofanexactmatch-whoseresult
value is 1.0, which exactly matches the number of literals, in this case 3. In case 2,
only two common items with same literal number exist, so the result value is less than
1.0. In case 3, there are also two common items, but they have a different number of
literals, so the result value is less than the value in case 2. In case 4, there is only one
common item with same number of literals; hence, the resulting value is less than that
of the previous case. In case 5, there is also one common item, but it has a different
numberofliterals,sotheresultvalueis0.08,lessthanthevalueincase4. Thelastcase
hasnocommonitems,sotheresultshouldhavetheminimumvalueof0.
Table4.1: RelevancefordifferentCases
Case R
1
R
2
RelevanceCalculation
1 A⇒B∩C A⇒B∩C 1.0
2 A⇒B∩C D⇒B∩C 0.5
3 A⇒B∩C A∩B⇒D∩C 0.33
4 A⇒B∩C A⇒D∩E 0.2
5 A⇒B∩C D⇒C 0.08
6 A⇒B∩C D⇒E∩F 0.0
43
4.3.3 GroupingRules
Toperformrulegrouping,wehaveintroducedavaluecalledavirtualvaluetocalculate
therelevancevalueforaparentnode. Wecombinethevirtualvalueforeachrulewhen
therelevancevalueforeachruleisequalorgreaterthanthedefaultrelevancethreshold
value. As the level of a category increases, the relevance threshold decreases. Since
a higher category includes more rules that are weakly related, it is more likely to have
a smaller relevance threshold value. The groupRules() function works as shown in
Algorithm2,wherethev:valuebecomespartofanewparentnode. Forexample,given
the two association rules{ a⇒ b & c, b⇒ a & c}, the HARC algorithm generates
a new sub-category of{ a| b⇒ b| a & c} if the relevance value of the two rules is
larger than the relevance threshold value. We repeat this process until only one root
nodeexists.
Figure4.7showsanexampleofthevirtualvaluecalculation,with10originalrules
ontheleftsideofthetableandtherelevancevaluecalculatedontherightside. Group-
ing is performed iteratively, with merging occurring when the rules have the same rel-
evance value. In our example, the first five pairs are merged as a group. The result has
three relevancevalues; inconsequence,theresulttreewillhavethreelevels. Whenwe
group rules, rule one and rule two are merged, and the parent node has real values of
one and two. The virtual value is calculated based on the groupRules() algorithm. For
example, the union of each side will be 1, (0 or 7), and (0 or 7), so the final virtual
44
value is “1 & (0| 7)⇒ (0| 7),” as shown in Figure 4.7. The process is reiterated until
all relevance values have been used. Pruning is performed when the relevance value is
equal to 0 or below the relevance threshold value, or when a rule does not belong to
anyoftheexistinggroups.
Algorithm2groupRules()Function
Require: R
1
;R
2
//a⇒b&c;b⇒a&c
Ensure: R
3
T =numberofliterals
fori = 1toT +1do
R
3
[i] =R
1
[i] UNION R
2
[i]
endfor
v :value←R
3
[i] //a|b⇒(b|a)&c
r :value←R
1
;R
2
The virtual value is used for determining the range of a child node when searching
for a rule. If we want to identify the specific rule “6 & 7⇒ 0,” starting from the root
node, we must examine the virtual value of the child nodes, with each part of a virtual
valueservingastherangeofchildnodes. Here,1,5,and6arevaluesforthefirstliteral;
0 and 7 for the second literal; and 0 and 7 for the third literal. Next, we search for a
child node which is within that range. Likewise, we iteratively continue to scan the
virtualvalueofchildnodesatthenextlevel. Valuesforthefirstliteralofeachnodeare
1,5,and6. Theonlywaytoidentifytheneededruleistoprogresstoanodewherethe
valueis6untilweidentifythespecifiedrule.
To allow above approach work efficiently, we use bitmap to store the code and do
bit wise AND for bit masking. Since fixed item codes are used throughout the system,
45
a fixed size of bitmap can be used to set the bit for existing codes. For example, if the
valuesforthefirstliteralare1,5,and6,thefirst,fifth,andsixthbitissetinthebitmap
forthatparticularposition. Whenusersearchesfor“1&(0|7)⇒(0|7),”asabove,bit
wise AND between the first literal 1 and the bitmap is done (...0110001 & ...0000001
= ...0000001). If the result is not equal to 0, it means the searched literal is within the
range,soitwillcheckthenextliteral. Iftheresultisequalto0,thenitmeansthechild
node does not contain the searched literal, so it can skip to search the whole child tree
andmovetothenextgroup/child. Thismakesthesearchmoretimeandspaceefficient.
When we use the algorithms and the relevance calculation shown above, our result
is a tree consisting of categories and sub-categories with different relevance values for
each level, and which may have sub-categories with overlapping nodes such that the
resultbecomesagroupofhierarchicalclusters. Duringthisstep,itispossibletoprune
rulesthatdonotbelongtoanysub-categoriesorwhoserelevancevalueisequaltozero.
One drawback of using traditional multi-level association rule approaches is that
theyproducealargenumberofrules,makingitdifficultforuserstoidentifyimportant
andinterestingrules. Ourapproachgroupsrelevantrulestogethertomaketheanalysis
easier, faster, and more efficient. Instead of manually scanning the entire rule set for
eachanalysis,ourapproachproducesagroupofhierarchicalsub-categoriescontaining
rules. Then,withoutneedingtoknowhowtheapproachworks,theusersimplyselects
46
an interesting rule, in our case the generalized rule, and analyzes rules within that cat-
egoryonly,ortheuserselectsseveralitemsandsearchesrulesthatincludethoseitems
only. This approach reduces the time needed to evaluate interesting rules from a large
ruleset,andthesimplicityoftheentireprocessallowsforeasieranalysisbytheuser.
Figure4.7: GroupingRulesusingRelevance
4.4 Applications
Therearemanyfieldsindesperateneedofatoolorapproachthatcanefficientlyextract
the most relevant and useful data for the user. As discussed earlier, association rule
mining is at best to identify the correlation among attributes. Therefore, any kind of
datathatisusedtoidentifytherelationshipamongattributesortoseetheimpactofan
attribute to others can be a good input data to our HARC system. The HARC system
47
is also a user-oriented system that provides a better and more efficient way of analyz-
ing unknown data, increasing the probability that users can identify useful patterns or
informationfromthedata.
Whenanalyzingpastdataforforecastingfutureevents,theeffectordependenceof
some features on other features is the most important consideration. Therefore, it is
very important to identify the strongest relationships among the relevant transactions.
The HARC system can do so very efficiently by associating numerous features into
groups of related rules to identify the most highly correlated features, which serve as
predictorsoffutureeventsandlocations. TheHARCsystemthushasmanyapplications
inthefieldsofearthquakescience,naturalresourceanalysis,andintrusiondetectionas
apartofadecisionsupportsystem. Thoselistedapplicationsarefieldswithtremendous
nationalsignificancebecausescientistsaroundtheworldhavebeenstrivingforadevice
like our HARC system for over a decade. For example, when drilling for oil, whether
in the desert or offshore, it is very critical to identify the correct location due to the
large investment necessary in terms of time and money. Correct identification requires
muchinvestmentsimplytoanalyzealargeamountofunstructuredrawdatafromwhich
it is very difficult to identify useful patterns. Use of the HARC approach decreases
the investment in time and money necessary to identify these patterns in an efficient
manner.
48
Chapter5
Evaluation
5.1 Environment
We conducted several experiments to test and validate the use of our approach and
algorithms. We conducted all the experiments using an Intel Pentium 1.4GHz System
with 512MB of RAM running on Windows XP. We generated association rules using
a free data mining Java tool called Tanagra [62], which uses the Apriori algorithm
presentedby[2]. Toimplementtheitemontology,weusedasemanticwebframework
calledJena[34]andawebontologylanguagecalledOWL[49].
49
5.2 Performance
Aspreviouslydescribed,ourtwoprimarygoalswereto(1)reducethenumberofrules
byperformingrulegeneralizationand(2)identifyinterestingandrelevantrulesbyper-
forming rule categorization. To fulfill these goals, our experiments were performed by
changing(i)thenumberofitemsand(ii)thenumberofcustomers,asexplainedinsec-
tion3.1,usingtheApriorialgorithmtogenerateassociationrulesinTanagraaccording
to minimum default support, confidence, and lift values. The minimum support value
wassetto0.33,theminimumconfidencevalueto0.75,andthemaximumitemsetsize
to 4. The minimum lift value was initially set at 1.1 for case (i), but later changed to
1.005 for case (ii) because the data set had not produced a significant number of rules
withtheoriginaldefaultliftvalue. Although[71]emphasizesevaluatinginterestingness
values such as support, confidence, and lift value, we focused on reducing the number
ofrulestobescanned.
5.2.1 SearchSpaceReduction
Buildingonsection4.2,whichdemonstratedhowrulegeneralizationisperformed,this
sectiondescribestheresultsobtainedfromrulegeneralization. Asexpected,thesearch
space was greatly reduced for both customer and item rules. As shown in Tables 5.1
and Table 5.2, the original number of rules was the number of rules generated from
50
Tanagra using the Apriori algorithm, with the “number of rules after RG” referring to
the number of generalized rules generated from the rule generalization. As discussed
earlier, every generalized rule stores the original rules that belong to it. Therefore, the
“numberofrulesafterRG-max”andthe“numberofrulesafterRG-min”arecalculated
as the number of a generalized rule plus the largest number of original rules and the
number of a generalized rule plus the smallest number of original rules, respectively,
thesamecalculationsusedforcalculationoftheworstandthebestcaseforscanninga
particularrule.
Table5.1: ResultforRuleGeneralization(RuleperItem)
ITEM 10 20 30 40
Original#ofrules 472 2432 15056 87867
#ofrulesafterRG 94 122 173 643
#ofrulesafterRG-max 134 410 3581 17009
#ofrulesafterRG-min 95 123 174 644
AsshowninTable5.1,asthenumberofitemsincreasesarithmetically,thenumber
of original rules increases exponentially. The sum of the number of rules after the rule
generalizationandtheoriginalnumberofrulesalsoincreaseexponentially,althoughat
aslowerrate. Forexample,10itemsproduce472rules,andthenumberofgeneralized
rulesis94. Themaximumnumberoforiginalrulesaftertherulegeneralizationis94+
40 = 134 and the minimum number of original rules after the rule generalization is 94
51
+1=95.
Table5.2: ResultsofRuleGeneralization(RuleperCustomer)
CUSTOMER 10 20 50 100 300 600
Original#ofrules 3882 5026 19801 19214 4578 2244
#ofrulesafterRG 201 202 312 320 185 156
#ofrulesafterRG-max 537 802 2596 2894 1003 616
#ofrulesafterRG-min 202 203 313 321 186 157
As shown in Table 5.2, as the number of customers increases arithmetically, the
number of original rules increases exponentially. The sum of the number of rules after
the rule generalization and the original number of rules also increase exponentially,
although at a slower rate. For example, 10 customers produce 3,882 rules, and the
number of generalized rules is 201. The maximum number of original rules after rule
generalizationis201+336=537andtheminimumnumberoforiginalrulesafterrule
generalization is 201 + 1 = 202. The results show that the number of rules that a user
mustscanhasreducedtheexponentialcomplexitybyafactorof3to5inTable5.1and
by6to7inTable5.2. Aswementionedearlier,thecomplexitycanbereducedfurther
by using rule categorization and the result after rule categorization is shown in section
5.2.2.
As shown in Figures 5.1 and Figure 5.2, the difference in the distance between the
two graphs ‘# of Rules” and “# of Rules after RG-max” shows the decrease in the
52
Figure5.1: GraphforNumberofRulesperItems
number of rules to scan after rule generalization has been performed. One interesting
phenomenon to note is that the number of rules first increases to 50 customers, but
decreases thereafter. The reason for this fluctuation is the sparseness of a data set in
which more items are located near the first and the last row. However, this fluctuation
can be overlooked because our experiment only aimed to reduce the search space by
usingourmethodandnottosavetime.
Fromourevaluation,itisclearthatperformingrulegeneralizationcontributesgreatly
tomarketbasketdataanalysisbydecreasingthenumberofrulesthatmustbeanalyzed.
As the number of items or customers increases, the reduction in the number of rules
53
alsodecreases. Inthefollowingsection,weusetheresultsofrulegeneralizationtocat-
egorize the original rules into sub-categories to demonstrate that our approach to rule
categorizationsuccessfullycontributestotheidentificationofrelevantrules.
Figure5.2: GraphforNumberofRulesperCustomer
5.2.2 Relevance
Thus far, we have demonstrated the means of reducing the search space for generated
association rules. We now aim to demonstrate the means of performing two important
tasks in rule categorization: (1) identifying a specific rule - we will count the number
ofrulesthatweneedtoexamineforidentifyingaspecificrule,suchas“cereal⇒milk
&bread”and(2)identifyingrulesthatincludeitemsspecifiedbytheuser. Whenauser
wants to identify rules that include milk, bread, and cereal, we examine the number of
54
rules that we must scan for identifying all the rules that include items specified by the
user.
Table 5.3, which shows the results of performing rule categorization to identify
a specific rule, compares the use of Tanagra and the use of the HARC approach for
identifyingaspecificrule. Asdiscussedearlier,TanagrausestheApriorialgorithmand
default minimum interestingness values for rule generation. Here, the number in the
table is the number of rules to be scanned to identify a rule specified by a user. The
experiment uses the same data set that we have used in the rule generalization. Table
5.3 shows a clear difference between the number of rules to be scanned using Tanagra
andusingtheHARCapproach,whichrequiresfewerrulesthanTanagraforidentifying
arulespecifiedbyauser. Theresultiscalculatedbyaddingthenumberofgeneralized
rulesscannedtothenumberofvirtualrulesscannedforobtainingthespecifiedrule. For
example, to identify the rule “6 & 0⇒ 7” for data set c 10 i10, Tanagra must scan 66
rules, but the HARC approach first searches for the generalized rule ‘59 & 59⇒ 59”
and then the categorized original rules. As explained earlier, our approach examines
the virtual value of the child nodes. In this case, we need 10 rules for identifying the
generalizedruleandthenthevirtualvalueof9nodesforsearchingtherulespecifiedby
the user, which is total of 19 steps. The result indicates that the time complexityO(n)
for a linear scan has been reduced toO(
2
n
∗logn). Although our approach reduces the
55
timecomplexity,itisslowerthanthetimecomplexityforabinarysearchtree-O(logn)
becauseourapproachdoesnotrestrictthenumberofchildnodestotwo.
Table5.3: ResultsforIdentifyingaSpecificRule
DataSet Rule Tanagra HARC
c10 i10 6&0⇒7 66 19
c10 i20 6&0⇒7 372 28
c10 i30 6&0⇒7 2815 181
c10 i40 6&0⇒7 16705 302
m c10 i20 6&0⇒7 588 48
m c20 i20 6&0⇒7 667 50
m c50 i20 6&0⇒7 3081 240
m c100 i20 9&7⇒8&6 5919 324
m c300 i20 9&7⇒8&6 1443 146
m c600 i20 9&7⇒8&6 729 73
The HARC approach is more appropriate for identifying rules that include items
specifiedbytheuser. Sinceithierarchicallycategorizesrulesusingtherelevancemea-
sure, the HARC approach does not need to search the lowest level, increasing the rate
at which rules are identified compared to when Tanagra is used. Suppose a user wants
to identify rules that contain fresh fruit, marmalade, and fruit juice, with the code for
eachitembeing0,7,and9,respectively. UsingTanagra,theuserwouldhavetosearch
line by line for related rules. In the case of 10 customers and 20 items, a user would
have to search 3,882 rules to identify all the rules that include the specified items. The
HARCapproach,ontheotherhand,searchesforthegeneralizedrulefirst,whichinthis
case should include three 59s. Since the related rules are already grouped together by
56
the relevancevalue, itprogressesdownlevelbylevelwhilecheckingthevirtualvalue.
Simply by examining the virtual node of its child nodes, our algorithm identifies most
of the related rules before we have reached the lowest node. As the size of our data
set increases, the reduction in the number of rules that must be scanned to identify the
related rules becomes larger. While Tanagra must scan the entire list to identify the
matching rules, our approach only needs to scan the generalized rules and the virtual
value of nodes that belong to the generalized rule. Since rules are grouped by the rel-
evance value, there is a high probability that rules with the specified items are located
together. For example, when identifying rules that include the specified items for data
setc 10 i10,wemustscanalltheexistinggeneralizedrules,atotalof94rules,among
which are 33 rules that include the generalized items of the specified rules. Because
each generalized rule must scan an average of 7 nodes to identify all the related rules,
the result is calculated as 94 + 33 * 7 = 325. Table 5.4, which shows the results when
this calculation is applied to the remainder of the data set, indicates that the time com-
plexityhasbeenreducedtoO(50∗logn),providingforamuchfastersearchthanwhen
thetimecomplexityisO(n),thetimecomplexityofalinearscan.
57
Table5.4: ResultsforIdentifyingRelatedItems
DataSet Items Tanagra HARC
c10 i10 0,7and9 472 325
c10 i20 0,7and9 2432 378
m c10 i20 0,7and9 3882 417
m c20 i20 0,7and9 5026 622
5.3 SearchQuality
5.3.1 PrecisionandRecall
Thesuccessofasearchenginealgorithmliesinitsabilitytoretrievinginformationfor
a given query. There are two ways in which one might consider the return of results to
be successful. Either you canobtain veryaccurate resultsor youcanfind manyresults
which have some connection with the search query. In information retrieval, these are
termedprecisionandrecall,respectively.
Inthefieldofinformationretrieval,precisionisthefractionofretrieveddocuments
thatarerelevanttothesearch:
precision =
|{relevantrules}∩{retrievedrules}|
|{retrievedrules}|
For example for a text search on a set of documents precision is the number of
correct results divided by the number of all returned results. Precision is also used
with recall, the percent of all relevant documents that is returned by the search. The
58
two measures are sometimes used together in theF
1
Score (or f-measure) to provide a
singlemeasurementforasystem.
Recall in Information Retrieval is the fraction of the documents that are relevant to
thequerythataresuccessfullyretrieved.
recall =
|{relevantrules}∩{retrievedrules}|
|{relevantrules}|
For example for text search on a set of documents recall is the number of correct
results divided by the number of results that should have been returned. In binary
classification, recall is called sensitivity. So it can be looked at as the probability that
a relevant document is retrieved by the query. It is trivial to achieve recall of 100% by
returning all documents in response to any query. Therefore recall alone is not enough
but one needs to measure the number of non-relevant documents also, for example by
computingtheprecision.
A measure that combines Precision and Recall is the harmonic mean of precision
andrecall,thetraditionalF-measureorbalancedF-score:
F = 2·
precision·recall
precision+recall
59
This is also known as the F
1
measure, because recall and precision are evenly
weighted. ItisaspecialcaseofthegeneralF
measure(fornon-negativerealvaluesof
):
F
= (1+
2
)·
precision·recall
2
·precision+recall
Two other commonly used F measures are the F
2
measure, which weights recall
twice as much as precision, and the F
0:5
measure, which weights precision twice as
muchasrecall.
5.3.2 Evaluation
Forourapproach,wehavemeasuredprecisionandrecallforeachsearchquery. Queries
are based on the 40 items we have used for our rule generation. Also, another 1600
queries based on the combination of two items have been used as one of the queries
(e.g. appleforthehead,beerforthebody).
Figure 5.3 shows the precision measured for each item query. The 40 single item
queriesarelocatedontherightofthegraphandthatiswhyitshowsthelowerprecision
thanthetwoitemqueries.
Figure5.4showstherecallmeasuredforeachitemquery. Liketheprecisiongraph,
the 40 single item queries are located on the right of the graph. Since the goal of
our system is to provide an additional information (relevance) and also to avoid losing
60
Figure5.3: Graphforprecisionperitems
important rules below the threshold value, the result shows in higher recall values for
thesingleitemqueries.
Generally,recallandprecisionhasaninverselyproportionalrelationshipandagood
search engine or search system follows the similar relationship. If we combine the
previoustworesultswecaneasilyshowthatourHARCsystemdoesfollowthegeneral
inverse relationship of a search system. For the single queries, we have higher recall
valuesandforthetwoitemqueries,itshowshigherprecisionvalues.
OurlastgraphinFigure5.5showthef-measurefortheprecisionandrecall. Aswe
have mentioned, f-measure is the harmonic mean between the two measures. Here we
have shown the f-measure where is 0.5. Even though our system generally provides
higher recall value, it is also important for the user to have higher precision result.
61
Figure5.4: Graphforrecallperitems
So we have given more weight (twice as much as recall) to precision to calculate the
f-measure.
F
0:5
= (1+0:25)·
precision·recall
0:25·precision+recall
= 5·
precision·recall
precision+recall
The result is more significant as it has higher f-measure value. Our graph shows it
mostlyhasresultover0.5.
62
Figure5.5: Graphforf-measureperitems
63
Chapter6
ContributionandFutureChallenges
6.1 Contribution
Traditionalmethodsusedforruleminingoftransactionaldatahavefacedlimitationsin
terms of efficiency in identifying relevant rules. Due to the large number of generated
rules,ithasalwaysbeenverydifficulttoidentifyrelevantrulesamongasetofgenerated
rules. Our approach, which mines a large problem space into a smaller hierarchically
structured search spaces, is the first approach that can generalize and categorize gen-
erated rules to support multi-level association rules while providing users with a more
efficientandrelevantwayofsearchingforrules.
Specifically, our approach uses an ontology, which contains a conceptual taxon-
omy that allows us to generalize an item to a higher level more easily, to support the
generalization of association rules. Moreover, it employs a relevance measure in rule
64
categorizationbyiterativelygroupingtherulesusingthe relevancemetricintoagroup
of clusters, with each cluster containing the most relevant rules. The results of our ex-
perimentsindicatethatourapproachreducesthesearchspacenecessaryforidentifying
theneededrules,providinguserswithamoreeffectiveandefficientmeansofanalysis.
6.2 FutureChallenges
Data analysis is currently receiving a great deal of attention, with the tremendous
growth of Internet-connected data and transactions and the attention and more impor-
tant than ever, as the size of the data set gets larger and larger. This rapid growth
has created an interesting challenge for businesses and consumers everywhere, includ-
ingonlineadvertisingorrecommendationcompanies. Theyneedtoaccuratelymeasure
andanalyzewhatusersandtransactionsaredoing. Ourapproachcanbeagoodfittose-
lectthebestmatchedfeatures/rulesbycategorizingthemusingourmetricofrelevance.
However, challenges exist in applying our approach to a very large scale application.
The main challenges are applicability and the performance, as discussed immediately
below.
65
6.2.1 Applicability
Asweknowalready,ourapproachisbasedonassociationrulemining. Wearegeneral-
izingandcategorizingthegeneratedrulesforbetteranalysis. Thatmeansourapproach
can be used with any kind of data mining techniques that are able to generate rules,
e.g., with the naive Bayesian and decision tree techniques. Many organizations today
useclassificationlearningtechniquesbutnotassociationrulemining,duetotheknown
problem mentioned earlier. If our approach can be integrated with the existing tech-
niques,itwillbeamajorfunctionaladditiontocurrentlyexistingsystems.
6.2.2 Performance
Our approach manipulates a large data set and needs heavy calculation of relevance
values. We also need an efficient way to store our results for fast retrieval. One way
is to optimize our data storage and access structure. The current tree structure for rule
storage can decrease the performance to O(n) as the branching factor increases. We
hopetodevelopabetterdatastructureintheformofamulti-dimensionaltree,calleda
kd-tree,toachievethetimecomplexityofretrievalofO(n
11=k
+m).
An alternative method is to make use of the map-reduce framework. Our approach
does substantial repeated calculation with the rules and also requires significant mem-
oryspacetoaccommodateahashmaptostorenecessaryvalues. Themap-reduceframe-
workcansupportparallelcomputingonlargedatasetsonclustersofcomputers. Inour
66
case, we calculate the relevance value for all specific rules within a generalized rule
andwehavetorepeatthisstepforallgeneralizedrule. Duetoitsrepetitiveprocess,we
can use parallel data processing (map/reduce) for each generalized rule. We store the
dataneededforthecalculationintoamapframeworkandreduceitafterwardsforrele-
vance calculation. We expect an improvement of
N
G
Np
in time, whereN
G
is the number
generalizedruleandN
p
isthenumberofprocessusedforthemap-reduceframework.
67
References
[1] Agrawal R., Imielinski T., and Swami A., ”Mining Association Rules between
Sets of Items in Large Databases”, Proceedings of the ACM SIGMOD Interna-
tionalConferenceonManagementofData,Washington,DC,pages207-216,May
1993.
[2] Agrawal, R., and Srikant, R., ”Fast algorithms for mining association rules”,
Proceedings of the 20th International Conference on Very Large Databases
(VLDB’94),Santiago,Chile,pages487-499,September1994.
[3] Agrawal R., Gehrke J., Gunopulos D., and Ragha-van P., ”Automatic Subspace
Clustering of High Dimensional Data for Data Mining Applications”, Proceed-
ings of the ACM SIGMOD International Conference on Management of Data,
Seattle,Washington,pages94-105,June1998.
[4] Aggarwal, C. C., Wolf, J. L., and Yu, P. S., ”A New Method for Similarity In-
dexing of Market Basket Data”, Proceedings of the ACM SIGMOD International
ConferenceonManagementofData,Philadephia,PA,pages407-418,June1999.
[5] AIComponents-ArtificialIntelligenceDataComponents,
http://www.aicomponents.com/Default.aspx
[6] An,A.,Khan,S.,Huang,X.,”ObjectiveandSubjectiveAlgorithmsforGrouping
Association Rules”, Proceedings of the 3rd IEEE International Conference on
DataMining(ICDM’03),Melbourne,FL,pages477-480,November2003.
[7] Bai,Y.,Wang,F.,andLiu,P.,”EfficientlyFilteringRFIDDataStreams”,Proceed-
ingsofthe1stInternationalVLDBWorkshoponCleanDatabases(CleanDB’06),
Seoul,Korea,pages50-57,September2006.
[8] Bayardo, R. J., Agrawal, R., and Gunopulos, D., ”Constraint-Based Rule Mining
inLarge,DenseDatabases”,Proceedingsofthe15thInternationalConferenceon
DataEngineering(ICDE’99),Sydney,Austrialia,pages188-197,March1999.
68
[9] Breaux,T.D.,andReed,J.W.,”UsingOntologyinHierarchicalInformationClus-
tering”,Proceedingsofthe38thAnnualHawaiiInternationalConferenceonSys-
temSciences(HICSS’05),HiltonWaikoloaVillageBigIsland,HI,pages111-112,
January2005.
[10] Brin,S.,Motwani,R.,Ullman,J.D.,andTsur,S.,”DynamicItemsetCountingand
Implication Rules for Market Basket Data”, Proceedings of the ACM SIGMOD
International Conference on Management of Data, Tucson, AZ, pages 255-264,
May1997.
[11] Calvanese, D., De Giacomo, G. and Lenzerini, M., “A Framework for Ontology
Integration”,Proceedingsofthe1stSemanticWebWorkingSymposium,Stanford,
CA,pages303-316,2001.
[12] Chawathe, S. S., Krishnamurthy, V., Ramachandrany, S., and Sarma, S., ”Man-
aging RFID Data (Extended Abstract)”, Proceedings of the 30th International
ConferenceonVeryLargeDataBases(VLDB’04),Toronto,Canada,pages1189-
1195,2004.
[13] Chen, X., Zhou, X., Scherl, R., and Geller, J., ”Using an Interest Ontology for
Improved Support in Rule Mining”, Proceedings of the 5th International Con-
ference on Data Warehousing and Knowledge Discovery (DaWaK’03), Prague,
CzechRepublic,LNCSvol.2737,pages320-329,September2003.
[14] Clerkin,P.,Cunningham,P.,andHayes,C.,”OntologyDiscoveryfortheSeman-
ticWebUsingHierarchicalClustering”, Proceedings of the Workshop on Seman-
tic Web Mining at ECML/PKDD, Freibury, Germany, pages 27-38, September
2001.
[15] Ding, Y., and Foo, S., ”Ontology research and development: Part 2-A review of
ontology mapping and evolving”, Journal of Information Science, 28(5), pages
375-388,October2002.
[16] EnterpriseMiner,SASInstitute,
http://www.sas.com/technologies/analytics/datamining/miner/
[17] Fayyad,U.,Piatetsky-Shapiro,G.,andSmmyth,P.,”FromDataMiningtoKnowl-
edgeDiscoveryinDatabases”,TheAImagazine17(33),pages37-54,Fall1996.
[18] Fu, Y, and Han, J., ”Meta-Rule-Guided Mining of Association Rules in Rela-
tional Databases”, Proceedings of the 1st International Workshop on Integra-
tion of Knowledge Discovery with Deductive and Object-Oriented Databases
(KDOOD’95),Singapore,pages39-46,December1995.
69
[19] Garofalakis, M. N., Rastogi, R., and Shim, K., ”SPIRIT: Sequential Pattern Min-
ing with Regular Expression Constraints”, Proceedings of the 25th International
Conference on Very Large Data Bases (VLDB’99), Edinburgh, Scotland, pages
223-234,September1999.
[20] Gerst, M., Bunduchi, R., and Graham, I., ”Current issues in RFID standardisa-
tion”, Proceedings of the Workshop on Interoperability Standards - Interop-ESA
conference,Geneva,Switzerland,HermesSciencePublishing,February2005.
[21] Gonzalez, H., Han, J., Li, X., and Klabjan, D., ”Warehousing and Analyzing
Massive RFID Data Sets”, Proceedings of the 22nd International Conference on
DataEngineering(ICDE’06),Atlanta,GA,pages1-10,April2006.
[22] Gonzalez, H., Han, J., and Li, X., ”FlowCube: Constructing RFID FlowCubes
for Multi-Dimensional Analysis of Commodity Flows”, Proceedings of the 32nd
International Conference on Very Large Data Bases (VLDB’06), Seoul, Korea,
pages834-845,September2006.
[23] Gonzalez, H., Han, J., and Li, X., ”Mining Compressed Commodity Workflows
From Massive RFID Data Sets”, Proceedings of the International Conference on
InformationandKnowledgeManagement(CIKM’06),Arlington,VA,pages162-
171,November2006.
[24] Gruber, T. R., “Toward Principles for the Design of Ontologies Used for Knowl-
edge Sharing”, In: Guarino, N.; Poli, R. (eds.): Formal Ontology in Conceptual
AnalysisandKnowledgeRepresentation,KluwerAcademicPublishing,Deventer,
pp907-9281993.
[25] Guha, S., Rastogi, R., and Shim, K., ”ROCK: A Robust Clustering Algorithm
for Categorical Attributes”, Proceedings of the 15th International Conference on
DataEngineering(ICDE’99),Sydney,Australia,pages512-521,March1999.
[26] Gupta G., Strehl A., and Ghosh J., ”Distance Based Clustering of Association
Rules”,ProceedingsofIntelligent,EngineeringSystemsthroughArtificialNeural
Networks (ANNIE), St. Louis, MO, vol. 9, pages 759-764, ASME Press, Novem-
ber1999.
[27] Halkidi,M.,Batistakis,Y.,andVazirgiannis,M.,”OnClusteringValidationTech-
niques”, Journal of Intelligent Information Systems, 17(2-3), pages 107-145, De-
cember2001.
[28] Han, J., ”Mining Knowledge at Multiple Concept Levels”, Proceedings of
the 4th International Conference on Information and Knowledge Management
(CIKM’95),Baltimore,MD,pages19-24,November1995.
70
[29] Han, J., and Fu, Y., ”Discovery of Multiple-Level Association Rules from Large
Databases”,Proceedingsofthe21thInternationalConferenceonVeryLargeData
Bases(VLDB’95),Zurich,Switzerland,pages420-431,September1995.
[30] Han, J., and Pei, J., ”Mining Frequent Patterns by Pattern-Growth: Methodology
and Implications”, ACM SIGKDD Explorations Newsletter, 2(2), pages 14-20,
December2000.
[31] Han, J., Pei J., and Yin Y., ”Mining Frequent Patterns without Candidate Gener-
ation”, Proceedings of the ACM SIGMOD International Conference on Manage-
mentofData,Dallas,Texas,pages1-12,May2000.
[32] Hilderman, R.J., Carter, C.L., Hamilton, H.J., and Cercone, N., ”Mining Asso-
ciation Rules from Market Basket Data using Share Measures and Characterized
Itemsets”, International Journal on Artificial Intelligence Tools, 7(2), pages 189-
220,June1998.
[33] Jeffery,S.R.,Garofalakis,M.,andFranklin,M.J.,”AdaptiveCleaningforRFID
Data Streams”, Proceedings of the 32nd International Conference on Very Large
DataBases(VLDB’06),Seoul,Korea,pages163-174,September2006.
[34] Jena-ASemanticWebFrameworkforJava,http://jena.sourceforge.net/
[35] Jiang,N.,Gruenwald,L.,”ResearchIssuesinDataStreamAssociationRuleMin-
ing”,ACMSIGMODRecord,35(1),pages14-19,March2006.
[36] Jorge A., ”Hierarchical Clustering for thematic browsing and summarization of
largesetsofAssociationRules”,Proceedingsofthe4thSIAMInternationalCon-
ferenceonDataMining(SDM04),Orlando,Florida,April2004.
[37] Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkamo, I.,
”Finding Interesting Rules from Large Sets of Discovered Association Rules”,
Proceedings of the 3rd ACM International Conference Information and Knowl-
edge Management (CIKM’94), Gaithersburg, MD, pages 401-407, November
1994.
[38] Lakhal, L., and Stumme, G., ”Efficient Data Mining Based on Formal Concept
Analysis”,LNAIvol.3626,pages180-195,July2005.
[39] Lent B., Swami A., and Widom J., ”Clustering Association Rules”, Proceedings
of 13th International Conference on Data Engineering (ICDE’97), Birmingham,
U.K.,pages220-231,April1997.
71
[40] Liu, B., Hsu, W., and Ma, Y., Mining Association Rules with Multiple Mini-
mum Supports, Proceedings of the 5th ACM SIGKDD International Conference
onKnowledgeDiscoveryandDataMining(KDD’99),SanDiego,CA,pages337-
341,September1999.
[41] Liu, J., Wang, W., and Yang, J., ”A Framework for Ontology-Driven Subspace
Clustering”, Proceedings of the 10th ACMSIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD’04), Seattle, WA, pages 623-628,
August2004.
[42] Mennis,J.,andLiu,J.W.,”MiningAssociationRulesinSpatio-TemporalData”,
Proceedings of the 7th International Conference on GeoComputation, University
ofSouthampton,U.K.,September2003.
[43] Mills-Harris,M.D.,Soylemezoglu,A.,andSaygin,C.,”RFIDData-BasedInven-
tory Management of Time-Sensitive Materials”, Proceedings of the 31st Annual
ConferenceoftheIEEEIndustrialElectronicsSociety(IECON’05),Raleigh,NC,
November2005.
[44] Molnar, D., Soppera, A., and Wagner, D., ”Privacy For RFID Through Trusted
Computing”,ProceedingsoftheACMWorkshoponPrivacyintheElectronicSo-
ciety(WPES’05),Alexandria,VA,pages31-34,November2005.
[45] Ng, R., Lakshmanan, L.V.S., Han, J., and Pang, A., ”Exploratory Mining and
Pruning Optimizations of Constrained Associations Rules”, Proceedings of the
ACM SIGMOD International Conference on Management of Data, Seattle, WA,
pages13-24,June1998.
[46] Nguyen, D., and Kobsa, A., ”Better RFID Privacy Is Good for Consumers, and
Manufacturers, and Distributors, and Retailers”, Proceedings of the CHI Work-
shop on Privacy-Enhanced Personalization (PEP’06), Montreal, Canada, April
2006.
[47] Nigro, H. O., Gonzalez, S. C., and Xodo, D., “Data Mining with Ontologies:
Implementations, Findings and Frameworks”, Information Science Reference -
IGIGlobal,London,2008
[48] Ordonez,C.,”AModelforAssociationRulesBasedonClustering”,Proceedings
of the ACM Symposium on Applied Computing (SAC’05), Santa Fe, NM, pages
545-546,March2005.
[49] McGuinness D., and Harmelen F., “OWL Web Ontology Language Overview”,
Availableathttp://www.w3.org/TR/owl-features/
72
[50] Parsons L., Haque E., and Liu H., ”Subspace Clustering for High Dimensional
Data: a Review”, Journal of ACM SIGKDD Explorations Newsletter, 6(1), pages
90-105,June2004.
[51] Pei,J.,andHan,J..”Canwepushmoreconstraintsintofrequentpatternmining?”
Proceedings of the 6th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD’00), Boston, MA, pages. 350-354, August
2000.
[52] Pinto, H.S., Gomez-Perez, A., and Martins, J.P., “Some issues on ontology in-
tegration”, Proceedings of the IJCAI-99 Workshop on Ontologies and Problem-
Solvingmethods(KRR5),Stockholm,Sweden,pages7.1-7.12,August1999.
[53] Psaila, G., and Lanzi, P.L., ”Hierarchy-based Mining of Association Rules in
Data Warehouses”, Proceedings of the ACM Symposium on Applied Computing
(SAC’00),Como,Italy,pages307-312,March2000.
[54] ScanSource’sRFIDEdge.Legislation&Standards.
http://www.scansource.com/rfidedge/pages/legislation standards.html
[55] Sherkat, R., and Rafiei, D., ”Efficiently Evaluating Order Preserving Similarity
Queries Over Historical Market-Basket Data”, Proceedings of the 22nd Interna-
tional Conference on Data Engineering (ICDE’06), Atlanta, GA, page 19, April
2006.
[56] “Semantic Information Research Laboratory of the Computer Science Depart-
ment at the University of Southern California (USC)”, Available at http://sir-
lab.usc.edu
[57] Smythe, P., and Goodman, R.M., ”An Information Theoretic Approach to Rule
Induction from Databases”, IEEE Transactions on Knowledge and Data Engi-
neering,4(44),pages301-316,August1992.
[58] Srikant, R., and Agrawal, R., ”Mining Generalized Association Rules”, Proceed-
ings of the 21st International Conference on Very Large Databases (VLDB’95),
Zurich,Switzerland,pages407-419,September1995.
[59] Srikant, R., Vu, Q., and Agrawal, R, ”Mining Association Rules with Item Con-
straints”, Proceedings of the ACM International Conference on Knowledge Dis-
covery and Data Mining (KDD’97), Newport Beach, CA, pages 67-73, August
1997.
73
[60] Strehl A. and Ghosh J., ”A Scalable Approach to Balanced, High-dimensional
Clustering of Market-baskets”, Proceedings of the 7th International Conference
on High Performance Computing (HiPC’00), Bangalore, India, LNCS vol. 1970,
pages525-536,Springer,December2000.
[61] Sugumaran, V., and Storey, V. C., ”The Role of Domain Ontologies in Database
Design: An Ontology Management and Conceptual Modeling Environment”,
ACM Transaction on Database Systems (TODS), 31(3), pages 1064-1094,
September2006.
[62] TANAGRA: A Free Data Mining Software for Research and Education,
http://eric.univ-lyon2.fr/ricco/tanagra/,2005.
[63] TheFreeDictionary-Relevance(ComputerScience)
http://encyclopedia.thefreedictionary.com/Relevance+(Computer+Science)
[64] ToivonenH.,KlemettinenM.,RonkainenP.,HdtvnenK.,andMannilaH.,”Prun-
ing and Grouping Discovered Association Rules”, Proceedings of MLnet Work-
shop on Statistics, Machine Learning, and Discovery in Databases, Heraklion,
Crete,Greece,pages47-52,April1995.
[65] Tsur,D.,Ullman,J.D.,Abiteboul,S.,Clifton,C.,Motwani,R.,Nestorov,S.,and
Rosenthal, A., ”Query Flocks: A Generalization of Association-Rule Mining”,
Proceedings of the 1998 ACM SIGMOD International Conference on Manage-
mentofData,Seattle,WA,pages1-12,June1998.
[66] Tzanis,G.,Berberidis,C.,andVlahavas,I.,”OntheDiscoveryofMutuallyExclu-
sive Items in a. Market Basket Database”, Proceedings of the 2nd ADBIS Work-
shop on Data Mining and Knowledge Discovery, Thessaloniki, Greece, pages
1-12,September2006.
[67] UnitedStatesGovernmentAccountabilityOffice,”RadioFrequencyIdentification
TechnologyintheFederalGovernment”,GAO-05-551,
http://www.gao.gov/new.items/d05551.pdf,May2005.
[68] Wache, H., Vogele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann,
H. and Hubner, S., “Ontology-Based Integration of Information : A Survey of
Existing Approaches”, Proceedings of IJCAI 2001 Workshop on Ontologies and
InformationSharing,Seattle,WA,pages108-117,August2001.
[69] Wang,F.,andLiu,P.,”TemporalManagementofRFIDData”,Proceedingsofthe
31stInternationalConferenceonVeryLargeDataBases(VLDB’05),Trondheim,
Norway,pages1128-1139,August2005.
74
[70] Webb,G.I.,andZhang,S.,”Removingtrivialassociationsinassociationruledis-
covery”, Proceedings of the 1st International NAISO Congress on Autonomous
Intelligent Systems (ICAIS’02), Geelong, Australia, Canada/The Netherlands:
NAISOAcademicPress,2002.
[71] Won, D., Song, B. M., and McLeod, D., ”An Approach to Clustering Market-
ing Data”, Proceedings of the 2nd International Advanced Database Conference
(IADC’06),SanDiego,CA,June2006.
[72] Won, D., and McLeod, D., ”Ontology-Driven Rule Generalization and Rule Cat-
egorization for Market Data”, To appear in Proceedings of the 23rd ICDE Work-
shops on Data Mining and Business Intelligence (DMBI 2007), Istanbul, Turkey,
April2007.
[73] Wikipedia-Relevance(informationretrieval),
http://en.wikipedia.org/wiki/Relevance (information retrieval)
[74] Ya, X., ”Research Issues in Spatio-temporal Data Mining”, Proceedings of
the University Consortium for Geographic Information Science Workshop
on Geospatial Visualization and Knowledge Discovery, Lansdowne, Virginia,
November2003.
[75] Yun, C.-H., Chuang, K.-T., and Chen, K.-T., ”An Efficient Clustering Algorithm
for Market Basket Data Based on Small Large Ratios”, Proceedings of the 25th
International Computer Software and Applications Conference(COMPSAC’01),
Chicago,IL,pages505-510,October2001.
[76] Yun, C.-H., Chuang, K.-T., and Chen, K.-T., ”Using Category-Based Adherence
toClusterMarket-BasketData”,Proceedingsofthe2ndIEEEInternationalCon-
ference on Data Mining (ICDM’02), Maebashi City, Japan, pages 546-553, De-
cember2002.
[77] Yun, C.-H., Chuang, K.-T., and Chen, K.-T., ”Self-Tuning Clustering: An Adap-
tive Clustering. Method for Transaction Data”, Proceedings of the 4th Interna-
tionalConferenceonDataWarehousingandKnowledgeDiscovery(DaWaK’02),
Aix-en-Provence,France,LNCSVol.2454,pages42-51,September2002.
[78] Yun, C.-H., Chuang, K.-T., and Chen, K.-T., ”Clustering Item Data Sets with
Association-Taxonomy Similarity”, Proceedings of the IEEE 3rd International
ConferenceonDataMining(ICDM’03),Melbourne,FL,pages697-700,Novem-
ber2003.
75
Abstract (if available)
Abstract
The application of association rules, which specify relationships among large sets of items, is a fundamental data mining technique used for various applications. In this dissertation, we present an efficient method of using association rules for identifying rules from a stream of transactions consisting of a collection of items purchased, referred to as market basket data. A common problem encountered with market basket analysis is that it results in a number of weakly associated rules that are of little interest to the user. To mitigate this problem, we propose an efficient approach to managing the data so that only a reasonable number of rules need to be analyzed. First, we apply an ontology, a hierarchical structure that defines the relationships among concepts at different abstraction levels, to minimize the search space, thereby allowing the user to avoid having to search the large original result set for useful and important rules. Next, we apply a novel metric called relevance to categorize the rules using the Hierarchical Association Rule Categorization (HARC) algorithm, an algorithm that efficiently categorizes association rules by searching the compact generalized rules first and then the specific rules that belong to them, rather than scanning the entire list of rules. The efficiency and effectiveness of our approach is demonstrated in our experiments on high-dimensional synthetic data sets.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An efficient approach to clustering datasets with mixed type attributes in data mining
PDF
Tag based search and recommendation in social media
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
Scalable processing of spatial queries
PDF
Complex pattern search in sequential data
PDF
From matching to querying: A unified framework for ontology integration
PDF
Statistical approaches for inferring category knowledge from social annotation
PDF
Customized data mining objective functions
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Context-based information and trust analysis
PDF
Understanding semantic relationships between data objects
PDF
An algorithmic approach for static and dynamic gesture recognition
PDF
Algorithmic aspects of energy efficient transmission in multihop cooperative wireless networks
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
A reference-set approach to information extraction from unstructured, ungrammatical data sources
PDF
Imposing classical symmetries on quantum operators with applications to optimization
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Scalable evacuation routing in dynamic environments
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
Asset Metadata
Creator
Won, Dongwoo
(author)
Core Title
An efficient approach to categorizing association rules
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/23/2010
Defense Date
06/02/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
association,clustering,data mining,OAI-PMH Harvest,relevance
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McLeod, Dennis (
committee chair
), Nakano, Aiichiro (
committee member
), Pryor, Lawrence (
committee member
)
Creator Email
dongwoo.won@gmail.com,dwon@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3211
Unique identifier
UC1318736
Identifier
etd-Won-3913 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-349116 (legacy record id),usctheses-m3211 (legacy record id)
Legacy Identifier
etd-Won-3913.pdf
Dmrecord
349116
Document Type
Dissertation
Rights
Won, Dongwoo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
clustering
data mining
relevance