Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
An efficient approach to clustering datasets with mixed type attributes in data mining
(USC Thesis Other)
An efficient approach to clustering datasets with mixed type attributes in data mining
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Guidance Committee
Prof. Dennis McLeod (Chairperson )
Prof. Aiichiro Nakano
Prof. Lawrence Pryor (Outside Member)
An Efficient Approach to Clustering Datasets
with Mixed Type Attributes in Data Mining
Ph.D. Dissertation
submitted by
Jongwoo Lim
December 18, 2013
ii
Acknowledgements
First and foremost, I would like to thank my advisor and chair of my committee, Dr. Dennis
McLeod, for his guidance over the past 5 years. I am also very thankful to other members of
my committee: Prof. Aiichiro Nakano, Prof. Shri Narayanan, Prof. William GJ Halfond
and Prof. Larry Pryor.
I was fortunate to be surrounded by many gifted and smart colleagues like Dr. Seon Ho Kim,
Jongeun Jeon and Seongwook Youn. I am also very thankful for the help and support I
received from the members of Semantic Information Research Lab. It made my stay in Los
Angeles over the past 9 years to be truly memorable.
Finally, I would like to thank my parents, my sisters, and my nieces who are a constant
source of inspiration and always help me to believe in myself.
iii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1: Introduction 10
1.1 Objective and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Anticipated Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 2: Preliminary 18
2.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
2.2 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
2.3.1 Categorical Attribute Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Numerical Attribute Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.3 Mixed Attributes Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 3: Proposed Clustering Framework 29
iv
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
3.2 Entropy based Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Extract Candidate Cluster Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Weighting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Clustering Methods for Mixed Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 4: Experiments 44
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 A Heart Disease Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 A German Credit Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 5: Conclusion and Future Challenges 64
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Future Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
References 68
v
List of Figures
Figure 1 Overview of Proposed Clustering Framework . . . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 2 Initialization the difference of entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Figure 3 Process Merging Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Figure 4 Notations of extracting cluster number algorithm . . . . . . . . . . . . . . . . . . . . . . 39
Figure 5 Overview of the extracting cluster number algorithm . . . . . . . . . . . . . . . . . . . 40
Figure 6 Overview of heart disease dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Figure 7 Categorical attributes and Numerical attributes (heart disease dataset) . . . . . 45
Figure 8 Overview of German credit dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 9 Categorical attributes and Numerical attributes (German credit dataset) . . . . 47
Figure 10 Average entropy in each cluster numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 11 the difference of average entropy for heart disease dataset. . . . . . . . . . . . . . 49
Figure 12 an example of well balanced clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Figure 13 Accuracy of all cluster numbers with different weights . . . . . . . . . . . . . . . . 54
Figure 14 Average of accuracy of all cluster numbers with different weight . . . . . . . . 55
Figure 15 Average accuracy of each cluster number . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
vi
Figure 16 Comparison with k-mean algorithm (heart disease dataset) . . . . . . . . . . . . . . 57
Figure 17 The difference of average entropy for German credit dataset . . . . . . . . . . . . . 58
Figure 18 Average of accuracy of all cluster numbers with different weight. . . . . . . . . 61
Figure 19 Average of accuracy of each cluster numbers . . . . . . . . . . . . . . . . . . . . . . . . 62
Figure 20 Comparison with k-mean algorithm (German credit dataset) . . . . . . . . . . . . 63
vii
List of Tables
Table 1 Result of Center-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Table 2 Result of Object-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Table 3 Comparison average accuracy of heart disease dataset . . . . . . . . . . . . . . . . . . . 55
Table 4 Average accuracy of German credit dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 5 Average of accuracy in different weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
viii
Abstract
We propose an efficient approach to clustering datasets with mixed type attributes (both
numerical and categorical), while minimizing information loss during clustering. Real
world datasets such as medical datasets, bio datasets, transactional datasets and its ontology
have mixed attribute type datasets.
However, most conventional clustering algorithms have been designed and applied to
datasets containing single attribute type (either numerical or categorical). Recently,
approaches to clustering for mixed attribute type datasets have emerged, but they are
mainly based on transforming attributes to straightforwardly utilize conventional
algorithms. The problem of such approaches is the possibility of distorted results due to the
loss of information because significant portion of attribute values can be removed in the
transforming process without knowledge background of datasets. This may result in a
lower accuracy clustering.
To address this problem, we propose a clustering framework for mixed attribute type
datasets without transforming attributes. We first utilize an entropy based measure of
categorical attributes as our criterion function for similarity. Second, based on the results of
ix
entropy based similarity, we extract candidate cluster numbers and verify our weighting
condition that is based on the degree of well balanced clusters with pre-clustering results
and the ground truth ratio from the given dataset. Finally, we cluster the mixed attribute
type datasets with the extracted candidate cluster numbers and the weights.
We have conducted experiments with a heart disease dataset and a German credit dataset,
for which an entropy function as a similarity measure and the proposed method of
extracting number of clusters has been utilized. We also experimentally explore the relative
degree of balance of categorical vs. numerical attributes sub datasets in given datasets. Our
experimental results demonstrate that the proposed framework improved accuracy
effectively for the given mixed type attributes datasets.
10
CHAPTER 1
INTRODUCTION
Knowledge discovery in databases (KDD) is a concept that describes a process of
automatically searching large datasets for patterns or relationships that can be considered
knowledge about the datasets. It consists of data preprocessing, data mining and data post
processing.
Pre processing entails the transformation of raw data into an appropriate format for
subsequent analysis such as feature selection, dimensionality reduction. Post processing
ensures that valid and useful results from data mining are incorporated into useful
information.
For our purposes here, we define data mining as the process of analyzing data from
various perspectives in order to discover useful information such as relationships among
concepts that remain unknown. Data mining includes four major tasks: classification,
association analysis, clustering, and anomaly detection. In classification, one assigns
11
objects to one of predefined categories (e.g., detecting spam email). Association analysis
involves discovering previously unknown relationships in datasets that can be represented
in the form of association rules. Clustering can be used to group data that are meaningful,
useful, or both. Anomaly detection is the process of finding objects that are different from
most other objects.
Our research focuses on clustering in data mining. As our long-term goal is gaining the
ability to find relationships among concepts that are implicit in datasets, we first need to
investigate clustering concepts for our research goal. Clustering, which is the technique of
grouping of a number of similar objects, has been an efficient and common technique for
data analysis in many fields such as knowledge discovery, machine learning, pattern
recognition, image analysis, bio informatics and medical informatics.
Clustering algorithms (e.g., [1], [2], [3] and [4]) have been developed and refined in the
literature. Significantly, most of them have focused only on single characteristic attributes
such as numerical attributes or categorical attributes. However, clustering algorithms for
single characteristic attributes are not appropriate for mixed type attribute datasets, since
the characteristics of attributes are different from each other. For example, numerical
attribute are ordered and continuous values, in contrast to categorical attributes.
12
As mixed attribute datasets become more common in real life, clustering techniques for
this type of dataset are required in various informatics fields such as medical informatics,
bio informatics, geo informatics, information retrieval, and so on.
Many previous approaches to clustering for mixed attributes have been based on
transformation utilizing traditional clustering algorithms. There are two approaches to
transformation. In one approach, numerical attribute values are discretized, and categorical
clustering methods are applied. However, discretization leads to loss of information. In the
other approach, categorical attribute values are transformed into numerical attribute values
and numerical distance measures are applied for similarity between those object pairs.
However it is not obvious that this approach yields correct numerical values for categorical
values such as color, shape, and so on.
Recently, clustering techniques for mixed-type attribute datasets have been developed in
an effort to use traditional algorithms without transformation. What these techniques have
in common is their division of mixed type attributes into categorical attributes and
numerical attributes, as well as their uses of data analysis techniques. For example,
applying similarity measure to each attribute and then integrating them to cluster with
certain criteria such as cluster number.
13
For example, k-prototype clustering [6], which is an extension of k-means clustering, is
one of the clustering algorithms for mixed-attribute datasets without transformation.
However, the clustering performance of k-prototype clustering depends on the selection of
optimal cluster number k as an initial prototype for clustering, as the cluster number is
selected randomly or user-defined. Therefore, it has been a challenge in clustering to
determine the optimal cluster number.
To address the problems outlined above, we propose a framework for clustering that
supports to the clustering of mixed-attribute datasets without transformation and that
determines the cluster number for mixed-attribute datasets systematically based on an
entropy function.
1.1 Objective and Approach
The goal of our proposed framework is to cluster mixed-attribute datasets efficiently
without transformation. Within this framework, we propose an effective measure to
determine the distribution of clusters based on an entropy function. A weighting scheme
supports the accuracy of clustering results for a given mixed-attribute dataset.
14
As output format for knowledge discovery in database, domain ontology and knowledge
representation will be our experimental objects to establish the effectiveness of our
techniques in our long-term research goal. By developing a clustering framework for mixed
type attribute datasets, we can achieve a significant output format for knowledge discovery
in databases. As a partial processing of knowledge discovery in databases, clustering,
which extracts useful higher-level knowledge from raw datasets, becomes a necessity in
analyzing and understanding information. Higher-level knowledge can be extracted by
discovering clusters (groups) of similar objects from raw dataset.
Recently, raw datasets (e.g., medical datasets) have consisted of objects that have mixed
type attributes such as categorical attributes and numerical attributes. However, traditional
clustering algorithms have been limited to cluster mixed-attribute datasets due to
transformation causing lost information. One of the challenges for clustering performance
is to determine the cluster numbers (distribution of clusters) which are reasonable
clustering numbers in a given raw dataset.
Our proposed framework for clustering uses a heart disease dataset and a German credit
dataset from the UCI repository as mixed type attribute dataset. The heart disease dataset
consists of 270 objects which have 13 attributes each. We first divide 13 attributes into 7
categorical attributes and 6 numerical attributes to experiment with the proposed clustering
15
framework. The German credit dataset consist of 1000 objects which have 20 attributes
each and those 20 attributes are composed of 13 categorical attributes and 7 numerical
attributes.
As one of the elements of information theory, entropy function can be used to measure
the uncertainty of random variables. Therefore, we utilize an entropy function to extract the
distribution of clusters by measuring the uncertainty of random variables. Entropy values
for each cluster are representative of the similarity of objects in a cluster. We first regarded
each object in the dataset as a cluster. Thus, the initial entropy value of each cluster is 0.
We found that the entropy value of the cluster is increased by merging other clusters.
This means that the cluster has more dissimilar objects. We also found out that the
increasing rate of entropy value of the clusters differed from each other. Thus, we first
evaluate the increasing rate in each cluster and find the different increasing rate of the
entropy value. Finally, we extract the distribution of cluster numbers with the different
increasing rate of the entropy value in between clusters.
In addition, we experimentally explore the use of weighting schemes to identify the
relative balance of categorical vs. numerical attributes. The weighting scheme is based on
the ground truth ratio of datasets and the degree of the balance of object numbers in each
cluster. Based on the ground truth ratio, our weighting scheme assigns more weights to the
16
better degree of cluster balance between categorical and numerical attribute sub datasets
when the ground truth ratio is close to 0.5:0.5. With the distribution of cluster number and
the weighting scheme, we clustered a heart disease dataset and a German credit dataset with
hierarchical clustering in an agglomerative way.
1.2 Contributions
The main contribution of our proposed framework is an effective approach to clustering
of mixed-attribute datasets without transforming the types of each attributes such as
categorical or numerical attribute. A key aspect of this approach is determining reasonable
distribution numbers of clusters in terms of accuracy, which requires an objective measure
of clustering and appropriate weights to attributes in mixed type attribute datasets.
Our proposed framework presents an effective measure for the optimal distribution of
clusters based on an entropy function. We have conducted experiments with a heart disease
dataset and a German credit dataset, for which an entropy function has been developed and
is being generalized. We also experimentally explore the weighting schemes with the
relative in balance of categorical vs. numerical attributes for a given dataset.
17
1.3 Thesis Outline
The remainder of this proposal is organized as follows:
Chapter 2 introduces data mining concepts, similarity measurements, and clustering
algorithms.
Chapter 3 introduces the proposed clustering framework for mixed type attribute
datasets.
Chapter 4 discusses experimental results for our proposed framework.
Chapter 5 introduces the contribution of the research and concludes the thesis.
18
CHAPTER 2
PRELIMINARY
This chapter is organized as follows. First, in section 2.1, we introduce techniques of data
mining or knowledge discovery related to the long-term goal of our research. As our
proposed clustering framework has focused on mixed-attribute datasets, we investigate
measures of similarity and clustering algorithms of each type of attribute dataset. In section
2.2, we summarize four basic similarity measures in an effort to understand means of
measuring similarity in cluster. In section 2.3, we explore conventional clustering
algorithms related to the properties of attributes and types of clustering methods.
2.1 Knowledge Discovery and Data Mining
Our long-term goal is to identify relationships among the concepts that are implicit in the
datasets. To attain our long-term goal, we should first understand higher-level concepts
related to our long-term goal.
19
The process of knowledge discovery consists of a series of transformation steps such as
data preprocessing, data mining and post processing of data mining results [19], [21]. Data
preprocessing entails the transformation of raw data into an appropriate format for
subsequent analysis. Data preprocessing includes fusing data from multiple sources,
cleaning data to remove noise and duplicate observations, and selecting records and
features related to the data mining task such as data subsetting, normalization,
dimensionality reduction, and feature selection.
Data mining is the process of automatically discovering useful information and finding
novel and useful patterns that might otherwise remain unknown. Data mining tasks can be
divided into two categories: Predictive tasks and descriptive tasks.
Predictive tasks predict the value of a particular attribute based on the value of other
attributes. The attribute to be predicted is known as the target or dependent variable,
whereas the attributes used for making the prediction are known as the explanatory or
independent variables.
Descriptive tasks involve the derivation of patterns (e.g., clusters and correlations) that
summarize the basic relationships in data [22], [23].
20
Post processing is the integration of data mining results into useful information such as a
decision support system. An example of post processing is visualization, which allows
analysts to explore data and data mining results from a variety of viewpoints.
2.2 Similarity Measures
Similarity measurement is one of the key elements of process clustering in data mining.
The goal of similarity measurement is to estimate the similarity of mutual objects in a
domain. Many literatures have proposed similarity measures in data mining, data analysis
and information retrieval fields [42] . The notion of similarity for numerical attribute data is
relatively well-understood such as Euclidean distance, but the measure of similarity for
categorical data is not straightforward. Therefore developing a similarity measure for
categorical data is one of challenging issues [42].
A similarity coefficient indicates the strength of relationship between objects [38]. The
more two objects are similar with one another, the higher the similarity coefficient is. We
summarize similarity measures in terms of the type of data, such as dense data and sparse
data. For dense data, there are two representative similarity measures which are correlation
coefficient and Euclidean distance. Jaccard and cosine similarity measures are useful for
21
sparse data. We summarize these similarity measures as follows:
▪ Euclidean distance
As one of the most common distance measures, Euclidean distance measures the
dissimilarity or similarity with a distance function. It has used for numerical attribute
dataset. The formula is as follows:
2
1
( ,) ( )
n
kk
k
d XY x y
=
= −
∑
Where n is the number of dimensions and
k
x and
k
y are the
th
k attributes of x and y .
▪ Correlation ( Pearson’s correlation ) coefficient
Pearson's correlation coefficient, commonly called correlation coefficient, is used to
measures similarity in terms of a linear relationship between the attributes of objects.
Pearson’s correlation coefficient between two data objects, x and y, is defined as follows;
[( )( ) cov( , )
( ,)
XY
XY XY
EX Y XY
corr X Y
µµ
δδ δδ
−−
= =
Where , XY are random variables; ,
XY
µµ are expected values which can be mean values
and ,
X Y
δδ are standard deviations. It is defined by the conditions that both of the standard
deviation are finite and both of them are nonzero.
22
▪ Jaccard coefficient
The Jaccard coefficient is defined as the size of the intersection divided by the size of the
union of datasets. The Jaccard coefficient is used to handle objects consisting of
asymmetric information on binary attributes. The formula is as follows:
||
(, )
||
A B
J A B
A B
∩
=
∪
,where (, ) J A B is the Jaccard coefficient and A and B are datasets.
▪ Cosine similarity
Cosine similarity is used to a document similarity measure. Documents are often
represented as vectors, where each attribute represents the frequency with which a
particular term occurs in the document. The formula is as follows:
cos( , )
XY
XY
X Y
⋅
=
Where XY ⋅ is the vector dot product,
1
n
k k
k
X Y x y
=
⋅=
∑
and X is the length of vector X,
2
1
n
k
k
X x
=
=
∑
.
▪ Entropy
23
For the similarity measurement, the entropy concept is also used for categorical data in
several literatures. As an element of information theory, entropy is also a measure of the
uncertainty of a random variable. Average entropy values are created from a classical
entropy theory, Shannon entropy [12] [13].
The basic idea of entropy-based clustering is that the lowest entropy value between two
objects is equal to the closest objects in the dataset. Entropy-based clustering is a method of
finding similar objects in clusters based on their average entropy values, determining the
number of clusters and identifying the location of the cluster center. By using the expected
Entropy value which follows the MDL (Minimum Description Length) principle, it can
process incremental data without consuming memory space and show the concise
representation of the clusters [15].
2.3 Clustering Algorithms
In general, conventional clustering algorithms have been classified into two categories
such as hierarchical clustering algorithms and partitional clustering algorithms[31]. A
hierarchical algorithm has a divisive way or a agglomerative way and a partitional
algorithms has a single non-overlapping partitioning of the data sets. Clustering algorithms
24
have associated with data types. Each data type has the different characteristics form each
other. Therefore, clustering algorithms have referred to their characteristics. From those
point of view, Within the framework of the characteristic of attributes in a dataset, current
clustering algorithms can be classified into three categories: categorical, numerical and
mixed type attributes algorithms. We introduce conventional clustering algorithms for
those categories in sub sections.
2.3.1 Categorical Attribute Clustering Algorithms
The characteristics of categorical attribute dataset can be a finite set of values, discrete or
discontinuous values in set and no inherent useful ordering. We introduce conventional
clustering algorithms for categorical attribute datasets as follows:
Squeezer algorithm [1], which reads each tuple t in sequence through the dataset, and
then assigns tuple t into an existing cluster that is an empty initially or creates tuple t as a
new cluster by determining it with the similarities between t and clusters. It is robust to the
effect of noise by handling outliers and is appropriate for clustering data streams. However
when the size of dataset is extremely large, the clusters will waste a large amount of main
25
memory.
ROCK algorithm which is an adaptation of an agglomerative hierarchical clustering
algorithm [2]. It starts with regarding each tuple in a dataset as a singleton cluster and then
clusters are merged repeatedly with the closeness, the link based method, between clusters.
The link based method computed the number of links and defined the closeness with the
sum of the number of links between all pair of two tuples.
As a fast summarization based algorithm, CACTUS [3] consists of three phases such as
summarization that summarize information about the collected data, clustering that the
summarized information is used to discover a set of candidate cluster, and validation that
the valid set of clusters is selected from the candidate clusters.
2.3.2 Numerical Attribute Clustering Algorithms
On the contrary to the characteristics of categorical attributes, the characteristics of
numerical attribute can be infinite set of value, ordered, continuous, and manipulated such
as Min, Max, and Mean. As the characteristics of numerical attribute point of view,
following conventional algorithms have been used for numerical attribute datasets.
26
K-means clustering algorithms [30], it partitions number of data into k number of clusters
where each data belongs to the cluster with the nearest mean. K-means has known
problems [32],[33],[35], and [36]. One of the most popular heuristics for solving the
k-means problem is based on a simple iterative scheme for finding a locally minimal
solution. This algorithm is often called the k-means algorithm [40], [45].
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is for a very large
scale database [5]. It clusters multi-dimensional metric data points for a given set of
resources such as memory and time constraints. As a local clustering, each clustering is
decided without scanning all data points and currently existing clusters.
CURE algorithm [29] approaches to the clustering issue of large scale data sets with two
different ways from BIRCH. First, it starts by drawing a random sample from database
instead of pre clustering with all data points. Second, it partitions the random sample size
and clusters the data points in each partition to speed up clustering.
A density-based algorithm for large spatial databases [4], called DBSCAN
(Density-Based Spatial Clustering of Applications with Noise), can be used for numerical
attribute dataset. It presented the new clustering algorithm of DBSCAN that depends on a
density-based notion of clusters for discovering clusters of arbitrary shape. The number of
clusters is not required because of automatic detecting the cluster along with the natural
27
number of clusters [37].
2.3.3 Mixed Attribute Clustering Algorithms
As traditional algorithms that summarized in 2.3.1 and 2.3.2, those algorithms have
concentrated on the clustering of a single type attribute dataset. But they can also be applied
to a mixed type attributes dataset by converting or reforming themselves for their purposes.
Z. Huang [6] presented two algorithms. One is the k-mode, extending the k-means
algorithm, which is a new distance measure for categorical attributes. The other is
k-prototypes using a weighted sum of Euclidean distance for numerical attribute values.
However, inappropriate weights are decided by a priori may result in unexpected effects.
C. Li and G. Biswas [7] proposed the SBAC algorithm which is based on a similarity
measure with weights and uses an agglomerative algorithm. It is not appropriate for large
scale datasets due to the increasing complexity of the SBAC algorithm.
Yosr Naija, Salem Chakhar, Kaouther Blibech, and Riadh Robbana [8] proposed an
extension of partitional clustering methods devoted to mixed type attribute datasets.
Zengyou He, Xiaofe i Xu, and Shengchun Deng [9] proposed a cluster ensemble approach
28
method for mixed attribute data. J. Suguna and M. Arul Selvi [24] also proposed ensemble
fuzzy clustering for mixed numeric and categorical data. Above algorithms convert
categorical attribute values into numerical ones before applying clustering their method.
Recently, Amir Ahmad, and Lipika Dey [10] proposed a k-mean clustering algorithm for
mixed numerical and categorical data. Ming-Yi Shih, et al. [18] proposed a two-step
method for clustering mixed categorical and numerical data. It constructs similarity or
relationships among categorical attributes based on their co-occurrence and then those
categorical attributes are converted into numeric data. Finally, the hierarchical and
partitioning clustering algorithms used for clustering the data including converted into
numeric data. As we introduced above, several approaches to clustering mixed attribute
datasets have been described in the literature.
29
CHAPTER 3
PROPOSED CLUSTERING FRAMEWORK
3.1 Introduction
In this chapter, we introduce our proposed framework, which we call clustering for
mixed attributes. An overview of the proposed clustering framework appears in Fig 1.
Fig 1 Overview of Proposed Clustering Framework
Our proposed framework starts with the division of mixed attribute datasets into
categorical and numerical attributes sub datasets.
30
The proposed clustering framework consists of three main steps in Fig. 1.
In Step 1, we use an entropy based similarity measure with only categorical attributes and
extract candidate cluster numbers by evaluating the difference of values with entropy based
similarity measure. We analyze the difference of total entropy among clusters in an
exhaustive manner by reducing the number of clusters until all of clusters merge into one
cluster and extract candidate cluster numbers by using the difference in entropy values.
Second, we apply the extracted candidate cluster numbers K from Step 1 to cluster the
dataset using only numerical attributes (Step 2). Now, we have two clustering results, one
by using only categorical attributes and the other by using numerical ones. Note that the
number of clusters is decided solely by categorical attributes.
In Step 3, a weighting scheme is applied using the degree of balance in number of objects
in the clusters. After the pre-clustering, we can compare how two clustering results are
balanced. The main point of the weighting scheme is to put more weight onto the
better-balanced clustering between categorical and numerical one. After determining the
weights, the final clustering is processed for the mixed attribute type dataset using the
extract candidate cluster numbers from Step 1 and the weights.
We have two hypotheses in designing our proposed framework:
31
1) The candidate cluster numbers from categorical attributes can be also candidate cluster
numbers of mixed attribute type datasets in the given dataset. In case, measuring total
similarity for categorical attributes first and then combining numerical attributes in same
objects, the variance of its clustering result is less than the opposite case. Since one of
characteristics of categorical attribute is not continuous and ordered, it is not appropriate to
use a classical distance measure for similarity in a categorical attribute dataset. Reforming
numerical attributes into categorical can be the cause of possibility of distorting candidate
cluster numbers. So, we utilize entropy based similarity measure focused on categorical
attribute dataset.
2) As one of the critical conditions for effective clustering is the degree of the balance of
the number of objects in clusters. Based on the ground truth ratio, we determine that the
better balanced attribute between categorical attribute and numerical one will receive the
higher weight in a mixed attribute type datasets.
We will show that our proposed clustering framework works well for mixed type
attributes datasets in the experimental section.
32
3.2 Entropy-based similarity measure
As a criterion function, measuring similarity between objects is one of the primary steps
in clustering process. There are many well known methods for measuring distance between
objects for the purpose of clustering, but these methods are known that they have pros and
cons in some level. As one of elements of information theory, entropy can be used to
measure the uncertainty of random variables. On that point, we utilized it as a similarity
measure for the categorical attributes in mixed attribute type datasets.
Distance functions such as Euclidean distance are used as similarity measure for
numerical attribute since they represent the inherent distance meaning between numerical
attributes but they are not for categorical attribute. It is difficult for categorical attribute to
measure similarity in that its values cannot be directly compared each other because they
are not ordered nor continuous, whereas numerical attributes are ordered and continuous.
On account of the problem for categorical attribute, we developed an entropy-based
similarity measure which can be an effective and practical similarity measure for
categorical clustering [16]. We first give the notation of a classical entropy definition,
which is Shannon’s entropy definition [13] for an entropy based similarity measure.
33
The entropy () HX is simply defined as follows:
2
( ) ()log ()
xX
H X p x p x
∈
= −
∑
Where () p x is the probability mass function of the random variable x and X is the set of
possible outcomes of x .
We consider that a dataset X SC SN = + (Where SC is a subset of categorical attributes
and SN is a subset of numerical attributes) in the presence of R objects. Let m cn nn = + be
the total number of attributes in a give dataset, where cn is number of categorical attributes,
and nn is the number of numerical attributes.
Then,
12
{ , ,..., }
cn
SC D D D = , where
i
D is i
th
categorical attribute.
12
{ , ,..., }
nn
SN N N N = ,
where
i
N is i
th
numerical attribute.
i
At is a set of distinct values in i
th
categorical
attribute(
i
D ).
The definition of total entropy in a given dataset can be redefined as [18]:
2
1
( ) ( )log ( )
i
cn
i v At
H SC pv pv
= ∈
= −
∑ ∑
Where ( ) pv is the probability of occurrence of value v in i
th
categorical attribute(
i
D ).
34
We estimate the entropy value of categorical attribute subset so that we will extract
candidate cluster numbers as candidate cluster numbers for mixed attribute, which will be
used for the next preclustering process.
In order to extract candidate cluster numbers, we assume that a dataset SC can be
partitioned into K clusters. It can be represented as follows:
C
K
= {C
k
} for
1≤ k ≤ K , where
1≤ K ≤ R and
C
k
is a cluster having
n
k
records,
1≤ n
k
≤ R − (K −1) in the categorical attribute sub-dataset.
By maximizing the entropy criterion [16], the classical entropy based clustering attempts
to find the optimal candidate cluster number
K
C . The entropy criterion for optimal
candidate cluster number
K
C is as follows:
1
11
( ) ( ( ) ( ))
K
k
k K
k
OC C H SC n H C
cn K
=
= −
∑
(1)
where () H SC is the total entropy in the given dataset,
AH(C
K
) =
1
K
H(C
k
)
k=1
K
∑
is the average
entropy of
K
C . From equation (1), it is supposed to be minimized in order to maximize
()
k
OC C . We notate the average entropy of partition
K
C as
AH(C
K
) in the following
sections.
35
The minimization of expected entropy has an equal concept in clustering such as
Maximum Likelihood, dissimilarity coefficient [16]. Therefore, we first developed an
entropy based similarity measure. We will adapt it to extract candidate cluster numbers in
sub datasets.
3.3 Extract Candidate Cluster Numbers
As we mentioned in the first assumption in 3.1, to prevent distorting results by
converting numerical attributes into categorical ones, we utilize only the categorical
attribute sub dataset to measure similarity by using an entropy function.
As the difference of total entropy between
AH(C
K
) and
AH(C
K +1
) is equal to the
increasing expected entropy that has a correlation with the similarity between clusters, we
can extract candidate cluster numbers by exploring the difference of each cluster's average
of total entropy while each cluster is merged in an agglomerative way.
We make the assumption that each object set in a dataset is initially regarded as a
singleton cluster. By merging clusters based on the entropy criterion, the difference
between
AH(C
K
)
and
AH(C
K +1
) varies in each merging step because the probability
distribution of values in clusters changes in uncertain ways when two clusters are merged.
If two highly similar clusters are merged into one, then the variance of average entropy will
36
not change much. However, it significantly varies when two very different clusters are
merged.
When two clusters are merged into one, the resulting cluster’s entropy increases. For the
proof of increasing entropy, we provide another notations that
a b
CC is the mergence of
two clusters
a
C and
b
C , and
a
C has
a
n numbers of records and
b
C has
b
n numbers of
records.
The following relation based on expected entropy was proved in [18].
( )( ) ( ) ( )
a b a ba a bb
n n HC C n HC n HC + ≥+ (2)
This relation shows that the expected entropy is always identical or increased by merging
clusters, and the average entropy also has a provable relation as shown above.
We note that the difference of total entropy in cluster (
ent
Diff )is as follows:
( , ) ( )( ) ( ) ( ) 0
ent a b a b a b a a b b
Diff C C n n HC C n H C n HC =+ −+ ≥ (3)
( , )0
ent a b
Diff C C = where
a
C is identical with
b
C (4)
Using Eq. (3), our algorithm recursively merges clusters. Initially, each object is
considered as a singleton cluster and the initial entropy
ent
Diff for all possible cluster pairs
in the dataset in Fig. 2. So, when n objects exist in the dataset, n
2
/2
ent
Diff
values will be
calculated initially.
37
Fig 2 Initialization the difference of entropy (
ent
Diff )
As in Fig 2, we first initialized
ent
Diff between singleton objects, i.e., cluster, to merge
clusters in a bottom-up way. There are n objects in a dataset. Based on the equation, i.e.,
| ( , )| | ( , )|
ent a b ent b a
Diff C C Diff C C = , the number of
ent
Diff must be ( )/ 2 nn × .
Fig 3 Process Merging Clusters
Fig 3 shows the snapshot of merging clusters. The merging process consists of following
steps:
38
1) In the presence of n clusters, calculate
ent
Diff for all possible pairs of clusters.
Calculate the average entropy of partition C
n
and store it.
2) Find the pair having the minimum entropy, say
Diff
ent
(C
i
,C
j
)
. Consequently, the cluster
C
i
and C
j
are being merged. This is because the minimum change in entropy implies
potentially better clustering.
3) Next step is to determine which cluster will be deleted or updated. Between i and j, the
higher numbered cluster will be merged in to the lower numbered cluster. That is, the lower
numbered cluster will be updated and the other will be deleted. See Fig. 3.
4) Next, the
ent
Diff table will be updated by recalculating
ent
Diff
for all possible pairs of
remained clusters. Note that n = n-1 now. Calculate the average entropy of partition C
n
and
store it.
5) Steps 2-4 will be iterated until the number of clusters becomes one whole cluster.
Using the results of the above merging process, specifically {AH(C
i
)} where
1≤ i ≤ R,
we determine the optimal candidate cluster numbers which will be used in the final
clustering. The main point is to monitor the changes of average entropy during the merging
process. As shown in Eq. (2), the average entropy of resulting clusters increases after
merging two different clusters. By comparing
AH(C
K
) to
AH(C
K +1
) , we can identify
39
sudden changes in entropy.
Let
D
K
be the difference of entropy between
AH(C
K −1
) and
AH(C
K
). The algorithm
computes the set of entropy values, D={
D
K
} for all
2 ≤ K ≤ R until the number of cluster
being one. By monitoring the value changes in
D
K
, the algorithm determines the candidate
cluster numbers as follows: 1) find a subset of D, i.e., D
S
, with all
D
K
which satisfies
D
K −1
< D
K
and
D
K
> D
K +1
. 2) select P (P is an input parameter) greatest values of
D
K
values in D
S
then the set of K values becomes the candidate cluster numbers. See Fig. 5.
We present a three-step algorithm for extracting cluster numbers in fig 5, with fig 4
showing notations. The extracting cluster number algorithm consists of 3 steps: initialize
Average Entropy, merge Clusters, and extract candidate cluster number. These are
explained in detail in section 3.1 and 3.2.
Fig 4 Notations for algorithm for extracting cluster number
40
Fig 5 Overview of the extracting cluster number algorithm
41
3.4 Weight Mixed Attributes
Based on the algorithm described in subsection 3.2, the candidate cluster numbers are
decided. Using the numbers, the given dataset is clustered with only categorical and
numerical attributes respectively in Step2: pre clustering in Fig. 1. Then, we analyze how
each clustering result is balanced.
In general, well structured clustering shows that the numbers of objects in clusters are
balanced. By comparing the balance of clustering of one result with categorical attributes to
that with numerical ones, our approach sets a priority on one type of attribute over the
other. So as we mentioned in the second hypothesis above, we consider the weight of each
attribute type in a dataset before finally clustering mixed attribute type datasets. The ground
truth ratio from the given dataset is considered as another condition of our weight scheme.
The ground truth ratio of true to false with similar ratio, e.g., true : false ≅ 1 : 1, is regarded
as a well balanced clustering. We first give more weight for the better balanced attribute
type between categorical attribute and numerical one based on the ground truth ratio to
improve the results of the final clustering.
However, it is hard to formulate generally the optimal weighting scheme for mixed
attribute type datasets due to the difficulty in extracting correlation between attribute types.
42
Our weighting scheme only focused on given datasets. The weight condition of our mixed
attribute clustering is defined as follows:
1)
t c n
ω ωω = +
where
c
ω is weight for categorical attribute and
n
ω is weight for numerical attribute
2) 1 0 1, 0 1
t c n
and ω ωω = ≤≤≤≤ .
The last condition is based on the experimental result of pre-clustering with cluster
numbers. In consequence of extracting candidate cluster numbers and applying the
weighting scheme for mixed attributes, we perform the clustering of mixed attribute dataset.
We can expect that the greatest accuracy will be shown in our experimental results.
3.5 Clustering methods for mixed attribute
Once the weight are determined, the final clustering is performed based on the following
similarity measure using the weights:
S
M
= ω
C
S
C
+ω
N
S
N
(5)
where S
M
is the similarity of mixed attribute type datasets (i.e., total) and S
C
, S
N
is the
similarity of categorical and numerical attributes, respectively. With S
M
values, our
43
algorithm utilizes the agglomerative hierarchical clustering method for the final result. We
utilize the hierarchical clustering methods of moving center-based clustering and
object-based clustering. Moving center-based clustering involves the merger of clusters
when the minimum distance between any two clusters A and B is taken to be the average of
all distances between pairs of objects "x" in A and "y" in B that is, the mean distance
between elements of each cluster. The concept of center-based clustering is similar to mean
or average linkage clustering. The equation is as follows:
1
(, )
||||
a Ab B
d ab
AB
∈∈
∑∑
, where (, ) d ab is the Euclidean distance measure.
However, the equation is for numerical attributes, not for categorical attributes. For
categorical attribute, we use the difference of average entropy between cluster n and cluster
n-1. In contrast to center-based clustering, object-based clustering entails the merge of
clusters when the distance is minimal between all of the objects in each cluster. Although
the complexity of object based clustering is higher than the complexity of center based
clustering, the experimental results of object based clustering are superior to those of center
based clustering.
44
CHAPTER 4
EXPERIMENTAL RESULTS
4.1 Experimental Setup
In this chapter, We have conducted experiments with a heart disease dataset and a
German credit dataset from the UCI Data Repository [19], for which proposed framework
has been developed and is being generalized.
The first dataset, a heart disease dataset consists of 13 attributes and 270 objects.
categorical attributes have 7 attributes such as sex, chest pain type, fasting blood sugar,
resting electrocardiographic results, exercise induced angina, the slope of the peak exercise
ST segment, thal: 3 = normal; 6 = fixed defect; 7 = reversable defect. There are 6 numerical
attributes: age, resting blood pressure, serum cholestoral in mg/dl, maximum heart rate
achieved, oldpeak = ST depression induced by exercise relative to rest, and number of
major vessels (0-3) colored by flourosopy. The dataset has the groundtruth per object about
the confirmed diagnosis of heart disease, i.e., positive or negative diagnosis of heart
disease.
45
We first divide the 13 attributes into 7 categorical attributes and 6 numerical attributes to
experiment with the proposed clustering framework. Fig 6 is overview of our experimental
dataset. Fig 7 depicts the sub dataset created by dividing the heart disease dataset into
categorical attributes and numerical attributes.
Fig 6 Overview of heart disease dataset
Fig 7 categorical attributes (Left) and numerical attributes (Right).
46
The second dataset, a German credit dataset consists of 20 attributes and 1000 objects.
categorical attributes have 13 attributes such as 'status of existing checking account', 'credit
history', 'purpose', 'savings account/bond', 'present employment since', 'personal status and
sex', 'other debtors/guarantors', 'property', 'housing', 'job', 'telephone', and 'foreign worker'.
There are 7 numerical attributes: 'duration in month', 'credit amount', 'Installment rate in
percentage of disposable income', 'present residence since', 'age in year', 'number of
existing credits at this bank', and 'number of people being liable to provide maintenance for'.
The dataset has the ground truth per object about the confirmed credit of customers, i.e.,
positive or negative customer of credit.
Fig 8 Overview of German credit dataset
47
Fig 9 categorical attributes (Left) and numerical attributes (Right).
We divide the 20 attributes into 13 categorical attributes and 7 numerical attributes to
experiment with the proposed clustering framework. Fig 8 is overview of our experimental
dataset. Fig 9 depicts the sub dataset created by dividing the German credit dataset into
categorical attributes and numerical attributes.
48
4.2 Experimental Results
4.2.1 A Heart Disease Dataset
We show the results of our experiments with our proposed framework. Fig 3 in section 3
illustrated the increasing of average entropy by merging clusters for a heart disease dataset.
Fig 10 Average entropy in each cluster numbers
Although the experimental data, a heart disease dataset, contains 270 objects, we
disregard over 20 numbers of clusters to show the increase in average entropy explicitly. As
described in section 3, Fig 10 verifies the increase in entropy when merging clusters.
49
Based on the result of increasing average entropy by merging all clusters, we extract the
difference between cluster number K and cluster number K+1 as we described in section 3.
Fig 11 the difference of average entropy for heart disease dataset
Fig 11 demonstrates how the difference of the average entropy, i.e., D
K
, varies while
merging clusters. For example, D
5
means the difference of the average entropies between
when K=5 and K=4. When 6 clusters are merged into 5 clusters, we can find that the
difference of average entropy is relatively small. However, when 5 clusters are merged into
4 clusters, the difference of average entropy is increased highly. When 4 clusters are
merged into 3 clusters, it becomes relatively small again.
50
Abrupt change in entropy means that dissimilar clusters are merged, so this merging is
not desirable. Thus, K=5 becomes a candidate cluster number. In the same way, we identify
cluster number 8 and 11. Finally, the algorithm generates a subset, D
S
, by choosing top
three values (i.e., P=3 is the input parameter used in the experiment). So, D
S
={5, 8, 11},
which represents the candidate cluster numbers.
After extracting the candidate cluster number, we need to determine the weight values for
the given dataset. As proposed our weighting scheme in section 3, we cluster each type of
attribute sub dataset from whole dataset with candidate cluster numbers.
Fig 12 an example of well balanced clusters
51
Fig 12 shows the result of a comparison of the balance in a categorical attribute cluster
and a numerical attribute cluster that has two types of clustering, one is center based
clustering and the other is object based clustering.
Table 1 shows that the categorical attribute clusters are better balanced than the
numerical ones, while the number of clusters changes. For example, when the cluster C4 in
[K=5] is divided into C4 and C6 in [K=6], the balance of the number of objects in each
cluster is better balanced in a given dataset.
Based on the result as shown above, we can utilize the weighting scheme outlined in
section 3 with a given dataset. Table 1 shows that the categorical attribute clusters are better
balanced than the numerical ones, whereas the number of clusters changes (e.g., K=5, K=6)
in a given dataset that has 270 records. One object is regarded as a singleton cluster. The
condition of our weighting scheme is verified by this experimental result.
With the set of candidate cluster numbers and our proposed weighting scheme for
attributes, we can cluster the given dataset. We measure accuracy of the result of clustering.
The accuracy of measuring the clustering equation is as follows.
1
k
i
i
O
Accuracy
N
=
=
∑
,
52
Where N is the number of all instances in given dataset and
i
O is the number of instances
occurring in both cluster i and its corresponding class, which has the maximal value.
One class is for the negative class and the other is for the positive class. We used two
kinds of hierarchical clustering methods which are center based clustering and object based
clustering. Center based clustering uses the average of the distance of objects in a cluster as
a measure of the inter similarity between clusters and updates the average of similarity in
objects while combining another cluster, whereas object based clustering is a measure of
similarity of all objects between clusters.
Table 1 Center-based clustering
Table 1 shows the result of center based clustering explicitly. We can find that when
more weights is given to categorical attributes, the accuracy rate is higher. The weight [0.2
~0.8] is for categorical attributes. We show the candidate cluster numbers from 5 to 11 to
53
focus on the candidate cluster number. We can determine that the greatest accuracy of the
clustering on center based clustering is achieved when cluster number K is 5 and the weight
on the categorical attribute is 0.7.
Table 2 Object-based Clustering
Table 2 shows the result of object based clustering. The results of the object based
clustering have a similar trend to center based clustering. Comparing center based
clustering with object based clustering, we found that object based clustering has better
accuracy result. Therefore, we determine that the highest accuracy of clustering results for
this given dataset when cluster number K is 8 and the weight rate on categorical is 0.6.
54
Fig 13 Accuracy of all cluster numbers with different weights
Fig 13 shows that the variance of average accuracy for each cluster number is more stable
when one weights more on categorical attributes than numerical attributes. For example,
the weight [0.4] for categorical attributes, the variance of each cluster number’s accuracy is
bigger than the weight [0.6]. Fig 13 supports the hypothesis that the better balanced
attribute between categorical attributes and numerical attributes will receive the higher
weight in a mixed attributes dataset, which is appropriate for the proposed clustering
framework.
55
Fig 14 Accuracy of average of all cluster numbers with different weights
Fig 14 shows the average accuracy of all cluster numbers with different weight from Fig
13. It also shows that the weight [0.6] is the best weight of all cluster numbers.
Table 3 Comparison average accuracy for heart disease dataset
Table 3 shows that the average of each clustering between [0.2-0.4] and [0.6-0.8]. We
derive the result that the accuracy of the categorical attributes [0.6 - 0.8] is higher than
[0.2-0.4] from Table 3. It presents that our proposed weighting scheme is reasonable for the
given dataset.
56
Fig 15 Average accuracy of each cluster numbers
Fig 15 shows the average accuracy of each cluster number. The solid line represents
object based clustering for each cluster numbers, whereas the dotted line represents center
based clustering for each cluster numbers. As shown in Fig 15, we find that the average
accuracy of object based clustering is better than that of center based clustering over all.
The cluster number 8 for object based clustering is also the best cluster number for this
given dataset.
We also compared our proposed clustering framework with one of previous clustering
method that is K-means algorithm. To cluster the given dataset by using K-means
57
algorithm, we first transformed numerical attributes into categorical attributes and
evaluated the accuracy of clustering results. Fig 16 shows the comparison of average
accuracy between proposed clustering framework and K-means algorithm.
Fig 16 Comparison proposed approach with K-mean algorithm
Based on our experimental results, the weight ratio of the given dataset in our proposed
framework is that 0.6 for categorical attributes and 0.4 for numerical attributes. As shown
Fig 16, it also presents that our proposed clustering framework works well from number 5
of clusters to number 11 of clusters in the given datasets.
58
4.2.2 A German Credit Data
As described German credit dataset in 4.2, it consists of 1000 objects with 20 attributes
for each object, we just focused on the number of clusters from 1 to 20 in our experiments.
Because the increase rate of average entropy between clusters was insignificant over 20
numbers of clusters. As described in section 3, Fig 10 verifies the increase in entropy when
merging clusters. Based on the result of increasing average entropy by merging all clusters
for given dataset, we extract the difference between cluster number K and cluster number
K+1 as we described in section 3.
Fig 17 the difference of average entropy for German credit dataset
59
Fig 17 demonstrates how the difference of the average entropy, i.e., D
K
, varies while
merging clusters. For example, D
4
means the difference of the average entropies between
when K=4 and K=3. When 4 clusters are merged into 3 clusters, we can find that the
difference of average entropy is relatively small. However, when 5 clusters are merged into
4 clusters, the difference of average entropy is increased highly. When 3 clusters are
merged into 2 clusters, it becomes relatively small again. Abrupt change in entropy means
that dissimilar clusters are merged, so this merging is not desirable. Thus, K=4 becomes a
candidate cluster number.
In the same way, we identify cluster number 8 and 11. Finally, the algorithm generates a
subset, D
S
, by choosing top three values (i.e., P=3 is the number of clusters as input
parameter used in the experiment). So, D
S
={4, 7, 11}, which represents the candidate
cluster numbers.
After extracting the candidate cluster number, we need to determine the weight values for
the given dataset. As proposed our weighting scheme in section 3, we cluster each type of
attribute sub dataset from whole dataset with candidate cluster numbers. With the set of
candidate cluster numbers and our proposed weighting scheme for attributes, we can
cluster the given dataset. We measure accuracy of the result of clustering. As shown in
4.1.2, Comparing center based clustering with object based clustering, we found that object
60
based clustering has better accuracy result. Therefore, we used object based clustering for
German credit dataset.
Table 4 Average accuracy of German credit data
Comparing center based clustering with object based clustering, we found that object
based clustering has better accuracy result. Therefore, we determine that the highest
accuracy of clustering results for this given dataset when cluster number K is 4 and the
weight rate on categorical is 0.2. Based on the ground truth ratio, which is another
condition of our proposed weight scheme as described in 3.4, we found better accuracy
results when we weighted more on numerical attributes than categorical ones for this given
dataset.
61
Fig 18 Accuracy of average of all cluster numbers with different weights
Fig 18 shows the average accuracy of all cluster numbers with different weight for given
dataset. It shows that the weight [0.2] is the best weight of all cluster numbers.
Table 5 Average of accuracy in different weights
Table 5 shows that the average of each clustering between [0.2-0.4] and [0.6-0.8]. We
derive the result that the accuracy of the categorical attributes [0.2 - 0.4] is higher than [0.6
- 0.8] from Table 3. It presents that our proposed weighting scheme is reasonable for the
given dataset.
62
Fig 19 Average accuracy of each cluster numbers
As shown in Fig 19, It is highest accuracy when the number of clusters is 4 for given
dataset.
We also compared our proposed clustering framework with one of previous clustering
method that is K-means algorithm. To cluster the given dataset by using K-means
algorithm, we first transformed numerical attributes into categorical attributes and
evaluated the accuracy of clustering results. Fig 20 presents the comparison of average
accuracy between proposed clustering framework and K-means algorithm. Based on our
experimental results, the weight ratio of the given dataset in our proposed framework is that
0.2 for categorical attributes and 0.8 for numerical attributes.
63
Fig 20 Comparison proposed approach with K-mean algorithm (German credit dataset)
As shown Fig 20, it also presents that the average accuracy of our proposed clustering
framework is higher than previous k mean approach from 4 number of clusters to 11
number of clusters in the given datasets.
As we expected, our experiments showed that extracted cluster numbers from categorical
attributes subset have the possibility of being candidate cluster numbers for mixed
attributes dataset and that our weight scheme supports the accuracy of mixed attributes
clustering result. The results of our experiments on the heart disease dataset and the
German credit dataset showed that our proposed framework works effectively with mixed
attribute dataset.
64
CHAPTER 5
CONCLUSION AND FUTURE WORK
5.1 Conclusion
In this dissertation, we propose a clustering framework that is effective for a given mixed
type attributes dataset in terms of accuracy and show the possibility of effectiveness for
knowledge discovery in databases.
Previous clustering algorithms have focused on single type attribute such as numerical or
categorical attributes. Therefore, those algorithms have not been appropriate for mixed
type attribute datasets by themselves. Existing mixed attribute clustering algorithms still
have challenges to be solved. One of these challenges is loss of information by converting
attributes in mixed type attributes datasets.
Our clustering algorithm showed that it performed well without transformation that
prevents from losing information. We developed a measure of similarity based on an
entropy function and designed an algorithm of extracting distribution of clusters for a given
mixed type attributes dataset.
65
The experimental result showed that candidate cluster number from categorical attribute
has the possibility of being candidate cluster number for mixed attribute in one dataset.
Finally we proposed a weighting scheme based on the degree of balance in the number of
objects and the ground truth ratio of the given dataset, which improved our experimental
results in terms of accuracy.
With the proposed clustering framework, we have made three contributions:
1) The proposed framework clusters mixed type attribute dataset without losing the
characteristic of attributes.
2) Our proposed framework demonstrates extracting the candidate cluster numbers with an
entropy similarity measure and the accuracy of clustering result is the best in one of the
candidate cluster numbers for the given dataset.
3) We also experimentally explored the use of weighting schemes by with the relative
balance of categorical vs. numerical attributes and the ground truth ratio from the given
dataset.
The more research clustering for mixed type attribute datasets, the more we need to concern
the condition such as large scale and rapid increasing the quantity of data. Our next research
will focus on the datasets that have such conditions above.
66
5.2 Future Work
Our research has revealed a number of directions and possibilities for extension and
future work. Specifically, regarding clustering large scale datasets with mixed type
attributes, we plan to: (1) analyze existing subspace clustering algorithms [61], that focus
on reducing dimensions (attributes) in a dataset and the task of detecting all clusters for
large scale datasets with mixed type attributes, (2) investigate feature selection techniques
to detect correlation between different type attributes, and (3) survey other alternative
weighting schemes algorithms to improve the proposed framework in terms of accuracy.
Given the rapidly increasing interest in Big Data, such as SNS (Social Network Service)
data and transportation data, the scalability of our framework and approach to very large
datasets is a clear priority. The clustering of such big data set in a real time is a key issue
here. We plan to investigate clustering algorithms for very large, dynamically increasing
complex datasets and conduct experiments therewith.
As our long term goal, as stated in stated in the introduction section, we plan to improve
the clustering framework of mixed type attributes data sets in terms of scalability. In order
to discover unknown useful concepts and build up more robust relationships among those
concepts in a domain ontology (semantic specification), we plan to integrate the clustering
67
framework for large-scale datasets with mixed type attributes with a framework for domain
ontology learning.
68
REFERENCES
[1] Z. He, X. Xu, S. Deng, “Squeezer: An Efficient Algorithm for Clustering Categorical
Data,” Journal of Computer Science and Technology, vol. 17, no. 5, pp. 611-625,
2002.
[2] Sudipto Guha, Rajeev Rastogi, Kyuseok Shim, “ROCK: A Robust Clustering
Algorithm for Categorical Attributes”, Int. Conf. Data Engineering, pp.512-521,
Sydney, Australia, Mar 1999.
[3] Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan, “CACTUS- Clustering
Categorical Data Using Summaries”, Int. Conf. Knowledge Discovery and Data
Mining, pp. 73-83, 1999.
[4] M. Ester, H. P. Kriegel, J. Sander, X. Xu, “A Density-based Algorithm for Discovering
Clusters in Large Spatial Databases”, Int. Conf. Knowledge Discovery and Data
Mining(KDD ’96), pp. 226-231, Aug 1996.
[5] T. Zhang, R. Ramakishnan, M. Livny, “An Efficient Data Clustering Method for Very
Large Databases”, Proc. of the ACM_SIGMOD Int’l Conf. Management of Data, pp.
73-84, 1999
69
[6] G. Karypis, E. H. Han, V. Kumar, “A Hierarchical Clustering Algorithm Using
Dynamic Modeling”, IEEE Computer, Vol. 32, No. 8, pp. 68-75, 1999.
[7] Z. Huang, “Entensions to the k-Means Algorithms for Clustering Large Data Sets with
Categorical Values”, Data Mining and Knowledge Discovery, pp. 283-304, 1998.
[8] C. Li, G. Biswas, “Unsupervised Learning with Mixed Numeric and Nominal Data”,
IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 4, 2002.
[9] Yosr Naija, Salem Chakhar, Kaouther Blibech, Riadh Robbana, “Extension of
Partitional Clustering Methods for Handling Mixed Data“, ICDMW, pp. 257-266
IEEE Intl. Conf on Data Mining Workshops, 2008.
[10] He, Z, Xu, X. and Deng, S. “Clustering Mixed Numeric and Categorical Data: A
Cluster Ensemble Approach”, ARXiv Computer Science e-prints, 2008.
[11] Amir Ahmad, Lipika Dey, “A k-mean clustering algorithm for mixed numeric and
categorical data”, Data & Engineering, Vol 63, Issue 2, pp. 503-527, 2007
[12] Han, J. Kamber, M., “Data Mining: Concepts and Techniques”, Morgan Kauffman,
ISBN 1-55860-489-8, CA, USA
[13] C.E. Shannon, “A mathematical theory of communication,” Bell System Technical
Journal, 1948.
70
[14] Yao, J., Dah, M., S.T. & Liu, H., “Entropy-based Fuzzy Clustering and Fuzzy
Modeling”, Fuzzy Sets and Systems (Elsevier), Vol. 113, pp. 381-288, ISSN 0165-0114,
2000.
[15] Barbara, D., Couto, J., Li, Y., “COOLCAT: an entropy-based algorithm for categorical
clustering”, Proceedings of the Eleventh ACM CIKM Conference, pp. 582-589, 2002
[16] Tao Li, Sheng Ma, Mitsunori Ogihara, “Entropy-Based Criterion in Categorical
Clustering”, Proceeding of the twenty-first international conference on Machine
Learning, pp. 68, Canada, 2004
[17] C.H. Cheng, A. W.C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining
numerical data. Proc. of ACM SIGKDD Conference, 1999.
[18] Keke Chen., Ling Liu.,”The Best K for Entroy-based Categorical Data Clustering”,
SSDBM 2005: 235-262
[19] Pang-Ning Tan, Michael Steingback, Vipin Kumar, “Introduction Data Mining”,
Addison Wesley, 2006.
[20] Youn, S. and McLeod, D., "Spam Email Classification using an Adaptive Ontology",
Journal of Software (JSW), ISSN:1796-217X, Volume 2, Issue 3, pages 43-55, 2007.
71
[21] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth., “From Data Mining to Knowledge
Discovery: An Overview”, In Advances in Knowledge and Discovery and Data Mining,
Pages 1-34. AAAI Press, 1996.
[22] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu., “Mining Frequent Patterns in Data
Streams at Multiple Time Granularities”, Next Generation Data Mining, pages 191-212.
AAAI, 2003.
[23] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan., “Clustering Data
Streams: Theory and Practice”, IEEE Transactions and Knowledge and Data
Engineering, 15(3):515-528, 2003.
[24] J. Suguna and M. Arul Selvi., " ensemble fuzzy clustering for mixed numeric and
categorical data", International Journal of Computer Application, Vol 42-No3, Mar
2012
[25] Fisher and Douglas H., "Improving inference through conceptual clustering",
Proceedings of the 1987 AAAI Conferences, pp. 461–465, July 1987.
[26] Huan Z., "A fast clustering algorithm to cluster very large categorical data sets in data
mining", In Proc SIGMOD workshop on Research Issues on Data Mining and
Knowledge Discovery, pp. 146-151, May 1997.
72
[27] Z. He, X. Xu, & S. Deng, "Scalable algorithms for clustering categorical data",
Journal of Computer Science and Intelligence Systems, 1077-1089, 2005.
[28] Parul Agarwal, M. Afshar Alanm, Ranjit Biswas, "Issues, Challenges and Tools of
Clustering Algorithms", IJCSI, Vol. 8, Issue 3, No. 2, 2011.
[29] S. Gupta, Rajeev Rastogi, and Kyuseok shim, "CURE: An efficient clustering
algorithm for large database", Information Systems, Vol. 26, No. 1, pp. 35-28, 2001
[30] Tapas Kanungo, David M. Mount, Nathan S. Netanayahu, Christine D. Piatko, Ruth
Silverman, and Angela Y. Wu, "An Efficient k-Means Clustering Algorithm: Analysis
and Implementation", IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 24, No.7, July 2002.
[31] Guojun Gan, Chaoqun Ma, and Jiahong Wu, "Data clustering: theory, algorithms, and
applications", ASA-SIAM series on statistics and applied probability, 2007
[32] P.K. Agarwal and C.M. Procopiuc, " Exact and Approximation Algorithms for
Clustering", Proc. Ninth Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 658-667, Jan.
1998.
[33] S. Arora, P. Raghavan, and S. Rao, "Approximation Schemes for Euclidean k-median
and Related Problems", Proc. 30th Ann. ACM Symp. Theory of Computing, pp.
106-113, May 1998.
73
[34] S. Kolliopoulos and S. Rao, "A Nearly Linear-Time Approximation Scheme for the
Euclidean k-median Problem", Proc. Seventh Ann. European Symp. Algorithms, pp.
362-371, July 1999.
[35] M.R. Garey and D.S. Johnson, "Computers and Intractability: A Guide to the Theory
of NP-Completeness", New York: W.H. Freeman, 1979.
[36] A. Moore, "Very Fast EM-Based Mixture Model Clustering Using Multiresolution
kd-Trees", Proc. Conf. Neural Information Processing Systems, 1998.
[37] El-sonbatry, Y., Ismail, M., and Farouk, M., "An efficient density based clustering
algorithm for large databases", In 16th IEEE international conference on tools with
artificial intelligence, pp. 673-677, 2004.
[38] Everitt,B., Landau, S., and Leese, M., "Cluster Analysis", 4th edition. New York:
Oxford University Press, 2001.
[39] R.A. Jarvis and Edward A. Patrik, "Clustering Using a Similarity Measure Based on
Shared Near Neighbors", IEEE Transactions on Computer, Vol. c-22, No. 11, Nov
1973.
[40] E. Forgey, "Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of
Classification", Biometrics, vol. 21, p. 768, 1965.
74
[41] Guadalupe J. Torres, Ram B. Basnet, Andrew H. Sung, and Srinivas Mukkamala, "A
Similarity Measure for Clustering and its Applications", Proceedings of World
Academy of Science, Engineering and Technology, Vol. 31, ISSN 1307-6884,
pp.490-496, Vienna, Austria, July 2008.
[42] Shyam Boriah, Varun Chandola, Vipin Kumar, "Similarity Measures for Categorical
Data: A Comparative Evaluation", In Proceedings of SIAM Data Mining Conference,
2008.
[43] D. Gibson, J. Kleinberg, and P. Raghavan., "Clustering categorical data: an approach
based on dynamical systems", The VLDB Journal, 8(3) pp.222-236, 2000
[44] Loet Leydesdorff, "Similarity Measures, Author Cocitation Analysis, and Information
Theory", Journal of the American Society for Information Science & Technology, 56(7),
pp.769-772, 2005.
[45] J. MacQueen, "Some Methods for Classification and Analysis of Multivariate
Observations", Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, pp.
281-296, 1967.
[46] Gautam Das, Heikki Mannila, "Context-Based Similarity Measures for Categorical
Databases", PKDD 2000, LNAI 1910, pp. 201−210, 2000.
75
[47] M-J. Lesot and M. Rifqi, H. Benhadda, "Similarity measures for binary and numerical
data: a survey", Int. J. Knowledge Engineering and Soft Data Paradigms, Vol. 1, No. 1,
2009.
[48] Dekang Lin, "An Information-Theoretic Definition of Similarity", ICML Proceedings
of the Fifteenth International Conference on Machine Learning, pp.296-304, 1998
[49] S. Ferilli, T.M.A. Basile, N. Di Mauro, M. Biba, and F. Esposito, "
Generalization-based Similarity for Conceptual Clustering", LNCS Vol. 4944 pp.13-26,
2008.
[50] Shoji Hirano, Shusaku Tsumoto, Tomohiro Okuzaki, and Yutaka Hata, " A Clustering
Method for Nominal and Numerical Data Based on Rough Set Theory", JSAI 2001
Workshops , LNAI 2253, pp. 400-405, 2001.
[51] Cheng Xiao, Dequan Zheng, Yuhang Yang, "Automatic Domain-Ontology
Structure and Example Acquisition from Semi-Structured Texts", Sixth International
Conference on Fuzzy Systems and Knowledge Discovery, 2009.
[52] Yen-Ting Kuo, Andrew Lonie, Liz Sonenberg, "Domain Ontology Driven Data
Mining", ACM SIGKDD Workshop on Domain Driven Data Mining, 2007.
76
[53] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, "From Data Mining
to Knowledge Discovery", American Association for Artificial Intelligence Vol. 17
Num 3, 1996.
[54] David Sánchez and Antonio Moreno, "Learning Medical Ontologies from the Web",
LNAI 4924, pp. 32–45, 2008
[55] Wen ZHOU, Zong-tian LIU, Yan ZHAO, "Ontology Learning by Clustering Based on
Fuzzy Formal Concept Analysis", 31st Annual International Computer Software and
Applications Conference(COMPSAC 2007), 2007.
[56] Bin Zhang, "Comparison of the Performance of Center-Based Clustering Algorithms",
PAKDD 2003, LNAI 2637, pp. 63–74, 2003.
[57] Nor Hafizah Abd. Razak, Noridayu Manshor, Mandava Rajeswari, and Dhanesh
Ramachandram, "Object based Clustering using Hybrid Algorithms", Proceedings of
the 2008 Fifth International Conference on Computer Graphics, Imaging and
Visualisation, pp. 12-17, 2008.
[58] Rui Xu, Donald Wunsch II, "Survey Of Clustering Algorithms", Ieee Transactions On
Neural Networks, Vol. 16, 2005.
77
[59] Enhong CHEN and Gaofeng WU, "An Ontology Learning Method Enhanced by
Frame Semantics", Proceedings of the Seventh IEEE International Symposium on
Multimedia, 2005.
[60] D. Barbara, Y. Li, and J. Couto., "Coolcat: an entropy-based algorithm for categorical
clustering", Proc. of ACM Conf. on Information and Knowledge Mgt. (CIKM), 2002.
[61] Chung, S. and McLeod, D., “Dynamic Pattern Mining: An Incremental Data
Clustering Approach”, Journal on Data Semantics, Lecture Notes in Computer
Science, Springer, Vol. 2 pp. 85-112, 2005
[62] Jun, J., Chung S., and McLeod, D., “Subspace Clustering of Microarray Data based on
Domain Transformation”, VLDB Workshop on Data Mining on Bioinformatics, 2006.
[63] Xiao Cai, Feiping Nie, Heng Huang, “Multi-View K-Means Clustering on Big Data”,
Proceedings of the Twenty-Third International Joint Conference on Artificial
Intelligence, 2013.
[64] R. L. F. Cordeiro, A. J. M. Traina, C. Faloutsos, and C. Traina Jr., “Finding clusters in
subspaces of very large, multi-dimensional datasets”, In ICDE, pp. 625–636, 2010.
[65] Robson L. F. Cordeiro, Caetano Traina Jr., and Agma J. M. Traina, “Clustering Very
Large Multi-dimensional Datasets with MapReduce”, 17
th
ACM SIGKDD Conference
on Knowledge Discovery and Data Mining, 2011.
Abstract (if available)
Abstract
We propose an efficient approach to clustering datasets with mixed type attributes (both numerical and categorical), while minimizing information loss during clustering. Real world datasets such as medical datasets, bio datasets, transactional datasets and its ontology have mixed attribute type datasets. ❧ However, most conventional clustering algorithms have been designed and applied to datasets containing single attribute type (either numerical or categorical). Recently, approaches to clustering for mixed attribute type datasets have emerged, but they are mainly based on transforming attributes to straightforwardly utilize conventional algorithms. The problem of such approaches is the possibility of distorted results due to the loss of information because significant portion of attribute values can be removed in the transforming process without knowledge background of datasets. This may result in a lower accuracy clustering. ❧ To address this problem, we propose a clustering framework for mixed attribute type datasets without transforming attributes. We first utilize an entropy based measure of categorical attributes as our criterion function for similarity. Second, based on the results of entropy based similarity, we extract candidate cluster numbers and verify our weighting condition that is based on the degree of well balanced clusters with pre-clustering results and the ground truth ratio from the given dataset. Finally, we cluster the mixed attribute type datasets with the extracted candidate cluster numbers and the weights. ❧ We have conducted experiments with a heart disease dataset and a German credit dataset, for which an entropy function as a similarity measure and the proposed method of extracting number of clusters has been utilized. We also experimentally explore the relative degree of balance of categorical vs. numerical attributes sub datasets in given datasets. Our experimental results demonstrate that the proposed framework improved accuracy effectively for the given mixed type attributes datasets.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An efficient approach to categorizing association rules
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
Tag based search and recommendation in social media
PDF
Scalable processing of spatial queries
PDF
Learning semantic types and relations from text
PDF
From matching to querying: A unified framework for ontology integration
PDF
Complex pattern search in sequential data
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Inferring mobility behaviors from trajectory datasets
PDF
A learning‐based approach to image quality assessment
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
A data-driven approach to image splicing localization
PDF
Understanding semantic relationships between data objects
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Iteratively learning data transformation programs from examples
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Modeling and recognition of events from temporal sensor data for energy applications
Asset Metadata
Creator
Lim, Jongwoo
(author)
Core Title
An efficient approach to clustering datasets with mixed type attributes in data mining
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/01/2013
Defense Date
09/05/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
clustering,data mining,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McLeod, Dennis (
committee chair
), Nakano, Aiichiro (
committee member
), Pryor, Lawrence (
committee member
)
Creator Email
braindb@gmail.com,jonglim@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-333350
Unique identifier
UC11296197
Identifier
etd-LimJongwoo-2068.pdf (filename),usctheses-c3-333350 (legacy record id)
Legacy Identifier
etd-LimJongwoo-2068.pdf
Dmrecord
333350
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Lim, Jongwoo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
clustering
data mining