Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
DBSSC: density-based searchspace-limited subspace clustering
(USC Thesis Other)
DBSSC: density-based searchspace-limited subspace clustering
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DBSSC : Density-Based Searchspace-limited Subspace Clustering
by
Jongeun Jun
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
Dec 2013
Copyright 2013 Jongeun Jun
Dedication
I dedicate my dissertation work to my family and many friends. A very special thanks to
my loving parents, Namchan Jun and Heesun Uhm whose endless loves to me and all the
supports when I have been in hard time. My wife Hyewook has always been in my side and
are very special. I also dedicate this dissertation to my many friends and church members
who have supported me throughout the process. I will always appreciate all they have
done, especially Seokkyung Chung for helping me overcome technical diculties, and for
the many hours of proofreading. I dedicate this work and give special thanks to my two
wonderful and lovely daughters Dahnby and Eunby for being there for me throughout
the entire doctorate program. Both of you have been my eternal sunshine.
ii
Acknowledgements
I wish to thank my committee members who were more than generous with their expertise
and precious time. A special thanks to Dr. Dennis McLeod, my committee chairman for
his countless hours of re
ecting, reading, encouraging, and most of all patience throughout
the entire doctorate program. Thank you Dr. Cyrus Shahabi, Dr. Daniel O'Leary Dr.
Ellis Horowits, Dr. Shri Narayanan, Dr. Roger Zimmermann, and Dr. Larry Pryor
for agreeing to serve on my committee. Special thanks goes to Dr. Parag Havaldar for
allowing me to serve as a teaching assistant several years. I would like to acknowledge
and thank my school division for allowing me to conduct my research and providing any
assistance requested. Finally I would like to thank sta members of computer science
department for their continued support.
iii
Contents
Dedication ii
Acknowledgements iii
List of Tables vi
List of Figures vii
Abstract ix
Chapter 1 Introduction 1
1.1 Focus of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2 Density-based Gene Expression Clustering using a Mutual
Neighborhood 10
2.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Background for the Proposed Method . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Similarity Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Challenges in Density-based Gene Expression Clustering . . . . . . 17
2.4 Density-based Clustering using a Mutual Neighborhood . . . . . . . . . . 19
2.4.1 Construction of Mutual Neighborhood Graph . . . . . . . . . . . . 20
2.4.2 Identication of Rough Cluster Structure . . . . . . . . . . . . . . 21
2.4.3 Cluster Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.3 Comparative Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 28
iv
Chapter 3 Searchspace-limited subspace clustering 31
3.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 The Subspace Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Identication of Candidate Subspace Clusters based on Domain
Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.3 Ecient Super-pattern Search based on Inverted Index . . . . . . 39
3.3.4 Identication of Meaningful Subspace Clusters . . . . . . . . . . . 42
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 4 Density-based Searchspace-limited Subspace clustering 49
4.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 The Density-based Subspace Clustering Algorithm . . . . . . . . . . . . . 52
4.3.1 Basic notation of Density-based clustering . . . . . . . . . . . . . . 52
4.3.2 Identication of Searchspace-limited Subspace Clusters . . . . . . . 53
4.3.3 Application of Density-based Approach to Subspace Clustering . . 54
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.1 Scalability of Density-based Searchspace-limited Subspace Clustering 60
4.4.2 Accuracy w.r.t the number of cores and dimensionality . . . . . . . 63
4.4.3 Equi-width bins versus Homogeneity-based bins . . . . . . . . . . . 69
Chapter 5 Future Work 71
5.1 Improving subspace clustering algorithm . . . . . . . . . . . . . . . . . . . 71
5.2 DBSSC into general-purpose algorithm . . . . . . . . . . . . . . . . . . . . 73
5.3 Ecient analysis for the result . . . . . . . . . . . . . . . . . . . . . . . . 73
Chapter 6 Conclusions 75
Reference List 77
v
List of Tables
2.1 Summary of symbols and corresponding meanings . . . . . . . . . . . . . 20
2.2 Proportion of biologically characterized genes in meaningful clusters versus
those in [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 A comparison between the proposed method and other approaches on the
Cho's data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Notations for subspace clustering . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Number of patterns generated in Cho data . . . . . . . . . . . . . . . . . . 61
4.2 Percentage number of patterns generated in Cho data . . . . . . . . . . . 61
4.3 Number of patterns generated in Spellman data . . . . . . . . . . . . . . . 62
4.4 Percentage number of patterns generated in Spellman data . . . . . . . . 62
4.5 Detailed accuracy of 4-bin case in Cho data w.r.t. # of cores and # of
dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Accuracy w.r.t. # of cores and # of bins in Cho data . . . . . . . . . . . 64
4.7 Accuracy w.r.t # of cores and # of bins in Spellman data . . . . . . . . . 64
vi
List of Figures
1.1 Plot of sample gene expression data across whole conditions (shown in left
hand side) versus subset of conditions (shown in right hand side) . . . . . 2
2.1 Correlation betweenk-NN density (whenk=30) and-density (when=0.8)
for the Cho's data that will be discussed in Chapter 2.5.1. The horizontal
axis represents k-NN density and the vertical axis represents -density. . . 18
2.2 Mean expression patterns for the Cho's data obtained by the proposed
method. The horizontal axis represents time points between 0-80 and 100-
160 minutes. The vertical axis represents normalized expression level. The
error bar at each time point delineates the standard deviation. . . . . . . 30
3.1 An overview of the proposed clustering algorithm . . . . . . . . . . . . . . 36
3.2 An illustration of domain transformation . . . . . . . . . . . . . . . . . . . 37
3.3 A sample example of subspace cluster discovery . . . . . . . . . . . . . . . 40
3.4 Inverted index at 4-dimensional subspace . . . . . . . . . . . . . . . . . . 41
3.5 Identication of subspace clusters . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Sample plots for subspace clusters . . . . . . . . . . . . . . . . . . . . . . 46
3.7 The number of genes versus pattern rate . . . . . . . . . . . . . . . . . . . 47
3.8 Dierent number of symbols versus running time rate . . . . . . . . . . . 48
3.9 The number of dimensions versus running time rate with the use of inverted
index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 An illustration on how core points attract border points in dierent sub-
space clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
vii
4.2 A plot of two sample core clusters, which are far from each other. The red
line shows k-nearest neighbors for the gene YBR089w, and the blue line
shows k-nearest neighbors for the gene YDR471w, in Cho data. . . . . . . 56
4.3 An overview of the proposed clustering algorithm . . . . . . . . . . . . . . 58
4.4 An illustration of scalibility . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 An illustration of accuracy results . . . . . . . . . . . . . . . . . . . . . . 65
4.6 An illustration of result in Equi-width vs. K-means . . . . . . . . . . . . 69
viii
Abstract
We propose a mining framework that supports the identication of useful knowledge
based on data clustering. With the recent advancement of microarray technologies, the
expression levels of thousands of genes can be measured simultaneously. The availabil-
ity of the huge volume of microarray dataset makes us to focus our attention on gene
expression datasets mining. We apply density-based approach to identify clusters from
full-dimensional microarray datasets, and get the meaningful results. In general, mi-
croarray technologies provide multi-dimensional data. In particular, given that genes are
often co-expressed under subsets of experimental conditions, we present a novel subspace
clustering algorithm. In contrast to previous approaches, our method is based on the
observation that the number of subspace clusters is related with the number of maximal
subspace clusters to which any gene pair can belong. By performing discretization to
gene expression proles, the similarity between two genes is transformed as a sequence
of symbols that represents the maximal subspace cluster for the gene pair. This domain
transformation (from genes into gene-gene relations) allows us to make the number of
possible subspace clusters dependent on the number of genes. Based on the symbolic
representations of genes, we present an ecient subspace clustering algorithm that is
scalable to the number of dimensions. In addition, the running time can be drastically
ix
reduced by utilizing inverted index and pruning non-interesting subspaces. Furthermore,
by incorporating the density-based approach into the above searchspace-limited subspace
clustering, we develop a fast running subspace clustering algorithm which nds impor-
tant subspace clusters. Experimental results indicate that the proposed method eciently
identies co-expressed gene subspace clusters for the yeast cell cycle datasets.
x
Chapter 1
Introduction
With the recent advancement of DNA microarray technologies, the expression levels of
thousands of genes can be measured simultaneously. The obtained data are usually
organized as a matrix (also known as a gene expression prole), which consists of m
columns andn rows. The rows represent genes (usually genes of the whole genome), and
the columns correspond to the samples (e.g., various tissues, experimental conditions, or
time points).
Given this rich amount of gene expression data, the goal of microarray analysis is to
extract hidden knowledge (e.g., similarity or dependency between genes) from this ma-
trix. The analysis of gene expressions may identify mechanisms of gene regulation and
interaction, which can be used to understand a function of a cell [23]. Moreover, compar-
ison between expressions in a diseased tissue and a normal tissue will further enhance our
understanding in the disease pathology [26]. Therefore, data mining, which transforms a
raw dataset into useful higher-level knowledge, becomes a must in bioinformatics.
One of the key steps in gene expression analysis is to perform clustering genes that
show similar patterns. By identifying a set of gene clusters, we can hypothesize that the
1
0 2 4 6 8 10 12 14 16
−3
−2
−1
0
1
2
3
Conditions
Expression levels
0 6 8 12 14 17
−3
−2
−1
0
1
2
3
Conditions
Expression levels
Figure 1.1: Plot of sample gene expression data across whole conditions (shown in left
hand side) versus subset of conditions (shown in right hand side)
genes clustered together tend to be functionally related. Traditional clustering algorithms
have been designed to identify clusters in the full dimensional space rather than subsets
of dimensions [23, 53, 13, 47]. By developing and applying density-based approach, we
gure out possibilities and limitations of traditional full-dimensional clustering. When
correlations among genes are not apparently visible across the whole dimensions as shown
in the left-side graph of Figure 1.1, the traditional approaches fail to detect clusters.
However, it is well-known that genes can manifest a coherent pattern under subsets
of experimental conditions as shown in the right-side graph of Figure 1.1. Therefore,
it is essential to identify such local patterns in microarray datasets, which is a key to
revealing biologically meaningful clusters. Therefore, we investigate the particular setting
of clustering problem, where we are interested in the identication of gene expression
clusters in subspaces.
In this dissertation, we propose a mining framework that supports the identication
of meaningful subspace clusters. When m (the number of dimensions) is equal to 50 (m
2
for gene expression data usually varies from 20 to 100), the number of possible subspaces
is 2
50
1 1:1259 10
15
. Thus, it is computationally expensive to search all subspaces
to identify clusters. To cope with this problem, many subspace clustering algorithms
rst identify clusters in low dimensional spaces, and use them to nd clusters in higher
dimensional spaces based on apriori principle [4]
1
. However, this approach is not scalable
to the number of dimensions in general.
In contrast to the previous approaches, our method is based on the observation that
the maximum number of subspaces is related with the number of maximal subspace
clusters to which any two genes can belong. In particular, by performing discretiza-
tion to gene expression proles, we can transform the similarity between two genes as
a sequence of symbols that represent the maximal subspace cluster for the gene pairs.
This transformation allows us to limit the number of possible subspace clusters to
n(n1)
2
where n is the number of genes. Based on the transformed data, we present an e-
cient searchspace-limited subspace clustering algorithm that is scalable to the number of
dimensions. Moreover, by utilizing inverted index and pruning (even conservatively) non-
interesting subspace clusters, the running time can be drastically reduced. Note that the
presented clustering algorithm can detect subspace clusters regardless of whether their
coordinate values are consecutive to each other or not. Furthermore, in order to make our
searchspace-limited subspace clustering more scalable, we incorporate our density-based
clustering approach. In the density-based clustering approach, points with high density
(i.e., core genes) and points with low density (i.e., outlier points) are identied. Non-core,
1
If a collection of pointsC form a dense cluster in ak-dimensional space, then any subset ofC should
form a dense cluster in a (k 1)-dimensional space.
3
non-outlier points are dened as border points. Since a core point has high density, it
is expected to locate well inside the cluster. Thus, core points and surrounding border
points form multiple subspace clusters. That is, core points have high potential in belong-
ing to multiple subspace clusters. Therefore, instead of performing subspace clustering
on whole datasets, by performing subspace clustering on only core points, we can further
reduce running time drastically. After that, border points are used to expand the cluster
structure by assigning them to the most relevant cluster. Coupling with density-based
approach, experimental results indicate that our subspace clustering improves running
time signicantly.
1.1 Focus of the Dissertation
With the rapid progress in data acquisition, sensing, and storage technologies, the size of
available datasets is increasing at a rapid rate. Given this overwhelming amount of data,
data mining, which transforms raw datasets into useful higher-level knowledge, becomes
a must in analyzing and understanding the information. The raw dataset is consists of
objects that are usually represented as a vector of numeric values in multidimensional
space. Higher-level of knowledge can be extracted by discovering groups (clusters) of sim-
ilar objects from the raw dataset. And, traditional clustering methods, which consider
all of the dimensions (attributes) for measuring similarity between objects, have taken
a leading role in data mining. However, there are several weaknesses for dealing with
high dimensional data with traditional clustering frameworks. This is caused by inherent
characteristics of high dimensional raw data, which are the increasing irrelevance of some
4
dimensions, and the tendency to be nearly equidistant between objects as the number of
dimensions in the dataset increases. Feature selection methods have been used for dis-
covering relevant dimensions from high dimensional space, and clusters on them, however
the feature selection methods have a critical limit that uncovers relations between objects
in multiple, overlapping sub-dimensional spaces.
Subspace clustering algorithms are suggested to overcome inherent problems in full-
dimensional clustering frameworks and feature selection methods. The main purpose of
subspace clustering algorithms is to nd meaningful groups of objects in multiple sub-
dimensions that may be overlapped, because clusters may be embedded in lower dimen-
sional subspaces. In addition, dierent sets of dimensions may be relevant for dierent
sets of objects, and an object can have multiple characteristics in dierent subspaces.
Although all the high dimensional data can be considered as input data of subspace clus-
tering, datasets in certain problem domains have recently received scientic attention:
Gene expression analysis. Since gene expression datasets consist of measurements
across various conditions (or time points), they are characterized by high dimen-
sional, huge size volume, and noisy data. It is well known that genes can manifest
a coherent pattern under subsets of experimental conditions, and these patterns
are closely related with certain functionality of genes. Therefore, It is essential to
identify such local patterns in gene expression datasets, which is a key to revealing
biologically meaningful clusters.
5
E-commerce. With the fast growing online market, nding user groups who show
similar interests in products is more important than any other era in history. Fur-
thermore, users can be identied by diverse characteristics, and current technologies
allow substantial user information to be gathered. However, it is dicult to analyze
the user information data because of its high dimensionality and the heterogeneity
of users' interests. A subspace clustering with the ability of nding locally correlated
patterns is a promising method for recommendation systems and target marketing
in e-commerce.
Others. Besides gene expression analysis and e-commerce, there are several problem
domains that are suitable for subspace clustering. For example, creating a hierarchy
of data sources in an information integration system and organizing unstructured
web search results are possible problem domains where subspace clustering can be
utilized.
There are two main streams of work in subspace clustering. Top-down algorithms
(e.g., PROCLUS [2], FINDIT [51], etc) nd full-dimensional clusters rst, and then re-
ne them to nd subspace clusters. Bottom-up strategies (e.g., CLIQUE [4], MAFIA [25],
etc.) start from nding low dimensional dense regions, and then use them to make rele-
vant subspace clusters. However, both of them have a major weakness on scalability of
dimensions. For example, the number of dimensions (m) for gene expression data usually
varies from 20 to 100, and the number of dimensions of feature vectors for documents is
well over 100,000. When m is equal to 50, the number of possible subspaces is 2
50
1. It is
computationally expensive to search all subspaces to identify clusters. To cope with this
6
problem, bottom-up type of algorithms rst identify clusters in low dimensional spaces,
and use them to nd clusters in higher dimensional spaces based on the a priori algorithm.
However, this approach is not scalable with the number of dimensions in general. This
motivates the necessity of a subspace clustering algorithm that is scalable with the num-
ber of dimensions. To this end, we develop a novel subspace clustering algorithm that
utilizes domain transformation from objects to object-object relations. This problem
domain transformation allows us to limit the size of search space in subspace clusters.
Furthermore, coupling with density-based clustering approach with the subspace clus-
tering reduces computational complexity signicantly, and identify important subspace
clusters at the same time.
1.2 Contributions
Our eorts on data mining are mainly focused on gene expression analysis [13, 33]. This
is because there can be valuable hidden high-level knowledge in gene expression datasets
that consist of the expression levels of thousands of genes under hundreds of conditions.
The other reason that we use gene expression data is that most genes are expected to be
strongly correlated under certain sub-conditions, and these genes have similar biological
meanings. Upon completion of this dissertation, our research will have the following key
expected contributions.
Density-based clustering using a mutual neighborhood [13]. We present a clustering
algorithm for gene expression data. The main distinctions of the algorithm are
three things. First, it is robust to outlier by taking a density based approach.
7
Second, it can identify biologically meaningful clusters with high detection ratio.
Finally it provides cluster structure with reasonable homogeneity and separation.
In addition, we establish non-intuitive relationship between KNN-density and a
connected component in mutual neighborhood graph.
Identifying coherent patterns on subsets of gene expression datasets [33]. Given that
genes are often co-expressed under subsets of experimental conditions, we propose a
novel searchspace-limited subspace clustering algorithm. By performing discretiza-
tion to gene expression proles, each gene can be expressed as a sequence of symbols
instead of a sequence of numeric values. As gene pairs may be similar in certain
conditions and not in other conditions, the maximum number of common symbols
in gene pairs would be less than or equal to total number of conditions, even though
each gene has same length of symbol sequence (total number of conditions). Now,
the similarity between two genes is transformed as a sequence of symbols that rep-
resents the maximal subspace cluster for the gene pair. This domain transformation
(from genes into gene-gene relations) allows us to make the number of possible sub-
space clusters dependent on the number of genes. Based on the symbolized genes,
we present an ecient subspace clustering algorithm that is linearly scalable to the
number of dimensions. In addition, the running time can be drastically reduced by
utilizing inverted indexing and pruning non-interesting subspaces. Experimental
results so far indicate that the proposed method eciently identies co-expressed
8
gene subspace clusters for a yeast cell cycle dataset. However, the searchspace-
limited subspace clustering is not limited to gene expression mining, but applicable
to subspace clustering in general for high-dimensional massive datasets.
Improving running time with high accuracy on searchspace-limited subspace clus-
tering. In our density-based approach, we identify core, border, and outlier genes.
We investigate applicability of this identication into subspace clustering. Even
though our searchspace-limited subspace clustering is scalable to the number of
dimensions, and search space is limited into the upper bound based on number
of genes, the computational cost is still high in high dimensional big data. By
incorporating density-based clustering with the searchspace-limited subspace clus-
tering, we reduce computational cost signicantly. In addition, our density-based
searchspace-limited subspace clustering approach leverages the strength of both
top-down and bottom-up approach. Thus, it generates high accuracy overlapping
clusters in highly reduced running time than other subspace clustering algorithms.
1.3 Organization of the Dissertation
The remainder of this dissertation is organized as follows. In Chapter 2, we present
density-based gene expression clustering using a mutual neighborhood. In Chapter 3,
we propose searchspace-limited subspace clustering based on domain transformation. In
Chapter 4, we present density-based searchspace-limited subspace clusterng. We present
directions for future work in Chapter 5, and provide conclusions of this dissertation in
Chapter 6.
9
Chapter 2
Density-based Gene Expression Clustering using a Mutual
Neighborhood
2.1 Preliminary
As described in Introduction, discovering new biological knowledge from the data obtained
by high-throughput experimental technologies is a major challenge in bioinformatics.
And, huge volume of microarray data is available by utilizing recent advancement of DNA
microarray technologies. The obtained data is represented as a matrix, and statistical
analysis can be applied to the matrix type of microarray data.
There are several ways to analysis the microassay data including classication and
clustering, however, clustering methods are getting more focus on in these days. This is
because, based on the hypothesis that genes clustered together tend to be functionally
related, functions of unknown genes can be predicted from genes (with known functions)
in a same cluster. Recently, genome-wide expression data clustering has received sig-
nicant attention in the bioinformatics research community, ranging from hierarchical
clustering [19, 46], self-organizing maps [47], neural networks [27]), algorithms based on
10
Principal Components Analysis ([54] or Singular Value Decomposition [15, 19, 29], and
graph-based approach [53]. However, previous algorithms are limited in addressing the
followings three challenges.
1. Gene expression data often contains a large amount of outliers, which are consid-
erably dissimilar with the signicant portion of the data. Sometimes, the discovery
of clusters would be hampered by the presence of outliers. For example, a small
amount of outliers can substantially in
uence the mean values of clusters in center-
based clustering (e.g., K-means). Thus, clustering algorithms must be able to
identify outliers and remove them if necessary.
2. Co-expressed gene clusters may be highly connected by a large amount of inter-
mediate genes that are located between one cluster and another [31]. Clustering
algorithms should not be confused by genes in a transition region. That is, simply
merging two clusters connected by a set of intermediate genes would be avoided.
Thus, the ability to detect the \genes in the transition region" would be helpful in
gene expression clustering.
3. Since gene expression data consist of measurements across various conditions (or
time points), they are characterized by multi-dimensional data. Clusters in high-
dimensional spaces are, in general, of diverse densities. Although we are mainly
interested in highly co-expressed gene clusters, medium density clusters may be
biologically meaningful as well. For example, time-shift relationship is often exem-
plied in some biological processes such as the cell cycle. Moreover, some mecha-
nisms can limit the number of expressed genes based on the principle of eciency,
11
consequently, the expression relationships among genes along the same biological
pathway may be partially revealed in a single microarray experiment [60]. As a
consequence, biologically meaningful clusters may not always correspond to highly
co-expressed clusters. Thus, it is essential to identify wide densities of clusters (es-
pecially high density and medium density clusters). However, most of clustering
algorithms have diculty in identifying clusters with diverse densities.
To address the above three issues, we propose a novel clustering algorithm that utilize
a density-based approach. In the density-based clustering approach, density for each
gene is estimated rst. Next, genes with high density (i.e., core genes) and genes with
low density (i.e., outlier genes) are identied. Non-core, non-outlier genes are dened as
border genes. Since a core gene has high density, it is expected to be located well inside
the cluster (i.e., a representative of a cluster). Thus, instead of conducting clustering
on whole data, performing clustering on core genes alone can avoid the rst and second
problems stated above while it produces a skeleton of cluster structure. After that, border
genes are used to expand the cluster structure by assigning them to the most relevant
cluster.
In density-based clustering, outlier genes may not be clustered into any cluster. That
is, since the goal of gene expression clustering is to identify a set of genes with similar
patterns, it may be necessary to discard outlier genes during the clustering process. While
this approach does not provide a complete organization of all genes, it can extract the
\essentials" of information given data. However, if a complete clustering is necessary, the
outlier genes can be added to the closest clusters (or they can form singleton clusters).
12
Density-based clustering approaches have been studied in previous data mining lit-
erature [20, 35]. However, since they have been studied in non-biological context, the
presented approaches are not directly applicable to gene expression data. Our primary
contribution is to tackle the three problems stated above, and exploit the strengthes and
address the limitations of the previous density-based clustering approaches.
The remainder of this chapter is organized as follows. In Chapter 2.2, we review the
related work, and highlight the strengths and weaknesses of the previous approaches.
We present background information in Chapter 2.3. We explain the proposed clustering
algorithm in Chapter 2.4. Finally, we present experimental results in Chapter 2.5.
2.2 Related Work
In this chapter, we brie
y review the previous gene expression clustering approach. Note
that this chapter should not be considered as a comprehensive survey of all published gene
expression clustering algorithms. It only aims to provide a concise overview of algorithms
that are directly related with our approach. Jiang et al. [32] provides the comprehensive
review on gene expression clustering. For details, refer to that paper [32].
Partition-based clustering decomposes a collection of genes, which is optimal with
respect to some pre-dened function such as center-based approach [23, 49]. Center-
based algorithms nd the clusters by partitioning the entire dataset into a pre-determined
number of clusters [17, 23, 49]. Although the center-based clustering algorithms have been
widely used in gene expression clustering, there exist the following drawbacks. First, the
algorithm is sensitive to an initial seed selection. Depending on the initial points, it is
13
susceptible to a local optimum. Second, as discussed in Chapter 3.4.2, it is sensitive to
noises. Third, the number of clusters should be determined beforehand.
Hierarchical (agglomerative) clustering (HAC) nds the clusters by initially assigning
each gene to its own cluster and then repeatedly merging pairs of clusters until a certain
stopping condition is met [17, 19, 46]. However, a user should determine where to cut
the dendrogram to produce actual clusters.
The graph-based approach [53] utilizes graph algorithms (e.g., minimum spanning
tree or minimum cut) to partition the graph into connected subgraphs. However, due to
the points in the transition region, this approach may end up with a highly connected set
of genes.
To better explain time-course gene expression datasets, new models are proposed to
capture the relationships between time-points [6, 5]. However, they assume that the
data ts into a certain distribution, which does not hold in gene expression datasets as
discussed in Yeung et al. [55].
As we have discussed, our work is motivated by previous density-based clustering
approaches such as DBSCAN [20] or SNN [18]. However, since these approaches utilizes
a notion of connectivity to build a cluster, it might not be relevant for the dataset with
moderately dense transition regions. In addition, both approaches need user-dened
parameters (e.g., MinPts), which are dicult to be determined in advance. Recently,
density-based clustering algorithms have been applied to gene expression datasets [30,
31]. Both approaches are promising in that a meaningful hierarchical cluster structure
(rather than a dendrogram) can be built. However, these approaches have drawbacks in
14
that MinPts is determined beforehand [30], or computationally expensive due to kernel
density estimation [31].
2.3 Background for the Proposed Method
2.3.1 Similarity Metric
To estimate density for each gene, we need to decide the distance metric (or similarity
metric). One of the most commonly used metrics to measure the distance between two
data items is Euclidean distance. The distance betweenx
i
andx
j
inm-dimensional space
is dened as follows:
d(x
i
;x
j
) =Euclidean(x
i
;x
j
) =
v
u
u
t
m
X
d=1
(x
id
x
jd
)
2
(2.1)
Since Euclidean distance emphasizes individual magnititudes of each feature, it does not
account for shifting or scaling patterns very well. In gene expression datasets, the overall
shapes of gene expression patterns is more important than magnititude. To address the
shifting and scaling problem, each gene can be standardized as follows:
x
id
=
x
id
i
i
(2.2)
where x
i
is the standardized vector of x
i
,
i
is the mean of x
i
, and
i
is the standard
deviation of x
i
, respectively.
15
Another widely used metric for time-series similarity is Pearson's correlation coe-
cient. Given two genes x
i
and x
j
, Pearson's correlation coecient r(x
i
;x
j
) is dened as
follows:
r(x
i
;x
j
) =
P
m
d=1
(x
id
i
)(x
jd
j
)
p
P
m
d=1
(x
id
i
)
2
pP
m
d=1
(x
jd
j
)
2
(2.3)
Note that r(x
i
;x
j
) has a value between 1 (perfect positive linear correlation) and -1
(perfect negative linear correlation). In addition, value 0 indicates no linear correlation.
Throughout this chapter, we will explain our methodology by using Pearson correlation
coecient.
2.3.2 Density Estimation
In order to estimate density for each gene, a neighborhood needs to be dened rst.
Traditionally, there are two ways to dene a neighborhood, an -neighborhood and a
k-nearest neighborhood. In an -neighborhood, a neighborhood for x
i
is dened as a set
of genes, whose similarity to x
i
is greater than equal to pre-dened threshold (). On
the other hand, in ak-nearest neighborhood forx
i
, a neighborhood is dened by a set of
the the most similar k genes excluding x
i
. We denote k-nearest neighborhood for x
i
as
N
k
(x
i
). Thus, a k-nearest neighborhood is more
exible than an -neighborhood in that
the former can model diverse shapes of neighborhood while the latter can only model
spherical shapes.
After the type of neighborhood is determined, the next step is to dene a density
based on the neighborhood. In -density, the density of an object can be dened by
the number of the objects in a region of specied radius () around the point [20]. In
16
this proposal, we are mainly focused on k-NN density estimation. In conventional k-NN
approach to probability density, it rst xes the number of neighbors (k) beforehand. If
thek-th nearest neighbor ofx
i
is close tox
i
, thenx
i
is more likely to be in a region with
a high probability density. Thus, the distance between x
i
and the k-th nearest neighbor
provides a measure of the density. However, since the approach only considers only the
k-th neighbor, it is susceptible to random variation.
To address this problem, we dene k-NN density in terms of the sum of correlation
between k-nearest neighbors to x
i
as follows:
X
x
j
2N
k
(x
i
)
r(x
i
;x
j
)=k (2.4)
Since the proposed approach considers all k neighbors, it is less sensitive to k than
considering only the k-th neighbor. Based on the proposed notion of density estimation,
the estimated density can be viewed as a graph perspective, in that the density at a
vertex (i.e., a gene) is dened as the sum of the weights of the k strongest edges. Thus,
k-NN density implies how strongly a gene is connected to its k-nearest neighbor genes.
2.3.3 Challenges in Density-based Gene Expression Clustering
Figure 2.1 shows correlation between -density and k-NN density. In general, strong
correlation could be observed between them. That is, high (low) k-NN density implies
high (low) -density, and vice versa. As discussed, clusters in high-dimensional spaces
are commonly of various densities. Density-based algorithms therefore have diculty in
choosing a proper threshold values for (the user-dened threshold to cut o between
17
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
0
20
40
60
80
100
120
140
160
KNN−density (k=30)
Epsilon−density (epsilon=0.8)
Figure 2.1: Correlation between k-NN density (when k=30) and -density (when =0.8)
for the Cho's data that will be discussed in Chapter 2.5.1. The horizontal axis represents
k-NN density and the vertical axis represents -density.
18
core genes and border genes). That is, if is too high, then we may miss relevant genes
that might be representative for the cluster. On the other hand, if the value of is set
small to include all relevant core genes, then there is a possibility to include border genes.
Thus, rather than identifying genes with globally high density, our rst task is to
identify the genes that are located in uniform regions with diverse densities. That is,
regardless of the density of gene, as long as the region around a certain gene has relatively
uniform density, the gene is considered as a relevant candidate for the core genes. By
doing so, we can capture clusters of widely varying density.
However, although a clustering algorithm is able to classify regions (with dierent
densities) into clusters, in gene expression clustering, it is not relevant to nd a cluster
with too low density (although the region has uniform density) because the main goal of
gene expression clustering is to capture a set of highly co-expressed genes. Therefore, our
algorithm is composed of two tasks: identication of all clusters with diverse densities,
and ltering the non-relevant clusters with low density.
The neighborhood list size (k) may determine the granularity of the clusters. If k is
too small, then the algorithm will tend to identify a large number of small-size clusters.
In contrast, if k is too large, then a few large-size clusters are formed. In what follows,
we explain how to decide the value of and when k is xed.
2.4 Density-based Clustering using a Mutual Neighborhood
The proposed algorithm proceeds in four phases, mutual neighborhood graph construc-
tion, identication of core genes and outlier genes, core gene clustering, and assignment
19
Notation Meaning
n A total number of genes
m A total number of time points
X An mn gene expression prole matrix
xi i-th gene
xij j-th feature of i-th gene
N
k
(xi) The k-nearest neighbor list for xi (excluding xi)
M
k
(xi) The k-nearest mutual neighbor list for xi (excluding xi)
P A set of core genes
CP A set of core clusters (before renement)
C A set of clusters (after renement)
K A number of clusters
A threshold that determines a core point
A threshold that determines a noise point
Table 2.1: Summary of symbols and corresponding meanings
of border genes. First, we dene a mutual neighborhood graph, which is essential in our
further discussions.
2.4.1 Construction of Mutual Neighborhood Graph
We rst provide denition for a k-nearest mutual neighborhood (k-MNN).
Denition 2.4.1: [k-nearest mutual neighborhood]. Given x
i
and x
j
, if N
k
(x
i
)
contains x
j
andN
k
(x
j
) contains x
i
, then x
i
andx
j
are referred to as a k-nearest mutual
neighbor to each other. The k-nearest mutual neighborhood is dened by the k-nearest
mutual neighbor list for x
i
, and denoted by M
k
(x
i
).
With respect to k-MNN and k-NN, the following properties hold.
1. N
k
(x
i
) N
k+1
(x
i
) and M
k
(x
i
) M
k+1
(x
i
) for any x
i
and k 1.
20
2. M
k
(x
i
)N
k
(x
i
) for any x
i
and k 1. Thus k-MNN provides more tight form of
neighborhoods than k-NN does.
Thus, given an input gene expression dataset and maximum number of neighbors (k)
where 0 < k < n, we build a k-nearest mutual neighborhood graph MG
k
= (U;E). In
this graph,U is the set of genes, andu;v2E if and only ifu andv arek-nearest mutual
neighbor to each other. Note that the value of k in k-NN density and the value of k in
MG
k
are dierent.
2.4.2 Identication of Rough Cluster Structure
A rough cluster structure can be derived by performing clustering on core genes. This
step can be devised based on the following observation. Since border and outlier genes
are excluded in the rough cluster identication step, each cluster is expected to be well
separated from each other. That is, if two core genes belong to dierent clusters, then they
are expected to be far from each other. Similarly, if two core genes should belong to a same
cluster, then the two genes are expected to be signicantly proximate to each other. Thus,
the key part in rough cluster structure identication is to nd a set of relevant core genes.
A set of connected components (which is denoted byC
k
) in a mutual neighborhood graph
(MG
k
) can provide a clue for this step. We rst dene intra-component connectivity as
follows:
1
(c
k
) =
1
jc
k
j
2
X
x
i
;x
j
2c
k
r(x
i
;x
j
) (2.5)
Note that
1
measures how strongly genes in a same component are connected to each
other. Thus,
1
can be utilized to measure a density for a component. Similarly,
1
21
can be also utilized to measure a density for a cluster. In the following, we establish the
possible key relationships for connected components.
1. The value of
1
that a connected component can have is fairly diverse. The rationale
behind this corresponds to the fact that a cluster in high dimensional space can be
of diverse densities.
2. Densities of genes belonging to the same component are similar. This implies that
the components in a mutual neighborhood graph can capture uniform density re-
gions.
3. Dierent mutual neighborhood graph can be dened by dierent values of k. As
the value ofk increases, in general, the number of connected components decreases,
and the size of connected components increases. However, in some cases, we can
nd same components from dierent mutual neighborhood graphs. We dene such
components as invariant components. Most of such components turns out to be a
collection of outlier genes. Thus, our argument here is that invariant components
do not contribute core gene selection.
4. A large size component (with a very low intra-component connectivity) may be
obtained. The rationale behind this corresponds to the fact that several dissimilar
components can be connected by genes in intermediate regions (we refer to this as
a chaining eect).
5. A large number of genes may form singleton components (a component with its
size is equal to 1). If the density of such genes are low, then those would be an
22
outlier gene. If high density genes fall into singleton components, then they might
be located in the boundary of dense regions. Finally, if medium density genes form
singleton components, then they are located on a boundary of a cluster.
6. If a high density gene falls into a medium size component, then it might be located
in the center of dense regions, consequently, intra-connectivity of the component is
expected to be very high. Similarly, if medium density gene belongs to medium size
component, then it might be located on the center of a medium dense regions. If
a low density gene belongs to a medium size component, then it is a center of low
density cluster.
As the value ofk increases, the chaining eects are encountered, consequently, the size
of connected components tend to increase rapidly. Since we want to make our algorithm
not much sensitive to the value of k, to avoid this chaining eect, we introduce the
following subcomponent decomposition theorem.
Lemma 2.4.2: Let C
k+1
=fc
k+1
0
;c
k+1
1
;:::;c
k+1
l
0
g and C
k
=fc
k
0
;c
k
1
;:::;c
k
l
1
g be a set of
connected components obtained from MG
k+1
and MG
k
, respectively. If g2 c
k+1
j
, then
9i
0
such that g2c
k
i
0
and every genes in c
k
i
0
also belongs to c
k+1
j
.
Proof: Since
S
i
c
k
i
=
S
i
c
k+1
i
, for every g2 c
k+1
i
, there exists i
0
such that g2 c
k
i
0
.
Thus, it is sucient to show that every gene in c
k
i
0
belongs to c
k+1
i
. We will prove this
by contradiction. Suppose there exists g
0
2c
k
i
0
such that g
0
= 2c
k+1
i
. Since g;g
0
2c
k
i
0
, one
of the followings must hold: g2M
k
(g
0
) (case 1), or there exists g
i
0
;:::g
in
2c
k
i
such that
g2M
k
(g
i
0
), g
i
0
2M
k
(g
i
1
), ... , g
in
2M
k
(g
0
) (case 2).
Case 1: Since M
k
(g
0
) M
k+1
(g
0
), g must belong to M
k+1
(g
0
). That is, g and g
0
are
23
k + 1 mutual neighbors to each other. Thus, there exists direct edge between g andg
0
in
MG
k+1
.
Case 2: Since M
k
(x) M
k+1
(x) for any x, g 2 M
k+1
(g
i
0
), g
i
0
2 M
k+1
(g
i
1
), ... ,
g
in
2M
k+1
(g
0
). That is, there exists a path between g and g
0
in MG
k+1
.
By case 1 and case 2, g and g
0
must belong to a same connected component in MG
k+1
.
Since g2 c
k+1
i
, g
0
must also belong to c
k+1
i
. This contradicts to our assumption that
g
0
= 2c
k+1
i
. Therefore, every gene in c
k
i
0
must belong to c
k+1
i
.
Theorem 2.4.3: For any c
k+1
i
2 C
k+1
, there exists I =fi
0
;:::;i
l
g such that c
k+1
i
=
S
j2I
c
k
j
.
Proof: This is straightforward by Lemma 2.4.2.
Thus, based on the above theorem, we can provide ltering heuristics as follows:
(1) If genes form singleton components or invariant components, then they have low
possibility of being core genes. (2) If a large-size component (with a low intra-component
connectivity) is obtained, then this component is composed of disparate sub-components
due to the chaining eect. Thus, we decompose this component into subcomponents, and
recursively apply the rst rule to these subcomponents. In the following, we provide the
formal denitions for core genes and outlier genes to encompass the above relationships.
Denition 2.4.4: [Core genes]. Given MG
k
, core genes can be formalized as locally
dense areas in the original data, which are captured by local maximums of remaining
components (after ltering) in MG
k
.
Denition 2.4.5: [Outlier genes]. Given MG
k
, low K-NN density genes which form
singleton components or invariant components are dened as outlier genes.
24
Once core genes are selected, the next step is to perform clustering on core genes. Since
core genes are expected to be well separated, rather than employing complex algorithms,
we use a standard implementation of greedy agglomerative clustering. That is, each
gene is initialized as a cluster of size one, the similarities between all pairs of clusters
are computed, and the two closest clusters are then repeatedly merged until the number
of clusters is reduced to a target number or the maximum similarity becomes less than
pre-dened threshold. In this chapter, we selected group average scheme (i.e., similarity
between clusters is dened by the similarity of each cluster) for similarity update.
There are three main computationally expensive steps here. The rst step is the k-
nearest neighborhood construction for all genes, which takes O(n
2
). However, for low
dimensional data, the complexity can be reduced to O(nlogn) if spatial data structures
such as X-tree [8] are utilized (Note that if the number of dimensions is high enough, then
dimensionality reduction needs to be performed as the preprocessing stage). The second
step is the construction of a mutual neighborhood graph and connected component iden-
tication in a mutual neighborhood graph. Identication of a set of maximally connected
components in a graph can be eciently performed within O(n) where n is the number
of vertexes [14]. The third step is concerned with core gene clustering. Regarding to the
group average scheme, similarity between clusters C
i
and C
j
is not changed during the
agglomerative steps unlessC
i
orC
j
is selected to be merged. Thus, all similarities can be
computed once for each pair of clusters and inserted into a priority queue. If cluster C
i
andC
j
are selected to form a new cluster C
p
, then any similarities involving either C
i
or
C
j
are removed from the queue, and similarities for the rest of the clusters with C
p
are
25
inserted. This step takes O(n
2
logn) if the priority queue is implemented using a binary
heap. In sum, the worst time complexity is O(n(logn)
2
).
2.4.3 Cluster Expansion
Once a rough cluster structure is obtained, the next step is to identify relevant clusters
that can host each border gene. This step can be easily achieved by selecting the cluster,
which has the largest proximity to the border gene.
2.5 Experimental Results
2.5.1 Experimental Setup
The proposed clustering algorithm was tested on yeast cell cycle data. The data is a
subset from the original 6,220 genes with 17 time points listed by [12]. Cho et al. sampled
17 time points at 10 minutes time interval, covering nearly two full cell cycles of yeast
Saccharomyces cerevisiae. Among 6,220 genes, 2,321 genes were selected based on the
largest variance in their expression. In addition, one abnormal time point was removed
from the data set as suggested by [47], consequently, the resulting data consists of 2,321
genes with 16 time points. Our analysis primarily focuses on this data set.
2.5.2 Evaluation Metrics
For the empirical evaluation of the proposed clustering algorithm, we rst describe the
evaluation criteria. There are three measures to evaluate the accuracy of clustering al-
gorithm in general. When there exists true solution, we can compare the true solution
26
to the obtained result by computing the proportion of correctly identied ones. On the
other hand, when the true solution is unknown, the most widely used metrics are how
much each cluster is homogeneous (Equation 2.6) that can be estimated using the sum
of average pairwise similarities between genes (that are assigned to a same cluster), and
how much clusters are well separated each other (Equation 2.7).
1
=
1
K
K
X
r=1
(
1
jC
r
j
2
X
x
i
;x
j
2Cr
r(x
i
;x
j
) ) (2.6)
2
=
1
K
K
X
i=1
argmaxfr(x
i
;x
j
)jx
i
2C
i
;x
j
2C
j
;i6=jg (2.7)
In general,
1
favors a large number of small-size clusters, and
2
favors a small number
of large-size clusters. In sum, we favor a clustering solution with the largest value of
1
, and the smallest value of
2
(i.e., maximize within-cluster similarity and minimize
between-cluster similarity).
2.5.3 Comparative Algorithms
For the comparison, we utilized EXPANDER [45] software that implements K-means,
SOM and CLICK.
Regarding to K-means, since we are using Pearson correlation instead of Euclidean
distance, there is a question of whether K-means will still work. However, if the data
27
is standardized by subtracting o the mean and dividing by the standard deviation, K-
means will work with Pearson correlation by showing that the Pearson correlation and
the Euclidean distance are monotonically related as follows:
r(x
i
;x
j
) = 1
d
2
(x
i
;x
j
)
2m
(2.8)
Famili et al. showed that 21 is the most relevant number of clusters when K-means
is applied to Cho's data [21]. Thus, we can x the number of clusters in advance. More-
over, to overcome the sensitivity of initial seed selection, we perform K-means clustering
multiple times.
In CLICK, homogeneity value needs to be set to control over the homogeneity of the
resulting clustering. This parameter serves as a threshold in various steps in the algo-
rithm, including the denition of cluster kernels, singleton adoptions and kernel merging.
The higher the value assigned to this parameter the tighter the resulting clusters. In our
experiment, the default value was used.
2.5.4 Experimental Results
The proposed clustering algorithm clustered the genes into 17 clusters. These clusters
are shown in Figure 2.2. Table 2.2 shows how well the proposed algorithm successfully
separated biologically meaningful patterns into ve clusters. The detection ratio of our
approach is 65.3%. CLICK failed to separate the biologically meaningful patterns into
ve clusters. With respect to K-means, it sometimes failed to detect ve clusters due
to its random nature. In other cases, it detected ve cell cycle clusters with detection
28
Cell cycle Proportion Meaningful Cluster
Early G1 12/32 Cluster 1
Late G1 70/87 Cluster 2
S phase 19/48 Cluster 10
G2 phase 18/28 Cluster 11
M 28/30 Cluster 16
Table 2.2: Proportion of biologically characterized genes in meaningful clusters versus
those in [12].
ratio (average: 59.5%, maximum: 63.5%) with 50 time runs. SOM also sometimes failed
to detect ve clusters due to its random nature. In other cases, it detected ve cell
cycle clusters with detection ratio (average: 48.4%, maximum: 55.8%) with 50 time runs.
Thus, in term of biologically meaningful cluster detection, our clustering works better
than K-means, SOM or CLICK.
A summarization of separation and homogeneity for each clustering algorithm is shown
in Table 2.3. As shown, in terms of homogeneity, sinceK-means tries to maximize within
similarity, it worked best (but it did not work well in between similarity). Similarly, since
CLICK tries to minimize between similarity, it worked best in between similarity (but
it did not work well in within similarity). However, we could observe that our cluster-
ing algorithm produced reasonable between and within similarity although we employed
simple greedy agglomerative approach in core gene clustering. Thus, this supports our
argument that our core gene selection methods works eectively.
29
0 10 20
−2
0
2
4
Cluster1
0 10 20
−2
0
2
4
Cluster 2
0 10 20
−2
−1
0
1
2
Cluster 3
0 10 20
−4
−2
0
2
4
Cluster 4
0 10 20
−2
−1
0
1
2
Cluster 5
0 10 20
−4
−2
0
2
4
Cluster 6
0 10 20
−2
−1
0
1
2
Cluster 7
0 10 20
−2
−1
0
1
2
Cluster 8
0 10 20
−2
0
2
4
Cluster 9
0 10 20
−2
−1
0
1
2
Cluster 10
0 10 20
−2
−1
0
1
2
Cluster 11
0 10 20
−2
0
2
4
Cluster 12
0 10 20
−2
0
2
4
Cluster 13
0 10 20
−2
0
2
4
Cluster 14
0 10 20
−2
−1
0
1
2
Cluster 15
0 10 20
−2
−1
0
1
2
Cluster 16
0 10 20
−2
0
2
4
Cluster 17
Figure 2.2: Mean expression patterns for the Cho's data obtained by the proposed
method. The horizontal axis represents time points between 0-80 and 100-160 minutes.
The vertical axis represents normalized expression level. The error bar at each time point
delineates the standard deviation.
K-means CLICK SOM Our Method
no. clusters 21 18 30 17
1 0.52 0.42 0.43 0.49
2 0.32 0.19 0.38 0.26
Table 2.3: A comparison between the proposed method and other approaches on the
Cho's data
30
Chapter 3
Searchspace-limited subspace clustering
3.1 Preliminary
The problem of curse of dimensionality in high dimensional space has been known in
statistics community from 1960's [7]. And, in noisy high dimensional data, the number
of irrelevant dimensions are increased as the dimensionality of data is increased. These
irrelevant dimensions make traditional clustering methods hard to perform clustering in
full-dimensional space. Moreover, whatever similarity metrics are used, all the similarities
between objects in very high dimensional data tend to be very close each others, so input
parameters of traditional clustering methods are getting more hard to be determined.
Agrawal et al. [4] suggested an algorithm, named CLIQUE, that is the rst to address
the subspace clustering problem. In this chapter, we describe a formal denition of
subspace clustering problem, and special requirements of the algorithm in following three
categories.
1. Eective treatment of high dimensionality. A subspace clustering algorithm has to
deal with problems that are caused by high dimensionality of data. These problems
31
include low average density of objects anywhere in high dimensional data space,
uniformly distributed data or noises in certain dimensions, and ineectiveness of
distance function in full dimensional space. Moreover, objects that are similar on
dierent combinations of dimensions may functionally related.
2. Interpretability of results. Because data mining cannot stand alone, but works with
external experts, interpretability of results is important. A subspace clustering
algorithm should provide simple representation and description of resulting clusters
that can be easily interpreted by end-user.
3. Scalability and usability. Scalability of the algorithm is important, because recent
advancement of technology allows huge volume of data. Also, a subspace clustering
methods should not be aected by the order of data.
There are a number of subspace clustering algorithms proposed after Agrawal dened
the problem [2, 3, 10, 11, 22, 25, 37, 43, 51, 52]. However, in a sense of unsupervised
clustering, it's dicult to fulll all of three requirements. For example, if a algorithm is
focusing on scalability, it may accept losing some quality of results [2, 10]. Especially,
many subspace clustering algorithms cannot be scalable to the number of dimensions.
To address the above three issues, we propose a novel clustering algorithm that uti-
lize domain transformation. In our proposed framework, by limiting the size of search
space with domain transformation, we provide scalable subspace clustering method to
the number of dimensions without losing quality of the results. There are four steps in
our framework, and details of each step are explained in Chapter 3.3.
32
3.2 Related Work
In this section, we brie
y discuss about previous approaches on subspace clustering. Jiang
et al. [32] and Parsons et al. [41] provide the comprehensive review on gene expression
clustering and subspace clustering, respectively. For details, refer to those paper [32, 41].
Besides subspace clustering methods, feature transformation [28] and feature selection
techniques [9, 39, 42, 58] are used for analyzing high dimensional data. In general, feature
transformation uses all dimensions and transforms some linear combinations of dimensions
into new features for reducing number of features for analyzing high dimensional data.
However, because the method preserve relations between objects in original data, it cannot
be free from noise and irrelevance of data. On the other hand, feature selection attempts
to nd relevant features from whole dimensions, and use extracted features for the purpose
of data mining. Noise and irrelevance of certain dimensions can be removed by feature
selection method, but the method limits the number of subspaces, so that it cannot detect
other interesting subspaces that can hold important information on them.
Recently, clustering on the subset of conditions has received signicant attentions
during the past few years [4, 2, 3, 10, 11, 22, 25, 37, 43, 51, 52]. Each algorithm can be
classied in dierent set of categories. In this proposal, we'll use two categories (Top-
down approach and Bottom-up approach) as suggested by Parsons et al. [41]. Bottom-up
approaches [4, 11, 25, 10, 37, 43] rst identify clusters in low dimensions (C) and derive
clusters high dimensions based on C using the apriori principle. On the other hand,
top-down approaches [2, 3, 51, 52, 22] nd rough clusters in full-dimensional space by
sampling objects and assigning same importance into each dimension. And then, in
33
several iterations, the importance (weight) of each dimension is changed by adding ob-
jects that are not included during previous iterations. High weighted dimensions form
subspaces of resulting clusters. The main obstacle of bottom-up approach is achieving
scalability to the number of dimensions [4, 11, 25, 10, 43]. For resolving this scalability
problem, CLTree [37] proposed the algorithm that utilizes a decision tree. However, it
doesn't utilize whole dimension for constructing subspace clusters, it is limited to guar-
antee the quality of resulting subspaces. Top-down algorithms doesn't allow overlapping
of instances in multiple clusters, and it makes dicult to nd multiple functionality of
certain instances.
However, our method is dierent from other approaches in that all maximal subspaces
for object pairs (whose maximum value is bounded byn(n1)=2) are generated by trans-
forming a set of objects into a set of object-object relations. Identication of meaningful
subspace clusters is based on the maximal patterns for object pairs. This limited size
of search space makes our algorithm to be scalable to the number of dimensions, and to
fulll requirements of subspace clustering that is proposed by Agrawal [4].
3.3 The Subspace Clustering Algorithm
In this section, we present the details of each step of our approach. Table 3.1 shows the
notations, which will be used throughout this chapter.
The rst step of our clustering is to quantize gene expression dataset (Section 3.3.1).
The primary reason for discretization is that we need to eciently extract a maximal
subspace cluster to which every gene pair belongs. In-depth discussions on why we need
34
Notation Meaning
n The total number of genes in gene expression data
m The total number of dimensions in gene expression data
X nm gene expression prole matrix
xi The i-th gene
xij The value of xi in j-th dimension
Si The set of symbols for i-th dimension (Si\Sj =; if i6=j)
sij A symbol for xij
si A sequence of symbols for xi
K The maximum number of symbols
pi A pattern
mpij The maximal pattern for xi and xj
Pa A set of maximal patterns that contains a symbol a
Pi The set of all maximal patterns in i-dimensional subspace
P The set of all maximal patterns in X
i The minimum number of elements that
an i-dimensional subspace cluster should have
The minimum dimension of subspace clusters
Table 3.1: Notations for subspace clustering
discretization, and details of our algorithm are presented in Section 3.3.2. In Section 3.3.3,
we explain how we can further reduce computational cost by utilizing inverted index and
pruning. Finally, in Section 3.3.4, we explain how to select meaningful subspace clusters.
Figure 3.1 sketches the proposed subspace clustering algorithm.
3.3.1 Discretization
In general, discretization approach can be categorized into three ways:
Equi-width bins. Each bin has approximately same size.
Equi-depth bins. Each bin has approximately same number of data elements.
Homogeneity-based bins. The data elements in each bin are similar to each other.
35
Step 1. Discretization and symbolization of gene expression data:
Initialization: i= 1.
1.1 Perform K-means clustering on i-th dimension.
1.2 Using S
i
, assign symbols to each cluster.
1.3 i =i + 1.
Repeat Step 1 for im.
Step 2. Enumeration of all possible maximal patterns for each gene pair
and construction of inverted index for each dimension and Hash Table:
2.1 Extract a maximal pattern (mp
ij
) between x
i
and x
j
using s
i
and s
j
.
2.2 Insert mp
ij
to inverted index at dimension k if the length of
mp
ij
is equal to k.
2.3 Insert the gene pair x
i
and x
j
to Hash Table indexed by mp
ij
.
Repeat Step 2 for all gene pairs.
Step 3. Computation of support, pruning and generating candidate clusters:
Initialization: i = where is a user-dened threshold for the
minimum dimension of a subspace cluster.
3.1 Compute support of each maximal pattern (p) at i using inverted index.
3.2 Prune p at i and super-pattern of p at i + 1 if support of p is less than
i
.
3.3 i =i + 1
Repeat Step 3 for im 1.
Step 4. Identication of interesting subspace clusters:
Initialization: i =.
4.1 For each pattern (p) of candidate clusters at i, calculate dierence between
its support and support of each super-pattern of p at i + 1, and assign
the biggest dierence as maximum dierence of p.
4.2 Remove p from candidate clusters at i if maximum dierence of p is less
than user-dened threshold.
4.3 i =i + 1
Repeat Step 4 for im 1.
Figure 3.1: An overview of the proposed clustering algorithm
In this chapter, we use a homogeneity-based bins approach. In particular, we utilize
K-means clustering with Euclidean distance metric [17]. Because we apply K-means to
1-dimensional data, each cluster corresponds to the interval. Additionally,K corresponds
to the number of symbols for discretization.
Once we identify clusters, genes belonging to a same cluster are discretized with a
same symbol. That is, x
id
and x
jd
are represented as a same symbol if and only if
x
id
;x
jd
2C
kd
whereC
kd
isk-th cluster ind-th dimension. The complexity for this step is
O(nmK) wheren andm correspond to the number of genes and conditions, respectively.
36
In this chapter, we use same value of K across all dimensions for simplicity.
1
However,
in Section 3.4, we investigate the eect of dierent value of K in terms of running time.
Figure 3.2: An illustration of domain transformation
3.3.2 Identication of Candidate Subspace Clusters based on Domain
Transformation
The core of our approach lies in domain transformation to tackle subspace clustering.
That is, the problem of subspace clustering is signicantly simplied by transforming
gene expression data into domain of gene-gene relations (Figure 3.2). In this Section, we
1
More eective discretization algorithm for gene expression data is currently under development.
37
explore how to identify candidate subspace clusters based on the notion of domain trans-
formation. Before we present detailed discussions on the proposed clustering algorithm,
denitions for basic terminology are provided rst.
Denition 3.3.1: [Induced-string] A string string
1
is referred to as an induced-string
ofstring
2
if every symbol ofstring
1
appears instring
2
and length ofstring
1
is less than
string
2
.
Denition 3.3.2: [(Maximal) pattern] Given s
i
and s
j
, which are symbolic represen-
tations of x
i
and x
j
, respectively, any induced-string of both s
i
and s
j
is referred to as a
pattern ofx
i
andx
j
. A pattern is referred to as a maximal pattern ofx
i
andx
j
(denoted
as mp
ij
) if and only if every pattern of x
i
and x
j
(except mp
ij
) is an induced-string of
mp
ij
.
For example, if s
1
= a
1
b
1
c
2
d
3
and s
2
= a
1
b
1
c
1
d
3
, then maximal pattern for x
1
and
x
2
is a
1
b
1
d
3
. Since the length of the maximal pattern is 3, the maximum dimensions
of candidate subspace cluster that can host x
1
and x
2
is 3. Thus, the set of genes that
have same maximal pattern can be a candidate maximal subspace cluster. By discretizing
gene expression data, we can obtain the upper bound for the number of maximal patterns,
which is equal to
n(n1)
2
.
Because a pattern is represented as a form of string, based on the denition of induced-
string, the notion of super-pattern is dened as follows:
Denition 3.3.3: [Super-pattern] A patternp
i
is a super-pattern ofp
j
if and only ifp
j
is an induced string of p
i
.
38
Figure 3.3 illustrates how we approach the problem of subspace clustering. Each
vertex represents a gene, and the edge between vertexes shows a maximal pattern between
two genes. In order to nd meaningful subspace clusters, we dene the minimum number
of objects (
i
) that ani-dimensional subspace cluster should have. For this purpose, the
notion of support is dened as follows:
Denition 3.3.4: [Support] Given a pattern p
k
, support of p
k
is the total number of
gene pairs (x
i
andx
j
) such thatp
k
is an induced-string ofmp
ij
orp
k
is a maximal pattern
of x
i
and x
j
.
Thus, the problem of subspace clustering is reduced to computing support of each
maximal pattern, and identifying maximal patterns whose support exceeds
i
if the length
of maximal pattern is equal to i. The maximal pattern between x
1
and x
3
is dened as
a
1
b
1
in Figure 3.3. This means that x
1
and x
3
has potential to be grouped together
in 2-dimensional subspace. However, we need to consider whether there exists enough
number of gene pairs that has a pattern a
1
b
1
. This step can be achieved by computing
support for a
1
b
1
, and checking whether the support exceeds
2
or not. In this example,
support for a
1
b
1
is 6.
3.3.3 Ecient Super-pattern Search based on Inverted Index
Achieving an ecient super-pattern search for support computation is important in our
algorithm. The simple approach to searching super-patterns for a given maximal pattern
at dimension d is to conduct a sequential scan on all maximal patterns at dimension d
0
(d
0
>d). The following shows the running time for this approach.
39
Figure 3.3: A sample example of subspace cluster discovery
n1
X
i=
jP
i
j
m
X
j=i+1
(jjP
j
j) (3.1)
wherejP
j
j is the number of maximal patterns at dimension j and is the minimum
dimension of subspace clusters.
However, we can reduce the time by utilizing inverted index, which has been widely
used in modern information retrieval. In inverted index [44], the index associates a set
of documents with terms. That is, for each term t
i
, we build a document list (D
i
) that
contains all documents containing t
i
. Thus, when a query q is composed of t
1
;:::;t
k
, to
identify a set of documents which contains q, it is sucient to examine the documents
that contain allt
i
's (i.e., intersection ofD
i
's) instead of checking whole document dataset.
40
Figure 3.4: Inverted index at 4-dimensional subspace
By implementing each document list as a hash table, looking up documents that contain
t
i
takes constant time.
In our super-pattern search problem, each term and document correspond to symbol
and pattern, respectively. Figure 3.4 illustrates the idea. Each pattern list is also imple-
mented as a hash table. When a query pattern is composed of multiple symbols, symbols
are sorted (in ascending order) according to their pattern frequency. After then, we start
to take intersection of pattern lists whose pattern frequency is lowest. For example, given
a query pattern b
2
c
0
d
0
, we search super pattern as follows:
2
(P
b
2
\P
d
0
)\P
c
0
= (fp
3
g\fp
2
;p
4
;p
5
;p
6
g)\P
c
0
=;
2
If we reach an empty set while taking intersection, then there is no need to keep intersection of pattern
lists.
41
where P
a
is a pattern list for a symbol a.
By taking intersection of pattern list whose size is small, we can reduce the number
of operations. The worst time complexity for the identication of all super-patterns for
all query patterns (a = (a
1
;a
2
:::;a
k
)) in P
k
is shown as follows:
T
k
=
m
X
i=k+1
X
a2P
k
(min(jP
i
a
1
j;:::;jP
i
a
k
j) (k 1)) (3.2)
wherejP
i
a
k
j corresponds to the length of pattern list for the symbol a
k
at dimension i.
Note that the worst time complexity for the construction of inverted index at each
dimension k takes kjP
k
j. Thus, the total time complexity for super-pattern search is
given as below:
m
X
k=
(kjP
k
j) +
m1
X
k=
(klogk +jP
k
jT
k
) (3.3)
Moreover, we can further reduce running time through the pruning step. In particular,
we use the notion of expected support in Zaki et al. [59]. That is, if support for p
ij
in dimension k is less than
k
, then besides eliminating p
ij
in future consideration of
subspace clustering, we do not consider super-patterns of p
ij
in dimension k + 1 any
more.
3.3.4 Identication of Meaningful Subspace Clusters
After pruning maximal patterns using the notion of support, the nal step is to identify
meaningful subspace clusters, which corresponds to the Step 4 in Figure 3.1. Since we
42
Figure 3.5: Identication of subspace clusters
are interested in maximal subspace clusters, we scan maximal patterns (denoted as p) at
dimensioni, and compare support ofp with support of all super-patterns (denoted asp
0
)
at dimension (i + 1). Any maximal pattern at i (p) are not considered for a subspace
cluster if the dierence between support of p and that of p
0
is less than user-dened
threshold.
We illustrate this step with Figure 3.5. As shown, there are three maximal patterns at
dimension 2 (A1B1, A1C1 and A1C2) and two maximal patterns at dimension 3 (A1B1C1
and A1B1C2), respectively. The cluster with a pattern A1B1 (C
1
) contains the clusters
with a pattern A1B1C1 (C
2
) and A1B1C2 (C
3
). If the support between C
1
and C
2
(or
the support between C
1
and C
3
) is less than user-dened threshold, then C
1
is ignored
and both C
2
and C
3
are kept as meaningful subspace clusters. Otherwise, C
1
is also
considered as a meaningful cluster besides C
2
and C
3
.
43
3.4 Experimental Results
In this section, we present experimental results that demonstrate the eectiveness of the
proposed clustering algorithm. Section 3.4.1 illustrates our experimental setup. Experi-
mental results are presented in Section 3.4.2.
3.4.1 Experimental Setup
For empirical evaluation, the proposed clustering algorithm was tested on yeast cell cycle
data. The data is a subset from the original 6,220 genes with 17 time points listed by [12].
Cho et al. sampled 17 time points at 10 minutes time interval, covering nearly two full cell
cycles of yeast Saccharomyces cerevisiae. Among 6,220 genes, 2,321 genes were selected
based on the largest variance in their expression. In addition, one abnormal time point
was removed from the data set as suggested by [47], consequently, the resulting data
consists of 2,321 genes with 16 time points. Our analysis is primarily focused on this
data set.
3.4.2 Experimental Results
Figure 3.6 demonstrates how well our clustering algorithm was able to capture highly
correlated clusters under a subset of conditions. The x-axis represents the conditions,
and they-axis represents expression level. As shown, subspace clusters do not necessarily
manifest high correlations across all conditions. We also veried biological meanings of
subspace clusters with the Gene Ontology [61]. For example, YBR279w and YLR068w
(Figure 3.6(b)) have the most specic common parent \nucleobase, nucleoside, nucleotide
and nucleic acid metabolism" in the Gene Ontology. However, they showed low correlation
44
in full dimensions. In addition, YBR083w and YDL197C (Figure 3.6(d)) belonging to
subspace cluster 625 participate in a similar biological function, \transcriptional control".
In order to investigate the number of maximal patterns in various scenarios, pattern
rate, which represents how many maximal patterns are generated in comparison with the
number of possible maximal patterns, is dened as follows:
Pattern Rate = #maximal patterns =
n(n 1)
2
(3.4)
Figure 3.7 shows relationships between the number of genes and pattern rate. The
x-axis represents the number of genes (n), and they-axis represents the pattern rate. As
shown, pattern rate decreases as n increases. This property is related with the nature of
datasets. In general, datasets should have certain amount of correlations among object
pairs. Otherwise, clustering cannot detect meaningful clusters from the dataset. Thus,
the increasing rate for the number of actual maximal patterns is much less than the
increasing rate for possible number of maximal patterns (
n(n1)
2
).
Figure 3.8 shows how much running time can be improved by using inverted index.
The x-axis represents the number of symbols, and the y-axis represents the ratio of
Equation (3.3) to Equation (3.1), which is dened as running time rate. As shown,
we could observe signicant improvement. For example, when the number of symbols
is 7. the running time was reduced more than 20 times. Moreover, we could observe
performance improvement as the number of symbols increases. This is primarily because
the length of pattern list in inverted index decreases as the number of symbols increases.
45
0 2 4 6 8 10 12 14 16
−3
−2
−1
0
1
2
3
Conditions
Expression levels
(a) Expression level of genes in subspace cluster 453
0 4 13 15 16 17
−3
−2
−1
0
1
2
3
Conditions
Expression levels
(b) Subspace cluster 453
0 2 4 6 8 10 12 14 16
−3
−2
−1
0
1
2
3
Conditions
Expression levels
(c) Expression level of genes in subspace cluster 625
0 2 5 7 8 10 14 17
−3
−2
−1
0
1
2
3
Conditions
Expression levels
(d) Subspace cluster 625
0 2 4 6 8 10 12 14 16
−3
−2
−1
0
1
2
3
Conditions
Expression levels
(e) Expression level of genes in subspace cluster 392
0 8 12 13 17
−3
−2
−1
0
1
2
3
Conditions
Expression levels
(f) Subspace cluster 392
Figure 3.6: Sample plots for subspace clusters
46
0 100 200 300 400 500 600 700 800 900 1000
60
65
70
75
80
85
90
95
The number of genes
Pattern rate
Figure 3.7: The number of genes versus pattern rate
Figure 3.9 shows the scalability of our algorithm in terms of the number of dimensions.
The x-axis represents the number of dimensions, and the y-axis represents running time
rate, which is dened by running time atd divided by running time whend is 4. As shown,
we could observe linear increase in running time rate, which supports the scalability of
our algorithm.
47
3 5 7 9
0
0.02
0.04
0.06
0.08
0.1
0.12
Number of Symbol
Running time rate
Figure 3.8: Dierent number of symbols versus running time rate
4 8 12 16
1
2
3
4
5
6
7
8
9
Number of dimensions (d)
Running time rate
Figure 3.9: The number of dimensions versus running time rate with the use of inverted
index
48
Chapter 4
Density-based Searchspace-limited Subspace clustering
4.1 Preliminary
As described in Introduction of this dissertation, the main goal of applying clustering
approach into gene expression datasets is to nd functionally related genes from the raw
data. The rich amount of data generated by recent microarray technologies provides
more chances to reveal the hidden information in raw data. On the other hand, the
huge volume of data makes us dicult to apply traditional full-dimensional clustering
approach. Thus, we develop the searchspace-limited subspace clustering described in
Chapter 3. Even though the searchspace-limited subspace clustering approach is scalable
to the number of dimensions, and the computational cost is limited to the upper bound,
the complexity of clustering is still high for a high-dimensional huge volume of dataset.
In this chapter, we propose novel mining framework that supports the identication of
meaningful subspace clusters using density-based approach. When number of dimensions
(m) is equal to 50, the number of possible subspaces is 2
50
1 1:1259 10
15
. However,
this number only shows possible subspaces, and if we consider possible number of bins
49
(S) in each dimension, then the number of possible 'subspace clusters' increases to the
number, which is calculated by following equation. And, the possible number of subspace
clusters in the case of m is equal to 50 and S is equal to 3 is 1:2677 10
30
!
m
X
i=1
mCiS
i
(4.1)
Consequently, it is computationally expensive to search all subspaces to identify clusters.
To address the computational issue, many subspace clustering algorithms utilize bottom-
up approach, but it is not scalable to the number of dimensions. On the other hand,
to address the scalability issues, other subspace clustering algorithms utilize top-down
approach, but this approach dose not allow overlapping subspace clusters.
To cope with the scalability and overlapping issue, we present a novel searchspace-
limited subspace clustering approach based on domain transformation in Chapter 3. In
contrast to the previous subspace clustering approaches, our method is based on the obser-
vation that each gene pair should generate a maximal subspace cluster. This observation
makes us to transform the datasets from genes to gene-gene pairs, and this transforma-
tion allows us to limit the number of possible subspace clusters to
n(n1)
2
where n is
the number of genes. Based on the transformed data, we present an ecient subspace
clustering algorithm that is scalable to the number of dimensions. At the same time, the
subspace clustering algorithm allows overlapping clusters.
Furthermore, by investigating our density-based clustering approach, we discover the
way to reduce the computational cost of the searchspace-limited subspace clustering ap-
proach. In the density-based clustering approach, we can identify core genes and, outlier
50
genes based on their k-NN density. Since a core point has high density, it is expected to
be in a center of a cluster, and surrounded by similar points (another core points or bor-
der points). That is, core points have high potential in belonging to multiple subspace
clusters. In addition, a subspace cluster generated by core points is expected to have
more genes, because of the high density of the points. Therefore, instead of performing
subspace clustering on whole datasets, by performing subspace clustering on only core
points, we can further reduce running time drastically, and nd important subspace clus-
ters. After that, border points are used to expand the cluster structure by assigning them
to the most relevant cluster. Coupling with density-based approach, experimental results
indicate that our subspace clustering improves running time signicantly.
The remainder of this chapter is organized as follows. In Section 4.2, we brie
y review
the related work, and highlight the strengths and weaknesses of the previous approach in
comparison with ours. Section 4.3 explores how to enhance our subspace clustering using
density-based approach. Section 4.4 presents our experimental results on yeast cell cycle
datasets.
4.2 Related Work
Usually, subspace clustering approaches are categorized into top-down or bottom-up ap-
proaches. As mentioned, top-down approach doesn't allow overlapping clusters, because
it starts clustering from full-dimensional space. However, allowing overlapping clusters
might be one of the natural reasons to do subspace clustering. The examples of top-down
approaches are PROCLUS [2], DOC [43], SSPC [56], and MINECLUS [57]. On the other
51
hand, bottom-up approaches allows overlapping clusters, but it struggles for dealing with
huge search space. Because bottom-up approach starts clustering from single-dimensional
space, the theoretical search space increase exponentially with regard to the number of
dimensions. That's why every bottom-up approaches utilize some kind of search heuris-
tics for reducing running time. The examples of bottom-up approaches are CLIQUE [4],
EDSC [1], NCLUS [11], SUBCLU [34], nCluster [38], and MAFIA [40]. Among bottom-
up approaches, SUBCLU and EDSC utilize densities as one of search heuristics, but the
densities used in these approaches are local densities, not full-dimensional density that
we used. For details about diverse subspace clustering approaches, refer to [41], and [36].
4.3 The Density-based Subspace Clustering Algorithm
We start this section by brie
y describing our density-based clustering approach and
searchspace-limited subspace clustering approach. After that, we present the details
about the density-based searchspace-limited subspace clustering.
4.3.1 Basic notation of Density-based clustering
The following steps sketch the basic steps of density-based clustering.
1. Density for each gene is estimated rst. Section 2.3.2 shows details of this step.
2. Genes with high density (i.e., core genes) and genes with low density (i.e., out-
lier genes) are identied. Non-core, non-outlier genes are dened as border genes.
Section 2.4.1 shows details of this step.
52
3. Since a core gene has high density, it is expected to locate well inside the cluster
(i.e., a representative of a cluster). Thus, instead of conducting clustering on whole
data, performing clustering on core genes alone can produces a skeleton of cluster
structure. Section 2.4.2 shows details of this step.
4. Border genes are used to expand the cluster structure by assigning them to the
most relevant cluster. Section 2.4.3 shows details of this step.
In density-based clustering, outlier genes may not be clustered into any cluster. That
is, since the goal of gene expression clustering is to identify a set of genes with similar
patterns, it may be necessary to discard outlier genes during the clustering process. While
this approach does not provide a complete organization of all genes, it can extract the
\essentials" of information given data. However, if a complete clustering is necessary, the
outlier genes can be added to the closest clusters.
4.3.2 Identication of Searchspace-limited Subspace Clusters
The following steps sketches the outline of searchspace-limited subspace clustering algo-
rithm without density-based approach.
1. Discretization and symbolization of gene expression data. The rst step of our
subspace clustering is to quantize gene expression dataset. Section 3.3.1 shows
details of this step.
2. Enumeration of all possible maximal patterns for each gene pair. Section 3.3.2 shows
key part of our algorithm.
53
3. Super-pattern search. In Section 3.3.3, we explain how we can perform super-pattern
search by utilizing inverted index.
4.3.3 Application of Density-based Approach to Subspace Clustering
In order to make our subspace clustering more scalable, we incorporate density-based
approach to subspace clustering. Toward this end, we investigate the relations between
densities of points and subspace clusters across dierent dimensions, and make the fol-
lowing three observations. Our primary usages of density-based approach in subspace
clustering is motivated by the following observations.
1. Core points attract surrounding border points to form multiple subspace clusters.
That is, core points have high potential in belonging to multiple subspace clusters.
In addition, core points, and a set of border points surrounding the core points
form medium dimensional subspace cluster. On the other hand, if a border point
and a core point are far from each other, then they form low dimensional subspace
cluster.
2. If a set of core points are similar to each other, then they form high-dimensional
subspace cluster. On the other hand, if two core points are far from each other,
then they form low dimensional subspace cluster.
3. Outlier points belong to low dimensional subspace clusters. Otherwise, i.e., if they
belong to medium dimensional subspace clusters, that means there exist many other
points that are similar to outlier points. This result in medium or high density of
outlier points, which contradicts to the denition of outlier points.
54
Figure 4.1: An illustration on how core points attract border points in dierent subspace
clusters
Figure 4.1 presents sample example for the rst observation. LetR
i
be a set of points
at located at the region i. Given that all the points in R
2
(also represented as A
2
B
4
)
are very close to each other, points in R
2
are dened as core points. Similarly, points
in R
1
, R
3
, and R
4
are assigned as border points. The following shows subspace clusters
resulting from this dataset.
C
1
=R
2
C
2
=R
2
[
R
4
C
3
=R
1
[
R
2
[
R
3
55
Figure 4.2: A plot of two sample core clusters, which are far from each other. The red
line shows k-nearest neighbors for the gene YBR089w, and the blue line shows k-nearest
neighbors for the gene YDR471w, in Cho data.
As shown, cluster C
1
is 2-dimensional subspace cluster (dimension x and y), which is
formed by core points only. Similarly, cluster C
2
is formed by core points, and border
points in R
4
, resulting in subspace cluster in dimension y. In addition, Figure 4.2 il-
lustrates the second observation. As shown, when two core points have high similarity,
they belong to high dimensional subspace cluster. On the other hand, when two core
points have low similarity, for example, gene YBR089w and gene YDR471w, the possible
subspace cluster that hosts two core points is in very low dimensions.
Furthermore, based on the third observation on outliers, we present a systematic ap-
proach to determine one of the key thresholds in subspace clustering, which generally has
many parameters to be set through experiments. For example, one of the key parame-
ters is the minimum number of dimensions for meaningful subspace clusters. In previous
density-based clustering, outlier points tend to be considered as noise points, and re-
moved in the preprocessing step. However, instead of setting this parameter through
56
many experiments, we utilize outlier points that carry valuable information in subspace
clustering.
Let O
1
, ... , O
n
be outlier points, and P
i
be the set of maximal patterns generated
fromO
i
and points inN
k
(O
i
). Then, the minimum number of dimensions () for subspace
clustering is dened as follows:
=
1
nk
n
X
i=1
k
X
j=1
l
ij
(4.2)
wherel
ij
is the length of the maximal pattern generated byO
i
andj-th nearest neighbor
of O
i
.
Thus, equipped with the above three observations, we present highly scalable density-
based subspace clustering. The algorithm is presented in Figure 4.3. Based on the
rst observations, given that medium and high dimensional subspace clusters generally
carry more valuable information than low dimensional subspace clusters, what we actu-
ally need is all the patterns generated by core points. That is, instead of performing
subspace clustering on whole data points, conducting subspace clustering on core points
only produce the skeleton of subspace clusters. After that, we assign border points to the
closest subspace clusters. Now instead of generating pattern for border and core point,
we generate pattern for border point and core pattern, which will drastically reduce the
number of patterns
Therefore, instead of performing subspace clustering on whole datasets, by performing
subspace clustering on only core points, we can reduce running time drastically. After
57
Step 1. K-NN Density estimation for each point
Select core, border and outlier points.
Step 2. Discretization and symbolization of gene expression data
Step 3. Apply Searchspace-limited subspace clustering to core points
3.1 Generation of all the maximal patterns
3.2 Computation of support
3.3 Super-pattern search
Step 4. Expansion of subspace clusters resulting from core
Assign border points to a representative core pattern
Step 5. Filter subspaces whose dimension is less than
where is determined by Equation 3.3
.
Figure 4.3: An overview of the proposed clustering algorithm
that, border genes are used to expand the cluster structure by assigning them to the most
relevant cluster.
For each border point, nd similarity between a border point and core clusters. We
only generate patterns when the similarity between a border point and core clusters
exceeds certain threshold. Note that the accuracy of algorithm does not depend on this
threshold. It only aects on the running time.
4.4 Experimental Results
For the purpose of eciency comparison of our subspace clustering algorithm, we need
to dene which subspace clusters are more important than the others. As known in
data mining community, good clustering can be dened in several ways. For example,
58
some algorithms aim to achieve higher similarity between objects in a cluster, and higher
dissimilarity between objects in dierent clusters at the same time. On the other hand,
some others aim to group the conceptually related objects. In addition, the size of the
clusters might be the issue, and it is related with granularity issues of clusters. Fine-
grained clusters can provide higher inner similarities, but there should be a high risk
to misplace naturally-similar objects in dierent clusters. Especially in the subspace
clustering, when considering importance of the subspace clusters, number of dimensions
for the clusters should be considered.
In this chapter, we assume that larger clusters in higher dimensional spaces are more
important than the others, and don't apply inner similarity measures for cluster analysis.
Because our subspace clustering algorithm allows overlapping clusters, we expect larger
cluster should include more valuable relational information among the objects. In ad-
dition, we assume that higher dimensional subspace clusters are more informative than
lower dimensional ones, because they utilize more features in raw data.
The goal of developing our subspace clustering algorithm is to reduce running time as
well as to maintain the ability to nd important subspace clusters. Because the running
time of a subspace clustering is highly aected by the size of search spaces, we show the
eciency of our algorithm by comparing the sizes of search spaces instead of measuring
actual running time.
The proposed clustering algorithm was tested on two yeast cell cycle datasets. The
rst data is a subset from the original 6,220 genes with 17 time points listed by [12]. Cho
et al. sampled 17 time points at 10 minutes time interval, covering nearly two full cell
cycles of yeast Saccharomyces cerevisiae. Among 6,220 genes, 2,321 genes were selected
59
based on the largest variance in their expression. In addition, one abnormal time point
was removed from the data set as suggested by [47], consequently, the resulting data
consists of 2,321 genes with 16 time points.
The second dataset is Spellman et al.'s yeast cell cycle dataset (a.k.a. Spellman
dataset) [46]. Using cDNA arrays, Spellman et al. measured the genome-wide mRNA
levels for 6,108 yeast ORFs simultaneously over approximately two cell cycle periods in a
yeast culture synchronized by factor relative to a reference mRNA from an asynchronous
yeast culture. The yeast cells were sampled at 7 minute intervals for 119 minutes with a
total of 18 time points after synchronization. Among 6,108 genes, we removed the genes
with missing values, and obtained 4,489 genes. Thus, the dataset is organized as an 18
4,489 matrix with equally spaced sampling time points. In this paper, rather than
trying to identify cell-cycle regulated gene clusters (by relying on external knowledge),
we perform unsupervised clustering on 4,418 genes.
4.4.1 Scalability of Density-based Searchspace-limited Subspace Clustering
In general, scalability of subspace clustering depends on the number of dimensions in
data, because the search space is exponentially increase with regard to the number of
dimensions. However, in our subspace clustering algorithm, the size of search space is
limited by number of objects in data. So, we demonstrate the scalability of our cluster-
ing algorithm based on increasing number of objects (i.e., cores) instead of number of
dimensions.
Table 4.1 through Table 4.4 support the scalability of our subspace clustering al-
gorithm. The column variables of the table represent number of cores which are used
60
200 300 400 500 600 700 800 900 2321
upper bound 19900 44850 79800 124750 179700 244650 319600 404550 2692360
3 bins 6248 12822 21716 34120 49071 66054 86077 107891 650736
4 bins 6233 10852 16692 24448 33560 44170 56008 68994 329901
5 bins 6111 9813 14035 19248 25049 31679 39043 46696 155434
6 bins 6163 8890 11750 15528 19606 23253 27105 31176 80185
7 bins 5138 7045 9186 11748 14424 16678 19025 21520 44836
8 bins 4033 5034 6351 7866 9326 10526 11747 13020 24257
Table 4.1: Number of patterns generated in Cho data
200 300 400 500 600 700 800 900 2321
3 bins 0.96 1.97 3.34 5.24 7.54 10.15 13.23 16.58 100
4 bins 1.89 3.29 5.06 7.41 10.17 13.39 16.98 20.91 100
5 bins 3.93 6.31 9.03 12.38 16.12 20.38 25.12 30.04 100
6 bins 7.69 11.09 14.65 19.37 24.45 29.00 33.80 38.88 100
7 bins 11.46 15.71 20.49 26.20 32.17 37.20 42.43 48.00 100
8 bins 16.63 20.75 26.18 32.43 38.45 43.39 48.43 53.68 100
Table 4.2: Percentage number of patterns generated in Cho data
to generate Density-based Searchspace-limited subspace clusters, and the row variables
represent number of bins which are used in discritization step.
Table 4.1 and Table 4.3 show actual size of subspace clusters (i.e., search space)
which are generated by the each number of cores and the number of bins in Cho and
Spellman data. And, the search spaces are the sum of subspace clusters in 8 through
16 dimensional subspaces in Cho data, and in 8 through 18 dimensional subspaces in
Spellman data. Table 4.2 and Table 4.4 are the percentage representation of Table 4.1
and Table 4.3. These percentages are calculated based on the last column of Table 4.1
and Table 4.3.
61
400 600 800 1000 1200 1400 1600 1800 4489
upper bound 79800 179700 319600 499500 719400 979300 1279200 1619100 10073316
3 bins 11805 17835 23331 29829 36638 43515 49777 56263 140930
4 bins 20635 35220 53722 75067 92953 116870 137523 159310 592988
5 bins 28040 62050 108079 163548 223289 292651 356018 415633 1517273
6 bins 33434 66202 105772 155417 204763 269309 331132 404430 1643909
7 bins 28719 68179 123501 194885 282576 382959 492349 606683 2776547
8 bins 29156 63631 111379 173042 241019 327265 421991 522972 2607985
Table 4.3: Number of patterns generated in Spellman data
400 600 800 1000 1200 1400 1600 1800 4489
3 bins 8.38 12.66 16.56 21.17 26.00 30.88 35.32 39.92 100
4 bins 3.48 5.94 9.06 12.66 15.68 19.71 23.19 26.87 100
5 bins 1.85 4.09 7.12 10.78 14.72 19.29 23.46 27.39 100
6 bins 2.03 4.03 6.43 9.45 12.46 16.38 20.14 24.60 100
7 bins 1.03 2.46 4.45 7.02 10.18 13.79 17.73 21.85 100
8 bins 1.12 2.44 4.27 6.64 9.24 12.55 16.18 20.05 100
Table 4.4: Percentage number of patterns generated in Spellman data
62
Figure 4.4: An illustration of scalibility
Fig 4.4 is a graphical representation of the values in the percentage tables. Thex-axis
in Fig 4.4 through Fig 4.6 represents approximated percentage of the number of cores in
the table. As shown in the gure, by increasing number of cores, the size of search space is
linearly increased in both data. Furthermore, as shown in Fig 4.4 (c), the increasing rates
of the average percentages search spaces for Cho and Spellman data are very similar with
minor shifting. This result shows that our Density-based Searchspace-limited subspace
clustering algorithm is data independently scalable.
4.4.2 Accuracy w.r.t the number of cores and dimensionality
We compute accuracy by comparing subspace clusters, which are generated by Non
Density-based Searchspace-limited subspace clustering (i.e., utilizing all the data for gen-
erating search space) and Density-based Searchspace-limited subspace clustering (i.e.,
63
200 300 400 500 600 700 800 900
16 dim 100 100 100 100 100 100 100 100
15 dim 96.43 96.43 100 100 100 100 100 100
14 dim 94.29 94.29 97.14 97.14 100 100 100 100
13 dim 100 100 100 100 100 100 100 100
12 dim 100 100 100 100 100 100 100 100
11 dim 86.05 93.94 96.97 96.97 100 100 100 100
10 dim 71.43 97.14 97.14 97.14 97.14 100 100 100
9 dim 35.48 60.38 88.57 94.29 94.29 97.14 97.14 100
8 dim 16.95 39.74 70.83 93.75 96.88 96.88 96.88 96.88
Average 77.85 86.88 94.52 97.70 98.70 99.34 99.34 99.65
Table 4.5: Detailed accuracy of 4-bin case in Cho data w.r.t. # of cores and # of
dimensions
200 300 400 500 600 700 800 900
3 bins 80.04 90.74 93.81 96.14 97.05 97.60 99.35 99.35
4 bins 77.85 86.88 94.52 97.70 98.70 99.34 99.34 99.65
5 bins 92.88 95.45 95.73 95.73 95.98 97.62 98.27 98.27
6 bins 80.86 86.81 91.08 95.26 97.09 97.45 97.45 98.54
7 bins 91.86 94.98 98.28 98.53 99.51 99.51 99.51 99.75
8 bins 97.21 97.68 98.17 98.17 98.65 100.00 100.00 100.00
Table 4.6: Accuracy w.r.t. # of cores and # of bins in Cho data
400 600 800 1000 1200 1400 1600 1800
3 bins 49.63 62.84 75.19 81.45 86.31 93.86 95.05 97.49
4 bins 52.61 74.03 81.66 88.45 91.83 94.13 94.38 95.21
5 bins 33.93 65.45 80.19 88.55 94.94 98.63 98.92 99.19
6 bins 33.19 61.06 80.76 87.90 92.98 95.27 96.06 96.46
7 bins 27.38 52.72 70.20 86.63 90.35 92.60 93.16 93.97
8 bins 20.57 52.22 71.49 82.00 87.53 88.74 91.67 93.35
Table 4.7: Accuracy w.r.t # of cores and # of bins in Spellman data
64
Figure 4.5: An illustration of accuracy results
utilizing limited number of cores for generating search space). We use result of Non
Density-based subspace clustering as ground truth data, because it utilizes all the objects
in the dataset, and produces whole possible subspace clusters. On the other hand, as
shown in Table 4.1 through Table 4.4, Density-based Searchspace-limited subspace clus-
tering utilizes only few core objects among all the objects in the data, and produce far
lesser number of subspace clusters than whole possible subspace clusters.
As mentioned early in this section, we assume that larger and higher dimensional
clusters are more important than the others. In the sense of this denition, from ground
truth data (i.e., subspace clusters generated by whole objects), we collect approximately
top 10% largest set of clusters from each dimensional subspace clusters. After that, we
gure out minimum size (i.e., number of objects) of top 10% clusters in each dimension
from the collected set of clusters. We denote this set as G
d
(m), where d is dimension,
65
and m is d
0
s minimum number of objects. For the comparison, we collect another sets
of clusters from the subspace clusters generated by Density-based Searchspace-limited
subspace clustering that has at least the boundary size. We denote this set as T
d
(m),
where d is dimension, and m is minimum number of objects which is same with ground
truth data set.
Now, we can compare sets of clusters resulting from Density-based and Non Density-
based method in each dimensional subspace, and compute the each dimensional accuracy
by Equation 4.3. In other words, the accuracy means that how many most important
subspace clusters are missed in the result of Density-based clustering from the result of
Non Density-based clustering.
n(G
d
(m)\T
d
(m))
n(G
d
(m))
(4.3)
The sample result for 4-bin case in Cho data is shown in Table 4.5. As shown in it,
accuracy in low dimensional spaces are lower than high dimensional spaces. It is caused
by dierence of number of clusters generated in each dimension. Because there are more
clusters generated in lower dimensional space than higher dimensional space, there are
higher chance of missing clusters in lower dimensional spaces than higher dimensional
spaces. In addition, the table shows the trend of accuracy relative to the increasing
number of cores.
66
For the purpose of comparison, average accuracies are used in Table 4.6, and Table 4.7.
The average accuracy is calculated by Equation 4.4, where i is 8 andj is 16 in Cho data,
and 8 and 18 in Spellman data respectively.
P
j
d=i
n(G
d
(m)\T
d
(m))
P
j
d=i
n(G
d
(m))
(4.4)
As expected, these results show that utilizing more number of cores provide higher ac-
curacy than using lesser number of cores. And, by overlapping pairs of tables (Table 4.2
& Table 4.6, and Table 4.4 & Table 4.7), we can see the relation between the accuracy
and the corresponding size of search space. In general, larger search space gives higher
accuracy. But, as shown in Table 4.2 and Table 4.6, only 5.24% of search space gives
96.14% of accuracy, when number of core is 500, and number of bins is 3. It means that
even though the accuracy is generally correlated with the size of the search space, it also
depends on the characteristics of the dataset.
Figure 4.5 summarizes our experimental results. It illustrates average accuracy change
(i.e., average of 3 to 8-bins cases) w.r.t. increasing number of cores. It also shows
corresponding average percentage of search space out of actual full search space (second
bar of the group bar graph). The actual full search space means the search space which
is generated by all the objects in the dataset. Finally, it shows corresponding average
percentage of search space out of the upper bound of our algorithm (the last bar of the
group bar graph).
In Cho data, we achieve over 90% accuracy only with 20% of core objects. The size
of search space corresponds 9.85% out of full search space, and 0.49% out of the upper
67
bound. In Spellman data, for achieving over 90% accuracy, we need around 35% of cores,
and the size of corresponding search space is 14.71% out of full search space, and 1.79%
out of the upper bound. It means that, by generating less than 20% out of actual full
search space (i.e., 80% lesser runtime than Non Density-based approach), we can achieve
over 90% of accuracy with our Density-based Searchspace-limited subspace clustering.
At this time, we might need to recall and discuss about the size of search spaces
again. As mentioned in Section 4.1, there are two upper bounds in number of possible
subspace clusters. The rst one is the upper bound (
n(n1)
2
) of our algorithm, which is
utilized for calculating percentage search space in Figure 4.5. The other upper bound is
the theoretical upper bound, which is calculated by Equation 4.1. If we consider both
upper bounds, then we can see the signicant eect of our algorithm with regard to the
eciency. For example, As mentioned earlier, for Cho data, only 5.24% of search space
gives 96.14% of accuracy, when number of core is 500, and number of bins is 3. But, this
ratio is merely out of actual full search space, and it drops to 1.27% out of the upper
bound of our algorithm, and surprisingly, it drops to 0.00079% out of the theoretical
upper bound!
In summary, under our assumption about the denition of good quality subspace
clusters, our Density-based subspace clustering approach identies important subspace
clusters in ecient running time.
68
Figure 4.6: An illustration of result in Equi-width vs. K-means
4.4.3 Equi-width bins versus Homogeneity-based bins
We test the eect of two dierent discritization methods (Equi-width bins approach and
Homogeneity-based bins approach) in Cho data. As shown in Fig 4.6, there are not signi-
cant dierences for the accuracy and the scalability between two methods. The scalability
result shows minor shifting eect, but the increasing rate seems to be similar each other.
In addition, the Homogeneity-based bins approach (K-means clustering) provide a little
bit higher accuracy with bigger search spaces than Equi-width bins approach. It provides
another evidence about the trade-o relation between accuracy and scalability that we
show in the last section. At this point, it's dicult to say which method is better than the
other. The size of search space is closely related with a distribution of the data. Usually,
when each dimensional data follows Gaussian distribution with larger standard deviation,
two methods provides similar results. It's because Homogeneity-based approach clusters
69
objects as balanced as it can in the sense of size of the clusters. In other words, in the
case of larger standard deviation, Equi-width bins approach provides similar descritiza-
tion output withK-means clustering approach. We'll consider distribution of data in the
future work.
70
Chapter 5
Future Work
At the core of our current research eorts, DBSSC plays a key role. An ecient subspace
clustering algorithm promises huge benets on diverse problem domains, and DBSSC
may be a stepping stone to the successful subspace clustering framework.
We intend to extend this work into the following three directions. First, we are
currently investigating how we can improve DBSSC in several ways. Second, we plan
to conduct more comprehensive experiments on diverse datasets as well as compare our
approach with other competitive algorithms. Finally, in order to interpret obtained gene
clusters, external knowledge needs to be involved. Toward this end, we plan to explore
the methodology to quantify relationship between subspace clusters and known biological
knowledge by utilizing Gene Ontology [61].
5.1 Improving subspace clustering algorithm
For improving our subspace clustering algorithm itself, We intend to extend the presented
framework into the following three directions.
71
Improving discritization step. Discritization on each condition is the rst step in
DBSSC, and plays a key role in improving the quality of subspace clusters. Cur-
rently, we utilize a k-means algorithm for this step, but it limits number of possible
symbols in each condition (dimension) to a pre-dened xed number. But, there
can be dierent characteristics (e.g., density distribution of values, pre-determined
importance, etc.) of each condition: this means the algorithm should have an abil-
ity to freely determine the number of symbols in each dimension. So, we plan to
apply dierent known algorithms for this step, and conduct experiments to nd the
most ecient method. In addition, because there are natural information loss dur-
ing discritization, we should minimized this information loss. The possible way to
do is that utilizing sliding window technique. Usually, information loss happens in
border lines between two adjacent bins, and sliding windows that cover the border
lines will be useful for minimizing single-dimensional clustering errors.
Finding concrete subspace clusters. There are still more issues on DBSSC - how we
can systematically determine the bound of number of objects in a subspace cluster,
and what is relevant minimum number of dimensions that provides meaningful
clusters? Currently, a probabilistic model is used to determine the minimum number
of objects in a cluster. And, the minimum number of dimensions is determined
experimentally. These two issues strongly aect the average running time and
quality of result, because the threshold for the minimum number of objects in a
cluster is related with the pruning step and total number of obtained subspace
72
clusters, and the minimum number of dimensions determines the size of search
space. We plan to investigate these issues in diverse ways.
Extending DBSSC to pattern-based subspace clustering. Our current subspace clus-
tering algorithm utilizes value similarity for nding subspace clusters, but there is
high possibility to extent it to pattern-based clustering. Pattern-based clustering
is performed by pattern similarity, and it has been widely studied for the purpose
of dealing with time-series data. By adding numeric orders of symbols as another
factor for the clustering, we can achieve pattern-based subspace clustering.
5.2 DBSSC into general-purpose algorithm
Besides improving our subspace clustering algorithm itself, we plan to extend our research
to apply DBSSC to other application domains. DBSSC is a general-purpose subspace
clustering algorithm, therefore it may be suitable for e-commerce and association rule
mining by combining SCUDOT with a classication method. So, we plan to investigate
whether DBSSC is suitable to domains such as association rule mining, and recommen-
dation systems.
5.3 Ecient analysis for the result
In order to interpret obtained clusters, external knowledge needs to be involved. It de-
pends on problem domain, because available external knowledge will be dierent domain
by domain. For gene expression analysis, we plan to explore the methodology to quantify
73
the relationship between subspace clusters and known biological knowledge by utilizing
a Gene Ontology.
74
Chapter 6
Conclusions
In this dissertation, we presented the mining framework that is vital to intelligent infor-
mation analysis. An experimental prototype has been developed, implemented and tested
to demonstrate the eectiveness of the proposed framework.
In order to analyze static data eectively, we developed density-based clustering that
works as a batch mode. We tested the developed algorithm for a yeast cell cycle dataset.
The proposed algorithm utilizes a mutual neighborhood graph in order to eectively
determine core and outlier genes.
In addition, we presented the subspace mining framework that is vital to high di-
mensional data analysis. The uniqueness of our approach is based on observation that
the maximum number of subspaces is limited to the number of objects. By transform-
ing a object-object relation into a maximal pattern that is represented as a sequence of
symbols, we could identify subspace clusters very eciently by utilizing inverted index
and pruning non-interesting subspaces. As experimental results based on gene expression
data show, our method is scalable to the number of dimensions, and nds interesting
subspace clusters.
75
Furthermore, in order to improve the searchspace-limited subspace clustering, we de-
veloped the ecient subspace clustering algorithm based on KNN density estimation
and limited sub-dimensional search space by transforming the problem. By limiting the
search space into our upper bound, and utilizing full-dimensional density information, our
density-based searchspace-limited subspace clustering algorithm leverages the strengths
of both top-down and bottom-up approach. As experimental results show, our approach
generates high accuracy overlapping clusters, while achieves signicant running time im-
provement with.
76
Reference List
[1] I. Assentl, R. Krieger, E. M uller, and T. Seidl. EDSC: ecient density-based sub-
space clustering. In Proceedings of the 17th ACM CIKM Information and knowledge
management, 2008.
[2] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast Algorithms
for Projected Clustering. In Proceedings of ACM SIGMOD International Conference
on Management of Data, 1999.
[3] C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high di-
mensional spaces. In Proceedings of the 2000 SIGMOD international conference on
Management of data, 2000.
[4] R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan. Automatic subspace clus-
tering of high dimensional data for data mining applications. In Proceedings of ACM
SIGMOD International Conference on Management of Data, 1998.
[5] Z. Bar-Joseph, G. Gerber, D. Giord, T. Jaakkola, and I. Simon. A new approach to
analyzing gene expression time series data. In Proceedings of Annual Conference on
Research in Computational Molecular Biology, 2002.
[6] Z. Bar-Joseph. Analyzing time series gene expression data. In Bioinformatics,
20(16):2493-2503, 2004.
[7] R. Bellman. Adaptive Control Processes Princeton Univ. Press, 1961
[8] S. Berchtold, D. A. Keim, and H. P. Kreigel. The X-tree: an index structure for high
dimensional data. In Proceedings of the 22nd International Conference on Very Large
Data Bases, 1996.
[9] A. Blum and P. Langley. Selection of relevant features and examples in machine
learning. Artiial Intelligence, 1997.
[10] J. W. Chang and D. S. Jin. A new cell-based clustering method for large, high-
dimensional data in data mining applications In In Proceedings of the ACM sympo-
sium on Applied computing, 503-507, 2002.
[11] C. H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based subspace clustering for mining
numerical data. In n Proceedings of the ACM SIGKDD international conference on
Knowledge discovery and data mining, 1999.
77
[12] R.J. Cho, M.J. Campbell, E.A. Winzeler,L. Steinmetz, A. Conway, L. Wodicka,
T.G. Wolfsberg, A.E. Gabrielian, D. Landsman, D.J. Lockhart, and R.W. Davis. A
genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell, 2:5-73,
1998.
[13] S. Chung, J. Jun, and D. McLeod. Mining gene expression datasets using density-
based clustering. In Proceedings of ACM CIKM International Conference on Infor-
mation and Knowledge Management, 2004.
[14] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to algorithms. The
MIT Press, Cambridge, Mass, 1989.
[15] C. H. Q. Ding, X. He, H. Zha, and H. D. Simon. Adaptive dimension reduction
for clustering high dimensional data. In Proceedings of the 2002 IEEE International
Conference on Data Mining, 2002.
[16] C. Ding, H. Peng. Minimum Redundancy Feature Selection from Microarray Gene
Expression Data. Bioinform Comput Biol, 2:65-73, 2003.
[17] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classication (2nd Ed.). Wiley,
New York, 2001.
[18] L. Ert oz, M. Steinbach, and V. Kumar. Finding clusters of dierent sizes, shapes, and
densities in noisy, high dimensional data. In Proceedings of the SIAM International
Conference on Data Mining, 2003.
[19] M. B. Eisen et al. Cluster analysis and display of genome-wide expression patterns.
In Proceedings of National Academy of Science, 95(25):14863-14868, 1998.
[20] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 1996.
[21] A.F. Famili, G. Liu, and Z. Liu. Evaluation and optimization of clustering in gene
expression data analysis. Bioinformatics, 20(10):1535-1545, 2004
[22] J. H. Friedman and J. J. Meulman. Clustering objects on subsets of attributes.
http://citeseer.nj.nec.com/friedman02clustering.html, 2002
[23] A. Gasch, and M. Eisen. Exploring the conditional coregulation of yeast gene ex-
pression through fuzzy k-means clustering. Genome Biology, 3(11):1-22, 2002.
[24] L. Gidskehaug, E. Anderssen, A. Flatberg, B.K. Alsberg. (2007) A framework for
signicance analysis of gene expression data using dimension reduction methods. BMC
Bioinformatics, 8:346, 2007
[25] S. Goil et al. Maa: Ecient and scalable subspace clustering for very large data
set. Technical Report 9906-010, Northwestern University, 1999.
[26] T.R. Golub et al. Molecular classication of cancer: class discovery and class pre-
diction by gene expression monitoring. Science, 286(15):531-537, 1999.
78
[27] J. Herrero et al. A hierarchical unsupervised growing neural network for clustering
gene expression patterns. Bioinformatics, 17(2):126-136, 2001.
[28] A. Hinneburg and D. A. Keim. Optimal grid-clustering:Towards breaking the curse
of dimensionality in high-dimensional clustering. In Proceedings of International Con-
ference on Very Large Data Bases, 1999.
[29] D. Horn, and I. Axel. Novel clustering algorithm for microarray expression data in
a truncated SVD space. Bioinformatics. 19(9):1110-1115, 2003.
[30] D. Jiang, J. Pei, and A. Zhang. DHC: a density-based hierarchical clustering method
for time series gene expression data. In Proceedings of the 3rd IEEE International
Symposium on BioInformatics and BioEngineering, 2003.
[31] D. Jiang, J. Pei, and A. Zhang. Towards interactive exploration of gene expression
patterns. ACM SIGKDD Explorations, 6(1):79-90, 2004.
[32] D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: a survey.
In IEEE Transactions on Knowledge and Data Engineering, 16(11):1370-1386, 2004.
[33] J. Jun, S. Chung, and D. McLeod. Subspace clustering of microarray data based
on domain transformation. In Proceedings of VLDB Workshop on Data Mining on
Bioinformatics, 2006.
[34] K. Kailing, H.-P. Kriegel, and P. Kr oger. Density-Connected subspace clustering for
high-dimensional data. In Proceedings of the 4th SIAM SDM International Conference
on Data Mining, 2004.
[35] G. Karypis, E. H. Han, and V. Kumar CHAMELEON: A hierarchical clustering
algorithm using dynamic modeling. IEEE Computer, 32(8):68-75, 1999.
[36] H.-P. Kriegel, P. Kr oger, and A. Zimek. Clustering high-dimensional data: A survey
on subspace clustering, pattern-based clustering, and correlation clustering. ACM
TKDD Transactions on Knowledge Discovery from Data, 3(1):1-58, 2009.
[37] B. Liu, Y. Xia, and P. S. Yu. Clustering through decision tree construction. In
Proceedings of the ninth international conference on Information and knowledge man-
agement, 2000.
[38] G. Liu, J. Li, K. Sim, and L. Wong. Distance based subspace clustering with
exible
dimension partitioning. In Proceedings of the 23th ICDE International Conference on
Data Engineering, 2007.
[39] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining.
Kluwer Academic Publishers, 1998.
[40] H. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data
sets. In Proceedings of the 1st SIAM SDM International Conference on Data Mining,
2001.
79
[41] L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a
review. ACM SIGKDD Explorations Newsletter, 6(1):90-105, 2004.
[42] J. M. Pena, J. A. Lozano, P. Larranaga, and I. Inza. Dimensionality reduction in un-
supervised learning of conditional gaussian networks. Pattern Analysis and Machine
Intelligence, 2001
[43] C. M. Procopiuc et al. A monte carlo algorithm for fast projective clustering. In
Proceedings of the ACM SIGMOD international conference on Management of data,
2002.
[44] G. Salton and M.J. McGill. Introduction to modern information retrieval. McGraw-
Hill, 1983.
[45] R. Sharan, A. Maron-Katz, and R. Shamir. CLICK and EXPANDER: a system for
clustering and visualizing gene expression data. In Bioinformatics, 19(14):1787-1799,
2003.
[46] P. T. Spellman et al. Comprehensive identication of cell cycle-regulated genes of
the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of
the Cell, 9(12):3273-3297, 1998.
[47] P. Tamayo et al. Interpreting patterns of gene expression with self organizing maps.
In Proceedings of National Academy of Science, 96(6):2907-2912, 1999.
[48] C. Tang, A. Zhang, and J. Pei. Mining phenotypes and informative genes from gene
expression data. In Proceedings of the 9th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, 2003.
[49] S. Tavazoie et al. Systematic determination of genetic network architecture. Nature
Genetics, 22(3):281-285, 1999.
[50] H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering by pattern similarity in large
data sets. In Proceedings of ACM SIGMOD International Conference on Management
of Data, 2002.
[51] K.G. Woo, and J. H. Lee. FINDIT: a Fast and Intelligent Subspace Clustering
Algorithm using Dimension Voting. PhD thesis, Korea Advanced Institute of Science
and Technology, 2002.
[52] J. Yang, W. Wang, H. Wang, and P. Yu. -clusters:capturing subspace correlation
in a large data set. In Proceedings of International Conference on Data Engineering,
2002.
[53] Y. Xu, V. Olman, and D. Xu. Clustering gene expression data using a graph-theoretic
approach: an application of minimum spanning trees. Bioinformatics, 18(4):536-545,
2002.
[54] K. Y. Yeung, and W. Ruzzo. An empirical study on principal component analysis
for clustering gene expression data. Bioinformatics, 17(9):763-774, 2001.
80
[55] K.Y. Yeung et al. Model-based clustering and data transformations for gene expres-
sion data. Bioinformatics, 17(10):977-987, 2001.
[56] K.Y. Yip, D.W. Cheung, and M.K. Ng. On discovery of extremely low-dimensional
clusters using semi-supervised projected clustering. In Proceedings of the 21st ICDE
International Conference on Data Engineering, 2005.
[57] M.L. Yiu. and N. Mamoulis. Iterative projected clustering by subspace mining. In
IEEE Transactions on Knowledge and Data Engineering, 17(2):176-189, 2005.
[58] L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-
based lter solution. In Proceedings of International Conference on Machine Learning,
2003.
[59] M.J. Zaki, and M. Peters. CLICKS: Mining subspace clusters in categorical data
via K-partite maximal cliques. In Proceedings of International Conference on Data
Engineering, 2005.
[60] X. Zhou, M. J. Kao, and W. H. Wong. Transitive functional annotation by short-
est path analysis of gene expression data. In Proceedings of Natational Academy of
Science, 12783-12788, 2002.
[61] The Gene Ontology Consortium. Creating the gene ontology resource: design and
implementation. Genome Research, 11(8):1425-1433, 2001.
81
Abstract (if available)
Abstract
We propose a mining framework that supports the identification of useful knowledge based on data clustering. With the recent advancement of microarray technologies, the expression levels of thousands of genes can be measured simultaneously. The availability of the huge volume of microarray dataset makes us to focus our attention on gene expression datasets mining. We apply density-based approach to identify clusters from full-dimensional microarray datasets, and get the meaningful results. In general, microarray technologies provide multi-dimensional data. In particular, given that genes are often co-expressed under subsets of experimental conditions, we present a novel subspace clustering algorithm. In contrast to previous approaches, our method is based on the observation that the number of subspace clusters is related with the number of maximal subspace clusters to which any gene pair can belong. By performing discretization to gene expression profiles, the similarity between two genes is transformed as a sequence of symbols that represents the maximal subspace cluster for the gene pair. This domain transformation (from genes into gene-gene relations) allows us to make the number of possible subspace clusters dependent on the number of genes. Based on the symbolic representations of genes, we present an efficient subspace clustering algorithm that is scalable to the number of dimensions. In addition, the running time can be drastically reduced by utilizing inverted index and pruning non-interesting subspaces. Furthermore, by incorporating the density-based approach into the above searchspace-limited subspace clustering, we develop a fast running subspace clustering algorithm which finds important subspace clusters. Experimental results indicate that the proposed method efficiently identifies co-expressed gene subspace clusters for the yeast cell cycle datasets.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 833 (2004)
PDF
An efficient approach to clustering datasets with mixed type attributes in data mining
PDF
An efficient approach to categorizing association rules
PDF
Orthogonal shared basis factorization: cross-species gene expression analysis using a common expression subspace
PDF
Tag based search and recommendation in social media
PDF
Scalable processing of spatial queries
PDF
Scalable data integration under constraints
PDF
Clustering 16S rRNA sequences: an accurate and efficient approach
PDF
Scalable multivariate time series analysis
PDF
Gene-set based analysis using external prior information
PDF
Inferring mobility behaviors from trajectory datasets
PDF
From matching to querying: A unified framework for ontology integration
PDF
WOLAP: wavelet-based on-line analytical processing
PDF
A reference-set approach to information extraction from unstructured, ungrammatical data sources
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
PDF
Generalized optimal location planning
PDF
Scalable evacuation routing in dynamic environments
PDF
Efficient crowd-based visual learning for edge devices
PDF
Differentially private learned models for location services
PDF
A statistical ontology-based approach to ranking for multi-word search
Asset Metadata
Creator
Jun, Jongeun
(author)
Core Title
DBSSC: density-based searchspace-limited subspace clustering
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/18/2013
Defense Date
10/21/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
big data analysis,bioinformatics,data mining,density-based clustering,gene expression analysis,OAI-PMH Harvest,subspace clustering
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
McLeod, Dennis (
committee chair
), O’Leary, Daniel E. (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
jongeunj@gmail.com,jongeunj@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-346502
Unique identifier
UC11296428
Identifier
etd-JunJongeun-2157.pdf (filename),usctheses-c3-346502 (legacy record id)
Legacy Identifier
etd-JunJongeun-2157.pdf
Dmrecord
346502
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Jun, Jongeun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
big data analysis
bioinformatics
data mining
density-based clustering
gene expression analysis
subspace clustering