USC Computer Science Technical Reports, no. 833 (2004)
USC Computer Science Technical Reports, no. 833 (2004)
Mining Gene Expression Datasets using
Density-based Clustering
Seokkyung Chung, Jongeun Jun, Dennis McLeod
Department of Computer Science
University of Southern California
Los Angeles, California 90089–0781, USA
[seokkyuc, jongeunj, mcleod]
Giv en the recen t adv ancemen t of microarra y tec hnologies,
w e presen t a densit y-based clustering approac h for the pur-
p ose of co-expressed gene clusters iden tication. Underly-
ing h yp othesis is that a set of co-expressed gene clusters
can b e used to unrev eal a common biological function. By
addressing the strengths and limitations of previous densit y-
based clustering approac hes, w e presenta nov el clustering
algorithm that utilizes a neigh b orho o d dened b y k -nearest
neigh b ors. Exp erimen tal results indicate that the prop osed
metho d iden ties biologica ll y meaningful and co-expressed
gene clusters.
Categories and Subject Descriptors
I.5.3 [P attern Recognition]: Clustering
General Terms
Densit y-based Clustering, Gene Expression Analysis
With the recen t adv ancemen t of DNA microarra y tec hnolo-
gies, the expression lev els of thousands of genes can b e mea-
sured sim ultaneously . The obtained data are usually orga-
nized as a matrix where the columns represen t genes (usu-
ally genes of the whole genome), and the ro ws corresp ond to
the samples (e.g. v arious tissues, exp erimen tal conditions,
or time p oin ts). Giv en this ric h amoun t of gene expression
data, it is essen tial to extract hidden kno wledge from this
This researc h has b een funded in part b y the In tegrated
Media Systems Cen ter, a National Science F oundation Engi-
neering Researc h Cen ter, Co op erativ e Agreemen t No. EEC-
CIKM’04, November 8–13, 2004, Washington, DC, USA.
One of the k ey steps in gene expression analysis is to p erform
clustering genes that sho w similar patterns. By iden tifying
a set of gene clusters, w e can h yp othesize that the genes
clustered together tend to b e functionall y related. Th us,
gene expression clustering ma y b e used to iden tify mec ha-
nisms of gene regulation and in teraction, whic h can b e used
to understand the function of a cell.
Since gene expression data consist of measuremen ts across
v arious conditions (or time p oin ts), they are c haracterized
bym ulti-dimension al, h uge size of v olumes, and a noisy
data. Th us, clustering algorithms m ust b e able to address
and exploit suc h features of the datasets. Recen t database
mining researc h has prop osed densit y-based clustering al-
gorithms, whic h are relev an t for m ulti-dimensi ona l noisy
datasets. By addressing the limitations of previous densit y-
based clustering metho ds, w e presentanov el KNN (k -nearest
neigh b or) densit y estimation clustering algorithm that is rel-
ev an t for pro ducing co-expressed gene clusters.
In this pap er, w e are mainly fo cused on time-course gene
expression data (i.e., expression lev els of genes are monitored
during some time in terv al). In particular, w e utilize the
y east cell cycle dataset in tro duced in Sp ellman et al. [3].
Ho w ev er, the prop osed algorithm can b e extended to other
kinds of microarra y datasets.
Con v en tional densit y-based clustering starts from estimat-
ing densit y for eac h p oin t in order to iden tify core, b order
and noise p oin ts. A core p oin t is referred to as a p oin t whose
densit y is greater than a user-dened threshold. Similarly ,a
noise p oin t is referred to as a p oin t whose densit y is less than
a user-dened threshold. Noise p oin ts are usually discarded
in the clustering pro cess. A non-core, non-noise p ointis
considered as a b order p oin t. Hence, clusters can b e dened
as dense regions (i.e., a set of core p oin ts), and eac h dense
region is separated from one another bylo w densit y regions
(i.e., a set of b order p oin ts). Figure 1 illustrates the in tu-
ition b ehind this approac h. As sho wn, with a high densit y
gene, the sum of distance b et w een nearest neigh b ors to x is
relativ ely smaller than a lo w densit y gene.
By incorp orating the ideas of densit y-based clustering, our
clustering algorithm pro ceeds in three phases: (1) densit y
estimation for eac h gene; (2) rough cluster iden tication us-
ing core genes (i.e., core p oin ts); (3) cluster renemen t using
0 2 4 6 8 10 12 14 16 18
Time points
Expression level
0 2 4 6 8 10 12 14 16 18
Time points
Expression level
(a) High densit y gene (b) Lo w densit y gene
Figure 1: Plot of top k -nearest neigh b ors for a high
densit y gene and a lo w densit y gene (when k =30)
b order genes (i.e., b order p oin ts). Due to the space limita-
tion, w e only sk etc h the main idea of our algorithm. F or
details, refer to Ch ung et al. [1].
F or densit y estimation, the prop osed algorithm is mainly
fo cused on KNN (k -nearest neigh b or) densit y estimation.
That is, densit y of a gene, x, is dened b y the sum of dis-
tances b et w een k -nearest neigh b ors to x. In addition, since
the o v erall shap es of gene expression patterns is more imp or-
tan t than magnititude in gene expression datasets, w e use
P earson correlation co ecien t for similarit y metric b et w een
t w o genes.
Since a core gene has high densit y , it is exp ected to lo cate
w ell inside the cluster. Th us, instead of p erforming cluster-
ing on whole datasets, conducting clustering on core genes
set can pro duce a rough cluster structure. Since b order and
noise genes are excluded in the rough cluster iden tication
step, eac h cluster is exp ected to b e w ell separated eac h other.
Once sk eleton of cluster is iden tied, b order genes are used
to rene cluster structure b y assigning them to the most
relev an t cluster.
F or some biological pro cesses (e.g., the cell cycle), expres-
sion relationshi ps ma y b e rev ealed at dieren t time p oin ts.
In the presence of suc h a time-shift, P earson’s correlation
is limited in capturing the relationshi p b et w een t w o expres-
sion proles. Moreo v er, some mec hanisms (e.g., negativ e
feedbac k lo ops) can limit the n um b er of expressed genes
based on the principle of eciency .Th us, the expression
relationship s among genes along the same biological path-
w ayma y b e partially rev ealed in a single microarra y exp er-
imen t. Therefore, in order to address the t w o problems, the
prop osed clustering algorithm exploits neigh b orho o d -based
clustering [2].
T o utilize spatial index structure for ecien t densit y esti-
mation, w e conduct dimensionali t y reduction based on Sin-
gular V alue Decomp osition (SVD) [1]. Although SVD has
b een utilized in gene expression clustering researc h, the main
purp ose is dieren t in that the main goal of the previous ap-
proac hes w as to prepro cess the data b efore clustering while
our main aim is to ecien tly supp ort a similarit y searchin
the truncated SVD space.
Figure 2 plots sample clusters where x-axis and y -axis rep-
resen ts time p oin ts and expression v alues, resp ectiv ely . Fig-
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14 16 18
(a) Core clusters (b) Coheren t patterns
Figure 2: Sample examples of core clusters and the
corresp onding coheren t expression patterns
ure 2(a) sho ws sample core clusters, and Figure 2(b) illus-
trates the corresp onding coheren t patterns that c haracterize
a trend of expression lev els of genes within a cluster. A co-
heren t pattern of a cluster is dened b y a medoid of the
The rst cluster (cluster 1) is mainly comp osed of genes that
are in v olv ed in the assem bly and arrangemen t of cell struc-
tures. In particular, YBR009C, YNL030W, and YDR224C
(whose biological function corresp onds to c hromatin assem-
bly/disassem bl y) are classied in to cluster 1. YMR307W
(whose function is cell w all organization and biogenesis) and
YCL063W (whose function is v acuole inheritance) are also
classied in to cluster 1. YNL339C, YLR467W, Y AL007C,
and Y AL014C (whose biological pro cess is telomerase inde-
p enden t telomere main tenance) w ere detected as the second
cluster. In addition, YOR033C and YDR097C (whose bio-
logical function corresp onds to mismatc h repair) w ere also
classied together.
W e presen ted the mining framew ork that is vital to microar-
ra y data analysis. An exp erimen tal protot yp e system has
b een dev elop ed, implemen ted, and tested to demonstrate
the eectiv eness of the prop osed mo del. In order to iden tify
co-expressed genes in a y east cell cycle dataset, w e dev elop ed
the clustering algorithm based on KNN densit y estimation.
In the future, w e plan to incorp orate biological annotation
in to clustering pro cess.
[1] S. Ch ung, J. Jun, and D. McLeo d. Mining gene
expression datasets using densit y-based clustering. In
USC/IMSC T e chnic al R ep ort, 2004.
[2] S. Ch ung, and D. McLeo d. Dynamic topic mining from
news stream data. In Pr o c e e dings of ODBASE, 2003.
[3] P . T. Sp ellman et al. Comprehensiv e iden ticatio n of
cell cycle-regulated genes of the y east Sac char omyc es
Cer evisiae b y microarrayh ybridization . Mole cular
Biolo gy of the Cel l, 9(12):3273-3297, 1998.
Seokkyung Chung, Jongeun Jun, Dennis McLeod. "Mining gene expression datasets using density-based clustering." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 833 (2004).
