Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 833 (2004)
(USC DC Other)
USC Computer Science Technical Reports, no. 833 (2004)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Mining Gene Expression Datasets using
Density-based Clustering
Seokkyung Chung, Jongeun Jun, Dennis McLeod
Department of Computer Science
University of Southern California
Los Angeles, California 90089–0781, USA
[seokkyuc, jongeunj, mcleod]@usc.edu
ABSTRACT
Giv en the recen t adv ancemen t of microarra y tec hnologies,
w e presen t a densit y-based clustering approac h for the pur-
p ose of co-expressed gene clusters iden tication. Underly-
ing h yp othesis is that a set of co-expressed gene clusters
can b e used to unrev eal a common biological function. By
addressing the strengths and limitations of previous densit y-
based clustering approac hes, w e presenta nov el clustering
algorithm that utilizes a neigh b orho o d dened b y k -nearest
neigh b ors. Exp erimen tal results indicate that the prop osed
metho d iden ties biologica ll y meaningful and co-expressed
gene clusters.
Categories and Subject Descriptors
I.5.3 [P attern Recognition]: Clustering
General Terms
Algorithms
Keywords
Densit y-based Clustering, Gene Expression Analysis
1. INTRODUCTION
With the recen t adv ancemen t of DNA microarra y tec hnolo-
gies, the expression lev els of thousands of genes can b e mea-
sured sim ultaneously . The obtained data are usually orga-
nized as a matrix where the columns represen t genes (usu-
ally genes of the whole genome), and the ro ws corresp ond to
the samples (e.g. v arious tissues, exp erimen tal conditions,
or time p oin ts). Giv en this ric h amoun t of gene expression
data, it is essen tial to extract hidden kno wledge from this
matrix.
This researc h has b een funded in part b y the In tegrated
Media Systems Cen ter, a National Science F oundation Engi-
neering Researc h Cen ter, Co op erativ e Agreemen t No. EEC-
9529152.
CIKM’04, November 8–13, 2004, Washington, DC, USA.
One of the k ey steps in gene expression analysis is to p erform
clustering genes that sho w similar patterns. By iden tifying
a set of gene clusters, w e can h yp othesize that the genes
clustered together tend to b e functionall y related. Th us,
gene expression clustering ma y b e used to iden tify mec ha-
nisms of gene regulation and in teraction, whic h can b e used
to understand the function of a cell.
Since gene expression data consist of measuremen ts across
v arious conditions (or time p oin ts), they are c haracterized
bym ulti-dimension al, h uge size of v olumes, and a noisy
data. Th us, clustering algorithms m ust b e able to address
and exploit suc h features of the datasets. Recen t database
mining researc h has prop osed densit y-based clustering al-
gorithms, whic h are relev an t for m ulti-dimensi ona l noisy
datasets. By addressing the limitations of previous densit y-
based clustering metho ds, w e presentanov el KNN (k -nearest
neigh b or) densit y estimation clustering algorithm that is rel-
ev an t for pro ducing co-expressed gene clusters.
In this pap er, w e are mainly fo cused on time-course gene
expression data (i.e., expression lev els of genes are monitored
during some time in terv al). In particular, w e utilize the
y east cell cycle dataset in tro duced in Sp ellman et al. [3].
Ho w ev er, the prop osed algorithm can b e extended to other
kinds of microarra y datasets.
2. PROPOSED ALGORITHM
Con v en tional densit y-based clustering starts from estimat-
ing densit y for eac h p oin t in order to iden tify core, b order
and noise p oin ts. A core p oin t is referred to as a p oin t whose
densit y is greater than a user-dened threshold. Similarly ,a
noise p oin t is referred to as a p oin t whose densit y is less than
a user-dened threshold. Noise p oin ts are usually discarded
in the clustering pro cess. A non-core, non-noise p ointis
considered as a b order p oin t. Hence, clusters can b e dened
as dense regions (i.e., a set of core p oin ts), and eac h dense
region is separated from one another bylo w densit y regions
(i.e., a set of b order p oin ts). Figure 1 illustrates the in tu-
ition b ehind this approac h. As sho wn, with a high densit y
gene, the sum of distance b et w een nearest neigh b ors to x is
relativ ely smaller than a lo w densit y gene.
By incorp orating the ideas of densit y-based clustering, our
clustering algorithm pro ceeds in three phases: (1) densit y
estimation for eac h gene; (2) rough cluster iden tication us-
ing core genes (i.e., core p oin ts); (3) cluster renemen t using
0 2 4 6 8 10 12 14 16 18
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Time points
Expression level
0 2 4 6 8 10 12 14 16 18
−3
−2
−1
0
1
2
3
4
Time points
Expression level
(a) High densit y gene (b) Lo w densit y gene
Figure 1: Plot of top k -nearest neigh b ors for a high
densit y gene and a lo w densit y gene (when k =30)
b order genes (i.e., b order p oin ts). Due to the space limita-
tion, w e only sk etc h the main idea of our algorithm. F or
details, refer to Ch ung et al. [1].
F or densit y estimation, the prop osed algorithm is mainly
fo cused on KNN (k -nearest neigh b or) densit y estimation.
That is, densit y of a gene, x, is dened b y the sum of dis-
tances b et w een k -nearest neigh b ors to x. In addition, since
the o v erall shap es of gene expression patterns is more imp or-
tan t than magnititude in gene expression datasets, w e use
P earson correlation co ecien t for similarit y metric b et w een
t w o genes.
Since a core gene has high densit y , it is exp ected to lo cate
w ell inside the cluster. Th us, instead of p erforming cluster-
ing on whole datasets, conducting clustering on core genes
set can pro duce a rough cluster structure. Since b order and
noise genes are excluded in the rough cluster iden tication
step, eac h cluster is exp ected to b e w ell separated eac h other.
Once sk eleton of cluster is iden tied, b order genes are used
to rene cluster structure b y assigning them to the most
relev an t cluster.
F or some biological pro cesses (e.g., the cell cycle), expres-
sion relationshi ps ma y b e rev ealed at dieren t time p oin ts.
In the presence of suc h a time-shift, P earson’s correlation
is limited in capturing the relationshi p b et w een t w o expres-
sion proles. Moreo v er, some mec hanisms (e.g., negativ e
feedbac k lo ops) can limit the n um b er of expressed genes
based on the principle of eciency .Th us, the expression
relationship s among genes along the same biological path-
w ayma y b e partially rev ealed in a single microarra y exp er-
imen t. Therefore, in order to address the t w o problems, the
prop osed clustering algorithm exploits neigh b orho o d -based
clustering [2].
T o utilize spatial index structure for ecien t densit y esti-
mation, w e conduct dimensionali t y reduction based on Sin-
gular V alue Decomp osition (SVD) [1]. Although SVD has
b een utilized in gene expression clustering researc h, the main
purp ose is dieren t in that the main goal of the previous ap-
proac hes w as to prepro cess the data b efore clustering while
our main aim is to ecien tly supp ort a similarit y searchin
the truncated SVD space.
3. EXPERIMENTAL RESULTS
Figure 2 plots sample clusters where x-axis and y -axis rep-
resen ts time p oin ts and expression v alues, resp ectiv ely . Fig-
2 4 6 8 10 12 14 16 18
−4
−3
−2
−1
0
1
2
3
4
2 4 6 8 10 12 14 16 18
−4
−3
−2
−1
0
1
2
3
4
2 4 6 8 10 12 14 16 18
−4
−3
−2
−1
0
1
2
3
4
2 4 6 8 10 12 14 16 18
−4
−3
−2
−1
0
1
2
3
4
(a) Core clusters (b) Coheren t patterns
Figure 2: Sample examples of core clusters and the
corresp onding coheren t expression patterns
ure 2(a) sho ws sample core clusters, and Figure 2(b) illus-
trates the corresp onding coheren t patterns that c haracterize
a trend of expression lev els of genes within a cluster. A co-
heren t pattern of a cluster is dened b y a medoid of the
cluster.
The rst cluster (cluster 1) is mainly comp osed of genes that
are in v olv ed in the assem bly and arrangemen t of cell struc-
tures. In particular, YBR009C, YNL030W, and YDR224C
(whose biological function corresp onds to c hromatin assem-
bly/disassem bl y) are classied in to cluster 1. YMR307W
(whose function is cell w all organization and biogenesis) and
YCL063W (whose function is v acuole inheritance) are also
classied in to cluster 1. YNL339C, YLR467W, Y AL007C,
and Y AL014C (whose biological pro cess is telomerase inde-
p enden t telomere main tenance) w ere detected as the second
cluster. In addition, YOR033C and YDR097C (whose bio-
logical function corresp onds to mismatc h repair) w ere also
classied together.
4. CONCLUSION AND FUTURE WORK
W e presen ted the mining framew ork that is vital to microar-
ra y data analysis. An exp erimen tal protot yp e system has
b een dev elop ed, implemen ted, and tested to demonstrate
the eectiv eness of the prop osed mo del. In order to iden tify
co-expressed genes in a y east cell cycle dataset, w e dev elop ed
the clustering algorithm based on KNN densit y estimation.
In the future, w e plan to incorp orate biological annotation
in to clustering pro cess.
5. REFERENCES
[1] S. Ch ung, J. Jun, and D. McLeo d. Mining gene
expression datasets using densit y-based clustering. In
USC/IMSC T e chnic al R ep ort, 2004.
[2] S. Ch ung, and D. McLeo d. Dynamic topic mining from
news stream data. In Pr o c e e dings of ODBASE, 2003.
[3] P . T. Sp ellman et al. Comprehensiv e iden ticatio n of
cell cycle-regulated genes of the y east Sac char omyc es
Cer evisiae b y microarrayh ybridization . Mole cular
Biolo gy of the Cel l, 9(12):3273-3297, 1998.
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 879 (2006)
PDF
USC Computer Science Technical Reports, no. 721 (2000)
PDF
USC Computer Science Technical Reports, no. 849 (2005)
PDF
USC Computer Science Technical Reports, no. 572 (1994)
PDF
USC Computer Science Technical Reports, no. 740 (2001)
PDF
USC Computer Science Technical Reports, no. 574 (1994)
PDF
USC Computer Science Technical Reports, no. 575 (1994)
PDF
USC Computer Science Technical Reports, no. 829 (2004)
PDF
USC Computer Science Technical Reports, no. 747 (2001)
PDF
USC Computer Science Technical Reports, no. 763 (2002)
PDF
USC Computer Science Technical Reports, no. 838 (2004)
PDF
USC Computer Science Technical Reports, no. 844 (2005)
PDF
USC Computer Science Technical Reports, no. 817 (2004)
PDF
USC Computer Science Technical Reports, no. 643 (1996)
PDF
USC Computer Science Technical Reports, no. 882 (2006)
PDF
USC Computer Science Technical Reports, no. 810 (2003)
PDF
USC Computer Science Technical Reports, no. 769 (2002)
PDF
USC Computer Science Technical Reports, no. 638 (1996)
PDF
USC Computer Science Technical Reports, no. 639 (1996)
PDF
USC Computer Science Technical Reports, no. 906 (2009)
Description
Seokkyung Chung, Jongeun Jun, Dennis McLeod. "Mining gene expression datasets using density-based clustering." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 833 (2004).
Asset Metadata
Creator
Chung, Seokkyung
(author),
Jun, Jongeun
(author),
McLeod, Dennis
(author)
Core Title
USC Computer Science Technical Reports, no. 833 (2004)
Alternative Title
Mining gene expression datasets using density-based clustering (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
2 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16269461
Identifier
04-833 Mining Gene Expression Datasets using Density-based Clustering (filename)
Legacy Identifier
usc-cstr-04-833
Format
2 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/