Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Concept, topic, and pattern discovery using clustering
(USC Thesis Other)
Concept, topic, and pattern discovery using clustering
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CONCEPT, TOPIC, AND PATTERN DISCOVERY USING CLUSTERING
by
Seokkyung Chung
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2005
Copyright 2005 Seokkyung Chung
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 3219869
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
®
UMI
UMI Microform 3219869
Copyright 2006 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
D edication
This dissertation is dedicated to my parents and my wife, Hyejin Kim, who have been
always with me in every phase of this Ph.D. journey.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A cknow ledgem ents
I could never be able to finish this work without help, support, and encouragement from
several persons.
First of all, I would like to express my deep thanks to my mentor Dennis McLeod.
His excellent guidance, encouragement, and flexibility throughout my Ph.D. study con
tributed substantially to the understanding of the complex intersection of information
retrieval, data mining, and artificial intelligence, which is presented in this dissertation.
I feel fortunate to work under his supervision for my Ph.D.
I am also very grateful to other members of my qualifying examination and dissertation
committees Kevin Knight, Cyrus Shahabi, Rozer Zimmerman, and Larry Pryor.
Finally, I thank to all fellow students and colleagues in USC Semantic Information
Research Lab for their ideas and comments on my research, and most importantly their
friendship. I feel extremely lucky to be a member of this wonderful group.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C ontents
D edication ii
A cknow ledgem ents iii
List O f Tables vi
List O f Figures vii
A bstract viii
1 Introduction 1
1.1 Focus of the D issertation....................................................................................... 3
1.2 C o n trib u tio n s........................................................................................................... 4
1.3 Organization of the Dissertation ........................................................................ 6
2 Background for Topic M ining 8
2.1 Topic Detection and T rack in g ............................................................................. 8
2.2 Document C lustering.............................................................................................. 10
2.2.1 Partition-based C lustering......................................................................... 10
2.2.2 Hierarchical Agglomerative C lu ste rin g .................................................... 12
2.2.3 Projective T ech n iq u es............................................................................... 12
2.3 Intelligent News Services S ystem s....................................................................... 14
2.4 Information Preprocessing for Topic M in in g ................................................... 15
3 Topic M ining for D ocum ent Stream s 19
3.1 P re lim in a ry .............................................................................................................. 19
3.2 A Motivating E x am p le.......................................................................................... 25
3.3 A Proposed Incremental Non-hierarchical Document Clustering Algorithm
using a Neighborhood S e a r c h ............................................................................. 27
3.3.1 Neighborhood se arc h ................................................................................... 28
3.3.2 Identification of an appropriate c lu s te r.................................................. 28
3.3.3 R e-clu sterin g ............................................................................................... 33
3.4 How to Extend the Non-hierarchical Clustering Algorithm into a Hierar
chical V e rs io n ? ........................................................................................................ 33
3.5 Building a Topic O ntology.................................................................................... 37
3.6 Experimental Results for Information Analysis ............................................. 40
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.7 Experimental Setup .............................................................................................. 40
3.8 Experimental R esults.............................................................................................. 42
3.8.1 P aram eterizatio n ........................................................................................... 42
3.8.2 Ability to identify clusters with same density, but different shapes 43
3.8.3 Ability to discover clusters with different densities and shapes . . . 46
3.8.4 Event confusion.............................................................................................. 46
4 D ensity-based G ene E xpression C lustering using a M utual N eighbor
hood 48
4.1 P re lim in a ry ............................................................................................................... 48
4.2 Related W o rk ........................................................................................................... 51
4.3 Background for the Proposed M e th o d .............................................................. 53
4.3.1 Similarity M e tr ic ............................................................................... 53
4.3.2 Density E stim atio n ............................................................................ 54
4.3.3 Challenges in Density-based Gene Expression C lustering.................... 56
4.4 Density-based Clustering using a M utual N e ig h b o rh o o d ................................ 58
4.4.1 Construction of M utual Neighborhood G r a p h ......................................... 58
4.4.2 Identification of Rough Cluster S tr u c tu r e .............................................. 59
4.4.3 Cluster E x p a n s io n ........................................................................................ 64
4.5 Experimental R esults................................................................................................. 64
4.5.1 Experimental S e t u p ..................................................................................... 64
4.5.2 Evaluation M e tric s........................................................................................ 65
4.5.3 Comparative A lgorithm s............................................................................... 66
4.5.4 Experimental Results .................................................................................. 67
5 Conclusions 69
6 Future W ork 70
6.1 Topic M in in g .............................................................................................................. 70
6.2 Investigation of Applicability for a Topic Mining to E arth Science Datasets 72
6.3 Crisis M anagem ent..................................................................................................... 78
R eference List 78
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List Of Tables
3.1 A sample illustrative example for document xterm matrix. For simplicity,
each document vector is represented as boolean values instead of TF-IDF
v a lu e s ......................................................................................................................... 24
3.2 Notations for incremental non-hierarchical document c lu s te rin g ............... 25
3.3 Notations for incremental hierarchical document clustering........................ 32
3.4 A sample specific terms for the clusters at level 1. The term with regular
font denotes NE. Thus, this supports the argument that NE plays a key
role in defining specific details of e v e n t s ........................................................... 38
3.5 A sample specific terms for the clusters at level 2 ......................................... 38
3.6 General terms for the court trial cluster 1 in Table 3 . 4 ............................... 38
3.7 Examples for selected topics and events .......................................................... 40
3.8 Top 15 high term frequency words in Colorado wildfire and Arizona wild
fire event. Num represents the average number of term occurrences per
document in each event (without document length normalization). Terms
with italic font carry event-specific information for each wildfire event . . 45
4.1 Summary of symbols and corresponding meanings ...................................... 58
4.2 A comparison between the proposed method and other approaches on the
Cho’s d a ta .................................................................................................................. 68
4.3 Proportion of biologically characterized genes in meaningful clusters versus
those in ([26])............................................................................................................. 68
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List Of Figures
2.1 The incremental document clustering algorithm in T D T ............................ 10
3.1 Overview of a proposed fram ew o rk ................................................................... 23
3.2 The incremental non-hierarchical document clustering algorithm ............ 29
3.3 Illustration of a re-clustering p h a s e ................................................................... 31
3.4 Illustration of e” ’s sensitivity to clustering re s u lts ......................................... 41
3.5 Illustration of non-spherical document clusters ............................................. 43
3.6 Comparison of the clustering algorithms on datasets-1. Datasets-1 consists
of five different datasets where each cluster has approximately the same
density. The values of precision and recall shown in this table are obtained
by averaging the accuracy of the algorithm on each d a t a s e t ....................... 44
3.7 Comparison of the accuracy of clustering algorithms at level 1 on datasets-
2. Datasets-2 consists of ten different datasets where each cluster has
arbitrary numbers of documents. The values of precision and recall shown
in this table are obtained by averaging the accuracy of the algorithm on
each d a ta s e t............................................................................................................... 44
4.1 Correlation between k-NN density (when k= 30) and e-density (when e=0.8)
for the Cho’s data that will be discussed in Chapter 4.5.1. The horizontal
axis represents fc-NN density and the vertical axis represents e-density. . . 57
4.2 Mean expression patterns for the Cho’s data obtained by the proposed
method. The horizontal axis represents time points between 0-80 and 100-
160 minutes. The vertical axis represents normalized expression level. The
error bar at each time point delineates the standard deviation..................... 68
6.1 MIESIS system arch itectu re................................................................................. 75
vii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A bstract
In this dissertation, we present mining framework to extract useful pattern, concept, and
topic from multi-dimensional dataset using clustering. In general, there are two kinds of
datasets, incremental data and static data. Incremental data is the one where data items
are inserted over time. However, not all datasets are incremental. In many cases, with
static data, there is no incremental insertion. Thus, depending on the nature of data,
relevant data mining algorithms should be developed. Thus, this dissertation is basically
composed of two parts: incremental clustering for incremental data, and batch clustering
for static data. For incremental data, we target news streams, and for static data, we
target gene expression data.
In the first part, we propose a mining framework that supports the identification of
useful patterns based on incremental data clustering. Given the popularity of Web news
services, we focus our attention on news streams mining. A key challenging issue within
news repository management is the high rate of document insertion. To address this
problem, we present an incremental hierarchical document clustering algorithm using a
neighborhood search. The novelty of the proposed algorithm is the ability to identify
meaningful patterns (e.g., news events, and news topics) while reducing the amount of
computations by maintaining cluster structure incrementally. In addition, we propose a
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
topic ontology learning framework that utilizes the obtained document hierarchy. Experi
mental results demonstrate th at the proposed clustering algorithm produces high-quality
clusters, and a topic ontology provides interpretations of news topics at different levels
of abstraction.
In the second part, we focus our attention on mining yeast cell cycle dataset. In
molecular biology, a set of co-expressed genes tend to share a common biological func
tion. Thus, it is essential to develop an effective clustering algorithm to identify the
set of co-expressed genes. Toward this end, we propose genome-wide expression clus
tering based on a density-based approach. By addressing the strengths and limitations
of previous density-based clustering approaches, we present a novel density clustering
algorithm, which utilizes a neighborhood defined by fc-nearest m utual neighbors. Experi
mental results indicate that the proposed method successfully identifies co-expressed and
biologically meaningful gene clusters.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
Introduction
W ith the rapid progress in data acquisition, sensing, and storage technologies, the size
of available datasets is increasing at a overwhelming rate. Given this massive amount
of data, data mining, which transforms raw datasets into useful higher-level knowledge,
becomes a must in analyzing and understanding the information. Obtained knowledge
then can be used for diverse applications ranging from business and engineering design
to scientific analysis and modelling.
Clustering is one of the fundamental techniques for exploring the underlying structure
of data. The main purpose of clustering is to automatically classify objects into groups
such that data items within a same cluster are similar and data items from different
clusters are dissimilar.
The huge amount of available data is a major driving force for making clustering one
of the most attractive research areas. Since people do not have enough time to manually
analyze whole data, cluster analysis can be utilized to assist human users to explore the
structure of data. Moreover, non-intuitive relationships can be discovered throughout
cluster analysis, which can be overlooked by human users. Therefore, for many decades,
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
clustering has been widely investigated in many areas, including artificial intelligence,
data mining, information retrieval, machine learning, statistics, and pattern recognition.
Clustering can provide potential benefits in many applications, including Earth sci
ence, molecular biology, electronic commerce, intelligence, and the Web. In Earth science,
rapid growth in remote sensing systems has made it possible to obtain data about nearly
every part of our globe, including the solid earth, ocean, and atmosphere. In this cir
cumstance, the size and spatio-temporal nature of the data raise significant challenges for
earth scientists. To address these research issues, clustering can be utilized to sift through
this enormous amount of raw datasets. As a result, cluster analysis is able to identify
regions of earth where neighbor points in each region have similar short-term/long-term
climate patterns. This knowledge will be used to find correlation of climate patterns over
different regions (e.g., teleconnections).
In computational molecular biology, one of the key demands is the detection of groups
of genes that manifest similar expression patterns. Clustering co-expressed genes into
groups can assist biologists to understand gene functions in the biological processes in a
cell.
In electronic commerce, customizing a system to a user’s preferences is extremely
important. For instance, many online sellers recommend products to their customers
based on the purchasing patterns. Since the massive amount of data can be collected
from Web click streams, the data contains valuable knowledge of customer behavior. In
this case, clustering analysis module can be effectively used for modelling customers’
behavior and for profiling users (subject to privacy concerns).
2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Finally, data in intelligence applications (e.g., homeland security or crisis manage
ment) arrives as high-speed information streams, and we need to process this decision-
critical information very quickly. Hence, the algorithm must be able to perform a fast
update of the existing mining structure. In this case, incremental clustering can process
incoming data online, and further identify potential anomaly events for the purpose of
intrusion detection or crisis alert. Therefore, it is worthwhile to pursue clustering in many
applications.
The rest of this chapter is organized as follows. The next Chapter discusses the focus
of this dissertation, Chapter 1.2 introduces a brief summarization of the contributions
of this dissertation. Finally, Chapter 1.3 contains the organization of the rest of the
dissertation.
1.1 Focus of th e D issertation
In general, there are two kinds of datasets, incremental data and static data. Incremental
data is the one where data items are added into a database over time. News streams
belongs to this category. To deal with this type of data, clustering algorithm must
be incremental. However, not all datasets are incremental. In some cases, there is no
continuous insertion in the data. Gene expression data falls into this category. In this
case, batch clustering is sufficient. Thus, depending on the nature of data, we need
to develop relevant clustering algorithms for the data. Therefore, this dissertation is
basically composed of two parts: incremental clustering for news streams, and batch
clustering for gene expression data.
3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
For news streams mining, we present topic mining. Topic mining is analogous to
a search engine in th at both deal with documents. However, while users must provide
keywords to initiate the search process to locate their information needs, in topic mining,
identification of users’ information needs is not its primary purpose. T hat is, topic mining
dynamically extracts valuable and unknown knowledge from news streams. Therefore,
topic mining is best suited for “discovery” purposes, i.e., learning and discovering knowl
edge that was previously unknown.
Coupled with world-wide efforts in bioinformatics, recent advancement in high through
put technologies have resulted in a vast amount of life science data. For example, with
the DNA microarray techologies, the expression levels of thousands of genes can be mea
sured simultaneously [40]. The obtained data are usually organized as a matrix where the
columns represent genes (usually genes of the whole genome), and the rows correspond
to the samples (e.g. various tissues, experimental conditions, or time points). Given
this rich amount of gene expression data, the goal of microarray analysis is to extract
hidden knowledge (e.g., similarity or dependency between genes) from this matrix. Due
to its static nature of data, effective batch clustering needs to be developed. Toward this
end, we focus our attentions on a density-based clustering approach for the purpose of
co-expressed gene cluster identification.
1.2 C ontributions
The goal of this dissertation is to present clustering algorithms for high-dimensional, and
noisy data. To build a novel paradigm for cluster analysis, this dissertation utilizes ideas
4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
from multiple disciplines, such as pattern recognition, machine learning, statistics, data
mining, and graph theory. The main contribution of this dissertation is to present two
types of algorithms to deal with incremental as well as static datasets. In the following,
we describe the specific contributions for both frameworks.
• Efficient incremental hierarchical news document clustering. Since several hundred
news stories are published everyday at a single Web news site, to cope with such
dynamic environments, we should provide efficient incremental data mining algo
rithms. Despite the huge body of research efforts on document clustering, little
work has been conducted in the context of incremental hierarchical news document
clustering. Our developed clustering algorithm based on a neighborhood-search
has several key advantages, including the scalability with the high dimensionality,
capability to discover clusters with different sizes, and ability to provide succinct
description of clusters.
• Topic ontology learning from a news stream. In order to achieve rich semantic in
formation retrieval, an ontology-based approach would be provided. However, as
discussed in Agirre et al. [85], one of the main problems with concept-based ontolo
gies is th at topically related concepts and terms are not explicitly finked. T hat is,
there is no relation between court-attorney, kidnap-police, etc. Although there exist
different types of term association relationships in WordNet [99] such as “Bush ver
sus President of US” as synonym, or “G.W. Bush versus R. Reagan” as coordinate
terms, these types of relationships are limited to addressing topical relationships.
Thus, concept-based ontologies have a limitation in supporting a topical search. For
5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
example, consider the Sports domain ontology that we have developed in our previ
ous work [75, 76, 77]. In this ontology, “Kobe Bryant”, who is an NBA basketball
player, is related with term s/concepts in Sports domain. However, for the purpose
of query expansion, “Kobe Bryant” needs to be connected with a “court trial” con
cept if a user keeps “Kobe Bryant court trial” in mind. Therefore, it is essential
to provide explicit links between topically related concepts/term s. To address this
issue, we propose a topic ontology learning framework th at utilizes the obtained
document hierarchy. The obtained topic ontology can provide interpretations of
news topics at different levels of abstraction.
• Density-based clustering using a mutual neighborhood. We present a clustering al
gorithm for gene expression data. The main distinctions of the algorithm are three
things. First, it is robust to outlier by taking a density based approach. Second,
it can identify biologically meaningful clusters with high detection ratio. Finally
it provides cluster structure with reasonable homogeneity and separation. In addi
tion, we establish non-intuitive relationship between KNN-density and a connected
component in mutual neighborhood graph.
1.3 O rganization of the D issertation
The remainder of this dissertation is organized as follows. In Chapter 2, we present
background knowledge for topic mining. Chapter 3 discusses main content for topic
mining. In Chapter 4, we propose density-based gene expression clustering using a mutual
6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
neighborhood. We provide conclusions of this dissertation in Chapter 5, and present
directions for future work in Chapter 6.
7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2
Background for Topic M ining
The most relevant research areas to our work are Topic Detection and Tracking (TDT)
and document clustering. Chapter 2.1 presents a brief overview on TD T work. In Chap
ter 2.2, a survey on previous document clustering work is provided. Finally, Chapter 2.3
introduces previous work on intelligent news services, which utilize document clustering
and TDT.
2.1 Topic D etection and Tracking
Over the past six years, the information retrieval community has developed a new research
area, called Topic Detection and Tracking (TDT) [3, 4, 5, 104, 21, 134, 135]. The main
goal of TDT is to detect the occurrence of a novel event in a stream of news stories, and
to track the known event. In particular, there are three m ajor components in TDT.
1. Story segmentation. It segments a news stream (e.g., including transcribed speech)
into topically cohesive stories. Since online Web news (in HTML format) is supplied
in segmented form, this task only applies to audio or TV news.
8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2. First Story Detection (FSD). It identifies whether a new document belongs to an
existing topic or a new topic.
3. Topic tracking. It tracks events of interest based on sample news stories. It asso
ciates incoming news stories with the related stories, which were already discussed
before. It can be also asked to monitor the news stream for further stories on the
same topic.
In Allan et al. [3], the notion of event is first defined. Event is defined as “some unique
thing that happens at some point in time” . Hence, an event is different from a topic. For
example, “airplane crash” is a topic while “Chinese airplane crash in Korea in April 2002”
is an event. Thus, there exists M -l mapping between event and topic (i.e., multiple events
can be on a same topic). Note that it is important to identify events as well as topics.
Although the user may not be interested in a flood topic, in general, she may be interested
in documents about a flood event in her home town. Thus, a news recommendation system
must be able to distinguish different events within a same topic.
Yang et al. introduced an important property of news events, referred to as temporal
locality [134]. T hat is, news articles discussing the same event tend to be temporally
proximate. In addition, most of the events (e.g., flood, earthquake, wildfire, kidnapping)
have short duration (e.g., 1 week - 1 month). They exploited these heuristics when
computing similarity between two news articles.
The most popular method in TDT is to use a simple incremental clustering algorithm,
which is shown in Figure 2.1. Our work starts by addressing the limitations of this
algorithm.
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. Initially, only one news article is available, and it forms a singleton
cluster.
2. For an incoming document (d*), we compute the similarity between d*
and pre-generated clusters. The similarity is computed by the
distance between d* and the representative of the cluster.
3. Selects the cluster (Ci) that has the maximum proximity with d*.
4. If the similarity between d* and C\ exceeds the pre-defined threshold,
then all documents in Cj are considered as related stories to d*
(topic tracking), and d* is assigned to Cj.
Otherwise, d* is considered as a novel story (first story detection), and
a new cluster for d* is created.
5. Repeat 2-4 whenever a new document appears in a stream.
Figure 2.1: The incremental document clustering algorithm in TDT
2.2 D ocum ent C lustering
In this chapter, we classify the widely used document clustering algorithms into two
categories (partition-based clustering and hierarchical clustering), and provide a concise
overview for each of them.
2.2.1 P a r titio n -b a se d C lu sterin g
Partition-based clustering decomposes a collection of documents, which is optimal with
respect to some pre-defined function. Typical methods in this category include center-
based clustering, Gaussian Mixture Model, etc.
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Center-based algorithms identify the clusters by partitioning the entire dataset into a
pre-determined number of clusters (e.g., AA means clustering), or an automatically derived
number of clusters (e.g., W-means clustering) [19, 65, 35, 46, 81, 102, 106].
The most popular and the best understood clustering algorithm is K-means cluster
ing [35]. The K -means algorithm is a simple but powerful iterative clustering method
to partition a dataset into K disjoint clusters, where K must be determined beforehand.
The idea of the algorithm is to assign points to the cluster such th at the sum of the mean
square distance of points to the center of the assigned cluster is minimized.
While the A'-means clustering approach works in a metric space, medoid-based method
works with a similarity space [65, 102]. It uses the medoids (representative sample ob
jects) instead of the means (e.g., the centers of clusters) such that the sum of the distances
of points to their closest medoid is minimized.
Although the center-based clustering algorithms have been widely used in document
clustering, there exist at least five serious drawbacks. First, in many center-based cluster
ing algorithms, the number of clusters (K ) needs to be determined beforehand. Second,
the algorithm is sensitive to an initial seed selection. Depending on the initial points,
it is susceptible to a local optimum. Third, it can model only a spherical (A'-means) or
ellipsoidal (IF-medoid) shape of clusters. Thus, the non-convex shape of clusters can
not be modeled in center-based clustering. Forth, it is sensitive to outliers since a small
amount of outliers can substantially influence the mean value. Finally, due to the na
ture of iterative scheme in producing clustering results, it is not relevant for incremental
datasets.
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.2.2 H iera rch ica l A g g lo m e r a tiv e C lu ster in g
Hierarchical (agglomerative) clustering (HAC) finds the clusters by initially assigning
each document to its own cluster and then repeatedly merging pairs of clusters until a
certain stopping condition is met [35, 53, 74, 54, 141]. Consequently, its result is in the
form of a tree, which is referred to as a dendrogram. A dendrogram is represented as a
tree with numeric levels associated to its branches.
The main advantage of HAC lies in its ability to provide a view of data at multiple
levels of abstraction. However, since HAC builds a dendrogram, a user must determine
where to cut the dendrogram to produce actual clusters. This step is usually done by
human visual inspection, which is a time-consuming and subjective process. Moreover,
the computational complexity of HAC is more expensive than that of partition-based
clustering. In partition-based clustering, the computational complexity is 0 (n K I) where
n is the number of documents, K is the number of clusters, and I is the number of
iterations, respectively. In contrast, HAC takes 0 ( n :i) if pairwise similarities between
clusters are changed when two clusters are merged. However, the complexity can be
reduced to 0 { n 2logn) if we utilize a priority queue [141].
2 .2 .3 P r o je c tiv e T ech n iq u es
In document clustering, since many different terms can be used to express a similar
meaning, the vector space model contains highly correlated evidences. Thus, the text
data can be projected onto a small number of dimensions corresponding to principal
components, and clustering can be performed on the projected subspace. Singular Value
Decomposition (SVD) has been employed for this purpose. The main benefit of SVD is to
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
decompose a m atrix into three component matrices, exposing many interesting properties
of the original one. It has been widely used in time-series databases [79], text information
retrieval [16, 36, 37, 38], and multimedia information retrieval [105].
The basic intuition behind SVD is to examine the entire dataset and rotate the original
axis to maximize variance along the first few dimensions. Hence, we can achieve the
dimensionality reduction effect by keeping only the first few dimensions while losing
the least information. The following theorem provides the mathematical background of
SVD [51].
T h e o rem 2.2.1: Given N x M matrix X , we can decompose X as follows.
X = UxYlxV t2-1)
where U is a column orthonormal N x r matrix, r is the rank of X , ^ is a diagonal r
x r matrix with eigenvalues of X , and V is a column-orthonormal M x r matrix.
P roof: Refer to [51].
Note that the rank of the matrix is defined as the number of linearly independent
columns. In addition, column orthonomality means that each column is orthogonal and
has length 1. Columns of U and V are referred to as left and right singular vectors,
respectively. For detailed definitions for these terms, refer to [51].
W ithout loss of generality, we can assume that components of 53 (the eigenvalues
A i of X ) are arranged in decreasing order. The beauty of SVD lies in the fact that we
can reduce the number of dimensions by discarding the insignificant dimensions (i.e.,
less singular values). Hence, X k can be obtained by keeping first k singular values and
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
discarding r — k singular values and the corresponding left and right singular vectors of
A. We denote the reduced version as Ah, Uk, Ylk an<^
In Latent Semantic Indexing (LSI), the term x document m atrix is modeled by SVD.
While LSI was originally used for improved query processing in information retrieval [16,
36, 37, 38], the basic idea can be employed for clustering as well citeprocdiu. Thus, we
can employ SVD to reduce the number of dimensions as a preprocessing step, and we can
use partition-based or hierarchical clustering for the document clustering.
Recent studies showed th at random subspace projections perform very well for high
dimensional data and the accuracy is close to the optimal projection as given by the
SVD [18, 116].
2.3 Intelligent N ew s Services System s
The one of the most successful intelligent news services is NewsBlaster [96]. The basic idea
of NewsBlaster is to group the articles on the same story using clustering, and present one
story using multi-document summarization. Thus, the main goal of NewsBlaster is similar
to ours in th at both aim to propose intelligent news analysis/delivery tools. However, the
underlying methodology is different. For example, with respect to clustering, NewsBlaster
is based on the clustering algorithm in Hatzivassiloglou et al. [61]. Main contributions
of their work is to augment document representation using linguistic features. However,
rather than developing their own clustering algorithm, they used conventional HAC,
which has the drawbacks as discussed in Chapter 2.2.
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Recent attem pts present other intelligent news services like NewsInEssence [108, 109],
or QCS (Query, Cluster, Summarize) [39]. Both services utilize a similar approach to
NewsBlaster in th at they separate the retrieved documents into topic clusters, and create a
single summary for each topic cluster. However, their main focus does not lie in developing
a novel clustering algorithm. For example, QCS utilizes generalized spherical A'-means
clustering whose limitations have been addressed in Chapter 2.2.
Therefore, it is worthwhile to develop a sophisticated document clustering algorithm
that can overcome the drawbacks of previous document clustering work. In particular,
the developed algorithm must address the special requirements in news clustering such
as high rate of document insertion, or ability to identify event level clusters as well as
topic level clusters.
2.4 Inform ation Preprocessing for Topic M ining
The information preprocessing step extracts meaningful information from unstructured
text data and transforms it into structured knowledge. As shown in Figure 3.1, this step
is composed of the following standard IR tools.
• HTML preprocessing. Since downloaded news articles are in HTML format, we
remove irrelevant HTML tags for each article and extract meaningful information.
• Tokenization. Its main task is to identify the boundaries of the terms.
• Stemming. There can be different forms for the same terms (e.g., students and
student, go and went). These different forms of the same term need to be converted
to their roots. Toward this end, instead of solely relying on Porter stemmer [107],
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
in order to deal with irregular plural/tense, we combine Porter stemmer with the
lexical database [98].
• Stopwords removal. Stopwords are the terms that occur frequently in the text but do
not carry useful information. For example, have, did, and get are not meaningful.
Removing such stopwords provide us with a dimensionality reduction effect. We
employ the stopword list th at was used in Smart project [114].
After preprocessing, a document is represented as a vector in an n-dimensional vector
space [114]. The simple way to do this is to employ the Bag-Of-Word (BOW) approach.
T hat is, all content-bearing terms in the document are kept and any structure of text or
the term sequence is ignored. Thus, each term is treated as a feature and each document
is represented as a vector of certain weighted term frequencies in this feature space.
There are several ways to determine the weight of a term in a document. However,
most methods are based on the following two heuristics.
• Im portant terms occur more frequently within a document than unimportant terms
do.
• The more times a term occurs throughout all documents, the weaker its discrimi
nating power becomes.
The term frequency (TF) is based on the first heuristic. In addition, TF can be normalized
to reflect different document lengths. Let freqij be the number of tj’s occurrence in a
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
document j, and lj be the length of the document j. Then, term frequency (tfij ) of tj in
the document j is defined as follows:
th = (2.2)
The document frequency (DF) of the term (the percentage of the documents that con
tain this term) is based on the second heuristic. A combination of TF and DF introduces
TF-IDF ranking scheme, which is defined as follows:
71
Wij = tfij x log— (2.3)
T i{
where Wij is the weight of f2 in a document j , n is the total number of documents in the
collection, and n 2 is the number of documents where occurs at least once.
The above ranking scheme is referred to as static TF-IDF since it is based on static
document collection. However, since documents are inserted incrementally, ID F values
are initialized using a sufficient amount of documents (i.e., the document frequency is
generated from training corpus). After then, ID F is incrementally updated as subsequent
documents are processed. In particular, we employ an incremental update of ID F value
proposed by Yang et al. [134].
Finally, to measure closeness between two documents, we use the Cosine metric, which
measures the similarity of two vectors according to the angle between them [114]. Thus,
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
vectors pointing to similar directions are considered as representing similar concepts. The
cosine of the angles between two m-dimensional vectors (x and y) is defined by
x - v
Similarity(x,y) = Cosine(x,y) = || ^ || |'j ~ (^-^)
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 3
Topic M ining for D ocum ent Stream s
3.1 Prelim inary
W ith the rapid growth of the World Wide Web, Internet users are now experiencing over
whelming quantities of online information. Since manually analyzing the data becomes
nearly impossible, the analysis would be performed by autom atic data mining techniques
to fulfill users’ information needs quickly.
On most Web pages, vast amounts of useful knowledge are embedded into text. Given
such large sizes of text datasets, mining tools, which organize the text datasets into struc
tured knowledge, would enhance efficient document access. This facilitates information
search and at the same time, provides an efficient framework for document repository
management as the number of documents becomes extremely huge.
Given that the Web has become a vehicle for the distribution of information, many
news organizations are providing newswire services through the Internet. Given this
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
popularity of the Web news services, we have focused our attention on mining patterns
from news stream s.1
The simplest document access method within Web news services is keyword-based re
trieval. Although this method seems effective, there exist at least three drawbacks. First,
if a user chooses irrelevant keywords (due to broad and vague information needs or unfa
miliarity with the domain of interest), retrieval accuracy will be degraded. Second, since
keyword-based retrieval relies on the syntactic properties of information (e.g., keyword
counting),2 semantic gap cannot be overcome. Third, only expected information can be
retrieved since the specified keywords are generated from users’ knowledge space. Thus,
if users are unaware of the airplane crash that occurred yesterday, then they cannot issue
a query about th at accident even though they might be interested.
The first two drawbacks stated above have been addressed by query expansion based
on domain-independent ontologies [130]. However, it is well known that this approach
leads to a degradation of precision. That is, given that the term s introduced by term
expansion may have more than one meaning, using additional terms can improve recall,
but decrease precision. Exploiting a manually developed ontology with a controlled vo
cabulary is helpful in this situation [77, 75, 76]. However, although ontology-authoring
tools have been developed in the past decades, manually constructing ontologies whenever
new domains are encountered is an error-prone and time-consuming process. Therefore,
1In this dissertation, we are concerned with (news) articles, which are also referred to as documents.
2Like Latent Semantic Indexing (LSI) [16], the vector space model based on keyword counting can be
augmented with semantics by combining other methods (e.g., Singular Value Decomposition). However,
keyword-based retrieval in this paper is referred to as the method relying on only simple keyword counting.
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
integration of knowledge acquisition with data mining, which is referred to as ontology
learning, becomes a must [91, 20, 33].
In this paper, we propose a mining framework that supports the identification of
meaningful patterns (e.g., topical relations, topics, and events th at are instances of topics)
from news stream data. To build a novel framework for an intelligent news database
management and navigation scheme, we utilize techniques in information retrieval, data
mining, machine learning, and natural language processing.
To facilitate information navigation and search on a news database, we first identify
three key problems.
1. Vague inform ation needs. Sometimes, defining keywords for a search is not an
easy task, especially when a user has vague information needs. Thus, a reasonable
starting point would be provided to assist the user.
2. Lack of topical relations in concept-based ontologies. In order to achieve rich seman
tic information retrieval, an ontology-based approach would be provided. However,
as discussed in Agirre et al. [85], one of the main problems with concept-based
ontologies is th at topically related concepts and terms are not explicitly linked.
That is, there is no relation between court-attorney, kidnap-police, etc. Thus,
concept-based ontologies have a limitation in supporting a topical search. For ex
ample, consider the Sports domain ontology that we have developed in our previous
work [77, 75, 76]. In this ontology, “Kobe Bryant” , who is an NBA basketball player,
is related with terms/concepts in Sports domain. However, for the purpose of query
expansion, “Kobe Bryant” needs to be connected with a “court trial” concept if a
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
user keeps “Kobe Bryant court trial” in mind. Therefore, it is essential to provide
explicit links between topically related concepts/terms.
3. High rate of docum ent insertion. As several hundred news articles are published
everyday at a single Web news site, triggering the whole mining process whenever
a document is inserted to the database is computationally impractical. To cope
with such a dynamic environment, efficient incremental data mining tools need to
be developed.
The first of the three problems can be approached using clustering. A collection of
documents is easy to skim if similar articles are grouped together. If the news articles are
hierarchically classified according to their topics, then a query can be formulated while a
user navigates a cluster hierarchy. Moreover, clustering can be used to identify and deal
with near-duplicate articles. T hat is, when news feeds repeat stories with minor changes
from hour to hour, presenting only the most recent articles is probably sufficient.
To remedy the second problem, we present a topic ontology, which is defined as a
collection of concepts and relations. In a topic ontology, concept is defined as a set of
terms that characterize a topic. We define two generic kinds of relations, generalization
and specialization. The former can be used when a query is generalized to increase recall
or broaden the search. On the other hand, the latter is useful when refining the query.
For example, when a user is interested in someone’s court trial but cannot remember the
name of a person, then specialization can be used to narrow down the search.
To address the third problem, we propose a sophisticated incremental hierarchical
document clustering algorithm using a neighborhood search. The novelty of the proposed
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
y*Information Preprocessing
■ HTML P a d n g
* Tnlam narinn
Information Analysis
* {InQremaraal)HiPiuix£ucal
DfyimtfntCfaafrimg
* Ofttolo©* Learning
Stopm ird rem oral
(^TopicOnabg^)
D ia b a s e zt srve r * 1
/ Informenon Gathering 7
• SendWeb craririer to Web
Nevre Service
G eneral
ontology
I
Information Delivery
WWW
■ Chivfer Vk a ln M iii i
• T e n u Suggestion
1 (Ceynard-baied retrieval
• T opic D etecrinn T ra d d n g
/
Figure 3.1: Overview of a proposed framework
algorithm is the ability to identify news event clusters as well as news topic clusters
while reduce the amount of computation by maintaining cluster structure incrementally.
Learning topic ontologies can be performed on the obtained document hierarchy.
Figure 3.1 illustrates the main parts of the proposed framework. In the information
gathering stage, a Web crawler retrieves a set of news documents from a news Web site
(e.g., CNN). Developing an intelligent Web crawler is another research area, and it is not
our main focus. Hence, we implement a simple Web spider, which downloads news articles
from a news Web site on a daily basis. The retrieved documents are processed by data
mining tools to produce useful higher-level knowledge (e.g., a document hierarchy, a topic
ontology, etc), which is stored in a content description database. Instead of interacting
with a Web news service directly, by exploiting knowledge in the database, an information
delivery agent can present an answer in response to a user request.
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
kidnap abduct child boy police search missing investigate return home
di 1 0 1 0 1 1 0 1 0 0
d2 1 1 1 1 1 0 1 1 0 0
dz 0 1 0 1 0 0 1 0 1 1
Table 3.1: A sample illustrative example for document x term matrix. For simplicity, each
document vector is represented as boolean values instead of TF-IDF values
Main contributions of our work are twofold. First, despite the huge body of research
efforts on document clustering [94, 81, 61, 89, 141], little work has been conducted in the
context of incremental hierarchical news document clustering. To address the problem
of frequent document insertions into a database, we have developed an incremental hier
archical clustering algorithm using a neighborhood search. Since the algorithm produces
a document cluster hierarchy, it can identify event level clusters as well as topic level
clusters. Second, to address the lack of topical relations in concept-based ontologies, we
propose a topic ontology learning framework, which can interpret news topics at multiple
levels of abstraction.
The remainder of this chapter is organized as follows. Chapter 3.2 illustrates a mo
tivating example for the proposed incremental clustering algorithm. In Chapter 3.3, a
non-hierarchical incremental document clustering algorithm using a neighborhood search
is presented. Chapter 3.4 explains how to extend the algorithm into a hierarchical ver
sion. Chapter 3.5 shows how to build a topic ontology based on the obtained document
hierarchy. Finally, Chapter 3.6 presents experimental results.
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Notation Meaning
n The total number of documents in a database
d* A new document
di An i-th document
e Threshold for determining the neighborhood
N<(di) e-neighborhood for di
Ddi
The set of documents that contain any term of di
cd i
The set of clusters that contain any neighbor of di
\A\
The size of a set A where A can be a neighborhood or cluster
dfij
Document frequency of a term ti within a set Aj
Wij TF-IDF value for a term U for a document dj
Si
Signature vector for a set Aj
4
i — th component of Sj
Table 3.2: Notations for incremental non-hierarchical document clustering
3.2 A M otivating Exam ple
To illustrate a simple example, consider the following three documents (whose document xterm
matrix is shown in Table 3.1).
1. d\: A child is kidnapped so police starts searching.
2. c ? 2 : Police found the suspect of child kidnapping.
3. d^: An abducted boy safely returned home.
In the above three documents, although d\ and c fo are similar, and c ?2 and dz are similar,
d\ and dz are completely dissimilar since they share no terms. Consequently, transitivity
relation does not hold. Why does this happen? We provide explanations to this question
in terms of three different perspectives.
1. Fuzzy similarity relation. As discussed in the fuzzy theory [138], the similarity rela
tion does not satisfy transitivity. To make it satisfy transitivity, a fuzzy transitivity
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
closure approach was introduced. However, this approach is not scalable with the
number of data points.
2. Inherent characteristic of news. As discussed in Allan et al. [3], event is considered
as an evolving object through some time interval (i.e., content of news articles on the
same story are changed throughout time). Hence, although the documents belong
to a same event, the terms the documents use would be different if they discuss
different aspects of the event.
3. Language semantics. The diverse term usage for a same meaning (e.g., kidnap and
abduct) needs to be considered. Using only a syntactic property (e.g., keyword
counting) aggravates the problem.
The transitivity is related with document insertion order in incremental clustering. Con
sider the TD T incremental clustering algorithm in Figure 2.1. If the order of document
insertion is “d ^ d s ”, then one cluster ({{di, d.2, c ? 3 }}) is obtained. However, if the order
is “did3d2”, then two clusters ({{di, d2}, {da}}) are obtained. Although the order of doc
ument insertion is fixed (because the document is inserted whenever it is published), it is
undesirable if the clustering result significantly depends on the insertion order. Regardless
of the input order, the successful algorithm should produce a single cluster, {{di, d2 , da}}.
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.3 A P roposed Increm ental N on-hierarchical D ocum ent
C lustering A lgorithm using a N eighborhood Search
Before we present detailed discussions on the proposed clustering algorithm, definitions
for basic terminology are provided first. In addition, Table 3.2 shows the notations, which
will be used throughout this paper.
D efinition 3.3.1: [similar] If Similarity(di, dj) > e, then a document dL is referred to
as similar to a document dj.
D efinition 3.3.2: [iV e(c?f)] e-neighborhood for di is {x : Similarity(x,di) > e}.
That is, e-neighborhood for a document di is defined as a set of documents, which are
similar to di. In this paper, e-neighborhood and neighborhood are used interchangeably.
D efinition 3.3.3: [neighbor] A document dj is defined as a neighbor of di if and only
if dj £ Nt (di).
The proposed clustering algorithm is based on the observation that a property of an
object would be influenced by the attributes of its neighbors. Examples of such attributes
are the properties of the neighbors, or the percentage of neighbors that fulfill a certain
constraint. The above idea can be translated into clustering perspective as follows: a
cluster label of an object depends on the cluster labels of its neighbors.
Recent data mining research has proposed density-based clustering such as Shared
Nearest Neighbors (SNN) clustering [43, 68]. In SNN, the similarity between two objects
is defined as the number of fc-nearest neighbors they share. Thus, the basic motivation of
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
SNN clustering is similar to ours, however, as we will explain in Chapter 3.4, the detailed
approach is completely different.
Figure 3.2 shows the proposed incremental clustering algorithm. Initially, we assume
that only one document is available. Thus, this document itself forms a singleton cluster.
Adding a new document to existing cluster structure proceeds in three phases: neighbor
hood search, identification of an appropriate cluster for a new document, and re-clustering
based on local information. In what follows, these three steps are explained in detail.
3.3.1 N e ig h b o r h o o d search
Achieving an efficient neighborhood search is important in the proposed clustering algo
rithm. Since we deal with documents in this research, we can rely on an inverted index for
the purpose of the neighborhood search.3 In an inverted index [114], the index associates
a set of documents with terms. That is, for each term ti, we build a document list that
contains all documents containing ti. Given that a document di is composed of t\, ...
,tfc, to identify similar documents to di, instead of checking whole document dataset, it
is sufficient to examine the documents that contain any t*. Thus, given a document di,
identifying the neighborhood can be accomplished in 0(1-0,^ |).
3 .3 .2 Id e n tific a tio n o f an a p p ro p ria te clu ster
To assign an incoming document (d*) to the existing cluster, the cluster, which can host
d*, needs to be identified using the neighborhood of d*. If there exists such a cluster,
3Note that the neighborhood search can be supported with Multi-Dimensional Index (MDI) struc
ture [12, 57, 14] coupling with dimensionality reduction (e.g., wavelet transforms [24] or Fourier trans
forms [2]) if the proposed algorithm is extended into other data types such as time-series.
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Step 1. Initialization:
Document do forms a singleton cluster Co-
Step 2. Neighborhood search:
Given a new incoming document d*, obtain 7 V e(d*) by performing
a neighborhood search.
Step 3. Identification of a cluster that can host a new document:
Compute the similarity between d* and a cluster C, 6 C(i, ■
Based on the value obtained from above,
if there exists a cluster (Cj) that can host d*, then
add d* to the cluster and update the D C F j.
Otherwise,
create a new cluster for d* and
create a corresponding DCF vector for this new cluster.
Step 4 ■ Re-clustering:
Let Cj be the cluster that hosts d*.
If Cj is not a singleton cluster, then trigger merge operation.
Step 5.
Repeat Step 2-4 whenever a new document appears in a stream.
Figure 3.2: The incremental non-hierarchical document clustering algorithm
then d* is assigned to the cluster. Otherwise, d* is identified as an outlier and forms a
singleton cluster.
Toward this end, the set of candidate clusters (Cd,) is identified by selecting the cluster
that contains any document belonging to iV£(d*). Subsequently, the cluster, which can
host a new document, is identified by using one of the following three methods.
1. Considering the size of an overlapped region. Select the cluster th at has the largest
number of its members in _ /V £(d*). This approach only considers the number of
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
documents in the overlapped region, and ignores the proximity between neighbors
and d*.
2. Exploiting weighted voting. The similarities between each neighbor of d* and the
candidate clusters are measured. Then, the similarity values are aggregated using
weighted voting. T hat is, the weight is determined by the similarity between the
proximity of a neighbor to the new document. Thus, each neighbor can vote for its
class with a weight proportional to its proximity to the new document.
Let Wj be a weight for representing the proximity of nj to the new document (e.g.,
Cosine similarity between nj and the new document). Then, the most relevant
cluster (C*) is selected based on the following formula:
C* = argmaxCk£Cit E Wj ■ Similarity(nj, Sk) (3.1)
nj£Nt{d» )
Equation (3.1) mitigates the problem of the previous m ethod by considering the
weight Wj. Moreover, it still favors the cluster with a large size of overlapped
region to iV£(d*) by summing up the weighted similarity.
3. Exploiting a signature vector. While the weighted voting approach is effective, it
is computationally inefficient since the similarities between all neighbors and all
candidate clusters need to be computed. Instead, we employ a simple but effec
tive approach, which measures the similarity between the signature vector of the
neighborhood and that of the candidate clusters.
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C luster 1 C luster 1 C lu ster 1
Cluster 2
Step 1: Check whether dl can be added to cluster 1
Step 2: Add dl to cluster 1
Step 3: Merge cluster 1 and cluster 2 if they satisfy the merge constraint
Figure 3.3: Illustration of a re-clustering phase
The signature vector should be composed of terms that reflect the main characteristics
vector for the cluster. For each term fj in the set Aj (e.g., cluster/neighborhood), we
compute the weight for the signature vector using the following formula:
In Equation (3.2), the first factor measures the normalized document frequency within
a set, and the second factor measures the sum of the weight for the term over the whole
documents within a set.
Next, the notion of Document Cluster Feature (DCF) vector4 is presented as follows:
4The basic notion of DCF is motivated by Cluster Feature (CF) in BIRCH clustering [140].
of the documents within a set. For example, the center of a cluster would be a signature
(3.2)
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Notation Meaning
STCi A collection of specific terms for C,
T Virtual time
df(T ) Document frequency of a term U in whole documents at time T
d f M T )
Document frequency of a term t, within Cj at time T
K Number of clusters at level 1 at time T
dfouT(T)
A quantitative value representing how much ti occurs outside
the cluster Cj at time T
Sel*(T) Selectivity of a term ti for the cluster Cj at tim e T
C]
A i-th cluster at level j
Table 3.3: Notations for incremental hierarchical document clustering
D efinition 3.3.4: [DCF] Document Cluster Feature (DCF) vector for a cluster Cj
is defined as a triple DCFi = (N j, DF), Wj) where A T j is the number of documents in
Ci, DFi is a document frequency vector for Cj, and Wj is a weight sum vector for Cj,
respectively.
Theorem 3.3.5: [A dditivity of DCF] Let DCFi = (Ni, D F it W i) and D C Fj = (Nj,
DFj, W j) be the docum ent cluster feature vectors fo r Ci and Cj, respectively. Then, DCF
fo r a new cluster (by merging Ci and Cj) is defined by (Ni + Nj, DFi + DFj, Wj + Wj).
Proof: It is straightforward by simple linear algebra.
To compute the similarity between a document and a cluster, we only need signature
vectors of the cluster and the document. However, the signature vector does not need to
be recomputed as a new document is inserted to the cluster. This property is based on
the additivity of DCF. Since Si (a signature vector for Cj) can be directly reconstructed
from DCFi, instead of recomputing Si whenever a new document is inserted into Cj, the
DCFi only needs to be updated using the additivity of DCF.
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In sum, if there exists a cluster (Ci) th at can host a new document, then the new
document is assigned to Ci and the DCFi is updated. Otherwise, a new cluster for d*
and a DCF vector for this cluster are created.
3 .3 .3 R e -c lu ste r in g
If d* is assigned to Ci, then a merge operation needs to be triggered. This is based on
a locality assumption [110]. Instead of re-clustering the whole dataset, we only need to
focus on the clusters th at are affected by the new document. T hat is, a new document
is placed in the cluster, and a sequence of cluster re-structuring processes is performed
only in regions th at have been affected by the new document. Figure 3.3 illustrates this
idea. As shown, clusters that contain any document belonging to the neighborhood of a
new document need to be considered.
3.4 H ow to E xtend the N on-hierarchical C lustering A lgorithm
into a Hierarchical Version?
When the algorithm in Figure 3.2 is applied to a news article dataset, different event
clusters5 can be obtained. Since our goal is to generate a cluster hierarchy, all event
clusters on the same topic need to be combined together. For example, to reflect a
court trial topic, all court trial event clusters at level 1 should be merged in a single
cluster at level 2. However, in many cases, this becomes a difficult task due to the
5These event clusters are defined at level 1. Note that level 0 corresponds to the lowest level in a
cluster tree (i.e., each document itself forms a singleton cluster at level 0). Thus, clusters at level 1 are
expected to contain similar documents on a certain event (i.e., event clusters) while clusters at level 2 are
expected to contain similar documents on a certain topic (i.e., topic clusters).
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
extremely high term-frequency of named entities within a document. Named entities are
people/organization, tim e/date and location, which play a key role in defining “who”,
“when”, and “where” of a news event. Thus, although two different event clusters belong
to the same topic, similarity between the clusters becomes extremely low, consequently,
the task of merging different event clusters (on a same topic) is not simple.
To address the above problem, we illustrate how to extend the algorithm (in Fig
ure 3.2) into a hierarchical version. Table 3.3 summarizes the notations th at will be used
in this chapter. Before presenting a detailed discussion, necessary terminology is first
defined.
D efinition 3.4.1: Specific term (ST ). A specific term for a cluster Ci is a term, which
frequently occurs within a cluster Ci, but rarely occurs outside of Ci. The collection of
specific terms for Ci is denoted by S T q ■
D efinition 3.4.2: V irtual tim e (T ). Virtual time T is initialized by 0. At any time
T, only one operation (e.g., document insertion and cluster merge) can be performed. In
addition, T is increased by one only when an operation is performed.
Let dfl (T ) be the document frequency of a term ti in whole document dataset at time
T. Then, the document frequency of ti at time T + 1 is defined as follows:
dfl(T) + 1, if d* is inserted at T and d* contains ti,
dfl (T), Otherwise
(3.3)
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Let dfl j N {T) be the document frequency of a term fj within a Cj at time T. Then
dfjJ N (T + 1) is recursively defined as follows:
dfjJ N(T) + 1, if d* is inserted to Cj at T and d* contains ti,
dfl / N(T + l) = { (3.4)
dfl /N (T), Otherwise
We denote K (T) as a number of clusters at level 1 at time T. Then, K (T + 1) is
defined as follows:
' K (T), if d* is inserted to an existing cluster at T
K (T + 1) ~ < K (T ) + 1, if d* itself forms a new cluster at T (3.5)
K(T) — 1, if two clusters are merged at T
Although d p(T + 1) — dfjN could be considered for representing how much ti occurs
outside Cj at T + 1, it is not sufficient if our goal is to quantify how much ti is informative
for Cj. This is because the number of clusters can also affect on how much ti discriminates
Cj from other clusters. Thus, df% QUT(T + 1), which represents how much ti occurs outside
Cj at time T + 1, can be defined as follows:
dfgUT(T + l) = df‘i T p C f p + 1) (3-6)
Finally, the selectivity of a term ti for the cluster Cj at time T + 1 is defined as follows:
SeVj (T + 1) = log d^ T + ^ (3.7)
ydfSUT(T+ 1)
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In sum, Equation (3.7) assigns more weight to the terms occurring frequently within C j ,
and occurring rarely outside of Cj. Therefore, a term with high selectivity for Ci can be
a candidate for STq ■
Based on the definition of ST, the proposed hierarchical clustering algorithm is de
scribed. While clusters at level 1 are generated using the algorithm in Figure 3.2, if no
more documents are inserted to a certain cluster at level 1 during the pre-defined time
interval, then we assume that the event for the cluster ends,6 and associate ST with this
cluster at level 1. We then perform a neighborhood search for this cluster at level 2.
Since ST reflects the most specific characteristics for the cluster, it is not helpful if two
topically similar clusters (but different events) need to be merged. Hence, when we build
a vector for Cj, terms in ST (for Cj) are not included for building a cluster vector.
At this moment, it is worthwhile to compare our algorithm with the SNN approach [68,
43]. The basic strategy of SNN clustering is as follows: It first constructs the nearest
neighbor graph from the sparsified similarity matrix, which is obtained by keeping only
& -nearest neighbor of each entry. Next, it identifies representative points by choosing the
points th at have high density, and removes noise points th at have low density. Finally, it
takes connected components of points to form clusters.
The key difference between SNN and our approach is that SNN is defined on static
datasets while ours can deal with incremental datasets. The re-clustering phase, and
special data structures (e.g., DCF or signature vector) make our algorithm more suitable
for incremental clustering than SNN. The second distinction is how a neighborhood is
defined. In SNN, a neighborhood is defined as a set of fc-nearest neighbors while we use
®This assumption is based on the temporal proximity of an event [134],
36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
e-neighborhood. Thus, as discussed in Han et al. [58], the neighborhood constructed from
fc-nearest neighbors is local in that the neighborhood is defined narrowly in dense regions
while it is defined more widely in sparse regions. However, for document clustering, a
global neighborhood approach produces more meaningful clusters. The third distinction
is that we intend to build a cluster hierarchy incrementally. In contrast, SNN does
not focus on hierarchical clustering. Finally, our algorithm can easily identify singleton
clusters. This is especially important in our application since an outlier document on a
in a news stream may imply a valuable fact (e.g., a new event or technology that has
not been mentioned in previous articles). In contrast, SNN overlooks the importance of
singleton clusters.
3.5 Building a Topic O ntology
A topic ontology is a collection of concepts and relations. One view of a concept is as a set
of terms that characterize a topic. We define two generic kinds of relations, specialization
and generalization. The former is useful when refining a query while the latter can be
used when generalizing a query to increase recall or broaden the search.
Table 3.4 and Table 3.5 illustrate the sample specific terms for the selected events/topics.
As shown, with respect to the news event, we observed th at the specific details are cap
tured by the lower levels (e.g., level 1), while higher levels (e.g., level 2) are abstract. We
can also generate general terms for the node, which is defined as follows:
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Event Specific features
Court trial 1 winona, ryder, actress, shoplift, beverly
Court trial 2 andrea, yates, drown, insanity
Court trial 3 blake, bakley, actor
Court trial 4 moxley, martha, kennedy, michael
Kidnapping 1 elizabeth, smart, utah, salt, lake
Kidnapping 2 jessica, holly, soham, Cambridgeshire, england
Kidnapping 3 weaver, ashlei, miranda, gaddis
Kidnapping 4 avila, samantha, runnion
Earthquake 1 san, giuliano, puglia, italy, sicily, etna
Earthquake 2 china, bachu, beijing, xinjiang
Earthquake 3 algeria, algerian
Earthquake 4 iran, qazvin
Table 3.4: A sample specific terms for the clusters at level 1. The term with regular
font denotes NE. Thus, this supports the argument th at NE plays a key role in defining
specific details of events
Topic Specific features
Court trial attorney court defense evidence jury kill law legal
murder prosecutor testify trial
Kidnapping abduct disappear enforce family girl kidnap miss
parent police
Earthquake body collapse damage earthquake fault hit injury
magnitude quake victim
Airplane crash accident air aircraft airline aviate boeing collision crash
dead flight passenger pilot safety traffic warn
Table 3.5: A sample specific terms for the clusters at level 2
Event General features
Court trial 1 arm arrest camera count delay drug hill injury order
store stand target victim
Table 3.6: General terms for the court trial cluster 1 in Table 3.4
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
D efinition 3.5.1: G eneral term (G T ). A general term for a cluster Ci is a term,
which frequently occurs within a cluster Cj, and also frequently occurs outside of Cj. A
collection of general term s for Cj is denoted by GTct •
Thus, in comparison with ST, the selectivity of GT is less than that of ST. Those ST and
GT constitute the concepts of a topic ontology.7
Table 3.6 shows GT for the “court trial 1” cluster in Table 3.4. W hen the “Winona Ry
der court trial” cluster (Ci) is considered, ST^ represents the most specific information
for “Winona Ryder court trial event”, GT,carries the next most specific information for
the event, and specific terms for the court trial cluster describe the general information
for the event. Therefore, we can conclude that a topic ontology can characterize a news
topic at multiple levels of abstraction.
Human-understandable information needs to be associated with cluster structure such
that clustering results are easily comprehensible to users. Since a topic ontology provides
an interpretation of a news topic at multiple levels of detail, an im portant use of a topic
ontology is automatic cluster labeling. In addition, a topic ontology can be effectively
used for suggesting alternative queries in information retrieval.
There exists research work on extraction of hierarchical relations between terms from
a set of documents [50, 115] or term associations [119]. However, our work is unique
in that the topical relations are dynamically generated based on incremental hierar
chical clustering rather than based on human defined topics such as Yahoo directory
(http://www.yahoo.com ).
7There are two thresholds (for selectivity) that for ST (Ai) and GT (A2 ), which are determined by
experiments.
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Sample topic Sample events
Earthquake Algeria earthquake, Alaska earthquake, Iran earthquake, etc
Flood Russia flood, Texas flood, China flood, etc
Wildfire Colorado wildfire, Arizona wildfire, New Jersey wildfire, etc
Airplane crash Ukraina airplane crash, Taiwan airplane crash, etc
Court trial David Westerfield, Andrea Yates, Robert Blake, etc
Kidnapping Smantha Runnion, Elizabath Smart, Patrick Dennehy, etc
National Security Mailbox pipebomb, Shoebomb, Dirty bomb, etc
Health 2002-West nile virus, 2003-West nile virus, SARS, etc
Table 3.7: Examples for selected topics and events
3.6 E xperim ental R esults for Inform ation A nalysis
In this chapter, we present experimental results that demonstrate the effectiveness of
the information analysis component. Chapter 3.7 illustrates our experimental setup.
Experimental results are presented in Chapter 3.8.
3.7 E xperim ental Setup
For the empirical evaluation of the proposed clustering algorithm, approximately 3,000
news articles downloaded from CNN (http://www.cnn.com ) are used. The total number
of topics and events used in this research is 15 and 180, respectively. Thus, the maximum
possible number of clusters we can obtain (at level 1) is 180. Note th at the number of
documents for events ranges from 1 to 151. Table 3.7 illustrates sample examples for
topics and events.
The quality of a generated cluster hierarchy was determined by two metrics, precision
and recall. Let Tr be a class on topic/event r .8 Then, a cluster Cr is referred to as a topic
8A class is determined by ground truth dataset. Thus, a class on topic/event r contains all documents
on r, and does not contain any other document on other topics or events. In contrast, a cluster is
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 t
epsilon
Figure 3.4: Illustration of e” ’s sensitivity to clustering results
r cluster if and only if the majority of subclusters for Cr belong to Tr. The precision and
recall of the clustering at level i (where Ki is the number of clusters at level i) then can
be defined as follows:
_ \ ^ | c r n Tr
i Ki
r— 1
| c r n Tr
| Tr
(3.8)
(3.9)
Thus, if there is large topic overlap within a cluster, then the precision will drop down.
Precision and recall are relevant metrics in that they can measure “meaningful theme”.
T hat is, if a cluster (C) is about “Turkey earthquake” , then C should contain all doc
uments about ‘Turkey earthquake” . In addition, documents, which do not talk about
‘Turkey earthquake” , should not belong to C.
determined by clustering algorithms. Note that there exists 1-1 mapping between event and cluster at
level 1 of hierarchy, and topic and cluster at level 2 of hierarchy.
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.8 E xperim ental R esults
For the purpose of comparison, we decided to use /F-rneans clustering. However, since
iF-means is not suitable for incremental clustering, /F-rneans clustering is performed
retrospectively on datasets. In contrast, the proposed algorithm was tested on incremental
datasets after learning IDF. Moreover, since we already knew the number of clusters
at level 1 based on the ground-truth data, K could be fixed in advance. Furthermore,
to overcome AT-mean’s sensitivity to initial seed selections, a seed p is selected with the
condition that the chosen seeds are far from each other. Since we deal with document
datasets, the intelligent seed selection9 can be easily achieved by using an inverted index.
3 .8 .1 P a r a m e te r iz a tio n
The size of a neighborhood, which is determined by e, influences clustering results. To
observe the effect, we performed an experiment as follows: From 3,000 documents, we
organized sample datasets, which consists of 500 documents in 50 clusters of different
sizes. Then, while changing the value of e, our clustering was conducted on the dataset.
In Figure 3.4, the x-axis represents the value of e, and the y-axis represents the number
of clusters in the result (fcl) over the number of clusters determined by ground-truth data
(fc2). Thus, if the clustering algorithm guesses the exact number of clusters, then the value
of y corresponds to one. As observed in Figure 3.4, we could find the best result when
e varies between 0.1 and 0.25, i.e., the algorithm guessed the exact number of clusters.
If the value of e was too small, then the algorithm found a few large-size clusters. In
9T w o documents are mutually orthogonal if they share no terms. This holds true when the Cosine
metric is used for the similarity measure.
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
0.15
0.1
0.05
0
- 0.05
- 0.1
- 0.15
- 0.2
0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Figure 3.5: Illustration of non-spherical document clusters
contrast, many small-size clusters were identified if the value e is too large. Thus, the
proposed algorithm might be considered as sensitive to the choice of e. However, once
the value of e (i.e., e = 0 .2 ) was fixed, the approximately right number of clusters were
always obtained whenever we performed clustering on different datasets. Therefore, the
number of clusters does not need to be given to our algorithm as an input parameter,
which is a key advantage over partition-based clustering.
3 .8 .2 A b ility to id e n tify c lu ste r s w ith sa m e d en sity , b u t differen t sh a p es
To illustrate the simple example for the shapes of document clusters with the same density,
approximately the same number of documents were randomly chosen from two different
events (a wildfire event and a court trial event), and the docum entxterm matrix on this
dataset is decomposed by Singular Value Decomposition. By keeping the first two largest
singular values, the dataset could be projected onto a 2D space corresponding to principal
43
w o o 0 n O CD °
o 0° o c
O a d o c u m e n t o n w ildfire
x a d o c u m e n t o n c o u rt trial
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Precision Recall
Level 1 91.5% 90.3%
Level 2 1 0 0 % 76.4%
Precision Recall
Level 1 83.1% 86.7%
(a) Proposed algorithm (b) Modified K-means
Figure 3.6: Comparison of the clustering algorithms on datasets-1. Datasets-1 consists of
five different datasets where each cluster has approximately the same density. The values
of precision and recall shown in this table are obtained by averaging the accuracy of the
algorithm on each dataset
Precision Recall
87.5% 8 8 .6 %
Precision Recall
78.7% 79.5%
(a) Proposed algorithm (b) Modified fc-means algorithm
Figure 3.7: Comparison of the accuracy of clustering algorithms at level 1 on datasets-2.
Datasets-2 consists of ten different datasets where each cluster has arbitrary numbers
of documents. The values of precision and recall shown in this table are obtained by
averaging the accuracy of the algorithm on each dataset
components. Figure 3.5 illustrates the plot of the documents. As shown, since the shape
of document cluster can be arbitrary, a shape of document cluster cannot be assumed in
advance (e.g., hyper-sphere in A:-means).
To test the ability of identifying the different shapes of clusters, we organized datasets
where each cluster consists of approximately the same number of documents (but as
illustrated in Figure 3.5, each document cluster will have a different shape). As shown
in Figure 3.6, the proposed algorithm outperforms the modified A-means algorithm in
terms of precision and recall. 10 This is because the proposed algorithm measures similarity
10We did not compare the modified fc-means algorithm with ours at level 2. To do this, we also need
to develop a feature selection algorithm to extend the modified R-means algorithm into a hierarchical
version.
44
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Colorado wildfire Num Arizona wildfire Num
1 fire 14.33 fire 17.68
2 forest 5.72 rodeo 4.76
3 firefighter 4.83 blaze 4.38
4 acre 3.94 firefighter 4.21
5 evacuate 3.77 burn 3.92
6 haym an 3.22 arizona 3.46
7 blaze 3.11 paxon 3.15
8 weather 3.06 acre 3.00
9 official 2.89 wildfire 3.00
1 0 national 2.83 chediski 2.89
1 1 burn 2.72 resident 2.61
1 2 area 2 .6 6 center 2.46
13 wildfire 2.56 national 2.46
14 denver 2.43 area 2.46
15 Colorado 2.33 evacuate 2.65
Table 3.8: Top 15 high term frequency words in Colorado wildfire and Arizona wildfire
event. Num represents the average number of term occurrences per document in each
event (without document length normalization). Terms with italic font carry event-
specific information for each wildfire event
between a cluster and a neighborhood of a document while A'-mcans clustering measures
similarity between a cluster and a document. Note that 10% increase in accuracy is
significant by considering the fact that we provided the correct the number of clusters
(A T) and choose the best initial seed points for A'-mcans. As illustrated in Figure 3.6,
the recall of our algorithm decreases as the level increases. The main reason for this
poor recall at level 2 is related to the characteristics of news articles. As discussed, a
named entity (NE) plays a key role in defining who/when/where of an event. Hence, NE
contributes to high quality clustering at level 1. However, at level 2, since the strength
of topical terms are not very strong (unlike named entities), it was not easy to merge
different event clusters (belonging to the same topic) into a same topical cluster.
45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3 .8 .3 A b ility t o d isco v er c lu ste r s w ith d ifferen t d e n sitie s and sh ap es
Since the sizes of clusters can be of arbitrary numbers, clustering algorithms must be
able to identify the clusters with wide variance in size. To test the ability of identifying
clusters with different densities, we organized datasets where each dataset consists of
document clusters with diverse densities. As shown in Figure 3.7, when the density of
each cluster is not uniform, the accuracy of the modified A-means clustering algorithm
degraded. In contrast, the accuracy of our algorithm remains similar. Therefore, based on
the experimental results on datasets- 1 and datasets-2 , we can conclude that our algorithm
has better ability to find arbitrary shapes of clusters with variable sizes than A-means
clustering.
3 .8 .4 E v en t co n fu sio n
There are some events that we could not correctly separate. For example, on the wildfire
topic, there exist different events, such as “Oregon wildfire” , “Arizona wildfire”, etc.
However, at level 1, it was hard to separate those events into different clusters. Table 3.8
illustrates the reason for this event confusion at level 1 . As shown, term frequency of
topical terms (e.g., fire, firefighter, etc) is relatively higher than th at of named entities
(e.g., Colorado, Arizona, etc). Similarly, for the airplane crash topic, it was difficult
to separate different airplane crash events since distinguishing lexical features like plane
number has extremely low term frequency.
The capability of distinguishing different events on the same topic is important. One
possible solution is to use temporal information. Rational behind this approach is based
on the assumption that news articles on same event are temporally proximate, However, if
46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
two events occur during the same time interval, then this tem poral information might not
be helpful. Another approach is to use classification, i.e., training dataset is composed of
multiple topic classes, and each class is composed of multiple events. After then, we learn
the weight of topic-specific terms and named entities [135]. However, this approach is
not relevant since we cannot accommodate the dynamically changing topics. Therefore,
we need further study for the event confusion.
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 4
D ensity-based G ene Expression C lustering using a M utual
N eighborhood
4.1 Prelim inary
Discovering new biological knowledge from the data obtained by high-throughput exper
imental technologies is a major challenge in bioinformatics. For example, with the recent
advancement of DNA microarray technologies, the expression levels of thousands of genes
can be measured simultaneously. The obtained data is organized as a matrix where the
columns represent genes (usually genes of the whole genome) and the rows correspond to
the samples (e.g. various tissues, experimental conditions, or time points).
Given this massive amount of gene expression data, the goal of microarray analysis
is to extract hidden knowledge (e.g., similarity or dependency between genes) from this
matrix. The analysis of gene expression may identify mechanisms of gene regulation and
interaction, which can be used to understand a function of a cell [48]. Moreover, compar
ison between expression in a diseased tissue and a normal tissue will further enhance our
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
understanding in the disease pathology [52]. Therefore, data mining, which transforms
raw data into useful higher-level knowledge, becomes a must in bioinformatics.
One of the key steps in gene expression analysis is to perform clustering genes that
show similar patterns. Based on the hypothesis that genes clustered together tend to
be functionally related, functions of unknown genes can be predicted from genes (with
known functions) in a same cluster. Recently, genome-wide expression data cluster
ing has received significant attention in the bioinformatics research community, ranging
from hierarchical clustering [40, 120], self-organizing maps [126], neural networks [62]),
algorithms based on Principal Components Analysis ([136] or Singular Value Decompo
sition [34, 40, 64], and graph-based approach [133]. However, previous algorithms are
limited in addressing the followings three challenges.
1. Gene expression data often contains a large amount of outliers, which are consid
erably dissimilar with the significant portion of the data. Sometimes, the discovery
of clusters would be hampered by the presence of outliers. For example, a small
amount of outliers can substantially influence the mean values of clusters in center-
based clustering (e.g., iF-means). Thus, clustering algorithms must be able to
identify outliers and remove them if necessary.
2. Co-expressed gene clusters may be highly connected by a large amount of inter
mediate genes that are located between one cluster and another [70]. Clustering
algorithms should not be confused by genes in a transition region. T hat is, simply
merging two clusters connected by a set of intermediate genes would be avoided.
49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Thus, the ability to detect the “genes in the transition region” would be helpful in
gene expression clustering.
3. Since gene expression data consist of measurements across various conditions (or
time points), they are characterized by multi-dimensional data. Clusters in high
dimensional spaces are, in general, of diverse densities. Although we are mainly
interested in highly co-expressed gene clusters, medium density clusters may be
biologically meaningful as well. For example, time-shift relationship is often exem
plified in some biological processes such as the cell cycle. Moreover, some mecha
nisms can limit the number of expressed genes based on the principle of efficiency,
consequently, the expression relationships among genes along the same biological
pathway may be partially revealed in a single microarray experiment [142], As a
consequence, biologically meaningful clusters may not always correspond to highly
co-expressed clusters. Thus, it is essential to identify wide densities of clusters (es
pecially high density and medium density clusters). However, most of clustering
algorithms have difficulty in identifying clusters with diverse densities.
To address the above three issues, we propose a novel clustering algorithm that utilize
a density-based approach. In the density-based clustering approach, density for each
gene is estimated first. Next, genes with high density (i.e., core genes) and genes with
low density (i.e., outlier genes) are identified. Non-core, non-outlier genes are defined as
border genes. Since a core gene has high density, it is expected to be located well inside
the cluster (i.e., a representative of a cluster). Thus, instead of conducting clustering
on whole data, performing clustering on core genes alone can avoid the first and second
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
problems stated above while it produces a skeleton of cluster structure. After that, border
genes are used to expand the cluster structure by assigning them to the most relevant
cluster.
In density-based clustering, outlier genes may not be clustered into any cluster. That
is, since the goal of gene expression clustering is to identify a set of genes with similar
patterns, it may be necessary to discard outlier genes during the clustering process. While
this approach does not provide a complete organization of all genes, it can extract the
“essentials” of information given data. However, if a complete clustering is necessary, the
outlier genes can be added to the closest clusters (or they can form singleton clusters).
Density-based clustering approaches have been studied in previous data mining lit
erature [44, 74]. However, since they have been studied in non-biological context, the
presented approaches are not directly applicable to gene expression data. Our primary
contribution is to tackle the three problems stated above, and exploit the strengthes and
address the limitations of the previous density-based clustering approaches.
The remainder of this chapter is organized as follows. In Chapter 4.2, we review the
related work, and highlight the strengths and weaknesses of the previous approaches.
We present background information in Chapter 4.3. We explain the proposed clustering
algorithm in Chapter 4.4. Finally, we present experimental results in Chapter 4.5.
4.2 R elated Work
In this chapter, we briefly review the previous gene expression clustering approach. Note
th at this chapter should not be considered as a comprehensive survey of all published gene
51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
expression clustering algorithms. It only aims to provide a concise overview of algorithms
that are directly related with our approach. Jiang et al. [71] provides the comprehensive
review on gene expression clustering. For details, refer to th at paper [71].
Partition-based clustering decomposes a collection of genes, which is optimal with
respect to some pre-defined function such as center-based approach [48, 128]. Center-
based algorithms find the clusters by partitioning the entire dataset into a pre-determined
number of clusters [35, 48, 128]. Although the center-based clustering algorithms have
been widely used in gene expression clustering, there exist the following drawbacks. First,
the algorithm is sensitive to an initial seed selection. Depending on the initial points, it
is susceptible to a local optimum. Second, as discussed in Chapter 3.8, it is sensitive to
noises. Third, the number of clusters should be determined beforehand.
Hierarchical (agglomerative) clustering (HAC) finds the clusters by initially assigning
each gene to its own cluster and then repeatedly merging pairs of clusters until a certain
stopping condition is met [35, 40, 120]. However, as discussed in Chapter 2.2.2, a user
should determine where to cut the dendrogram to produce actual clusters.
The graph-based approach [133] utilizes graph algorithms (e.g., minimum spanning
tree or minimum cut) to partition the graph into connected subgraphs. However, due to
the points in the transition region, this approach may end up with a highly connected set
of genes.
To better explain time-course gene expression datasets, new models are proposed to
capture the relationships between time-points [11, 10]. However, they assume that the
data fits into a certain distribution, which does not hold in gene expression datasets as
discussed in Yeung et al. [137].
52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
As we have discussed, our work is motivated by previous density-based clustering
approaches such as DBSCAN [44] or SNN [43]. However, since these approaches utilizes
a notion of connectivity to build a cluster, it might not be relevant for the dataset with
moderately dense transition regions. In addition, both approaches need user-defined
parameters (e.g., M inPts ), which are difficult to be determined in advance. Recently,
density-based clustering algorithms have been applied to gene expression datasets [69,
70]. Both approaches are promising in that a meaningful hierarchical cluster structure
(rather than a dendrogram) can be built. However, these approaches have drawbacks in
that MinPts is determined beforehand [69], or computationally expensive due to kernel
density estimation [70].
4.3 Background for th e Proposed M ethod
4 .3 .1 S im ila rity M e tr ic
To estimate density for each gene, we need to decide the distance metric (or similarity
metric). One of the most commonly used metrics to measure the distance between two
data items is Euclidean distance. The distance between Xj and Xj in m-dimensional space
is defined as follows:
d(xi,xj ) = Euclidean(xi,Xj) =
\
- xjd)2 (4.1)
d= l
Since Euclidean distance emphasizes individual magnititudes of each feature, it does not
account for shifting or scaling patterns very well. In gene expression datasets, the overall
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
shapes of gene expression patterns is more im portant than m agnititude. To address the
shifting and scaling problem, each gene can be standardized as follows:
- (4 -2 )
where x* is the standardized vector of Xi, /it is the mean of xl. and aj is the standard
deviation of Xi, respectively.
Another widely used metric for time-series similarity is Pearson’s correlation coeffi
cient. Given two genes Xi and Xj, Pearson’s correlation coefficient r(xi,Xj) is defined as
follows:
r (u , xt ) = (4 3)
Note that r(xi,Xj) has a value between 1 (perfect positive linear correlation) and -1
(perfect negative linear correlation). In addition, value 0 indicates no linear correlation.
Throughout this dissertation, we will explain our methodology by using Pearson correla
tion coefficient.
4 .3 .2 D e n s ity E stim a tio n
In order to estimate density for each gene, a neighborhood needs to be defined first.
Traditionally, there are two ways to define a neighborhood, an e-neighborhood and a
fc-nearest neighborhood. In an e-neighborhood, a neighborhood for Xj is defined as a set
of genes, whose similarity to x$ is greater than equal to pre-defined threshold (e). On
the other hand, in a fc-nearest neighborhood for Xi, a neighborhood is defined by a set of
the the most similar k genes excluding Xj. We denote fc-nearest neighborhood for Xj as
54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Nk(xi). Thus, a fc-nearest neighborhood is more flexible than an e-neighborhood in that
the former can model diverse shapes of neighborhood while the latter can only model
spherical shapes.
After the type of neighborhood is determined, the next step is to define a density
based on the neighborhood. In e-density, the density of an object can be defined by the
number of the objects in a region of specified radius (e) around the point [44], In this
dissertation, we are mainly focused on k-NN density estimation. In conventional fc-NN
approach to probability density, it first fixes the number of neighbors (k) beforehand. If
the fc-th nearest neighbor of x$ is close to Xj, then Xj is more likely to be in a region with
a high probability density. Thus, the distance between Xj and the fc-th nearest neighbor
provides a measure of the density. However, since the approach only considers only the
k-th neighbor, it is susceptible to random variation.
To address this problem, we define fc-NN density in terms of the sum of correlation
between k-nearest neighbors to x, as follows:
r{xi,xj)/k (4.4)
Xj&Nk (xi)
Since the proposed approach considers all k neighbors, it is less sensitive to k than
considering only the fe-th neighbor. Based on the proposed notion of density estimation,
the estimated density can be viewed as a graph perspective, in that the density at a
vertex (i.e., a gene) is defined as the sum of the weights of the k strongest edges. Thus,
k-NN density implies how strongly a gene is connected to its fc-nearest neighbor genes.
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4 .3 .3 C h a lle n g e s in D e n sity -b a se d G en e E x p r e ssio n C lu ste r in g
Figure 4.1 shows correlation between e-density and fc-NN density. In general, strong
correlation could be observed between them. T hat is, high (low) fc-NN density implies
high (low) e-density, and vice versa. As discussed, clusters in high-dimensional spaces
are commonly of various densities. Density-based algorithms therefore have difficulty in
choosing a proper threshold values for a (the user-defined threshold to cut off between
core genes and border genes). T hat is, if a is too high, then we may miss relevant genes
that might be representative for the cluster. On the other hand, if the value of a is set
small to include all relevant core genes, then there is a possibility to include border genes.
Thus, rather than identifying genes with globally high density, our first task is to
identify the genes th at are located in uniform regions with diverse densities. That is,
regardless of the density of gene, as long as the region around a certain gene has relatively
uniform density, the gene is considered as a relevant candidate for the core genes. By
doing so, we can capture clusters of widely varying density.
However, although a clustering algorithm is able to classify regions (with different
densities) into clusters, in gene expression clustering, it is not relevant to find a cluster
with too low density (although the region has uniform density) because the main goal of
gene expression clustering is to capture a set of highly co-expressed genes. Therefore, our
algorithm is composed of two tasks: identification of all clusters with diverse densities,
and filtering the non-relevant clusters with low density.
The neighborhood list size (k) may determine the granularity of the clusters. If k is
too small, then the algorithm will tend to identify a large number of small-size clusters.
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
16 0
140
120
o 100
0 .7 0 .7 5 0.8
KNN-density (k=30)
0 .8 5 0 .9 5 0.55 0 .6 5
Figure 4.1: Correlation between fc-NN density (when fc=30) and e-density (when e=0.8)
for the Cho’s data that will be discussed in Chapter 4.5.1. The horizontal axis represents
k-NN density and the vertical axis represents e-density.
57
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Notation Meaning
n A total number of genes
m A total number of time points
X An m x n gene expression profile matrix
Xi i-th gene
Xij j-th feature of i-th gene
Nk(xi) The fc-nearest neighbor list for Xi (excluding Xi)
Mk(xi) The fc-nearest mutual neighbor list for au (excluding xt)
P A set of core genes
CP A set of core clusters (before refinement)
C A set of clusters (after refinement)
K A number of clusters
a A threshold that determines a core gene
(3
A threshold that determines a noise gene
Table 4.1: Summary of symbols and corresponding meanings
In contrast, if fc is too large, then a few large-size clusters are formed. In what follows,
we explain how to decide the value of a and (3 when k is fixed.
4.4 D ensity-based Clustering using a M utual N eighborhood
The proposed algorithm proceeds in four phases, mutual neighborhood graph construc
tion, identification of core genes and outlier genes, core gene clustering, and assignment
of border genes. First, we define a mutual neighborhood graph, which is essential in our
further discussions.
4 .4 .1 C o n str u c tio n o f M u tu a l N e ig h b o r h o o d G rap h
We first provide definition for a fc-nearest mutual neighborhood (fc-MNN).
D efinition 4.4.1: [fc-nearest m utual neighborhood]. Given X{ and Xj, if Nk{xi)
contains Xj and Nk(xj) contains Xj, then Xi and Xj are referred to as a fc-nearest mutual
58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
neighbor to each other. The fc-nearest mutual neighborhood is defined by the fc-nearest
mutual neighbor list for Xj, and denoted by Mfc(xj).
W ith respect to fc-MNN and fc-NN, the following properties hold.
1. Nk(xi) C Nk+i(xi) and Mfc(xj) C Mk+i(xi) for any Xi and fc > 1.
2. Mk(xi) cNk(xi) for any x* and fc > 1. Thus fc-MNN provides more tight form of
neighborhoods than fc-NN does.
Thus, given an input gene expression dataset and maximum number of neighbors (fc)
where 0 < fc < n, we build a fc-nearest mutual neighborhood graph M G k = (U,E ). In
this graph, U is the set of genes, and it, v G E if and only if u and v are fc-nearest mutual
neighbor to each other. Note th at the value of fc in fc-NN density and the value of fc in
M G k are different.
4 .4 .2 Id e n tific a tio n o f R o u g h C lu ster S tr u c tu r e
A rough cluster structure can be derived by performing clustering on core genes. This
step can be devised based on the following observation. Since border and outlier genes
are excluded in the rough cluster identification step, each cluster is expected to be well
separated from each other. T hat is, if two core genes belong to different clusters, then they
are expected to be far from each other. Similarly, if two core genes should belong to a same
cluster, then the two genes are expected to be significantly proximate to each other. Thus,
the key part in rough cluster structure identification is to find a set of relevant core genes.
A set of connected components (which is denoted by Ck) in a mutual neighborhood graph
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(M G k) can provide a clue for this step. We first define intra-component connectivity as
follows:
Tl(cf c ) = J ^ l| 2 r(Xi’Xj) (4-5)
X i , X j £ C k
Note that T i measures how strongly genes in a same component are connected to each
other. Thus, T i can be utilized to measure a density for a component. Similarly, T i
can be also utilized to measure a density for a cluster. In the following, we establish the
possible key relationships for connected components.
1. The value of T i that a connected component can have is fairly diverse. The rationale
behind this corresponds to the fact that a cluster in high dimensional space can be
of diverse densities.
2. Densities of genes belonging to the same component are similar. This implies that
the components in a mutual neighborhood graph can capture uniform density re
gions.
3. Different mutual neighborhood graph can be defined by different values of fc. As
the value of fc increases, in general, the number of connected components decreases,
and the size of connected components increases. However, in some cases, we can
find same components from different mutual neighborhood graphs. We define such
components as invariant components. Most of such components turns out to be a
collection of outlier genes. Thus, our argument here is that invariant components
do not contribute core gene selection.
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4. A large size component (with a very low intra-component connectivity) may be
obtained. The rationale behind this corresponds to the fact that several dissimilar
components can be connected by genes in intermediate regions (we refer to this as
a chaining effect).
5. A large number of genes may form singleton components (a component with its
size is equal to 1). If the density of such genes are low, then those would be an
outlier gene. If high density genes fall into singleton components, then they might
be located in the boundary of dense regions. Finally, if medium density genes form
singleton components, then they are located on a boundary of a cluster.
6 . If a high density gene falls into a medium size component, then it might be located
in the center of dense regions, consequently, intra-connectivity of the component is
expected to be very high. Similarly, if medium density gene belongs to medium size
component, then it might be located on the center of a medium dense regions. If
a low density gene belongs to a medium size component, then it is a center of low
density cluster.
As the value of k increases, the chaining effects are encountered, consequently, the size
of connected components tend to increase rapidly. Since we want to make our algorithm
not much sensitive to the value of k, to avoid this chaining effect, we introduce the
following subcomponent decomposition theorem.
L em m a 4.4.2: Let Ck+1 = {cg+1, cj+1, ..., c^o +1} and Ck= {ck, ck, ..., } be a set of
connected components obtained from M G k+1 and M Gk, respectively. If g G cj +1; then
Bio such that g € ck Q and every genes in ck Q also belongs to ck+1.
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
P ro o f: Since (Ji ck = (Ji ci +1> f°r every g G ck+l, there exists io such that g G ck Q .
Thus, it is sufficient to show th at every gene in ck belongs to c[0+1. We will prove this
by contradiction. Suppose there exists g' G ckQ such that (/ ^ c^+1. Since 5 , 5 ' G c* , one
of the followings must hold: g G M k(g') (case 1), or there exists gi0, ...gin G ck such that
g G M k(gio), gio G M k{gh ), ... , gin G M k(g’) (case 2).
Case 1: Since M k(g') C Mfc+i ('), 5 must belong to M k+i(g'). T hat is, 51 and are
k + 1 mutual neighbors to each other. Thus, there exists direct edge between g and g' in
M G k+1.
Case 2: Since M fc(x) C M k+1{x) for any x, g G M fc+i ( ^ 0), gio G ... ,
g, in G M k+i{g'). T hat is, there exists a path between g and g' in M G k+1.
By case 1 and case 2, g and g' must belong to a same connected component in M G k+l.
Since g G ck+l, g' must also belong to ck+1. This contradicts to our assumption that
g' ^ ck+1. Therefore, every gene in ck o must belong to ck+1.
Theorem 4.4.3: For any ck+1 G Ck+l, there exists I — {io, •••,**} such that ck+1 =
lW sf c -
Proof: This is straightforward by Lemma 4.4.2.
Thus, based on the above theorem, we can provide filtering heuristics as follows:
(1) If genes form singleton components or invariant components, then they have low
possibility of being core genes. (2) If a large-size component (with a low intra-component
connectivity) is obtained, then this component is composed of disparate sub-components
due to the chaining effect. Thus, we decompose this component into subcomponents, and
62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
recursively apply the first rule to these subcomponents. In the following, we provide the
formal definitions for core genes and outlier genes to encompass the above relationships.
D efinition 4.4.4: [Core genes]. Given M G k, core genes can be formalized as locally
dense areas in the original data, which are captured by local maximums of remaining
components (after filtering) in M G k.
D efinition 4.4.5: [Outlier genes]. Given M G k, low K -NN density genes which form
singleton components or invariant components are defined as outlier genes.
Once core genes are selected, the next step is to perform clustering on core genes. Since
core genes are expected to be well separated, rather than employing complex algorithms,
we use a standard implementation of greedy agglomerative clustering. T hat is, each
gene is initialized as a cluster of size one, the similarities between all pairs of clusters
are computed, and the two closest clusters are then repeatedly merged until the number
of clusters is reduced to a target number or the maximum similarity becomes less than
pre-defined threshold. In this paper, we selected group average scheme (i.e., similarity
between clusters is defined by the similarity of each cluster) for similarity update.
There are three main computationally expensive steps here. The first step is the fc-
nearest neighborhood construction for all genes, which takes 0 ( n 2). However, for low
dimensional data, the complexity can be reduced to Oinlogn) if spatial data structures
such as X-tree [14] are utilized (Note that if the number of dimensions is high enough,
then dimensionality reduction needs to be performed as the preprocessing stage). The
second step is the construction of a mutual neighborhood graph and connected compo
nent identification in a mutual neighborhood graph. Identification of a set of maximally
63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
connected components in a graph can be efficiently performed within 0 (n ) where n is the
number of vertexes [31]. The third step is concerned with core gene clustering. Regarding
to the group average scheme, similarity between clusters Ci and Cj is not changed during
the agglomerative steps unless Ci or Cj is selected to be merged. Thus, all similarities can
be computed once for each pair of clusters and inserted into a priority queue. If cluster
Ci and Cj are selected to form a new cluster Cp, then any similarities involving either Ci
or Cj are removed from the queue, and similarities for the rest of the clusters with Cp are
inserted. This step takes 0 (n 2logn) if the priority queue is implemented using a binary
heap. In sum, the worst time complexity is 0{n(logn)2).
4 .4 .3 C lu ster E x p a n sio n
Once a rough cluster structure is obtained, the next step is to identify relevant clusters
that can host each border gene. This step can be easily achieved by selecting the cluster,
which has the largest proximity to the border gene.
4.5 E xperim ental R esults
4 .5 .1 E x p e r im e n ta l S etu p
The proposed clustering algorithm was tested on yeast cell cycle data. The data is a
subset from the original 6,220 genes with 17 time points listed by [26]. Cho et al. sampled
17 time points at 10 minutes time interval, covering nearly two full cell cycles of yeast
Saccharomyces cerevisiae. Among 6,220 genes, 2,321 genes were selected based on the
largest variance in their expression. In addition, one abnormal time point was removed
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
from the data set as suggested by [126], consequently, the resulting data consists of 2,321
genes with 16 time points. Our analysis primarily focuses on this data set.
4 .5 .2 E v a lu a tio n M e trics
For the empirical evaluation of the proposed clustering algorithm, we first describe the
evaluation criteria. There are three measures to evaluate the accuracy of clustering al
gorithm in general. When there exists true solution, we can compare the true solution
to the obtained result by computing the proportion of correctly identified ones. On the
other hand, when the true solution is unknown, the most widely used metrics are how
much each cluster is homogeneous (Equation 4.6) that can be estimated using the sum
of average pairwise similarities between genes (that are assigned to a same cluster), and
how much clusters are well separated each other (Equation 4.7).
In general, Ti favors a large number of small-size clusters, and T2 favors a small number
of large-size clusters. In sum, we favor a clustering solution with the largest value of
Ti, and the smallest value of T2 (i.e., maximize within-cluster similarity and minimize
between-cluster similarity).
(4.6)
Y argmax{r(xi, Xj)\xi £ Ci, Xj £ C j,i ^ j } (4.7)
6 5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.5.3 C om parative A lgorithm s
For the comparison, we utilized EXPANDER [118] software th at implements A'-means,
SOM and CLICK.
Regarding to If-means, since we are using Pearson correlation instead of Euclidean
distance, there is a question of whether if-m eans will still work. However, if the data
is standardized by subtracting off the mean and dividing by the standard deviation, K -
means will work with Pearson correlation by showing that the Pearson correlation and
the Euclidean distance are monotonically related as follows:
, * d2(x * i’xj)
r (x i ,x j ) = 1 ------------------------------------------------------(4.8)
Famili et al. showed that 21 is the most relevant number of clusters when if-means
is applied to Cho’s data [45]. Thus, we can fix the number of clusters in advance. More
over, to overcome the sensitivity of initial seed selection, we perform if-means clustering
multiple times.
In CLICK, homogeneity value needs to be set to control over the homogeneity of the
resulting clustering. This parameter serves as a threshold in various steps in the algo
rithm, including the definition of cluster kernels, singleton adoptions and kernel merging.
The higher the value assigned to this parameter the tighter the resulting clusters. In our
experiment, the default value was used.
66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.5.4 E xp erim en tal R esu lts
The proposed clustering algorithm clustered the genes into 17 clusters. These clusters
are shown in Figure 4.2. Table 4.3 shows how well the proposed algorithm successfully
separated biologically meaningful patterns into five clusters. The detection ratio of our
approach is 65.3%. CLICK failed to separate the biologically meaningful patterns into
five clusters. W ith respect to A-means, it sometimes failed to detect five clusters due
to its random nature. In other cases, it detected five cell cycle clusters with detection
ratio (average: 59.5%, maximum: 63.5%) with 50 time runs. SOM also sometimes failed
to detect five clusters due to its random nature. In other cases, it detected five cell
cycle clusters with detection ratio (average: 48.4%, maximum: 55.8%) with 50 time runs.
Thus, in term of biologically meaningful cluster detection, our clustering works better
than A-means, SOM or CLICK.
A summarization of separation and homogeneity for each clustering algorithm is shown
in Table 4.3. As shown, in terms of homogeneity, since A-means tries to maximize within
similarity, it worked best (but it did not work well in between similarity). Similarly, since
CLICK tries to minimize between similarity, it worked best in between similarity (but
it did not work well in within similarity). However, we could observe that our cluster
ing algorithm produced reasonable between and within similarity although we employed
simple greedy agglomerative approach in core gene clustering. Thus, this supports our
argument that our core gene selection methods works effectively.
6 7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
K - m e a n s C L I C K S O M O u r M e t h o d
no. clusters 21 18 30 17
T i 0.52 0.42 0.43 0.50
t 2 0.32 0.19 0.38 0.23
Table 4.2: A comparison between the proposed m ethod and other approaches on the
Cho’s data
Cluatorl Cluator 3 Cluster 4
2
1
O
1
• 2
20 O 1 O
C li 13
lO 20 O no 20 O 10 20 O 10 20 O 10 20
Figure 4.2: Mean expression patterns for the Cho’s data obtained by the proposed
method. The horizontal axis represents time points between 0-80 and 100-160 minutes.
The vertical axis represents normalized expression level. The error bar at each time point
delineates the standard deviation.
Cell cycle Proportion Meaningful Cluster
Early G1 12/32 Cluster 1
Late G1 70/87 Cluster 2
S phase 19/48 Cluster 10
G2 phase 18/28 Cluster 11
M 28/30 Cluster 16
Table 4.3: Proportion of biologically characterized genes in meaningful clusters versus
those in ([26]).
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C hapter 5
Conclusions
In this dissertation, we presented the mining framework that is vital to intelligent infor
mation analysis. An experimental prototype has been developed, implemented and tested
to demonstrate the effectiveness of the proposed framework.
In order to accommodate data that are frequently inserted over time, we developed
the incremental document clustering algorithm based on a neighborhood search. The
presented clustering algorithm could identify news event clusters as well as topic clusters
incrementally. We also showed that presented topic ontologies could characterize news
topics at multiple levels of abstraction. The proposed model therefore extends the state
of the art in information retrieval and data mining by assisting users in locating and
viewing information.
In addition, in order to analyze static data effectively, we developed density-based
clustering th at works as a batch mode. We tested the developed algorithm for a yeast
cell cycle dataset. The proposed algorithm utilizes a mutual neighborhood graph in order
to effectively determine core and outlier genes.
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 6
Future W ork
6.1 Topic M ining
We intend to extend the presented topic mining framework into the following five direc
tions. First, although a document hierarchy can be obtained using unsupervised cluster
ing, as shown in Aggarwal et al. [1 ], the cluster quality can be enhanced if a pre-existing
knowledge base is exploited. T hat is, based on this priori knowledge, we can have some
control while building a document hierarchy.
Second, besides exploiting text data, we can utilize other information since Web news
articles are composed of text, hyperlinks, and multimedia data. For example, as described
in [72], both terms and hyperlinks (which point to related news articles or Web pag es)
can be used for feature selection.
Third, coupling with WordNet [99], we plan to extend the topic ontology learning
framework to accommodating rich semantic information extraction. To this end, we will
annotate a topic ontology within Protege [103],
70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Forth, we plan to develop an ontology-based document representation model. By
generalizing a term into a concept, we expect to correctly identify similarity relationship
between documents. Toward this end, our ontology-based document representation strat
egy consists of two phases. First, representative term sets are extracted from documents.
Second, using term set T\ and term set T2 for document d\ and cTk respectively, we can
define similarity between d\ and c /2 using the aggregated similarity between T\ and T 2 .
The simple way for the first phase is to take terms from documents with the high term
frequency. However, the second phase is not simple. Recently, some approaches have been
proposed to evaluate similarity between concepts in an ontology [ 8 6 , 112]. One of them is
semantic similarity, which relies on the incorporation of empirical probability estimates
into an ontology. To this end, it utilizes statistical and/or topological information of con
cepts and their interrelationships. Once we define similarity between documents, then we
can perform clustering on the obtained similarity matrix. In sum, our key objective is to
incorporate ontology-based document representation model (instead of BOW model) for
the topic mining framework, and answer several important questions including the fol
lowings: (1) Is ontology-based document representation model useful for improving the
accuracy of document clustering? Why (not)? ; (2) Can ontology-based document repre
sentation model speed up the convergence of clustering when we use iterative-optimization
clustering?
Finally, based on the observation that visual presentation of the topics/concepts en
ables rapid understanding of the underlying characteristics of data, we plan to develop a
visualization tool for topic mining.
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.2 Investigation of A pplicability for a Topic M ining to
Earth Science D atasets
To strengthen our work in terms of generality, it is worthwhile to investigate to the
potential applicability of topic mining to earth science information streams. There have
been several attem pts to apply data mining techniques to earth science datasets in the
past few years [121, 122, 124], However, our work will be unique from the previous work
in that we are dealing with information streams. In many emerging science and business
applications, data takes the form of streams rather than static datasets [7]. For example,
the applications require handling data in the form of a stream, such as sensor data, stock
price data, network traffic ffow, Web click streams, and so on.
Similarly, in the earth science domain, different kinds of measurement (e.g., water
vapor pressure, tem perature, etc) of certain coordinates on earth can be gathered through
Global Positioning System (GPS) receivers on hourly basis, which raises new research
challenges in data mining community. In traditional data mining, mining operations are
performed over static datasets, i.e., the input data can be loaded into main memory
several times. However, when we deal with data streams, data mining in a batch mode
becomes technically infeasible since not all data can be read into the memory several
times. Thus, conventional data mining algorithms for static datasets will break down
in our scenario. Therefore, the data mining toolkits must be equipped with the ability
to analyze information streams within one pass (unlike in traditional data mining), and
leverage that data in real-time.
72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Toward this end, we define a new research project, referred to as MIESIS (Mining
Earth Science Information Streams). In order to enable scientists to adapt to new earth
science phenomena quickly, MIESIS will present innovative data mining toolkits to dis
cover interesting patterns from earth science datasets. To cope with the very dynamic
nature of the problem, it is crucial to develop efficient and scalable incremental data min
ing tools on information streams. Hence, methods for efficiently analyzing and retrieving
these information streams will be central challenges in MIESIS. Closely related research
areas to MIESIS include time-series data mining [24, 2] and data stream processing [55, 7].
We will complement our research by exploring and implementing existing algorithms in
these fields.
Our vision for MIESIS is to build solid foundation for earth science information
streams mining. By providing the intelligent and scalable data mining toolkits, the earth
scientists can easily comprehend and interpret the available data, and adapt new scien
tific change quickly. To enable this vision, besides Cluster Analysis component we have
identified the following four core requirements:
• Information Streams Segmentation. Segmentation is the process of transforming
information streams into a piecewis e linear representation. To determine a set of
segments, we should find points (i.e., isolation points) where each point separates
two different segments.
• Discretization of Information Streams. After we construct a cluster model, next
step is to discretize information
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
streams. By substituting each segment by the symbol, we can represent a continuous
information stream as a set of symbols. This discretized representation is easy to
manipulate and particularly useful when processing a similarity query.
• Novelty Detection Model Construction. Novelty detection, also referred to as anomaly
detection, is defined as a process of identifying unforeseen events from large amount
of data collection. In earth science, scientists are interested in
unknown patterns rather than known events. For instance, if the tem perature for
certain region abnormally increases sharply, then that will be a candidate for a
novel e vent.
• Prediction Model Construction. Since conventional prediction models deal with
numerical time-series datasets, we can only obtain numerical value in future. How
ever, in real-life applications, the utility of prediction model lies in detect ing events,
rather than predicting numerical values. Event here is referred to as meaningful
object to which we can assign some semantics, e.g., earthquake or flood.
Each of the requirements for MIESIS vision enables core functionalities of earth sci
ence information streams mining system. Information Streams Segmentation divides a
data stream into a set of piecewise linear representations. The set of all segments is
then clustered in the Cluster Analysis module, and different symbols are used to label
the cluster s. By substituting each segment by the symbol associated in the cluster,
we can represent a continuous information stream as a s et of discrete symbols. Based
on this representation, we implement new ideas to build Novelty Detection and Predic
tion Model using grammatical inference, regression, classification, and Hidden Markov
74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
f 1 M E jiS c o n p a in B
CD
-♦
Figure 6.1: MIESIS system architecture
Process [129]. Spatio-temporal events such as teleconnections can be also detected us
ing similar methodologies. Figure 6.1 illustrates functional architecture of the proposed
system.
75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Since the datasets in earth science consist of various time series measurements and
climate variables (e.g. temperature, water vapor pressure, precipitation), they are char
acterized by multidimensional, spatio-temporal, huge size of volumes, heterogeneous for
mats, and a noisy continuous information stream. Thus, clustering algorithms must be
able to address and exploit such features of the datasets.
A key issue in Cluster Analysis is how to apply our previous clustering algorithm to
earth science information streams. Since the target of clustering in MIESIS is a set of
segments, a multidimensional index structure will be needed to discover the neighborhood
of each segment. However, due to the high dimensionality of segments, the neighborhood
search of segments based on multidimensional index structure is computationally imprac
tical. Hence, the proposed algorithm is applicable to MIESIS only if we can find fast
similarity search m ethod for the segments.
To overcome the curse of dimensionality, we first transform the segments into lower
dimensionality space. In time-series data mining research, Discrete Fourier Transf orm
(DFT), and Discrete Wavelet Transform (DWT) have been widely used for this pur
pose [24, 2]. T hat is, since DFT or DWT can preserve essentials of the data in the first
few coefficients, multidimensional index structure can be constructed using the first few
coefficients in the transformed space. In MIESIS, we will employ wavelet transforms for
the dimensionality reduction.
One of the key requirements in our clustering is to support fast neighborhood search.
Since we can represent the segments in the low-dimensionality transformed space using
wavelets, a multidimensional index structure will be served for discovering the neighbor
hood. Toward this end, we will adopt XXL (extensible and fleXible Library), which is a
76
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Java library th at implements advanced query processing functionalities [15]. The library
supports high-level framework for many different index-structures, which are ready to
use. Moreover, it includes query processing packages for index structure such as nearest
neighbor queries to range queries.
To overcome the curse of dimensionality, we first transform the segments into lower
dimensionality space. In time-series data mining research, Discrete Fourier Transform
(DFT), and Discrete Wavelet Transform (DWT) have been widely used for this pur
pose [1, 6 ]. T hat is, since DFT or DWT can preserve essentials of the data in the first
few coefficients, multidimensional index structure can be constructed using the first few
coefficients in the transformed space.
Previous wavelet-based data mining research assumes th at all data is regular, i.e.,
data values exist on equally-spaced points. However, in MIESIS, data need not be regular
since values at some points are not available. For instance, due to the GPS failure, we
may not gather climate data from certain region during some time-interval, resulting in
missing values. In this situation, wavelet transform is not directly applicable to this data.
Conventional approach to solving this problem is based on interpolation, which fills up
missing values. In MIESIS, instead of relying on this nave approach, we propose a novel
and unique method based on second-generation wavelets using lifting [1 2 , 28].
The wavelets in second generation consider general scenarios. For instance, data
should not be located along equally-spaced (i.e., irregular). Moreover, it can deal with
finite data whose size need not be integral power of two (most of wavelet-based data
mining research assume the size of data). In MIESIS, we will employ “unbalanced Haar”,
which is the generalization of Haar wavelet transform to the second-generation setting.
77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Alternative is to build second-generation wavelet is based on lifting. In MIESIS, we will
use both frameworks for the wavelet transforms on irregular setting.
6.3 Crisis M anagem ent
In crisis management, since data arrives as high-speed information streams, we need to
process this decision-critical information very quickly. Hence, the mining algorithm must
be able to process this information stream in one-pass, and leverage the data in real-time.
T hat is, it should be equipped with a capability of mining data online, and identifying
(anomalous) patterns continuously. This online mining ability is particularly important
in homeland security applications since alerting warning about potential threats as early
as we can is critical. Furthermore, intelligent data can be distributed over several law
enforcement organizations (e.g., FBI, CIA). These data sources can be heterogeneous
(i.e., different database systems, different database schemas, etc) since they are usually
maintained by different agencies. Hence, sharing and exchanging the information effec
tively becomes a must in this situation. In sum, monitoring information streams and
sharing/exchanging heterogeneous information are two functionalities in homeland secu
rity.
78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reference List
[1 ] C. C. Aggarwal, S. C. Gates, and P. S. Yu. - On the merits of using supervised clus
tering for building categorization systems. In Proceedings of the 5th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 1999.
[2] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence
database. In Proceedings of International Conference of Foundations of Data Orga
nization and Algorithms, 1993.
[3] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection
and tracking: pilot study final report. In Proceedings of the DARPA Broadcast News
Transcription and Understanding Workshop, 1998.
[4] J. Allan, V. Lavrenko, and H. Jin. First story detection in TD T is hard. In Pro
ceedings of the 9th AC M International Conference on Information and Knowledge
Management, 2000.
[5] J. Allan, R. Gupta, and V. Khandelwal. Temporal summaries of news topics. In Pro
ceedings of the 24th International ACM SIGIR International Conference on Research
and Development in Information Retrieval, 2001.
[ 6 ] J. Allan. Topic detection and tracking: event-based information organization. Kluwer
Academic Publishers, Norwell, MA, 2002.
[7] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in
data stream systems. In Proceedings of AC M Symposium on Principles of Database
Systems, 2 0 0 2
[ 8 ] L. Badea. Functional discrimination of gene expression patterns in terms of the gene
ontology. In Proceedings of Pacific Symposium on Biocomputing, 2003.
[9] A. Bagga, and B. Baldwin. Algorithms for scoring coreference chains. In Proceedings
of the 7th Message Understanding Conference, 1998.
[10] Z. Bar-Joseph, G. Gerber, D. Gifford, T. Jaakkola, and I. Simon. A new approach
to analyzing gene expression time series data. In Proceedings of Annual Conference
on Research in Computational Molecular Biology, 2002.
[11] Z. Bar-Joseph. Analyzing time series gene expression data. In Bioinformatics,
20(16):2493-2503, 2004.
79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[12] N. Beckmann, H. P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: an effi
cient and robust access method for points and rectangles. AC M SIGMOD Record,
19:(2):322-331, 1990.
[13] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. Jour
nal of Computational Biology, 6(3/4) :281-297, 1999.
[14] S. Berchtold, D. A. Keim, and H. P. Kreigel. The X-tree: An index structure for
high dimensional data. In Proceedings of the 22nd International Conference on Very
Large Data Bases, 1996.
[15] J. Bercken, B. Blohsfeld, J. P. Dittrich, J. Kramer, T. Schafer, M. Schneider, and B.
Seeger. XXL - A library approach to supporting efficient implementations of advanced
database queries. In Proceedings of 27th International Conference on Very Large Data
Bases, 2001.
[16] M. W. Berry, S. T. Dumais, and G. W. O ’Brien. Using linear algebra for intelligent
information retrieval. SIAM Review, 37(4):573-595, 1995.
[17] D. Bickel. Robust cluster analysis of microarray gene expression data with the
number of clusters determined biologically. Bioinformatics, 19(7) :818— 24, 2003.
[18] E. Bingham, and H. Mannila. Random projection in dimensionality reduction: appli
cations to image and text data. In Proceedings of the 7th A CM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2001.
[19] P. S. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large
databases. In Proceedings of the 4th AC M SIGKDD International Conference on
Knowledge Discovery and Data Mining, 1998.
[20] J. D. Bo, P. Spyns, and R. Meersman. Assisting ontology integration with existing
thesauri. In Proceedings of the 3rd International Conference on Ontologies, Databases,
and Application of Semantics for Large Scale Information Systems, 2004.
[21] T. Brants, F. Chen, and A. Farahat. A system for new event detection. In Proceed
ings of the 26th International AC M SIGIR International Conference on Research and
Development in Information Retrieval, 2003.
[22] J. Carbonell, and J. Goldstein. The use of MMR: diversity-based reranking for
reordering documents and producing summaries. In Proceedings of the 21st Annual
International AC M SIGIR Conference on Research and Development in Information
Retrieval, 1998.
[23] K. Chakrabarti, and S. Mehrotra. Local dimensionality reduction: a new approach to
indexing high dimensional spaces. In Proceedings of the 26th International Conference
on Very Large Data Bases, 2000.
[24] K Chan, and A. W. Fu. Efficient time series matching by wavelets. In Proceedings
of IEEE International Conference on Data Engineering, 1999.
80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[25] A. Y. Chen, A. Donnellan, D. McLeod, G. Fox, J. Parker, J. Rundle, L. Grant,
M. Pierce, M. Gould, S. Chung, and S. Gao. Interoperability and semantics for
heterogeneous earthquake science data. In Proceedings of International Workshop on
Semantic Web Technologies for Searching and Retrieving Scientific Data, 2003.
[26] R.J. Cho, M.J. Campbell, E.A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka,
T.G. Wolfsberg, A.E. Gabrielian, D. Landsman,D, D.J. Lockhart, and R.W. Davis A
genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell, 2(1):65-
73, 1998.
[27] S. Chung, and D. McLeod. Dynamic topic mining from news stream data. In
Proceedings of the 2nd International Conference on Ontologies, Databases, and Ap
plication of Semantics for Large Scale Information Systems, 2003.
[28] S. Chung, J. Jun, and D. McLeod. Mining gene expression datasets using density-
based clustering. In Proceedings of ACM Conference on Information and Knowledge
Management, 2004.
[29] S. Chung, and D. McLeod. Dynamic pattern mining: an incremental data clustering
approach. Journal on Data Semantics, 2:85-112, 2005.
[30] S. Chung, J. Jun, and D. McLeod. Incremental mining from news streams. To
appear in Encyclopedia of Data Warehousing and Mining, Idea Group Inc.
[31] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to algorithms. The
MIT Press, Cambridge, Mass, 1989.
[32] H. Cui, M. Kan, and T. Chua. Unsupervised learning of soft patterns for generating
definitions from online news, In Proceedings of the 13th Conference on World Wide
Web, 2004.
[33] H. Davulcu, S. Vadrevu, and S. Nagarajan. OntoMiner: bootstrapping and pop
ulating ontologies from domain specific web sites. In Proceedings of the 1st VLDB
International Workshop on Semantic Web and Databases, 2003.
[34] C. H. Q. Ding, X. He, H. Zha, and H. D. Simon. Adaptive dimension reduction
for clustering high dimensional data. In Proceedings of the 2002 IEEE International
Conference on Data Mining, 2002.
[35] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd Ed.). Wiley,
New York, 2001.
[36] S. T. Dumais. LSI meets TREC: a status report. In Proceedings of the 1st Text
REtrieval Conference, 1992.
[37] S. T. Dumais. Latent semantic indexing (LSI) and TREC-2. In Proceedings of the
2nd Text REtrieval Conference, 1993.
[38] S. T. Dumais. Latent semantic indexing (LSI): TREC-3 Report. In Proceedings of
the 3rd Text REtrieval Conference, 1994
81
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[39] D. M. Dunlavy, J. Conroy, and D. P. O’Leary. QCS: a tool for querying, cluster
ing, and summarizing documents. In Proceedings of Human Language Technology
Conference, 2003.
[40] M. B. Eisen, P. T. Spellman, P. 0 . Brown, and D. Botstein. Cluster analysis and
display of genome-wide expression patterns. In Proceedings of National Academy of
Science, 95(25):14863-14868, 1998.
[41] S. Erdal, O. Ozturk, D. L. Armbruster, H. Ferhatosmanoglu, and W. C. Ray. A
time series analysis of microarray data. In Proceedings of the 4th IEEE Symposium
on Bioinformatics and Bioengineering, 2004.
[42] L. Ertoz, M. Steinbach, and V. Kumar. Finding topics in collections of documents:
a shared nearest neighbor approach. In Clustering and Information Retrieval, Kluwer
Academic Publishers, 2003.
[43] L. Ertoz, M. Steinbach, and V. Kumar. Finding clusters of different sizes, shapes, and
densities in noisy, high dimensional data. In Proceedings of the SIAM International
Conference on Data Mining, 2003.
[44] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases with noise. In Proceedings of the 2nd AC M SIGKDD
International Conference on Knowledge Discovery and Data Mining, 1996.
[45] A. F. Famili, G. Liu, and Z. Liu. Evaluation and optimization of clustering in gene
expression data analysis. Bioinformatics, 20(10):1535-1545, 2004
[46] U. M. Fayyad, C. Reina, and P. S. Bradley. Initialization of iterative refinement clus
tering algorithms. In Proceedings of the 4th AC M SIGKDD International Conference
on Knowledge Discovery and Data Mining, 1998.
[47] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E.
Rupp in. Placing search in context: the concept revisited. In AC M Transactions on
Information Systems, 20(1): 116— 131, 2002.
[48] A. Gasch, and M. Eisen. Exploring the conditional coregulation of yeast gene ex
pression through fuzzy k-means clustering. In Genome Biology, 3(11):l-22, 2 0 0 2 .
[49] I. Gat-Viks, R. Sharan, and R. Shamir. Scoring clustering solutions by their biolog
ical relvance. In Bioinformatics, 19(18):2381-2389, 2003.
[50] E.J. Glover, D.M. Pennock, S. Lawrence, and R. Krovetz. Inferring hierarchical
descriptions. In Proceedings of the 2002 AC M CIKM International Conference on
Information and Knowledge Management, 2002.
[51] G. H. Golub and C. F. van Loan. Matrix computations. North Oxford Academic,
Oxford, UK, 1983.
82
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[52] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov,
H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S.
Lander. Molecular classification of cancer: class discovery and class prediction by
gene expression monitoring. Science, 286(15):531-537, 1999.
[53] S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for
large databases. In Proceedings of the AC M SIGMOD International Conference on
Management of Data, 1998.
[54] S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for cate
gorical attributes. Information Systems, 25(5):345-366, 2000.
[55] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O ’Callaghan. Clustering data
streams: theory and practice. IEEE Transactions on Knowledge and Data Engineer
ing, 15(3) :515-528, 2003.
[56] V. Guralnik, and G. Karypis. A scalable algorithm for clustering protein sequences.
In Proceedings of Workshop on Data Mining in Bioinformatics, in conjunction with
the 7th AC M SIGKDD International Conference on Knowledge Discovery and Data
Mining, 2001.
[57] A. Guttman. R-Trees: A dynamic index structure for spatial searching. In Pro
ceedings of the AC M SIGMOD International Conference on Management of Data,
1985.
[58] J. Han, and M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann
Publishers, 2001.
[59] D. Harel and Y. Koren, Clustering spatial data using random walks. In Proceedings
of the 7th AC M International Conference on Knowledge Discovery and Data Mining,
2001 .
[60] D. Harel and Y. Koren, On clustering using random walks. In Proceedings of the
21st Foundations of Software Technology and Theoretical Computer Science, 2001.
[61] V. Hatzivassiloglou, L. Gravano, and A. Maganti. An investigation of linguistic
features and clustering algorithms for topical document clustering. In Proceedings of
the 23rd Annual International A CM SIGIR Conference on Research and Development
in Information Retrieval, 2000.
[62] J. Herrero, A. Valencia, and J. Dopazo. A hierarchical unsupervised growing neural
network for clustering gene expression patterns. Bioinformatics, 17(2):126-136, 2001.
[63] L. Hirschman, J. C. Park, J. Tsujii, L. Wong, and C. H. Wu. Accomplishments and
challenges in literature data mining for biology. In Bioinformatics, 18(12): 1553-1561,
2002 .
[64] D. Horn, and I. Axel. Novel clustering algorithm for microarray expression data in
a truncated SVD space. Bioinformatics. 19(9):1110-1115, 2003.
83
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[65] P. J. Huber. Robust Statistics. Wiley, New York, 1981.
[ 6 6 ] T. R. Hvidsten, A. Lagreid, and J. Komorowski. Learning rule-based models of
biological process from gene expression time profiles using Gene Ontology. In Bioin
formatics, 19(9): 1116-1123, 2003.
[67] V. R. Iyer, M. B. Eisen, D. T. Ross, G. Schuler, T. Moore, J. C. F. Lee, J. M. Trent,
L. M. Staudt, J. Hudson Jr, M. S. Boguski, D. Lashkari, D. Shalon, D. Botstein, and
P. O. Brown. The transcriptional program in the response of human fibroblasts to
serum. Science, 283(5398):83-87, 1999.
[ 6 8 ] R. A. Jarvis, and E. A. Patrick. Clustering using a similarity measure based on
shared near neighbors. IEEE Transactions on Computers, C22, 1025-1034, 1973.
[69] D. Jiang, J. Pei, and A. Zhang. DHC: a density-based hierarchical clustering method
for time series gene expression data. In Proceedings of the 3rd IEEE International
Symposium on Bioinformatics and BioEngineering, 2003.
[70] D. Jiang, J. Pei, and A. Zhang. Towards interactive exploration of gene expression
patterns. AC M SIGKDD Explorations, 6(l):79-90, 2004.
[71] D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: a survey.
In IEEE Transactions on Knowledge and Data Engineering, 16(11):1370-1386, 2004.
[72] T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hyper
text categorisation. In Proceedings of the 18th International Conference on Machine
Learning, 2001.
[73] K. V. Kanth, D. Agrawal, A. E. Abbadi, and A. K. Singh. Dimensionality reduction
for similarity searching in dynamic databases. Computer Vision and Image Under
standing: CVIU, 75(l):59-72, 1999.
[74] G. Karypis, E. H. Han, and V. Kumar CHAMELEON: A hierarchical clustering
algorithm using dynamic modeling. IEEE Computer, 32(8):68-75, 1999.
[75] L. Khan, and D. McLeod. Effective retrieval of audio information from annotated
text using ontologies. In Proceedings of ACM SIGKDD Workshop on Multimedia
Data Mining, 2000.
[76] L. Khan, and D. McLeod. Disambiguation of annotated text of audio using onologies.
In Proceeding of AC M SIGKDD Workshop on Text Mining, 2000.
[77] L. Khan, D. McLeod, and E. H. Hovy. Retrieval effectiveness of an ontology-based
model for information selection. In the VLDB Journal, 13(l):71-85, 2004.
[78] E. M. Knorr, and R. T. Ng. Algorithms for mining distance-based outliers. In
Proceedings of the 24th Very Large Data Bases, 1998.
[79] F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in
large datasets of time sequences. In Proceedings of the AC M SIGMOD International
Conference on Management of Data, 1997.
84
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[80] W. Lam, P. Cheung, and R. Huang. Mining events and new name translations from
online daily news. In Proceedings of the Joint A C M /IE E E Conference on Digital
Libraries, 2004.
[81] B. Larsen and C. Aone. Fast and effective text mining using linear-time document
clustering. In Proceedings of the 5th AC M SIGKDD International Conference on
Knowledge Discovery and Data Mining, 1999.
[82] D. Lawrie, and W. Croft. Discovering and comparing topic hierarchies. In Proceed
ings of RIAO, 2000.
[83] D. Lawrie, W. B. Croft, and A. Rosenberg. Finding topic words for hierarchical sum
marization. In Proceedings of the 24th Annual International AC M SIGIR Conference
on Research and Development in Information Retrieval, 2001.
[84] D. D. Lewis. Reuters-21578 text categorization test collection distribution 1.0.
http://www.research.att.com/ lewis, 2000.
[85] C. Y. Lin, and E.H. Hovy. Automated multi-document summarization in NeATS.
In Proceedings of the Human Language Technology (HLT) Conference, 2002.
[ 8 6 ] D. Lin. An information-theoretic definition of similarity. In Proceedings of the 15th
International Conference on Machine Learning, 1998.
[87] B. Liu, Y. Ma, and P. S. Yu. Discovering unexpected information from your com
petitors’ web sites. In Proceedings of the 7th AC M SIGKDD International Conference
on Knowledge Discovery and Data Mining, 2001.
[ 8 8 ] B. Liu, C. W. Chin, and H. T. Ng. Mining topic-specific concepts and definitions on
the web. In Proceedings of the 12th International World Wide Web Conference, 2003.
[89] X. Liu, Y. Gong, W. Xu and S. Zhu. Document clustering with cluster refinement
and model selection capabilities. In Proceedings of the 25th Annual International
Conference on Research and Development in Information Retrieval, 2002
[90] S. Lloyd. An optimization approach to relaxation labeling algorithms. Image and
Vision Computing, 1(2), 1983.
[91] A. Maedche, and S. Staab. Ontology learning for the Semantic Web. IEEE Intelli
gent Systems, 16(2), 2001.
[92] A. Maguitman, D. Leake, T. Reichherzer, and F. Menczer. Dynamic extraction of
topic descriptors and discriminators: towards automatic context-based topic search.
In Proceedings of the 13th ACM Conference on Information and Knowledge Manage
ment, 2004.
[93] C. Manning, and H. Schiitze. Foundations of statistical natural language processing.
MIT Press, 1999.
85
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[94] A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional
data sets with application to reference matching. In Proceedings of the 6th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000.
[95] K. R. McKeown, J. L. Klavans, V. Hatzivassiloglou, R. Barzilay, and E. Eskin.
Towards multidocument summarization by reformulation: progress and prospects. In
Proceedings of A A A I/IA A I. 1999.
[96] K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, J.L. Klavans, A.
Nenkova, C. Sable, B. Schiffman, and S. Sigelman. Tracking and summarizing news
on a daily basis with Columbia’s Newsblaster. In Proceedings of the Human Language
Technology Conference, 2002.
[97] K. R. McKeown, R. Barzilay, J. Chen, D. K. Elson, D. K. Evans, J. Klavans, A.
Nenkova, B. Schiffman, and S. Sigelman. Columbia’s Newsblaster: new features and
future directions. In Proceedings of Human Language Technology Conference of the
North American Chapter of the Association for Computational Linguistics, 2003.
[98] I. D. Melamed. Automatic evaluation and uniform filter cascades for inducing n-
best translation lexicons. In Proceedings of the 3rd Workshop on Very Large Corpora,
1995.
[99] G. Miller. Wordnet: An on-line lexical database. International Journal of Lexicog
raphy, 3(4):235-312, 1990.
[100] A. Moore. Very fast EM-based mixture model clustering using multiresolution kd-
trees. In Advances in Neural Information Processing Systems 11, 1999.
[101] S. Morishita, T. Hishiki, and K. Okubo. Towards mining gene expression database.
In Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery, 1999.
[102] R. T. Ng, and J. Han. Efficient and effective clustering methods for spatial data
mining. In Proceedings of the 20th International Conference on Very Large Data
Bases, 1994.
[103] N. F. Noy, M. Sintek, S. Decker, M. Crubezy, R. W. Fergerson, and M. A.
Musen. Creating Semantic Web contents with Protege-2000. IEEE Intelligent Sys
tems, 6(12):60-71, 2001.
[104] R. Papka, J. Allan, and V. Lavrenko, UMass approaches to detection and tracking
at TDT-2. In Proceedings of the Broadcast News Transcription and Understanding
Workshop, 1999.
[105] Z. Pecenovic. Image retrieval using latent semantic indexing. M aster’ s Thesis,
Swiss Federal Institute of Technology, 1997.
[106] D. Pelleg, and A. Moore. X-means: Extending K-means with efficient estimation
of the number of clusters. In Proceedings of the 17th International Conference on
Machine Learning.
86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
107] M. F. Porter. An algorithm for suffix stripping. Program, 14(3): 130-137, 1980.
108] D. R. Radev, S. Goldensohn, Z. Zhang, and R. S. Raghavan. Newsinessence: a
system for domain-independent, real-time news clustering and multi-document sum
marization. In Proceedings of Human Language Technology Conference, 2001.
109] D. R. Radev, S. Goldensohn, Z. Zhang, and R. S. Raghavan. Interactive, domain-
independent identification and summarization of topically related news. In Proceed
ings of the 5th European Conference on Research and Advanced Technology for Digital
Libraries, 2001.
110] L. Ralaivola, and F. d ’Alche-Buc. Incremental support vector machine learning:
a local approach. In Proceedings of the Annual Conference of the European Neural
Network Society, 2001.
111] A. Ratnaparkhi. A maximum entropy part-of-speech tagger. In Proceedings of the
Empirical Methods in Natural Language Processing, 1996.
112] P. Resnik. Semantic similarity in a taxonomy: an information-based measure and
its application to problems of ambiguity in natural language. Journal of Artificial
Intelligence Research, 1999.
113] M. Sahami. Using machine learning to improve information access. Ph.D. Thesis,
Stanford University, 1999.
114] G. Salton and M. J. McGill. Introduction to modern information retrieval.
McGraw-Hill, 1983.
115] M. Sanderson, and W. B. Croft. Deriving concept hierarchies from text. In Pro
ceedings of the 22nd Annual International AC M SIGIR Conference on Research and
Development in Information Retrieval, 1999.
116] H. Schiitze and H. Silverstein. Projections for efficient document clustering. In
Proceedings of the 20th International ACM SIGIR Conference on Research and De
velopment in Information Retrieval, 1997.
117] C. Shahabi, S. Chung, M. Safar, and G. Hajj. A wavelet-based approach to improve
the efficiency of multi-level trend mining. In Proceedings of the Ifth International
Conference on Scientific and Statistical Database Management, 2001.
118] R. Sharan, A. Maron-Katz, and R. Shamir. CLICK and EXPANDER: a system for
clustering and visualizing gene expression data. In Bioinformatics, 19(14):1787-1799,
2003.
119] D. Song, and P. D. Bruza. Towards context sensitive information inference. Journal
of the American Society for Information Science and Technology, 54(4):321-334, 2003.
120] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P.
O. Brown, D. Botstein, and B. Futcher. Comprehensive identification of cell cycle-
regulated genes of the yeast saccharomyces cerevisiae by microarray hybri dization.
Molecular Biology of the Cell, 9(12):3273-3297, 1998.
87
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[121] M. Steinbach, P. Tan, V. Kumar, S. Klooster, C. Potter, and A. Torregrosa. Clus
tering earth science data: goals, issues and results. In Proceedings of AC M SIGKDD
Workshop on Mining Scientific Dataset, 2001.
[122] M. Steinbach, P. Tan, V. Kumar, S. Klooster, and C. Potter. Discovery of climate
indices using clustering. In Proceedings of AC M SIGKDD International Conference
on Knowledge Discovery and Data Mining, 2003.
[123] R. Stevens, C. Wroe, P. Lord, and C. Goble. Ontologies in bioinformatics. In S.
Staab and R. Studer, editors, Handbook on Ontologies, pages 635-657. Springer, 2003.
[124] P. E. Stolorz, and C. Dean. Quakefinder: A scalable data mining system for de
tecting earthquakes from space. In Proceedings of ACM SIGKDD International Con
ference on Knowledge Discovery and Data Mining, 1996.
[125] W. Sweldens, and P. Schroder. Building your own wavelets at home. Wavelets in
Computer Graphics, ACM SIGGRAPH Course Notes, 1996.
[126] P. Tamayo, D. Slonim, Q. Mesirov, J. Zhu, S. Kitareewan, E. Dmitrovsky, E. S.
Lander, and T.R. Golub. Interpreting patterns of gene expression with self organizing
maps. In Proceedings of National Academy of Science, 96(6):2907-2912, 1999.
[127] A. Tanay, R. Sharan and R. Shamir. Biclustering algorithms: a survey. To appear
in the Handbook of Bioinformatics.
[128] S. Tavazoie, D. Hughes, M. J. Campbell, R. J. Cho, and G.M. Church. Systematic
determination of genetic network architecture. Nature Genetics, 22(3):281-285, 1999.
[129] V. N. Vapnik. The nature of statistical learning theory. Springer, 1995.
[130] E. M. Voorhees. Query expansion using lexical-semantic relations. In Proceedings
of the 17th International AC M SIGIR Conference on Research and Development in
Information Retrieval, 1994.
[131] H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering by pattern similarity in large
data sets. In Proceedings of AC M SIGMOD International Conference on Management
of Data, 2002.
[132] R. Weber, H. J. Schek, and S. Blott. Quantitative analysis and performance study
for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th
International Conference on Very Large Data Bases, 1998.
[133] Y. Xu, V. Olman, and D. Xu. Clustering gene expression data using a graph-
theoretic approach: an application of minimum spanning trees. Bioinformatics,
18(4):536-545, 2002.
[134] Y. Yang, J. Carbonell, R. Brown, T. Pierce, B. T. Archibald, and X. Liu. Learn
ing approaches for detecting and tracking news events. IEEE Intelligent Systems,
14(4):32-43, 1999.
88
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[135] Y. Yang, J. Zhang, J. Carbonell and C. Jin. Topic-conditioned novelty detection.
In Proceedings of the 8th A C M SIGKDD Intemaltional Conference on Knowledge
Discovery and Data Mining, 2002.
[136] K. Y. Yeung, and W. Ruzzo. An empirical study on principal component analysis
for clustering gene expression data. Bioinformatics, 17(9):763-774, 2001.
[137] K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, and W. L. Ruzzo. Model-
based clustering and data transformations for gene expression data. Bioinformatics,
17(10) :977-987, 2001.
[138] L. A. Zadeh. Similarity relations and fuzzy orderings. Information Sciences, 3:177-
200, 1971.
[139] H. Zeng, Q. He, Z. Chen, W. Ma, and J. Ma. Learning to cluster web search
results. In Proceedings of the 27th annual International Conference on Research and
Development in Information Retrieval, 2004.
[140] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering
method for very large databases. In Proceedings of the AC M SIGMOD International
Conference on Management of Data, 1996.
[141] Y. Zhao, and G. Karypis. Evaluations of hierarchical clustering algorithms for
document datasets. In Proceedings of the 11th ACM International Conference on
Information and Knowledge Management, 2002.
[142] X. Zhou, M. J. Kao, and W. H. Wong. Transitive functional annotation by short
est path analysis of gene expression data. In Proceedings of Natational Academy of
Science, 12783-12788, 2002.
89
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An adaptive temperament -based information filtering method for user -customized selection and presentation of online communication
PDF
An adaptive soft classification model: Content-based similarity queries and beyond
PDF
Continuous media placement and scheduling in heterogeneous disk storage systems
PDF
Building queryable datasets from ungrammatical and unstructured sources
PDF
A unified model and methodology for conceptual database design and evolution
PDF
A federated architecture for database systems
PDF
A hierarchical knowledge based system for airplane recognition
PDF
An integrated environment for modeling, experimental databases and data mining in neuroscience
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
A framework for supporting data integration using the materialized and virtual approaches
PDF
A syntax-based statistical translation model
PDF
Enabling clinically based knowledge discovery in pharmacy claims data: An application in bioinformatics
PDF
A foundation for general-purpose natural language generation: Sentence realization using probabilistic models of language
PDF
Location-based spatial queries in mobile environments
PDF
Cost -sensitive cache replacement algorithms
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
A learning-based object-oriented framework for conceptual database evolution
PDF
Clustering techniques for coarse -grained, antifuse-based FPGAs
PDF
An extensible object-oriented framework for engineering design databases
PDF
Color processing and rate control for storage and transmission of digital image and video
Asset Metadata
Creator
Chung, Seokkyung (author)
Core Title
Concept, topic, and pattern discovery using clustering
Contributor
Digitized by ProQuest
(provenance)
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Advisor
McLeod, Dennis (
committee chair
), Knight, Kevin (
committee member
), Pryor, Larry (
committee member
), Shahabi, Cyrus (
committee member
), Zimmermann, Roger (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-598333
Unique identifier
UC11336813
Identifier
3219869.pdf (filename),usctheses-c16-598333 (legacy record id)
Legacy Identifier
3219869.pdf
Dmrecord
598333
Document Type
Dissertation
Rights
Chung, Seokkyung
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA