Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Clustering 16S rRNA sequences: an accurate and efficient approach
(USC Thesis Other)
Clustering 16S rRNA sequences: an accurate and efficient approach
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CLUSTERING 16S RRNA SEQUENCES:
AN ACCURATE AND EFFICIENT APPROACH
by
Yichao Dong
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
Thesis Committee:
Dr. Fengzhu Sun (Chair)
Dr. Liang Chen
Dr. Jed Fuhrman
August 2019
Copyright 2019 Yichao Dong
Acknowledgments
First and foremost, I would like to express my most sincere gratitude to my Ph.D.
advisors, Dr. Ting Chen and Dr. Fengzhu Sun, for the unreserved support and
the most enlightening teaching that they have given me throughout my study.
Working alongside with them and being able to see their dedication to science and
their rigorous attitude for scholarship have been the most inspiring experience for
me, and these good qualities, without a doubt, will continue to guide me through
the rest of my life.
My heartfelt appreciation also goes to my Ph.D. qualification and dissertation
committee faculty members, Dr. Michael S. Waterman, Dr. Liang Chen, Dr. Jed
Fuhrman and Dr. Aiichiro Nakano, for their words of encouragement, and their
extremely insightful comments that have greatly helped expand my horizon in the
subject.
I would also like to thank my collaborator, Linhao, my good friends and class-
matesWenzheng, Haifeng, Tsu-Pei, Mingyoung, Satya, Ben, Erikandmanyothers,
for working with me, for teaching me so many things, and for all the great moments
we spent together over the years during the Ph.D. program.
I am also immensely thankful to my parents, for instilling values of hard work
and good faith into me and for all the sacrifice they made for me. I also want
to thank my daughter, Hannah, for keeping me motivated and teaching me the
ii
responsibilities of being a father. At last, words are simply not enough to describe
how grateful I am to my dear wife, Mengge Zhang, who gave me hope and strength
in moments of despair, and has been standing by my side in both good times and
hard times.
iii
Contents
Acknowledgments ii
List of Figures vi
List of Tables vii
Abstract viii
1 Introduction 1
1.1 Overview of the 16S rRNA gene clustering problem . . . . . . . . . 1
1.2 The hard threshold problem in OTU clustering . . . . . . . . . . . 4
1.3 Current methods and challenges . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Hierarchical clustering methods . . . . . . . . . . . . . . . . 6
1.3.2 Greedy heuristic clustering methods . . . . . . . . . . . . . . 6
1.3.3 Statistical model-based clustering method . . . . . . . . . . 8
CROP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
DACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Materials and Methods 14
2.1 Overview of the HCLUST algorithm . . . . . . . . . . . . . . . . . 14
2.2 Building feature vectors from read sequences . . . . . . . . . . . . . 19
2.3 Data partitioning with Locality Sensitive Hashing (LSH) . . . . . . 24
2.4 Clustering algorithm of HCLUST . . . . . . . . . . . . . . . . . . . 29
2.4.1 Improved sequence alignment strategy . . . . . . . . . . . . 29
2.4.2 Reductionofsequencealignmentswithfeaturevectorheuristics 30
2.4.3 Improvement of cluster center re-election strategy . . . . . . 31
2.4.4 HCLUST-G, the fast mode . . . . . . . . . . . . . . . . . . . 32
2.5 Clustering evaluation methods and criteria . . . . . . . . . . . . . . 32
2.5.1 NMI score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.2 Precision, Recall, and F-score . . . . . . . . . . . . . . . . . 34
2.6 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6.1 In silico simulated data across three ecosystems . . . . . . . 36
iv
2.6.2 Clone43 dataset . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.3 Human gut V6 dataset . . . . . . . . . . . . . . . . . . . . . 37
2.6.4 Lake Taihu dataset . . . . . . . . . . . . . . . . . . . . . . . 38
2.7 Test configurations and environments . . . . . . . . . . . . . . . . . 38
3 Results 40
3.1 In silico simulated data results . . . . . . . . . . . . . . . . . . . . . 40
3.2 Clone43 dataset results . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Human gut V6 dataset results . . . . . . . . . . . . . . . . . . . . . 46
3.4 Running time benchmarking results with the Taihu dataset . . . . . 48
4 Conclusion and Future Directions 51
4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.1 Further enhance the scalability of HCLUST . . . . . . . . . 53
4.2.2 Incorporate platform based sequencing error correction models 54
Reference List 56
A Supplmentary materials 63
A.1 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . 64
v
List of Figures
1.1 Birth-Death Process . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 The main workflow of the HCLUST program . . . . . . . . . 16
3.1 insilico dataset human gut NMI score . . . . . . . . . . . . . . 40
3.2 insilico dataset human gut Precision, Recall and F-score . . 41
3.3 insilico dataset ocean NMI score . . . . . . . . . . . . . . . . . 41
3.4 insilico dataset ocean Precision, Recall and F-score . . . . . 42
3.5 insilico dataset soi NMI score . . . . . . . . . . . . . . . . . . . 42
3.6 insilico dataset soil Precision, Recall and F-score . . . . . . . 43
3.7 Clone43 NMI score . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8 Clone43 Precision, Recall and F-score . . . . . . . . . . . . . . 45
3.9 Human gut V6 dataset NMI score . . . . . . . . . . . . . . . . 46
3.10 Human gut V6 dataset Precision, Recall and F-score . . . . 47
4.1 The main workflow of the HCLUST program with parallel
computing at thread level . . . . . . . . . . . . . . . . . . . . . 54
A.1 Experiments investigating different word counting models
and word-size selection . . . . . . . . . . . . . . . . . . . . . . . 64
vi
List of Tables
2.1 Summary table of the simulation experiments A total of 6
simulation experiments was performed to investigate the optimal
model (Simple word frequency counting vs. the d
∗
2
model) and the
associatedparameters(k-merlengthandMarkovorderifapplicable)
to use as the alignment-free strategy. . . . . . . . . . . . . . . . . . 23
2.2 Different alignment similarity values with corresponding
C
feat
cut-off values. . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Table of Correspondence Relationship of different align-
ment similarity values, MLE estimator ˆ s of Sim
feat
and
p =Pr[h
v
(x) =h
v
(y)]. . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Predicted OTU numbers in the simulated datasets . . . . . 43
3.2 Predicted OTU numbers in the Clone43 dataset . . . . . . . 46
3.3 Predicted OTU numbers in the human gut V6 dataset . . . 48
3.4 HCLUST vs. CROP running time . . . . . . . . . . . . . . . . 49
3.5 HCLUST vs. other algorithms running time . . . . . . . . . 49
vii
Abstract
The 16S rRNA gene is central to metagenomic sequencing analysis as a marker
gene for taxonomic classification, in which distinct 16S rRNAs reveal the profile of
microbial composition in a given sample. Crucial for this kind of analysis is a type
ofalgorithmthatperformsde novo OTU(OperationalTaxonomicUnit)clustering,
which generates OTU clusters from the 16S rRNA amplicon sequences based on
inter-sequence similarities, without the help of known 16S rRNA reference data.
Several mainstream de novo OTU algorithms use a greedy heuristic approach
that is prone to produce spurious OTUs, and part of the reason is the use of a
hard-cut similarity threshold based on sequence alignment, such as 97%, to group
sequences into OTUs. On the other hand, CROP is a statistical model-based
method that applies a Bayesian approach to the clustering problem, and avoids the
use of a hard-cut threshold. For this reason, CROP is able to generate OTUs more
accurately, but at the same time it suffers from poor running time performance.
In this article we propose a new method, HCLUST, which also uses Bayesian
clustering but with a more efficient implementation, and in addition, to make the
algorithmevenmorescalable, itutilizesadivide-and-conquerapproachtopartition
the input data into individual buckets, where the Bayesian clustering procedure
can be carried out independently in each individual bucket. Importantly, the data
partitioning occurs in a way that similar sequencesare more likely to be assigned to
viii
the same bucket, via a special hashing scheme known as locality-sensitive hashing
(LSH).
We performed multiple benchmarking studies comparing the OTU clustering
accuraciesbetweenHCLUSTandothermainstreamalgorithms, withdatasetscom-
prised of computer-simulated reads as well as real sequencing reads, and from sam-
ples based on different ecological systems. HCLUST, along with CROP, were iden-
tified as the most accurate algorithms across all of the datasets being tested. We
further demonstrated that HCLUST runs more than 10 times faster than CROP
on datasets with relatively large input sizes.
ix
Chapter 1
Introduction
1.1 Overview of the 16S rRNA gene clustering
problem
Microbes are estimated to have an abundance of approximately 5×10
30
cells in the
biosphere and are well known for their immense diversity. They play a wide variety
of roles in ecosystems ranging from environments such as ocean, fresh water, soil,
to the human body [1–4].
The 16S ribosomal RNA (16S rRNA) gene is a section of prokaryotic DNA
foundacrossallbacteriaandarchaea. LikeallribosomalRNAgenes, the16SrRNA
is an indispensable component of the ribosome (it codes for the small subunit of
the ribosome in bacteria), the cellular organelle that translates messenger RNAs
into proteins. In the course of evolution, there is little change in the ribosome
structure, and likewise its ribosomal RNA components are well conserved amongst
different species. In the case of the 16S rRNA product, however, due to the
intrinsic folding and internal bonding of the ribosomal rRNAs within the ribosome,
the sequence in certain regions within the rRNA are usually more conserved and
is known as the conserved regions, while other regions are more susceptible to
mutations evolutionarily and are known as the hypervariable regions, or simply,
the variable regions. The 16S rRNA gene is roughly 1.5 kilobases in length, and it
contains nine hypervariable regions (denoted V1 – V9).
1
As the 16S rRNA sequence is highly taxonomy-specific, it is widely used to
identify microbes with genus- or even species-level resolution [5] in a metagenomic
sequencing study. In a typical metagenomic sequencing study workflow, the full-
length(oroftentimes, oneormorehypervariableregions)ofthe16SrRNAgeneare
sequenced from the collected samples, and the resultant amplicon reads (derived
from the 16S rRNA genes from different microbial species) are grouped together
based upon sequence similarity into bins that are commonly known as Operational
Taxonomic Units (OTUs) [5,6]. After the OTUs are derived, the center sequence
of each OTU is selected as a representative to take part in more downstream
analysis,suchasmultiplesequencealignments[7]andmoreimportantly,taxonomic
classification using methods such as the RDP classifier [8] and the 16S classifier
[9], which attempts to assign OTUs to particular species (or higher orders of the
taxonomy such as genus and family).
Traditionally the microbial species of interest are first cultured in a laboratory
environment before they can be subject to sequencing and taxonomic studies [10],
but this approach is severely limited by the fact that only less than 1% of microbial
organisms can be cultivated in a laboratory setting [11]. With the next-generation
sequencing methods, we are now able to sequence 16S rRNA amplicons directly
from samples obtained from various ecological systems, including human body
sites, fecal samples, and from the environment such as ocean, lake and soil [12].
Depending on whether the 16S rRNA gene analyzing procedure relies on known
sequencesdepositedinpubliccurateddatabases(e.g. SILVA[13],GreenGenes[14],
RDP[15]andetc.), twogeneralapproachesarecurrentlybeingusedtocharacterize
microbial communities by analyzing 16S rRNA sequences: taxonomy-dependent
methods and taxonomy-independent methods [7,16–18].
2
The taxonomy-dependent methods classify 16S rRNA amplicon sequences into
the corresponding taxonomies by matching the query sequences against refer-
ence sequences deposited in annotated sequence databases mentioned earlier. In
contrast, the taxonomy-independent methods cluster the 16S rRNA amplicon
sequences into OTUs without referencing any existing curated data.
The major drawback of taxonomy-dependent methods is the relatively small
number of known microbial reference sequences (whole genome or the 16S rRNA
gene) currently being curated in public databases [16]. This particularly limits
16S rRNA analysis of samples containing novel species, which are particularly
common with environmental samples, in which the great majority of microbial
species are uncultivable in a laboratory and their 16S rRNA sequences are not well
characterized.
On the other hand, the taxonomy-independent methods are able to generate
OTUs (or "clusters", which is used interchangeably to mean OTUs in this text)
in a de novo manner, based only on the intrinsic similarities between the ampli-
con sequences, and therefore much more useful for characterizing sequencing sam-
ples obtained from a wide range of the ecological environments such as in ocean,
fresh water, soil, and etc. The most widely accepted sequence similarity measure
between a pair of amplicon sequences is the percentage of base positions that agree
in a pairwise sequence alignment. A common similarity threshold for distinguish-
ing the microbes roughly at the species level is 97% (and lower thresholds are used
for identifying OTUs at higher taxonomic levels), which is supported by empirical
studies [19].
Due to the versatility as well as the intrinsic complexity of the taxonomy-
independent methods, many algorithms have been developed to perform this task.
3
Our method, named HCLUST, which we will examine in more detail in the
following text, also belongs to this category of de novo clustering algorithms.
1.2 The hard threshold problem in OTU cluster-
ing
The most widely used de novo sequence clustering methods today, such as Uclust
[20] and CD-hit [21], are based on examining the alignment similarities between
sequencesataspecifiedthreshold, suchas97%forspecies-levelOTUidentification,
as described earlier. However, as popular as it is, using a hard-cut threshold such
as 97% for species-level OTU clustering is controversial for the following reasons:
The first problem arises from the subtle difference in the biological interpre-
tation of 97% OTUs [6]: the 97% similarity between 16S rRNA is only a rough
estimate to delineate species in a generalized manner: for example, two different
species, such as Bacillus globisporus and B. psychrophilus, may have up to 99%
16S similarity between them [22]. In other cases, the same bacterial species such
as E. coli can produce multiple copies of the 16S rRNA gene that is less than 97%
similar [23].
The second problem is that the exact similarity threshold to use can vary from
studies to studies [24–27]. Some recent studies proposed that a threshold of at
least 99% can be used to produce species or strain-level OTUs on high-quality
16S sequencing dataset [28], whereas some proposed that and if the quality of the
sequences cannot be guaranteed, especially in the case of pyrosequencing, 97% is
still an optimal threshold to avoid overestimating the biodiversity [24].
The third problem with a hard and inflexibly sequence alignment similarity
threshold such as 97% is that it inevitably fails to account for the variations that
4
originate from artificial sources, such as 1. Sequencing errors that occur to varying
degrees depending on different sequencing platforms [29], 2. Different experimental
procedures such as DNA extraction methods and choice of PCR primers [30–32];
and 3. Differences in sequence alignment implementation as well as alignment sim-
ilarity definition [20,21]. For this reason, methods that use an inflexible similarity
threshold, are more prone to produce spurious OTUs and overestimate the total
number of OTUs in a sample [16].
Recent advances in sequencing technologies have enabled the generation of
massive volumes of sequences, including the 16S rRNA gene amplicon sequences,
with increasingly lower cost [33,34]. Therefore there is an ever-growing need for
de novo 16S rRNA clustering methods that are both accurate and scalable.
Here we present our method, HCLUST, which primarily aims to tackle the
third problem of using a hard similarity threshold for similarity discussed above,
by using soft thresholds defined by an underlying statistical model. It aims to
generate accurate OTUs with high biological relevance, by being more tolerable to
sequence variations introduced from artificial sources described earlier. Combined
withaninputdatapartitioningstrategyknownaslocality-sensitivehashing(LSH),
HCLUST is able to perform highly accurate clustering of 16S rRNA gene sequences
into OTUs and doing so in a computationally efficient manner.
1.3 Current methods and challenges
As mentioned previously, the mainstream methods of de novo 16S rRNA clus-
tering are based on examining pairwise sequence alignment similarities between
the input sequences and in general, these methods typically fall in one of the
5
three categories: hierarchical clustering, greedy heuristic clustering and statistical
model-based clustering [16].
1.3.1 Hierarchical clustering methods
The representative methods in this category MOTHUR [7] and ESPRIT [18], that
uses approaches akin to the classic hierarchical clustering. MOTHUR requires a
pairwise distance matrix be calculated between every sequence in the input data,
which is then used to construct the hierarchical tree for clustering and linkage
analysis. Computing the pairwise distance matrix is an extremely costly operation
in terms of both time and space complexity, which is in the order of O(N
2
K),
where N is the total number of sequences in the input and K is the total number
of resulting clusters.
ESPRIT improves the efficiency of the hierarchical clustering approach with
a k−mer distance filtering strategy that reduces unnecessary sequence pairwise
alignments, and it lowers memory usage by storing the reduced-distance with a
sparse matrix [18].
Yet in practice, even with improvements in computational efficiency, hierar-
chical clustering methods are still not suitable for processing even medium-scale
sequencing data due to the intrinsic high computational cost [16].
1.3.2 Greedy heuristic clustering methods
To address the computational efficiency issues, two closely related methods, CD-
hit [21] and Uclust [20], have been developed and they become the most popular
approaches on working with large data sets.
Both methods use the following incremental greedy approach (with minor vari-
ations) to assign sequences into OTUs:
6
At the start of the algorithm, duplicate input sequences are first deduplicated
and the unique reads form a list of unprocessed sequences. The list is sorted
by certain criteria, such as sequence length [21]. Next, the first sequence (the
"query sequence") in the unprocessed list automatically becomes the representative
sequence of the first cluster. Then starting from the second query sequence in the
unprocessed list, each iscomparedto each existingclusterrepresentativesusing the
sequence similarity threshold criterion, and depending on its alignment similarities
to the existing cluster representatives, it either gets merged into an existing cluster,
or itself becomes the representative of a new cluster.
The greedy nature of these approaches makes them a lot faster compared to
hierarchical methods, as the greedy algorithm effectively has time complexity of
O(NK), where N is the total number of distinct sequences in the input, and K
is the final number of clusters. As K is much less than N in almost all scenarios,
they scale much better than O(N
2
K) methods when N gets large.
Yet the greedy nature of this approach inevitably causes problems that impair
the accuracy of the derived OTUs clusters, in particular:
1. Poorclusterrepresentativeselection: Eachnewquerysequence isbeingcom-
pared to current cluster representatives formed so far, but fundamentally these
cluster representative sequences are chosen somewhat arbitrarily depending on
their position in the (sorted) input list. In case of CD-hit, the input list is sorted
in terms of decreasing sequence lengths [21], and therefore longer sequences in
the input tends to become the representative sequence. Ideally the representa-
tive sequence should be as close to the real OTU center as possible so minimize
the splitting of OTUs. This is because in OTU clustering studies, it is typically
assumed that the similarity between any member sequence and the OTU centroid
are equal or above the pre-set threshold (e.g. 97%), but this does not necessarily
7
hold true for any two sequences within the same OTU. Also as pointed out in [35],
using a single and invariant OTU representative loses a lot of information of the
true OTUs because sequences in the middle or ends of datasets are compared only
to the existing OTU representatives that appear in the beginning portion of the
input dataset.
3. Hard alignment similarity threshold: As described earlier in Section 1.2,
both CD-hit and Uclust use a fixed alignment-based similarity threshold to define
OTUs. Any sequences that compare less than the threshold with the cluster rep-
resentatives are strictly separated and become new clusters. Although sequencing
error rate is unlikely to exceed 1% in the more modern sequencing platforms, poor
choice of cluster centers as explained earlier could cause reads belonging to the
same species group to split into different OTUs, due to issues with the hard-cut
threshold.
In fact, many studies have shown that this type of algorithms, specially Uclust,
often overestimate the number of species-level OTUs in 16S rRNA analysis; and
the clustering result is heavily dependent on the order of input sequences in the
input list [16,18,36].
1.3.3 Statistical model-based clustering method
CROP
To address the hard-cut sequence similarity threshold associated with the above-
mentioned approaches, Hao et al. [36] proposed CROP, which uses a statistical
model-based,Bayesianclusteringapproachthatavoidstheuseofahard-cutthresh-
old. Here we briefly review the Bayesian clustering algorithm in CROP, as it is
highly relevant to HCLUST, and more details are available in the CROP original
paper [36].
8
First of all, CROP uses Gaussian mixture density to model the each input
sequence, denoted x, as:
p(x|K,~ π,~ μ,
~
σ
2
) =
K
X
j=1
π
j
f(x;μ
j
,σ
2
j
)
Note we define ~ x = (x
1
,x
2
,··· ,x
N
) to be the N input sequences that are
independently drawn from the above Gaussian mixture density, where K is
the number of clusters in the Gaussian mixture model, ~ μ = (μ
1
,μ
2
,··· ,μ
K
),
~ σ
2
= (σ
2
1
,σ
2
2
,··· ,σ
2
K
) and ~ π = (π
1
,π
2
,··· ,π
K
) are the cluster centers, variances
and mixture proportions for all K clusters, respectively. The sum of all π
j
’s are
equal to 1.
For any sequence x
i
, given a certain assignment to a particular cluster j, it
follows a modified Gaussian probability density as:
f(x
i
;μ
j
,σ
2
j
) =
1
q
2πσ
2
j
e
−
D
2
(x
i
,μ
j
)
2σ
2
j
whereD(x
i
,μ
j
) denotes the global alignment distance between the sequencex
i
and the center of the jth cluster μ
j
. In addition, ~ z = (z
1
,z
2
,··· ,z
N
) is defined
as cluster assignment variables where z
i
can take any value from (1,2,··· ,K)
denoting the cluster assignment of the ith sequence.
Therefore,K,~ z,~ μ determines the identity of each resulting cluster and will be
the parameters we are interested in inferring. In addition, CROP assumes that~ σ
2
,
~ π follows prior distribution as:
~ π|K∼Dirichlet(γ
1
,··· ,γ
K
)
σ
2
i
∼Inverse−Gamma(α,β)
9
where in practice, γ and α are pre-set values, and β is directly related to a
"soft" threshold value that the user can configure as an input to the program, which
effectively restricts distance between member sequences and their OTU centroids
(namely, the ~ μ), leading to different cluster formations.
Next, in order to estimate these unknown parameters, CROP aims to maximize
the posterior probabilityP(~ π,~ μ,~ σ
2
|~ x), by a process known as Monte Carlo Markov
Chain (MCMC). The transition between two successive iterations of the MCMC
is defined by the Birth-Death Process (Figure 1.1).
Figure 1.1: Birth-Death Process
At any given iteration during the MCMC (time t), the current K (call it K
t
)
clusters collectively make a decision (based on probabilistic sampling from the
maximized likelihood function) whether in the next round of MCMC (time t+1,
where K
t+1
clusters will be generated) they undergo "birth" (one new cluster is
created, K
t+1
= K
t
+1), or "death" (one cluster gets destroyed, K
t+1
= K
t
−1).
In either case, all input sequences undergo a round of sampling to be re-assigned
to any of the K
t+1
clusters, based on the following probability density function:
P(z
i
=j|···) =
π
j
q
σ
2
j
e
−
D
2
(x
i
,μ
j
)
2σ
2
j
10
where μ
j
and σ
2
j
are the center and the variance of the jth cluster immediately
after the Birth-Death decision at time t+1.
After the cluster assignment variable ~ z is updated, to complete this iteration
in the MCMC Birth-Death process, other parameters including ~ π,~ σ
2
,~ μ are also
updated accordingly.
In particular,~ π and~ σ
2
are updated by sampling from their prior distributions
[36], and for each cluster, its center ~ μ, are re-elected by randomly sampling from
the likelihood term L(μ
j
), which is the likelihood of the jth cluster:
L(μ
j
) =
n
j
Y
i=1
f(x
ji
;μ
j
,σ
2
j
)
where μ
j
is the center of the cluster j, n
j
is the number of member sequences in
this cluster, σ
2
j
is the variance, and f is the Gaussian probability density function
described in (section 1.3.3). CROP then computes a total n
j
L(μ
j
) values with
every member sequence being the cluster candidate centerμ
j
, with the final center
elected from sampling.
The MCMC process is repeated until the birth-death process stabilizes or a
maximum pre-set number of iterations are reached, from which we have obtained
the inferred values forK (number of clusters),~ z (cluster assignment of each input
sequence), as well as ~ μ (center sequence for each cluster).
Compared with other approaches, CROP defines the a "soft" threshold which
is directly related to the parameter ~ σ
2
, which intrinsically controls the relative
similarity between each member sequence to its OTU centroid, hence avoiding the
arbitrary hard cut similarity threshold, say 97%. This model-based approach is
thought to be more tolerable to sequencing errors in inferring OTUs [18].
11
However, in terms of computational complexity, CROP still requires computa-
tion of all pairwise distances between the input sequences (a distance matrix con-
taining N
2
entries is computed at the start of CROP, before entering the MCMC
iterations). This inevitably incurs a high cost ofO(N
2
K) as mentioned earlier. In
addition, the iterative nature of the Birth-Death Process and the intrinsic compu-
tation cost in carrying out the MCMC sampling of parameters make CROP a very
time-consuming algorithm.
DACE
Jiang et al. proposed DACE [37] which uses Dirichlet Process Means (DP-means)
[38] to perform sequence clustering. Here we will briefly review the DP-means
algorithm as described in [37,38].
The latent cluster assignment variable ~ z = (z
1
,z
2
,··· ,z
N
) is defined in the
same way as in CROP described above, and the probability of assigning a sequence
x
i
to a cluster k is:
f(z
i
=k) =
n
−i,k
ψ
exp(−
1
2σ
2
D
2
(x
i
,μ
k
)) k = 1···K
1
ψ
exp(−
λ
2σ
2
−
D
2
(x
i
,x
i
)
2(ρ+σ
2
)
) k =K +1
where ψ is the normalizing constant, n
−i,k
is the number of data not counting x
i
assigned to cluster k. The upper case of the equation above refers to assigning x
i
to an existing cluster k; while the lower case means this sequence x
i
is going to
belong to a new (K +1)th cluster.
Note λ is a distance threshold, and if λ is smaller than all of the distances in
(D
2
(x
i
,μ
1
),··· ,D
2
(x
i
,μ
K
)), a new cluster will be created with x
i
becoming its
new member.
12
Intuitively the DP-means approach is very similar to greedy approaches like
Uclust and CD-hit [37], in that it also uses a hard threshold based on sequence
alignment similarity (i.e. λ in the formulation above), and the clustering approach
is incremental: first input sequence becomes the center of the first cluster, then
the second and following input sequences are evaluated one by one through the
DP-means procedure to determine whether the particular query sequence should
join any of the existingK clusters, or itself becomes the center of theK+1 cluster.
In addition, to make DP-means more scalable, input data are first partitioned
using locality sensitive hashing (LSH), in which sequences that are closer together
has a higher probability of being in the same bucket. In particular, DACE uses
the p−stable family of LSH functions [39] that works on vector distance defined
with L
p
norm, such as Euclidean distance (L
2
norm). HCLUST also utilizes LSH
to partition input reads to speed up its running time, but it uses the LSH family
based on cosine distance instead [40]. We will go into much more details about
using LSH for sequence data pre-processing and pre-binning in the main body of
HCLUST method discussion below.
13
Chapter 2
Materials and Methods
2.1 Overview of the HCLUST algorithm
The main goal of HCLUST is to achieve accurate 16S rRNA clustering result
without sacrificing scalability, meaning that it should be able to handle large input
data sets in a time frame comparable to mainstream methods.
AtthecoreofHCLUSTistheunsupervisedBayesianclusteringmethodderived
from CROP [36], but we implemented a few important modifications that aim to
enhance the stability of clusters as well as running time performance, while still
maintaining much of its soft threshold clustering characteristics. This will be
discussed in more detail in Section 2.4.
Recall that the MCMC-driven Bayesian clustering approach is computationally
expensive (Section 1.3.3), as the distance term (based on sequence alignment)
is required in estimating the likelihood at every iteration of the MCMC. Sequence
alignment is a notoriously expensive operation that for each pair of sequences, it
takes up to O(L
2
) (L is the sequence length) time and space complexity using
the classic Needleman–Wunsch algorithm [41]. Hence like many modern scalable
methods [20,21], a critical requirement of the HCLUST algorithm is to avoid doing
unnecessary pairwise alignments between any two sequences in the input data set,
especially when two sequences are just too far apart to produce any meaningful
sequence alignment.
14
HCLUST tackles this problem by incorporating a highly effective data parti-
tioning procedure, which separates the input sequences into individual partitions
(or "buckets"), in a manner that highly similar sequences are more likely to appear
in the same bucket. HCLUST is then able to perform Bayesian clustering inde-
pendently within each individual bucket and generate clusters. The Bayesian clus-
tering process typically runs much faster when dealing with a smaller input size,
as it is intrinsically a O(N
2
) algorithm as mentioned earlier. Also, note that the
MCMC procedure also converges faster if the true number of clusters is smaller.
To summarize, the input data partitioning strategy in HCLUST is based on
the following two essential concepts:
1. Numeric feature vectors are first built from the input DNA sequences by
counting frequencies of fixed-sized k-mers (also known as k-tuples) that
appear in the sequence, and
2. The use of a Locality Sensitive Hashing (LSH) scheme to assign sequences
(represented by their corresponding feature vectors) to particular buckets in
the hash table, with the statistical property that if two feature vectors are
close together in a certain metric space in which they are defined, then there
is a greater probability that the two vectors will have the same hash value
(making them appear in the same bucket).
The overall HCLUST algorithm has the high-level workflow as illustrated in
Figure 2.1 as a flowchart:
15
HCLUST Inputs:
1. Fasta sequences
2. Similarity Threshold
Unique sequences
Partition 1 Partition 2 Partition X Partition 3
LSH Partitioning
Bayesian
Clustering
Bayesian
Clustering
Bayesian
Clustering
Bayesian
Clustering
Clusters
Number of
clusters
converge?
No
Yes
Report Result
De-duplication
Feature vector building
...
Clusters
Clusters
Clusters
Clusters
Clusters
Clusters Clusters
Clusters
Clusters
LSH Iteration
Process
Final round of
Bayesian clustering
Figure 2.1: The main workflow of the HCLUST program
16
In particular:
1. HCLUST first performs quality check (removing reads containing ambiguous
bases and discarding very short reads), then it de-duplicates identical reads
in the input data, while keeping the abundance information of each unique
read, which is required in the Bayesian inference procedure.
2. Next, we compute a feature vector based on k-mer counting strategy
described in Section 2.2, and use the cosine similarity measure to define
similarity between feature vectors built from the sequences.
3. In an iterative manner (Figure 2.1), HCLUST performs LSH iterations, and
in each iteration of which, HCLUST applies a cosine similarity-preserving
LSH transformation to all the feature vectors built from the sequences, par-
titioning data into individual buckets in the hash table, aggregating more
similar sequences in the same bucket as described earlier.
4. HCLUST then applies the CROP-derived Bayesian clustering (Section 2.4)
to each bucket separately to generate clusters. In particular, in the first iter-
ation of the LSH loop, clusters are generated in each bucket from individual
de-duplicated reads. In subsequent iterations, we use cluster representatives
obtained from the previous iteration, in which they will be further merged
(or split) into different clusters. More details of the Bayesian clustering pro-
cedure of HCLUST will be discussed in Section 2.4.
5. The LSH iteration is repeated until convergence, in which either the total
number of clusters becomes stabilized between iterations, or a maximum
number of iterations have been reached. In Section 2.3 and Equation
2.3.8 we are going to discuss how to estimate the expected number of LSH
iterations that HCLUST needs to execute before the result converges.
17
6. After the termination of the LSH partitioning loop, HCLUST does a final
round of cluster merging (by pooling all clusters obtained so far) before the
final result is returned.
Note the HCLUST also provides an ultra-fast mode of the same LSH iteration-
based clustering, but using the incremental greedy approach similar to Uclust and
CD-hit, for this reason, we named this mode "HCLUST-G" and will discuss it in
more detail in Section 2.4.4.
For both HCLUST and HCLUST-G, two inputs are needed from the user: the
input file containing the sequences to be clustered, and the similarity threshold
(e.g. 0.97). This similarity threshold will be converted internally in HCLUST
to a soft constraint that defines the parameter σ as describe earlier in Section
1.3.3; while in case of HCLUST-G, the threshold is used as-is following the same
approach as in the incremental greedy algorithms.
In summary, below is the pseudo-code for the LSH iteration procedure in the
HCLUST algorithm:
18
Repeat LSH iteration until convergence:
Initialize empty cluster list as CList
For each feature vector v:
Calculate LSH hash value of v to
insert v to a bucket in the hash table
End for
For each bucket b in hash table:
Perform Bayesian Clustering to contents in b
Add resulting clusters to CList
End for
If CList converges in size compared to the last iteration:
Break loop
Else:
Continue loop
End repeat loop
2.2 Buildingfeaturevectorsfromreadsequences
The feature vector built from the DNA input sequences serves several important
purposes in HCLUST:
19
1. The Locality Sensitive Hashing (LSH) partitioning scheme only works on
numeric vectors.
2. The feature vector distance serves as a good approximation to real alignment
distance between two sequences, and therefore in the Bayesian clustering
procedure of HCLUST (Section 2.4), feature vector distances are used to
filter out unnecessary pairwise alignments between a query sequence and a
cluster representative sequence.
3. In HCLUST-G mode, where incremental greedy algorithm is used instead
of the default Bayesian method, cosine similarity computed from feature
vector representations is used a fast heuristic to choose potential candidates
of cluster representatives for the query sequence to perform real alignment
with.
With the unprecedented volumes of genetic sequences generated by next-
generation sequencing, using a numeric feature vector to represent the DNA
sequence is the so-called alignment-free approach that enables rapid sequence
comparison, without having to undergo the expensive sequence alignment proce-
dure. Various alignment-free schemes based on word-counting statistics have been
studied [42–46] and adopted in variety of bioinformatics problems [7,18,47,48].
In order to find out the best feature vector construction scheme, we experi-
mented with two models, the simple word-frequency counting model, and the d
∗
2
model, for building the feature vector. First, let us define for every k-tuplew∈A
k
with length k, whereA ={A,T,C,G} represents for the oligonucleotides, denote
N
w
asthecountofoccurrencesofw alongthesequence. Thetwomodelsforfeature
vector definitions are as following:
20
F
simple
=
N
w
L−k+1
,w∈A
k
, F
d
∗
2
=
N
w
−E(N
w
)
q
E(N
w
)
,w∈A
k
WhereL is the total length of the targeting sequence and E(N
w
) is the expec-
tation ofN
w
under certain assumed background Markov chain (MC) model distri-
bution, which has been shown to be useful in representing genomic sequences [49].
Because a good alignment-free feature definition should have the property that
after transforming the original DNA sequences into feature vectors, the similarity
between the two feature vectors should be closely preserved to reflect the real
alignment similarity between the two sequences that they represent. In order to
validate the best feature definition as well as the bestk and background MC model
selectionswhichsatisfiessuchproperty, wedefinethe cosine similarity betweentwo
feature vector u and v with angle θ
u,v
∈ [0,π] as:
cos(θ
u,v
) =
u·v
||u||
2
||v||
2
Since cosine similarity ranges from [−1,1], we further normalize the cosine
similarity and define the alignment-free feature similarity Sim
feat
as:
Sim
feat
=
1+cos(θ
u,v
)
2
In our study, for feature definitionsF
simple
andF
d
∗
2
, we considered two different
Markov chain background model (i.i.d (0
th
order MC) and 1
st
order MC models)
and two tuple lengths (k = 4 and k = 5). Considering the 16S rRNA is a short
gene (even more so if only the variable regions are considered), our experiments
on k greater than 5 did not outperform smaller k’s (Supplementary Figure
21
A.1), and longer k leads to an exponential increase of the dimension of the feature
vectors, which has adverse effects on the program running time.
In order to identify the best model, (namely, which feature definition and what
size of k-mer to use for word counting, and what Markov order to use), we com-
pared the alignment-free feature similarities Sim
feat
under all models with the
alignment similarity Sim
align
based on a simulated dataset prepared using the
Grinder software [50]. From a total of 2 million pairwise comparisons from this
dataset, only those with Sim
align
≥90% (the range where a 16S rRNA gene clus-
tering analysis is mostly interested in) is chosen (which represents a total number
of 177,000 comparisons). Sim
align
is plotted againstSim
feat
(as shown inSupple-
mentary Figure A.1), and both Pearson- and Spearman- Correlations between
Sim
align
betweenSim
feat
are calculated for each model (No. 1 to 6) (Table 2.1).
We decided to use Model No. 1 (simple word frequency counting with word
sizek = 4) (Table 2.1 andSupplementary Figure A.1) to compose the feature
vectors used in HCLUST, for the following reasons:
• This model (Model No. 1) has relatively high correlation between between
Sim
align
between Sim
feat
.
• Compared with Model No. 2 (simple word counting k = 5), this model is
much more computational- and memory-efficient to compute and store.
• Thesimplewordfrequencycountingmodeldoesnotneedtoassumeanunder-
lying Markov chain model of the 16S rRNA gene sequence, as opposed to d
∗
2
methods (Model No. 3 to 6).
22
Table 2.1: Summary table of the simulation experiments A total of 6 simulation experiments was per-
formed to investigate the optimal model (Simple word frequency counting vs. the d
∗
2
model) and the associated
parameters (k-mer length and Markov order if applicable) to use as the alignment-free strategy.
Model No. Feature- k MC-order Pearson- Spearman-
Definition Correlation Correlation
1 F
simple
4 N/A 0.844 0.780
2 F
simple
5 N/A 0.865 0.793
3 F
d
∗
2
4 0 0.765 0.688
4 F
d
∗
2
4 1 0.819 0.745
5 F
d
∗
2
5 0 0.846 0.766
6 F
d
∗
2
5 1 0.834 0.754
Finally, by looking at the distributions ofSim
align
andSim
feat
data points, we
have obtained estimates for cut-off values, C
feat
, as shown in Table 2.2 below:
Table 2.2: Different alignment similarity values with corresponding C
feat
cut-off values.
Sim
align
90% 91% 92% 93% 94% 95% 96% 97% 98% 99%
C
feat
0.8 0.81 0.82 0.83 0.85 0.86 0.88 0.91 0.93 0.95
Here C
feat
is obtained from taking the Sim
feat
value that covers 95% of
observed distributions ofSim
align
andSim
feat
data points (Supplementary Fig-
ure A.1), which serves as a useful heuristic to quickly filter out candidates to be
compared with in a single LSH iteration of the HCLUST algorithm. More details
will be covered in Section 2.4.
23
2.3 Data partitioning with Locality Sensitive
Hashing (LSH)
Locality-sensitive hashing [39,40,51] is a powerful and extremely efficient hashing-
based algorithm, which projects high-dimension numeric data points to certain
hash positions in a hash table, with the property that the hashing preserves (to
some degree) the original distance between the vectors before the LSH projection.
Because of this property, LSH is most widely used for finding the approximate
nearest neighbor (ANN) for objects [51].
At the core of LSH is a randomized scalar projection by taking the dot-product
of our query point, x, in a high-dimensional metric space, with another vector
v, with each component of v selected at random. Starting with this principle,
different LSH schemes have been proposed, and each scheme is somewhat geared
towards preserving a particular kind of distance metric. Some of the most popular
LSH schemes include thep-stable distributions-based LSH [39] that is particularly
useful in preserving thel
p
norm (e.g. Euclidean distance), the Minhash scheme [52]
for preserving the Jaccard similarity between two sets, and the cosine similarity-
preserving LSH [40,46,53]. Out of all the LSH schemes investigated, we found that
the cosine similarity-preserving LSH is most suitable for our application because
compared with the l
p
norm-preserving LSH scheme [37,39], the cosine similarity-
preserving LSH scheme uses less number of parameters (2 instead of 4), making
LSH parameter estimation easier (Section 2.3).
cosine similarity-preserving LSH
For feature vector x with certain lengthT, the family of hash functionh
v
(x) that
preserves cosine similarity is defined simply as:
24
h
v
(x) =
1 if v·x≥ 0
0 if v·x< 0
(2.1)
Where the function parameter vector v is of length T, and each component
of v is drawn from a standard normal distribution. Here note h(·) produces only
1-bit (either 0 or 1) of hash value, and the final complete LSH function will be a
composition function composed by concatenating multiple different h(·) functions
(cf. Equation 2.6).
Then for any two (feature) vectors x and y, the probability of h
v
(x) = h
v
(y)
can be proven [54] to be
Pr[h
v
(x) =h
v
(y)] = 1−
θ
x,y
π
(2.2)
As a simple explanation of Equation 2.2, it is obvious that if x and y are
pointing to the same direction, this probability is equal to 1, and if they point to
opposite direction, then θ
x,y
=π and this probability is equal to 0.
LSH parameter estimation
RecallthatinSection2.2,wehavedefinedSim
feat
asthealignment-freesimilarity
between x and y. Based on the definition of Sim
feat
, we have the representation
of θ
x,y
in terms of Sim
feat
:
Sim
feat
=
1+cos(θ
x,y
)
2
=⇒ θ
x,y
=arccos(2×Sim
feat
−1) (2.3)
Based on Equation 2.3, the probability Pr[h
v
(x) =h
v
(y)] in Equation 2.2
can then be expressed in terms of the Sim
feat
as
25
Pr[h
v
(x) =h
v
(y)] = 1−
θ
x,y
π
= 1−
arccos(2×Sim
feat
−1)
π
(2.4)
Denote ˆ p as the estimator of Pr[h
∗
(x) = h
∗
(y)], where∗ represents for any
parameter vector v with each of its component drawn from N(0,1). In order
to obtain the estimator ˆ p, we need to estimate Sim
feat
first. Denote ˆ s as the
estimator of Sim
feat
, we use Maximum Likelihood Estimate (MLE) [55] for the
estimation of ˆ s. Plug in the calculated ˆ s values into Equation 2.4, we can obtain
the corresponding estimators of probabilities that two sequences will have the same
binary hash value (i.e. h
∗
(x) = h
∗
(y)) based on the estimator ˆ s of their feature
vector similarity Sim
feat
. (Table 2.3)
Let p = Pr[h
∗
(x) = h
∗
(y)] from Equation 2.4, the estimation of p, namely
the ˆ p will be given by the closed form:
ˆ p = 1−
arccos(2׈ s−1)
π
(2.5)
Table 2.3: Table of Correspondence Relationship of different alignment similarity values, MLE
estimator ˆ s of Sim
feat
and p=Pr[hv(x)=hv(y)].
Sim
align
90% 91% 92% 93% 94% 95% 96% 97% 98% 99%
ˆ s 0.84 0.85 0.86 0.88 0.89 0.91 0.92 0.94 0.96 0.99
ˆ p 0.74 0.75 0.76 0.77 0.78 0.80 0.82 0.84 0.87 0.93
According to entries in Table 2.3, for example, if we are interested in looking
at 97% or real sequence alignment similarity, then we can use the corresponding
estimator for the alignment-free similarity: ˆ s = 0.94, which further gives an esti-
mation ˆ p = 0.84, meaning thereis0.84 (84%)probabilitythatthehashvaluegiven
26
by h
∗
(·) for these two sequences (that are within 97% real alignment similarity)
are equal.
Thisprobabilityseemshighenough, butusingh
∗
(·)alonetobinourdatapoints
will produce at most 2 partitions (h
∗
(·) can either be 0 or 1, after all.), which is
hardly any useful because each partition will certainly contain mostly points that
are dissimilar but happen to have the same h
∗
(·) value. Therefore to make LSH
more sensitive, instead of using the single-bit h
∗
(·) function, we can compose a
G-bit composite function, g(·), formed by concatenating G independent 1-bit
h
i
(·) functions:
g(·) = [h
1
(·),h
2
(·),··· ,h
G
(·)] (2.6)
Where g(x) = g(y) ⇐⇒ h
i
(x) = h
i
(y) for∀ i∈{1,2,··· ,G}. Then with
the composite functiong(·) defined in Equation 2.6, we will have the probability
that two feature vectors x and y have the same hash values under the composite
function g(·):
Pr[g(x) =g(y)] ={Pr[h
∗
(x) =h
∗
(y)]}
G
=⇒ Pr[g(x) =g(y)] =p
G
≈ (ˆ p)
G
(2.7)
Where ˆ p is the estimator of p as described in Equation 2.5.
Compare with the original hash function h
∗
(·), the composite hash function
g(·) will reduce the false positive (dissimilar points accidentally have the same
hash values) rate, in other words,Pr[g(x) =g(y)|θ
x,y
is large] will be much smaller
than Pr[h
∗
(x) =h
∗
(y)|θ
x,y
is large].
However, at the same time, we need to increase the true positive (similar points
having the same hash values) rate when using the composite hash functiong(·), in
27
other words, to increase the probability Pr[g(x) = g(y)|θ
x,y
is small]. As a result,
we apply multiple iterations of LSH partitions to hash the data points. In each
iteration, the composite function g(·) is generated randomly from scratch. Here
we apply a similar analysis done for an LSH application in approximate nearest
neighbor search (ANN) in [39]:
LetM denote the number of iterations of LSH partitioning, then for data point
x that it can be hashed to the same bucket with its nearest neighbor (NN) data
point y for at least once in all M iterations will be:
Pr[x finds its close neighbor y] = 1−(1−p
G
)
M
(2.8)
Wherep is the probabilityPr[h
∗
(x) =h
∗
(y)], and G is the number of concate-
nated h
∗
(·)’s in the composited hash function g(·).
Once x and y appear in a single bucket for at least once in all M iterations
of LSH partitioning, they will be processed by the Bayesian cluster procedure
Section 2.4, which works on each LSH partition separately.
Earlier in this section, we have derived a close-form estimation forp (Equation
2.5andTable2.3). Denoteq =Pr[x finds its close neighbor y]inEquation2.8,
the relationship ofG (the length of the concatenated composite hash functiong(·))
and M (the number of LSH iterations) can be expressed as:
q = 1−(1−p
G
)
M
=⇒ M =
log(1−q)
log(1−p
G
)
≈
log(1−q)
log(1−(ˆ p)
G
)
(2.9)
Here q is a fixed value in HCLUST and ˆ p can be inferred from Table 2.3
depending on the user-provided similarity threshold value. For example, for a
clusteringtaskwithalignmentsimilaritythresholdof0.97(forspecies-levelOTUs),
p = 0.84 (read from Table 2.3), and ifq is fixed at 0.99 and we choseG = 10 as a
28
good length of concatenated composite hash function, then the expected number
of LSH iterations for the clustering result to converge will be 24.
2.4 Clustering algorithm of HCLUST
To achieve an accurate 16S rRNA clustering outcome and avoid problems asso-
ciated with using a hard-cut sequence similarity threshold, HCLUST uses the
Bayesian clustering method derived from CROP [36], in which sequences are mod-
eled with a Gaussian mixture distribution, and the birth-death MCMC procedure
is also used to estimate parameters that gives results for the total number of clus-
ters, cluster assignments, and centers for each cluster.
But importantly, HCLUST makes the following modifications that aim to
enhance the stability of clusters as well as running time performance, while still
maintaining much of its soft threshold clustering characteristics.
2.4.1 Improved sequence alignment strategy
HCLUST improves the sequence alignment implementation from full Needleman-
Wunsch alignment used in CROP to the banded version of the algorithm [56],
which works particularly well for 16S rRNA clustering studies where relatively high
sequence similarity (usually clustering happens at thresholds >= 90%), because
we can restrict the sequence alignment computation to only a narrow band along
the diagonal of the matrix, instead of the full matrix during the classic Needleman-
Wunsch dynamic programming algorithm, hence avoiding paying the high compu-
tational cost of O(L
2
) to align two sequences, where L is the sequence length.
Note that after an alignment is obtained, in HCLUST we define the alignment
distance between two sequences x
i
and x
j
as
29
Dist(x
i
,x
j
) = 1−Sim(x
i
,x
j
) = 1−
number of matched positions in the alignment
length of the alignment
2.4.2 Reduction of sequence alignments with feature vec-
tor heuristics
As described earlier in Section 1.3.3, it is imperative to have all the pairwise
distances between the input sequences computed at the start of the MCMC birth-
death procedure, as the pairwise distances between any two sequences x
i
and x
j
is required for the algorithm to select a center sequence μ [36]. We noted that
even in case of a LSH partition (the "bucket"), there will be sequences that gets
hashed to this bucket due to chance and not its sequence similarity. Aligning these
dissimilar sequences would be a more costly task as the band size in the banded
alignment (Section 2.4.1) will need to increase as well. Reducing unnecessary
pairwise alignment is always the key to high computational efficiency of clustering
algorithms. And here HCLUST takes advantage of the feature vector representa-
tion that we pre-computed for each input sequence. Here we only perform actual
pairwise alignment of two sequences, x
i
and x
j
in the bucket only if their feature
vector similarity, Sim
feat
(x
i
,x
j
) is greater than the empirical threshold of cut-off
values, C
feat
, as shown in Table 2.2. In any case where Sim
feat
(x
i
,x
j
) < C
feat
for the two sequences, we set their distance to D(x
i
,x
j
) = 0.5, which is an empir-
ical value that is low enough so that it does not interfere with the center election
process within each cluster.
30
2.4.3 Improvement of cluster center re-election strategy
In addition to the fast heuristic described in Section 2.4.2 above, HCLUST also
improves the efficiency of cluster center re-election after each round of MCMC.
Recall that CROP uses an extremely costly sampling method to compute the
likelihood terms L(μ
j
) (Section 1.3.3), which is the likelihood of the j-cluster
given a particular cluster center μ:
L(μ
j
) =
n
j
Y
i=1
f(x
ji
;μ
j
,σ
2
j
)
for every member sequence being the cluster candidate center μ
j
.
HCLUST adopts a more practical and intuitive approach of updating/electing
cluster center by defining that for thej-th cluster, its centerμ
j
(call this sequence
x
c
)asthesequencethathastheminimumaveragedistancefromallothersequences
in the same cluster, i.e.
x
c
= argmin
x
i
∈{x
1
,···,xn
j
}
X
xm∈{x
1
,···,xn
j
}
Dist(x
i
,x
m
)
where Dist(x
i
,x
m
) is the alignment distance between sequences x
i
and x
m
.
Note in the actual HCLUST implementation we take into account the relative
abundance of each sequence (because input sequences are de-duplicated) when cal-
culating the sum of distances, meaning more abundant sequence has more "weight"
in the vote.
Note this makes the choice of cluster center in HCLUST more deterministic
(compared with likelihood-based random sampling of cluster center in CROP), in
which the cluster center in HCLUST is the sequence that minimizes the sum of
center-to-member distances. Therefore it avoids cases where poor cluster centers
are be picked as a result of the more probabilistic nature of sampling in CROP.
31
2.4.4 HCLUST-G, the fast mode
In addition to the Bayesian clustering algorithm that HCLUST uses by default, to
achieve even higher scalability when dealing with extremely large datasets, we also
developed a fast mode of HCLUST that uses the same feature vector building and
LSH iteration scheme, but instead of Bayesian clustering, it uses the incremental
greedy heuristic approach to cluster sequences in each buckets produced by the
LSH. For this reason, we named this fast mode of HCLUST as HCLUST-G, where
G stands for "greedy mode".
Essentially, HCLUST-G is able to take advantage of the HCLUST algorithm
in the following manner:
• Since the greedy approach has time complexity of O(NK) (N is number of
sequences to cluster, K is the final number of clusters, more about this in
Section 1.3.2), having the LSH mechanism to concentrate similar sequences
into the same bucket effectively reducesK in each of the clustering sub-task.
• HCLUST-G potentially has advantage over CD-hit and Uclust in the sense
that between successive LSH iterations the cluster centers are updated (using
the same schema as HCLUST, described in Section 2.4.3, as opposed to
the invariant cluster centers in CD-hit and Uclust.
2.5 Clustering evaluation methods and criteria
To evaluate the accuracy of an OTU-clustering algorithm, it is necessary to first
obtain the ground truth data, in which the OTU assignment (e.g. at a particular
taxonomic level, such as genus and species) for each input sequence is known.
This is particular challenging in large-scale sequencing experiments done in the
32
ecological environment (such as soil, ocean etc.) as there is usually a large amount
of unknown microbial species present in the samples.
As a result, we used different types of 16S sequence datasets (more details in
Section 2.6 below) that contain the truth (or very close to truth) OTU assign-
ments, and evaluated the OTU clustering accuracy of HCLUST (both its default
mode and its fast mode, HCLUST-G) against the following state-of-art algorithms
as introduced in Section 1.3:
• CROP [36]
• DACE [37]
• CD-hit [21]
• Uclust [20]
In particular, we used three different evaluation metrics, 1. Normalized Mutual
Information(NMI)score, 2. F-scorealongwithPrecisionandRecall, and3. Result
OTU number compared with truth OTU number.
2.5.1 NMI score
The Normalized Mutual Information (NMI) score [57] is a well-recognized method
to evaluate the quality of unsupervised clustering given a truth set [28,37,58]. The
NMI is based on mutual information (MI) in the information theory. Given two
cluster assignments,G andP, whereG is the ground truth cluster assignment and
P is the predicted assignment, NMI score is defined as
NMI(G,P) =
I(G,P)
q
H(G)H(P)
33
where I(G,P) is the mutual information between the assignments sets G and P,
and H(G), H(P) are the entropies of sets G and P, respectively.
The NMI score has a bounded range of [0,1], with 0 indicating purely indepen-
dent cluster assignments, and 1 being the perfect assignment that is identical to
the ground truth.
2.5.2 Precision, Recall, and F-score
In addition to the entropy-based NMI score discussed above, the F-score (or F-
measure, which is computed from precision and recall) is another popular measure
to evaluate how good the cluster assignment is compared with the truth set of
cluster assignments.
Here we follow the formulation in Steinback et. al.’s work [59], in particular,
let’s define i as a cluster label for the true cluster assignment, and j as a cluster
label for the test cluster assignment:
i = 1,2,··· ,I
j = 1,2,··· ,J
whereI is the total number of clusters (e.g. known OTUs) in the truth set, and J
is the total number of clusters in the test set (e.g. OTUs produced by the unsu-
pervised clustering algorithm). Also for the convenience of following discussions,
we will call any cluster in the truth set a "truth cluster" (in other literature this is
also known as the "class"), and any cluster produced by the clustering algorithm a
"test cluster" (known simply as "cluster" in other literature).
34
Then Precision and Recall between any true clusteri and any other test cluster
j can be defined as:
Precision(i,j) =
n
ij
n
j
Recall(i,j) =
n
ij
n
i
where n
ij
is the number of members that appear in both the truth cluster i and
the test cluster j; n
i
is the size of cluster i and n
j
the size of cluster j.
The F-score, F(i,j), of the truth cluster i and the test cluster j is then
F(i,j) =
2×Precision(i,j)×Recall(i,j)
Precision(i,j)+Recall(i,j)
which then allows the overall F-score, F, to be computed as
F =
X
i
n
i
n
max
j∈1,···,J
F(i,j)
Note taking the maximum F(i,j) across all values of j for a particular i makes
sense intuitively, because here we are essentially selecting the test cluster j that is
"most similar" (hence has the highest F(i,j)) to the truth cluster i.
Finally, note that the overall Precision and Recall can also be calculated as:
Precision =
X
j
n
j
n
max
i∈1,···,I
Precision(i,j)
Recall =
X
i
n
i
n
max
j∈1,···,J
Recall(i,j)
2.6 Test Data
We used a total of four following publicly available datasets, which are:
35
• in silico simulated 16S V4 region data [60]
• Clone43 dataset [61]
• Human gut 16S V6 region dataset [62]
• Lake Taihu dataset [63]
2.6.1 In silico simulated data across three ecosystems
Almeida et. al [60] created this data based on some of the most abundant
genera across publicly available metagenomes from three major ecosystem: the
human gut, ocean, and soil. From the known species genomes they used in silico
PCR to extract the hypervariable regions (V1-V2, V3-V4, V4, and V4-V5) of the
genes (with commonly used primer sequences), then randomly mutated 2% of the
positions of each variable regions. Finally, read sequencing was simulated with
ART [64] with the MiSeq error profile, generating 250bp-long pair-end reads.
We obtained the unprocessed simulated pair-end reads for the V4 region, for
all 3 types of ecosystems, which was amplified from 16S genes from 100 differ-
ent species and each contains 200,000 paired-end reads. Then we follow the pre-
processing and quality control steps described in [60] by first assembling paired-end
reads into contigs and filter out sequences with ambiguous base calls.
After contig assembling and quality filtering, we derived 193,776 sequences for
the human gut data, 199,775 for the ocean data, and 199,807 for the soil data.
Next we used QIIME’s close reference taxonomy assignment tool with default
setting to obtain taxonomic information of each sequence in each of the three
datasets. We extracted ground-truth taxonomic information from the resultant
taxonomic assignment, but only down to the genus level (which is also the lowest
taxonomic level that Almeida et. al examined in their original paper [60]). Finally,
36
we obtained 126,072, 159,616, 154,064 sequences with known genus assignments
for the human-gut, ocean, and soil simulated datasets, respectively, which are then
used to evaluate clustering accuracy of the algorithms.
2.6.2 Clone43 dataset
This is a well-known real data set produced by Huse et. al. [61,65] from a mixture
of 43 plasmid clones containing the V6 hypervariable region of very distinct 16S
rRNA genes amplified from a deep-sea hydrothermal vent community on a Roche
GS20(454pyrosequencing)platform. TheV6regionoftheClone43mixtureranges
from 57 to 145nt.
This is an ideal dataset to test the baseline performance of an OTU clustering
algorithm at species level, this is because it contains a known number (i.e. 43) of
species-level OTUs, and the 43 species are also divergent, meaning that their 16S
rRNA gene is distinct enough to allow OTU identification on this dataset to have
unambiguous OTU assignment.
2.6.3 Human gut V6 dataset
ThisisarealdatasetcoveringtheV6regionof16Sgenefrommetagenomicsequenc-
ing of the human gut flora done in the study of gut microbiome in obese and lean
twins [62]. We performed the similar procedure as in the in silico simulated data
(Section 2.6.1) to assign taxonomic information to every sequence in the dataset,
but this time only sequences with species-level assignments are chosen as ground
truths for our comparison. As a result, we derived the truth data set with a total
of 135,790 sequences with 59 distinct species-level assignments.
37
2.6.4 Lake Taihu dataset
This real dataset contains 81 samples of microbial community sampled from dif-
ferent surface sites of Lake Taihu, China [63]. This is a very large 16S dataset that
contains more than 300 million reads generated on Illumina’s sequencing platform.
Due to the vast diversity of microbial species that often exist in ecological samples
like this, there is no reliable manner to generate a truth set of OTU assignments
for the samples. As a result, we primarily used the Taihu dataset to benchmark
the running time performance of HCLUST versus the other algorithms.
2.7 Test configurations and environments
The following versions of clustering algorithms are used for our performance test
and benchmark studies:
• HCLUST are implemented in C++ conforming to the C++11 standard.
The program can be compiled from source (with a relatively modern GCC
compiler such as GCC 4.7.0+) on a UNIX system. The program operates
the Bayesian clustering algorithm by default, and can be configured to run
in the fast mode, HCLUST-G by a user flag. Source code of HCLUST will
be made available under a open-source license.
• CROP is downloaded and built from its Github repository at
https://github.com/tingchenlab/CROP.
• DACE is downloaded and built from its Github repository at
https://github.com/tinglab/DACE.
38
• CD-hit is downloaded from its author’s Github repository at
https://github.com/weizhongli/cdhit, where release V4.6.7 is used in
our study.
• Uclust is executed with the pick_otus.py procedure as part of the QIIME
package (version 1.9.1).
The running time benchmarking study (Section 3.4) is done on a DELL work-
station with Intel Xeon E5-1620 CPU at 3.6 GHz and 8 GB memory. All programs
except DACE are run on a single CPU core (since DACE needs to run on at least
two cores), and program execution times are measured in CPU time with the time
command in Linux.
39
Chapter 3
Results
3.1 In silico simulated data results
Forthecomputersimulateddatasetofthe16SrRNAV4regionacross3ecosystems
of human-gut, ocean and soil, we evaluated the NMI score, Precision, Recall and
F-score of the 6 algorithms (including HCLUST and its fast mode HCLUST-G).
First we look at the simulated human gut dataset result in Figure 3.1 and
Figure 3.2, CROP has the best performance overall, although by a very small
margin compared with other algorithms, which are mostly on par with each other.
Figure 3.1: insilico dataset human gut NMI score
40
Figure 3.2: insilico dataset human gut Precision, Recall and F-score
Next is the simulated ocean dataset (Figure 3.3 and Figure 3.4), where
HCLUST and CROP clearly has the best overall NMI score as well as F-score
measures than other methods.
Figure 3.3: insilico dataset ocean NMI score
41
Figure 3.4: insilico dataset ocean Precision, Recall and F-score
At last in the simulated dataset with soil microbes, all methods again have
similar NMI scores and F-scores, except that DACE has a much lower F-score.
Figure 3.5: insilico dataset soi NMI score
42
Figure 3.6: insilico dataset soil Precision, Recall and F-score
While HCLUST and CROP only has clear advantage in NMI and F-score mea-
sures in the ocean simulated dataset out of the 3 datasets in total, the results are
much more interesting when we look at the total number of genus-level OTUs pro-
duced by these algorithms (Table 3.1), where we observed that although all meth-
ods overestimates the number of OTUs compared with the truth, both HCLUST
and CROP predicted number of OTUs that are closest to the corresponding truth
OTU number across all three types of ecosystem datasets.
Table 3.1: Predicted OTU numbers in the simulated datasets
gut ocean soil
Truth OTUs 41 41 48
HCLUST 61 63 54
CROP 58 57 70
DACE 69 84 84
CD-hit 67 80 82
Uclust 67 79 83
HCLUST-G 65 69 77
43
3.2 Clone43 dataset results
FortheClone43dataset,whichisgeneratedfrom43known16Sgenes,producedthe
following evaluation outcome for the different algorithms Figure 3.7 and Figure
3.8.
Figure 3.7: Clone43 NMI score
44
Figure 3.8: Clone43 Precision, Recall and F-score
It is immediately clear that both HCLUST and CROP have near-perfect clus-
tering accuracy on this dataset, with their NMI scores, Precision, Recall and
F-score are all nearly equal to 1. Also note that between the greedy heuristic
algorithms - CD-hit, Uclust and HCLUST-G (HCLUST running in greedy mode),
HCLUST-G has the highest NMI score amongst the three, and it also has a rela-
tively high F-score.
Nextweexaminehowmanyclusters(OTUs)havebeengeneratedbyeachofthe
algorithms and compare with the truth (43), the result is summarized in (Table
3.2).
Again, both HCLUST and CROP generated OTU numbers (40 and 41,
respectively) that are very close to the real number 43, although slightly under-
estimating. Ontheotherhand, allothermethodsvastlyover-estimatesthenumber
of OTUs, calling out hundreds of OTUs in their results.
45
Table 3.2: Predicted OTU numbers in the Clone43 dataset
Truth OTUs 43
HCLUST 40
CROP 41
DACE 999
CD-hit 421
Uclust 438
HCLUST-G 673
3.3 Human gut V6 dataset results
Finally, the Human gut V6 region dataset is our last dataset for evaluating the
clustering accuracy of our algorithms. Again we plotted the NMI score and F-score
related performances (Figure 3.9 and Figure 3.10):
Figure 3.9: Human gut V6 dataset NMI score
46
Figure 3.10: Human gut V6 dataset Precision, Recall and F-score
In this dataset, HCLUST has the best overall performance out of all 6 algo-
rithms, having a clear advantage in terms of NMI-score and F-score. CROP’s
performance comes at a close second. The greedy heuristic-based algorithms also
obtained relatively accurate results, with CD-hit performs slightly better than
Uclust and HCLUST-G.
Next, as Table 3.3 reveals, although all methods overestimate the OTU num-
ber this time, HCLUST predicts an OTU number (136) that is closest to the truth
numberof59. Allothermethodspredicted2to10-foldasmanyOTUsasHCLUST
does.
47
Table 3.3: Predicted OTU numbers in the human gut V6 dataset
Truth OTUs 59
HCLUST 136
CROP 251
DACE 1353
CD-hit 361
Uclust 653
HCLUST-G 1029
3.4 Runningtimebenchmarkingresultswiththe
Taihu dataset
To assess the scalability of HCLUST (both the default Bayesian clustering imple-
mentation as well as its fast mode HCLUST-G), we conducted two different bench-
mark studies. The first benchmark experiment compares HCLUST with CROP,
as both approaches use Bayesian clustering and both are demonstrating very good
clustering accuracies in our test data sets. In this first benchmark study, we
used small to medium sized input files containing containing between 100,000 and
1,000,000 reads sampled from the complete Taihu dataset.
In the second benchmark we compared running time performance between
HCLUST, HCLUST-G, DACE, CD-hit and Uclust on relatively large data sets
containing between 1 million and 10 million reads sampled from the complete
Taihu dataset.
Both results from both benchmarking comparisons, generated under a local
desktop environment (Section 2.7), measured in minutes of CPU time, are shown
in Table 3.5 and Table 3.4 below:
48
Table 3.4: HCLUST vs. CROP running time
100K 200K 400K 600K 800K 1M
HCLUST 26 40 71 96 104 113
CROP 133 260 657 959 1244 1397
speed up 5X 6X 9X 10X 12X 12X
Table 3.5: HCLUST vs. other algorithms running time
1M 2M 4M 6M 8M 10M
HCLUST 127 277 296 411 511 569
HCLUST-G 1 3 6 9 11 13
DACE 19 27 41 72 90 132
CD-hit 7 19 42 67 94 121
Uclust 1 1 1 2 3 4
From Table 3.4 we saw a huge running time advantage of HCLUST over
CROP, achieving between 5 to 12 times speed up of CPU time. For the 1 million
reads dataset, for example, HCLUST shortens a 23-hour CROP run to less than 2
hours.
In the second running time benchmark experiment, we scaled up the test data
size 10 times larger (with data ranging from 1 to 10 million reads). This time, the
Bayesian clustering based HCLUST inevitably becomes the most time-consuming
algorithm, compared with other greedy-heuristic based algorithms (note DACE’s
DP-means approach also has greedy-like property, as described earlier in Section
1.3.3).
Onaanothernote, HCLUST-Gisthesecondfastestalgorithm, onlyslowerthan
Uclust (which is currently the fastest 16S rRNA gene clustering algorithm running
on a single core) by a small margin, and almost 10 times faster than DACE or CD-
hit. Considering the HCLUST-G has comparable clustering accuracy compared
to these greedy heuristic-based methods (Sections 3.1 to 3.3). Therefore we
consider HCLUST-G a useful addition to the default HCLUST, where it can be
49
used on extremely large datasets when running time is a strict constraint to the
users.
50
Chapter 4
Conclusion and Future Directions
4.1 Conclusion
In this study, we introduced HCLUST, a highly accurate method for clustering 16S
rRNA gene sequences. The two core concepts of HCLUST is ‘divide-and-conquer’
and model-based Bayesian clustering.
For divide-and-conquer, we first explained our strategy of using word frequency
counting to convert DNA sequences into numeric feature vectors, which leads to
our discussion of using cosine angle measure to represent distance between the
feature vectors. Next we presented the locality-sensitive hashing (LSH) function
formulation that preserves the cosine similarity when projecting the feature vectors
to hash values.
In particular, a total of six different feature-vector building schemes were inves-
tigated and the simple word frequency counting model with word length of 4 was
chosen as the best model to use. We also proposed empirical cut-off thresholds to
beusedasafilteringcriteriatoestimatetherealalignmentsimilaritiesbetweentwo
DNA sequences by computing the cosine similarity of the feature vectors (Section
2.2).
Weperformedsimpleanalysisandcalculationstothecosinedistance-preserving
LSH formulation used in HCLUST, from which we derived the analytical relation-
ship between the parameters we set for the LSH (Section 2.3).
51
For the model-based Bayesian clustering, HCLUST adopted the general
approach used in CROP [36], that models sequences and clusters using a Gaussian
mixture and derived estimations for the parameters of interest using a MCMC-
based procedure. But in the same time, we made significant improvements to the
Bayesian clustering algorithm to enhance the efficiency while largely preserving
the accuracy of clustering (Section 2.4).
In addition to the default Bayesian clustering mode, we implemented a fast
mode, named HCLUST-G, in the same HCLUST software, that uses the incre-
mental greedy clustering strategy similar to CD-hit and Uclust, with the same
LSH iteration workflow as the default HCLUST (Section 2.4.4).
To evaluate the clustering accuracy of HCLUST, we used different 16S rRNA
datasets that are either simulated in silico, or generated in real sequencing exper-
iments. The datasets cover a wide range of ecosystems including the human gut,
ocean, lake and soil; they also cover different hypervariable regions of the 16S
rRNA gene including V4 and V6; and finally, different sequencing platforms such
as Illumina’s pair-end sequencing and the 454 pyrosequencing are also covered.
For the datasets where the ground truths of cluster assignments are known,
the clustering accuracy of HCLUST was evaluated using different metrics such
as NMI score, Precision, Recall and F-score, and are compared with mainstream
algorithms including Uclust, CD-hit, DACE and CROP. HCLUST (along with
CROP) gave the most accurate prediction for the majority cases of the datasets,
both in terms of having the highest NMI score and F-score, as well as making the
closest estimates of the total number of clusters (OTUs) present in the datasets
(Sections 3.1 to 3.3).
52
In terms of scalability, given the comparable performance of clustering accuracy
between HCLUST and CROP, HCLUST is more than 10 times faster than CROP
when benchmarked with a medium-sized dataset (Sections 3.4).
When compared against other more scalable algorithms on larger-sized
datasets, HCLUST took longer to finish, due to the intrinsic computational cost
of the Bayesian clustering approach; but it is still able to cluster 10 million input
sequences under 10 hours. The fast mode, HCLUST-G, on the other hand, was
able to beat all other algorithms (except Uclust) in terms of running time, and
we consider it a useful option to accompany the default HCLUST when dealing
with very large datasets and execution time is a serious constraint to the users
(Sections 3.4).
4.2 Future Directions
4.2.1 Further enhance the scalability of HCLUST
Although we have seen that HCLUST produces highly accurate OTU clustering
outcomes compared with state-of-art algorithms such as CD-hit and Uclust, it still
runs a lot slower than these greedy-heuristic based algorithms. This is mostly due
to the nature of Bayesian clustering, where a massive amount of computation work
is being carried out at each iteration of the MCMC process.
Fortunately, we will be able to further enhance HCLUST’s running time using
parallel computing approaches such as multi-threading or multi-CPU processing,
thanks to the fact that the LSH partition schema is fully compatible with parallel
processing. ThisisbecauseinasingleLSHiteration, eachofthebucketscontaining
the partitioned sequences undergoes the clustering action independently. This
meanstheclusteringtaskscanbedistributedtodifferentthreads(orprocessors)for
53
concurrent execution. Figure 4.1 shows the high-level design of a multi-threaded
version of HCLUST.
Input:
Fasta sequences
Similarity Threshold
Partition 1
Partition 2
Partition M
Partition 3
LSH Partitioning
Bayesian
Clustering
Bayesian
Clustering
Bayesian
Clustering
Bayesian
Clustering
Clusters
converge?
No
Yes
Final round of
Bayesian
Clustering
...
Clusters Clusters Clusters Clusters
Clusters Clusters
Clusters Clusters Clusters
Clusters Clusters Clusters
DNA Sequences
LSH Iteration Process
Report Result
De-duplication
Feature vector
building
Thread 1
Thread 2
Thread 3
Thread X
Figure 4.1: The main workflow of the HCLUST program with parallel computing at thread level
4.2.2 Incorporate platform based sequencing error correc-
tion models
Recently in the field of 16S rRNA study, we have seen the emergence of sequenc-
ing error denoising algorithms that aim to distinguish true biological sequences
from sequences that arise from sequencing artifacts such as sequencing errors, and
many such approaches claim to have resolution down to a single base pair [66–70].
This finer resolution can be useful in analyzing OTUs that are extremely close in
evolution distance, for example, in the case of different strains within the same
species.
Typically this class of sequencing error denoising algorithms makes use of the
known sequencing-error models associated with a specific sequencing platform,
such as amplicons produced by Illumina machines [66,67]. In contrast, HCLUST
54
currently operates in a sequencing platform-agnostic manner, and for this reason,
it lacks the ability to denoise errors with help of sequencing platform-specific error
models. Therefore we think that it will be a good future direction for HCLUST to
integrate some platform-specific error detection models to enhance its resolution
in OTU identification.
55
Reference List
[1] Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, et al.
Microbial diversity in the deep sea and the underexplored “rare biosphere”.
Proceedings of the National Academy of Sciences. 2006;103(32):12115–12120.
[2] van Elsas JD, Costa R, Jansson J, Sjöling S, Bailey M, Nalin R, et al. The
metagenomics of disease-suppressive soils–experiences from the METACON-
TROL project. Trends in Biotechnology. 2008;26(11):591–601.
[3] QinJ,LiY,CaiZ,LiS,ZhuJ,ZhangF,etal. Ametagenome-wideassociation
study of gut microbiota in type 2 diabetes. Nature. 2012;490(7418):55–60.
[4] Djikeng A, Kuzmickas R, Anderson NG, Spiro DJ. Metagenomic analysis of
RNA viruses in a fresh water lake. PLoS One. 2009;4(9):e7264.
[5] Janda JM, Abbott SL. 16S rRNA gene sequencing for bacterial identification
in the diagnostic laboratory: pluses, perils, and pitfalls. Journal of Clinical
Microbiology. 2007;45(9):2761–2764.
[6] Nguyen NP, Warnow T, Pop M, White B. A perspective on 16S rRNA opera-
tional taxonomic unit clustering using sequence similarity. NPJ Biofilms and
Microbiomes. 2016;2:16004.
[7] Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB,
et al. Introducing mothur: open-source, platform-independent, community-
supported software for describing and comparing microbial communities.
Applied and Environmental Microbiology. 2009;75(23):7537–7541.
[8] Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian classifier for rapid
assignment of rRNA sequences into the new bacterial taxonomy. Applied and
Environmental Microbiology. 2007;73(16):5261–5267.
[9] ChaudharyN,SharmaAK,AgarwalP,GuptaA,SharmaVK. 16Sclassifier: a
tool for fast and accurate taxonomic classification of 16S rRNA hypervariable
regions in metagenomic datasets. PloS One. 2015;10(2):e0116106.
56
[10] Klijn N, Weerkamp AH, de Vos WM. Identification of mesophilic lactic acid
bacteria by using polymerase chain reaction-amplified variable regions of 16S
rRNA and specific DNA probes. Applied and Environmental Microbiology.
1991;57(11):3390–3393.
[11] Kellenberger E. Exploring the unknown: the silent revolution of microbiology.
EMBO Reports. 2001;2(1):5–7.
[12] Rothberg JM, Leamon JH. The development and impact of 454 sequencing.
Nature Biotechnology. 2008;26(10):1117–1124.
[13] Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, et al.
SILVA: a comprehensive online resource for quality checked and aligned ribo-
somal RNA sequence data compatible with ARB. Nucleic Acids Research.
2007;35(21):7188–7196.
[14] Bowman JL, Floyd SK, Sakakibara K. Green genes—comparative genomics
of the green branch of life. Cell. 2007;129(2):229–234.
[15] Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, et al. Ribosomal
Database Project: data and tools for high throughput rRNA analysis. Nucleic
Acids Research. 2013;42(D1):D633–D642.
[16] Chen W, Zhang CK, Cheng Y, Zhang S, Zhao H. A comparison of methods
for clustering 16S rRNA sequences into OTUs. PloS One. 2013;8(8):e70837.
[17] Liu Z, DeSantis TZ, Andersen GL, Knight R. Accurate taxonomy assignments
from16SrRNAsequencesproducedbyhighlyparallelpyrosequencers. Nucleic
Acids Research. 2008;36(18):e120–e120.
[18] Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W, et al. ESPRIT:
estimating species richness using large collections of 16S rRNA pyrosequences.
Nucleic Acids Research. 2009;37(10):e76–e76.
[19] Konstantinidis KT, Tiedje JM. Genomic insights that advance the species
definition for prokaryotes. Proceedings of the National Academy of Sciences.
2005;102(7):2567–2572.
[20] Edgar RC. Search and clustering orders of magnitude faster than BLAST.
Bioinformatics. 2010;26(19):2460–2461.
[21] Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large
sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–
1659.
57
[22] FoxGE,WisotzkeyJD,JurtshukJRP. Howcloseisclose: 16SrRNAsequence
identity may not be sufficient to guarantee species identity. International
Journal of Systematic and Evolutionary Microbiology. 1992;42(1):166–170.
[23] Eren AM, Maignien L, Sul WJ, Murphy LG, Grim SL, Morrison HG, et al.
Oligotyping: differentiating between closely related microbial taxa using 16S
rRNA gene data. Methods in Ecology and Evolution. 2013;4(12):1111–1119.
[24] Kunin V, Engelbrektson A, Ochman H, Hugenholtz P. Wrinkles in the rare
biosphere: pyrosequencing errors can lead to artificial inflation of diversity
estimates. Environmental Microbiology. 2010;12(1):118–123.
[25] SullamKE,RubinBE,DaltonCM,KilhamSS,FleckerAS,RussellJA. Diver-
gence across diet, time and populations rules out parallel evolution in the gut
microbiomesofTrinidadian guppies. TheISMEJournal.2015;9(7):1508–1522.
[26] Needham DM, Sachdeva R, Fuhrman JA. Ecological dynamics and co-
occurrence among marine phytoplankton, bacteria and myoviruses shows
microdiversity matters. The ISME Journal. 2017;11(7):1614–1629.
[27] Tapolczai K, Vasselon V, Bouchez A, Stenger-Kovács C, Padisák J, Rimet F.
The impact of OTU sequence similarity threshold on diatom-based bioassess-
ment: A case study of the rivers of Mayotte (France, Indian Ocean). Ecology
and Evolution. 2019;9(1):166–179.
[28] Edgar RC. Updating the 97% identity threshold for 16S ribosomal RNA
OTUs. Bioinformatics. 2018;34(14):2371–2375.
[29] Salipante SJ, Kawashima T, Rosenthal C, Hoogestraat DR, Cummings LA,
Sengupta DJ, et al. Performance comparison of Illumina and ion torrent next-
generation sequencing platforms for 16S rRNA-based bacterial community
profiling. Applied and Environmental Microbiology. 2014;80(24):7583–7591.
[30] Kennedy NA, Walker AW, Berry SH, Duncan SH, Farquarson FM, Louis P,
et al. The impact of different DNA extraction kits and laboratories upon the
assessment of human gut microbiota composition by 16S rRNA gene sequenc-
ing. PloS One. 2014;9(2):e88982.
[31] Schirmer M, Ijaz UZ, D’Amore R, Hall N, Sloan WT, Quince C. Insight into
biases and sequencing errors for amplicon sequencing with the Illumina MiSeq
platform. Nucleic Acids Research. 2015;43(6):e37–e37.
[32] Walker AW, Martin JC, Scott P, Parkhill J, Flint HJ, Scott KP. 16S rRNA
gene-based profiling of the human infant gut microbiota is strongly influenced
by sample processing and PCR primer choice. Microbiome. 2015;3(1):26.
58
[33] Li J, Jia H, Cai X, Zhong H, Feng Q, Sunagawa S, et al. An integrated cat-
alog of reference genes in the human gut microbiome. Nature Biotechnology.
2014;32(8):834–841.
[34] Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Con-
treras M, et al. Human gut microbiome viewed across age and geography.
Nature. 2012;486(7402):222–228.
[35] Zou Q, Lin G, Jiang X, Liu X, Zeng X. Sequence clustering in bioinformatics:
an empirical study. Briefings in Bioinformatics. 2018;.
[36] Hao X, Jiang R, Chen T. Clustering 16S rRNA for OTU prediction: a method
of unsupervised Bayesian clustering. Bioinformatics. 2011;27(5):611–618.
[37] Jiang L, Dong Y, Chen N, Chen T. DACE: a scalable DP-means algorithm for
clustering extremely large sequence data. Bioinformatics. 2016;33(6):834–842.
[38] Kulis B, Jordan MI. Revisiting k-means: New algorithms via Bayesian non-
parametrics. Proceedings of the 29th International Conference on Machine
Learning;p. 513–520.
[39] Datar M, Immorlica N, Indyk P, Mirrokni VS. Locality-sensitive hashing
scheme based on p-stable distributions. In: Proceedings of the 20th Annual
Symposium on Computational Geometry. ACM; 2004. p. 253–262.
[40] Charikar MS. Similarity estimation techniques from rounding algorithms. In:
Proceedings of the 34th Annual ACM Symposium on Theory of Computing.
ACM; 2002. p. 380–388.
[41] Needleman SB, Wunsch CD. A general method applicable to the search for
similarities in the amino acid sequence of two proteins. Journal of molecular
biology. 1970;48(3):443–453.
[42] Reinert G, Chew D, Sun F, Waterman MS. Alignment-free sequence
comparison (I): statistics and power. Journal of Computational Biology.
2009;16(12):1615–1634.
[43] Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F. Alignment-free sequence com-
parison based on next-generation sequencing reads. Journal of Computational
Biology. 2013;20(2):64–79.
[44] Vinga S, Almeida J. Alignment-free sequence comparison—a review. Bioin-
formatics. 2003;19(4):513–523.
59
[45] Zhang M, Yang L, Ren J, Ahlgren NA, Fuhrman JA, Sun F. Prediction
of virus-host infectious association by supervised learning methods. BMC
Bioinformatics. 2017;18(3):60.
[46] Behnam E, Waterman MS, Smith AD. A geometric interpretation for local
alignment-free sequence comparison. Journal of Computational Biology.
2013;20(7):471–485.
[47] Jun SR, Sims GE, Wu GA, Kim SH. Whole-proteome phylogeny of prokary-
otes by feature frequency profiles: An alignment-free method with opti-
mal feature resolution. Proceedings of the National Academy of Sciences.
2010;107(1):133–138.
[48] Domazet-Lošo M, Haubold B. Alignment-free detection of local similarity
among viral and bacterial genomes. Bioinformatics. 2011;27(11):1466–1472.
[49] Ren J, Song K, Deng M, Reinert G, Cannon CH, Sun F. Inference of Marko-
vian properties of molecular sequences from NGS data and applications to
comparative genomics. Bioinformatics. 2015;32(7):993–1000.
[50] Angly FE, Willner D, Rohwer F, Hugenholtz P, Tyson GW. Grinder: a
versatile amplicon and shotgun sequence simulator. Nucleic Acids Research.
2012;40(12):e94–e94.
[51] Slaney M, Casey M. Locality-sensitive hashing for finding nearest neighbors.
IEEE Signal Processing Magazine. 2008;25(2):128–131.
[52] Broder AZ, Glassman SC, Manasse MS, Zweig G. Syntactic clustering of the
web. Computer Networks and ISDN Systems. 1997;29(8-13):1157–1166.
[53] Behnam E, Smith AD. The Amordad database engine for metagenomics.
Bioinformatics. 2014;30(20):2949–2955.
[54] Goemans MX, Williamson DP. Improved approximation algorithms for maxi-
mum cut and satisfiability problems using semidefinite programming. Journal
of the ACM. 1995;42(6):1115–1145.
[55] Myung IJ. Tutorial on maximum likelihood estimation. Journal of Mathe-
matical Psychology. 2003;47(1):90–100.
[56] Chao KM, Pearson WR, Miller W. Aligning two sequences within a specified
diagonal band. Bioinformatics. 1992;8(5):481–487.
[57] Studholme C, Hill DL, Hawkes DJ. An overlap invariant entropy measure of
3D medical image alignment. Pattern Recognition. 1999;32(1):71–86.
60
[58] Sun Y, Cai Y, Huse SM, Knight R, Farmerie WG, Wang X, et al. A large-scale
benchmark study of existing algorithms for taxonomy-independent microbial
community analysis. Briefings in Bioinformatics. 2011;13(1):107–121.
[59] Michael Steinbach VK George Karypis. A comparison of document clustering
techniques. KDD Workshop on Text Mining. 2000;.
[60] Almeida A, Mitchell AL, Tarkowska A, Finn RD. Benchmarking taxonomic
assignments based on 16S rRNA gene profiling of the microbiota from com-
monly sampled environments. GigaScience. 2018;7(5):giy054.
[61] Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM. Accuracy
and quality of massively parallel DNA pyrosequencing. Genome Biology.
2007;8(7):R143.
[62] Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley
RE, et al. A core gut microbiome in obese and lean twins. Nature.
2009;457(7228):480–484.
[63] Li J, Zhang J, Liu L, Fan Y, Li L, Yang Y, et al. Annual periodicity in
planktonic bacterial and archaeal community composition of eutrophic Lake
Taihu. Scientific Reports. 2015;5:15488.
[64] Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing
read simulator. Bioinformatics. 2011;28(4):593–594.
[65] Huse SM, Welch DM, Morrison HG, Sogin ML. Ironing out the wrinkles in
the rare biosphere through improved OTU clustering. Environmental Micro-
biology. 2010;12(7):1889–1898.
[66] Eren AM, Morrison HG, Lescault PJ, Reveillaud J, Vineis JH, Sogin ML.
Minimum entropy decomposition: unsupervised oligotyping for sensitive par-
titioning of high-throughput marker gene sequences. The ISME Journal.
2015;9(4):968–979.
[67] Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes
SP. DADA2: high-resolution sample inference from Illumina amplicon data.
Nature Methods. 2016;13(7):581–583.
[68] Edgar RC. UNOISE2: improved error-correction for Illumina 16S and ITS
amplicon sequencing. BioRxiv. 2016;p. 081257.
[69] Hathaway NJ, Parobek CM, Juliano JJ, Bailey JA. SeekDeep: single-base
resolution de novo clustering for amplicon deep sequencing. Nucleic Acids
Research. 2017;46(4):e21–e21.
61
[70] Nearing JT, Douglas GM, Comeau AM, Langille MG. Denoising the Denois-
ers: an independent evaluation of microbiome sequence error-correction
approaches. PeerJ. 2018;6:e5364.
62
Appendix A
Supplmentary materials
63
A.1 Supplementary Figures
(a) (b)
(c) (d)
(e) (f)
Figure A.1: Experiments investigating different word counting models and word-size selection
64
Abstract (if available)
Abstract
The 16S rRNA gene is central to metagenomic sequencing analysis as a marker gene for taxonomic classification, in which distinct 16S rRNAs reveal the profile of microbial composition in a given sample. Crucial for this kind of analysis is a type of algorithm that performs de novo OTU (Operational Taxonomic Unit) clustering, which generates OTU clusters from the 16S rRNA amplicon sequences based on inter-sequence similarities, without the help of known 16S rRNA reference data. ❧ Several mainstream de novo OTU algorithms use a greedy heuristic approach that is prone to produce spurious OTUs, and part of the reason is the use of a hard-cut similarity threshold based on sequence alignment, such as 97%, to group sequences into OTUs. On the other hand, CROP is a statistical model-based method that applies a Bayesian approach to the clustering problem, and avoids the use of a hard-cut threshold. For this reason, CROP is able to generate OTUs more accurately, but at the same time it suffers from poor running time performance. ❧ In this article we propose a new method, HCLUST, which also uses Bayesian clustering but with a more efficient implementation, and in addition, to make the algorithm even more scalable, it utilizes a divide-and-conquer approach to partition the input data into individual buckets, where the Bayesian clustering procedure can be carried out independently in each individual bucket. Importantly, the data partitioning occurs in a way that similar sequences are more likely to be assigned to the same bucket, via a special hashing scheme known as locality-sensitive hashing (LSH). ❧ We performed multiple benchmarking studies comparing the OTU clustering accuracies between HCLUST and other mainstream algorithms, with datasets comprised of computer-simulated reads as well as real sequencing reads, and from samples based on different ecological systems. HCLUST, along with CROP, were identified as the most accurate algorithms across all of the datasets being tested. We further demonstrated that HCLUST runs more than 10 times faster than CROP on datasets with relatively large input sizes.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Ecological patterns of free-living and particle-associated prokaryotes, protists, and viruses at the San Pedro Ocean Time-series between 2005 and 2018
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
PDF
Probabilistic modeling and data integration to examine RNA-protein interactions
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
Temporal variability of marine archaea across the water column at SPOT
PDF
Technological advancements in microbial analyses of periodontitis patients: focus on Illumina® sequencing using the Miseq system on the 16s rRNA gene: a clinical and microbial study
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Applications and improvements of background adjusted alignment-free dissimilarity measures
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
An efficient approach to clustering datasets with mixed type attributes in data mining
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
Machine learning of DNA shape and spatial geometry
Asset Metadata
Creator
Dong, Yichao
(author)
Core Title
Clustering 16S rRNA sequences: an accurate and efficient approach
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
07/24/2019
Defense Date
06/11/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
16S rRNA,Bayesian clustering,clustering,metagenomics,OAI-PMH Harvest,unsupervised learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
)
Creator Email
henrydyc@hotmail.com,yichaodo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-188286
Unique identifier
UC11663026
Identifier
etd-DongYichao-7601.pdf (filename),usctheses-c89-188286 (legacy record id)
Legacy Identifier
etd-DongYichao-7601.pdf
Dmrecord
188286
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Dong, Yichao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
16S rRNA
Bayesian clustering
clustering
metagenomics
unsupervised learning