Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sharpening the edge of tools for microbial diversity analysis
(USC Thesis Other)
Sharpening the edge of tools for microbial diversity analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Sharpening the Edge of Tools for Microbial Diversity Analysis by Xiaolin Hao A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (MOLECULAR AND COMPUTATIONAL BIOLOGY) December 2012 Copyright 2012 Xiaolin Hao ii Table of Contents List of Tables iv List of Figures v Abstract vi Chapter 1: Introduction 1 Metagenomics and Microbial Diversity Analysis 1 Current Methods 1 Major Challenges 2 Our Solutions 3 Chapter 2: An Unsupervised Bayesian Clustering Method for OTU Prediction 7 Statement of Problem 7 Method 7 Gaussian Mixture Model 7 The Birth-Death Process 9 Markov Chain Monte Carlo (MCMC) Sampling 11 Prior Setting 13 Pairwise Alignment and Distance 14 Implementation 14 Computation 14 CROP Workflow 15 Results 18 Estimating the Number of OTUs 18 Validation Results Using a Dataset of 90 Artificial Bacteria Clones 20 Estimating Abundance Using the Human Skin Microbiome Dataset 23 Comparing Different Microbial Communities 24 Discussion 28 Chapter 3:Dealing with Bias in OTU Prediction Using 16S rRNA Gene Fragment Data 30 Statement of Problem 30 Method 31 Data Description 31 Mapping Procedure 32 OTU Prediction 33 iii Metrics for Performance Evaluation for Simulated Datasets 36 Performance Evaluation of Simulated Datasets and Model Building 37 Statistical Tests Used For Real Data 37 Diversity analysis 39 Results 39 Databases for mapping and extracting taxonomic information 39 Test of computational methods for integrating taxonomic information 40 Simulated Data 41 Experimental Data 45 Discussion 49 Chapter 4: Future Perspectives 50 References 52 iv List of Tables Table 1: Summary of Shotgun and Full-length Simulated Datasets 32 Table 2: E-value and Identity Cut-offs 33 Table 3: Comparison between the Complete Database and the Compressed Database 40 v List of Figures Figure 1: CROP Workflow 15 Figure 2: Comparison of the Number of OTUs Predicted by CROP, ESPRIT and Mothur 20 Figure 3: Comparison of the Number of OTUs Found by CROP, PyroNoise and Mothur 22 Figure 4: The Clustering Results (by CROP) of the Human Microbial Sequencing Data 24 Figure 5: The Taxonomy Information of Sebaceous, Moist and Dry Skin Locations 25 Figure 6: The Theta Index Values and the Phylogenetic Tree Shown as a Heatmap 27 Figure 7: The Rarefaction Curves for All Three Human Skin Types 28 Figure 8: Comparison of Four Computational Methods for Integrating Taxonomic Information 40 Figure 9: Bray-Curtis (A), Sensitivity (B) and Specificity (C) for Simulated Datasets 42 Figure 10: A Plot Showing Model Based on Bray-Curtis (A), Sensitivity (B) and Specificity (C) 44 Figure 11: Abundance Estimations of Major Phylum for All Four Data Types 46 Figure 12: Rarefaction Curve of Lean and Obese Samples 48 vi Abstract Metagenomics studies have prospered from the rapid development of next-generation sequencing. However, microbial diversity analysis as an essential component of metagenomics is still facing three major challenges: handling errors in data, performing analysis efficiently for large data and avoiding primer bias issue. Since 16S rRNA gene sequences have been frequently used to profile microbial diversity, we focus on this data and successfully provide solutions to all three challenges: our proposed unsupervised Bayesian clustering method termed Clustering 16S rRNA for OTU Prediction (CROP) can find clusters based on the natural organization of data without setting a hard cutoff threshold (3%/5%) as required by hierarchical clustering methods. By applying our method to several datasets, we demonstrate that CROP is robust against sequencing errors and that it efficiently produces more accurate results than conventional hierarchical clustering methods. We also built a generic model for comparing 16S rRNA gene fragment data extracted from metagenomic shotgun sequencing data with targeted 16S rRNA sequencing data. This model, when combined with future benchmarking studies, could help validating 16S rRNA gene fragment data’s ability to avoid primer bias and provide unbiased microbial diversity estimates. Our proposed analysis pipeline could also be implemented for future 16S rRNA gene fragment-based studies. 1 Chapter 1: Introduction 1.1. Metagenomics and Microbial Diversity Analysis In recent years, the development of next-generation sequencing technology has made it possible to directly sequence a huge amount of high-quality DNA/RNA fragments extracted from environmental samples in an acceptable time period [12, 41] and thus made a new area of research termed Metagenomics [29] to prosper. Metagenomics research directly studies the mixture of microorganisms in an environmental sample, and provides insight into “unculturable” portion of microbial community, which is now known to be an order of magnitude larger than what microbiologists have been studying using culturing techniques[12]. Meanwhile, analytical tools have been developed, including multiple sequence alignments [10, 24], pairwise alignments (global/local) [34], and clustering [43, 44, 49], by which we are able to address the fundamental problems in Metagenomics, in particular, the distribution, abundance and co-occurrence of microorganisms in given environmental samples. These problems are all categorized under the context of microbial diversity analysis. Recent studies have shown that such information may indicate environmental changes or disease conditions [3, 6, 35, 37, 47, 51]. 1.2. Current Methods 16S rRNA gene sequences are widely used to infer the phylogenetic relationship between organisms. That is, by comparing two 16S rRNA gene sequences, biologists can determine whether they belong to the same genus using a conventional threshold of 5% dissimilarity, or the same species using a conventional threshold of 3% dissimilarity, although the validity of using these thresholds has not been proven. Currently, 16S rRNA gene sequences for microbial diversity analysis are generated by using a targeted 2 sequencing technology. Universal primers are used to identify the 16S rRNA gene or its partial sequence of interest, followed by PCR amplification and next generation sequencing. Then by combining sequence comparison with hierarchical clustering methods, programs such as DOTUR [43] and its current version mothur [44], and ESPRIT [49] can partition 16S rRNA sequence data into clusters. Then, by using a predefined dissimilarity threshold, these methods can, in turn, report sequence clusters satisfying certain criteria (complete/average/single-linkage). In hierarchical clustering, the complete linkage and the average linkage methods are more widely used than the single linkage method [38]. 1.3. Major Challenges There are three major challenges for such analysis. First, these clustering methods are highly sensitive to the dissimilarity threshold such that even a slight change can result in very different clusters. Furthermore, the inevitable sequencing errors, as well as the unknown structure of microorganisms underlying an environmental sample, make it very difficult to find the optimal threshold for hierarchical clustering. For these reasons, the otherwise simple choice of 5% and 3% thresholds becomes problematic [20, 29] and may result in the overestimation of operational taxonomic units (OTUs). Quince et al [38] showed that hierarchical clustering methods overestimated the number of OTUs in a given sample when sequencing errors occurred, potentially leading to faulty conclusions. Sun et al [49] also explained that distance calculated from multiple alignment was, in general, larger than those calculated from pairwise alignments and thus might also result in overestimation of the number of OTUs. Recently, Schloss listed and analyzed most known factors which may result in overestimation of hierarchical clustering [42]. Many groups have already taken steps to address these problems. For example, mothur included a single linkage pre-clustering module to merge similar sequences before clustering [44], 3 and PyroNoise [38] was developed to remove sequencing errors to alleviate the overestimation of the number of OTUs. The second challenge comes from large data. Alongside the development of sequencing technology, metagenomic studies now allows the generation of billions of metagenomic shotgun sequences [37] or targeted 16S rRNA gene sequences directly from an environmental sample [3]. Considering the fast pace of the development of sequencing technology, we believe these numbers will only grow faster in the future. Such data explosion makes current computational methods a bottleneck for microbial diversity analysis since In the process of clustering, it is necessary to calculate a distance matrix in order to find the proper clusters to merge. Hierarchical clustering (and most of other clustering methods) needs a whole N× N matrix [22]. As a result, the computational complexity is at least quadratic, O(N 2 ), in terms of the number of sequences (N). Thus, for large-scale data sets, traditional clustering methods quickly become invalid in terms of computational time and memory usage. Although ESPRIT [49] has improved these factors by using an approximated k-mer distance to reduce the burden of calculating pairwise alignments and by parallelizing the pairwise alignment process on a computer cluster and clustering sequences on-the-fly, its algorithm is still hierarchical clustering and can only support complete linkage instead of average linkage clustering. The third challenge is related to the sequencing protocol used for extracting 16S rRNA gene sequences. Due to the lack of true “universality” for primers and difference between sub-regions of 16S rRNA genes, results of targeted 16S rRNA gene sequencing data surveys are affected by potential primer bias [50] and the type of data used, i.e., full-length or various hyper-variable regions [13, 54]. 1.4. Our Solutions For the first and the second challenge regarding clustering accuracy and efficiency, the area is calling for a more efficient and accurate method which can generate more 4 biologically meaningful clustering results from 16S rRNA sequences. Probabilistic model-based clustering algorithms are more robust to sequencing errors thus should perform better than hierarchical clustering with respect to the number of OTUs detected. There are several algorithms which have shown to be successful in clustering proteins into different functional families. For example, some research [2, 30] considered every column of a multiple alignment to be a sample from an underlying multinomial distribution, therefore, the product of the probability mass function of all columns could form a likelihood function. Another probabilistic approach called Markov Clustering [14] used random walk and matrix transformations on a given distance matrix to form clusters to detect protein families. However, these approaches currently cannot be applied directly to 16S rRNA sequence clustering, since the clustering results at different similarity levels are desirable as they correspond to different phylogenetic levels. Thus, applying a probabilistic approach to the clustering of 16S rRNA sequences requires the identification of parameter sub-spaces in which the optimal clustering results correspond to the partitions of the dataset at the desired phylogenetic level. Though aforementioned methods all have some parameters that can be tuned to achieve different clustering results, direct relationships between these parameters and similarity levels are hard to establish. Thus to address the first and the second challenge, we use the Gaussian mixture model to describe the data (Chapter 2). The key concept of our method replaces the mean value of a Gaussian distribution and instead uses a “center” sequence to characterize a specific cluster. Thus, if we consider the sequences as data points in a high dimensional space and we calculate the pairwise distances as the distance between two data points, then the probability that a sequence belongs to a cluster becomes a function of the distance between the sequence and the center [18]. The nature of Gaussian distributions can handle sequencing errors as well as sequence variations. However, by restricting the parameter space of the standard deviations of the Gaussian distributions, we could limit 5 our probabilistic search to the parameter sub-space in which the clustering results reflect the desired partitions of the datasets and, hence, the accurate number of underlying OTUs (Chapter 2 Methods). Based on this model, we can define the likelihood of the data and use a Markov Chain Monte Carlo (MCMC) approach to sample from the posterior distribution of the parameters to obtain the optimal clustering. The optimal result, which maximizes the posterior probability, will give all the quantities of interest, including the number of clusters, their relative abundance levels, and the sequences in each cluster. Richardson [39] and Stephens [48] have proposed MCMC methods to study the mixture model with an unknown number of components. In this application, we used a Markov birth-death process to build the Markov Chain with appropriate stationary distribution, as proposed by Stephens [48]. That is, in each step, a new cluster would be created, or an existing one would be deleted, according to which operation is more likely to increase the posterior probability. To enhance computational efficiency, we further introduced a hierarchical approach by splitting the data into small blocks, running Bayesian Clustering on each block independently, and then later merging these clustering results. We also introduced several criteria to reduce the burden of calculating Gaussian density functions, thereby accelerating the MCMC process (Chapter 2). For the third challenge, we recommend the usage of 16S rRNA gene fragments extracted from metagenomic shotgun sequencing data. Shogun sequencing data has natural immunity to primer bias, however, limited availability (<1%) of known bacterial genomes makes it difficult to associate sequencing reads with OTUs, and the highly dynamic structure of the underlying bacterial genomes [25] introduces extra complexity to the data analysis. On the other hand, 16S rRNA gene fragments may leverage the advantage of both 16S rRNA genes and metagenomic shotgun sequencing data. Previous surveys have shown that targeted 16S rRNA gene sequencing data tended to overestimate or underestimate the abundance of certain bacteria phyla when compared with 16S rRNA 6 gene fragments from the metagenome of the same community [51]. This difference is potentially caused by the primer bias of targeted sequencing, thus lending support to the use of 16S rRNA gene fragments. Some researchers have developed computational methods for α- and β-diversity analysis using this type of data [32, 45], while others have taken advantage of the current public 16S rRNA gene databases [5, 9, 36] to identify OTUs through database search. Despite the increasingly popular metagenomic analysis using this new type of data, its compatibility with traditional targeted 16S rRNA gene sequencing data is not well studied as a consequence of the difficulty of extracting 16S rRNA gene fragments from massive shotgun sequencing reads and the unknown sequencing coverage for the underlying metagenomes. More importantly, there is no current benchmarking which can prove its effectiveness. The only way of benchmarking in microbial diversity analysis is to build an artificial community consists of known microbes with known abundance. However, since universal primers are optimally designed for most known bacteria, thus we usually cannot observe significant primer bias issue for such artificial community. Even if we do, the observed difference may be arisen from the difference in the method of analyzing targeted 16S rRNA data and 16S rRNA gene fragments, not due to primer bias. As a result, lack of statistically rigorous comparison between 16S rRNA gene fragments and other data types makes it difficult to validate 16S rRNA gene fragments. Thus, it is of great interest to investigate this new data type and compare it with traditional 16S rRNA gene-based surveys. We performed extensive simulation studies using four different methods (Chapter 3) and successfully built a model to bridge the gap between 16S rRNA gene fragments and targeted 16S rRNA sequences with a simple conclusion: 16S rRNA gene fragments, with an average read length of at least 150bp, could provide the same level of resolution as full-length 16S rRNA gene sequences in terms of α-diversity, Bray-Curtis dissimilarity, sensitivity and specificity. 7 Chapter 2: An Unsupervised Bayesian Clustering Method for OTU Prediction 2.1. Statement of Problem Though widely used, current hierarchical clustering-based OTU prediction methods fail to deal with errors in data which result in overestimation of microbial diversity in environmental samples due to the hard cut-offs set by the algorithm to cut a hierarchical tree. Besides this, rapid development of sequencing technology is generating a data explosion which calls for more efficient methods to handle large data. Thus we want to design a clustering algorithm CROP (Clustering 16S rRNA For OTU Prediction) which has characteristics including: ability to tolerant sequencing errors; The method can get clustering results at desired phylogenetic levels by tuning certain parameters; There should not be a hard cutoff as in hierarchical clustering thus provide more flexibility to the method; A direct relationship between tunable parameters and phylogenetic (similarity) levels can be established and the method needs to be computationally and memory efficient in order to handle most current datasets). 2.2. Method 2.2.1. Gaussian Mixture Model We apply a Gaussian mixture model to 16S rRNA sequence data in which sequences x=(x 1 ,…,x N ) are assumed to be independently drawn from a mixture density with k clusters, where k is an unknown parameter, 22 1 ( | , , , ) ( ; , ), k i i i i p x k f x 8 where μ=(μ 1,…, μ k ) and σ 2 =(σ 1 2 ,…,σ k 2 ) are parameters for the centers and the variances, respectively, μ i and σ i 2 are specific to cluster i, and π=(π 1 ,…, π k ) are the nonnegative mixture proportions which sum up to 1. We introduce the missing data z=(z 1 ,…,z N ) in which z i specifies a mixture cluster to the observation of x i . Suppose that each z i is a realization of a discrete random variable Z i and that all Z i s (i=1,…,N) are independent and identically distributed with the following probability mass function (pmf), 2 ( | , , ) . ij p Z j Conditional on the latent variable Z=(Z 1 ,…,Z N ), the data {x 1 ,…,x N } are assumed to be independently observed from the density: 2 2 2 ( | , , , ) ( ; , ) ( ; , ). i i i j j i j j p x Z j p x f x In our studies, we consider the situation in which the density f(x i; μ j ,σ j 2 ) is a modified univariate Gaussian probability density function (pdf) with μ j as the center of the cluster and σ j 2 as the variance: 2 2 ( , ) 2 2 2 1 ( ; , ) , 2 ij j Dx i j j j f x e where D(x i , μ j ) denotes the distance between the sequence x i and the center of the j-th cluster μ j . We also follow common approaches to assume a hierarchical model for the priors (with standard conjugacy): | ~ ( ,..., ), k Dirichlet 2 ~ ( , ), i Inverse Gamma where μ i is chosen randomly from x 1 ,…,x N without replacement. The final form of the prior probability is 22 ( , , , | , , ) ( | , , ) ( | , ) ( | , , ) ( | , , , ) P k P k P k P k P k 9 and that of the posterior probability is 2 2 2 ( , , , | ) ( | , , , ) ( | , , ) ( , , | , , ). P k x P x k P k P 2.2.2. The Birth-Death Process To compute the parameters (π, μ, σ 2 ) which maximize P(π, μ, σ 2 |x), we apply the Markov Chain Monte Carlo (MCMC) method to construct an ergodic Markov Chain with the posterior distribution P(π, μ, σ 2 |x) as its stationary distribution. We design a Markov Birth-Death process as follows. When the process is at a certain step t i , and the number of existing clusters is k, the next step may be a “birth” with a probability P B /(P B +P D ), or a “death” with a probability P D /(P B +P D ), where P B and P D are the probability of a “birth” and a “death”, respectively. P B is given as a constant in our study, and P D is defined as: 1 , k Di i Pd where d i is the probability for a cluster to “die”. The way to compute d i is shown in a later section. Births: the (k+1)-th cluster is born with a new center μ k+1 randomly chosen from x 1 ,…,x N and μ k+1 ≠μ i (i=1,…,k). The choice of μ k+1 is based on the density: 1 2 1 ( ) , ( ; , ) ki i t t Px fx where (μ t ,σ t 2 ) are the parameters of the cluster to which x i is currently being assigned. Hence, μ k+1 is preferred over those x i s which are far away from the center of the cluster to which they currently belong. For other parameters of the new cluster, the priors are given as follows: 1 ~ ( , ), k Beta k 10 2 1 ~ ( , ). k Inverse Gamma Then the existing parameters need to be updated (i=1,…,k) by: 1 (1 ). i i k Deaths: an existing cluster dies independently of others with a probability ( 1) , () i iB L P NC k dP L P NC k where L -i indicates the likelihood without the i-th cluster and L indicates the likelihood with the i-th cluster: 2 1 1 ( ; , ), N k i j i i i j L f x 2 1, 1 ' ( ; , ). N k t i j i i i i t j L f x Let N c be the number of clusters. We use the Poisson distribution as its prior to simplify the form of d i , as suggested by Stephens [48]: ~ ( ). cB N Poisson P Then we have a simplified form for the death probability d i for each cluster: . i i L d L We update π j (j=1,..i-1,i+1,…,k) to reflect the death effect of the i-th cluster '. 1 j j i For more details about the Birth-Death process, refer to [48]. 11 2.2.3. Markov Chain Monte Carlo (MCMC) Sampling We update the parameters for the hierarchical mixture model in each step: 1) Update z by 2 2 ( , ) 2 2 ( |...) . ji i Dx j i i P z j e 2) Update π by 1 |... ~ ( ,..., ), k Dirichlet n n where (n 1,…, n k ) are the numbers of data points belonging to individual mixture clusters. In other words, for j=1,…,k, 1 1( ). N ji i n z j 3) Update σ 2 using the general approach as follows: 2 :1 2 ( , ) ~ ( , ). 22 j ji jz i i Dx n Inverse Gamma Richardson and Stephens suggested another hierarchical level over β 21 1 ~ ( ,( ) ), k i i g k h where g and h are hyperparameters. However, as Richardson and Green [39] mentioned, when applying the mixture model to the Bayesian classification, such update procedure would result in some clusters with a large variance, and its heavy tail would have a negative influence on the clustering result. This is because most methods assume that variances of clusters are similar to one another. Also, as mentioned before, restricting the variances is fundamental to obtain clustering results at different phylogenetic levels. In our method, we nested σ 2 by upper bound U and lower bound L. That is to say, if σ i 2 <L or σ i 2 >U, we would restart the updating procedure of σ j 2 as indicated below: 12 2 ~ ( , '), i Inverse Gamma where β’ is a given constant satisfying ' . 12 LU 4) Update μ: Since μ is a vector of sequences chosen from data instead of a common numerical vector, we sample the center for each cluster in the following way: For the j-th cluster, there are n j sequences. Each n j sequence is a potential candidate to be the center of this cluster. Let these sequences be X j ={x j1 ,…,x jnj }. The likelihood of this cluster is defined as 2 1 ( ) ( ; , ), j n j ji j j i L f x where μ j is the center of the cluster, σ j 2 is the variance, and f is the Gaussian probability density function. Thus, for two candidate centers μ j and μ j ’, we have 2 2 1 2 2 1 ( ') ( ; ', ) ( ' | , , ) , ( | , , ) ( ) ( ; , ) j j n j ji j j j j j j i n j j j j j ji j j i P f x P n x P n x P f x which suggests a brute force method to sample the center sequence. Assume that ( ) ( ') for all ' , j j j j P P X and then for each x ji ∈ X j , let δ i denote the probability of choosing x ji as the new center of this cluster 2 1 ( ; , ). j n i jl ji j l f x x So we can define 13 2 1 2 1 1 ( ; , ) ( ; , ) j j j n jl ji j l i n n jl ji j i l f x x f x x and then sample a new center for each cluster. 2.2.4. Prior Setting Based on the priors suggested by Richardson and Green [39] and Stephens [48], with some minor modifications, we set 1 2 0.2 g 2 100g h R 1 , B P where R=Max-Min is the variation of the data. L and U, as the lower and upper bounds for variances, are determined by the different levels of accuracy to be achieved. For the dissimilarity threshold at roughly 3%, we suggest setting L=1 and U=2.25 (mentioned as clustering at ~3% in later sections) so that the standard deviations of the Gaussian distributions in the mixture model will range from 1 to 1.5. For the threshold at 5%, we suggest setting L=2.25 and U=6.25 (mentioned as clustering at ~5% in later sections) so that the standard deviations of Gaussian distributions in the mixture model will range from 1.5 to 2.5. It is the nature of Gaussian distribution that 95% of its density falls into the interval [μ-2σ,μ+2σ]. As such, the choice of Us, as noted above, virtually guarantees that at least 95% of the sequences in a given cluster will be less than 3% (if U=2.25) or 5% (if U=6.25) dissimilar from the center sequence in the cluster. At the same time, the choice of Ls, as noted above, helps to maintain the assumption that all Gaussian distributions in a mixture model will have similar variances. Under these circumstances, the settings suggested above for both Ls and Us should give results comparable to those of conventional hierarchical clustering methods that used 3% and 5% dissimilarity cutoffs, respectively. Although our approach is still associated with previously mentioned 14 controversial 3% and 5% thresholds, it produces clusters with different standard deviations and its probabilistic nature fits better with real data. The settings of Ls and Us, as noted above, were used for all downstream analysis. Users can define their own Ls and Us in CROP. 2.2.5. Pairwise Alignment and Distance The Needleman-Wunsch algorithm[34] was used for pairwise alignment. However, we do not perform the dynamic programming on the whole matrix, only on a band near the diagonal with a width of 5% of the sequence length at each side. By doing this, we mainly focus on those sequence pairs with more than 90% similarity. The Quickdist algorithm [47], which considers consecutive gaps in an alignment as one gap, is used to calculate the pairwise distance from the alignment results. Every distance is calculated as a percentage number in our approach. That is to say, a distance of 5.2 indicates 5.2% dissimilarity. 2.3. Implementation 2.3.1. Computation The most time-consuming part of Bayesian clustering is the pairwise alignment. The second most time-consuming part of the process involves updating centers, which is O(N 2 /k) (where N is the number of sequences and k is the number of clusters), whereas calculating the Gaussian probability density functions for all iterations and sequences is also a costive step. In order to improve the computational efficiency of Bayesian clustering, we introduce three operations, as noted below: 1) When the calculated dissimilarity D(x i , x j ) between two sequences x i and x j is greater than 15%, we set the Gaussian probability density function involving this distance to be 0. That is to say, we consider 15 2 2 ( , ) 2 2 1 0 2 ij D x x e , if D(x i , x j )≥15%. 2) if the standard deviation of a cluster is σ (t) and in next iteration, the updated value σ (t+1) satisfies 0.9σ (t) ≤σ (t+1) ≤1.1σ (t) , then for this cluster, we consider 22 22 ( ) ( 1) ( , ) ( , ) 22 22 ( ) ( 1) 11 , 22 tt D x D x tt ee where μ is the center of this cluster and x is a sequence in this cluster. Update all cluster centers if the current iteration gives birth to a new cluster; otherwise, we do not update centers. In addition, we do not update cluster centers for those clusters consisting of fewer than 5 sequences. The first and second operations dramatically reduce the computational time for calculating the Gaussian pdfs, while the third operation avoids unnecessary updating of cluster centers. In practice, these operations do not compromise the accuracy of the results. 2.3.2. CROP Workflow Figure 1 shows a flowchart of CROP. First, the dataset is randomly split into blocks of 100-1,000 sequences each. Generally, a smaller block size is preferred for longer sequences, such as full-length 16S rRNA gene sequences, while a larger block size is preferred for shorter sequences, such as one single hyper-variable region. Then, an independent Bayesian clustering is applied to each block. A distance matrix is generated for each block using the pairwise alignment algorithm. We run 20*(block size) iterations of MCMC, considering the first 10*(block size) iterations as burn-in. From all the iterations after burn-in, we choose the one with the largest posterior probability and report it as our clustering result for this block. 16 Fig. 1. CROP workflow. At the next level of the hierarchical approach, every cluster in each block is treated as one sequence with the center sequence as the representative, and all these center sequences are pooled and further split into blocks. However, a slightly different distance 17 matrix is computed for each block. The matrix is a “center sequences against clusters” matrix, in which the (i, j) entry indicates the distance between the i-th center sequence and the j-th cluster. We calculate this distance by the average distance between the i-th center sequence and C randomly chosen sequences from the j-th cluster (C=20 is set for clusters with >20 sequences): 1 1 l C ij ij l DD C Then using this distance matrix, we apply a weighted Bayesian clustering on these clusters such that the weight is proportional to the size of the cluster, and this process will continue until one of the conditions noted below is satisfied. 1) The number of the clusters is more than 90% of the number of sequences. (This means most sequences are forming a cluster by themselves. Thus, split and merge process will not be able to reduce the dimension of the data efficiently any more.) 2) The number of the clusters is smaller than a predetermined threshold. 3) The process has been running for N times, where N is a predetermined threshold. After the previous process, we will be able to reduce the size of data to a much smaller scale. However, if we run one more round of Bayesian clustering on all the remaining clusters, we still have to face at least O(k 2 ) time complexity where k is the true number of OTUs for the data. Thus for highly diverse data, this could still be incapable. In order to better address these large and diverse data, we introduce another heuristic: 1) Separate all clusters at this stage into “abundant” and “rare” ones where “rare” is user-defined. For example, we may consider all clusters which contain less than 5 sequences are “rare”. 2) Calculate the distance between each pair of “rare” and “abundant” clusters using the same method as above. If the distance is less than 3*dissimilarity level we are 18 clustering at and the size of the “rare” cluster is less than 1% of the “abundant” one, merge this “rare” cluster into the “abundant” one. 3) After all the merging is done, do independent Bayesian clustering on “abundant” and “rare” set separately, and combine two results to constitute the final report. This deterministic heuristic takes advantage of the tail distribution of Gaussian and may significantly improve performance since most clusters for real data tend to be “rare” ones. However, this heuristic can do affect the accuracy of the result thus we doesn’t recommend using it unless the data is too diverse to handle. For all the following analysis, we consider “rare” cluster to be cluster with no sequence in it, thus not using this heuristic at all (All clusters are “abundant” in this case). As a probabilistic approach, CROP may get slightly different results in different runs. Thus, in later sections, 10 runs were performed for all experiments, and the result with the highest posterior probability was chosen if not specified. 2.4. Results 2.4.1. Estimating the number of OTUs We first validated our method using a dataset described by Huse et al[19], which consisted of amplicons of V6 regions from 43 16S rRNA templates which were at least 3% different from each other. The sequenced reads were further categorized into two datasets. The first one (A) contained only those reads that were within 3% of one of the 43 templates, and the second dataset (B) contained all the reads. These two datasets consisted of 191,387 and 202,340 reads, respectively. To obtain the ground truth, we ran hierarchical clustering (using ESPRIT) on 43 template sequences. Then we compared CROP, ESPRIT and mothur (with and without the single linkage pre-clustering) using datasets (A) and (B) to obtain the number of OTUs. Default parameters were used for ESPRIT and for mothur. For mothur, we used the same pairwise alignment program as 19 implemented in ESPRIT to calculate a phylip format distance matrix and then specified the average linkage in clustering. We ran the three programs on dataset (A) and the results were compared in Figure 2A. CROP identified 43 clusters when clustering at ~3% (By using variance interval [1, 2.25], if unspecified). In 10 runs, CROP averaged 43.5± 0.5 clusters, among which the result with 43 clusters gave the largest posterior probability. When clustering at ~5% (By using variance interval [2.25, 6.25], if unspecified), CROP produced 42 clusters. In 10 runs, CROP averaged 43± 0.8 clusters, among which the result with 42 clusters gave the largest posterior probability. Both results were in exact agreement with the ground truth. To judge the accuracy of the predicted clusters, we compared cluster center sequences with the template sequences and assigned the cluster center with its most similar template sequence. We found that each template sequences, both at 3% and 5% thresholds, was uniquely assigned by a cluster center in ~3% and ~5% results, respectively, indicating 100% accuracy. Then, we ran the three programs on the dataset (B). CROP identified 65 clusters when clustering at ~3%. In 10 runs, CROP averaged 67.7± 2.4 clusters. When clustering at ~5%, the number of clusters decreased to 45, and averaged 45± 1 in 10 runs. We also mapped the resulting center sequences to their nearest template sequences to access the accuracy of the results. In the ~5% result, 43 out of 45 cluster centers were less than 5% dissimilar from their nearest template sequences, while the remaining 2 cluster centers were more than 10% dissimilar from their nearest templates. These 2 clusters were quite small, containing <10 sequences. Similar observations were found for the ~3% results where 43 out of 65 cluster centers were less than 3% dissimilar from their nearest template sequences and each of the 43 template sequences is uniquely mapped by one of these 43 cluster centers, while the remaining 22 cluster centers were more than 7% from their nearest template sequences and all of them are small (with <10 sequences). 20 In comparison, CROP outperformed both ESPRIT and mothur, both of which overestimated the number of OTUs. As expected, mothur did a better job than ESPRIT when using the average linkage algorithm and the single linkage pre-clustering significantly improved the performance of the hierarchical clustering. However, in general, CROP is still more robust in dealing with sequencing errors and produces more accurate clustering results. Fig. 2. Comparison of the number of OTUs predicted by CROP, ESPRIT and mothur (with and without single linkage pre-clustering) using two sequencing datasets generated from 43 16S rRNA templates (by Huse et al. 2007). Dataset (A) contains reads that are within 3% of the templates, and Dataset (B) contains all the raw reads. 2.4.2. Validation results using a dataset of 90 artificial bacteria clones We also obtained Quince’s dataset [38] that consisted of 34,308 reads (12,360 unique reads) sequenced from V5 and V6 regions of 90 different clones of bacteria. This dataset contains very similar species (<3% difference) and thus is appropriate for testing CROP’s effectiveness on distinguishing closely related species. We applied CROP to this dataset using various intervals for the standard deviations of the Gaussian distributions. In 21 addition to ~3% and ~5%, interval [0.2,0.5] was, for example, used to compare with the 1% dissimilarity threshold in hierarchical clustering, and interval [0.5,1] was, for example, used to compare with the 2% dissimilarity threshold in hierarchical clustering. The results were compared with the ground truth, as well as with those obtained from mothur [44] with and without the single linkage pre-clustering (only unique reads used), and PyroNoise [38] followed by mothur (The dataset was pre-processed with PyroNoise in order to reduce the noise in original data to improve the performance of hierarchical clustering in this case). The input for mothur is the distance matrix we calculated using the pairwise alignment instead of the default multiple alignment. We chose the average linkage over the complete linkage in mothur, as the average linkage had been shown to produce better results by Quince et al [38]. As shown in Figure 3, mothur (pre-clustering) and PyroNoise (plus mothur) overestimated the number of OTUs. By mapping the cluster center sequences in CROP’s result to the clusters in the result of PyroNoise’s (plus mothur), we found that CROP achieved a result similar to that of PyroNoise’s (plus mothur) at the regions with ≥2% dissimilarity in terms of the number of clusters and cluster contents detected. However, as the threshold decreased, especially at ≤1% dissimilarity regions, CROP did overestimate the number of OTUs because the sequencing error rate was around 1%. The single linkage pre-clustering slightly improved the performance of the hierarchical clustering. Although the improvement was not as significant as the previous case, it did reduce the number of sequences to be clustered from 12,360 to 5,370, and thus could be still helpful in dealing with large data sets. In general, CROP achieved a performance similar to that of PyroNoise (plus mothur) at 5% and 3% and did so using significantly less computational time and without modeling the sequencing errors. PyroNoise alone used more than 1 day on cluster computers with 128 CPUs to process these datasets as 22 reported by the study, while CROP completed the work within 3 hours using a single CPU. Fig. 3. Comparison of the number of OTUs found by CROP, PyroNoise (plus mothur) and mothur, in which the results for CROP are shown both as a straight dashed line indicating the number of OTUs when using different parameters and as an approximated average lineage-through-time curve (CROP Assumption). 23 2.4.3. Estimating abundance using the human skin microbiome dataset In this application, we applied CROP to the human skin microbiome data by Grice et al. [17]. We first chose a skin site, the axillary vault (AV) of patient HV5 consisting of 1,130 nearly-full-length 16S rRNA sequences. Then we used the RDP classifier [5] to infer the taxonomy for each sequence. These results were considered to be the ground truth in this experiment. According to this ground truth, we found that the original result [17] underestimated the abundance of the genus Propionibacteria. Then we applied CROP to cluster this dataset at ~5%. To compare our clustering results with the ground truth, we assigned genera to each cluster by searching the center sequence in the RDP database. CROP identified 34 clusters. Figure 4A shows that our results are very similar to the ground truth and that among the 33 detected genera in the ground truth, 32 are uniquely mapped by 32 clusters in our results. The only difference between our results and the ground truth occurs in the class Actinobacteria, where 4 sequences belonging to the genus Williamsia are assigned to the genus Mycobacterium and Corynebacterium (Fig. 4B). All three of these genera are in the same suborder Corynebacterineae. In comparison, mothur (using the average linkage) produces 47 clusters, overestimating the number of genera by 43%. 24 Fig. 4. The clustering results (by CROP) of the human microbial sequencing data at the axillary vault skin location: (A) ~5% results and the ground truth and (B) more detailed results inside the “Other Actinobacteria”. 2.4.4. Comparing different microbial communities Finally, we applied CROP to three human skin microbiome datasets: sebaceous (including glabella, alar crease, external auditory canal, occiput, manubirum and back), moist (including nare, axillary vault, antecubital fossa, interdigital web space, inguinal crease, gluteal crease, popliteal fossa, plantar heel, and umbilicus) and dry (including volar forearm, hypothenar palm, and buttock). After clustering, every cluster center was searched against the RDP database to determine the taxonomy of this cluster. The results are shown in Figure 5. Betaproteobacteria is shown to dominate the dry locations, while Corynebacteria and Propionibacteria, both belonging to Actinobacteria, are shown to dominate the moist and sebaceous locations, respectively. 25 Fig. 5. The taxonomy information of sebaceous, moist and dry skin locations, as determined by CROP clustering results at ~5%. To study the species similarity and difference between different skin locations, while also to validate CROP at species level using real environmental data, we merged all sequences, clustered them using CROP at ~3%, and computed the Jaccard index value and the Theta index value [17] to see if we got consistent species-level result with previous studies. The Jaccard index value measures the sample membership by the proportion of shared OTUs between the two samples. The Theta index value measures the sample structure by taking OTU abundance levels into consideration. Basically, the 26 similarity of the two samples tends to increase as both Jaccard and Theta index values increase. These results show that the Jaccard index values of the two skin locations with the same environmental conditions (dry, moist or sebaceous) were not significantly different from those with different environmental conditions (P=0.53, two-sided t test); however, the Theta index values did show significant differences (P=2.33 ∗10 -25 , two-sided t test). These results suggest that the environmental conditions might not significantly affect the composition of the bacterial communities located on the human skin, but they could instead affect the relative abundance levels. The pairwise Theta index values between skin locations are shown in Figure 6B. Using these pairwise Theta index values as a distance measure, we clustered these 18 skin locations using hierarchical clustering (Figure 6A). The figure shows that skin locations are not clustered together strictly following their environmental conditions. For example, the antecubital fossa and the interdigital web space, both moist, are more similar to the volar forearm and the hypothenar palm, both dry, than to each other. Thus, it appears most likely that a spatial relationship dominates the environmental conditions in these cases, as all four skin locations are found on the human forearms. It might therefore be concluded that environments and locations on the human body co-determine the human microbiome distribution, a theory which agrees with what was found in another human skin microbiome study [6]. 27 Fig. 6 The Theta index values and the phylogenetic tree constructed based on these values. (A) Phylogenetic tree of 18 human skin locations with their environmental conditions annotated in brackets. (B) Theta index values shown as a heatmap. Rarefaction curves for each of the three types of skin locations are drawn in Figure 7, showing that the sebaceous locations are less diverse than the moist or dry locations with respect to the number of OTUs. These conclusions are consistent with a previous study [17]. Results from mothur (with pre-clustering, pairwise alignment, and the average linkage) are also shown in Figure 7 for comparison. Hierarchical clustering still overestimates the number of OTUs. 28 Fig. 7. The rarefaction curves for all three human skin types. 2.5. Discussion CROP provides a clustering tool that automatically determines the best clustering result for 16S rRNA sequences at different phylogenetic levels. Yet, at the same time, it is able to manage large datasets and to overcome sequencing errors. Our study shows that CROP gives accurate clustering results, both in terms of the number of clusters and their 29 abundance levels, for various types of 16S rRNA datasets. In contrast, the standard hierarchical clustering strategy, even with the pre-clustering process and the average linkage method, still frequently overestimates the number of OTUs in the presence of sequencing errors, resulting in an underestimation of the abundance level of the underlying OTUs. In addition, we demonstrate by using the human skin dataset that the results produced by CROP can provide compelling biological insights into different microbial communities. 30 Chapter 3: Dealing with Bias in OTU Prediction Using 16S rRNA Gene Fragment Data 3.1. Statement of Problem 16S rRNA gene fragments refer to metagenomic shotgun sequencing reads which can be mapped to a known 16S rRNA gene. The data itself is a subset of metagenomic shotgun sequences and is thus naturally immune to primer bias since no targeted priming is necessary for shotgun sequencing. Technically, this data was not widely used during the past mainly for two reasons: low coverage and lack of reference information. However, with the explosion of data in recent years and rapid development of reference 16S rRNA databases, the availability of this data should definitely be re-evaluated. Besides these technical problems, lack of benchmarking results also makes this data hard to convince researchers about its usefulness, especially when targeted 16S rRNA sequencing-based surveys have already proved to be successful. In this case, experimental benchmarking (which refers to designing an artificial microbial community and do both targeted and shotgun sequencing and compare the results) alone is not enough since universal primers are optimally designed for most known bacteria and thus may not be significant for an artificial community. Besides this, given the complexity of microbial communities, it is also highly possible that the difference between 16S rRNA gene fragments and targeted 16S rRNA sequencing data on a simple artificial community merely by chance or due to bias in the analysis methods used for 16S rRNA gene fragments. Thus in order to prove 16S rRNA gene fragments to be useful for microbial diversity analysis, we need to establish a generic model to compare the performance achieved by 16S rRNA gene fragments and targeted 16S rRNA sequencing data. Such model provides us with ability to compare the results of two data types generated from the same community, and the 31 resulting difference will only be due to technical biases. When combined with future experimental benchmarking, this model can help prove 16S rRNA gene fragments as an unbiased data source for estimating microbial diversity. In order to generate such a generic model, we first generated simulated datasets and studied the optimal way of extracting 16S rRNA gene fragments and assigning taxonomic information to them. Based on these analysis results, we performed a comparative analysis on all the simulated datasets and used Mann-Whitney U-test [28] to build the model. With the established model, we further used this model to compare the performance of different data types using real data [51]. The result suggests that the significant difference between microbial diversity estimation of different data types is very likely to be due to potential primer bias and thus encourages the usage of unbiased 16S rRNA gene fragments. 3.2. Method 3.2.1. Data Description Two data types were used in the study: simulated data and real data. Two groups of simulated data, as shown in Table 1, were used in the study. The first group of datasets consisted of a total of 540 simulated 16S rRNA gene fragment datasets with various read lengths (50 – 300bp) and sizes (100 – 10,000 sequences) generated by random sampling from 9,773 nearly full-length 16S rRNA gene sequences uniformly [51]. For each specific choice of read length and data size, 10 simulated datasets were generated in order to study the variation of results. The choice of these parameters reflected current technology in metagenomic sequencing. As such, the range of read length covered 50bp to 150bp from the Illumina sequencing platform to 200bp or longer from the 454 sequencing platform. Although we tested other read lengths between the numbers chosen, the results did not significantly improve. The dataset size followed an 32 assumption that the average coverage was less than 0.1x. The second group of datasets consisted of 5 simulations of full-length 16S rRNA gene sequence datasets, each with a different data size and 10 replicates. Each replicate was generated by the random sampling from the original 9,773 sequences. The sizes of these datasets were designed to provide comparable coverage (<0.1x) with the simulated 16S rRNA gene fragment datasets. Table 1. Summary of Shotgun and Full-length Simulated Datasets 16S rRNA Gene Fragments Full-length 16S rRNA Sequences Read Length (bp) 50,65,85,100,150,200 ~1200-1600 Data Size 100, 200, 400, 800, 1000,2000,4000,8000,10000 100,200,400,800,1000 Total Datasets 6x9x10=540 1x5x10=50 The real data consisted of nearly full-length 16S rRNA sequences, V2 amplicon of 16S rRNA gene, V6 amplicon of 16S rRNA gene, and 454 whole genome shotgun sequencing data introduced by Turnbaugh et al. [51]. This study used the following 18 samples, all having four data types: TS1, TS2, TS3, TS4, TS5, TS6, TS7, TS8, TS9, TS19, TS20, TS21, TS28, TS29, TS30, TS49, TS50 and TS51. 3.2.2. Mapping Procedure Two databases were used to map metagenomic shotgun sequencing data and extract taxonomic information. The first dataset, referred to as “complete database”, consisted of 1,428,381 high-quality 16S rRNA gene sequences annotated in RDP. The second dataset, referred to as “compressed database”, was built by clustering the 1,428,381 sequences at 3% distance threshold using a program called uclust [11]. The final size of the compressed database is 252,314. The two datasets were then built into BLAST databases. Sequences were shuffled before construction of the databases since they were originally grouped sequentially by their RDP taxonomy annotation. The 9,773 original 33 full-length sequences were removed to avoid self-hitting. BLASTN was used to map the data to the databases. We used very stringent E-value and similarity cut-offs, as shown in Table 2, so that boundary reads across the 16S gene and other parts of bacteria genomes could be excluded. Table 2. E-value and Identity Cut-offs Read Length (bp) E-value cut-offs Similarity cut-offs 50 1e-10 90 65 1e-13 90 85 1e-17 90 100 1e-20 90 150 1e-30 90 200 1e-40 90 Full-Length 1e-200 90 3.2.3. OTU Prediction By mapping metagenomic shotgun sequencing data to a large 16S gene database, we are able to extract out reads which are potential 16S rRNA gene fragments. However, a significant portion of the 16S rRNA gene fragments will get multiple hits in closely related taxonomic groups, even under very stringent criteria, i.e., similarity cut-off > 99% and mapping length > 95%, based on the presence of highly conserved regions in 16S rRNA genes. Methods have been proposed to handle multiple hits. One popular approach, which is used in MEGAN[21], assigns a read to the Lowest Common Ancestor (LCA) of its mapped taxonomic groups. In practice, however, this approach fails to assign most 16S rRNA fragments to the genus-level taxa. Thus, we propose the following four approaches where we only consider those BLAST hits with >90% of read length mapped and >95% similarity for the genus-level OTU analysis. 34 1) Maximum Likelihood Estimation (MLE) The likelihood model is described as follows: Assume that read , , has BLAST-hits in taxonomic group , . For the -th hit, let be the length of the local alignment reported by the BLAST, and let be the percentage of identities in the local alignment. Then the probability that read is generated from the taxonomic group is 1 ' 1 k m ij m Il P kl where is the length of the reference full-length 16S rRNA gene sequence. Let π j be the abundance of a taxonomic group in a sample, and let be the latent variable indicating that read is assigned to a specific taxonomic group . Thus for all the reads and taxonomic groups, the full likelihood function is then: 1 1 () N M i j ij j i L I Z j P A log transformation gives: 11 ( )(log log ) ' NM i j ij ij I Z j P Le 11 log ' ( )(log log ) NM i j ij ij L I Z j P In order to get the MLE of , we impute in the above formula with its conditional expectation: 1 ( ( ) | ) ( | ) : j ij i i ij M l il l P E I Z j data P Z j data T P A simple EM algorithm [8] is used to determine the maximum likelihood estimate of the abundance of a mixture. For the t-th iteration, 35 E-step: () () () 1 t j ij t ij M t l il l P T P M-step: ( 1) ( ) 1 1 N tt j ij i T N For most reads in our study, the reported BLAST hits in different taxonomic groups were not mathematically different in terms of such parameters as alignment length or percentage of identities, thus calling for this simple setting. Therefore, using a different likelihood model, or further fine tuning the above parameters, should not provide any significant difference in result. 2) A greedy algorithm for solving the Minimum set covering (MSC) problem The second approach belongs to the category of maximum parsimony. If we define taxonomic groups (such as genus) as “sets”, , and reads as “elements”, , then , or read , is covered by group , if read gets a BLAST hit in the taxonomic group . Using such definition, the objective of the maximum parsimony becomes finding the smallest number of taxonomic groups that can cover all the reads, which is exactly the same as the minimum set covering problem, which is known to be NP-complete[23]. Since a polynomial-time solution does not exist, we use a greedy algorithm to determine a covering set. The greedy algorithm first selects a set covering the maximum number of uncovered elements and then repeatedly adds a new set which can cover the maximum number of uncovered elements until every element is covered by the selected sets. Finally, we select a set of taxonomic groups (sets) that explains all sequenced reads. 36 To estimate the abundance of the selected taxonomic groups, we weight each read by the following rule: if a read hits selected groups, it carries weight in each group. Thus the abundance of a taxonomic group is estimated as the total weights of its reads. Since this approach is clearly biased to high-abundance OTUs, the results are expected to be high in specificity and low in sensitivity. 3) Maximum Likelihood Estimate of Minimum set cover (MSC & MLE) This approach is similar to the aforementioned greedy algorithm, except that we apply an EM algorithm to estimate the abundance of the selected taxonomic groups. 4) Nearest Neighbor (NN) Each taxonomic group consists of multiple 16S rRNA gene sequences to which a read may have multiple hits. The nearest neighbor approach assigns each read to the taxonomic group with the highest BLAST score, or, if two or more groups have the same score, to all of those with equal weights summing up to 1. The abundance of a taxonomic group is the number of the weights of its reads. 3.2.4. Metrics for Performance Evaluation for Simulated Datasets Based on their taxonomy in RDP, we consider the abundance estimation of the original 9,773 nearly full-length 16S rRNA gene sequences to be the ground truth. Several metrics, including Bray-Curtis dissimilarity, sensitivity and specificity, are used to compare the ground truth with abundance estimated from each simulated dataset using the NN method. Here, sensitivity is defined as the percentage of the taxonomic groups in the ground truth found by analyzing the simulated data. Specificity is defined as the percentage of the taxonomic groups estimated from the simulated data which are also in the ground truth. 37 3.2.5. Performance Evaluation of Simulated Datasets and Model Building Implementing the above three performance metrics is conditional on addressing the question based on the result from the taxonomic analysis of n 16S rRNA gene fragments of an average length of k-bps, i.e., the size of a full-length 16S rRNA gene sequence dataset that can produce an equivalent result. To address this question, we use the approach described below. First, we denote 1) the simulated 16S rRNA gene fragment dataset with read length k 1 and read number n 1 as dataset and 2) the simulated full-length 16S rRNA gene sequence dataset with read number n 2 as dataset below. Since each such defined dataset has 10 replicates, we can use the Mann-Whitney U-test [28] to test the difference between each simulated 16S rRNA gene fragment dataset and each full-length 16S rRNA gene sequence dataset measured by the Bray-Curtis dissimilarity, sensitivity and specificity, using 5% significance level We define two datasets, i.e., datasets and dataset to be equivalent if (1) there is no significant difference between their performance measures, in terms of sensitivity, specificity, or the Bray-Curtis dissimilarity, or (2) dataset gives better performance than dataset , but worse performance than dataset where . 3.2.6. Statistical Tests Used For Real Data 1) Likelihood Ratio Test (LRT) We use a Likelihood Ratio Test (LRT) to determine if the observed difference between the genus-level abundance estimation of 16S rRNA gene fragments and that of full-length 16S rRNA gene sequences is from low coverage (null hypothesis) or some other biases (alternative hypothesis). If we assume that each dataset is a sample from a multinomial distribution , the maximum likelihood estimates of the cell probabilities are then the corresponding genus-level abundance estimation. Under the null hypothesis, this abundance estimation is determined by 38 estimating the abundance of different genera after pooling two datasets together. The test statistic is the log ratio of the likelihood under the null and the alternative hypotheses: where N 1 and N 2 are the numbers of the 16S rRNA gene fragments and the full-length 16S rRNA sequences for this sample, respectively. k 1 and k 2 are the numbers of detected genera, and the estimated abundances of the i-th detected genera, and n 1i and n 2i the numbers of sequences assigned to the i-th genus, respectively. is the pooled estimate of the genus abundance. k is the number of the genera detected by at least one sample. This test statistic follows a chi-square distribution with the degree of freedom equal to k 1 +k 2 -k-1 asymptotically. A p-value is thus calculated correspondingly. If we assume that the two samples are drawn from the same underlying multinomial distribution and there is no other factor affecting the sampling process, then the test will not reject the null hypothesis, even if the coverage is low (small N 1 and N 2 ). Since there are 18 samples to be tested simultaneously, the p-value is then corrected using the Benjamini and Hochberg False Discovery Rate (FDR) [1]. Although the following analysis is for the genus-level only, the test can be applied to any taxonomic level by simply using the abundance estimation at the desired taxonomic level. 2) Wilcoxon signed-rank test [53] Wilcoxon signed-rank test is used to find the genera whose abundance estimation is significantly different between two different data types. The abundance estimation of a certain genus for a data type forms an 18-dimensional vector representing the estimated 1 2 1 2 1 2 1 2 1 2 1 2 12 1 1 1 1 1 2 2 1 2 1 1 1 1 ( )( ) 2log 2log ( )( ) i i i i i i i i k k k n n n n i i i i i i k k k k n n n n i i i i i i i i N p N p p N p N p p p 39 abundance levels from the 18 samples. The Wilcoxon signed-rank test is then performed on each pair of such vectors with a significant difference of 0.05. 3.2.7. Diversity analysis Mothur [44] software with default parameter settings is used for diversity analysis. First, the RDP-aligned best hit full-length 16S sequences (determined as described above) for each 16S rRNA gene fragment is put into a fasta file, while information, which identifies the sequence from which the sample is generated, is stored in a separate “group” file. These 2 files are then imported into mothur for downstream analysis, including the hierarchical clustering and the rarefaction curve calculation. 3.3. Results 3.3.1. Databases for mapping and extracting taxonomic information We compared the mapping results using the complete database with those using the compressed database (Chapter 3 Methods). As shown in Table 2, we measured the mapping speed by how many bases could be mapped per second on a single Opteron 885, 2.6 GHz CPU when using two difference databases. Given the results, we would recommend using the compressed database since the mapping process is at least several orders of magnitude faster than using the complete database, while the performance, in terms of Bray-Curtis dissimilarity, sensitivity and specificity, is not significantly different (two-sided t-test, p-value <0.0001 for all three metrics). 40 Table 3. Comparison between the Complete Database and the Compressed Database Complete Database Compressed Database Bray-Curtis Dissimilarity 0.1205142± 0.064813 0.1213089± 0.065371 Sensitivity 0.620783± 0.203592 0.604533± 0.227941 Specificity 0.877164± 0.019811 0.879328± 0.020517 Mapping Speed (Base/Sec) 55 6 3.3.2. Test of computational methods for integrating taxonomic information Figure 8 compares the performance of the four computational methods. Each colored point indicates the sensitivity and specificity achieved for applying a specific method into one of the 540 simulated shotgun datasets. Fig. 8. Comparison of four computational methods for integrating taxonomic information, where MLE stands for the maximum likelihood estimation method, MSC stands for the Minimum Set Covering method, MSC-MLE stands for the MSC method followed by the MLE method, and NN stands for the nearest neighbor method. 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Specificity Sensitivity MSC & MLE MSC NN MLE 41 Overall, results suggest that the nearest neighbor (NN) method produces the most accurate results. The maximum likelihood estimate (MLE) method consistently has the lowest specificity, while the greedy algorithm (MSC) consistently has the lowest sensitivity. The MSC-MLE method has better sensitivity than the MSC method, but with lower specificity. The NN method achieves the best combination of sensitivity and specificity (the most upper-right corner). Also, in terms of Bray-Curtis dissimilarity, the NN method consistently achieves the best performance, which indicates the most accurate overall abundance estimation. Although we lose some specificity for datasets with short read length, with current sequencing platforms where read length usually ranges from 100 to 400, NN approach is the best choice from the above results Thus, in later sections, all the results are generated using this approach by default. 3.3.3. Simulated Data On average, 99.5% of the simulated 16S rRNA gene fragments were mapped. For the 10 datasets generated from each combination of read length and the read number, we calculated the average Bray-Curtis dissimilarity, sensitivity and specificity based on the weighted-nearest-neighbor method. Results are shown in Figure 9. 42 Fig. 9. Bray-Curtis dissimilarity (A), Sensitivity (B) and Specificity (C) for all simulated datasets. The result in Figure 9 shows that the read number is the factor dominating sensitivity, while read length is negligible. Thus, by carefully choosing mapping criteria, reads as short as 50bp can achieve sensitivity similar to the full-length sequences. As such, the Illumina platform with its high throughput will have the best sensitivity in detecting new taxa. However, in terms of the Bray-Curtis dissimilarity and specificity, both read length and the read number play an important role. Together, they give a better sense about the 43 accuracy of the abundance estimation. As expected, when read number is high, specificity is low, mainly because more reads will fall into highly conserved regions of 16S rRNA genes, and these reads are harder to be correctly assigned. In addition, when reads are longer, the data tend to have better immunity to this specificity-diminishing effect. This can be largely explained by the fact that longer reads are more likely to contain taxon-specific information and are thus less likely to be assigned incorrectly. Although the above results show that the simulated 16S rRNA gene fragments can produce highly accurate abundance estimation, the conclusion only holds true for the community represented by the 9,773 full-length 16S sequences from which we sampled the data. Both changes in community membership and structure could affect the above results. For example, based on the above results and given the limited genus-level diversity for this specific community, we conclude that a shotgun sample with an average read length of 200 bp and ~300 reads mapped to a 16S database provides an abundance estimation equivalent to 1000 full-length 16S sequences generated from the same community. However, this is apparently not true if the community is more diverse, say, containing 500 genera. Thus, we used a comparative method to predict the abundance estimation accuracy when using the 16S rRNA gene fragments. By comparing the results of the simulated 16S rRNA gene fragments and full-length samples, we are able to establish a relationship between them (Figure 10). This model was built by using pair-wise Wilcoxon rank-sum test to test the significance of the difference between Bray-Curtis dissimilarity, sensitivity and specificity of full-length and shotgun samples at 5% significance level (Chapter 3 Methods). From these results, we draw a conclusion that when read length is ≥150bp, n 16S rRNA gene fragments would achieve the same level of accuracy as n full-length 16S rRNA sequences. In other words, it is highly likely that a read from a 16S rRNA gene ≥150bp in length would contain enough information to characterize the whole gene. 44 Fig. 10. A plot showing model based on Bray-Curtis (A), Sensitivity (B) and Specificity (C). 45 Depending on the average read length, another useful conclusion based on the model is that exhaustive sequencing of environmental samples may not be a good choice for detecting new or rare taxa by low specificity. For example, sequencing billions of 50bp reads from a sample may easily give a sensitivity of 1.0, but because of low specificity, there is no way to tell if a detected taxon is real or not. For current technologies, 454 or pair-end Illumina sequencing may be the best platform for such detecting purposes since they could achieve read length >150bp (pair-end for illumina) and thus, according to our conclusion, n 16S rRNA fragments data from these two platforms would achieve the same level of accuracy as n full length 16S rRNA sequences. 3.3.4. Experimental Data We studied the human gut microbiome data introduced by Turnbaugh et al. [51]. There were 18 samples which had all four data types (WGS, full-length 16S, V2 region and V6 region) and were thus used for comparative study. The phylum-level abundance estimations for all four data types are shown in Figure 11. According to our model shown in Figure 10, the shotgun data (>200bp, (standard deviation) 16S rRNA fragments) should have a level of accuracy similar to the full-length data ( full-length 16S rRNA sequences). However, the Bray-Curtis dissimilarity between the two data is highly significant, using the two-sided t-test with p-value < 10 -5 . This could result from potential technical bias in 16S rRNA gene sequencing or just by chance due to that the amount of reads for both are low. Therefore, we used a likelihood ratio test (see Methods) to test if the low read number alone provides strong support to the observed difference between the genus-level abundance estimation of the 16S rRNA gene fragments and full-length 16S rRNA sequences. After correcting for multiple testing, 15/18 results showed that the coverage was not a significant predictor (FDR < 0.05, with an average FDR of 0.00076). For samples TS1, TS8 and TS9, the test gives FDR values of 0.06, 0.05 and 0.16, respectively, which are 46 not significant at level 0.05. However, these results are significant enough to show that the difference between the abundance estimation of the 16S rRNA gene fragments and full-length 16S rRNA gene sequences for this gut microbiome data does not, at least merely, result from the low read number, but rather some other source of technical bias. Comparing sequencing protocols of the two data, we tend to believe that the most likely source of bias occurred during priming. Thus the observed difference between the two data reflects the effect of primer bias on abundance estimation. These results are also consistent with the original study where the significant difference among the full-length, V2, V6 and 16S rRNA gene fragment data was reported [51]. Fig. 11. Abundance estimations of major phylum for all four data types. It was also claimed in the original survey that the full-length and V6 data tended to show a depletion of Bacteroidetes, while the V2 and 16S fragments data tended to show a 47 relative depletion of Firmicutes when compared with the full-length and V6 data. In order to prove this more rigorously, we used the Wilcoxon signed-rank test to test the significance of the difference between the abundance estimation of each genus in the 16S fragments, full-length, V2 and V6 data. The results suggested that the abundance estimation of the 13 genera supported by the 16S fragments data was significantly different from the other three data types. Among them, 3 genera belonged to the phylum Bacteroidetes in which the 16S rRNA gene fragments data gave significantly higher abundance estimation than the other three. 9 genera belonged to the phylum Firmicutes. For these 9 Firmicutes genera, 16S rRNA gene fragment estimation was lower than the full-length and V6 data. We also found 1 genus belonged to the phylum Actinobacteria in which the abundance estimation of the 16S rRNA gene fragments data were higher than the full-length data, but lower than the V2 and V6. The above analysis suggests that the abundance estimation using the 16S rRNA gene fragments would be significantly different from those using other data types due to potential primer bias and for human gut microbiome, the difference mainly resides in the phylum Bacteroidetes in which the abundance estimated from the 16S fragments is higher and the phylum Firmicutes in which the abundance estimated from the 16S fragments is lower. Above analysis suggests that primer bias could result in highly biased abundance estimation. Furthermore, for validation purposes, we used 16S rRNA gene fragments data to reproduce another previous conclusion reached by using only the V6 data before [51], indicating that the diversity of the gut microbiome in the lean population is higher than that in the obese population (Chapter 3 Methods). The result is shown in Figure 12. This result suggests that even if abundance estimation is highly biased, the overall high-level conclusions reached using targeted 16S rRNA sequencing data may still hold true. However, it is also easily imagined that due to the significant bias in abundance 48 estimation, the importance of a taxonomic group will be over-/under-estimated (as Firmucutes and Bacteroidetes in human). Results of any method (Bray-Curtis Dissmilarity, weighted-UniFrac [27], etc) taking the abundance estimation into account will be biased when using targeted 16S rRNA gene sequencing data thus need to be treated cautiously. Fig. 12. Rarefaction curve of lean and obese samples; error bars for 95% confidence interval. 0 100 200 300 400 500 600 700 800 900 1000 0 500 1000 1500 2000 2500 3000 3500 4000 Number of OTUs at 3% Number of Sequences Lean Obese 49 3.4. Discussion By building up a generic model, we bridged the gap between 16S rRNA gene fragments and targeted 16S rRNA sequencing data. Our conclusion that “a 16S rRNA gene fragment with >150bp in length provides same accuracy as a full-length 16S rRNA sequence” could serve as a good starting point for experimental design and make the comparison between 16S rRNA gene fragment-based and targeted 16S rRNA sequencing-based surveys possible. With further development of computational methods [32, 45] and databases [5, 9, 36], we believe this new data type may further improve our understanding about the diversity of various microorganism communities. 50 Chapter 4: Future Perspectives With our research focus on improving current metagenomic analysis tools, especially in the context of microbial diversity analysis, we successfully provide solution to deal with three major challenges in the area: improving clustering accuracy and efficiency and avoid primer bias. In the near future, our research will focus on validating previous targeted 16S rRNA sequencing survey results using our improved tools and help the community build a solid foundation to perform large scale microbial diversity analysis projects on. Looking even further, microbial diversity analysis is just a component of metagenomics, though an essential one. Generally, the two major questions of interest in studying a microbial community are “what are they” and “what do they do”, while microbial diversity analysis can only answer the first one. By incorporating with large annotated gene databases, metagenomic shotgun sequencing data could also provide a pool of potential functionalities for a specific microbial community. In case of mammalian gut microbiome, these function profiles show more consistency among different individuals than taxonomic profiles [33, 51]. However, even if we can generate a general function profile for a community, nowadays we still lack an efficient way to connect these gene function information to taxonomic information in the community. Thus we are only able to answer the question “what does a community do” but not “what does a microbe do in a community”. This problem may not be easily solved computationally. Traditional binning methods [4, 31], due to their limited accuracy at species level, could not satisfy current demand. Since they require either long reads or use prior knowledge about taxonomic abundance information, this nature will prevent them from being implemented for future metagenomic RNA-seq projects. Thus the area is calling for new methods which can establish solid connection between a gene (and its function) and a taxonomic group (or an 51 OTU). In order to statistically model this connection, we need multiple samples of the same community. For example, recently researchers studied the effect of antibody treatment on human gastrointestinal micobiota [40]. We believe such treatment which can result in change in community structure could provide us with a good source for establishing aforementioned connection between genes and taxonomic groups. The development of biotechnology has also indicated a new path for solutions. Single-cell sequencing [26] and single-molecule sequencing [7] are the two most attractive technology for researchers. These new technology could provide researchers with sequence data generated from a single cell or ultra-long reads generated from a single molecule. In terms of metagenomics, we can link genes (and their functions) to a taxonomic group by directly looking at single-cell sequencing data or single-molecule sequencing data (if a whole bacteria genome can be sequenced as one read). This may all completely change the way we conduct metagenomic studies today. However, new technology always comes with new computational challenges. We had normalization problem for Microarray analysis [46], base calling problem for next generation sequencing technology [15, 16] , bias in RNA-seq data [52], etc. Thus, researching proactively in such potential challenges could also benefit and provide strong support to the whole community. 52 References 1. Benjamini Y , Hochberg Y: On the Adaptive Control of the False Discovery Rate in Multiple Tesing WIth Independent Statistics. J Edu Behav Stat 2000, 25:23. 2. Brown DP: Efficient functional clustering of protein sequences using the Dirichlet process. Bioinformatics 2008, 24:1765-1771. 3. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R: Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci U S A 2011, 108 Suppl 1:4516-4522. 4. Chatterji S, Yamazaki I, Bai Z, Eisen JA: CompostBin: a DNA composition-based algorithm for binning environmental shotgun reads. In Book CompostBin: a DNA composition-based algorithm for binning environmental shotgun reads (Editor ed.^eds.). pp. 17-28. City: Springer-Verlag; 2008:17-28. 5. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM: The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res 2009, 37:D141-145. 6. Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI, Knight R: Bacterial community variation in human body habitats across space and time. Science 2009, 326:1694-1697. 7. Deamer DW, Akeson M: Nanopores and nucleic acids: prospects for ultrarapid sequencing. Trends Biotechnol 2000, 18:147-151. 8. Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B (Methodological) 1977, 39:1-38. 9. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL: Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 2006, 72:5069-5072. 10. DeSantis TZ, Jr., Hugenholtz P, Keller K, Brodie EL, Larsen N, Piceno YM, Phan R, Andersen GL: NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Res 2006, 34:W394-399. 11. Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics 2010, 26:2460-2461. 12. Eisen JA: Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS Biol 2007, 5:e82. 13. Engelbrektson A, Kunin V , Wrighton KC, Zvenigorodsky N, Chen F, Ochman H, Hugenholtz P: Experimental factors affecting PCR-based estimates of microbial species richness and evenness. ISME J 2010, 4:642-647. 14. Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30:1575-1584. 15. Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998, 8:186-194. 53 16. Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 1998, 8:175-185. 17. Grice EA, Kong HH, Conlan S, Deming CB, Davis J, Young AC, Bouffard GG, Blakesley RW, Murray PR, Green ED, et al: Topographical and temporal diversity of the human skin microbiome. Science 2009, 324:1190-1192. 18. Hao X, Jiang R, Chen T: Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering. Bioinformatics 2011, 27:611-618. 19. Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol 2007, 8:R143. 20. Huse SM, Welch DM, Morrison HG, Sogin ML: Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environmental Microbiology 2010, 12:1889-1898. 21. Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome Res 2007, 17:377-386. 22. Johnson SC: Hierarchical clustering schemes. Psychometrika 1967, 32:241-254. 23. Karp RM (Ed.). Reducibility Among Combinatorial Problems. New York: Plenum; 1972. 24. Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 2005, 33:511-518. 25. Lan R, Reeves PR: Intraspecies variation in bacterial genomes: the need for a species genome concept. Trends Microbiol 2000, 8:396-401. 26. Lasken RS: Single-cell genomic sequencing using Multiple Displacement Amplification. Current Opinion in Microbiology 2007, 10:510-516. 27. Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R: UniFrac: an effective distance metric for microbial community comparison. ISME J 2011, 5:169-172. 28. Mann HB, Whitney DR: On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Annals of Mathematical Statistics 1947, 18:50-60. 29. Marco D (Ed.). Metagenomics: Theory, Methods and Applications.: Caister Academic Press; 2010. 30. Marttinen P, Corander J, Toronen P, Holm L: Bayesian search of functionally divergent protein subgroups and their function specific residues. Bioinformatics 2006, 22:2466-2474. 31. McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I: Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 2007, 4:63-72. 32. Miller CS, Baker BJ, Thomas BC, Singer SW, Banfield JF: EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome Biol 2011, 12:R44. 33. Muegge BD, Kuczynski J, Knights D, Clemente JC, Gonzalez A, Fontana L, Henrissat B, Knight R, Gordon JI: Diet drives convergence in gut microbiome functions across mammalian phylogeny and within humans. Science 2011, 332:970-974. 34. Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48:443-453. 54 35. Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V , McEwen JE, Wetterstrand KA, Deal C, et al: The NIH Human Microbiome Project. Genome Res 2009, 19:2317-2323. 36. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glockner FO: SILV A: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res 2007, 35:7188-7196. 37. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T, et al: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 2010, 464:59-65. 38. Quince C, Lanzen A, Curtis TP, Davenport RJ, Hall N, Head IM, Read LF, Sloan WT: Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods 2009, 6:639-641. 39. Richardson S, Green PJ: On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society, Series B (Methodological) 1997, 59:731-792. 40. Robinson CJ, Young VB: Antibiotic administration alters the community structure of the gastrointestinal micobiota. Gut Microbes 2010, 1:279-284. 41. Rothberg JM, Leamon JH: The development and impact of 454 sequencing. Nat Biotechnol 2008, 26:1117-1124. 42. Schloss PD: The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies. PLoS Comput Biol 2010, 6:e1000844. 43. Schloss PD, Handelsman J: Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol 2005, 71:1501-1506. 44. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, et al: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 2009, 75:7537-7541. 45. Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard KS: PhylOTU: a high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data. PLoS Comput Biol 2011, 7:e1001061. 46. Smyth GK, Speed T: Normalization of cDNA microarray data. Methods 2003, 31:265-273. 47. Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, Neal PR, Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci U S A 2006, 103:12115-12120. 48. Stephens M: Bayesian analysis of mixture models with an unknown number of components - an alternative to reversible jump methods. Annals of Statistics 2000, 28:40-74. 49. Sun Y , Cai Y , Liu L, Yu F, Farrell ML, McKendree W, Farmerie W: ESPRIT: estimating species 55 richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res 2009, 37:e76. 50. Suzuki MT, Giovannoni SJ: Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Appl Environ Microbiol 1996, 62:625-630. 51. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP, et al: A core gut microbiome in obese and lean twins. Nature 2009, 457:480-484. 52. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10:57-63. 53. Wilcoxon F: Individual comparisons by ranking methods. Biometrics Bulletin 1945, 1:80-83. 54. Youssef N, Sheik CS, Krumholz LR, Najar FZ, Roe BA, Elshahed MS: Comparison of species richness estimates obtained using nearly complete fragments and simulated pyrosequencing-generated fragments in 16S rRNA gene-based environmental surveys. Appl Environ Microbiol 2009, 75:5227-5236.
Abstract (if available)
Abstract
Metagenomics studies have prospered from the rapid development of next-generation sequencing. However, microbial diversity analysis as an essential component of metagenomics is still facing three major challenges: handling errors in data, performing analysis efficiently for large data and avoiding primer bias issue. Since 16S rRNA gene sequences have been frequently used to profile microbial diversity, we focus on this data and successfully provide solutions to all three challenges: our proposed unsupervised Bayesian clustering method termed Clustering 16S rRNA for OTU Prediction (CROP) can find clusters based on the natural organization of data without setting a hard cutoff threshold (3%/5%) as required by hierarchical clustering methods. By applying our method to several datasets, we demonstrate that CROP is robust against sequencing errors and that it efficiently produces more accurate results than conventional hierarchical clustering methods. We also built a generic model for comparing 16S rRNA gene fragment data extracted from metagenomic shotgun sequencing data with targeted 16S rRNA sequencing data. This model, when combined with future benchmarking studies, could help validating 16S rRNA gene fragment data’s ability to avoid primer bias and provide unbiased microbial diversity estimates. Our proposed analysis pipeline could also be implemented for future 16S rRNA gene fragment-based studies.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Clustering 16S rRNA sequences: an accurate and efficient approach
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Predicting virus-host interactions using genomic data and applications in metagenomics
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Exploring the genetic basis of complex traits
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Techniques for de novo sequence assembly: algorithms and experimental results
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
Isoform quantification and splicing regulation analysis in RNA-seq studies
PDF
Integrating high-throughput sequencing data to study gene regulation
Asset Metadata
Creator
Hao, Xiaolin
(author)
Core Title
Sharpening the edge of tools for microbial diversity analysis
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
11/27/2012
Defense Date
04/16/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian,bioinformatics,clustering,computational biology,metagenomics,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Ting (
committee chair
), Finkel, Steven E. (
committee member
), Liu, Yan (
committee member
), Sun, Fengzhu Z. (
committee member
), Wang, Kai (
committee member
), Waterman, Michael S. (
committee member
)
Creator Email
haoxiaolin86@gmail.com,xiaolinh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-120471
Unique identifier
UC11292438
Identifier
usctheses-c3-120471 (legacy record id)
Legacy Identifier
etd-HaoXiaolin-1351.pdf
Dmrecord
120471
Document Type
Dissertation
Rights
Hao, Xiaolin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Bayesian
bioinformatics
clustering
computational biology
metagenomics