Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Techniques for de novo sequence assembly: algorithms and experimental results
(USC Thesis Other)
Techniques for de novo sequence assembly: algorithms and experimental results
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TECHNIQUES FOR DE NOVO SEQUENCE ASSEMBLY: ALGORITHMS AND EXPERIMENTAL RESULTS by Sungje Cho A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2012 Copyright 2012 Sungje Cho Dedication This thesis is dedicated to my family. They bore my weakness for all the time, and en- couraged me to move forward. Especially, my wife gave her life for me. She is supporting me in all aspect. My parent is praying for me to our God, and my daughters enjoyed their american life although I could not support them well. ii Acknowledgements I want to give my greatest honor to my advisors, Professor C.-C. Jay Kuo and Professor Ting Chen. If I could not meet anyone of them, I would not nish my journey. Professor Kuo was like my father. He always concerned to make me on the way not to swerve to the other ways. Professor Chen was like my brother. He was unbelievably kind for me, and kept trying to help me. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vi List of Figures vii Abstract ix Chapter 1: Introduction 1 1.1 Signicance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Quality Measures for Contigs . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 Reads Alignment with Contigs . . . . . . . . . . . . . . . . . . . . 9 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Background Review 12 2.1 Sequence Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 De Bruijn Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 3: WEAV : De Novo Whole Transcriptome Assembly 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 Rening Sequences into Contigs . . . . . . . . . . . . . . . . . . . . 29 3.2.2 Correctness Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.3 Graph Simplication . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.4 Read Clustering: WEAV-CLUSTER . . . . . . . . . . . . . . . . . 42 3.2.5 The Idea about Paired-End Read . . . . . . . . . . . . . . . . . . . 45 3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Read Clustering: WEAV-CLUSTER . . . . . . . . . . . . . . . . . 47 3.3.2 Contruction of de Bruijn Graph and Graph Simplication . . . . . 49 3.3.3 Sequence Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.4 Bridging the Proposed Contigs: WEAV-STITCHING . . . . . . . 53 iv 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.1 Ground Truth and Evaluation . . . . . . . . . . . . . . . . . . . . . 55 3.4.2 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4.2.1 The Best Performed K-mers . . . . . . . . . . . . . . . . 58 3.4.2.2 The Comparison of Assemblers . . . . . . . . . . . . . . . 61 3.4.3 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chapter 4: MetaSEQ: De Novo Sequence Assembly of Short Regions in Metage- nomics 79 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.2 Performance of MetaSEQ . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.3 Tests on multiple read sets . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.1 Resolving Orientations of Reads . . . . . . . . . . . . . . . . . . . 93 4.4.2 Correcting Sequencing Errors . . . . . . . . . . . . . . . . . . . . . 94 4.4.3 Constructing a de Bruijn Graph . . . . . . . . . . . . . . . . . . . 95 4.4.4 Assembling using the Partition-Ligation algorithm . . . . . . . . . 96 4.4.5 Clustering Template Sequences . . . . . . . . . . . . . . . . . . . . 99 Chapter 5: Conclusion and Future Work 100 5.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2.1 Mathematics for Choosing the Best K-mer . . . . . . . . . . . . . 101 5.2.2 Minor Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2.3 Major Upgrade: Distributed Memory . . . . . . . . . . . . . . . . 103 Bibliography 104 v List of Tables 3.1 Statistics of 6 simulated datasets . . . . . . . . . . . . . . . . . . . . . . . 57 3.2 The best K-mers for each assembler . . . . . . . . . . . . . . . . . . . . . 61 3.3 Assembly results for high-coverage orientation free dataset, H-10M.fa . . 66 3.4 Assembly results for middle-coverage orientation free dataset, M-1M.fa . 66 3.5 Assembly results for low-coverage orientation free dataset, L-100K.fa . . 67 3.6 The ground truth of the sequenced genes, transcripts and segments for human brain transcriptome experimental dataset . . . . . . . . . . . . . . 68 3.7 Some wrong clustering results . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1 Comparison of Expected abundance levels of clustered species (truth) and Estimated abundance levels of clustered template sequences (results by runningMetaSEQ) under various similarity thresholds T sim . (L 1 ) is the weightedL 1 distance between the expected abundances and the estimated abundances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Performance of MetaSEQ on assembling reads sampled from Datasets A and B. (L 1 ) is a metric based on the weighted L 1 distance between the Expected clusters and the assembled clusters. The running time was based on a regular 32-bit desktop computer. . . . . . . . . . . . . . . . . . . . . 89 4.4 Performance of MetaSEQ on assembling reads sampled from Datasets C and D. (L 1 ) is a metric based on the weighted L 1 distance between the Expected clusters and the assembled clusters. The running time was based on a regular 32-bit desktop computer. . . . . . . . . . . . . . . . . . . . . 92 vi List of Figures 1.1 A average quality score graph for a Illumina Solexa sequencer . . . . . . 5 1.2 A sequencing error or a SNP driven bubble . . . . . . . . . . . . . . . . . 10 3.1 The implement ow of WEAV . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 A tangled (complex) structure . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Ambiguously clustered genes . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 A graph simplication example . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5 For orientation free high-coverage dataset, H-10M.fa . . . . . . . . . . . . 59 3.6 For orientation free middle-coverage dataset, M-1M.fa . . . . . . . . . . . 60 3.7 For orientation free low-coverage dataset, L-100K.fa . . . . . . . . . . . . 61 3.8 Comparison statistics for orientation-free high-coverage simulated dataset, H-10M.fa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.9 Comparison statistics for orientation free middle-coverage simulated dataset, M-1M.fa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.10 Comparison statistics for orientation free low-coverage simulated dataset, L-100K.fa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.11 Distribution of classes of contigs (mapped to RefSeq only, mapped to Gen- code only, mapped to both RefSeq and Gencode, New Transcripts, and Unmapped contigs) for dierent contig length. . . . . . . . . . . . . . . . 68 3.12 The length distribution of assembled contigs by WEAV (a), and of se- quenced transcript segments (b). . . . . . . . . . . . . . . . . . . . . . . . 68 vii 3.13 (a) shows a new transcript with new exons, and (b) shows a new one with a new combination of exons. . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.14 (length of WEAV contig)=(length of transcript) vs. (length of ground truth segment)=(length of transcript). . . . . . . . . . . . . . . . . . . . . . . . 71 3.15 Comparison statistics for ambiguous orientation high-coverage simulated dataset, H-Ori-10M.fa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.16 For high-coverage ambiguous orientation dataset, H-Ori-10M.fa . . . . . . 73 3.17 For middle-coverage ambiguous orientation dataset, M-Ori-1M.fa . . . . . 74 3.18 For low-coverage ambiguous orientation dataset, L-Ori-100K.fa . . . . . . 75 3.19 Comparison statistics for ambiguous orientation high-coverage simulated dataset, H-Ori-10M.fa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.20 Comparison statistics for ambiguous orientation middle-coverage simulated dataset, M-Ori-1M.fa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.21 Comparison statistics for low-coverage ambiguous orientation simulated dataset, L-Ori-100K.fa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.1 Relationships between the sequencing error rate and the number of reads (out of 720K) for which the orientations cannot be determined by the MetaSEQ algorithm (solid line) and the computation time measured in seconds (dash line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2 Distribution of coverage for read set 3 of Data B . . . . . . . . . . . . . . 91 viii Abstract The deep sequencing of second generation sequencing technology has enabled us to study complex biological structures, which have multiple DNA units simultaneously such as transcriptomics and metagenomics. Unlike general genome sequence assembly, a DNA unit of these biological structures may have multiple copies with small or substantial structural variations and/or SNPs simultaneously in an experimental sample. Therefore, the deep sequencing is necessary to gure out such variations concurrently. This dissertation focuses on de novo transcriptome assembly which requires simulta- neous assembly of multiple alternatively spliced gene transcripts. In practice, the de novo transcriptome assembly is the only option for studying the transcriptome of organisms that do not have reference genome sequences, and it can also be applied to identify novel transcripts and structural variations in the gene regions of model organisms. We propose WEAV for the de novo transcriptome assembly which consists of two separate processes: clustering and assembly. WEAV reduces the complexity of RNA-seq dataset by partitioning it into clusters called clustering. WEAV simplify a diverse RNA-seq dataset, which has many genes together, into many, smaller clustered read sets, which have few genes a cluster, in the clustering process. The underlying idea is straightforward. A sequencer samples reads ix from random place so reads from one gene may have overlaps with others if sequencing depth is enough. The overlaps are the keys to connect reads from one gene. We can transform a dataset into a graph where each read is a node and two reads are connected by an edge when they have an overlap. Each connected component will be a clustered read set. As a result, we can assume that a cluster may have one or few genes; therefore, it will not be mixed. After this process, WEAV assembles the clustered read set with de Bruijn graph backbone, and a novel error correction process simplify the backbone with a fast map- ping tool, PerM. Roughly speaking, WEAV tries to solve the historical Shortest Common Superstring [61] problem with the graph to identify multiple alternatively spliced gene transcripts simultaneously and approaches the problem using Set Cover problem. We propose novel statistical measures to make the NP hard problem manageable. The mea- sures are explainability based on the likelihood of sequences and correctness based on bootstrapping. We compared WEAV with other assemblers with various, simulated reads. We tested the performance by widely used measures such as specicity, sensitivity, N50, and the length of the longest sequence. After this, we tested WEAV using an experimental dataset having 58.58 million 100bp human brain transcriptome reads. WEAV assembled 156,494 contigs that were longer than 300bp. 96.3% (specicity) of these contigs were mapped onto either RefSeq, Gencode or human Genome sequences (hg19), and they covered>72% sequenced bases annotated in RefSeq and Gencode. These high sensitivity and specicity showed the exceptional power of WEAV for transcriptome assembly. x Chapter 1 Introduction 1.1 Signicance of the Research Frederick Sanger invented a DNA sequencing method in 1975, and it gave him his second Nobel prize in 1980. Through this great breakthrough, people have been struggling to gure out DNA sequences and recently mRNA (transcript) sequences too. However, all the sequencers generate sets of short fragments, called read sets. The lengths of fragments are dependent on the sequencing platforms such as chain-termination of Sanger producing 800 1; 000bp [35] or longer, pyrosequencing of 454 Life Science producing 250 400bp, reversible dye-terminators of Illumina Solexa producing 50100bp, sequencing by ligation of Solid producing about 50bp [60], and so on. Although all the sequencers use their own methods, they similarly generate overlapped reads by breaking and amplifying the target sequence. Therefore, how to reconstruct the long target genome with many short fragments based on the overlaps was highlighted and called as Sequence Assembly. The sequence assembly is a well-known NP-complete problem. Besides, the repetitive regions of genome, sequencing errors, PCR biases and the other biological and technical 1 problems made it even worse. Moreover, as sequencing technology is shifting from high quality represented by Sanger sequencer to high throughput represented by Illumina Solexa sequencer, another problem, named ultra short reads, was introduced [1]. High throughput sequencing, also known as the Second Generation Sequencing, has several features. - Short Read Length: Although it has improved a lot, for example, Illumina Solexa generated 32 35bp reads 3 years ago, but now it generates 100bp reads, it is still much shorter than 800 1000bp of Sanger sequencer. For this reason, many of repetitive regions are not possible to be reconstructed, and the assembly may not avoid gapped assembly. - Unreliable and Short Paired-End Read: All the sequencer companies claim that they support paired-end reads as Sanger sequencer supports them. The domain that has longer repetitive regions demands longer insert size which means longer distance between paired reads. However, second generation sequencers cannot allow long insert size, but even shorter than Sanger single-end read. As a result, the current state-of-the-art second generation sequencing technology has limitations on the whole genome sequence assembly. - Higher Coverage: Although high throughput platforms boast their potential to provide higher coverage with lower unit price, it is not a merit but a mandatory option. The reason is that sampling reads from a target sequence follows Poisson process with meancG=L, wherec,G andL are the sampling, the length of the target sequence, and the length of a read respectively when c=L is small [61]. Therefore, 2 interarrival time between two adjacent reads follows an exponential distribution with mean c. Now, the sucient overlap would be dened by the absolute length in bp in general, and let us dene it as O. Then, the probability of satisfying Obp overlap between two adjacent reads is 1e c(LO) for L>O, and the shorter the read lengthL is less likely to have overlapObp between two adjacent reads. Thus, second generation sequencer requires higher coverage to guarantee Obp minimum overlaps between adjacent reads. - Sequencing Errors: Until a few years ago, second generation sequencers had had much higher sequencing error rates, but now it is not much higher than Sanger se- quencer. However, even when the sequencing error rate is low, the absolute number of sequencing errors would be many, if it coexists with high coverage. Overlap-layout-consensus method is the common approach for the Sanger reads; however, majority of second generation sequencer datasets use de Bruijn graph- based methods. The problem is that each sequencing error adds an erroneous edge into the graph. Moreover, the erroneous edge can be randomly matched some- where and be hidden in the graph by the relatively short overlap requirement of de Bruijn graph-based method. Therefore, the higher coverage does not always guar- antee tractability for the sequencing errors contrary to the overlap-layout-consensus method for Sanger sequencer where higher coverage almost surely guarantee a reli- able elimination of sequencing errors by majority-vote. Therefore, the history of de Bruijn graph-based methods is the history of struggling to remove the erroneous edges eciently. 3 Because of above features of second generation sequencing, the assembly needs massive computing power and huge memory. For example, ABySS could assemble human genome with 3.5 billion paired-end reads with 35 46bp using 21 eight-core machines (168 cores) with 16GB of memory for each machine (336GB of memory in total) in 4 days [78]. More recently, SOAPdenovo also tried to assemble human genome with 90x coverage by< 40bp reads using a 32-core with 512GB memory machine, and it took 40 hours [52]; that is, even the state-of-the-art genome assembler, SOAPdenovo, necessarily needs resources beyond individual aordable. Sequencing error rate is still not tame. For instance, Illumina Solexa claimed that they improved the problem, which the tail part of their reads had much higher sequencing error rate, and they published some of the cautiously quality controlled datasets. We, however, have seen poor quality datasets recently, generated by new Illumina machine. As gure 1.1 shows, the read length of the dataset is 101bp, and the average quality score of each base-position is very poor except in the few front positions. The average quality scores dropped fast from about 20 th base position. It shows that the quality of a dataset relies on the ability of the technician who oper- ates the sequencer, and we still need to consider a harsh condition with high sequencing error rate. With above-mentioned problems such as short reads, low quality, demanding massive resource, and so on, the sequence assembly still has a long way to go. On the other hand, if we can get genomes, transcriptome structures, metagenomes, and so on easily, it will have unlimited potentials in discovering new therapy and medicine and in inventing a new material with special features. We proposed a new de novo sequence assembler for 4 Figure 1.1: A average quality score graph for a Illumina Solexa sequencer transcriptome and metagenomics which improved the performance of exiting assemblers, and proposed a new way to untangle complex problems in the area. 5 1.2 Review of Previous Work Assembly programs strongly depend on the sequencing technology, and there are remark- ably clear dierences between assemblers for Sanger sequencers and those for second generation sequencers. We will brie y summarize previous work according to following classes [63], [76]: Greedy Graph-based Assemblers, Overlap-Layout-Consensus Assemblers, and de Bruijn Graph-based Assemblers. - Greedy Graph-based Assemblers: SHARCGS [19], SSAKE [88] and VCAKE [42] are in this class. There is a greedy algorithm which is 2.75 times optimal, and there is a conjecture that the greedy algorithm is 2 times optimal [61] in the point of a sequence reconstruction problem discussed in Sec. 2.1. However, greedy algorithm becomes impractical as the datasets grow bigger because pairwise alignments over all reads to nd the best alignment takes too long time. - Overlap-Layout-Consensus Assemblers: Celera Assembler [66], Arachne [4], [41], CAP and PCAP [37], CABOG [62], Newbler [56], Shorty [36] and Edena [34] are in this class. Overlap-layout-consensus approach is the most matured method, and it produces the best assembly for comparatively long reads and low sequencing error rates of Sanger sequencer. However, it is not applicable for ultra short second generation sequencer in general, but useful for 454 datasets. 454 Life Sciences distributes Newbler, and CABOG (Celera Assembler with the Best Overlap Graph) is a revised version of Celera Assembler for 454 datasets. Although Shorty and Edena were 6 published for ultra short read sequencers, they were applied for bacterial genomes only, and they are not widely used now. Note that because multiple sequence alignment is NP-complete [22], [43], [86], they used an approximation algorithm to construct overlap graph. - De Bruijn Graph-based Assemblers: Euler [11]{[13], [72]{[74], AllPaths [9], [53], SOAP and SOAPdenovo [48], [50], [52], Velvet [90], [91]and ABySS [78] are in this class. De Bruijn Graph method was originally proposed for Sequencing by Hybridization and now the most famous and active way for second generation sequencing. We will discuss in detail in Sec. 2.1. 1.3 Contributions of the Research We proposed two computational solutions such as WEAV for transcriptome assembly in Chapter 3 and metaSEQ for metagenomics assembly in Chapter 4. Both assemblers share the main ideas such as de Bruijn graph, traversing the graph, and abundance level estimation of contigs by EM-algorithm. Moreover, transcriptome and metabiome also share similar characteristics such that they consist of many objects such as genes and species; each object may have variations such as alternatively spliced transcripts and dierent individuals; each object is not evenly distributed. However, they are dierent enough too. The number of alternatively spliced tran- scripts of a gene is much less than the number of dierent individuals of a gene. Moreover, dierent transcripts of a gene have a dierent set of segments, but dierent individuals 7 of a species have almost same DNA sequences except several SNP alterations. Because of these dierences, WEAV and metaSEQ has dierences in detail. We want to summarize a couple of the major contribution of our research in the following. 1.3.1 Quality Measures for Contigs Importance Measure This idea comes from relative abundance level estimation of species in a sample which tells how frequent a given species in a population proportional to the other species. This can be a parameter producing an observation, a dataset, in the way that an abundant species was sampled more, and rare species was sampled fewer. This problem is well studied in statistics by using the maximum likelihood ratio (MLE) test by EM-algorithm. We proposed a relevant probability model to utilize this tool for Metagenomics se- quence assembly, which considers discovering species and their relative species. However, in the RNAseq assembly, we changed the view points. In the context, the relative abundance level is the expression level. We found that there are many convenient tools to answer the expression level if template sequences exist. Therefore, the goal is to give a true transcriptome (a complete set of template sequences) to the other program. On the other hand, the concept is still usable because it can tell which sequence is more valuable than the other in the sense of how much portion of a given dataset the sequence can explain. This concept takes a crucial role in the assembly algorithm discussed in Sec. 3.2. Because this concept is not saying which contig is incorrect, we do not remove any contig because of their low importance. 8 Correctness Measure Together with importance score discussed before, we proposed a correctness score using Bootstrapping. The Bootstrapping was proposed to answer the accuracy of an estimation by simulating the distribution of the estimation. In the assembly, we estimate a set of contigs, and we want to give them score saying how correct they are. The basic idea is straightforward. If a given contig exists in a dataset, it would be likely existed in a resampled dataset too. Thus, we resampled a dataset many times, and tested whether a given contig exists or not, and a correctness measure is the rate of existence. Correctness score can be an indicator to tell which contig is incorrect one, which is opposite to the importance score. We designed the estimation to be sensitive to detect chimeras which should be avoided for RNAseq assembly. Also, it performs well distin- guishing wrong basecalls; however, it is not crucial because it can be easily corrected by many SNP call programs [17], [49], [51], [58], [68]. 1.3.2 Reads Alignment with Contigs As we have mentioned earlier in Sec. 1.1, suppressing the eect of sequencing errors is a key barometer for the performance of assemblers, and assemblers have their own logic for them. However, we found that complicated evaluation process to screen them from true edges can be postponed, and we can shift responsibility for them on a mapping program. This approach is dierent from others where it is not for their assembly process but their evaluation process, and we use the mapping program as a part of the assembler. 9 Figure 1.2 shows a simple example. 3 reads (blue) were sampled from a target sequence (black), but a sequence has a sequencing error in the middle (red), and a part of a de Bruijn graph for the three reads is shown in the bottom. Although the sequencing error can easily be screened by majority vote, it is not necessarily needed. Just ignore one edge of the bubble. All the three read can be mapped to whichever paths because it has just one base mismatch, and the majority vote process for a contig with mapping is easy. As a result, we claim that using a mapping program can speedup graph simplication process, which all the de Bruijn graph-based assemblers do, and guarantee more accurate contig sequences. Detailed information is in Chapter 3. Figure 1.2: A sequencing error or a SNP driven bubble 1.4 Organization of the Dissertation The rest of this dissertation is organized as follows. Some background of this research is reviewed in Chapter 2. It consists of state-of-the-art assembly method with historical background, and EM-algorithm for measuring the likelihood of a sequence relative to a set 10 of sequences which is a crucial part of the assembly. Chapter 3 and Chapter 4 are two dif- ferent applications of sequence assembly. In Chapter 3, WEAV, a de novo transcriptome assembler, will be discussed, and in Chapter 4, metaSEQ, a de novo metagenomics assem- bler, will be discussed. Chapter 3 and chapter 4 have their own background reviews too because their research targets are dierent from each other. Finally, concluding remarks and future research issues are in Chapter 5. 11 Chapter 2 Background Review In this chapter, we will discuss the general assembly problem, the de Bruijn Graph and EM-algorithm to estimate the abundance level. 2.1 Sequence Assembly Second generation sequencers such as Illumina Solexa, Solid and 454 is the target se- quencing platform for this work. Although all sequencing platforms have problems listed below, the problems might be even worse for second generation sequencers than Sanger sequencer. Let us see the problems rst. 1. Errors There are at least two kinds of errors. - Sequencing Error: This is also known as wrong base-calling. Most of second generation sequencers are based on sequencing-by-synthesis. By controlling synthesis cycles, they read one base at a cycle. However, the synthesis is the result of chemical reaction and it follows a stochastic process. In other words, 12 it may not synthesis a nucleotide or synthesis more than one nucleotide in a cycle by chance. This is the main source of sequencing errors. - Chimeric Reads:Two dierent parts of a genome might be combined side-by- side. This error may come from chimeric inserts generated by amplication process. These reads will drive wrong assembly. 2. Unknown Orientation: Although there is a directed sequencing technique and we used a dataset generated by the technique, most of the datasets still have the orientation problem; that is, a read can come from any of the DNA strands, and this makes assembly more dicult. 3. Repetitive Regions: In general, this can make at least two problems. - Longer than read repeat: If a repeat is longer than read, nothing we can do. This is a common problem especially for second generation sequencer. Hybrid way of using Sanger and second generation sequencers together, or using mate- pair reads can be a choice to solve this problem. - Ambiguity in Ordering: Let 5 segments be A, B, C, D and E. These 5 segments are in the target genome in the order of A!E!B!E!C! E! D. If reads are shorter than the segments, we can determine A and D, but the order of B and C would be ambiguous. We consider a simplied assembly problem called The Shortest Common Superstring Problem (SSP) rst, where we ignore all the problems discussed above. Gallant et al. [25] described the problem like below and proved that it is NP-complete. 13 Let nite alphabet , string s2 , and a set of reads in a dataset R = r 1 ;:::;r n , where r i 2 ; 1in. Denition 1. A superstring of R is a string s containing each r i ; 1 i n, as a substring. The superstring problem is: Given a set of reads R and a positive integer K, nd a superstring of R of length K. Superstring problem is already NP-complete [25], so SSP is NP-complete too. Let us dene SSP in mathematically. Denition 2. LetS K is a set of superstrings of R and its length isK. The SSP problem is: Find S 0 K =min K fS K :S K 6=;;K2Z + g. However, the real dataset probably has errors and the orientation problem, and we may be able to dene more general problems in various ways. One of them is called se- quence reconstruction problem (SRP), where we allow errors and the orientation problem. Denition 3. With an error rate 2 [0; 1), sequence reconstruction problem (SRP) is to nd a superstring s of R contains each r i ; 1 i n as a substring a in a way that maxfminfd(a;r i );d(a;r r i )ggjaj, here r r i means reverse complement of r i and d(a; a) is the number of mismatched positions between two sequences. Kececioglu [44] proved that SRP is also a NP-complete problem. The real assembly problem is even more complicate than SRP. There may not be a superstring explaining whole reads in R but a set of superstrings. Each superstring is dierent from the others, and it explains a partial set of reads ofR only because of the lack of coverage. This set of superstrings is called as contigs, and if some contigs can be ordered by other information 14 such as mate-paired reads, they can form a scaold which is longer than an individual contig but may have gaps between contigs. Moreover, repetitive regions violate the very fundamental assumption that the shortest is the best because repetitive regions require multiple copies of subsequences in the superstring. 2.2 De Bruijn Graph To dene the de Bruijn graph mathematically, let is a character set andK is a parameter to specify dimension. Then, a K dimensional de Bruijn graph ofjj characters, G = (N;E), can be dened as follows. The node set is N =fc 1 c 2 :::c K :c i 2 ; 1iKg, and with the following overlap indicator function between two nodes,n =c 1 c 2 :::c K 2N and n =c 0 1 c 0 2 :::c 0 K 2N, o(n;n 0 ) = 8 > > < > > : 1 if c 2 =c 0 1 ,c 3 =c 0 2 ,:::,c K =c 0 K1 0 otherwise the edge set E can be dened as E =f(n;n 0 ) :o(n;n 0 ) = 1; for all n2N and n 0 2Ng. Therefore, all nodes havejj incoming and outgoing nodes. However, it was dierently used in Bioinformatics. For a given dataset, there may be several read sets, and G = (N;E) is from the dataset. N is all K-mers, which mean K-tuples, in the dataset, and e = (n;n 0 )2 E is only dened when there is (K + 1)-mer whose the rst K-mer forms n and the last K-mer forms n 0 . The graph is called a de Bruijn graph of K-mer for the given dataset. We will use this denition for the rest of this paper. 15 AsK grows, the size of node setN and edge setE can grow exponentially. However, in a real situation, the size of the graph is controlled by the size of dataset, an underlying sequence, and sequencing error rate. Interestingly, if no sequencing error is in the dataset, the underlying sequence for a given K determines the graph size. However, by the sequencing error rate, the graph can grow exponentially in size. Idury et al. [40] are one of the pioneers to apply this graph for the sequence assembly. Sec. 1.2 shows various applications using this concept. As we have mentioned in Sec. 1.2, the history of de Bruijn graph-based approach is the history of removing sequencing errors. Although the size of a de Bruijn graph must be bounded by the underlying target DNA sequence, the sequencing errors trigger the exponential property of the de Bruijn graph. With the following section, we also tried to escape from the curse of sequencing errors in various ways. 2.3 Maximum Likelihood Estimation Abundance levels of species, also known as relative species abundance, was introduced to measure the biological diversity together with species richness. Species richness shows the number of dierent species in a given area and a given period, and abundance level of a species shows the relative abundance and the rarity of a species among the sample. The biological diversity with these two statistical indicators can tell how dierent a given sample from the other samples. 16 However, we use the term abundance level dierently. We will use it for general objects, which means not only species but also sequences. Denition 4. Abundance level of an object means how common or rare the object is relative to the other objects. Species, contigs, transcripts or others can be the object. Bacteria are the objects of metagenomics; the abundance level of a species in a given metabiome tells the proportion of the individuals of the species to the total population in the metabiome. The abundance levels of a transcript is also dened similarly. Especially, the abundance level of a contig is used to lter out fake contigs such as contigs having wrong insertions or deletions of short fragments or having too many sequencing errors, or chimeric. The way of calculation will be discussed in the following. Problem Denition A read set is given, and a set of sequences is given which is derived from the read set. The goal is to estimate abundance levels of sequences of the set of sequences. Some notations are listed in the following for discussing EM-algorithm. 17 8 > > > > > > > > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > : R =fr 1 ;r 2 ;:::;r n g a read set having n reads S =fs 1 ;s 2 ;:::;s m g a set of m sequences A =fa 1 ;a 2 ;:::;a m g the abundance levels of the m sequences, S 0a i 1 and P 1im a i = 1 jrj length of a read r r j i the base of j th position of i th read, 1in and 0jjr i j s j i ;jsj same as the case of read R andS are given information, andA will be estimated. Note that if a sequences2S represents the metagenome of a species, correspondinga2A represents how common the species is relative to all the species whichS stands for. Also, ifs2S represents a contig, corresponding a2 A represents how likely the contig is relative to all the other contigs in S. Getting a Maximum Likelihood Estimation (MLE) ofA using Expectation-Maximization algorithm (EM-algorithm) is well studied [59], but the likelihood function may be changed author by author. For example, Pa saniuc et al. [69] proposed their own likelihood func- tion, however they did not concern the length of each sequence. However, if the sequences are varying in length a lot, it matters. Especially, transcripts vary in length a lot from < 100bp to about 100; 000bp. There- fore, the likelihood function is dened with considering sequence length. 18 (A) = P (R =fr 1 ;r 2 ;:::;r n gjS;A) = Q n i=1 P (r i 2RjS;A) Reads are independent from each other = Q n i=1 P m j=1 P (r i js j )a j = Q n i=1 P m j=1 I(r i ;s j ) js j jjr i j+1 a j I(r i ;s j ) is an indicator dened below. (2.1) I(r;s) = 8 > > < > > : 1 if exist i2f0;:::;jsjjrjgsuch that P 1jjrj 1fr j 6=s i+j g M 0 otherwise , where 1fg is an indicator function producing 1 when the condition between parentheses is satised and producing 0 otherwise, and M is a threshold indicating the maximum allowable number of mismatches to say mappable. Denition 5. If a given read, r, and a sequence, s, satisfy this condition, we say r is mappable on s. The mapping program PerM [15] will provide I(r;s) for all r2R and s2S. EM-algorithm The above mentioned likelihood function is not correct because not all mappable reads on s came from the sequence s. However, according to Eq. 2.1, a read may come from multiple sequences simultaneously. In other words, a portion of a read comes from a sequence and the other portions come from the other sequences. But, this can never 19 happen; a read should come from a sequence. The case where two identical reads came from dierent sequences is possible, but one to one correspondence between reads and sequences should still be kept. Therefore, let us consider the missing data telling which read came from which se- quence Z ij ; 1in; 1jm to make complete data. Z ij = 8 > > < > > : 1 if i th read r i came from j th sequence s j 0 otherwise Note that if Z ij = 1, Z ik = 0;8k6=j. A likelihood function for the complete data is represented as follows. complete (A) =P (R;ZjA;S) = n Y i m Y j a j js j jjr i j + 1 Z ij (2.2) , where the property of Z ij changes summation of Eq.2.1. Note that I(r i ;s j ) is removed here because Z ij already has the information, and this likelihood function is dierent from (A) of Eq.2.1 which is likelihood using observed data only. Now, we can dene log-likelihood function l(A). l(A) =log ( complete (A)) = n X i=1 m X j=1 Z ij (log(a j )log(js j jjr i j + 1)): (2.3) Now, maximizing l(A) will maximize Eq.2.2. Initial Step: a (0) i = 1=m; 1im 20 E-Step: Q-function is the expectation of l(A) for a given current estimation of abundance level A (k) , R and S. Q(A;A (k) ) : = E(l(A)jA (k) ;R;S) = P n i=1 P m j=1 E(Z ij jA (k) ;R;S)(log(a j )log(js j jjr i j + 1)) E(Z ij jA (k) ;R;S) = P (Z ij = 1jA (k) ;R;S) = a (k) j js j jjr i j+1 = P m h=1 I(r i ;s h ) js h jjr i j+1 a (k) h : = ^ Z (k) ij To explain this derivation, Let a random variableH is an assigning random variable for a given read r, and H =h means read r came from sequence s h . Note that the sampling process is a stochastic process. E(Z ij jA (k) ;R;S) = P (Z ij = 1jA (k) ;R;S) P (Z ij = 1jA (k) ;R;S) = P (Z ij = 1jA (k) ;r i ;S) (1) = P(Z ij =1;A (k) ;r i ;S) P(A (k) ;r i ;S) = P(r i jZ ij =1;A (k) ;S)P(Z ij =1;A (k) ;S) P(r i jA (k) ;S)P(A (k) ;S) = 1 js j jjr i j+1 P (Z ij = 1jA (k) ;S) = P (r i jA (k) ;S) (2) = a (k) j js j jjr i j+1 (3) = P m h=1 P (r i ;H =hjA (k) ;S) = a (k) j js j jjr i j+1 = P m h=1 I(r i ;s h ) js h jjr i j+1 a (k) h (4) 21 (1) Z ij only depends on r i . (2) P (r i jZ ij = 1;A (k) ;S) = 1 js j jjr i j+1 . Note again that Z ij = 1 implies Z ik = 0 for all k except j. (3) P (Z ij = 1jA (k) ;S) =a (k) j (4) P (r i ;H =hjA (k) ;S) =P (H =hjA (k) ;S)P (r i jH =h;A (k) ;S) =a (k) h I(r i ;s h ) js h jjr i j+1 Although the random variableH looks similar to the random vectorZ i: = (Z i1 ;Z i2 ;:::;Z im ), it is dierent. For example, P (Z ij = 1jZ ik = 1;k6= j) = 0, however, P (H 1 = jjH 2 =k;k6=j) still can have non-zero value where H 1 andH 2 is two copies ofH. So, the indicator functionI(r i ;s h ) comes again because of absence ofZ ij 's blessing, and P (r i jZ ih ;A (k) ;S) = 1= (js h jjr i j + 1) not I(r i ;s h )= (js h jjr i j + 1). M-Step Q(A;A (k) ) = P n i=1 P m j=1 ^ Z (k) ij (log(a j )log(js j jjr i j + 1)) P m j=1 a j = 1 This will be maximized with the following estimation of abundance level. a (k+1) j = P n i=1 ^ Z (k) ij = P n i=1 P m j=1 ^ Z (k) ij To derive this result, let us use Lagrange Multiplier. Q(A;A (k) ) = P n i=1 P m j=1 ^ Z (k) ij (log(a j )log(js j jjr i j + 1)) P m j=1 a j = 1 22 Let us maximize Q(:;:) function using Lagrange multiplier. Now, F (A;) = P n i=1 P m j=1 ^ Z (k) ij (log(a j )log(js j jjr i j + 1)) + 1 P m j=1 a j @F(A;) @ = 1 P m j=1 a j = 0 @F(A;) @a j = P n i=1 ^ Z (k) ij 1 a j = 0 ) = 1 a j P n i=1 ^ Z (k) ij a j = 1 P n i=1 ^ Z (k) ij And also, P m j=1 a j = 1 P m j=1 P n i=1 ^ Z (k) ij = 1 = P m j=1 P n i=1 ^ Z (k) ij a (k+1) j = P n i=1 ^ Z (k) ij P m j=1 P n i=1 ^ Z (k) ij Now, we have E-step and M-step. 23 Chapter 3 WEAV : De Novo Whole Transcriptome Assembly 3.1 Introduction Studying transcriptome has become easier with the availability of RNA-seq technol- ogy [87]. It is cheaper, faster, and has a higher dynamic range. Wall et al. [85] claimed that second generation sequencing is superior to conventional Sanger sequencing for tran- scriptome proling, although it has drawbacks as a consequence of short read length and machine-specic sequencing error patterns, and Marguerat et al. [55] showed the superiority of RNA-seq over the microarray-based approach. Although many biological studies have taken advantage of this technology such as human nervous system [33], gene expression proling [18] [82] [32] [81], the computational methods for the analysis of RNA-seq datasets remain at a primitive level because of many technical diculties. These include incomplete and inaccurate gene annotations, biological diculties such as alternatively spliced gene transcripts, alternative 3'- and 5'-ends, overlapping genes and paralogous genes, technical diculties such as base-call 24 errors, indel errors, and sequencing biases. Moreover, the third generation sequencing will exacerbate technical diculties [75] [20]. Currently, there are three leading approaches to the analysis of RNA-seq datasets when reference genome sequences are available. The rst approach maps RNA-seq reads directly to the gene transcripts to estimate gene expression levels as well as identify dier- entially expressed genes which requires accurate gene annotations and highly depends on them [65] [64] [3] [27] [67] [89] [16]. Gene annotations, however, are far from completion for most model organisms. For example, many well-known human genome databases such as RefSeq and Gencode, are still frequently being updated, and the number of human genes has been predicted to be anywhere between 25K (Refseq) to 120K (AceView). This large variation arose from using dierent gene structure and dierent sources of data for gene annotation [30], [2]; in other words, the numbers vary by the denition of genes as well as the reliability of evidence. The second approach adopts a mapping step similar to what used in the rst approach, but the goal is to identify novel gene transcripts or splice junctions [31] [83] [82]. This approach maps RNA-seq reads, which cannot otherwise be mapped to the genome and/or gene transcript sequences, into the genome sequences us- ing specialized gap-mapping tools, such as TopHat [82], to identify exon/intron junctions. Enumerate all possible candidates of gene transcripts by enumerating all combinations of the mapped reads and regions, and screen wrong combinations using multiple criteria. Obviously, this approach is strongly dependent on the ability of a mapping program. In some cases, mapping programs are necessary to map reads across multiple gaps in order to locate multiple short exons, insertions or deletions (genome variants). Moverover, a read can be spread over long gaps or chromosomes because of structural variations. No 25 program can cover all these cases together eciently yet. The third approach uses local assembly not only to increase resolution of nding genomic variations, but also to avoid the biases by using incomplete gene annotations [77] [10] [26]. The local assembly is sim- ilar to resequencing where it uses only the reads having variations from the reference. Reconstructing these variable regions by local assembly enables better mapping results, and it bridges invariable regions. However, the eciency of the third method still depends on the existing annotation. De novo transcriptome assembly is not only the obvious choice when no reference sequence exists, but also the competitive choice when reference sequences exist. The reason is that it is free from aforementioned problems: (1) inaccurate and incomplete annotations, (2) SNP or indel mutations, and structural variation [80]. It is independent of existing annotations, and it has more potential to discover new transcripts, especially when the target genome has structural variations. The history of de novo genome assembly began during the Sanger sequencing era. Since then, Kecedioglu and Myers proposed an overlap-layout-consensus sequences frame- work [44] formulating multiple computational problems and solving them eciently using dierent heuristics. Programs such as TigrAssembler [29] and Phred [23], [24] used the overlap-layout-consensus sequences approach. Idury and Waterman [40] later proposed a de Bruijn graph-based method for genome assembly, and Pevzner and his colleagues followed up with implementations and improvements [71]{[73]. Subsequently, many de novo genome assemblers for second generation sequencing technology have been proposed including Euler-families [13], SHARCGS [19], Velvet [90], ABySS [78], ALLPATH [9], and SOAPdenovo [52]. Miller et al. [63] classied those 26 assemblers into three groups: the greedy graph-based assemblers, the overlap-layout- consensus assemblers, and the de Bruijn graph-based assemblers. Although each has its own advantages, the de Bruijn graph approach has exceptional strength for short read assembly in the presence of sequencing errors or redundant reads (see Sec. 3.2.3.) However, no de novo transcriptome assembly program has specialties in second gen- eration sequencing technology. Thus, genome assemblers, such as Velvet and ABySS, were applied for transcriptome assembly. Surget-Groba and Montoya-Burgos [80] used Velvet with multipleK-mers to minimize the strong dependency of de Bruijn graph toK and to get multi-resolution results. However, since these programs essentially assumed one or two underlying sequences in the assembly, they were unable to assemble multiple alternatively spliced transcripts that partially overlap. Practically, the graph traversing process in these programs tends to stop at the splice junctions, thus producing many short contigs. Birol et al. [5] used ABySS for their assembly project, but it produced too many contigs, some of which were remarkably short, so they had to process these contigs again in order to obtain reasonable results. We proposed a mathematical framework about how to get the optimal set of contigs for a dataset using de Bruijn graph method. Idury et al. [40] and Pevzner et al. [70], [74] addressed the mathematical discussions for de Bruijn graph method, but how to nd the optimal set of contigs on the de Bruijn graph was not studied yet. We approached the problem in the following way: (1) For a given set of contigs and a given dataset, propose an algorithm to select a subset of contigs fully explainable the dataset. (2) Propose an algorithm to eliminate fake paths on the de Bruijn graph to reduce the size of the set of contigs. 27 Our new de novo transcriptome assembly method consists of two programs called WEAV-CLUSTER and WEAV. In the rst step, the WEAV-CLUSTER divides the read set into small clusters so that each cluster consists of transcripts of one or a few genes, and the WEAV assembles each cluster into multiple alternatively spliced gene transcripts. WEAV uses a simple strategy to remove edges caused by sequencing errors, and literally traverses the graph to identify multiple, overlapping contigs that correspond to alterna- tively spliced gene transcripts. Finally, two statistical measures based on EM-algorithm and Bootstrapping dierentiate correct contigs from fake contigs. We tested WEAV on two small datasets to show that it is capable of assembling multiple alternatively spliced gene transcripts. Then we applied WEAV to a human brain transcriptome dataset that had 58.58 million reads 100bp in length. The gene transcripts assembled by WEAV were evaluated using the ground truth inferred by mapping reads against the human genome annotation databases, RefSeq and Gencode. The results showed that WEAV is an ecient transcriptome assembler with > 72% sensitivity and > 96% specicity. 3.2 Method Figure 3.1 shows the implementation ow of WEAV, but we developed the idea by the contrary to this. Therefore, this section will discuss from explainability to clustering, and the logical ow is following. (1) We reduced the sequence assembly problem as the process of selecting the most probable or correct sequences from a given pool of sequences by the explainability measure, and call them as contigs (Sec. 3.2.1), but the selecting 28 Figure 3.1: The implement ow of WEAV process will be computationally demanding if the pool is immense in size, so (2) we screen out contigs from the pool when they are highly unlikely to be correct by the correctness measure (Sec. 3.2.2). Again, the pool may occupy vast space and be computationally demanding a complex graph (de Bruijn graph), and (3) we simplify the graph which includes error-correction process (Sec. 3.2.3). However, still it may not be enough when an input dataset is huge, and the underlying structure of it is complex. Therefore, (4) we break it into smaller read sets called clustering (Sec. 3.2.4). 3.2.1 Rening Sequences into Contigs There were several trials to provide a mathematical framework for the sequence assembly problem, and the rst trial was based on the Sequence Reconstruction Problem [61] which is an extension of the Shortest Common Superstring Problem. Its goal was to nd a shortest superstring explaining all reads in a read setR, and the overlap-layout- consensus method re ected this well. However, it was hard to formulate a mathematical 29 framework for de Bruijn graph method, and thus Pevzner et al. [73] proposed the only mathematical framework called Eulerian Superpath Problem where it considers paths covering all edges exactly once. Nonetheless, this formulation becomes impractical as the size of input dataset grows, and it is not useful for RNA-seq because it needs to nd multiple paths, and edges may be visited multiple times for multiple isoforms of a gene may share some exon. Scaolders such as Bambus 2 [46] and Cuinks [83] gave us insights although scaf- folding is dierent from sequence assembly. Cuinks assembles a read set with positional information by mapping on reference sequences. A path can be represented as a partial ordered set by the positional information, and it tries to nd the minimum set of chains which explains the read set after transitive reduction grounded on Dilworth's theorem. The algorithm of Bambus 2 is based on the Optimal Linear Arrangement problem on the contig connection graph. Although their problem is dierent, both tried to nd paths. Moreover, they tried to nd the minimal set of superstrings explaining the read set, and they produce the superstrings as outputs. WEAV is conceptually not very dierent from them, but WEAV has mapping free (against Cuinks), and paired-end free (against both) algorithm. We use de Bruijn graph as the backbone where contigs is equivalent to paths. Contig Selection For a read setR, we use de Bruijn graph G db = (V db ;E db ) and its conventional construc- tion method [40]. E db is the collection of all K-mer inR, and V db is the collection of all (K 1)-mers in E db . Again, v and v 0 of e = (v;v 0 )2 E db are the rst and the last 30 (K 1)-mers ofe. Practically, the graph is no more than a sorted edges, and it is enough for dening G db . We call this graph as K-dimensional de Bruijn graph. The denition of a path on the G db is a series of edges e 1 e 2 :::e l , e i = (v i ;v 0 i )2 E db ;i2f1;:::;lg where v 0 i = v i+1 for i2f1;:::;l 1g. Let a path space be P for a given graph G db which has all possible paths of G db , and a sequence space S. Then, a one-to-one function f s : P! S can be dened. Before dening the function, let s[i;j] be the subsequence of a sequence s which covers s from i th position to j th position, and let s +s 0 be the concatenation of the sequences s and s 0 . Then, f s (p) = f s (e 1 :::e l ) = f s (e 1 ) +f s (e 2 )[K;K] +::: +f s (e l )[K;K] wheref s (e2E db ) is theK-mer ofe. Therefore, S is a transformation ofP by f s . A supporting read set for ans2S is dened byR s =fr2Rjr is a subsequence of sg. Then, we callsexplainsR s like Bambus 2, andR s supportss. Now, sequence assembly on the graph is to nd a set of sequencesS =fs 1 ;s 2 ;:::;s T gS which satises[ T i=1 R s i = R. Let us thinkfR 1 ;:::;R T g is the smallest subset offR 1 ;:::;R jSj g satisfying[ T i=1 R s i = R. In this case, the assembly problem is to nd the smallest suchS. This problem can be reduced to Set Cover problem, which is well-known NP-complete problem, and a greedy approximation algorithm exists saying \choose the largest set rst"; that is, \choose a sequence s of the largestjR s j rst." Note that the s of the argmax s jR s j does not guarantee s is from the longest path, and vice versa for chimeric edges and uneven coverage. Moreover, none of the Optimal Linear Arrangement method of Bambus 2 and the Minimum Chain Decomposition method of Cuinks can be applicable to nd such SS because G db is not likely to be a directed acyclic graph (DAG) required by those methods. 31 Explainability Measure For a given setS, mapping tools such as PerM [15] provide polynomial time solution for constructingfR s js2Sg. Now, choosing one by one from the biggest to the smallestjR s j until[ s2S R s =R is straightforward. However, we observed that this process produced many similar sequences, and those sequences have almost identical sets of supporting reads. The reason is simple; the situation, which many paths share a short path p in the middle, is common, and those paths share R fs(p) . Moreover, this has another serious conceptual problem. Sequencing is to sample reads from DNA/RNA sequences, so there is strict one-to-one correspondence between reads and sequences. However, a read may support multiple sequences by the supporting read concept; in other words, the supporting read concept does not exactly re ect sequencing process. Therefore, we use a measure called Explainability which does not force one-to-one correspondence but a read may support multiple sequences in fractional, however, summation of the fractions should be exactly 1. Roughly speaking, r supports s as much as P (rjs), and let R s =fr 1 ;:::;r n g. Then, the explainability of the s is dened as P n i=1 P (r i js). WEAV uses this explainability of s instead ofjR s j, and we used a species relative abundance in a community concept [6] [57] for the estimation. The probability of an event where a readr2R was sampled froms2S is proportional to the chance of choosing s inS. In an experiment, it represents how well expressed the sequence is (of a gene or an isoform) in the transcriptome, and it is equivalent to the abundance level of it. Cuinks [83] used the abundance level concept too, but WEAV does 32 not drop any s due to its low abundance level because we do not believe low abundance means the incorrectness of the sequence. Let a be the abundance level of s, then, a/ jRpj jRjjsj . Starting from this basic idea, we proposed following EM-algorithm [59]. Assume two observations such asR =fr 1 ;r 2 ;:::;r N g and S =fs 1 ;:::;s M g, and an unknown parameter A = (a 1 ;:::;a M ) which is an abun- dance level vector where a i is related to s i . The likelihood of observingR from a given set of sequencesS is(A) =P (RjA;S) = Q n i=1 P m j=1 I(r i ;s j ) js j jjr i j+1 a j whereI(r;s) = 1 when r supports s, otherwise 0. However, we cannot nd the maximum likelihood estimation (MLE) of A because of missing information that species which read was sampled from which sequence. Let Z be the missing information with Z ij = 1 when i th read r i comes from j th sequences j , otherwise 0. Then, modify the likelihood function as 0 (A) =P (RjA;Z;S) = Q N i Q M j a j js j jjr i j+1 Z ij . Note thatZ ij = 1 impliesZ ik = 0 for allk6=j, andI(r i ;s j ) does not exist because Z ij already has the information. Finally, we have l(A) =log( 0 (A)) = P N i=1 P M j=1 Z ij (log(a j )log(js j jjr i j + 1)). The EM-algorithm is as follows. Initial Step: a (0) i = 1=M (uniform abundance as- sumption) for i2f1;:::;Mg. E-Step: We will maximize following Q-function in the M Step. Q(A;A (k) ) = E(l(A)jA (k) ;R;S) = P N i=1 P M j=1 E(Z ij jA (k) ;R;S)(log(a j ) log(js j jjr i j + 1)), and we have ^ Z (k) ij = E(Z ij jA (k) ;R;S) = P (Z ij = 1jA (k) ;R;S) = a (k) j js j jjr i j+1 = P M h=1 I(r i ;s h ) js h jjr i j+1 a (k) h . M-Step: Q(A;A (k) ) = P N i=1 P M j=1 ^ Z (k) ij (log(a j ) log(js j jjr i j+1)) is maximized with a condition P M j=1 a j = 1 whena (k+1) j = P N i=1 ^ Z (k) ij = P N i=1 P M j=1 ^ Z (k) ij . Now, let the explainability of s i 2 S be f exp (s i ), and f exp (s i ) = P N i=1 Z ij =a j P N i=1 P T j=1 ^ Z (k) ij =a j N. 33 With explainability, we will choose sequences in S from having the biggest explain- ability to having the least explainability until P s2S f exp (s) =jRj, and we callS as a contig set, and the sequences inS are called contigs. 3.2.2 Correctness Measure No matter how meritorious the rening process is, the process is unavoidably slow if S is oversized. Therefore, WEAV tries to make S smaller into S 0 by discarding incorrect sequences before the rening process. Consistency of Sequences Let us dene consistency of a given s2S with R s . For example, two reads r 1 and r 2 in Fig. 3.2 share a K-mer in the middle (K = 15), and the other parts are dierent from each other. Then, two reads make a tangled structure (also called repetitive structure) in G db . In this case, a path e i1 ee o1 ande i2 ee o2 are supported by r 1 andr 2 respectively, but the other pathse i1 ee o2 ande i2 ee o1 are not supported by any reads and called inconsistent. Therefore, ifp =:::e i1 ee o2 :::, neitherr 1 norr 2 will be inR fs(p) , ande i1 ;e; andR s does not support e o2 assuming r 1 and r 2 are the only reads supporting them. Figure 3.2: A tangled (complex) structure 34 Generally, a contig s is consistent with R s when any length mbp (L m > 1) subsequences ofs is a subsequence of at least one read inR s . Letr2R s be aligned with s[pos(r);pos(r) +L 1] whereL is the read length. Then, s is consistent withR s andm if and only if overlaps between two adjacent reads are at least (m 1)bp (Thm. 3.2.1). Longer m is more conservative avoiding chimeric contigs, but may miss correct contigs by low coverage. Theorem 3.2.1. LetR s =fr 1 ;:::;r M g be sorted by their alignment starting positions on s; pos(r i )pos(r i+1 );i2f1;:::;M 1g. Then, s is consistent with R s in the sense that all length mbp (1<mL) subsequences of s is in R s if and only if overlap threshold is (m1); that is,pos(r 1 ) = 1,pos(r M ) =jf s (p)jL+1, andpos(r i )+Lpos(r i+1 )m1 for all i2f1;:::;M 1g. Proof. ()) Assume all lengthmbp subsequence ofs is in, but somer i andr i+1 does not satisfy the overlap thresholdm1. We can assumepos(r i )+Lpos(r i+1 ) =m2 without loss of generality. Then, s[pos(r i+1 ) 1;pos(r i ) +L] is not a subsequence of both r i and r i+1 , and its length is pos(r i ) +Lpos(r i+1 ) + 2 =m 2 + 2 =m. Therefore, at least one of lengthm subsequence ofs is not inR s ; therefore,p is not consistent! contradicts the assumption. Therefore, consistency with m implies m 1 overlap threshold. (() Conversely, assume pos(r 1 ) = 1, pos(r M ) =jf s (p)jL + 1, and pos(r i ) +L pos(r i+1 )m 1 for all i2f1;:::;M 1g. This means every position of s is covered with at least one read in R s . Let an arbitrary subsequence of s bes[; +m 1] whose length is m. By overlap threshold, there should be at least one r2 R s which includes s[; +m 1] as its subsequence. If no such read exists, for some i2f1;:::;M 1g, 35 pos(r i ) +L 1< +m 1,pos(r i ) +Lm< and <pos(r i+1 ), and the overlap betweenr i andr i+1 ispos(r i )+Lpos(r i+1 ) = (pos(r i )+Lm+1)+m1pos(r i+1 )< +m1pos(r i+1 )<+m1 =m1. Therefore, the overlap betweenr i andr i+1 is< (m 1) which contradicts the assumption. As a result, satisfying overlap constraint with pos(r 1 ) = 1 and pos(r M ) =jf s (p)jL + 1 implies all length mbp (1 < m L) subsequence of s is in R s . Correctness Analysis with Bootstrapping In addition to the consistency check, WEAV uses a novel nonparametric Bootstrap method to answer how correct a given sequence is. The read setR is sampled from the pool of many mRNA sequences. Our correctness measure is based on a simple observation that a true sequence s2S is likely to be sampled well, has many supporting reads in R s , and every region ofs might be covered well byR s . For example, if a sequence is a chimera by combining two distantly related regions but has many supporting reads, then the sequence may have highexplainability. However, because the joint may not be suciently covered, the reads covering the joint may not be included in the resampling, and the sequence will be inconsistent in the resampled read set. We use this observation for our bootstrapping. ForR = fr 1 ;r 2 ;:::;r N g, resample N reads fromR with replacement and call it R . We can check the consistency of each s2S withR , and how frequent each s was consistent in B repeated resampling will be our correctness measures. For example, let a K-mer be supported by reads. Then, P (the K-mer is in R ) = P (at least one of reads is inR ) = 1(1=N) N , and this converges to 1e fast as N increases. Also, e ! 0 as increases. is closely related to the coverage, and the 36 coverage is generally high about 50-100x. Roughly speaking, a K-mer may be supported byc(LK +1)=L in average whereL is the read length by Poisson assumption [61]. Now, assumeR supports an s well, but it has a K-mer in the middle of the path supported by few reads (might be a chimeric sequence). In this situation, the K-mer controls the overall correctness bounded by (1e ). For example, if = 1, no matter how well supported the other regions are, the correctness would be upper bounded by 63%. On the other hand, if just few and reads support K-mers representing the both ends of a sequence, the correctness would be upper bounded by (1e e +e ( +) ), and it might be wrong extension. By a correctness thresholdc TH , WEAV can remove or x wrong sequences eciently. Assume an s has low correctness measure c because only few reads support some con- secutive K-mers of s. We can estimate the correctness again assuming that many reads support theK-mers, and let us denote the new correctness as c 0 . Suppose c<c TH which means wrong sequence, but c 0 c TH which means correct sequence. Then, we can x the sequence by removing the consecutive K-mers. s is a chimera if the K-mers are in the middle of s, and splitting s will produce two correct sequences. s has wrong ends if the K-mers are at the end, and trimming s will produce a correct sequence. However, this is not always possible if resulting sequence is short. Therefore, correctness measure can be a tool to removing incorrect sequences, and it reduces the size of S. Explainability and Correctness Theexplainability and thecorrectness measures are synergistic. Theexplainability tells which sequence explains how much portion ofR. Therefore, the high explainability of a 37 sequence justiably means that the sequence is true, but the converse is not necessarily correct, and the low explainability does not necessarily mean it is an incorrect. Oppo- sitely, low correctness of a sequence strongly infers it is an incorrect one; however, it is hard to say a sequence with highcorrectness is a true sequence or more likely to be true than the other. In short, explainability is for saying true sequences as true, but is not for saying incorrect sequences as incorrect. Conversely,correctness is for saying incorrect sequences as incorrect, but is not for saying true sequences as true. On the other hand, explainability is estimated in comparison with the other se- quences, so the other sequences in S aect the explainability of s. Therefore, the com- paratively less incorrect sequences in S guarantee the better explainability estimation, but the correctness of a sequence is estimated independently from the other sequences. So, applyingcorrectness measure rst andexplainability measure second is the best or- der to maximize the synergy, and WEAV uses these measures in this order (see Fig. 3.1). 3.2.3 Graph Simplication The size of the path space jPj is exponentially increasing as jE db j is increasing, and it makes the problem exponentially dicult no matter how reliable explainability or correctness measures are. Thus, we need to control the size ofjPj, and WEAV tries to remove erroneous edges and untangle repetitive structures by so called graph simplication process. 38 The edges caused by sequencing errors may introduce bubble, dangling sources/sinks, transitive edges, and so on. On the other hand, biological variations such as SNPs, alter- native splicing, genomic variations may introduce such tangled structure too. Therefore, the goal is to remove edges from sequencing errors while keeping edges from biological variations to minimizejSj because such tangled structures makejSj =jPj grow exponen- tially in proportion to the number of erroneous edges. There are several graph simplication and error correction methods. Read-threading: align reads onG db , and lter out erroneous junctions when they are not supported by the alignments [12]. Mate-threading: similar to read-threading, but align read-pairs instead of reads, and use the distance statistics between paired reads for junction ltering [63]. Spectral alignment: start from an assumption that error-free K-mers have high copy numbers in R, and try to make all K-mers error-free [71], [74]. These methods use complicated decision-making processes, and demand much memory to store additional information. Our graph simplication and error correction process is based on recoverability be- tween s and s 0 . Let L be the length of read. Let r2R and s2S be aligned with m mismatches. r is called mappable on s when m L for a given arbitrary number 0< 1, and we call this -tolerance. Formally, the denition of R s is changed as R s = fr2Rj9i2 [1;jsjL+1] such that mismatches between s[i;i+L1] and r is Lg wherejsj is the length of the sequence s. Lets =:::b i1 b i b i+1 ::: and there exist r2R s suggesting b 0 i 6= b i for the i th position, then s 0 = :::b i1 b 0 i b i+1 :::2 S can be recovered from s by r, and we call s and s 0 are mutually recoverable by r. Let us skip the formal 39 denition of recoverability which is complicated but conceptually straightforward. Note that s 0 (or equivalently s) does not have to be inS when s 0 is recoverable from s by R s . WEAV substitutessuperedges for simple paths for basic graph simplication process and less memory consumption. Let indegree(e) and outdegree(e) of e = (v s ;v e )2 E db be the number of edges whose ending node and starting node are v s and v e respectively; that is, the number of incoming and outgoing edges of e. First, a p = e 1 e 2 :::e l is called as a simple path whenindegree(e i ) = 1;8i2f2;:::;lg andoutdegree(e i ) = 1;8i2 f1;:::;l1g. Then,p can be replaced by longer edgee, called super-edge, whose sequence is f s (p). This super-edge construction is a short range assembly. A super-edge e has 3 weights such asinweight,outweight and total weight, andw I (e),w O (e) andw T (e) denote them respectively. Therefore, w I (e) = w O (e) = w T (e) ifjf s (e)j = K. For the super edge e of the simple path p, where e i might be a super-edge, w I (e) = w I (e 1 ), w O (e) =w O (e n ) and w T (e) = P n i=1 w T (e i ). The graph simplication process is as follows where edge means super-edge. 1. Removing erroneous edges: (a) Removing bubbles: Multiple paths sharing start and end nodes forms a bubble. We call a bubble simple when each path is a super-edge, and WEAV considers the simple bubbles only. Let e and e 0 be two of them, and assume e is recov- erable from e 0 (equivalently, e 0 is recoverable from e). Then, remove e when w T (e)<w T (e 0 ) or remove e 0 otherwise. (b) Removing dangling edges: Let e = (v in ;v out ), it is a source edge when there is no other edge ended by v in , or it is a sink edge when there is no other 40 edge started by v out . Let e = (v in ;v out ) be a source edge, but there are other non-source edges ended by the v out . In this case, e might be a dangling edge when its w O (e) = 1 and the other edges have > 1 out-weights. When e is dangling edge, and it is recoverable from the other non-source edges, remove e. Apply the same rule for the sink edge case too. 2. Detaching edges: WEAV tries to avoid wrong traversing. Assume a node has mul- tiple outgoing edges, and one of them e has w I (e) = 1 and the other edges have > 1 in-weights. Then, e might be random matching by sequencing error(s). In this case, detach e from the node instead of removing it. Apply the same rule for the multiple incoming edges case using the out-weights. Later, if the detached edges are recoverable, removing dangling edge process will remove it. 3. Removing low-weighted edges: In general, not only sequencing errors [74] but also noisy reads produce low-weight edges. We classify a read as a noisy read when it comes from the other genes supposed not to be inR. Whichever statistical way for choosing weight threshold introduces bias; therefore, we simply remove lowest weighted edges iteratively in the sense of total weight. The way using total weight prefers longer edges than shorter with high weight. Theoretically, one sequencing error introduces a new length 2K 1bp edge. If jf s (e)j >> 2K 1 for an edge e, the edge hardly come from sequencing errors or noisy reads. However, short edge but with high weight is possible by one noisy read 41 with multiple copies. Therefore, this process uses total weight rather than average weight. 4. Iterate until the graph become simple: Repeat above processes until the number of paths in the graph goes down under the predened number of contigs T T . T T can be easily determined re ecting prior information about the number of genes and the number of isoforms. Removing low-weighted edges competes with both removing erroneous edges and detaching edges during iterations. Removing low-weighted edges guarantees to have simple enough graph; the process, however, never assures not removing true edges. On the other hand, both removing erroneous edges and detaching edges induce longer simple paths! induce longer super-edges, so these processes make the simple paths more tolerable for the removing low-weighted edges process. This process is heuristic, but by combining with the statistical tools of Sec. 3.2.2, the whole process will be statistically meaningful. Moreover, those removed edges can be restored during Finalizing of Sec. 3.3.3 and Bridging of Sec. 3.3.4. 3.2.4 Read Clustering: WEAV-CLUSTER Transcriptome assembly must consider multiple genes and multiple isoforms together. This is dierent from regular genome assembly because it assumes only one underlying sequence. Moreover, the isoforms of a gene are similar to each other because they share 42 common exons. Therefore, dividing a whole read set into smaller read sets, called clus- tered read sets or clusters will be benecial if one or few genes form a clustered read set. The purpose of WEAV-CLUSTER is this, and the problem formulation is following. Clustering Let a whole read setR have N readsfr 1 ;r 2 ;:::;r N g. When r i ;r j 2R share at least one length Obp subsequence, denote it r i \ O r j 6=?, and say r i and r j are connected by O-mer. This can be extended for arbitrary read sets R and R 0 . R and R 0 are connected (R\ O R 0 6=?) if and only if9(r;r 0 )2 (R;R 0 ) so that r\ O r 0 6=?. For a read r, we can deneR r =fr 0 jr\ O r 0 6=?;r 0 2Rg, and callR r a neighbor of r. R is calledclosed when R r R for all r2R. Now, we can formulate the clustering problem as follows. For a given whole read set R and an overlap constraint O, a set of read setsC =fR 1 ;R 2 ;:::;R M g can be found satisfying conditions such asR =[ M i=1 R i andR2C is closed. R2C is called a clustered read set or simply a cluster. Therefore, the clustering problem is to nd the greatest partition withObp overlap constraint; however, we can relieve the mutually exclusiveness condition of the partition problem [8] later. To solve this problem, let us dene a graph G = (V;E), where V =R, and E = fe = (r;r 0 )jr\ O r 0 6= ?;r;r 0 2Rg. Then, the clustering problem becomes the problem of partitioning G into smaller disjoint subgraphs, and nodes of each subgraph will be an element ofC. Then, the partition is the greatest clusterC if and only if all the nodes in a partition is pairwise connected. This can be easily solved by Breadth-First-Search (BFS). 43 However, the graphG is hard to implement under limited time and space. Potentially, any read can be connected to any read; therefore, the construction of E will take much time and memory. Therefore, we have used (O + 1)-dimensional de Bruijn graph as a backbone which was dened in Sec. 3.2.1 with replacing K with (O + 1), and we call the (O + 1)-dimensional de Bruijn graph asG C = (V C ;E C ). Now,G C = (V C ;E C ) is memory ecient and easy to construct. By BFS, we can partitionV C , and let the partition beV 1 ;:::;V M such thatV i \V j =? for i6= j and[ M i=0 V i = V C . For each V i , WEAV-CLUSTER constructs a clustered read set R i 2C. For example, r2R is an element of R i when r has at least one vertex of V i as a substring, and this process will take O(NL logjV C j). Now, let us consider the size ofO and its eect. First, withO = 1, all the reads will be connected andC =fRg, and withO =L, each unique read will be one clustered read set. Both are not desirable, but we want to have short enough O to make same-origin-reads be clustered together, and long enough O to make dierent-origin-reads be splitted into dierent clustered read sets. Figure 3.3 shows rough idea about the eect of size O. We tried to cluster RefSeq mRNA sequences with various O-mers from 11bp to 100bp, and we counted how many genes are clustered with the other genes called ambiguously clustered genes. As it shows, the curve is not converged yet until 100bp, more than 10% of genes were ambiguously clustered with even 100-mer,and only about 60% of genes were clustered correctly with 35-mer. Therefore, we applied from shortO-mer to longO-mer iteratively, and the result will be discussed in Sec. 3.3.1 44 Figure 3.3: Ambiguously clustered genes 3.2.5 The Idea about Paired-End Read The paired-end (or mate-pair) sequencing technique is commonly used for all sequencing platform which sequences both ends of each insert library. The size of insert library deter- mines the distance statistic between pairs, and a library preparation method determines the size. Each sequencer provides the insert library size statistic such as average and variance. Generally, shorter insert library technique gives smaller variance, and longer gives bigger variance. First of all, long read can be better than short read to solve all the diculties of sequence assembly such as wrong basecalling, unknown orientation, lack of coverage, repeats, and chimera (indel errors), and these problems become no problem if a sequencer generates only one read for a genome. We can think paired-end sequencing in this angle. Let insert size be Ibp, and read length be Lbp. A paired reads has the information of both ends of the insert, and the middle (I 2L)bp of the insert is unknown. The potential of this paired-end technique is in between the potentials of length Lbp and 45 length Ibp single-end techniques, because the paired-end sequencing technique mimics long single-end sequencing technique. The method so called read-threading is the way of solving the diculties of assembly using single-end reads. The method using paired-end reads should be virtually same as if applying the read-threading with Ibp single-end reads. The remaining problem is how to clean the ambiguities of unknown (I 2L)bp and the variable length of I. It is a statistical problem, but it is not very dicult. Construct G db as usual, and align two reads of a pair. There may or may not a path between two reads. The short ranged assembly can tell what is in the middle if there are paths. Moreover, we can nd the best path according to its length, which should be similar to the average insert size. This straightforward method can tell how to untangle complex structures of G db , and make G db simple, and enhance overall assembly performance. 3.3 Implementation We have discussed mathematical framework in Sec. 3.2, and discuss how we implement the method in this section. The order of this section is opposite to Sec. 3.2 following not logical order but running order. WEAV removes low complexity reads which have short, repetitive patterns and low quality reads measured by quality scores. Reads having unusually high copy number might be from repeats or sequencing bias, so we suppressed the copy number of such 46 reads to a predened number. This is apreprocessing which is a common process for all assemblers. 3.3.1 Read Clustering: WEAV-CLUSTER WEAV-CLUSTER breaks downR intoC as discussed in Sec. 3.2.4. Ideally, each cluster consists of the complete set of reads from one gene. Fortunately, this clustering step does not have to be perfect, because the assembly algorithm of WEAV can assemble multiple homologous sequences simultaneously. Moreover, the combinations of repeat-rich human transcriptome, short reads, sequencing errors and so on make the ideal clustering impossible (see Fig. 3.3). Therefore, the goal is not to be perfect but avoids uncertainty as possible as we can. The process has two steps: (1) Partition G R and (2) ClusterR. Partition G R : There are trade-os between short and long O (see Sec. 3.2.4). To utilize advantages of both short and long O, we used a na ve but powerful iterative way. WEAV-CLUSTER uses a short O-mer rst to keep related reads in same clusters. If unreasonably large clusters exist, cluster them with longer O-mer again. Note that the order from short to long is crucial because keeping related reads in the same clusters is more urgent than splitting unrelated reads into dierent clusters. The reason is that WEAV can assemble multiple sequences together, but nothing it can do if related reads are splitted. Now, the choice of O is dependent on the target species. For human tran- scriptome, we found that O < 20 is meaningless because almost all the genes share at least one 20-mer with the other genes, and about 40% of genes share at least one 35-mer (see Fig. 3.3). Cluster the read set with O = 35 rst and increase it until about O = 90 47 because O > 90 hardly ever improves. If two genes share a 90bp subsequence, they will share even 500bp subsequence almost surely. ClusterR: After partitioningG R , clusteringR according to the partitions is straight- forward. Let the partition beP G R =fG 1 = (V 1 ;E 1 );:::;G D = (V D ;E D )g satisfying V C =[ D i=1 V i and E C =[ D i=1 E i , and assign membership ID i to e2 E i . Because G R is just sorted (O + 1)-mers (see Sec. 3.2.4), searching an edge will be fast (O(logjE C j)). Now, searching (O + 1)-mers of a read will give the membership ID, and we can cluster R according to it. However, all the edges (or vertices) of a read should be members of a partition, so for a given read, checking the membership of only one O-mer of it is enough !O(jRj logjE C j). Orientation Issue for Clustering According to the clustering method, it is straightforward to make a one-directional read set of a bi-directional. Now, we developed a concept called mirror edge and mirror graph. An edge and its reverse complement are mirror edges to each other. For a given graph partitionP G R =fG 1 = (V 1 ;E 1 );:::;G D = (V D ;E D )g, let G i ;G j 2P G R . Ife ande 0 exist in E i and E j respectively, and e and e 0 are mirror edges, dene G i and G j as mirror graphs to each other, and this can easily extend to clusters and reads. Now, let a new graph of the partition of the G R be G P GR = (V;E) where V = P G R and E =f(G i ;G j )jG i and G j are mirror graphsg. Then, partition G P GR again, and let the partition be P G PGR =fP 1 ;:::;P D 0g where P 2 P G PGR is a graph whose vertices are inP G R . We will clusterR according to the partition of G P GR . In this case, the 48 orientation of each read ofR can be determined by searching bipartite graph for each partition P2P G PGR . The idea is clear. If (G i ;G j )2 P 2 P G PGR , the reads support G i , and the reads support G j may have opposite orientations. Surely, there can be exceptions and which may make searching bipartite graph dicult, but we saw that the case is rare. The revised process is: (1) search bipartite graph for each P2P G PGR , and (2) if bipartitie does not exist, remove the most trivialG2P in the sense of the number of nodes inG, and search bipartite again. (3) Bipartite graph has two groups, and a G2 P should be in one of them. Then, we want to make a partition P2P G PGR as a graph by collecting all nodes and edges in each G2 P . During this process, add reverse complement of nodes and edges of the opposite group. (4) Collect read according to this merged graphs. Finally, searching reverse complement of each read will be necessary too. If a read has a reverse complemented O-mer of a merged graph, the read must be added to the cluster but with reverse complementing. Therefore, after applying WEAV-CLUSTER, the orientation of each read within a cluster is not ambiguous but one-directional, and WEAV is free from ambiguous orientation. 3.3.2 Contruction of de Bruijn Graph and Graph Simplication We constructed aK-dimensional de Bruijn graph dened in Sec. 3.2.1. However, WEAV does not use V db of G db = (V db ;E db ) explicitly, but use E db only as a sorted array. After graph construction, WEAV transforms simple paths into super-edges, and the weight of super-edges will do a decisive role in graph simplication. Note that the graph is just 49 sorted edges, and now the edge can be longer than Kbp. In Fig. 3.4, edges' annotations represent edges. Let us discuss the process using an example. Figure 3.4 shows graph simplication process, and the process removes errors and ambiguities fromG db . Let the edges of Fig. 3.4 be super-edges, and the graph of Fig. 3.4(a) be driven by 26 50bp reads (some of them was omitted). Black edges were from healthy reads without any error and thus may have high weights, but colored edges were from erroneous edges and thus may have low weights. Figure 3.4(a)! 3.4(b): Low-weighted edges are removed when and only when they have counterpart and recoverable from it. We allow up to 4% of dissimilarity between two edges to be considered as recoverable. For example, the green arc of the nested bubble structure has low total weight and the sequence of its paired black arc are same in length and similar. Dangling edges (red ones) also are recoverable, and have low in-weight and out-weight respectively (see Sec. 3.2.3). Then, we can remove them, and can be recovered in the nalizing stage by read mapping if needed. Note that we call these dangling edges and bubbles are simple because it has a direct counterpart in the super-edge set. Moreover, we maintain the weights of the removed edges by adding them to the counterpart edges. By this, the sequencing depth is kept during the process, and the error-possessing edges will be more noticeable by their comparably lower weights while the remaining edges become heavier in weight and gradually free from following removing low-weight process. Blue arc also forms a bubble but complex because counterpart is not an edge in the super-edge set. WEAV does not care this complex situation, but only care simple case as the green arc. Figure 3.4(b)! 3.4(c): On the other hand, let an edgee has multiple outgoing edges, and one of them has weight 1, but others have much higher weight. We believe the weight 50 1 edge is spurious, and thus we detach the edge as in Fig. 3.4(c), and so do multiple incoming edges with similar way. Figure 3.4(d)! 3.4(f): Figure 3.4(d) shows simpler super-edge graph. The blue arc eventually was removed as dangling edge in the second iteration as in Fig. 3.4(e). Finally, there is only one super edge in Fig. 3.4(f). This is the way of the graph simplication process, and the whole process may be iterated until no edge can be simplied. In between iterations: WEAV tries to remove the low-weight edges between adjacent iterations if needed. For this, WEAV searches the lightest total weight (w T min ), and removes all the edges whose total weights are w T min + 1 (skipped this process in the example). Although this process is heuristic, by combining with the statistical ways of evaluating and selecting contigs, the whole process is statistically meaningful. Moreover, those removed edges can be restored later during Step 4 of Sequence Assembly of Sec. 3.3.3 and Bridging of Sec. 3.3.4. 3.3.3 Sequence Assembly We propose the following processes for de novo RNA assembly. Step 1. Graph traversing and propose a sequence set S: WEAV traverses the graph from all sources to all sinks simultaneously by using Depth-First-Search. Note that the graph simplication process controlls the size ofP. The graph may have cycles caused by repetitive sequences and low-complexity K-tuples shared in multiple unrelated reads, so a true path representing a gene transcript may visit some edges multiple times. Thus, we use two parameters to control the number of maximum repetition of an edge in a path and the length of the path with T rep and T len . During graph traversing, a traveler may 51 visit an edge multiple times, but the traveler should not visit an edge more than T rep times and the total traversing distance cannot exceed T len bp in length. Step 2. Validating sequences ofS: For a given sequences2S, we used a fast mapping program PerM [15] to get supporting read set R s allowing up to 4% of mismatches (see Sec. 3.2.1). Let R s =fr 1 ;r 2 ;:::;r g be a sorted read set according to their alignment positions in the sequence, and the positions be pos 1 ;pos 2 ;:::;pos where 0 pos i pos i+1 <jsj for all i2f1;:::;Ng. Then, the path is valid when overlap between two adjacent reads should have at least Kbp overlap. That is, pos i +Lpos i+1 K should be kept where L is the read length to be a valid path. Remove invalid sequences from S. Step 3. Fix and prune the valid sequences with correctness measure: For the valid sequences, we x wrong basecalls rst by majority vote according to the read-alignments, and estimate their correctness second (see Sec. 3.2.2). When the correctness of a sequence is extremely low (<:2), WEAV tries to improve it by trimming or splitting. We remove sequences whose correctness is still low (<:2) or whose length is short ( 300bp). Now, removing invalid and incorrect sequences shrink the size of the sequence set S. Step 4. Measure explainabilities and propose a contig setS: The explainability mea- sure, denoted by f exp (see Sec. 3.2.1), let us choose a contig set S 2 S. Let S = fs 1 ;s 2 ;:::;s M g be sorted by explainabilities such that f exp (s i ) f exp (s i+1 ) for 1 i < M. We want to formS =fs 1 ;:::;s T g for some T M satisfying P s2S f exp (s) = (1)jRj. Ideally, must be zero, but practically should be non-zero, and WEAV uses =:01. 52 3.3.4 Bridging the Proposed Contigs: WEAV-STITCHING The graph simplication process can be considered as denoising process. Therefore, WEAV uses harsh and conservative method to remove as much noise as possible (Sec. 3.3.3). Therefore, a contiguous path might be broken into shorter paths. However, we observed that theS can be a barometer to tell informative reads from noisy, and we can try to join contigs into longer one by using relaxed graph simplication process if possible. We started from an assumption that theS is correct, and WEAV wants to use reads which overlap an s2S on either end. Therefore, the procedures are as follows. (1) For all s2S, collect K-mers from both ends of s, and it will be the initial edges setE where we chose = 5. (2) Let E(r) be theK-mer set ofr. For allr2R, ifE(r)\E6=?, updateE withE =E[E(r). (3) Repeat step (2) until there is no update. (4) AddS to theE but after trimming bases from the both ends of each s2S because they were in E already in step (2). (5) Do assembly again withE. This procedure is straightforward but powerful to compensate likely sacrices from the graph simplication process. 53 (a) Original (b) Step 1 (c) Step 2 (d) Start of 2 nd iteration (e) 2 nd Step 1 (f) 2 nd Step 2 Figure 3.4: A graph simplication example 54 3.4 Results We simulated datasets by random sampling from RefSeq database to evaluate and com- pare the performance of WEAV with three other de Bruijn graph methods such as Velvet, ABySS and SOAPdenovo. To zoom in the comparison, we used two more datasets to analyse the eects of equally expressed multiple isoforms of a gene and a noisy dataset, and we applied WEAV to an experimental human brain transcriptome dataset for showing the practical power of WEAV. 3.4.1 Ground Truth and Evaluation We do not use the whole set of annotated human gene transcripts in RefSeq or Gencode as the ground truth. The complete transcriptome is unlikely to be fully covered with a RNA-seq dataset. Instead, we propose a concept, sequenced, for segments, transcripts and/or genes to estimate a ground truth. We rst map each read inR into an annotation database such as RefSeq, Gencode or hg19 Human Genome. A read from a transcript is probably aligned on the transcript; conversely, a read aligned on a transcript is probably from the transcript. Assume we sampled n reads from a transcript, and mapping results let us know their sampling positions. We name the sampling position from pos 1 to pos n in increasing order such as pos i pos i+1 for all 1i<n. Let pos k +Lpos k+1 O for all ik <j. Then, the segment from pos th i position to (pos j +L 1) th position is continuous, and if it is long enough say (pos j +L pos i ) l min , we call the segment as sequenced segment. This concept is extended 55 to transcripts and genes to be sequenced ones in the sense whether they have at least one sequenced segment. In this work, we used an overlap constraint O = 40bp and a minimum segment lengthl min = 300bp. Note thatsequenced is dierent fromexpressed. An expressed gene may not be a sequenced gene if the sequencing depth is not deep enough. Therefore, all sequenced genes are expressed genes, but not all expressed genes are sequenced genes. Finally, a collection of sequenced genes/transcripts/segments is the ground truth estimation. Evaluation process is also based on the mapping result. We mapped the contig setS to the RefSeq or Gencode using BLAT [45] which is a gapped alignment tool, and mapped (if needed) to hg19 Human Genome using BLAST which is another slower but more precise gapped alignment tool. If a contig is aligned with a sequenced segment/transcript/gene, the segment/transcript/gene is classied as detected. Because neither RefSeq nor Gen- code is complete, the assembly results may include new transcripts, and they may be aligned hg19 Human Genome. If the whole contig is not aligned with both RefSeq and Gencode but aligned with hg19 Human Genome, we claim that it is a new transcript. On the other hand, if a contig is aligned with either RefSeq or Gencode, but the alignment is across multiple genes, we call it a chimera or a chimeric assembly. 3.4.2 Simulation Study Simulated Datasets and Ground Truth To evaluate the performance of assemblers, we prepared several simulated datasets using randomly selected 1,000 genes from RefSeq where each gene may have multiple isoforms. The sampling process is (1) choosing a gene randomly according to their predened 56 exponentially distributed expression level, (2) choosing an isoform randomly with equal chance over all transcripts if the gene has multiple transcript, (3) choosing a position randomly with equal chance over all positions, (4) getting 100bp from the position, and (5) giving 2% of sequencing error rate for each position which is a typical error rate for second generation sequencers. We sampled high (10M reads,200x coverage), middle (1M reads, 20x coverage) and low (100K reads, 2x coverage) coverage datasets, and we prepared 3 more datasets for each of them by giving random orientation to check the orientation features of assemblers. Table 3.1 shows the summary of these simulated datasets. In the simulation study, the ground truth is not estimated but given, and the table roughly shows it. The coverages of sequenced segments were about 240x for high coverage dataset, 38x for middle, and 8x for low coverage dataset. This wide range of coverages will be enough to prove the stable performance of WEAV. Note that the coverages are dierent from the overall coverages mentioned earlier. These coverages are only for the sequenced segments, and we care those segments only. Table 3.1: Statistics of 6 simulated datasets Eective # Seq. # Seq. # Seq. Average Average Useless Dataset # Reads Orientation Genes mRNA Segments Coverage Coverage Reads H-10M.fa 10M Orientation-Free 987/1,000 1,513/1,600 2,041 208.20 242.30 7,662 M-1M.fa 1M Orientation-Free 754/1,000 1,088/1,600 1,739 20.82 38.56 14,519 L-100K.fa 100K Orientation-Free 427/1,000 559/1,600 1,076 2.08 8.76 14,316 H-Ori-10M.fa 10M With Orientation 987/1,000 1,513/1,600 2,041 208.20 242.30 7,662 M-Ori-1M.fa 1M With Orientation 754/1,000 1,088/1,600 1,739 20.82 38.56 14,519 L-Ori-100K.fa 100K With Orientation 427/1,000 559/1,600 1,076 2.08 8.76 14,316 Clustering the Datasets The simulated read sets were clustered in the way discussed in Sec. 3.3.1 for enhance assembly eciency as discussed in Sec. 3.2.4. The goal of the clustering is to prepare a 57 read set for a gene. Therefore, we tested the one-to-one correspondence between read sets and genes. Figure 3.5, gure 3.6, and gure 3.7 show the results with various coverages. WEAV-CLUSTER roughly satised the one-to-one correspondence between sequenced segments and clustered read sets. Some clusters had multiple sequenced segments while most clusters had only one sequenced gene (see (a) and (b) of the gures). This means that those multiple sequenced segments in a cluster came from one gene. The fact that reads from a gene was not scattered into multiple clusters (see (c) and (d) of the gures) supports this. Note that genes in the gures might not be sequenced genes but expressed genes, so some genes might have no read so that no cluster had been assigned for them. Similarly, some sequenced segments were not found anywhere because of their low cov- erages. Moreover, the one-to-one correspondence did not deteriorate even for the low coverage dataset where many genes were not sequenced. 3.4.2.1 The Best Performed K-mers We tested 4 assemblers such as WEAV, Velvet (v1.0.07), ABySS (v1.2.1) and SOAPden- ovo (v1.05), and evaluated their performances by specicity and sensitivity analyses based on the foregoing ground truth. We used default parameters for all programs except ABySS in which `-e0' option used because the eroding parameter misled the assembly. However, the most in uential parameter, K-mer, was determined by testing8K2f21; 23;:::; 49g because, until now, there is no reliable way of predicting the best or at least a decent K-mer for given conditions such as coverage, a read length, a target domain, and so on. Therefore, researchers test all possible K-mers in some range, and we also followed 58 (a) # Clusters vs `# Genes/Cluster' (b) # Clusters vs `Sequenced Seg- ments/Cluster' (c) # Genes vs `# Clusters/Gene' (d) # Sequenced Segment vs `# Clusters/Seqeunced Segment' Figure 3.5: For orientation free high-coverage dataset, H-10M.fa the way. Note that we used only odd numbers to avoid palindrome-like matches due to reverse complements [90]. The contigs shorter than 300bp were discarded. Figure 3.8 shows two statistics to compare for the high coverage dataset. The most widely used statistic N50s, which is the length of the longest sequence amongfs2Sj P s 0 2S;js 0 jjsj js 0 j P s 0 2S js 0 j=2g, and the lengths of the longest contigs are plotted. Ap- parently, WEAV outperformed the other assemblers. Figure 3.8 shows other statistics, and more detailed analyses are given there. The performance of WEAV could be main- tained up to the middle coverage dataset (see gure 3.9), but it was just comparable with 59 (a) # Clusters vs `# Genes/Cluster' (b) # Clusters vs `Sequenced Seg- ments/Cluster' (c) # Genes vs `# Clusters/Gene' (d) # Sequenced Segment vs `# Clusters/Seqeunced Segment' Figure 3.6: For orientation free middle-coverage dataset, M-1M.fa other assemblers for the low coverage dataset (see gure 3.10) whose coverage was about 8x (see table 3.1). However, 8x coverage is rare for Ref-seq in common situations. The best K-mer for each assembler according to its N50 is shown in table 3.2 with other statistics. WEAV tends to perform better at shortK-mers, but the other assemblers at long K-mers. This can be explained by the nature of de Bruijn method, but WEAV is dierent (discussed later). Therefore, WEAV needs rich connectivity by short K-mers up to a certain degree, but simple graph by long K-mer is desirable for the others. 60 (a) # Clusters vs `# Genes/Cluster' (b) # Clusters vs `Sequenced Seg- ments/Cluster' (c) # Genes vs `# Clusters/Gene' (d) # Sequenced Segment vs `# Clusters/Seqeunced Segment' Figure 3.7: For orientation free low-coverage dataset, L-100K.fa Table 3.2: The best K-mers for each assembler K-mer # Isoforms Longest Mean Total Length N50 N90 WEAV 27-mer 2,392 15,387bp 2,047.24bp 4,896,989bp 3,162bp 1,649bp ABySS 49-mer 1,325 10,914bp 1,070.05bp 1,417,816bp 1,647bp 429bp Velvet 49-mer 1,467 6,670bp 796.27bp 1,168,135bp 991bp 379bp SOAPdenovo 49-mer 1,361 9,547bp 922.98bp 1,256,181bp 1,309bp 388bp 3.4.2.2 The Comparison of Assemblers For the best K-mer of each assembler, we collected contigs 300bp, and we mapped them on RefSeq database by BLAT [45] where simulated datasets come from. There might be multiple aligning sites for each contig, but not all of them are meaningful, so we ltered out with the following rules. First, we discarded all the mapping whose similarity 61 (a) # Contigs ( 300bp) (b) The length of the longest contig (bp) (c) Total length of all contigs (bp) (d) Themeanlengthofthecon- tigs (bp) (e) N50 (bp) (f) N90 (bp) Figure 3.8: Comparison statistics for orientation-free high-coverage simulated dataset, H-10M.fa. (matched basepair/length of the contig) is less than 90%. When a contig has a map with 90% similarity, we say the contig ismapped orcorrect. Note that even a mapped contig may have maps whose similarities are less than 90%. For mapped contigs, there were two more ways of ltering called full detection rule and partial detection rule according to the target sequence where the contig aligned. Full detection rule: The contig should be long enough to cover90% of sequenced segment. Only when the contig satisfy this condition, it becomes a mapped contig, and the sequenced segment becomes Partial detection rule: The condition of covering90% of a sequenced segment is released. Thus, regardless how much portion of a sequenced segment is covered, if the segment has at least one 62 (a) # Contigs ( 300bp) (b) The length of the longest contig (bp) (c) Total length of all contigs (bp) (d) Themeanlengthofthecon- tigs (bp) (e) N50 (bp) (f) N90 (bp) Figure 3.9: Comparison statistics for orientation free middle-coverage simulated dataset, M-1M.fa contig mapped on it,mapped anddetected are legitimated. A sequenced gene/transcript becomes detected when at least one of its sequenced segment is detected. According to aforementioned rules, we compared 4 assemblers for the high coverage dataset, and the results are in table 3.3 where the denitions of specicity and sensitivity are (# Mapped contigs)/(# Proposed Contigs) and (# Detected Segments)/(# Sequenced Segments) respectively. Sensitivities for sequenced basepair/gene/transcript are similar. WEAV outperformed for all measures except specicity for partial detection rule. Surprisingly, comparison the coverages for detected sequenced segments and unde- tected sequenced segments shows that ABySS, Velvet and SOAPdenovo could not detect 63 (a) # Contigs ( 300bp) (b) The length of the longest contig (bp) (c) Total length of all contigs (bp) (d) Themeanlengthofthecon- tigs (bp) (e) N50 (bp) (f) N90 (bp) Figure 3.10: Comparison statistics for orientation free low-coverage simulated dataset, L-100K.fa highly covered sequenced segments but detect lowly covered which contradicts common sense while WEAV did better for highly covered sequenced segments. This comes from the nature of conventional de Bruijn graph methods which try to nd simple paths de- ned in Sec. 3.2.3, and produce the sequence of the simple paths as contigs in general. The higher coverage means more diversities on the graph due to more absolute number of sequencing errors and longer sequenced regions. However, when there are too many ambiguities from sequencing errors and underlying structure is complex, assemblers undergo diculties to nd long simple paths although they use many algorithmic/statistical tools to remove ambiguities. This is the reason why 64 they did not perform well for higher coverage sequenced segments. Moreover, the reason why other assemblers performed better for longerK-mer in gure 3.8 also can be explained in this way. Longer K-mer tends to make sequencing errors less eective, and it tends to be stronger to the repetitive structure than shorter K-mer. Simply speaking, imagine that a read has a sequencing error at L=2 th position. Then, the error will introduce a bubble for K < L=2, but it will introduce a dangling edge for K L=2. Therefore, longer K is benecial for other de Bruijn method because dangling edges are generally easier to remove than bubbles. However, when the coverage is not sucient, longK leads fragmented, short contigs. Moreover, not all sequenced segments are suciently covered for RNA-seq even when the overall coverage is high. For this reason, choosing longer K is not always desirable, and WEAV overcame this intrinsic problem by the statistical method such as explainability and correctness. Next, table 3.3 shows that WEAV contigs has higher detecting power than the other assemblers. As the rule was changed from partial to full detection rule, WEAV lost 23% of contigs while the other assembler lost about 54%-67% for their lack of detecting power. In other words, WEAV produced long contigs, and they fully cover the underlying sequences (see table 3.2). As a result, the WEAV could reliably pinpoint sequenced segments. WEAV achieved 73.9% of specicity and >82% of basepair level sensitivity while the other assemblers achieved 33%-46% of specicity and 36%-52% of sensitivity under full detection rule. Also, the situation was similar for the middle-coverage dataset (see table 3.4), but the low-coverage dataset showed just comparable performance to the other assemblers (see table 3.5). The reason is clear. As the coverage gets low, the de Bruijn graph gets simpler without many ambiguities, and the sequenced segments get shorter, 65 so the situation is just opposite to the case of highly covered. Therefore, WEAV cannot get an advantage over the other assemblers for low coverage dataset, but simultaneously, WEAV does not lose its power when the coverage is low. Lastly, as the rule is changing from partial to full detection rule, WEAV sacriced 3.42% of base-level sensitivity while 23% of contigs were screened out. Therefore, most of the 23% of contigs should be redundant, but the other assemblers look worse. They sacriced about 10% of sensitivities while 60% of contigs were screened out. Therefore, larger portions of their contigs were redundant than WEAV, which proves the powerful error correction process of WEAV to reduce redundancies. Table 3.3: Assembly results for high-coverage orientation free dataset, H-10M.fa Coverages of # Proposed Sensitivities in the levels of sequenced Detected Undetected Assembler Specicity Contigs Bases Genes transcripts Segments Segments Segments Partial Detection Rule WEAV 96.9% 2,392 86.18% 88.15% 85.99% 73.9% 270.82x 64.39x ABySS 99.9% 1,325 60.44% 58.50% 58.10% 52.7% 58.30x 523.36x Velvet 100.0% 1,467 57.77% 56.64% 57.11% 50.5% 64.81x 485.13x SOAPdenovo 100.0% 1,361 55.35% 55.72% 54.06% 49.6% 58.38x 470.26x Full Detection Rule WEAV 73.9% 2,392 82.76% 84.90% 82.09% 69.2% 281.32x 54.98x ABySS 46.1% 1,325 52.16% 49.65% 45.74% 42.2% 49.74x 452.22x Velvet 33.2% 1,467 44.05% 40.83% 36.42% 33.2% 54.32x 390.31x SOAPdenovo 40.4% 1,361 45.97% 44.68% 40.05% 37.2% 42.51x 412.27x Table 3.4: Assembly results for middle-coverage orientation free dataset, M-1M.fa Coverages of # Proposed Sensitivities in the levels of sequenced Detected Undetected Assembler Specicity Contigs Bases Genes transcripts Segments Segments Segments Partial Detection Rule WEAV 99.4% 1,607 77.80% 78.08% 74.63% 63.19% 47.23x 8.21x ABySS 100.0% 1,109 74.84% 67.51% 67.51% 59.46% 38.20x 39.65 Velvet 100.0% 1,179 69.74% 64.06% 64.06% 55.66% 41.94x 30.79x SOAPdenovo 100.0% 1,155 71.74% 66.05% 66.05% 59.29% 37.28x 41.83x Full Detection Rule WEAV 65.5% 1,607 72.47% 67.51% 67.51% 57.21% 49.42x 9.98x ABySS 49.5% 1,109 64.58% 56.90% 56.90% 47.67% 36.31x 42.67x Velvet 64.8% 1,179 51.85% 43.50% 43.50% 37.32% 34.63x 42.80x SOAPdenovo 43.7% 1,155 58.17% 53.05% 53.05% 45.14% 33.10x 46.16x 3.4.3 Experimental Study We used a human brain transcriptome dataset, which has 58.58M reads with 100bp in length. After preprocessing, we got 22.68M nest quality reads. 66 Table 3.5: Assembly results for low-coverage orientation free dataset, L-100K.fa Coverages of # Proposed Sensitivities in the levels of sequenced Detected Undetected Assembler Specicity Contigs Bases Genes transcripts Segments Segments Segments Partial Detection Rule WEAV 98.7% 624 56.46% 50.35% 54.74% 42.93% 12.77x 3.55x ABySS 100.0% 482 70.49% 61.12% 66.19% 56.78% 10.59x 4.39x Velvet 100.0% 448 61.13% 55.74% 59.21% 48.70% 11.25x 4.84x SOAPdenovo 100.0% 486 62.95% 56.21% 60.47% 48.42% 11.71x 3.74x Full Detection Rule WEAV 53.84% 624 48.74% 42.62% 45.26% 34.85% 13.09x 4.63x ABySS 60.37% 482 59.30% 50.35% 54.03% 46.10% 10.79x 5.80x Velvet 47.54% 448 43.30% 38.41% 39.53% 33.36% 10.66x 7.30x SOAPdenovo 53.29% 486 50.68% 44.73% 46.69% 37.27% 11.87x 5.56x Clustering and De Novo Assembly The clustering algorithm consists of three steps, the construction of a de Bruijn graph, the partitioning the graph into subgraphs, and collect reads according to the subgraphs. WEAV-CLUSTER produced 25,467 clustered read sets having 100 reads, and 510,601 having 2 reads. Out of 510,601 read sets, WEAV assembled 156,494 contigs whose lengths were > 300bp. The mean of these \> 300 contigs was 390:3bp, and the maximum was 6; 120bp. Among them, 5,088 (3.2%) were > 1; 000bp. Evaluation of the Assembly results To compute the ground truth, we downloaded sequences of gene transcripts of RefSeq on Sep.03.2010 and of Gencode on Feb.16.2010 from the UCSC Genome Browser. We mapped reads to the transcript sequences allowing up to 4 mismatches using PerM and identied the sequenced segments (> 300bp) of gene transcripts shown in Table 3.6 (Sec. 3.3.3). On average, each transcript in RefSeq and Gencode had 4.16 and 2.85 sequenced segments respectively. Figure 3.12 shows the length distributions of WEAV contigs and sequenced segments. Although the sequenced segments are longer (518:8bp in average) than the WEAV assembled contigs (390:3bp in average), the following sensitivity 67 Figure 3.11: Distribution of classes of contigs (mapped to RefSeq only, mapped to Gen- code only, mapped to both RefSeq and Gencode, New Transcripts, and Unmapped con- tigs) for dierent contig length. analysis shows that the majority of these sequenced transcript segments are aligned with the contigs. (a) WEAV Contigs (b) Sequenced Transcript Segments Figure 3.12: The length distribution of assembled contigs by WEAV (a), and of sequenced transcript segments (b). Specicity of the Assembled Contigs To examine how many of the assembled contigs are true, we searched the contigs against the RefSeq sequences, Gencode sequences and human genome sequences (hg19) using Table 3.6: The ground truth of the sequenced genes, transcripts and segments for human brain transcriptome experimental dataset Database # Genes # Sequenced Genes # Transcripts # Sequenced Transcripts # Sequenced Segments RefSeq 22,110 14,776 (66.8%) 34,260 22,745 (66.4%) 94,542 Gencode 28,553 12,548 (43.9%) 79,899 38,245 (47.9%) 108,959 68 BLAT. Since BLAT is a local alignment tool, various alignments may be reported from short to long alignments. We selected the result if and only if a contig were fully mapped, as a partially mapped contig may be a chimera assembly. At the end, 111,385 (71.2%) out of 156,494 contigs were mapped to either RefSeq or Gencode with average lengths 430:5bp and 425:5bp respectively. Among 45,109 unmapped contigs, 39,269 (81.6%) were mapped onto the genome with 302:5bp which might be new transcripts. Lastly, 5,840 (3.73%) contigs were not mapped anywhere with 319:9bp. As a result, the specicity of the WEAV assembly results was remarkably high, 96.3%. In gure 3.11, we divided the contigs into 7 bins according to their length, and for each bin, we plotted the fraction mapped to RefSeq only, the fraction mapped to Genocode only, the fraction mapped to both RefSeq and Gencode, the fraction of the new tran- scripts, and the fraction of the unmapped. Most contigs (> 50%) were mapped to both RefSeq and Gencode, and more were mapped to RefSeq than to Gencode. This seemed to indicate that RefSeq had a better human genome annotation than Gencode. Most of the 39,269 contigs were new transcripts. Figure 3.13 shows two examples of new transcripts discovered by WEAV. Figure 3.13(a) shows a contig (a new transcript) having 3 new exons, and (b) shows a contig (also a new transcript) having a new combi- nation of known exons. We discovered 959 new transcripts using known exons dened in RefSeq, and 452 new transcripts using known exons dened in Gencode. Sensitivity of the Assembled Contigs We computed the fraction of the ground truth genes and bases that were covered with the assembled contigs. 13,339 (90.3%) out of 14,776 sequenced RefSeq genes, and 9,818 69 WEAV_NEW PRKACB PRKACB PRKACB Your Sequence from Blat Search RefSeq Genes (a) New Transcript WEAV_NEW_COMBI KTN1 KTN1 KTN1 KTN1 Your Sequence from Blat Search RefSeq Genes (b) New Combination Figure 3.13: (a) shows a new transcript with new exons, and (b) shows a new one with a new combination of exons. (78.2%) out of 12,548 sequenced Gencode genes were covered with some contigs. Likewise, WEAV assembled 40.5M bp (74.1%) out of 54.6M bp sequenced RefSeq basepairs, and 40.9Mbp (72.4%) out of 56.5Mbp sequenced Gencode basepairs. The overall sensitivity is 90% and 78% for both databases in terms of the number of genes, and over 72% in terms of the number of basepairs. Note that we did not compute the fraction of the ground truth transcripts that were covered with the contigs because the transcripts from one gene overlap with each other, and thus computing the fraction will overestimate the sensitivity. For each sequenced transcript, we computed the fraction of the total length of its sequenced segments over the length of this transcript, and then computed the fraction of the total length of its assembled contigs over the length of this transcript. Figure 3.14 shows the comparison of these two fractions. There is a positive correlation between these two fractions. A transcript, in which most of its bases are covered by the sequenced segments, is also highly covered with the contigs, and vice versa. Overall Analysis The analysis of specicity shows that the number of the assembled contigs mapped to RefSeq or Gencode is 111,385 with 428:4bp in average. Considering that the number of 70 Figure 3.14: (length of WEAV contig)=(length of transcript) vs. (length of ground truth segment)=(length of transcript). the ground truth sequences (sequenced transcript segments) is 94,540 for RefSeq with and 108,959 for Gencode, WEAV assembled about 1.18 and 1.02 contigs, respectively, for each ground truth sequence. On the other hand, the analysis of sensitivity shows that these contigs cover > 72% of the bases of the ground truth. Combining these two analysis results, we conclude that WEAV assembled about one contig for each ground truth sequence, but only covered 72% of the bases. 3.5 Discussion We have developed and tested the de novo whole transcriptome assembler WEAV. Not only does WEAV inherit nifty features of the de Bruijn graph method, but also imple- ments features for transcriptome assembly. We have shown that WEAV can successfully assemble all the isoforms of gene ABR, and WEAV is robust and capable for terribly noisy datasets while all other programs failed. Finally, we showed that WEAV can assemble a large human brain transcriptome dataset with exceptionally high specicity (> 96%) and sensitivity (> 72% at the level of basepairs). Moreover, WEAV-CLUSTER could eciently remove ambiguities from orientation. The ambiguous orientation is a common problem, but the experimental dataset was free 71 from this problem because it used a directional sequencing technique. WEAV, how- ever, can be free from ambiguous orientation problem by using WEAV-CLUSTER whose method was discussed in Sec. 3.2.4. We prepared 3 parallel datasets of previously used three simulated datasets for this problem (see table 3.1). Figure 3.15 is the counterpart of gure 3.8. As it shows, all the four assemblers showed no dierence with introducing orientation. One surprising fact is that WEAV-CLUSTER for ambiguous orientation did even better in terms of one-to-one correspondence between sequenced segments and clusters because of its merging nature discussed in Sec. 3.2.4 (see gure 3.16). (a) The length of the longest contig (bp) (b)N50 (bp) Figure 3.15: Comparison statistics for ambiguous orientation high-coverage simulated dataset, H-Ori-10M.fa In addition, WEAV-CLUSTER determined the orientations correctly. Table 3.7 shows how many reads were misclassied in orientation. Only 0.82% of reads for high-coverage dataset were wrong, and lower coverage datasets had less because the less absolute number 72 of sequencing errors can keep from random matches in a wrong direction. Although there were wrong orientations, it did not aect the overall performance. Table 3.7: Some wrong clustering results Dataset # Mixed-Orientation Clusters # Reads # Minor Directional Reads H-Ori-10M.fa 11/2075 (0.53%) 426,014/10M (4.26%) 82,925/426,014 (19.47%) M-Ori-1M.fa 10/1651 (0.61%) 19,707/1M (1.97%) 1,699/19,707 (8.62%) L-Ori-100K.fa 5/820 (0.61%) 141/100K (0.14%) 15/141 (11.35%) (a) # Clusters vs `# Genes/Cluster' (b) # Clusters vs `Sequenced Seg- ments/Cluster' (c) # Genes vs `# Clusters/Gene' (d) # Sequenced Segment vs `# Clusters/Seqeunced Segment' Figure 3.16: For high-coverage ambiguous orientation dataset, H-Ori-10M.fa 73 (a) # Clusters vs `# Genes/Cluster' (b) # Clusters vs `Sequenced Seg- ments/Cluster' (c) # Genes vs `# Clusters/Gene' (d) # Sequenced Segment vs `# Clusters/Seqeunced Segment' Figure 3.17: For middle-coverage ambiguous orientation dataset, M-Ori-1M.fa 74 (a) # Clusters vs `# Genes/Cluster' (b) # Clusters vs `Sequenced Seg- ments/Cluster' (c) # Genes vs `# Clusters/Gene' (d) # Sequenced Segment vs `# Clusters/Seqeunced Segment' Figure 3.18: For low-coverage ambiguous orientation dataset, L-Ori-100K.fa 75 (a) # Contigs ( 300bp) (b) The length of the longest contig (bp) (c) Total length of all contigs (bp) (d) Themeanlengthofthecon- tigs (bp) (e) N50 (bp) (f) N90 (bp) Figure 3.19: Comparison statistics for ambiguous orientation high-coverage simulated dataset, H-Ori-10M.fa 76 (a) # Contigs ( 300bp) (b) The length of the longest contig (bp) (c) Total length of all contigs (bp) (d) Themeanlengthofthecon- tigs (bp) (e) N50 (bp) (f) N90 (bp) Figure 3.20: Comparison statistics for ambiguous orientation middle-coverage simulated dataset, M-Ori-1M.fa 77 (a) # Contigs ( 300bp) (b) The length of the longest contig (bp) (c) Total length of all contigs (bp) (d) Themeanlengthofthecon- tigs (bp) (e) N50 (bp) (f) N90 (bp) Figure 3.21: Comparison statistics for low-coverage ambiguous orientation simulated dataset, L-Ori-100K.fa 78 Chapter 4 MetaSEQ: De Novo Sequence Assembly of Short Regions in Metagenomics 4.1 Introduction Metagenomics, also known as environmental genomics, is the study of microbial mixtures from various microbiota found in mammal gut, sea water, and soil. Metagenomics has recently been dened as \the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species [14]." Using new high through- put sequencing technologies, researchers are now able to understand microbial diversity and even study functions such as metabolic division of labor. The general procedure rst involves identifying the structure of sample populations and then estimating the abundance level of each population [38]. Finally, data gathered from these steps enable investigators to study microbial functions or interactions with each other and/or with their environments. 79 However, there are two major obstacles in metagenomics: the intractability of cloning for microbial samples and the wide variety of species, as well as diversity within the same species. Although conventional shotgun sequencing technology has been applied to some metagenomics studies [7], [39], they all required cloning followed by slow and expensive sequencing. As a result, such kinds of approaches were limited to applications with simple samples consisting of a few species or well-studied samples in which genome information is available. The development of the next generation high throughput sequencers, such as 454 Life Sciences Genome Sequencer FLX System (www.454.com), Illumina 1G Genome Analysis System (www.illumina.com), Helicos GSS Sequencing (www.helicosbio.com), and Applied Biosystems SOLiD Sequencing (www.appliedbiosystems.com) have profoundly changed the approach to metagenomics. Sogin et al. explored the microbial populations of the deep sea environment using PicoTiterPlate (454 Life Sciences) and showed the diversity of what they termed the \rare biosphere" [79]. To accomplish this, they sequenced the V6 region of bacterial 16s rRNAs using a tag sequencing strategy, assigned phylotypes for each sequence, and constructed a database called V6RefDB. Huber et al. conducted similar research [38] to study population structure of microbes in two dierent vents from dierent places of deep sea water [47]. sequenced 16s rRNA genes in a study of microbial structure in the gut of multiple mammals to understand co-evolution of the host and the microbiota. They discovered that the microbiota of a mammal is related to both its diet and phylogeny. All of the aforementioned work were based on either direct sequencing of short regions, such as the V6 regions of 16s rRNA that are shorter than the length of a read, or cloning 80 the species rst, followed by sequencing and assembling using the traditional shotgun sequence assembly programs. De novo metagenomic sequence assembly has been limited because no computational method has yet been devised to accommodate a metagenomic sample which may consist of thousands of dierent species, among which the pairwise similarity values may range from 60% to more than 99%. In fact, since there is no agree- ment on the denition of a species, any value from 95% to 99% similarity has been used to cluster sequences into species. Therefore, the specic computational challenges are to (1) distinguish sequencing errors from actual base level dierences between dierent microbial sequences, (2) assemble multiple, highly similar microbial genome sequences accurately, while avoiding assembling into chimera sequences which are formed by con- catenating sub-sequences from dierent species, and (3) develop fast and memory-ecient programs that can be used by individual biological laboratories, which, most likely, do not have access to high-performance computers. Most modern de novo assemblers for short reads use either the de Bruijn graph- based technique, such as Velvet [90], EULER-SR [13], and ALLPATHS [9], or a greedy overlap-extension technique, such as SSAKE [88], SHARCGS [19], VCAKE [42], and Edena [34]. However, none of these programs can be applied to assemble reads obtained from metagenomic samples. In this paper, we propose a de novo metagenomic sequence assembly program for short regions, termed MetaSEQ, which will simultaneously assemble millions of reads into hundreds of species at dierent abundance levels. Although MetaSEQ was tested with a read length of 100 bps, which is the shortest possible length scientists believe for metagenomics studies [21], [28] and on rRNA regions, the computational framework of 81 MetaSEQ is free from those three computational challenges. We believe thatMetaSEQ can support the existing computational challenges enumerated above and can also be scaled up for de novo assembly of whole metagenomes. 4.2 Results Traditional sequence assembly programs only nd one unanimous sequence. In contrast, the goal of MetaSEQ is to assemble millions of reads into multiple species across short regions. The backbone ofMetaSEQ is the de Bruijn Graph method which was proposed by Idury and Waterman [40] and later developed by Pevzner and Waterman [71]. We chose the de Bruijn graph method because it implicitly encodes every possible sequence as a path in the graph; however, we have, at the same time, overcome its most obvious weakness in handling sequencing errors by choosing the correct paths from an exponential number of possible paths in the graph. MetaSEQ consists of the following ve computational steps. 1. Resolving orientations of reads. Traditional sequence assembly programs con- sider both orientations of a read in the assembly process. However, in metagenomic sequencing projects, a large number of reads are sequenced over a short region, so the overlaps among these reads should provide sucient information to determine the orientations of reads. We have designed a simple greedy randomized algorithm to determine the orientations that maximize the total pairwise overlaps among reads measured by the number of pairs of common m-tuples. This step greatly simplies the de Bruijn graph. 82 2. Correcting sequencing errors. We have constructed a hash table for all m- tuples in the reads. A sequencing error will cause multiple m-tuples containing the error base to be singletons, or \lower-coverage" tuples than other tuples on the same read. Based on this observation, we designed a simple algorithm to identify bases that have such property and correct them. 3. Constructing a de Bruijn graph. We chosek-tuples to create a de Bruijn graph using reads with known orientations and with most sequencing errors having been corrected. 4. Assembling using the Partition-Ligation algorithm. Each assembled se- quence, or path in the de Bruijn graph, is called a \template sequence." Ideally, the purpose of the sequence assembly in the de Bruijn graph is to dene a source node and a sink node, after which it is only necessary to traverse the graph from the source to the sink to nd all template sequences and then estimate their abun- dance levels. However, the number of possible paths is exponentially large, so we also designed a novel partition-ligation algorithm that rst partitions the graph into multiple segments and then assembles template sequences in the rst segment, extending them into the next segments, one by one. 5. Clustering template sequences. Similar template sequences are clustered into species according to some pre-dened similarity threshold. Details of each step are described in section entitled \Resolving Orientations of Reads" to \Clustering Template Sequences" in Methods. 83 4.2.1 Datasets We rst obtained the following metagenomic datasets. Datasets of Similar Species { Data A. Data A consists of 10 partial 16s rRNA gene sequences(species), having a length between 500 to 600 basepairs, from naturally-occurring marine microbes sampled from the Padcic Ocean around Catelina Island (samples provided by Professor Jed Fuhrman's Laboratory at the University of Southern California). The average pairwise similarity for these 10 species is 96.9%. { Data B. Data B comes from the same source as Data A. The only dierence is that Data B consists of 100 16s rRNA sequences. The average pairwise similarity for these 100 species is 93.2%. Datasets of Dissimilar Species { Data C. Data C consists of 100 partial 16s rRNA sequences obtained from the SILVA rRNA database project (http://www.arb-silva.de/). At the SILVA website, we queried `Bacteria! Acidobacteria' from the SSU Database and selected 100 sequences whose lengths are between 540 and 690 basepairs. The average pairwise similarity of Data C is about 50.1%. { Data D. Data D comes from the same source as Data C. The only dierence is that Data D consists of 1000 partial 16s rRNA sequences. The average pairwise similarity for these 1000 species is 52.9%. 84 Multiple read sets were generated to test the performance of our assembly program. First, we randomly assign an abundance level to each sequence using an exponential function and then normalize them. This simulation is based on what we have observed in the real metagenomic samples. Then, we simulate the shotgun sequencing process as follows. We randomly select a sequence with the probability equal to its abundance level, and then we sample a 100-basepair read at random from it at either orientation. After having sampled the reads, we randomly introduce sequencing errors in each read with a xed rate, typically 0.2% or 1.0%. 4.2.2 Performance of MetaSEQ Resolving Orientations We simulate a set of 720K reads with various error rates using Data A and apply our greedy algorithm to resolve orientations of reads. Figure 4.1 shows the relationship be- tween the sequencing error rate and the number of reads for which the algorithm could not determine the orientations (solid line). The computation time is also shown in the plot (dash line). Clearly, Figure 4.1 shows that both the number of undecided reads and the computation time increase almost linearly with the sequencing error rates. To our surprise, even at the highest error rate (2%), the algorithm predicts orientations with 100% accuracy, while leaving only 0.028% of the reads undecided. Thus, we can comfort- ably discard these reads without aecting the assembly results, while beneting from a much simplied de Bruijn graph. If necessary, the orientation of these undecided reads can be determined after the assembly. 85 Figure 4.1: Relationships between the sequencing error rate and the number of reads (out of 720K) for which the orientations cannot be determined by the MetaSEQ algorithm (solid line) and the computation time measured in seconds (dash line). Correcting sequencing errors We simulate a set of 720K reads with an error rate of 0.2% using Data A. After re- solving orientations, we scan each read for bases that cause sharp increase or decrease ( t = 10-fold) of weights of m-tuples before and after these bases and then correct them. We discover that scanning reads with multiple variously sized m has a better performance than using m with a xed size. For example, scanning reads 6 times using m =f50; 45; 40; 35; 30; 25g respectively in each iteration is able to correct 93.9% of the sequencing errors. Meanwhile, scanning using m = 50 for 6 times corrects only 88.8%, and scanning using m = 25 for 6 times corrects only 92%. Note that new errors can be introduced in this process, as some true polymorphic bases could be wrongly corrected. In this simulation, our results showed that only 3.9% of the total corrections were wrong. Pevzner et al. have argued that this is not a serious problem [71] because these new 86 errors can be corrected back in the nal assembly when all the original reads are aligned for consensus calling. To test how eectively this method works for read sets with higher error rates, we simulate another set of 720K reads with an error rate of 1.0% using Data A. Our method corrects 95.7% of sequencing errors while introducing only 2.0% new errors. This shows that our error-correction method works just as eectively for higher error rates. Assembly results We simulate a set of 100K reads with 0.2% sequencing error rate using the 10 species in Data A. We assemble this dataset into template sequences using the Partition-Ligation algorithm implemented inMetaSEQ, setting the size of one segment at 150 basepairs. We cluster the template sequences using the center-based clustering algorithm with various similarity thresholds T sim = 95%; 97%; 98% and 99%, and estimate the abundance level for each cluster. To assess the accuracy of the clusters, we also cluster the 10 species using the same procedure, calculating the expected abundance level for each cluster, along with their real abundance levels. The consistency between the estimated clusters and the expected clusters is measured by a weighted L 1 norm metric. For an underlying species u i with abundance level A(u i ) and a cluster c j with estimated abundance level ^ A(c j ), we can decide whether u i is a member of c j and use this membership information to compute the expected abundance level ofc j : E A (c j ) = P i:u i 2c j A(u i ). Now we have two vectors: E A = (E A (c 1 );E A (c 2 );:::) and ^ A = ( ^ A(c 1 ); ^ A(c 2 );:::). The weighted L 1 norm is dened by (L 1 ) = P i jE A (c i ) ^ A(c i )jE A (c i ). This weighted L 1 norm places more emphasis on the major clusters 87 with high expected abundance levels. The method is described in the section entitled "Clustering Template Sequences," and Table 4.1 shows the comparison between the \Expected" abundance levels and the \Estimated" abundance levels for each cluster and the weighted L 1 distance, (L 1 ). Although the 10 species in Data A are quite similar to each other, the results show that MetaSEQ performed well in assembling such datasets, averaging (L 1 ) = 0:031, slightly short of a perfect assembly ((L 1 ) = 0). Table 4.1: Comparison of Expected abundance levels of clustered species (truth) and Es- timated abundance levels of clustered template sequences (results by runningMetaSEQ) under various similarity thresholds T sim . (L 1 ) is the weighted L 1 distance between the expected abundances and the estimated abundances. T sim Clusters c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 (L 1 ) 95% Expected 60.0% 40.0% - - - - - - 3.1% Estimated 56.9% 43.1% - - - - - - 97% Expected 44.1% 40.0% 15.9% - - - - - 5.0% Estimated 52.7% 40.8% 6.5% - - - - - 98% Expected 40.0% 30.0% 15.0% 10.0% 5.0% - - - 0.7% Estimated 40.8% 30.2% 13.6% 10.5% 4.9% - - - 99% Expected 40.0% 20.0% 10.0% 10.0% 5.0% 5.0% 5.0% 5.0% 3.5% Estimated 40.8% 9.5% 9.9% 0.2% 5.2% 4.9% 4.9% 3.6% Average 3.1% 4.2.3 Tests on multiple read sets Tests on similar species We simulate ve sets of reads using Data A (10 species) and Data B (100 species) to test how the assembly results ofMetaSEQ are aected by the following parameters: the number of underlying species (10 vs. 100), the sequencing error rate (0.2% vs. 1%), the number of reads (100K vs. 1M), and the similarity threshold T sim (95%, 97%, 98% and 99%). Table 4.2 shows the assembly results and performance of MetaSEQ on each 88 Table 4.2: Performance ofMetaSEQ on assembling reads sampled from Datasets A and B. (L 1 ) is a metric based on the weighted L 1 distance between the Expected clusters and the assembled clusters. The running time was based on a regular 32-bit desktop computer. Data Read Parameters Expected Assembly Results Sets Sets #Species Error #Reads T sim #Clusters #Templates #Clusters (L 1 ) Time(s) A 1 10 0.2% 100K 95% 2 61 2 8.3% 4 10 0.2% 100K 97% 3 61 4 6.8% 4 10 0.2% 100K 98% 4 61 6 4.6% 4 10 0.2% 100K 99% 8 61 14 7.0% 4 2 10 1.0% 100K 95% 2 133 2 9.3% 7 10 1.0% 100K 97% 3 133 4 13.7% 7 10 1.0% 100K 98% 4 133 6 6.0% 7 10 1.0% 100K 99% 8 133 20 8.3% 7 B 3 100 0.2% 100K 95% 11 208 2 10.9% 19 100 0.2% 100K 97% 31 208 4 2.9% 19 100 0.2% 100K 98% 52 208 9 7.3% 19 100 0.2% 100K 99% 72 208 19 7.2% 19 4 100 0.2% 1M 95% 11 348 2 18.9% 813 100 0.2% 1M 97% 31 348 5 1.3% 813 100 0.2% 1M 98% 52 348 10 2.1% 813 100 0.2% 1M 99% 72 348 31 3.2% 813 5 100 1.0% 1M 95% 11 234 3 2.2% 750 100 1.0% 1M 97% 31 234 6 2.6% 750 100 1.0% 1M 98% 52 234 8 2.8% 750 100 1.0% 1M 99% 72 234 26 5.4% 750 read set under various T sim . From Table 4.2, we observe the following properties of our assembly program: T sim : The value of(L 1 ) uctuates across the four thresholds ofT sim in every read set. Since the underlying species are highly similar, the clustering results are very sensitive to the threshold of T sim . For example, the number of expected clusters in read 3, computed directly from the original species, jumps from 11 forT sim =95% to 72 forT sim =99%. As a result, even a couple of dierent bases between the template sequences and the species can cause signicant disagreement at the cluster level. Using Data A, we repeat the sampling of 100K reads 30 times to obtain 30 read sets and use them to test the performance of MetaSEQ. At the thresholdTsim = 89 98%, the average(L1) are 4.6% and 4.9% for 0.2% and 1.0% sequencing error rates, respectively. The standard deviations are 1.06% and 1.28%, respectively. Again, we do not observe any dierence in the computational time between these two error rates. This means that our algorithm eciently corrects most errors and screens out chimera sequences. #Species: The increase of the number of underlying species (comparing read set 1 with read set 3) does not seriously aect the assembly results. The reads from rare species may potentially contribute to the assembly of some \chimera" template sequences that are similar to the original species (> 95% similarity), but also distant enough (< 98% similarity) to be clustered together whenT sim is high. On the other hand, the rare species are under-sampled and impossible to be assembled. Figure 4.2 shows the distribution of the number of reads for each of the 100 species in read set 3 ordered from the highest to the lowest. Species ranked 40 th and above are under- sampled and nearly invisible. Reads from these rare species may be assembled into other similar species at higher abundance levels. This explains why MetaSEQ produces similar results for read set 1 and and read set 3. Error rate: Contrary to what we expected, the error rate does not have a signicant impact on the assembly results (comparing read set 1 with read set 2 and read set 4 with read set 5). This means that the error-correction algorithm works very well. It corrects most sequencing errors without making many mistakes. 90 #Reads: The increase of the number of reads signicantly improves the assem- bly results (comparing read set 3 with read set 4) in almost all cases except one. With 10 times more reads, the computational time only increases moderately, about 813=19 = 43 times, somewhere between a linear increase (10 times) and a quadratic increase (100 times). Figure 4.2: Distribution of coverage for read set 3 of Data B Table 4.2 shows that MetaSEQ accomplishes the assembly of every read set within 15 minutes at a regular 32-bit desktop. The computation time takes into account the number of reads, as well as the complexity of the underlying species and the error rate. Test on dissimilar species We simulate four read sets using Data C (100 species) and Data D (1000 species), each with 1M reads. The four read sets dier in the percentage of error rate, one with 0.2% and the other with 1.0%. The assembly results are shown in Table 4.4. We observe that the patterns of the results in Table 4.4 are similar to those in Table 4.2 indicating that the sequencing error rate has no major impact on the accuracy of the assembly and that 91 more reads produce more accurate results. The dierence is that the values of (L 1 ) are much more consistent across both thresholds of T sim in both read sets because the underlying species are quite dissimilar. Again, MetaSEQ accomplishes the assembly of both read sets within a few minutes, although longer running time is observed for the read set with more species. Table 4.4: Performance of MetaSEQ on assembling reads sampled from Datasets C and D. (L 1 ) is a metric based on the weighted L 1 distance between the Expected clusters and the assembled clusters. The running time was based on a regular 32-bit desktop computer. Data Read Parameters Expected Assembly Results Sets Sets #Species Error #Reads T sim #Clusters #Templates #Clusters (L 1 ) Time(s) C 6 100 0.2% 1M 95% 79 36 16 7.3% 8 100 0.2% 1M 98% 81 36 21 7.4% 8 7 100 1.0% 1M 95% 79 84 11 4.2% 7 100 1.0% 1M 98% 81 84 15 5.8% 7 D 8 1000 0.2% 1M 95% 823 335 41 4.6% 394 1000 0.2% 1M 98% 831 335 75 3.1% 394 9 1000 1.0% 1M 95% 823 245 50 5.1% 386 1000 1.0% 1M 98% 831 245 101 4.8% 386 4.3 Discussion The common computational challenge in many metagenomic projects is to develop a robust de novo sequence assembly program that is capable of simultaneously assembling multiple, diverse, and yet highly similar, microbial genomes, or short regions. In this work, we developed such a program, termed MetaSEQ, which is the rst to de novo assemble micro-reads for short regions in metagenomes. MetaSEQ consists of multiple components of novel statistical methods and algorithms to solve the challenging problem described above. In particular, the proposed Partition-Ligation assembly algorithm greatly reduces the complexity of the assembly problem and can be easily scaled up for assembling much 92 longer regions. Furthermore, the proposed algorithms for solving orientation of reads and for correcting sequencing errors are shown to be very ecient and accurate. Under all circumstances, MetaSEQ accurately assembles 1M reads and 1,000 species within a few minutes on a regular 32-bit desktop computer. The error-correction algorithm is shown to be highly eective. However, we have also observed that many of the assembly errors were the result of this algorithm mistakenly correcting some true polymorphic bases. Because the underlying species are highly similar to one another, this type of mistake, although infrequent, can lead to some mis-assembled sequences and completely dierent clustering results. To reduce such mistakes, we can tune the parameter, which is the threshold to detect the increase and decrease of read coverage before and after the base, to reduce the total number of corrected bases. This results in fewer mistakes, but it also leaves more uncorrected sequencing errors. 4.4 Methods 4.4.1 Resolving Orientations of Reads We formulate this into a simple two-category classication problem. Each read will be assigned to one category of Forward or another category of Reverse. Given any assignment of orientations of reads, we can create a hash table H using m-tuples in the reads so that a read with length l will have exactly lm + 1 m-tuples. We dene the weight for each m-tuple as the number of such m-tuples in a dened set of reads. We denew(r;H) to be the weight of a read r in the hash table H, calculated as the sum of 93 the weights of its lm + 1 m-tuples in H. Therefore, the goal of the classication is to maximize the total weight of all reads P r w(r;H). We propose a simple randomized greedy algorithm to solve this classication problem. We randomly order the reads and compute the orientations of reads one by one. Assume that we have assigned orientations to a set of i reads, S i , and created a hash table H i for S i . Let a randomly selected (i + 1)-th read be r we then calculate the weight of the Forward strand ofr inH i asw(r;H i ), and the weight of theReverse strand ofr, denoted asr 0 , inH i asw(r 0 ;H i ). Ifw(r;H i )>w(r 0 ;H i ), we assign Forward to readr, otherwise Reverse to r. Then, we add read r to S i and expand H i to H i+1 . We could dene a threshold t such that if w(r;H i )w(r 0 ;H i ) t, then we assign Forward to readr, and ifw(r 0 ;H i )w(r;H i )t, then we assign Reverse tor. If none of them is satised, we either discard r or delay the assignment to r and proceed to the next random read. This algorithm is obviously sensitive to the initial set of reads being assigned for orientations. To reduce this sensitivity, we repeat the above randomized algorithms B times and select a subset of reads to which the assignment of orientations is consistent over 95%B times, or in other words, with 95% condence level. 4.4.2 Correcting Sequencing Errors Pevzner and his colleagues proposed multiple methods to correct sequencing errors for the de Bruijn graph [11], [13], [71]. We take these natural insights to develop a dierent approach. Our approach is based on the assumptions that (1) the weights of m-tuples in each read without sequencing error must be changed smoothly from the rst m-tuple 94 to the last m-tuple, (2) a sequencing error will cause a sharp decrease or increase (if it occurs at the rstm bases) of the weights of consecutivem-tuples while the correction of such sequencing errors should smooth the weights, and (3) an actual polymorphic base will most likely have multiple supporting reads (weights of multiple m-tuples> 1) unless it is a singleton. Therefore, we propose an algorithm that rst hashes allm-tuples into a table, counting the weights, and then scans each read to detect bases causing sharp decrease or increase of weights of m-tuples t-fold. If more than one error occurs within a distance of m bases in a read, one scan will only correct the rst error, but miss the second error. Thus, we will perform multiple scans to correct all sequencing errors. The choice of m is a topic of study itself, and we has discussed this topic in the section entitled \Correcting sequencing errors." 4.4.3 Constructing a de Bruijn Graph We follow the conventional de Bruijn graph construction algorithm proposed by Idury and Waterman [40] to construct a graph usingk-tuples for edges and (k 1)-tuples for nodes. We add a source node pointing to the starting nodes and a sink node pointed to by the ending nodes. Although de novo assembly programs simplify the graph by contracting each simple path in the graph into one \superpath" [71], [74] or \supernode" [90] in order to resolve repeats and merge similar sequences, we do not apply this idea in our algorithm because we do not intend to resolve repeats or merge similar sequences. 95 4.4.4 Assembling using the Partition-Ligation algorithm The graph-traversing algorithm We propose the following three-step algorithm to de novo sequence assembly through graph-traversing. Step 1. Graph traversing. Our assembly algorithm requires only traversing the graph to nd paths, also called template sequences, and then estimating their abundance levels. While traversing the de Bruijn graph from the source node to the sink node, we are in- terested in paths that satisfy the length constraints of [D min ;D max ], which are generally known when short regions are selected for sequencing. The length constraint is important not only in restricting the number of paths being traversed but also in assembling repet- itive sequences that form multiple cycles in the graph. Such constraint restricts these cycles from being traversed too many or too few times. Generally, repetitive sequences are rare in short regions, as most of these short regions are selected for sequencing be- cause they are conserved and free of repetitive sequences. Therefore, the length constraint provides a simple and eective solution to this problem. Step 2. Validating paths. We denote a read as a supporting read for a path if all of its k-tuple edges are within the path, or simply, the read is fully aligned with the path. We can exclude chimera sequence by examining supporting reads: for every edge on the path, we examine if it is covered by at least one supporting read for the path. Conversely, if any edge is not covered by any supporting read, this path represents a chimera sequence. Applying this process, we obtain a set of validated paths/template sequences. 96 Step 3. Estimating the abundance levels. Dene N(t j ) to be the number of support- ing reads for template sequence t j and A(t j ) to be the abundance level of t j . Ideally, A(t j ) can be calculated by the fraction of the supporting reads for t j among all reads: A(t j ) =N(t j )= P 8l N(t l ): However, this formula ignores the fact that one read can sup- port multiple template sequences because of the similarities among the species sequences. Taking this into account, we propose an iterative method to estimate A(:). Dene T (r i ) to be the set of template sequences supported by read r i . Dene w(r i ;t j ) to be the con- tribution of readr i to template sequencet j . In general,w(r i ;t j ) = 1 ifT (r i ) =ft j g, and 0 if t j = 2T (r i ). The iterative method is described in the following: 1. Initiate w(): w(r i ;t j ) = 1=jT (r i )j for t j 2T (r i ), for8i. 2. Calculate the total contribution of the supporting reads for template t j for8j: N(t j ) = P r i supports t j w(r i ;t j ). 3. Calculate the abundance levels of t j using the total contribution for8j: A(t j ) = N(t j )= P l N(t l ): 4. Re-estimate the weight functions w(r i ;t j ) = A(t j )= P t l 2T(r i ) A(t l ) for t j 2 T (r i ) and8i. 5. Repeat 2, 3, and 4 until A(:) converges. The graph-traversing algorithm works well if the number of template sequences is small. However, the number of template sequences grows exponentially by the lengths of template sequences. In one simulation using 10 species with a length of 600 base pairs and an average of 94.5% similarity, and 200K reads with a length of 100 base pairs, 97 we found over 2 million template sequences. Therefore, we proposed a more ecient partition-ligation algorithm for assembly. This algorithm is not restricted by the length of the short regions being sequenced. The Partition-Ligation Assembly Algorithm Instead of traversing the graph from the source to the sink at once, we partition the graph into multiple mutually exclusive segments, each consisting of a set of nodes within a certain distance range (dened by the Breadth-First-Search Algorithm) of the source. For example, for short regions with a length of roughly 600 base pairs, we consider 4 segments, 150 base pairs for each: [0; 150], [151; 300], [301; 450], and [451; 600], and then partition the de Bruijn graph accordingly. The algorithm consists of the following steps. 1. Partition the de Bruijn graph into multiple segments. 2. Apply the Graph-Traversing algorithm to the rst segment to obtain a set of tem- plate sequences with estimated abundance levels. 3. Merge template sequences with lower abundance levels to those with higher abun- dance levels if their sequence similarity exceeds a pre-dened threshold, i.e., 99%. Retain top M (i.e., M = 1; 000) template sequences according to the abundance levels. 4. Extend the template sequences to the next segment, applying the Graph-Traversing algorithm to obtain a set of template sequences with estimated abundance levels. 5. Repeat step 3 until reaching the last segment. 98 4.4.5 Clustering Template Sequences There are numerous clustering algorithms available. We choose a simple center-based clustering algorithm. The basic idea is to rst rank template sequences according to the abundance levels, and then, starting from the rst sequence selected as the center se- quence, we examine every other succeeding sequence as follows: if the examined sequence is similar to the center sequence beyond a pre-dened similarity threshold,T sim , we merge the examined sequence with its abundance levels to the center sequence. We repeat this process until no other sequence in the list can be merged with the center sequence. Next, we move down the list and select the next sequence as the center sequence and repeat this examination process until we move down to the last sequence in the list. 99 Chapter 5 Conclusion and Future Work 5.1 Summary of the Research In this dissertation, we proposed two main research issues of de novo sequence assembly. WEAV is active and still needs to have more analysis and upgrades discussed in Sec. 5.2, but metaSEQ is an extinct assembler now. Therefore, we want to focus on WEAV. We have used de Bruijn graph-based method which is the conventional way of assembly with second generation sequencing technologies. Although all such assemblers have highly similar logic, ideas and diculties, they are dierent from each other in unpublished details. Not only does WEAV inherit advantageous features of the de Bruijn graph method, but also implements features for transcriptome assembly. We have shown that WEAV can successfully assemble all the isoforms of gene ABR, it is capable of assembling extremely noisy datasets while all existing genome assembly programs failed. Unfortunately, we did not compare WEAV with several transcriptome assemblers published on late 2010. Most of them use Multiple-K method. In this method, they assembled a given transcriptome 100 dataset with existing genome assembler such as ABySS and Velvet; however, they assem- ble again and again using dierent K-mers, and they intersect all the assembly results of multiple K's. This is the simplest way of using an existing assembler, but we are not sure that this is ecient in the aspect of time and memory. Finally, we showed that WEAV can assemble a large human brain transcriptome dataset with high specicity (> 96%) and sensitivity (> 72% at the level of basepairs). 5.2 Future Work 5.2.1 Mathematics for Choosing the Best K-mer There is no mathematical solution to predict the best K-mer to get the best assembly results because the problem is extraordinarily dicult, and thus researchers intentionally look away from it. Therefore, almost no progress was made for the problem since Idury and Waterman proposed de Bruijn method. However, we believe there should be a mathematical solution to estimate the best K-mer for given coverage, read length and sequencing error rate. We may ask various questions as follows. For given conditions such as K length, coverage, read length, and sequencing error What is the expected contig length? LetL be the length of the longest contig, and then what is P (L L 0 ) for given conditions? 101 Let s U be the underlying sequence where the read setR was sampled, and let p U =e 1 e 2 :::e M be the path representing s U , and let G = (V;E) be the de Bruijn graph fromR. Can we know how probable p U exists in G. In other words, what is P (fe 1 ;e 2 ;:::;e M gE)? Every assembly program has its own policy to choose reliable edges to use. We call the copy number of an edge inR as weight of the edge. Assume the traversing process is to choose the heaviest edge and its heaviest neighbor iteratively, and assume we have M sequences by this process. What is the probability that the s U is in M sequences? We have done the preliminary work, and we need to do some simulation work. 5.2.2 Minor Upgrades There are several required minor upgrades which are not dicult to implement. - Orientation: Directed sequencing technique is inecient than random shotgun se- quencing because it needs additional endeavor [54], [84]. Therefore, most of the datasets have the orientation problem, and it will continue to exist for a while. This is comparatively easy problem because considering both direction is not terri- bly hard in terms of time and memory. The clustering algorithm of Sec. 3.2.4 can give us a way. - Paired-End Reads: Paired-end reads are common in these days. Sequencers read one end of each insert (also known as library), and produce the results as a read set which is a set of short DNA sequences. If sequencers read both ends of each 102 insert, the two reads (short DNA sequences) are called mate-paired reads or paired- end reads, and the length of inserts determines the distances between mates which follow a stochastic process. The key is how to use the statistics to untangle the complex structures of a given de Bruijn graph. - Scaolding: Together with paired-end reads, WEAV must give scaolds information. 5.2.3 Major Upgrade: Distributed Memory As datasets become bigger, the limited resource becomes a serious problem, and we believe that distributed computing, especially distributed memory, is mandatory. ABySS already utilized the distributed technique [78]. For distributed technique, there are two famous libraries. OpenMP (Open Multi-Processing): OpenMP supports multi-threads and shared memory programming. It gives very easy interfaces, but also it has limitations for safe parallelism. Therefore, we should be careful of using this library. OpenMPI (Open Message Passing Interface): OpenMPI denes various com- munication tools among multiple boxes. Note that a box may have multiple CPUs and a CPU may have multiple cores, but all the cores may share cache of a CPU and all the CPUs may share memory of a box. OpenMPI wraps the low-level TCP communication, and it opens various communication tools for broadcasting and point-to-point communication. Therefore, OpenMPI enables distributed memory on many boxes, and OpenMP en- ables distributed computing on each box with multiple cores. 103 Bibliography [1] C. Alkan, S. Sajjadian, and E. E. Eichler, \Limitations of next-generation genome sequence assembly," Nat Meth, vol. advance online publication, Nov. 2010. [2] J. L. Ashurst and J. E. Collins, \Gene annotation: prediction and testing," Annual Review of Genomics and Human Genetics, vol. 4, pp. 69{88, 2003, PMID: 14527297. [3] K. F. Au, H. Jiang, L. Lin, Y. Xing, and W. H. Wong, \Detection of splice junctions from paired-end RNA-seq data by SpliceMap," Nucleic Acids Research, vol. 38, no. 14, pp. 4570 {4578, 2010. [4] S. Batzoglou, D. B. Jae, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger, J. P. Mesirov, and E. S. Lander, \ARACHNE: a whole-genome shotgun assembler," Genome Research, vol. 12, no. 1, pp. 177{189, Jan. 2002, PMID: 11779843. [5] I. Birol, S. D. Jackman, C. B. Nielsen, J. Q. Qian, R. Varhol, G. Stazyk, R. D. Morin, Y. Zhao, M. Hirst, J. E. Schein, D. E. Horsman, J. M. Connors, R. D. Gascoyne, M. A. Marra, and S. J. M. Jones, \De novo transcriptome assembly with ABySS," Bioinformatics, vol. 25, no. 21, pp. 2872{2877, Nov. 2009. [6] C. E. Bock, Z. F. Jones, and J. H. Bock, \Relationships between species richness, evenness, and abundance in a southwestern savanna," Ecology, vol. 88, no. 5, pp. 1322{1327, May 2007, PMID: 17536417. [7] M. Breitbart, P. Salamon, B. Andresen, J. M. Mahay, A. M. Segall, D. Mead, F. Azam, and F. Rohwer, \Genomic analysis of uncultured marine viral commu- nities," Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 22, pp. 14 250{14 255, Oct. 2002, PMID: 12384570. [8] R. A. Brualdi, Introductory Combinatorics. Pearson Education, Limited, Apr. 2011. [9] J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. Lander, C. Nusbaum, and D. B. Jae, \ALLPATHS: de novo assembly of whole-genome shotgun microreads," Genome Research, vol. 18, no. 5, pp. 810{820, May 2008, PMID: 18340039. [10] J. Cao, K. Schneeberger, S. Ossowski, T. Gunther, S. Bender, J. Fitz, D. Koenig, C. Lanz, O. Stegle, C. Lippert, X. Wang, F. Ott, J. Muller, C. Alonso-Blanco, K. Borgwardt, K. J. Schmid, and D. Weigel, \Whole-genome sequencing of multiple arabidopsis thaliana populations," Nat Genet, vol. 43, no. 10, pp. 956{963, Oct. 2011. 104 [11] M. Chaisson, P. Pevzner, and H. Tang, \Fragment assembly with short reads," Bioinformatics, vol. 20, no. 13, pp. 2067 {2074, 2004. [12] M. J. Chaisson, D. Brinza, and P. A. Pevzner, \De novo fragment assembly with short mate-paired reads: Does the read length matter?" Genome Research, vol. 19, no. 2, pp. 336{346, Feb. 2009, PMID: 19056694. [13] M. J. Chaisson and P. A. Pevzner, \Short read fragment assembly of bacterial genomes," Genome Research, vol. 18, no. 2, pp. 324 {330, Feb. 2008. [14] K. Chen and L. Pachter, \Bioinformatics for Whole-Genome shotgun sequencing of microbial communities," PLoS Comput Biol, vol. 1, no. 2, p. e24, Jul. 2005. [15] Y. Chen, T. Souaiaia, and T. Chen, \PerM: ecient mapping of short sequencing reads with periodic full sensitive spaced seeds," Bioinformatics, vol. 25, no. 19, pp. 2514{2521, Oct. 2009. [16] N. Cloonan, A. R. R. Forrest, G. Kolle, B. B. A. Gardiner, G. J. Faulkner, M. K. Brown, D. F. Taylor, A. L. Steptoe, S. Wani, G. Bethel, A. J. Robertson, A. C. Perkins, S. J. Bruce, C. C. Lee, S. S. Ranade, H. E. Peckham, J. M. Manning, K. J. McKernan, and S. M. Grimmond, \Stem cell transcriptome proling via massive- scale mRNA sequencing," Nature Methods, vol. 5, no. 7, pp. 613{619, Jul. 2008, PMID: 18516046. [17] M. A. DePristo, E. Banks, R. Poplin, K. V. Garimella, J. R. Maguire, C. Hartl, A. A. Philippakis, G. del Angel, M. A. Rivas, M. Hanna, A. McKenna, T. J. Fennell, A. M. Kernytsky, A. Y. Sivachenko, K. Cibulskis, S. B. Gabriel, D. Altshuler, and M. J. Daly, \A framework for variation discovery and genotyping using next-generation DNA sequencing data," Nat Genet, vol. 43, no. 5, pp. 491{498, May 2011. [18] A. Disset, L. Cheval, O. Soutourina, J. Duong Van Huyen, G. Li, C. Genin, J. Tostain, A. Loupy, A. Doucet, and R. Rajerison, \Tissue compartment analy- sis for biomarker discovery by gene expression proling," PLoS ONE, vol. 4, no. 11, p. e7779, Nov. 2009. [19] J. C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer, \SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing," Genome Research, vol. 17, no. 11, pp. 1697{1706, Nov. 2007. [20] R. Drmanac, A. B. Sparks, M. J. Callow, A. L. Halpern, N. L. Burns, B. G. Ker- mani, P. Carnevali, I. Nazarenko, G. B. Nilsen, G. Yeung, F. Dahl, A. Fernandez, B. Staker, K. P. Pant, J. Baccash, A. P. Borcherding, A. Brownley, R. Cedeno, L. Chen, D. Cherniko, A. Cheung, R. Chirita, B. Curson, J. C. Ebert, C. R. Hacker, R. Hartlage, B. Hauser, S. Huang, Y. Jiang, V. Karpinchyk, M. Koenig, C. Kong, T. Landers, C. Le, J. Liu, C. E. McBride, M. Morenzoni, R. E. Morey, K. Mutch, H. Perazich, K. Perry, B. A. Peters, J. Peterson, C. L. Pethiyagoda, K. Pothuraju, C. Richter, A. M. Rosenbaum, S. Roy, J. Shafto, U. Sharanhovich, K. W. Shannon, C. G. Sheppy, M. Sun, J. V. Thakuria, A. Tran, D. Vu, A. W. Zaranek, X. Wu, 105 S. Drmanac, A. R. Oliphant, W. C. Banyai, B. Martin, D. G. Ballinger, G. M. Church, and C. A. Reid, \Human genome sequencing using unchained base reads on Self-Assembling DNA nanoarrays," Science, vol. 327, no. 5961, pp. 78{81, Jan. 2010. [21] R. Edwards, B. Rodriguez-Brito, L. Wegley, M. Haynes, M. Breitbart, D. Peterson, M. Saar, S. Alexander, E. C. Alexander, and F. Rohwer, \Using pyrosequencing to shed light on deep mine microbial ecology," BMC Genomics, vol. 7, no. 1, p. 57, 2006. [22] I. Elias, \Settling the intractability of multiple alignment," Journal of Computational Biology: A Journal of Computational Molecular Cell Biology, vol. 13, no. 7, pp. 1323{1339, Sep. 2006, PMID: 17037961. [23] B. Ewing and P. Green, \Base-calling of automated sequencer traces using phred. II. error probabilities," Genome Research, vol. 8, no. 3, pp. 186{194, Mar. 1998, PMID: 9521922. [24] B. Ewing, L. Hillier, M. C. Wendl, and P. Green, \Base-calling of automated se- quencer traces using phred. i. accuracy assessment," Genome Research, vol. 8, no. 3, pp. 175{185, Mar. 1998, PMID: 9521921. [25] J. Gallant, D. Maier, and J. Astorer, \On nding minimal length superstrings," Journal of Computer and System Sciences, vol. 20, no. 1, pp. 50{58, Feb. 1980. [26] X. Gan, O. Stegle, J. Behr, J. G. Steen, P. Drewe, K. L. Hildebrand, R. Lyngsoe, S. J. Schultheiss, E. J. Osborne, V. T. Sreedharan, A. Kahles, R. Bohnert, G. Jean, P. Derwent, P. Kersey, E. J. Beleld, N. P. Harberd, E. Kemen, C. Toomajian, P. X. Kover, R. M. Clark, G. Ratsch, and R. Mott, \Multiple reference genomes and transcriptomes for arabidopsis thaliana," Nature, vol. 477, no. 7365, pp. 419{423, 2011. [27] J. G. Gibbons, E. M. Janson, C. T. Hittinger, M. Johnston, P. Abbot, and A. Rokas, \Benchmarking Next-Generation transcriptome sequencing for functional and evo- lutionary genomics," Mol Biol Evol, vol. 26, no. 12, pp. 2731{2744, Dec. 2009. [28] S. M. D. Goldberg, J. Johnson, D. Busam, T. Feldblyum, S. Ferriera, R. Friedman, A. Halpern, H. Khouri, S. A. Kravitz, F. M. Lauro, K. Li, Y. Rogers, R. Straus- berg, G. Sutton, L. Tallon, T. Thomas, E. Venter, M. Frazier, and J. C. Venter, \A sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes," Proceedings of the National Academy of Sciences, vol. 103, no. 30, pp. 11 240 {11 245, Jul. 2006. [29] Granger G. Sutton, Owen White, Mark D. Adams, and Anthony R. Kerlavage, \TIGR assembler: A new tool for assembling large shotgun sequencing projects," Genome Science and Technology, vol. 1(1), pp. 9{19, Jan. 1995. [30] R. Guigo and M. G. Reese, \EGASP: collaboration through competition to nd human genes," Nat Meth, vol. 2, no. 8, pp. 575{577, 2005. 106 [31] M. Guttman, M. Garber, J. Z. Levin, J. Donaghey, J. Robinson, X. Adiconis, L. Fan, M. J. Koziol, A. Gnirke, C. Nusbaum, J. L. Rinn, E. S. Lander, and A. Regev, \Ab initio reconstruction of cell type-specic transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs," Nat Biotech, vol. 28, no. 5, pp. 503{ 510, May 2010. [32] D. Hahn, G. Ragland, D. Shoemaker, and D. Denlinger, \Gene discovery using mas- sively parallel pyrosequencing to develop ESTs for the esh y sarcophaga crassi- palpis," BMC Genomics, vol. 10, no. 1, p. 234, 2009. [33] P. Hammer, M. S. Banck, R. Amberg, C. Wang, G. Petznick, S. Luo, I. Khrebtukova, G. P. Schroth, P. Beyerlein, and A. S. Beutler, \mRNA-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain," Genome Research, vol. 20, no. 6, pp. 847{860, Jun. 2010, PMID: 20452967. [34] D. Hernandez, P. Franois, L. Farinelli, M. Osters, and J. Schrenzel, \De novo bac- terial genome sequencing: millions of very short reads assembled on a desktop com- puter," Genome Research, vol. 18, no. 5, pp. 802{809, May 2008, PMID: 18332092. [35] D. G. Hert, C. P. Fredlake, and A. E. Barron, \Advantages and limitations of next-generation sequencing technologies: A comparison of electrophoresis and non- electrophoresis methods," ELECTROPHORESIS, vol. 29, no. 23, pp. 4618{4626, 2008. [36] M. S. Hossain, N. Azimi, and S. Skiena, \Crystallizing short-read assemblies around seeds," BMC Bioinformatics, vol. 10, no. Suppl 1, pp. S16{S16, Jan. 2009, PMID: 19208115 PMCID: 2648751. [37] X. Huang and S. Yang, \Generating a genome assembly with PCAP," Current Protocols in Bioinformatics / Editoral Board, Andreas D. Baxevanis ... [et Al, vol. Chapter 11, p. Unit11.3, Oct. 2005, PMID: 18428744. [38] J. A. Huber, D. B. Mark Welch, H. G. Morrison, S. M. Huse, P. R. Neal, D. A. Buttereld, and M. L. Sogin, \Microbial population structures in the deep marine biosphere," Science (New York, N.Y.), vol. 318, no. 5847, pp. 97{100, Oct. 2007, PMID: 17916733. [39] D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster, \MEGAN analysis of metage- nomic data," Genome Research, vol. 17, no. 3, pp. 377{386, Mar. 2007, PMID: 17255551. [40] R. M. Idury and M. S. Waterman, \A new algorithm for DNA sequence assembly," JOURNAL OF COMPUTATIONAL BIOLOGY, vol. 2, pp. 291|306, 1995. [41] D. B. Jae, J. Butler, S. Gnerre, E. Mauceli, K. Lindblad-Toh, J. P. Mesirov, M. C. Zody, and E. S. Lander, \Whole-genome sequence assembly for mammalian genomes: Arachne 2," Genome Research, vol. 13, no. 1, pp. 91{96, Jan. 2003, PMID: 12529310. 107 [42] W. R. Jeck, J. A. Reinhardt, D. A. Baltrus, M. T. Hickenbotham, V. Magrini, E. R. Mardis, J. L. Dangl, and C. D. Jones, \Extending assembly of short DNA sequences to handle error," Bioinformatics, vol. 23, no. 21, pp. 2942 {2944, Nov. 2007. [43] W. Just, \Computational complexity of multiple sequence alignment with SP-score," Journal of Computational Biology: A Journal of Computational Molecular Cell Biology, vol. 8, no. 6, pp. 615{623, 2001, PMID: 11747615. [44] J. D. Kececioglu and E. W. Myers, \Combinatorial algorithms for DNA sequence assembly," ALGORITHMICA, vol. 13, pp. 7|51, 1993. [45] W. J. Kent, \BLAT{the BLAST-like alignment tool," Genome Research, vol. 12, no. 4, pp. 656{664, Apr. 2002, PMID: 11932250. [46] S. Koren, T. J. Treangen, and M. Pop, \Bambus 2: Scaolding metagenomes," Bioinformatics, Sep. 2011. [47] R. E. Ley, M. Hamady, C. Lozupone, P. J. Turnbaugh, R. R. Ramey, J. S. Bircher, M. L. Schlegel, T. A. Tucker, M. D. Schrenzel, R. Knight, and J. I. Gordon, \Evo- lution of mammals and their gut microbes," Science (New York, N.Y.), vol. 320, no. 5883, pp. 1647{1651, Jun. 2008, PMID: 18497261. [48] R. Li, W. Fan, G. Tian, H. Zhu, L. He, J. Cai, Q. Huang, Q. Cai, B. Li, Y. Bai, Z. Zhang, Y. Zhang, W. Wang, J. Li, F. Wei, H. Li, M. Jian, J. Li, Z. Zhang, R. Nielsen, D. Li, W. Gu, Z. Yang, Z. Xuan, O. A. Ryder, F. C. Leung, Y. Zhou, J. Cao, X. Sun, Y. Fu, X. Fang, X. Guo, B. Wang, R. Hou, F. Shen, B. Mu, P. Ni, R. Lin, W. Qian, G. Wang, C. Yu, W. Nie, J. Wang, Z. Wu, H. Liang, J. Min, Q. Wu, S. Cheng, J. Ruan, M. Wang, Z. Shi, M. Wen, B. Liu, X. Ren, H. Zheng, D. Dong, K. Cook, G. Shan, H. Zhang, C. Kosiol, X. Xie, Z. Lu, H. Zheng, Y. Li, C. C. Steiner, T. T. Lam, S. Lin, Q. Zhang, G. Li, J. Tian, T. Gong, H. Liu, D. Zhang, L. Fang, C. Ye, J. Zhang, W. Hu, A. Xu, Y. Ren, G. Zhang, M. W. Bruford, Q. Li, L. Ma, Y. Guo, N. An, Y. Hu, Y. Zheng, Y. Shi, Z. Li, Q. Liu, Y. Chen, J. Zhao, N. Qu, S. Zhao, F. Tian, X. Wang, H. Wang, L. Xu, X. Liu, T. Vinar, Y. Wang, T. Lam, S. Yiu, S. Liu, H. Zhang, D. Li, Y. Huang, X. Wang, G. Yang, Z. Jiang, J. Wang, N. Qin, L. Li, J. Li, L. Bolund, K. Kristiansen, G. K. Wong, M. Olson, X. Zhang, S. Li, H. Yang, J. Wang, and J. Wang, \The sequence and de novo assembly of the giant panda genome," Nature, vol. 463, no. 7279, pp. 311{317, Jan. 2010, PMID: 20010809. [49] R. Li, Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang, \SNP detection for massively parallel Whole-Genome resequencing," Genome Research, vol. 19, no. 6, pp. 1124{1132, Jun. 2009. [50] R. Li, Y. Li, H. Zheng, R. Luo, H. Zhu, Q. Li, W. Qian, Y. Ren, G. Tian, J. Li, G. Zhou, X. Zhu, H. Wu, J. Qin, X. Jin, D. Li, H. Cao, X. Hu, H. Blanche, H. Cann, X. Zhang, S. Li, L. Bolund, K. Kristiansen, H. Yang, J. Wang, and J. Wang, \Build- ing the sequence map of the human pan-genome," Nature Biotechnology, vol. 28, no. 1, pp. 57{63, Jan. 2010, PMID: 19997067. 108 [51] R. Li, C. Yu, Y. Li, T. Lam, S. Yiu, K. Kristiansen, and J. Wang, \SOAP2: an improved ultrafast tool for short read alignment," Bioinformatics, vol. 25, no. 15, pp. 1966{1967, Aug. 2009. [52] R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kris- tiansen, S. Li, H. Yang, J. Wang, and J. Wang, \De novo assembly of human genomes with massively parallel short read sequencing," Genome Research, Dec. 2009. [53] I. Maccallum, D. Przybylski, S. Gnerre, J. Burton, I. Shlyakhter, A. Gnirke, J. Malek, K. McKernan, S. Ranade, T. P. Shea, L. Williams, S. Young, C. Nusbaum, and D. B. Jae, \ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads," Genome Biology, vol. 10, no. 10, p. R103, 2009, PMID: 19796385. [54] L. Mamanova, R. M. Andrews, K. D. James, E. M. Sheridan, P. D. Ellis, C. F. Langford, T. W. B. Ost, J. E. Collins, and D. J. Turner, \FRT-seq: amplication- free, strand-specic transcriptome sequencing," Nat Meth, vol. 7, no. 2, pp. 130{132, Feb. 2010. [55] S. Marguerat, B. T. Wilhelm, and J. Bhler, \Next-generation sequencing: appli- cations beyond genomes," Biochemical Society Transactions, vol. 36, no. Pt 5, pp. 1091{1096, Oct. 2008, PMID: 18793195. [56] M. Margulies, M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A. Bemben, J. Berka, M. S. Braverman, Y. Chen, Z. Chen, S. B. Dewell, L. Du, J. M. Fierro, X. V. Gomes, B. C. Godwin, W. He, S. Helgesen, C. H. Ho, C. H. Ho, G. P. Irzyk, S. C. Jando, M. L. I. Alenquer, T. P. Jarvie, K. B. Jirage, J. Kim, J. R. Knight, J. R. Lanza, J. H. Leamon, S. M. Lefkowitz, M. Lei, J. Li, K. L. Lohman, H. Lu, V. B. Makhijani, K. E. McDade, M. P. McKenna, E. W. Myers, E. Nickerson, J. R. Nobile, R. Plant, B. P. Puc, M. T. Ronan, G. T. Roth, G. J. Sarkis, J. F. Simons, J. W. Simpson, M. Srinivasan, K. R. Tartaro, A. Tomasz, K. A. Vogt, G. A. Volkmer, S. H. Wang, Y. Wang, M. P. Weiner, P. Yu, R. F. Begley, and J. M. Rothberg, \Genome sequencing in microfabricated high-density picolitre reactors," Nature, vol. 437, no. 7057, pp. 376{380, Sep. 2005, PMID: 16056220. [57] McGill, \Species abundance distributions: moving beyond single prediction theories to integration within an ecological framework," Ecology Letters, vol. 10, pp. 995{ 1015, 2007. [58] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, and M. A. DePristo, \The genome analysis toolkit: A MapReduce framework for analyzing Next-Generation DNA se- quencing data," Genome Research, vol. 20, no. 9, pp. 1297{1303, Sep. 2010. [59] G. J. McLachlan and D. Peel, Finite mixture models. John Wiley and Sons, Sep. 2000. [60] M. L. Metzker, \Sequencing technologies - the next generation," Nature Reviews. Genetics, vol. 11, no. 1, pp. 31{46, Jan. 2010, PMID: 19997069. 109 [61] Michael S. Waterman, Introduction to computational biology: maps, sequences and genomes. CRC Press, 1995. [62] J. R. Miller, A. L. Delcher, S. Koren, E. Venter, B. P. Walenz, A. Brownley, J. John- son, K. Li, C. Mobarry, and G. Sutton, \Aggressive assembly of pyrosequencing reads with mates," Bioinformatics, vol. 24, no. 24, pp. 2818 {2824, Dec. 2008. [63] J. R. Miller, S. Koren, and G. Sutton, \Assembly algorithms for next-generation sequencing data," Genomics, vol. 95, no. 6, pp. 315{327, Jun. 2010. [64] R. Morin, M. Bainbridge, A. Fejes, M. Hirst, M. Krzywinski, T. Pugh, H. McDonald, R. Varhol, S. Jones, and M. Marra, \Proling the HeLa s3 transcriptome using ran- domly primed cDNA and massively parallel short-read sequencing," BioTechniques, vol. 45, no. 1, pp. 81{94, Jul. 2008, PMID: 18611170. [65] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeer, and B. Wold, \Mapping and quantifying mammalian transcriptomes by RNA-Seq," Nat Meth, vol. 5, no. 7, pp. 621{628, Jul. 2008. [66] E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H. H. Chou, C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and J. C. Venter, \A whole-genome assembly of drosophila," Science (New York, N.Y.), vol. 287, no. 5461, pp. 2196{2204, Mar. 2000, PMID: 10731133. [67] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, and M. Snyder, \The transcriptional landscape of the yeast genome dened by RNA sequencing," Science (New York, N.Y.), vol. 320, no. 5881, pp. 1344{1349, Jun. 2008, PMID: 18451266. [68] R. Nielsen, J. S. Paul, A. Albrechtsen, and Y. S. Song, \Genotype and SNP calling from next-generation sequencing data," Nature Reviews Genetics, vol. 12, no. 6, pp. 443{451, May 2011. [69] B. Paaniuc, N. Zaitlen, and E. Halperin, \Accurate estimation of expression levels of homologous genes in RNA-seq experiments," RECOMB, vol. 6044/2010, pp. 397{ 409, 2010. [70] P. A. Pevzner, \1-Tuple DNA sequencing: computer analysis," Journal of Biomolecular Structure & Dynamics, vol. 7, no. 1, pp. 63{73, Aug. 1989, PMID: 2684223. [71] ||, \A new approach to fragment assembly in DNA sequencing," IN PROC. 5TH ANNUAL INTERNATIONAL CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY (RECOMB 01, vol. 98, pp. 256|267, 2001. 110 [72] P. A. Pevzner, P. A. Pevzner, H. Tang, and G. Tesler, \De novo repeat classication and fragment assembly," Genome Research, vol. 14, no. 9, pp. 1786{1796, Sep. 2004, PMID: 15342561. [73] P. A. Pevzner and H. Tang, \Fragment assembly with double-barreled data," Bioinformatics, vol. 17, no. suppl 1, pp. S225{233, Jun. 2001. [74] P. A. Pevzner, H. Tang, and M. S. Waterman, \An eulerian path approach to DNA fragment assembly," Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 17, pp. 9748 {9753, 2001. [75] N. Rusk, \Cheap third-generation sequencing," Nat Meth, vol. 6, no. 4, p. 244, Apr. 2009. [76] M. C. Schatz, A. L. Delcher, and S. L. Salzberg, \Assembly of large genomes using second-generation sequencing," Genome Research, vol. 20, no. 9, pp. 1165 {1173, 2010. [77] K. Schneeberger, S. Ossowski, F. Ott, J. D. Klein, X. Wang, C. Lanz, L. M. Smith, J. Cao, J. Fitz, N. Warthmann, S. R. Henz, D. H. Huson, and D. Weigel, \Reference- guided assembly of four diverse arabidopsis thaliana genomes," Proceedings of the National Academy of Sciences, Jun. 2011. [78] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. M. Jones, and I. Birol, \ABySS: a parallel assembler for short read sequence data," Genome Research, vol. 19, no. 6, pp. 1117{1123, Jun. 2009, PMID: 19251739. [79] M. L. Sogin, H. G. Morrison, J. A. Huber, D. Mark Welch, S. M. Huse, P. R. Neal, J. M. Arrieta, and G. J. Herndl, \Microbial diversity in the deep sea and the underexplored "rare biosphere"," Proceedings of the National Academy of Sciences of the United States of America, vol. 103, no. 32, pp. 12 115{12 120, Aug. 2006, PMID: 16880384. [80] Y. Surget-Groba and J. I. Montoya-Burgos, \Optimization of de novo transcrip- tome assembly from next-generation sequencing data," Genome Research, Aug. 2010, PMID: 20693479. [81] T. T. Torres, M. Metta, B. Ottenwlder, and C. Schltterer, \Gene expression proling by massively parallel sequencing," Genome Research, vol. 18, no. 1, pp. 172{177, Jan. 2008, PMID: 18032722. [82] C. Trapnell, L. Pachter, and S. L. Salzberg, \TopHat: discovering splice junctions with RNA-Seq," Bioinformatics, vol. 25, no. 9, pp. 1105{1111, May 2009. [83] C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, and L. Pachter, \Transcript assembly and quantica- tion by RNA-Seq reveals unannotated transcripts and isoform switching during cell dierentiation," Nat Biotech, vol. 28, no. 5, pp. 511{515, May 2010. 111 [84] A. P. Vivancos, M. Gell, J. C. Dohm, L. Serrano, and H. Himmelbauer, \Strand- specic deep sequencing of the transcriptome," Genome Research, vol. 20, no. 7, pp. 989 {999, Jul. 2010. [85] P. K. Wall, J. Leebens-Mack, A. S. Chanderbali, A. Barakat, E. Wolcott, H. Liang, L. Landherr, L. P. Tomsho, Y. Hu, J. E. Carlson, H. Ma, S. C. Schuster, D. E. Soltis, P. S. Soltis, N. Altman, and C. W. dePamphilis, \Comparison of next generation sequencing technologies for transcriptome characterization," BMC Genomics, vol. 10, p. 347, 2009, PMID: 19646272. [86] L. Wang and T. Jiang, \On the complexity of multiple sequence alignment," Journal of Computational Biology: A Journal of Computational Molecular Cell Biology, vol. 1, no. 4, pp. 337{348, 1994, PMID: 8790475. [87] Z. Wang, M. Gerstein, and M. Snyder, \RNA-Seq: a revolutionary tool for tran- scriptomics," Nature Reviews. Genetics, vol. 10, no. 1, pp. 57{63, Jan. 2009, PMID: 19015660. [88] R. L. Warren, G. G. Sutton, S. J. M. Jones, and R. A. Holt, \Assembling millions of short DNA sequences using SSAKE," Bioinformatics, vol. 23, no. 4, pp. 500 {501, Feb. 2007. [89] B. T. Wilhelm, S. Marguerat, S. Watt, F. Schubert, V. Wood, I. Goodhead, C. J. Penkett, J. Rogers, and J. Bhler, \Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution," Nature, vol. 453, no. 7199, pp. 1239{1243, Jun. 2008, PMID: 18488015. [90] D. R. Zerbino and E. Birney, \Velvet: Algorithms for de novo short read assembly using de bruijn graphs," Genome Research, vol. 18, no. 5, pp. 821{829, Feb. 2008. [91] D. R. Zerbino, G. K. McEwen, E. H. Margulies, and E. Birney, \Pebble and rock band: heuristic resolution of repeats and scaolding in the velvet short-read de novo assembler," PloS One, vol. 4, no. 12, p. e8407, 2009, PMID: 20027311. 112
Abstract (if available)
Abstract
The deep sequencing of second generation sequencing technology has enabled us to study complex biological structures, which have multiple DNA units simultaneously such as transcriptomics and metagenomics. Unlike general genome sequence assembly, a DNA unit of these biological structures may have multiple copies with small or substantial structural variations and/or SNPs simultaneously in an experimental sample. Therefore, the deep sequencing is necessary to figure out such variations concurrently. ❧ This dissertation focuses on de novo transcriptome assembly which requires simultaneous assembly of multiple alternatively spliced gene transcripts. In practice, the de novo transcriptome assembly is the only option for studying the transcriptome of organisms that do not have reference genome sequences, and it can also be applied to identify novel transcripts and structural variations in the gene regions of model organisms. We propose WEAV for the de novo transcriptome assembly which consists of two separate processes: clustering and assembly. ❧ WEAV reduces the complexity of RNA-seq dataset by partitioning it into clusters called clustering. WEAV simplify a diverse RNA-seq dataset, which has many genes together, into many, smaller clustered read sets, which have few genes a cluster, in the clustering process. The underlying idea is straightforward. A sequencer samples reads from random place so reads from one gene may have overlaps with others if sequencing depth is enough. The overlaps are the keys to connect reads from one gene. We can transform a dataset into a graph where each read is a node and two reads are connected by an edge when they have an overlap. Each connected component will be a clustered read set. As a result, we can assume that a cluster may have one or few genes
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Breaking the plateau in de novo genome scaffolding
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Isoform quantification and splicing regulation analysis in RNA-seq studies
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
De novo peptide sequencing and spectral alignment algorithm via tandem mass spectrometry
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Understanding the characteristic of single nucleotide variants
PDF
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
Network structures: graph theory, statistics, and neuroscience applications
PDF
Efficient short-read sequencing on long-read sequencers
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
Clustering 16S rRNA sequences: an accurate and efficient approach
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Block-based image steganalysis: algorithm and performance evaluation
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Feature engineering and supervised learning on metagenomic sequence data
Asset Metadata
Creator
Cho, Sungje
(author)
Core Title
Techniques for de novo sequence assembly: algorithms and experimental results
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/30/2012
Defense Date
04/24/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bioinformatics,computational biology,OAI-PMH Harvest,sequence assembly
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Chen, Ting (
committee member
), Leahy, Richard M. (
committee member
)
Creator Email
sungje.cho@gmail.com,sungjech@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-77527
Unique identifier
UC11289527
Identifier
usctheses-c3-77527 (legacy record id)
Legacy Identifier
etd-ChoSungje-1068.pdf
Dmrecord
77527
Document Type
Dissertation
Rights
Cho, Sungje
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bioinformatics
computational biology
sequence assembly