Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Breaking the plateau in de novo genome scaffolding
(USC Thesis Other)
Breaking the plateau in de novo genome scaffolding
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
University of Southern California Doctoral Dissertation Breaking the Plateau in De Novo Genome Scaolding Author: Misagh Bagherian Supervisor: Dr. Ting Chen A dissertation presented to the FACULTY OF THE USC GRADUATE SCHOOL in partial fulllment of the requirements for the degree of Doctor of Philosophy (COMPUTATIONAL BIOLOGY AND BIOINFORMATICS) August, 2017 Copyright 2017 Misagh Bagherian Declaration of Authorship I, MisaghBagherian, declare that this thesis titled,\Breaking the Plateau in De Novo Genome Scaffolding" and the work presented in it are my own. I conrm that: This work was done wholly or mainly while in candidature for a research degree at this University. Where any part of this thesis has previously been submitted for a degree or any other qualication at this University or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly attributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: i UNIVERSITY OF SOUTHERN CALIFORNIA Abstract Computational Biology and Bioinformatics Program Dornsife College of Letters, Arts and Sciences Doctor of Philosophy Breaking the Plateau in De Novo Genome Scaolding by Misagh Bagherian Scaolding is the process of orienting and ordering contiguous sequences (contigs) pro- duced during genome assembly. To nish draft genomes it is crucial to have accurate scaolding as it facilitates the procedures needed to ll in the gaps between contigs which are costly and laborious. It has been proved that the scaolding problem is an NP-hard problem and in the case of presence of erroneous data (e.g. chimeric reads, mappings, contigs, etc.) even its sub-problems, like nding orientations, are compu- tationally intractable. That is why conventional formulations of scaolding programs rely on heuristics to nd approximate solutions, with potentially exponential run- ning time. But due to limitations of the short reads which are the main ingredient in de novo assembling and scaolding, most repetitive parts of the genome remain unsolved. As nearly all state-of-the-art scaolders remove suspicious repeat contigs, they all yield fragmented scaolds which are still far from nearly complete genomes or at least correctly ordered and oriented contigs. We present the rst repeat-aware scaolding model for second-generation sequencing data which like most current scaf- folders rely on paired reads, but it does not remove or ignore repeat contigs. On the contrary, our model tries to use repeats to solve the most challenging parts of the scaolding graph. We show how this repeat-aware second-generation sequencing iii model can solve repeats in small articial synthesized genomes, and why it is not capable of scaling up for more complex real genomes to break the plateau in de novo genome scaolding. Recently, the third-generation sequencing data has become more available to re- searchers and scientists. Long-read libraries can make it possible to cope with the scaolding problem in a more accurate and cost-eective way. They can improve the quality of draft assemblies constructed from second-generation sequencing data by making them more accurate and more complete. We propose a novel and scalable hy- brid scaolding model that aims to scaold pre-assembled contigs produced by short reads. We show how this new hybrid scaolding model performs to upgrade some bacterial draft genomes and might be capable of breaking the plateau in de novo genome scaolding. Keywords: De novo genome assembly; Genome scaolding; Repeats; Single molecule sequencing; Long reads; Genome nishing. Acknowledgements Firstly, I would like to express my sincere gratitude to my supervisor Professor Ting Chen for the continuous support of my Ph.D. studies and related research. His ex- pertise, understanding, and patience, added considerably to my graduate experience. Besides my advisor, I would like to thank Professor Michael Waterman and my dis- sertation and qualifying exam committee members: Dr. Aiichiro Nakano, Dr. Liang Chen, Dr. Fengzhu Sun and Dr. Remo Rohs for their insightful com- ments and encouragement, and questions incenting me to widen my research from various perspectives. I thank my fellow labmates for the stimulating discussions and all the fun we have had in years. I would also like to thank my family: my parents, my sister and my brother, for the support they provided through my entire life and in particular, I must acknowledge the joy of my life: my wife and my best friend, Zohreh! Without her love, support, criticism, and encouragement, I would not have nished this dissertation. In conclusion, I recognize that this research would not have been possible without the nancial assistance from National Science Foundation (NSF), National Institute of Health (NIH), the program of Computational Biology and Bioinformatics in USC Dornsife College of Letters, Arts and Sciences, and USC Viterbi Fellowship. I would express my gratitude to these agencies. iv Contents Declaration of Authorship i Abstract ii Acknowledgements iv Contents v List of Figures viii List of Tables x 1 Introduction 1 1.1 High-Throughput Low-Cost Sequencing Era . . . . . . . . . . . . . . 1 1.2 De Novo Genome Assemblers . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Overlap Computation . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Overlap-Layout-Consensus . . . . . . . . . . . . . . . . . . . . 7 1.2.3 De Bruijn Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Genome Scaolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Genome Finishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.1 What is a nished genome? . . . . . . . . . . . . . . . . . . . 15 1.4.2 Gap Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.3 Why is it important? . . . . . . . . . . . . . . . . . . . . . . . 16 1.5 Other Post Assembly Improvements . . . . . . . . . . . . . . . . . . . 17 1.6 Our Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Genome Scaolding 20 2.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Algorithmic Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 23 v Contents vi 2.3 Current Tools and Softwares . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Bambus (2004) . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 SOPRA (2010) . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.3 Bambus-2 (2011) . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.4 SSPACE (2011) . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.5 Opera (2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.6 MIP (2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.7 GRASS (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.8 SCARPA (2013) . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.1 Contiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3 Repeats 41 3.1 De novo genome assembly and repeats . . . . . . . . . . . . . . . . . 42 3.2 Assembly errors caused by repeats . . . . . . . . . . . . . . . . . . . . 44 3.3 Strategies for handling repeats . . . . . . . . . . . . . . . . . . . . . . 46 4 Repeat-Aware Scaolding using Paired Reads 49 4.1 Scaolding graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1.1 Scaolding graph . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Handling Repeats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 Contig Variables . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.2 Edge Satisfaction Variable . . . . . . . . . . . . . . . . . . . . 57 4.3.3 Distance Variable . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.4 Overlap Prevention Variable . . . . . . . . . . . . . . . . . . . 58 4.3.5 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.6 Constraints Imposed By Edges . . . . . . . . . . . . . . . . . 59 4.4 Results and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.1 Equivalent Candidate Solutions . . . . . . . . . . . . . . . . . 69 5 Hybrid Genome Scaolding 73 5.1 Third Generation Sequencing (TGS) . . . . . . . . . . . . . . . . . . 73 5.1.1 Pacic Biosciences Single-Molecule Real-Time (SMRT) sequenc- ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1.2 Oxford Nanopore Single-Molecule Sequencing . . . . . . . . . 79 5.2 Hybrid Assembly and Scaolding . . . . . . . . . . . . . . . . . . . . 80 5.3 Our Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3.1 Hybrid Input Libraries . . . . . . . . . . . . . . . . . . . . . . 86 5.3.2 Aligning Long Reads . . . . . . . . . . . . . . . . . . . . . . . 86 Contents vii 5.3.3 Contig Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.4 Best Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.5 Extending Chains to Superchains . . . . . . . . . . . . . . . . 100 5.3.6 Joining Superchains to Make Pre-scaolds . . . . . . . . . . . 103 5.3.7 Phase-I vs. Phase-II . . . . . . . . . . . . . . . . . . . . . . . 105 5.3.8 Finalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3.9 Estimating Distances . . . . . . . . . . . . . . . . . . . . . . . 108 5.4 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 108 6 Conclusion and Future Direction 115 A Scaolding Graphs and Extended Scaolding Graphs 119 Bibliography 131 List of Figures 1.1 Misassembly because of repeats . . . . . . . . . . . . . . . . . . . . . 5 3.1 Assembly errors caused by repeats . . . . . . . . . . . . . . . . . . . . 45 4.1 Overview of the proposed model . . . . . . . . . . . . . . . . . . . . . 50 4.2 Bidirected edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Undirected and directed interpretation of bidirected edges . . . . . . 52 4.4 An example of a scaolding graph . . . . . . . . . . . . . . . . . . . . 53 4.5 Expanded scaolding graph to handle repeats . . . . . . . . . . . . . 55 4.6 Constraints imposed by four possible bidirectional edges . . . . . . . 60 4.7 An articial repetitive genome structure used as one of our test cases 68 4.8 Scaolding graph of test case of Figure 4.7 . . . . . . . . . . . . . . . 70 4.9 Extended graph representation of Figure 4.8 . . . . . . . . . . . . . . 71 4.10 Equivalent solution candidates . . . . . . . . . . . . . . . . . . . . . . 72 5.1 PacBio SMRT sequencing . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Oxford Nanopore Technologies MinION . . . . . . . . . . . . . . . . . 81 5.3 Ratcheting of a DNA strand through a biological nanopore . . . . . . 81 5.4 Overview of our hybrid approach . . . . . . . . . . . . . . . . . . . . 85 5.5 Scaolding using long reads toy example . . . . . . . . . . . . . . . . 87 5.6 General structure of a hybrid model . . . . . . . . . . . . . . . . . . . 88 5.7 Work ow of our proposed hybrid model . . . . . . . . . . . . . . . . . 90 5.8 Contig chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.9 Consistency between aligned portions and orientations . . . . . . . . 94 5.10 Palindromic chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.11 Non-mutual best neighbor relationship in repeats . . . . . . . . . . . 98 5.12 Neighborhood of repeats and misassembled contigs . . . . . . . . . . 100 5.13 Contig super-chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.14 Chain extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.15 From superchains to pre-scaolds . . . . . . . . . . . . . . . . . . . . 104 5.16 Finalization of pre-scaolds . . . . . . . . . . . . . . . . . . . . . . . 107 A.1 A scaolding graph without any repeats . . . . . . . . . . . . . . . . 120 viii List of Figures ix A.2 A scaolding graph with one two-copy repeat contig . . . . . . . . . . 121 A.3 Extended graph representation of Figure A.2 . . . . . . . . . . . . . . 121 A.4 A scaolding graph with one three-copy repeat contig . . . . . . . . . 122 A.5 Extended graph representation of Figure A.4 . . . . . . . . . . . . . . 123 A.6 A scaolding graph with two repeat contigs . . . . . . . . . . . . . . 124 A.7 Extended graph representation of Figure A.6 . . . . . . . . . . . . . . 125 A.8 A scaolding graph with a tandem repeat contigs . . . . . . . . . . . 126 A.9 Extended graph representation of Figure A.8 . . . . . . . . . . . . . . 127 A.10 A scaolding graph with two regular repeat contigs and one tandem repeat contig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.11 Extended graph representation of Figure A.10 . . . . . . . . . . . . . 128 A.12 A scaolding graph with repeat of repeats structure . . . . . . . . . . 129 A.13 Extended graph representation of Figure A.12 . . . . . . . . . . . . . 130 List of Tables 5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Genome reconstruction of E. coli . . . . . . . . . . . . . . . . . . . . 110 5.3 Genome reconstruction of B. trehalosi . . . . . . . . . . . . . . . . . 110 5.4 Genome reconstruction of M. haemolytica . . . . . . . . . . . . . . . 111 5.5 Comparison of hybrid assemblies using long-read correction tools . . . 114 x To Zohreh. Thank you, Sunshine, for your support, patience, and belief in me. I love you! xi Chapter 1 Introduction 1.1 High-Throughput Low-Cost Sequencing Era Advances in sequencing technologies have always had great impact on whole-genome sequencing projects. Nowadays, deep coverage of most species can be generated in just a few days by sequencing technologies which have low cost and hight throughput. The vast improvement in sequencing technologies together with the decreasing cost of them will make it possible to accurately sequence almost any organism even those whose genomes have billions of base pair. This improvement will have a crucial impact on any further downstream analyses. Outstanding successes of complex genome sequencing (e.g. panda genome) and rese- quencing projects (e.g. human genome) motivated sequencing of many new species 1 Chapter 1. Introduction 2 genomes. The sequence data generated by one of these projects consist of millions or billions of short (usually less than 250 nucleotides) DNA sequences (a.k.a reads). These reads must then be assembled de novo before most genome analyses can be- gin. The aforementioned projects used 35-100 bp reads and depth of coverage in these projects ranged from 50x to 100x whereas earlier WGS projects using Sanger sequencing used read lengths of 750{800 bps and required 7x-10x coverage [1]. The main bottleneck with short-read sequencing technologies is their shorter read length and technically, their much higher coverage cannot compensate for it. A re- cent comparison of draft genomes produced with short-read data shows that those produced with longer reads have far better contiguity than the latest short-read as- semblies [2]. Some research studies [1, 3{5] were designed to clarify details regarding assemblies based on short reads and to nd out which assembly software (with what parameters) can produce the best results. Such details depend upon the design of the sequencing experiments, genome features, and the underlying algorithm used in the assembly software, but one can easily conclude that the result will be still a highly fragmented draft genome consisting of a set of contiguous sequences (contigs). To better understand why assembling large genomes from short reads remains a very challenging problem we should get more familiar with current state-of-the-art de novo assemblers and their main challenges in yielding a near complete genome. Chapter 1. Introduction 3 1.2 De Novo Genome Assemblers With all recent improvement in sequencing technologies, the current state-of-the-art sequencing platforms produce small or big pieces of the genome and their result is still far from complete. The result draft genomes are not error-free and one can still nd many small or big errors and misassemblies. That is why we still need some tools and elegant algorithms to reconstruct all these sequences (contigs) into as long as possible linear stretches of correctly oriented and ordered base pairs. To better understand the genome sequence assembly and scaolding problem one can see it as solving a one-dimensional puzzle with thousands of pieces and extra troublesome conditions [6]. Moreover, the original picture may have identical or almost identical sections (repeats) and some of these repeated regions may be mirrored (repeats with dierent orientation), the majority of the pieces do not t perfectly with any other and the picture depicted on them may be distorted. The only helpful element of the problem is the amount of pieces available, usually covering the original image by a lot more than one time which in bioinformatics terminology is called coverage. Coverage is dened as the total number of sequenced nucleotides divided by the (estimated) length of the genome; thus, Nx coverage implies that the genome is sequenced N times over. Here is another interesting example that helps better understand key concepts and challenges in de novo genome assembly. Suppose we have a very thick text book. One can consider its content as a single one dimensional character string. We use a copy machine to make several (e.g. 50) copies of the book. Chapter 1. Introduction 4 But the copier is not a perfect machine and sometimes it randomly makes mistakes in scanning and copying characters. Then we use a specic paper shredder to randomly shred all book copies into relatively small character strings (50 to 250 characters). Now here is the goal: we want to reconstruct the complete original version of the book using all available shredded pieces. To better understand how challenging this task is, one should be aware that a great body of the book content can be identical or nearly identical (repeats). We may also have some shredded papers of other irrelevant books mixed with our shreds (contamination in bioinformatics terminology). One can intuitively think that to assemble a complete genome they should rst nd the pairwise overlaps between the sequenced reads by aligning them against each other. Then overlapped reads must be merged together to build a minimal number longer sequences (ideally equal to the total number of chromosomes in the genome) until no further merging is possible. This procedure in which no other reference genome is available to guide the process is called de novo genome assembly. On the other hand, with increasing availability of (nearly) complete genomes, one can exploit the available closely related complete genomes as a reference in performing the assembly task. This type of genome assembly is referred as comparative or reference- based assembly in which the relative placement of reads can be inferred from their alignment to the related genome (or reference). Here in this dissertation we are mainly focused on de novo genome assembly and try to improve its results. The rst series of de novo genome assemblers were designed to assemble shorter Chapter 1. Introduction 5 Figure 1.1: A chimeric contig is created by merging two distant sections of the genome into a single contig. This can happen when overlapping reads that originate from dierent instances of a repeat are falsely connected together by the assembler. (e.g. bacterial) genomes. They were using a greedy heuristic to nd pairwise align- ments between overlapping Sanger reads and continue with merging the pair with best (longest) pairwise alignment. They were mainly using (dierent variants of) Smith- Waterman sequence alignment algorithm [7] which properly handles some sequencing errors in the overlapping sections of the sequence reads. In such methods, the merg- ing of longest overlaps continues until no further merging is possible. The resulted longer contiguous sequences are called contigs. If there were no sequencing errors and no repeats present in the genome, this heuristic greedy approach would suce and it could result in constructing the complete genome eciently. But even in the simplest genome sequencing projects, there are sequencing errors and chimeric reads and even the smallest genomes have repeated subsequences. These greedy algorithms can yield acceptable results in simpler genomes, but they would fail in precisely reconstructing repetitive sections and repeat boundaries in more complicated genomes. Repeats (repeated sections) occur in multiple copies throughout a genome. Some of Chapter 1. Introduction 6 these copies may be in the forward orientation while some other copies may be present in the reverse-complement orientation. Repeats are source of major challenges in de novo genome assembly and scaolding. The severity of such challenges is dierent in various sequencing technologies. But as a simple rule of thumb, one can assume that the longer the reads, the bigger the percentage of resolved repeats. Ideally, if we have reads longer than repeats, one can claim that repetitive sequences will no longer impose severe challenges to the genome assembly [8]. Figure 1.1 shows how a misassembled \chimeric" contig can be made by simply merging reads related to distant repeated sections of the genome [9]. Nearly all de novo assemblers suer from diculties with repeats. In most modern assembly methods, the assembly process is divided into two phases, pre-assembly and scaolding. In the pre-assembly phase, assembler collects overlapping information to merge unambiguous reads to make valid (non-chimeric) contigs which end at the repeat boundaries. These valid contigs are sometimes referred as unitigs. In the next phase, some sort of spanning information is needed to connect and merge unitigs with each other to form the so called scaolds (a.k.a supercontigs or scaftigs). For the generation of contigs (unitigs), short SGS (second-generation sequencing) reads are widely used because of their relatively low cost and low error rate. An exception to this is the Hierarchical Genome Assembly Process (HGAP [10]) that uses only PacBio long reads to assemble bacterial sized genomes which requires very large coverages of data (80x-100x) to compensate for the large error content [6]. For the next phase Chapter 1. Introduction 7 (scaolding phase), SGS paired sequences (e.g. Illumina's paired-end [11] and mate- pair [12] reads) have been common choices recently. As we will discuss later in Chapter 5, the use of TGS (third-generation sequencing) long reads for scaolding purposes is an interesting emerging trend. For now to better understand why de novo assembly is still a very challenging unsolved problem we should get more familiar with the approaches used in current state-of-the-art de novo assemblers. 1.2.1 Overlap Computation Overlap computation is a very crucial step in most genome assembly algorithms. In SGS libraries, because of the short read lengths a great number of reads is needed to achieve an acceptable coverage. This huge number of reads makes the overlap compu- tation task a very time consuming step for which a computationally ecient method must be incorporated. Current state-of-the-art genome assemblers use two dier- ent categories of methods to calculate pairwise overlaps: Overlap-Layout-Consensus methods and de Bruijn graphs. 1.2.2 Overlap-Layout-Consensus The idea behind this family of methods is to avoid checking and computing overlaps between each pair of reads. This idea can be implemented by a seed-based strategy to predict potential candidates for pairwise overlaps. In this seed-based strategy, reads Chapter 1. Introduction 8 sharing short xed-length (length k) substrings of reads (referred as k-mers) are fur- ther evaluated to exactly nd the overlapped portion (if any). In these methods, the overlap information is represented by a graph (known as Overlap-Layout-Consensus graph) in which every node represents a read and each edge indicates an overlap be- tween the adjacent nodes (reads). Choice of k-mer length can have a great impact on the quality of results of these assemblers. One can ne tune k based on dierent factors such as genome length, read length, coverage, and sequencing errors. If k is too short, computational time increases exponentially and the whole method can get impractical. On the other hand, longerk increases the running time but there would be a risk of missing legitimate overlaps with a too large choice ofk. Celera Assembler [13], published in 2000, used a derivative of this method and called it string graph which simplies the overlap graph by removing redundant transitive edges. SGA and Fermi are considered other assemblers in this family which further improved the idea behind Celera by using some sort of sophisticated indexing technique which is behind the scope of this dissertation. 1.2.3 De Bruijn Graphs De Bruijn graphs were rst presented for EULER assembler [14]. After that, many popular and state-of-the-art de novo genome assemblers, especially ones mainly using high-throughput short SGS libraries, like Velvet, ABySS, ALLPATHS, and SOAPde- novo used a de Bruijn graph approach. In such methods, rst each read is broken Chapter 1. Introduction 9 down intok-mers. K-mers in de Bruijn graph approaches are generally much shorter than overlap-layout-consensusk-mers. Each vertex in the de Bruijn graph represents a uniquek-mer and each edge between twok-mers indicates that the adjacentk-mers share exactly k 1 bases such that the rst k 1 bases of one k-mer are exactly equal to the last k 1 bases of the other k-mer. In this approach, overlaps between reads are not explicitly represented in the graph. By traversing non-branching paths in a de Bruijn graph, contigs can be constructed. But as we discussed before, due to sequencing errors and repetitive substructures in genomes, these non-branching paths are usually quite short in de Bruijn graphs. \Branches" and \bubbles" are troublesome patterns in de Bruijn graphs. To cope with this problems, assemblers usually try to detect and remove these artifacts from the de Bruijn graph to make the problem easier to solve. Most of the times, they don't solve the real problem, but by detecting and removing such complexities they can prevent further problems and misassemblies from happening. This is the real reason most draft genomes are actually a set of contigs rather than a complete contiguous genome or a set of completely constructed chromosomes. To obtain a higher quality result, before the actual assembly step an extra error-correction step would be very eective. But this approach has a major drawback of losing reads structure information in the de Bruijn graph. More specically, after moving a sliding window of size k over each read and breaking it down into a path of k-mers, the reads linear information Chapter 1. Introduction 10 may be lost. Especially if some distal reads share k-mers in the middle of them, branching nodes will be created that impose diculties for further steps and some extra processing will be required in order to detect and resolve them. While overlap-layout-consensus method are very computationally intensive, de Bruijn graph methods require huge amounts of memory to store the graph structure [6]. ABySS and SOAPdenovo are examples of de Bruijn graph assemblers which both have managed to be able to be run on some complex mammalian genomes like the human genome (resulting in a set of fragmented contigs). But most assemblers in this family of methods struggle with proper parallelization and that is why they cannot scale well when it comes to assembling more complex mammalian genomes. A general usage rule of thumb would suggest that for very short reads (<100bp), de Bruijn assemblers would be a better choice while for longer reads overlap-layout-consensus based assemblers might be a more appropriate one. 1.3 Genome Scaolding Genome scaolding is the extra step required after performing sequence assembly to decrease the fragmentation inside the draft assembly. During the scaolding process, sequences constructed in the assembly phase (pre-assembled contigs) should be placed in their correct relative position. Thus, the result of the scaolding phase must be a set of correctly oriented and ordered contigs with possible overlaps or gaps (with Chapter 1. Introduction 11 estimated distance) between contigs. This inferred structure over input contigs are called scaolds (or supercontigs/scaftigs) in some older literature. Gaps between contigs might be either sequencing gaps that can be lled by sequencing information covered by some reads (e.g. gaps induced because of repeats) or true physical gaps caused by lack of sequencing information in the input library which are unaccessible and uncovered (no reads inside the library covering these regions) sections of the DNA. The term \de novo" in de novo genome scaolding similarly indicates a family of scaolding methods in which there is no reference genome available to guide the process to orient, order, and correctly position contigs. Most state-of-the-art scaolders use paired sequences provided by popular high- throughput SGS platforms (e.g. Illumina and Roche 454) as the main source of linking information to orient, order, and position contigs. Paired sequences are ac- tually two reads sequenced from both ends of a single DNA fragment. These two paired reads may or may not overlap each other (there may be a gap between them). The relative orientation and distance of paired reads would be dierent in various sequencing technologies. Commonly there are two types of paired read libraries: (1) paired-end libraries [11] and (2) mate-pair libraries. Paired-end reads are constructed from smaller fragments of 180 to 500 bps (e.g. 100 bps each read with a gap of 20 (overlap) to 300 bps). On the other hand, mate-pair reads [12] are sequenced with a quite dierent technology to achieve longer fragment sizes of about 1 to 10 Kbps (Kilo base pairs). Reads in paired libraries are usually sequenced from dierent strands and while one of them is sequenced from the forward strand, the other one is Chapter 1. Introduction 12 sequenced from the reverse-complement strand. The relative orientation of paired-end and mate-pair reads are also slightly dierent. When dierent ends of a paired read align with dierent contigs, a relative orientation, order, and an approximate distance can be inferred. For example, when the reverse- complement of a contig aligns to an end of a paired read, then its relative orientation compared to the other contig (aligned to the other end) has to be ipped. In order to connect two contigs in the scaold, usually, more than one linking information are required as a single connection supported by only one paired read can be made by mistake because of the presence of sequencing errors in reads or contigs. Many scaolders (especially the ones which are not standalone and are parts of well- known genome assembler packages) incorporate a greedy paradigm. They start the process with the longest contig and connect more contigs as long as there are linking evidence available and there are no con icts caused by the new ones. Although most genome assemblers have a built-in scaolding module, but they are usually very basic and naive. Recently, standalone scaolders are of great interest. Standalone scaolders should not be bound and limited to outputs of any specic assembler and they have to be able to upgrade the quality of results of any assembler eectively. Published in 2004, Bambus [15] was the rst independent scaolder that brought mate-pair scaolding capabilities to virtually any assembler, but it was not designed to deal with SGS libraries. Bambus paved the way to a bunch of other stand-alone scaolders like SOPRA [16], Bambus 2 [17], MIP [18], Opera [19], SSPACE [20], Chapter 1. Introduction 13 GRASS [21] and SCARPA [22] all have been built in recent years. A comprehensive comparison of all these scaolders is included in Hunt et al. (2014) study [23]. As we are working on stand-alone scaolders and try to solve problems in current scaolders, we discuss scaolding approaches and challenges in more details in Chapter 2. 1.4 Genome Finishing In this section we describe why the result of de novo genome scaolding would be still a fragmented non-complete genome and further post-assembly improvements and specically gap lling (gap closure) is needed to have nearly complete genomes. Genome centers used to invest heavily in the nishing aspect of sequencing a genome. But today, the focus is more centered on generating many genomes as quickly as possible for a low cost. Part of the reason for being less emphasis on nishing eorts can be related to advances in technology. Next-generation sequencing has made it possible to sequence a human genome in a little more than a day. Although low-cost and high-throughput sequencing technologies have increased the amount of sequence data available exponentially, the process of obtaining complete genome sequences typically requires time and labor intensive techniques like BAC-based sequencing and fosmid sequencing in which researchers are less likely to invest. When the Human Genome Project announced that the entire human genome had been sequenced, the term \nished" was a matter of semantics. Technically, it was Chapter 1. Introduction 14 the most complete human genome, but researchers were still laboring away, putting bits and pieces together and trying to resolve some of the trickier regions. In fact, eorts to nish the human genome persist ten years later. The Genome Reference Consortium [24] is continuing to improve on the human reference genome by releasing patches that x various regions of the genome, and it is also constructing multiple iterations of regions that are too complex and variable to be represented by just one assembly. Currently more than a third of genome sequences available in public databases remain as drafts and many other projects are still incomplete because of limitations of short read second-generation sequencing (SGS) and assembly processes [25]. Sequencing er- rors, regions of high complexity and repeated sequences are the most common issues. The single molecule third-generation sequencing (TGS) technology solved some of these limitations with longer reads, but brought in others such as high error rate and higher cost. Thus, there is still a dependence on SGS platforms. The vast majority of genomes available today were sequenced using short reads and their assemblies can still be improved. Developments of the nishing process, which comprise error correc- tion, scaolding and gap closing, did not follow the speed of sequencing technologies. One strategy to reduce the number of gaps is to obtain data from dierent sequenc- ing technologies, aiming to reduce errors, compensate bias and improve quality and completeness of the genome sequence [10]. Another approach is to obtain alternative assemblies using the same raw data, but with dierent assemblers and parameters [1]. Chapter 1. Introduction 15 These strategies usually generate many datasets, which can be combined to improve the genome. 1.4.1 What is a nished genome? How one denes a nished genome is up for interpretation. The rst problem in nishing genomes is dening what a nished genome is. If one means there should not be any unknown bases or any error, that is feasible for certain organisms, like bacteria and worms. But for human and other animal genomes, there are regions in these large eukaryotic genomes that are hardly accessible for any sequencing tech- nology, like highly repetitive regions and centromeres. Additionally, if we look at the human genome, one person's genome is completely dierent from another one's in certain regions. Another issue with dening a nished genome is that humans and other animals are diploids, receiving half of their genome from their mother and half from their father. Plants, which are often polyploidy, are even more complex. Technically, a nished genome would have to contain both the mother's and father's haploid genomes. Currently, the reference genome is a haploid representation, which is an articial concept [26]. Cancer genomes add even more complexity to the ques- tion of what a nished genome is. Tumors are frequently a collection of variable genomes. There are dierent genome populations within the same tumor mass. Yet, sequencing does not distinguish between these various genomes, unless the researcher is doing single-cell sequencing. However, single-cell sequencing has its own technical Chapter 1. Introduction 16 challenges, such as amplifying DNA from one cell in an unbiased manner and generat- ing sucient coverage, that make de novo assembly and genome nishing extremely dicult and costly. 1.4.2 Gap Closure Even after performing genome scaolding right after genome assembly, most draft genomes would be still incomplete. There might be still gaps (with estimated lengths) between adjacent contigs in the scaold. Some of these gaps are caused by repeats and some of these gaps are because of low-coverage regions of the genome. The linking information of paired libraries can help with bridging these dicult areas by connecting contigs into larger scaolds, but scaolding cannot add novel sequences to the draft genome. As manual lling of incomplete regions with Sanger sequencing data is very expensive, developing tools for closing gaps of draft genome sequences that take advantage of dierent datasets is an important component of automatic genome nishing eorts [25, 27]. 1.4.3 Why is it important? The importance of having a nished genome highly depends on the specic research question. Finished genomes are not necessary for every research application. If the researcher is interested in a gene list, then having a complete, assembled genome may Chapter 1. Introduction 17 not be necessary. But if the researcher is developing a model system for gene regula- tion, for instance, then all the regions in between are also important. As longer read sequencing technologies are becoming cheaper, it is expected that more researchers would start nishing genomes. Finished human genomes will have important medical consequences because they will allow for the phasing of genomes and a better un- derstanding of the large structural variations in cancer genomes. In the future, once longer-read sequencing technologies (e.g. from PacBio, Moleculo (acquired by Illu- mina in 2012), Oxford Nanopore, etc.) become cheap and high-throughput enough to sequence whole genomes, we will probably revisit many of the genomes we sequenced and realize how many mistakes we have made [26]. 1.5 Other Post Assembly Improvements Generating a gap-free and fully annotated genome is the ultimate goal of genome projects. According to what have been discussed so far, genome scaolding and gap closure are two major post-assembly improvement procedures to yield better quality, less fragmented, and ideally nearly complete draft genomes. But there are other minor post-assembly procedures which can be applied before or after genome scaolding and gap closure to improve the contiguity and accuracy of the results. These procedures can be designed and incorporated within genome assemblers and scaolders or as stand-alone tools. Genome annotation is also a very important and time-consuming aspect of many de novo genome sequencing projects specially if they Chapter 1. Introduction 18 aim to generate high quality annotated genomes that may be subsequently used as reference genomes to facilitate the re-sequencing and annotation of many related species through comparative methods [28]. Most post-assembly improving tools are relatively time consuming even for smaller and less complex genomes. For example, applying Wellcome Trust Sanger Institute's PAGIT (Post-Assembly Genome-Improvement Toolkit [28]) to an E. coli assembly takes approximately 24 hours. It can annotate over 4300 gene models and double the average contig size [6]. ICORN (Iterative Correction of Reference Nucleotides) is a part of this post-assembly pipeline. By iteratively aligning reads to the sequence, ICORN enables error correction in consensus sequences. A variety of errors like sin- gle base-pair errors and small indels (insertions/deletions) can be corrected. RATT (Rapid Annotation Transfer Tool) is another module in their pipeline than can trans- fer annotation from reference genomes to the draft genome. There are other tools and software for error correction, improving scaolding, and generating annotations. Studying all these tools and listing their approaches together with pros and cons of each of them would be beyond the scope of this dissertation. 1.6 Our Goal Our projects are focused on proposing methods and frameworks to improve the vast publicly available draft genomes produced by SGS short read de novo assemblers. Our Chapter 1. Introduction 19 improvements are mainly categorized as stand-alone scaolding rather than genome nishing or gap closure (gap lling). Despite other currently available methods we want to properly order and orient the pre-assembled contigs without removing the repeats which are the main source of complication in de novo genome assembly and scaolding. Following chapters present more details on de novo genome scaold- ing (Chapter 2), the problem with repeats (Chapter 3), and our proposed methods (Chapters 4 and 5). Chapter 2 Genome Scaolding 2.1 Denitions Obtaining a nearly complete genome sequence is a key step in the molecular anal- ysis of an organism. As we discussed in Chapter 1, overlapping reads go through the process of de novo genome assembly, but the result would be far from a nearly complete genome. The high throughput, reduced price, and increased availability of second generation sequencing (SGS) technology have driven the development of many de novo genome assemblers most of which have some sort of built-in scaolders. The draft genome made by assemblers would be actually just a set of contigs. The total number and length of these contigs are highly dependent on underlying sequenc- ing technologies, sequencing coverage, and the complexity of the genome. Recent studies benchmarking de novo assemblers, like GAGE [1], Assemblathon-1 [4], and 20 Chapter 2. Genome Scaolding 21 Assemblathon-2 [5], reported large quality variations between assemblers, datasets and assembler parameters. By linking contigs into scaolds, assembly continuity can be improved a lot. In this chapter, we brie y describe and compare methods using paired libraries to bridge over regions of the genome which are dicult to sequence or to assemble. As we discussed in the previous chapter, there are two dierent types of paired SGS libraries (paired-end and mate-pair). In these libraries, paired reads can bridge over gaps ranging from tens to tens of thousands of base pairs. Again as we brie y discussed in Chapter 1 and we go through more details on it in Chapter 3, repeats are the source of major problems in de novo genome assembly and scaolding. Repeats are usually misassembled or collapsed during the assembly process. Furthermore, in the scaolding step, repeats cause numerous links arising from repetitive contigs. Scaolding is a very \hard" problem which does not scale linearly. It has been shown that it is computationally intractable [29]. A scaolding problem is usually represented as a graph in which nodes (vertices) cor- respond to contigs and edges (links) indicate linkage information (e.g. linking paired reads) between adjacent nodes (contigs). This scaolding graph (a.k.a contig graph) can be a weighted graph in which edge weights show how strong the linkage infor- mation is. For example, one simple scenario would be a weighted scaolding graph in which an edge weight shows how many paired reads support that specic relative orientation and positioning between adjacent contigs. One scaold per chromosome, Chapter 2. Genome Scaolding 22 together with gaps of the correct lengths, would be the ideal output from a genome scaolder. Published in 2004, Bambus [15] was the rst stand-alone scaolder, but it was not developed for large SGS libraries. Since 2010, several standalone de novo scaolders, specically developed for SGS data in mind, have been designed and implemented. In this chapter we brie y analyze Bambus2 [17], SSPACE [20], SOPRA [16], Opera [19], MIP [18], GRASS [21], and SCARPA [22]. We also brie y discuss the built-in scaolding modules from the assemblers ABySS [30], SGA [31] and SOAPdenovo2 [32]. All aforementioned scaolders use an existing read mapping software (simply called mapper) to know where each read have to be placed (if properly aligned) in the input contigs. This mapping step is usually a time-consuming step that has to be performed before the actual scaolding. There are many mappers available to perform this initial aligning task, however, some scaolders bundle popular mappers (like BWA and Bowtie) to perform the initial mapping of reads against contigs. Some assemblers and scaolders (like SOAPdenovo) use their own mapping tool to align input reads against pre-assembled contigs. A proper standalone scaolder should let the user run the initial mapping using their software of choice and provide the output alignment le, usually in SAM (Sequence Alignment Map) or BAM (compressed indexed binary version of SAM) format [33], as an input to the scaolding tool. Once the positions of reads alignments against pre-assembled contigs have been determined, scaolding Chapter 2. Genome Scaolding 23 algorithm can be started. 2.2 Algorithmic Complexity Scaolding can be viewed as a constraint satisfaction problem in which input contigs need to get properly oriented and ordered such that constraints imposed by linkage information (e.g. aligned paired reads) are satised. We need to emphasize that linkage information between contigs is not necessarily restricted to paired reads. Any other source of information (like mapping contigs against a related reference genome or against a physical map) that can imply some sort of relative placement can be considered as a valid imposed constraint [15]. When there are no repeats or sequencing errors, this constraint satisfaction problem can be solved eciently and trivially. But in presence of repeats and errors, it has been proved that ordering and orienting input contigs and satisfying all imposed constraints is a computationally intractable problem [29, 34]. Even smaller subproblems of the scaolding problem have been showed to be intractable. Solving orientation problem or solving the ordering problem (with the given true orientations) are both proved to be NP-hard problems. The orientation problem can be reduced to the problem of nding a maximum bipartite subgraph and the ordering problem can be reduced to the Optimal Linear Arrangement problem, both of which are NP-hard [6, 35]. That is why to practically solve the scaolding problem we need heuristics to nd approximate solutions. Chapter 2. Genome Scaolding 24 2.3 Current Tools and Softwares Here in this section we brie y discuss most important de novo genome scaolders which primarily rely on paired read data to order and orient the pre-assembled contigs. We emphasize on standalone scaolders rather than built-in modules inside de novo genome assemblers. After brie y describing and comparing their approaches, we will present more details about each of them in following subsections. Next section also provides a more detailed practical comparison result that shows most of these tools are not as practical and as successful as they claim. Bambus and SSPACE follow similar greedy approaches. In Bambus, rst, contigs with the biggest number of links are joined together and subsequent links con ict- ing with the existing (already joined) ones are ignored. SSPACE picks the longest contig rst and continues to join it with other contigs until no further junction is possible, either due to the lack of linkage information or because of the uncertainty with choosing the next contig to join (mainly at repeat boundaries). Other scaold- ers discussed in this chapter construct a scaolding graph (a.k.a contig graph) and use various methods to nd an approximate solution to the scaolding problem by processing this scaolding graph. Some methods try to simplify the problem by par- titioning the scaolding graph and solving each subgraph independently. Nearly all scaolders try to detect and remove repeat nodes (corresponding to repeating contigs) from the scaolding graph to make it simpler and easier to solve. Bambus2 identi- es and removes repeats (repetitive contigs), but keeps contigs arising from variants, Chapter 2. Genome Scaolding 25 then orients and orders them into scaolds and terminates with a variant detection stage. SOPRA is an iterative scaolder that in each iteration uses statistical opti- mizations to remove inconsistent edges and nodes (that increase spurious links) from the graph. These iterations are repeated until all nodes and edges in the graph are consistent with each other. Opera solves the problem very similarly, but it drops the distance constraints between adjacent contigs. SCARPA is a scaolder which tries to solve the orientation and ordering problems separately. It uses restrictions making the computation tractable. One of the key characteristics of SCARPA is that in a verication step it tries to detect and break misassembled contigs. SGA is a very conservative scaolder which tries to disallow any con ict in the graph. That is why it usually produces more fragmented scaolds, but it is more accurate than other scaolders. MIP scaolder partitions the scaolding graph into restricted size subgraphs. Then it transforms subgraph into mixed integer programming problems which can be solved independently and merged together later. GRASS uses mixed integer quadratic programming but like SOPRA it works iteratively to produce the nal results. In following subsections, we go through some more details of these well-known and highly cited standalone de novo genome scaolders for second-generation sequencing data. More comparison details (mainly based on Hunt et al. (2014) study [23]) will be discussed in the next section. Chapter 2. Genome Scaolding 26 2.3.1 Bambus (2004) Developed in 2004, Bambus [15] was the rst general-purpose standalone scaolder mainly designed for Sanger sequencing libraries. Bambus is not necessarily a pure de novo genome scaolder as it was not restricted to paired read libraries and lets using other sources of linking information (like alignments against related reference genomes, physical maps, contig overlaps, and information about the conservation of gene synteny). As an input, Bambus takes abstract contig links representing the relative orientation and relative position of adjacent contigs. Following a greedy paradigm, it joins contigs with the largest number of links rst and ignores subse- quent edges con icting with existing joins. In their algorithm, each contig link is assigned a weight (or priority) indicating its quality. In the rst iteration, top-quality links are used to generate the rst set of scaolds. Then during following iterations, lower-quality links will be added to grow scaolds from previous iterations. At each iteration, only links consistent with previously computed scaolds will be considered. Because of this heuristic, their approach has been claimed to be a hierarchical method. Bambus also includes a mechanism for detecting and removing suspected repeats to reduce the negative eects of misassemblies. 2.3.2 SOPRA (2010) Published in 2010, SOPRA [16] uses paired read libraries to solve the scaolding problem. It stands for Statistical Optimization of Paired Read Assembly. SOPRA is Chapter 2. Genome Scaolding 27 an iterative scaolder that in each iteration uses statistical optimizations to remove inconsistent edges and nodes (that increase spurious links) from the graph. These iterations are repeated until all nodes and edges in the graph are consistent with each other. Thus, it tries to select as large as possible subset of constraints which are simul- taneously satisable and make a balance between the quality and the size of resulted scaolds. SOPRA models the scaolding problem as an optimization problem with optimization variables associated with nodes and edges in the scaolding graph. Prob- lematic constraints (mainly caused by repeats, chimeric reads, and chimeric contigs) are removed. Other than that, all other constraints are treated equally by SOPRA's algorithm. In each iteration, SOPRA solves orientation and ordering problems sepa- rately. It rst tries to nd the optimal orientations for all contigs. Then it removes links inconsistent with the decided orientations. Finally, it models remaining links as springs connecting adjacent contigs to nd an optimal ordering solution in which the total number of compressed or stretched are minimized. SOPRA also accepts SOLiD sequencing information as input for which it incorporates a dynamic programming algorithm to convert the inputs from colors to bases. 2.3.3 Bambus-2 (2011) Being part of the AMOS package [36, 37], Bambus2 [17] is mainly designed to deal with some challenges which are more specic to metagenome assemblies. Genomic Chapter 2. Genome Scaolding 28 variation is the main source of challenges in metagenomes assembly and scaold- ing. Bambus2 suggests a method distinguishing genomic variation from repeats in the scaolding graph. Repeats and genomic variations introduce dierent complex substructures inside the scaolding graph. Repeats usually \tangle" the scaolding graph and mask the global structure of the genome while genomic variations (e.g. antigenic variation loci in dierent strains of an organism) introduce \bubbles" which are localized motifs. Bambus2 removes tangles (focal points in the graph) and keeps bubbles in the scaolding graph to better recover the global structure of metagenomes. More specically, Bambus2 replaces each bubble (graph nodes involved in the bubble) with a single node in the scaolding graph. After solving the scaolding problem, for each variant Bambus2 outputs the main sequence along with alternative varia- tions which correspond to the haplotypes. For repeats removal, Bambus2 uses a node centrality measure (similar to the vertex betweenness centrality). It also suggests a coverage-based repeat identication algorithm in which instead of using a global coverage statistic, the coverage variation within graph components is tracked and analyzed. 2.3.4 SSPACE (2011) SSPACE [20] (SSAKE-based [38] Scaolding of Pre-Assembled Contigs after Exten- sion) also follows a greedy approach. It picks the longest contig rst and continues to join it with other contigs until no further junction is possible, either due to the lack Chapter 2. Genome Scaolding 29 of linkage information or because of the uncertainty with choosing the next contig to join (mainly at repeat boundaries). In their implementation, rst reads with non- ACGT bases are ltered out. Then remaining valid paired reads are mapped to the pre-assembled contigs using Bowtie. It stores all mapping information (orientation and position of mapped reads) in a hash. After removing duplicate paired reads, op- tionally input contigs can be extended using unmapped reads. This is a useful feature especially when the input read library is dierent from the library used for the initial contig assembly. Before the real greedy scaolding step starts, putative contig pairs are made out of results of the initial read alignment step. The scaolding phase starts by picking the longest contig. Then, if the link between this contig and the contig paired with it is supported by at least k (a parameter) read pairs, this new contig would be added to the scaold and the procedure continues with this newly added contig in a similar way. The main problem arises when the scaolding algorithm reaches a \branch" where there are more than one candidate contigs (all with at least k supports) to continue with. SSPACE follows a simple heuristic in which if there are links between alternatives, it tries to place all alternatives properly in the current scaold. Otherwise (when there is no link between alternatives), it calculates a ratio between the two best alternatives. If this ratio meets some specic threshold, the alternative with the highest support will be added to the current scaold. Otherwise, the scaolder cannot decide to choose a contig to continue with, so halts and stops extending the current scaold. SSPACE iteratively repeats this process and creates and extends a new scaold until no unscaolded contig remains. As we will discuss Chapter 2. Genome Scaolding 30 in the comparison section in in this chapter, SSPACE is one of the simplest and most practical standalone scaolders and usually outperforms other more overblown sophisticated ones. 2.3.5 Opera (2011) Opera [19] works very similarly to SOPRA, but it drops contig distances constraints. To minimize the total number of unsatised links, Opera proposes a dynamic pro- gramming algorithm. Opera has been claimed to be scalable to scaold large draft genomes. They propose a graph contraction algorithm to ensure such scalability. Af- ter solving the orientations and orders, Opera formulates a quadratic programming problem to estimate the gap sizes between adjacent contigs. The logic behind these formulations is to avoid aggressively linking contigs together in scaolds, otherwise, although the result seems to have good contiguity, it would highly suer in accuracy and there would be lots of misassemblies with wrong connections in the nal scaolds. OPERA labels and removes a contig as a repeat contig if its read coverage is at least 1.5 times bigger than the average genomic coverage. Chapter 2. Genome Scaolding 31 2.3.6 MIP (2011) Published in 2011, MIP (Mixed Integer Programming) scaolder [18] partitions the scaolding graph into restricted size subgraphs. Biconnected components of the scaf- folding graph would be these restricted size subgraphs that have to be solved indepen- dently. MIP transforms graph partitions into mixed integer programming problem instances which can be solved independently and merged together later. Like pre- vious approaches, MIP also attempts to detect repeat contigs based on their high degrees (e.g. >50) in the scaolding graph or much higher than expected coverage (e.g. >2.5 times the average coverage). It removes all suspicious repeat contigs from the scaolding graph. 2.3.7 GRASS (2012) Similar to MIP, GRASS [21] (GeneRic ASsembly Scaolding) proposes a mixed inte- ger programming formulation of the scaolding problem, but it combines orientation, order, and distance estimation in a single optimization objective function. Further- more, GRASS has a quadratic objective function while MIP formulates a linear ob- jective function. Similar to some other scaolders, like SOPRA, GRASS follows an iterative approach to nd a consistent subset of contigs. However, they try to nd and remove inconsistent constraints rather than inconsistent contigs. They proposed an EM (expectation{maximization) strategy and an unconstrained binary quadratic programming approximation to optimize the related objective function. But despite Chapter 2. Genome Scaolding 32 these lofty claims and ideas, many users reported that they could not use it and put it to work practically. In a recent comprehensive evaluation of assembly scaolding tools [23], it was mentioned that GRASS did not join any contigs together in any of the test cases or using the S. aureus genome. It either resulted in scaolds sim- ply consisting of the original contigs or crashed. Therefore, it was not analyzed any further in that study. 2.3.8 SCARPA (2013) Published in 2013, SCARPA [22] is a scaolder trying to solve the orientation and ordering problems separately. It uses restrictions making the computation tractable. It rst tries to solve the orientation problem. In the modied version of the scaolding graph (G) used in their method, it can be shown that the contig orientation problem has a feasible solution if and only if G has no cycles containing an odd number of nodes. In real scenarios, one should attempt to nd the smallest set of nodes that can be removed from the graph to allow for a feasible solution. This is known as the Odd Cycle Transversal problem which is NP-hard, but there is a xed-parameter tractable algorithm for it. After removing odd cycles and assigning orientations to all contigs, the graph may still have some cycles that although do not cause any problem for orientations but make linear positioning and placement of contigs infeasible. All directed cycles should be eliminated to place the contigs into a linear order. This is the Feedback Arc Set problem, the NP-hard problem of nding a minimal set of edges Chapter 2. Genome Scaolding 33 whose removal makes a directed graph acyclic. They follow an available heuristic approach to solve this problem. Contig verication and misassembly removal is another key feature of SCARPA. There is an implicit assumption based on which other scaolders are designed and imple- mented, that paired read data is noisy but contigs are error-free, which is not always true. That is why unlike other scaolders, SCARPA includes a contig verication step that breaks contigs that appear to be misassembled. Most misassembled contigs are results of over-collapsing of some repeat regions. Like other scaolders, SCARPA also removes repeat contigs from the scaolding graph. They put a threshold on the maximum number of links a contig can make. If a contig exceeds this threshold, it is disconnected from the graph. 2.4 Comparisons Hunt et al. (2014) [23] published the rst independent analysis of de novo genome scaolders which mainly use second-generation sequencing data as their input. De- pending on the choice of scaolder and the dataset used as the input, large variations in the outputs quality have been found. Contrary to scaolders' published results, even very simple scenarios of perfect inputs produced some surprising results in most of these scaolders. In their study, both simulated and real sequencing data of various Chapter 2. Genome Scaolding 34 organisms (like Rhodobacter sphaeroides, Staphylococcus aureus, Plasmodium falci- parum, and Homo sapiens) were used to analyze the performance of these scaolders. Several tools managed to produce perfect results when the high quality simulated data was used as the input. On the other hand, when working on real data, at least 10% of joins remained unidentied. In terms of speed, usability, and the total number of correct and incorrect joins between contigs, studied scaolders vary a lot. Based on their overall performance, Hunt et al. (2014) [23] conclude that SSPACE, SGA, and SOPRA outperform other scaolders. It is very important to note that the quality of the results and the performance of the scaolder of choice are highly dependent upon the choice of incorporated read mapper and the inherent genome complexity. To better understand the results of Hunt et al.'s comparisons we should be more familiar with the comparison criteria and evaluation metrics that should be used to compare scaolding algorithms. 2.4.1 Contiguity Traditionally, it has been believed that longer and more continuous contigs and scaf- folds are better. That is why the most important metrics to evaluate the result of assemblers and scaolders are length and contiguity statistics. Thus, genome assem- blers and genome scaolders producing fewer and longer contigs and scaolds are generally considered more powerful. The philosophy of this assessment approach has roots in the intuition that we have already a set of fragmented contigs and we want Chapter 2. Genome Scaolding 35 to have a draft genome sequence as close as possible to the real complete genome sequence. Maximum length (length of the longest contig) and N50 (generally, Nx with x be- tween 0 and 100) are considered the most common length and contiguity metrics. N50 has an interpretation very similar to the median length, but N50 tends to ig- nore short contigs and gives more weight to long contigs and scaolds. To calculate the N50, one should rst calculate the \total sum" which would be the sum of the lengths of all contigs or scaolds. Then the set of contigs or scaolds has to be sorted in descending order, from longest to shortest contig/scaold. Finally, we need to add up lengths of long contigs one by one (starting from the longest contig) until this sum reaches the length of 50% of the \total sum". The length of the last added contig/s- caold would be the N50 of the assembly/scaolding. In other words, half of the total sequences of the draft assembly would be contained in contigs or scaolds with length at leastN50 base pairs. Thus the longerN50 the better and more contiguous the assembly. Similarly, one can dene and use other measures like N75 and N90 for a clearer understanding of the lengths distribution. If the length of the genome is known, one can substitute \total sum" with the genome length. Then similarlyNGx contiguity measures, likeNG50,NG75, andNG90 can be dened. Having the actual reference genome, one can map all contigs/scaolds against the reference genome and substitute the \total sum" with the sum of aligned blocks (parts of contigs which are successfully aligned to the reference genome). So NGAx contiguity measures, like NGA50, NGA75, and NGA90 can be dened. Chapter 2. Genome Scaolding 36 2.4.2 Accuracy The traditional and still widely used length and contiguity measures can only indicate the improvement on the draft assembly fragmentation. But they have given zero information on the precision and base-level accuracy of the results. For example, a dummy script that concatenates reads or contigs in a random order can yield a draft assembly with a single huge contig beating length and contiguity measures of best assemblers and scaolders. There is always a length vs. accuracy trade-o that has to be carefully considered. But how can we assess the accuracy of resulted contigs and scaolds after assembling and scaolding? For accuracy measures, we need to have the reference genome or closely related complete genomes. For example in benchmarking studies, researchers use the reference genomes from which the read libraries (either real or simulated) have originated as a template for assembly/scaolding accuracy assessment. Hunt et al. (2014) [23] track the ordered and oriented contigs inside the nal scaolds by mapping them against the available reference genomes to evaluate their relative orientations, orders, and to verify the estimated gaps (or overlaps) between adjacent contigs. First, they error-correct contigs to have them unambiguously aligned against the reference genome. Then they nd unique tags (short substrings) withing each contig. These unique contig tags would be used for contigs tracking. To do so, for each contig, the middle 50-base long substring is selected as a candidate tag and gets mapped to the reference genome and also to all other contigs. When veried as a Chapter 2. Genome Scaolding 37 unique tag, its information (showing to which contig it belongs) will be stored in a separate le. Otherwise, the length of the substring will increase until a proper unique tag is found. In the worst case, the whole contig can serve as a unique tag. Otherwise, as the whole contig is either a repeat or it is a smaller contig contained in a bigger contig, we can safely discard it from our assessments. During the evaluation, based on the location of these unique tags on the reference genome, the links (joins) between scaolded contigs can be validated. If the tags of two adjacent scaolded contigs are in the correct orientation and order and withing the acceptable distance range (length equal to the average fragment length would be the maximum acceptable dierence), the connection between these two contigs will be considered as a correct connection, otherwise an incorrect connection. There may also be some \skipped tags" (when there are missing contigs between a connection) that must be reported by the evaluation tool in this method. In Hunt et al. (2014) [23] research study on which we focus in this comparison section, the quality of each scaolder was assessed using the following ve key metrics: Correct joins: a pair of contigs correctly joined with the tags in the expected orientation and separated by the correct distance. Incorrect joins: where contigs from dierent locations, in the wrong orienta- tion or separated by the incorrect distance were joined together. Chapter 2. Genome Scaolding 38 Skipped tags: two contigs were correctly joined, but separated by a gap that should contain another contig. Lost tags: a tag that was completely absent from the output of the scaolder. Running time: the total CPU time used, including any pre-processing stages and read mapping. For each metric (within each dataset), a score in the [0; 1] interval is calculated such that the worst tool (e.g. the tool producing fewest correct joins) is scored 0, the best tool (e.g. the tool producing the highest number of correct joins) is scored 1, and the remaining tools receive scaled scores accordingly. A single summary score (averaging all ve scores) can be calculated for each tool within each dataset. But one can note that a summary score equal to 1 for a tool within a dataset does not necessarily mean that the tool has managed to produce the perfect result. It only indicates that for that specic dataset and using this set of accuracy evaluation metrics, this tool outperformed other similar tools. In the evaluation studies, MIP, Opera, and ABySS performed relatively well but they did not output results with the highest quality. MIP lost some date during the scaf- folding. MIP and Opera showed some random behavior in producing unpredictable outputs when there were no clear solutions to the scaolding problem. On both simu- lated and real test cases, SCARPA was reported to generate results inconsistent with the input data. It also lost some data and was generally outperformed by the other Chapter 2. Genome Scaolding 39 scaolders. SGA proved itself to be a conservative tool choosing the solution with the strongest evidence. Although acting randomly when there are more than one equally likely solutions, SGA has managed to produce results with high accuracy and low error rates. SSPACE and SOPRA generally performed well on simulated and read test cases. Although Bambus2's performance on simulated test cases was similar to that of SSPACE, its accuracy was variable using the real date. On the other hand, although SOAPdenovo2 could make few joins in simulated test cases, it could manage to make more connections than all other tools with a relatively low error rate. According to Hunt et al. (2014) [23], when the higher accuracy and the minimum possible error rate is more important to the user, SGA would be the best choice as it proved itself to be a conservative tool. But one should note that SGA makes much fewer connections compared to other scaolders, thus its results still tend to be frag- mented. To balance between accuracy and contiguity, SOPRA can be a reasonably good candidate. It has been also reported that when two dierent size libraries are given as the input, SOPRA outperforms other tools as it aggregates and combines the linking information into the scaolding graph. But it does not scale well and cannot be practically used for large datasets. SSPACE and SOAPdenovo2 would be best choices if a relatively short running time is important. They were fastest tools producing reasonably well (in terms of both contiguity and accuracy) results. It has been also reported that compared to SOAP- denovo2, SSPACE acted more conservatively on most datasets. SSPACE is the easiest Chapter 2. Genome Scaolding 40 tool to install, run, and interpret. That is why SSPACE is the most cited standalone scaolder among all discussed here. We will compare our own results with SSPACE in the following chapters. Generally, Hunt et al. (2014) [23] do not recommend us- ing OPERA, MIP, or SCARPA as they are usually outperformed by SGA, SOPRA, SOAPdenovo2, and SSPACE. Chapter 3 Repeats From bacteria to mammals, repetitive DNA sequences are abundant and quite com- mon in a broad range of species. Nealy half of the human genome contains repetitive sequences. Plant genomes are famous for containing high proportions of repeats. Even genomes of bacteria can have up to 40% repeat content [39]. Algorithms for sequence alignment, sequence assembly and scaolding have always suered from technical problems imposed by the presence of repeats. These problems got worse with the availability and popularity of SGS data which provide high volume short read data. Interpreting sequence alignment and assembly results have always been aected by biases and errors caused by ambiguities created by repeats. Simply removing or ignoring repeats (done by the most state of the art assemblers and scaf- folders) is not a true solution as these crucial biological phenomena play a key role in 41 Chapter 3. Repeats 42 all downstream analyses. Later in this chapter, we mainly discuss the computational challenges caused by repeats in de novo genome assembly and scaolding. Repeats can appear in dierent structures and sizes. Some repeats are widely inter- spersed while some other repeats are not (like the tandem or nested repeats). Their copy number (repeat count) can vary from two copies to millions of copies throughout the genome. Some repeats may be only mono- or dinucleotide repeats (1 or 2 bases re- peats respectively) while some other repeats may contain millions of base pairs. The human genome has well-characterized repeats usually categorized into two classes: \long interspersed repeats" and \short tandem repeats". Short tandem repeats are also called \microsatellites. Long interspersed repeats can be either SINEs (short in- terspersed nuclear elements) or LINEs (long interspersed nuclear elements). Covering about 11% of the human genome, Alu repeat elements are a very well-documented example of interspersed repeats [40]. 3.1 De novo genome assembly and repeats Compared to the traditional Sanger sequencing, SGS technologies can produce data with higher coverage depth at a far lower cost. That is why to compensate with SGS libraries' short read length, current assembly methods use deeper sequencing coverage. But diculties arisen by repeats cannot be overcome by higher coverage depths. In de novo genome assembly, repeat sequences longer read length create gaps Chapter 3. Repeats 43 in the nal result. Similarly, in de novo genome scaolding using paired read data, repeats longer than fragment lengths cannot be solved. That is why these days most draft assemblies that can be produced quickly and at a cheaper price are far from complete and are more fragmented compared to Sanger sequencing draft assemblies [9, 41]. Besides inducing sequencing gaps (not physical gaps), repeats can collapse and cause chimeric rearrangements and misassemblies. The contiguity and accuracy problems caused by repeats are highly dependent upon the input libraries' read length. Generally, the longer the reads, the fewer problems caused by repeats. There are millions of copies of repeats (200 to 500 base pairs long) present in the human genome. These repeats are generally longer than the length of SGS reads. That is why SGS data are not enough to assemble larger and more complex genomes like those of animals and plants. Even using the tradition Sanger sequencing date, longer repeats (e.g. LINEs) cannot be solved. To do so, exceptionally long reads or other long-range linking information is required. As mentioned in previous chapters, most de novo assemblers (either overlap-layout- consensus or de Bruijn graph assemblers) create graphs out of the input read library. To reconstruct the genome and produce results, these graphs need to be properly processed and traversed. Similarly, most de novo genome scaolders build scaolding graphs to perform their job and scaold the pre-assembled contigs. Technically, the presence of repeats has a drastically bad eect on the structure of these graphs. As brie y discussed before, repeats create \branches" in assembly graphs and \tangles" in Chapter 3. Repeats 44 scaolding graphs. To decide which branch/path to follow, assemblers and scaolders have to make a guess and their overall performance is highly aected by such guesses. Wrong guesses create incorrect connections which result in the creation of chimeric contigs and incorrect scaolds and possibly with erroneous copy numbers for repeated sequences. To achieve more accurate results, a more conservative algorithm should stop/halt at these branches, but this leads to more fragmented draft genomes with fairly small contigs/scaolds. 3.2 Assembly errors caused by repeats The key problem with repeats is that an assembler cannot distinguish dierent copies of a repeat from each other. That is why the anking sequences can easily be misas- sembled and joined together incorrectly. As depicted in Figure 3.1, the most common error caused by repeats is that an assembler may create a chimera by connecting two chromosomal regions which must be far from each other. As one can easily verify in this gure, all reads may align perfectly to a misassembled genome. If we have reads longer than a repeat region or paired reads sequenced from ends of a fragment whose length is bigger than repeat length the misassemblies may be detected. Paired reads are sequenced from ends of a single DNA fragment with known size. An as- sembler/scaolder can utilize the information regarding the relative orientation and the expected distance of such reads. If paired reads do not span a particular repeat, then it would be impossible to resolve the problem unambiguously. Chapter 3. Repeats 45 Figure 3.1: Assembly errors caused by repeats. (A) Rearrangement assembly er- ror caused by repeats. (B) A collapsed tandem repeat. (C) A collapsed interspersed repeat. The diculty of assembling large genomes using very short reads have been illustrated in two recent studies. According to Alkan et al. (2011) [42], human genome draft assemblies (produced by 2011) were about 16% shorter than the available reference genome, primarily because of repetitive sequences being missed. Particularly, these assemblies were lacking 420 Mbp of common repeats like Alu elements, LINE 1 ele- ments, and segmental duplications. In another study, Ye et al. (2011) [41] checked two Chapter 3. Repeats 46 draft assemblies of the chicken genome which were assembled using Sanger sequenc- ing data. Compared to the human genome, the chicken genome has less repetitive content (10% compared to 50%). In this study, 37 long (>10 kb) contigs were found to be misassembled because of the collapse of interspersed repeats (Figure 3.1). Tandem repeats are the source of another common assembly problem which is usually harder to resolve. It is very dicult for an assembler to determine the correct copy number of a tandem repeat as nearly-identical copies usually collapse into fewer copies. Figure 3.1b illustrates a collapsed repeat in which two identical copies are assembled into one. Note that again all reads may align perfectly, but probably inconsistent paired reads alignment and irregular coverage depths can reveal the misassemblies. 3.3 Strategies for handling repeats In assembly graphs (either overlap-layout-consensus or de Bruijn) or scaolding graphs, all copies of a repeat are initially collapsed into a single node (read,k-mer, or contig). Sequencing errors and repeat margins will show up as \branches" in these graphs. Complex repeats usually create complicated \tangles" in the graphs [14] that need to be untangled to solve the assembly and scaolding problems. To resolve such tangles, two major strategies have been used by dierent algorithms. In the rst and widely used approach, paired read libraries will be used to nd distal information between anking sequences (sequences on dierent sides of a repeat). As discussed in previous Chapter 3. Repeats 47 chapters, paired libraries fragments can have various length, approximately from 200 bps to 20 Kbps. Producing even longer fragments is possible using fosmid clones (30 to 40 Kbps) and BAC (bacterial articial chromosome) clones (150 Kbps). So if a paired read spans a repeat region, the assembling or scaolding algorithm can unambiguously traverse the repeat node to connect a unique sequence on one side of the repeat to another unique sequence on the other side of the repeat. So a fragment smaller than the repeat cannot be of any help to resolve the problem. Because of possible misalignments and sequencing errors in read libraries, most assembler and scaolders require more than one supporting read pairs to infer a link between two nodes in the graph. This is usually referred as \weight" or \support" of an edge in the graph processing algorithms. To see an example how long fragment libraries can help resolve some repetitive se- quences, we brie y discuss a study performed on the potato genome [39]. Being a highly repetitive sequence, potato genome length is 844 Mbps, 62% of which is con- sidered to be repetitive. The rst draft assembly of this genome was produced with a combination of 454 and Illumina reads. This draft genome was highly fragmented with tiny contigs (697 bpsN50) and small scaolds (8 KbpsN50). When reassembled (scaolded) using Illumina mate-pair reads (2 to 10 Kbps fragments), the scaolds' lengths grew linearly with the size of fragments. After using Sanger sequencing to create paired reads from fosmids (40 Kbps) and BACs (100 Kbps), the nal scaold N50 length was increased to 1.3 Mbps which shows a 100-fold increase. Chapter 3. Repeats 48 Analyzing the coverage depths is the second major strategy for dealing with repeats. Coverage depth statistics cannot help assemblers and scaolders to unambiguously resolve the repeats, but they can help in identifying the repeats which is the rst problem that needs to be solved. The key assumption in this approach is that the whole genome is uniformly covered by the input read library which is not a completely true assumption but it will help a lot in detecting repeats. For example, in a 60x coverage, one can assume that on average all non-repeat contigs must be covered 60 times. But the repeated contigs will have a substantially higher (deeper) coverage. The higher the coverage of a repeat contig, the stronger the evidence supporting it is a repeat. One can even design models to estimate the copy numbers of a repeat based on the coverage depth statistics. None of these strategies can completely solve the problems caused by repetitive se- quences, as the ultimate solution requires reads longer than most repetitive sequences [39]. In following chapters, we will present two scaolding approaches, one (Chapter 4) completely relying on short read paired data which is the mainstream today, and the other one (Chapter 5) which will be a hybrid scaolder that can incorporate long reads to improve assembly results and handle repeats better. Modeling and trying to incorporate repeats in the nal scaolds (instead of ignoring or removing them) will be the key aspect and main contribution of our works. Chapter 4 Repeat-Aware Scaolding using Paired Reads 4.1 Scaolding graph The input to the scaolder is a set of contigs produced by an assembler and a set of paired reads. First, we map the paired reads to the contigs. Bowtie2 [43, 44] is currently the best choice for mapping short read data. This mapping yields links between contigs that we represent as a scaolding graph. The graph structure is then used to formulate a mixed linear programming model. Figure 4.1 shows the overview of our suggested model. Dierent parts of this gure will be discussed in more details in this chapter. 49 Chapter 4. Repeat-Aware Scaolding using Paired Reads 50 Figure 4.1: A schematic overview of the main steps of the algorithm is depicted in this gure. Chapter 4. Repeat-Aware Scaolding using Paired Reads 51 4.1.1 Scaolding graph The input to the scaolding problem can be represented as a bidirected graph called the scaolding graph. Each contig is represented by a node in the graph. There is an edge between two contigs if there are paired reads linking them. Each edge is the representative of a group of paired-end links between contigs suggesting the same relative order and orientation between the adjacent contigs. The total number of such supporting links will be referred as support or weight of the corresponding edge. The edge is directed at both endpoints indicating the orientation of the contigs with regard to each other. So the edges are bidirected and the scaolding graph is a bidirected graph. A directed graph is a special case of a bidirected graph. Figures 4.2 and 4.3 show the four possible ways of linking two contigs and their corresponding interpretations. Here in these gures, paired reads are oriented as in Illumina paired- end data, where paired reads are sequenced from dierent strands of the genome. Another interpretation of bidirected edges sees them as one of the following three cases: (1) regular directed edges in which both ends agree with each other's direction, i.e. both point to one of the two adjacent nodes, (2) extraverted edges in which both ends are pointing outward, i.e. towards the adjacent vertices, and (3) introverted edges in which both ends are pointing inward, i.e. away from the adjacent vertices. Figure 4.4 shows an example of a scaolding graph using this interpretation. Chapter 4. Repeat-Aware Scaolding using Paired Reads 52 Figure 4.2: Bidirected edges. The four possible ways of linking two contigs with a paired-read and the bidirected edge representing each case is represented in this gure. <R1,R2> is a paired-read. It can be either paired-end (mapping FR by default) or mate-pair (mapping RF by default) (F:Forward, R:Reverse- complement). Figure 4.3: Undirected and directed interpretation of bidirected edges. The four possible bidirected edges in a scaolding graph and the possible undirected and directed interpretation of each case is depicted in this gure. For example, bidirected edge X>>Y is equivalent to an undirected edge between 3'-end of contig X and 5'-end of contig Y. In the directed scenario, it can be interpreted as a forward copy of contig X followed by a forward copy of contig Y, or the reverse-complement copy of contig Y followed by the reverse complement of contig X. Chapter 4. Repeat-Aware Scaolding using Paired Reads 53 Figure 4.4: An example of a scaolding graph. Another interpretation of bidi- rected edges sees them as one of following three cases: (1) regular directed edges in which both ends agree with each other's direction, i.e. both point to one of the two adjacent nodes, (2) extraverted edges in which both ends are pointing outward, i.e. towards the adjacent vertices, and (3) introverted edges in which both ends are pointing inward, i.e. away from the adjacent vertices. This gure shows an example of a scaolding graph using this interpretation. 4.2 Handling Repeats As emphasized in previous chapters, to the best of our knowledge, nearly all stan- dalone SGS scaolders discussed in Chapter 2, not only cannot solve repeats but also need to somehow identify and eliminate suspicious contigs before being able to go forward with ordering and orienting the remaining non-repeat contigs. Here in our Chapter 4. Repeat-Aware Scaolding using Paired Reads 54 model, we want to incorporate repeats into our model to be able to solve the most complex tangles in the scaolding graph. According to our model overview in Figure 4.1, our scaolding method is based on knowing the copy numbers (repeat counts) of all contigs. They can be estimated within the scaolder or calculated by other more sophisticated tools and be given to our model as an input le. Knowing all integer copy numbers of input contigs, we can convert the scaolding graph to an extended scaolding graph. In regular scaolding graph, the total number of graph vertices is equal to the total number of input contigs (that meet the minimum length criteria). But in the proposed extended scaolding graph, the total number of graph vertices will be equal to the total number of distinct contig copies. Figure 4.5 shows a small example in which both regular scaolding graph and extended scaolding graph are presented. Here is the main idea in building the extended contig graph. Non-repeat nodes will remain unchanged. But repeat nodes will be replaced with replicated identical copies. The number of these replicated identical nodes should be exactly the same as the copy number of the repeat node. Any edges that meet at least one repeat node (colored edges in Figure 4.5B) must be also replicated accordingly (Figure 4.5C). For example if one adjacent node of an edge is a non-repeat node and the other adjacent node is an n-copy repeat node, the edge should be replicated n times and although all of them share one adjacent node (the non-repeat node), the other adjacent node should be one of then identical copies of the repeat node. Similarly, if one adjacent node of an Chapter 4. Repeat-Aware Scaolding using Paired Reads 55 Figure 4.5: Expanded scaolding graph to handle repeats. (A) This gure shows a sequence of contigs on a genome in which contig X is a collapsed repeat contig with two copies. The rst instance of X is sitting between contigs B and C while its second instance is sitting between contigs D and E. (B) This gure shows the corresponding scaolding graph of contigs depicted in part A. Without loss of generality edge weights and edge types are eliminated, just for simplicity. The dashed edge shows an edge between contigs A and X spanning contig B in between. (C) This gure shows the expanded version of scaolding graph of part B. This is the graph we process to build and solve the mixed integer linear programming model. Colored edges are the edges that are extended to the number of copy number of repeat contig X. edge is ann-copy repeat node and the other adjacent node is another m-copy repeat node, the edge should be replicated nm times. Figure 4.5A shows a sequence of contigs on a genome in which contig X is a collapsed repeat contig with two copies. The rst instance of X is sitting between contigs B and C while its second instance is sitting between contigs D and E. Figure 4.5B shows the corresponding scaolding Chapter 4. Repeat-Aware Scaolding using Paired Reads 56 graph. For simplicity, edge weights and edge types are eliminated in this gure. The dashed edge shows an edge between contigs A and X spanning contig B in between. The expanded version of this scaolding graph is depicted in Figure 4.5C. Colored edges are the edges that are extended to the number of copy number of repeat contig X. This is the graph we process node by node and edge by edge to build and solve the mixed integer linear programming model which will be discussed in the next section. 4.3 Model Formulation In this section, we will develop a mixed integer linear programming (MILP) formu- lation for the scaolding problem. Suppose there are n nodes (contigs) and m edges (paired read link bundles) present in the scaolding graph. Let us denote the length of contigX byL X . LetG be an estimate on the upper bound of the length of the en- tire genome or the longest chromosome. G is one of the inputs in our proposed model (Figure 4.1). The solution to the MILP problem will assign values to the following variables. 4.3.1 Contig Variables For each contigX, we have considered two variables one of which is real-valued while the other one is a binary variable. Chapter 4. Repeat-Aware Scaolding using Paired Reads 57 Contig Position Variable P X will be the variable showing the position of contig X in the genome (1 P X G). The actual position of the contig should be an integer value, but as integer variables make the MILP problems more complex and more time-consuming to be solved (sometimes impossible to solve), the position of the contigs in the nal solution is modeled as a real-valued number. The position is always the location that aligns with the beginning of the contig. So if the contig aligns to the scaold in reverse- complement direction, the position is still the location where the beginning of the contig aligns. Therefor, there will be n contig position variables in our formulation. Contig Orientation Variable O X will be the binary variable showing the orientation of contig X in the solution, 0 meaning the forward orientation and 1 meaning the reverse-complement orientation. So there will be n contig orientation variables in our formulation. 4.3.2 Edge Satisfaction Variable E will be a binary variable for edge E in the scaolding graph, 0 meaning the constraints imposed by this edge is not satised in this solution, 1 meaning that the constraints imposed by this edge is satised in the current solution. So there will be m edge variables in our formulation. If there were not any errors in the inputs of Chapter 4. Repeat-Aware Scaolding using Paired Reads 58 our problem (reads, contigs, mappings, etc.), we might not need this additional set of variables as all constraints would be concordant with each other and the solution could be trivially found eciently. 4.3.3 Distance Variable D E will be the distance variable for edgeE in the scaolding graph. It is real-valued and its upper-bound and lower-bound are set according to the insert library statistics, e.g. [ 2; + 3] when and are mean and standard deviation of insert library respectively. The upper bounds and lower bounds can be adjusted for each edge E dierently according to the exact mapping positions and its supporting read pairs. 4.3.4 Overlap Prevention Variable xy will be a binary variable for contig pair (X;Y ) in the scaolding graph, 0 meaning in the solution contigX precedes contigY , 1 meaning in the solution contigX follows contig Y . So there will be n(n 1)=2 gamma variables in our formulation. 4.3.5 Objective Function Our objective is to minimize the number of edges whose imposed constraints are not satised. Where the sum is over all edges in the scaolding graph, max P E (! E E ) shows the objective function in our formulation where E is the edge satisfaction Chapter 4. Repeat-Aware Scaolding using Paired Reads 59 variable of edge E and ! E denotes weight (support) of edge E. This is equivalent to maximizing the number of satised pairs-reads. 4.3.6 Constraints Imposed By Edges Now that we presented the variables and the objective function of our problem formu- lation, we need to get familiar with the imposed constraints that should be satised in the model solution. The majority of constraints are added after processing the edges of the scaolding graph. All edges of the scaolding graphs should be processed one by one and after processing each one, a set of orientation and positioning constraints will be added to the model. Each case (Figure 4.6) imposes dierent set of constraints. Constraints on Orientations For fully directed edges (case 1 and case 4 in Figure 4.6), we should have O x = O y which means orientations of adjacent contigs should be the same as each other. Both of them would be either forward or reverse complement. Inequalities 4.1 show the corresponding constraints added to the model. Chapter 4. Repeat-Aware Scaolding using Paired Reads 60 Figure 4.6: Constraints imposed by four possible bidirectional edges. For each case, the two possible placements of contigs X and Y are presented. The top one would be the forward scenario and the bottom one would be the dual reverse- complement scenario. Chapter 4. Repeat-Aware Scaolding using Paired Reads 61 O x O y (1) 0 O x O y + (1) 0 (4.1) Note that if the edge satisfaction variable is set to 0 in the solution, both imposed constraints would be satised automatically for all possible orientations. But if the edge satisfaction variable is set to 1 in the solution, O X and O Y should have the same binary value to satisfy both constraints imposed by inequalities 4.1. For introvert (case 2 in Figure 4.6) and extrovert (case 3 in Figure 4.6) edges, we should haveO x 6=O y which means orientations of adjacent contigs should not be the same as each other. If one of them is set to be forward the other one should be set to be reverse-complement. Inequalities 4.2 show the corresponding constraints added to the model. O x +O y (1) 1 O x +O y + (1) 1 (4.2) Note that if the edge satisfaction variable is set to 0 in the solution, both imposed constraints would be satised automatically for all possible orientations. But if the Chapter 4. Repeat-Aware Scaolding using Paired Reads 62 edge satisfaction variable is set to 1 in the solution,O X andO Y must have dierent binary values to satisfy both constraints imposed by inequalities 4.2. Constraints on Positions Similarly, constraints on contigs positions and distance variables will be added to the model according to the type of the corresponding edge. Again we present the imposed constraints case by case according to the cases depicted in Figure 4.6. For case 1, we must havejP x P y j =L x +D. Four inequalities must be added to the model to impose this constraint. For the forward scenario (O x =O y = 0), inequalities 4.3, and for the reverse-complement scenario (O x =O y = 1), inequalities 4.4 capture the imposed constraint. One can note that for the forward scenario, inequalities 4.4 and for the reverse-complement scenario, inequalities 4.3 are automatically satised. Also setting the edge satisfaction variable to 0 in the solution, automatically satises all four inequalities, regardless of values assigned to orientation, position, and distance variables in these inequalities. P x +D +L x O x :1O y :1 (1):1P y P x +D +L x O x :1 +O y :1 + (1):1P y (4.3) P y +D +L x (1O x ):1 (1O y ):1 (1):1P x P y +D +L x + (1O x ):1 + (1O y ):1 + (1):1P x (4.4) Chapter 4. Repeat-Aware Scaolding using Paired Reads 63 For case 2, we must havejP x P y j =L x +D +L y . Four inequalities must be added to the model to impose this constraint. For the forward scenario (O x = 0 and O y = 1), inequalities 4.5, and for the reverse-complement scenario (O x = 1 and O y = 0), inequalities 4.6 capture the imposed constraint. One can note that for the forward scenario, inequalities 4.6, and for the reverse-complement scenario, inequalities 4.5 are automatically satised. Also setting the edge satisfaction variable to 0 in the solution, automatically satises all four inequalities, regardless of values assigned to orientation, position, and distance variables in these inequalities. P x +D +L x +L y 1O x :1 (1O y ):1 (1):1P y P x +D +L x +L y 1 +O x :1 + (1O y ):1 + (1):1P y (4.5) P y +D +L x +L y 1 (1O x ):1O y :1 (1):1P x P y +D +L x +L y 1 + (1O x ):1 +O y :1 + (1):1P x (4.6) For case 3, we must havejP x P y j = D. Four inequalities must be added to the model to impose this constraint. For the forward scenario (O x = 1 and O y = 0), inequalities 4.7, and for the reverse-complement scenario (O x = 0 and O y = 1), inequalities 4.8 capture the imposed constraint. One can note that for the forward scenario, inequalities 4.8, and for the reverse-complement scenario, inequalities 4.7 are automatically satised. Also setting the edge satisfaction variable to 0 in the solution, automatically satises all four inequalities, regardless of values assigned to Chapter 4. Repeat-Aware Scaolding using Paired Reads 64 orientation, position, and distance variables in these inequalities. P x +D + 1 (1O x ):1O y :1 (1):1P y P x +D + 1 + (1O x ):1 +O y :1 + (1):1P y (4.7) P y +D + 1O x :1 (1O y ):1 (1):1P x P y +D + 1 +O x :1 + (1O y ):1 + (1):1P x (4.8) For case 4, we must havejP x P y j =D +L y . Four inequalities must be added to the model to impose this constraint. For the forward scenario (O x =O y = 1), inequalities 4.9, and for the reverse-complement scenario (O x =O y = 0), inequalities 4.10 capture the imposed constraint. One can note that for the forward scenario, inequalities 4.10, and for the reverse-complement scenario, inequalities 4.9 are automatically satised. Also setting the edge satisfaction variable to 0 in the solution, automatically satises all four inequalities, regardless of values assigned to orientation, position, and distance variables in these inequalities. P x +D +L y (1O x ):1 (1O y ):1 (1):1P y P x +D +L y + (1O x ):1 + (1O y ):1 + (1):1P y (4.9) P y +D +L y O x :1O y :1 (1):1P x P y +D +L y +O x :1 +O y :1 + (1):1P x (4.10) Chapter 4. Repeat-Aware Scaolding using Paired Reads 65 No Pairwise Overlap Constraints To explicitly prevent contigs from sharing the same position or overlapping each other (more than maximum valid overlap length ), we can add the following extra set of constraints. If contig X precedes contig Y ( xy = 0) and if (O x = 0 and O y = 0), P y P x L x must be added to the constraints, if (O x = 0 and O y = 1), P y P x L x +L y 1 must be added to the constraints, if (O x = 1 and O y = 0), P y P x 1 must be added to the constraints, if (O x = 1 and O y = 1), P y P x L y must be added to the constraints. Putting all together, inequalities 4.11 capture all aforementioned inequalities for pair- wise overlap constraints. P y P x (1O x ):L x +O y :L y +OxOy xy :1 P x P y (1O y ):L x +O x :L x +OyOx (1 xy ):1 (4.11) Repeat Constraints As discussed in section 4.2 and Figure 4.5, node replications and edge replications are key elements of constructing the extended scaolding graph and this is the graph Chapter 4. Repeat-Aware Scaolding using Paired Reads 66 we processed node by node and edge by edge so far to build and solve the mixed integer linear programming model. According to our formulation, each edge in the extended graph is assigned a binary satisfaction variable and the objective function in our formulation tries to make as many as possible satised edges. But one should note that among the replicated copies of a regular edge in the original scaolding graph, at most one edge copy can be satised and all other satisfaction variables should be set to zero. For example in Figure 4.5C, at most one edge in each colored group can be set to one in the solution. So constraints 4.12 for all edges e within a replicated group, should be also added to our model constraints. X e e 1 (4.12) 4.4 Results and Conclusion In this section, we go through the results of some carefully simulated test cases. As we wanted to control the count, location, length, and relative order of repeats, we developed a variety of simple (no repeats) to complex (several repeats) hypothetical genomes. Appendix Figures A.1 through A.13 show the regular scaolding graphs and the corresponding extended scaolding graphs. Figures' captions specically describe the repeat structure of each test case. We tried to make the simple test case of Figure A.1 more complex in each step by adding repeat structures to it, increasing the repeat Chapter 4. Repeat-Aware Scaolding using Paired Reads 67 counts, changing the orientations of some repeats, making some tandem repeats, and also putting some complex repeat of repeat structures in the graph. Although the lengths of these articial genomes are not very long, due to the presence of articial repeats in these genomes, none of the standalone scaolders can fully solve these test cases. Because highly covered or high-degree nodes in the scaolding graph that makes tangles in the scaolding graph are usually removed from the graph. Even ignoring the repeats will put just one copy of them in the nal solution and location of other copies will be considered as gaps. To the best of our knowledge, our model is the only one that can handle these repeat structures. Removing suspicious repeat nodes and their connected edges makes the graph more sparse and disconnected and the number of connected components will increase. The total number of connected components in the scaolding graph puts a lower limit on the number of nal scaolds. But this repeat removal with other possible pre- processing like graph partitioning or following a totally greedy approach makes the algorithm more scalable and more ready to be applicable to real genomes especially the more complex mammalian ones. Currently, our focus is on making our method more scalable and practical on real genomes. Figure 4.7 shows an articial repetitive genome (16,443 bps long) used as one of our test cases. Corresponding regular and extended contig graphs are depicted in Figure 4.8 and 4.9 respectively. Several repeat contigs (colored in this gure) are harbored in this small articial genome, including tandem repeats and repeats with dierent Chapter 4. Repeat-Aware Scaolding using Paired Reads 68 Figure 4.7: An articial repetitive genome structure used as one of our test cases. Several repeat contigs are drawn in color. This articial repetitive genome harbors many repeat structures including the tandem repeats and repeats with dierent orientations. Our proposed model could successfully assign correct orientation and relative order to all input contigs. orientations. Our proposed model could successfully assign correct orientation and relative order to all input contigs. Illumina short reads (paired-end reads) used in these articial test cases (including test cases presented in Appendix) were simulated using SimSeq [45] short read library simulator with 500 bp mean insert sizes and 20 bp standard deviation. We used Bowtie2 [43, 44] to map paired-end reads onto contigs. In section 2.4 we discussed the scaolding quality assessment criteria, including the criteria regarding both the contiguity and the accuracy of results. For example, we mentioned that N50 is the widely used contiguity measure and mapping to the available reference genome is the best accuracy checking method when the reference genome is available (only for quality assessment and not for scaolding). But in our articial test cases that we constructed the repetitive genome and contigs ourselves and we know exactly the orientation and location of each contig (including the repeat ones), we can just check the values assigned to contig position variables (P X ) and Chapter 4. Repeat-Aware Scaolding using Paired Reads 69 contig orientation variables (O X ) and see whether the orientations and relative orders are correct or not. We implemented our scaolder in C++ and used Gurobi Optimization software 1 [46] to solve the constructed mixed linear integer programming models. 4.4.1 Equivalent Candidate Solutions When repeat is not ignored and not removed from the scaolding graph and especially plays an important role in correctly modeling the problem to capture the underlying truth, there is a possibility that there are several equivalent solution candidates that the scaolding graph of all of them and hence the constructed mixed integer linear programming would be exactly the same. Figure 4.10 shows a simple scenario in which all four possible solutions would have identical scaolding graph (duality of forward and reverse complement scenarios). All of these four dierent orderings have the same scaolding graph and hence same mixed integer linear programming model. So one might need to somehow enumerate the model solution to have the actual set of solution candidates. Paired reads from longer insert libraries or other distant information can help to pick the right solution among the pool of equivalent candidate solutions. Having paired reads from both short insert (e.g. Illumina paired-end) and long insert (e.g. Illumina mate-pair) libraries would decrease the probability of such situations. 1 Free academic license of its C++ API. Chapter 4. Repeat-Aware Scaolding using Paired Reads 70 Figure 4.8: Scaolding graph of test case of Figure 4.7. Edge weights are the paired-read support count of each edge. Node labels follow \contig- id(repeats)length" format. Sequence of contigs 3' (reverse of 3), 6, and 7 (grayed out) is repeated twice between contigs 5 and 8. Contig 9 (grayed out) is a repeat contig with two copies, one forward copy between contigs 8 and 10, and one reverse- complement copy between contig 19 and a copy of contig 20. Contig 20 (grayed out) is a tandem repeat contig with two tandem copies (one forward and the other one reverse-complement) both of which are sitting between contig 21 and a copy of contig 9. Chapter 4. Repeat-Aware Scaolding using Paired Reads 71 Figure 4.9: Extended graph representation of Figure 4.8. For the sake of sim- plicity, edge weights have been deleted. Our suggested method could successfully solve this simulated test case and correctly output the order and orientation of all contigs. Node labels follow \repeat-id:contig-id(repeats)length" format. Chapter 4. Repeat-Aware Scaolding using Paired Reads 72 Figure 4.10: Equivalent solution candidates. All of these four dierent orderings have the same scaolding graph and hence same mixed integer linear programming model. So one might need to somehow enumerate the model solution to have the actual set of solution candidates. Paired reads from longer insert libraries or other distal information can help to pick the right solution among the pool of equivalent candidate solutions. Chapter 5 Hybrid Genome Scaolding 5.1 Third Generation Sequencing (TGS) For a long time, Sanger sequencing chain termination method was dominating all other DNA sequencing technologies [47]. Announced in 2001, sequencing the human genome was a milestone in DNA sequencing which triggered and paved the road for further improvements and advancements. Then Next Generation Sequencing (NGS) methods made less expensive and parallel high-throughput sequencing a reality. Pro- ducing huge numbers of reads with lengths comparable to Sanger sequencing in a single run, 454 pyrosequencing [48] also became very popular. Later on, Illumina sequencers made outputting billions of reads in a single run possible which are still widely used and so popular. But this came at a price of having much shorter reads. Recent generation Illumina sequencers can produce high-quality reads which are 200 73 Chapter 5. Hybrid Genome Scaolding 74 to 300 bps long. Although presented and got popular with the name \NGS", nowadays these technologies are better to be referred as SGS (second generation sequencing). Ion Proton (a sequencer made by Ion Torrent company) is one of the most recent SGS sequencers [49]. Contrary to Sanger, 454, and Illumina sequencing, Ion Proton is a less expensive sequencer that provided a cheap and fast sequencing technology not relying on optical methods. As discussed in previous chapters, despite the cheap price tag, massively high-throughput performance, and high quality reads with low error rates, SGS libraries cannot be an ideal input for automatic de novo genome assembly and sequencing projects because of their short read lengths. Recently, third-generation sequencing (TGS) has become available. Nanopore se- quencing and Single Molecule Real-Time (SMRT) sequencing are proposed by lead- ing companies to produce much longer reads. Pacic Biosciences (PacBio) sequencers (e.g. RS I, RS II, and Sequel) use SMRT sequencing [50] while Oxford Nanopore Technologies (ONT) has developed a machine called MinION which uses nanopore sequencing [51]. Contrary to SGS methodologies, during TGS library preparation, there would be no amplication step and that is why these methods are called single molecule sequencing methods. Moreover, in TGS, much longer reads can be pro- duced with average read lengths more than 6 to 8 Kbps and longest reads as long as 30 to 150 Kbps [52]. In following subsections, we will brie y explain the underlying biotechnology and key aspects of PacBio's SMRT and ONT's MinION systems. TGS makes easy and fast productions of much longer reads possible. As we discussed Chapter 5. Hybrid Genome Scaolding 75 in previous chapters, de novo reconstruction of complete genomes using SGS short reads has remained a challenge. Although SGS libraries provide very high sequencing coverage with highly accurate reads, the short length of SGS reads cannot overcome technical problems caused by repetitive sequences. For example, in de Bruijn graph- based assemblers, SGS reads are broken down into shorter k-mers (usually 21 to 101 bps long). For genomes containing repeats longer the length of a k-mer (which is almost always the case), it would be impossible to resolve the repeats. Later on in this chapter, we will explain how TGS long reads can be used in de novo genome assembly and/or de novo scaolding of high-quality pre-assembled SGS contigs. For many organisms, there are multiple SGS draft assemblies most of which are highly fragmented. One can use newly available TGS long read libraries to upgrade such draft genomes by orienting, ordering, and distancing (or joining) such contigs. Higher quality consensus sequences can even be used for closing the remaining sequencing gaps in the nal draft genome. Long reads produced by PacBio SMRT sequencing have been shown to be powerful in resolving long repeats and are becoming the gold standard in genome sequencing of prokaryotes [53]. The resulted PacBio SMRT sequencing prokaryotic draft genomes were comparable in contiguity and accuracy to Sanger sequencing ones. It has been also shown that PacBio SMRT long reads could resolve some complex regions of the chimp and the human genome [54, 55]. But TGS cannot completely replace SGS because of its relatively high error rate. It is believed that for many genomics, Chapter 5. Hybrid Genome Scaolding 76 metagenomics, and phylogenomics research questions, SGS will remain state of the art for several years. 5.1.1 Pacic Biosciences Single-Molecule Real-Time (SMRT) sequencing The idea of synthesizing a strand of DNA by directly observing a single molecule of polymerase was rst implemented by PacBio. Although this idea was rst presented in 2001, it has been unimplemented because it was unknown how it can be possible to conne the enzyme to an observable volume which is small enough to call bases accurately while they are incorporated one by one. Zero-Mode Waveguide (ZMW) is the key structure in SMRT sequencing that made implementation of this idea possible [56]. ZMWs are like super tiny wells with a diameter of tens of nanometers on a 100 nm metal lm deposited on a glass substrate. ZMWs force the laser light ( 600 nm wavelength) to exponentially decay as it enters the wells, thus they prevent the laser light from passing entirely through. Inside every ZMW (at the bottom) a single polymerase molecule is attached using biotin/streptavidin interaction. Then all four types of nucleotides which are uorophore-labeled (dierent color for each nucleotide) are poured over an array of ZMWs. In a process called diusion, dierent types of nucleotides surround the polymerase to get correctly incorporated one by one. When Chapter 5. Hybrid Genome Scaolding 77 a specic nucleotide is added to the growing sequence strand, its uorescent tag re- mains in the bottom of the well emitting powerful light for enough time to make it recognizable against diusion noise. Sequencing sensors can distinguish dierent spikes of colors emitted from each ZMW for base calling. SMRT sequencing is per- formed in parallel on several SMRT cells each of which contains tens of thousands of ZMWs. Released in 2011, PacBio RS I was the rst commercialized SMRT sequenc- ing instrument that was able to read 8 SMRT cells with 150,000 (two sets of 75,000) ZMWs. In 2013, PacBio RS II was released which was capable of using all 150,000 ZMWs concurrently, thus reaching a throughput of 400 Mbps per run. Compared to SGS platforms, SMRT sequencing is much faster (minutes instead of days) as it does not have any \scan-and-wash" step and requires fewer amounts of chemicals and sample preparation. Furthermore, as there is no PCR amplication required, systematic amplication biases can be avoided in SMRT sequencing. Using the latest chemistry (P5-C3) in SMRT sequencing, an average read length of 8.5 Kbps and a maximum of 30 Kbps can be achieved. This opens the doors to new possibilities for de novo genome assembly, de novo genome scaolding, haplotypes detection, and phasing of entire chromosomes. PacBio platforms can provide two types of long reads: (1) Continuous Long Reads (CLR) and (2) Circular Consensus Sequencing (CCS) reads. CLRs are the typical long reads produced by PacBio platforms. CCS reads are shorter (<1.5 Kbps) than CLRs but they have better error rates (<5%) and are considered highly accurate Chapter 5. Hybrid Genome Scaolding 78 Figure 5.1: In Single Molecule Real-Time (SMRT) sequencing the emission spec- tra of uorescent labeled nucleotides are detected while being incorporated by the polymerase [52]. compared to error-prone CLRs. CCS reads can be used as an alternative to the traditional Sanger sequencing reads. Together with highly accurate SGS short reads, CCS reads can be used to perform error-correction on long CLRs. SMRT sequencing has a great potential to become the sequencing technology of the future if its major drawbacks can be resolved in a near future. The most important challenge with PacBio's CLR reads is their high error rate which can range from 10% to 15% dominated by insertions (8{12%) and deletions (1{2%). This is especially problematic when sequence alignment or sequence assembly must be performed using these reads. Moreover, performing sample preparation and purchasing sequencing instruments, reagents, and supplies are still quite expensive (price-to-throughput ratio of $0.33{$1.00 per 1 million base pairs in PacBio SMRT sequencing compared to $0.05{$0.15 in Illumina sequencing [57]). Chapter 5. Hybrid Genome Scaolding 79 5.1.2 Oxford Nanopore Single-Molecule Sequencing Oxford Nanopore Technologies manufactured MinION which is the rst device imple- menting the idea of nanopore sequencing. MinION devices are as small as an oce stapler or a small cell phone that can be plugged into the USB port of a digital device (e.g. a tablet or a laptop). This makes them an ideal portable sequencing device (Figure 5.2). In April 2016, NASA (National Aeronautics and Space Administration) announced that they were going to use a MinION device in the International Space Station to test the possibility of performing DNA sequencing in space. This opens the doors to monitor changes in microbes and humans genomes in response to space ight and it may pave the road towards detection of life elsewhere in the universe [58]. Biological nanopores (created by some toxins like the staphylococcal a-hemolysin protein) within a phospholipid bilayer are the key elements in nanopore sequencing. Nanopores are small channels measuring a few nanometers in diameter. When the phospholipid bilayer is placed in a salt solution, electrodes can form an ionic gradient. As we know DNA molecules are negatively charged and are pulled by this gradient to pass through the nanopore. When each nucleotide passes through the nanopore, the amplitude of the ion current changes (decreases) in a characteristic way specic to the nucleotide type. This characteristic change can be detected and read by a sensor making base calling possible. MinION has 512 channels containing nanopores, each detecting10 bps per second. For library preparation, a hairpin adapter is ligated to one end of the DNA molecule, Chapter 5. Hybrid Genome Scaolding 80 while a motor protein is ligated to the other end. The motor protein is responsible for ratcheting the DNA molecule through the nanopore and making the DNA single- stranded (Figure 5.3). But it should be emphasized that the two strands of one molecule remain connected by the hairpin adapter at the end. In an ideal scenario, after the rst strand, the hairpin adapter and the other strand should pass through the nanopore. This method is called \strand sequencing". Because of the length of the nanopore channel and the overall speed of the afore- mentioned strand sequencing process, always more than one nucleotides are present in the nanopore at a time. That is why usually signals of overlapping and sliding 5-mers are recorded. Thus, the base calling software (which is cloud-based and called MinKNOW) needs to distinguish 1024 (4 5 ) dierent ionic current states. Reads produced by nanopore sequencing are reported to have high error rate (ranging from 25% to 40%) so far. Recently published studies show a sequence production rate of 90 to 490 Mbps per 48 hours in which the mean read length was around 6 Kbps and the maximum reported read length was around 150 Kbps [52]. 5.2 Hybrid Assembly and Scaolding The inherent drawbacks and limitations of sequencing technologies are sill hindering the automatic accurate reconstruction of complete genomes. That is why results of Chapter 5. Hybrid Genome Scaolding 81 Figure 5.2: The MinION [59]. Image credit: Oxford Nanopore Technologies. Figure 5.3: Ratcheting of a DNA strand through a biological nanopore [52]. Chapter 5. Hybrid Genome Scaolding 82 advanced de novo genome assembly and de novo genome scaolders are still frag- mented drafts which are far from being accurate and complete. Due to their low error content, SGS reads can be assembled into highly accurate contigs. But as short SGS reads cannot resolve most repeats, the resulted contigs would be fairly small and any attempt to scaold these contigs which only relies on SGS libraries (even long insert paired ones) may lead to creation of incorrect joins between contigs and invalid rearrangements within the scaolds which may badly aect all downstream analyses. On the other hand, long TGS reads can be used to easily resolve many large repeats. But due to the high error rate in TGS reads, to use them for de novo assembly, huge sequencing coverages are required. Error correcting long TGS reads using short but highly accurate SGS reads or contigs can also be a time-consuming approach to coping with the current problem of errors in pure TGS libraries. Utilizing the best of both worlds, SGS and TGS libraries can be used together in a \hybrid model" for de novo genome assembling and scaolding. Generally, the process of assembling genomes using datasets containing reads from both sequencing generations is called \hybrid assembly". ALLPATHS-LG was the rst hybrid assem- bler to use Illumina paired-end and mate-pairs libraries together with PacBio long reads to reconstruct complete microbial genomes by assembling the Illumina reads using a de Bruijn graph-based approach and resolving repeats and lling the gaps with PacBio long reads [60, 61]. Tested on sixteen genomes, it could manage to completely assemble four genomes. In the other twelve genomes, some large repeats remained unresolved. One of the limitations of this method is that besides Illumina's Chapter 5. Hybrid Genome Scaolding 83 short paired-end libraries, it requires both PacBio long read and Illumina's mate-pair (jumping) libraries for a single run. Following another direction, some hybrid tools try to correct errors in long PacBio reads rst by mapping a high coverage accurate short read library (like Illumina, 454, or PacBio CCS reads) onto them. Then these error-corrected long reads can be assembled into signicantly large genomes. But even for small genomes, this error correction step can be very computationally intensive, thus almost impractical for large complex genomes. Moreover, short reads sequenced from dierent sections of the genome which are very similar to each other may create spurious adjacencies that can lead to misassemblies. Besides being useful in hybrid assembly models, PacBio long reads can be used in a genome nishing tool to ll in the gaps of almost-complete scaolds. PBJelly [62] is a standalone nishing tools which can yield impressive results even with low coverages. Using PacBio uncorrected long reads (CLRs) to scaold (as a standalone scaolder) high-quality pre-assembled contigs (made by de novo SGS assemblers) would be the best approach to follow due to its simplicity, speed, and ease of use. Currently, AHA [63] and SSPACE-LongRead [64] are the only implementation of this approach to de novo hybrid assembly. In both standalone scaolders, rst, PacBio long reads are mapped onto pre-assembled input SGS contigs using BLASR [65]. By analyzing mapping coordinates, connection links between contigs can be inferred. AHA stores this linkage information in a graph, while SSPACE-LongRead stores them in a hashing Chapter 5. Hybrid Genome Scaolding 84 data structure (similar to short read version of SSPACE). This hybrid approach can be fast and easy to setup and run as minimal preprocessing and parameterization is required and there is no limitation on the choice of pre-assembled contigs or long read libraries. 5.3 Our Proposed Model To break the plateau in de novo genome scaolding, we designed and implemented a hybrid approach in which the pre-assembled contigs made by SGS short-read libraries are scaolded by uncorrected TGS long reads. We used short reads produced by Illumina second-generation sequencers and long reads produced by Pacic Biosciences third-generation sequencers to evaluate our proposed model. This hybrid approach in scaolding draft genomes would be cheap, fast and reliable. Scalability would be a key factor as we want to use the advantage of long reads to scaold more complex genomes later. In brief, our model consists of three main steps depicted in Figure 5.4: (1) Alignment of pre-assembled contigs (or scaolds) against the long reads, (2) Building contig sequences that we call chains in our implementation, and (3) Assembling the contig sequences (chains) to produce the nal scaolds. Figure 5.4C shows what we exactly mean by contig sequences (chains). Contig sequences (chains) are partially ordered and oriented contigs inferred from each aligned long read. These contig sequences can be seen as mini-scaolds. Chapter 5. Hybrid Genome Scaolding 85 Figure 5.4: Overview of our hybrid approach. According to Figure 5.4 each contig sequence (chain) can be seen as a sequence on a non-ACGT alphabet. Distinct oriented contigs set would be the new alphabet instead. In other words, using this proposed general model, scaolding can be seen and modeled as a new assembly problem in which new sequences of contigs (instead of sequences of nucleotides) should be assembled to form the nal consensus. Figure 5.5 shows a toy example describing how the set of contig sequences (chains) can be assembled to form a consensus which will be the nal solution scaold(s). Figures 5.4 and 5.5 just show the general idea of our model's approach. Following gures and explanations clarify more details on how we want to break the plateau in de novo Chapter 5. Hybrid Genome Scaolding 86 genome scaolding. 5.3.1 Hybrid Input Libraries In this hybrid model, we need both SGS and TGS libraries as inputs. SGS higher quality short-read libraries are used in de novo assembly (usually by de Bruijn graph- based assemblers) which outputs high-quality contigs. Ideally, there should not be any misassemblies in such contigs and they should be trimmed at the unresolved re- peats boundaries. Such contigs together with TGS long read libraries are the main input of de novo scaolders which are the main focus of this thesis and this chapter. Ideally, scaolders should be able to produce as few as the total number of chromo- somes. Orientation, relative order and pairwise distance of consecutive contigs are estimated in scaolding. Both SGS and TGS libraries (either the same input libraries or any new unused ones) can be used in the nal step which is the nishing step in which sequencing gaps are closed and `N's in the output scaolds are lled using the information in the libraries. Figure 5.6 shows the general structure of a hybrid approach using both SGS and TGS libraries to produce the nal draft genome. 5.3.2 Aligning Long Reads Figure 5.7 shows a comprehensive work ow diagram presenting key steps in our pro- posed hybrid model. We will describe each step in details in this and following subsec- tions. In the rst step, one needs to align TGS long reads against the pre-assembled Chapter 5. Hybrid Genome Scaolding 87 Figure 5.5: A toy example showing the idea of assembling the contig sequences (chains) to build the nal scaold(s). Chapter 5. Hybrid Genome Scaolding 88 Figure 5.6: The general structure of a hybrid approach using both SGS and TGS libraries to produce the nal draft genome. SGS contigs. Recently many traditional and widely used rst- and second-generation mappers claim to be able to deal with third-generation libraries. BWA (BWA-SW and BWA-MEM) [66], Bowtie2 [67], BLAT [68], SMALT [69], BBmap [70] and MUM- mer [71] are all claimed to work with PacBio reads. LAST [72, 73] is also claimed to produce promising results for mapping Oxford Nanopore reads. As we use PacBio CLR long reads (with15% error rate) in our evaluations (next chapter), we recommend PacBio's native mapper, Basic Local Alignment with Suc- cessive Renement (BLASR) [65], to be used for mapping PacBio's CLR long reads against pre-assembled SGS contigs. Existing SGS alignment methods are either too Chapter 5. Hybrid Genome Scaolding 89 inecient for TGS long read datasets, or not sensitive enough to align single-molecule sequencing reads, which have a higher error rate than second-generation sequencing. The resulting alignment le contains information regarding alignments of all contigs against all long reads. Alignment score, alignment identity percentage, mapping qual- ity, alignment length and alignment coordinates (start and end positions) on contigs and on long reads are the most important pieces of information (besides long reads and contigs names and lengths) we incorporate into our model to be used in further steps. 5.3.3 Contig Chains The second step in the work ow of our proposed model is to process and analyze alignment information from the rst step to build as many as possible unique contig sequences that from now on we will simply call chains. Chains can be seen as mini- scaolds or scaold seeds that will grow and get rened step by step to make the nal scaolds set. In bacterial genome, this will be ideally a single scaold representing the whole genome. Figure 5.8 shows how chains are created by processing alignment information. In this step, each single long read has the potential to be translated to a new chain in our model. According to Figure 5.8, each bead of a chain conveys the alignment information of a single contig to that specic long reads. Chains which are useful for scaolding purposes must have at least two beads. The longer the TGS Chapter 5. Hybrid Genome Scaolding 90 Figure 5.7: A comprehensive work ow diagram presenting key steps in our pro- posed hybrid model. Chapter 5. Hybrid Genome Scaolding 91 read (compared to SGS contigs), the longer and the better the constructed chain for scaolding unscaolded contigs and solving correspondent repeats. Alignment information in beads of chains are summarized and stored in data struc- tures for further lookup. Distances, supports and co-occurrences are important in- formation to be processed and stored as depicted in Figure 5.7. Distance shows the distance between the end of a preceding contig and start of the immediately follow- ing contig in the chain. It is very important to know that based on the underlying assembly algorithm of de novo SGS assemblers, the distance between two consecutive contigs might be negative implying an overlap between the tail of the preceding contig and head of the following contig. As mentioned above, supports and co-occurrences are two other important informa- tion that should be stored in appropriate data structures for further lookup. Support is a non-negative integer representing how many long reads in the input TGS library, induce a specic pairwise order and orientation. For example if support of chain link X|Y 1 is equal to n, it means in exactly n long reads (chains), contig X is immediately followed byY or contigY is followed byX (the reverse-complement dual scenario). Similarly, co-occurrences is a non-negative integer representing in exactly how many long reads (chains) two contigs are co-occurred with a specic pairwise order and orientation. One should note that both pairwise orders and orientations play a crucial 1 Throughout this thesis, for any DNA sequence A, A or A 0 represents its reverse-complement. Chapter 5. Hybrid Genome Scaolding 92 role in these denitions as support and co-occurrence ofX|Y may be dierent from those of Y |X and X|Y . Figure 5.8: Contig chains. For the sake of brevity and scalability, we only store unique chains. This means that after constructing a chain out of contigs alignments against a long read and storing information regarding distances, supports, and co-occurrences, we will discard a chain if we have already a similar chain or a longer chain covering this chain in our chains repository. E.g. if the new chain would be X 1 |X 2 |X 3 and we already have a chainX 1 |X 2 |X 3 |X 4 , we can discard the newly constructed chain and only update information regarding distances, supports, and co-occurrences. Using this summarization technique in E. coli 2 , alignments of more than 350,000 long reads can be represented with about 200 chains. One should also note that aligned portion of almost all contigs in a chain, except the rst and last contigs in the chain, should be relatively high (>90%). But rst and 2 Escherichia coli Chapter 5. Hybrid Genome Scaolding 93 last contigs can be aligned with a very small or very big portion. These two marginal contigs play a key role in extending chains in following steps of our proposed model. In Figure 5.8, contigs X 2 to X n1 should be aligned almost with their full lengths but contigsX 1 andX n are exempt from this strong requirement. Because of chimeric long reads and also chimeric (misassembled) contigs, there may be many cases that such alignments should be skipped in producing chains. Another important information that should be processed and checked in the chains construction step is to verify whether the aligned portion of rst and last contigs of the chain is consistent with the inferred orientation or not. E.g. in Figure 5.9, green and red portions of contigs and the long read are properly aligned. ContigsA andB's alignments are valid if and only if orientations of both alignments are forward (not reverse-complement). On the contrary, contigsC andD's alignments are valid if and only if orientations of both alignments are reverse-complement (not forward). To be more precise, contig C (for example) cannot be a valid forward aligned contig unless the great majority of the blue portion is also properly aligned against the long read. Due to presence of chimeric long reads and also chimeric (misassembled) contigs, there may be many cases that such alignments should be skipped in producing chains. According to our analysis, there are many chimeric long reads in TGS libraries that are made up of dierent large sections of the reference genome. As a whole, they do not map to a contiguous section of the reference genome, and that is why we call them chimeric long reads. Most of the time, distant sections joined mistakenly in a long Chapter 5. Hybrid Genome Scaolding 94 read are very similar to each other or each other's reverse-complement. That is why chains made out of such long reads have repeated beads which are most of the time errors. Palindromic chains are a special case of chimeric chains made out of chimeric long reads depicted in Figure 5.10. This gure shows how two relatively long repeated sections in the reference genome can be mistakenly joined in TGS library production and how this chimeric long read can result in a chimeric palindromic chain. It is important to skip and ignore such chains in the chain production of the proposed model. If we have relatively small high-quality contigs (the case with older de novo assemblers like Velvet [74]) and relatively very long TGS long reads, we may have many long palindromic chains with more than four beads. But if we have relatively larger contigs or relatively shorter TGS long reads, most of our chains would have 2-4 beads. That is why it is safer and more conservative to skip and ignore chains with repeated beads in them. Figure 5.9: Consistency between aligned portions and orientations. Green and red portions of contigs and the long read are the properly aligned regions. Contigs A and B's alignments are valid if and only if orientations of both alignments are forward (not reverse-complement). On the contrary, contigs C and D's alignments are valid if and only if orientations of both alignments are reverse-complement (not forward). To be more precise, contig C (for example) cannot be a valid forward orientated aligned contig unless the great majority of blue portion is also properly aligned against the long read. Chapter 5. Hybrid Genome Scaolding 95 Figure 5.10: Palindromic chains are a special case of chimeric chains made out of chimeric long reads. This gure shows how two relatively long repeated sections in the reference genome can be mistakenly joined in TGS library production and how this chimeric long read can result in a chimeric palindromic chain. 5.3.4 Best Neighbors One of the key steps in our proposed model is \processing the neighborhood infor- mation" in chains. This is the rst step in which we are using the stored information regarding supports and co-occurrences of dierent contigs in our chains set. Finding and cutting weak (chimeric) links in our chains is one of the main results of this step in our proposed hybrid model. Finding and marking repeat and mis-assembled con- tigs is another important result of this step, as presented in Figure 5.7. We will label such probably repeated or misassembled contigs as high-risk contigs. These high-risk contigs play a key role in chain extension and chain joining steps later in our proposed hybrid model. In processing the neighborhood information in all chains we want to know for each Chapter 5. Hybrid Genome Scaolding 96 contig (in its forward orientation state), what are the best (most supported) two (one after and one before) neighbor contigs (possibly in forward or reverse-complement orientation state) that immediately precedes and follows this contig. E.g. if true relative orders and orientations of contigs X, Y and Z in the reference genome is as |X|Y |Z|, and none of these contigs are high-risk ones, the best neighbor ahead of contig Y must be Z and the best neighbor before contig Y must be X. Similarly, if true relative orders and orientations of contigsX,Y andZ in the reference genome is as |X|Y |Z|, and none of these contigs are high-risk ones, the best neigh- bor ahead of contig Y must be X and the best neighbor before contig Y must be Z. One should know that |X|Y |Z| and |Z|Y |X| are equivalent to each other. To nd best (most supported) neighboring orientated contigs for each contig, we keep oriented neighbor contigs (in each chain) in two priority queues, one con- taining all neighboring orientated contigs immediately following and one containing all neighboring orientated contigs immediately preceding the current contig. As mentioned above and also presented in Figure 5.7, nding and cutting weak (chimeric) links in all chains is one of the main steps in editing chains before starting chain extensions during nal scaolding. Suppose A and B are two contigs that are linked to each other (A|B) in some chains and the total number of chains support- ing this link (support of A|B) is equal to s 1 . But in this processing neighborhood information step we nd out that \none of them" are their best neighbors meaning that there exists a contig U and a contigV such thatA|U (orA|U) is a link with higher supports 2 >s 1 , and similarly,V |B (orV |B) is a link with higher support Chapter 5. Hybrid Genome Scaolding 97 s 3 > s 1 . This can be because of A or B being high-risk contigs (either repeated or misassembled), or because of errors in the initial alignment step or errors in TGS long read library. We will describe later how we distinguish high-risk contigs. For now, we can consider such links as weak chimeric links that need to be cut in this step. One might suggest that any link with small support (meaningfully smaller than the average support) can be considered as a weak link and possibly a chimeric link that needs to be cut. But our studies with several organisms showed that there are cases that because of the very low coverage of TGS long reads on some region there are very few (sometimes even 1 or 2) long reads supporting a link between two contigs. So we should beware of mistakenly cutting such links by misinterpreting them as errors and chimeric links. That is why we propose this best (most supported) neighborhood concept. If \neither of two contigs" connected by a link in a chain are each other's best neighbor, one should deactivate this chain and produce two new smaller chains by cutting the weak link. Generally, if there are n weak links in a chain, the chain gets deactivated and at most,n+1 new smaller chains are added to our chains set. It is important to emphasize that if there are already identical chains or longer chains covering these smaller chains in our repository, we can simply skip and ignore these new chains. Deactivated chains would be put aside until the very last step that we might check if we can use the information in them to close any gap and join any pair of scaolds or not. For nding and marking probable repeats and misassembled contigs (high-risk con- tigs) we need to nd pairs of contigs that are linked together in some chains and `only Chapter 5. Hybrid Genome Scaolding 98 one' of them is the other's best neighbor and the other one is not. E.g. in Figure 5.11, contigR is a repeated contig and contigsA,B,C andD are regular contigs. Suppose R is the best neighbor for all A, B, C and D and we have s 1 > s 3 and s 2 > s 4 . In this case, R's best neighbors would be A and B. So in the right chain links do not have mutual best neighbor property. Similarly, if we have s 1 > s 3 and s 2 < s 4 , links attached to B and C are failing to have mutual best neighbor property. We have to emphasize that all links of chains in Figure 5.11 are valid links and we should refrain from cutting them in this step. But it is important to know that whenever we observe situations like this, contig R would be a high-risk contig and probably a repeat contig. We will just mark such contigs as high-risk. We will describe how later in our work ow, such high-risk contigs play a key role in terminating the chain extension steps. Figure 5.11: Non-mutual best neighbor relationship in repeats. In this gure, contigR is a repeated contig and contigsA,B,C andD are regular contigs. SupposeR is best neighbor for allA,B,C andD and we haves 1 >s 3 ands 2 >s 4 . In this case, R's best neighbors would be A and B. So in the right chain links do not have mutual best neighbor property. Similarly if we have s 1 >s 3 and s 2 <s 4 , links attached to B and C are failing to have mutual best neighbor property. One should also note that for a regular correctly-assembled non-repeat contig, there Chapter 5. Hybrid Genome Scaolding 99 must be one oriented contig ahead and one oriented contig before claiming it as their best neighbors and it also must claim both of them as its best neighbors. But this is not necessarily the case for repeats and misassembled contigs. In Figure 5.11, there are two contigs (A and C) claiming R as their best neighbor ahead and there are also two contigs (B and D) claiming repeat contig R as their best neighbor before. This case is also depicted in Figure 5.12A. If the total number of contigs claiming a contig as their best neighbor ahead is equal to and consistent with the total number of contigs claiming a contig as their best neighbor before, we can say that this high- risk contig is almost surely a repeat. Otherwise (Figure 5.12B), it is probably a misassembled contig, especially if it is relatively a large contig (e.g. >10 Kbps). That is why one can assume that any contig receiving more than one best neighbor ahead or best neighbor before claim should be marked as a high-risk contig on this crucial step of our proposed hybrid model (Figure 5.12). We believe this approach in nding repeats is simple and more eective than analyzing only the coverage information in the alignment of contigs against long reads. As we do not need to estimate the copy number of repeat contigs in this proposed method, our model would be less sensitive to any possible over- or under-estimation of copy numbers of repeated contigs. Chapter 5. Hybrid Genome Scaolding 100 Figure 5.12: Neighborhood of repeats and misassembled contigs. If the total number of contigs claiming a contig as their best neighbor ahead is equal to and consistent with the total number of contigs claiming a contig as their best neighbor before, we can say that this high-risk contig is almost surely a repeat (A). Otherwise, it is probably a misassembled contig (B). That is why one can assume that any contig that receives more than one best neighbor ahead or best neighbor before claims should be marked as a high-risk contig. 5.3.5 Extending Chains to Superchains Now we have a set of high-quality and well-supported chains. In our proposed model chains extend forward and backward as much as possible. We will discuss this later in this subsection. Eventually, we will have a set of extended chains which are called superchains in our proposed model (Figure 5.13). If there are any unused contigs which are not incorporated into extended superchains, they would form a singleton superchain. One should note that superchains can be as short as a single bead (contig) or as long as a superchain containing all contigs. But according to our proposed extension algorithm which will be discussed shortly, superchains does not and cannot contain any high-risk contig in them. \A chain is only as strong as its weakest link. 3 " That is exactly how original pre- extension chains are scored and sorted in our hybrid model. The strength of a link 3 Proverb. Chapter 5. Hybrid Genome Scaolding 101 Figure 5.13: Contig super-chains. is its support, the total number of long reads supporting that link in a chain. There might be several links in a chain. The minimum of all supports of links of a chain would be its score in our model. We will process chains for extension (even a starting point or a candidate for getting merged with another chain) in descending order of their min-support scores. The chain with the highest score (largest min-support value) is picked for an extension. This initial chain must not contain any high-risk contig or an already used contig (a contig already visited and incorporated in the extension process). If there are no such chains left, this step of the work ow is terminated. We try to extend the selected chain forward as much as possible from its tail. When the chain forward extension has been halted or terminated, we repeat a similar procedure to extend it backwardly from its head. We describe the forward extension procedure here. The backward extension procedure would be quite similar. Chapter 5. Hybrid Genome Scaolding 102 After choosing the highest-score unstudied chain as the starting point, we look up all chain locations of the last contig. E.g. if the current last contig in the tail is contigX, we should nd all chain beads containing eitherX orX. This can be done quite fast if during the rst step of our hybrid model (processing the input alignment le and building chains), we record all locations of a contig to let it know where (position) and how (forward or backward orientation) it occurs in dierent chains. Then based on the orientation of the last contig (tail contig) in the candidate chain, we decide which side of the chain is needed for contig extension. We need to double check to make sure the other side is consistent with what we already have in the current chain. E.g. in Figure 5.14A and Figure 5.14B, left chains are the current chains which are getting extended and right chains are the corresponding candidate chains that are chosen for getting merged with the original chain to extend it. In Figure 5.14A the last contig is found in a forward orientation state while in Figure 5.14B the last contig is found in a reverse-complement orientation state. In both cases dashed sub-chains show parts of the current extending chain and the candidate chain that should be matching each other while the other part of the candidate chain is used for extension. It is important to say that if there are no extension candidates available, the exten- sion is terminated. On the other hand, if there are more than one candidate chains available, we continue with the highest score (larger min-support value) candidate chain. During the extension and merging the candidate chain with the original chain, we should halt extension as soon as we reach a high-risk contig or an already used contig. As discussed earlier in this section, the set of resulted extended chains which Chapter 5. Hybrid Genome Scaolding 103 Figure 5.14: Chain extension. Left chains are the current chains which are getting extended and right chains are the corresponding candidate chains that are chosen for getting merged with the original chain to extend it. (A) the last contig is found in a forward orientation state. (B) the last contig is found in a reverse-complement orientation state. In both cases (A) and (B) dashed sub- chains show parts of the current extending chain and the candidate chain that should be matching each other while the other part of the candidate chain is used for extension. do not contain any high-risk contig together with the set unused contigs in the form of singleton chains are called superchains in our proposed hybrid model (Figure 5.13). 5.3.6 Joining Superchains to Make Pre-scaolds If there are not any misassembled contig or any repeats in the genome, the previous step in the work ow would merge all chains to produce one single superchain and this single superchain would be our nal scaold answering all questions regarding the order, orientation, and distance of all input SGS contigs. But because of high-risk contigs, the set of resulted superchains contains more than one superchain some of which may be singletons and some others may be very long superchains. We should Chapter 5. Hybrid Genome Scaolding 104 emphasize it one more time that high-risk contigs (repeats and misassemblies) are not present in superchains and superchains are terminated at the repeat boundaries. The goal of this step in the work ow is to use all unused and unstudied chains (including all deactivated chains) to join some superchains heads and tails to each other to decrease the total number of superchains. This new set of joined superchains that will contain high-risk contigs are called pre-scaffolds. It is important to mention that we need to have at least one long read chain spanning each gap or repeat region to be able to join these superchains terminated at the gap or repeat boundary (Figure 5.15). If the gap or the repeat regions are too long (longer than all long reads in our TGS library), that gap or repeat should be considered as unsolvable. If there is more than one candidate chain (having high-risk contigs) for joining same superchain margins, we should go ahead with the longer and higher-score one. Figure 5.15: From superchains to pre-scaolds. We need to have at least one long read chain spanning each gap or repeat region to be able to join these superchains terminated at the gap or repeat boundary. If the gap or the repeat regions are too long (longer than all long reads in our TGS library), that gap or repeat should be considered as unsolvable. If there are more than one candidate chain (having high-risk contigs) for joining same superchain margins, we should go ahead with the longer and higher-score one. Contigs R 1 and R 2 are high-risk (repeat) contig in this gure. Chapter 5. Hybrid Genome Scaolding 105 5.3.7 Phase-I vs. Phase-II All discussed so far can be seen as the phase-I of the proposed model in scaolding the input contigs. Actually, the key parameter dierentiating phase-I and phase-II is the contigs length. Usually, mappers, assemblers, and scaolders have a mincontiglen parameter indicating that any contigs shorter than this parameter should be skipped and ignored in further analyses. This can be set to500 bps in bacterial genomes and larger values for more complex genomes. But as there might be still many high-risk contigs longer than mincontiglen complicating the whole procedure, we introduced a similar phaseIcontiglen parameter to dierentiate two phases in our proposed method. This can be set to1000 or1500 bps. This means that during phase-I, all contigs shorter than phaseIcontiglen are temporarily skipped and ignored in phase- I chains. At the end of suprchain production and pre-scaold production steps, only phase-I contigs are reported. In phase-II of our proposed model (Figure 5.7), phase-II chains containing both phase-I and phase-II contigs are used to join any pre-scaold margins to decrease the total number of pre-scaolds (`Join Pre-scaolds' step in Figure 5.7). Another important phase-II step mentioned in Figure 5.7 is to `Enhance Pre-scaolds' by trying to ll in the gaps between consecutive large phase-I contigs with any stretch of high-quality (highly supported) shorter phase-II contigs. Chapter 5. Hybrid Genome Scaolding 106 5.3.8 Finalization One more step might be needed to nalize the set of constructed scaolds. At the end of phase-II, if the total number of pre-scaolds is equal to the (expected) number of chromosomes, there is not any nalization step required. But if the total number of pre-scaolds is greater than the number of chromosomes, we might still be able to join some more pre-scaolds. In this nalization step, we only need to check pre- scaolds' marginal contigs to see if there are mutually best neighbors for each other or not. But it is important to emphasize that none of the margins should be high-risk contigs. Otherwise, these nal junctions might lead to misscaolding and results in misassemblies in the nal evaluation reports. In the nalization step, if a pre-scaold margin contig (which is not a high-risk contig itself) is mutually in the best neighbor relationship with a repeat contig, it is safe to append that repeat contig to the pre-scaold. But we cannot expect to continue this appending for one more step or expect to join pre-scaolds in this way. It is better to explain this in Figure 5.16. In this gure two pre-scaolds are shown, one of them has non-high-risk contig A as its tail and the other one has non-high-risk contig B as its head. Contig A's best neighbor ahead is high-risk contig R 1 and contig B's best neighbor before is high-risk contig R 2 . If there is a chain which has subchainA|R 1 |R 2 |B or subchain B|R 2 |R 1 |A, we can use it to join this two pre-scaolds. Actually, one can expect that there are not any chains long enough to have such sub-chains. If there were such chains, these pre-scaolds would have been Chapter 5. Hybrid Genome Scaolding 107 joined to each other when they were superchains in the previous step. But what if we have chains A|R 1 |R 2 and R 1 |R 2 |B in our chains repository. Can we still use these two chains to close this repeated region and join these pre-scaolds together or not? The answer is No! Because R 1 and R 2 are marked as high-risk contigs and there is a great probability that they are repeats. So R 1 andR 2 in chainA|R 1 |R 2 are not necessarily the same R 1 andR 2 in chainR 1 |R 2 |B. This is depicted in the lower part of Figure 5.16. Thus, the only nalization step that we can do in this last step of the work ow that will not result in misassemblies is appending R1 andR2 to the tail of pre-scaold 1 and head or pre-scaold 2. Figure 5.16: Finalization of prescaolds. Two pre-scaolds are depicted, one having non-high-risk contig A as its tail and the other one having non-high-risk contig B as its head. Contig A's best neighbor ahead is high-risk contig R 1 and contigB's best neighbor before is high-risk contig R 2 . If there is a chain which has subchain A|R 1 |R 2 |B or subchain B|R 2 |R 1 |A, we can use it to join this two pre-scaolds. If we just have chainsA|R 1 |R 2 andR 1 |R 2 |B we cannot use these two chains to close this repeated region and join these pre-scaolds together. Because R 1 and R 2 in chain A|R 1 |R 2 are not necessarily the same R 1 and R 2 in chain R 1 |R 2 |B. The only nalization step that can be done that will not result in mis-assemblies is appending R1 and R2 to tail of pre-scaold 1 and head or pre-scaold 2. Chapter 5. Hybrid Genome Scaolding 108 5.3.9 Estimating Distances Now that all contigs are ordered and orientated in scaolds, we need to estimate the distance of consecutive contigs pairs in scaolds. E.g. if contig X and Y are ordered and orientated az Y |X in a scaold and s is the support of such pairing between X and Y , we have previously recorded s dierent exact distances between Y and X based on their alignment coordinates in corresponding long reads. Based on the frequency of these dierent exact distances one needs to rst decide whether the distance is a positive value or a negative one. Positive distances mean there is a gap between Y and X and the most frequent exact distance is considered as the estimated positive distance. If this estimated positive distance is n, we need to put n Ns in the FASTA le between nucleotide sequences of Y and X. If the most frequent exact distance is a negative distance, we need to do a semi-global pairwise alignment to nd the exact overlap between these two contigs. If the two contigs are overlapped, their nucleotide sequences would be merged accordingly and there would not be any Ns between them in the nal FASTA le. 5.4 Evaluation and Discussion Now we should see how eective this work ow and the proposed hybrid model per- forms to scaold pre-assembled SGS bacterial contigs. Promising results and reliable Chapter 5. Hybrid Genome Scaolding 109 Table 5.1: Datasets used for evaluation of our proposed hybrid scaolder. E. coli B. trehalosi M. haemolytica Genome length 4,636,675 bps 2,407,846 bps 2,731,870 bps SGS library Sequencing platform Illumina MiSeq Illumina MiSeq Illumina MiSeq Total reads # 3,046.4 K 1,718.2 K 1,724.4 K Total bases # 460 M 249.2 M 249.4 M Mean read length 151 bps 145 bps 144 bps Sequencing coverage 100x 100x 90x TGS library Sequencing platform PacBio RS PacBio RS PacBio RS Total reads # 383.5 K 205.1 K 176 K Total bases # 929.1 M 499.9 M 531.2 M Mean read length 2,422 bps 2,437 bps 3,019 bps Sequencing coverage 200x 200x 194x improvements in a short time will open the door to scaold more complex draft genomes like those of plants and mammals. We evaluate our method in terms of contiguity and accuracy measures discussed in section 2.4. We use reports from \Quality Assessment Tool for Genome Assem- blies" (QUAST [75]) to show how our proposed hybrid model performs compared to PacBio's AHA (A Hybrid Assembler [76]) long read scaolder and the state-of-the- art standalone long read scaolder, SSPACE-LR (SSPACE-LongRead [64]), which is widely used and outperforms other competitors. In order to test the performance of our proposed model (we call it CHAINS in following tables) and compare it to current state-of-the-art standalone long-read scaf- folder (SSPACE-LongRead), we have performed in-depth analysis on three bacte- rial datasets: Escherichia coli K12 MG1655, Bibersteinia trehalosi, and Mannheimia haemolytica M42548 (Table 5.1). For each organism, paired-end Illumina MiSeq reads were used to create draft assemblies with IDBA de novo assembler [77]. Subsequently, Chapter 5. Hybrid Genome Scaolding 110 Table 5.2: Genome reconstruction of E. coli using SGS and TGS libraries. QUAST [75] evaluation metrics for input IDBA contigs, SSPACE-LongRead scaf- folds and CHAINS scaolds are presented in this table. Assembler / Scaolder IDBA SSPACE-LR CHAINS Contigs / Scaolds 168 19 1 Largest Contig 178,322 4,639,194 4,657,636 N50 58,008 4,639,194 4,657,636 NGA50 57,209 3,484,521 4,327,647 Total Length (bps) 4,570,181 4,657,165 4,657,636 # Mis-assemblies 0 3 4 # Mismatches (per 100 Kbps) 0.37 0.66 0.64 # Indels (per 100 Kbps) 0.18 0.35 0.24 # N's (per 100 Kbps) 0 1724.57 2357.53 Table 5.3: Genome reconstruction of B. trehalosi using SGS and TGS libraries. QUAST [75] evaluation metrics for input IDBA contigs, SSPACE-LongRead scaf- folds and CHAINS scaolds are presented in this table. Assembler / Scaolder IDBA SSPACE-LR CHAINS Contigs / Scaolds 73 24 6 Largest Contig 261,542 693,687 809,379 N50 83,683 281,804 591,744 NGA50 68,536 279,168 426,589 Total Length (bps) 2,349,231 2,369,584 2,461,025 # Mis-assemblies 1 3 5 # Mismatches (per 100 Kbps) 1.07 5.16 3.32 # Indels (per 100 Kbps) 2.05 3.12 2.67 # N's (per 100 Kbps) 0 833.14 1133.59 the draft contigs were scaolded using \uncorrected" PacBio RS long CLR reads with both SSPACE-LongRead and CHAINS. The results are displayed in Tables 5.2 to 5.4. To better understand QUAST evaluation reports in Tables 5.2 to 5.4, we brie y explain key comparison criteria in these tables as follows: # Contigs/Scaolds: The total number of contigs or scaolds in the assem- bly, Chapter 5. Hybrid Genome Scaolding 111 Table 5.4: Genome reconstruction of M. haemolytica using SGS and TGS libraries. QUAST [75] evaluation metrics for input IDBA contigs, SSPACE- LongRead scaolds and CHAINS scaolds are presented in this table. Assembler / Scaolder IDBA SSPACE-LR CHAINS Contigs / Scaolds 158 32 5 Largest Contig 256,637 2,339,412 2,303,653 N50 43,305 2,339,412 2,303,653 NGA50 43,305 325,889 2,240,518 Total Length (bps) 2,623,972 2,715,757 2,801,758 # Mis-assemblies 1 29 4 # Mismatches (per 100 Kbps) 2.86 30.52 5.59 # Indels (per 100 Kbps) 0.31 1.62 0.7 Largest contig: The length of the largest contig in the assembly, Total length: The total number of bases in the assembly, N50: The largest contig length, L, such that using contigs of length L accounts for at least 50% of the `bases in the assembly', N50: The largest contig length, L, such that using contigs of length L accounts for at least 50% of the `bases in the reference genome', NGA50: Contigs are broken into `aligned blocks', then NG50 statistic (instead of N50) is computed on these blocks, # of misassemblies: The number of misassemblies, using Plantagora's def- inition [78]. Plantagora denes a misassembly breakpoint as a position in the assembled contigs where the left anking sequence aligns over 1 kb away from the right anking sequence on the reference, or they overlap by >1 Kbp, or the anking sequences align on opposite strands or dierent chromosomes. Chapter 5. Hybrid Genome Scaolding 112 # of mismatches per 100 Kbp The average number of mismatches per 100,000 aligned bases. This metric does not distinguish between single-nucleotide polymorphisms, which are true dierences in the assembled genome versus the reference genome, and single-nucleotide errors, which are due to errors in reads or errors in the assembly algorithm. # indels per 100 Kbp The average number of single nucleotide insertions or deletions per 100,000 aligned bases. The assembly statistics in Tables 5.2 to 5.4 show that both SSPACE-LongRead and our proposed method, CHAINS, are able to signicantly reduce the number of input draft contigs using error-prone PacBio RS CLR reads as guidance. Overall the total assembly length remains relatively stable. It is apparent that CHAINS is better able to reconstruct continuous genome segments given the nal number of scaolds, number of mismatches, and number of indels are generally lower, especially for the more challenging B. trehalosi genome. It should also be remarked that the runtimes of these two tools are much shorter than PacBio's native AHA scaolder [64]. Our CHAINS model, implemented in C++ was running much faster than SSPACE-LongRead, implemented in Perl. This makes our model more suited for scaolding genomes on smaller computing systems. In terms of accuracy (through comparison with the corresponding reference genomes) our model introduced a few more misassemblies (one for E.coli and two for B. trehalosi) but this is clearly explained by the great improvement in the contiguity of the nal draft Chapter 5. Hybrid Genome Scaolding 113 assembly (especially for B. trehalosi). Moreover, it should be remarked that some of the apparent errors are actually true variations between the sequenced strain and the reference genome, an issue which is also explained by Koren et al. [79] as for each sequenced strain a close (but not necessarily the exact) reference was selected for quality assessment. To conclude this chapter, we also present results of another hybrid approach in which error-prone long reads get error-corrected rst (via a time-consuming error-correction method), then they are directly assembled to produce nal assembly draft. Lin and Liao (2015) [80] studies such hybrid methods. Table 5.5 shows the same QUAST contiguity and accuracy evaluation metrics presented for our hybrid model in Tables 5.2 for E. coli dataset. One can clearly observe that contiguity results and over- all runtime of this hybrid approach is far from ideal for bacterial genomes and the applicability of this approach for more complex genomes seems quite limited. Chapter 5. Hybrid Genome Scaolding 114 Table 5.5: Comparison of hybrid assemblies obtained from various long- read correction tools [80]. In this hybrid approach error-prone long reads get error-corrected rst (via a time-consuming error-correction method), then they are directly assembled to produce nal assembly draft. Assembler/ Scaolder ABySS LoRDEC + runCA LSC + runCA PBcR pipeline ECTools + runCA Proovread + runCA Contigs/ Scaolds 82 47 29 12 12 6 Largest Contig 285832 4188310 1954248 1889858 4644297 3523306 NGA50 129004 686189 572337 572347 956540 949310 Total Length (bps) 4646776 4791407 4769525 4667990 4695577 4674021 # Mis- assemblies 2 43 21 8 14 8 # Mismatches (per 100 Kbps) 2.92 3.30 2.26 2.03 9.33 5.73 # Indels (per 100 Kbps) 0.41 0.65 0.54 0.35 1.21 0.32 Overall Runtime 60m 15h 46m 21h 35m 4h 35m 11h 21m 5h 42m Chapter 6 Conclusion and Future Direction To conclude, we overview the key ideas we discussed in this dissertation. In Chapter 1, we presented the background of the eld. We described how high-throughput second- generation sequencing has had a great impact on the eld and made the low-cost, automatic, fast, and accurate production of draft genomes possible. We described and distinguished between de novo assembling, scaolding, and nishing to clarify why assemblers' results are highly fragmented and we need state-of-the-art standalone scaolders to order, orient, and correctly position fragmented contigs. We introduced and discussed several widely known standalone scaolders using SGS paired-read libraries in Chapter 2. We saw why scaolding is an NP-hard problem and why nearly all presented and discussed scaolders fail in producing a nearly complete genome even for less complicated bacterial genomes. 115 Chapter 6. Conclusion 116 The curse of repeats in de novo genome assembly and scaolding was presented and discussed in more details in Chapter 3. We explained dierent types of repeats in various species and discussed how each repeat type can aect the contiguity and ac- curacy of automatic reconstruction of draft genomes. We discussed the need for a repeat-aware scaolder to break this plateau in Chapter 4. We presented a repeat- aware formulation of the scaolding problem and well-known scaolding data struc- tures, like scaolding graphs. We also presented a mixed integer linear programming method to deal with the repeats. Using some small articial genomes we showed how eectively our model can solve the repeats and how state-of-the-art scaolders cannot produce a single large scaold out of these articial repetitive genomes. A very important conclusion of Chapter 4 was about the possibility of having multi- ple equivalent candidate solutions that no SGS scaolder (including out presented model) can distinguish. So one might need to somehow enumerate the model output solution to create the actual set of equivalent solution candidates. Paired reads from longer insert libraries or other distal information can help to pick the right solution in the pool of equivalent candidate solutions. Having paired reads from both short insert (e.g. Illumina paired-end) and long insert (e.g. Illumina mate-pair) libraries decreases the probability of these situations. Because of these limitations and the problem with the scalability of our proposed model to deal with more complex real genomes, we cannot claim that our proposed SGS scaolding model presented in Chapter 4 can break the plateau. To break the plateau in de novo genome scaolding, in Chapter 5, We proposed a novel hybrid Chapter 6. Conclusion 117 model for scaolding pre-assembled SGS contigs using TGS long read information. We showed that even uncorrected error-prone PacBio RS CLR reads can be well used to connect contigs into chains, chains to superchains, and superchains to scaf- folds. Our proposed model was successful in placing solvable repeats and conse- quently reconstructed bacterial genomes better than SSPACE-LongRead [64] which had already outperformed other competitors. Importantly, we showed that only two libraries are needed for the hybrid assembly of a bacterial genome. One Illumina MiSeq paired-end library is sucient for the construction of a proper draft assem- bly yielding high-quality SGS contigs using a state-of-the-art de Bruijn graph or overlap-layout-consensus assembler (for sure, the nal quality is in uenced by the exact method chosen). We claim that one PacBio RS library is sucient to nearly nish bacterial assemblies. We presented promising results using PacBio uncorrected CLR libraries with about 15% error-rate and 2 to 3 Kbps mean read lengths. It is likely that further improvements of the sequencing platforms and chemistry will show additional improvements, i.e. a complete genome closure at lower costs. Further- more, our model can be used for possible new sequencing platforms, such as Illumina and Oxford Nanopore Technologies systems, given that the input format (FASTA or FASTQ) is standardized and not specically bound to any specic platform. For future directions, it can be foreseen that additional methods will appear that per- form assemblies using only long read information. But there are some major issues with the overlap-consensus implementation (which requires an accurate prior esti- mation of the expected genome size) and the introduction of erroneous duplications Chapter 6. Conclusion 118 (which need to be manually removed). We feel the current study opens new doors to address the genome assembly question and can positively contribute to the auto- matic reconstruction of less fragmented and ideally, truly complete draft genomes or automatic and fast upgrade of current available fragmented draft genomes in vari- ous research centers and public libraries. We expect by improvements in length and accuracy of third-generation sequencing libraries, researchers can use the ideas in our presented hybrid model to scaold high-quality contigs of much more complex genomes. Appendix A Scaolding Graphs and Extended Scaolding Graphs In this Appendix, we go through the results of some carefully simulated test cases. As we wanted to control the count, location, length, and relative order of repeats, we developed a variety of simple (no repeats) to complex (several repeats) hypothetical genomes. Figures A.1 through A.13 show the regular scaolding graphs and the corresponding extended scaolding graphs. Figures' captions specically describe the repeat structure of each test case. We tried to make the simple test case of Figure A.1 more complex in each step by adding repeat structures to it, increasing the repeat counts, changing the orientations of some repeats, making some tandem repeats, and also putting some complex repeat of repeat structures in the graph. Although the lengths of these genomes are not very long, because of the repeats 119 Appendix A. Scaolding Graphs and Extended Scaolding Graphs 120 in these genomes none of the standalone scaolders can fully solve these test cases, because highly covered or high-degree nodes in the scaolding graph that makes tangles in the scaolding graph are usually removed from the graph. Even ignoring the repeats will put just one copy of them in the nal solution and location of other copies will be considered as gaps. To the best of our knowledge, our model is the only one that can handle these repeat structures. Figure A.1: A scaolding graph without any repeats. Our suggested method could successfully solve this simulated test case and correctly output the order and orientation of all contigs. Edge weights are the paired-read support count of each edge. Node labels follow \id(repeats)length" format. Appendix A. Scaolding Graphs and Extended Scaolding Graphs 121 Figure A.2: A scaolding graph with one two-copy repeat contig. Contig 3 (grayed out) is a repeat contig with two copies, one between contigs 2 and 4, and the other one between contigs 12 and 13. Both copies of repeat contig 3 share the same orientation. FigureA.3: Extended graph representation of Figure A.2. Our suggested method could successfully solve this simulated test case and correctly output the order and orientation of all contigs. Edge weights are the paired-read support count of each edge. Node labels follow \repeat-id:contig-id(repeats)length" format. Appendix A. Scaolding Graphs and Extended Scaolding Graphs 122 Figure A.4: A scaolding graph with one three-copy repeat contig. Contig 3 (grayed out) is a repeat contig with three copies, one forward copy between contigs 2 and 4, one forward copy between contigs 12 and 13, and one reverse-complement copy between contigs 5 and 6. Appendix A. Scaolding Graphs and Extended Scaolding Graphs 123 FigureA.5: Extended graph representation of Figure A.4. Our suggested method could successfully solve this simulated test case and correctly output the order and orientation of all contigs. Edge weights are the paired-read support count of each edge. Node labels follow \repeat-id:contig-id(repeats)length" format. Appendix A. Scaolding Graphs and Extended Scaolding Graphs 124 Figure A.6: A scaolding graph with two repeat contigs. Contig 3 (grayed out) is a repeat contig with three copies, one forward copy between contigs 2 and 4, one forward copy between contigs 12 and 13, and one reverse-complement copy between contigs 5 and 6. Contig 9 (grayed out) is a repeat contig with two copies, one forward copy between contigs 8 and 10, and one reverse-complement copy between contigs 19 and 20. Appendix A. Scaolding Graphs and Extended Scaolding Graphs 125 FigureA.7: Extended graph representation of Figure A.6. Our suggested method could successfully solve this simulated test case and correctly output the order and orientation of all contigs. Edge weights are the paired-read support count of each edge. Node labels follow \repeat-id:contig-id(repeats)length" format. Appendix A. Scaolding Graphs and Extended Scaolding Graphs 126 Figure A.8: A scaolding graph with a tandem repeat contigs. Contig 11 (grayed out) is a tandem repeat contig with two tandem copies both of which are forward- oriented and both of them are sitting between contigs 10 and 12. Appendix A. Scaolding Graphs and Extended Scaolding Graphs 127 FigureA.9: Extended graph representation of Figure A.8. Our suggested method could successfully solve this simulated test case and correctly output the order and orientation of all contigs. Edge weights are the paired-read support count of each edge. Node labels follow \repeat-id:contig-id(repeats)length" format. Appendix A. Scaolding Graphs and Extended Scaolding Graphs 128 FigureA.10: A scaolding graph with two regular repeat contigs and one tandem repeat contig. Contig 3 (grayed out) is a repeat contig with three copies, one forward copy between contigs 2 and 4, one forward copy between contigs 12 and 13, and one reverse-complement copy between contigs 5 and 6. Contig 9 (grayed out) is a repeat contig with two copies, one forward copy between contigs 8 and 10, and one reverse-complement copy between contigs 19 and 20. Contig 20 (grayed out) is a tandem repeat contig with two tandem copies (one forward and the other one reverse-complement) both of which are sitting between contig 21 and a copy of contig 9. Figure A.11: Extended graph representation of Figure A.10. Our suggested method could successfully solve this simulated test case and correctly output the order and orientation of all contigs. Edge weights are the paired-read support count of each edge. Node labels follow \repeat-id:contig-id(repeats)length" format. Appendix A. Scaolding Graphs and Extended Scaolding Graphs 129 FigureA.12: A scaolding graph with a repeat of repeats structure. The sequence of contigs 3' (reverse of 3), 6, and 7 (grayed out) is repeated twice between contigs 5 and 8. Contig 9 (grayed out) is a repeat contig with two copies, one forward copy between contigs 8 and 10, and one reverse-complement copy between contig 19 and a copy of contig 20. Contig 20 (grayed out) is a tandem repeat contig with two tandem copies (one forward and the other one reverse-complement) both of which are sitting between contig 21 and a copy of contig 9. Appendix A. Scaolding Graphs and Extended Scaolding Graphs 130 Figure A.13: Extended graph representation of Figure A.12. Our suggested method could successfully solve this simulated test case and correctly output the order and orientation of all contigs. Edge weights are the paired-read support count of each edge. Node labels follow \repeat-id:contig-id(repeats)length" format. Bibliography [1] Steven L Salzberg, Adam M Phillippy, Aleksey Zimin, Daniela Puiu, Tanja Magoc, Sergey Koren, Todd J Treangen, Michael C Schatz, Arthur L Delcher, Michael Roberts, Guillaume Mar cais, Mihai Pop, and James A Yorke. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res., 22(3):557{567, March 2012. [2] Sante Gnerre, Iain Maccallum, Dariusz Przybylski, Filipe J Ribeiro, Joshua N Burton, Bruce J Walker, Ted Sharpe, Giles Hall, Terrance P Shea, Sean Sykes, Aaron M Berlin, Daniel Aird, Maura Costello, Riza Daza, Louise Williams, Robert Nicol, Andreas Gnirke, Chad Nusbaum, Eric S Lander, and David B Jae. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. U. S. A., 108(4):1513{1518, 25 January 2011. [3] The assemblathon. http://assemblathon.org/, . Accessed: 2015-8-23. [4] Dent Earl, Keith Bradnam, John St John, Aaron Darling, Dawei Lin, Joseph Fass, Hung On Ken Yu, Vince Bualo, Daniel R Zerbino, Mark Diekhans, Ngan Nguyen, Pramila Nuwantha Ariyaratne, Wing-Kin Sung, Zemin Ning, Matthias Haimel, Jared T Simpson, Nuno A Fonseca, _ Inan c Birol, T Roderick Docking, Isaac Y Ho, Daniel S Rokhsar, Rayan Chikhi, Dominique Lavenier, Guillaume Chapuis, Delphine Naquin, Nicolas Maillet, Michael C Schatz, David R Kelley, Adam M Phillippy, Sergey Koren, Shiaw-Pyng Yang, Wei Wu, Wen-Chi Chou, Anuj Srivastava, Timothy I Shaw, J Graham Ruby, Peter Skewes-Cox, Miguel Betegon, Michelle T Dimon, Victor Solovyev, Igor Seledtsov, Petr Kosarev, Denis Vorobyev, Ricardo Ramirez-Gonzalez, Richard Leggett, Dan MacLean, Fangfang Xia, Ruibang Luo, Zhenyu Li, Yinlong Xie, Binghang Liu, Sante Gnerre, Iain MacCallum, Dariusz Przybylski, Filipe J Ribeiro, Shuangye Yin, Ted Sharpe, Giles Hall, Paul J Kersey, Richard Durbin, Shaun D Jackman, Jarrod A Chapman, Xiaoqiu Huang, Joseph L DeRisi, Mario Caccamo, Yingrui Li, David B Jae, Richard E Green, David Haussler, Ian Korf, and Benedict Paten. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res., 21(12):2224{2241, December 2011. 131 Bibliography 132 [5] Keith R Bradnam, Joseph N Fass, Anton Alexandrov, Paul Baranay, Michael Bechner, Inan c Birol, S ebastien Boisvert, Jarrod A Chapman, Guillaume Chapuis, Rayan Chikhi, Hamidreza Chitsaz, Wen-Chi Chou, Jacques Corbeil, Cristian Del Fabbro, T Roderick Docking, Richard Durbin, Dent Earl, Scott Emrich, Pavel Fedotov, Nuno A Fonseca, Ganeshkumar Ganapathy, Richard A Gibbs, Sante Gnerre, El enie Godzaridis, Steve Goldstein, Matthias Haimel, Giles Hall, David Haussler, Joseph B Hiatt, Isaac Y Ho, Jason Howard, Martin Hunt, Shaun D Jackman, David B Jae, Erich D Jarvis, Huaiyang Jiang, Sergey Kazakov, Paul J Kersey, Jacob O Kitzman, James R Knight, Sergey Koren, Tak-Wah Lam, Dominique Lavenier, Fran cois Laviolette, Yingrui Li, Zhenyu Li, Binghang Liu, Yue Liu, Ruibang Luo, Iain Maccallum, Matthew D Macmanes, Nicolas Maillet, Sergey Melnikov, Delphine Naquin, Zemin Ning, Thomas D Otto, Benedict Paten, Oct avio S Paulo, Adam M Phillippy, Francisco Pina-Martins, Michael Place, Dariusz Przybylski, Xiang Qin, Carson Qu, Filipe J Ribeiro, Stephen Richards, Daniel S Rokhsar, J Graham Ruby, Simone Scalabrin, Michael C Schatz, David C Schwartz, Alexey Sergushichev, Ted Sharpe, Timothy I Shaw, Jay Shendure, Yujian Shi, Jared T Simpson, Henry Song, Fedor Tsarev, Francesco Vezzi, Riccardo Vicedomini, Bruno M Vieira, Jun Wang, Kim C Worley, Shuangye Yin, Siu-Ming Yiu, Jianying Yuan, Guojie Zhang, Hao Zhang, Shiguo Zhou, and Ian F Korf. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience, 2(1):10, 22 July 2013. [6] Konstantinos Patsis. Evaluation of DNA scaolding techniques using PacBio long reads. Master's thesis, Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands, 2014. [7] T F Smith and M S Waterman. Identication of common molecular subsequences. J. Mol. Biol., 147(1):195{197, 25 March 1981. [8] Niranjan Nagarajan and Mihai Pop. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comput. Biol., 16(7):897{908, July 2009. [9] Michael C Schatz, Arthur L Delcher, and Steven L Salzberg. Assembly of large genomes using second-generation sequencing. Genome Res., 20(9):1165{1173, September 2010. [10] Chen-Shan Chin, David H Alexander, Patrick Marks, Aaron A Klammer, James Drake, Cheryl Heiner, Alicia Clum, Alex Copeland, John Huddleston, Evan E Eichler, Stephen W Turner, and Jonas Korlach. Nonhybrid, nished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods, 10(6):563{569, June 2013. Bibliography 133 [11] Paired-End sequencing. http://www.illumina.com/technology/ next-generation-sequencing/paired-end-sequencing_assay.html, . Accessed: 2015-8-23. [12] Mate pair sequencing. http://www.illumina.com/technology/ next-generation-sequencing/mate-pair-sequencing_assay.html, . Accessed: 2015-8-23. [13] Art L. Delcher Ian M. Dew Dan P. Fasulo Michael J. Flanigan Saul A. Kravitz Clark M. Mobarry Knut H. J. Reinert Karin A. Remington Eric L. Anson Randall A. Bolanos Hui-Hsien Chou Catherine M. Jordan Aaron L. Halpern Stefano Lonardi Ellen M. Beasley Rhonda C. Brandon Lin Chen Patrick J. Dunn Zhongwu Lai Yong Liang Deborah R. Nusskern Ming Zhan Qing Zhang Xiangqun Zheng Gerald M. Rubin Mark D. Adams J. Craig Venter Eugene W. Myers, Granger G. Sutton. A whole-genome assembly of drosophila. Science, 287(5461):2196{2204, 2000. [14] P A Pevzner, H Tang, and M S Waterman. An eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. U. S. A., 98(17):9748{9753, 14 August 2001. [15] Mihai Pop, Daniel S Kosack, and Steven L Salzberg. Hierarchical scaolding with bambus. Genome Res., 14(1):149{159, January 2004. [16] Adel Dayarian, Todd P Michael, and Anirvan M Sengupta. SOPRA: Scaolding algorithm for paired reads via statistical optimization. BMC Bioinformatics, 11:345, 24 June 2010. [17] Sergey Koren, Todd J Treangen, and Mihai Pop. Bambus 2: scaolding metagenomes. Bioinformatics, 27(21):2964{2971, 1 November 2011. [18] Leena Salmela, Veli M akinen, Niko V alim aki, Johannes Ylinen, and Esko Ukkonen. Fast scaolding with small independent mixed integer programs. Bioinformatics, 27(23):3259{3265, 1 December 2011. [19] Song Gao, Wing-Kin Sung, and Niranjan Nagarajan. Opera: reconstructing optimal genomic scaolds with high-throughput paired-end sequences. J. Comput. Biol., 18(11):1681{1691, November 2011. [20] Marten Boetzer, Christiaan V Henkel, Hans J Jansen, Derek Butler, and Walter Pirovano. Scaolding pre-assembled contigs using SSPACE. Bioinformatics, 27(4):578{579, 15 February 2011. [21] Alexey A Gritsenko, Jurgen F Nijkamp, Marcel J T Reinders, and Dick de Ridder. GRASS: a generic algorithm for scaolding next-generation sequencing assemblies. Bioinformatics, 28(11):1429{1437, 1 June 2012. Bibliography 134 [22] Nilgun Donmez and Michael Brudno. SCARPA: scaolding reads with practical algorithms. Bioinformatics, 29(4):428{434, 15 February 2013. [23] Martin Hunt, Chris Newbold, Matthew Berriman, and Thomas D Otto. A comprehensive evaluation of assembly scaolding tools. Genome Biol., 15(3): R42, 3 March 2014. [24] Genome reference consortium. http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/, . Accessed: 2015-8-23. [25] Vitor C Piro, Helisson Faoro, Vinicius A Weiss, Maria B R Steens, Fabio O Pedrosa, Emanuel M Souza, and Roberto T Raittz. FGAP: an automated gap closing tool. BMC Res. Notes, 7:371, 18 June 2014. [26] GenomeWeb feature: The lost art of genome nishing. https://www.genomeweb.com/sequencing/ genomeweb-feature-lost-art-genome-finishing, . Accessed: 2015-8-23. [27] Marten Boetzer and Walter Pirovano. Toward almost closed genomes with GapFiller. Genome Biol., 13(6):R56, 25 June 2012. [28] Martin T Swain, Isheng J Tsai, Samual A Assefa, Chris Newbold, Matthew Berriman, and Thomas D Otto. A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs. Nat. Protoc., 7(7): 1260{1284, July 2012. [29] Daniel H Huson, Knut Reinert, and Eugene W Myers. The greedy path-merging algorithm for contig scaolding. J. ACM, 49(5):603{615, 1 September 2002. [30] Jared T Simpson, Kim Wong, Shaun D Jackman, Jacqueline E Schein, Steven J M Jones, and Inan c Birol. ABySS: a parallel assembler for short read sequence data. Genome Res., 19(6):1117{1123, June 2009. [31] Jared T Simpson and Richard Durbin. Ecient de novo assembly of large genomes using compressed data structures. Genome Res., 22(3):549{556, March 2012. [32] Ruibang Luo, Binghang Liu, Yinlong Xie, Zhenyu Li, Weihua Huang, Jianying Yuan, Guangzhu He, Yanxiang Chen, Qi Pan, Yunjie Liu, Jingbo Tang, Gengxiong Wu, Hao Zhang, Yujian Shi, Yong Liu, Chang Yu, Bo Wang, Yao Lu, Changlei Han, David W Cheung, Siu-Ming Yiu, Shaoliang Peng, Zhu Xiaoqian, Guangming Liu, Xiangke Liao, Yingrui Li, Huanming Yang, Jian Wang, Tak-Wah Lam, and Jun Wang. SOAPdenovo2: an empirically improved memory-ecient short-read de novo assembler. Gigascience, 1(1):18, 27 December 2012. Bibliography 135 [33] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. The sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078{2079, 15 August 2009. [34] W J Kent and D Haussler. Assembly of the working draft of the human genome with GigAssembler. Genome Res., 11(9):1541{1548, September 2001. [35] M R Garey and D S Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. [36] Todd J Treangen, Dan D Sommer, Florent E Angly, Sergey Koren, and Mihai Pop. Next generation sequence assembly with AMOS. Curr. Protoc. Bioinformatics, Chapter 11:Unit 11.8, March 2011. [37] AMOS WIKI. http://amos.sourceforge.net/wiki/index.php/AMOS, . Accessed: 2015-8-22. [38] Ren e L Warren, Granger G Sutton, Steven J M Jones, and Robert A Holt. Assembling millions of short DNA sequences using SSAKE. Bioinformatics, 23 (4):500{501, 15 February 2007. [39] Todd J Treangen and Steven L Salzberg. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet., 13(1): 36{46, January 2012. [40] Mark A Batzer and Prescott L Deininger. Alu repeats and human genomic diversity. Nat. Rev. Genet., 3(5):370{379, May 2002. [41] Liang Ye, Ladeana W Hillier, Patrick Minx, Nay Thane, Devin P Locke, John C Martin, Lei Chen, Makedonka Mitreva, Jason R Miller, Kevin V Haub, David J Dooling, Elaine R Mardis, Richard K Wilson, George M Weinstock, and Wesley C Warren. A vertebrate case study of the quality of assemblies derived from next-generation sequences. Genome Biol., 12(3):R31, 31 March 2011. [42] Can Alkan, Saba Sajjadian, and Evan E Eichler. Limitations of next-generation genome sequence assembly. Nature methods, 8(1):61{65, 2011. [43] Bowtie 2: fast and sensitive read alignment. http://bowtie-bio.sourceforge.net/bowtie2/index.shtml, . Accessed: 2015-8-24. [44] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nat. Methods, 9(4):357{359, April 2012. [45] jstjohn. jstjohn/SimSeq. https://github.com/jstjohn/SimSeq. Accessed: 2015-8-25. Bibliography 136 [46] Gurobi. Gurobi optimization, inc. http://www.gurobi.com/. Accessed: 2015-8-25. [47] Frederick Sanger, Steven Nicklen, and Alan R Coulson. Dna sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences, 74(12):5463{5467, 1977. [48] Marcel Margulies, Michael Egholm, William E Altman, Said Attiya, Joel S Bader, Lisa A Bemben, Jan Berka, Michael S Braverman, Yi-Ju Chen, Zhoutao Chen, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376{380, 2005. [49] Jonathan M Rothberg, Wolfgang Hinz, Todd M Rearick, Jonathan Schultz, William Mileski, Mel Davey, John H Leamon, Kim Johnson, Mark J Milgrew, Matthew Edwards, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature, 475(7356):348{352, 2011. [50] John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geo Otto, Paul Peluso, David Rank, Primo Baybayan, Brad Bettman, et al. Real-time dna sequencing from single polymerase molecules. Science, 323(5910):133{138, 2009. [51] Daniel Branton, David W Deamer, Andre Marziali, Hagan Bayley, Steven A Benner, Thomas Butler, Massimiliano Di Ventra, Slaven Garaj, Andrew Hibbs, Xiaohua Huang, et al. The potential and challenges of nanopore sequencing. Nature biotechnology, 26(10):1146{1153, 2008. [52] Christoph Bleidorn. Third generation sequencing: technology and its potential impact on evolutionary biodiversity research. Systematics and Biodiversity, 14 (1):1{8, 2016. [53] Chen-Shan Chin, David H Alexander, Patrick Marks, Aaron A Klammer, James Drake, Cheryl Heiner, Alicia Clum, Alex Copeland, John Huddleston, Evan E Eichler, et al. Nonhybrid, nished microbial genome assemblies from long-read smrt sequencing data. Nature methods, 10(6):563{569, 2013. [54] John Huddleston, Swati Ranade, Maika Malig, Francesca Antonacci, Mark Chaisson, Lawrence Hon, Peter H Sudmant, Tina A Graves, Can Alkan, Megan Y Dennis, et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome research, 24(4):688{696, 2014. [55] Mark JP Chaisson, John Huddleston, Megan Y Dennis, Peter H Sudmant, Maika Malig, Fereydoun Hormozdiari, Francesca Antonacci, Urvashi Surti, Richard Sandstrom, Matthew Boitano, et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature, 517(7536):608{611, 2015. Bibliography 137 [56] John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geo Otto, Paul Peluso, David Rank, Primo Baybayan, Brad Bettman, Arkadiusz Bibillo, Keith Bjornson, Bidhan Chaudhuri, Frederick Christians, Ronald Cicero, Sonya Clark, Ravindra Dalal, Alex Dewinter, John Dixon, Mathieu Foquet, Alfred Gaertner, Paul Hardenbol, Cheryl Heiner, Kevin Hester, David Holden, Gregory Kearns, Xiangxu Kong, Ronald Kuse, Yves Lacroix, Steven Lin, Paul Lundquist, Congcong Ma, Patrick Marks, Mark Maxham, Devon Murphy, Insil Park, Thang Pham, Michael Phillips, Joy Roy, Robert Sebra, Gene Shen, Jon Sorenson, Austin Tomaney, Kevin Travers, Mark Trulson, John Vieceli, Jerey Wegener, Dawn Wu, Alicia Yang, Denis Zaccarin, Peter Zhao, Frank Zhong, Jonas Korlach, and Stephen Turner. Real-time DNA sequencing from single polymerase molecules. Science, 323(5910):133{138, 2 January 2009. [57] Konstantinos Patsis. Evaluation of DNA scaolding techniques using PacBio long reads. PhD thesis, TU Delft, Delft University of Technology, 2014. [58] Sequencing dna in the palm of your hand | nasa. http://www.nasa.gov/ mission_pages/station/research/news/biomolecule_sequencer. (Accessed on 08/12/2016). [59] Nanopore sequencing is here to stay - bio-it world. http://www.bio-itworld. com/2014/12/22/nanopore-sequencing-here-stay.html. (Accessed on 08/12/2016). [60] Sante Gnerre, Iain MacCallum, Dariusz Przybylski, Filipe J Ribeiro, Joshua N Burton, Bruce J Walker, Ted Sharpe, Giles Hall, Terrance P Shea, Sean Sykes, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences, 108 (4):1513{1518, 2011. [61] Sergey Koren, Michael C Schatz, Brian P Walenz, Jerey Martin, Jason T Howard, Ganeshkumar Ganapathy, Zhong Wang, David A Rasko, W Richard McCombie, Erich D Jarvis, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature biotechnology, 30(7):693{700, 2012. [62] Adam C English, Stephen Richards, Yi Han, Min Wang, Vanesa Vee, Jiaxin Qu, Xiang Qin, Donna M Muzny, Jerey G Reid, Kim C Worley, and Richard A Gibbs. Mind the gap: upgrading genomes with pacic biosciences RS long-read sequencing technology. PLoS One, 7(11):e47768, 21 November 2012. [63] Ali Bashir, Aaron A Klammer, William P Robins, Chen-Shan Chin, Dale Webster, Ellen Paxinos, David Hsu, Meredith Ashby, Susana Wang, Paul Peluso, Robert Sebra, Jon Sorenson, James Bullard, Jackie Yen, Marie Valdovino, Emilia Mollova, Khai Luong, Steven Lin, Brianna LaMay, Amruta Joshi, Lori Rowe, Michael Frace, Cheryl L Tarr, Maryann Turnsek, Brigid M Davis, Andrew Kasarskis, John J Mekalanos, Matthew K Waldor, and Eric E Bibliography 138 Schadt. A hybrid approach for the automated nishing of bacterial genomes. Nat. Biotechnol., 30(7):701{707, July 2012. [64] Marten Boetzer and Walter Pirovano. SSPACE-LongRead: scaolding bacterial draft genomes using long read sequence information. BMC Bioinformatics, 15:211, 20 June 2014. [65] M. J. Chaisson and G. Tesler. Mapping single molecule sequencing reads using basic local alignment with successive renement (BLASR): application and theory. BMC Bioinformatics, 13:238, 2012. [66] H. Li and R. Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589{595, Mar 2010. [67] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nature methods, 9(4):357{359, 2012. [68] W James Kent. Blat|the blast-like alignment tool. Genome research, 12(4): 656{664, 2002. [69] Smalt | sanger institute. http://www.sanger.ac.uk/science/tools/smalt-0. (Accessed on 08/14/2016). [70] Brian Bushnell. Bbmap: A fast, accurate, splice-aware aligner. Technical report, Ernest Orlando Lawrence Berkeley National Laboratory, Berkeley, CA (US), 2014. [71] Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu, and Steven L Salzberg. Versatile and open software for comparing large genomes. Genome biology, 5(2):1, 2004. [72] Szymon M Kie lbasa, Raymond Wan, Kengo Sato, Paul Horton, and Martin C Frith. Adaptive seeds tame genomic sequence comparison. Genome research, 21 (3):487{493, 2011. [73] Martin C Frith, Raymond Wan, and Paul Horton. Incorporating sequence quality data into alignment improves dna read mapping. Nucleic acids research, 38(7):e100{e100, 2010. [74] Daniel R Zerbino and Ewan Birney. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research, 18(5):821{829, 2008. [75] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. Quast: quality assessment tool for genome assemblies. Bioinformatics, 29(8): 1072{1075, 2013. Bibliography 139 [76] Aha pacicbiosciences/bioinformatics-training wiki github. https: //github.com/PacificBiosciences/Bioinformatics-Training/wiki/AHA. (Accessed on 08/16/2016). [77] Yu Peng, Henry CM Leung, Siu-Ming Yiu, and Francis YL Chin. Idba{a practical iterative de bruijn graph de novo assembler. In Annual International Conference on Research in Computational Molecular Biology, pages 426{440. Springer, 2010. [78] Roger Barthelson, Adam J McFarlin, Steven D Rounsley, and Sarah Young. Plantagora: modeling whole genome sequencing and assembly of plant genomes. PLoS One, 6(12):e28436, 2011. [79] Sergey Koren, Gregory P Harhay, Timothy PL Smith, James L Bono, Dayna M Harhay, Scott D Mcvey, Diana Radune, Nicholas H Bergman, and Adam M Phillippy. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome biology, 14(9):1, 2013. [80] Hsin-Hung Lin and Yu-Chieh Liao. Evaluation and validation of assembling corrected pacbio long reads for microbial genome completion via hybrid approaches. PloS one, 10(12):e0144305, 2015.
Abstract (if available)
Abstract
Scaffolding is the process of orienting and ordering contiguous sequences (contigs) produced during genome assembly. To finish draft genomes it is crucial to have accurate scaffolding as it facilitates the procedures needed to fill in the gaps between contigs which are costly and laborious. It has been proved that the scaffolding problem is an NP-hard problem and in the case of presence of erroneous data (e.g. chimeric reads, mappings, contigs, etc.) even its sub-problems, like finding orientations, are computationally intractable. That is why conventional formulations of scaffolding programs rely on heuristics to find approximate solutions, with potentially exponential running time. But due to limitations of the short reads which are the main ingredient in de novo assembling and scaffolding, most repetitive parts of the genome remain unsolved. As nearly all state-of-the-art scaffolders remove suspicious repeat contigs, they all yield fragmented scaffolds which are still far from nearly complete genomes or at least correctly ordered and oriented contigs. We present the first repeat-aware scaffolding model for second-generation sequencing data which like most current scaffolders rely on paired reads, but it does not remove or ignore repeat contigs. On the contrary, our model tries to use repeats to solve the most challenging parts of the scaffolding graph. We show how this repeat-aware second-generation sequencing model can solve repeats in small artificial synthesized genomes, and why it is not capable of scaling up for more complex real genomes to break the plateau in de novo genome scaffolding. ❧ Recently, the third-generation sequencing data has become more available to researchers and scientists. Long-read libraries can make it possible to cope with the scaffolding problem in a more accurate and cost-effective way. They can improve the quality of draft assemblies constructed from second-generation sequencing data by making them more accurate and more complete. We propose a novel and scalable hybrid scaffolding model that aims to scaffold pre-assembled contigs produced by short reads. We show how this new hybrid scaffolding model performs to upgrade some bacterial draft genomes and might be capable of breaking the plateau in de novo genome scaffolding.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Techniques for de novo sequence assembly: algorithms and experimental results
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Big data analytics in metagenomics: integration, representation, management, and visualization
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Comparative transcriptomics: connecting the genome to evolution
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Feature engineering and supervised learning on metagenomic sequence data
PDF
3D modeling of eukaryotic genomes
PDF
De novo peptide sequencing and spectral alignment algorithm via tandem mass spectrometry
PDF
Application of machine learning methods in genomic data analysis
PDF
Understanding the characteristic of single nucleotide variants
PDF
Unlocking capacities of genomics datasets through effective computational methods
PDF
danbing-tk: a computational genetics framework for studying variable number tandem repeats
Asset Metadata
Creator
Bagherian, Misagh
(author)
Core Title
Breaking the plateau in de novo genome scaffolding
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
07/20/2017
Defense Date
08/24/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
de novo genome assembly,genome finishing,genome scaffolding,long reads,OAI-PMH Harvest,repeats,single molecule sequencing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Ting (
committee chair
), Chen, Liang (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
bagheria@usc.edu,misagh.bagherian@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-405787
Unique identifier
UC11263160
Identifier
etd-BagherianM-5564.pdf (filename),usctheses-c40-405787 (legacy record id)
Legacy Identifier
etd-BagherianM-5564.pdf
Dmrecord
405787
Document Type
Dissertation
Rights
Bagherian, Misagh
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
de novo genome assembly
genome finishing
genome scaffolding
long reads
repeats
single molecule sequencing